# Data understanding

The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.

## Data collection

<!-- ### task

Acquire within the project the data (or access to the data) listed in the
project resources. This initial collection includes data loading if necessary
for data understanding. For example, if you apply a specific tool for data
understanding, it makes perfect sense to load your data into this tool.
This effort possibly leads to initial data preparation steps.

Note: if you acquire multiple data sources, integration is an additional
issue, either here or in the later data preparation phase.

### output

List the dataset (or datasets) acquired, together with their locations
within the project, the methods used to acquire them and any problems
encountered. Record problems encountered and any solutions achieved
to aid with future replication of this project or with the execution of
similar future projects.
 -->

The initial dataset was obtained from the Kaggle platform, and it is publicly available by [this link](https://www.kaggle.com/datasets/mohdph/saudi-arabia-real-estate-dataset). After unzipping, I stored under the name `data/database.sqlite`. Personally, SQLite is not my favorite database, and I prefer to analyze the data using Python and its libraries such as `pandas`. After data cleaning, I will store the data in `.csv` format for future ease of use.

## Data description

<!-- ### task

Examine the “gross” or “surface” properties of the acquired data and
report on the results.

### output

Describe the data which has been acquired, including: the format of
the data, the quantity of data, for example number of records and fields
in each table, the identities of the fields and any other surface features
of the data which have been discovered. Does the data acquired satisfy
the relevant requirements? -->

The dataset is made of `.sqlite` file, and it contains 1 table named `Listings`, which has `48` columns and `663946` rows in total. Dataset contains such features as user information, price, title, creation and update time, location, and other features related to the real estate. There are texts, dates, integers, and floats in the dataset, which can be further processed and analyzed.

For the current version of the project, the dataset has all relevant information, and it is ready for further analysis.

## Explore data

<!-- ----------

### task

This task tackles the data mining questions, which can be addressed
using querying, visualization and reporting. These include: distribution
of key attributes, for example the target attribute of a prediction task;
relations between pairs or small numbers of attributes; results of
simple aggregations; properties of significant sub-populations; simple
statistical analyses. These analyses may address directly the data mining goals; they may also contribute to or refine the data description
and quality reports and feed into the transformation and other data
preparation needed for further analysis.

### output

Describe results of this task including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate,
include graphs and plots, which indicate data characteristics or lead
to interesting data subsets for further examination. -->

From the dataset I could observe that the price is skewed to the right, and there are some outliers in the dataset having too low and too much price. This means, that there is no system in this business to restrict too low and too high prices. Also, the dataset contains some missing values, and some of the columns are not relevant for the analysis.

Same we can say for the area parameter. There are some outliers in the dataset, having too big areas (which are impossible to have).

Given distribution of age, we can clearly see, that a lot of people do not want to share their age (too many 0 values), therefore we should take in an account that we have less features than we thought.

## Data quality

The dataset has no duplicates, which is good for the analysis. However, there are a lot of missing values in the dataset, and some of the columns are not relevant for the analysis, which should be properly handled. As it was previously mentioned, there are a lot of outliers in the dataset, which confirms the necessity of data cleaning.

## Costs and benefits

Overall the project cost is considered to be under 1000$, because there is no much resources needed for the analysis. The benefits of the project are to understand the real estate market in Saudi Arabia, and to provide better ML model for the prediction of the price of the real estate.