# Data Science Challenge

In [2]:
#Libraries
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

## Data Description

Column | Description
:---|:---
`area` | Area of the real estate (in m-square)
`bathrooms` | Number of bathrooms
`bedrooms` | Number of rooms
`condo_fee`| Montly maintenance fee (in USD)
`parking_spots`| Number of parking spots
`price`| Price of the real estate (in USD)
`suites` | Number of bedrooms with direct access to bathroom.
`type` | Kind of real estate
`lat` | Latitude of the property
`lon` | Longitude of the property
`misc` | Other information

## Exploratory Data Analysis

### 1. Load and Inspect the Data

Not much information is given about the data itself. The best way to describe your data is looking directly at it.

> ### Task 1:
Use `pandas` do explore the data:
- **Take an overall look at the tabular data.**
- **Check the data shape.**
- **Check the types of values in each column.**

In [10]:
# Dataset is already loaded below
data = pd.read_csv("dataset.csv")

---

### 2. Data Cleaning

Another crucial step during data analysis is to identify and clean possible problems in the dataset. Some of the problems that could be present in a dataset are *Missing Values*, *Duplicates* and *Inconsistency*.In real world dataset is quite messy and we need to clean it before we do some analysis on it.

> ### Task 2:
Using `pandas`, clean the data making sure that you address these problems:
- **Missing values**
    Remove all those columns where more than 95% of values are missing.Also drop rows where more than 2 values are missing if they constitute **less than 3% of total sample size.**
- **Duplicates**
    Drop duplicate rows

---

### 3. Descriptive Statistics

A simple and fast way to better understand the data is to calculate descriptive statistics about it. They provide a a better notion of variables range and distribution. Additionally, they can show the presence of impossible values in the dataset. Removing these values is crucial to a good performance of the created model later.
In this case, a possible inconsistency could be present in the columns `parking_spots` or `bedrooms`. They should be positive.

**OBS.: Do not confuse inconsistent values with outliers. In general, outliers are unlikely but plausible data entries. While inconsistent entries are impossible, e. g. negative area.**

> ### Task 3:
Explore the statistics behind the data with  `pandas`:
- **Calculate descriptive statistics of numeric and categorical columns.**
- **Remove inconsistent entries.**

---

### Questions

Q1. What is the most frequent <b>type</b> of real estate? <br>
Q2. What is the <b>median</b> number of <b>bedrooms</b>?

Your answer:<br>
1.<br>
2.


---

### 4. Data Visualization

Visualization is the best way better understand the data.

> #### Task 4:
Using any plotting library (`matplotlib`, `seaborn`, `folium`, etc..), select an appropriate graphs for each step:
- **Plot the approximate distributions of numeric and categorical variables.**
- **Create relevant plot to easily represent the data.**

---

### 5. Outliers

From the information acquired from the previous plots and statistics, we can remove outliers in columns with long-tailed distributions. Ideally, we should investigate these **outliers**, check where came from, if they are valid data points, or if they come from mistyping, etc. 

> #### Task 5:
Perform these steps in the data:
- **Deal with the outliers in the dataset.**

---

### 6. Correlations

Now, let's verify correlations of the features in the data. In general, good numeric features have a high absolute value of [**Pearson Correlation**](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) with the target variable, and low value with other relevant features.

> #### Task 6:
Explore the correlations between the features in the dataset:
- **Calculate the correlation between numeric feature.**
- **Choose appropriate visualizations to display these correlations.**

---

## Machine Learning

### 7. Feature Engineering

Now, it is time to prepare the features for the model. Multiple times, combining some features could be interesting for the model performance. Additionally, some features should be treated to better represent their meaning. 

> #### Task 7:
Perform what is asked below:
- **Transform the coordinate features in a more appropriate representation.**
- **Feel free to combine any other features.**

#### Hint:

For example, `lat` and `lon` features represent angles on a sphere which is not a good metric to our problem. A possible approach is to first convert the `lat` and `lon` coordinates in their normalized cartesian coordinates using the formulas below:

$$ x = \cos{(lat)} \cdot \cos{(lon)} $$
$$ y = \cos{(lat)} \cdot \sin{(lon)} $$
$$ z = \sin{(lat)} $$

After that, create a cluster model to classify each entry based on its closest neighbors. In general, [Elbow method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)) is a good technique to choose the adequate number of clusters.

---

### 8. Prepare Features for Model

In general, some steps must to be performed before feeding the data to the model. This step depends on the model selected, but using a simple linear model is always a good a ideal.

> #### Task 8:
Considering a linear model, prepare all the numeric and categorical features following these steps: 
- **Perform adequade transformations to numeric features.**
- **Perform adequade transformations to categorical features.**

---

### 9. Model Training and Evaluation

Finally, the data is prepared to be feed to the model. In regression problem, linear models are good candidates because they are easily explainable and not computationally intensive.

> #### Task 9:
Using models from `sklearn.linear_model`, perform these tasks:
- **Create a linear model and fit it to the data.**
- **Calculate the expected *RMSE*, *MAE* and *R2* scores of the model.**

---