# Introduction to Machine Learning - Exercise 2
* The aim of the exercise is to learn basic techniques for visualization creation and interpretation using Matplotlib and Seaborn libraries.

![meme02](https://github.com/rasvob/EFREI-Introduction-to-Machine-Learning/blob/main/images/fml_02_meme_02.png?raw=true)

# Exploratory data analysis

Data source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Important attributes description:
* SalePrice: The property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* Heating: Type of heating
* CentralAir: Central air conditioning
* GrLivArea: Above grade (ground) living area square feet
* BedroomAbvGr: Number of bedrooms above basement level

## Import used packages

In [None]:
import pandas as pd # dataframes
import numpy as np # matrices and linear algebra
import matplotlib.pyplot as plt # plotting
import seaborn as sns # another matplotlib interface - styled and easier to use

## Load the data into the Pandas DataFrame - in our case it is a csv file
* https://raw.githubusercontent.com/rasvob/EFREI-Introduction-to-Machine-Learning/blob/main//datasets/zsu_cv1_data.csv

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/rasvob/EFREI-Introduction-to-Machine-Learning/main/datasets/zsu_cv1_data.csv', sep=',')
df.head()

## Price is the most interisting attribute in our dataset, let's take a look on its distribution
- SalePrice

## Vizualize house prices in form of histogram
- What does it tell us about the prices?

In [None]:
sns.displot(df.SalePrice)

### Modify bins
- Perhaps we want to visualize the histogram with more dense or coarse bins
- *bins* attribute

In [None]:
sns.displot(df.SalePrice, bins=30)

### Add probability density curve (Kernel density estimation)

### 💡 How is KDE useful?
* Imagine you have a bunch of data points, like the heights of people in a city. 
    * You can’t just count how many people are of a certain height, because there are too many possible heights.
    * Instead, you can group the heights into bins, like 150-160 cm, 160-170 cm, and so on, and count how many people fall into each bin.
    * This gives you a histogram, which is a kind of bar chart that shows how frequent each bin is.
        * This gives you an idea of how the heights are distributed in the city.
* 
But histograms have some problem
    * . They depend on how you choose the bins, which can be arbitrary and affect the shape of the histogra
    * . They also have sharp edges, which make them look jagged and unnatura
    * . They also don’t tell you anything about what happens between the bins, like how likely it is to find someone who i**s 165 **cm tall
* Kernel density estimation is useful for data analysis because it lets you see the shape and features of your data more clearly than histograms.
.

We can use parameter *kde* if we want to show estimation of probability density function.

Modify colors of plot, use parameters:
- *color*: green
- *edgecolor*: white

In [None]:
sns.histplot(df.SalePrice, kde=True, edgecolor='white', color='green')

### Use quartiles (Q1, Q3) for highlighting most common price range in histogram

* Check functions *describe* and *quantile* over price column

In [None]:
df.SalePrice.describe()

In [None]:
df.SalePrice.quantile(0.25), df.SalePrice.quantile(0.75)

## 🔎 What can we say about prices based on the quantiles?
### Use function *axvline* from Matplotlib to draw vertical lines at the quartile positions

In [None]:
fig = plt.figure(figsize=(9, 6))
sns.histplot(df.SalePrice, bins=60, edgecolor='white', color='green')
plt.axvline(df.SalePrice.quantile(0.25), color='red')
plt.axvline(df.SalePrice.quantile(0.75), color='red')

## 📊 Let's add more complexity to histogram vizualizations. 
### Does price change for different values of GaragaFinish attribute?

### GarageFinish: Interior finish of the garage
- Fin	Finished
- RFn	Rough Finished	
- Unf	Unfinished
- NA	No Garage

In [None]:
sns.displot(data=df, x='SalePrice', hue='GarageFinish', edgecolor='white')

### 💡 We can see that houses with no garage are somehow missing
- The no garage houses have a *nan* value in the feature so they are not plotted

### Fill nan values with 'NoGarage' string

In [None]:
df['GarageFinish'] = df['GarageFinish'].fillna('NoGarage')

### Take a look at the histogram again

In [None]:
sns.displot(data=df, x='SalePrice', hue='GarageFinish', edgecolor='white')

## 💡 Sometimes the histograms distinguished by colors are not easily redeable
- We can use *col* attribute to automatically plot every histogram on its own canvas

In [None]:
sns.displot(data=df, x='SalePrice', col='GarageFinish')

# 📊 Another very useful tecnique for the numerical features distributions are boxplots
- 🔎 Did you see any of these before?

- 🔎 How should we read boxplots?
    - **Quartiles**
    - **Median**
    - Box size
    - **Outliers and IQR**
    - Min/max - with or without outliers

### Plot only the *SalePrice* using boxplot

In [None]:
sns.boxplot(data=df, y='SalePrice')

## Let's take a look at the *SalePrice* for different *BldgType* values 
- 🔎 What can we say about the prices?

In [None]:
fig = plt.figure(figsize=(16, 9))
sns.boxplot(data=df, y='SalePrice', x='BldgType')

## We can do the same for *GrLivArea*

In [None]:
sns.boxplot(data=df, y='GrLivArea')

## Let's take a look at the *GrLivArea* for different *BldgType* values 
- 🔎 What can we say about the GrLivArea?

In [None]:
fig = plt.figure(figsize=(16, 9))
sns.boxplot(data=df, y='GrLivArea', x='BldgType')

# 📊 Scatter plots are commonly used for visualizing two numerical variables

### We can use standard *scatterplot* with *BldgType* as a *hue* so we can better grasp the relationship between *GrLivArea* and *SalePrice*
* Try to set *alpha* parameter for opacity settings

In [None]:
fig = plt.figure(figsize=(12,6))
sns.scatterplot(data=df, x='GrLivArea', y='SalePrice', hue='BldgType', alpha=0.5)

### We can see there are some outliers in the data, let's zoom only to area without them, how can we filter the data?

In [None]:
fig = plt.figure(figsize=(12,6))
sns.scatterplot(data=df[df.GrLivArea < 3000], x='GrLivArea', y='SalePrice', hue='BldgType', alpha=1)

# We have an information about a month and a year for the sold houses.
## 🔎 Can you vizualize what was the average price of the house by quarters?

* We need to create a new column **YearQuarterSold** with merged Year and Quarter information in this pattern: '2010-1','2010-2' and so on first
    * We can map the values using the *apply* function or use string concatenation directly
* 💡 If you need to change data type of the columns, you can use *astype*

In [None]:
df['QuarterSold'] = (df.MoSold + 2) // 3
df['YearQuarterSold'] = df.YrSold.astype(str) + '-' + df.QuarterSold.astype(str)
df['YearQuarterSold'].head()

### Compute the average price for the *YearQuarterSold* attribute

In [None]:
df_agg = df.groupby('YearQuarterSold').SalePrice.mean().reset_index(name='MeanPrice')
df_agg.head()

## Vizualize the data using *lineplot*
- In case of xlabels mixing try to tune the *rotation* and *horizontalalignment* parameters

In [None]:
fig = plt.figure(figsize=(9,6))
sns.lineplot(data=df_agg, x='YearQuarterSold', y='MeanPrice')
plt.xticks(rotation=65, horizontalalignment='right')
plt.show()

### Add the max, min and median to the plot
- Use *describe* and the [Pandas Melt](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html) function
    - use YearQuarterSold as an *id_vars*
    - use 'min', 'mean', '50%', 'max' as *value_vars*

In [None]:
df.groupby('YearQuarterSold').SalePrice.describe().reset_index()

![meme01](https://github.com/rasvob/EFREI-Introduction-to-Machine-Learning/blob/main/images/fml_02_meme_01.png?raw=true)

In [None]:
df_agg = df.groupby('YearQuarterSold').SalePrice.describe().reset_index()
df_melt = pd.melt(df_agg, id_vars=['YearQuarterSold'], value_vars=['min', 'mean', '50%', 'max'])
df_melt

## What can we say about the minimum and maximum values?
- 💡Take a look at std. deviation

In [None]:
fig = plt.figure(figsize=(9,6))
sns.lineplot(data=df_melt, x='YearQuarterSold', y='value', hue='variable')
plt.xticks(rotation=65, horizontalalignment='right')
plt.show()

# We can take a look at number of sold houses for the defined time periods as well
## 📊 We will use standard bar plot
- 🔎 In which quarter were the most houses sold?
- 💡 Fun facts: [https://themortgagereports.com/44135/whats-the-best-time-of-year-to-sell-a-home](https://themortgagereports.com/44135/whats-the-best-time-of-year-to-sell-a-home)

In [None]:
df_cnt = df.groupby('YearQuarterSold').SalePrice.count().reset_index(name='Count')
df_cnt

In [None]:
fig = plt.figure(figsize=(9,6))
sns.barplot(data=df_cnt, x='YearQuarterSold', y='Count')
plt.xticks(rotation=65, horizontalalignment='right')
plt.show()

# Tasks
## ✅ Task 1 - Outlier detection
- We need to somehow mark the outliers in the data according to the *SalePrice* and *GrLivArea*
    - One possibility is to compute [IQR](https://www.statisticshowto.com/probability-and-statistics/interquartile-range/) for both columns and mark outliers using lower and upper bounds
    - Lower bound: Q1 - 1.5*IQR
    - Upper bound: Q3 + 1.5*IQR
- 💡 If the house has *SalePrice* **OR** *GrLivArea* value outside of the bounds - it is an outlier
- Vizualize the data using scatter plot and distinguish the outlier and non-outlier data using different colors (*hue*) 💡

## ✅ Task 2 - Describe what you see in the data
- Try to vizualize the relationship between *SalePrice* and *OverallQual*
    - 💡You can use BoxPlots, Scatter plots, etc., the choice of a right plot type is up to you 🙂
- Do the same for *SalePrice* and *OverallCond*; i.e. vizualize and describe insight-
- **Describe the insight you got from the plots with a few sentences in a Markdown cell below the plot**