
# ML-Workshop
## Data Exploration & House price prediction
Dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

## Load the data from .csv

In [2]:
import pandas as pd
import missingno
from sklearn.metrics import mutual_info_score
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="ticks", color_codes=True)

# | remove max column restriction for printing DataFrames (default=20)
pd.options.display.max_columns = None 


## Missing values (NaN)

### Find NaN values 

In [3]:
# | get names of columns where values are missing

# | plot missing values only of columns where values are missing


### Remove/Replace the NaN Values

In [4]:
# | drop all columns where more than 90% of the values are N/A

# | check how many of the remaining rows have at least one NaN value


Now we have two options: <br>
- A: remove all rows that have at least one missing value
- B: replace the Nan values e.g. by the mean (for numerical) or the most frequent value (for categorical) of the corresponding column

In [5]:
# | A: remove all rows that have at least one missing value

# | B: replace the Nan values by the mean (for numerical) or the most frequent value (for categorical)


### Calculate statistics such as mean, std & percentiles

In [6]:
# | note: df.describe() ignores the non-numerical columns


## Separate categorical and numerical columns 

Nominal and numerical values cannot be analysed in the same way ...

In [7]:
# | check how many categorical columns we have


In [8]:
# | separate categorical and numerical columns for further analysis


## Analyse distributions

In [9]:
# | plot distribution/histogram for target column SalePrice


In [10]:
# | plot distributions for all columns


In [11]:
# | look at outliers for column LotArea


## Correlation analysis of numerical features

### Correlation between features and target 

In [12]:
# | get highest correlations between features and target variable 'SalePrice'


Now lets look more in detail at the relationship of the most correlating feature to the target


In [13]:
# | scatter plot of highest correlating column with SalePrice


In [14]:
# | scatter plot of top-10 correlating column with SalePrice


Let's have a look at the remaining features, with lower correlations:

#### What about non-linear relationships?
Pearson correlation only captures linear relationships between two vectors. Metrics such as Mutual Information (MI) also measure non-linear relationships.

In [None]:
# | calculate MI scores


In [None]:
# | look at columns with highest scores and check if one of these didn't have high pearson correlation.

### Correlation between features

Find features holding redundant information

In [None]:
# | plot correlation matrix 


## Analysis of categorical features

Note that we cannot apply the same analysis to the categorical columns, as correlation and mutual information are only defined for numerical values.

### Visualize frequencies/distribution of the nominal values

### Analyse relation between categorical features and target variable

Boxplots are a nice way to visialize the "correlation" between the classes of categorical features and the target variable

## Machine Learning - House Price Prediction with Regression

In [None]:
# | convert categorical features to numbers


In [None]:
# | concatenate categorical and numerical features

# | define x,y


In [None]:
from sklearn.model_selection import train_test_split
# | split data into train and test sets


In [None]:
from sklearn.ensemble import RandomForestRegressor
# | define and train model


In [None]:
# | plot prediction v.s. ground truth