# Machine Learning in sklearn: Data Exploration and Model Fitting

This notebook demonstrates a potential workflow for data exploration and Machine Learning model creation in Python using *sklearn*.

## Data Exploration

This section contains several methods for the descriptive statistical analysis of datasets.
Purpose: exploration and preparation of data for modeling.

The script loads the Boston House Prices dataset from the *sklearn* package per default.

In [None]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# load demo dataset
from sklearn.datasets import load_boston
dataset = load_boston()

## First check of data
*Print dataset description (only if available):*

In [None]:
if hasattr(dataset, 'DESCR'):
    print(dataset['DESCR'])

*Convert dataset to Pandas DataFrame (DF), add target data as new column, and check DF head:*

In [None]:
data_df = pd.DataFrame(data=dataset['data'])
if hasattr(dataset, 'feature_names'):
    data_df.columns=dataset['feature_names']
else:
    print("Please specify the feature names manually!")
data_df['TARGET'] = dataset['target']
print(data_df.head(5))

*Print DF statistics:*
- distribution of data
- sample count
- type of data

In [None]:
data_df.describe()

In [None]:
data_df.info()

*Count null values per feature:*

In [None]:
# count null values per feature:
print(data_df.isnull().sum())

# drop null values if present:
unprocessed_length = len(data_df)
data_df = data_df.dropna()
processed_length = len(data_df)
print(f"\nUnprocessed DF length: {unprocessed_length}, processed DF length: {processed_length}, dropped instances: {unprocessed_length - processed_length}.")

*Correlation/distribution preview:*

Plot Scatter Matrix of Data:

In [None]:
fig = plt.figure()
pd.plotting.scatter_matrix(data_df, figsize=((14, 12)))
plt.show()

*Create correlation matrix and plot it as a heatmap:*
- highly (positively or negatively) correlated features may contain redundant information - potential targets for removal for more efficient modeling / predictions
- desired: features with a high correlation to the target

In [None]:
corr = data_df.corr()
fig = plt.figure(figsize=(14, 12))
sns.heatmap(corr, square=True, annot=True)  # switch off annot to hide values
plt.show()

*Plot distribution information:*

Issues with the distribution may lead to problems during model fitting. This may be compensated by data transformation and/or under-/oversampling of data.

- a) box and whisker plot
- b) histograms

In [None]:
# a) box and whisker plot
sns.set()  # use Seaborn Standard Formatting from here on

fig, ax = plt.subplots(2, 1)
data_df.plot(kind='box', logy=True, figsize=(14,20), ax=ax[0], title='log. values')
data_df.plot(kind='box', logy=False, figsize=(14,20), ax=ax[1], title='values')
plt.show()

In [None]:
# b) histograms
fig = plt.figure()
data_df.hist(figsize=(14, 12))
plt.show()

*No actions will be performed on the dataset in this demonstration.*

## Data preprocessing, model creation, and model fit

The following steps create the actual model and fit it to the data. The workflow assumes that the target is numeric, i.e., Regressor Models will be used for the fit.

Two models will be used:
- Random Forest Regressor (i.e., an ensemble of multiple Decision Trees)
- K-nearest Neighbors Regressor

In [None]:
# import scikit learn packages:
# data preprocessing:
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
# models:
from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.neighbors import KNeighborsRegressor as KNR

*Shuffle the DataFrame and split it into 20% Test Set and 80% Training Set.*

In [None]:
data_df = shuffle(data_df)
train_df, test_df = train_test_split(data_df, test_size=0.2)

# split dataframes into features and targets:
if hasattr(dataset, 'feature_names'):
    train_data = train_df[dataset['feature_names']]
    test_data = test_df[dataset['feature_names']]
else:
    print("Please provide the feature names manually!")
    
train_target = train_df['TARGET']
test_target = test_df['TARGET']

# create dictionary for the results of fitting and scoring:
res_dict = {}

*Creation of and fit to the Random Forest Regressor:*

In [None]:
# screening parameters - in this demonstration, the number of estimators and the tree depth will be varied:
n_estimators = [20, 50, 100]
max_depths = [2, 5, 10, 20]

for n_est in n_estimators:
    for max_dep in max_depths:
        current_identifier = f"RFR model (n_estimators: {n_est}, max_depth: {max_dep})"
        RFR_model = RFR(n_estimators=n_est, max_depth=max_dep)
        RFR_model.fit(train_data, train_target)
        train_score = RFR_model.score(train_data, train_target)
        test_score = RFR_model.score(test_data, test_target)
        res_dict[current_identifier] = {'model': RFR_model, 'train score': train_score, 'test score': test_score}

*Creation of and fit to the K-nearest Neighbors Regressor:*

In [None]:
# screening parameters - in this demonstration, only the number of neighbors will be varied:
n_neighbors = [2, 3, 5, 10]

for n_n in n_neighbors:
    current_identifier = f"KNR model (n_neighbors: {n_n})"
    KNR_model = KNR(n_neighbors=n_n)
    KNR_model.fit(train_data, train_target)
    train_score = KNR_model.score(train_data, train_target)
    test_score = KNR_model.score(test_data, test_target)
    res_dict[current_identifier] = {'model': KNR_model, 'train score': train_score, 'test score': test_score}

*Iterate through results dictionary and create a results DataFrame:*

In [None]:
res_df = pd.DataFrame(columns=['Model', 'Train Score', 'Test Score'])

for key, value in res_dict.items():
    res_df = res_df.append({'Model': key,
                            'Test Score': round(res_dict[key]['test score'], 2),
                            'Train Score': round(res_dict[key]['train score'], 2)}, ignore_index=True)

# sort values by Test Score in descending order:
res_df.sort_values(by='Test Score', ascending=False, inplace=True)
res_df.reset_index(drop=True, inplace=True)

# print results:
print(res_df)

### Feature Importance:

*Create a plot of the Feature Importance based on the best-performing Random Forest Regressor:*

The Feature Importance denotes the proportion of the Target determined by the respective Feature.

In [None]:
# isolate identifier of best-performing Random Forest Regressor model:
for ind, row in res_df.iterrows():
    if 'RFR model' in row['Model']:
        identifier = row['Model']
        break

# isolate feature importances from model and create dictionary based on these values:
feature_imp_dict = dict(zip(dataset['feature_names'], res_dict[identifier]['model'].feature_importances_))
# convert dictionary to DataFrame:
imp_df = pd.DataFrame.from_dict(data=feature_imp_dict, orient='index')
imp_df.columns = ['Feature Importance']

# sort values in descending order:
imp_df.sort_values(by='Feature Importance', inplace=True, ascending=False)

# create plot:
fig = plt.figure()
imp_df.plot.bar(figsize=(12, 10))
plt.ylabel('Relative Feature Importance')
plt.xlabel('Feature')
plt.show()

# print Feature Importance DataFrame:
print("DataFrame of Feature Importance:\n\n", imp_df)

### Use the best-performing model for predictions:

In [None]:
# load model and print model properties:
best_model = res_dict[res_df.iloc[0]['Model']]['model']
print(f"Best-performing model: {res_df.iloc[0]['Model']}.\nModel properties:\n {best_model}")

In [None]:
# using the best-performing model for predictions:
# use first DF row as demo test dataset; replace this row by actual data for real predictions
test_dataset = np.array(data_df.iloc[0][:-1].values.reshape(1, -1))  
prediction = best_model.predict(test_dataset)  # provide data in form of an array
print(f"Features:\n{test_dataset}\n\n Model prediction: {prediction[0]}")