# WQD7005 - Data Mining
## Lab Exersice 2

#### Matrix Number : 17043640

#### Name                 : Gunasegarran Magadevan

#### 1. Prequisite
Perform the following steps before trying the lab test 2:

Import pandas as "pd" and load the house price dataset into "df".
Print dataset information to refresh your memory.
Run preprocess_data function on the dataframe to perform preprocessing steps 
discussed last week (lab 1 dataset).

In [1]:
import pandas as pd

 # read the dataset
df = pd.read_csv('lab1.csv')
df.head()

Unnamed: 0,TargetB,ID,TargetD,GiftCnt36,GiftCntAll,GiftCntCard36,GiftCntCardAll,GiftAvgLast,GiftAvg36,GiftAvgAll,...,PromCntCardAll,StatusCat96NK,StatusCatStarAll,DemCluster,DemAge,DemGender,DemHomeOwner,DemMedHomeValue,DemPctVeterans,DemMedIncome
0,0,14974,,2,4,1,3,17.0,13.5,9.25,...,13,A,0,0,,F,U,0,0,0
1,0,6294,,1,8,0,3,20.0,20.0,15.88,...,24,A,0,23,67.0,F,U,186800,85,0
2,1,46110,4.0,6,41,3,20,6.0,5.17,3.73,...,22,S,1,0,,M,U,87600,36,38750
3,1,185937,10.0,3,12,3,8,10.0,8.67,8.5,...,16,E,1,0,,M,U,139200,27,38942
4,0,29637,,1,1,1,1,20.0,20.0,20.0,...,6,F,0,35,53.0,M,U,168100,37,71509


In [2]:
import numpy as np
import pandas as pd

def data_prep():
    
    # change DemCluster from interval/integer to nominal/str
    df['DemCluster'] = df['DemCluster'].astype(str)
    
    # change DemHomeOwner into binary 0/1 variable
    dem_home_owner_map = {'U':0, 'H': 1}
    df['DemHomeOwner'] = df['DemHomeOwner'].map(dem_home_owner_map)
    
    # denote errorneous values in DemMidIncome
    mask = df['DemMedIncome'] < 1
    df.loc[mask, 'DemMedIncome'] = np.nan
    
    # impute missing values in DemAge with its mean
    df['DemAge'].fillna(df['DemAge'].mean(), inplace=True)

    # impute med income using mean
    df['DemMedIncome'].fillna(df['DemMedIncome'].mean(), inplace=True)

    # impute gift avg card 36 using mean
    df['GiftAvgCard36'].fillna(df['GiftAvgCard36'].mean(), inplace=True)
    
    # drop ID and the unused target variable
    df.drop(['ID', 'TargetD'], axis=1, inplace=True)
    
    # one-hot encoding
    df = pd.get_dummies(df)
    
    return df

In [3]:
# auto import the python function from Lab_Exercise_2 path
from dm_tools import data_prep

#### 2. Data Partitioning
Perform following operations and answer the following questions:

##### a) Describe training, validation and test dataset.
##### b) What is the purpose for each of these split?
- Training dataset: is a set of examples that is used to build the model. Once the models are built, we need to know how good they are.
- Validation dataset: is a set of examples that is used to evaluate the performance of a model. Validation dataset is unseen (not used in training/fitting process) and typically has similar distribution with training dataset. Validation dataset is commonly used to estimate performance and choose one of a number of different models. The combined performance on training and validation sets also reveal if the learned model is overfit to the training dataset. In other words, it reveals if the model has learnt relationships specific to the provided data which might not be true in general. Overfitting model tends to perform horribly outside of the training set. We will see the example of overfitting in this practical note.
- Test dataset: is a set of examples used to estimate the performance of a model in practice. Similar to validation, test dataset is unseen. However, it is not used in model selection process.
    
#### c) What is k-fold cross validation? 
- A common problem with using validation dataset is we drastically reduce the number of samples for training the model. In addition, random split used for train/validation can impact the model selected during training process. To solve this problem, k-fold cross-validation (k-fold CV) is commonly used.
 - In k-fold cross-validation, an exclusive validation dataset is no longer required. Instead, the training dataset is randomly partitioned into $k$ equal-sized partitions. Of the $k$ subsamples, a single subsample is retained as the validation data for testing the model, and the remaining $k-1$ subsamples are used as training data. The cross-validation process is then repeated $k$ times (the folds), with each of the $k$ subsamples used exactly once as the validation data. The $k$ results from the folds can then be averaged (or otherwise combined) to produce a single estimation of model performance.
    
#### d) What is the advantage and disadvantage of k-fold CV compared to normal training/test/validation method?
- Cross-validation allows the validation process to generalise better (does not depend on randomness of the train/validation split) and reduces data waste (very beneficial for limited size datasets). The drawback of CV is the computation can be expensive and slow as it multiplies the model training time by $k$ times. We will learn more about CV later in this tutorial through the usage of GridSearchCV.

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
# preprocessing step
df = data_prep()

# target/input split
y = df['TargetB']
X = df.drop(['TargetB'], axis=1).values

In [6]:
# setting random state
rs = 10

X_mat = X.as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.3, stratify=y, random_state=rs)

AttributeError: 'numpy.ndarray' object has no attribute 'as_matrix'

##### 3. Decision Tree
Perform the following operations and answer the question.

##### a) Import and build a decision tree classifier. 
##### b) Set the random state to 0 to ensure your result is similar with the answers. 
##### c) Fit it against the training data.
##### d) What is the performance of the model against training data? 
##### e) How about against the test data? Do you see any indication of overfitting here?
##### f) What are the top 5 most important features in this model?
##### g) Visualise the structure of your decision tree. Can you identify characteristics of important features?

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score

# simple decision tree training
model = DecisionTreeClassifier(random_state=rs)
model.fit(X_train, y_train)

In [None]:
print("Train accuracy:", model.score(X_train, y_train), ", Test accuracy:", model.score(X_test, y_test))

In [None]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
import numpy as np

# grab feature importances from the model and feature name from the original X
importances = model.feature_importances_
feature_names = X.columns

# sort them out in descending order
indices = np.argsort(importances)
indices = np.flip(indices, axis=0)

# limit to 20 features, you can leave this out to print out everything
indices = indices[:20]

for i in indices:
    print(feature_names[i], ':', importances[i])

In [None]:
import pydot
from io import StringIO
from sklearn.tree import export_graphviz

# visualize
dotfile = StringIO()
export_graphviz(model, out_file=dotfile, feature_names=X.columns)
graph = pydot.graph_from_dot_data(dotfile.getvalue())
graph[0].write_png("week3_dt_viz.png")

In [None]:
#retrain with a small max_depth limit

model = DecisionTreeClassifier(max_depth=3, random_state=rs)
model.fit(X_train, y_train)

print("Train accuracy:", model.score(X_train, y_train))
print("Test accuracy:", model.score(X_test, y_test))

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
importances = model.feature_importances_
feature_names = X.columns

# sort them out in descending order
indices = np.argsort(importances)
indices = np.flip(indices, axis=0)

# limit to 20 features, you can leave this out to print out everything
indices = indices[:20]

for i in indices:
    print(feature_names[i], ':', importances[i])

# visualize
dotfile = StringIO()
export_graphviz(model, out_file=dotfile, feature_names=X.columns)
graph = pydot.graph_from_dot_data(dotfile.getvalue())
graph[0].write_png("week3_dt_viz.png") # saved in the following file

In [None]:
test_score = []
train_score = []

# check the model performance for max depth from 2-20
for max_depth in range(2, 21):
    model = DecisionTreeClassifier(max_depth=max_depth, random_state=rs)
    model.fit(X_train, y_train)
    
    test_score.append(model.score(X_test, y_test))
    train_score.append(model.score(X_train, y_train))

In [None]:
import matplotlib.pyplot as plt

# plot max depth hyperparameter values vs training and test accuracy score
plt.plot(range(2, 21), train_score, 'b', range(2,21), test_score, 'r')
plt.xlabel('max_depth\nBlue = training acc. Red = test acc.')
plt.ylabel('accuracy')
plt.show()

In [None]:
# grid search CV
params = {'criterion': ['gini', 'entropy'],
          'max_depth': range(2, 7),
          'min_samples_leaf': range(20, 60, 10)}

cv = GridSearchCV(param_grid=params, estimator=DecisionTreeClassifier(random_state=rs), cv=10)
cv.fit(X_train, y_train)

print("Train accuracy:", cv.score(X_train, y_train))
print("Test accuracy:", cv.score(X_test, y_test))

# test the best model
y_pred = cv.predict(X_test)
print(classification_report(y_test, y_pred))

# print parameters of the best model
print(cv.best_params_)

In [None]:
# grid search CV #2
params = {'criterion': ['gini', 'entropy'],
          'max_depth': range(2, 6),
          'min_samples_leaf': range(45, 56)}

cv = GridSearchCV(param_grid=params, estimator=DecisionTreeClassifier(random_state=rs), cv=10)
cv.fit(X_train, y_train)

print("Train accuracy:", cv.score(X_train, y_train))
print("Test accuracy:", cv.score(X_test, y_test))

# test the best model
y_pred = cv.predict(X_test)
print(classification_report(y_test, y_pred))

# print parameters of the best model
print(cv.best_params_)

In [None]:
import numpy as np
import pydot
from io import StringIO
from sklearn.tree import export_graphviz

def analyse_feature_importance(dm_model, feature_names, n_to_display=20):
    # grab feature importances from the model
    importances = dm_model.feature_importances_
    
    # sort them out in descending order
    indices = np.argsort(importances)
    indices = np.flip(indices, axis=0)

    # limit to 20 features, you can leave this out to print out everything
    indices = indices[:n_to_display]

    for i in indices:
        print(feature_names[i], ':', importances[i])

def visualize_decision_tree(dm_model, feature_names, save_name):
    dotfile = StringIO()
    export_graphviz(dm_model, out_file=dotfile, feature_names=feature_names)
    graph = pydot.graph_from_dot_data(dotfile.getvalue())
    graph[0].write_png(save_name)



In [None]:
# do the feature importance and visualization analysis on GridSearchCV's best model
from dm_tools import analyse_feature_importance, visualize_decision_tree

analyse_feature_importance(cv.best_estimator_, X.columns, 20)
visualize_decision_tree(cv.best_estimator_, X.columns, "optimal_tree.png")