***
# ISOM3360 Data Mining for Business Analytics
## Group 23 Project Code - Credit Card Defaultee Analysis
### Part 2.3 - Naive Bayes Classifier
***

Name: LAM, Ho Chit  
ITSC: hclamao  
SID: 20607878

Name: LEE, Ho Wan Owen  
ITSC: hwolee  
SID: 20604852

Name: LEE, Wai Chung  
ITSC: wcleeaj  
SID: 20702733

### Workflow of this notebook (TBC)

1. Explore features and characteristics of dataset
2. Drop columns of low data quality (e.g. large amounts of empty values)
3. Determine $k$ columns to keep in the dataset (feature selection)
4. Perform one-hot encoding
5. Split into training and testing sets
6. Perform data cleaning
   - Dealing with missing values
7. Perform data standardization / normalization
8. Export preprocessed data to .csv files at `./data_preprocessed/`

### First Decision Tree
we will use all the train data (891 examples) to construct the tree and evaluate the model

#### Step 1: Define features and target variable

In [None]:
# define independent variables / attirbutes / features
features = ['Pclass','Age_zscore','SibSp','Parch','Fare_zscore','Sex_male','Embarked_Q','Embarked_S']
# define one single target variable / label
target = ['Survived']

# get defined training dataset
X = train_df[features]
y = train_df[target]

X.info()

#### Step 2: Split data into training and validation set

In [None]:
# import train split function
from sklearn.model_selection import train_test_split

# split data into 80% and 20%, put 20% in testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=3360)

X_train.info()

#### Step 3: Build a Tree based on 80% train data

In [None]:
# import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

def create_model(**params):
    
    model = DecisionTreeClassifier(random_state=3360, **params)

    # train the model by fit in 80% training data
    model.fit(X_train, y_train)

    print("Depth:", model.get_depth())
    print("Leaves:", model.get_n_leaves())

    return model

In [None]:
# define model by using default hyperparameter values
model = create_model()

# get prediction for X_val
pred_val = model.predict(X_val)

#### Visualize the Decision Tree

In [None]:
from sklearn import tree
import matplotlib.pyplot as plt

# function for simple tree visualization

def simple_tree_vis(model):
    plt.figure(figsize = (100,150))
    tree.plot_tree(model,ax=None, fontsize=50)
    plt.show()
    return None

simple_tree_vis(model)

In [None]:
from sklearn.tree import export_graphviz
from IPython.display import Image
import pydotplus
import graphviz

# function for fancy tree visualization

def tree_vis(model):
    dot_data = tree.export_graphviz(model, out_file=None, 
                      feature_names=features,  
                      class_names=['Did not survive', 'Survived'],
                      filled = True, rounded=True,  
                      special_characters=True)
    graph = graphviz.Source(dot_data)
    graph.render("titanic_decisiontree")
    return graph

# uncomment the next line for graphical representation of the decision tree
# tree_vis(model)

#### Step 4: Evaluate the model on 20% validation set

- Calculate:
  - Accuracy
  - Precision
  - Recall
  <!-- - F1 score -->
- Display confusion matrix
- Plot curves:
  - Precision-Recall curve
  - ROC curve

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import confusion_matrix, classification_report
# from sklearn.metrics import roc_curve, precision_recall_curve, auc

def evaluate_model(model):
    
    # training
    print("---------- Evaluation ----------\n")
    print("Evaluation: Training")
    preds = model.predict(X_train)

    # output all metrics scores
    print("\tAccuracy:", accuracy_score(y_train, preds, normalize=True, sample_weight=None))
    # print("Precision:", precision_score(truth, preds, sample_weight=None))
    # print("Recall:", recall_score(truth, preds, sample_weight=None))

    # display confusion matrix
    print("\tConfusion matrix:\n", confusion_matrix(y_train, preds))
    
    # print classification report
    print("\tClassification report:\n", classification_report(y_train, preds))
    
    
    # validation
    print("Evaluation: Validation")
    preds = model.predict(X_val)

    # output all metrics scores
    print("\tAccuracy:", accuracy_score(y_val, preds, normalize=True, sample_weight=None))
    # print("Precision:", precision_score(truth, preds, sample_weight=None))
    # print("Recall:", recall_score(truth, preds, sample_weight=None))

    # display confusion matrix
    print("\tConfusion matrix:\n", confusion_matrix(y_val, preds))
    
    # print classification report
    print("\tClassification report:\n", classification_report(y_val, preds))
    
    
    return None

In [None]:
# evaluate model

evaluate_model(model)

Since the difference between training and validation accuracy is substantial and the training accuracy is extremely close to 100%, it is safe to conclude that severe overfitting occured in this model with default hyperparameters.  
There are 3 methods to reduce overfitting:
- Hyperparameter tuning (manual)
- Cross validation
- Hyperparameter tuning (via GridSearchCV)

### Manual Hyperparameter Tuning

##### max_depth = 8

In [None]:
model1 = create_model(max_depth=8)
evaluate_model(model1)

##### max_leaf_nodes = 50

In [None]:
model2 = create_model(max_leaf_nodes=50)
evaluate_model(model2)

##### min_samples_split = 2

In [None]:
model3 = create_model(min_samples_split=2)
evaluate_model(model3)

##### min_samples_leaf = 6

In [None]:
model4 = create_model(min_samples_leaf=6)
evaluate_model(model4)

##### min_impurity_decrease = 0.05

In [None]:
model5 = create_model(min_impurity_decrease=0.05)
evaluate_model(model5)

##### Combination of hyperparameters above
- max_depth = 8
- max_leaf_nodes = 50
- min_samples_split = 2
- min_samples_leaf = 6
<!-- - min_impurity_decrease = 0.1 -->

In [None]:
model6 = create_model(max_depth=8, 
                      max_leaf_nodes=50,
                      min_samples_split=2,
                      min_samples_leaf=6)
evaluate_model(model6)

### 10-fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

score_cv = cross_val_score(model6, X, y, cv=10)
print("CV results:", score_cv)
print("Mean =", score_cv.mean())

### GridSearchCV

In [None]:
# create try_grid
try_grid = [{'max_depth': np.arange(3, 16),     # 3 to 15
             'max_leaf_nodes': np.arange(1, 19)*5,      # 5, 10, 15, ..., 90
             'min_samples_split': np.arange(2, 7),     # 2 - 6
             'min_samples_leaf': np.arange(3, 10),      # 3 - 9
             'min_impurity_decrease': np.linspace(0, 0.225, 8),}]        # 0, 0.025, 0.05, etc., 0.2

In [None]:
from sklearn.model_selection import GridSearchCV

# create GridSearchCV object
DTM = GridSearchCV(DecisionTreeClassifier(random_state=3360), param_grid=try_grid, cv=10, verbose=1)

In [None]:
DTM.fit(X, y)

print("Best params:", DTM.best_params_)
print("Best score :", DTM.best_score_)

In [None]:
# create instance of best model
best_model = create_model(**DTM.best_params_)

evaluate_model(best_model)
simple_tree_vis(best_model)

### Generate data file for prediction results

In [None]:
# create dataframe for prediction results
preds = pd.DataFrame(index=test_df.index, columns=['Survived'])

# store prediction results of best model into dataframe
preds['Survived'] = best_model.predict(test_df[features])

# export to csv file
preds.to_csv('prediction.csv')

preds.describe()

### Conclusion and findings

The results are fairly predictive.

## This is the end of Part 2.3 Naive Bayes Classifier.