In [None]:
import pandas as pd
import numpy as np
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import graphviz as gp
import statsmodels.api as sm
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import export_graphviz
from sklearn.metrics import accuracy_score

# Part 1: Running CART on the Titanic dataset

The goal here is to run CART on the full Titanic dataset (we take the training dataset provided by Kaggle and split it three ways into testing, training, validation - the testing set provided by Kaggle does not have the "Survived" category making it impossible for us to measure the quality of our model). This dataset has already been fully pre-processed --- note that this data prepreprocessing has been carried separately on each dataset to avoid train-test contamination.

In [None]:
Titanic_train=pd.read_csv("Titanic_train_cleaned.csv")
Titanic_test=pd.read_csv("Titanic_test_cleaned.csv")
Titanic_validation=pd.read_csv("Titanic_val_cleaned.csv")

In [None]:
ytrain=Titanic_train[["Survived"]]
Xtrain=Titanic_train.drop(columns="Survived")

In [None]:
yvalidation=Titanic_validation[["Survived"]]
Xvalidation=Titanic_validation.drop(columns="Survived")

In [None]:
ytest=Titanic_test[["Survived"]]
Xtest=Titanic_test.drop(columns="Survived")

In [None]:
Xtrain

In [None]:
ytrain

1. For a number of maximum leaves equal to 6, use the following snippet of code to fit a CART to the training dataset

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier_DT = DecisionTreeClassifier(max_leaf_nodes = 6)
classifier_DT.fit(Xtrain, ytrain)

2. Graph the decision tree using the code below.

In [None]:
from sklearn.tree import export_graphviz
dot_data = export_graphviz(classifier_DT, feature_names = Xtrain.columns, filled = True, rounded = True, class_names=["Died","Survived"])
graph = gp.Source(dot_data)
graph

3. How accurate is this decision tree on the validation data? Generate the predicted values using the code below and then compute the acccuracy of the model.

In [None]:
y_pred = classifier_DT.predict(Xvalidation)

from sklearn.metrics import accuracy_score
score=accuracy_score(yvalidation,y_pred)
score

4. We are now going to use the validation set to optimize how many leaves we should have in our tree. The goal here is to use a for loop to iterate over possible values of the maximum number of leaves, fit the model each time on the *training set* and obtain the `accuracy_score` on the *validation set*. Use the snippet of data below to get started. For comparison, also find the `accuracy_score` on the *training set*.

In [None]:
n_max_leaf_nodes = range(2,60) # Lets train the models with 2, 3, 4, ... 60 leafs
train_array = []
validation_array = []

for n in n_max_leaf_nodes:
    
    #insert here the code to train the Decision Tree on the training set
    
    #insert here the code that makes predictions for both the validation and the training sets 
    
    #insert here the code that gives us the accuracy of the model on the validation and training sets
    #store the accuracy values as validation_score and train_score

    train_array.append([n,train_score])
    validation_array.append([n,validation_score])

5. We can now plot both accuracy levels over the chosen number of nodes. Which number of max_leaf_nodes would you feel comfortable picking?

In [None]:
array = pd.DataFrame(validation_array)
plt.scatter(array[0],array[1])

array_train = pd.DataFrame(train_array)
plt.scatter(array_train[0],array_train[1])
plt.legend(['Validation set','Training set'])
plt.xlabel("Number of leaves",fontsize=15)
plt.ylabel("Accuracy",fontsize=15)
plt.show()

6. Retrain your model using this max_leaf_nodes on the full training and validation set. Plot the corresponding tree.

In [None]:
Xtrainval=Xtrain.append(Xvalidation)
ytrainval=ytrain.append(yvalidation)

In [None]:
#insert here code to retrain your model on the training + validation set

In [None]:
#insert here code to graph tree

7. What is the accuracy_score on the testing set?

8. Pulling up the Titanic Movie dataset, what do you predict for Rose and Jack in terms of survival? If you have seen the movie, does it agree with the outcomes they meet there?

In [None]:
Titanic_movie=pd.read_csv("Titanic_movie.csv")
Titanic_movie

In [None]:
classifier_DT.predict(Titanic_movie.drop(columns="Name"))

# Part 2: The Boston houseprices dataset - CART and more advanced tools for regression (optional)

In this question, we use the Boston crime dataset from, e.g., http://lib.stat.cmu.edu/datasets/boston .

In [None]:
boston=sm.datasets.get_rdataset("Boston","MASS")

In [None]:
boston_data=boston.data

## A. Regression Trees

1. As usual, take a look at the header of the dataset and make sure you understand the features before you begin. Are there any duplicates? Any empty values?

2. We will be attempting to predict the median value of the houses based on the other features using CART. In class, we saw how to use decision trees to *classify*, now our goal is to perform *regression*. To do this create the appropriate feature dataset `X` and the appropriate labels `y`. Then separate these into train/validation/test in proportions of 50/25/25 using `test_train_split`.

3. Using `DecisionTreeRegressor` which works very similarly to `DecisionTreeClassifier`, plot a decision tree for this problem based with `max_leaf_nodes=8`. Try and interpret the tree: which variables seem to condition the value of property? What information is given for each node of the tree?

4. Fit the decision tree obtained to `Xvalidation` to obtain `ypredict`. Compute the MSE between `ypredict` and `yvalidation` using `mean_squared_error`. Take a look at its square root. What do you think?

5. Similarly to what was done in lecture, (1) train this model for many different values of `max_leaf_nodes`, (2) predict the values taken on the validation set, (3) compute the MSE between predicted values and real values, (4) append this value to the `array` variable.

In [None]:
n_max_leaf_nodes = range(2,40) # Lets train the models with 2, 3, 4, ... 40 leaves

array = []

for n in n_max_leaf_nodes:
    
    #insert here the code to train the regressor on the dataset for varying levels of max_leaf_nodes   
    
    #insert here the code that gives us the accuracy of the model on the validation set 
    
    array.append([n,score])

6. Let's contrast this with the accuracy when evaluated on the training set. Obtain a second set `array_train` with the `accuracy_score` for each `max_leaf_nodes` parameter.

In [None]:
n_max_leaf_nodes = range(2,40) # Lets train the models with 2, 3, 4, ... 40 leafs
array_train= []

for n in n_max_leaf_nodes:
    
    #insert here the code to train the regressor on the dataset for varying levels of max_leaf_nodes
        
    #insert here the code that gives us the accuracy of the model on the training set 
    
    array_train.append([n,score])

7. Plot the predicted scores on the training set and validation set as a function of the number of leaves. What do you observe? Does that make sense? How many leaves would you feel comfortable picking?

In [None]:
array = pd.DataFrame(array)
plt.scatter(array[0],array[1])

array_train = pd.DataFrame(array_train)
plt.scatter(array_train[0],array_train[1])
plt.legend(['Validation set','Training set'])
plt.xlabel("number of leaves",fontsize=15)
plt.ylabel("MSE",fontsize=15)
plt.show()

8. Retrain the model on the training and validation sets with 13 leaves. What is the value of the square root of the MSE on the testing set? Denote this value by `RMSE_DT`.

In [None]:
Xtrainval=Xtrain.append(Xvalidation)
ytrainval=ytrain.append(yvalidation)

In [None]:
#insert here code to retrain your model on the training + validation set

In [None]:
#insert here code to graph the tree

In [None]:
#insert here code to obtain predictions and compute the RMSE

10. When you look at this tree, what features play a large role in determining house prices?

There are basically 4 (ordered from maybe most important to least): lstat (% lower status population), rm (average number of rooms per dwelling), dis (weighted distance to employment centers), nox (air quality), tax (full-property tax rate). Note that e.g. higher taxes lead to a lower price for example.

## B. Bagging / Random Forests / Gradient Boosting [optional]

In this part, we contrast bagging/random forests/gradient boosting and see which one does the best. It stands to reason that these three methods should at least improve on what we have seen so far. 

We will throughout use the fact that max_leaf_nodes=13 and train our models on `Xtrainval, ytrainval`.

We start first with Bagging/Random Forests and then move onto Gradient Boosting.

1. We use `RandomForestRegressor` for both, tweaking only the parameter `max_features`. Explain why this can be done and what we should set `max_features` to for it to be bagging, and what it should be set equal to for it to be random forests.

2. Let's start with Bagging. Use `RandomForestRegressor` in exactly the same way as `DecisionTreeRegressor` to fit the model on `Xtrainval,ytrainval` with `max_leaf_nodes=13` and `max_features` appropriately set.

Hint: you may want to use `ytrainval["medv"]` instead of `ytrain` to avoid the error message you may be getting.

3. Use the model you've just obtained to predict the values on `Xtest`. Then, compute the MSE between the predicted values and `ytest`. What is its square root? Denote this by `RMSE_Bagging`. Does it improve on `RMSE_DT`?

4. Let's move onto random forests. Use `RandomForestRegressor` in exactly the same way as `DecisionTreeRegressor` to fit the model on `Xtrainval,ytrainval` with `max_leaf_nodes=13` and `max_features` appropriately set.

Hint: you may want to use `ytrainval["medv"]` instead of `ytrain` to avoid the error message you may be getting.

5. Use the model you've just obtained to predict the values on `Xtest`. Then, compute the MSE between the predicted values and `ytest`. What is its square root? Denote this by `RMSE_RF`. Does it improve on `RMSE_DT`? on `RMSE_Bagging`?

6. Finally, for gradient boosting, use `GradientBoostingRegressor` (which again has a very similar syntax to `DecisionTreeRegressor`) to fit the model on `Xtrainval,ytrainval` with `max_leaf_nodes=13`.

7. Use the model you've just obtained to predict the values on `Xtest`. Then, compute the MSE between the predicted values and `ytest`. What is its square root? Denote this by `RMSE_Boost`. Does it improve on `RMSE_DT`? on `RMSE_Bagging`? on `RMSE_RF`? What do you think is the best model?