Decision Tree - Bank Marketing Dataset
Description
You are given the 'Portuguese Bank' marketing dataset which contains data about a telemarketing campaign run by the bank to sell a product (term deposit - a type of investment product).

Each row represents a 'prospect' to whom phone calls were made to sell the product. There are various attributes describing the prospects, such as age, profession, education level, previous loans taken by the person etc. Finally, the target variable is 'purchased' (1/0), 1 indicating that the person had purchased the product. A sample of the training data is shown below (note that 'id' shouldn't be used to train the model) :

    age          job  marital          education default  housing     loan  \
0   30  blue-collar  married           basic.9y      no      yes       no   
1   39     services   single        high.school      no       no       no   
2   25     services  married        high.school      no      yes       no   
3   38     services  married           basic.9y      no  unknown  unknown   
4   47       admin.  married  university.degree      no      yes       no   

     contact month day_of_week ...  pdays  previous     poutcome  \
0   cellular   may         fri ...    999         0  nonexistent   
1  telephone   may         fri ...    999         0  nonexistent   
2  telephone   jun         wed ...    999         0  nonexistent   
3  telephone   jun         fri ...    999         0  nonexistent   
4   cellular   nov         mon ...    999         0  nonexistent   

   purchased  id  
0          0   1  
1          0   2  
2          0   3  
3          0   4  
4          0   5  

As an analyst, you want to predict whether a person will purchase the product or not. This will help the bank reduce their marketing costs since one can then target only the prospects who are likely to buy.

Build a decision tree with default hyperparameters to predict whether a person will buy the product or not. 

The training data is provided here:
/data/training/bank_train.csv

After you train the model, use the test data to make predictions. The test data can be accessed here. 
/data/test/bank_test.csv

You have to write the predictions in the file
/code/output/bank_predictions.csv

in the following format (note the column names carefully):
     bank_predicted    id
0               0  2041
1               0   399
2               0  1400
3               0  3709
4               0  2111




Datasets
Training dataset
Execution Time Limit
15 seconds

info_outline
You have reached the maximum submission limit for this problem. Your further submissions will not be considered for evaluation.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from sklearn import metrics, preprocessing
from sklearn.tree import DecisionTreeClassifier

# read training data
bank_train = pd.read_csv("/data/training/bank_train.csv")

# read test data
bank_test = pd.read_csv("/data/test/bank_test.csv")

print(bank_train.head())
print(bank_test.head())

# build the model 
# create DT object
dt_default = DecisionTreeClassifier()

# train the model
print(bank_train.columns)
x_train = bank_train.drop(['purchased', 'id'], axis=1)
y_train = bank_train[['purchased']]

dt_default.fit(x_train, y_train)

#  make predictions
print(bank_test.head())
predictions = dt_default.predict(bank_test.drop(['id'], axis=1))
print(predictions[:5])


# write columns id, predictions into the output file
d = pd.DataFrame({'id': bank_test['id'], 'bank_predicted': predictions})
print("\n", "d", "\n", d.head())

# write the output
d.to_csv('/code/output/bank_predictions.csv', sep=",")

Decision Tree Hyperparameter Tuning
Description
You are given the 'Portuguese Bank' marketing dataset, which you have already seen in the previous question. It contains data about a telemarketing campaign run by the bank to sell a product (term deposit - a type of investment product). 



A sample of the training data is shown below (note that 'id' shouldn't be used to train the model) :

    age          job  marital          education default  housing     loan  \
0   30  blue-collar  married           basic.9y      no      yes       no   
1   39     services   single        high.school      no       no       no   
2   25     services  married        high.school      no      yes       no   
3   38     services  married           basic.9y      no  unknown  unknown   
4   47       admin.  married  university.degree      no      yes       no   

     contact month day_of_week ...  pdays  previous     poutcome  \
0   cellular   may         fri ...    999         0  nonexistent   
1  telephone   may         fri ...    999         0  nonexistent   
2  telephone   jun         wed ...    999         0  nonexistent   
3  telephone   jun         fri ...    999         0  nonexistent   
4   cellular   nov         mon ...    999         0  nonexistent   

   purchased  id  
0          0   1  
1          0   2  
2          0   3  
3          0   4  
4          0   5  

In the previous question on this dataset, you had built a decision tree with default hyperparameters. In this question, you will find the optimal value of the hyperparameter max_depth usingGridSearchCV(), and then build a model using the optimal value of max_depth to predict whether a given prospect will buy the product. 

To find the optimal value, you can plot training and test accuracy versus max_depth using matplotlib (the code is already written - you will see the plot displayed below the coding console).

The training data is provided here:
/data/training/bank_train.csv

After you tune the model and find the optimal value of max_depth, use the test data to make predictions. The test data can be accessed here. 
/data/test/bank_test.csv

You have to write the predictions in the file
/code/output/bank_predictions.csv

in the following format (note the column names carefully):
     bank_predicted    id
0               0  2041
1               0   399
2               0  1400
3               0  3709
4               0  2111




Datasets
Training dataset
Execution Time Limit
15 seconds

info_outline
You have reached the maximum submission limit for this problem. Your further submissions will not be considered for evaluation.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from sklearn import metrics, preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

# read training data
bank_train = pd.read_csv("/data/training/bank_train.csv")

# read test data
bank_test = pd.read_csv("/data/test/bank_test.csv")

print(bank_train.head())
print(bank_test.head())

# build the model 

# # train the model
print(bank_train.columns)
x_train = bank_train.drop(['purchased', 'id'], axis=1)
y_train = bank_train[['purchased']]

# Hyperparameter tuning: maxdepth
# specify number of folds for k-fold CV
n_folds = 5

# parameters to build the model on
parameters = {'max_depth': range(1, 40)}

# instantiate the model
dtree = DecisionTreeClassifier(random_state = 100)

# fit tree on training data
tree = GridSearchCV(dtree, parameters, 
                    cv=n_folds, 
                   scoring="accuracy",
                   return_train_score=True)
tree.fit(x_train, y_train)

# scores of GridSearch CV
scores = tree.cv_results_
print(pd.DataFrame(scores).head())

# plotting accuracies with max_depth
plt.figure()
plt.plot(scores["param_max_depth"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_max_depth"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
plt.savefig('/code/output/hyperparam_c.png') 

# create DT with optimal max_depth
best_tree = DecisionTreeClassifier(max_depth=4)
best_tree.fit(x_train, y_train)

# make predictions
print(bank_test.head())
predictions = best_tree.predict(bank_test.drop(['id'], axis=1))
print(predictions[:5])


# write columns id, predictions into the output file
d = pd.DataFrame({'id': bank_test['id'], 'bank_predicted': predictions})

# # write the output
d.to_csv('/code/output/bank_predictions.csv', sep=",")

Random Forest - Bank Marketing Dataset
Description
You are given the 'Portuguese Bank' marketing dataset which contains data about a telemarketing campaign run by the bank to sell a product (term deposit - a type of investment product).

You have already built a decision tree on this dataset, which would have given about 92-93% accuracy. In this question, you'll build a random forest and compare the accuracy with that of the decision tree.

If you are familiar with the dataset, you can skip the description below.

Dataset Description
Each row represents a 'prospect' to whom phone calls were made to sell the product. There are various attributes describing the prospects, such as age, profession, education level, previous loans taken by the person etc. Finally, the target variable is 'purchased' (1/0), 1 indicating that the person had purchased the product. A sample of the training data is shown below (note that 'id' shouldn't be used to train the model) :

    age          job  marital          education default  housing     loan  \
0   30  blue-collar  married           basic.9y      no      yes       no   
1   39     services   single        high.school      no       no       no   
2   25     services  married        high.school      no      yes       no   
3   38     services  married           basic.9y      no  unknown  unknown   
4   47       admin.  married  university.degree      no      yes       no   

     contact month day_of_week ...  pdays  previous     poutcome  \
0   cellular   may         fri ...    999         0  nonexistent   
1  telephone   may         fri ...    999         0  nonexistent   
2  telephone   jun         wed ...    999         0  nonexistent   
3  telephone   jun         fri ...    999         0  nonexistent   
4   cellular   nov         mon ...    999         0  nonexistent   

   purchased  id  
0          0   1  
1          0   2  
2          0   3  
3          0   4  
4          0   5  

Build a random forest with default hyperparameters to predict whether a person will buy the product or not. 

The training data is provided here:
/data/training/bank_train.csv

After you train the model, use the test data to make predictions. The test data can be accessed here. 
/data/test/bank_test.csv

You have to write the predictions in the file
/code/output/bank_predictions.csv

in the following format (note the column names carefully):
     bank_predicted    id
0               0  2041
1               0   399
2               0  1400
3               0  3709
4               0  2111
Datasets
Training dataset
Execution Time Limit
15 seconds

info_outline
You have reached the maximum submission limit for this problem. Your further submissions will not be considered for evaluation.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import metrics, preprocessing
from sklearn.ensemble import RandomForestClassifier

# read training data
bank_train = pd.read_csv("/data/training/bank_train.csv")

# read test data
bank_test = pd.read_csv("/data/test/bank_test.csv")

print(bank_train.head())
print(bank_test.head())

############################
### WRITE YOUR CODE HERE ###
############################

# Build the model 
# Create a random forest object rf (use default hyperparameters)
rf = RandomForestClassifier()

# Train the model

# Create x_train: Drop the columns 'purchased' (target) and 'id'
print(bank_train.columns)
x_train = bank_train.drop(['purchased', 'id'], axis=1)

# Create y_train
y_train = bank_train[['purchased']]

# Fit the model
rf.fit(x_train, y_train)

#  Make predictions using test data
print(bank_test.head())

# remember to drop 'id' from the test dataset 
predictions = rf.predict(bank_test.drop(['id'], axis=1))
print(predictions[:5])

# Write the columns 'id' and 'predictions' into the output file
d = pd.DataFrame({'id': bank_test['id'], 'bank_predicted': predictions})

# Write the output
d.to_csv('/code/output/bank_predictions.csv', sep=",")

Random Forest - Hyperparameter Tuning
Description
Problem Description

You are given the 'Portuguese Bank' marketing dataset, which you have already seen.

In the previous question, you had built a random forest with default hyperparameters. In this question, you will find the optimal value of the hyperparameter max_depth usingGridSearchCV(), and then build a model using the optimal value of max_depth.

To find the optimal value, you can plot training and test accuracy (on the y-axis) versus max_depth (on the x-axis) using matplotlib. This time, you have to create the plot yourself (the comments will guide you to create a plot).

Data

A sample of the training data is shown below (note that 'id' shouldn't be used to train the model) :

    age          job  marital          education default  housing     loan  \
0   30  blue-collar  married           basic.9y      no      yes       no   
1   39     services   single        high.school      no       no       no   
2   25     services  married        high.school      no      yes       no   
3   38     services  married           basic.9y      no  unknown  unknown   
4   47       admin.  married  university.degree      no      yes       no   

     contact month day_of_week ...  pdays  previous     poutcome  \
0   cellular   may         fri ...    999         0  nonexistent   
1  telephone   may         fri ...    999         0  nonexistent   
2  telephone   jun         wed ...    999         0  nonexistent   
3  telephone   jun         fri ...    999         0  nonexistent   
4   cellular   nov         mon ...    999         0  nonexistent   

   purchased  id  
0          0   1  
1          0   2  
2          0   3  
3          0   4  
4          0   5  



The training data is provided here:
/data/training/bank_train.csv

After you tune the model and find the optimal value of max_depth, use the test data to make predictions. The test data can be accessed here. 
/data/test/bank_test.csv

You have to write the predictions in the file
/code/output/bank_predictions.csv

in the following format (note the column names carefully):
     bank_predicted    id
0               0  2041
1               0   399
2               0  1400
3               0  3709
4               0  2111




Datasets
Training dataset
Execution Time Limit
25 seconds

info_outline
You have reached the maximum submission limit for this problem. Your further submissions will not be considered for evaluation.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import metrics, preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

# read training data
bank_train = pd.read_csv("bank_train.csv")

# read test data
bank_test = pd.read_csv("sample/bank_test.csv")

print(bank_train.head())
print(bank_test.head())

##########################
## WRITE YOUR CODE HERE ##
##########################

# create x_train and y_train
x_train = bank_train.drop(['purchased', 'id'], axis=1)
y_train = bank_train[['purchased']]

#####################################################
## Implement GridSearchCV to find optimal max_depth
#####################################################

# specify number of folds for k-fold CV
n_folds = 5

# specify range of the hyperparameter max_depth 
parameters = {'max_depth': range(2, 20, 5)}

# instantiate the model
rf = RandomForestClassifier()

# fit tree on training data
rf = GridSearchCV(rf, parameters, cv=n_folds, 
                   scoring="accuracy", 
                   return_train_score = True)

# fit the rf model 
rf.fit(x_train, y_train)

# store scores/results of GridSearch CV in a df
scores = rf.cv_results_
print(pd.DataFrame(scores).head())

#####################################################
## Plot mean_train_score and mean_test_score (accuracies) on the x-axis
# and param_max_depth on the y-axis
#####################################################


# plotting accuracies with max_depth
plt.figure()
plt.plot(scores["param_max_depth"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_max_depth"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
plt.savefig('/code/output/max_depth.png') 


# from the plot, observe the optimal value of max_depth
# and store in max_depth_optimal
max_depth_optimal = 3

#########################################
# Build the model with optimal max_depth
#########################################
rf = RandomForestClassifier(max_depth = max_depth_optimal)
rf.fit(x_train, y_train)

## Make predictions
predictions = rf.predict(bank_test.drop(['id'], axis=1))
print(predictions[:5])

# Write columns id, predictions into the output file
d = pd.DataFrame({'id': bank_test['id'], 'bank_predicted': predictions})

# write the output
d.to_csv('bank_predictions.csv', sep=",")import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import metrics, preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

# read training data
bank_train = pd.read_csv("bank_train.csv")

# read test data
bank_test = pd.read_csv("sample/bank_test.csv")

print(bank_train.head())
print(bank_test.head())

##########################
## WRITE YOUR CODE HERE ##
##########################

# create x_train and y_train
x_train = bank_train.drop(['purchased', 'id'], axis=1)
y_train = bank_train[['purchased']]

#####################################################
## Implement GridSearchCV to find optimal max_depth
#####################################################

# specify number of folds for k-fold CV
n_folds = 5

# specify range of the hyperparameter max_depth 
parameters = {'max_depth': range(2, 20, 5)}

# instantiate the model
rf = RandomForestClassifier()

# fit tree on training data
rf = GridSearchCV(rf, parameters, cv=n_folds, 
                   scoring="accuracy", 
                   return_train_score = True)

# fit the rf model 
rf.fit(x_train, y_train)

# store scores/results of GridSearch CV in a df
scores = rf.cv_results_
print(pd.DataFrame(scores).head())

#####################################################
## Plot mean_train_score and mean_test_score (accuracies) on the x-axis
# and param_max_depth on the y-axis
#####################################################


# plotting accuracies with max_depth
plt.figure()
plt.plot(scores["param_max_depth"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_max_depth"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
plt.savefig('/code/output/max_depth.png') 


# from the plot, observe the optimal value of max_depth
# and store in max_depth_optimal
max_depth_optimal = 3

#########################################
# Build the model with optimal max_depth
#########################################
rf = RandomForestClassifier(max_depth = max_depth_optimal)
rf.fit(x_train, y_train)

## Make predictions
predictions = rf.predict(bank_test.drop(['id'], axis=1))
print(predictions[:5])

# Write columns id, predictions into the output file
d = pd.DataFrame({'id': bank_test['id'], 'bank_predicted': predictions})

# write the output
d.to_csv('bank_predictions.csv', sep=",")