## Train a Random Forest Classifier on the ISOLET Dataset

You are working for a technology company and they are planning to launch a new voice assistant product. You have been tasked with building a classification model that will recognize the letters spelled out by a user based on the signal frequencies captured. Each sound can be captured and represented as a signal composed of multiple frequencies.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('ISOLET.csv')
df.head()

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f609,f610,f611,f612,f613,f614,f615,f616,f617,class
0,-0.4394,-0.093,0.1718,0.462,0.6226,0.4704,0.3578,0.0478,-0.1184,-0.231,...,0.4102,0.2052,0.3846,0.359,0.5898,0.3334,0.641,0.5898,-0.4872,'1'
1,-0.4348,-0.1198,0.2474,0.4036,0.5026,0.6328,0.4948,0.0338,-0.052,-0.1302,...,0.0,0.2954,0.2046,0.4772,0.0454,0.2046,0.4318,0.4546,-0.091,'1'
2,-0.233,0.2124,0.5014,0.5222,-0.3422,-0.584,-0.7168,-0.6342,-0.8614,-0.8318,...,-0.1112,-0.0476,-0.1746,0.0318,-0.0476,0.1112,0.254,0.1588,-0.4762,'2'
3,-0.3808,-0.0096,0.2602,0.2554,-0.429,-0.6746,-0.6868,-0.665,-0.841,-0.9614,...,-0.0504,-0.036,-0.1224,0.1366,0.295,0.0792,-0.0072,0.0936,-0.151,'2'
4,-0.3412,0.0946,0.6082,0.6216,-0.1622,-0.3784,-0.4324,-0.4358,-0.4966,-0.5406,...,0.1562,0.3124,0.25,-0.0938,0.1562,0.3124,0.3124,0.2188,-0.25,'3'


Extract the response variable using .pop() from pandas.

In [3]:
y = df.pop('class')

Split the dataset into training and test sets using train_test_split() from sklearn.model_selection.

In [4]:
X_train,X_test,y_train,y_test = train_test_split(df, y, test_size=0.3, random_state=888)

### Model with default hyperparameters

In [5]:
rf_model = RandomForestClassifier(random_state=42)

In [6]:
rf_model.fit(X_train, y_train)

# predictions
train_preds = rf_model.predict(X_train)
test_preds = rf_model.predict(X_test)

# accuracy scores
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)

# difference between accuracy scores
acc_difference = train_acc - test_acc

print(train_acc)
print(test_acc)
print(acc_difference)

1.0
0.9414529914529914
0.05854700854700856


With the default hyperparameters our model is predicting with 100% accuracy for the training set and 94 % for the testing set. The difference between the two is 6% which means the model is overfitting on the training set. Lets optimize our model and see if we can reduce the overfitting.

### Model(s) with some hyperparameter tuning

Create a function that will instantiate and fit a RandomForestClassifier using .fit() from sklearn.ensemble.
Try these hyperparameters:

    n_estimators = 20 and 50
    max_depth = 5 and 10
    min_samples_leaf = 10 and 50
    max_features = 0.5 and 0.3

### model2

In [10]:
rf_model2 = RandomForestClassifier(random_state=42, n_estimators=20, max_depth=5, min_samples_leaf=10, max_features=0.5)

In [11]:
rf_model2.fit(X_train, y_train)

# predictions
train_preds2 = rf_model2.predict(X_train)
test_preds2 = rf_model2.predict(X_test)

# accuracy scores
train_acc2 = accuracy_score(y_train, train_preds2)
test_acc2 = accuracy_score(y_test, test_preds2)

# difference between accuracy scores
acc_difference2 = train_acc2 - test_acc2

print(train_acc2)
print(test_acc2)
print(acc_difference2)

0.7106468755726589
0.6935897435897436
0.017057131982915363


model2 is now predicting with 71% accuracy on the training set and 69% on the testing set. The difference between the two is 1% which means we have removed almost all the overfitting but our accuracy for both training and testing sets has significantly reduced. Let's see if the model can be improved. 

### model3

In [12]:
rf_model3 = RandomForestClassifier(random_state=42, n_estimators=50, max_depth=10, min_samples_leaf=50, max_features=0.3)

In [13]:
rf_model3.fit(X_train, y_train)

# predictions
train_preds3 = rf_model3.predict(X_train)
test_preds3 = rf_model3.predict(X_test)

# accuracy scores
train_acc3 = accuracy_score(y_train, train_preds3)
test_acc3 = accuracy_score(y_test, test_preds3)

# difference between accuracy scores
acc_difference3 = train_acc3 - test_acc3

print(train_acc3)
print(test_acc3)
print(acc_difference3)

0.9014110317024006
0.8803418803418803
0.021069151360520233


model3 is predicting with 90% accuracy on train set and 88% on testing set. The difference betwwen the two is 2%. The model accuracy has improved on both training and testing sets but it is overfitting by 1% more. Let's see if we can improve the model again.

### model4

In [27]:
rf_model4 = RandomForestClassifier(random_state=42, n_estimators=50, max_depth=10, min_samples_leaf=40, max_features=0.2)

In [28]:
rf_model4.fit(X_train, y_train)

# predictions
train_preds4 = rf_model4.predict(X_train)
test_preds4 = rf_model4.predict(X_test)

# accuracy scores
train_acc4 = accuracy_score(y_train, train_preds4)
test_acc4 = accuracy_score(y_test, test_preds4)

# difference between accuracy scores
acc_difference4 = train_acc4 - test_acc4

print(train_acc4)
print(test_acc4)
print(acc_difference4)

0.9144218435037567
0.8987179487179487
0.015703894785807915


model4 is accurately predicting 91% of the training set and 90% of the testing set. The difference between the two is about 1%. The model has increased both the accuracy of the training and testing sets and reduced overfitting. This is the best model that has been trained.