# Training a Random Forrest Classifier

## Preprocessing

Fist we will start by importing the necessary libraries and loading the data.


In [11]:
import pandas as pd

df = pd.read_csv('../data/airlines_delay_cleaned.csv')
df.head()


Unnamed: 0,Time,Length,Airline,AirportFrom,AirportTo,DayOfWeek,Delayed
0,1296.0,141.0,DL,ATL,HOU,1,0
1,360.0,146.0,OO,COS,ORD,4,0
2,1170.0,143.0,B6,BOS,CLT,3,0
3,692.0,98.0,FL,BMI,ATL,4,0
4,580.0,60.0,WN,MSY,BHM,4,0


Now we will apply the preprocessing function from the `preprocessing.py` file to the data. This function will apply the following transformations:
- Separate the data into features and target
- Scale the numerical features and encode the categorical features
- Split the data into train and test sets


In [2]:
from preprocessing import preprocess_data

X_train, X_test, y_train, y_test = preprocess_data(df)


Now that the data is preprocessed, we can train the model. We will use a Random Forrest Classifier.

In [3]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest model
rf = RandomForestClassifier(n_estimators=100, random_state=17)

# Fit the model to your training data
rf.fit(X_train, y_train)



After training, we can evaluate the model.

In [4]:
from sklearn.metrics import accuracy_score

# Make predictions using your model
y_pred = rf.predict(X_test)

# Evaluate your model
print('Accuracy:', accuracy_score(y_test, y_pred))


Accuracy: 0.3312735267663813


As the accuracy is very low with 0.33, we will try to improve the model by tuning the hyperparameters. We will use a grid search to find the best hyperparameters and the the ideal depth of the tree.

In [6]:
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth': [5, 10, 15, 20, 25, 30, None]}

grid_search = GridSearchCV(RandomForestClassifier(n_estimators=100, random_state=17), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print('Best parameter: ', grid_search.best_params_)
print('Best score: ', grid_search.best_score_)




Best parameter:  {'max_depth': 10}
Best score:  0.5821420747452328


Now that the model is trained, we can evaluate it again.

In [9]:
from sklearn.metrics import accuracy_score

# Make predictions using your model
y_pred = grid_search.best_estimator_.predict(X_test)

# Evaluate your model
print('Accuracy:', accuracy_score(y_test, y_pred))


Accuracy: 0.5834804623738299


This looks much better but the linear regression model was still slightly better.

In [10]:
import pickle

# Save to file in the current working directory
pkl_filename = "../Models/random_forest_model.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(grid_search.best_estimator_, file)
