# Unit 2 Assessment

In this assignment, we will focus on airline incidents. The data set for this assignment includes information on the cost of bird strikes. Use this data set to see if you can predict the cost of a bird strike (i.e., the `Total Cost` column in the data set) based on the attributes of the incident. This is important because this model can make a cost prediction as soon as a bird strike incident happens.

## Description of Variables

The description of variables are provided in "Airline - Data Dictionary.docx"

## Goal

Use the **airline.csv** data set and build models to predict **Total Cost**.

**Be careful: this is a REGRESSION task**

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


## Important hints:

* This assignment requires you to work with a text-based column in addition to regular numeric/categorical columns. So you will have to pay attention to your pipelines during data processing.
* You can do your data prep before or after the train/test split. Regardless, you should use train_test_split only once. If you find yourself using it twice, it means you are doing something wrong.
* Recommended approach: 
    * import the data and perform the train/test split - like we always do. 
    * identify the names of numeric, categorical, feature engineered, and text columns - like we always do
    * create individual pipelines for each type of column - like we always do. For the text pipeline, I would recommend the TFIDF Vectorizer with SVDs. Though, you can also use TFIDF Vectorizer with top N terms (without SVDs).
    * combine all pipelines using the column transformer - like we always do 

# Section 1: 

## Data Prep (5 points)

In [253]:
import numpy as np
import pandas as pd

np.random.seed(1941)

In [254]:
airline = pd.read_csv("airline.csv")
airline.head()

Unnamed: 0,Aircraft,Number_Objects,Engines,Origin State,Phase,Description,Object Size,Weather,Warning,Altitude,Total Cost
0,PA,37,1.0,Florida,Descent,"BIRD, BROWN BIRD. A/C WAS DESCENDING INTO PATT...",Large,No Cloud,N,1500.0,690
1,C,43,1.0,Florida,Approach,BIRD SHATTERED L SIDE OF WINDHSLD. STUDENT REC...,Large,No Cloud,Y,2000.0,570
2,B-737,71,2.0,Oklahoma,Climb,MEDIUM SIZED BLACK BIRDS. CONTRACT MX INSPN OF...,Medium,No Cloud,N,1100.0,1027
3,Airbus,29,2.0,Wisconsin,Approach,"ID BY SMITHSONIAN, FAA 3881. DNA. 2 CRACKS IN ...",Large,No Cloud,Y,200.0,77479
4,B-737,32,2.0,Texas,Approach,BIRD SEEN AND HEARD THAT STRUCK RADOME. UPON I...,Small,No Cloud,N,1000.0,411


In [255]:
airline['Warning'] = airline['Warning'].map({'Y': 1, 'N': 0})

In [256]:
airline['Description'].fillna('missing', inplace=True)

In [257]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(airline, test_size=0.3)

In [258]:
train_set.isna().sum()

Aircraft           0
Number_Objects     0
Engines           16
Origin State      12
Phase             10
Description        0
Object Size       10
Weather            0
Altitude          10
Total Cost         0
dtype: int64

In [259]:
test_set.isna().sum()

Aircraft           0
Number_Objects     0
Engines           13
Origin State       6
Phase              4
Description        0
Object Size        4
Weather            0
Altitude           4
Total Cost         0
dtype: int64

In [260]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer

In [261]:
train_target = train_set['Total Cost']
test_target = test_set['Total Cost']

train_inputs = train_set.drop(['Total Cost'], axis=1)
test_inputs = test_set.drop(['Total Cost'], axis=1)

## Feature Engineering (1 point)

Create one NEW feature from existing data. You either transform a single variable, or create a new variable from existing ones. 

Grading: 
- 0.5 points for creating the new feature correctly
- 0.5 points for the justification of the new feature (i.e., why did you create this new feature)

Justification 
The Ratio of altitude to number of objects specfifies at a given altitude what protion of bird strikes took places by this we can know at which range of altitudes the more number of birds strike occured.

In [262]:

def fea_new_col(df):
    df1 = df.copy()
    df1['Ratio of Altitude to Number_Objects'] = (df1['Altitude']/df1['Number_Objects']).fillna(0)
    df1['Ratio of Altitude to Number_Objects'].replace(np.inf,1,inplace=True)
    return df1[['Ratio of Altitude to Number_Objects']]

In [263]:
fea_new_col(train_inputs)

Unnamed: 0,Ratio of Altitude to Number_Objects
984,0.000000
525,1.704545
863,0.000000
725,5.555556
298,61.224490
...,...
946,39.473684
181,2.127660
918,96.825397
556,0.158730


In [264]:
#Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

# Identifying the binary columns
binary_columns = ['Warning']

In [265]:
for col in binary_columns:
 numeric_columns.remove(col)

In [266]:
numeric_columns

['Number_Objects', 'Engines', 'Altitude']

In [267]:
for col in ['Description']:
 categorical_columns.remove(col)

In [268]:
categorical_columns

['Aircraft', 'Origin State', 'Phase', 'Object Size', 'Weather']

In [269]:
feat_eng_columns = ['Altitude','Number_Objects']

## Text Preparation 

In [270]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [271]:
def txt_new_col(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    # First, conver the dataframe column to a numpy array. Then, call the ravel function to make it one-dimensional
    return np.array(df).ravel()

In [272]:
txt_new_col(train_inputs['Description'])

array(['IMMATURE BALD EAGLE. DNT TO RT SIDE ANGLE OF ATTACK VANE. SMALL DENT IN FROTN OF AOA. A/C RETURNED TO LAND. FLT DELAYED 4 HRS 20 MINS. CREW PERFORMED A LOW APCH TO HAVE A/C INSPECTED FROM GROUND BEFORE LANDING.',
       'BIRD WAS LODGED IN OUT FLAP OF RT WING. FLAP AND UNDERSIDE OF WING DENTED.',
       ' ENG REPLACED DUE TO SEVERITY OF DAMAGED COMPRESSOR BLADES. INGESTION.',
       'GOOSE HIT RT SIDE DAMAGING RT WING STRUT FUSELAGE TO STRUT FAIRING. AFTER LDG, 15 GEESE WERE SEEN IN GRASS BY RWY. STRIKE WAS REPTD TO GSO ATC & MX. A/C RETURNED TO SVC LATER IN DAY.',
       "DARK, MOONLESS NIGHT. HEARD LOUD IMPACT. 2 LRG DENTS ON LE RT WING. NO PROBLEM LDG. FOUND WHITE FEATHERS, BLOOD AND TISSUE AT IMPACT SITES. DENTS ABOUT 3' APART.",
       'INGESTED OWL ON DEPTR. RETD TO LAND.  ENG VIBRATION WENT TO FULL SCALE 2-3 MINUTES AFTER INGESTION. BORESCOPED. REPLACED 4 PAIRS OF BLADES.',
       'BIRDS REPTD AS GULLS ON 1 FORM AND GOOSE ON ANOTHER. DMG TO ENG INLET REQD IMMED. REPAIR. 

In [273]:
text_column = ['Description']

# Pipeline

In [274]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [275]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [276]:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [277]:
fea_new_column = Pipeline(steps=[('fea_new_column', FunctionTransformer(fea_new_col))])

In [278]:
text_transformer = Pipeline(steps=[
                ('txt_new_column', FunctionTransformer(txt_new_col)),
                ('text', TfidfVectorizer(stop_words='english')),
                ('svd', TruncatedSVD(n_components=300, n_iter=10))
            ])

In [279]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns),
        ('trans', fea_new_column, feat_eng_columns),
        ('text', text_transformer,text_column)],   
        remainder='passthrough')

# Transform: fit_transform() for TRAIN

In [280]:
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[-0.30226077,  0.10189067, -0.55807993, ..., -0.01682127,
         0.03367297, -0.01607395],
       [-0.44462019,  0.10189067, -0.52033778, ...,  0.02017042,
        -0.01790916, -0.03497772],
       [ 2.04666971,  0.10189067, -0.55807993, ...,  0.00612735,
        -0.00984678, -0.00646172],
       ...,
       [ 0.90779433,  0.10189067,  2.51161457, ...,  0.05851991,
         0.00718203,  0.01101136],
       [ 0.90779433,  0.10189067, -0.55304764, ...,  0.03673463,
        -0.00457338, -0.02735358],
       [-0.23108105,  0.10189067, -0.30646562, ...,  0.0193289 ,
        -0.00690638,  0.02337287]])

In [281]:
train_x.shape

(844, 435)

# Tranform: transform() for TEST

In [282]:
test_x = preprocessor.transform(test_inputs)

test_x

array([[ 4.09536350e-01,  1.01890674e-01, -5.58079928e-01, ...,
         2.32856930e-04, -2.09329768e-04,  6.21532445e-04],
       [ 9.78974042e-01,  1.01890674e-01, -5.58079928e-01, ...,
        -6.24670269e-03, -9.62082249e-03,  1.06790171e-02],
       [-5.15799900e-01,  1.01890674e-01, -5.07757067e-01, ...,
        -1.99390389e-02,  2.45678465e-02, -3.49901821e-03],
       ...,
       [ 3.38356638e-01,  1.01890674e-01, -5.58079928e-01, ...,
         2.54581725e-03, -1.41331147e-02,  6.77287936e-04],
       [ 9.07794330e-01, -1.76758168e+00, -5.20337783e-01, ...,
         1.58970956e-03, -1.13637904e-02, -5.06855792e-02],
       [-1.22759702e+00,  1.01890674e-01,  4.57943994e-02, ...,
         1.62553079e-02, -1.84109871e-02, -2.08168857e-03]])

In [283]:
test_x.shape

(363, 435)

## Find the Baseline (1 point)

In [284]:
from sklearn.dummy import DummyRegressor

dummy_regr = DummyRegressor(strategy="mean")

dummy_regr.fit(train_x, train_target)

In [285]:
from sklearn.metrics import mean_squared_error

In [286]:
dummy_train_pred = dummy_regr.predict(train_x)

baseline_train_mse = mean_squared_error(train_target, dummy_train_pred)

baseline_train_rmse = np.sqrt(baseline_train_mse)

print('Baseline Train RMSE: {}' .format(baseline_train_rmse))

Baseline Train RMSE: 332447.6233465298


In [287]:
dummy_test_pred = dummy_regr.predict(test_x)

baseline_test_mse = mean_squared_error (test_target, dummy_test_pred)

baseline_test_rmse = np.sqrt(baseline_test_mse)

print('Baseline Test RMSE: {}' .format(baseline_test_rmse))

Baseline Test RMSE: 320569.91745033476


# Section 2: 

Build the following models:


## Decision Tree: (1 point)

In [288]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=15) 

tree_reg.fit(train_x, train_target)

In [289]:
#Train RMSE
train_pred = tree_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Decision Tree Train Model RMSE: {}' .format(train_rmse))

Decision Tree Train Model RMSE: 26065.34836573627


In [290]:
#Test RMSE
test_pred = tree_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Decision Tree Test Model RMSE: {}' .format(test_rmse))

Decision Tree Test Model RMSE: 220339.37438473367


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

Yes,The test RMSE (220339.37) is significantly higher than the training RMSE (26065.35), which suggests that the model is overfitting to the training data and performing poorly on the unseen test data.Further model correction are required in order to minimize overfitting.

In [291]:
##Performing few iterations by varying max_depth to avoid overfitting
train_error = []
test_error = []

for x in range(1,41):
    tree_reg3 = DecisionTreeRegressor(max_depth=x)
    tree_reg3.fit(train_x, train_target)
    reg_train_predictions = tree_reg3.predict(train_x)
    reg_test_predictions = tree_reg3.predict(test_x)
    train_rmse = round(np.sqrt(mean_squared_error (train_target, reg_train_predictions)),4)
    test_rmse = round(np.sqrt(mean_squared_error (test_target,reg_test_predictions)),4)
    print('# Max depth = {}'.format(x) + "     " +'Train RMSE = {}'.format(train_rmse) + "   "
         'Test RMSE = {}'.format(test_rmse))
    
    train_error.append(train_rmse)
    test_error.append(test_rmse)

# Max depth = 1     Train RMSE = 176848.6276   Test RMSE = 199215.1509
# Max depth = 2     Train RMSE = 136035.6299   Test RMSE = 221221.7308
# Max depth = 3     Train RMSE = 114508.7978   Test RMSE = 217185.2164
# Max depth = 4     Train RMSE = 91221.0982   Test RMSE = 224414.0804
# Max depth = 5     Train RMSE = 78491.3662   Test RMSE = 223645.3903
# Max depth = 6     Train RMSE = 69826.9662   Test RMSE = 214642.3584
# Max depth = 7     Train RMSE = 61872.1754   Test RMSE = 241451.6475
# Max depth = 8     Train RMSE = 52634.357   Test RMSE = 233577.6215
# Max depth = 9     Train RMSE = 43323.6719   Test RMSE = 229390.9792
# Max depth = 10     Train RMSE = 38695.5477   Test RMSE = 210869.0218
# Max depth = 11     Train RMSE = 35800.5562   Test RMSE = 257882.1857
# Max depth = 12     Train RMSE = 33155.6291   Test RMSE = 236385.3471
# Max depth = 13     Train RMSE = 30623.1672   Test RMSE = 238794.2833
# Max depth = 14     Train RMSE = 28403.2567   Test RMSE = 225785.0262
# Max depth =

In [292]:
#Performing Randomized Grid Search
from sklearn.model_selection import RandomizedSearchCV

param_grid = [
    {'min_samples_leaf': np.arange(1, 30), 
     'max_depth': np.arange(1,30)}
  ]

tree_reg = DecisionTreeRegressor()

grid_search = RandomizedSearchCV(tree_reg, param_grid, cv=5, n_iter=10,
                           scoring='neg_mean_squared_error', verbose=1,
                           return_train_score=True)

grid_search.fit(train_x, train_target)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [293]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

236085.96552709088 {'min_samples_leaf': 26, 'max_depth': 3}
184088.18630774444 {'min_samples_leaf': 7, 'max_depth': 12}
196028.43425198927 {'min_samples_leaf': 14, 'max_depth': 7}
209288.948745402 {'min_samples_leaf': 17, 'max_depth': 13}
255371.4531134454 {'min_samples_leaf': 1, 'max_depth': 24}
225399.66967075714 {'min_samples_leaf': 23, 'max_depth': 21}
210745.56195976163 {'min_samples_leaf': 19, 'max_depth': 10}
245783.98783379485 {'min_samples_leaf': 28, 'max_depth': 18}
196641.3918511926 {'min_samples_leaf': 15, 'max_depth': 27}
205849.9395478403 {'min_samples_leaf': 11, 'max_depth': 20}


In [294]:
grid_search.best_estimator_

In [295]:
#Train RMSE
train_pred = grid_search.best_estimator_.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('DT_Random_Grid_search Train RMSE: {}' .format(train_rmse))

DT_Random_Grid_search Train RMSE: 107518.91626385371


In [296]:
#Test RMSE
test_pred = grid_search.best_estimator_.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('DT_Random_Grid_search Test RMSE: {}' .format(test_rmse))

DT_Random_Grid_search Test RMSE: 186720.9726125196


## Voting regressor (1 points):

The voting regressor should have at least 3 individual models

In [297]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor 
from sklearn.svm import SVR
from sklearn.ensemble import VotingRegressor

dtree_reg = DecisionTreeRegressor(max_depth=15)
svm_reg = SVR(kernel="rbf", C=10, epsilon=0.01, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=100000, tol=1e-3)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        ('svr', svm_reg), 
                        ('sgd', sgd_reg)])

voting_reg.fit(train_x, train_target)

In [298]:
#Train RMSE
train_pred = voting_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Voting_Regressor_Train RMSE: {}' .format(train_rmse))

Voting_Regressor_Train RMSE: 2405213711582.3926


In [299]:
#Test RMSE
test_pred = voting_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Voting_Regessor_Test RMSE: {}' .format(test_rmse))

Voting_Regessor_Test RMSE: 2757795690589.5703


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

yes,The difference between the test and training RMSE values is relatively small, suggesting a moderate level of overfitting rather than severe overfitting.
In this case, the test RMSE (216212781248.7989) is higher than the train RMSE (247200643996.11316), indicating that the model is performing worse on the unseen test data compared to the train data it was fitted on.

In [300]:
#Performing RandomForest ensemble model
from sklearn.ensemble import RandomForestRegressor 

rnd_reg = RandomForestRegressor(n_estimators=500, max_depth=10, n_jobs=-1) 

rnd_reg.fit(train_x, train_target)

In [301]:
#Train RMSE
train_pred = rnd_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Random_forest_Train RMSE: {}' .format(train_rmse))

Random_forest_Train RMSE: 70174.30148425935


In [302]:
#Test RMSE
test_pred = rnd_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Random_forest_Test RMSE: {}' .format(test_rmse))

Random_forest_Test RMSE: 142144.68891550097


In [303]:
#Performing ExtratreesRegresspr
from sklearn.ensemble import ExtraTreesRegressor 

ext_reg = ExtraTreesRegressor(n_estimators=500, max_depth=10, n_jobs=-1) 

ext_reg.fit(train_x, train_target)

In [304]:
#Train RMSE
train_pred = ext_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Extra_Trees_Train RMSE: {}' .format(train_rmse))

Extra_Trees_Train RMSE: 44502.152906479016


In [305]:
#Test RMSE
test_pred = ext_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Extra_Trees_Test RMSE: {}' .format(test_rmse))

Extra_Trees_Test RMSE: 118513.93107396505


By performing Both models seem to be overfitting based on the higher test RMSE compared to train RMSE.Maybe due to the model complexity overfitting issues cannot be addressed with this dataset

## A Boosting model: (1 point)

Build either an Adaboost or a GradientBoost model

In [306]:
#Train on 75% of the sample only
from sklearn.ensemble import GradientBoostingRegressor
gb_reg = GradientBoostingRegressor(max_depth=2, n_estimators=100, 
                                   learning_rate=0.1, subsample=0.75) 

gb_reg.fit(train_x, train_target)

In [307]:
#Train RMSE
train_pred = gb_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('GBR_Train RMSE: {}' .format(train_rmse))

GBR_Train RMSE: 69784.61063333869


In [308]:
#Test RMSE
test_pred = gb_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('GBR_Test RMSE: {}' .format(test_rmse))

GBR_Test RMSE: 136289.93515976617


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

yes,We can see that the test RMSE is significantly higher than the training RMSE, which suggests that the model is overfitting to the training data.

In [309]:
# Selecting the number of estimators
for x in range(1,30):
    gb_reg = GradientBoostingRegressor(max_depth=3, n_estimators=x, learning_rate=1.0) 
    gb_reg.fit(train_x, train_target)
    
    train_pred = gb_reg.predict(train_x)
    test_pred = gb_reg.predict(test_x)
    
    train_rmse = np.sqrt(mean_squared_error(train_target, train_pred))
    test_rmse = np.sqrt(mean_squared_error(test_target, test_pred))
    
    print('# Estimators = {}'.format(x) + "     " + 'Train rmse = {}'.format(train_rmse) + "   "
         'Test rmse = {}'.format(test_rmse))

# Estimators = 1     Train rmse = 114508.79776039407   Test rmse = 229981.5557653471
# Estimators = 2     Train rmse = 101532.97440698586   Test rmse = 235674.6973766747
# Estimators = 3     Train rmse = 97143.31453072195   Test rmse = 256725.06512603542
# Estimators = 4     Train rmse = 85416.50365703652   Test rmse = 220425.53696958142
# Estimators = 5     Train rmse = 79461.87197394701   Test rmse = 233824.63438709037
# Estimators = 6     Train rmse = 72482.75864974273   Test rmse = 230380.94645332123
# Estimators = 7     Train rmse = 68005.16160469527   Test rmse = 223581.70598771662
# Estimators = 8     Train rmse = 62726.05609194828   Test rmse = 253729.29458254826
# Estimators = 9     Train rmse = 58677.26321224798   Test rmse = 233084.73536470262
# Estimators = 10     Train rmse = 56301.773738699354   Test rmse = 260932.73769529583
# Estimators = 11     Train rmse = 51564.63064705665   Test rmse = 255853.42413280462
# Estimators = 12     Train rmse = 48959.28508307044   Test rm

In [310]:
# Early Stopping
gb_reg = GradientBoostingRegressor(max_depth=2, n_estimators=100, 
                                   learning_rate=1, 
                                  tol=0.1, n_iter_no_change=5, validation_fraction=0.2,
                                  verbose=1) 

gb_reg.fit(train_x, train_target)

      Iter       Train Loss   Remaining Time 
         1 19789147504.5983           10.22s
         2 15288644543.0339           10.03s
         3 11434189694.8943            9.65s
         4 10495032577.1999            9.52s
         5  9510168765.0698            9.51s
         6  8706717712.8772            9.36s


In [311]:
gb_reg.n_estimators_

6

In [312]:
#Train RMSE
train_pred = gb_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('modified_GBR_Train RMSE: {}' .format(train_rmse))

modified_GBR_Train RMSE: 158029.40260240863


In [313]:
#Test RMSE
test_pred = gb_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('modified_GBR_Test RMSE: {}' .format(test_rmse))

modified_GBR_Test RMSE: 302305.79740425944


While the test RMSE is still higher than the training RMSE, indicating some degree of overfitting, the gap between the two has narrowed considerably. This suggests that the modifications made to the model, have helped to reduce overfitting and improve generalization performance.eventhough further modifications are performed due to model complexity and small dataset i think we cannot remove the overfitting issue

## Neural network: (1 point)

In [314]:
from sklearn.neural_network import MLPRegressor

#Default settings create 1 hidden layer with 100 neurons
mlp_reg = MLPRegressor(hidden_layer_sizes=(100,),max_iter=1000)

mlp_reg.fit(train_x, train_target)



In [315]:
#Train RMSE
train_pred = mlp_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('NN_Train RMSE: {}' .format(train_rmse))

NN_Train RMSE: 339658.47208864306


In [316]:
#Test RMSE
test_pred = mlp_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('NN_Test RMSE: {}' .format(test_rmse))

NN_Test RMSE: 329209.77730830753


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

Based on these RMSE values alone, it does not appear that the model is overfitting. However, the high RMSE values for both training and test sets indicate that the model's predictions have a significant deviation from the actual target values, suggesting poor overall performance.

## Grid search (1 points)

Perform either a full or randomized grid search on any model you want. There has to be at least two parameters for the search. 

In [328]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = [
    {'hidden_layer_sizes': [(50,25,10),(75,50,25,10),(100,75,50,25,10)],
     'alpha': [0.01,0.05,0.10],
     'max_iter': [500,750,1000]}]

dnn_reg = MLPRegressor()

grid_search = RandomizedSearchCV(dnn_reg, param_grid, cv=3, n_iter=5,
                           scoring='neg_mean_squared_error', verbose=1,
                           return_train_score=True)

grid_search.fit(train_x, train_target)

Fitting 3 folds for each of 5 candidates, totalling 15 fits




In [329]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

210675.2893235496 {'max_iter': 750, 'hidden_layer_sizes': (50, 25, 10), 'alpha': 0.1}
195638.24654420023 {'max_iter': 500, 'hidden_layer_sizes': (100, 75, 50, 25, 10), 'alpha': 0.1}
193444.9672960027 {'max_iter': 500, 'hidden_layer_sizes': (100, 75, 50, 25, 10), 'alpha': 0.05}
206523.98300014247 {'max_iter': 750, 'hidden_layer_sizes': (50, 25, 10), 'alpha': 0.05}
197917.6280520027 {'max_iter': 750, 'hidden_layer_sizes': (75, 50, 25, 10), 'alpha': 0.01}


In [330]:
grid_search.best_estimator_

In [338]:
#Train RMSE
train_pred = grid_search.best_estimator_.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('NN_Random_search_Train RMSE: {}' .format(train_rmse))

NN_Random_search_Train RMSE: 5763.080455402821


In [339]:
#Test RMSE
test_pred = grid_search.best_estimator_.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('NN_Random_search_Test RMSE: {}' .format(test_rmse))

NN_Random_search_Test RMSE: 160047.6136512824


Based on the substantial gap between the low training RMSE and high test RMSE, there are clear signs that this model is overfitting to the training data.

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

In [340]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = [
    {'hidden_layer_sizes': [(50,25,10),(75,50,25,10),(100,75,50,25,10),(125,100,75,50,25,10)], 
     'alpha': np.arange(0.01,1),
     'max_iter': np.arange(1000,10000)}
  ]

dnn_reg = MLPRegressor()

grid_search = RandomizedSearchCV(dnn_reg, param_grid, cv=5, n_iter=5,
                           scoring='neg_mean_squared_error', verbose=1,
                           return_train_score=True)

grid_search.fit(train_x, train_target)

Fitting 5 folds for each of 5 candidates, totalling 25 fits




In [341]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

168020.60499542343 {'max_iter': 2183, 'hidden_layer_sizes': (100, 75, 50, 25, 10), 'alpha': 0.01}
173235.4019946772 {'max_iter': 9460, 'hidden_layer_sizes': (75, 50, 25, 10), 'alpha': 0.01}
170713.5754717724 {'max_iter': 1206, 'hidden_layer_sizes': (100, 75, 50, 25, 10), 'alpha': 0.01}
179139.57564360186 {'max_iter': 8902, 'hidden_layer_sizes': (50, 25, 10), 'alpha': 0.01}
170938.81681144692 {'max_iter': 6747, 'hidden_layer_sizes': (125, 100, 75, 50, 25, 10), 'alpha': 0.01}


In [342]:
grid_search.best_estimator_

In [343]:
#Train RMSE
train_pred = grid_search.best_estimator_.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Modified_NN_Random_search_Train RMSE: {}' .format(train_rmse))

Modified_NN_Random_search_Train RMSE: 4557.5831169372495


In [344]:
#Test RMSE
test_pred = grid_search.best_estimator_.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Modified_NN_Random_search_Test RMSE: {}' .format(test_rmse))

Modified_NN_Random_search_Test RMSE: 160114.34033512385


Over few correction on randomized grid search still the overfitting concern persists

# Discussion (3 points in total)


## List the train and test values of each model you built (1 points)

## Which model performs the best and why? (1 points) 

Hint: The best model is the one that has the best TEST value (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## How does it compare to baseline? (1 points)