How to download Kaggle Data https://www.kaggle.com/discussions/general/74235



In [308]:
# Import neseccary modules.
import pandas as pd
import numpy as np

Load the dataset

In [309]:
train_data = pd.read_csv("datathon_train.csv")

Begin preliminary analysis of the data. Exploratory phase

In [310]:
# Drop the Id column
train_data.drop("Id", axis=1, inplace=True)

In [311]:
# Get summary statistics
train_data.describe()

Unnamed: 0,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,DISTANCE,SEGMENT_NUMBER,CONCURRENT_FLIGHTS,MANUFACTURE_YEAR,NUMBER_OF_SEATS,AIRPORT_FLIGHTS_MONTH,AIRLINE_FLIGHTS_MONTH,...,FLT_ATTENDANTS_PER_PASS,GROUND_SERV_PER_PASS,PLANE_AGE,PRCP,SNOW,SNWD,TMAX,AWND,DEP_DELAY_NEW,IS_DELAYED
count,697224.0,697224.0,697224.0,697224.0,697224.0,697224.0,697224.0,697224.0,697224.0,697224.0,...,697224.0,697224.0,697224.0,697224.0,697224.0,697224.0,697224.0,697224.0,697224.0,697224.0
mean,6.620198,15.723732,3.931059,852.680886,3.064137,28.073635,2007.433931,134.799211,12398.966083,63379.64382,...,9.9e-05,0.000135,11.566069,0.10486,0.032057,0.090429,71.472508,8.369713,15.409382,0.399027
std,3.397212,8.756372,1.996984,611.715805,1.759975,21.329959,7.046577,47.035304,8315.616727,33670.96123,...,8.5e-05,4.7e-05,7.046577,0.342738,0.318871,0.72449,18.356162,3.609616,47.38517,0.489699
min,1.0,1.0,1.0,66.0,1.0,1.0,1987.0,44.0,1100.0,835.0,...,0.0,7e-06,0.0,0.0,0.0,0.0,-10.0,0.0,0.0,0.0
25%,4.0,8.0,2.0,402.0,2.0,11.0,2002.0,90.0,5445.0,25367.0,...,3.4e-05,9.9e-05,5.0,0.0,0.0,0.0,59.0,5.82,0.0,0.0
50%,7.0,16.0,4.0,687.0,3.0,23.0,2007.0,143.0,11398.0,68293.0,...,6.2e-05,0.000125,12.0,0.0,0.0,0.0,74.0,7.83,0.0,0.0
75%,10.0,23.0,6.0,1087.0,4.0,39.0,2014.0,173.0,17499.0,84667.0,...,0.000144,0.000177,17.0,0.02,0.0,0.0,86.0,10.51,10.0,1.0
max,12.0,31.0,7.0,5095.0,15.0,109.0,2019.0,337.0,33340.0,116615.0,...,0.000348,0.000229,32.0,11.63,17.2,25.2,115.0,33.78,1742.0,1.0


Do we have an imbalanced dataset? Yes...yes we do.

In [312]:
# Do we have an imbalanced dataset? Let's find the distribution of the target variable
# Notice how there are many more flights (rows) that aren't delayed than flights that are
pd.value_counts(train_data["IS_DELAYED"])

  pd.value_counts(train_data["IS_DELAYED"])


IS_DELAYED
0    419013
1    278211
Name: count, dtype: int64

Let's decide what columns/features to use in our prediction.

In [313]:
train_data.columns

Index(['MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'DEPARTING_AIRPORT',
       'ORIGIN_CITY_NAME', 'DEST', 'DEST_CITY_NAME', 'DEP_TIME_BLK',
       'DISTANCE', 'SEGMENT_NUMBER', 'CONCURRENT_FLIGHTS', 'MANUFACTURE_YEAR',
       'NUMBER_OF_SEATS', 'CARRIER_NAME', 'AIRPORT_FLIGHTS_MONTH',
       'AIRLINE_FLIGHTS_MONTH', 'AIRLINE_AIRPORT_FLIGHTS_MONTH',
       'AVG_MONTHLY_PASS_AIRPORT', 'AVG_MONTHLY_PASS_AIRLINE',
       'CARGO_HANDLING', 'FLT_ATTENDANTS_PER_PASS', 'GROUND_SERV_PER_PASS',
       'PLANE_AGE', 'PREVIOUS_AIRPORT', 'PRCP', 'SNOW', 'SNWD', 'TMAX', 'AWND',
       'DEP_DELAY_NEW', 'IS_DELAYED'],
      dtype='object')

1. Are any the features basically the "same" as another feature? Well, the airports and cities say the same thing, DEP_DEL_NEW should be removed, and manufacture year is the same variable as the age of the plane.

In [314]:
# Let's drop the redundant features to reduce our training time in the future
train_data.drop(['ORIGIN_CITY_NAME', 'DEST_CITY_NAME', 'MANUFACTURE_YEAR', 'DEP_DELAY_NEW','PREVIOUS_AIRPORT'], axis=1, inplace=True)

Next, we have to turn the categorical variales into numbers so we can use them to train our model! We'll use sklearn's `LabelEncoder()` for this.

Here's how a label encoder works:

`fit`
If we have a list of discrete variables, like ["a", "b", "b", "c"], the label encoder will locate each unique item in the list ("a", "b", "c") and assign an integer to that object, for instance,

"a" -> 0

"b" -> 1

"c" -> 2

`transform`
Now, when we encounter a list like ["b", "b", "c", "a"], the LabelENcoder will perform the translation between string and number, and output [1, 1, 2, 0]. Essentially, replacing the string with the corresponding number.

Label encoders, however, do not handle unseen values. So, if we try to translate "d", the LabelEncoder will through a bad error. If out training set contains only "a", "b", and "c", and our testing set contains a new string "d", we'll run into a problem. To account for this, we'll add an "UNSEEN" to the unique items in the list, so when we encounter an unknown value in the testing set, we'll replace it with "UNSEEN" and continue encoding.

In [315]:
# For each of the string, categorical variables, we must encode these values as numbers.
from sklearn.preprocessing import LabelEncoder

dest_le = LabelEncoder().fit(train_data["DEST"].tolist() + ["UNSEEN"])
train_data["DEST"] = dest_le.transform(train_data["DEST"])

carrier_name_le = LabelEncoder().fit(train_data["CARRIER_NAME"].tolist()+ ["UNSEEN"])
train_data["CARRIER_NAME"] = carrier_name_le.transform(train_data["CARRIER_NAME"])

dep_time_blk_le = LabelEncoder().fit(train_data["DEP_TIME_BLK"].tolist()+ ["UNSEEN"])
train_data["DEP_TIME_BLK"] = dep_time_blk_le.transform(train_data["DEP_TIME_BLK"])

departing_airport_le = LabelEncoder().fit(train_data["DEPARTING_AIRPORT"].tolist()+ ["UNSEEN"])
train_data["DEPARTING_AIRPORT"] = departing_airport_le.transform(train_data["DEPARTING_AIRPORT"])

2. For this starter code, I'll select 5 random variables to use as my features. You should do your own selection, and think about what features would be useful!

In [307]:
train_data = train_data[['DEST',
       'DEP_TIME_BLK', 'DISTANCE', 'SEGMENT_NUMBER', 'CONCURRENT_FLIGHTS',
       'NUMBER_OF_SEATS', 'CARRIER_NAME', 'AIRPORT_FLIGHTS_MONTH',
       'AIRLINE_FLIGHTS_MONTH', 'AIRLINE_AIRPORT_FLIGHTS_MONTH',
       'AVG_MONTHLY_PASS_AIRPORT', 'AVG_MONTHLY_PASS_AIRLINE',
       'CARGO_HANDLING','PLANE_AGE', 'PRCP', 'SNOW', 'SNWD', 'AWND','IS_DELAYED']]

In [189]:
train_data = train_data[['AWND','AIRLINE_FLIGHTS_MONTH','DISTANCE','AVG_MONTHLY_PASS_AIRLINE','SNOW','PRCP','SEGMENT_NUMBER','CARGO HANDLING','IS_DELAYED']]

KeyError: "['CARGO HANDLING'] not in index"

In [108]:
train_data = train_data[['AWND','DISTANCE','AIRLINE_AIRPORT_FLIGHTS_MONTH','SNOW','NUMBER_OF_SEATS','PRCP','SEGMENT_NUMBER', 'AIRLINE_FLIGHTS_MONTH', 'AVG_MONTHLY_PASS_AIRLINE','CARGO_HANDLING', 'IS_DELAYED']]


In [316]:
train_data.corr()

Unnamed: 0,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,DEPARTING_AIRPORT,DEST,DEP_TIME_BLK,DISTANCE,SEGMENT_NUMBER,CONCURRENT_FLIGHTS,NUMBER_OF_SEATS,...,CARGO_HANDLING,FLT_ATTENDANTS_PER_PASS,GROUND_SERV_PER_PASS,PLANE_AGE,PRCP,SNOW,SNWD,TMAX,AWND,IS_DELAYED
MONTH,1.0,0.0079,0.004115,-0.003732,-0.006372,0.000513,-0.003786,0.02013,0.026436,0.002932,...,-0.00912,0.003142,-0.006976,-0.018068,-0.005187,-0.053381,-0.088859,0.171282,-0.119111,-0.016193
DAY_OF_MONTH,0.0079,1.0,0.007654,0.002837,-0.001285,-0.001562,0.001973,-0.000607,0.00066,0.000729,...,-0.002825,0.002131,0.00023,-0.000269,0.016722,0.005042,0.006311,0.016879,0.040346,0.003041
DAY_OF_WEEK,0.004115,0.007654,1.0,0.001732,0.002943,0.008441,0.014112,-0.026466,-0.027232,0.00934,...,0.000413,0.000378,0.002181,-0.003918,0.017159,-0.004371,-0.011323,0.007677,0.000752,0.005666
DEPARTING_AIRPORT,-0.003732,0.002837,0.001732,1.0,0.012353,-0.042449,0.098492,-0.033514,-0.364522,0.061262,...,-0.000698,-0.050244,0.006561,-0.040894,-0.030233,0.004551,-0.038147,0.013161,-0.061727,-0.00247
DEST,-0.006372,-0.001285,0.002943,0.012353,1.0,0.017773,0.075202,-0.004623,0.035293,0.015641,...,0.009503,0.004917,0.028326,-0.037299,-0.020566,0.003673,0.00734,-0.017343,-0.006082,0.009433
DEP_TIME_BLK,0.000513,-0.001562,0.008441,-0.042449,0.017773,1.0,-0.02761,0.73731,0.049274,-0.023026,...,-0.020386,-0.00136,-0.024924,0.010948,-0.00378,-0.003925,-0.00571,0.021328,0.001732,0.177448
DISTANCE,-0.003786,0.001973,0.014112,0.098492,0.075202,-0.02761,1.0,-0.242376,-0.036127,0.456127,...,0.042916,0.169596,0.285271,-0.137902,-0.013098,0.000188,-0.007706,0.004168,0.024092,0.054035
SEGMENT_NUMBER,0.02013,-0.000607,-0.026466,-0.033514,-0.004623,0.73731,-0.242376,1.0,0.006736,-0.208146,...,-0.020133,-0.102945,-0.189199,0.079894,-0.016659,-0.014796,-0.007161,0.031896,-0.026695,0.129363
CONCURRENT_FLIGHTS,0.026436,0.00066,-0.027232,-0.364522,0.035293,0.049274,-0.036127,0.006736,1.0,-0.066207,...,-0.067438,0.127184,0.10417,0.037866,-0.012638,-0.01493,-0.021376,0.020139,0.054053,0.011263
NUMBER_OF_SEATS,0.002932,0.000729,0.00934,0.061262,0.015641,-0.023026,0.456127,-0.208146,-0.066207,1.0,...,0.358686,0.208898,0.344619,-0.108893,-0.011384,-0.006468,-0.014613,0.062273,-0.024643,0.08982


Now, we convert this dataframe into a numpy array to begin the model training process

In [317]:
train_data_np = train_data.to_numpy()

Then, we separate the features from the target variable

In [318]:
X = train_data_np[:, :-1] # All rows, and every column except for the last one, which is the target variable
y = train_data_np[:, -1]

Now, we split the data into a training set and testing set so we can both train the model, and evaluate the moedl after training it

In [319]:
from sklearn.model_selection import train_test_split
# IF YOUR MODEL IS TAKING TOO LONG TO RUN, INCREASE THE TEST SIZE to 0.7, original 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Now, we define our model. This is truly where the magic happens, and it's truly just plug and play. Feel free to swap out my model with any one of these, and explore how the results change!

In [320]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

In [133]:
model = RandomForestClassifier(
    n_estimators=150,
    n_jobs=-1
)

In [258]:
model = QuadraticDiscriminantAnalysis()

In [321]:
model = AdaBoostClassifier(
    n_estimators=200
)

Here, I'm just defining a random model

In [403]:
# model = GaussianNB()
model = KNeighborsClassifier()

In [322]:
model.fit(X_train, y_train)

Great! Now that our model is done training, let's see how we did. To evaluate our model we must define what metric we evaluate our model on. We'll be using AUROC

In [323]:
from sklearn.metrics import roc_auc_score

In [324]:
# Use our testing subset and make predictions
y_test_predictions_probabilities = model.predict_proba(X_test)

`predict_proba` is a function that returns the probability/confidence of the model for each class.

In [325]:
y_test_predictions_probabilities

array([[0.49959828, 0.50040172],
       [0.50066796, 0.49933204],
       [0.50077064, 0.49922936],
       ...,
       [0.50099127, 0.49900873],
       [0.49896267, 0.50103733],
       [0.50237652, 0.49762348]])

If we exaime the first row [0.57795736, 0.42204264], we interpret this as the model is 57% confident that the label should be 0, and 42% confident that the label should be 1. The AUROC Score is concerned only with the probability of the 1 label, so we must grab the second column

In [326]:
y_test_predictions = y_test_predictions_probabilities[:, 1] # All rows, second column

In [327]:
roc_auc_score(y_test, y_test_predictions)

0.6823558236652204

This is decent!

Now let's load our test data and make predictions on that, then create our submission file

In [340]:
test_data = pd.read_csv("datathon_test.csv")

In [341]:
test_data

Unnamed: 0,Id,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,DEPARTING_AIRPORT,ORIGIN_CITY_NAME,DEST,DEST_CITY_NAME,DEP_TIME_BLK,DISTANCE,...,CARGO_HANDLING,FLT_ATTENDANTS_PER_PASS,GROUND_SERV_PER_PASS,PLANE_AGE,PREVIOUS_AIRPORT,PRCP,SNOW,SNWD,TMAX,AWND
0,0,7,1,1,Minneapolis-St Paul International,"Minneapolis, MN",PHL,"Philadelphia, PA",2000-2059,980,...,1462,0.000144,0.000149,18,Bradley International,0.00,0.0,0.0,93.0,4.70
1,1,4,12,5,Los Angeles International,"Los Angeles, CA",BNA,"Nashville, TN",1000-1059,1797,...,1462,0.000144,0.000149,18,Cincinnati/Northern Kentucky International,0.00,0.0,0.0,73.0,12.30
2,2,10,11,5,Ronald Reagan Washington National,"Washington, DC",SYR,"Syracuse, NY",0700-0759,298,...,0,0.000000,0.000090,16,Norfolk International,0.00,0.0,0.0,75.0,5.82
3,3,9,26,4,LaGuardia,"New York, NY",ORD,"Chicago, IL",2000-2059,733,...,518,0.000254,0.000229,12,Chicago O'Hare International,0.00,0.0,0.0,82.0,9.40
4,4,8,13,2,Detroit Metro Wayne County,"Detroit, MI",ERI,"Erie, PA",2000-2059,164,...,10,0.000034,0.000099,16,Kalamazoo/Battle Creek International,0.00,0.0,0.0,88.0,7.61
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298806,298806,6,16,7,Chicago O'Hare International,"Chicago, IL",AVL,"Asheville, NC",1800-1859,536,...,10,0.000034,0.000099,17,Duluth International,0.07,0.0,0.0,60.0,9.62
298807,298807,3,20,3,Tampa International,"Tampa, FL",EWR,"Newark, NJ",0700-0759,997,...,518,0.000254,0.000229,4,NONE,0.00,0.0,0.0,76.0,8.50
298808,298808,10,6,7,John F. Kennedy International,"New York, NY",MCO,"Orlando, FL",2100-2159,944,...,62,0.000160,0.000127,13,Orlando International,0.01,0.0,0.0,71.0,10.96
298809,298809,7,17,3,General Mitchell Field,"Milwaukee, WI",LGA,"New York, NY",0600-0659,738,...,10955,0.000062,0.000099,19,NONE,0.00,0.0,0.0,84.0,6.71


Since we used only the "SEGMENT_NUMBER", "NUMBER_OF_SEATS", "PRCP", "CARGO_HANDLING", "AIRLINE_FLIGHTS_MONTH", "DEP_TIME_BLK" columns when training, we must only use these when testing, because these features are what our model is trained on

Note, we MUST keep the Id column here to create our submission file

In [342]:
test_data = test_data[['Id','DEST',
       'DEP_TIME_BLK', 'DISTANCE', 'SEGMENT_NUMBER', 'CONCURRENT_FLIGHTS',
       'NUMBER_OF_SEATS', 'CARRIER_NAME', 'AIRPORT_FLIGHTS_MONTH',
       'AIRLINE_FLIGHTS_MONTH', 'AIRLINE_AIRPORT_FLIGHTS_MONTH',
       'AVG_MONTHLY_PASS_AIRPORT', 'AVG_MONTHLY_PASS_AIRLINE',
       'CARGO_HANDLING','PLANE_AGE', 'PRCP', 'SNOW', 'SNWD', 'AWND']]

In [330]:
test_data.head()

Unnamed: 0,Id,DEST,DEP_TIME_BLK,DISTANCE,SEGMENT_NUMBER,CONCURRENT_FLIGHTS,NUMBER_OF_SEATS,CARRIER_NAME,AIRPORT_FLIGHTS_MONTH,AIRLINE_FLIGHTS_MONTH,AIRLINE_AIRPORT_FLIGHTS_MONTH,AVG_MONTHLY_PASS_AIRPORT,AVG_MONTHLY_PASS_AIRLINE,CARGO_HANDLING,PLANE_AGE,PRCP,SNOW,SNWD,AWND
0,0,PHL,2000-2059,980,4,47,132,Delta Air Lines Inc.,14787,86274,6606,1581456,12460183,1462,18,0.0,0.0,0.0,4.7
1,1,BNA,1000-1059,1797,2,40,160,Delta Air Lines Inc.,17338,76299,2995,2780593,12460183,1462,18,0.0,0.0,0.0,12.3
2,2,SYR,0700-0759,298,2,23,50,Comair Inc.,11637,23411,2415,955406,1245396,0,16,0.0,0.0,0.0,5.82
3,3,ORD,2000-2059,733,5,28,120,United Air Lines Inc.,13731,51182,740,1208249,8501631,518,12,0.0,0.0,0.0,9.4
4,4,ERI,2000-2059,164,9,64,50,SkyWest Airlines Inc.,14439,70841,4166,1486066,3472966,10,16,0.0,0.0,0.0,7.61


Me must make the DEPARTING_AIRPORT, CARRIER_NAME, and  DEP_TIME_BLK. numerical using the SAME label encoder we used on our train data for consistency, but first, as mentioned before, we must check if there are any values in these categories that weren't in the training data so we don't run into any errors. If we find any, we replace them with "UNSEEN"

In [343]:
new_dep_time_blk = []
for value in test_data["DEP_TIME_BLK"]:
       # If the value is unknown, we tag the "UNSEEN"
       if value not in dep_time_blk_le.classes_:
              new_dep_time_blk.append("UNSEEN")
       # If the value is known to the labelencoder, we can safely append that value
       else:
              new_dep_time_blk.append(value)
# Replace
test_data["DEP_TIME_BLK"] = new_dep_time_blk

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data["DEP_TIME_BLK"] = new_dep_time_blk


In [332]:
new_departing_airport = []
for value in test_data["DEPARTING_AIRPORT"]:
       # If the value is unknown, we tag the "UNSEEN"
       if value not in departing_airport_le.classes_:
              new_departing_airport.append("UNSEEN")
       # If the value is known to the labelencoder, we can safely append that value
       else:
              new_departing_airport.append(value)
# Replace
test_data["DEPARTING_AIRPORT"] = new_departing_airport

KeyError: 'DEPARTING_AIRPORT'

In [344]:
new_carrier = []
for value in test_data["CARRIER_NAME"]:
       # If the value is unknown, we tag the "UNSEEN"
       if value not in carrier_name_le.classes_:
              new_carrier.append("UNSEEN")
       # If the value is known to the labelencoder, we can safely append that value
       else:
              new_carrier.append(value)
# Replace
test_data["CARRIER_NAME"] = new_carrier

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data["CARRIER_NAME"] = new_carrier


In [345]:
new_dest = []
for value in test_data["DEST"]:
       # If the value is unknown, we tag the "UNSEEN"
       if value not in dest_le.classes_:
              new_dest.append("UNSEEN")
       # If the value is known to the labelencoder, we can safely append that value
       else:
              new_dest.append(value)
# Replace
test_data["DEST"] = new_dest

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data["DEST"] = new_dest


In [76]:
test_data

Unnamed: 0,Id,AWND,AIRLINE_FLIGHTS_MONTH,DISTANCE,AVG_MONTHLY_PASS_AIRLINE,SNOW,PRCP,SEGMENT_NUMBER,CARGO_HANDLING
0,0,4.70,86274,980,12460183,0.0,0.00,4,1462
1,1,12.30,76299,1797,12460183,0.0,0.00,2,1462
2,2,5.82,23411,298,1245396,0.0,0.00,2,0
3,3,9.40,51182,733,8501631,0.0,0.00,5,518
4,4,7.61,70841,164,3472966,0.0,0.00,9,10
...,...,...,...,...,...,...,...,...,...
298806,298806,9.62,68083,536,3472966,0.0,0.07,6,10
298807,298807,8.50,52866,997,8501631,0.0,0.00,1,518
298808,298808,10.96,24795,944,3190369,0.0,0.01,5,62
298809,298809,6.71,116615,738,13382999,0.0,0.00,1,10955


In [346]:
test_data["DEP_TIME_BLK"] = dep_time_blk_le.transform(test_data["DEP_TIME_BLK"])
test_data["CARRIER_NAME"] = carrier_name_le.transform(test_data["CARRIER_NAME"])
test_data["DEST"] = dest_le.transform(test_data["DEST"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data["DEP_TIME_BLK"] = dep_time_blk_le.transform(test_data["DEP_TIME_BLK"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data["CARRIER_NAME"] = carrier_name_le.transform(test_data["CARRIER_NAME"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data["DEST"] = dest_le.transform(test_

In [77]:
test_data

Unnamed: 0,Id,DEPARTING_AIRPORT,NUMBER_OF_SEATS,PRCP,CARRIER_NAME,AIRLINE_FLIGHTS_MONTH,DEP_TIME_BLK
0,0,48,132,0.00,5,86274,15
1,1,41,160,0.00,5,76299,5
2,2,71,50,0.00,4,23411,2
3,3,36,120,0.00,16,51182,15
4,4,16,50,0.00,12,70841,15
...,...,...,...,...,...,...,...
298806,298806,10,50,0.07,12,68083,13
298807,298807,84,173,0.00,16,52866,2
298808,298808,31,162,0.01,9,24795,16
298809,298809,22,143,0.00,13,116615,1


Great! Let's now do the same thing as we did before

In [347]:
test_data_np = test_data.to_numpy()

This is now entirely test data, and we don't need to split using `train_test_split` because we're not training a new model.

In [348]:
X_TEST = test_data_np[:, 1:] # The first column is the Id column, which we do not want to keep in our predictions

In [349]:
predictions = model.predict_proba(X_TEST)[:, 1] # Just like before

ValueError: X has 18 features, but AdaBoostClassifier is expecting 25 features as input.

In [299]:
predictions

array([0.50050824, 0.50079592, 0.49995737, ..., 0.50015867, 0.50099369,
       0.50016569])

In [300]:
len(predictions)

298811

Now, time to make our submission file! The submission file has two columns to named exactly this way. "Id", and "IS_DELAYED"

In [301]:
submission = test_data[["Id"]]

In [302]:
submission

Unnamed: 0,Id
0,0
1,1
2,2
3,3
4,4
...,...
298806,298806
298807,298807
298808,298808
298809,298809


In [303]:
submission["IS_DELAYED"] = predictions

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  submission["IS_DELAYED"] = predictions


In [304]:
submission

Unnamed: 0,Id,IS_DELAYED
0,0,0.500508
1,1,0.500796
2,2,0.499957
3,3,0.499865
4,4,0.500024
...,...,...
298806,298806,0.501114
298807,298807,0.499989
298808,298808,0.500159
298809,298809,0.500994


Now, we save the dataframe into a CSV

In [306]:
submission.to_csv("test2_submission.csv", index=False)