# Model Training Notebook

In this notebook, we train 3 models on the preprocessed data and make predictions on the test data. 

As we are doing binary classification, the models used are Naive Bayes, SVM, and Logistic Regression

### Load Training and Test Data

In [12]:
import pandas as pd

# Load training data
df = pd.read_csv(r'../data/train_clean.csv')

# Split training set
x_train = df.drop('Transported', axis=1)
y_train = df['Transported']

print(x_train.head())
print(y_train.head())

   Unnamed: 0  HomePlanet  CryoSleep  Destination   Age    VIP  RoomService  \
0           0           1      False            2  39.0  False          0.0   
1           1           0      False            2  24.0  False        109.0   
2           2           1      False            2  58.0   True         43.0   
3           3           1      False            2  33.0  False          0.0   
4           4           0      False            2  16.0  False        303.0   

   FoodCourt  ShoppingMall     Spa  VRDeck  Deck  Num  Side  
0        0.0           0.0     0.0     0.0     1    0     0  
1        9.0          25.0   549.0    44.0     5    0     1  
2     3576.0           0.0  6715.0    49.0     0    0     1  
3     1283.0         371.0  3329.0   193.0     0    0     1  
4       70.0         151.0   565.0     2.0     5    1     1  
0    False
1     True
2    False
3    False
4     True
Name: Transported, dtype: bool


In [2]:
# Load test data
test_df = pd.read_csv(r'../data/test_clean.csv')

print(test_df.head())

test_df.fillna(test_df.mean(), inplace=True)

# Split training set
passenger_id = test_df['PassengerId']
x_test = test_df.drop('PassengerId', axis=1)

   Unnamed: 0 PassengerId  HomePlanet CryoSleep  Destination   Age    VIP  \
0           0     0013_01           0      True            2  27.0  False   
1           1     0018_01           0     False            2  19.0  False   
2           2     0019_01           1      True            0  31.0  False   
3           3     0021_01           1     False            2  38.0  False   
4           4     0023_01           0     False            2  20.0  False   

   RoomService  FoodCourt  ShoppingMall     Spa  VRDeck  Deck  Num  Side  
0          0.0        0.0           0.0     0.0     0.0     6  3.0     1  
1          0.0        9.0           0.0  2823.0     0.0     5  4.0     1  
2          0.0        0.0           0.0     0.0     0.0     2  0.0     1  
3          0.0     6652.0           0.0   181.0   585.0     2  1.0     1  
4         10.0        0.0         635.0     0.0     0.0     5  5.0     1  


### Naive Bayes

#### Train

In [3]:
# Imports

from sklearn.naive_bayes import GaussianNB

In [4]:
# Train the model
model = GaussianNB()

# Train the model using training data
model.fit(x_train, y_train)

GaussianNB()

#### Make Predictions

In [5]:
# Predict output
transported = model.predict(x_test)
predicted_naive = pd.DataFrame({'PassengerId': passenger_id, 'Transported': transported})

In [6]:
print(predicted_naive.head())
predicted_naive.shape

  PassengerId  Transported
0     0013_01         True
1     0018_01        False
2     0019_01         True
3     0021_01         True
4     0023_01         True


(4277, 2)

In [7]:
predicted_naive.to_csv(r'../data/submission_naive_bayes.csv', index=False)

### SVM Classifier

#### Train SVM Classifier Model

In [8]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(x_train,y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('svc', SVC(gamma='auto'))])

In [9]:
transported = clf.predict(x_test)
predicted_svm = pd.DataFrame({'PassengerId': passenger_id, 'Transported': transported})

In [10]:
predicted_svm.to_csv(r'../data/submission_svm.csv', index=False)

### Logistic Regression

#### Train Logistic Regression Model

In [17]:
from sklearn.linear_model import LogisticRegression

logistic_clf = LogisticRegression(random_state=0).fit(x_train,y_train)

logistic_clf.score(x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.7897366030881017

#### Make Predictions

In [24]:
transported_logistic = logistic_clf.predict(x_test)

In [25]:
predicted_logistic = pd.DataFrame({'PassengerId': passenger_id, 'Transported': transported_logistic})
predicted_logistic.to_csv(r'../data/submission_logistic.csv', index=False)