# Assignment 3: Logistic Regression and SVMs
## Daisy Pinaroc

First, we will explore our data and prepare it for analysis through feature engineering techniques.

## Feature Engineering

In [100]:
# Load the dataset as a dataframe using pandas library 
import pandas as pd
# Importing OneHotEncoder class from scikit-learn preprocessing for later use
from sklearn.preprocessing import OneHotEncoder

mt_rainier_df = pd.read_csv("MtRainier_data.csv")

# Print a few columns using head() function
mt_rainier_df.head()

Unnamed: 0.1,Unnamed: 0,Date,Route,Succeeded,Battery Voltage AVG,Temperature AVG,Relative Humidity AVG,Wind Speed Daily AVG,Wind Direction AVG,Solare Radiation AVG
0,0,11/27/2015,Disappointment Cleaver,0,13.64375,26.321667,19.715,27.839583,68.004167,88.49625
1,1,11/21/2015,Disappointment Cleaver,0,13.749583,31.3,21.690708,2.245833,117.549667,93.660417
2,2,10/15/2015,Disappointment Cleaver,0,13.46125,46.447917,27.21125,17.163625,259.121375,138.387
3,3,10/13/2015,Little Tahoma,0,13.532083,40.979583,28.335708,19.591167,279.779167,176.382667
4,4,10/9/2015,Disappointment Cleaver,0,13.21625,38.260417,74.329167,65.138333,264.6875,27.791292


Features to keep:
* Route
* Temperature AVG
* Relative Humidity AVG
* Wind Speed Daily AVG
* Wind Direction AVG
* Solare Radiation AVG


* I'm not keeping "Date" because it caused problems with scaling later on. 
* "Battery Voltage AVG" also seems like a feature that won't help the user decide whether to hike on a given day. 

### Exploratory data analysis

In [101]:
# Checking for duplicate rows
mt_rainier_df.duplicated().sum()

0

We won't have to drop any duplicate rows because there aren't any in the dataset.

In [102]:
# Applying dropna() for completeness
mt_rainier_df = mt_rainier_df.dropna()

In [103]:
# Check the data types of the columns using info() function, also checking for completeness
mt_rainier_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1895 entries, 0 to 1894
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             1895 non-null   int64  
 1   Date                   1895 non-null   object 
 2   Route                  1895 non-null   object 
 3   Succeeded              1895 non-null   int64  
 4   Battery Voltage AVG    1895 non-null   float64
 5   Temperature AVG        1895 non-null   float64
 6   Relative Humidity AVG  1895 non-null   float64
 7   Wind Speed Daily AVG   1895 non-null   float64
 8   Wind Direction AVG     1895 non-null   float64
 9   Solare Radiation AVG   1895 non-null   float64
dtypes: float64(6), int64(2), object(2)
memory usage: 148.2+ KB


In [104]:
# Printing the number of rows and columns in the dataframe

# Rows
rows = len(mt_rainier_df.index)

# Columns
col = len(mt_rainier_df.axes[1])

print('Rows:',rows)
print('Columns:', col)

Rows: 1895
Columns: 10


I've determined the number of rows and columns in the dataframe because I would like to see if this changes after feature engineering.

### Feature and Label Definition

In [105]:
mt_rainier_features_df = mt_rainier_df[["Route","Temperature AVG", "Relative Humidity AVG", "Wind Speed Daily AVG", "Wind Direction AVG","Solare Radiation AVG"]] 
mt_rainier_features_df.head()

Unnamed: 0,Route,Temperature AVG,Relative Humidity AVG,Wind Speed Daily AVG,Wind Direction AVG,Solare Radiation AVG
0,Disappointment Cleaver,26.321667,19.715,27.839583,68.004167,88.49625
1,Disappointment Cleaver,31.3,21.690708,2.245833,117.549667,93.660417
2,Disappointment Cleaver,46.447917,27.21125,17.163625,259.121375,138.387
3,Little Tahoma,40.979583,28.335708,19.591167,279.779167,176.382667
4,Disappointment Cleaver,38.260417,74.329167,65.138333,264.6875,27.791292


In [106]:
mt_rainier_labels_df = mt_rainier_df[["Succeeded"]]
mt_rainier_labels_df.head()

Unnamed: 0,Succeeded
0,0
1,0
2,0
3,0
4,0


Because "Succeeded" is classified as 0 and 1, that will be our label.

### Feature Transformation

In [107]:
# Transforming categorical features into 1-hot
route_names_to_list = mt_rainier_features_df["Route"].to_list()

# Converting the 1-dimensional list into a list of lists(2D)
route_names_to_list_of_lists = []

for route in route_names_to_list:
    route_names_to_list_of_lists.append([route])

# Defining an object
route_encoder = OneHotEncoder()

# Fit our data (i.e., extract and order vocabulary)
route_encoder.fit(route_names_to_list_of_lists)

print(f"Unique vocabulary items {len(route_encoder.categories_[0])}\n")

# Now transform each example in our data into 1-hot form
route_names_transformed = route_encoder.transform(route_names_to_list_of_lists)

# Transform the result object into a matrix
route_names_transformed = route_names_transformed.toarray()
print(route_names_transformed)

# Create a dataframe back from the array
route_names_transformed_df = pd.DataFrame(route_names_transformed)

# Now concatenate this feature back to the original dataframe 
mt_rainier_features_df.reset_index(drop=True, inplace=True)
route_names_transformed_df.reset_index(drop=True, inplace=True)

mt_rainier_features_transformed_df = pd.concat([mt_rainier_features_df,route_names_transformed_df], axis=1)

# We don't need Route now since we have already transformed it
mt_rainier_features_transformed_df = mt_rainier_features_transformed_df.drop(columns=["Route"], axis=1)
mt_rainier_features_transformed_df.head()

Unique vocabulary items 22

[[0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]


Unnamed: 0,Temperature AVG,Relative Humidity AVG,Wind Speed Daily AVG,Wind Direction AVG,Solare Radiation AVG,0,1,2,3,4,...,12,13,14,15,16,17,18,19,20,21
0,26.321667,19.715,27.839583,68.004167,88.49625,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,31.3,21.690708,2.245833,117.549667,93.660417,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,46.447917,27.21125,17.163625,259.121375,138.387,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,40.979583,28.335708,19.591167,279.779167,176.382667,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,38.260417,74.329167,65.138333,264.6875,27.791292,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The only categorical feature we have is "Route"

Now that we have transformed the only categorical variable, we are left with numerical variables. We'll scale these features using Standard scaling.

### Feature Scaling

In [108]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
all_columns = mt_rainier_features_transformed_df.columns

mt_rainier_features_transformed_df[all_columns] = scaler.fit_transform(mt_rainier_features_transformed_df[all_columns])
mt_rainier_features_transformed_df.head()



Unnamed: 0,Temperature AVG,Relative Humidity AVG,Wind Speed Daily AVG,Wind Direction AVG,Solare Radiation AVG,0,1,2,3,4,...,12,13,14,15,16,17,18,19,20,21
0,-1.580891,-1.269311,1.895222,-0.958813,-1.567664,-0.032504,0.681507,-0.429389,-0.051434,-0.120225,...,-0.03982,-0.032504,-0.065112,-0.032504,-0.022978,-0.022978,-0.045992,-0.022978,-0.032504,-0.115624
1,-1.033951,-1.180109,-0.902775,-0.414849,-1.520897,-0.032504,0.681507,-0.429389,-0.051434,-0.120225,...,-0.03982,-0.032504,-0.065112,-0.032504,-0.022978,-0.022978,-0.045992,-0.022978,-0.032504,-0.115624
2,0.630261,-0.930861,0.72809,1.139477,-1.11585,-0.032504,0.681507,-0.429389,-0.051434,-0.120225,...,-0.03982,-0.032504,-0.065112,-0.032504,-0.022978,-0.022978,-0.045992,-0.022978,-0.032504,-0.115624
3,0.029488,-0.880092,0.993477,1.36628,-0.771758,-0.032504,-1.467337,-0.429389,-0.051434,-0.120225,...,-0.03982,-0.032504,-0.065112,-0.032504,-0.022978,-0.022978,-0.045992,-0.022978,-0.032504,-0.115624
4,-0.269251,1.196481,5.972851,1.200588,-2.117412,-0.032504,0.681507,-0.429389,-0.051434,-0.120225,...,-0.03982,-0.032504,-0.065112,-0.032504,-0.022978,-0.022978,-0.045992,-0.022978,-0.032504,-0.115624


## Data Splitting

In [109]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# First extract our test data and store it in x_test, y_test
features = mt_rainier_features_transformed_df.to_numpy()
labels = mt_rainier_labels_df.to_numpy()
_x, x_test, _y, y_test = train_test_split(features, labels, test_size=0.10, random_state=42)

# set k = 5
k = 5

kfold_spliiter = KFold(n_splits=k)

folds_data = [] # this is an inefficient way but still do it

fold = 1
for train_index, validation_index in kfold_spliiter.split(_x):
    x_train , x_valid = _x[train_index,:],_x[validation_index,:]
    y_train , y_valid = _y[train_index,:] , _y[validation_index,:]
    print (f"Fold {fold} training data shape = {(x_train.shape,y_train.shape)}")
    print (f"Fold {fold} validation data shape = {(x_valid.shape,y_valid.shape)}")
    fold+=1
    folds_data.append((x_train,y_train,x_valid,y_valid))

Fold 1 training data shape = ((1364, 27), (1364, 1))
Fold 1 validation data shape = ((341, 27), (341, 1))
Fold 2 training data shape = ((1364, 27), (1364, 1))
Fold 2 validation data shape = ((341, 27), (341, 1))
Fold 3 training data shape = ((1364, 27), (1364, 1))
Fold 3 validation data shape = ((341, 27), (341, 1))
Fold 4 training data shape = ((1364, 27), (1364, 1))
Fold 4 validation data shape = ((341, 27), (341, 1))
Fold 5 training data shape = ((1364, 27), (1364, 1))
Fold 5 validation data shape = ((341, 27), (341, 1))


## Defining and training Classifiers

We'll use four model candidates:
* Logistic Regression with no regularization (default)
* Logistic Regression with L2 regularization
* Support Vector Classifier with Linear Kernel
* Support Vector Classifier with Polynomial Kernel (Degree = 2)

In [110]:
# Now let's define our models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


# LR with no regularizer
lr_vanilla = LogisticRegression(penalty="none")

# LR with regularizer
lr_L2 = LogisticRegression(penalty="l2")

# SVM with linear kernel
svm_linear = SVC(kernel="linear")

# SVM with polynomial kernel
svm_poly = SVC(kernel="poly",degree=2)

# Keep all the models in a dictionary

all_models = {"lr_vanilla":lr_vanilla, 
              "lr_L2":lr_L2,
              "svm_linear":svm_linear,
              "svm_poly":svm_poly}

## K-fold Cross Validation
For each fold we train every model, evaluate their accuracies on training and validation data and selsct the model that offers the best accuracy.

In [111]:
best_validation_accuracy = 0
best_model_name = ""
best_model = None

# Iterate over all models
for model_name in all_models.keys():
    
    print (f"Evaluating {model_name} ...")
    model = all_models[model_name]
    
    # Let's store training and validation accuracies for all folds
    train_acc_for_all_folds = []
    valid_acc_for_all_folds = []
    
    #Iterate over all folds
    for i, fold in enumerate(folds_data):
        x_train, y_train, x_valid, y_valid = fold

        # Train the model
        _ = model.fit(x_train,y_train.flatten())

        # Evluate model on training data
        y_pred_train = model.predict(x_train)
        
        # Evaluate the model on validation data
        y_pred_valid = model.predict(x_valid)
        
        # Compute training accuracy
        train_acc = accuracy_score(y_pred_train , y_train.flatten())
        
        # Store training accuracy for each folds
        train_acc_for_all_folds.append(train_acc)
        
        # Compute validation accuracy
        valid_acc = accuracy_score(y_pred_valid , y_valid.flatten())

        # Store validation accuracy for each folds
        valid_acc_for_all_folds.append(valid_acc)
    
    #average training accuracy across k folds
    avg_training_acc = sum(train_acc_for_all_folds)/k
    
    print (f"Average training accuracy for model {model_name} = {avg_training_acc}")
    
    #average validation accuracy across k folds
    avg_validation_acc = sum(valid_acc_for_all_folds)/k
    
    print (f"Average validation accuracy for model {model_name} = {avg_validation_acc}")
    
    # Select best model based on average validation accuracy
    if avg_validation_acc > best_validation_accuracy:
        best_validation_accuracy = avg_validation_acc
        best_model_name = model_name
        best_model = model
    print (f"-----------------------------------")

print (f"Best model for the task is {best_model_name} which offers the validation accuracy of {best_validation_accuracy}")

Evaluating lr_vanilla ...
Average training accuracy for model lr_vanilla = 0.6277126099706745
Average validation accuracy for model lr_vanilla = 0.612316715542522
-----------------------------------
Evaluating lr_L2 ...
Average training accuracy for model lr_L2 = 0.6275659824046921
Average validation accuracy for model lr_L2 = 0.6129032258064516
-----------------------------------
Evaluating svm_linear ...
Average training accuracy for model svm_linear = 0.6174486803519061
Average validation accuracy for model svm_linear = 0.6070381231671554
-----------------------------------
Evaluating svm_poly ...
Average training accuracy for model svm_poly = 0.6117302052785923
Average validation accuracy for model svm_poly = 0.5829912023460411
-----------------------------------
Best model for the task is lr_L2 which offers the validation accuracy of 0.6129032258064516


## Insights 
All models are operating equally badly and are underfitting. The ideal training accuracy should be more than 0.9 (>90% of the examples in training data should be classified correctly). 

In such cases we can try to improve data and extract better features such as trail conditions (i.e. rocky, muddy, slippery, etc.), potential hazards, difficulty level of trail, and wildlife presence. Although these features aren't present in our current dataframe, I think they could improve user experience over time and help increase accuracy. Another way users' experience could be improved is collecting data over a longer span of time, rather than 1 year. 

## Extra Credit
Find out the best set of features and the best model by **trying many possible combination of features**. 

Same features as previous model, but **WITHOUT ROUTE**
## MODEL 2
Features:
* Temperature AVG
* Relative Humidity AVG
* Wind Speed Daily AVG
* Wind Direction AVG
* Solare Radiation AVG

In [112]:
# Load the dataset as a dataframe using pandas library 
import pandas as pd
#Importing OneHotEncoder class from scikit-learn preprocessing for later use
from sklearn.preprocessing import OneHotEncoder

mt_rainier_df = pd.read_csv("MtRainier_data.csv")

mt_rainier_df = mt_rainier_df.drop_duplicates()

# Applying dropna() for completeness
mt_rainier_df = mt_rainier_df.dropna()

# Defining our features
mt_rainier_features_df = mt_rainier_df[["Temperature AVG", "Relative Humidity AVG", "Wind Speed Daily AVG", "Wind Direction AVG","Solare Radiation AVG"]] 

# Defining our labels
mt_rainier_labels_df = mt_rainier_df[["Succeeded"]]

# Scaling our features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
all_columns = mt_rainier_features_df.columns

mt_rainier_features_df[all_columns] = scaler.fit_transform(mt_rainier_features_df[all_columns])

# Data splitting
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# First extract our test data and store it in x_test, y_test
features = mt_rainier_features_df.to_numpy()
labels = mt_rainier_labels_df.to_numpy()
_x, x_test, _y, y_test = train_test_split(features, labels, test_size=0.10, random_state=42)

# set k = 5
k = 5

kfold_spliiter = KFold(n_splits=k)

folds_data = [] # this is an inefficient way but still do it

fold = 1
for train_index, validation_index in kfold_spliiter.split(_x):
    x_train , x_valid = _x[train_index,:],_x[validation_index,:]
    y_train , y_valid = _y[train_index,:] , _y[validation_index,:]
    print (f"Fold {fold} training data shape = {(x_train.shape,y_train.shape)}")
    print (f"Fold {fold} validation data shape = {(x_valid.shape,y_valid.shape)}")
    fold+=1
    folds_data.append((x_train,y_train,x_valid,y_valid))
    
# Now let's define our models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# LR with no regularizer
lr_vanilla = LogisticRegression(penalty="none")

# LR with regularizer
lr_L2 = LogisticRegression(penalty="l2")

# SVM with linear kernel
svm_linear = SVC(kernel="linear")

# SVM with polynomial kernel
svm_poly = SVC(kernel="poly",degree=2)

# Keep all the models in a dictionary

all_models = {"lr_vanilla":lr_vanilla, 
              "lr_L2":lr_L2,
              "svm_linear":svm_linear,
              "svm_poly":svm_poly}

# K-fold cross validation
best_validation_accuracy = 0
best_model_name = ""
best_model = None

# Iterate over all models
for model_name in all_models.keys():
    
    print (f"Evaluating {model_name} ...")
    model = all_models[model_name]
    
    # Let's store training and validation accuracies for all folds
    train_acc_for_all_folds = []
    valid_acc_for_all_folds = []
    
    #Iterate over all folds
    for i, fold in enumerate(folds_data):
        x_train, y_train, x_valid, y_valid = fold

        # Train the model
        _ = model.fit(x_train,y_train.flatten())

        # Evluate model on training data
        y_pred_train = model.predict(x_train)
        
        # Evaluate the model on validation data
        y_pred_valid = model.predict(x_valid)
        
        # Compute training accuracy
        train_acc = accuracy_score(y_pred_train , y_train.flatten())
        
        # Store training accuracy for each folds
        train_acc_for_all_folds.append(train_acc)
        
        # Compute validation accuracy
        valid_acc = accuracy_score(y_pred_valid , y_valid.flatten())

        # Store validation accuracy for each folds
        valid_acc_for_all_folds.append(valid_acc)
    
    #average training accuracy across k folds
    avg_training_acc = sum(train_acc_for_all_folds)/k
    
    print (f"Average training accuracy for model {model_name} = {avg_training_acc}")
    
    #average validation accuracy across k folds
    avg_validation_acc = sum(valid_acc_for_all_folds)/k
    
    print (f"Average validation accuracy for model {model_name} = {avg_validation_acc}")
    
    # Select best model based on average validation accuracy
    if avg_validation_acc > best_validation_accuracy:
        best_validation_accuracy = avg_validation_acc
        best_model_name = model_name
        best_model = model
    print (f"-----------------------------------")

print (f"Best model for the task is {best_model_name} which offers the validation accuracy of {best_validation_accuracy}")

Fold 1 training data shape = ((1364, 5), (1364, 1))
Fold 1 validation data shape = ((341, 5), (341, 1))
Fold 2 training data shape = ((1364, 5), (1364, 1))
Fold 2 validation data shape = ((341, 5), (341, 1))
Fold 3 training data shape = ((1364, 5), (1364, 1))
Fold 3 validation data shape = ((341, 5), (341, 1))
Fold 4 training data shape = ((1364, 5), (1364, 1))
Fold 4 validation data shape = ((341, 5), (341, 1))
Fold 5 training data shape = ((1364, 5), (1364, 1))
Fold 5 validation data shape = ((341, 5), (341, 1))
Evaluating lr_vanilla ...
Average training accuracy for model lr_vanilla = 0.6227272727272727
Average validation accuracy for model lr_vanilla = 0.6041055718475073
-----------------------------------
Evaluating lr_L2 ...
Average training accuracy for model lr_L2 = 0.6227272727272727
Average validation accuracy for model lr_L2 = 0.6041055718475073
-----------------------------------
Evaluating svm_linear ...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mt_rainier_features_df[all_columns] = scaler.fit_transform(mt_rainier_features_df[all_columns])


Average training accuracy for model svm_linear = 0.6087976539589443
Average validation accuracy for model svm_linear = 0.6064516129032258
-----------------------------------
Evaluating svm_poly ...
Average training accuracy for model svm_poly = 0.5942815249266863
Average validation accuracy for model svm_poly = 0.5894428152492669
-----------------------------------
Best model for the task is svm_linear which offers the validation accuracy of 0.6064516129032258


## MODEL 3
Features:
* Route
* Battery Voltage AVG
* Temperature AVG
* Relative Humidity AVG
* Wind Speed Daily AVG
* Wind Direction AVG

In [113]:
# Load the dataset as a dataframe using pandas library 
import pandas as pd
#Importing OneHotEncoder class from scikit-learn preprocessing for later use
from sklearn.preprocessing import OneHotEncoder

mt_rainier_df = pd.read_csv("MtRainier_data.csv")

mt_rainier_df = mt_rainier_df.drop_duplicates()

# Applying dropna() for completeness
mt_rainier_df = mt_rainier_df.dropna()


# Defining our features
mt_rainier_features_df = mt_rainier_df[["Route", "Battery Voltage AVG", "Temperature AVG", "Relative Humidity AVG", "Wind Speed Daily AVG", "Wind Direction AVG"]] 

# Defining our labels
mt_rainier_labels_df = mt_rainier_df[["Succeeded"]]


# Transforming categorical features (Route) into 1-hot
route_names_to_list = mt_rainier_features_df["Route"].to_list()

# Converting the 1-dimensional list into a list of lists(2D)
route_names_to_list_of_lists = []

for route in route_names_to_list:
    route_names_to_list_of_lists.append([route])

# Defining an object
route_encoder = OneHotEncoder()

# Fit our data (i.e., extract and order vocabulary)
route_encoder.fit(route_names_to_list_of_lists)

print(f"Unique vocabulary items {len(route_encoder.categories_[0])}\n")

# Now transform each example in our data into 1-hot form
route_names_transformed = route_encoder.transform(route_names_to_list_of_lists)

# Transform the result object into a matrix
route_names_transformed = route_names_transformed.toarray()
print(route_names_transformed)

# Create a dataframe back from the array
route_names_transformed_df = pd.DataFrame(route_names_transformed)

# Now concatenate this feature back to the original dataframe 
mt_rainier_features_df.reset_index(drop=True, inplace=True)
route_names_transformed_df.reset_index(drop=True, inplace=True)

mt_rainier_features_transformed_df = pd.concat([mt_rainier_features_df,route_names_transformed_df], axis=1)

# We don't need Route now since we have already transformed it
mt_rainier_features_transformed_df = mt_rainier_features_transformed_df.drop(columns=["Route"], axis=1)
mt_rainier_features_transformed_df.head()


# Scaling our features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
all_columns = mt_rainier_features_transformed_df.columns

mt_rainier_features_transformed_df[all_columns] = scaler.fit_transform(mt_rainier_features_transformed_df[all_columns])


# Data splitting
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# First extract our test data and store it in x_test, y_test
features = mt_rainier_features_transformed_df.to_numpy()
labels = mt_rainier_labels_df.to_numpy()
_x, x_test, _y, y_test = train_test_split(features, labels, test_size=0.10, random_state=42)

# set k = 5
k = 5

kfold_spliiter = KFold(n_splits=k)

folds_data = [] # this is an inefficient way but still do it

fold = 1
for train_index, validation_index in kfold_spliiter.split(_x):
    x_train , x_valid = _x[train_index,:],_x[validation_index,:]
    y_train , y_valid = _y[train_index,:] , _y[validation_index,:]
    print (f"Fold {fold} training data shape = {(x_train.shape,y_train.shape)}")
    print (f"Fold {fold} validation data shape = {(x_valid.shape,y_valid.shape)}")
    fold+=1
    folds_data.append((x_train,y_train,x_valid,y_valid))
    
# Now let's define our models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# LR with no regularizer
lr_vanilla = LogisticRegression(penalty="none")

# LR with regularizer
lr_L2 = LogisticRegression(penalty="l2")

# SVM with linear kernel
svm_linear = SVC(kernel="linear")

# SVM with polynomial kernel
svm_poly = SVC(kernel="poly",degree=2)

# Keep all the models in a dictionary

all_models = {"lr_vanilla":lr_vanilla, 
              "lr_L2":lr_L2,
              "svm_linear":svm_linear,
              "svm_poly":svm_poly}

# K-fold cross validation
best_validation_accuracy = 0
best_model_name = ""
best_model = None

# Iterate over all models
for model_name in all_models.keys():
    
    print (f"Evaluating {model_name} ...")
    model = all_models[model_name]
    
    # Let's store training and validation accuracies for all folds
    train_acc_for_all_folds = []
    valid_acc_for_all_folds = []
    
    #Iterate over all folds
    for i, fold in enumerate(folds_data):
        x_train, y_train, x_valid, y_valid = fold

        # Train the model
        _ = model.fit(x_train,y_train.flatten())

        # Evluate model on training data
        y_pred_train = model.predict(x_train)
        
        # Evaluate the model on validation data
        y_pred_valid = model.predict(x_valid)
        
        # Compute training accuracy
        train_acc = accuracy_score(y_pred_train , y_train.flatten())
        
        # Store training accuracy for each folds
        train_acc_for_all_folds.append(train_acc)
        
        # Compute validation accuracy
        valid_acc = accuracy_score(y_pred_valid , y_valid.flatten())

        # Store validation accuracy for each folds
        valid_acc_for_all_folds.append(valid_acc)
    
    #average training accuracy across k folds
    avg_training_acc = sum(train_acc_for_all_folds)/k
    
    print (f"Average training accuracy for model {model_name} = {avg_training_acc}")
    
    #average validation accuracy across k folds
    avg_validation_acc = sum(valid_acc_for_all_folds)/k
    
    print (f"Average validation accuracy for model {model_name} = {avg_validation_acc}")
    
    # Select best model based on average validation accuracy
    if avg_validation_acc > best_validation_accuracy:
        best_validation_accuracy = avg_validation_acc
        best_model_name = model_name
        best_model = model
    print (f"-----------------------------------")

print (f"Best model for the task is {best_model_name} which offers the validation accuracy of {best_validation_accuracy}")

Unique vocabulary items 22

[[0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]
Fold 1 training data shape = ((1364, 27), (1364, 1))
Fold 1 validation data shape = ((341, 27), (341, 1))
Fold 2 training data shape = ((1364, 27), (1364, 1))
Fold 2 validation data shape = ((341, 27), (341, 1))
Fold 3 training data shape = ((1364, 27), (1364, 1))
Fold 3 validation data shape = ((341, 27), (341, 1))
Fold 4 training data shape = ((1364, 27), (1364, 1))
Fold 4 validation data shape = ((341, 27), (341, 1))
Fold 5 training data shape = ((1364, 27), (1364, 1))
Fold 5 validation data shape = ((341, 27), (341, 1))
Evaluating lr_vanilla ...
Average training accuracy for model lr_vanilla = 0.6043988269794721
Average validation accuracy for model lr_vanilla = 0.5935483870967742
-----------------------------------
Evaluating lr_L2 ...




Average training accuracy for model lr_L2 = 0.6043988269794721
Average validation accuracy for model lr_L2 = 0.5935483870967742
-----------------------------------
Evaluating svm_linear ...
Average training accuracy for model svm_linear = 0.58841642228739
Average validation accuracy for model svm_linear = 0.5747800586510264
-----------------------------------
Evaluating svm_poly ...
Average training accuracy for model svm_poly = 0.5998533724340176
Average validation accuracy for model svm_poly = 0.5782991202346042
-----------------------------------
Best model for the task is lr_vanilla which offers the validation accuracy of 0.5935483870967742


## MODEL 4
Features:
* Route
* Temperature AVG
* Relative Humidity AVG
* Wind Speed Daily AVG
* Solare Radiation AVG

In [114]:
# Load the dataset as a dataframe using pandas library 
import pandas as pd
#Importing OneHotEncoder class from scikit-learn preprocessing for later use
from sklearn.preprocessing import OneHotEncoder

mt_rainier_df = pd.read_csv("MtRainier_data.csv")

mt_rainier_df = mt_rainier_df.drop_duplicates()

# Applying dropna() for completeness
mt_rainier_df = mt_rainier_df.dropna()


# Defining our features
mt_rainier_features_df = mt_rainier_df[["Route", "Temperature AVG", "Relative Humidity AVG", "Wind Speed Daily AVG", "Solare Radiation AVG"]] 

# Defining our labels
mt_rainier_labels_df = mt_rainier_df[["Succeeded"]]


# Transforming categorical features (Route) into 1-hot
route_names_to_list = mt_rainier_features_df["Route"].to_list()

# Converting the 1-dimensional list into a list of lists(2D)
route_names_to_list_of_lists = []

for route in route_names_to_list:
    route_names_to_list_of_lists.append([route])

# Defining an object
route_encoder = OneHotEncoder()

# Fit our data (i.e., extract and order vocabulary)
route_encoder.fit(route_names_to_list_of_lists)

print(f"Unique vocabulary items {len(route_encoder.categories_[0])}\n")

# Now transform each example in our data into 1-hot form
route_names_transformed = route_encoder.transform(route_names_to_list_of_lists)

# Transform the result object into a matrix
route_names_transformed = route_names_transformed.toarray()
print(route_names_transformed)

# Create a dataframe back from the array
route_names_transformed_df = pd.DataFrame(route_names_transformed)

# Now concatenate this feature back to the original dataframe 
mt_rainier_features_df.reset_index(drop=True, inplace=True)
route_names_transformed_df.reset_index(drop=True, inplace=True)

mt_rainier_features_transformed_df = pd.concat([mt_rainier_features_df,route_names_transformed_df], axis=1)

# We don't need Route now since we have already transformed it
mt_rainier_features_transformed_df = mt_rainier_features_transformed_df.drop(columns=["Route"], axis=1)


# Scaling our features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
all_columns = mt_rainier_features_transformed_df.columns

mt_rainier_features_transformed_df[all_columns] = scaler.fit_transform(mt_rainier_features_transformed_df[all_columns])


# Data splitting
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# First extract our test data and store it in x_test, y_test
features = mt_rainier_features_transformed_df.to_numpy()
labels = mt_rainier_labels_df.to_numpy()
_x, x_test, _y, y_test = train_test_split(features, labels, test_size=0.10, random_state=42)

# set k = 5
k = 5

kfold_spliiter = KFold(n_splits=k)

folds_data = [] # this is an inefficient way but still do it

fold = 1
for train_index, validation_index in kfold_spliiter.split(_x):
    x_train , x_valid = _x[train_index,:],_x[validation_index,:]
    y_train , y_valid = _y[train_index,:] , _y[validation_index,:]
    print (f"Fold {fold} training data shape = {(x_train.shape,y_train.shape)}")
    print (f"Fold {fold} validation data shape = {(x_valid.shape,y_valid.shape)}")
    fold+=1
    folds_data.append((x_train,y_train,x_valid,y_valid))
    
# Now let's define our models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# LR with no regularizer
lr_vanilla = LogisticRegression(penalty="none")

# LR with regularizer
lr_L2 = LogisticRegression(penalty="l2")

# SVM with linear kernel
svm_linear = SVC(kernel="linear")

# SVM with polynomial kernel
svm_poly = SVC(kernel="poly",degree=2)

# Keep all the models in a dictionary

all_models = {"lr_vanilla":lr_vanilla, 
              "lr_L2":lr_L2,
              "svm_linear":svm_linear,
              "svm_poly":svm_poly}

# K-fold cross validation
best_validation_accuracy = 0
best_model_name = ""
best_model = None

# Iterate over all models
for model_name in all_models.keys():
    
    print (f"Evaluating {model_name} ...")
    model = all_models[model_name]
    
    # Let's store training and validation accuracies for all folds
    train_acc_for_all_folds = []
    valid_acc_for_all_folds = []
    
    #Iterate over all folds
    for i, fold in enumerate(folds_data):
        x_train, y_train, x_valid, y_valid = fold

        # Train the model
        _ = model.fit(x_train,y_train.flatten())

        # Evluate model on training data
        y_pred_train = model.predict(x_train)
        
        # Evaluate the model on validation data
        y_pred_valid = model.predict(x_valid)
        
        # Compute training accuracy
        train_acc = accuracy_score(y_pred_train , y_train.flatten())
        
        # Store training accuracy for each folds
        train_acc_for_all_folds.append(train_acc)
        
        # Compute validation accuracy
        valid_acc = accuracy_score(y_pred_valid , y_valid.flatten())

        # Store validation accuracy for each folds
        valid_acc_for_all_folds.append(valid_acc)
    
    #average training accuracy across k folds
    avg_training_acc = sum(train_acc_for_all_folds)/k
    
    print (f"Average training accuracy for model {model_name} = {avg_training_acc}")
    
    #average validation accuracy across k folds
    avg_validation_acc = sum(valid_acc_for_all_folds)/k
    
    print (f"Average validation accuracy for model {model_name} = {avg_validation_acc}")
    
    # Select best model based on average validation accuracy
    if avg_validation_acc > best_validation_accuracy:
        best_validation_accuracy = avg_validation_acc
        best_model_name = model_name
        best_model = model
    print (f"-----------------------------------")

print (f"Best model for the task is {best_model_name} which offers the validation accuracy of {best_validation_accuracy}")

Unique vocabulary items 22

[[0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]
Fold 1 training data shape = ((1364, 26), (1364, 1))
Fold 1 validation data shape = ((341, 26), (341, 1))
Fold 2 training data shape = ((1364, 26), (1364, 1))
Fold 2 validation data shape = ((341, 26), (341, 1))
Fold 3 training data shape = ((1364, 26), (1364, 1))
Fold 3 validation data shape = ((341, 26), (341, 1))
Fold 4 training data shape = ((1364, 26), (1364, 1))
Fold 4 validation data shape = ((341, 26), (341, 1))
Fold 5 training data shape = ((1364, 26), (1364, 1))
Fold 5 validation data shape = ((341, 26), (341, 1))
Evaluating lr_vanilla ...
Average training accuracy for model lr_vanilla = 0.6221407624633432
Average validation accuracy for model lr_vanilla = 0.6129032258064516
-----------------------------------
Evaluating lr_L2 ...




Average training accuracy for model lr_L2 = 0.6230205278592376
Average validation accuracy for model lr_L2 = 0.612316715542522
-----------------------------------
Evaluating svm_linear ...
Average training accuracy for model svm_linear = 0.6171554252199414
Average validation accuracy for model svm_linear = 0.6082111436950146
-----------------------------------
Evaluating svm_poly ...
Average training accuracy for model svm_poly = 0.6089442815249267
Average validation accuracy for model svm_poly = 0.580058651026393
-----------------------------------
Best model for the task is lr_vanilla which offers the validation accuracy of 0.6129032258064516


## MODEL 5
Features:
* Route
* Battery Voltage AVG
* Wind Speed Daily AVG
* Wind Direction AVG 
* Solare Radiation AVG

In [115]:
# Load the dataset as a dataframe using pandas library 
import pandas as pd
#Importing OneHotEncoder class from scikit-learn preprocessing for later use
from sklearn.preprocessing import OneHotEncoder

mt_rainier_df = pd.read_csv("MtRainier_data.csv")

# Dropping duplicates, if any
mt_rainier_df = mt_rainier_df.drop_duplicates()

# Applying dropna() for completeness
mt_rainier_df = mt_rainier_df.dropna()


# Defining our features
mt_rainier_features_df = mt_rainier_df[["Route", "Battery Voltage AVG","Wind Speed Daily AVG", "Wind Direction AVG", "Solare Radiation AVG"]] 

# Defining our labels
mt_rainier_labels_df = mt_rainier_df[["Succeeded"]]


# Transforming categorical features (Route) into 1-hot
route_names_to_list = mt_rainier_features_df["Route"].to_list()

# Converting the 1-dimensional list into a list of lists(2D)
route_names_to_list_of_lists = []

for route in route_names_to_list:
    route_names_to_list_of_lists.append([route])

# Defining an object
route_encoder = OneHotEncoder()

# Fit our data (i.e., extract and order vocabulary)
route_encoder.fit(route_names_to_list_of_lists)

print(f"Unique vocabulary items {len(route_encoder.categories_[0])}\n")

# Now transform each example in our data into 1-hot form
route_names_transformed = route_encoder.transform(route_names_to_list_of_lists)

# Transform the result object into a matrix
route_names_transformed = route_names_transformed.toarray()
print(route_names_transformed)

# Create a dataframe back from the array
route_names_transformed_df = pd.DataFrame(route_names_transformed)

# Now concatenate this feature back to the original dataframe 
mt_rainier_features_df.reset_index(drop=True, inplace=True)
route_names_transformed_df.reset_index(drop=True, inplace=True)

mt_rainier_features_transformed_df = pd.concat([mt_rainier_features_df,route_names_transformed_df], axis=1)

# We don't need Route now since we have already transformed it
mt_rainier_features_transformed_df = mt_rainier_features_transformed_df.drop(columns=["Route"], axis=1)


# Scaling our features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
all_columns = mt_rainier_features_transformed_df.columns

mt_rainier_features_transformed_df[all_columns] = scaler.fit_transform(mt_rainier_features_transformed_df[all_columns])


# Data splitting
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# First extract our test data and store it in x_test, y_test
features = mt_rainier_features_transformed_df.to_numpy()
labels = mt_rainier_labels_df.to_numpy()
_x, x_test, _y, y_test = train_test_split(features, labels, test_size=0.10, random_state=42)

# set k = 5
k = 5

kfold_spliiter = KFold(n_splits=k)

folds_data = [] # this is an inefficient way but still do it

fold = 1
for train_index, validation_index in kfold_spliiter.split(_x):
    x_train , x_valid = _x[train_index,:],_x[validation_index,:]
    y_train , y_valid = _y[train_index,:] , _y[validation_index,:]
    print (f"Fold {fold} training data shape = {(x_train.shape,y_train.shape)}")
    print (f"Fold {fold} validation data shape = {(x_valid.shape,y_valid.shape)}")
    fold+=1
    folds_data.append((x_train,y_train,x_valid,y_valid))
    
# Now let's define our models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# LR with no regularizer
lr_vanilla = LogisticRegression(penalty="none")

# LR with regularizer
lr_L2 = LogisticRegression(penalty="l2")

# SVM with linear kernel
svm_linear = SVC(kernel="linear")

# SVM with polynomial kernel
svm_poly = SVC(kernel="poly",degree=2)

# Keep all the models in a dictionary

all_models = {"lr_vanilla":lr_vanilla, 
              "lr_L2":lr_L2,
              "svm_linear":svm_linear,
              "svm_poly":svm_poly}

# K-fold cross validation
best_validation_accuracy = 0
best_model_name = ""
best_model = None

# Iterate over all models
for model_name in all_models.keys():
    
    print (f"Evaluating {model_name} ...")
    model = all_models[model_name]
    
    # Let's store training and validation accuracies for all folds
    train_acc_for_all_folds = []
    valid_acc_for_all_folds = []
    
    #Iterate over all folds
    for i, fold in enumerate(folds_data):
        x_train, y_train, x_valid, y_valid = fold

        # Train the model
        _ = model.fit(x_train,y_train.flatten())

        # Evluate model on training data
        y_pred_train = model.predict(x_train)
        
        # Evaluate the model on validation data
        y_pred_valid = model.predict(x_valid)
        
        # Compute training accuracy
        train_acc = accuracy_score(y_pred_train , y_train.flatten())
        
        # Store training accuracy for each folds
        train_acc_for_all_folds.append(train_acc)
        
        # Compute validation accuracy
        valid_acc = accuracy_score(y_pred_valid , y_valid.flatten())

        # Store validation accuracy for each folds
        valid_acc_for_all_folds.append(valid_acc)
    
    #average training accuracy across k folds
    avg_training_acc = sum(train_acc_for_all_folds)/k
    
    print (f"Average training accuracy for model {model_name} = {avg_training_acc}")
    
    #average validation accuracy across k folds
    avg_validation_acc = sum(valid_acc_for_all_folds)/k
    
    print (f"Average validation accuracy for model {model_name} = {avg_validation_acc}")
    
    # Select best model based on average validation accuracy
    if avg_validation_acc > best_validation_accuracy:
        best_validation_accuracy = avg_validation_acc
        best_model_name = model_name
        best_model = model
    print (f"-----------------------------------")

print (f"Best model for the task is {best_model_name} which offers the validation accuracy of {best_validation_accuracy}")

Unique vocabulary items 22

[[0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]
Fold 1 training data shape = ((1364, 26), (1364, 1))
Fold 1 validation data shape = ((341, 26), (341, 1))
Fold 2 training data shape = ((1364, 26), (1364, 1))
Fold 2 validation data shape = ((341, 26), (341, 1))
Fold 3 training data shape = ((1364, 26), (1364, 1))
Fold 3 validation data shape = ((341, 26), (341, 1))
Fold 4 training data shape = ((1364, 26), (1364, 1))
Fold 4 validation data shape = ((341, 26), (341, 1))
Fold 5 training data shape = ((1364, 26), (1364, 1))
Fold 5 validation data shape = ((341, 26), (341, 1))
Evaluating lr_vanilla ...




Average training accuracy for model lr_vanilla = 0.6387096774193548
Average validation accuracy for model lr_vanilla = 0.6228739002932551
-----------------------------------
Evaluating lr_L2 ...
Average training accuracy for model lr_L2 = 0.6387096774193548
Average validation accuracy for model lr_L2 = 0.6228739002932551
-----------------------------------
Evaluating svm_linear ...
Average training accuracy for model svm_linear = 0.6177419354838709
Average validation accuracy for model svm_linear = 0.6111436950146627
-----------------------------------
Evaluating svm_poly ...
Average training accuracy for model svm_poly = 0.6095307917888563
Average validation accuracy for model svm_poly = 0.5835777126099707
-----------------------------------
Best model for the task is lr_vanilla which offers the validation accuracy of 0.6228739002932551


## MODEL 6
Features:
* Route
* Wind Speed Daily AVG
* Solare Radiation AVG

In [116]:
# Load the dataset as a dataframe using pandas library 
import pandas as pd
#Importing OneHotEncoder class from scikit-learn preprocessing for later use
from sklearn.preprocessing import OneHotEncoder

mt_rainier_df = pd.read_csv("MtRainier_data.csv")

# Dropping duplicates, if any
mt_rainier_df = mt_rainier_df.drop_duplicates()

# Applying dropna() for completeness
mt_rainier_df = mt_rainier_df.dropna()


# Defining our features
mt_rainier_features_df = mt_rainier_df[["Route","Wind Speed Daily AVG", "Solare Radiation AVG"]] 

# Defining our labels
mt_rainier_labels_df = mt_rainier_df[["Succeeded"]]


# Transforming categorical features (Route) into 1-hot
route_names_to_list = mt_rainier_features_df["Route"].to_list()

# Converting the 1-dimensional list into a list of lists(2D)
route_names_to_list_of_lists = []

for route in route_names_to_list:
    route_names_to_list_of_lists.append([route])

# Defining an object
route_encoder = OneHotEncoder()

# Fit our data (i.e., extract and order vocabulary)
route_encoder.fit(route_names_to_list_of_lists)

print(f"Unique vocabulary items {len(route_encoder.categories_[0])}\n")

# Now transform each example in our data into 1-hot form
route_names_transformed = route_encoder.transform(route_names_to_list_of_lists)

# Transform the result object into a matrix
route_names_transformed = route_names_transformed.toarray()
print(route_names_transformed)

# Create a dataframe back from the array
route_names_transformed_df = pd.DataFrame(route_names_transformed)

# Now concatenate this feature back to the original dataframe 
mt_rainier_features_df.reset_index(drop=True, inplace=True)
route_names_transformed_df.reset_index(drop=True, inplace=True)

mt_rainier_features_transformed_df = pd.concat([mt_rainier_features_df,route_names_transformed_df], axis=1)

# We don't need Route now since we have already transformed it
mt_rainier_features_transformed_df = mt_rainier_features_transformed_df.drop(columns=["Route"], axis=1)


# Scaling our features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
all_columns = mt_rainier_features_transformed_df.columns

mt_rainier_features_transformed_df[all_columns] = scaler.fit_transform(mt_rainier_features_transformed_df[all_columns])


# Data splitting
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# First extract our test data and store it in x_test, y_test
features = mt_rainier_features_transformed_df.to_numpy()
labels = mt_rainier_labels_df.to_numpy()
_x, x_test, _y, y_test = train_test_split(features, labels, test_size=0.10, random_state=42)

# set k = 5
k = 5

kfold_spliiter = KFold(n_splits=k)

folds_data = [] # this is an inefficient way but still do it

fold = 1
for train_index, validation_index in kfold_spliiter.split(_x):
    x_train , x_valid = _x[train_index,:],_x[validation_index,:]
    y_train , y_valid = _y[train_index,:] , _y[validation_index,:]
    print (f"Fold {fold} training data shape = {(x_train.shape,y_train.shape)}")
    print (f"Fold {fold} validation data shape = {(x_valid.shape,y_valid.shape)}")
    fold+=1
    folds_data.append((x_train,y_train,x_valid,y_valid))
    
# Now let's define our models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# LR with no regularizer
lr_vanilla = LogisticRegression(penalty="none")

# LR with regularizer
lr_L2 = LogisticRegression(penalty="l2")

# SVM with linear kernel
svm_linear = SVC(kernel="linear")

# SVM with polynomial kernel
svm_poly = SVC(kernel="poly",degree=2)

# Keep all the models in a dictionary

all_models = {"lr_vanilla":lr_vanilla, 
              "lr_L2":lr_L2,
              "svm_linear":svm_linear,
              "svm_poly":svm_poly}

# K-fold cross validation
best_validation_accuracy = 0
best_model_name = ""
best_model = None

# Iterate over all models
for model_name in all_models.keys():
    
    print (f"Evaluating {model_name} ...")
    model = all_models[model_name]
    
    # Let's store training and validation accuracies for all folds
    train_acc_for_all_folds = []
    valid_acc_for_all_folds = []
    
    #Iterate over all folds
    for i, fold in enumerate(folds_data):
        x_train, y_train, x_valid, y_valid = fold

        # Train the model
        _ = model.fit(x_train,y_train.flatten())

        # Evluate model on training data
        y_pred_train = model.predict(x_train)
        
        # Evaluate the model on validation data
        y_pred_valid = model.predict(x_valid)
        
        # Compute training accuracy
        train_acc = accuracy_score(y_pred_train , y_train.flatten())
        
        # Store training accuracy for each folds
        train_acc_for_all_folds.append(train_acc)
        
        # Compute validation accuracy
        valid_acc = accuracy_score(y_pred_valid , y_valid.flatten())

        # Store validation accuracy for each folds
        valid_acc_for_all_folds.append(valid_acc)
    
    #average training accuracy across k folds
    avg_training_acc = sum(train_acc_for_all_folds)/k
    
    print (f"Average training accuracy for model {model_name} = {avg_training_acc}")
    
    #average validation accuracy across k folds
    avg_validation_acc = sum(valid_acc_for_all_folds)/k
    
    print (f"Average validation accuracy for model {model_name} = {avg_validation_acc}")
    
    # Select best model based on average validation accuracy
    if avg_validation_acc > best_validation_accuracy:
        best_validation_accuracy = avg_validation_acc
        best_model_name = model_name
        best_model = model
    print (f"-----------------------------------")

print (f"Best model for the task is {best_model_name} which offers the validation accuracy of {best_validation_accuracy}")

Unique vocabulary items 22

[[0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]
Fold 1 training data shape = ((1364, 24), (1364, 1))
Fold 1 validation data shape = ((341, 24), (341, 1))
Fold 2 training data shape = ((1364, 24), (1364, 1))
Fold 2 validation data shape = ((341, 24), (341, 1))
Fold 3 training data shape = ((1364, 24), (1364, 1))
Fold 3 validation data shape = ((341, 24), (341, 1))
Fold 4 training data shape = ((1364, 24), (1364, 1))
Fold 4 validation data shape = ((341, 24), (341, 1))
Fold 5 training data shape = ((1364, 24), (1364, 1))
Fold 5 validation data shape = ((341, 24), (341, 1))
Evaluating lr_vanilla ...
Average training accuracy for model lr_vanilla = 0.6234604105571847
Average validation accuracy for model lr_vanilla = 0.612316715542522
-----------------------------------
Evaluating lr_L2 ...




Average training accuracy for model lr_L2 = 0.6234604105571847
Average validation accuracy for model lr_L2 = 0.612316715542522
-----------------------------------
Evaluating svm_linear ...
Average training accuracy for model svm_linear = 0.6195014662756598
Average validation accuracy for model svm_linear = 0.6134897360703813
-----------------------------------
Evaluating svm_poly ...
Average training accuracy for model svm_poly = 0.5972140762463344
Average validation accuracy for model svm_poly = 0.5759530791788856
-----------------------------------
Best model for the task is svm_linear which offers the validation accuracy of 0.6134897360703813


## MODEL 7
Features:
* Route
* Battery Voltage
* Wind Speed Daily AVG
* Solare Radiation AVG

In [117]:
# Load the dataset as a dataframe using pandas library 
import pandas as pd
#Importing OneHotEncoder class from scikit-learn preprocessing for later use
from sklearn.preprocessing import OneHotEncoder

mt_rainier_df = pd.read_csv("MtRainier_data.csv")

# Dropping duplicates, if any
mt_rainier_df = mt_rainier_df.drop_duplicates()

# Applying dropna() for completeness
mt_rainier_df = mt_rainier_df.dropna()


# Defining our features
mt_rainier_features_df = mt_rainier_df[["Route","Battery Voltage AVG","Wind Speed Daily AVG", "Solare Radiation AVG"]] 

# Defining our labels
mt_rainier_labels_df = mt_rainier_df[["Succeeded"]]


# Transforming categorical features (Route) into 1-hot
route_names_to_list = mt_rainier_features_df["Route"].to_list()

# Converting the 1-dimensional list into a list of lists(2D)
route_names_to_list_of_lists = []

for route in route_names_to_list:
    route_names_to_list_of_lists.append([route])

# Defining an object
route_encoder = OneHotEncoder()

# Fit our data (i.e., extract and order vocabulary)
route_encoder.fit(route_names_to_list_of_lists)

# Now transform each example in our data into 1-hot form
route_names_transformed = route_encoder.transform(route_names_to_list_of_lists)

# Transform the result object into a matrix
route_names_transformed = route_names_transformed.toarray()

# Create a dataframe back from the array
route_names_transformed_df = pd.DataFrame(route_names_transformed)

# Now concatenate this feature back to the original dataframe 
mt_rainier_features_df.reset_index(drop=True, inplace=True)
route_names_transformed_df.reset_index(drop=True, inplace=True)

mt_rainier_features_transformed_df = pd.concat([mt_rainier_features_df,route_names_transformed_df], axis=1)

# We don't need Route now since we have already transformed it
mt_rainier_features_transformed_df = mt_rainier_features_transformed_df.drop(columns=["Route"], axis=1)


# Scaling our features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
all_columns = mt_rainier_features_transformed_df.columns

mt_rainier_features_transformed_df[all_columns] = scaler.fit_transform(mt_rainier_features_transformed_df[all_columns])


# Data splitting
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# First extract our test data and store it in x_test, y_test
features = mt_rainier_features_transformed_df.to_numpy()
labels = mt_rainier_labels_df.to_numpy()
_x, x_test, _y, y_test = train_test_split(features, labels, test_size=0.10, random_state=42)

# set k = 5
k = 5

kfold_spliiter = KFold(n_splits=k)

folds_data = [] # this is an inefficient way but still do it

fold = 1
for train_index, validation_index in kfold_spliiter.split(_x):
    x_train , x_valid = _x[train_index,:],_x[validation_index,:]
    y_train , y_valid = _y[train_index,:] , _y[validation_index,:]
    print (f"Fold {fold} training data shape = {(x_train.shape,y_train.shape)}")
    print (f"Fold {fold} validation data shape = {(x_valid.shape,y_valid.shape)}")
    fold+=1
    folds_data.append((x_train,y_train,x_valid,y_valid))
    
# Now let's define our models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# LR with no regularizer
lr_vanilla = LogisticRegression(penalty="none")

# LR with regularizer
lr_L2 = LogisticRegression(penalty="l2")

# SVM with linear kernel
svm_linear = SVC(kernel="linear")

# SVM with polynomial kernel
svm_poly = SVC(kernel="poly",degree=2)

# Keep all the models in a dictionary

all_models = {"lr_vanilla":lr_vanilla, 
              "lr_L2":lr_L2,
              "svm_linear":svm_linear,
              "svm_poly":svm_poly}

# K-fold cross validation
best_validation_accuracy = 0
best_model_name = ""
best_model = None

# Iterate over all models
for model_name in all_models.keys():
    
    print (f"Evaluating {model_name} ...")
    model = all_models[model_name]
    
    # Let's store training and validation accuracies for all folds
    train_acc_for_all_folds = []
    valid_acc_for_all_folds = []
    
    #Iterate over all folds
    for i, fold in enumerate(folds_data):
        x_train, y_train, x_valid, y_valid = fold

        # Train the model
        _ = model.fit(x_train,y_train.flatten())

        # Evluate model on training data
        y_pred_train = model.predict(x_train)
        
        # Evaluate the model on validation data
        y_pred_valid = model.predict(x_valid)
        
        # Compute training accuracy
        train_acc = accuracy_score(y_pred_train , y_train.flatten())
        
        # Store training accuracy for each folds
        train_acc_for_all_folds.append(train_acc)
        
        # Compute validation accuracy
        valid_acc = accuracy_score(y_pred_valid , y_valid.flatten())

        # Store validation accuracy for each folds
        valid_acc_for_all_folds.append(valid_acc)
    
    #average training accuracy across k folds
    avg_training_acc = sum(train_acc_for_all_folds)/k
    
    print (f"Average training accuracy for model {model_name} = {avg_training_acc}")
    
    #average validation accuracy across k folds
    avg_validation_acc = sum(valid_acc_for_all_folds)/k
    
    print (f"Average validation accuracy for model {model_name} = {avg_validation_acc}")
    
    # Select best model based on average validation accuracy
    if avg_validation_acc > best_validation_accuracy:
        best_validation_accuracy = avg_validation_acc
        best_model_name = model_name
        best_model = model
    print (f"-----------------------------------")

print (f"Best model for the task is {best_model_name} which offers the validation accuracy of {best_validation_accuracy}")

Fold 1 training data shape = ((1364, 25), (1364, 1))
Fold 1 validation data shape = ((341, 25), (341, 1))
Fold 2 training data shape = ((1364, 25), (1364, 1))
Fold 2 validation data shape = ((341, 25), (341, 1))
Fold 3 training data shape = ((1364, 25), (1364, 1))
Fold 3 validation data shape = ((341, 25), (341, 1))
Fold 4 training data shape = ((1364, 25), (1364, 1))
Fold 4 validation data shape = ((341, 25), (341, 1))
Fold 5 training data shape = ((1364, 25), (1364, 1))
Fold 5 validation data shape = ((341, 25), (341, 1))
Evaluating lr_vanilla ...




Average training accuracy for model lr_vanilla = 0.6372434017595308
Average validation accuracy for model lr_vanilla = 0.6287390029325512
-----------------------------------
Evaluating lr_L2 ...
Average training accuracy for model lr_L2 = 0.6370967741935483
Average validation accuracy for model lr_L2 = 0.6281524926686217
-----------------------------------
Evaluating svm_linear ...
Average training accuracy for model svm_linear = 0.6165689149560117
Average validation accuracy for model svm_linear = 0.612316715542522
-----------------------------------
Evaluating svm_poly ...
Average training accuracy for model svm_poly = 0.6070381231671556
Average validation accuracy for model svm_poly = 0.5777126099706745
-----------------------------------
Best model for the task is lr_vanilla which offers the validation accuracy of 0.6287390029325512


### Summary (ranked from best to worst):
1. Model 7- lr_vanilla: 0.62874
2. Model 5- lr_vanilla: 0.62287
3. Model 6- svm_linear: 0.61349
4. Original- lr_L2: 0.61290
5. Model 4- lr_vanilla: 0.61290
6. Model 2- svm_linear: 0.60645
7. Model 3- lr_vanilla: 0.59355

## Which was the best model?
I found that the best model was Model 7, which was logistic regression with no regularization (lr_vanilla) producing a validation accuracy of 0.629. 

The best features were:
* Route
* Battery Voltage AVG
* Wind Speed Daily AVG
* Solare Radiation AVG

### Notes:
* Original model & Model 4 produced the same validation accuracy. Model 4 did not include "Wind Direction AVG". I find this interesting. Maybe "Wind Direction AVG" is not as impactful compared to the other features.
* I was shocked to see that "Battery Voltage AVG" was more impactful than I thought. I wonder why that is.
* *I was a bit confused on when to use MinMaxScaling vs. StandardScaling. I just decided to use StandardScaling because of the lab, however I noticed how it scaled the one-hot encodings so I wasn't 100% sure if it was the right scaler to use.*