# Introduction

### Welcome to the Davidson AI/ML Competition. This is the starter notebook for the competition and if you run the notebook to completion you will have a valid submission to upload to the leaderboard. More info on how to specifically do that at the end of this notebook. This notebook is provided as a guide on how to construct a valid submission and basic guide on AI/ML. The notebook is heavily commented and prioritizes clarity in code over simplicity. 

### Jupyter Notebooks are interactive environments where you can write Python code. A great introduction to Jupyter Notebooks can be found [here](https://www.kaggle.com/code/jhoward/jupyter-notebook-101)

---
# Reading in Data

In [6]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('./input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

./input\test.csv
./input\train.csv


In [8]:
# Read in the data using Pandas
train_df = pd.read_csv("./input/train.csv")
test_df = pd.read_csv("./input/test.csv")

In [9]:
# Let's view the training data
train_df.head()

Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Ease of Online booking,Convenience of Departure/Arrival Time,Baggage Handling,...,Food and Drink,Seat Comfort,Inflight Entertainment,On-Board Service,Leg Room,Inflight Service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,Satisfaction Rating
0,100157,Female,Loyal Customer,29,Business,Business,2207.45,2,2,4,...,4,4,4,3,5,4,4,0,8.0,Satisfied
1,35082,Male,Loyal Customer,7,Personal,Economy,362.47,4,4,5,...,2,2,2,3,5,4,2,24,46.0,Neutral / Dissatisfied
2,115833,Female,Loyal Customer,42,Business,Economy,527.26,5,5,5,...,4,1,5,5,5,5,2,5,0.0,Satisfied
3,70660,Male,Loyal Customer,55,Business,Business,1505.25,4,4,2,...,3,5,2,2,2,2,5,0,0.0,Satisfied
4,12546,Male,Loyal Customer,47,Personal,Economy,1072.8,2,4,4,...,1,1,1,3,5,5,1,0,0.0,Neutral / Dissatisfied


In [10]:
# Let's view the testing data
test_df.head()

Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Ease of Online booking,Convenience of Departure/Arrival Time,Baggage Handling,...,Inflight Wifi Service,Food and Drink,Seat Comfort,Inflight Entertainment,On-Board Service,Leg Room,Inflight Service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes
0,0,Male,Non-Loyal Customer,49,Personal,Economy,2485.7,1,2,4,...,0,2,3,2,1,1,3,2,0,0.0
1,1,Female,Non-Loyal Customer,65,Personal,Economy,585.18,1,1,4,...,1,4,4,4,3,4,3,4,0,13.0
2,2,Male,Non-Loyal Customer,40,Personal,Economy,1014.2,1,2,4,...,1,1,1,1,1,4,3,1,0,0.0
3,3,Female,Non-Loyal Customer,14,Personal,Economy,324.96,1,2,4,...,1,3,3,3,3,1,4,3,0,0.0
4,4,Female,Non-Loyal Customer,45,Personal,Economy,647.04,0,3,4,...,1,5,5,5,2,3,5,5,14,7.0


### Note that the test data does not have the `Satisfaction Rating` column. Your model will be predicting this column and then submitting to the leaderboard

---
# Data Exploration

### Let's check if our dataset has any missing data or NaNs. If either of these are present in the dataset, it will lead to errors when training the ML Model

In [11]:
for column in train_df.columns:
    if train_df[column].isna().any():
        nan_counts = train_df.isna().sum().sum()
        print(f"{nan_counts} NaNs found in column '{column}'")

281 NaNs found in column 'Arrival Delay in Minutes'


### It looks like we have missing data/NaNs in the `Arrival Delay in Minutes` column. Only 281 rows are missing data out of the 90916 total rows or about 0.3%. Because we have relatively few, a common method is to replace the missing data with the most common value in the column. There are other methods that you can explore to do this in a smarter way, but we will do the simpler method for the starter notebook. 

In [12]:
# Find the most common value in Arrival Delay in Minutes
most_common_arrival_delay = train_df['Arrival Delay in Minutes'].mode()[0]

print(most_common_arrival_delay)

0.0


In [13]:
# Update the training data to fill in the missing values
train_df['Arrival Delay in Minutes'].fillna(most_common_arrival_delay, inplace=True)

### In general, when we make updates to the training data we want to apply the same updates to the test data. The test data also has some missing values in the same column, so we will update in the same way.

In [14]:
# Update the testing data to fill in the missing values
test_df['Arrival Delay in Minutes'].fillna(most_common_arrival_delay, inplace=True)

---
# Data Processing
### Similar to the data exploration stage, there are common data processing techniques that are used to prepare a dataset for ML Model training. Most ML Models do not allow you to pass strings in directly. Instead it is common to encode these values. For example, instead of "Male" and "Female", we can encode that as 0 and 1. This is known as a categorical variable, where each value corresponds to a different category. Our dataset has a few of these, so let's update them. 

In [15]:
# Encode Female as 0 and Male as 1
train_df['Gender'].replace('Female', 0, inplace=True)
train_df['Gender'].replace('Male', 1, inplace=True)

# Encode Loyal Customer as 0 and Non-Loyal Customer as 1
train_df['Customer Type'].replace('Loyal Customer', 0, inplace=True)
train_df['Customer Type'].replace('Non-Loyal Customer', 1, inplace=True)

# Encode Business as 0 and Personal as 1
train_df['Type of Travel'].replace('Business', 0, inplace=True)
train_df['Type of Travel'].replace('Personal', 1, inplace=True)

# Encode Business as 0, Economy as 1, and Economy Plus as 2
train_df['Class'].replace('Business', 0, inplace=True)
train_df['Class'].replace('Economy', 1, inplace=True)
train_df['Class'].replace('Economy Plus', 2, inplace=True)

### We've updated our training data and need to apply the same encodings to the test data. Note: I recommend using functions to perform your data processing since the code will be very similar between the training and testing data. I'm providing it in full verbosity for clarity.

In [16]:
# Encode Female as 0 and Male as 1
test_df['Gender'].replace('Female', 0, inplace=True)
test_df['Gender'].replace('Male', 1, inplace=True)

# Encode Loyal Customer as 0 and Non-Loyal Customer as 1
test_df['Customer Type'].replace('Loyal Customer', 0, inplace=True)
test_df['Customer Type'].replace('Non-Loyal Customer', 1, inplace=True)

# Encode Business as 0 and Personal as 1
test_df['Type of Travel'].replace('Business', 0, inplace=True)
test_df['Type of Travel'].replace('Personal', 1, inplace=True)

# Encode Business as 0, Economy as 1, and Economy Plus as 2
test_df['Class'].replace('Business', 0, inplace=True)
test_df['Class'].replace('Economy', 1, inplace=True)
test_df['Class'].replace('Economy Plus', 2, inplace=True)

### The final column we need to update is the `Satisfaction Rating` column. This column does not exist in the test set, so we only apply the encoding to the training data.

In [17]:
# Encode Neutral / Dissatisfied as 0 and Satisfied as 1
train_df['Satisfaction Rating'].replace('Neutral / Dissatisfied', 0, inplace=True)
train_df['Satisfaction Rating'].replace('Satisfied', 1, inplace=True)

### There are a lot of additional data processing techniques that you can use for this competition. The starter notebook is meant to get you started with a baseline model, not be a complete guide!

---


# Train ML Model

### Now that our data is processed, we can train an ML Model. For an ML Model, you have two main pieces of information: the `y` data is the variable you are trying to predict. In our case it is the `Satisfaction Rating`. The `y` data is also called the target, output, or response. To predict our target, we use our `x` data. The `x` data (also known as input or features) is all of the variables that the ML Model can use to best predict the `y` data. In our example, we will try using all of the variables we are provided to predict the `Satisfaction Rating`. 

### The specific ML Model we will be using is a LogisticRegression (which is a binary classification model) which will predict the`Satisfaction Rating`. In general, any ML Model that is a classification algorithm could be used for predicting the `Satisfaction Rating`

In [18]:
# Import the model from scikit-learn
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [19]:
# Setup the train x data and train y data

# Get all of the column names
column_names = list(train_df.columns)
print(f'All columns are {column_names}')
print('\n')

# For our x data, we need to remove 'Satisfaction Rating' since we are trying to predict that
column_names.remove('Satisfaction Rating')

x_column_names = column_names

# Have successfully removed Satisfaction Rating
print(f'X columns are {x_column_names}')

# Assign the data to train_x and train_y
train_x = train_df[x_column_names]
train_y = train_df['Satisfaction Rating']

All columns are ['id', 'Gender', 'Customer Type', 'Age', 'Type of Travel', 'Class', 'Flight Distance', 'Ease of Online booking', 'Convenience of Departure/Arrival Time ', 'Baggage Handling', 'Check-In Service', 'Gate Location', 'Online Boarding', 'Inflight Wifi Service', 'Food and Drink', 'Seat Comfort', 'Inflight Entertainment', 'On-Board Service', 'Leg Room', 'Inflight Service', 'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes', 'Satisfaction Rating']


X columns are ['id', 'Gender', 'Customer Type', 'Age', 'Type of Travel', 'Class', 'Flight Distance', 'Ease of Online booking', 'Convenience of Departure/Arrival Time ', 'Baggage Handling', 'Check-In Service', 'Gate Location', 'Online Boarding', 'Inflight Wifi Service', 'Food and Drink', 'Seat Comfort', 'Inflight Entertainment', 'On-Board Service', 'Leg Room', 'Inflight Service', 'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes']


### Let's take a quick look at `train_x` and `train_y` to see their structure

In [20]:
train_x.head()

Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Ease of Online booking,Convenience of Departure/Arrival Time,Baggage Handling,...,Inflight Wifi Service,Food and Drink,Seat Comfort,Inflight Entertainment,On-Board Service,Leg Room,Inflight Service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes
0,100157,0,0,29,0,0,2207.45,2,2,4,...,2,4,4,4,3,5,4,4,0,8.0
1,35082,1,0,7,1,1,362.47,4,4,5,...,4,2,2,2,3,5,4,2,24,46.0
2,115833,0,0,42,0,1,527.26,5,5,5,...,5,4,1,5,5,5,5,2,5,0.0
3,70660,1,0,55,0,0,1505.25,4,4,2,...,4,3,5,2,2,2,2,5,0,0.0
4,12546,1,0,47,1,1,1072.8,2,4,4,...,2,1,1,1,3,5,5,1,0,0.0


In [21]:
pd.DataFrame(train_y).head()

Unnamed: 0,Satisfaction Rating
0,1
1,0
2,1
3,1
4,0


In [22]:
# To demonstrate that train_x and train_y correspond to the original train_df, you can uncomment the line below to show the original data.
# train_df.head()

In [23]:
# Train the ML Model
model.fit(train_x, train_y)

### Let's check how well our model does at making predictions on the training set

In [24]:
from sklearn.metrics import accuracy_score

# Make predictions for our train_x using our model that we just fit
y_pred_train = model.predict(train_x)

# Calculate the accuracy score between our model's predictions and the true values
np.round(accuracy_score(y_pred_train, train_y) * 100, 4)

74.6535

---
# Make Predictions 

### Now that we have a trained ML Model, let's use it to make predictions on the test set that we don't know the correct answers. We will have the `x` data from our testing dataset and will pass that to our model. Our model will make predictions (often referred to as `y_pred`) that will be our submission to the leaderboard!

In [25]:
# Create test x data
test_x = test_df[x_column_names]

### Let's take a quick look at `test_x`

In [26]:
test_x.head()

Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Ease of Online booking,Convenience of Departure/Arrival Time,Baggage Handling,...,Inflight Wifi Service,Food and Drink,Seat Comfort,Inflight Entertainment,On-Board Service,Leg Room,Inflight Service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes
0,0,1,1,49,1,1,2485.7,1,2,4,...,0,2,3,2,1,1,3,2,0,0.0
1,1,0,1,65,1,1,585.18,1,1,4,...,1,4,4,4,3,4,3,4,0,13.0
2,2,1,1,40,1,1,1014.2,1,2,4,...,1,1,1,1,1,4,3,1,0,0.0
3,3,0,1,14,1,1,324.96,1,2,4,...,1,3,3,3,3,1,4,3,0,0.0
4,4,0,1,45,1,1,647.04,0,3,4,...,1,5,5,5,2,3,5,5,14,7.0


In [27]:
# Make predictions
y_pred = model.predict(test_x)
pd.DataFrame(y_pred, columns=['Satisfaction Rating'])

Unnamed: 0,Satisfaction Rating
0,0
1,0
2,0
3,0
4,0
...,...
38959,0
38960,0
38961,1
38962,0


### Woohoo! We've successfully made predictions with our trained ML Model. You'll notice that the predictions are 0 and 1, instead of "Neutral / Dissatisfied" and "Satisfied". Remember that we encoded those strings to 0 and 1 earlier.

---
# Make Submission

### The Kaggle leaderboard expects you to submit the string version ("Neutral / Dissatisfied" and "Satisfied") so let's update that and make a submission!

In [28]:
# The sample submission file provides the structure of the submission so it's a good starting place. 
submission_df = pd.read_csv("./input/sample_submission.csv")

# The default sample submission has "Satisfied" for all rows.
submission_df.head()

Unnamed: 0,id,Satisfaction Rating
0,0,Satisfied
1,1,Satisfied
2,2,Satisfied
3,3,Satisfied
4,4,Satisfied


In [29]:
# Update submission to use our predictions
submission_df['Satisfaction Rating'] = y_pred

# Replace 0 with "Neutral / Dissatisfied" and replace 1 with "Satisfied"
submission_df['Satisfaction Rating'].replace(0, 'Neutral / Dissatisfied', inplace=True)
submission_df['Satisfaction Rating'].replace(1, 'Satisfied', inplace=True)

In [30]:
# Save your submission file
submission_df.to_csv("./submission/submission.csv", index=False)

## Methods for submitting

1. On the right panel, scroll down a little bit until you see the `Submit to Competition` panel. If you click the `Submit` button, the entire notebook will rerun each cell in order. At the end, if you are creating a `submission.csv` (NAME HAS TO MATCH EXACTLY), then Kaggle will submit this to the leaderboard. You'll see your latest score and best score in the panel after a successful submission. 

2. After saving your submission to a csv, you can click the dropdown under `Output` on the right panel. Hover over your submission and click the 3 dots on the right. Select download and a local copy will be downloaded to your machine. Then navigate back to the competition site on Kaggle and click `Submit Predictions` in the top right. This will open a modal where you can upload your submission file. If you go this route, the name of your submission file does not matter.  