<a href="https://colab.research.google.com/github/jacqueslethuaut/hackathon-shikansen/blob/main/SoloTraveller01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Shinkansen Travel Experience
The goal of the problem is to predict whether a passenger was satisfied or not considering his/her overall experience of traveling on the Shinkansen Bullet Train

Team:
- Raghavendar Lokineni
- Jacques Le Thuaut

## Initialisation

In [1]:
USING_GOOGLE_COLAB = True
HACKATHON_DATA = '/Hackathon'
TRAVEL_DATA_TRAIN = '/Traveldata_train_(1).csv'
TRAVEL_DATA_TEST = '/Traveldata_test_(1).csv'
SURVEY_DATA_TRAIN = '/Surveydata_train_(1).csv'
SURVEY_DATA_TEST = '/Surveydata_test_(1).csv'
DATA_DICTIONNARY = '/Data_Dictionary_(1).xlsx'

## Connecting to Google Drive

In [2]:
if USING_GOOGLE_COLAB:
  from google.colab import drive
  drive.mount('/content/drive')

Mounted at /content/drive


## Loading librairies

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Loading the data


In [4]:
travel_data_train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks' + HACKATHON_DATA + TRAVEL_DATA_TRAIN)
travel_data_test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks' + HACKATHON_DATA + TRAVEL_DATA_TEST)
survey_data_train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks' + HACKATHON_DATA + SURVEY_DATA_TRAIN)
survey_data_test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks' + HACKATHON_DATA + SURVEY_DATA_TEST)

## Print shape of datasets

### travel data train

In [5]:
print(travel_data_train.head())
print(travel_data_train.tail())

         ID  Gender   Customer_Type   Age      Type_Travel Travel_Class  \
0  98800001  Female  Loyal Customer  52.0              NaN     Business   
1  98800002    Male  Loyal Customer  48.0  Personal Travel          Eco   
2  98800003  Female  Loyal Customer  43.0  Business Travel     Business   
3  98800004  Female  Loyal Customer  44.0  Business Travel     Business   
4  98800005  Female  Loyal Customer  50.0  Business Travel     Business   

   Travel_Distance  Departure_Delay_in_Mins  Arrival_Delay_in_Mins  
0              272                      0.0                    5.0  
1             2200                      9.0                    0.0  
2             1061                     77.0                  119.0  
3              780                     13.0                   18.0  
4             1981                      0.0                    0.0  
             ID Gender   Customer_Type   Age      Type_Travel Travel_Class  \
94374  98894375   Male  Loyal Customer  32.0  Business Tr

In [6]:
travel_data_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,94379.0,98847190.0,27245.014865,98800001.0,98823595.5,98847190.0,98870784.5,98894379.0
Age,94346.0,39.41965,15.116632,7.0,27.0,40.0,51.0,85.0
Travel_Distance,94379.0,1978.888,1027.961019,50.0,1359.0,1923.0,2538.0,6951.0
Departure_Delay_in_Mins,94322.0,14.64709,38.138781,0.0,0.0,0.0,12.0,1592.0
Arrival_Delay_in_Mins,94022.0,15.00522,38.439409,0.0,0.0,0.0,13.0,1584.0


In [7]:
travel_data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94379 entries, 0 to 94378
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       94379 non-null  int64  
 1   Gender                   94302 non-null  object 
 2   Customer_Type            85428 non-null  object 
 3   Age                      94346 non-null  float64
 4   Type_Travel              85153 non-null  object 
 5   Travel_Class             94379 non-null  object 
 6   Travel_Distance          94379 non-null  int64  
 7   Departure_Delay_in_Mins  94322 non-null  float64
 8   Arrival_Delay_in_Mins    94022 non-null  float64
dtypes: float64(3), int64(2), object(4)
memory usage: 6.5+ MB


Verify that ID column contains non duplicated ID

In [9]:
travel_data_train.isnull().sum()

ID                            0
Gender                       77
Customer_Type              8951
Age                          33
Type_Travel                9226
Travel_Class                  0
Travel_Distance               0
Departure_Delay_in_Mins      57
Arrival_Delay_in_Mins       357
dtype: int64

In [14]:
100.0 * travel_data_train.isnull().sum() / travel_data_train.shape[0]

ID                         0.000000
Gender                     0.081586
Customer_Type              9.484101
Age                        0.034965
Type_Travel                9.775480
Travel_Class               0.000000
Travel_Distance            0.000000
Departure_Delay_in_Mins    0.060395
Arrival_Delay_in_Mins      0.378262
dtype: float64

In [8]:
len(travel_data_train['ID'].unique())

94379

First observations :
- the train dataset contains 94379 events with unique ID (column 'ID')
- 8 other columns have been populated 
  - Gender with 77 missing values (~0.1%)
  - Customer_Type with 8951 missing values (~10%)
  - Age (range 7 - 85) with 33 missing values (~0.03%)
  - Type_Travel with 9226 missing values (~10%)
  - Travel_Class with no missing value
  - Travel_Distance (range 50 - 6951) with no missing value
  - Departure_Delay_in_Mins (range 0 - 1592) with 57 missing values (~0.06%)
  - Arrival_Delay_in_Mins (range 0 - 1584) with 357 missing values (~0.38%) => note : this is almost the same range as departure. Could be useful for missing values

Verify values of 'Gender', 'Customer_Type', 'Type_Travel\', 'Travel_Class'

In [20]:
print('values of Gender')
print(travel_data_train['Gender'].unique())
print('values of Customer_Type')
print(travel_data_train['Customer_Type'].unique())
print('values of Type_Travel')
print(travel_data_train['Type_Travel'].unique())
print('values of Travel_Class')
print(travel_data_train['Travel_Class'].unique())

values of Gender
['Female' 'Male' nan]
values of Customer_Type
['Loyal Customer' 'Disloyal Customer' nan]
values of Type_Travel
[nan 'Personal Travel' 'Business Travel']
values of Travel_Class
['Business' 'Eco']


Next observation :
- Once nan fixed, 'Gender', 'Customer_Type', 'Type_Travel', 'Travel_Class' can be encoded


