# **GROUP ASSIGNMENT - MACHINE LEARNING TECHNIQUES FOR PATTERN RECOGNITION APPLICATIONS**

--------------------
## **1. Introduction**
--------------------

--------------------
### **1.1 Context**
--------------------

The problem statement is based on the Shinkansen Bullet Train in Japan, and passengers’ experience with that mode of travel. This machine learning exercise aims to determine the relative importance of each parameter with regards to their contribution to the passengers’ overall travel experience.

-----------------------
### **1.2 Objective** 
-----------------------

The objective of this problem is to understand which parameters play an important role in swaying passenger feedback towards a positive scale. The goal of the problem is to predict whether a passenger was satisfied or not with his/her experience of travelling on Shinkansen Bullet train.

------------------------------------
### **1.3 Dataset Description**
------------------------------------

The dataset contains a random sample of individuals who traveled on this train. The dataset consists of information related to passengers and attributes related to Shinkansen train and the post-service experience. Each passenger was explicitly asked whether they were satisfied with their overall travel experience or not, and that is captured in the data of the survey report under the variable labeled ‘Overall_Experience’.

## **2. Prepare script**
### **2.1 Load Libraries**

In [1]:
import pandas as pd


### **2.2 Loading the Dataset**

In [2]:
ds = pd.read_csv("datashinkansen.csv") #ds - datashinkansen

## **3. Understand the data** 
### **3.1 Check the head and tail of the data**

In [3]:
ds.head() #Show data shinkansen head

Unnamed: 0,ID,Departure_Delay_in_Mins,Arrival_Delay_in_Mins,Gender,Seat_Comfort,Seat_Class,Arrival_Time_Convenient,Onboard_Wifi_Service,Ease_of_Online_Booking,Baggage_Handling,Legroom,CheckIn_Service,Cleanliness,Overall_Experience
0,98800001,0,5,Female,Needs Improvement,Green Car,Excellent,Good,Needs Improvement,Needs Improvement,Acceptable,Good,Needs Improvement,0
1,98800002,9,0,Male,Poor,Ordinary,Excellent,Good,Good,Poor,Needs Improvement,Needs Improvement,Good,0
2,98800003,77,119,Female,Needs Improvement,Green Car,Needs Improvement,Needs Improvement,Excellent,Excellent,Excellent,Good,Excellent,1
3,98800004,13,18,Female,Acceptable,Ordinary,Needs Improvement,Acceptable,Acceptable,Acceptable,Acceptable,Good,Acceptable,0
4,98800005,0,0,Female,Acceptable,Ordinary,Acceptable,Needs Improvement,Good,Good,Good,Good,Good,1


In [4]:
ds.tail()#Show data shinkansen tail

Unnamed: 0,ID,Departure_Delay_in_Mins,Arrival_Delay_in_Mins,Gender,Seat_Comfort,Seat_Class,Arrival_Time_Convenient,Onboard_Wifi_Service,Ease_of_Online_Booking,Baggage_Handling,Legroom,CheckIn_Service,Cleanliness,Overall_Experience
94374,98894375,83,125,Male,Poor,Ordinary,Good,Poor,Poor,Good,Good,Needs Improvement,Good,0
94375,98894376,5,11,Male,Good,Ordinary,Good,Needs Improvement,Acceptable,Acceptable,Acceptable,Good,Acceptable,1
94376,98894377,0,0,Male,Needs Improvement,Green Car,Needs Improvement,Good,Good,Good,Good,Acceptable,Good,1
94377,98894378,0,0,Male,Needs Improvement,Ordinary,,Good,Good,Good,Good,Good,Excellent,0
94378,98894379,28,28,Male,Acceptable,Ordinary,Poor,Acceptable,Acceptable,Good,Good,Poor,Good,0


### **3.2 Use the info() and describe() functions for more information**

In [5]:
ds.info() #Show data shinkansen information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94379 entries, 0 to 94378
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   ID                       94379 non-null  int64 
 1   Departure_Delay_in_Mins  94379 non-null  int64 
 2   Arrival_Delay_in_Mins    94379 non-null  int64 
 3   Gender                   94302 non-null  object
 4   Seat_Comfort             94318 non-null  object
 5   Seat_Class               94379 non-null  object
 6   Arrival_Time_Convenient  85449 non-null  object
 7   Onboard_Wifi_Service     94349 non-null  object
 8   Ease_of_Online_Booking   94306 non-null  object
 9   Baggage_Handling         94237 non-null  object
 10  Legroom                  94289 non-null  object
 11  CheckIn_Service          94302 non-null  object
 12  Cleanliness              94373 non-null  object
 13  Overall_Experience       94379 non-null  int64 
dtypes: int64(4), object(10)
memory usage: 

**Observations:**
* There are 94379 participants and 14 columns 
* It is conspicuous that not all columns have NoN-Null count of 94379. This means that not every data column has been aswered from all participants.

In [6]:
ds.describe() #Show data shinkansen description

Unnamed: 0,ID,Departure_Delay_in_Mins,Arrival_Delay_in_Mins,Overall_Experience
count,94379.0,94379.0,94379.0,94379.0
mean,98847190.0,14.638246,14.948463,0.546658
std,27245.01,38.128961,38.377695,0.497821
min,98800000.0,0.0,0.0,0.0
25%,98823600.0,0.0,0.0,0.0
50%,98847190.0,0.0,0.0,1.0
75%,98870780.0,12.0,13.0,1.0
max,98894380.0,1592.0,1584.0,1.0


**Observations:**
* This figure dsows all integer values, which are only 4 values. That makes it hard to analyse.
* The ID column is not useful.
* For delay values the standard deviation is big. This is because most trains do not have delays, the ones that have delays then often have huge delays.

### **3.3 Look for the presence of null values in the dataset**

In [20]:
ds.isnull().sum().sum() #Shows number of null in total

9486

In [21]:
ds.isnull().sum() #Shows number of null in each column

ID                            0
Departure_Delay_in_Mins       0
Arrival_Delay_in_Mins         0
Gender                       77
Seat_Comfort                 61
Seat_Class                    0
Arrival_Time_Convenient    8930
Onboard_Wifi_Service         30
Ease_of_Online_Booking       73
Baggage_Handling            142
Legroom                      90
CheckIn_Service              77
Cleanliness                   6
Overall_Experience            0
dtype: int64

**Observations:**
* There is a total number of 9486 null values
* Columns that do not have null values are `ID`, `Departure_Delay_in_Mins`, `Arrival_Delay_in_Mins`, `Seat_Class` and `Overall_Experience` 

## **4. Preprocessing** 
### **4.1 Remove the unsignificant parameters (ex: ID)**

### **4.2 Encode the categorical object variables in both the train & test set**

### **4.3 Separate the dataset into Training dan Testing data**