# POC - AI Pool 2023 - Day 01 - Data Science

## Introduction

#### Data Science & Data scientist

Before going further in this subject, let's start by a short definition of what Data Science is : Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data and apply knowledge and actionable insights from data across a broad range of application domains.

A Data Scientist is often seen as a handyman from fetching the data to putting a machine learning model in production.
In reality, each part related to AI and Data as its own job : The Data Miner fetches the data, the machine learning engineer builds machine learning models and the MLOps deploys those models.

Another way to see the Data scientist (which I prefer) is as the one who knows how to handle all works related to data : Data mining, Data exploration, interpretation of the data, its visualization and its processing.

We will not go any further into details of each job in AI but if you want to know more I advise you to read [this great book](https://huyenchip.com/ml-interviews-book/contents/chapter-1.-ml-jobs.html) written by _Chip Huyen_ who explains each job in every part of AI.

#### What you will see in this subject

In this subject you will discover a few bases of Data Science : How to manipulate data, explore it, visualise it and interpret it.\
Eventually, you will learn how to use a machine learning model using the `sklearn` library.

If you have any questions, don't hesitate to ask other candidates or one of the supervisors.\
Good luck and have fun.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## I - Data Exploration

Before manipulating our data or even interpreting it we need to explore it, to know what type of data do we have and what does it mean.\
So let's start by exploring our data using the `pandas` and `searborn` libraries.

### I-I Reading a csv

We have at our disposition a csv (`./data/train.csv`) that we want to explore, the first step is to know what data does our csv contains?

**Tasks:**
* Using pandas, open `./data/train.csv`
* Find what columns our csv contains (name, type and number of values)
* Find what is our dataframe's shape

### I-II Set indexes

Nice! We now have a better understanding of our data. It seems like we are facing the `spaceship titanic` dataset, referencing each passager who were on board of the spaceship titanic.\
Our goal is to explore this dataset and finally to create a simple machine learning model to predict if a passenger survived using its informations.

To give you a better understanding of our data, here is a description of each columns :
* **PassengerId** : ID of the passenger.
* **HomePlanet**: Planet of the passenger departed from.
* **CryoSleep**: Animation suspended for the duration of the trip.
* **Cabin**: Cabin number of passenger.
* **Destination**: The planet the passenger will be debarking to.
* **Age**: The age of passenger.
* **VIP**: Passenger paid for special VIP service.
* **RoomService / FoodCourt / ShoppingMall / Spa / VRDeck**: Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* **Name**: The first and last names of the passenger.
* **Transported**: Passenger transported to another dimension.

Using the above informations, we can see that the `PassengerId` colomn is just full of indexes referencing each passagenrs.\
Before going futher let's precise that we will use the `PassengerId` column as index.

**Tasks:**
* Set the DataFrame index using `PassengerId` column.

Good! Now we can start.

### I-III Cleaning dataset

One of the main issues in Data Science are missing values. Watch the informations taht you have it your columns and ask yourself which column could be a problem and we should drop.
If you said `Cabin` you are right! (IF you said `Age`, remember what does our final goal is in this subject).

(In reality we have techniques to deal with missing values but to simplify this subject we will not see them.)

Indeed, the `Caibn` column miss soo many values that it useless, we prefer to drop it.\
We can also see that it miss values in the columns `Age` and `HomePlanet`, to simplify the next steps we also decide to drop every row containing missing value(s).

**Tasks:**
* Drop the `Cabin` column
* Drop every rows with one or more missing values

### I-IV Basic data exploration

Now we are sure we no longer have missing values we can go futher.

As we can see, our csv contains numérics and alphanumerics values. Both are explorable but to start we will focus only on the numerics values.\
A good start would be to know the distribution of each values.

**Tasks:**
* Find the mean value for each numerical column
* Find the std value for each numerical column
* Find the min value for each numerical column
* Find the lower percentile (25) for each numerical column
* Find the median for each numerical column
* Find the upper percentile (75) for each numerical column
* Find the max value for each numerical column

We are starting to see a little more clearly, what can we interpret from these data?

We can see that an average passenger aboard the spaceship titanic has 30 yrs old
On the other hand, we do not learn much more about the `VIP` column.

Let's continue to learn about the passengers aboard the Titanic by looking at the number of passengers in VIP class.

**Tasks:**
* Find how many passengers was in each class

We can see that the third class represents almost half of the passengers, it changes our vision of the Titanic ... \
Let's explore a bit the profile of a passenger in each of the classes do you want?

**Tasks:**
* Find the mean value of the `RoomService` column for each class.
* Find the mean value of the `FoodCourt` column for each class.
* Display the average age of a passenger in each class
* Display the price spent on purchases for each class
* Display the rate of transportation in another dimension of passengers in each class

We can see very interesting information like:
* The "old" population is more predominantly in VIP class.
* The population that spends the most is the VIP class

Now let's move on to different embarkation ports, which one do you think was used the most?

To help you, here is the spaceship titanic's journey:

<img src="./img/solarsystem.jpg" width="700px" />

**Task:**
* Find how many passengers embarked by each planet

As expected, we can see that on Earth the spaceship Titanic took on the most passengers, followed by Europa its first stop and Mars its second stop.
Let's see now how many passengers of each class joined each planet.

**Objectif:**
* For each class, find how many people embarked on board from which planet.

We notice that for the non VIP classes, the quasi-majority of the passengers embarked on the Earth whereas for the VIP class, a significant proportion of passengers embarked on Europa.

### I-V Advance Data Exploration

We're starting to see it much more clearly in our data, aren't we? \
Now it's time to explore the correlations between our different values and in particular the transport rate.

So start by displaying a simple correlation table between the numerical values.

**Task:**
* Find and display the correlation between each numerical columns

We can already interpret a lot of information but before taking a look I suggest that we add some colors.

**Task:**
* Display a heatmap showing the correlation between each numerical columns

Isn't it more pleasant to read? Depending on whether a passenger was carried or not, what can we interpret from this graph?

We can see that the class of the passenger was a factor that had a great influence on the passenger's carriage rate.
We can see a semblance of correlation between age and whether a passenger was carried, let's try to find out more.

**Taks:**
* Show the relationship between age and whether or not a passenger was carried using a histogram.

Well, we are sure that there is a correlation between age and being transported. \
You know what you have to do...

**Task:**
* Show if there is a link between a passenger's Age and whether or not it survived

Now that we have explored different correlations, we can prepare our data for our model to interpret;

Our model only accepts numerical values, so what do we do about the `Transported` column?
We just have to convert it into a numerical value.

We will also try to show the correlation between age and the fact that a passenger was transported (we saw that a passenger of five years or less is considered a child).


**Tasks:**
* Create a new column named `Child` and fill it in (remember, we consider a passenger who is under 6 years old to be a child)
* Convert the `Transported`, `VIP` and `HomePlanet` column into a numeric column.

Our data is ready, before we create the model, let's take one last look at the correlations between our data to help us decide which ones might be useful.

**Objectives:**
* Using a heatmap, show the correlation between all the numerical columns.
* Using the `groupby` method of pandas, show the relationship between `Age` and `Transported`.
* Using the `groupby` method of pandas, show the relationship between `Child` and `Transported`.

## II - Machine learning

So far we have taken the time to :
* Explore the data
* View the data
* Correlate the data
* Interpret the data
It's a good start, don't you think?

Now let's get down to business (_add a drumbeat_): machine learning ("_tin tin tin _").\
For now we're not going to go into too much detail on how to create our models ourselves, we'll just use the `sklearn` library which will do most of the work for us.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier

### II-I Data

Before creating our model (promised this is the last step of preparation) we must create a testing and training set ("_Set what?_" Said a student in the distance).\
To understand what a test set is and why it is necessary it is best to go over what machine learning is so let's start with a short definition.

<ins>Machine learning</ins>: Machine learning is the study of computer algorithms that can improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence.

There are two things to remember from this definition:
- "_computer algorithms that can improve automatically_": In machine learning, we do not directly create the solution but an algorithm that will adjust "automatically" until potentially reaching the desired result.
- "_can improve automatically through experience and by the use of data._": Our model learns thanks to data, so the model is not at the center of our attention, it is first and foremost our data that is.

A machine learning model will adjust to meet a single criterion: Bringing the _cost_ closer to zero.\
As a reminder, the loss function (producing the loss) is a function which from a prediction and labels indicates how wrong the model is, the closer the loss is to zero, the better.

To illustrate these remarks, I suggest that we take a look at the cost function nammed MSE (mean squared error).\
<img src="https://www.gstatic.com/education/formulas2/355397047/en/mean_squared_error.svg"/>

We have here named $Y_i$ the model prediction for a numbered data item $i$ and $\hat{Y}_i$ the result expected by our model for this same numbered data $i$.\
We sum the results obtained for each data numbered from $0$ to $n$ and take the average of this sum by dividing the result by $n$.

We thus obtained the average difference between the predictions of the model and the expected results, it is our cost.

The loss is practical to verify the learning of a model, it suffices to verify that the cost decreases as the model learn. On the other hand, if I show you a cost of $100$, it's hard to know if it's good or not, that's where the accuracy comes in, it's the percentage of times the model has found the right result.\
An accuracy of $50%$ would mean that our model is wrong every other time, $90%$ once in 10, etc ...

On the other hand, we cannot always have an accurary, take for example a model which aims to predict the exact speed of a car.\
He predicted $121.5km/h$ and the car was going at $119km/h$, you can't tell your model is "right" or "wrong". You will say rather that it was wrong of $2.5km/h$ (which is a loss).


"_And our history of testing and training set, is where in there? _" Exclaims the impatient.\
If we summarize, our model learns on the data we give it and tries to reduce the cost calculated according to the prediction of our model and the expected results but if we want to know how our model behaves on the data that it does not have ever seen how we do it? We create a test set, a set of data our model had never see and test it on it ...

Our training set is the data that is used by our model to train, our test set is a data that our model has never seen that we use to know how behaves on a data that he has not seen before.
To be precise there is even a third set called the validation set but we will not discuss it for the moment.

Here as we do not have only one csv, we will have to divide it into two sets (training and testing). \
You understood everything? Perfect! Enough of an explanation like that, let's take action!

**Tasks:**
* Create a dataframe named `train_df` containing 80% of our data
* Create a dataframe named `test_df` containing 20% of our data

Now that we have our sets, it is time to choose the data we will use to train our model.
To begin with, we recommend using the columns `VIP`, `Age` and `Child` but you are free to change this selection.

**Task:**
* Select the columns that you think will be useful in predicting whether a passenger has been transported.

In [None]:
columns = ['TODO']

We will **FINALLY** be able to switch to buzz word, the machine learning application.

To start our first prediction we will use an extremely simple model that some of you may have already seen or used: linear regression.\
The principle of a linear regression is to draw a line in $N$ dimensions where $N$ represents the number of values that we give to our model.

To illustrate these words, here is the course of learning a linear regression on a two-dimensional data which is linear: \
![LiRegURL](https://miro.medium.com/max/700/1*CjTBNFUEI_IokEOXJ00zKw.gif "Linear regression")

This algorithm is quick and easy to set up but only works if the data is linear (which answers the equation $y = b_0 + b_1x$).\
Is ours? Let's try and we'll see.

**Task:**
* Train a linear regression model on your training set and test it on your test set

If you have inconclusive results (less than $0.23$) don't be surprised.\
Obviously our data is not linear (not surprisingly), you can check by executing the code below:

In [None]:
plt.scatter(file.Age, file.Transported)

An algorithm that might be more promising is logisitic regression, it tries to apply the following formula:
## $\frac{1}{(1 + e^{-(b_0 + b_1x)}}$

Let's see what it looks like!

**Task:**
* Train a logistic regression model on your training set and test it on your test set and display your score

You should have much better results (over $0.70$).

To conclude, let's try another kind of algorithm, a decision tree named Random forest.\
We will not detail its operation here but we urge you more than strongly to inquire about it.

**Task:**
* Train a Random Forest decision tree on your training set and test it on your test set

Congratulations! You have quickly discovered the basics of data science and used your first machine learning models, I am impressed.

## III - It's your turn!

To conclude this subject, we have a challenge for you. Go to [this website](https://www.kaggle.com/competitions/spaceship-titanic/) and try to solve the challenge.\
The one with the best results will earn **100 points** on the day!