# Part 2: Basics

## Load Titanic data

* Downloaded from [here](https://github.com/datasciencedojo/datasets/blob/master/titanic.csv).
* [Description on Kaggle](https://www.kaggle.com/c/titanic/data)

In [44]:
# import `pandas` which is one of the main libraries for data analytics in Python:
# website: https://pandas.pydata.org/
import pandas as pd

In [50]:
# load titanic data
# we set the index of the data to `PassengerId`
data_titanic = pd.read_csv("titanic.csv", index_col="PassengerId")
data_titanic

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Predict survival (`Survived`)

In [52]:
# define the target (what we want to predict)
y = data_titanic["Survived"]
y

PassengerId
1      0
2      1
3      1
4      1
5      0
      ..
887    0
888    1
889    0
890    1
891    0
Name: Survived, Length: 891, dtype: int64

In [53]:
# define the set of input features (the data we want to use to predict survival)
X = data_titanic.drop(columns="Survived")

In [57]:
# import a linear regression from scikit-learn
# scikit-learn is the main library for machine learning in Python
# website: https://scikit-learn.org
from sklearn.linear_model import LogisticRegression

In [56]:
# define our model (many different models possible, we heard about k-nearest neighbors and decision trees)
model = LogisticRegression()

In [58]:
# try fit the model
model.fit(X, y)

ValueError: could not convert string to float: 'Braund, Mr. Owen Harris'

Fitting the model failed with the error `ValueError: could not convert string to float: 'Braund, Mr. Owen Harris'`.
This is, because `LogisticRegression` can only handle numbers. 
But the titanic dataset contains many variables that are not numbers, such as `Name` which containts the string `'Braund, Mr. Owen Harris'`.

In [62]:
# select columns that are numbers
# For this, we use `numpy` the 'fundamental package for scientific computing with Python'.
# website: https://numpy.org/
import numpy as np
X.select_dtypes(np.number).columns

Index(['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')

In [63]:
# define a new set of input features that only contains numeric features 
X_numeric = X[X.select_dtypes(np.number).columns]
X_numeric

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,3,22.0,1,0,7.2500
2,1,38.0,1,0,71.2833
3,3,26.0,0,0,7.9250
4,1,35.0,1,0,53.1000
5,3,35.0,0,0,8.0500
...,...,...,...,...,...
887,2,27.0,0,0,13.0000
888,1,19.0,0,0,30.0000
889,3,,1,2,23.4500
890,1,26.0,0,0,30.0000


In [65]:
# try to fit our model again
model.fit(X_numeric, y)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

This time, we see the error `ValueError: Input contains NaN, infinity or a value too large for dtype('float64').`
Which means that there are columns that contain empty fields. 
Let's have a look at them.

In [69]:
# we see that the column `Age` contains `177` empty values
X_numeric.isna().sum()

Pclass      0
Age       177
SibSp       0
Parch       0
Fare        0
dtype: int64

In [75]:
# for now let's drop all rows with NAs
# IMPORTANT: this looses A LOT of data which should only be done as a last resort in practice.
# We will learn about feature imputation which would be an alternative.
idx_notna = ~ X_numeric["Age"].isna()
X_numeric_notna = X_numeric[idx_notna]
y_notna = y[idx_notna]

In [78]:
# fit the model AGAIN
model.fit(X_numeric_notna, y_notna)

LogisticRegression()

In [80]:
# predict
model.predict(X_numeric_notna)

array([0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,
       0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,