# Preprocessing danych

Zacznijmy do zaimportowania podstawowych modułów i bibliotek. Upewnij się, że masz zainstalowany ```scikit-learn```.

In [229]:
import sys
import os
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn
assert sklearn.__version__ >= "0.20"

Następnie, pobieramy zbiór ze słynnego konkursu "Titanic - Machine Learning from Disaster" z poniższego linku. <br/>
https://www.kaggle.com/c/titanic/data

Kaggle to ważna strona w świecie data science i machine learning'u. Można na niej znaleźć masę zbiorów danych, praktyczne mikrokursy, notebooki i to z czego słynie najbardziej - competitions(w tym przyszłe lokalne BIT AI ;) ). Jeśli jeszcze tego nie zrobiłeś/aś, gorąco zachęcam do założenia konta.

Dane wypakowujemy do wybranego folderu, a następnie wczytujemy je do data frame'ów. Poniższy kod zakłada, że pliki są w tym samym miejscu co ten notebook.

In [230]:
datapath = 'data' #change accordingly

def load_data(filename):
    csv_path = os.path.join(datapath, filename)
    return pd.read_csv(csv_path)

In [231]:
%%time
train_data = load_data('titanic_train.csv')
test_data = load_data('titanic_test.csv')

CPU times: user 9.41 ms, sys: 0 ns, total: 9.41 ms
Wall time: 9.13 ms


In [232]:
?pd.read_csv

In [233]:
??pd.read_csv

## Podstawowa analiza danych

In [234]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [235]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [236]:
train_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [237]:
%matplotlib notebook
train_data.hist(figsize=(8,8))
plt.show()

<IPython.core.display.Javascript object>

In [238]:
num_train_data = train_data.select_dtypes(exclude=['object'])
cat_train_X = train_data.select_dtypes(include=['object'])

In [303]:
num_train_X, y = num_train_data.drop('Survived', axis=1), num_train_data['Survived']

## Eliminacja "nieużytecznych" zmiennych

In [240]:
corr_matrix = train_data.corr()
corr_matrix['Survived'].sort_values(ascending=False)

Survived       1.000000
Fare           0.257307
Parch          0.081629
PassengerId   -0.005007
SibSp         -0.035322
Age           -0.077221
Pclass        -0.338481
Name: Survived, dtype: float64

In [241]:
num_train_X = num_train_X.drop('PassengerId', axis=1)
cat_train_X = cat_train_X.drop('Name', axis=1)

## Problem brakujących wartości

In [242]:
cat_train_X = cat_train_X.drop('Cabin', axis=1)

In [243]:
from sklearn.impute import SimpleImputer

num_simple_imputer = SimpleImputer(strategy='median')
num_simple_imputer.fit(num_train_X)
simple_imputed_num_train_X = num_simple_imputer.transform(num_train_X)

In [244]:
simple_num_train_X = pd.DataFrame(simple_imputed_num_train_X,
                              columns=num_train_X.columns,
                              index=num_train_X.index)
simple_num_train_X

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
0,3.0,22.0,1.0,0.0,7.2500
1,1.0,38.0,1.0,0.0,71.2833
2,3.0,26.0,0.0,0.0,7.9250
3,1.0,35.0,1.0,0.0,53.1000
4,3.0,35.0,0.0,0.0,8.0500
...,...,...,...,...,...
886,2.0,27.0,0.0,0.0,13.0000
887,1.0,19.0,0.0,0.0,30.0000
888,3.0,28.0,1.0,2.0,23.4500
889,1.0,26.0,0.0,0.0,30.0000


In [245]:
cat_imputer = SimpleImputer(strategy='most_frequent')
cat_imputer.fit(cat_train_X)
imputed_cat_train_X = cat_imputer.transform(cat_train_X)

In [246]:
cat_train_X = pd.DataFrame(imputed_cat_train_X,
                              columns=cat_train_X.columns,
                              index=cat_train_X.index)
cat_train_X

Unnamed: 0,Sex,Ticket,Embarked
0,male,A/5 21171,S
1,female,PC 17599,C
2,female,STON/O2. 3101282,S
3,female,113803,S
4,male,373450,S
...,...,...,...
886,male,211536,S
887,female,112053,S
888,female,W./C. 6607,S
889,male,111369,C


---
<h2><span style="color:orange">Bonus</span></h2>
Użyliśmy tradycyjnej imputacji. Poniżej są wykresy przedstawiające obserwacje, które nie zawierają wieku. Widzisz jakieś zależności? Spróbuj dokonać imputacji wielowymiarowej. Poprawia wynik naszego modelu? <br/>

*Tip* `IterativeImputer`

In [247]:
nan_age_df = train_data[train_data['Age'].isna()]
nan_age_df.hist(figsize=(8,8))
plt.show()

<IPython.core.display.Javascript object>

In [248]:
# explicitly require this experimental feature
from sklearn.experimental import enable_iterative_imputer
# now you can import normally from sklearn.impute
from sklearn.impute import IterativeImputer

num_iterative_imputer = IterativeImputer()
num_iterative_imputer.fit(num_train_X)
iterative_imputed_num_train_X = num_iterative_imputer.transform(num_train_X)

iterative_num_train_X = pd.DataFrame(iterative_imputed_num_train_X,
                              columns=num_train_X.columns,
                              index=num_train_X.index)

simple_num_train_X.compare(iterative_num_train_X)
# wystepuja roznice
# nieliczne ujemne wartosci wieku dla predykcji IterativeImputer'a
# nie dziala dla danych kategorycznych

Unnamed: 0_level_0,Age,Age
Unnamed: 0_level_1,self,other
5,28.0,27.617131
17,28.0,34.111643
19,28.0,27.639522
26,28.0,27.639522
28,28.0,27.627645
...,...,...
859,28.0,27.639446
863,28.0,-6.156028
868,28.0,27.598218
878,28.0,27.627343


In [249]:
# num_train_X = simple_num_train_X
# num_imputer = num_simple_imputer

num_train_X = iterative_num_train_X
num_imputer = num_iterative_imputer

# accuracy minimalnie sie roznia

---

## Zmienne kategoryczne

In [250]:
cat_train_X['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [251]:
cat_train_X['Embarked'].value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

In [252]:
cat_train_X['Ticket'].value_counts()

1601            7
347082          7
CA. 2343        7
3101295         6
347088          6
               ..
SC 1748         1
C.A. 17248      1
F.C.C. 13528    1
A./5. 2152      1
364846          1
Name: Ticket, Length: 681, dtype: int64

In [253]:
cat_train_X_se = cat_train_X.drop('Ticket', axis=1)

In [254]:
from sklearn.preprocessing import OneHotEncoder

one_hot_encoder = OneHotEncoder(sparse=False)
one_hot_encoder.fit(cat_train_X_se)
one_hot_encoder.categories_

[array(['female', 'male'], dtype=object), array(['C', 'Q', 'S'], dtype=object)]

In [255]:
cat_train_X_se = one_hot_encoder.transform(cat_train_X_se)
cat_train_X_se = pd.DataFrame(cat_train_X_se,
                              columns=['Female', 'Male', 'C', 'Q', 'S'])
cat_train_X_se

Unnamed: 0,Female,Male,C,Q,S
0,0.0,1.0,0.0,0.0,1.0
1,1.0,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,0.0,1.0
3,1.0,0.0,0.0,0.0,1.0
4,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...
886,0.0,1.0,0.0,0.0,1.0
887,1.0,0.0,0.0,0.0,1.0
888,1.0,0.0,0.0,0.0,1.0
889,0.0,1.0,1.0,0.0,0.0


---
<h2><span style="color:orange">Bonus</span></h2>


Pominęliśmy być może istotną zmienną `Ticket`. Spróbuj ją zakodować wykorzystując hashing lub kodowanie binarne.
Duży plus jeżeli zrobisz to samodzielnie, ale możesz wykorzystać bibliotekę http://contrib.scikit-learn.org/category_encoders/.
---

In [256]:
cat_train_X_t = cat_train_X.drop(['Sex', 'Embarked'], axis=1)
cat_train_X_t

Unnamed: 0,Ticket
0,A/5 21171
1,PC 17599
2,STON/O2. 3101282
3,113803
4,373450
...,...
886,211536
887,112053
888,W./C. 6607
889,111369


In [257]:
from category_encoders.hashing import HashingEncoder

hashing_encoder = HashingEncoder(cols=['Ticket'])
hashing_encoder.fit(cat_train_X_t)

cat_train_X_t = hashing_encoder.transform(cat_train_X_t)
cat_train_X_t.info()
cat_train_X_t

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   col_0   891 non-null    int64
 1   col_1   891 non-null    int64
 2   col_2   891 non-null    int64
 3   col_3   891 non-null    int64
 4   col_4   891 non-null    int64
 5   col_5   891 non-null    int64
 6   col_6   891 non-null    int64
 7   col_7   891 non-null    int64
dtypes: int64(8)
memory usage: 55.8 KB


Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7
0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,1
2,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,1
4,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...
886,0,0,0,0,0,0,1,0
887,0,0,0,0,1,0,0,0
888,0,1,0,0,0,0,0,0
889,0,0,0,0,1,0,0,0


In [258]:
# polaczenie cech kategorycznych
cat_train_X = pd.concat([cat_train_X_se, cat_train_X_t], axis=1)
cat_train_X

Unnamed: 0,Female,Male,C,Q,S,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7
0,0.0,1.0,0.0,0.0,1.0,0,0,1,0,0,0,0,0
1,1.0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,1
2,1.0,0.0,0.0,0.0,1.0,0,0,0,0,1,0,0,0
3,1.0,0.0,0.0,0.0,1.0,0,0,0,0,0,0,0,1
4,0.0,1.0,0.0,0.0,1.0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0.0,1.0,0.0,0.0,1.0,0,0,0,0,0,0,1,0
887,1.0,0.0,0.0,0.0,1.0,0,0,0,0,1,0,0,0
888,1.0,0.0,0.0,0.0,1.0,0,1,0,0,0,0,0,0
889,0.0,1.0,1.0,0.0,0.0,0,0,0,0,1,0,0,0


## Skalowanie

In [259]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(num_train_X)
scaled_num_train_X = scaler.transform(num_train_X)
num_train_X = pd.DataFrame(scaled_num_train_X,
                           columns=num_train_X.columns)

In [260]:
num_train_X.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0
mean,0.654321,0.411444,0.065376,0.063599,0.062858
std,0.418036,0.157946,0.137843,0.134343,0.096995
min,0.0,0.0,0.0,0.0,0.0
25%,0.5,0.326803,0.0,0.0,0.01544
50%,1.0,0.392152,0.0,0.0,0.028213
75%,1.0,0.489299,0.125,0.0,0.060508
max,1.0,1.0,1.0,1.0,1.0


## Dyskretyzacja

In [261]:
from sklearn.preprocessing import KBinsDiscretizer
?pd.cut
# Use cut when you need to segment and sort data values into bins.
# This function is also useful for going from a continuous variable to a categorical variable.
# For example, cut could convert ages to groups of age ranges.
# Supports binning into an equal number of bins, or a pre-specified array of bins.

## Modelowanie

In [262]:
train_X = pd.concat([num_train_X, cat_train_X], axis=1)
train_X.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Female,Male,C,Q,S,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7
0,1.0,0.326803,0.125,0.0,0.014151,0.0,1.0,0.0,0.0,1.0,0,0,1,0,0,0,0,0
1,0.0,0.512512,0.125,0.0,0.139136,1.0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,1
2,1.0,0.37323,0.0,0.0,0.015469,1.0,0.0,0.0,0.0,1.0,0,0,0,0,1,0,0,0
3,0.0,0.477692,0.125,0.0,0.103644,1.0,0.0,0.0,0.0,1.0,0,0,0,0,0,0,0,1
4,1.0,0.477692,0.0,0.0,0.015713,0.0,1.0,0.0,0.0,1.0,0,0,0,0,1,0,0,0


In [286]:
from sklearn.neighbors import KNeighborsClassifier

# uzywamy wyznaczonych optymalnych parametrow:
# {'n_neighbors': 43, 'p': 8, 'weights': 'uniform'}
classifier = KNeighborsClassifier(n_neighbors=43, p=8, weights='uniform')
classifier.fit(train_X, y)

KNeighborsClassifier(n_neighbors=43, p=8)

In [291]:
from sklearn.model_selection import cross_val_score

train_pred = classifier.predict(train_X)
train_scores = cross_val_score(classifier, train_X, y,
                               scoring='accuracy', cv=100)
np.mean(train_scores)

0.7970833333333333

## Preprocessing danych testowych

In [292]:
def preprocess(df, num_imputer, cat_imputer, one_hot_encoder, hashing_encoder, scaler):
    num_df = df.select_dtypes(exclude=['object'])
    cat_df = df.select_dtypes(include=['object'])
    #redundancy removal
    num_df = num_df.drop('PassengerId', axis=1)
    cat_df = cat_df.drop(['Name', 'Cabin'], axis=1)
    #handle missing values
    imputed_num = num_imputer.transform(num_df) #notice that we do NOT fit
    imputed_cat = cat_imputer.transform(cat_df)
    num_df = pd.DataFrame(imputed_num,
                          columns=num_df.columns,
                          index=num_df.index)
    cat_df = pd.DataFrame(imputed_cat,
                          columns=cat_df.columns,
                          index=cat_df.index)
    #encode categorical variables
    cat_df_se = one_hot_encoder.transform(cat_df.drop('Ticket', axis=1))
    cat_df_se = pd.DataFrame(cat_df_se,
                          columns=['Female', 'Male', 'C', 'Q', 'S'])
    cat_df_t = hashing_encoder.transform(cat_df.drop(['Sex', 'Embarked'], axis=1))
    cat_df = pd.concat([cat_df_se, cat_df_t], axis=1)
    #scaling
    scaled_num = scaler.transform(num_df)
    num_df = pd.DataFrame(scaled_num,
                          columns=num_df.columns)
    result_df = pd.concat([num_df, cat_df], axis=1)
    return result_df

In [293]:
test_X = preprocess(test_data, num_imputer, cat_imputer, one_hot_encoder, hashing_encoder, scaler)
test_X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 18 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Pclass  418 non-null    float64
 1   Age     418 non-null    float64
 2   SibSp   418 non-null    float64
 3   Parch   418 non-null    float64
 4   Fare    418 non-null    float64
 5   Female  418 non-null    float64
 6   Male    418 non-null    float64
 7   C       418 non-null    float64
 8   Q       418 non-null    float64
 9   S       418 non-null    float64
 10  col_0   418 non-null    int64  
 11  col_1   418 non-null    int64  
 12  col_2   418 non-null    int64  
 13  col_3   418 non-null    int64  
 14  col_4   418 non-null    int64  
 15  col_5   418 non-null    int64  
 16  col_6   418 non-null    int64  
 17  col_7   418 non-null    int64  
dtypes: float64(10), int64(8)
memory usage: 58.9 KB


In [294]:
test_pred = classifier.predict(test_X)

In [295]:
ids = np.array([len(train_X) + (i+1) for i in range(len(test_pred))], dtype=int)
ids = ids.reshape(-1, 1)
test_pred = test_pred.reshape(-1, 1)
pred_df = pd.DataFrame(np.concatenate((ids, test_pred), axis=1),
                       columns=['PassengerId', 'Survived'])

In [296]:
pred_df

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,0
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [298]:
pred_df.to_csv('results/titanic_predictions.csv', index=False)

Gratulacje! Stworzyliśmy pełnoprawny model machine learningu.

---
<h2><span style="color:orange">Bonus</span></h2>

Stworzyliśmy model, ale wykorzystaliśmy domyślne(bardzo przemyślane) hiperparametry, aby go ulepszyć, musimy znaleźć odpowiednie wartości dla naszego problemu.

In [312]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {
     'n_neighbors': [2, 3, 4, 5, 6, 7, 8, 10, 12, 15, 18, 22, 26, 31, 36, 43,
                     50, 59, 69, 80, 92, 105, 119, 134, 150, 167, 184, 202, 221, 241],
     'weights': ['uniform', 'distance'],
     'p': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
  ]

In [285]:
%%time
grid_search = GridSearchCV(classifier, param_grid, cv=5,
                           scoring='accuracy',
                           return_train_score=True)
grid_search.fit(train_X, y)
grid_search.best_params_

CPU times: user 7min 29s, sys: 2.51 s, total: 7min 31s
Wall time: 9min 12s


{'n_neighbors': 43, 'p': 8, 'weights': 'uniform'}

---

<h2><span style="color:orange">Bonus II</span></h2>

Preprocessing może być żmudnym procesem. To w jaki sposób przetworzyliśmy dane treningowe, musimy powtórzyć dla danych testowych. Tworzenie dużych i długich funkcji, tak jak `preprocess` może być niewygodne i niesie za sobą ograniczenia.
Ponadto, zwróć uwagę, że etapy preprocessingu, również można(nawet trzeba) tuningować poprzez dobór odpowiednich hiperparametrów. W obecnej formie jest to mocno utrudnione. 
Z pomocą przychodzą pipeline'y:


https://blog.prokulski.science/2020/10/10/pipeline-w-scikit-learn/

Zapoznaj się z artykułem i spróbuj zbudować prosty pipeline dla danych numerycznych.

In [311]:
from sklearn.pipeline import Pipeline

num_train_data = train_data.select_dtypes(exclude=['object'])
num_train_X, y = num_train_data.drop('Survived', axis=1), num_train_data['Survived']

pipe = Pipeline(steps=[
    ('fill missing data', SimpleImputer(strategy='median')),
    ('scaler', MinMaxScaler()),
    ('classifier', KNeighborsClassifier())
])

pipe.fit(num_train_X, y)
print(pipe.score(num_train_X, y))

0.7811447811447811
