<a href="https://colab.research.google.com/github/krakowiakpawel9/ml_course/blob/master/unsupervised/02_dimensionality_reduction/03_pca_wine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* @author: krakowiakpawel9@gmail.com  
* @site: e-smartdata.org

### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```
Kurs stworzony w oparciu o wersję `0.22.1`

### Spis treści:
1. [Import bibliotek](#0)
2. [Załadowanie danych](#1)
3. [Podział na zbiór treningowy i testowy](#2)
4. [Standaryzacja](#3)
5. [PCA](#4)




### <a name='0'></a> Import bibliotek

In [0]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

np.set_printoptions(precision=4, suppress=True, edgeitems=5, linewidth=200)

### <a name='1'></a> Załadowanie danych

In [2]:
df_raw = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)
df = df_raw.copy()
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [3]:
data = df.iloc[:, 1:]
target = df.iloc[:, 0]

data.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [4]:
target.head()

0    1
1    1
2    1
3    1
4    1
Name: 0, dtype: int64

### <a name='2'></a> Podział na zbiór treningowy i testowy

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target)

print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')

X_train shape: (133, 13)
X_test shape: (45, 13)


### <a name='3'></a> Standaryzacja

In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)
X_train_std[:5]

array([[-0.7619, -0.7462, -0.3665,  0.6218, -0.952 ,  0.6828,  1.075 ,  0.2177,  0.3759, -0.4961, -1.2259,  0.2719, -1.2265],
       [-0.4311,  1.0038, -0.1032,  0.6218,  0.4432, -0.9864, -0.8622, -1.6125, -1.3735, -0.0262, -0.838 , -1.9685, -0.4779],
       [ 1.5288, -0.4084,  1.2508,  0.1624,  1.4199,  0.7791,  1.0651, -0.2815,  0.7522,  0.5058,  0.4119, -0.0027,  1.5745],
       [ 0.5243, -0.625 ,  0.9499,  0.9281, -0.7427,  0.4581, -0.9616,  1.2992,  1.3541,  2.9664, -1.7431, -1.2602, -0.4176],
       [ 0.7081, -0.5037,  1.1756, -0.6952,  0.8618,  0.8593,  0.8366, -0.531 , -0.2072,  0.9935,  1.317 ,  0.3297,  1.6651]])

### <a name='4'></a> PCA

In [0]:
from sklearn.decomposition import PCA

pca = PCA()
X_pca = pca.fit_transform(X_train_std)

Wyjaśniona wariancja

In [8]:
results = pd.DataFrame(data={'explained_variance_ratio': pca.explained_variance_ratio_})
results['cumulative'] = results['explained_variance_ratio'].cumsum()
results['component'] = results.index + 1
results

Unnamed: 0,explained_variance_ratio,cumulative,component
0,0.37595,0.37595,1
1,0.197179,0.573129,2
2,0.113598,0.686726,3
3,0.063909,0.750636,4
4,0.059108,0.809744,5
5,0.047583,0.857327,6
6,0.040203,0.89753,7
7,0.026766,0.924296,8
8,0.020851,0.945147,9
9,0.01829,0.963437,10


In [9]:
fig = go.Figure(data=[go.Bar(x=results['component'], y=results['explained_variance_ratio'], name='explained variance ratio'),
                      go.Scatter(x=results['component'], y=results['cumulative'], name='cumulative explained variance')],
                layout=go.Layout(title=f'PCA - {pca.n_components_} components', width=700))
fig.show()