<a href="https://colab.research.google.com/github/kszymon/machine-learning-bootcamp/blob/main/unsupervised%20/02_dimensionality_reduction/03_pca_wine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```
Kurs stworzony w oparciu o wersję `0.22.1`

### Spis treści:
1. [Import bibliotek](#0)
2. [Załadowanie danych](#1)
3. [Podział na zbiór treningowy i testowy](#2)
4. [Standaryzacja](#3)
5. [PCA](#4)

### <a name='0'></a> Import bibliotek

In [3]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px

np.set_printoptions(precision=4, suppress=True, edgeitems=5, linewidth=200)

### <a name='1'></a> Załadowanie danych

In [4]:
df_raw = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)
df = df_raw.copy()
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [5]:
data = df.iloc[:, 1:]
target = df.iloc[:, 0]
data.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [6]:
target.value_counts()

Unnamed: 0_level_0,count
0,Unnamed: 1_level_1
2,71
1,59
3,48


### <a name='2'></a> Podział na zbiór treningowy i testowy

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target)

print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')

X_train shape: (133, 13)
X_test shape: (45, 13)


### <a name='3'></a> Standaryzacja

In [13]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)
X_train_std[:5]

array([[-1.5415, -0.1677,  1.5271,  2.6217, -0.5199, -0.2882,  0.2038,  1.6854,  0.2565, -0.9117,  0.0347, -0.2502, -0.9155],
       [ 1.069 ,  1.5953,  0.043 , -0.0161, -0.7239, -0.8052, -1.1969,  0.91  , -0.0961,  1.7455, -1.6848, -1.3966, -0.8693],
       [ 0.2922,  0.8688, -0.3281, -0.3092, -0.112 , -0.8052, -1.1969,  1.918 ,  0.4505,  2.4131, -1.7278, -1.5829, -0.2252],
       [-0.3954, -0.6992, -0.4023,  0.3356, -1.3358, -1.4633, -0.5711,  1.6854,  0.0097, -0.8896, -0.0083, -0.7947, -0.8197],
       [-0.8029, -1.0181, -1.6638,  0.0132, -1.4718, -0.3195, -0.0347, -0.7182, -1.013 , -0.1689,  0.6796,  1.2402, -0.7702]])

### <a name='4'></a> PCA

In [14]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
X_train_pca.shape

(133, 3)

Wyjaśniona wariancja

In [20]:
results = pd.DataFrame(data={'explained_variance_ratio': pca.explained_variance_ratio_})
results['cumulative'] = results['explained_variance_ratio'].cumsum()
results['component'] = results.index + 1
results

Unnamed: 0,explained_variance_ratio,cumulative,component
0,0.357884,0.357884,1
1,0.193301,0.551185,2
2,0.111272,0.662456,3


In [21]:
fig = go.Figure(data=[go.Bar(x=results['component'], y=results['explained_variance_ratio'], name='explained variance ratio'),
                      go.Scatter(x=results['component'], y=results['cumulative'], name='cumulative explained variance')],
                layout=go.Layout(title=f'PCA - {pca.n_components_} components', width=950, template='plotly_dark'))
fig.show()

In [22]:
X_train_pca_df = pd.DataFrame(data=np.c_[X_train_pca, y_train], columns=['pca1', 'pca2', 'pca3', 'target'])
X_train_pca_df.head()

Unnamed: 0,pca1,pca2,pca3,target
0,-1.598747,-1.388958,3.284172,2.0
1,-2.921893,1.887945,-0.868741,3.0
2,-2.787039,2.134431,-0.93486,3.0
3,-1.903033,-1.642034,-0.129321,2.0
4,0.24008,-2.411627,-1.1309,2.0


In [23]:
px.scatter_3d(X_train_pca_df, x='pca1', y='pca2', z='pca3', color='target', template='plotly_dark', width=950)

In [24]:
X_train_pca[:5]

array([[-1.5987, -1.389 ,  3.2842],
       [-2.9219,  1.8879, -0.8687],
       [-2.787 ,  2.1344, -0.9349],
       [-1.903 , -1.642 , -0.1293],
       [ 0.2401, -2.4116, -1.1309]])

In [25]:
X_test_pca[:5]

array([[ 2.9817,  1.2863, -0.2515],
       [-1.0763, -1.8856,  0.8991],
       [-0.0928, -1.9649,  0.5671],
       [-3.9091,  0.0178,  0.0805],
       [ 2.5314,  0.9228, -0.0813]])