# Clase 5 ‚Äî Introducci√≥n a `scikit-learn`

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

* Recorrido por las posibilidades de la [librer√≠a](http://scikit-learn.org/stable/user_guide.html)
* Familizarizaci√≥n con la [documentaci√≥n](http://scikit-learn.org/stable/modules/classes.html)
* Implementaci√≥n de un flujo de trabajo sencillo para regresi√≥n (http://scikit-learn.org/stable/tutorial/basic/tutorial.html)

https://www.facebook.com/groups/DataScienceArgentina

### Utilidad

* Aprendizaje supervisado
    * Clasificaci√≥n
    * Regresi√≥n
* Aprendizaje no supervisado
* ~~Aprendizaje por refuerzos~~

Redes neuronales: hasta perceptr√≥n multicapa.

### Extensiones

http://scikit-learn.org/stable/related_projects.html

* Pandas
* M√°s algoritmos
* **Automatizaci√≥n** üòÄ
* Dominios espec√≠ficos
    * Visi√≥n computarizada (im√°genes)
    * Procesamiento del lenguaje (texto)
    * Medicina, astronom√≠a, geograf√≠a...


### Datos

`scikit-learn` consume **datos** con forma de matriz o arreglo bidimensional, de dimensi√≥n `(n_muestras, n_atributos)` ‚Äî es como imaginamos normalmente a los datos, dispuestos en una tabla donde las **columnas** son los atributos y hay tantas muestras como **filas**.

Convencionalmente en la documentaci√≥n la varible `X` se utiliza para los **atributos** propiamente dichos, y la variable `y` para los **objetivos**. Cuando el objetivo es uno solo, `y` suele tomar la forma de arreglo unidimensional de dimensi√≥n `(n_muestras,)`. 

### Objetos

En `scikit-learn` hay dos tipos fundamentales de objetos:

* Los **transformadores**, que implementan los m√©todos
    * `fit(X, y)` y
    * `transform(X)`,

* y los **estimadores**, que implementan
    * `fit(X, y)`,
    * `predict(X)` y
    * dependiendo del estimador, `predict_proba(X)`.

### Aprendizaje supervisado
    
- 1.1. **Generalized Linear Models**
- 1.2. Linear and Quadratic Discriminant Analysis
- 1.3. Kernel ridge regression
- 1.4. **Support Vector Machines**
- 1.5. Stochastic Gradient Descent
- 1.6. **Nearest Neighbors**
- 1.7. Gaussian Processes
- 1.8. Cross decomposition
- 1.9. **Naive Bayes**
- 1.10. **Decision Trees**
- 1.11. Ensemble methods
- 1.12. Multiclass and multilabel algorithms
- 1.13. Feature selection
- 1.14. Semi-Supervised
- 1.15. Isotonic regression
- 1.16. Probability calibration
- 1.17. Neural network models (supervised)

### Aprendizaje no supervisado

- 2.1. Gaussian mixture models
- 2.2. Manifold learning
- 2.3. **Clustering**
- 2.4. Biclustering
- 2.5. **Decomposing signals in components** (matrix factorization problems)
- 2.6. Covariance estimation
- 2.7. Novelty and Outlier Detection
- 2.8. Density Estimation
- 2.9. Neural network models (unsupervised)

### `predict_proba(X)`

1.6 http://scikit-learn.org/stable/modules/calibration.html

When performing classification you often want not only to predict the class label, but also obtain a probability of the respective label. This probability gives you some kind of confidence on the prediction. Some models can give you poor estimates of the class probabilities and some even do not support probability prediction. The calibration module allows you to better calibrate the probabilities of a given model, or to add support for probability prediction.

---

# Flujo de trabajo

![](https://docs.google.com/drawings/d/1HJH4Al7gkcIKOr21w-ciZwAFZad6CsU_YKdeAiHHolA/pub?w=960&h=720)

## Conjunto de datos de plantas de iris

**Cantidad de instancias**: 150
   	
**Atributos** (4)
    1. Largo del s√©palo [cm]
    2. Ancho del s√©palo [cm]
    3. Largo del p√©talo [cm]
    4. Ancho del p√©talo [cm]
    
**Objetivos** (1)
    5. clase
        Setosa
        Versicolour
        Virginica

**Valores ausentes**: No

<img src='https://upload.wikimedia.org/wikipedia/commons/4/41/Iris_versicolor_3.jpg' alt="Drawing" style="width: 400px;"/>

In [30]:
from sklearn import datasets

iris = datasets.load_iris()
X, y = iris.data, iris.target

print('Datos', '\t\t',   X.shape)
print('Objetivos', '\t', y.shape)

Datos 		 (150, 4)
Objetivos 	 (150,)


## Preprocesamiento de datos

### Carga de datos

Integrando `Pandas` con `scikit-learn` usando el paquete [`sklearn-pandas`](https://github.com/pandas-dev/sklearn-pandas).

```
# pip install sklearn-pandas
```

In [15]:
from sklearn_pandas import DataFrameMapper

import sklearn.preprocessing

data = pd.DataFrame({'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
                     'children': [4., 6, 3, 3, 2, 3, 5, 4],
                     'salary':   [90, 24, 44, 27, 32, 59, 36, 27]})

data

Unnamed: 0,children,pet,salary
0,4.0,cat,90
1,6.0,dog,24
2,3.0,dog,44
3,3.0,fish,27
4,2.0,cat,32
5,3.0,dog,59
6,5.0,cat,36
7,4.0,fish,27


In [17]:
mapper = DataFrameMapper([
    ('pet', sklearn.preprocessing.LabelBinarizer()),
    (['children'], sklearn.preprocessing.StandardScaler())
], df_out=True)

mapper.fit_transform(data)

Unnamed: 0,pet_cat,pet_dog,pet_fish,children
0,1.0,0.0,0.0,0.208514
1,0.0,1.0,0.0,1.87663
2,0.0,1.0,0.0,-0.625543
3,0.0,0.0,1.0,-0.625543
4,1.0,0.0,0.0,-1.459601
5,0.0,1.0,0.0,-0.625543
6,1.0,0.0,0.0,1.042572
7,0.0,0.0,1.0,0.208514


### Limpieza de datos

* Cardinalidad
* Rango
* Desviaci√≥n
* Formato
    * Booleano
    * Num√©ro (separadores)
    * Texto
        * espacios (*trimming*)
        * tildes
        * casos (may√∫sculas, min√∫sculas)
* **Codificaci√≥n** (UTF-8, etc√©tera)

### Partici√≥n del conjunto de datos

* entrenamiento (50%)
* validaci√≥n (25%) ‚Äî salvo cross-validation o ausencia de hiperpar√°metros
* prueba (25%)

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

## Muestreo ‚Äî conjuntos desbalanceados

El *entrenamiento* y la *validaci√≥n* de estimadores suele requerir conjuntos de datos **balanceados**; no as√≠ la *prueba* del modelo que debe enfrentar datos reales del problema (**desbalanceados**). `scikit-learn` apenas provee algoritmos de muestro, podemos usar la extensi√≥n [`imbalanced-learn`](http://contrib.scikit-learn.org/imbalanced-learn/index.html) que implementa varios.

    # pip install imbalanced-learn

`imbalanced-learn` aporta objetos del tipo *muestreador* que implementan los m√©todos `fit(X, y)` y `sample(X)`.

**Under-sampling**

* ClusterCentroids
* RandomUnderSampler

**Over-sampling**

* SMOTE
* RandomOverSampler

##### Ejemplo con RandomUnderSampler

http://contrib.scikit-learn.org/imbalanced-learn/generated/imblearn.under_sampling.RandomUnderSampler.html

In [44]:
from sklearn.datasets import make_classification

from imblearn.under_sampling import RandomUnderSampler

# Generate the dataset
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=20, n_clusters_per_class=1,
                           n_samples=200, random_state=10)

# Apply the random under-sampling
rus = RandomUnderSampler(return_indices=True)
X_resampled, y_resampled, idx_resampled = rus.fit_sample(X, y)

## Preprocesamiento de atributos

- 4.1. Pipeline and FeatureUnion: combining estimators
- 4.2. Feature extraction
- 4.3. Preprocessing data
- 4.4. Unsupervised dimensionality reduction
- 4.5. Random Projection
- 4.6. Kernel Approximation
- 4.7. Pairwise metrics, Affinities and Kernels
- 4.8. Transforming the prediction target (y)

### Extracci√≥n

4.2 http://scikit-learn.org/stable/modules/feature_extraction.html

* Im√°genes
* Lenguaje

### Transformaci√≥n

4.3.1 http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling

* Estandarizaci√≥n (StandardScaler) ‚Äî Muy com√∫n; a cada atributo le remueve su valor medio y lo escala dividi√©ndolo por su desviaci√≥n est√°ndar.
* Reajuste (MinMaxScaler, MaxAbsScaler)  

4.3.2 http://scikit-learn.org/stable/modules/preprocessing.html#normalization

* Normalizaci√≥n (Normalizer) ‚Äî Divide vectores por su norma (afecta filas).

##### Ejemplo con StandardScaler

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)
scaler.transform(X)     

### Imputaci√≥n de valores ausentes

4.3.5 http://scikit-learn.org/stable/modules/preprocessing.html#imputation-of-missing-values

* Descarte (tirar la muestra)
* **Valor m√°s com√∫n**
* **Valor promedio**
* **Valor medio**
* Estimaci√≥n (clasificaci√≥n/regresi√≥n)
* Hot-deck (el valor de la muestra m√°s parecida)
* NA como otro valor

### Creaci√≥n

4.3.6 http://scikit-learn.org/stable/modules/preprocessing.html#generating-polynomial-features

De $(X_1, X_2)$ a $(1, X_1, X_2, X_1^2, X_1X_2, X_2^2)$.

### Reducci√≥n de dimensionalidad

4.4 http://scikit-learn.org/stable/modules/unsupervised_reduction.html

* PCA ‚Äî an√°lisis de componentes principales
* Proyecciones al azar
* Varios estimadores **no supervisados** implementan el m√©todo `transform(X)`

##### Ejemplo con PCA

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [None]:
from sklearn.decomposition import PCA

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])

pca = PCA(n_components=1)
pca.fit(X)

pca.explained_variance_ratio_ 

### Selecci√≥n ‚Äî solo aprendizaje supervisado

1.3 http://scikit-learn.org/stable/modules/feature_selection.html

* Umbral de varianza
* An√°lisis univariado
* Usando un estimador
* Eliminaci√≥n recursiva (tambi√©n existe la agregaci√≥n recursiva)

##### Ejemplo con SelectFromModel

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html

In [None]:
from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV

# Load the boston dataset.
boston = load_boston()
X, y = boston.data, boston.target

# We use the base estimator LassoCV since the L1 norm promotes sparsity of features.
clf = LassoCV()

# Set a minimum threshold of 0.25
sfm = SelectFromModel(clf, threshold=0.25)
sfm.fit(X, y)
sfm.transform(X)

### Selecci√≥n del modelo

- 3.1. Cross-validation: evaluating estimator performance
- 3.2. Tuning the hyper-parameters of an estimator
- 3.3. Model evaluation: quantifying the quality of predictions
- 3.4. Model persistence
- 3.5. Validation curves: plotting scores to evaluate models

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape

X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.4, random_state=0)

X_train.shape, y_train.shape

X_test.shape, y_test.shape


clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

#### Evaluaci√≥n del modelo

3.3 http://scikit-learn.org/stable/modules/model_evaluation.html

Cada estimador implementa un m√©todo llamado `score(X, y)` que devuelve un puntaje del desempe√±o del estimador. El puntaje es calculado usando una m√©trica acorde a la naturaleza del estimador, por ejemplo muchos regresores usan *error cuadr√°tico medio* mientas que muchos clasificadores usan *precisi√≥n*.

##### Reporte

In [40]:
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

             precision    recall  f1-score   support

    class 0       0.50      1.00      0.67         1
    class 1       0.00      0.00      0.00         1
    class 2       1.00      0.67      0.80         3

avg / total       0.70      0.60      0.61         5



##### Matriz de confusi√≥n

In [33]:
from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)

array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

##### F1

$F_1 = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}$

In [36]:
from sklearn.metrics import f1_score
y_true = [0, 1, 1, 0, 1, 1]
y_pred = [0, 1, 1, 0, 0, 1]
f1_score(y_true, y_pred)  

0.8571428571428571

##### Cohen's kappa

https://en.wikipedia.org/wiki/Cohen's_kappa

$\kappa = \frac{p_o - p_e}{1 - p_e}$

In [42]:
from sklearn.metrics import cohen_kappa_score
y_true = [0, 1, 1, 0, 1, 1]
y_pred = [0, 1, 1, 0, 0, 1]
cohen_kappa_score(y_true, y_pred)

0.66666666666666674

#### Validaci√≥n cruzada

3.1 http://scikit-learn.org/stable/modules/cross_validation.html

Se necesitan dos cosas:
- Una estrategia de particionamiento de los datos
- Una m√©trica de evaluaci√≥n

##### Estrategias

* K-fold, stratified k-fold ‚Äî estrategias por defecto pare regresores y clasificadores respectivamente.
* Leave one out (LOO)
* Leave P out (LPO)
* Shuffle & split, stratified shuffle & split

![](http://tomaszkacmajor.pl/wp-content/uploads/2016/05/cross-validation.png)

##### M√©tricas

* De no especificarse ninguna, se usa el m√©todo `score` del estimador.
* Las m√©tricas m√°s comunes se pueden pasar como opci√≥n, ver [tabla](http://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values).
* Se pueden armar **puntuadores** a partir de cualquier m√©trica, tanto de la API como definidas por el usuario.

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

In [32]:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5) 

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.98 (+/- 0.03)


### Optimizaci√≥n

3.2 http://scikit-learn.org/stable/modules/grid_search.html

In [39]:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()

par√°metros = {
    'kernel':('linear', 'rbf'),
         'C':[1, 10]
}

svm = svm.SVC()

clf = GridSearchCV(svm, par√°metros)
clf.fit(iris.data, iris.target)
                            
clf.best_estimator_

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

### Persistencia del modelo

3.4 http://scikit-learn.org/stable/modules/model_persistence.html

In [None]:
from sklearn import svm
from sklearn import datasets

clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)  

import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
clf2.predict(X)

## Boston price data set

**Cantidad de instancias**: 506
   	
**Atributos** (13)
    1.  CRIM per capita crime rate by town
    2.  ZN proportion of residential land zoned for lots over 25,000 sq.ft.
    3.  INDUS proportion of non-retail business acres per town
    4.  CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    5.  NOX nitric oxides concentration (parts per 10 million)
    6.  RM average number of rooms per dwelling
    7.  AGE proportion of owner-occupied units built prior to 1940
    8.  DIS weighted distances to five Boston employment centres
    9.  RAD index of accessibility to radial highways
    10. TAX full-value property-tax rate per 10,000 USD
    11. PTRATIO pupil-teacher ratio by town
    12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    13. LSTAT % lower status of the population

**Objetivos** (1)
    14. MEDV Median value of owner-occupied homes in 1000‚Äôs USD

**Valores ausentes**: No

In [31]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

boston = load_boston()
X, y   = boston.data, boston.target

print('Datos', '\t\t',   X.shape)
print('Objetivos', '\t', y.shape)

Datos 		 (506, 13)
Objetivos 	 (506,)


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [None]:
from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV

# Load the boston dataset.
boston = load_boston()
X, y = boston['data'], boston['target']

# We use the base estimator LassoCV since the L1 norm promotes sparsity of features.
clf = LassoCV()

# Set a minimum threshold of 0.25
sfm = SelectFromModel(clf, threshold=0.25)
sfm.fit(X, y)
n_features = sfm.transform(X).shape[1]

# Reset the threshold till the number of features equals two.
# Note that the attribute can be set directly instead of repeatedly
# fitting the metatransformer.
while n_features > 2:
    sfm.threshold += 0.1
    X_transform = sfm.transform(X)
    n_features = X_transform.shape[1]

# Plot the selected two features from X.
plt.title(
    "Features selected from Boston using SelectFromModel with "
    "threshold %0.3f." % sfm.threshold)
feature1 = X_transform[:, 0]
feature2 = X_transform[:, 1]
plt.plot(feature1, feature2, 'r.')
plt.xlabel("Feature number 1")
plt.ylabel("Feature number 2")
plt.ylim([np.min(feature2), np.max(feature2)])
plt.show()

---

### Pipeline

All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The last estimator may be any type (transformer, classifier, etc.).

In [21]:
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import Binarizer

pipe = make_pipeline(Binarizer(), MultinomialNB()) 

In [25]:
pipe.steps

[('binarizer', Binarizer(copy=True, threshold=0.0)),
 ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]

In [27]:
from sklearn.model_selection import GridSearchCV
params = dict(reduce_dim__n_components=[2, 5, 10],
              clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=params)

### Cantidad de objetivos

* Clasificador
    * Binario
    * **Multi clase**
    * **Multi etiqueta**
    * Multi clase-etiqueta
* Regresor
    * Univariado
    * Multivariado

## Automatizaci√≥n

[TPOT](https://github.com/rhiever/tpot) es una herramienta de aprendizaje autom√°tico automatizado que optimiza el flujo de trabajo.

    # pip install tpot

In [8]:
from tpot import TPOTRegressor

tpot = TPOTRegressor(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
tpot.score(X_test, y_test)



Optimization Progress:  29%|‚ñà‚ñà‚ñâ       | 35/120 [00:11<00:41,  2.06pipeline/s]

Generation 1 - Current best internal CV score: 13.050775893707737


Optimization Progress:  42%|‚ñà‚ñà‚ñà‚ñà‚ñè     | 50/120 [00:13<00:12,  5.58pipeline/s]

Generation 2 - Current best internal CV score: 13.050775893707737


Optimization Progress:  58%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä    | 70/120 [00:29<00:20,  2.49pipeline/s]

Generation 3 - Current best internal CV score: 11.119271792901147


Optimization Progress:  73%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé  | 88/120 [00:35<00:12,  2.59pipeline/s]

Generation 4 - Current best internal CV score: 11.119271792901147


                                                                              

Generation 5 - Current best internal CV score: 11.119271792901147

Best pipeline: GradientBoostingRegressor(input_matrix, GradientBoostingRegressor__alpha=0.8, GradientBoostingRegressor__learning_rate=0.1, GradientBoostingRegressor__loss=DEFAULT, GradientBoostingRegressor__max_depth=5, GradientBoostingRegressor__max_features=1.0, GradientBoostingRegressor__min_samples_leaf=11, GradientBoostingRegressor__min_samples_split=3, GradientBoostingRegressor__n_estimators=100, GradientBoostingRegressor__subsample=0.5)




17.76419689066346

In [9]:
tpot.export('tpot_boston_pipeline.py')