# Aprendizaje de maquina 

En este cuaderno veremos algunas implementaciones de algoritmos de aprendizaje de maquina 

## Librerías

Comenzamos importando las librerías necesarias, en su mayoría serán parte de sklearn, ya se para datos de prueba, preprocesamiento, modelos y métricas


In [1]:
from sklearn.datasets import fetch_california_housing, load_iris
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.cluster import KMeans
from sklearn.ensemble import IsolationForest

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import accuracy_score, recall_score, f1_score

import pandas as pd

## Aprendizaje Supervisado
### Regresión
Cargamos la tabla de viviendas en estados unidos


In [3]:
data = fetch_california_housing()
X_regr = data.data
y_regr = data.target

In [4]:
print(data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [5]:
df = pd.DataFrame(data= X_regr, columns= data['feature_names'] )
df["Value"] = y_regr
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Value
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


Dividimos en conjuntos de prueba y entrenamiento



In [6]:
X_regr_train, X_regr_test, y_regr_train, y_regr_test = train_test_split(X_regr, y_regr, test_size=0.1)

#### Regresión lineal
Generamos el modelo y lo entrenamos
Con el modelo entrenado generamos predicciones


In [7]:
model_lr = LinearRegression()
model_lr.fit(X_regr_train, y_regr_train)

### Clasificación
Cargamos otro conjunto de datos adecuado para clasificación. El conjunto de datos Iris

In [8]:
y_pred_lr = model_lr.predict(X_regr_test)
y_pred_lr

array([1.69740543, 1.75327957, 1.82536379, ..., 1.68324364, 1.05193778,
       1.77323238])

In [9]:
data = load_iris()
X_clas = data.data
y_clas = data.target

In [10]:
print(data.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [11]:
df_class = pd.DataFrame(data= X_clas, columns= data['feature_names'] )
df_class["class"] = y_clas
df_class

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


Dividimos en conjuntos de prueba y entrenamiento

In [12]:
X_clas_train, X_clas_test, y_clas_train, y_clas_test = train_test_split(X_clas, y_clas, test_size=0.2)


#### Regresión logística
Generamos el modelo y lo entrenamos.
Con el modelo entrenado generamos predicciones


In [13]:
model_logr = LogisticRegression()
model_logr.fit(X_clas_train, y_clas_train)

In [14]:
y_pred_logr = model_logr.predict(X_clas_test)
y_pred_logr

array([2, 2, 0, 0, 0, 0, 2, 0, 1, 1, 1, 1, 2, 2, 1, 2, 1, 0, 1, 0, 0, 1,
       2, 0, 0, 0, 2, 2, 2, 2])

#### Maquinas de soporte vectorial
Generamos el modelo y lo entrenamos.
Con el modelo entrenado generamos predicciones


In [15]:
model_svc = SVC()
model_svc.fit(X_clas_train, y_clas_train)

In [16]:
y_pred_svc = model_svc.predict(X_clas_test)
y_pred_svc

array([2, 2, 0, 0, 0, 0, 2, 0, 1, 1, 1, 1, 1, 2, 1, 2, 1, 0, 1, 0, 0, 1,
       2, 0, 0, 0, 2, 2, 2, 2])

#### Arboles de decisión 
Generamos el modelo y lo entrenamos.
Con el modelo entrenado generamos predicciones


In [17]:
model_dt = DecisionTreeClassifier()
model_dt.fit(X_clas_train, y_clas_train)

In [18]:
y_pred_dt = model_dt.predict(X_clas_test)
y_pred_dt

array([2, 2, 0, 0, 0, 0, 2, 0, 1, 2, 1, 1, 2, 2, 1, 2, 1, 0, 2, 0, 0, 1,
       2, 0, 0, 0, 2, 2, 2, 2])

#### Bosques aleatorios
Generamos el modelo y lo entrenamos.
Con el modelo entrenado generamos predicciones


In [19]:
model_rf = RandomForestClassifier()
model_rf.fit(X_clas_train, y_clas_train)

In [20]:
y_pred_rf = model_dt.predict(X_clas_test)
y_pred_rf

array([2, 2, 0, 0, 0, 0, 2, 0, 1, 2, 1, 1, 2, 2, 1, 2, 1, 0, 2, 0, 0, 1,
       2, 0, 0, 0, 2, 2, 2, 2])

#### Redes neuronales (perceptrón multicapa)
Generamos el modelo y lo entrenamos.
Con el modelo entrenado generamos predicciones


In [21]:
model_nn = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(8,10,10,3))
model_nn.fit(X_clas_train, y_clas_train)

In [22]:
y_pred_nn = model_nn.predict(X_clas_test)
y_pred_nn

array([2, 2, 0, 0, 0, 0, 2, 0, 1, 2, 1, 1, 2, 2, 1, 2, 1, 0, 1, 0, 0, 1,
       2, 0, 0, 0, 2, 2, 2, 2])

## Aprendizaje no supervisado
Volvemos a cargar el conjunto de datos Iris ahora sin etiquetar


In [23]:
data = load_iris()
X_clus = data.data

In [24]:
## Desaparece la columna clase (porque en este caso es lo que queremos predecir en no supervisado)
df_clus = pd.DataFrame(data= X_clus, columns= data['feature_names'] )
df_clus

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


### K-medias
útil para encontrar cúmulos de elementos que se comportan similar


In [27]:
## Tres clusters porque queremos tres clases
model_km =  KMeans(n_clusters=3)
model_km.fit(X_clus)



In [28]:
model_km.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,
       2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int32)

In [29]:
model_km.cluster_centers_

array([[5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
       [5.006     , 3.428     , 1.462     , 0.246     ],
       [6.85      , 3.07368421, 5.74210526, 2.07105263]])

### Bosque de aislamiento
Utilizado para encontrar datos atípicos


In [30]:
model_if = IsolationForest(random_state=42)
model_if.fit(X_clas_train)

In [31]:
model_if.predict(X_clas_train)


array([-1,  1,  1, -1, -1,  1,  1,  1,  1, -1, -1,  1,  1,  1, -1,  1,  1,
        1,  1, -1,  1, -1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,
        1, -1,  1,  1,  1, -1,  1,  1, -1, -1,  1,  1,  1,  1,  1, -1,  1,
        1,  1, -1,  1, -1,  1, -1,  1,  1,  1,  1, -1,  1, -1, -1,  1, -1,
        1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1,  1,  1,
        1,  1,  1,  1, -1,  1,  1, -1, -1,  1,  1,  1,  1,  1,  1, -1,  1,
        1,  1, -1,  1,  1, -1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,
        1])

# Medidas de evaluación

In [32]:
print(f"""Regresion lineal
  Error Cuadrático Medio {mean_squared_error(y_regr_test, y_pred_lr)}
  R^2 {r2_score(y_regr_test, y_pred_lr)}
"""
)

Regresion lineal
  Error Cuadrático Medio 0.5125322087384078
  R^2 0.6066383113891223



In [33]:
print(f"""Regresion Logistica
  Acuracy {accuracy_score(y_clas_test, y_pred_logr)}
  Recall {recall_score(y_clas_test, y_pred_logr, average=None)}
  F1 {f1_score(y_clas_test, y_pred_logr, average=None)}
"""
)

Regresion Logistica
  Acuracy 0.9666666666666667
  Recall [1.         0.88888889 1.        ]
  F1 [1.         0.94117647 0.95238095]



In [34]:
print(f"""maquinas de soporte vectorial
  Acuracy {accuracy_score(y_clas_test, y_pred_svc)}
  Recall {recall_score(y_clas_test, y_pred_svc, average=None)}
  F1 {f1_score(y_clas_test, y_pred_svc, average=None)}
"""
)

maquinas de soporte vectorial
  Acuracy 0.9333333333333333
  Recall [1.         0.88888889 0.9       ]
  F1 [1.         0.88888889 0.9       ]



In [35]:
print(f"""arboles de desicíon
  Acuracy {accuracy_score(y_clas_test, y_pred_dt)}
  Recall {recall_score(y_clas_test, y_pred_dt, average=None)}
  F1 {f1_score(y_clas_test, y_pred_dt, average=None)}
"""
)

arboles de desicíon
  Acuracy 0.9
  Recall [1.         0.66666667 1.        ]
  F1 [1.         0.8        0.86956522]



In [36]:
print(f"""bosque aleatorio
  Acuracy {accuracy_score(y_clas_test, y_pred_rf)}
  Recall {recall_score(y_clas_test, y_pred_rf, average=None)}
  F1 {f1_score(y_clas_test, y_pred_rf, average=None)}
"""
)

bosque aleatorio
  Acuracy 0.9
  Recall [1.         0.66666667 1.        ]
  F1 [1.         0.8        0.86956522]



In [37]:
print(f"""redes neuronales
  Acuracy {accuracy_score(y_clas_test, y_pred_nn)}
  Recall {recall_score(y_clas_test, y_pred_nn, average=None)}
  F1 {f1_score(y_clas_test, y_pred_nn, average=None)}
"""
)

redes neuronales
  Acuracy 0.9333333333333333
  Recall [1.         0.77777778 1.        ]
  F1 [1.         0.875      0.90909091]



In [44]:
# Ejercicio medir acc de dataset de entrenamiento
print(f"Acuracy {accuracy_score(y_clas_test, y_pred_dt)}")

print(f"Acuracy {accuracy_score(y_clas_train, y_pred_dt)}")

Acuracy 0.9


ValueError: Found input variables with inconsistent numbers of samples: [120, 30]