In [1]:
%load_ext watermark
%watermark

2020-09-16T14:32:53-05:00

CPython 3.7.6
IPython 7.13.0

compiler   : GCC 7.3.0
system     : Linux
release    : 5.4.0-47-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit


In [2]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (12,12)
np.random.seed(42)

## Knn - K vecinos más proximos

Vamos a ver como vamos a usar el algoritmo KNN en scikit-learn.

El algoritmo KNN se usar en problemas de clasificacion (con el estimador [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)) como en problemas de regresión (con el estimador [KneighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html))

In [3]:
pelis = pd.read_csv("datos_peliculas.csv")
pelis.head()

Unnamed: 0,pelicula,año,ratings,genero,ventas,presupuesto,secuela,vistas_youtube,positivos_youtube,negativos_youtube,comentarios,seguidores_agregados
0,13 Sins,2014,6.3,8,9130,4000000.0,1,3280543,4632,425,636,1120000.0
1,22 Jump Street,2014,7.1,1,192000000,50000000.0,2,583289,3465,61,186,12350000.0
2,3 Days to Kill,2014,6.2,1,30700000,28000000.0,1,304861,328,34,47,483000.0
3,300: Rise of an Empire,2014,6.3,1,106000000,110000000.0,2,452917,2429,132,590,568000.0
4,A Haunted House 2,2014,4.7,8,17300000,3500000.0,2,3145573,12163,610,1082,1923800.0


In [4]:
pelis.shape

(231, 12)

In [5]:
pelis = pelis.drop("pelicula", axis=1)

In [6]:
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.model_selection import train_test_split

## KNN para problemas de clasificación

Predecir el genero de una categoria en función de su popularidad

In [7]:
from sklearn.metrics import f1_score

In [8]:
variable_objetivo_clasificacion = "genero"
variables_independientes_clasificacion = pelis.drop(
    [variable_objetivo_clasificacion], axis=1).columns

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    pelis[variables_independientes_clasificacion],
    pelis[variable_objetivo_clasificacion], test_size=0.20)

In [10]:
KNeighborsClassifier?

La función tiene varios parametros importantes

* n_neighbors: numero de vecinos a analizar.
* weights: se pueden trabajar de dos maneras
    * uniform: todos los puntos tienen el mismo peso sin importar la distancia.
    * distance: el peso de los puntos esta definido en función de la distancia.
* metric: como se calcula la distancia a los puntos (por defecto es la distancia euclidiana)

In [11]:
X_train.head()

Unnamed: 0,año,ratings,ventas,presupuesto,secuela,vistas_youtube,positivos_youtube,negativos_youtube,comentarios,seguidores_agregados
55,2014,8.7,188000000,165000000.0,1,5421705,16635,751,4316,1865000.0
229,2015,5.4,12300000,3000000.0,1,66872,400,67,201,0.0
69,2014,6.4,127000000,40000000.0,1,1142964,2346,167,311,0.0
168,2015,6.7,183000000,29000000.0,2,9214467,39824,998,1987,7336000.0
109,2014,7.1,2590000,9500000.0,1,134353,280,43,308,0.0


In [12]:
clasificador_knn = KNeighborsClassifier(n_neighbors=10, 
                                        weights="uniform")

In [13]:
clasificador_knn.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=10)

In [14]:
preds = clasificador_knn.predict(X_test)
preds[:10]

array([3, 8, 8, 1, 3, 2, 8, 3, 2, 1])

In [15]:
print(preds[:10], "\nScore F1: {}".format(clasificador_knn.score(X_test,y_test)))

f1_score(y_test, preds, average="micro")

[3 8 8 1 3 2 8 3 2 1] 
Score F1: 0.3617021276595745


0.3617021276595745

In [16]:
clasificador_knn = KNeighborsClassifier(n_neighbors=10,
                                       weights="distance")

clasificador_knn.fit(X_train, y_train)
preds = clasificador_knn.predict(X_test)
print(preds[:10], "\nscore F1: {}".format(clasificador_knn.score(X_test,y_test)))
f1_score(y_test,preds,average="micro")

[3 8 8 1 3 2 8 3 3 9] 
score F1: 0.3829787234042553


0.38297872340425526

In [17]:
from sklearn.metrics import classification_report

In [18]:
print(classification_report(y_test,preds))

              precision    recall  f1-score   support

           1       0.31      0.50      0.38        10
           2       0.00      0.00      0.00         2
           3       0.45      0.42      0.43        12
           4       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         0
           8       0.57      0.67      0.62        12
           9       0.00      0.00      0.00         4
          10       0.00      0.00      0.00         3
          12       0.00      0.00      0.00         3
          15       0.00      0.00      0.00         0

    accuracy                           0.38        47
   macro avg       0.13      0.16      0.14        47
weighted avg       0.33      0.38      0.35        47



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [19]:
classification_report?

Podemos usar KNeighbors para devolver los vecinos más cercanos

In [20]:
X_test.iloc[0]

año                         2015.0
ratings                        7.7
ventas                  49500000.0
presupuesto             30000000.0
secuela                        1.0
vistas_youtube          11476882.0
positivos_youtube          40496.0
negativos_youtube           1383.0
comentarios                 4435.0
seguidores_agregados           0.0
Name: 218, dtype: float64

In [21]:
distancia, indice = clasificador_knn.kneighbors(
    [X_test.iloc[0]], n_neighbors=1
)
distancia, indice

(array([[8648469.61304606]]), array([[156]]))

In [22]:
X_train.iloc[indice[0]]

Unnamed: 0,año,ratings,ventas,presupuesto,secuela,vistas_youtube,positivos_youtube,negativos_youtube,comentarios,seguidores_agregados
191,2015,7.3,42500000,25000000.0,1,11036701,50002,1005,3525,776000.0


## Knn para problemas de regresión

In [23]:
from sklearn.metrics import mean_squared_error

variable_objetivo_regresion = "ventas"
variables_independientes_regresion = pelis.drop(variable_objetivo_regresion,axis=1).columns

In [26]:
X_train, X_test, y_train, y_test = train_test_split(
    pelis[variables_independientes_regresion],
    pelis[variable_objetivo_regresion], test_size=0.20)

In [27]:
KNeighborsRegressor?

Tiene los mismos parametros del de clasificación

In [28]:
regresor_knn = KNeighborsRegressor(n_neighbors=10, weights="distance")

regresor_knn.fit(X_train,y_train)

KNeighborsRegressor(n_neighbors=10, weights='distance')

In [29]:
preds = regresor_knn.predict(X_test)
preds

array([5.77598582e+06, 1.99218779e+08, 5.22718267e+07, 3.04748097e+06,
       7.80062231e+07, 4.78727214e+07, 5.21240238e+07, 5.93022167e+07,
       7.49468197e+07, 9.12329679e+07, 7.03489473e+07, 1.82692317e+08,
       2.82445048e+07, 1.08624687e+07, 3.97173720e+07, 4.03493052e+07,
       3.63396779e+07, 3.23651506e+07, 1.21506362e+07, 2.18446367e+07,
       1.29536895e+08, 1.12369979e+07, 4.43514409e+07, 4.01796045e+07,
       1.32035921e+07, 1.02480559e+08, 9.88894202e+07, 9.56558550e+07,
       7.74208639e+07, 3.06222002e+08, 9.01985135e+07, 3.57704603e+06,
       7.15873565e+06, 1.59847370e+08, 5.88276776e+07, 2.23216488e+08,
       1.27252372e+07, 3.99596725e+07, 3.71233940e+07, 8.00191716e+07,
       3.10354554e+07, 3.86931786e+07, 1.94912972e+08, 2.23710764e+08,
       5.81636294e+07, 2.26283159e+07, 1.19539899e+07])

In [30]:
print("error cuadratico medio: {}".format(
np.sqrt(np.abs(mean_squared_error(y_test,preds)))))

error cuadratico medio: 43959789.048082694


Funcionamiento de clasificador y regresor con validacion cruzada

In [31]:
from sklearn.model_selection import cross_val_score

In [35]:
error = np.sqrt(cross_val_score(KNeighborsClassifier(n_neighbors=10, weights="distance"),
                       X=pelis[variables_independientes_clasificacion],
                       y=pelis[variable_objetivo_clasificacion],scoring="f1_micro").mean()
               )
error



0.5541033543388334

In [37]:
error = np.sqrt(np.abs((cross_val_score(KNeighborsRegressor(n_neighbors=10, weights="distance"),
                       X=pelis[variables_independientes_regresion],
                       y=pelis[variable_objetivo_regresion],scoring="neg_mean_squared_error").mean()
               )))
error

67491922.421441