# Análisis de Sentimiento con KNN y PCA

Vamos a usar el dataset de IMDB de [Maas et al(2011)](
https://ai.stanford.edu/~amaas/data/sentiment/) 


In [None]:
!wget https://github.com/finiteautomata/imdb-dataset/raw/master/imdb_dataset.csv.zip
!unzip imdb_dataset.csv.zip

In [1]:
import pandas as pd 

df = pd.read_csv("IMDB Dataset.csv")


print("Cantidad de documentos: {}".format(df.shape[0]))

Cantidad de documentos: 50000


In [2]:
pd.options.display.max_colwidth = 200

df[:10]

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...",positive
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone...",positive
5,"Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring. It just never gets old, despite my having seen it some 15 o...",positive
6,I sure would like to see a resurrection of a up dated Seahunt series with the tech they have today it would bring back the kid excitement in me.I grew up on black and white TV and Seahunt with Gun...,positive
7,"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny ...",negative
8,Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful i...,negative
9,"If you like original gut wrenching laughter you will like this movie. If you are young or old then you will love this movie, hell even my mom liked it.<br /><br />Great Camp!!!",positive


Lo mezclamos para que no esté ordenado

In [3]:
# Esto pide un sample, le pedimos una muestra de todo el df

df = df.sample(frac=1, random_state=2020)

df[:10]

Unnamed: 0,review,sentiment
24397,"Admittedly Alex has become a little podgey, but they are still (for me) the greatest rock trio, ever. I wholeheartedly recommend this DVD to any fan.<br /><br />I was very disappointed that they c...",positive
39273,"An absolutely brilliant film! Jiri Trnka, the master of puppet animation, confronts totalitarianism in this, his final, film. It would be banned by the Communist Czechoslovakian government (at the...",positive
6546,"All the boys seem to be sexually aroused by Mandy Lane. All the girls seem to be jealous of Mandy Lane. But, nothing seems to become of it, and this viewer wonders why? Mandy is beautiful and a ma...",negative
13504,"Superhero movies pretty much always suck, and this is no exception. Its only redeeming quality is the fact the movie COULD have been even worse. I would put 'Batman & Robin' and 'Steel' above this...",negative
49765,The genre of suspense films really takes a dive in this one. The big problem is IMPLAUSABILITY. I realize that you need to create difficult situations which would cause suspense and the tense feel...,negative
31361,SPOILERS AHEAD<br /><br />This is one of the worst movies ever made - it's that simple. There is not one redeeming quality about this movie. The first 10 minutes are quite tricky - they actually l...,negative
27741,This Batman movie isn't quite as good as Batman mask of The Phantasm and Batman and Mr. Freeze subzero But it is still a good installment to the Batman cartoons I say it is equally good as Batman ...,positive
47758,"As you all may know, JIGSAW did not make its way to Blackbuster because of a member of Full Moon's own staff, Devin Hamilton. Devin is the one who sells to all of the video chains. He recently rel...",negative
22393,"This is the prime example of low budget, winning over what would be a good story line. Let's bring back Samaire Armstrong (having seen her work on the O.C. I know she can do better), then find a b...",negative
18289,"""Eagle's Wing"" is a pleasant surprise of a movie, & keeps the viewer interested. I didn't know anything about it being made by the British until I read the other viewer comments. I can understand ...",positive


## Train y Test

Nos vamos a quedar con una fracción de los datos para train y otra para test

In [4]:
import sklearn

df_train = df[:10000]
df_test = df[10000:13000]

text_train, text_test = df_train["review"], df_test["review"]
label_train, label_test = df_train["sentiment"], df_test["sentiment"]

print("Class balance : {} pos {} neg".format(
    (label_train == 'positive').sum() / label_train.shape[0], 
    (label_train == 'negative').sum() / label_train.shape[0]
))

Class balance : 0.5007 pos 0.4993 neg


In [5]:
(label_test == "positive").sum() / label_test.shape[0]

0.49866666666666665

Está más o menos parejo. Usemos accuracy (#cantidad de aciertos / #cantidad de ensayos) como métrica

## Convertir a bag of words

Veamos cómo funciona CountVectorizer

La idea general es que CountVectorizer convierte un conjunto de texto en el modelo de bolsa de palabras (bag of words), donde cada texto se representa como un vector de $\mathbb{R}^V$, donde $V$ es el vocabulario elegido.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

textos = [
    "bolsa de palabras",
    "bolsa es una palabra",
    "palabra no es una bolsa",
    "bolsa es una bolsa",
    "bolsa es una bolsa y es una palabra"
]

vect = CountVectorizer()

Lo entramos a estos textos

In [8]:
vect.fit(textos)

CountVectorizer()

In [9]:
vect.vocabulary_

{'bolsa': 0, 'de': 1, 'palabras': 5, 'es': 2, 'una': 6, 'palabra': 4, 'no': 3}

In [10]:
mat = vect.transform(textos)

mat

<5x7 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

Es una matriz "rala" (sparse en inglés)

In [11]:
import pandas as pd

# Doy vuelta el vocabulario...
vocabulario = {v:k for k, v in vect.vocabulary_.items()}
vocabulario = [vocabulario[i] for i in range(len(vocabulario))]
df = pd.DataFrame(mat.todense(), columns=vocabulario)
df["texto"] = textos
df

Unnamed: 0,bolsa,de,es,no,palabra,palabras,una,texto
0,1,1,0,0,0,1,0,bolsa de palabras
1,1,0,1,0,1,0,1,bolsa es una palabra
2,1,0,1,1,1,0,1,palabra no es una bolsa
3,2,0,1,0,0,0,1,bolsa es una bolsa
4,2,0,2,0,1,0,2,bolsa es una bolsa y es una palabra


## Volvamos a IMDB

Ahora, apliquemos esto a nuestros textos...

No nos vamos a quedar con todas las palabras:

- Sacar palabras muy frecuentes
- Sacar palabras que aparecen muy pocas veces 

¿Por qué sirve esto? Lo que comentamos en la presentación.

In [12]:
vect = CountVectorizer()

vect.fit(text_train)

len(vect.vocabulary_)

52815

Esto es un montón. Reduzcámoslo un poco

- `min_df`: palabras que aparezcan al menos 3 veces
- `max_features`: Quedarme con las 5000 palabras más frecuentes
- `binary`: Sólo marcar 0 o 1 de acuerdo a si aparece o no la palabra

In [None]:
vect = CountVectorizer(min_df=3, max_features=5000, binary=True)

vect.fit(text_train)

len(vect.vocabulary_)

In [None]:
X_train = vect.transform(text_train)
X_test = vect.transform(text_test)

y_train = label_train# == 'positive' # Convertimos a vectores booleanos
y_test = label_test# == "positive"

Vamos a usar un clasificador KNN con 10 vecinos

Recordemos que la idea de KNN es la siguiente:

1. Buscar los k vecinos más cercanos en nuestro espacio $\mathbb{R}^n$
2. Efectuar una "votación" entre esos k-vecinos eligiendo la etiqueta más frecuente

No estamos haciendo ninguna representación probabilística compleja ni hacemos asunción alguna sobre nuestros datos. Es un modelo **no paramétrico**.

Vamos a usar la implementación de `sklearn`. Pueden ver el [User Guide](https://scikit-learn.org/stable/modules/neighbors.html#classification) y la [documentación del clasificador](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

Primero, llamamos a una función `fit` que "ajusta" nuestro modelo a los datos.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=10)

clf.fit(X_train, y_train)

In [None]:
%%time
from sklearn.metrics import accuracy_score

y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print("Accuracy: {:.3f}".format(acc))

¿Podremos mejorarlo...?

Veamos para varios valores de $k$ la performance del algoritmo

In [None]:
import matplotlib.pyplot as plt 
from tqdm.auto import tqdm

results = []
ks = range(2, 82, 4)

for k in tqdm(ks):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)
    results.append(accuracy_score(y_test, y_pred))



In [None]:
plt.plot(ks, results, "o-")
plt.xlabel("K")
plt.ylabel("Accuracy")
plt.title("Accuracy en test en función del valor de K")

## Metodo de la potencia

Vamos a hacer un pequeño intermezzo: vamos a ver cómo calcular autovalores y autovectores usando el método de la potencia

Implementar las siguientes funciones (`power_iteration` y `eig`)

In [None]:
import numpy as np

def power_iteration(A, niter=10_000, eps=1e-6):
    """
    Calcula el autovector al autovalor asociado de valor máximo
    
    
    Devuelve (a, v) con a autovalor, y v autovector de A

    Arguments:
    ----------

    A: np.array
        Matriz de la cual quiero calcular el autovector y autovalor
    
    niter: int (> 0)
        Cantidad de iteraciones

    eps: Epsilon
        Tolerancia utilizada en el criterio de parada
    """
    
    a = 1
    v = np.ones(A.shape[0])
    """
    TODO: Completar el método de la potencia

    IMPORTANTE: Agreguen algún criterio de parada!
    """

    return a, v


Hagamos un ejemplo que ya vimos:

$$
A = Q^T \begin{pmatrix}
d_1    &0     &0      &0      &\\
0      &d_2   &0      &0      &\\
\vdots &\vdots&\ddots &\vdots &\\
0      &0     &0      &d_n    & \\
\end{pmatrix} Q 
$$

con $Q = I - 2 v v^T$, $||v||_2=1$
 la matriz de reflexión que sabemos que es ortogonal

Probemos calcular el método de la potencia con esto

In [None]:
import numpy as np

D = np.diag([5.0, 4.0, 3.0, 2.0, 1.0])

v = np.ones((D.shape[0], 1))

v = v / np.linalg.norm(v)

# Matriz de Householder
B = np.eye(D.shape[0]) - 2 * (v @ v.T)

# Matriz ya diagonalizada
M = B.T @ D @ B

power_iteration(M)

### Metodo de la potencia + Deflación

Implementar método de la potencia + deflación

In [None]:
def eig(A, num=2, niter=10000, eps=1e-6):
    """
    Calculamos num autovalores y autovectores usando método de la potencia+deflación
    """
    A = A.copy()
    eigenvalues = []
    eigenvectors = np.zeros((A.shape[0], num))
    for i in range(num):
        """
        TODO: Completar código
        """
        pass    
    return np.array(eigenvalues), eigenvectors
