# Procesamiento de Lenguaje Natural I

**Autor:** Gonzalo G. Fernandez

Clase 1: Introducción a NLP, Vectorización de documentos.

## Consigna desafío 1

1. Vectorizar documentos. Tomar 5 documentos al azar y medir similaridad con el resto de los documentos. Estudiar los 5 documentos más similares de cada uno analizar si tiene sentido la similaridad según el contenido del texto y la etiqueta de clasificación.

2. Entrenar modelos de clasificación Naïve Bayes para maximizar el desempeño de clasificación (f1-score macro) en el conjunto de datos de test. Considerar cambiar parámetros de instanciación del vectorizador y los modelos y probar modelos de Naïve Bayes Multinomial y ComplementNB.

3. Transponer la matriz documento-término. De esa manera se obtiene una matriz término-documento que puede ser interpretada como una colección de vectorización de palabras. Estudiar ahora similaridad entre palabras tomando 5 palabras y estudiando sus 5 más similares. La elección de palabras no debe ser al azar para evitar la aparición de términos poco interpretables, elegirlas "manualmente".

In [54]:
import numpy as np
import random
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV, GridSearchCV
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.pipeline import Pipeline

## Resolución
### Vectorización de documentos
El dataset *20newsgroups* comprende aproximadamente 18000 publicaciones sobre 20 temas, divididas en dos subconjuntos: uno para entrenamiento (o desarrollo) y otro para pruebas (o evaluación del rendimiento). La división entre los conjuntos de entrenamiento y prueba se basa en los mensajes publicados antes y después de una fecha específica.

In [None]:
newsgroups_train = fetch_20newsgroups(
    subset="train", remove=("headers", "footers", "quotes")
)
newsgroups_test = fetch_20newsgroups(
    subset="test", remove=("headers", "footers", "quotes")
)

print(newsgroups_train.data[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


**Tf-idf** (del inglés Term frequency – Inverse document frequency): frecuencia de término – frecuencia inversa de documento (o sea, la frecuencia de ocurrencia del término en la colección de documentos), es una medida numérica que expresa cuán relevante es una palabra para un documento en una colección.

*TfidfVectorizer*: Convierte una colección de documentos sin procesar en una matriz de características TF-IDF.

In [None]:
tfidfvect = TfidfVectorizer()
X_train = tfidfvect.fit_transform(newsgroups_train.data)
print(f"Cantidad de documentos: {X_train.shape[0]}")

Cantidad de documentos: 11314


Selección de 5 documentos al azar y análisis de similaridad:

In [13]:
random.seed(42)
random_idxs = random.sample(range(0, X_train.shape[0] + 1), 5)
random_idxs_class = [
    newsgroups_train.target_names[newsgroups_train.target[idx]] for idx in random_idxs
]
print(f"Índices aleatorios: {random_idxs} - {random_idxs_class}")
cossim = [cosine_similarity(X_train[idx], X_train)[0] for idx in random_idxs]

Índices aleatorios: [10476, 1824, 409, 4506, 4012] - ['rec.sport.hockey', 'comp.sys.mac.hardware', 'comp.graphics', 'rec.autos', 'rec.sport.hockey']


Función para exploración de los documentos obtenidos a través de similitud coseno:

In [17]:
def dump_cossim_documents(doc_idx: int):
    print(f"Documento elegido (clase {random_idxs_class[doc_idx]}):")
    print(newsgroups_train.data[random_idxs[doc_idx]])
    print("*" * 80)
    for idx in np.argsort(cossim[doc_idx])[::-1][1:6]:
        class_name = newsgroups_train.target_names[newsgroups_train.target[idx]]
        print(f"Índice: {idx}, Etiqueta: {class_name}")
        print(newsgroups_train.data[idx])
        print("*" * 80)

Exploración de los documentos elegidos al azar y sus similares:

In [18]:
dump_cossim_documents(0)

Documento elegido (clase rec.sport.hockey):
This is a general question for US readers:

How extensive is the playoff coverage down there?  In Canada, it is almost
impossible not to watch a series on TV (ie the only two series I have not had
an opportunity to watch this year are Wash-NYI and Chi-Stl, the latter because
I'm in the wrong time zone!).  We (in Canada) are basically swamped with 
coverage, and I wonder how many series/games are televised nationally or even
locally in the US and how much precedence they take over, say, local news if
the games go into double-OT.

Email me so as not to waste bandwidth, please.  My news feed is kind of slow
anyways.
********************************************************************************
Índice: 5064, Etiqueta: rec.sport.hockey

I only have one comment on this:  You call this a *classic* playoff year
and yet you don't include a Chicago-Detroit series.  C'mon, I'm a Boston
fan and I even realize that Chicago-Detroit games are THE most exc

**Conclusión:** Todos los documentos tratan de temas vinculados a la computación y específicamente a tópicos vinculados a gráficos, excepto el elegido que en realidad trata de hockey. Quizas la similitud es a través de términos como "games" y "televised".

In [19]:
dump_cossim_documents(1)

Documento elegido (clase comp.sys.mac.hardware):


	I think this kind of comparison is pretty useless in general.  The
processor is only good when a good computer is designed around it adn the
computer is used in its designed purpose.  Comparing processor speed is
pretty dumb because all you have to do is just increase the clock speed
to increase speed among other things.

	I mean how can you say a 040 is faster than a 486 without 
giving is operational conditions?  Can you say the same when 
you are running a program that uses a lot of transidental functions.
Knowing that 040 does not have transidental functions building in to 
its FPU and 486 does, can you say that 040 is still faster?

	Anyway, I hope people do not decided upon wether a computers
is good or not solely on its processor.  Or how fast a processor is
based on its name, because one can alway do a certain things to a
processor to speed it up.  

	But if we restrict our arguements to, for example, pure
processor architectu

In [20]:
dump_cossim_documents(2)

Documento elegido (clase comp.graphics):
I can't fiqure this out.  I have properly compiled pov on a unix machine
running SunOS 4.1.3  The problem is that when I run the sample .pov files and
use the EXACT same parameters when compiling different .tga outputs.  Some
of the .tga's are okay, and other's are unrecognizable by any software.
********************************************************************************
Índice: 3444, Etiqueta: comp.graphics
Hi, I'm just getting into PoVRay and I was wondering if there is a graphic
package that outputs .POV files.  Any help would be appreciated.
Thanks.

Later'ish
Craig

********************************************************************************
Índice: 5799, Etiqueta: comp.graphics
I finally got a 24 bit viewer for my POVRAY generated .TGA files.

It was written in C by Sean Malloy and he kindly sent me a copy.  He
wrote it for the same purpose, to view .TGA files using his SpeedStar 24.

It ONLY works with the SpeedStar 24 and I cann

In [21]:
dump_cossim_documents(3)

Documento elegido (clase rec.autos):

This does sound good, but I heard it tends to leave more grit, etc in the 
oil pan.  Also, I've been told to change the old when it's hot before the
grit has much time to settle.

Any opinions?

********************************************************************************
Índice: 4211, Etiqueta: rec.motorcycles


It's normal for the BMW K bikes to use a little oil in the first few thousand 
miles.  I don't know why.  I've had three new K bikes, and all three used a
bit of oil when new - max maybe .4 quart in first 1000 miles; this soon quits
and by the time I had 10,000 miles on them the oil consumption was about zero.
I've been told that the harder you run the bike (within reason) the sooner
it stops using any oil.

********************************************************************************
Índice: 5928, Etiqueta: comp.sys.mac.hardware
or
there


Okay, I guess its time for a quick explanation of Mac sound.

The original documentation for t

In [22]:
dump_cossim_documents(4)

Documento elegido (clase rec.sport.hockey):
For those Leaf fans who are concerned, the following players are slated for
return on Thursday's Winnipeg-Toronto game :
    Peter Zezel, John Cullen

  Mark Osborne and Dave Ellett are questionable to return on Thursday.
********************************************************************************
Índice: 6599, Etiqueta: soc.religion.christian
True.

Also read 2 Peter 3:16

Peter warns that the scriptures are often hard to understand by those who
are not learned on the subject.
********************************************************************************
Índice: 10644, Etiqueta: rec.sport.hockey
In  <1qvos8$r78@cl.msu.>, vergolin@euler.lbs.msu.edu (David Vergolini) writes...

There's quite a few Wings fans lurking about here, they just tend
to be low key and thoughtful rather than woofers.  I suppose every
family must have a Roger Clinton, though.  But remember (to paraphrase
one of my favorite Star Trek lines), "if we adopt the ways o

### Entrenamiento de modelos de clasificación Naïve Bayes

Entrenamiento de Naïve Bayes multinomial con búsqueda de hiperparámetros:

In [50]:
pipeline = Pipeline([("tfidf", TfidfVectorizer()), ("nb", MultinomialNB())])

param_grid = {
    "tfidf__max_df": [0.9, 1.0],
    "tfidf__min_df": [1, 5],
    "tfidf__ngram_range": [(1, 1), (1, 2)],
    "tfidf__use_idf": [True, False],
    "nb__alpha": [0.01, 0.1, 1.0, 5.0],
    "nb__fit_prior": [True, False],
}

# grid_search = GridSearchCV(
#     estimator=pipeline, param_grid=param_grid, cv=5, scoring="f1_macro"
# )
search = HalvingGridSearchCV(
    pipeline, param_grid, scoring="f1_macro", cv=3, factor=2, random_state=42
)
search.fit(newsgroups_train.data, newsgroups_train.target)

tfidfvect = search.best_estimator_.named_steps["tfidf"]
clf = search.best_estimator_.named_steps["nb"]

print("Best TFIDF parameters:")
print("max_df:", tfidfvect.max_df)
print("min_df:", tfidfvect.min_df)
print("ngram_range:", tfidfvect.ngram_range)
print("use_idf:", tfidfvect.use_idf)

print("Best model:")
print("Parameters:", search.best_params_)
print("F1 score:", search.best_score_)

Best TFIDF parameters:
max_df: 0.9
min_df: 1
ngram_range: (1, 1)
use_idf: True
Best model:
Parameters: {'nb__alpha': 0.01, 'nb__fit_prior': False, 'tfidf__max_df': 0.9, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 1), 'tfidf__use_idf': True}
F1 score: 0.7284207900137819


Validación con datos de test:

In [51]:
X_test = tfidfvect.transform(newsgroups_test.data)
y_test = newsgroups_test.target
y_pred = clf.predict(X_test)
f1_score(y_test, y_pred, average="macro")

0.6895480073763225

Entrenamiento de Naïve Bayes complementario con búsqueda de hiperparámetros:

In [52]:
pipeline = Pipeline([("tfidf", TfidfVectorizer()), ("nb", ComplementNB())])

param_grid = {
    "tfidf__max_df": [0.9, 1.0],
    "tfidf__min_df": [1, 5],
    "tfidf__ngram_range": [(1, 1), (1, 2)],
    "tfidf__use_idf": [True, False],
    "nb__alpha": [0.01, 0.1, 1.0, 5.0],
    "nb__fit_prior": [True, False],
    "nb__norm": [True, False],
}

search = HalvingGridSearchCV(
    pipeline, param_grid, scoring="f1_macro", cv=3, factor=2, random_state=42
)
search.fit(newsgroups_train.data, newsgroups_train.target)

tfidfvect = search.best_estimator_.named_steps["tfidf"]
clf = search.best_estimator_.named_steps["nb"]

print("Best TFIDF parameters:")
print("max_df:", tfidfvect.max_df)
print("min_df:", tfidfvect.min_df)
print("ngram_range:", tfidfvect.ngram_range)
print("use_idf:", tfidfvect.use_idf)

print("Best model:")
print("Parameters:", search.best_params_)
print("F1 score:", search.best_score_)

Best TFIDF parameters:
max_df: 0.9
min_df: 1
ngram_range: (1, 1)
use_idf: False
Best model:
Parameters: {'nb__alpha': 0.1, 'nb__fit_prior': False, 'nb__norm': False, 'tfidf__max_df': 0.9, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 1), 'tfidf__use_idf': False}
F1 score: 0.7335787847219347


Validación con datos de test:

In [53]:
X_test = tfidfvect.transform(newsgroups_test.data)
y_test = newsgroups_test.target
y_pred = clf.predict(X_test)
f1_score(y_test, y_pred, average="macro")

0.6987411616066841

**Conclusiones:**
- El modelo Naïve Bayes complementario da resultados ligeramente mejores que el Naïve Bayes multinomial.
- Los resultados con la búsqueda de hiperparámetros logran un f1 score de hasta 0.7 en test.
- Se utilizó `HalvingGridSearchCV` para recortar los tiempos de ejecución, pero probablemente `GridSearchCV` resulte en modelos más performantes.

### Transposición de la matriz documento-término
Mediante la transposición de la matriz documento-término se obtiene una matriz término-documento que puede ser interpretada como una colección de vectorización de palabras.

In [91]:
tfidfvect = TfidfVectorizer()
X_train = tfidfvect.fit_transform(newsgroups_train.data)
td_matrix = X_train.T
vocab = tfidfvect.vocabulary_ # term - index
index_term = {v: k for k, v in vocab.items()}

words = ["wheel", "score", "space", "god", "transistor"]
cossim = [cosine_similarity(td_matrix[vocab[word]], td_matrix)[0] for word in words]

In [94]:
def dump_cossim_terms(doc_idx: int):
    print(f"Palabra elegida:", words[doc_idx])
    print("Similares:", end=" ")
    for idx in np.argsort(cossim[doc_idx])[::-1][1:6]:
        print(index_term[idx], end=", ")
    print("\n")


for i in range(len(words)):
    dump_cossim_terms(i)

Palabra elegida: wheel
Similares: steering, hops, _rear_, 1890s, tiller, 

Palabra elegida: score
Similares: scoresheets, throws, inquisitively, zamboni, homered, 

Palabra elegida: space
Similares: nasa, seds, shuttle, enfant, seti, 

Palabra elegida: god
Similares: jesus, bible, that, existence, christ, 

Palabra elegida: transistor
Similares: ym3623b, bc546b, pdif, mouser, bat85, 



Se puede observar como para las palabras elegidas a través de la matriz transpuesta y utilizando similitud coseno se obtienen términos cercanos a la temática con que se los puede vincular.

- *wheel*: palabras vinculadas a automóviles, motocicletas (1890s quizas por el año de algun modelo?)
- *score*: palabras relacionadas con deporte. zamboni se encuentra relacionada al hockey sobre hielo, homered al baseball, throws al football americano.
- *space*: palabras estrictamente relacionadas con la investigación espacial. Por ejemplo nasa, shuttle, seti (search of extra-terrestrial intelligence)
- *god*: palabras estrictamente relacionadas con la religión. Por ejemplo jesus, bible, existance, christ
- *transistor*: palabras que corresponden a modelos de transistores o proveedores.