# Vectorización de texto y modelo de clasificación Naïve Bayes con el dataset 20 newsgroups

### Consigna del desafío 1

**1**. Vectorizar documentos. Tomar 5 documentos al azar y medir similaridad con el resto de los documentos.
Estudiar los 5 documentos más similares de cada uno analizar si tiene sentido
la similaridad según el contenido del texto y la etiqueta de clasificación.

**2**. Entrenar modelos de clasificación Naïve Bayes para maximizar el desempeño de clasificación
(f1-score macro) en el conjunto de datos de test. Considerar cambiar parámteros
de instanciación del vectorizador y los modelos y probar modelos de Naïve Bayes Multinomial
y ComplementNB.

**3**. Transponer la matriz documento-término. De esa manera se obtiene una matriz
término-documento que puede ser interpretada como una colección de vectorización de palabras.
Estudiar ahora similaridad entre palabras tomando 5 palabras y estudiando sus 5 más similares.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.metrics import f1_score

from sklearn.datasets import fetch_20newsgroups
import numpy as np

## Carga de datos

In [2]:
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

## Vectorización con TfidfVectorizer

In [5]:
tfidfvect = TfidfVectorizer()


In [6]:
newsgroups_train.data[0]

'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.'

In [7]:
X_train = tfidfvect.fit_transform(newsgroups_train.data)
# `X_train` la podemos denominar como la matriz documento-término

In [8]:
print(type(X_train))
print(f'shape: {X_train.shape}')
print(f'cantidad de documentos: {X_train.shape[0]}')
print(f'tamaño del vocabulario (dimensionalidad de los vectores): {X_train.shape[1]}')

<class 'scipy.sparse._csr.csr_matrix'>
shape: (11314, 101631)
cantidad de documentos: 11314
tamaño del vocabulario (dimensionalidad de los vectores): 101631


In [9]:
tfidfvect.vocabulary_['car']

25775

In [19]:
idx2word = {v: k for k,v in tfidfvect.vocabulary_.items()}

In [20]:
y_train = newsgroups_train.target
y_train[:10]

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

In [21]:
print(f'clases {np.unique(newsgroups_test.target)}')
newsgroups_test.target_names

clases [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]


['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

## Similaridad de documentos

In [44]:
indices_documentos_aleatorios = np.random.choice(range(X_train.shape[0]), size=5, replace=False)
documentos_aleatorios = [X_train[i] for i in indices_documentos_aleatorios]

similarity_matrix = cosine_similarity(X_train[indices_documentos_aleatorios], X_train)

similar_documents_dict = {}

# Encontrar los 5 documentos más similares para cada uno de los documentos aleatorios
for i, doc_index in enumerate(indices_documentos_aleatorios):
    # Excluir el propio documento de la lista de documentos similares
    similar_indices = np.argsort(similarity_matrix[i])[::-1][1:6]
    similar_documents_dict[doc_index] = similar_indices
    original_class_label = newsgroups_train.target_names[y_train[doc_index]]

    print(f"Document {i+1} (Index: {doc_index}, Class: {original_class_label})")
    print("Content:") 
    #print(newsgroups_train.data[doc_index])
    print()
    
    print("Top 5 most similar documents:")
    for j, similar_index in enumerate(similar_indices):
        similar_class_label = newsgroups_train.target_names[y_train[similar_index]]
        print(f"Similar Document {j+1} (Index: {similar_index}, Class: {similar_class_label})")

    print()


Document 1 (Index: 8110, Class: comp.sys.ibm.pc.hardware)
Content:

Top 5 most similar documents:
Similar Document 1 (Index: 2997, Class: comp.sys.ibm.pc.hardware)
Similar Document 2 (Index: 196, Class: sci.electronics)
Similar Document 3 (Index: 7773, Class: sci.electronics)
Similar Document 4 (Index: 3412, Class: sci.electronics)
Similar Document 5 (Index: 6626, Class: comp.sys.ibm.pc.hardware)

Document 2 (Index: 8155, Class: sci.med)
Content:

Top 5 most similar documents:
Similar Document 1 (Index: 385, Class: sci.med)
Similar Document 2 (Index: 8550, Class: sci.med)
Similar Document 3 (Index: 8660, Class: sci.med)
Similar Document 4 (Index: 8899, Class: sci.med)
Similar Document 5 (Index: 2189, Class: sci.med)

Document 3 (Index: 6606, Class: misc.forsale)
Content:

Top 5 most similar documents:
Similar Document 1 (Index: 6069, Class: comp.sys.mac.hardware)
Similar Document 2 (Index: 10584, Class: sci.space)
Similar Document 3 (Index: 3362, Class: misc.forsale)
Similar Document 4

In [45]:
similar_documents_dict

{8110: array([2997,  196, 7773, 3412, 6626]),
 8155: array([ 385, 8550, 8660, 8899, 2189]),
 6606: array([ 6069, 10584,  3362,  9668,  8483]),
 9818: array([ 3717,  5568,  9010, 11118,  2797]),
 8359: array([10139,  2557,  1239, 10122,  7342])}

In [48]:
for key, values in similar_documents_dict.items():
    print (f"DOCUMENT INDEX: {key}")
    print()
    print(newsgroups_train.data[key])
    print()
    print (f"SIMILAR DOCUMENTS: ")
    print()
    for value in values:
        print(f"INDEX: {value}")
        print()
        print(newsgroups_train.data[value])
        print()
        
    
    

DOCUMENT INDEX: 8110

I am looking for a CDROM audio cable to connect my Toshiba 3401B (L/R audio) to
the Pro Audio Spectrum 16 sound card.  Thanks in advance for any pointers...

SIMILAR DOCUMENTS: 

INDEX: 2997

Hi, I need some advice from the netland in selecting a sound card.

I am about to buy a sound card for my kid. I don't know which one to buy.
Which one to select from the following list:

- Sound Blaster 16
- Miscrosoft- sound card
- Audio Spectrum
- Sound Blaster pro
- Sound Blaster


My allocated budget is around $250.


Could some of you know about sound cards help me to select the most appropriate
one for my kid ?


I have 486-33 Mz OPTI MB.
I also have NEC CDROM that I would like to connect to the sound card.


Thank you.



INDEX: 196


As a general rule, no relay will cleanly switch audio if you try to tranfer
the circuit with the contacts.  The noise you hear is due to the momentary
opening and closing of the path.

The noiseless way of transfering audio is to ground 

Se puede observar que para cada uno de los 5 documentos, los documentos elegidos son similares, en algunos la categoria lo indica, sin embargo, en otros al analizar el texto se observa que los documentos hablan de cosas similares.

### Modelo de clasificación Naïve Bayes

In [62]:
mnb_clf = MultinomialNB()
cnb_clf= ComplementNB()


In [63]:
mnb_clf.fit(X_train, y_train)
cnb_clf.fit(X_train, y_train)

In [64]:
# con nuestro vectorizador ya fiteado en train, vectorizamos los textos
# del conjunto de test
X_test = tfidfvect.transform(newsgroups_test.data)
y_test = newsgroups_test.target
y_pred_mnb =  mnb_clf.predict(X_test)
y_pred_cnb = cnb_clf.predict(X_test)

In [65]:
f1_mnb = f1_score(y_test, y_pred_mnb, average='macro')
f1_cnb = f1_score(y_test, y_pred_cnb, average='macro')

In [66]:
print("F1-score macro - Multinomial Naïve Bayes:", f1_mnb)
print("F1-score macro - Complement Naïve Bayes:", f1_cnb)

F1-score macro - Multinomial Naïve Bayes: 0.5854345727938506
F1-score macro - Complement Naïve Bayes: 0.692953349950875


### Pruebas con variantes de tfidvect

In [67]:
# Variantes de tfidvect
tfidfvect_variant1 = TfidfVectorizer(ngram_range=(1, 1), min_df=1, max_df=1.0)
tfidfvect_variant2 = TfidfVectorizer(ngram_range=(1, 2), min_df=2, max_df=0.8)

In [68]:
X_train_variant1 = tfidfvect_variant1.fit_transform(newsgroups_train.data)
X_train_variant2 = tfidfvect_variant2.fit_transform(newsgroups_train.data)

In [69]:
X_test_variant1 = tfidfvect_variant1.transform(newsgroups_test.data)
X_test_variant2 = tfidfvect_variant2.transform(newsgroups_test.data)

In [70]:
mnb_classifier_variant1 = MultinomialNB()
cnb_classifier_variant1 = ComplementNB()

In [72]:
mnb_classifier_variant1.fit(X_train_variant1, y_train)
cnb_classifier_variant1.fit(X_train_variant1, y_train)

In [73]:
y_pred_mnb_variant1 = mnb_classifier_variant1.predict(X_test_variant1)
y_pred_cnb_variant1 = cnb_classifier_variant1.predict(X_test_variant1)

In [74]:
f1_mnb_variant1 = f1_score(y_test, y_pred_mnb_variant1, average='macro')
f1_cnb_variant1 = f1_score(y_test, y_pred_cnb_variant1, average='macro')

In [75]:
print("Variant 1 - F1-score macro - Multinomial Naïve Bayes:", f1_mnb_variant1)
print("Variant 1 - F1-score macro - Complement Naïve Bayes:", f1_cnb_variant1)

Variant 1 - F1-score macro - Multinomial Naïve Bayes: 0.5854345727938506
Variant 1 - F1-score macro - Complement Naïve Bayes: 0.692953349950875


In [78]:
mnb_classifier_variant2 = MultinomialNB()
cnb_classifier_variant2 = ComplementNB()

In [79]:
mnb_classifier_variant2.fit(X_train_variant2, y_train)
cnb_classifier_variant2.fit(X_train_variant2, y_train)

In [80]:
y_pred_mnb_variant2 = mnb_classifier_variant2.predict(X_test_variant2)
y_pred_cnb_variant2 = cnb_classifier_variant2.predict(X_test_variant2)

In [81]:
f1_mnb_variant2 = f1_score(y_test, y_pred_mnb_variant2, average='macro')
f1_cnb_variant2 = f1_score(y_test, y_pred_cnb_variant2, average='macro')

In [82]:
print("Variant 2 - F1-score macro - Multinomial Naïve Bayes:", f1_mnb_variant2)
print("Variant 2 - F1-score macro - Complement Naïve Bayes:", f1_cnb_variant2)

Variant 2 - F1-score macro - Multinomial Naïve Bayes: 0.5703496397235439
Variant 2 - F1-score macro - Complement Naïve Bayes: 0.6878218782600645


Hay un cambio en F1 score cuando se modifican los paramteros de instanciacion del vectorizador. En este caso, se modifico el parametro ngram_range en la primera se consideran solo unigramas y en la segunda unigramas y bigramas. Los parametros min_df y max_df establecen umbrales en los que palabras no seran tomadas en cuenta durante la vectorizacion. min_df = 1, ignora palabras que aparecen en menos de 1 documento y min_df = 2 ignora palabras que aparecen en menos de dos documentos. max_df = 0.8 ignora terminos que aparecen en mas del 80%. Como se ve en este caso. La segunda variante produce f1 score menor para los dos clasificadores.

### Pruebas con valores diferentes de alpha

In [85]:
alpha_values = [0.1, 1.0, 10.0]

In [87]:
from sklearn.pipeline import Pipeline

pipelines = []
for alpha in alpha_values:
    # MultinomialNB pipelines
    mnb_pipeline = Pipeline([
        ('clf', MultinomialNB(alpha=alpha))
    ])
    pipelines.append(('MultinomialNB_alpha_' + str(alpha), mnb_pipeline))
    
    # ComplementNB pipelines
    cnb_pipeline = Pipeline([
        ('clf', ComplementNB(alpha=alpha))
    ])
    pipelines.append(('ComplementNB_alpha_' + str(alpha), cnb_pipeline))

In [92]:
from sklearn.metrics import f1_score 

results = []
for name, pipeline in pipelines:
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    f1 = f1_score(y_test, y_pred, average='macro')
    results.append((name, f1))

In [93]:
for result in results:
    print(f"Nombre: {result[0]}, f1_score: {result[1]}")

Nombre: MultinomialNB_alpha_0.1, f1_score: 0.6564514103512165
Nombre: ComplementNB_alpha_0.1, f1_score: 0.6953652590540836
Nombre: MultinomialNB_alpha_1.0, f1_score: 0.5854345727938506
Nombre: ComplementNB_alpha_1.0, f1_score: 0.692953349950875
Nombre: MultinomialNB_alpha_10.0, f1_score: 0.39858607610754687
Nombre: ComplementNB_alpha_10.0, f1_score: 0.6450088043260654


Se observa que al modificar alpha, se puede obtener un mejor f1_score. Esto se nota sobre todo en el caso de Multinomial Naive Bayes ya que parece más sensible. Se obtienen los mejores accuracy score con un alpha de 0.1, es decir, el alpha más pequeño.

### Transponer la matriz documento-término

In [99]:
newsgroups_data = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

In [100]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

In [101]:
tfidf_matrix = tfidf_vectorizer.fit_transform(newsgroups_data.data)

In [102]:
term_document_matrix = tfidf_matrix.transpose()

In [103]:
feature_names = tfidf_vectorizer.get_feature_names_out()
feature_names

array(['00', '000', '0000', ..., 'zzzzzzt', '³ation', 'ýé'], dtype=object)

In [104]:
words_to_study = ['computer', 'baseball', 'space', 'government', 'medicine']

In [106]:
for word in words_to_study:
    word_index = np.where(feature_names == word)[0][0]
    similarity_scores = cosine_similarity(term_document_matrix[word_index], term_document_matrix)
    top_similar_indices = similarity_scores.argsort()[0][-6:-1][::-1]
    top_similar_words = [feature_names[idx] for idx in top_similar_indices]
    print(f"Words most similar to '{word}': {top_similar_words}")

Words most similar to 'computer': ['decwriter', 'deluged', 'harkens', 'shopper', 'delicate']
Words most similar to 'baseball': ['tommorrow', 'football', 'spl2', 'lubchansky', 'penna']
Words most similar to 'space': ['nasa', 'shuttle', 'seds', 'enfant', 'exploration']
Words most similar to 'government': ['libertarian', 'encryption', 'agencies', 'regulation', 'people']
Words most similar to 'medicine': ['strengthens', 'dislikes', 'homepathy', 'neurodermitis', 'alleiating']


Cuando se analizan las palabras similares hay un poco de incongruencia, ya sea por palabras que no existen o palabras que no pueden considerarse similares.