# Taller de NLP: Procesando textos con técnicas simples

Para este taller deberás disponer de algunas librerías como scikit-learn, NLTK, spacy y fbpca. 

In [1]:
import pandas as pd
import numpy as np
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt

## for processing
import re
import nltk

from sklearn import feature_extraction, model_selection, naive_bayes, pipeline, manifold, preprocessing, metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
'''
Función para preprocesar un texto.
:parametros
    :param text: string - nombre de la columna que contiene el texto
    :param lst_stopwords: lista - lista de palabras vacías a eliminar
    :param flg_stemm: bool - si se debe aplicar stemming
    :param flg_lemm: bool - si se debe aplicar lematización
:retorno
    texto limpio
'''
def utils_preprocess_text(text, flg_stemm=False, flg_lemm=True, lst_stopwords=None):
    
    ## limpiar (convertir a minúsculas y eliminar puntuaciones y caracteres, luego hacer strip)
    text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
            
    ## Tokenizar (convertir de string a lista)
    lst_text = text.split()
    ## eliminar palabras vacías
    if lst_stopwords is not None:
        lst_text = [word for word in lst_text if word not in 
                    lst_stopwords]
                
    # Stemming (eliminar -ing, -ly, ...)
    if flg_stemm == True:
        ps = nltk.stem.porter.PorterStemmer()
        lst_text = [ps.stem(word) for word in lst_text]
                
    # Lematización (convertir la palabra a su forma raíz)
    if flg_lemm == True:
        lem = nltk.stem.wordnet.WordNetLemmatizer()
        lst_text = [lem.lemmatize(word) for word in lst_text]
            
    ## volver a convertir de lista a string
    text = " ".join(lst_text)
    return text

In [3]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /home/santi/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /home/santi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
lst_stopwords = nltk.corpus.stopwords.words("english")
#print(lst_stopwords)

En este taller trabajaremos con el dataset `progressive-tweet-sentiment` el cual tiene algunos tweets recopilados y categorizados en 4 clases: 'Legalization of Abortion', 'Hillary Clinton', 'Feminist Movement', 'Atheism'. 

In [5]:
df = pd.read_csv("https://raw.githubusercontent.com/profesanti/intro_NLP_PUJ/main/Martes/data/progressive-tweet-sentiment.csv",encoding='latin-1')
df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,q1_from_reading_the_tweet_which_of_the_options_below_is_most_likely_to_be_true_about_the_stance_or_outlook_of_the_tweeter_towards_the_target,q1_from_reading_the_tweet_which_of_the_options_below_is_most_likely_to_be_true_about_the_stance_or_outlook_of_the_tweeter_towards_the_target:confidence,q2_which_of_the_options_below_is_true_about_the_opinion_in_the_tweet,q2_which_of_the_options_below_is_true_about_the_opinion_in_the_tweet:confidence,orig__golden,internal_id,orig_q1_from_reading_the_tweet_which_of_the_options_below_is_most_likely_to_be_true_about_the_stance_or_outlook_of_the_tweeter_towards_the_target,q1_from_reading_the_tweet_which_of_the_options_below_is_most_likely_to_be_true_about_the_stance_or_outlook_of_the_tweeter_towards_the_target_gold,orig_q2_which_of_the_options_below_is_true_about_the_opinion_in_the_tweet,target,tweet,tweet_id
0,713632888,True,golden,30,,AGAINST: We can infer from the tweet that the ...,0.6581,2. The tweet does NOT expresses opinion about ...,0.4976,True,189,,AGAINST: We can infer from the tweet that the ...,,Legalization of Abortion,Thank you for another day of life Lord. #Chris...,id588718177095266305
1,713632889,False,golden,2,,NONE OF THE ABOVE: There is no clue in the twe...,1.0,2. The tweet does NOT expresses opinion about ...,0.5294,True,190,,AGAINST: We can infer from the tweet that the ...,,Legalization of Abortion,@rosaryrevival Lovely to use Glorious Mysterie...,id592798858725425152
2,713632890,True,golden,26,,AGAINST: We can infer from the tweet that the ...,0.8859,1. The tweet explicitly expresses opinion abo...,0.882,True,207,,AGAINST: We can infer from the tweet that the ...,,Legalization of Abortion,@Niall250 good thing is that #DUP have consist...,id593472619208380419
3,713632891,False,golden,3,,AGAINST: We can infer from the tweet that the ...,0.6323,1. The tweet explicitly expresses opinion abo...,0.6323,True,211,,AGAINST: We can infer from the tweet that the ...,,Legalization of Abortion,"So, you tell me... is murder okay if the victi...",id592699132399194112
4,713632892,True,golden,31,,AGAINST: We can infer from the tweet that the ...,0.892,1. The tweet explicitly expresses opinion abo...,0.8939,True,213,,AGAINST: We can infer from the tweet that the ...,,Legalization of Abortion,@HillaryClinton Don't you mean to say (all chi...,id588527665365188608


In [6]:
df = df[["target", "tweet"]]
df.head()

Unnamed: 0,target,tweet
0,Legalization of Abortion,Thank you for another day of life Lord. #Chris...
1,Legalization of Abortion,@rosaryrevival Lovely to use Glorious Mysterie...
2,Legalization of Abortion,@Niall250 good thing is that #DUP have consist...
3,Legalization of Abortion,"So, you tell me... is murder okay if the victi..."
4,Legalization of Abortion,@HillaryClinton Don't you mean to say (all chi...


#### 1. Preprocesamiento. Realice procedimientos para preprocesar texto sobre la columna de `tweet` cómo: tokenizar, eliminar stopwords, stemming y lemmatization. Al final, el resultado de dicho procesamiento debe ser un texto (no una lista de palabras). 

In [7]:
# SOLUCIONAR AQUI

Unnamed: 0,target,tweet,text_clean
0,Legalization of Abortion,Thank you for another day of life Lord. #Chris...,thank another day life lord christian catholic...
1,Legalization of Abortion,@rosaryrevival Lovely to use Glorious Mysterie...,rosaryrevival lovely use glorious mystery east...
2,Legalization of Abortion,@Niall250 good thing is that #DUP have consist...,niall250 good thing dup consistently said murd...
3,Legalization of Abortion,"So, you tell me... is murder okay if the victi...",tell murder okay victim mentally disabled
4,Legalization of Abortion,@HillaryClinton Don't you mean to say (all chi...,hillaryclinton dont mean say child deserve cha...


Con todo el texto preprocesado crea una nube de palabras y visualizala

In [None]:
# pip install wordcloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Combine all text_clean into a single string
text_combined = " ".join(df["text_clean"])

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text_combined)

# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()


#### 2. Realice un estrategia de Bag of Words (BoW).  Produzca una mátriz término-documento, en este caso un simple conteo de palabras (`sklearn.feature_extraction.text.CountVectorizer`). ¿Cúal es el vocabulario? ¿Cuál es la dimensión de la matriz?

In [11]:
# SOLUCIONAR AQUI

(1159, 4419)


In [14]:
vocab = np.array(vectorizer.get_feature_names_out())
print(vocab[-10:])

['yoursisright' 'youve' 'yr9' 'yup' 'yvrmoms' 'zacharyebell' 'zahara'
 'zelda' 'zubair' 'ìååôåãåôì']


#### 3. Escoga un número de tópicos o temas (no muy grande, entre 4 y 10) y realice un análisis mediante factorización de valores singulares (SVD) de la matriz de término-documento obtenida anteriormente. ¿Qué interpretación puede realizar? ¿Qué puede concluir a través de las matrices U, S y V? . Recuerde que este procedimiento es aprendizaje NO supervisado.

In [15]:
num_top_words= # ESCOGER NUMERO DE TEMAS

def show_topics(a):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

In [16]:
# SOLUCIONAR AQUI

In [17]:
# SOLUCIONAR AQUI

(1159, 10) (10,) (10, 4419)


In [18]:
# SOLUCIONAR AQUI

['woman right dont men feminist people abortion want',
 'god dont know life love best whats people',
 'know best whats dont well people feminist want',
 'know best woman whats god well happiness hillaryclinton',
 'hillaryclinton hillary rt clinton world 2016 ready marcorubio',
 'people abortion prochoice want never rt best kid',
 'right people abortion life human equal believe whats',
 'rt ready 2016 marcorubio futuretxleader yall feminist texas',
 'dont woman abortion rt want life love fuck',
 'hillaryclinton right feminist world people rt equal 2016']

#### 4. Repita (3) pero realizando una factorización no negativa de matrices (NMF). ¿Qué interpreta a partir de las matrices obtenidas H y W?

In [19]:
# SOLUCIONAR AQUI

(1159, 10)
(10, 4419)


In [20]:
# SOLUCIONAR AQUI

['woman men need abortion like equal president man',
 'god pray sinner life mother death love teamjesus',
 'know best whats well happiness life health way',
 'dont want fuck love men understand get make',
 'hillaryclinton world president support success would readyforhrc bad',
 'people abortion prochoice want life never kid choice',
 'right human life equal one baby abortion believe',
 'rt ready 2016 marcorubio futuretxleader yall texas gop',
 'feminist men get youre equality shit ugly go',
 'hillary clinton im like good time get think']

#### 5. Repita los puntos (3) y (4) pero con base en la matriz obtenida al realizar TF-IDF

In [21]:
vectorizer_tfidf = # SOLUCIONAR AQUI
vectors_tfidf = # SOLUCIONAR AQUI

In [23]:
# SOLUCIONAR AQUI (SVD)

['woman right dont feminist men god abortion people',
 'god know life pray islam sinner mother amen',
 'know best whats dont happiness well hillaryclinton health',
 'hillaryclinton world 2016 ready rt marcorubio president hillary',
 'ready abortion rt 2016 marcorubio futuretxleader yall dont',
 'abortion life right human choice people adoption pregnant',
 'right woman know 2016 marcorubio ready best whats',
 'right hillary clinton human im dont time get',
 'hillary clinton im abortion woman many think good',
 'dont want love woman abortion baby choice mean']

In [22]:
# SOLUCIONAR AQUI (NMF)

['woman men equal president need man like le',
 'god pray sinner amen mary rosary hour mother',
 'know best whats happiness well body health way',
 'hillaryclinton world barackobama whole supporting support success president',
 'ready 2016 marcorubio rt futuretxleader yall texas newamericancentury',
 'abortion choice life one adoption beautiful day chooses',
 'right human believe equal care life live baby',
 'feminist dont want understand feminism equality men love',
 'hillary clinton im good like get many way',
 'people prochoice never kid love pregnant cant pro']

#### 7. Use PCA y realice una interpretación similar a la de la factorizaciones anteriores usando la matriz TF-IDF (¿Qué significa las componentes principales, qué palabras aportan más a cada componente?).

Sugerencia: `pca.components_` representa los vectores propios (eigenvectors) del espacio transformado. En el contexto de análisis de texto, cada componente (fila de `pca.components_`) puede interpretarse como un "tema" o "tópico".

In [25]:
from sklearn.decomposition import PCA

# SOLUCIONAR AQUI


['woman right men feminist equal abortion equality want',
 'god right dont know life woman feminist men',
 'know best whats dont happiness well health hillaryclinton',
 'hillaryclinton world god woman president right success support',
 'right abortion life know rt ready woman 2016',
 '2016 marcorubio ready feminist rt men woman futuretxleader',
 'right hillary im human clinton like get believe',
 'dont right love hillaryclinton world people want life',
 'people feminist ugly world equal equality right whats',
 'people hillary clinton want woman many god best']

#### 8. Hasta ahora hemos realizado aprendizaje no supervisado, pero también podemos aplicar aprendizaje supervisado. Use las matrices de término-documento anteriores para realizar una clasificación, usando algunos modelos (Multinomial Naive Bayes, Random Forest, LDA, etc.). Debe tener en cuenta que para evaluar correctamente los resultados (al menos) debe usar una partición training-test (recomendable validación cruzada). Compare y discute los resultados a partir de diversas métricas de desempeño (no olvide la matriz de confusión). Sugerencia: realice la vectorización (conteo de palabras) y la clasificación en un Pipeline. 

In [32]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# SOLUCIONAR AQUI


Cross-validation scores: [0.74193548 0.69892473 0.66486486 0.65405405 0.65945946]
Mean CV score: 0.6838
