# Tarea 3

- Martínez Ostoa Néstor I. 
- Minería de Textos
- LCD - IIMAS - UNAM

---

*Descargar el archivo wiki-sub.jsonl y realizar las siguientes actividades. Este archivo tiene 100k documentos, cada uno de ellos corresponde a la introducción de un artículo de Wikipedia.*

*Descargar el archivo query-sub.jsonl. Este archivo contiene 50 consultas que serán utilizadas para evaluar las propuestas de RI. Cada consulta tiene los identificadores de los documentos que fueron evaluados como relevantes.*

---



In [1]:
import numpy as np
import pandas as pd
import os
import tempfile

In [2]:
docs_df = pd.read_json('docs.json', lines=True)
queries_df = pd.read_json('query.json', lines=True)

In [3]:
print(f"Documents: {docs_df.shape}\nQueries: {queries_df.shape}")

Documents: (100000, 2)
Queries: (50, 2)


In [4]:
docs_df.head()

Unnamed: 0,id,text
0,HazMat_-LRB-film-RRB-,HazMat is a 2013 horror film written and direc...
1,Pseudodrephalys,Pseudodrephalys is a genus of South American s...
2,Anathallis_linearifolia,Anathallis linearifolia is a species of orchid...
3,Usserød_Å,"Usserød Å , the principal drainage of Sjælsø L..."
4,Swimming_at_the_2015_World_Aquatics_Championsh...,The Women 's 100 metre freestyle competition o...


In [5]:
queries_df.head()

Unnamed: 0,query,docs
0,Nikolaj Coster-Waldau worked with the Fox Broa...,"[Waldau_-LRB-surname-RRB-, Waldau, Fox_Broadca..."
1,Adrienne Bailon is an accountant.,"[Accountant, Adrienne_Bailon]"
2,Beautiful reached number two on the Billboard ...,"[Billboard_Hot_100, Number_Two_-LRB-film-RRB-,..."
3,Neal Schon was named in 1954.,"[Neal_Schon, 1954]"
4,The Boston Celtics play their home games at TD...,"[TD_Garden, TD, Boston_Garden, Boston_Celtics]"


In [6]:
docs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   id      100000 non-null  object
 1   text    100000 non-null  object
dtypes: object(2)
memory usage: 1.5+ MB


In [7]:
queries_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   query   50 non-null     object
 1   docs    50 non-null     object
dtypes: object(2)
memory usage: 928.0+ bytes


## Actividad 1

Implementar 2 modelos de RI que utilicen los algoritmos LSA (truncated SVD) y LDA para identificar la distribución de tópicos de los documentos. Los modelos de RI deben usar esta distribución de tópicos para recuperar los documentos relevantes para una consulta dada. Existen varias formas de utilizar el análisis de tópicos en RI. Explicar cuál es la técnica utilizada y cuál es la intención de usar el análisis de tópicos en RI.

---

- **Técnica utilizada: utilizar una nueva matriz término documento $C_k$ de tal forma que tenga un menor rango que la matriz término documento original $C$ mediante la descomposición por valores singulares SVD**
- **La intención del análisis de tópicos es resolver los problemas de polisemia (múltiples significados para una misma palabra) y sinominia (múltiples palabras pueden tener un mismoo significado)**

### Pre procesamiento

1. Obtención de documentos 
2. Limpieza de documentos
3. Construcción de matriz término-documento

In [8]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
import gensim
from gensim import corpora, models

In [9]:
# 1. Obtención de documentos
docs = docs_df["text"]

In [10]:
print(docs[:3])

0    HazMat is a 2013 horror film written and direc...
1    Pseudodrephalys is a genus of South American s...
2    Anathallis linearifolia is a species of orchid...
Name: text, dtype: object


In [11]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()

def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

In [12]:
# 2. Limpieza de documentos
docs = [clean(doc).split() for doc in docs]

In [13]:
print(docs[:3])

[['hazmat', '2013', 'horror', 'film', 'written', 'directed', 'lou', 'simon'], ['pseudodrephalys', 'genus', 'south', 'american', 'skipper', 'butterfly', 'family', 'hesperiidae'], ['anathallis', 'linearifolia', 'specie', 'orchid', 'linearifolia']]


In [14]:
# 3. Construcción matriz término documento
dictionary = corpora.Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [15]:
print(dictionary)

Dictionary(315882 unique tokens: ['2013', 'directed', 'film', 'hazmat', 'horror']...)


In [16]:
print(f"Num of docs: {len(corpus)}")
corpus[:3]

Num of docs: 100000


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)],
 [(16, 1), (17, 2), (18, 1), (19, 1)]]

### Modelo de Análisis Semántico Latente (LSA)

- La idea general es ocupar una representación $C_k$ de menor rango que la matriz término-documento $C$ utilizando descomposición por valor singular en donde $k\approx $ a centenas
- Generamos un mapeo de cada documento y término a un conjunto de tópicos

In [19]:
NUM_TOPICS = 180
lsi_model = models.LsiModel(corpus, id2word=dictionary)
corpus_lsi = lsi_model[corpus]

In [20]:
lsi_model.print_topics(num_topics=10, num_words=3)

[(0, '0.636*"rrb" + 0.636*"lrb" + 0.281*"paris"'),
 (1, '0.541*"judaism" + 0.390*"press" + 0.300*"study"'),
 (2, '-0.878*"paris" + -0.232*"de" + 0.208*"rrb"'),
 (3, '0.862*"part" + 0.232*"s" + -0.083*"lrb"'),
 (4, '0.537*"s" + -0.425*"part" + 0.160*"first"'),
 (5, '-0.999*"amomum" + -0.005*"specie" + -0.005*"flavorubellum"'),
 (6, '-0.646*"county" + -0.271*"ft" + -0.267*"el"'),
 (7, '-0.657*"robert" + -0.476*"raf" + -0.365*"richard"'),
 (8, '0.443*"julien" + 0.431*"born" + 0.320*"french"'),
 (9, '0.378*"album" + -0.290*"school" + 0.265*"county"')]

### Modelo Latent Dirichlet Allocation (LDA)
- Dado un conjunto de datos de doocumentos, la idea es realizar un seguimiento e intentar averiguar los temas que generarían documentos

In [21]:
lda_model = models.LdaModel(corpus, id2word=dictionary, num_topics=NUM_TOPICS)
corpus_lda = lda_model[corpus]

In [22]:
lda_model.print_topics(num_topics=10, num_words=3)

[(80, '0.091*"navy" + 0.082*"ship" + 0.059*"character"'),
 (17, '0.076*"better" + 0.055*"left" + 0.046*"deputy"'),
 (90, '0.114*"light" + 0.067*"winter" + 0.042*"commonwealth"'),
 (111, '0.089*"mark" + 0.063*"m" + 0.051*"dedicated"'),
 (173, '0.092*"hotel" + 0.082*"bill" + 0.076*"sun"'),
 (28, '0.094*"competition" + 0.053*"metre" + 0.030*"competed"'),
 (96, '0.144*"produced" + 0.117*"star" + 0.101*"production"'),
 (41, '0.113*"point" + 0.064*"half" + 0.052*"coast"'),
 (89, '0.176*"north" + 0.158*"right" + 0.075*"carolina"'),
 (76, '0.094*"west" + 0.057*"water" + 0.043*"bridge"')]

### Persistencia de modelos

In [23]:
def save_model(model, suffix):
    with tempfile.NamedTemporaryFile(prefix='model-', suffix=suffix, delete=False) as tmp:
        model.save(tmp.name) 
    return tmp.name

In [24]:
lsi_name = save_model(lsi_model, suffix='.lsi')
print(lsi_name)

/var/folders/48/3g13bfjj3g56jyfv2g7zr_c00000gp/T/model-zidtfwyt.lsi


In [25]:
lda_name = save_model(lda_model, suffix='.lda')
print(lda_name)

/var/folders/48/3g13bfjj3g56jyfv2g7zr_c00000gp/T/model-m4gu3j94.lda


## Actividad 2

Utilizando las consultas del archivo query-sub.jsonl, obtener las curvas recall-precisión para los 2 modelos de RI basados en tópicos. Reportar el mejor valor de F1 obtenido para cada uno. También reportar el MAP resultante.

---

In [137]:
def set_from_list_of_tuples(list_of_tuples):
    it = map(lambda list_of_tuples: list_of_tuples[0], list_of_tuples)
    return set(it)

def get_docs_from_topics(topics_of_interest, corpus):
    s_topics_of_interest = set_from_list_of_tuples(topics_of_interest)
    n = len(s_topics_of_interest)
    indices = []
    doc_idx = 0
    for topics in corpus:
        s_topics = set_from_list_of_tuples(topics)
        if (len(s_topics_of_interest.intersection(s_topics)) == n):
            indices.append(doc_idx)
        doc_idx += 1
    return indices

In [28]:
loaded_lsi_model = models.LsiModel.load(lsi_name)
loaded_lda_model = models.LdaModel.load(lda_name)

In [29]:
corpus_lsi = loaded_lsi_model[corpus]
corpus_lda = loaded_lda_model[corpus]

In [30]:
queries_df.head()

Unnamed: 0,query,docs
0,Nikolaj Coster-Waldau worked with the Fox Broa...,"[Waldau_-LRB-surname-RRB-, Waldau, Fox_Broadca..."
1,Adrienne Bailon is an accountant.,"[Accountant, Adrienne_Bailon]"
2,Beautiful reached number two on the Billboard ...,"[Billboard_Hot_100, Number_Two_-LRB-film-RRB-,..."
3,Neal Schon was named in 1954.,"[Neal_Schon, 1954]"
4,The Boston Celtics play their home games at TD...,"[TD_Garden, TD, Boston_Garden, Boston_Celtics]"


In [165]:
queries = queries_df['query']
queries = [clean(q).split() for q in queries]
q_corpus = [dictionary.doc2bow(q) for q in queries]

In [172]:
def get_recall_precision(queries_corpus, docs_corpus, model_str, docs_df, queries_df):
    q_idx = 0
    recalls = []
    precisions = []
    f1s = []
    for query in queries_corpus:
        if model_str == 'lsi':
            topics = loaded_lsi_model[query]
            topics.sort(key=lambda y: y[1])
            topics_of_interest = topics[-4:]
        else:
            topics_of_interest = loaded_lda_model[query][:4]
        docs_indices = get_docs_from_topics(topics_of_interest, docs_corpus)
        if (len(docs_indices) == 0):
            q_idx += 1
            continue
        
        relevant_docs = queries_df.iloc[q_idx, 1]
        retrieved_docs_df = docs_df.iloc[docs_indices]
        relevant_retrieved_docs_df = retrieved_docs_df[retrieved_docs_df['id'].isin(relevant_docs)]
        
        # recall
        total_relevant = len(relevant_docs)
        relevant_retrieved = relevant_retrieved_docs_df.shape[0]
        R = relevant_retrieved / total_relevant
        recalls.append(R)
        
        # precision
        total_retrieved = retrieved_docs_df.shape[0]
        P = relevant_retrieved / total_retrieved
        precisions.append(P)
        
        # f1 score
        try: 
            f1 = (2*P*R) / (P + R)
        except:
            f1 = 0
        f1s.append(f1)
        
        q_idx += 1
    return pd.DataFrame({
        'recall': recalls, 'precision': precisions, 'f1': f1s
    })

In [None]:
recall_precision_lda_df = get_recall_precision(q_corpus, corpus_lda, 'lda', docs_df, queries_df)

In [None]:
recall_precision_lsi_df = get_recall_precision(q_corpus, corpus_lda, 'lsi', docs_df, queries_df)

In [None]:
recall_precicion_lda_df

In [None]:
recall_precicion_lsi_df