<a href="https://colab.research.google.com/github/nosgueira/PLN-2022-1/blob/main/Atividade03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Atividade 3: Recuperação de Textos

- Aluno: Gabriel da Silva Corvino Nogueira (18/0113330)

## Imports

In [1]:
import re
import nltk
import pandas as pd

nltk.download('reuters')
nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import reuters
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from tqdm.notebook import tqdm


[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Importação de Dados

In [2]:
cats = reuters.categories()
print("Reuters has %d categories:\n%s" % (len(cats), cats))

Reuters has 90 categories:
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']


In [3]:
fileids=reuters.fileids()

categories = []
text = []

for file in fileids:
    categories.append(reuters.categories(file))
    text.append(reuters.raw(file))
corpus_df = pd.DataFrame({'ids':fileids, 'categories':categories, 'text':text})

In [4]:
len(corpus_df)

10788

Como demonstrado acima, o corpus reuters possui mais de 10,000 textos. Sendo assim, para que seja possível executar os experimentos, iremos lidar com uma versão reduzida desse corpus:


In [5]:
corpus_reduzido = corpus_df.sample(frac=0.1, random_state=666, ignore_index=True)
corpus_reduzido

Unnamed: 0,ids,categories,text
0,training/13192,[earn],FREEDOM SAVINGS AND LOAN ASS'N &lt;FRDM> YEAR ...
1,test/15709,[earn],NORTHWESTERN NATIONAL LIFE &lt;NWNL> UPS PAYOU...
2,test/20031,[jet],BANGLADESH TENDERS FOR TWO MLN BARRELS PETROLE...
3,training/4525,[crude],HOUSE SPEAKER BACKS OIL IMPORT FORECAST PLAN\n...
4,training/13889,[trade],VOLCKER URGES INDUSTRIAL NATIONS TO KEEP TRADE...
...,...,...,...
1074,training/13120,"[grain, ship]",GRAIN SHIPS LOADING AT PORTLAND\n There were ...
1075,training/4748,[acq],SERVICE CONTROL BIDS FOR AMERICAN SERVICE\n &...
1076,training/12743,[jobs],GERMAN MARCH UNADJUSTED JOBLESS FALLS\n West ...
1077,training/5502,[earn],NBD BANCORP &lt;NBD> REGULAR DIVIDEND SET\n Q...


## Pré-processamento

In [6]:
def preproc(text):
    ps = PorterStemmer()
    text = text.lower()

    stop_words = set(stopwords.words('english'))
    pattern =re.compile("['\.&,'\(\)><\,;\s]+|(\d+[\./,]?)+$")

    tokens = word_tokenize(text)
    tokens = [ps.stem(token) for token in tokens]
    tokens = [re.sub(r"[-']+(.*)",r'\1', token) for token in tokens]

    return [token for token in tokens if token not in stop_words 
                                         and not pattern.match(token)
                                         and token !='']


In [7]:
textos = corpus_reduzido.text.to_list()

 ## Representação Bag of Words

Para esse exemplo será utilizada a representação Bag of Words que leva em conta a frequência de cada palavra em um texto.

In [8]:
bow_dict = {}
for i,doc in enumerate(textos):
    doc_name = corpus_reduzido.ids[i]
    bow_dict[doc_name] = dict()
    for word in preproc(doc):
        bow_dict[doc_name][word] = 1 if word not in bow_dict[doc_name] else bow_dict[doc_name][word]+1
        


In [9]:
bow = pd.DataFrame().from_records(bow_dict).fillna(0).T.astype(int)
bow = bow.reindex(sorted(bow.columns), axis=1)

In [10]:
bow

Unnamed: 0,/exxon,/oapec/opec,/ompani,1/march,10.5p,100dlr-a-shar,109-7/8,10th,10year,111-1,...,zico,zimbabw,zimbabwe,zinc,zoet,zone,zoran,zorinski,zuccherifici,zurich
test/14863,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
test/14918,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
test/14928,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
test/14965,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
test/14969,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
training/9936,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
training/9942,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
training/9952,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
training/9953,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Representação TF-IDF

### Term Frequency

In [11]:
tf = {}

for i, doc in enumerate(textos):
    
    doc_name = corpus_reduzido.ids[i]
    tf[doc_name] = dict()
    words = preproc(doc)

    for word in words:
        tf[doc_name][word] = (1 if  word not in tf[doc_name] 
                                    else tf[doc_name][word]+1)
    tf[doc_name] = {word:(freq/len(words)) 
                    for word, freq in tf[doc_name].items()}



### Inverse Document Frequency

In [12]:
import math
vocab = bow.columns
idf =  {}

for word in vocab:
    idf[word] = math.log(len(bow)/len(bow[word][bow[word]!=0]))
    
    


### TF-IDF

In [13]:
tfidf_dict = {}

for doc in tf:
    tfidf_dict[doc] = dict()
    for word in tf[doc]:
        tfidf_dict[doc][word] = tf[doc][word]*idf[word]

In [14]:
tfidf = pd.DataFrame().from_records(tfidf_dict).fillna(0).T
tfidf = tfidf.reindex(sorted(tfidf.columns), axis=1)

In [15]:
tfidf

Unnamed: 0,/exxon,/oapec/opec,/ompani,1/march,10.5p,100dlr-a-shar,109-7/8,10th,10year,111-1,...,zico,zimbabw,zimbabwe,zinc,zoet,zone,zoran,zorinski,zuccherifici,zurich
test/14863,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
test/14918,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
test/14928,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
test/14965,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
test/14969,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
training/9936,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
training/9942,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
training/9952,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
training/9953,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Similaridade entre Documentos

### Similaridade de Cosseno

In [16]:
def cosine_similarity(A, B):
    dot = sum(a*b for a,b in zip(A,B))
    normA = (sum(a**2 for a in A))**0.5
    normB = (sum(b**2 for b in B))**0.5
    return dot/(normA*normB)


In [17]:
def comp(id_a, id_b, embedding):
    a = embedding.T[id_a].values
    b = embedding.T[id_b].values
    return cosine_similarity(a, b)


### Função de Avaliação

In [18]:
def get_categories(df, id):
    return set(df[df.ids==id].categories.explode())


A função à seguir retorna uma tabela (lista) de 10 posições indicando a quantidade de recuperações corretas por posição no rank de similaridade

In [19]:
def get_score(embedding, df):

    scores = [0]*10
    ids = df.ids.to_list()
    
    for id_a in tqdm(ids):
        rank = [(id_b, comp(id_a, id_b,embedding)) for id_b in ids
                if id_b!=id_a]
    
        top10 = sorted(rank, key=lambda pair : pair[1], reverse=True)[:10]
    
        for i, (id, _)in enumerate(top10):
            cats_a = get_categories(df,id_a)
            cats_b = get_categories(df,id)
            if len(cats_a & cats_b)!=0 :
                scores[i]+=1
    return scores

In [20]:
def gen_table(scores):
    pontuacao = pd.DataFrame(
    {'Posição':range(1,11),
     'Acertos': scores,
    'Acurácia':[acertos/len(corpus_reduzido) for acertos in scores_bow]})
    pontuacao.set_index(pontuacao.columns[0], inplace=True)
    return pontuacao

### Similaridade por Bag of Words (contagem)

In [None]:
scores_bow = get_score(bow, corpus_reduzido)

  0%|          | 0/1079 [00:00<?, ?it/s]

A seguir, é possível ver a quantidade de acertos por posição em nosso processo de recuperação utilizando a representação Bag of Words de contagem: 

In [None]:
gen_table(scores_bow)

### Similaridade por TF-IDF

In [None]:
scores_tfidf = get_score(tfidf, corpus_reduzido)

A seguir, é possível ver a quantidade de acertos por posição em nosso processo de recuperação utilizando a representação TF-IDF: 

In [None]:
gen_table(scores_tfidf)