# Recomendação
### Como saber o mais interessante para um usuário?
#### Para recomendar um conteúdo você precisa conhecer sobre aquele conteúdo e sobre o usuário que quer impressionar. Nada melhor que encarar ambos como vetores e comparar distâncias!

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import requests

# Distância e similaridade de dados
O que é distância? (Já vimos distância da aula anterior, para o algoritmo KNN) 
<br>
Uma medida de distância entre dois pontos de dimensão n (dados com n colunas; vetor com n posições), é uma função

### Distância Euclidiana

In [2]:
def euclidean(a,b):
    s = 0
    n = len(a)
    for i in range(n):
        s = s + (a[i] - b[i])**2
        # elevar a 0.5 e igual a fazer a raiz quadrada
    return s**0.5

### Distância de Manhattan
Diferente da distância euclidiana, a distância de Manhattan não mede como distância a reta que liga os pontos.
- Sempre maior que a euclidiana

In [3]:
def manhattan(a,b):
    s = 0
    n = len(a)
    for i in range(n):
        s = s + abs(a[i]-b[i])
    return s

### Distância/Similaridade de cosseno
Cada linha do dataset é um vetor.
Um vetor é um artefato matemático com sentido, direção e magnitude.
- Similaridade de cosseno: cos(a)
- Distância de cosseno: 1 - | cos(a) |

In [4]:
def cosine(a,b):
    mod_a = 0
    mod_b = 0
    a_dot_b = 0
    
    n = len(a)
    for i in range(n):
        mod_a += + a[i]**2
        mod_b += + b[i]**2
        
        a_dot_b += a[i]*b[i]
      
    mod_a = mod_a**0.5
    mod_b = mod_b**0.5

    return abs(a_dot_b/(mod_a*mod_b))

# Recomendação baseada em conteúdo
Existem duas abordagens comuns para recomendar conteúdo:
- Baseada em conteúdo (**Content-Based**): Recomendação baseada em atributos do conteúdo
- Baseada no usuário (**User-Based**): Recomendação baseada em informações do usuário e seu consumo
<br>
<br>
Como nosso dataset é de artigos, podemos produzir um sistema de recomendação baseado na similaridade dos conteúdos

### Preparando os dados

In [5]:
import pandas as pd
news_load = pd.read_csv('articles_1000.csv')
news_df = news_load[pd.notnull(news_load['text'])]
news_df.head(50)

Unnamed: 0,title,text,date,category,subcategory,link
0,"Lula diz que está 'lascado', mas que ainda tem...",Com a possibilidade de uma condenação impedir ...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
1,"'Decidi ser escrava das mulheres que sofrem', ...","Para Oumou Sangaré, cantora e ativista malines...",2017-09-10,ilustrada,,http://www1.folha.uol.com.br/ilustrada/2017/10...
2,Três reportagens da Folha ganham Prêmio Petrob...,Três reportagens da Folha foram vencedoras do ...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
3,Filme 'Star Wars: Os Últimos Jedi' ganha trail...,A Disney divulgou na noite desta segunda-feira...,2017-09-10,ilustrada,,http://www1.folha.uol.com.br/ilustrada/2017/10...
4,CBSS inicia acordos com fintechs e quer 30% do...,"O CBSS, banco da holding Elopar dos sócios Bra...",2017-09-10,mercado,,http://www1.folha.uol.com.br/mercado/2017/10/1...
5,"Em encontro, Bono pergunta a Macri sobre argen...","O vocalista da banda irlandesa U2, Bono, fez u...",2017-09-10,mundo,,http://www1.folha.uol.com.br/mundo/2017/10/192...
6,"Posso sair do Brasil quando e como quiser, diz...",O italiano Cesare Battisti disse nesta segunda...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
7,Tite diz querer seguir na seleção após o Mundi...,Pela primeira vez desde que assumiu o comando ...,2017-09-10,esporte,,http://www1.folha.uol.com.br/esporte/2017/10/1...
8,Supremo nega pedido para Senado analisar impea...,O STF (Supremo Tribunal Federal) negou na quin...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
9,"Em teste, WhatsApp Business permite que empres...",O aplicativo de mensagens instantâneas WhatsAp...,2017-09-10,tec,,http://www1.folha.uol.com.br/tec/2017/10/19257...


In [6]:
stopwords_df = pd.read_csv('stopwords.txt', header=None, names=['stop'], sep='\n') # USP
stopwords = stopwords_df['stop'].str.strip().values

In [7]:
import spacy # https://spacy.io/
nlp = spacy.load('pt_core_news_sm')

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

def wm2df(wm, feat_names):
    doc_names = ['News{:d}'.format(idx) for idx, _ in enumerate(wm)]

    df = pd.DataFrame(data=wm.toarray(), index=doc_names,
                      columns=feat_names)
    return(df)

def tokenizer_without_stopwords(docs):
    doc = nlp(docs)
    return [token.lemma_.lower() for token in doc if not (token.is_punct or token.is_space or (token.text.lower() in stopwords))]


def my_preprocessor(list_):
    return str(list_)

custom_vec = TfidfVectorizer(preprocessor=my_preprocessor, tokenizer=tokenizer_without_stopwords)
cwm = custom_vec.fit_transform(news_df.text)
corpus_tfidf = wm2df(cwm, custom_vec.get_feature_names())

corpus_tfidf.head(10)

Unnamed: 0,$,'s,+,"-""single","-""você","-0,12","-0,41","-0,6",-1,"-1,75",...,—uma,—valorização,—vejam,—à,—às,—é,€,─,▶,★
News0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
doc = "News12"

words_vec = corpus_tfidf.loc[doc]
words_vec.head(10)

$           0.0
's          0.0
+           0.0
-"single    0.0
-"você      0.0
-0,12       0.0
-0,41       0.0
-0,6        0.0
-1          0.0
-1,75       0.0
Name: News12, dtype: float64

## Aplicando distância entre documentos
- Euclidiana: quando você se preocupa com a magnitude da distância
- Manhattan:

In [10]:
words_vec_a = corpus_tfidf.loc["News26"]
words_vec_b = corpus_tfidf.loc["News48"]

In [11]:
news_df.title[26]

'Rivais franceses da Uber intensificam esforços para roubar fatia de mercado'

In [12]:
news_df.title[48]

'Uber muda tática e suspende serviço mais barato sem licença na Noruega'

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

s_c = cosine_similarity([words_vec_a.values], [words_vec_b.values])
print("A similaridade de cosseno entre os dois documentos é {}".format(s_c[0][0]))

A similaridade de cosseno entre os dois documentos é 0.26018714614461536


## Exercício: Vamos montar um sistema de recomendação
Existe mais de uma abordagem para recomendar conteúdo, uma delas é utilizar o documento que está sendo consumido para teronar documentos semelhantes. Esse tipo de abordagem se chama **Content-based**.
### 1. Faça uma função que recebe 2 docId, e retorne a similaridade de cosseno entre eles

In [14]:
def similaridade(doc_a, doc_b, matriz):
    return cosine_similarity([matriz.loc[doc_a].values], [matriz.loc[doc_b].values])[0][0]

In [15]:
similaridade("News26", "News48", corpus_tfidf)

0.26018714614461536

### 2. Faça uma função que recebe um docId, e calcule a similaridade entre o documento passado para a função e todos os outros.
Retorne uma lista com (docId, similaridade)

In [16]:
def recomendacao(doc_a, matriz):
    return [(doc_,similaridade(doc_a, doc_, matriz)) for doc_ in matriz.index.values]

In [17]:
recomendacao("News48", corpus_tfidf)

[('News0', 0.02910351244756649),
 ('News1', 0.018446770050393343),
 ('News2', 0.017980790343034826),
 ('News3', 0.026609379642142817),
 ('News4', 0.022483753473278207),
 ('News5', 0.017334235870466463),
 ('News6', 0.019438468677169335),
 ('News7', 0.009898725407404396),
 ('News8', 0.014621510060637184),
 ('News9', 0.16834441580271203),
 ('News10', 0.02190333063199363),
 ('News11', 0.0309497817485229),
 ('News12', 0.02393472298171934),
 ('News13', 0.05006759328120609),
 ('News14', 0.06533970466361808),
 ('News15', 0.021247879197191673),
 ('News16', 0.039634806246397584),
 ('News17', 0.018931155627958315),
 ('News18', 0.018965141553977077),
 ('News19', 0.013064345047658858),
 ('News20', 0.02972034108851554),
 ('News21', 0.013636930950323737),
 ('News22', 0.019911088871147783),
 ('News23', 0.0565033595549413),
 ('News24', 0.021463492139470717),
 ('News25', 0.019080096371238205),
 ('News26', 0.26018714614461536),
 ('News27', 0.041726433191293516),
 ('News28', 0.08974075713124824),
 ('News2

#### Ordenando

In [18]:
sorted(recomendacao("News48", corpus_tfidf), key=lambda x: x[1], reverse=True)

[('News48', 1.0),
 ('News829', 0.41550841542641725),
 ('News26', 0.26018714614461536),
 ('News9', 0.16834441580271203),
 ('News558', 0.10224009429540874),
 ('News532', 0.10052504887068697),
 ('News866', 0.09644368393524447),
 ('News439', 0.0955784416262829),
 ('News28', 0.08974075713124824),
 ('News869', 0.08068252846812078),
 ('News589', 0.07938272909569428),
 ('News152', 0.07894443069892318),
 ('News63', 0.07528340922408688),
 ('News743', 0.07249163468834376),
 ('News930', 0.07121405255542415),
 ('News853', 0.06948403838933907),
 ('News893', 0.0683062357495419),
 ('News698', 0.06828664057503564),
 ('News38', 0.06820754277885042),
 ('News14', 0.06533970466361808),
 ('News452', 0.06496977946501376),
 ('News69', 0.06402432497007823),
 ('News169', 0.06287374619202549),
 ('News383', 0.06106997762771193),
 ('News305', 0.06045337652419953),
 ('News335', 0.05966492423038652),
 ('News162', 0.05946289453334723),
 ('News715', 0.05878679045501253),
 ('News401', 0.05868126126280032),
 ('News598',

In [19]:
news_df.title[829]

'Fachin nega pedido de Aécio para devolver o mandato'

In [20]:
news_df.title[48]

'Uber muda tática e suspende serviço mais barato sem licença na Noruega'

# Filtro Colaborativo
#### Usando o acesso de alguns usuários para recomendar itens para outros usuários.
No caso, vamos recomendar palavras para outras matérias. Palavras que talvez os jornalistas tenham esquecido de mencionar em uma matéria.

In [21]:
unstack = corpus_tfidf.unstack()
df = pd.DataFrame(unstack)

word_df = df.reset_index()
word_df.columns=["word", "document", "score"]

word_df = word_df[["document", "word", "score"]]
word_df = word_df[word_df.score > 0]
print(word_df.count())
word_df.head()

document    207918
word        207918
score       207918
dtype: int64


Unnamed: 0,document,word,score
298,News298,$,0.031385
809,News809,$,0.04862
1019,News80,'s,0.052006
1091,News152,'s,0.042241
1156,News217,'s,0.054083


## SVD (Singular Value Decomposition)
A matriz de usuários-itens é decomposta conseguindo um autovetor que representa a "força" de cada elemento da matriz

Um exemplo do uso de SVD em matrizes esparsas

In [22]:
from scipy.sparse import csc_matrix
from scipy.sparse.linalg import svds

A = csc_matrix([[1,0,0],[5,0,2],[0,2,0],[0,0,3]], dtype=float)
A.shape

(4, 3)

In [23]:
u, s, vt = svds(A, k=2) # k é o número de fatores
print("s",s)
print("u",u)
print("vt",vt)

s [2.75193379 5.6059665 ]
u [[-1.73323831e-01  1.56782328e-01]
 [-2.27856346e-01  9.54078802e-01]
 [-1.04795289e-18  1.90587647e-19]
 [ 9.58144214e-01  2.55250744e-01]]
vt [[-4.76975707e-01 -1.44194848e-18  8.78916478e-01]
 [ 8.78916478e-01  5.34213982e-19  4.76975707e-01]]


In [24]:
import numpy as np

sigma = np.zeros((len(s), len(s)))
for i in range(len(s)):
    sigma[i,i] = s[i]

In [25]:
result = np.dot(u, np.dot(sigma, vt))
result

array([[ 1.00000000e+00,  1.15730387e-18, -1.66533454e-16],
       [ 5.00000000e+00,  3.76142896e-18,  2.00000000e+00],
       [ 2.31460773e-18,  4.72919999e-36, -2.02509038e-18],
       [-6.66133815e-16, -3.03763556e-18,  3.00000000e+00]])

In [26]:
A.todense()

matrix([[1., 0., 0.],
        [5., 0., 2.],
        [0., 2., 0.],
        [0., 0., 3.]])

### Existem alguns pacotes prontos para recomendação

In [36]:
from surprise import SVD, Dataset
from surprise.reader import Reader
from surprise.model_selection import cross_validate

reader = Reader(rating_scale=(0,1))
data = Dataset.load_from_df(word_df, reader)

algo = SVD()

cross_validate(algo, data, measures=['RMSE', 'MAE'])

# Ou Ndcg

{'fit_time': (29.431864976882935,
  16.924479007720947,
  17.809990167617798,
  23.95526099205017,
  18.561808824539185),
 'test_mae': array([0.03920984, 0.03939817, 0.03889182, 0.03931086, 0.0393398 ]),
 'test_rmse': array([0.05598938, 0.05584189, 0.05514933, 0.05589311, 0.05589038]),
 'test_time': (0.7760980129241943,
  0.4907360076904297,
  0.5302860736846924,
  0.5542788505554199,
  0.6796102523803711)}

In [37]:
word_df[word_df.score > 0].sample(10)

Unnamed: 0,document,word,score
4016955,News852,ba,0.07541
21660197,News284,roma,0.036379
16702755,News762,moda,0.050668
3158215,News358,apenas,0.019929
13436336,News185,incorporadora,0.055041
13933756,News874,internacional,0.016037
11024699,News839,federal,0.021501
7442635,News121,credibilidade,0.060174
2702393,News890,além,0.013888
4728512,News647,bolsar,0.043653


In [38]:
word_df[(word_df.document=='News195') & (word_df.word=='alfabetizar')]

Unnamed: 0,document,word,score


In [39]:
algo.predict('News195','alfabetizar')

Prediction(uid='News195', iid='alfabetizar', r_ui=None, est=0.005901088637397578, details={'was_impossible': False})

## Exercício: Faça uma função que dê as melhores recomendações de palavras para uma matéria que eu passe
Retorne uma lista com (word, score) ordenada

In [40]:
def algo_word(doc, word):
    return algo.predict(doc, word).est

def recomend_words(doc):
    lista = [(word, algo_word(doc, word)) for word in word_df.word.unique()]
    
    return sorted(lista, key=lambda x: x[1], reverse=True)

In [41]:
recomend_words('News3')

[('confeitar', 0.4765386168340888),
 ('1ª', 0.4470087856477225),
 ('battisti', 0.44484290753722733),
 ('iniciativa', 0.4373300203568129),
 ('emocional', 0.41594914459167676),
 ('italiano', 0.40823132162358466),
 ('juizado', 0.4026649641187814),
 ('creche', 0.39724832203321014),
 ('descobri', 0.3953178780292994),
 ('pet', 0.3935480474933952),
 ('dizzy', 0.39340075010615605),
 ('híbrido', 0.39232112447824385),
 ('anticorrupção', 0.39091250466812477),
 ('empresismo', 0.38619258962913117),
 ('irlandês', 0.38565395323380747),
 ('pnad', 0.384354992522457),
 ('igreja', 0.38428124002113817),
 ('monitore', 0.38380573353192515),
 ('paddock', 0.382855305274576),
 ('flamengo', 0.3798761764560479),
 ('responsabilização', 0.3797152823260759),
 ('confaz', 0.3787108094118419),
 ('coautor', 0.37846262018617693),
 ('détente', 0.37832725865485384),
 ('leveza', 0.3770368530953865),
 ('precision', 0.3766624845637283),
 ('biológico', 0.3764053838801227),
 ('rigor', 0.37592436923216654),
 ('cartel', 0.373940

In [42]:
news_df.title[3]

"Filme 'Star Wars: Os Últimos Jedi' ganha trailer definitivo; assista"