# Demo notebook for article recommandation

The goal of this notebook is to show a typical recommandation workflow, as it will be performed through the chatbot.

Here we only demonstrate the use of the pretrained models and explore some ways of personnalizing the recommandation.


### Contents

__Preliminaries__
* a. Package Imports
* b. Data and models

__1. Geo & Topic prediction__

* Prediction of input article's geographic zone and topic

__2. Basic reco based on article body similarity__

* a. Most similar article
    
* b. Random articles from the same cluster

__3. Similarity-based reco with filter on geo and topic__

__4. How to use the title__

__5. Entity-based recommandation__  




### Preliminaries

A. Packages

In [1]:
## Classic packages ##

import os 
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
import random 
random.seed(a=2905) # set random seed 
import pickle
import re
from scipy import stats
from time import time

## NLP packages ##

import gensim
from gensim import corpora

from wordcloud import WordCloud

import spacy
try: 
    print("fr_core_news_sm loaded")
    nlp = spacy.load("fr_core_news_sm") # load pre-trained models for French
except:
    print("fr loaded")
    nlp=spacy.load('fr') # fr calls fr_core_news_sm 
from spacy.lang.fr import French

import nltk
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer # not adapted to French?
from nltk.stem.snowball import FrenchStemmer # already something 

## ML with sklearn ##

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer 

from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC, SVC
import sklearn.cluster
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, StratifiedShuffleSplit
from sklearn.model_selection import cross_val_score, cross_validate

from sklearn.base import TransformerMixin
from sklearn.compose import make_column_selector

from sklearn.metrics import classification_report
from sklearn.metrics import silhouette_score

fr_core_news_sm loaded


B. Data and models

* Dataset with Past articles = our database for article recommandation

* User input = some recent article manually picked on the Internet

We need: 

* tfidf model to transform the article body --> saved pipeline 
* models for geo and topic classification --> saved pipeline
* cosine distance for basic comparison --> python
* kmeans model for clustering --> still to save


In [2]:
# tfidfs used for clustering

with open('tfidf_vectorizer_base', 'rb') as file:
    tfidf_vectorizer_base = pickle.load(file)

with open('tfidf_vectorizer_vocab', 'rb') as file:
    tfidf_vectorizer_vocab = pickle.load(file)

In [3]:
# Load pipelines from saved pickle files


with open('best_topic_lr_basic_vocab', 'rb') as file:
    best_topic_lr_basic_vocab = pickle.load(file)

with open('best_topic_lr_basic_vocab_l2', 'rb') as file:
    best_topic_lr_basic_vocab_l2 = pickle.load(file)
    
with open('best_topic_rf_basic_vocab', 'rb') as file:
    best_topic_rf_basic_vocab = pickle.load(file)

with open('best_geo_rf_entity_vocab', 'rb') as file:
    best_geo_rf_entity_vocab = pickle.load(file)
    


In [4]:
# topic codes

with open('geo_code_dic', 'rb') as file:
    geo_code_dic = pickle.load(file)
    
with open('topic_code_dic', 'rb') as file:
    topic_code_dic = pickle.load(file)
    
print(geo_code_dic)
print(topic_code_dic)

{0: 'afr', 1: 'as', 2: 'eu', 3: 'fr', 4: 'lat', 5: 'me', 6: 'spa', 7: 'usa', 8: 'wo'}
{0: 'cu', 1: 'eco', 2: 'ju', 3: 'mi', 4: 'po', 5: 'sc', 6: 'so', 7: 'spo'}


In [5]:
labeled_df=pd.read_csv("labeled_articles_clean.csv")
labeled_df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,year,title,text,url,geo,topic,geo_code,topic_code
0,0,0,1988,Tintin dans l'espace,Trois semaines à bord de la station soviétique...,https://www.lexpress.fr/informations/tintin-da...,fr,sc,3,5
1,1,1,1988,Le faux suicide de Robert Boulin,1979 : son corps est découvert en forêt de Ram...,https://www.lexpress.fr/actualite/politique/le...,fr,ju,3,2
2,2,2,1988,Des pierres contre les certitudes,"Rideaux de fer baissés, silhouettes furtives, ...",https://www.lexpress.fr/actualite/monde/proche...,me,po,5,4
3,3,3,1988,"Otages: soudain, mercredi soir...",""" Je lui ai dit: ""Ça suffit"", et j'ai raccroch...",https://www.lexpress.fr/informations/otages-so...,me,ju,5,2
4,4,4,1988,Les secrets de la planète rouge,"S'il existe, dans le système solaire, un seul ...",https://www.lexpress.fr/actualite/sciences/les...,spa,sc,6,5


In [6]:
sample_articles=pd.read_excel("sample_articles.xlsx")
sample_articles.head()

Unnamed: 0,title,text,url
0,"« L’Ickabog », de J. K. Rowling : le triste ro...",C’est l’histoire d’un conte de fées qui se tra...,https://www.lemonde.fr/livres/article/2020/12/...
1,Covid-19 en France : Jean Castex expose la str...,Le premier ministre doit présenter la stratégi...,https://www.lemonde.fr/planete/article/2020/12...
2,Claude Guéant mis en examen pour « association...,L’ex-ministre de l’intérieur était déjà mis en...,https://www.lemonde.fr/societe/article/2020/12...
3,Pédocriminalité : quinze ans de réclusion requ...,L’homme de 70 ans comparait à huis clos pour r...,https://www.lemonde.fr/societe/article/2020/12...
4,Le prix littéraire Interallié décerné à Irène ...,"L’autrice, qui raconte les suites du meurtre d...",https://www.lemonde.fr/culture/article/2020/12...


## 1. Geo and Topic prediction 

In [7]:
pred_geo=best_geo_rf_entity_vocab.predict(sample_articles.text)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 216 out of 216 | elapsed:    0.0s finished


In [8]:
pred_geo_labels = [geo_code_dic[x] for x in pred_geo]
print(pred_geo_labels)

['fr', 'fr', 'fr', 'fr', 'fr', 'fr']


In [9]:
pred_topic = best_topic_lr_basic_vocab.predict(sample_articles.text)
pred_topic_labels = [topic_code_dic[x] for x in pred_topic]
print(pred_topic_labels)

['po', 'po', 'po', 'eco', 'eco', 'eco']


## 2. Basic reco based on article body similarity

