# Suggestions

Le but de ce notebook est qu'à partir d'un article on puisse trouver un groupe d'article proches.

Idées :
- tout prérassembler et classer dans des groupes pour limiter le temps de calculs. On pourrait imaginer faire des groupes de 10 articles proches et les afficher en suggestion -> existence de méthodes de clustering qui permettent de faire assez simplement je pense
- pour chaque article calculer ses plus proches voisins. On peut le faire à l'avance et stocker le numéro de ces articles. Comment définir la métrique ?


In [1]:
import numpy as np
import pandas as pd
import glob
import json
import matplotlib.pyplot as plt

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

In [3]:
meta_df=pd.read_csv("metadata_processed.csv")


In [4]:
def drop_nan(x):
    if str(x)=="nan":
        return ""
    return x

In [5]:
meta_df["abstract_process"]=meta_df["abstract_process"].apply(lambda x:drop_nan(x))

X=list(meta_df["abstract_process"])


In [6]:
vectorizer=TfidfVectorizer(lowercase=True,stop_words={'english'},min_df=0.01,analyzer='word')
vectorizer.fit(X)
title_vect=vectorizer.transform(X)

In [7]:
title_vect.shape


(157712, 1273)

In [8]:
print(title_vect[0,:])

  (0, 1240)	0.07639702492835487
  (0, 1219)	0.07744141286600782
  (0, 1216)	0.07702393437424933
  (0, 1208)	0.0737781331725699
  (0, 1192)	0.06918093827528092
  (0, 1153)	0.0570009151735699
  (0, 1100)	0.07395455998108887
  (0, 1080)	0.05957312705125294
  (0, 1074)	0.0742578606557069
  (0, 1021)	0.07198930518801817
  (0, 1020)	0.05061153385918455
  (0, 1016)	0.07054052529396325
  (0, 1013)	0.06593585223171049
  (0, 1009)	0.08661550573852737
  (0, 998)	0.06234298412373216
  (0, 966)	0.07705736125981294
  (0, 965)	0.08362749774292597
  (0, 963)	0.07863671146660153
  (0, 946)	0.04928295411413267
  (0, 932)	0.0699099996698374
  (0, 881)	0.05075784231428755
  (0, 866)	0.05704736615716727
  (0, 857)	0.38158562050952255
  (0, 830)	0.389118767163832
  (0, 783)	0.06722310672042628
  :	:
  (0, 335)	0.08212486430919541
  (0, 319)	0.06923425762552504
  (0, 312)	0.04330227022894022
  (0, 304)	0.07098333504523789
  (0, 289)	0.0790315073008854
  (0, 255)	0.0391835785851723
  (0, 247)	0.05935856209692

In [9]:
from sklearn.metrics.pairwise import linear_kernel

In [10]:
n,_=title_vect.shape
indices=np.zeros((n,5))
for i in range(n):
    sim=linear_kernel(title_vect[i:i+1],title_vect).flatten()
    indices[i]=sim.argsort()[-7:-2]
    if i%10000==0:
        print(i)

0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000


In [19]:
df_indices=pd.DataFrame(data=indices,columns=[f"suggestion_{i}" for i in range(5)],dtype='int32')

In [20]:
meta_df2=pd.concat([meta_df,df_indices],axis=1)

In [21]:
meta_df2.head()

Unnamed: 0.1,Unnamed: 0,title,abstract,authors,title_process,abstract_process,date,path,note_date,nb_ref_linked,note_ref,suggestion_0,suggestion_1,suggestion_2,suggestion_3,suggestion_4
0,0,Clinical features of culture-proven Mycoplasma...,OBJECTIVE: This retrospective chart review des...,"Madani, Tariq A; Al-Ghamdi, Aisha A",clinical feature of cultureproven mycoplasma p...,objective retrospective chart review describes...,2001.0,document_parses/pdf_json/d1aafb70c066a2068b027...,0.05,0,0.0,148401,143499,102731,37888,147696
1,1,Nitric oxide: a pro-inflammatory mediator in l...,Inflammatory diseases of the respiratory tract...,"Vliet, Albert van der; Eiserich, Jason P; Cros...",nitric oxide a proinflammatory mediator in lun...,inflammatory disease respiratory tract commonl...,2000.0,document_parses/pdf_json/6b0567729c2143a66d737...,0.047619,1,0.00053,27159,15861,14983,25575,48591
2,2,Surfactant protein-D and pulmonary host defense,Surfactant protein-D (SP-D) participates in th...,"Crouch, Erika C",surfactant proteind and pulmonary host defense,surfactant proteind spd participates innate re...,2000.0,document_parses/pdf_json/06ced00a5fc04215949aa...,0.047619,6,0.003178,8712,4088,72868,39313,143632
3,3,Role of endothelin-1 in lung disease,Endothelin-1 (ET-1) is a 21 amino acid peptide...,"Fagan, Karen A; McMurtry, Ivan F; Rodman, David M",role of endothelin1 in lung disease,endothelin1 et1 21 amino acid peptide diverse ...,2001.0,document_parses/pdf_json/348055649b6b8cf2b9a37...,0.05,2,0.001059,15360,130980,48591,15861,35032
4,4,Gene expression in epithelial cells in respons...,Respiratory syncytial virus (RSV) and pneumoni...,"Domachowske, Joseph B; Bonville, Cynthia A; Ro...",gene expression in epithelial cell in response...,respiratory syncytial virus rsv pneumonia viru...,2001.0,document_parses/pdf_json/5f48792a5fa08bed9f560...,0.05,2,0.001059,157555,127074,21507,26009,135944


In [22]:
meta_df2.to_csv("metadata_processed_suggestions.csv")

In [23]:
meta_df2.columns

Index(['Unnamed: 0', 'title', 'abstract', 'authors', 'title_process',
       'abstract_process', 'date', 'path', 'note_date', 'nb_ref_linked',
       'note_ref', 'suggestion_0', 'suggestion_1', 'suggestion_2',
       'suggestion_3', 'suggestion_4'],
      dtype='object')

In [27]:
article=meta_df2.loc[0]


In [32]:
suggestions=article[[f"suggestion_{i}" for i in range(5)]].tolist()

In [33]:
print(suggestions)

[148401, 143499, 102731, 37888, 147696]
