# Construction d'un tableau récapitulatif

Le but de ce notebook est de construire un pandas de taille (sous forme de matrice (nombre d'articles, nombres de champs lexicaux * 6) qui pour tous les articles récapitule ses "performances" dans différentes catégories, et ce pour chaque champ lexical déterminé à l'avance. 

Commençons par définir ces champs lexicaux, qui correspondent à des facteurs de risques. Ceux-ci sont suggérés par le site Kaggle (d'où est extraite la base de données). 
Ces facteurs sont : 
- Hypertension
- Diabetes
- Male gender
- Heart Disease
- COPD and respiratory system diseases
- Smoking Status
- Age
- Cerebrovascular disease
- Cardio- and cerebrovascular disease
- Cancer
- Chronic kidney diseases
- Drinking
- Overweight or obese
- Chronic liver disease

Il faut maintenant créer un "champ lexical" relatif à chacun de ces thèmes. Il est important d'avoir suffisament de synonymes pour ne pas manquer d'oublier des termes dans les articles. Ces derniers étant prétraités en amont, inutiles de faire attention aux majuscules et aux tirets. 

In [56]:
hypertension = ['hypertension', 'high blood pressure', 'hypertensive', 'high pressure', 'raised blood pressure', 'htn', 'hbp', 'ht', 'ace inhibitor', 'sartan']
diabetes = ['diabetes', 'high blood sugar', 'insulin resistance', 'diabetic', 'diabetics', 'dm']
gender = ['male gender', 'male', 'gender', 'sex', 'masculine', 'female']
heart_disease = ['cardiopathy', 'heart disease', 'heart', 'chd', 'arrhythmia', 'tachycardia', 'bradycardia', 'fibrillation', 'cardiomyopathy', 'infarction', 'ischemic heart disease']
COPD_respiratory_system = ['copd', 'emphysema', 'bronchitis', 'asthma', 'bronchiectasis', 'respiratory', 'trachea', 'lung', 'lungs', 'pulmonary', 'pneumoniae', 'pharyngitis', 'bronchiolitis', 'bronchopneumonia' ]
smoking_status = ['smoking', 'smoke', 'smoker', 'tobacco']
age = ['age', 'old', 'young', 'senior', 'child', 'children']
cerebrovascular_disease = ['cerebrovascular', 'embolism', 'ischemic', 'stroke', 'aneurysm', 'tia']
cancer = ['cancer', 'leukemia', 'cancerology', 'lymphom']
kidney_disease = ['kidney', 'gfr', 'dialysis']
drinking = ['drinking', 'alcohol', 'alcoholic']
overweight = ['overweight', 'obesity', 'obese', 'bmi']
liver_disease = ['liver', 'fascioliasis', 'cirrhosis', 'hepatitis']

## 0. Prétraitement des données

Afin de travailler sur des textes le plus efficacement possible, on va procéder à un prétraitement qui consiste :
- à récupérer les données nécessaires dans le dataset
- à en extraire texte, titre, abstract, 

In [181]:
import numpy as np
import pandas as pd
import json
import glob
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.blank('en')
nlp.max_length = 2000000

In [57]:
hta_tokens_list = [nlp(factor) for factor in hypertension]
diabetes_tokens_list = [nlp(factor) for factor in diabetes]
gender_tokens_list = [nlp(factor) for factor in gender]
heart_tokens_list = [nlp(factor) for factor in heart_disease]
respi_tokens_list = [nlp(factor) for factor in COPD_respiratory_system]
smoking_tokens_list = [nlp(factor) for factor in smoking_status]
age_tokens_list = [nlp(factor) for factor in age]
cerebrovascular_tokens_list = [nlp(factor) for factor in cerebrovascular_disease]
cancer_tokens_list = [nlp(factor) for factor in cancer]
kidney_tokens_list = [nlp(factor) for factor in kidney_disease]
drinking_tokens_list = [nlp(factor) for factor in drinking]
overweight_tokens_list = [nlp(factor) for factor in overweight]
liver_tokens_list = [nlp(factor) for factor in liver_disease]

In [5]:
root_path = '/Users/juliettemontanteme/Desktop/data_covid'
#astuce : pour trouver facilement le chemin, ouvrir le dossier correspondant 
#et en appuyant sur "alt" aller sur édition et copier le chemin
metadata_path = f'{root_path}/metadata.csv'
meta_df = pd.read_csv(metadata_path)

  interactivity=interactivity, compiler=compiler, result=result)


In [188]:
len(meta_df)

138794

In [99]:
#Classe permettant de faire un preprocessing d'un texte

class Preprocess:
    def __init__(self, path):
        self.path = path
        self.article = ""
        self.text = ""
        self.text1 = ""
        self.text2 = ""
        self.title = ""
        self.abstract = ""
    def open_article(self):
        file=open(str(root_path)+"/"+str(path),'r')
        article=json.load(file)
        file.close()
        self.article = article
    def make_text(self):
        text_list = []
        for entry in self.article['body_text']:
            text_list.append(entry['text'])
            text_list.append("\n")
        text_full=''.join(text_list)
        self.text = text_full
        self.title = nlp(self.article['metadata']['title'])
        self.abstract = nlp(meta_df_full_text.loc[lambda df: df['title'] == title] ['abstract'].to_numpy()[0])
    
    def preprocess(self): 
        self.text1 = nlp(self.text)
        text2 = ""
        for token in self.text1: 
            if token.is_stop == False:
                text2 = text2 + " " + token.lemma_
        self.text2 = nlp(text2)

    def repr_1(self):
        return(self.text1)
    
    def repr_2(self):
        return(self.text2)

## 1. La "diffusivité" d'un thème dans un article

Le but de cette partie est de travailler sur un corpus restreint de textes comportant déjà un mot-clé (ou un groupe de mots clés) correspondant à un facteur de risque et éventuellement prétraités. On parcourt chaque texte (titre et abstract inclus) et pour chaque texte, on sauvegarde les positions des mots choisis. On calcule ensuite la moyenne et l'écart-type de la position du mot.



In [131]:
import numpy as np
import math


class Diffusivity:
    def __init__(self, matches, doc):
        self.matches = matches
        self.doc = doc
        self.positions = []
        self.occurence = len(self.matches)
        self.score = 0
    def position(self):
        if self.matches != []:
            for match in self.matches :
                pos = (match[1] + match[2])/2
                self.positions.append(int(pos))   
    def scoring(self):
        if self.positions != []:
            mean_pos = np.mean(self.positions)
            ecart_pos = math.sqrt(np.var(self.positions))
            self.score = ecart_pos / len(self.doc)
        

On considère que lorsque l'écart type est plus grand que la moitié du nombre de mots du texte, cela veut dire que le mot-clé est cité dans plus de la moitié de l'article et donc que l'article parle principalement du mot-clé.


In [128]:
path = path_list[0]
proc = Preprocess(path)
proc.open_article()
proc.make_text()
proc.preprocess()
doc = proc.text1
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
matcher.add("RESPI", respi_tokens_list)
match = matcher(doc)
dif = Diffusivity(match, doc)
dif.position()
dif.scoring()
dif.score

0.2837764272295505

## 2. La fréquence d'un mot dans un groupe de mots

Dans cette section, on peut travailler sur un texte, sur un titre ou un abstract.

In [144]:
class Frequency: 
    def __init__(self, matches, doc):
        self.matches = matches
        self.doc = doc
        self.occurence = len(self.matches)
        self.score = 0

    def scoring(self):
        if len(self.matches) != 0:
            self.score = len(self.matches)/len(self.doc)



## 3. Processus de notation

In [9]:
def has_full_text(x):
    if x["pdf_json_files"]==True and x["pmc_json_files"]==True:
        return False
    return True

In [86]:
meta_bool=meta_df[["pdf_json_files","pmc_json_files"]].isnull()
meta_bool["has_full"]=meta_bool.apply(lambda x:has_full_text(x),axis=1)
meta_df_full_text=meta_df.loc[meta_bool["has_full"]]

In [11]:
n = len(meta_df_full_text)

In [161]:
path_list = []
for i in range(n):
    path = meta_df_full_text.iloc[i]['pmc_json_files']
    if str(path)[0] == 'd':
        path_list.append(path)
    else :
        path = meta_df_full_text.iloc[i]['pdf_json_files'].split(";")[0]
        path_list.append(path)


In [147]:
class Scoring:
    def __init__(self, path, lex, lex_name):
        self.path = path
        self.lex = lex
        self.lex_name = lex_name
        self.match_doc = []
        self.match_title = []
        self.match_abstract = []
        self.doc = ""
        self.score_dif = 0
        self.title = ""
        self.abstract = ""
        self.score_fr_doc = 0
        self.score_fr_title = 0
        self.score_fr_abstract = 0
        self.score_dif = 0
    def matching(self):
        proc = Preprocess(self.path)
        proc.open_article()
        proc.make_text()
        proc.preprocess()
        self.doc = proc.repr_1()
        self.title = proc.title
        self.abstract = proc.abstract
        matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
        matcher.add(self.lex_name, self.lex)
        self.match_doc = matcher(self.doc)
        self.match_title = matcher(self.title)
        self.match_abstract = matcher(self.abstract)
        
    def scoring(self):
        #Calcul du score de diffusivité
        dif = Diffusivity(self.match_doc, self.doc)
        dif.position()
        dif.scoring()
        self.score_dif = dif.score
        #Calcul du score de fréquence dans le texte
        fr_doc = Frequency(self.match_doc, self.doc)
        fr_doc.scoring()
        self.score_fr_doc = fr_doc.score
        #Calcul du score de fréquence dans le titre
        fr_title = Frequency(self.match_title, self.title)
        fr_title.scoring()
        self.score_fr_title = fr_title.score
        #Calcul du score de fréquence dans l'abstract
        fr_abstract = Frequency(self.match_abstract, self.doc)
        fr_abstract.scoring()
        self.score_fr_abstract = fr_abstract.score
        

In [139]:
path = path_list[0]
proc = Preprocess(path)
proc.open_article()
proc.make_text()
proc.preprocess()
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
matcher.add("RESPI", respi_tokens_list)
match = matcher(doc)
dif = Diffusivity(match, doc)
dif.position()
dif.scoring()

In [163]:
path = path_list[51]
sc = Scoring(path, respi_tokens_list, "RESPI")
sc.matching()
sc.scoring()
print(sc.score_dif, sc.score_fr_doc, sc.score_fr_title, sc.score_fr_abstract)

0.30831900241799925 0.002888917295155311 0 3.408751970684733e-05


In [182]:
tokens_list = [hta_tokens_list, diabetes_tokens_list, gender_tokens_list, heart_tokens_list,respi_tokens_list, smoking_tokens_list, age_tokens_list, cerebrovascular_tokens_list, cancer_tokens_list, kidney_tokens_list, drinking_tokens_list, overweight_tokens_list, liver_tokens_list] 
tokens_names = ["HTA", "DIABETES", "GENDER", "HEART", "RESPI", "SMOKING", "AGE", "CEREBROVASCULAR", "CANCER", "KIDNEY", "DRINKING", "OVERWEIGHT", "LIVER"]
for i in range(13):
    lex = tokens_list[i]
    lex_name = tokens_names[i]
    for path in path_list:  
        sc = Scoring(path, lex, lex_name)
        sc.matching()
        sc.scoring()
        meta_df_full_text[f'Diffusivité {lex_name}'] = sc.score_dif
        meta_df_full_text[f'FreqTitre {lex_name}'] = sc.score_fr_title
        meta_df_full_text[f'FreqAbstrac {lex_name}'] = sc.score_fr_abstract
        meta_df_full_text[f'FreqTexte {lex_name}'] = sc.score_fr_doc
    


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .

FileNotFoundError: [Errno 2] No such file or directory: '/Users/juliettemontanteme/Desktop/data_covid/document_parses/pmc_json/PMC3694918.xml.json'

In [185]:
path = 'document_parses/pmc_json/PMC3694918.xml.json'
sc = Scoring(path, respi_tokens_list, "RESPI")
sc.matching()
sc.scoring()
print(sc.score_dif, sc.score_fr_doc, sc.score_fr_title, sc.score_fr_abstract)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/juliettemontanteme/Desktop/data_covid/document_parses/pmc_json/PMC3694918.xml.json'

In [187]:
meta_df_full_text.head()

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,...,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id,Diffusivité HTA,FreqTitre HTA,FreqAbstrac HTA,FreqTexte HTA
0,ug7v899j,d1aafb70c066a2068b02786f8929fd9c900897fb,PMC,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,PMC35282,11472636.0,no-cc,OBJECTIVE: This retrospective chart review des...,2001-07-04,...,,,document_parses/pdf_json/d1aafb70c066a2068b027...,document_parses/pmc_json/PMC35282.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,,0,0,0,0
1,02tnwd4m,6b0567729c2143a66d737eb0a2f63f2dce2e5a7d,PMC,Nitric oxide: a pro-inflammatory mediator in l...,10.1186/rr14,PMC59543,11667967.0,no-cc,Inflammatory diseases of the respiratory tract...,2000-08-15,...,,,document_parses/pdf_json/6b0567729c2143a66d737...,document_parses/pmc_json/PMC59543.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,,0,0,0,0
2,ejv2xln0,06ced00a5fc04215949aa72528f2eeaae1d58927,PMC,Surfactant protein-D and pulmonary host defense,10.1186/rr19,PMC59549,11667972.0,no-cc,Surfactant protein-D (SP-D) participates in th...,2000-08-25,...,,,document_parses/pdf_json/06ced00a5fc04215949aa...,document_parses/pmc_json/PMC59549.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,,0,0,0,0
3,2b73a28n,348055649b6b8cf2b9a376498df9bf41f7123605,PMC,Role of endothelin-1 in lung disease,10.1186/rr44,PMC59574,11686871.0,no-cc,Endothelin-1 (ET-1) is a 21 amino acid peptide...,2001-02-22,...,,,document_parses/pdf_json/348055649b6b8cf2b9a37...,document_parses/pmc_json/PMC59574.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,,0,0,0,0
4,9785vg6d,5f48792a5fa08bed9f56016f4981ae2ca6031b32,PMC,Gene expression in epithelial cells in respons...,10.1186/rr61,PMC59580,11686888.0,no-cc,Respiratory syncytial virus (RSV) and pneumoni...,2001-05-11,...,,,document_parses/pdf_json/5f48792a5fa08bed9f560...,document_parses/pmc_json/PMC59580.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,,0,0,0,0


In [189]:
path = 'document_parses/pmc_json/PMC3694918.xml.json'
meta_df_full_text.loc[lambda df: df['pmc_json_files'] == path] 

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,...,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id,Diffusivité HTA,FreqTitre HTA,FreqAbstrac HTA,FreqTexte HTA
1183,2fd5op27,94f391ebd1b25ac7fb8159d7b24b7414c582a630; 3154...,PMC,A Simulation Optimization Approach to Epidemic...,10.1371/journal.pone.0067164,PMC3694918,23826222.0,cc-by,Reliable forecasts of influenza can aid in the...,2013-06-27,...,,,document_parses/pdf_json/94f391ebd1b25ac7fb815...,document_parses/pmc_json/PMC3694918.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,,0,0,0,0


In [None]:
sc.matching()
sc.scoring()
print(sc.score_dif, sc.score_fr_doc, sc.score_fr_title, sc.score_fr_abstract)