# Phase de pretraitement des données textuelles

Une fois les données nettoyées, on passe à leur prétraitement. En effet, on ne pourra pas passer les données brutes directement au modèle. Il faut passer par certaines étapes afin de lui permettre d'en retirer quelque chose.

In [1]:
import sys, re, inflect
import pandas as pd
from clean import clean_claimKG

import contractions as c
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem.snowball import SnowballStemmer 
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import WordNetLemmatizer 
#nltk.download('all')

On commence par charger les données et les nettoyées. Voir [Data presentation](./data-presentation.ipynb)

In [2]:
file_name = "../data/claimKG.csv"

# Lecture du fichier
kg_origin = pd.read_csv(file_name)
kg= kg_origin.copy()
clean_claimKG(kg,inplace=True, verbose=True)

Taille du dataframe: (39218, 23)
Suppression des 10 columns suivant:
	-> Unnamed: 0
	-> claimReview_source
	-> claimReview_author
	-> claimReview_author_url
	-> creativeWork_author_name
	-> creativeWork_author_sameAs
	-> creativeWork_datePublished
	-> rating_bestRating
	-> rating_ratingValue
	-> rating_worstRating
Suppression de 5749 lignes en doubles.
Suppression de 5 lignes.
Taille finale: (33464, 13)


Unnamed: 0,claimReview_author_name,claimReview_claimReviewed,claimReview_datePublished,claimReview_url,extra_body,extra_entities_author,extra_entities_body,extra_entities_claimReview_claimReviewed,extra_entities_keywords,extra_refered_links,extra_tags,extra_title,rating_alternateName
0,snopes,Finnish President Sauli Niinistö posted a vide...,2019-10-07,https://www.snopes.com/fact-check/president-fi...,"On Oct. 2, 2019, a joint press conference at t...",[],"[{""id"" : 33057"",""""begin"": 46,""end"": 57,""entity...","[{""id"" : 1042690"",""""begin"": 18,""end"": 32,""enti...",[],"https://t.co/Oo5Q56ALAu,https://twitter.com/ia...",,Did the President of Finland Post a Video Resp...,False
1,snopes,A supporter of U.S. Rep. Alexandria Ocasio-Cor...,2019-10-04,https://www.snopes.com/fact-check/babies-clima...,"An Oct. 3, 2019, town hall event in New York C...",[],"[{""id"" : 645042"",""""begin"": 33,""end"": 46,""entit...","[{""id"" : 54885332"",""""begin"": 22,""end"": 45,""ent...",[],https://twitter.com/redsteeze/status/117991491...,,Did an AOC Supporter Suggest ‘Eating Babies’ t...,Mixture
2,snopes,A photograph shows a bride and groom during a ...,2019-10-04,https://www.snopes.com/fact-check/handmaid-tal...,"In October 2019, a photograph supposedly showi...",[],"[{""id"" : 50430110"",""""begin"": 91,""end"": 106,""en...","[{""id"" : 50430110"",""""begin"": 46,""end"": 61,""ent...",[],https://twitter.com/God_loves_women/status/117...,,Is This a Photo of a ‘Handmaid’s Tale’-Themed ...,Miscaptioned
3,snopes,Canada legalized the medicinal use of cocaine.,2019-10-04,https://www.snopes.com/fact-check/medicinal-co...,"On Sep. 20, 2019, Huzlers published an article...",[],"[{""id"" : 7701"",""""begin"": 96,""end"": 103,""entity...","[{""id"" : 7701"",""""begin"": 38,""end"": 45,""entity""...",[],https://web.archive.org/web/20191004171021/htt...,,Did Canada Legalize the Medicinal Use of Cocaine?,Labeled Satire
4,snopes,"In September 2019, U.S. President Donald Trump...",2019-10-04,https://www.snopes.com/fact-check/trump-autism...,We received multiple inquiries from readers in...,[],"[{""id"" : 4848272"",""""begin"": 121,""end"": 133,""en...","[{""id"" : 4848272"",""""begin"": 31,""end"": 43,""enti...",[],"http://archive.is/ymlJP,http://archive.is/JgYP...",,Did Donald Trump Sign a $1.8 Billion Autism-Se...,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
39213,factcheck_afp,Rat meat from China is sold as boneless chicke...,,https://factcheck.afp.com//no-massive-quantiti...,Claims that 1 million pounds of rat meat from ...,[],"[{""id"" : 11632"",""""begin"": 131,""end"": 162,""enti...",[],[],https://web.archive.org/web/20180725201453/htt...,,"No, massive quantities of rat meat are still n...",FALSE
39214,factcheck_afp,Mars will appear as big as the Moon on July 27,,https://factcheck.afp.com//no-mars-will-not-be...,Mars will be as big as the Moon in the sky thi...,[],"[{""id"" : 13586"",""""begin"": 757,""end"": 762,""enti...","[{""id"" : 14640471"",""""begin"": 0,""end"": 4,""entit...",[],https://web.archive.org/web/20180725152030/htt...,,"No, Mars will not be as big as the Moon in the...",Hoax
39215,factcheck_afp,Massachusetts repealed the Second Amendment of...,,https://factcheck.afp.com//massachusetts-did-n...,Online reports claim that the US state of Mass...,"[{""id"" : 58299742"",""""begin"": 0,""end"": 12,""enti...","[{""id"" : 18618239"",""""begin"": 30,""end"": 38,""ent...","[{""id"" : 31655"",""""begin"": 27,""end"": 43,""entity...",[],https://web.archive.org/web/20180720190537/htt...,,Massachusetts did not repeal the Second Amendm...,Misleading
39216,factcheck_afp,Une photo montre une foule très dense sur une ...,,https://factcheck.afp.com//no-photo-does-not-s...,"According to several posts on Facebook, a vira...","[{""id"" : 7529378"",""""begin"": 13,""end"": 21,""enti...","[{""id"" : 7529378"",""""begin"": 30,""end"": 38,""enti...",[],[],https://www.facebook.com/strangworldstrangerpe...,,"No, this photo does not show a crowded beach i...",Photo détournée


Ensuite, on sélectionne 10 assertions au hasard afin de les manipuler et de voir ce qu'on peut en faire.

In [6]:
claims_text = list(kg['claimReview_claimReviewed'].sample(10))

for index, claim in enumerate(claims_text):
    print(index,"->",claim)

0 -> The governor has made a ""commitment to billions of dollars in debt and new spending without any explanation of how he plans to pay that money back.""
1 -> Says Mitt Romney once supported President Obama’s health care plan but now opposes it.
2 -> The toy company Hasbro acquired Death Row Records.
3 -> “Our real Gross Domestic Product (GDP) grew by 4.9 percent in 2017.”
4 -> Says ""Scott Baio .. dies in small plane crash.""
5 -> The Texas Senate ""approved a bill to put a special label on the insurance cards of anyone who bought a plan through Obamacare"" that includes the letter ""S"" for subsidy.
6 -> E-mail reproduces Ben Stein’s defense of President Bush’s actions in the aftermath of Hurricane Katrina.
7 -> The Massachusetts health care plan ""dealt with 8 percent of our population,"" far less than the ""100 percent of American people"" affected by President Barack Obama’s health care law.
8 -> Retailer Forever 21 is selling rings and other jewelry that include swastikas.
9 ->

## Tokenization

Découpage de l'assertion en Token (en mots).

In [8]:
def tokenize(text):
    return nltk.word_tokenize(text)

tokenize(claims_text[8])

['Retailer',
 'Forever',
 '21',
 'is',
 'selling',
 'rings',
 'and',
 'other',
 'jewelry',
 'that',
 'include',
 'swastikas',
 '.']

## Mise en miniscule

La mise en miniscule peut-être util dans certain cas.

In [9]:
def lowercase(text):
    return text.lower()

lowercase(claims_text[2])

'says that 500,000 federal workers -- one-fourth of the federal workforce -- make more than $100,000 a year.'

## numbers to words

Transformer les nombres en mots.

NB: à utiliser avant `ponctuations` pour éviter de séparer `15.25` en `15` et `25`

In [11]:
def number_to_words(text):
    return inflect.engine().number_to_words(text)

#print(number_to_words("15.2"))

def number2words(text):
    tokens = tokenize(text)
    for i,m in enumerate(tokens):
        try:
            float(m)
        except ValueError:
            continue
        else:
            tokens[i] = number_to_words(m)
    return ' '.join(tokens)
print(claims_text[3])
print("\n")
print(number2words(claims_text[3]))

“Our real Gross Domestic Product (GDP) grew by 4.9 percent in 2017.”


“ Our real Gross Domestic Product ( GDP ) grew by four point nine percent in two thousand and seventeen . ”


## Traitement des contraction et ponctuation

La suppression des ponctuations peut avoir des conséquences sur le qualité du modèle, par exemple dans la détection des opinions. Il est préferable de traiter d'abord les contractions dans les phrases avant de supprimer les ponctuations.

In [11]:
import contractions as c
def contractions(text):
    return c.fix(text)

print(contractions("couldn't"))

def ponctuations(text):
    return re.sub(r'[^\w\s]', ' ', text)

print(ponctuations(claims_text[5]))

could not
A photograph shows a female wolf protecting a male s throat during a fight 


## Stopwords

Supprimer les mots les plus fréquent de la langue. Dans notre cas:`the`, `a`, `an`, `in` ...

NB: Dans les stopwords fournit par défault par NLTK contient les formes de négation.

In [12]:
stopwords.words('english')
stopwords.words('french')

def remove_stopwords(text_tokenized,language='english'):
    stop_words = set(stopwords.words(language))
  
    return [w for w in text_tokenized if not w in stop_words]

print(claims_text[8])
claim = ponctuations(claims_text[8])
claim = tokenize(claim)
claim = remove_stopwords(claim)
print(' '.join(claim))

An inattentive janitor caused several deaths in a hospital when he disconnected patients' life support systems to plug in a floor polisher.
An inattentive janitor caused several deaths hospital disconnected patients life support systems plug floor polisher


## Stemmatisation

Le stemmatisation (racinisation en français) vise à garder la racine du mot. La racine d’un mot correspond à la partie du mot restante une fois que l’on a supprimé son (ses) préfixe(s) et suffixe(s), à savoir son radical. 
Plusieurs variantes d'un terme peuvent ainsi être groupées dans une seule forme représentative.

Il existe plusieurs algorithmes de stemmatisation, celui utiliser ici est `SnowBall Stemmer`. Mais il existe aussi `Lancaster Stemmer` qui est considérer comme plus agresif.

In [19]:
def stem(text_tokenized, language='english',stemmer_name='snowball',verbose=False):
    if stemmer_name == 'snowball':
        if verbose:
            print('Snowball stemmer used!')
        stemmer = SnowballStemmer(language=language) 
    elif stemmer_name == 'lancaster':
        if language != 'english':
            print("LancasterStemmer do not suport "+language, file=sys.stderr)
            raise ValueError()
        stemmer = LancasterStemmer()
    return [stemmer.stem(term) for term in text_tokenized]

claim = claims_text[0]
print(claim)
claim = ponctuations(claim)
claim = tokenize(claim)
claim = stem(claim,stemmer_name='snowball',verbose=True)
print(' '.join(claim))

In 2005 and 2007, "" Joe Straus received a 100 percent rating by NARAL (the National Abortion and Reproductive Rights Action League).""
Snowball stemmer used!
in 2005 and 2007 joe straus receiv a 100 percent rate by naral the nation abort and reproduct right action leagu


## Lemmatisation

La stemmatisation et la lemmatisation sont deux notions proches, mais il y a des différences fondamentales.
La lemmatisation a pour objectif de retrouver le lemme d'un mot, par exemple l'infinitif pour les verbes. La racinisation consiste à supprimer la fin des mots, ce qui peut résulter en un mot qui n'existe pas dans la langue.

NB: La lemmatisation foncionnent beaucoup mieux si chaque mot vient avec son tag parts-of-speech (POS).

In [34]:
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

def lem(words, pos_tag=[]):
    lemmatizer = WordNetLemmatizer()
    if len(words) == len(pos_tag):
        lem = []
        for w, pos in zip(words,pos_tag):
            if pos is None:
               lem.append(lemmatizer.lemmatize(w))
            else:
               lem.append(lemmatizer.lemmatize(w, pos=pos))
        return lem
    else:
        return [lemmatizer.lemmatize(w) for w in words]
  
claim = claims_text[8]
print(claim)
claim = ponctuations(claim)
claim = tokenize(claim)

pos = pos_tag(claim)

claim = lem(claim,[get_wordnet_pos(p[1]) for p in pos])
print(' '.join(claim))

An inattentive janitor caused several deaths in a hospital when he disconnected patients' life support systems to plug in a floor polisher.
An inattentive janitor cause several death in a hospital when he disconnect patient life support system to plug in a floor polisher


## Pos-tagging

L'étiquetage morpho-syntaxique est le processus qui consiste à associer aux mots d'un texte les informations grammaticales correspondantes comme la partie du discours, le genre, le nombre, etc.

In [35]:
def pos_tag(words):
    return nltk.pos_tag(words)

claim = claims_text[8]
print(claim)
claim = ponctuations(claim)
claim = tokenize(claim)
pos = pos_tag(claim)
pos

An inattentive janitor caused several deaths in a hospital when he disconnected patients' life support systems to plug in a floor polisher.


[('An', 'DT'),
 ('inattentive', 'JJ'),
 ('janitor', 'NN'),
 ('caused', 'VBD'),
 ('several', 'JJ'),
 ('deaths', 'NNS'),
 ('in', 'IN'),
 ('a', 'DT'),
 ('hospital', 'NN'),
 ('when', 'WRB'),
 ('he', 'PRP'),
 ('disconnected', 'VBD'),
 ('patients', 'NNS'),
 ('life', 'NN'),
 ('support', 'NN'),
 ('systems', 'NNS'),
 ('to', 'TO'),
 ('plug', 'VB'),
 ('in', 'IN'),
 ('a', 'DT'),
 ('floor', 'NN'),
 ('polisher', 'NN')]

## Fonction de prétraitement

La fonction de prétraitement va utiliser une combinaison des fonctions citées plus haut.
L'idée est donc d'entraîner le modèle avec quelques combinaisons pour voir leurs effets sur le processus d'apprentissage. Cette fonction va prendre en paramètre un texte brut et donner en résultat une liste de token, potentiellement accompagné avec leur tag POS.

In [None]:
class TextPreTraitement:
    lowercase=False
    lem=False
    pos=False
    stem=False
    ponctuation=False
    contraction=False
    tokenize=False
    stopword=False
    
    def __init__(self,lowercase=False,
            lem=False,
            pos=False,
            stem=False,
            ponctuation=False,
            contraction=False,
            tokenize=False,
            stopword=False):
        

In [49]:
def pretraitement(text_brut):
    text = ponctuations(text_brut)
    
    text_tokenized = tokenize(text)
    
    text_tagged = pos_tag(text_tokenized)

    text_lematized = lem(text_tokenized,[get_wordnet_pos(p[1]) for p in text_tagged])
    
    return [(text_lematized[i],p) for i,(w,p) in enumerate(text_tagged)]

for claim in claims_text:
    print("Claim: ",claim)
    pre = pretraitement(claim)
    for p in pre:
        print(p)
    print("\n")

Claim:  In 2005 and 2007, "" Joe Straus received a 100 percent rating by NARAL (the National Abortion and Reproductive Rights Action League).""
('In', 'IN')
('2005', 'CD')
('and', 'CC')
('2007', 'CD')
('Joe', 'NNP')
('Straus', 'NNP')
('receive', 'VBD')
('a', 'DT')
('100', 'CD')
('percent', 'NN')
('rating', 'NN')
('by', 'IN')
('NARAL', 'NNP')
('the', 'DT')
('National', 'NNP')
('Abortion', 'NNP')
('and', 'CC')
('Reproductive', 'NNP')
('Rights', 'NNP')
('Action', 'NNP')
('League', 'NNP')


Claim:  Says that except for Donald Trump, ""every other major party nominee"" for the past 40 years has released their tax returns.
('Says', 'VBZ')
('that', 'WDT')
('except', 'IN')
('for', 'IN')
('Donald', 'NNP')
('Trump', 'NNP')
('every', 'DT')
('other', 'JJ')
('major', 'JJ')
('party', 'NN')
('nominee', 'NN')
('for', 'IN')
('the', 'DT')
('past', 'JJ')
('40', 'CD')
('year', 'NNS')
('have', 'VBZ')
('release', 'VBN')
('their', 'PRP$')
('tax', 'NN')
('return', 'NNS')


Claim:  Says that 500,000 federal wo