# Tutorat M1 S2 Informatique DS4H

### Subject:

#### Re-structuration intelligente d'un jeu de donnée de debats politiques pour l'extraction de structures argumentaires

L'objectif du travail à réaliser est de structurer des données nécessaires à l'étude des composantes et des relations au sein des débats politiques qui ont eu lieu lors des élections du président des États-Unis de 1960 à 2016. Les débats se présentent sous la forme d'un dialogue entre un candidat et l'autre, qui répondent aux questions posées par un orateur sur divers sujets tels que l'économie, la sécurité, l'éducation, la guerre, les soins de santé, etc. Chaque débat a été divisé en sections en tenant compte des différents sujets abordés.

Tous les débats ont été annotés d'un point de vue argumentatif. Des annotations concernant les composantes argumentatives sont présentes à l'intérieur :
- Conclusions
- Prémisses
et des annotations faisant référence aux relations entre ces deux composants :
- Attaque
- Soutien
- Équivalent

L'objectif du projet est de concevoir et implémenter une structure de données textuels qui soit facile à manipuler pour la réalisation d'une des nombreuses tâches du TAL, à savoir l'extraction d'arguments.
Plus précisément, il s’agit de deux structures :
1) Un jeu de données référençant les composants (Claim, Premise) représentés par les colonnes suivantes :
- Ligne de dialogue
- Composants de l'argumentation
- Schéma BIO des composants
- Caractéristiques linguistiques, lexicales, grammaticales, syntaxiques, etc... (Chaque caractéristique séparée par une colonne) concernant le composant considéré
2) Un ensemble de données se référant aux relations (Attaque, Soutien, Équivalent) regroupées par section et représentées par les colonnes suivantes :
- Composante 1 (Claim/Premise)
- Composante 2 (Claim/Premise)
- Type de relation (Attaque/Soutien/Équivalent)
- Schéma BIO des composants et des relations avec leur distance
- Caractéristiques linguistiques, lexicales, grammaticales, syntaxiques, etc. (chaque caractéristique séparée par une colonne) concernant la relation considérée

## Questions
* What if the component is two sentences ? Because we don't have any ponctuation in the current component dataset.

In [23]:
# import the libraries
import pandas as pd
import string
from nltk.tokenize import sent_tokenize

In [24]:
# open the data file
components_data = pd.read_csv('./data/test_components.csv')
speeches_data = pd.read_csv('./data/test_speeches.csv')
# work with only the first 5 rows
# components_data = components_data.head(5)

In [25]:
# create new dataframe that copy the components_data but without the "Previous_Sentence" and "Next_Sentence" columns
components_data_context = components_data.drop(['Current_Sentence', 'Previous_Sentence', 'Next_Sentence'], axis=1)
# add columns for contexts
components_data_context['Context1'] = ''
components_data_context['Context2'] = ''

In [26]:
# tableau d'index des components qui n'ont pas été trouvés
components_not_found = []

# for each component, find the speeches that mention it
for index, row in components_data.iterrows():
    textToFind = row.Text
    # find the speeches that have the component in the text
    speeches = speeches_data[speeches_data['Speech'].str.find(textToFind) != -1]
    
    # make a try catch block to handle the case where the component is not found in the speeches
    try:
        # tokenize the speech into sentences
        sentences = sent_tokenize(speeches['Speech'].values[0])
        
        context1 = ''
        # get the sentence that contains the component
        for index, sentence in zip(range(0,len(sentences)), sentences):
            if sentence.find(textToFind) != -1:
                if(index > 0):
                    context1 = sentences[index-1]
                # add the current sentence
                context1 = context1 + ' ' + sentence
                if(index < len(sentences)-1):
                    context1 = context1 + ' ' + sentences[index+1]
                break
        # add context1
        components_data_context.loc[index, 'Context1'] = context1
        # add full speech to the context2 column
        components_data_context.loc[index, 'Context2'] = speeches['Speech'].values[0]
    except:
        # stocker les index des components qui n'ont pas été trouvés
        components_not_found.append(index)
        print('Component not found in speeches: ' + str(index))
        
    

Component not found in speeches: 56
Component not found in speeches: 75
Component not found in speeches: 94
Component not found in speeches: 95
Component not found in speeches: 603
Component not found in speeches: 927
Component not found in speeches: 935
Component not found in speeches: 1335
Component not found in speeches: 1410
Component not found in speeches: 1416
Component not found in speeches: 1429
Component not found in speeches: 1525
Component not found in speeches: 1709
Component not found in speeches: 3552
Component not found in speeches: 4821
Component not found in speeches: 5416
Component not found in speeches: 5434
Component not found in speeches: 5521
Component not found in speeches: 5537
Component not found in speeches: 5579
Component not found in speeches: 5583
Component not found in speeches: 5601
Component not found in speeches: 5644
Component not found in speeches: 5650
Component not found in speeches: 5757
Component not found in speeches: 5817
Component not found in 

Un certain nombre de composants n'ont pas été trouvés dans le texte, cela peut être du au format dans lesquels les données ont été enregistrées.


In [27]:
# Un certain nombre de components n'ont pas été trouvés dans les speeches
print('Percentage of components not found: ' + str(len(components_not_found)/len(components_data)*100) + '%')

Percentage of components not found: 0.4634994206257242%


La suite va donc consister à détecter les différents formats pour pouvoir les patcher et les ajouter à notre premier contexte.

In [28]:
# prenons pour exemple le premier component qui n'a pas été trouvé
cps = components_data.loc[components_not_found[0]]
print(cps.Text)


I believe it my responsibility as the leader of the Democratic party in 1960 to try to warn the American people that in this crucial time we can no longer afford to stand still We can no longer afford to be second best


Ici le composent (id 56) est composé de 2 phrases, or la fonction sent_tokenize ne les détecte pas comme 2 phrases car il a été stocké sans la ponctuation.

On va donc refaire une boucle mais cette fois sans la ponctuation pour détecter les phrases.
(petit soucis avec cette technique, les phrases contenant les ' ne sont pas détectées, on va donc pas la suite essayer de trouver une autre solution)

In [29]:
# vider le tableau d'index des components qui n'ont pas été trouvés
components_not_found_2 = []

#loop through the components_data that have not been found
for index, row in components_data.loc[components_not_found].iterrows():
    textToFind = row.Text
    # find the speeches that have the component in the text but this time we remove the ponctuation
    speeches = speeches_data[speeches_data['Speech'].str.translate(str.maketrans('', '', string.punctuation)).str.find(textToFind) != -1]
    # make a try catch block to handle the case where the component is not found in the speeches
    try:
        sentences = sent_tokenize(speeches['Speech'].values[0])
        
        component = ""
        first_id = -1
        last_id = -1
        # for each sentence of the speech
        for sentence in sentences:
            # if the sentence is in the component text
            if textToFind.find(sentence.translate(str.maketrans('', '', string.punctuation))) != -1:
                if first_id == -1:
                    first_id = sentences.index(sentence)
                last_id = sentences.index(sentence)
                # concat the sentence to the component
                component = component + ' ' + sentence
        # update the component text
        components_data.loc[index, 'Text'] = component
        # update the component text in the components_data_context
        components_data_context.loc[index, 'Text'] = component
        # add the previous sentence
        context1 = ''
        if first_id > 0:
            context1 += sentences[first_id-1]
        context1 += ' ' + component
        if last_id < len(sentences)-1:
            context1 += ' ' + sentences[last_id+1]
        # update the context1
        components_data_context.loc[index, 'Context1'] = context1
        # update the context2
        components_data_context.loc[index, 'Context2'] = speeches['Speech'].values[0]
        
    except Exception as e:
        # stocker les index des components qui n'ont pas été trouvés
        components_not_found_2.append(index)
        # print('Component not found in speeches: ' + str(index))
        print(e)

index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with size 0
index 0 is out of bounds for axis 0 with

In [30]:
# get first component that has not been found
comp = components_data.loc[components_not_found_2[0]]
# print the speechid and the section id and the text of the component
print(comp[['SpeechID', 'SectionID', 'Text']])
# print the percentage of components that have not been found
print('Percentage of components not found: ' + str(len(components_not_found_2)/len(components_data)*100) + '%')

SpeechID                                                     3
SectionID                                                    9
Text         Anybody that says America has been standing st...
Name: 75, dtype: object
Percentage of components not found: 0.44243126514273673%


Comme nous pouvons le voir, le pourcentage n'a pas beaucoup changé, on va donc essayer de trouver une autre solution.
Le principal soucis ici est que les phrases contenant des ' ne sont pas détectées. Ce qui est un cas fréquent en anglais. Il faut donc que l'on trouve une solution pour détecter ces phrases.

In [31]:
sentence = "Anybody that says America has been standing still for the last seven and a half years hasn't been traveling in America. He's been in some other country!"
print(sentence)
print(sentence.translate(str.maketrans('', '', string.punctuation)))

Anybody that says America has been standing still for the last seven and a half years hasn't been traveling in America. He's been in some other country!
Anybody that says America has been standing still for the last seven and a half years hasnt been traveling in America Hes been in some other country


Pour cela nous pouvons utiliser une regex qui va nous permettre de détecter les phrases contenant des ponctuations non désiré.

In [32]:
import re
new_s = re.sub(r'[\.!\?]','',sentence)
print(new_s)

Anybody that says America has been standing still for the last seven and a half years hasn't been traveling in America He's been in some other country


In [33]:
# vider le tableau d'index des components qui n'ont pas été trouvés
components_not_found_3 = []

#loop through the components_data that have not been found
for index, row in components_data.loc[components_not_found_2].iterrows():
    textToFind = row.Text
    # find the speeches that have the component in the text but this time we remove the ponctuation
    speeches = speeches_data[speeches_data['Speech'].str.replace(r'[\.\!\?]', '', regex=True).str.find(textToFind) != -1]
    # make a try catch block to handle the case where the component is not found in the speeches
    try:
        sentences = sent_tokenize(speeches['Speech'].values[0])
        component = ""
        print(index)
        first_id = -1
        last_id = -1
        # for each sentence of the speech
        for sentence in sentences:
            # if the sentence is in the component text
            if textToFind.find(sentence.replace(r'[\.\!\?]','',regex=True)) != -1:
                if first_id == -1:
                    first_id = sentences.index(sentence)
                last_id = sentences.index(sentence)
                # concat the sentence to the component
                component = component + ' ' + sentence
        # update the component text
        components_data.loc[index, 'Text'] = component
        # update the component text in the components_data_context
        components_data_context.loc[index, 'Text'] = component
        # add the previous sentence
        context1 = ''
        if first_id > 0:
            context1 += sentences[first_id-1]
        context1 += ' ' + component
        if last_id < len(sentences)-1:
            context1 += ' ' + sentences[last_id+1]
        # update the context1
        components_data_context.loc[index, 'Context1'] = context1
        # update the context2
        components_data_context.loc[index, 'Context2'] = speeches['Speech'].values[0]
        
    except:
        # stocker les index des components qui n'ont pas été trouvés
        components_not_found_3.append(index)
        # print('Component not found in speeches: ' + str(index))

75
95


In [34]:
# print le composant d'ID 603
print(components_data.loc[603]['Text'])

the tax program of the Ford administration will bring jobs where people are, and help to revitalize those cities as they can be


In [35]:
# search the speech that contains the word 'the tax program'
speeches_data[speeches_data['Speech'].str.find('the tax program of the Ford administration') != -1]

Unnamed: 0,Year,Date,Speaker,SectionID,SpeechID,Speech,Start,End
62,1976,22Oct,FORD,8,4,Let me uh - speak out very strongly. The Ford...,3485,5213


In [36]:
m_string = 'the tax program of the Ford administration will bring jobs where people'

# try all speechs
validation_speeches = pd.read_csv('data/validation_speeches.csv')
print(len(validation_speeches[validation_speeches['Speech'].str.find(m_string) != -1]))

train_speeches = pd.read_csv('data/train_speeches.csv')
print(len(train_speeches[train_speeches['Speech'].str.find(m_string) != -1]))

test_speeches = pd.read_csv('data/test_speeches.csv')
print(len(test_speeches[test_speeches['Speech'].str.find(m_string) != -1]))

# this component didn't exist at all ? 
# print le composant d'ID 75
print(components_data.loc[603]['Text'])

0
0
0
the tax program of the Ford administration will bring jobs where people are, and help to revitalize those cities as they can be


In [50]:
# print le component d'id 927
print(components_data.loc[927])


Year                                                              1976
Date                                                             22Oct
SectionID                                                            2
ID                                                                 T86
SpeechID                                                             0
Label                                                            Claim
Text                 I do make a pledge that in the next ten days w...
Start                                                               -1
End                                                                213
SentenceID_begin                                                     0
SentenceID_end                                                       0
Current_Sentence                                   WALTERS: Thank you.
Previous_Sentence                                  WALTERS: Thank you.
Next_Sentence          Mr. Maynard, your question for Governor Carter.
Speake