# Tutorat M1 S2 Informatique DS4H

### Subject:

#### Re-structuration intelligente d'un jeu de donnée de debats politiques pour l'extraction de structures argumentaires

L'objectif du travail à réaliser est de structurer des données nécessaires à l'étude des composantes et des relations au sein des débats politiques qui ont eu lieu lors des élections du président des États-Unis de 1960 à 2016. Les débats se présentent sous la forme d'un dialogue entre un candidat et l'autre, qui répondent aux questions posées par un orateur sur divers sujets tels que l'économie, la sécurité, l'éducation, la guerre, les soins de santé, etc. Chaque débat a été divisé en sections en tenant compte des différents sujets abordés.

Tous les débats ont été annotés d'un point de vue argumentatif. Des annotations concernant les composantes argumentatives sont présentes à l'intérieur :
- Conclusions
- Prémisses
et des annotations faisant référence aux relations entre ces deux composants :
- Attaque
- Soutien
- Équivalent

L'objectif du projet est de concevoir et implémenter une structure de données textuels qui soit facile à manipuler pour la réalisation d'une des nombreuses tâches du TAL, à savoir l'extraction d'arguments.
Plus précisément, il s’agit de deux structures :
1) Un jeu de données référençant les composants (Claim, Premise) représentés par les colonnes suivantes :
- Ligne de dialogue
- Composants de l'argumentation
- Schéma BIO des composants
- Caractéristiques linguistiques, lexicales, grammaticales, syntaxiques, etc... (Chaque caractéristique séparée par une colonne) concernant le composant considéré
2) Un ensemble de données se référant aux relations (Attaque, Soutien, Équivalent) regroupées par section et représentées par les colonnes suivantes :
- Composante 1 (Claim/Premise)
- Composante 2 (Claim/Premise)
- Type de relation (Attaque/Soutien/Équivalent)
- Schéma BIO des composants et des relations avec leur distance
- Caractéristiques linguistiques, lexicales, grammaticales, syntaxiques, etc. (chaque caractéristique séparée par une colonne) concernant la relation considérée

## Questions
* What if the component is two sentences ? Because we don't have any ponctuation in the current component dataset.

In [107]:
# import the libraries
import pandas as pd
import string
from nltk.tokenize import sent_tokenize

In [108]:
# open the data file
components_data = pd.read_csv('./data/test_components.csv')
speeches_data = pd.read_csv('./data/test_speeches.csv')
# work with only the first 5 rows
# components_data = components_data.head(5)

In [109]:
# create new dataframe that copy the components_data but without the "Previous_Sentence" and "Next_Sentence" columns
components_data_context = components_data.drop(['Current_Sentence', 'Previous_Sentence', 'Next_Sentence'], axis=1)
# add columns for contexts
components_data_context['Context1'] = ''
components_data_context['Context2'] = ''

In [110]:
# tableau d'index des components qui n'ont pas été trouvés
components_not_found = []

# for each component, find the speeches that mention it
for index, row in components_data.iterrows():
    textToFind = row.Text
    # find the speeches that have the component in the text
    speeches = speeches_data[speeches_data['Speech'].str.find(textToFind) != -1]
    
    # make a try catch block to handle the case where the component is not found in the speeches
    try:
        # tokenize the speech into sentences
        sentences = sent_tokenize(speeches['Speech'].values[0])
        
        context1 = ''
        # get the sentence that contains the component
        for index, sentence in zip(range(0,len(sentences)), sentences):
            if sentence.find(textToFind) != -1:
                if(index > 0):
                    context1 = sentences[index-1]
                # add the current sentence
                context1 = context1 + ' ' + sentence
                if(index < len(sentences)-1):
                    context1 = context1 + ' ' + sentences[index+1]
                break
        # add context1
        components_data_context.loc[index, 'Context1'] = context1
        # add full speech to the context2 column
        components_data_context.loc[index, 'Context2'] = speeches['Speech'].values[0]
    except:
        # stocker les index des components qui n'ont pas été trouvés
        components_not_found.append(index)
        print('Component not found in speeches: ' + str(index))
        
    

Component not found in speeches: 56
Component not found in speeches: 75
Component not found in speeches: 94
Component not found in speeches: 95
Component not found in speeches: 603
Component not found in speeches: 927
Component not found in speeches: 935
Component not found in speeches: 1335
Component not found in speeches: 1410
Component not found in speeches: 1416
Component not found in speeches: 1429
Component not found in speeches: 1525
Component not found in speeches: 1709
Component not found in speeches: 3552
Component not found in speeches: 4821
Component not found in speeches: 5416
Component not found in speeches: 5434
Component not found in speeches: 5521
Component not found in speeches: 5537
Component not found in speeches: 5579
Component not found in speeches: 5583
Component not found in speeches: 5601
Component not found in speeches: 5644
Component not found in speeches: 5650
Component not found in speeches: 5757
Component not found in speeches: 5817
Component not found in 

Un certain nombre de composants n'ont pas été trouvés dans le texte, cela peut être du au format dans lesquels les données ont été enregistrées.


In [111]:
# Un certain nombre de components n'ont pas été trouvés dans les speeches
print('Percentage of components not found: ' + str(len(components_not_found)/len(components_data)*100) + '%')

Percentage of components not found: 0.4634994206257242%


La suite va donc consister à détecter les différents formats pour pouvoir les patcher et les ajouter à notre premier contexte.

In [112]:
# prenons pour exemple le premier component qui n'a pas été trouvé
cps = components_data.loc[components_not_found[0]]
print(cps.Text)


I believe it my responsibility as the leader of the Democratic party in 1960 to try to warn the American people that in this crucial time we can no longer afford to stand still We can no longer afford to be second best


Ici le composent (id 56) est composé de 2 phrases, or la fonction sent_tokenize ne les détecte pas comme 2 phrases car il a été stocké sans la ponctuation.

On va donc refaire une boucle mais cette fois sans la ponctuation pour détecter les phrases.
(petit soucis avec cette technique, les phrases contenant les ' ne sont pas détectées, on va donc pas la suite essayer de trouver une autre solution)

In [113]:
# vider le tableau d'index des components qui n'ont pas été trouvés
components_not_found_2 = []

#loop through the components_data that have not been found
for index, row in components_data.loc[components_not_found].iterrows():
    textToFind = row.Text
    # find the speeches that have the component in the text but this time we remove the ponctuation
    speeches = speeches_data[speeches_data['Speech'].str.translate(str.maketrans('', '', string.punctuation)).str.find(textToFind) != -1]
    # make a try catch block to handle the case where the component is not found in the speeches
    try:
        sentences = sent_tokenize(speeches['Speech'].values[0])
        
        component = ""
        first_id = -1
        last_id = -1
        # for each sentence of the speech
        for sentence in sentences:
            # if the sentence is in the component text
            if textToFind.find(sentence.translate(str.maketrans('', '', string.punctuation))) != -1:
                if first_id == -1:
                    first_id = sentences.index(sentence)
                last_id = sentences.index(sentence)
                # concat the sentence to the component
                component = component + ' ' + sentence
        # update the component text
        components_data.loc[index, 'Text'] = component
        # update the component text in the components_data_context
        components_data_context.loc[index, 'Text'] = component
        # add the previous sentence
        context1 = ''
        if first_id > 0:
            context1 += sentences[first_id-1]
        context1 += ' ' + component
        if last_id < len(sentences)-1:
            context1 += ' ' + sentences[last_id+1]
        # update the context1
        components_data_context.loc[index, 'Context1'] = context1
        # update the context2
        components_data_context.loc[index, 'Context2'] = speeches['Speech'].values[0]
        
    except Exception as e:
        # stocker les index des components qui n'ont pas été trouvés
        components_not_found_2.append(index)

In [114]:
# get first component that has not been found
comp = components_data.loc[components_not_found_2[0]]
# print the speechid and the section id and the text of the component
print(comp[['SpeechID', 'SectionID', 'Text']])
# print the percentage of components that have not been found
print('Percentage of components not found: ' + str(len(components_not_found_2)/len(components_data)*100) + '%')

SpeechID                                                     3
SectionID                                                    9
Text         Anybody that says America has been standing st...
Name: 75, dtype: object
Percentage of components not found: 0.44243126514273673%


Comme nous pouvons le voir, le pourcentage n'a pas beaucoup changé, on va donc essayer de trouver une autre solution.
Le principal soucis ici est que les phrases contenant des ' ne sont pas détectées. Ce qui est un cas fréquent en anglais. Il faut donc que l'on trouve une solution pour détecter ces phrases.

In [115]:
sentence = "Anybody that says America has been standing still for the last seven and a half years hasn't been traveling in America. He's been in some other country!"
print(sentence)
print(sentence.translate(str.maketrans('', '', string.punctuation)))

Anybody that says America has been standing still for the last seven and a half years hasn't been traveling in America. He's been in some other country!
Anybody that says America has been standing still for the last seven and a half years hasnt been traveling in America Hes been in some other country


Pour cela nous pouvons utiliser une regex qui va nous permettre de détecter les phrases contenant des ponctuations non désiré.

In [116]:
import re
new_s = re.sub(r'[\.!\?]','',sentence)
print(new_s)

Anybody that says America has been standing still for the last seven and a half years hasn't been traveling in America He's been in some other country


In [117]:
# vider le tableau d'index des components qui n'ont pas été trouvés
components_not_found_3 = []

#loop through the components_data that have not been found
for index, row in components_data.loc[components_not_found_2].iterrows():
    textToFind = row.Text
    # find the speeches that have the component in the text but this time we remove the ponctuation
    speeches = speeches_data[speeches_data['Speech'].str.replace(r'[\.\!\?]', '', regex=True).str.find(textToFind) != -1]
    # make a try catch block to handle the case where the component is not found in the speeches
    try:
        sentences = sent_tokenize(speeches['Speech'].values[0])
        component = ""
        print(index)
        first_id = -1
        last_id = -1
        # for each sentence of the speech
        for sentence in sentences:
            # if the sentence is in the component text
            if textToFind.find(sentence.replace(r'[\.\!\?]','',regex=True)) != -1:
                if first_id == -1:
                    first_id = sentences.index(sentence)
                last_id = sentences.index(sentence)
                # concat the sentence to the component
                component = component + ' ' + sentence
        # update the component text
        components_data.loc[index, 'Text'] = component
        # update the component text in the components_data_context
        components_data_context.loc[index, 'Text'] = component
        # add the previous sentence
        context1 = ''
        if first_id > 0:
            context1 += sentences[first_id-1]
        context1 += ' ' + component
        if last_id < len(sentences)-1:
            context1 += ' ' + sentences[last_id+1]
        # update the context1
        components_data_context.loc[index, 'Context1'] = context1
        # update the context2
        components_data_context.loc[index, 'Context2'] = speeches['Speech'].values[0]
        
    except:
        # stocker les index des components qui n'ont pas été trouvés
        components_not_found_3.append(index)
        # print('Component not found in speeches: ' + str(index))

75
95


In [118]:
# print le composant d'ID 603
print(components_data.loc[603]['Text'])

the tax program of the Ford administration will bring jobs where people are, and help to revitalize those cities as they can be


In [119]:
# search the speech that contains the word 'the tax program'
speeches_data[speeches_data['Speech'].str.find('the tax program of the Ford administration') != -1]

Unnamed: 0,Year,Date,Speaker,SectionID,SpeechID,Speech,Start,End
62,1976,22Oct,FORD,8,4,Let me uh - speak out very strongly. The Ford...,3485,5213


In [120]:
m_string = 'the tax program of the Ford administration will bring jobs where people'

# try all speechs
validation_speeches = pd.read_csv('data/validation_speeches.csv')
print(len(validation_speeches[validation_speeches['Speech'].str.find(m_string) != -1]))

train_speeches = pd.read_csv('data/train_speeches.csv')
print(len(train_speeches[train_speeches['Speech'].str.find(m_string) != -1]))

test_speeches = pd.read_csv('data/test_speeches.csv')
print(len(test_speeches[test_speeches['Speech'].str.find(m_string) != -1]))

# this component didn't exist at all ? 
# print le composant d'ID 603
print(components_data.loc[603]['Text'])
# Actualy, this component exist but in this case, the sentence, was split in three and we have only the first part and last part of the sentence
# the results is that we can't find the component in the speech but it is the speech 62 by Ford

0
0
0
the tax program of the Ford administration will bring jobs where people are, and help to revitalize those cities as they can be


In [121]:
print(len(components_not_found_3))

42


In [122]:
components_data.loc[components_not_found_3[0]]

Year                                                              1960
Date                                                             21Oct
SectionID                                                            9
ID                                                                T448
SpeechID                                                             3
Label                                                            Claim
Text                 Anybody that says America has been standing st...
Start                                                             5524
End                                                               5675
SentenceID_begin                                                    11
SentenceID_end                                                      12
Current_Sentence     Anybody that says America has been standing st...
Previous_Sentence    We find four times as many projects undertaken...
Next_Sentence                      Let's get that straight right away.
Speake

### Traiter tous les cas particuliers

Identifications des cas particuliers:

* Certaines phrases sont tronqués, il faut donc les compléter.
* Manque de ponctuation (manque des points par exemple)

Nombre d'éléments non trouvés: 42 (dans le dataset test)
Ponctuation manquante: 2
Ponctuation manquante: [75, 95, ]
Correspondance:[8, 8, ]
Phrase tronquée: 2
Phrase tronquée: [603, 927, ]
Correspondance:[62, 103, ]


In [123]:
components_data.loc[components_not_found_3[3]]['Text']


"I do make a pledge that in the next ten days when we're asking the American people to make one of the most important decisions in their lifetime that uh - we do together what we can to stimulate voter participation"

In [125]:
speeches_data.loc[speeches_data.Speech.str.find("I do make a pledge that in the next ten days when we're asking the American people to make one of the most important decisions in ") != -1]
#  I believe that the uh - American people have been turned off in this election, uh - Mr. Maynard, for a variety of reasons. We have seen on Capitol Hill, in the Congress, uh - a great many uh - allegations of wrong-doing, of uh - alleged immorality, uh - those are very disturbing to the American people. They wonder how an elected representative uh - can serve them and participate in such activities uh - serving in the Congress of the United States. Yes, and I'm certain many, many Americans were turned off by the revelations of Watergate, a very, very uh - bad period of time in American political history. Yes, and thousands, maybe millions of Americans were turned off because of the uh - problems that came out of our involvement in Vietnam. But on the other hand, I found on July fourth of this year, a new spirit born in America. We were celebrating our Bicentennial; and I find that uh - there is a - a movement as I travel around the country of greater interest in this campaign. Now, like uh - any hardworking uh - person seeking public office uh - in the campaign, inevitably sometimes you will use uh - rather graphic language and I'm guilty of that just like I think most others in the political arena. But I do make a pledge that in the next ten days when we're asking the American people to make one of the most important decisions in their lifetime, because I think this election is one of the mast vital in the history of America, that uh - we do together what we can to stimulate voter participation.

Unnamed: 0,Year,Date,Speaker,SectionID,SpeechID,Speech,Start,End
103,1976,22Oct,FORD,2,4,I believe that the uh - American people have ...,3208,4735


Essayer de faire une recherche automatique des phrases qui sont tronquées.

In [126]:
# get le component d'ID 603
comp = components_data.loc[603]
# recuperer le texte du component
text = comp['Text']
# split le text sur les espaces
tab_text = text.split(' ')

# tester jusqu'où la phrase est présente dans le texte (non tronqué)
startComp = tab_text[0]
while(len(speeches_data.loc[speeches_data.Speech.str.find(startComp) != -1]) > 0):
    # verifier que la phrase ne soit présente que dans un seul speech (meilleur des cas)
    tab_text.pop(0)
    startComp += ' '+tab_text[0]

# enelever le dernier mot de current
startComp = ' '.join(startComp.split(' ')[:-1])
# print(speeches_data.loc[speeches_data.Speech.str.find(startComp) != -1])
speechs = speeches_data.loc[speeches_data.Speech.str.find(startComp) != -1]
endCom = ' '.join(tab_text)
m_speech = []
for index, speech in zip(speechs.index, speechs.Speech):
    if(speech.find(endCom) != -1):
        print(index)
        print(speech)
        m_speech = speech
# une fois que le debut du component est trouvé, verifier que la fin du component est aussi dans le speech (et ainsi retrouvé les deux partie de la phrase)
print(m_speech)

62
 Let me uh - speak out very strongly. The Ford administration does have a very comprehensive program to help uh - our major metropolitan areas. I fought for, and the Congress finally went along with a general revenue sharing program, whereby cities and uh - states, uh - the cities two-thirds and the states one-third, get over six billion dollars a year in cash through which they can uh - provide many, many services, whatever they really want. In addition we uh - in the federal government make available to uh - cities about uh - three billion three hundred million dollars in what we call community development. In adesh- in addition, uh - uh - as a result of my pressure an the Congress, we got a major mass transit program uh - over a four-year period, eleven billion eight-hundred million dollars. We have a good housing program, uh - that uh - will result in cutting uh - the down payments by 50 percent and uh - having mortgage payments uh lower at the beginning of any mortgage period. 