# French adverbials and subjuntive use

Juan Berrios | juanberrios@pitt.edu | Last updated: October 26, 2022

**Summary and overview of the data:**

- This notebook is part of a project on mood variation as conditioned by adverbials in French as spoken in Metropolitan France. The purpose of the code included in this notebook is to build a `DataFrame` object from the `.txt` file containing the entirety of the *Corpus de Français Parlé Parisien* (CFPP2000). 

**Contents:**
1. [Preparation](#1.-Preparation)  includes the necessary preparations.
2. [Corpus processing](#2.-Corpus-processing)  includes code for loading the files, turning them into a data frame, and cleaning them using one of the `.txt` files as a sample.
3. [Data frame building](#3.-Data-frame-building)  includes code for performing operations on the processed corpus to turn the data into a manageable data frame. The resulting data frame is stored as a `.pkl` file in case further processing is needed.

## 1. Preparation

- Loading libraries and additional settings:

In [1]:
#Importing libraries
import pandas as pd
import numpy as np

#Turning pretty print off:
%pprint

#Releasing all output:                                            
from IPython.core.interactiveshell import InteractiveShell #Prints all commands rather than the last one.
InteractiveShell.ast_node_interactivity = "all"

Pretty printing has been turned OFF


## 2. Corpus processing

- I'm going to first load the `.txt` file by lines. Then process it so that I get lines that are only speech rather than tags (starting with "<"). The last step will be to extract concordances containing the adverbails of interest ("*quand*", "*tandis que*", "*avant que*", "*jusqu'à ce que*").

In [2]:
#Open file and extract lines as a list

fname = "./data/cfpp2000-v42-utf8.txt"

with open(fname, encoding="utf8") as corpus:             #UTF-8 encoding included so special characters are kept.
    lines = corpus.readlines()

In [3]:
#Stip new line character from elements of list

concordances = [] #Create new list because otherwise it returns results to the shell

for line in lines:
    concordances.append(line.strip())

In [4]:
#Verifying results

print("There are:",len(concordances),"concordances.")

#Sample of first ten lines

print("---Sample---")
for line in concordances[:10]: print(line)

There are: 189845 concordances.
---Sample---
<quartier="03">
<transc="03-01">
<user="Ozgur_Kilic_H_32_alii_3e (1)">
<speaker="spk1_03-01">
vous pouvez la reposer encore si vous voulez §
</speaker>
<speaker="spk2_03-01">
non  §
vous l'avez entendue deux fois ça suffit §
</speaker>


- As can be seen from the sample, there are tags related to matters such as the speaker or the location where the data were collected. As we're mainly interested on the speech for this task, we'll remove the tags and keep only speech tokens:

In [5]:
#Extract only speech (no tags)

tokens = []

for line in concordances:
    if line[0] != "<":            #Line does not start with "<"
        tokens.append(line)

In [6]:
#Verifying result

print("There are:",len(tokens),"concordances.")

#Sample

print("---Sample---")
for line in tokens[:10]: print(line)

There are: 91714 concordances.
---Sample---
vous pouvez la reposer encore si vous voulez §
non  §
vous l'avez entendue deux fois ça suffit §
tchin tchin §
comment on s'est rencontrés §
bah §
bah les deux c'est  §
ça §
simple §
là avec Steve bah c'est mon frère donc euh tchk §


- We'll now build individual lists for adverbs of interest. Now that there is a list for regular spelling and another list for alternative (contracted or capital case) spellings:

In [7]:
#Regular

avant = [line for line in tokens if "avant que" in line]
tandis = [line for line in tokens if 'tandis que' in line]
jusque = [line for line in tokens if "jusqu'à ce que" in line]
quand = [line for line in tokens if "quand" in line]

#Alternative

avant_alt = [line for line in tokens if "avant qu'" in line]        
tandis_alt = [line for line in tokens if "tandis qu'" in line]
jusque_alt = [line for line in tokens if "jusqu'à ce qu'" in line]
quand_alt = [line for line in tokens if "Quand" in line]

In [8]:
#Verify length of each list

print("Original spelling")
len(avant)
len(tandis)
len(jusque)
len(quand)
print("Alternative")
len(avant_alt)
len(tandis_alt)
len(jusque_alt)
len(quand_alt)

Original spelling


10

28

4

3190

Alternative


5

1

3

4

- Let's clean up "quand" a little bit since aboout half of the list is made up of the expression "quand même", which doesn't correspond to the construction of interest:

In [9]:
quand = [line for line in quand if "quand même" not in line] #Removing a extraneous fairly known expression
quand_alt = [line for line in quand_alt if "quand même" not in line] 

In [10]:
#Verify results

len(quand)
len(quand_alt) #None here since the total stays the same

#Sample
print("---Sample---")
for line in quand[:10]: print(line)

1705

4

---Sample---
et quand on a redoublé tous les deux on s'est retrouvés plus  §
collège Montgolfier lycée Turgot quand on s'est rencontrés §
et même quand on avait dix-sept dix-huit enfin seize dix-sept dix-huit on était contents d'être là-bas on le kiffait bien le troisième §
nous quand on est partis §
et quand euh quand la mère de Steve a acheté ou quand mes parents ont acheté euh c'était un quartier il y avait encore les Halles c'était  §
enfin moi aujourd'hui j- je vois abso- absolument plus de vie de quartier alors que quand j'étais gamin  §
euh quand j'étais gamin il y avait pas vraiment de vie de quartier c'est-à-dire que je descendais de chez moi je vois il y avait une boulangerie  §
non parce que je bah quand j'y repasse en fait j'y repasse pas souvent mais quand j'y repasse euh je sais pas c'est peut-être un peu de nostalgie aussi hein c'est c'est juste ça mais  §
c'est c'était dans une école et donc la cafèt ça donnait euh sur rue et donc voilà on se voyait quand il y avait des

- Merging the lists before moving to the next step:

In [11]:
#Merge

avant.extend(avant_alt) #Using .extend because otherwise it would append the list and not the elements thereof
tandis.extend(tandis_alt)
jusque.extend(jusque_alt)
quand.extend(quand_alt)

#Verify lengths

len(avant)
len(tandis)
len(jusque)
len(quand)

15

29

7

1709

## 3. Data frame building

In [12]:
# Building individual data frames to do some preliminary tagging

avant_df = pd.DataFrame(avant,columns=["concordance"])
tandis_df = pd.DataFrame(tandis,columns=["concordance"])
jusque_df = pd.DataFrame(jusque,columns=["concordance"])
quand_df = pd.DataFrame(quand,columns=["concordance"])

In [13]:
#Tagging for adverbial

avant_df["adverbial"] = "avant que"
tandis_df["adverbial"] = "tandis que"
jusque_df["adverbial"] = "jusqu'à ce que"
quand_df["adverbial"] = "quand"

In [14]:
#Merging data frames. They will be sorted in the same order

adverbials_df = pd.concat([avant_df, tandis_df, jusque_df, quand_df], ignore_index=True)

In [15]:
#Tagging for corpus and mood (using the indicative as a default, which will be revised later in manual coding)

adverbials_df["corpus"] = "CFPP200"
adverbials_df["mood"] = "indicative" 

In [16]:
#Previewing

adverbials_df #First and last five

Unnamed: 0,concordance,adverbial,corpus,mood
0,avant que ça §,avant que,CFPP200,indicative
1,avant que Casino prenne la place à l'époque où...,avant que,CFPP200,indicative
2,pas avant plusieurs années pas avant que ça ...,avant que,CFPP200,indicative
3,avant que ce soit rendu à la ville dans les an...,avant que,CFPP200,indicative
4,et voyez malgré tout euh on s'en sort mais on ...,avant que,CFPP200,indicative
...,...,...,...,...
1755,quand on y est depuis septembre 2003 donc ça...,quand,CFPP200,indicative
1756,et dans une famille communisante on allait voi...,quand,CFPP200,indicative
1757,ah Quand passent §,quand,CFPP200,indicative
1758,ah Quand passent les §,quand,CFPP200,indicative


- Saving result as `.csv` file:

In [17]:
adverbials_df.to_csv("./data/adverbials_df.csv", encoding="utf-8-sig", header=0, index=True) #Note the encoding

- Pickling file for use in follow-up notebook:

In [18]:
adverbials_df.to_pickle("./data/adverbials_df.pkl")