# French adverbials and subjuntive use
Juan Berrios | juanberrios@pitt.edu | Last updated: April 21, 2021

**Summary and overview of the data:**

- This is part of a project of mood variation conditioned by adverbials in French as spoken in Metropolitan France. The purpose of the code included in this notebook is to build a `DataFrame` object from the `.txt` file containing the entirety of the Corpus de Français Parlé Parisien (CFPP2000). 

**Contents:**
1. [Preparation](#1.-Preparation)  includes the necessary preparations.
2. [Loading files](#2.-Loading-files)  includes code for loading the files, turning them into a data frame, and cleaning them using one of the `.txt` files as a sample.
3. [Processing corpus directories](#3.-Processing-corpus-directories)  includes code for performing the operations on a corpus directory containing all the text files for Argentinean Spanish. The resulting data frames is stored as a `.pkl` file in case further processing is needed.

## 1. Preparation

- Loading libraries and additional settings:

In [1]:
#Importing libraries
import glob, pickle, re
import pandas as pd
import numpy as np

#Turning pretty print off:
%pprint

#Releasing all output:                                            
from IPython.core.interactiveshell import InteractiveShell #Prints all commands rather than the last one.
InteractiveShell.ast_node_interactivity = "all"

Pretty printing has been turned OFF


## 2. Loading files

- I'm going to first load the `.txt` file by lines. Then refine it so that I get lines that are only dialogues rather than tags. The last step will be to extract lines containing the adverbails of interest ("quand", "tandis que", "avant" que", "jusqu'à ce que").

In [2]:
#Open file and extract lines as a list

fname = './data/cfpp2000-v42-utf8.txt'

with open(fname, encoding='utf8') as corpus:             #UTF-8 encoding included so special characters are kept.
    lines = corpus.readlines()

In [3]:
#Stip new line character from elements of list

concordances = []

for line in lines:
    concordances.append(line.strip())

In [4]:
#Verifying results

len(concordances)
type(concordances)

#Sample of first ten lines

print('---Sample---')
for line in concordances[:10]: print(line)

189845

<class 'list'>

---Sample---
<quartier="03">
<transc="03-01">
<user="Ozgur_Kilic_H_32_alii_3e (1)">
<speaker="spk1_03-01">
vous pouvez la reposer encore si vous voulez §
</speaker>
<speaker="spk2_03-01">
non  §
vous l'avez entendue deux fois ça suffit §
</speaker>


In [5]:
#Extract tokens only (no tags)

tokens = []

for line in concordances:
    if line[0] != '<':
        tokens.append(line)

In [6]:
#Verifying result

len(tokens)

#Sample

print('---Sample---')
for line in tokens[:10]: print(line)

91714

---Sample---
vous pouvez la reposer encore si vous voulez §
non  §
vous l'avez entendue deux fois ça suffit §
tchin tchin §
comment on s'est rencontrés §
bah §
bah les deux c'est  §
ça §
simple §
là avec Steve bah c'est mon frère donc euh tchk §


In [7]:
#Build individual lists for adverbs of interest

quand = [line for line in tokens if 'quand' in line]
avant = [line for line in tokens if "avant que" in line]
jusqua = [line for line in tokens if "jusqu'Ã ce que" in line]
tandis = [line for line in tokens if 'tandis que' in line]

In [8]:
#Verify length of each list

len(quand)
len(avant)
len(jusqua)
len(tandis)

3190

10

0

28

- No hits for "jusqu'à ce que". Let's clean up quand a little bit since aboout half of the list is made up of the expression "quand même".

In [9]:
quand = [line for line in quand if 'quand même' not in line] #Removing a extraneous fairly known expression

In [10]:
#Verify results

len(quand)

#Sample
print('---Sample---')
for line in quand[:10]: print(line)

1705

---Sample---
et quand on a redoublé tous les deux on s'est retrouvés plus  §
collège Montgolfier lycée Turgot quand on s'est rencontrés §
et même quand on avait dix-sept dix-huit enfin seize dix-sept dix-huit on était contents d'être là-bas on le kiffait bien le troisième §
nous quand on est partis §
et quand euh quand la mère de Steve a acheté ou quand mes parents ont acheté euh c'était un quartier il y avait encore les Halles c'était  §
enfin moi aujourd'hui j- je vois abso- absolument plus de vie de quartier alors que quand j'étais gamin  §
euh quand j'étais gamin il y avait pas vraiment de vie de quartier c'est-à-dire que je descendais de chez moi je vois il y avait une boulangerie  §
non parce que je bah quand j'y repasse en fait j'y repasse pas souvent mais quand j'y repasse euh je sais pas c'est peut-être un peu de nostalgie aussi hein c'est c'est juste ça mais  §
c'est c'était dans une école et donc la cafèt ça donnait euh sur rue et donc voilà on se voyait quand il y avait des

## 3. Processing corpus directories

In [11]:
# Building individual data frames to do some preliminary tagging

avant_df = pd.DataFrame(avant,columns=['concordance'])
tandis_df = pd.DataFrame(tandis,columns=['concordance'])
quand_df = pd.DataFrame(quand,columns=['concordance'])

In [12]:
#Tagging for adverbial

avant_df['adverbial'] = 'avant que'
tandis_df['adverbial'] = 'tandis que'
quand_df['adverbial'] = 'quand'

In [13]:
#Merging data frames. They will be sorted in the same order

adverbials_df = pd.concat([avant_df, tandis_df, quand_df], ignore_index=True)

In [14]:
#Tagging for the most common categories that  will later modify manually

adverbials_df['mood'] = 'subjunctive' 
adverbials_df['tense'] = 'present indicative' 
adverbials_df['corpus'] = 'CFPP200' 

In [15]:
adverbials_df.head(5) #First five
adverbials_df.tail(5) #Last five

Unnamed: 0,concordance,adverbial,mood,tense,corpus
0,avant que ça §,avant que,subjunctive,present indicative,CFPP200
1,avant que Casino prenne la place à l'époque où...,avant que,subjunctive,present indicative,CFPP200
2,pas avant plusieurs années pas avant que ça ...,avant que,subjunctive,present indicative,CFPP200
3,avant que ce soit rendu à la ville dans les an...,avant que,subjunctive,present indicative,CFPP200
4,et voyez malgré tout euh on s'en sort mais on ...,avant que,subjunctive,present indicative,CFPP200


Unnamed: 0,concordance,adverbial,mood,tense,corpus
1738,voilà mais après aussi ce qu'il faut savoir ...,quand,subjunctive,present indicative,CFPP200
1739,quand elle s'adresse à ses petites-filles §,quand,subjunctive,present indicative,CFPP200
1740,voilà ou pff quand je vais dans la rue ou ou...,quand,subjunctive,present indicative,CFPP200
1741,ou enfin au moins pas anonymes quoi parce que ...,quand,subjunctive,present indicative,CFPP200
1742,quand on y est depuis septembre 2003 donc ça...,quand,subjunctive,present indicative,CFPP200


- Saving result as `.csv` file:

In [16]:
adverbials_df.to_csv("./data/adverbials_df.csv", encoding='utf-8', header=True, index=False)