# French adverbials and subjuntive use
Juan Berrios | juanberrios@pitt.edu | Last updated: April 21, 2021

**Summary and overview of the data:**

- This is part of a project of mood variation conditioned by adverbials in French as spoken in Metropolitan France. The purpose of the code included in this notebook is to build a `DataFrame` object from the `.txt` file containing the entirety of the Corpus de Français Parlé Parisien (CFPP2000). 

**Contents:**
1. [Preparation](#1.-Preparation)  includes the necessary preparations.
2. [Loading files](#2.-Loading-files)  includes code for loading the files, turning them into a data frame, and cleaning them using one of the `.txt` files as a sample.
3. [Processing corpus directories](#3.-Processing-corpus-directories)  includes code for performing the operations on a corpus directory containing all the text files for Argentinean Spanish. The resulting data frames is stored as a `.pkl` file in case further processing is needed.

## 1. Preparation

- Loading libraries and additional settings:

In [1]:
#Importing libraries
import glob, pickle, re
import pandas as pd
import numpy as np
import os

#Turning pretty print off:
%pprint

#Releasing all output:                                            
from IPython.core.interactiveshell import InteractiveShell #Prints all commands rather than the last one.
InteractiveShell.ast_node_interactivity = "all"

Pretty printing has been turned OFF


## 2. Loading files

- The `.txt` files are very large. For testing purposes, I'll use only one of them as a start. The files are also tab-delimited. The columns correspond to an ID for the source text, an ID for the token, the token (word), the lemma, and the POS. I will hence use those for column names. 

In [2]:
os.getcwd()

'C:\\Users\\Juan\\Documents\\code\\research\\french_adverbials'

In [3]:
fname = 'C:/Users/Juan/Documents/code/research/french_adverbials/data/cfpp2000-v42-utf8.txt'

with open(fname) as corpus:
    lines = corpus.readlines()

In [4]:
concordances = []

for line in lines:
    concordances.append(line.strip())

In [5]:
len(concordances)
type(concordances)

189845

<class 'list'>

In [6]:
for line in concordances[:10]: print(line)

<quartier="03">
<transc="03-01">
<user="Ozgur_Kilic_H_32_alii_3e (1)">
<speaker="spk1_03-01">
vous pouvez la reposer encore si vous voulez Â§
</speaker>
<speaker="spk2_03-01">
non  Â§
vous l'avez entendue deux fois Ã§a suffit Â§
</speaker>


In [7]:
tokens = []

for line in concordances:
    if line[0] != '<':
        tokens.append(line)

In [8]:
len(tokens)

91714

In [9]:
for line in tokens[:10]: print(line)

vous pouvez la reposer encore si vous voulez Â§
non  Â§
vous l'avez entendue deux fois Ã§a suffit Â§
tchin tchin Â§
comment on s'est rencontrÃ©s Â§
bah Â§
bah les deux c'est  Â§
Ã§a Â§
simple Â§
lÃ  avec Steve bah c'est mon frÃ¨re donc euh tchk Â§


In [10]:
quand = [line for line in tokens if 'quand' in line]
avant = [line for line in tokens if "avant que" in line]
jusqua = [line for line in tokens if "jusqu'Ã ce que" in line]
tandis = [line for line in tokens if 'tandis que' in line]

In [11]:
len(quand)
len(avant)
len(jusqua)
len(tandis)

3190

10

0

28

In [12]:
quand = [line for line in quand if 'quand mÃªme' not in line] #Removing a extraneous fairly known expression

In [13]:
len(quand)

1705

## 3. Processing corpus directories

In [14]:
quand_df = pd.DataFrame(quand,columns=['concordance'])
avant_df = pd.DataFrame(avant,columns=['concordance'])
tandis_df = pd.DataFrame(tandis,columns=['concordance'])

In [15]:
quand_df['adverbial'] = 'quand'
avant_df['adverbial'] = 'avant que'
tandis_df['adverbial'] = 'tandis que'

In [16]:
adverbials_df = pd.concat([quand_df, avant_df, tandis_df], ignore_index=True)

In [17]:
adverbials_df['mood'] = 'subjunctive' 
adverbials_df['tense'] = 'present indicative' 
adverbials_df['corpus'] = 'CFPP200' 

In [18]:
adverbials_df

Unnamed: 0,concordance,adverbial,mood,tense,corpus
0,et quand on a redoublÃ© tous les deux on s'est...,quand,subjunctive,present indicative,CFPP200
1,collÃ¨ge Montgolfier lycÃ©e Turgot quand on s'...,quand,subjunctive,present indicative,CFPP200
2,et mÃªme quand on avait dix-sept dix-huit enfi...,quand,subjunctive,present indicative,CFPP200
3,nous quand on est partis Â§,quand,subjunctive,present indicative,CFPP200
4,et quand euh quand la mÃ¨re de Steve a achetÃ©...,quand,subjunctive,present indicative,CFPP200
...,...,...,...,...,...
1738,quand on milite quelque part et qu'on s'aperÃ§...,tandis que,subjunctive,present indicative,CFPP200
1739,lÃ c'est entre deux quartiers peut-Ãªtre de v...,tandis que,subjunctive,present indicative,CFPP200
1740,le parc le parc des Guilands c'est le parc des...,tandis que,subjunctive,present indicative,CFPP200
1741,madame est nÃ©e tandis que moi je suis venue j...,tandis que,subjunctive,present indicative,CFPP200


In [19]:
adverbials_df.to_csv(r"C:/Users/Juan/Documents/code/research/french_adverbials/data/adverbials_df.csv", index=False)