# Creating a list of French Words


## Summary
To determine which set of words in the profession string are already having correct spelling and does not require any cleaning a set of correct words is used. Three external data sources are used to create this set of words. The set of words is created in the `cleaning_and_creating_tags.ipynb` notebook. However, in the `creating_french_dictionary_words_set.ipynb` notebook, two of the three external sources are preprocessed and stored as a JSON as pre-processing these files is time-consuming. The notebook reads the two sources of data and stores the words in a JSON file where the keys are the root forms of the words (lemma) and the values are a list of derived words from each lemma (flexions).

For a lexicon of the French language, a set of words is provided by ortolang as [Morphalou3](https://repository.ortolang.fr/api/content/morphalou/2/LISEZ_MOI.html#idp37913792) that has 159,271 lemmas and 954,690 inflected forms of modern French is used. 

For proper nouns, the proper nouns dictionary is sourced from [prolex-unitex](https://tln.lifat.univ-tours.fr/version-francaise/ressources/prolex-unitex). 

In this notebook, the following files are read
- the CSV file provided by Morphalou3 in [Morphalou3_formatCSV_toutEnUn](https://repository.ortolang.fr/api/content/morphalou/2/), whose format is described [here](https://repository.ortolang.fr/api/content/morphalou/2/LISEZ_MOI.html#idp37944688) and
- the [DIC file](https://tln.lifat.univ-tours.fr/medias/fichier/prolex-unitex-1-2_1562935068094-zip?ID_FICHE=321994&INLINE=FALSE) provided by Prolex-Unitex in which each row has the word and its details in a single word (such as the context it is used and whether is masculine/feminine and singular or plural) separated by a comma.

The CSV and DIC files are read and processed to create a JSON file in the format mentioned above.

The data from `Morphalou3` look as shown below and 

![Morphalou3_CSV_file_format.png](./images/Morphalou3_CSV_file_format.png)

and is saved as following into JSON

```json
{
    "aalénien": [
        "aalénien",
        "aaléniens",
        "aalénienne",
        "aaléniennes"
    ]
}
```

The data from `Prolex-Unitex` looks as following

![prolex_unitex_file_format.png](./images/prolex_unitex_file_format.png)

and is saved as following into json

```json
{
    "william pitt, comte de chatham": [
        "comte de chatham",
        "william pitt",
        "william pitt, comte de chatham"
    ],
    "william pitt": [
        "william pitt"
    ]
}
```

The combined JSON of both the files is stored in a single JSON file as 

```json
{
    "aalénien": [
        "aalénien",
        "aaléniens",
        "aalénienne",
        "aaléniennes"
    ], 
    "william pitt, comte de chatham": [
        "comte de chatham",
        "william pitt",
        "william pitt, comte de chatham"
    ],
    "william pitt": [
        "william pitt"
    ]
}
```

## Imports

Pandas library is used to read the CSV file, JSON library is used to save the output.

In [1]:
import pandas as pd
import json
from tqdm.notebook import tqdm_notebook

### Reading the Morphalou3

As there are many columns in the CSV file, only the two required columns of the root word and derived word are read. Although the capitalization attributes certain meaning to the words, due to the design of the pipeline to lower case the words, the words from the dictionaries are also lowercased. The lemme is only mentioned once in the file at its first occurrence and for the rest of the rows only the flexion is mentioned.

The first 15 rows in the `Morphalou3_CSV.csv` contain metadata information about the file and hence they are skipped while reading the file.

In [2]:
columns = [0, 9]

header_list = ["lemme", "flexion"]

morphalou_words_df = pd.read_csv("./../data/external_data/Morphalou3_formatCSV_toutEnUn/Morphalou3_CSV.csv",
                              skiprows=15, sep=";",dtype = 'str', usecols = columns,
                              header=0, encoding="utf-8", keep_default_na=False, na_values="", names=header_list)

morphalou_words_df["lemme"] = morphalou_words_df["lemme"].str.lower()
morphalou_words_df["flexion"] = morphalou_words_df["flexion"].str.lower()

In [3]:
# empty dictionary to store words
french_dictionary_category = {}

for lemme,flexion in tqdm_notebook(morphalou_words_df.itertuples(index=False)):
    # for each lemma and flexion pair
    if isinstance(flexion, str):
        # if flexion is not NaN
        if isinstance(lemme, str):
            # if lemme is not NaN, that means there is a new root word
            prev_lemme = lemme
            if prev_lemme not in french_dictionary_category:
                # create its entry in the python dictionary to store the dervied words under it
                french_dictionary_category[prev_lemme] = []
            
        if flexion not in french_dictionary_category[prev_lemme]:
            # add the flexion under the lemma
            french_dictionary_category[prev_lemme].append(flexion)

0it [00:00, ?it/s]

### Reading the Prolex-Unitex

There are no standard libraries in python to read the `dic` files. Here, each line of the `dic` file is read and processed. Unlike the `Morphalou3`, the lemma and flexion are not meaningful in proper nouns and most of the time they are the same. In order to facilitate the uniformity to store the data, the following steps are performed.

1. For each line get the text component of the line and ignore the additional information (such as its gender or singular or plural) by splitting the text at the last period(`.`).
2. Replace the escaped comma (`\,`) in the text to \$ temporarily (as the flexion is separated with the lemme by a comma).
3. Replace the escaped period (`\.`) in the text with a normal period.
4. Split the text component at the comma. If there is lemme then the second element in the list is the lemme. Otherwise, the lemme and flexion are the same.
5. Replace back the \$ to a comma.
6. Add the lemme and the flexion to the python dictionary.

In [4]:
proper_nouns_file = open("./../data/external_data/Prolex-Unitex_1_2/Prolex-Unitex_1_2.dic", encoding="utf-8").readlines()

In [5]:
for dict_line in tqdm_notebook(proper_nouns_file):
    # get the text component of the line and ignore the additional information by splitting the text at the last period.
    # replace the escaped comma in the text to $ temporarily (as the flexion is seperated with the lemme by a comma).
    # replace the escaped period in the text to normal period.
    # Then split the text component at the comma. If there is lemme then the second element in the list is the lemme. Otherwise
    # it is empty.
    text_component = dict_line.lower().rsplit(".", 1)[0].replace("\.", ".").replace("\,", "$").split(",")
            
    flexion, lemme = text_component
    flexion, lemme = flexion.replace("$", ","), lemme.replace("$", ",")
        
    if not lemme:
        lemme = flexion
        
    if lemme not in french_dictionary_category:
        french_dictionary_category[lemme] = []

    if flexion not in french_dictionary_category[lemme]:
        french_dictionary_category[lemme].append(flexion)

  0%|          | 0/119609 [00:00<?, ?it/s]

### Saving

The Python dictionary is saved as JSON to the disk

In [6]:
with open("./../data/intermediate_steps/words_from_french_language_dictionaries.json", "w", encoding ='utf8') as outfile:
    json.dump(french_dictionary_category, outfile, indent = 4, ensure_ascii = False)