# Tags for Profession

This notebook contains the code and commentary on the steps involved in the creation of tags for the professions of people in Paris.

The algorithm is explained in detail at <ADD THE URL TO THE DOCUMENT>. In brief, the data generated from the [first phase of the project](https://quartier-richelieu.fr/) provided a dataset of entries from the trade directories of Paris. For each year a list of names, professions, and address sets was created. The address for the entries belonging to the Richelieu Quartier was cleaned and geo-referenced. However, the profession was not cleaned and normalized across years and the abbreviations were not filled. In the notebook, the profession for each entry of the dataset is cleaned by providing tags that represent the profession and filling the abbreviations with full forms. 

Before this notebook, 

The `cleaning_special_characters.ipynb` notebook was run to generate `all_paris_jobs_splchar_cleaned.csv` file (the content is explained in [Reading the Data after cleaning for special characters](reading-the-data-after-cleaning-for-special-characters) section. This file will serve as the starting point of this notebook.

Secondly, the french language dictionaries provided by `Morphalou3` and `Prolex-Unitex` were used to create a set of unique words to serve as a list of correctly spelled words (the street names are added to the list in this notebook) for this project and saved as `words_from_french_language_dictionaries.json`.


## Summary

This notebook performs the following keys tasks

1. During the OCR process, sometimes a single word is mistakenly broken into multiple words. So, an attempt is made to combine them.
2. As the profession strings contain keywords and connecting words, the connecting words are removed and keywords are stored with cleaning at apostrophe and dot as tokens.
3. The tokens that are not in the list of correct words are tried to merge with those in the list of correct words based on the token co-occurrence, frequency, and similarity.
4. After merging, the tokens containing a dot are identified as potential abbreviations and they are filled based on co-occurrence, frequency, and similarity.
5. After completing the abbreviations, the tokens are called tags, and the data is stored on the disk.

The pipeline is built on the following two major assumptions,

1. Most of the words have a correct spelling
2. The correct spelling appears more frequently than the misspelled one.
    
## Similarity of words (tokens)
 
The token similarity is the Levenshtein similarity between the tokens. The [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between two strings is the "minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other". The distance is normalized between 0 and 1 and converted into similarity. For this project, the similarity is calculated using the [RapidFuzz](https://github.com/maxbachmann/RapidFuzz) library's `fuzz.ratio` function without any processing of the strings. The reason for choosing this library is that the function to calculate the similarity between strings accepts a threshold for similarity and use it is as an early stopping criterion in calculating the similarity between the stings (based on the lengths of the strings, It is not possible to obtain a high similarity between the strings when the length of them is significantly different). It returns zero as the similarity when the similarity is less than the given threshold.    

## Imports

Import the existing python libraries

In [None]:
import copy
import itertools
import json
import pickle
import re
import string
import os

import numpy as np
import pandas as pd
from nltk.corpus import stopwords

stop_words_french = set(stopwords.words("french"))
from collections import Counter
from typing import Callable
from typing import Counter as Counter_type
from typing import Dict, FrozenSet, List, NoReturn, Optional, Set, Tuple, Union

import colorama
import textdistance
from rapidfuzz import fuzz, process
from tqdm.notebook import tqdm_notebook
from unidecode import unidecode

colorama.deinit()
colorama.init(strip=False)

## Hyperparameters

the pipeline requires the users to set some hyperparameters at various steps in the process. The hyper parameters are

1. `MINIMUM_TOKEN_LENGTH`: The minimum number of alphanumeric characters to be present in a token to be considered as valid.
2. `MINIMUM_TOKEN_FREQUENCY`: The minimum frequency for a token to be not considered as low frequent token.
3. `MIN_THRESHOLD_REPLACE_BROKEN_WORD`: The minimum similarity for a word to be considered as replacement for a set of consecutive low frequent and non correct tokens.
4. `MINIMUM_TOKEN_SIMILARITY`: The minimum similarity (_fuzz.ratio_) between tokens to be considered as similar.
5. `MINIMUM_ABVR_FULLFORM_SIMILARITY`: The minimum threshold for the modified Jaccard similarity score (detailed in [Algorithm for completing the abbreviations with support](#algorithm-for-completing-the-abbreviations-with-support)) between the full form and the abbreviation.
6. `MINIMUM_INTER_ABVR_SIMILARITY`: The minimum threshold between the filled and unfilled abbreviations to be considered similar.

In [None]:
MINIMUM_TOKEN_LENGTH = 3
MINIMUM_TOKEN_FREQUENCY = 51
MIN_THRESHOLD_REPLACE_BROKEN_WORD = 79.49
MINIMUM_TOKEN_SIMILARITY = 74.49
MINIMUM_ABVR_FULLFORM_SIMILARITY = 50
MINIMUM_INTER_ABVR_SIMILARITY = 69.49

The intermediate outcomes of the process are stored in the folder with the name containing the similarity score. In the next cell, a folder is created in the `intermediate_steps` folder with the similarity score in its name.

In [None]:
intermediate_steps_folder_prefix = "./../data/intermediate_steps/sim_score_"+str(int(np.ceil(MINIMUM_TOKEN_SIMILARITY)))+"/"
os.makedirs(os.path.dirname(intermediate_steps_folder_prefix), exist_ok=True)

## Splitting a string into a set of words

As the aim of the project is to create one-word tags, the strings with more than one word should be split into single words. The words are split at space, hyphen, apostrophe, and dot. After splitting the stings, the words that have less than the specified number of alphanumeric characters are removed. 

While space is an apparent choice, the reason for choosing the other delimiters are

- Hyphen: Generally, the hyphens are used to indicate a link between words or the end of the line, and the em dashes are used to provide an emphasis. However, the observation of the dataset has revealed that their usage is not consistent i.e. the hyphen and the em dash were used interchangeably by the OCR process. To standardize the process, the words are split at the hyphen.
- Apostrophe: The apostrophe is mostly used when combining the complementary pronoun and a word starting with a vowel. Thus the word is split at apostrophe and only the keyword is retained.
- Dot: In the dataset, a dot was mostly used to write abbreviations. Nevertheless, a word with a dot at the end and the next word in a string are combined. Thus, an attempt is made to split the words around the dot and check to disambiguate them into multiple probable words.

Examples: 
- a.p. moller-maersk to [a.p., moller, maersk]
- Aire-sur-l'Adourto to [aire, sur, l’adour] and [aire, sur, adour] (as _l_ is an complementary pronoun)

The method of splitting a string designed for this project is as follows

1. The string is split in space and hyphen.
2. For each substring after split
    1. retain the substring if it is not present in the French stop words (obtained from nltk - corpus - stopwords) and has a minimum length after removing the non-alphanumeric characters.
    2. If there is an apostrophe in the substring, split the sub-string at the apostrophe. Splitting at apostrophe involves,
        1. If the left part of the substring (i.e. the character before the apostrophe) is in complementary pronouns then the only right part of the substring will be a possible substring to return.
        2. If the left part of the substring (i.e. the word before the apostrophe) has more than one character and the last character of the first part is in the complementary pronouns then the first part without the last character and the second part as it is will be considered as the possible substrings to return if both the parts are in the list of correctly spelled words.
        3. If there is no complementary pronoun before the apostrophe and then the string itself will be possible substrings to return. 
    3. If any of the substrings in the possible substrings to return contains a dot, then the substrings are split at a dot
        1. If the part of the substring after a dot is a stop word, then the substring in the possible sub\strings to return is updated by removing the characters after the dot.
        2. If both the parts of the substring belong to a dictionary then the substring is split into sub substrings without the dot and the original substring in possible tokens to return is replaced by two substrings.
    4.  At last, all the possible substrings that are valid are returned.
        - A string (word) is said to be valid if it does not belong to a set of stop words and has at least the specified number of alpha numeric characters.

### Utility functions to split a string

Below are utility functions to split a string. These functions are used to split the strings in creating the list of correctly spelled words and the tokens for the professions. 

The functions are

1. `check_presence`: Returns whether an element is present in a given sequence of objects.
2. `check_token_length`: Returns whether the string has the minimum length after removing non-word characters.
3. `is_valid_token`: Returns whether the string is valid based on the `:func:check_token_length` and `:func:check_presence` in stopwords.
4. `split_metier_strings_to_tokens`: Splits and returns the input strings at space, hyphens, apostrophe after removing stop words.
5. `clean_tokens`: Splits the tokens (generated from splitting at space and hyphen) containing an apostrophe and dot to singular tokens if they contain complementary pronouns. Further, splits these tokens at the dot (if they contain any) and validates all of them.
6. `apos_split`: The function accepts a string with an apostrophe and splits the string at the apostrophe.
7. `complementary_pronouns_check`: Returns the string after the apostrophe, if the character before the apostrophe is a complementary pronoun, else returns None.
8. `dot_split`: Returns the list of tokens splitting at a dot.

In [None]:
def check_presence(element: Union[str, FrozenSet], check_in: Union[Set, Dict]) -> bool:
    """Returns whether an element is present in a given sequence of objects.

        Parameters
        ----------
        element : str or FrozenSet
            The element to check.
        check_in : Set or Dict
             The sequence of objects to check for the element.
        
        
        Returns
        -------
        bool
            Boolean indicating the presence (or not) of the element.
        """
    return element in check_in


def check_token_length(token: str, minimum_length: int) -> bool:
    """Returns whether the string has the minimum length after removing non-word characters.

        Parameters
        ----------
        token : str
            The string to check.
        minimum_length : int
             The minimum length of the string without non-alphanumeric characters.
        
        
        Returns
        -------
        bool
            Boolean indicating whether the string is of minimum length.
        """
    return len(re.sub("\W+", "", token)) >= minimum_length


def is_valid_token(token: str, minimum_length: int, stopwords: Set[str]) -> bool:
    """Returns whether the string is valid based on the :func:`check_token_length` and :func:`check_presence` in stopwords.
        A string is valid if it is not in the french stop words and has a length at least of ``minimum_length``.

        Parameters
        ----------
        token : str
            The string to check.
        minimum_length : int
            The minimum length of the string without non-alphanumeric characters.
        stopwords: Set[str]
            The set of stop words in French.

        
        Returns
        -------
        bool
            A boolean indicating whether a string is valid.
        """

    # Initialise the validities to True
    length_validity, stop_word_validity = True, True

    if isinstance(minimum_length, int):
        # check the length validity
        length_validity = check_token_length(token, minimum_length)

    if stopwords:
        # check the presence in stop words
        stop_word_validity = not check_presence(token, stopwords)

    return length_validity & stop_word_validity


def complementary_pronouns_check(
    apos_splits: List[str],
    before_apos_possibilities: Set[str],
    minimum_length: int,
    stop_words: Set[str],
) -> Union[str, None]:
    """Returns the string after the apostrophe,
        if the character before the apostrophe is a complementary pronoun provided in ``before_apos_possibilities``,
        else returns None.

        Parameters
        ----------
        apos_splits : List[str]
            The list of words that are generated by splitting a string at the apostrophe.
        before_apos_possibilities : Set[str]
            A set of words that are considered complementary pronouns.
        minimum_length : int
            The minimum length of the string without non-alphanumeric characters.
        stopwords: Set[str]
            The set of stop words in French.

        
        Returns
        -------
        possible_token : str
            The valid string after the apostrophe.
        None
            if the character before the apostrophe is not a complementary pronoun or if the string after the apostrophe is not valid.
        """
    if check_presence(apos_splits[0], before_apos_possibilities):
        # if first part of the split is in the complementary pronouns (Pronoms compléments) then the second part is considered as the token
        possible_token = apos_splits[1]
        if is_valid_token(possible_token, minimum_length, stop_words):
            # if the new token is valid, return the new token
            return possible_token
    return None


def dot_split(
    after_apos_splits: List[str],
    stop_words: Set[str],
    correctly_spelled_words: Set[str],
) -> List[str]:
    """Returns the list of tokens splitting at a dot.
        If the token after the dot belongs to stop words, then the token is changed to be the token up to the dot.
        If the part of the token after the dot does not belong to stop words but both the parts around the dot belong
            to the list of correctly spelled words then the token is changed to two sub tokens with parts around the dot.

    Parameters
    ----------
    after_apos_splits : List[str]
        The list of words that are generated by splitting a string at apostrophe.
    stop_words : Set[str]
        The set of stop words in French.
    correctly_spelled_words: Set[str]
        A set of words with correct spellings (mostly belonging to the French language).
    
    
    Returns
    -------
    dot_splitted_toks : List[str]
        The result of splitting the tokens (already split at apostrophe) at a dot.
    """
    dot_splitted_toks = []
    # empty list to store the possible splits

    for ind in range(0, len(after_apos_splits)):
        # for each spiltted token at apostrophe
        if "." in after_apos_splits[ind] and not after_apos_splits[ind][-1] == ".":
            # if the token contains a dot and if the dot it not at the end (a dot at the end can mean that the token is potentially an abbrevation)
            dot_split_parts = after_apos_splits[ind].split(".", 1)
            # split at the first occurence of the dot
            if check_presence(dot_split_parts[-1], stop_words):
                # if the part after the dot is in the stop words
                dot_splitted_toks.append(dot_split_parts[0] + ".")
                # append the token before the dot along with the dot
                continue
            elif all(
                dot_split_tok in correctly_spelled_words
                for dot_split_tok in dot_split_parts
            ):
                # else if the the words around the dot are in the list of correctly spelled words
                dot_splitted_toks.extend(dot_split_parts)
                continue
        # else append the token without splitting at dot
        dot_splitted_toks.append(after_apos_splits[ind])

    return dot_splitted_toks


def apos_split(
    token: str,
    minimum_length: int,
    correctly_spelled_words: Set[str],
    stop_words: Set[str],
) -> List[str]:
    """The function accepts a string with an apostrophe and splits the string at the apostrophe.
        If the token before the apostrophe belongs to complementary pronouns, then the word after the apostrophe becomes the token.
        Else if the last character of the part of the token before the apostrophe is a complementary pronoun
            and both the words around the apostrophe (after removing the complementary pronoun before the apostrophe)
            belong to the list of correctly spelled words then the token is changed to two sub tokens with parts around the apostrophe.

    Parameters
    ----------
    token : str
        The token to split at the apostrophe.
    minimum_length : int
            The minimum length of the string without non-alphanumeric characters.
    stopwords: Set[str]
        The set of stop words in French.
    correctly_spelled_words: Set[str]
        A set of words with correct spellings.

    
    Returns
    -------
    apos_split_subtokens : List[str]
        The result of splitting the token at an apostrophe.
    """

    apos_split_subtokens = [token]
    before_apos_possibilities = set(["l", "d", "n", "s", "m", "del"])
    # A set of complementary pronouns that when occured before an apostrophe can be ignored.

    # if the token contains an apostrophe, split at its first occurrence.
    apos_split = token.split("'", 1)

    second_part = complementary_pronouns_check(
        apos_split, before_apos_possibilities, minimum_length, stop_words,
    )
    if second_part:
        # If the part before apostrophe is a complementary pronoun, the part after apostrophe will be the new token.
        apos_split_subtokens = [second_part]

    elif len(apos_split[0]) > 1:
        # Else if last character first part of the split is in the complementary pronouns (Pronoms compléments) then the tokens might have been joined together. If new tokens after removing the complementary pronouns are valid tokens then they are considered as new tokens.
        if check_presence(apos_split[0][-1], before_apos_possibilities):
            subtokens_around_apos = [apos_split[0][:-1], apos_split[1]]

            if correctly_spelled_words and len(subtokens_around_apos) > 1:
                # if there is no dot in substrings and all the elements in the split list are in then correctly_spelled_words and return them
                if not all(
                    check_presence(pos_tok, correctly_spelled_words)
                    for pos_tok in subtokens_around_apos
                ):
                    apos_split_subtokens = [
                        ele
                        for ele in subtokens_around_apos
                        if is_valid_token(ele, minimum_length, stop_words)
                    ]
    return apos_split_subtokens


def clean_tokens(
    splitted_token: str,
    minimum_length: int,
    correctly_spelled_words: Set[str],
    stop_words: Set[str],
) -> List[str]:
    """Splits the tokens containing an apostrophe and dot to singular tokens if they contain complementary pronouns.
        Further, splits these tokens at the dot (if they contain any) and validates all of them.

    Parameters
    ----------
    splitted_token : str
        The string generated due to split at space and hyphen.
    minimum_length : int
        The minimum length of the string without non-alphanumeric characters.
    correctly_spelled_words : Set[str]
        A set of words with correct spellings.
    stopwords: Set[str]
        The set of stop words in French.

    
    Returns
    -------
    List[str]
        A list of strings containing the substring of the original token that is first split at the apostrophe
            and then at a dot or the input token, if it does not contain an apostrophe.
        An empty list if the input token is not valid.
    """

    possible_subtokens = []
    if is_valid_token(splitted_token, minimum_length, stop_words):
        # if the input string is valid

        possible_subtokens = [splitted_token]

        if "'" in splitted_token:
            # if the token contains a apostrophe, try splitting at apostrophe and return the valid tokens
            possible_subtokens = apos_split(
                splitted_token, minimum_length, correctly_spelled_words, stop_words
            )

        if any("." in pos_split for pos_split in possible_subtokens):
            # if the token or the tokens after splitting at apostrophe contains a dot, try splitting at dot and return the valid tokens
            possible_subtokens = dot_split(
                possible_subtokens, stop_words, correctly_spelled_words
            )

    return [
        ele
        for ele in possible_subtokens
        if is_valid_token(ele, minimum_length, stop_words)
    ]


def split_metier_strings_to_tokens(
    string_to_split: str,
    minimum_length: int,
    correctly_spelled_words: Union[Set[str], None],
    stop_words: Union[Set[str], None],
) -> List[str]:
    """Splits and returns the input strings at space, hyphens, apostrophe after removing stop words.

    Parameters
    ----------
    string_to_split : str
        The string to split into tokens.
    minimum_length : int
        The minimum length of the string without non-alphanumeric characters.
    correctly_spelled_words : Union[Set[str], None]
        A set of words with correct spellings.
        When creating a list of correct words, this parameter should be set to None as there is no list of correct words yet.
    stopwords: Union[Set[str], None]
        The set of stop words in French.

    
    Returns
    -------
    metier_tokens : List[str]
        The list of tokens for the given input string
    """

    # split the input string at space and hyphen
    split_space_hyphen = list(filter(None, re.split("\s|-", string_to_split)))

    metier_tokens = []
    # a list to store the token for the given input string
    for split_token in split_space_hyphen:
        # for each split part, check if the token is valid and try to split at apostrophe, if it contains one.
        cleaned = clean_tokens(
            split_token, minimum_length, correctly_spelled_words, stop_words
        )
        if cleaned:
            # append the valid token to the list
            metier_tokens.extend(cleaned)
    return metier_tokens

## Creation of list of correctly spelled words

The list of correctly spelled words plays a key role in the process of creating tags as described in the algorithm. In this section, three sources are used to create a set of words that are assumed to have correct spellings. As these sources not only contain a word, the strings are split at space, hyphen. The apostrophe and the complementary pronoun before the apostrophe are also removed, as the same pre-processing step will also be applied to the professions. Two types of datasets are used to create the list of correctly spelled words. The first type is the language dictionaries obtained from `Morphalou3` (for various French words) and `Prolex-Unitex` (for proper nouns in French). The second one is for the list of street names of Paris obtained from `Open Data Paris`.


- Using the dictionaries provided by `Morphalou3` and `Prolex-Unitex` a list of words was created and saved as `words_from_french_language_dictionaries.json`. The keys in the JSON files are root forms of the words (the meaning of root forms is very loosely used here) and the values are a list of words that are derived from the root form. The script used to create the JSON file can be found at `creating_french_dictionary_words_set.ipynb`.
- For the names of the streets, the table of streets of Paris is obtained from [Street names table](https://opendata.paris.fr/explore/dataset/denominations-emprises-voies-actuelles/table/?disjunctive.siecle&disjunctive.statut&disjunctive.typvoie&disjunctive.arrdt&disjunctive.quartier&disjunctive.feuille&sort=typo_min). The `denominations-emprises-voies-actuelles.csv` file contains details of the streets of Paris in 28 columns. The `Dénomination complète minuscule` columns are read and the strings are split and added to the list of correctly spelled words.


### Utility function to read the JSON file and street names file and return the list of correctly spelled words

The functions are 

1. `create_correctly_spelled_words_list_from_dictionaries`: Read and splits the strings in the words from the language dictionary (`category_wise_frech_dictionary.json`) at space, hyphen, and apostrophe and stored as a set of words.
2. `create_correctly_spelled_words_list_from_street_names`: Read and splits the street names in the CSV file of street names of the given column name at space, hyphen, and apostrophe and stored as a set of words.

In [None]:
def create_correctly_spelled_words_list_from_dictionaries(
    dictionary_file_path: str,
    minimum_length: int,
    include_original_string: bool = False,
) -> Set[str]:
    """Read and splits the strings in the words from the language dictionary at space, hyphen, and apostrophe
        and stored them as a set of words.

    Parameters
    ----------
    dictionary_file_path : str
        The path to the JSON dictionary file with keys as root words and values as a list of derived words.
    minimum_length : int
        The minimum length of the string without non-alphanumeric characters.
    include_original_string : bool, optional
        After splitting the derived words, the derived word is also added to the list of correctly spelled words if this is set to True,
        by default False.

    
    Returns
    -------
    dictionary_words_correctly_spelled : Set[str]
        A set of words that can be considered correctly spelled.
    """
    # load the json file
    with open(dictionary_file_path, encoding="utf8") as fr_dict:
        dictionary = json.load(fr_dict)

    dictionary_words_correctly_spelled = set()

    for lemme, flexions in dictionary.items():
        # for each root
        flexions = flexions + [lemme]
        for flexion in flexions:
            flexion_ = flexion.lower()
            splitted_flexion = split_metier_strings_to_tokens(
                flexion_,
                minimum_length,
                correctly_spelled_words=set(),
                stop_words=set(),
            )
            if include_original_string:
                if not check_presence(flexion_, set(splitted_flexion)):
                    splitted_flexion.append(flexion_)
            dictionary_words_correctly_spelled.update(splitted_flexion)

    return dictionary_words_correctly_spelled


def create_correctly_spelled_words_list_from_street_names(
    street_names_csv_file: str,
    sep: str,
    column_name: str,
    minimum_length: int,
    include_original_string: bool = False,
) -> Set[str]:
    """Read and splits the street names in the CSV file of street names of the given column name at space, hyphen,
        and apostrophe and stored as a set of words.

    Parameters
    ----------
    street_names_csv_file : str
        The path to the CSV file with street names.
    seperator : str
        The seperator of the csv file.
    column_name : str
        The name of the column in the CSV file with street names to read.
    minimum_length : int
        The minimum length of the string without non-alphanumeric characters.
    include_original_string : bool, optional
        After splitting the street name, the street name is also added to the list of correctly spelled words if this is set to True,
        by default False.

    
    Returns
    -------
    street_names_correctly_spelled : Set[str]
        A set of words that can be considered correctly spelled.
    """
    # Read the CSV file
    col_list = [column_name]

    street_names_df = pd.read_csv(
        street_names_csv_file,
        dtype={column_name: "str"},
        usecols=col_list,
        sep=sep,
        header=0,
        encoding="utf-8",
    )

    street_names_df[column_name] = street_names_df[column_name].str.lower()

    street_names_correctly_spelled = set()

    for street_name in street_names_df[column_name].unique():
        splitted_street_name = split_metier_strings_to_tokens(
            street_name, minimum_length, correctly_spelled_words=set(), stop_words=set()
        )

        if include_original_string:
            if not check_presence(street_name, set(splitted_street_name)):
                splitted_street_name.append(street_name)

        street_names_correctly_spelled.update(splitted_street_name)

    return street_names_correctly_spelled

The list of correctly spelled words from dictionaries and street names together is stored in `words_with_correct_spellings` and is obtained in the next cell while allowing only tokens with at least 3 alphanumeric characters.

In [None]:
# get the set of words from the language dictionary
words_from_lang_dictionary = create_correctly_spelled_words_list_from_dictionaries(
    dictionary_file_path="./../data/intermediate_steps/words_from_french_language_dictionaries.json",
    minimum_length=MINIMUM_TOKEN_LENGTH,
    include_original_string=False,
)

'''
# get the set of words from the street names
words_from_street_names = create_correctly_spelled_words_list_from_street_names(
    street_names_csv_file="./../data/external_data/Street-names/denominations-emprises-voies-actuelles.csv",
    sep=";",
    column_name="Dénomination complète minuscule",
    minimum_length=MINIMUM_TOKEN_LENGTH,
    include_original_string=False,
)


# combine the two sets to get a single set of correctly spelled words.
words_with_correct_spellings = set.union(
    words_from_lang_dictionary, words_from_street_names
)
'''

words_with_correct_spellings = set(
    filter(lambda word: "." not in word, words_from_lang_dictionary)
)

### Saving

The set of correctly spelled words is stored as a pickle file to the disk to avoid re creating the set every time and also to be used in another notebooks.

In [None]:
with open(
    "./../data/intermediate_steps/words_with_correct_spellings.pickle", "wb"
) as outfile:
    pickle.dump(words_with_correct_spellings, outfile)

### Reading

The set of correctly spelled words stored as a pickle file

In [None]:
with open(
    "./../data/intermediate_steps/words_with_correct_spellings.pickle", "rb"
) as outfile:
    words_with_correct_spellings = pickle.load(outfile)

## Combine the words broken by mistake during OCR (Step 1)

In the first step, we try to combine the words that were broken by mistake. The algorithm is as follows

1. Split the profession string at space, a hyphen, apostrophe, and dot to create a token set for each unique profession.
2. Create a counter for each unique token.
3. Go over each token set, if either of the consecutive entries is not in the list of correct words and either of them has a frequency less than a threshold then try to concatenate these entries.
4. Check if the concatenated word or a word that is similar with a certain threshold exists in the already existing tokens.
    1. If the exact concatenated word is not present in the existing tokens, then the **process** module of [RapidFuzz](https://github.com/maxbachmann/RapidFuzz) library is used to extract the closet word from a list of words. The `process.extractOne` function is used for this purpose. It accepts the arguments in the following order:
        - The seed word (concatenated word)
        - The list of words to search in (The list of unique tokens)
        - A processor to pre-process the strings
        - A scorer to calculate the similarity between strings (fuzz.ratio)
        - The minimum similarity between the seed word and the word in the list of words is to be selected and returned.
    
5. If such a word exists and if the frequency of such a word is greater than or equal to all the individual consecutive tokens used to obtain it, then replace the consecutive tokens with the found word.
6. Update the token set for the profession using the words obtained after combining.

### Utility functions to combine the mistakenly broken words

The functions are

1. `simple_processor`: A string processor to return the same string as input.
2. `tokens_after_combining_low_freq_consecutive_tokens`: The function accepts a dictionary with key as the profession and value as the list of corresponding tokens, a threshold frequency to determine if a token is less frequent and the minimum similarity for the word to replace the less frequent & incorrectly spelled tokens and returns the dictionary with key as the profession and value as the list of corresponding updated tokens.
3. `get_possible_missplit_tokens`: The function accepts a list of tokens along with their counter, a threshold frequency to determine if a token is less frequent, and returns a list of a sublist of tokens that can be combined if either of the consecutive entries is not in the list of correctly spelled words and either of them has a frequency less than a threshold.
2. `get_possible_combined_word`: The function accepts a list of words that have low frequency and are not in the list of correctly spelled words and returns the list containing possible replacement and similarity score or None if there is no word within the given similarity threshold.
3. `get_updated_tokens_after_combining`: This function accepts the current list of tokens for a profession string, the list of possible combinations, and returns the updated list of tokens for the profession after combining the tokens. The combinations are applied in the order of the similarity score and only one combination is applied per token.

In [None]:
def simple_processor(token: str) -> str:
    """A string processor to return the same string as input.
        This dummy processor is used to avoid the default processor of the Rapidfuzz module to calculate string similarity.

    Parameters
    ----------
    token : str
        The input string to process.

    
    Returns
    -------
    str
        The output string same as the input string.
    """
    return token

def get_possible_combined_word(
    tokens_to_combine: List[str],
    unique_tokens: Set[str],
    previous_combinations: Dict[
        FrozenSet[List[str]], Set[Tuple[Tuple[str], str, float]]
    ],
    token_counter_full_data: Counter_type[str],
    minimum_threshold_for_replacement_word: float,
    processor: Optional[Callable[[str], str]] = simple_processor,
    scorer: Optional[Callable[[str, str], float]] = fuzz.ratio,
) -> Union[Tuple[str, float], None]:
    """The function accepts a list of words that have low frequency and are not in the list of correctly spelled words,
        the unique tokens in the dataset, the minimum similarity for the word to replace the tokens with,
        and other keyword arguments and returns the list containing possible replacement and similarity score
        or None if there is no word within the given similarity threshold.

    Parameters
    ----------
    tokens_to_combine : List[str]
        The list of tokens that can be combined.
    unique_tokens : Set[str]
        The unique tokens that are present in the dataset.
    previous_combinations : Dict[Frozenset[List[str]], Set[Tuple[Tuple[str], str, float]]]
        A python dictionary that stores the previously combined tokens. The key is a frozenset of the list of tokens that can be combined.
        The value is a set of tuples that contains three entries.
            First, the tokens combined (as a tuple to preserve the order of token appearance),
            second, the string that the tokens are changed to, and
            lastly, the similarity score between the string obtained by concatenating the strings in the ``tokens_to_combine`` list
                and the string that the tokens are changed to.
    token_counter_full_data : Counter_type[str]
        A counter object that holds the number of times a token has appeared in the dataset.
    minimum_threshold_for_replacement_word : float
        The minimum similarity score to replace the tokens in the ``tokens_to_combine`` with the word found in the ``unique_tokens``.
    processor :  Optional[Callable[[str], str]] = simple_processor, optional
        The preprocessing to be performed on the string before computing the similarity score, by default simple_processor.
    scorer : Optional[Callable[[str, str], float]], optional
        The function to be used to compute similarity, by default fuzz.ratio.

    
    Returns
    -------
    Tuple[str, float]
        If the tokens in the ``tokens_to_combine`` have a suitable replacement then a list with the replacement word,
            its similarity score with the string obtained by concatenating tokens in ``tokens_to_combine``.
    None
        If there is no suitable replacement for the tokens in ``tokens_to_combine`` then None is returned.
    """

    # concatenate the words in ``tokens_to_combine``
    possible_combination = "".join(tokens_to_combine)

    if check_presence(possible_combination, unique_tokens):
        # if the concatenated word is already in the dataset, then that word is the possible replacement and the similarity score is 100 as the exact word is present.
        replacement = [possible_combination, 100, ""]
    else:
        # if the concatenated word is not in the dataset, then first check if the same words were combined earlier in this process.
        set_tokens = frozenset(tokens_to_combine)
        # Check if the frozenset of the ``tokens_to_combine`` is present in the previous_combinations``
        if check_presence(set_tokens, previous_combinations):
            # if the set of tokens is present, then check if the tokens were present in the same order as in the ``tokens_to_combine``.
            for combinations in previous_combinations[set_tokens]:
                if combinations[0] == tokens_to_combine:
                    # if the order of the tokens is same then
                    if None in combinations:
                        # if this combination is rejected earlier, return None
                        return None
                    # return the replacement word and its similarity score.
                    return combinations[-2:]
        # if the tokens were not combined earlier or not in the same order, then extract the closest match from the unique tokens of the dataset with the given similarity threshold.
        # The process returns None if no match is found.
        replacement = process.extractOne(
            possible_combination,
            unique_tokens,
            processor=simple_processor,
            scorer=fuzz.ratio,
            score_cutoff=minimum_threshold_for_replacement_word,
        )

    if replacement:
        # If there is a replacement (either the concatenated word or the word extracted from the unique tokens)
        if not check_presence(replacement[0], tokens_to_combine):
            # check if the replacement word is not one of the tokens to combine. If so, then the tokens are not replaced. Otherwise,
            freq_of_replacement = token_counter_full_data[replacement[0]]
            freq_of_org_tokens = [
                token_counter_full_data[tok_to_comb]
                for tok_to_comb in tokens_to_combine
            ]
            if all(
                freq_of_replacement >= freq_org_token
                for freq_org_token in freq_of_org_tokens
            ):
                # check if the frequency of the replacement word is greater than or equal to the frequency of the words in the ``tokens_to_combine``.
                # if yes, then return the replacement word and its frequency. The last entry of ``replacement`` is the index of the word in ``unique_tokens`` set. As it is not needed, only the first two entries which are the word and the similarity are returned.
                return replacement[:-1]
    # Return None, if a replacement is not found.
    return None


def get_updated_tokens_after_combining(
    current_tokens: List[str],
    possible_combinations: List[Tuple[Tuple[str], str, float]],
) -> List[str]:
    """This function accepts the current list of tokens for a profession string, the list of possible combinations,
        and returns the updated list of tokens for the profession after combining the tokens as per the ``possible combinations``.
        The combinations are applied in the order of the similarity score and only one combination is applied per token.

    Parameters
    ----------
    current_tokens : List[str]
        The list of tokens for a profession that are generated by splitting the string.
    possible_combinations : List[Tuple[Tuple[str], str, float]]
        The list of the possible combinations obtained by trying to combine words in the ``current_tokens`` that are less frequent
            and/or do not appear in the list of correctly spelled words.
        The list of possible combinations has 3 elements, the first is the tokens that are to be combined,
            second, the word to replace the tokens, and
            third is the similarity score of the replacement word with the word obtained by concatenating tokens in the first element.

    
    Returns
    -------
    List[str]
        The updated list of tokens for the profession that is obtained by combining words according to ``possible_combinations``.
    """

    # sort the possible_combinations in descending order of their similarity
    sorted_combinations = sorted(
        possible_combinations, key=lambda x: x[-1], reverse=True
    )

    # deepcopy of the current tokens
    updated_tokens = copy.deepcopy(current_tokens)

    changed_indv_tokens = set()
    # The tokens replaced for this tokens list are stored in the ``changed_indv_tokens`` set, to avoid replacing the same token multiple times.

    for combination in sorted_combinations:
        # for each possible combination

        start_ind, end_ind = None, None
        sub_tokens = list(combination[0])

        if not any(check_presence(item, changed_indv_tokens) for item in sub_tokens):
            # If the tokens to combine are not already combined in another possible combination for the same token list,

            sub_tokens_len = len(sub_tokens)

            for itr in range(len(updated_tokens)):

                if updated_tokens[itr : itr + sub_tokens_len] == sub_tokens:
                    # find the indices where the tokens to be combine start and end.
                    start_ind = itr
                    end_ind = itr + sub_tokens_len - 1

                    # Replace the tokens with the word
                    updated_tokens = (
                        updated_tokens[:start_ind]
                        + [combination[-2]]
                        + updated_tokens[end_ind + 1 :]
                    )

                    # add the replaced tokens to the ``changed_indv_tokens``set.
                    changed_indv_tokens.update(sub_tokens)
                    break
    # After updating the all possible tokens, return the new list of tokens.
    return updated_tokens


def get_possible_missplit_tokens(
    current_tokens: List[str],
    token_counter_full_data: Counter_type[str],
    correctly_spelled_words: Set[str],
    min_tok_frequency: int,
) -> List[List[str]]:
    """The function accepts a list of tokens along with their counter, their presence in the list of correctly spelled words
        and a threshold frequency to determine if a token is less frequent and
        returns a list of a sublist of tokens that can be combined if either of the consecutive entries is
        not in the list of correctly spelled words and either of them has a frequency less than a threshold.

    Parameters
    ----------
    current_tokens : List[str]
        The list of tokens for a profession string obtained by splitting the string.
    token_counter_full_data : Counter_type[str]
        A counter object that holds the number of times a token has appeared in the dataset.
    correctly_spelled_words : Set[str]
        The set of tokens from the unique tokens for the dataset belonging to list of correctly spelled words.
    min_tok_frequency : int
        The threshold frequency for a token to be considered as less frequent.

    
    Returns
    -------
    List[List[str]]
        The list of sublists of tokens that can be combined into one.
    """

    # initialise list to store potentially broken words
    sub_set_tokens = []

    for itr in range(len(current_tokens) - 1):

        # for each token compare it with its next token

        curr_tok = current_tokens[itr]
        next_tok = current_tokens[itr + 1]

        if check_presence(curr_tok, correctly_spelled_words) and check_presence(
            next_tok, correctly_spelled_words
        ):
            if (token_counter_full_data[curr_tok] >= min_tok_frequency) and (
                token_counter_full_data[next_tok] >= min_tok_frequency
            ):
                # ignore the pair of tokens if both of them are in the list of correctly spelled words and both of them have frequency greater than the threshold for minimum frequency
                continue
        # else append the pair of tokens to potentially broken words
        sub_set_tokens.append([curr_tok, next_tok])

    # The abpve loop only searches for words broken into two parts, however, words can be broken into more than two parts. This next step will combine and create new paris of possible broken words. Two potentailly broken words lists are combined if the last element of one list is same as the first element of the another list. The loop is excuted as long as there are no possible combinations of broken words.
    search_for_additional_combs = True

    while search_for_additional_combs:
        initial_cobinations = copy.deepcopy(sub_set_tokens)
        # copy the current list of broken words

        new_combs = []
        # list to store the new combinations of broken words
        for itr in range(len(initial_cobinations) - 1):
            if sub_set_tokens[itr][1:] == sub_set_tokens[itr + 1][:-1]:
                # if the last part of the current list is same as the first part of next list
                new_comb = sub_set_tokens[itr] + sub_set_tokens[itr + 1][-1:]
                # combine them to create a new combination
                if not check_presence(new_comb, sub_set_tokens):
                    # if the combination is not already added, append it to the list of new combination of broken words
                    new_combs.append(new_comb)

        if new_combs:
            # update the list of potentially broken words
            sub_set_tokens += new_combs
            if len(new_combs) == 1:
                search_for_additional_combs = False
        else:
            # stop the search for potentially broken words if there are no more lists to combine.
            search_for_additional_combs = False

    return sub_set_tokens


def tokens_after_combining_low_freq_consecutive_tokens(
    metier_token_mapping: Dict[str, List[str]],
    unique_tokens: Set[str],
    token_counter_full_data: Counter_type[str],
    correctly_spelled_words: Set[str],
    min_tok_frequency: int,
    minimum_threshold_for_replacement_word: float,
) -> Dict[str, List[str]]:
    """The function accepts a dictionary with key as the profession and value as the list of corresponding tokens,
        along with the list of all unique tokens and their counter, their presence in the list of correctly spelled words,
        a threshold frequency to determine if a token is less frequent and
        the minimum similarity for the word to replace the less frequent & incorrectly spelled tokens and
        returns the dictionary with key as the profession and value as the list of corresponding updated tokens.

    Parameters
    ----------
    metier_token_mapping : Dict[str, List[str]]
        A python dictionary with the key as the profession string obtained from the dataset and the value is a list of tokens obtained by splitting the string. 
    unique_tokens : Set[str]
        A set of unique tokens obtained from all the profession strings in the dataset.
    token_counter_full_data : Counter_type[str]
        A counter object that holds the number of times a token has appeared in the dataset
    correctly_spelled_words : Set[str]
        The set of tokens from the unique tokens for the dataset belonging to the list of correctly spelled words.
    min_tok_frequency : int
        The threshold frequency for a token to be considered as less frequent.
    minimum_threshold_for_replacement_word : float
        The minimum similarity score to replace the tokens in the ``tokens_to_combine`` with the word found in the ``unique_tokens``.

    
    Returns
    -------
    Dict[str, List[str]]
        A python dictionary with the key as the profession string obtained from the dataset and
        the value is a list of tokens updated after combining the possibly wrongly split words. 
    """

    earlier_combinations = {}
    # a dictionary to store the previously combined tokens. it will help in reducing the run time.

    updated_metier_token_mapping = {}
    # a dictionary to store the new profession string and tokens list mapping.

    updated_tokenlists = {}
    # a dictionary to store the tokens for a profession strings as keys and the updated list of tokens for the token set in  the key. This will help reducing the run time for two profession strings that will have same tokens after splitting the string.

    for unique_met_string, tokens in tqdm_notebook(metier_token_mapping.items()):
        # for each unique profession string and its tokens

        updated_tokens = tokens
        # set the updated tokens to be the current list of tokens (to return in case no tokens are not combined.)

        set_tokens = frozenset(tokens)

        if len(tokens) > 1:
            # if there is more than one token, otherwise there is no interest of combining the tokens at all

            found_updated_tokens = False
            if check_presence(set_tokens, updated_tokenlists):
                # check if the same set of tokens are already updated for a different profession string.
                for toks, updtoks in updated_tokenlists[set_tokens]:
                    if toks == tokens:
                        # not only the set of strings should match but also the order of the tokens in both the lists should be same.
                        updated_tokens = updtoks
                        found_updated_tokens = True
                        break

            if not found_updated_tokens:
                # if the list of  tokens are not previously merged, get the list of potentially broken words
                sub_set_tokens = get_possible_missplit_tokens(
                    tokens,
                    token_counter_full_data,
                    correctly_spelled_words,
                    min_tok_frequency,
                )
                possible_combinations = (
                    []
                )  # a list to store the possible combinations of tokens
                for combinable_tokens in sub_set_tokens:
                    # for each list potentially broken words, get a replacement word

                    replacement = get_possible_combined_word(
                        combinable_tokens,
                        unique_tokens,
                        earlier_combinations,
                        token_counter_full_data,
                        minimum_threshold_for_replacement_word,
                    )

                    if replacement:
                        # if there is a replacement word, then append it to the list of possible combinations
                        possible_combinations.append(
                            (tuple(combinable_tokens), replacement[0], replacement[1])
                        )
                    else:
                        # even if there is no replacement word, append None to the possible combinations as this combination will not be searched in other profession strings.
                        possible_combinations.append(
                            (tuple(combinable_tokens), None, None)
                        )

                filtered_possible_combinations = [
                    pos_comb
                    for pos_comb in possible_combinations
                    if None not in pos_comb
                ]
                # filter the combinations that are not void

                if filtered_possible_combinations:
                    # if there are any possible combinations, update the current tokens list.
                    updated_tokens = get_updated_tokens_after_combining(
                        tokens, filtered_possible_combinations
                    )

                # using the list of possible combinations for this profession string, update the previous_combinations dictionary to remember the merges
                for combination in possible_combinations:
                    # print("combination: ", combination, type(combination))
                    set_combined_toks = frozenset(combination[0])
                    if not check_presence(set_combined_toks, earlier_combinations):
                        earlier_combinations[set_combined_toks] = set()
                    if not check_presence(
                        combination, earlier_combinations[set_combined_toks]
                    ):
                        earlier_combinations[set_combined_toks].update([combination])

        # update the mapping of profession string to tokens updated after combining tokens (if combined, otherwise the original set of tokens are assigned)
        updated_metier_token_mapping[unique_met_string] = updated_tokens

        # update the dictionary that maintains the updated list of tokens based on the token lists.
        if not check_presence(set_tokens, updated_tokenlists):
            updated_tokenlists[set_tokens] = []
        updated_tokenlists[set_tokens].append([tokens, updated_tokens])

    return updated_metier_token_mapping

### Reading the Data after cleaning for special characters

After performing some cleaning on the special characters, the data set is stored at `all_paris_jobs_splchar_cleaned.csv`. 

The CSV file contains 10 columns and their description is 
1. `doc_id`: The unique id of the document of Gallica.
2. `page`: The page in which the entry is present in the document.
3. `row`: The row of the entry on the page.
4. `Nom`: The name of the person.
5. `métier_original`: The profession before cleaning the special characters.
6. `rue`: The name of the street in the address of the person.
7. `numéro`: The number in the street in the address of the person.
8. `annee`: The year in which the entry is published.
9. `gallica_page`: The page number adjusted from the `page` column that can be used on the Gallica.
10. `métier`: The profession string after cleaning special characters

In the next cell, this file is read as a data frame, and the rows that have a `métier` less than 3 characters are removed.

In [None]:
paris_jobs_spl_clean = pd.read_csv("./../data/strict_addressing.csv", names=["doc_id", "page", "row", "Nom", "métier_original", "rue", "numéro" , "annee"],
                             dtype={"doc_id":'str', "page":'int', "row":'str', "Nom":'str', "métier":'str', "rue":'str', "numéro":'str', "annee":'str'},
                             header=0, encoding="utf-8")
paris_jobs_spl_clean["métier"] = paris_jobs_spl_clean["métier_original"]

In [None]:
'''paris_jobs_spl_clean = pd.read_csv(
    "./../data/intermediate_steps/all_paris_jobs_splchar_cleaned.csv",
    dtype={
        "doc_id": "str",
        "page": "str",
        "row": "str",
        "Nom": "str",
        "métier_original": "str",
        "rue": "str",
        "numéro": "str",
        "annee": "str",
        "gallica_page": "str",
        "métier": "str",
    },
    header=0,
    encoding="utf-8",
)
'''

paris_jobs_spl_clean.dropna(subset=["métier"], inplace=True)

# Removing the rows with métier column having less than 3 characters.
paris_jobs_spl_clean = paris_jobs_spl_clean[
    ~(paris_jobs_spl_clean["métier"].str.len() < MINIMUM_TOKEN_LENGTH)
]

After reading the data, 

1. create a python dictionary (`unique_metier_split_mapping`) with key as the unique profession in the dataset and the values are a list of tokens obtained by splitting the profession string without a minimum length requirement or filtering the stopwords as the aim, in this case, is to meld the words that are broken into multiple words during the OCR process.
2. Create a set of unique tokens (`unique_tokens_before_step_1`) from all the professions and a counter (`tokens_counter_before_step_1`) to hold the frequency of their appearance.
3. Create a set of tokens that are in the set of correct words (`tokens_in_correct_words_before_step_1`) from the set of unique tokens.

In [None]:
# dictionary to store the profession string and corresponding tokens
unique_metier_split_mapping = {
    unique_met_string: split_metier_strings_to_tokens(
        unique_met_string,
        minimum_length=None,
        correctly_spelled_words=set(),
        stop_words=set(),
    )
    for unique_met_string in paris_jobs_spl_clean["métier"].unique()
}

# get the list of all tokens
individual_tokens_full_data = [
    tok
    for met_string in paris_jobs_spl_clean["métier"]
    for tok in unique_metier_split_mapping[met_string]
]

# create a Counter object on the list of all tokens
tokens_counter_before_step_1 = Counter(individual_tokens_full_data)

# get the keys of the counter object as they will be unique entries and store them as a set.
unique_tokens_before_step_1 = set(tokens_counter_before_step_1.keys())

# create a set of tokens in list of correct words
tokens_in_correct_words_before_step_1 = unique_tokens_before_step_1.intersection(
    words_with_correct_spellings
)

### Combining the mistakenly split tokens

The `tokens_after_combining_low_freq_consecutive_tokens` function is called to combine the wrongly broken tokens in a profession string and the update tokens for each profession are stored in `unique_metier_split_mapping_consq_merged`.

In [None]:
unique_metier_split_mapping_consq_merged = tokens_after_combining_low_freq_consecutive_tokens(
    metier_token_mapping=unique_metier_split_mapping,
    unique_tokens=unique_tokens_before_step_1,
    token_counter_full_data=tokens_counter_before_step_1,
    correctly_spelled_words=tokens_in_correct_words_before_step_1,
    min_tok_frequency=MINIMUM_TOKEN_FREQUENCY,
    minimum_threshold_for_replacement_word=MIN_THRESHOLD_REPLACE_BROKEN_WORD,
)

#### Saving the tokens per metier after merging low frequency consecutive tokens

In [None]:
with open(
    "./../data/intermediate_steps/unique_metier_tokens_mapping_after_consq_merged.json",
    "w",
    encoding="utf8",
) as outfile:
    json.dump(
        unique_metier_split_mapping_consq_merged, outfile, indent=4, ensure_ascii=False
    )

#### Reading the tokens per metier after merging low frequency consecutive tokens

In [None]:
with open(
    "./../data/intermediate_steps/unique_metier_tokens_mapping_after_consq_merged.json",
    encoding="utf8",
) as f:
    unique_metier_split_mapping_consq_merged = json.load(f)

### Updating the dataset (CSV file) with the new tokens

The updated tokens are joined with a space to form a string again (`metier_after_setp_1`) and the data frame is updated with the new profession string under the column `metier_after_S1`.

In [None]:
metier_after_setp_1 = {
    met_str: " ".join(updated_tokens)
    for met_str, updated_tokens in unique_metier_split_mapping_consq_merged.items()
}
paris_jobs_spl_clean["metier_after_S1"] = paris_jobs_spl_clean["métier"].map(
    metier_after_setp_1
)

#### Saving the data frame with updated metier after step 1

In [None]:
paris_jobs_spl_clean.to_csv(
    "./../data/intermediate_steps/all_paris_jobs_after_step_1.csv", index=False
)

## Split the metier into tokens (Step 2)

The crux of the second step is to create a set of tokens for each profession string excluding the stop words. In the process of creating these tokens, some cleaning of the tokens is also performed.
- In the first step, the profession strings were split at space, a hyphen, apostrophe, and dot and later joined with space after combining mistakenly split tokens. Hence, in this step, although the same pipeline (the same set of functions are used), the profession strings obtained after step 1 are split at space in practice. However, in this split, the words that have less than 3 alphanumeric characters and those in the stop words list are not included.
- Then the words that are potentially indicating a number are removed.
- Following the numbers, the words that are not in the list of correct words are converted into their normal form and replaced with the most frequent non-normalized form.
- Lastly, the tokens containing a dot in the middle are split into separate tokens when the part of the word represents an abbreviation. This step will result in a set of tokens for each profession.

### Reading the data frame with updated profession token sets (as strings) after step 1

In [None]:
paris_jobs_after_s1 = pd.read_csv(
    "./../data/intermediate_steps/all_paris_jobs_after_step_1.csv",
    dtype={
        "doc_id": "str",
        "page": "str",
        "row": "str",
        "Nom": "str",
        "métier_original": "str",
        "rue": "str",
        "numéro": "str",
        "annee": "str",
        #"gallica_page": "str",
        "métier": "str",
        "metier_after_S1": "str",
    },
    header=0,
    encoding="utf-8",
)

### Removing the words shorthanded for numbers

The 3 letter words with `re` or `er` were short hands for writing number like `première` or `troisième` or `quatrième`. These will be removed. The entries with `re` or `er` present are `lre`,`fer`,`ire`,`ler`,`įre`,`tre`,`ser`,`pre`,`fre`,`!re`,`ère`,`are`,`ier`,`bre`,`jre`,`jer`,`der`,`mer`,`her`,`(re`,`nre`,`gre`,`ere`,`īre`,`cer`,`ter`,`per`,`ger`,`ìre`,`ver`,`yre`,`qre`,`cre`,`vre`,`mre`,`ner`,`íre`,`dre`,`rer`,`ber`,`ure`,`îre`. 

However, some of these three-letter words are present in the list of correct words or do not indicate a number. The rows containing these words were studied and only the following will be replaced

1. All the three-letter words starting with `l` or `q` or `any digit` and ending with `re` or `er`.
2. All the three-letter words starting with `i` or `f` or `m` or `(` or `ī` or `!` or `ì` or `į` or  `j` or `î` or `í` and ending with `re`.
3. All the three-letter words starting with `g` or `a` or `p` or `b` or `y`and ending with `re` and followed by a space and `i`
4. All `ter` words
5. All `jer` words followed by a space and `i` or `a`.
6. All `tre` words followed by a space and `ins` or `cl` or `iss`.
7. All `fer` words followed by a space and `inst` or `ar`.

The idea of checking if the word is followed by a certain character comes from the fact that most of the time the numbers are used in the content of mentioning the arrondissement or the instances. Hence the entries are checked for `i` or `a` or `ins` or `iss` or `inst` or `ar` etc.

In [None]:
paris_jobs_after_s1["metier_after_S1"] = paris_jobs_after_s1["metier_after_S1"].str.replace(r"(?<=\s)(\d|l|q)(re|er)(?=\s)", "", regex=True)

paris_jobs_after_s1["metier_after_S1"] = paris_jobs_after_s1["metier_after_S1"].str.replace(r"(?<=\s)(i|f|m|\(|ī|!|ì|į|j|î|í)re(?=\s)", "", regex=True)

paris_jobs_after_s1["metier_after_S1"] = paris_jobs_after_s1["metier_after_S1"].str.replace(r"(?<=\s)(g|a|p|b|y)re(?=\si)", "", regex=True)

paris_jobs_after_s1["metier_after_S1"] = paris_jobs_after_s1["metier_after_S1"].str.replace(r"(?<=\s)ter(?=\s)", "", regex=True)

paris_jobs_after_s1["metier_after_S1"] = paris_jobs_after_s1["metier_after_S1"].str.replace(r"(?<=\s)jer(?=\s(i|a))", "", regex=True)

paris_jobs_after_s1["metier_after_S1"] = paris_jobs_after_s1["metier_after_S1"].str.replace(r"(?<=\s)tre(?=\s(ins|iss|cl))", "", regex=True)

paris_jobs_after_s1["metier_after_S1"] = paris_jobs_after_s1["metier_after_S1"].str.replace(r"(?<=\s)fer(?=\s(inst|ar))", "", regex=True)

### Split the profession string and filter stop words

The metier string is split into a set of tokens without stop words and words with less than 3 alphanumeric characters.

The next cell creates a python dictionary (`undecoded_metier_token_mapping`) with key as the unique profession (after combining wrongly broken words) in the dataset and the values is a list of tokens obtained by splitting the profession string with a minimum token length of 3, filtering the stopwords and splitting the words at dot and apostrophe.

In [None]:
nonnormal_metier_token_mapping = {
    met_str: split_metier_strings_to_tokens(
        string_to_split=met_str,
        minimum_length=MINIMUM_TOKEN_LENGTH,
        correctly_spelled_words=words_with_correct_spellings,
        stop_words=stop_words_french,
    )
    for met_str in paris_jobs_after_s1["metier_after_S1"].unique()
}

#### Saving the splitting of profession strings to tokens (without cleaning)

In [None]:
with open(
    "./../data/intermediate_steps/nonnormal_profession_token_mapping.json",
    "w",
    encoding="utf8",
) as outfile:
    json.dump(nonnormal_metier_token_mapping, outfile, indent=4, ensure_ascii=False)

#### Reading the splitting of profession strings to tokens (without cleaning)

In [None]:
with open(
    "./../data/intermediate_steps/nonnormal_profession_token_mapping.json",
    encoding="utf8",
) as f:
    nonnormal_metier_token_mapping = json.load(f)

### Normalizing (and Denormalizing) tokens

Some tokens in the dataset have non-french alphabets and some words that have ligature are sometimes broken into individual characters and sometimes not. For example, eggs as `œufs` and `oeufs`, sister as `sœur` or `soeur`, variants of the words vins: vinš, vïns, vịns, viñs, vîns, vỉns, víns, vinş, viņs, vìns, vińs, vīns. To correct these tokens, the tokens of the dataset that are not in the set of correctly spelled words shall be converted to normal form using the `unidecode` function. If the normal form is different from the non-normal form then the token is temporarily converted into normal form and then all the tokens having the same normal form are replaced with the non-normal form of the token (having the same normal form) that appears the highest number of times in the dataset.

#### Utility function to normalize (and denormalize) tokens.

The functions are

1. `update_tokens_to_most_frequent_form`: The function accepts the dictionary of profession string to non-normal tokens list, along with the dictionary with the tokens to be modified as key as value as the string that replaces the token and returns the updated mapping of profession string and list of tokens (after selecting the major non-normal form for tokens having the same normal form).
2. `get_normal_nonnormal_map`: The function accepts the list of unique tokens obtained from the profession strings and returns a dictionary with the normalized forms of the tokens not in the correct words set.
3. `get_denormalised_mapping`:  The function accepts the mapping of normal to a list of the non-normal forms along with the counter for the tokens from the dataset and selects the token with the highest frequency among those having the same normal form to replace them.

In [None]:
def get_normal_nonnormal_map(
    unique_tokens: List[str], correctly_spelled_words: Set[str]
) -> Dict[str, List[str]]:
    """The function accepts the list of unique tokens obtained from the profession strings, a set of correctly spelled words,
        and returns a dictionary with the normalized forms of the tokens not in the correct words set.
        The dictionary contains the normalized form of the token as a key and the list of non-normalized forms of the tokens as value.

    Parameters
    ----------
    unique_tokens : List[str]
        The list of unique tokens of the dataset.
    correctly_spelled_words : Set[str]
        The set of tokens from the unique tokens for the dataset belonging to the list of correctly spelled words.
    
    
    Returns
    -------
    normal_to_original_list_mapping : Dict[str, List[str]]
        A python dictionary with the normalized form of the token as key and the list of non-normalized forms of the tokens as value.
    """

    # empty dictionary to hold the return values
    normal_to_original_list_mapping = {}

    for tok in unique_tokens:
        # for each unique token
        if not check_presence(tok, correctly_spelled_words):
            # if the token is not in the list of correctly spelled words, get its normalized form
            decoded = unidecode(tok)
            if decoded != tok:
                if not check_presence(decoded, normal_to_original_list_mapping):
                    normal_to_original_list_mapping[decoded] = []
                # save the inverse mapping of normalized form to non-normalized form in another dictionary.
                normal_to_original_list_mapping[decoded].append(tok)
    return normal_to_original_list_mapping


def get_denormalised_mapping(
    normal_to_original_list_mapping: Dict[str, List[str]],
    tokens_counter_before_normalization: Counter_type[str],
) -> Dict[str, str]:
    """The function accepts the mapping of normal to a list of the non-normal forms along with the counter for the tokens
        from the dataset and selects the token with the highest frequency among those having the same normal form to replace them.

    Parameters
    ----------
    normal_to_original_list_mapping : Dict[str, List[str]]
         A python dictionary with the normalized form of the token as key and the list of non-normalized forms of the tokens as value.
    tokens_counter_before_normalization : Counter_type[str]
        The counter of the unique tokens in the dataset before normalizing the tokens.

    
    Returns
    -------
    Dict[str, str]
         A python dictionary with the non-normalized form of the token as key and the token to replace it as value.
    """

    # initialize a dictionary to store the normal form to non normal form mapping after select the token
    denormalization_mapping = {}

    for normal_token, nonnormal_tokens in normal_to_original_list_mapping.items():
        # for each normal form and all the tokens that have the considered normal form

        all_tokens_ = [normal_token] + nonnormal_tokens

        # create a dictionay with key as the tag and value as its frequency in the dataset.
        tags_freq_counts = {
            token_: tokens_counter_before_normalization[token_]
            for token_ in all_tokens_
        }

        # select the tag that has the maximum frequency
        selected_tag = max(tags_freq_counts, key=lambda k: tags_freq_counts[k])

        # update the mapping to store the selected tag for all the tokens in the current loop.
        for token_ in all_tokens_:
            denormalization_mapping[token_] = selected_tag

    # return the normal to non normal mapping of all the token
    return denormalization_mapping


def update_tokens_to_most_frequent_form(
    profession_token_mapping_non_normalised: Dict[str, List[str]],
    denormalization_mapping: Dict[str, str],
) -> List[str]:
    """The function accepts the dictionary of profession string to non-normal tokens list,
        along with the dictionary with the tokens to be modified as key as value as the string that replaces the token
        and returns the updated mapping of profession string and list of tokens
        (after selecting the major non-normal form for tokens having the same normal form).

    Parameters
    ----------
    profession_token_mapping_non_normalised : Dict[str, List[str]]
        The dictionary with key as a unique profession in the dataset and the values are the list of non-normalized tokens
        (that were obtained by splitting the profession string).
    denormalization_mapping : Dict[str, str]
        A python dictionary with the non-normalized form of the token as key and the token to replace it as value.

    
    Returns
    -------
    profession_token_mapping_de_normalised : Dict[str, List[str]]
        The dictionary with key as a unique profession in the dataset and the values are the list of denormalized tokens
        (after selecting the most frequent token with the same normal form).
    """

    # the list of tokens updated with the normalized form the mapping or retain the same token if there is no mapping.
    profession_token_mapping_de_normalised = {
        profession_string: [
            denormalization_mapping.get(curr_token, curr_token)
            for curr_token in profs_tokens
        ]
        for profession_string, profs_tokens in profession_token_mapping_non_normalised.items()
    }

    return profession_token_mapping_de_normalised

In [None]:
# get the list of non-normalized tokens from the dataset
all_tokens_before_normalzing = [
    metier_token
    for met_str in paris_jobs_after_s1["metier_after_S1"].to_list()
    for metier_token in nonnormal_metier_token_mapping[met_str]
]

# create a counter for the tokens before normalising
counter_tokens_before_normalizing = Counter(all_tokens_before_normalzing)

# get the list of unique non-normalized tokens from the dataset
unique_nonnormal_tokens = set(counter_tokens_before_normalizing.keys())

# seperate the non-normalized tokens that have correct spellings
nonnormal_tokens_with_correct_spellings = unique_nonnormal_tokens.intersection(
    words_with_correct_spellings
)

# get the non-normalized to normalized mapping
normal_to_nonnormal_list_mapping = get_normal_nonnormal_map(
    unique_nonnormal_tokens, nonnormal_tokens_with_correct_spellings
)

# get the mapping to update the token with same normal form with the highesht frequent non-normal form
de_normalised_maping = get_denormalised_mapping(
    normal_to_nonnormal_list_mapping, counter_tokens_before_normalizing
)

# update the tokens in profession to tokens mapping with their slected non normal forms
denormalised_metier_token_mapping = update_tokens_to_most_frequent_form(
    nonnormal_metier_token_mapping, de_normalised_maping
)

### Splitting words with dots between words

In the text, sometimes the profession has two abbreviations to describe it and they are not separated by space during OCR. For example `imprimeur.lithogr.` or `négoc.commiss.`. In this step after normalizing the tokens, these types of tokens where they are joined by a dot are split into multiple tokens. The idea is as follows: 


1. For each token that is not in the list of correctly spelled words and contains a dot (The tokens that have only one dot and that one at the end are ignored as they are potential abbreviations.) 
    1. The token is split at the dot and checked
    2. If the majority (more than half plus one) of the resultant split has a length greater than one (Otherwise, the token could be a spelling mistake that has dot instead of alphabets and such token is ignored), then for each sub token (the result of the split):
        1. Split the token at apostrophe if it contains an apostrophe (this provides a chance to split the tokens that have multiple apostrophes which were not split earlier)
        2. If the token is not in the stop words, add a dot to the token at the end, if the token with a dot at the end has a higher frequency than the token without the dot.
    3. If the result of splitting at dots is not the same as the result after updating tokens based on the frequency, then the token is updated to the list of new tokens obtained from the split.  
    
    
#### Utility function to split words with dots between words

The functions are

1. `break_tokens_with_dots_around_words`: The function accepts the list of unique tokens present in the dataset along with their counter, the minimum length of the token, a set of correctly spelled words, and a list of stop words and returns a mapping to further split for the tokens that are not in the list of correctly spelled words and have a dot in them.
2. `update_token_mappings_after_multiple_dot_split`: The function accepts the token sets before splitting the tokens with a dot in between and the mapping of tokens with a dot in between and their split and returns the updated token sets after replacing the tokens in the mapping. 

In [None]:
def break_tokens_with_dots_around_words(
    denormalized_unique_tokens: List[str],
    denormalized_tokens_counter: Counter_type[str],
    minimum_length: int,
    correctly_spelled_words: Set[str],
    stop_words: Set[str],
) -> Dict[str, str]:
    """The function accepts the list of unique tokens present in the dataset along with their counter,
        the minimum length of the token, a set of correctly spelled words, and a list of stop words
        and returns a mapping to further split for the tokens that are not in the list of correctly spelled words
        and have a dot in them.

    Parameters
    ----------
    denormalized_unique_tokens : List[str]
        A list of unique tokens from the profession strings after normalizing them.
    denormalized_tokens_counter : Counter_type[str]
        The counter of the ``normalized_unique_tokens`` in the dataset.
    minimum_length : int
        The minimum length of the string without non-alphanumeric characters.
    correctly_spelled_words : Set[str]
        The set of tokens from the unique tokens for the dataset belonging to the list of correctly spelled words.
    stop_words : Set[str]
        The set of stop words in French.

    
    Returns
    -------
    resplitted_tokens : Dict[str, str]
        A dictionary with a key as a unique token and value is a list of strings that the token should be changed to.
    """

    # Initialise a dictionary to store the tokens that are updated after this split.
    resplitted_tokens = {}

    for normal_token in denormalized_unique_tokens:
        # Don't change the tokens that are in the dictionary or have a dot at the end (probable abbrevation)
        if (
            not check_presence(normal_token, correctly_spelled_words)
            and ("." in normal_token)
            and not (normal_token.count(".") == 1 and normal_token[-1] == ".")
        ):
            # split the token at dot (all occurrences)
            split_on_dot = list(filter(None, normal_token.split(".")))
            # get the length of the strings after split.
            lens_of_splits = [len(split) for split in split_on_dot]
            # get the lengths of splits after removing tokens with only one character.
            lens_of_splits_without_singles = [
                split_len for split_len in lens_of_splits if split_len != 1
            ]

            # if the majority of the splits have a more than one character, update the tokens based on the frequency
            if len(lens_of_splits_without_singles) >= np.floor(
                (len(lens_of_splits) / 2) + 1
            ):

                # list to store the result of splitting of the token
                sep_toks = []

                for split_word in split_on_dot:
                    # for each sub token obtained as the result of splitting at dot
                    split_word_toks = [split_word]
                    if "'" in split_word:
                        # split the sub token at apostrophe if there is an apostrophe
                        split_word_toks = apos_split(
                            split_word,
                            minimum_length,
                            correctly_spelled_words,
                            stop_words,
                        )
                    if split_word_toks:
                        # if the token still remains after splitting at apostrophe (if there is no apostrophe then this condition is redundant)
                        for split_word_tok in split_word_toks:
                            # for eack sub sub token that is the result of splitting at apostrophe
                            if not check_presence(split_word_tok, stop_words):
                                # if the sub sub token is not in stop words
                                if (
                                    denormalized_tokens_counter[split_word_tok + "."]
                                    > denormalized_tokens_counter[split_word_tok]
                                ):
                                    # if the sub sub token with a dot at the end is more frequent than the one without a dot then append the token with the dot
                                    sep_toks.append(split_word_tok + ".")
                                    continue
                            # if either the sub sub token is a stopword or the frequency with a dot at the end is not more than the frequency without a dot, then append the sub sub token as it is (also the case when there is no sub sub token with a dot the end)
                            sep_toks.append(split_word_tok)

                if sep_toks != split_on_dot:
                    # if the result of the token split at dot and updating of those tokens based on the frequency is different (indicating that the tokens are updated)
                    sep_toks = [
                        septok
                        for septok in sep_toks
                        if is_valid_token(septok, minimum_length, stop_words)
                    ]
                    # the mapping from the original token to the splitted tokens is stored in the dictionary
                    resplitted_tokens[normal_token] = sep_toks

    return resplitted_tokens


def update_token_mappings_after_multiple_dot_split(
    profession_token_mapping_before_resplit: Dict[str, List[str]],
    dict_of_splitted_tokens: Dict[str, List[str]],
) -> Dict[str, List[str]]:
    """The function accepts the token sets before splitting the tokens with a dot in between and
        the mapping of tokens with a dot in between and their split and
        returns the updated token sets after replacing the tokens in the mapping. 

    Parameters
    ----------
    profession_token_mapping_before_resplit : Dict[str, List[str]]
        The dictionary with key as unique profession in the dataset and the values are the list of normalised tokens (that were obtained by splitting the profession string).
    dict_of_splitted_tokens : Dict[str, List[str]]
        A dictionary with key as a unqiue token and value is a list of strings that the token should to changed to.

    
    Returns
    -------
    new_profession_token_map : Dict[str, List[str]]
        The dictionary with key as unique profession in the dataset and the values are the list of normalised and resplitted for dot tokens.
    """

    # Initialise dictionary to store the new profession string to tokens mapping
    new_profession_token_map = {}

    for profs_string, profs_tokens in profession_token_mapping_before_resplit.items():
        # for each profession string and its tokens
        new_tokens = []
        # initialise a list to store the new list of tokens for the profession string
        for token in profs_tokens:
            # for each token of the profession string
            if token in dict_of_splitted_tokens:
                # append the updated list of tokens, if the token is in the dictionary of updated tokend
                new_tokens.extend(dict_of_splitted_tokens[token])
            else:
                # else append token itself
                new_tokens.extend([token])

        # add the profession string and its new tokens list to the mapping
        new_profession_token_map[profs_string] = new_tokens

    # return mapping
    return new_profession_token_map

In [None]:
# get the list of all tokens in the dataset after denormalising the them
all_tokens_after_denormalzing = [
    metier_token
    for met_str in paris_jobs_after_s1["metier_after_S1"].to_list()
    for metier_token in denormalised_metier_token_mapping[met_str]
]

# create a counter for the tokens after denormalising
counter_tokens_after_denormalizing = Counter(all_tokens_after_denormalzing)

# get the list of unique denormalized tokens from the dataset
unique_tokens_after_denormalizing = set(counter_tokens_after_denormalizing.keys())

# seperate the denormalized tokens that have correct spellings
correctly_spelled_tokens_after_denormalizing = unique_tokens_after_denormalizing.intersection(
    words_with_correct_spellings
)

# get the re split mapping of tokens with dot between the words
tokens_resplitted_mapping = break_tokens_with_dots_around_words(
    unique_tokens_after_denormalizing,
    counter_tokens_after_denormalizing,
    MINIMUM_TOKEN_LENGTH,
    correctly_spelled_tokens_after_denormalizing,
    stop_words_french,
)

resplitted_metier_token_mapping = update_token_mappings_after_multiple_dot_split(
    denormalised_metier_token_mapping, tokens_resplitted_mapping
)

### Collecting professions with the same tokens

Until this stage, the order of the tokens is preserved for each unique profession string. At this stage, the cleaning (without spelling correction) is completed and as the intention is to create single word tags for each profession, the order of the tokens is ignored from here on. The tokens and dataset are updated accordingly.

#### Utility function to collect professions with same tokens

The function is 

1. `get_unordered_tokens_to_metiers_mapping`: This function accepts the mapping of the profession strings to their tokens and returns an inverse mapping with tokens sets as the keys and the values as the list of profession strings having those tokens (the ordering of the tokens is ignored.)

In [None]:
def get_unordered_tokens_to_metiers_mapping(
    metier_token_mapping: Dict[str, List[str]]
) -> Dict[FrozenSet[str], List[str]]:
    """This function accepts the mapping the profession strings to their tokens and
        returns an inverse mapping with tokens sets as the keys and the values as the list of profession strings
        having those tokens (the ordering of the tokens is ignored.)

    Parameters
    ----------
    metier_token_mapping : Dict[str, List[str]]
        The dictionary with key as unique profession in the dataset and the values are the list of normalised and resplitted for dot tokens.

    Returns
    -------
    Dict[FrozenSet[str], List[str]]
        The dictionary with key as tokens set and values as  list of unique profession in the dataset that have the token set as the key.
    """

    # intialise a dictionary to store the tokens as key and the profession string as the value
    unordered_same_token_metiers = {}

    for met_str, met_toks in metier_token_mapping.items():
        # for each profession string and its corresponding tokens
        frozen_toks = frozenset(met_toks)
        # the list of tokens is converted into a frozen set as this is lossen the order of the tokens and frozenset can be used as a key for the dictionary.
        if not check_presence(frozen_toks, unordered_same_token_metiers):
            # if the tokens set is not already present in the mapping
            unordered_same_token_metiers[frozen_toks] = []
            # create an empty list corresponding to the set of tokens to store the profession strings having those tokens
        # append the profession string to the list corresponding to its tokens set.
        unordered_same_token_metiers[frozen_toks].append(met_str)

    return unordered_same_token_metiers

In [None]:
# get the mapping of token sets to profession strings
same_token_sets_metiers = get_unordered_tokens_to_metiers_mapping(
    resplitted_metier_token_mapping
)

# The keys of the mapping of token sets to profession strings is updated from frozen sets to strings (sepearated by space) as frozen sets cannot be saved to disk
same_token_metiers = {
    " ".join(set_tok): org_mets for set_tok, org_mets in same_token_sets_metiers.items()
}

# The inverse mapping of profession strings to serialized token sets is created to update the dataset.
same_token_metiers_inv_map = {
    org_met: set_tok
    for set_tok, org_mets in same_token_metiers.items()
    for org_met in org_mets
}

### Saving the serialized token sets to profession strings mapping

In [None]:
with open(
    "./../data/intermediate_steps/searlized_tokens_profession_mapping.json",
    "w",
    encoding="utf8",
) as outfile:
    json.dump(same_token_metiers, outfile, indent=4, ensure_ascii=False)

### Reading the serialized token sets to profession strings mapping

In [None]:
with open(
    "./../data/intermediate_steps/searlized_tokens_profession_mapping.json",
    encoding="utf8",
) as f:
    same_token_metiers = json.load(f)

The csv file containing the dataset is updated with the serialized tokens as the profession string after step 2 and stored in a column with the name `metier_after_S2`.

In [None]:
# create a new column to store the serialized token sets
paris_jobs_after_s1["metier_after_S2"] = paris_jobs_after_s1["metier_after_S1"].map(
    same_token_metiers_inv_map
)

# remove the rows of the dataset that do not have empty serialized token sets
paris_jobs_after_s1["metier_after_S2"].replace("", float("NaN"), inplace=True)
paris_jobs_after_s1.dropna(subset=["metier_after_S2"], inplace=True)

### Saving the dataframe with serialized token sets (seperated by space) for each row

In [None]:
paris_jobs_after_s1.to_csv(
    "./../data/intermediate_steps/all_paris_jobs_after_step_2.csv", index=False
)

### Reading the dataframe with serialized token sets (seperated by space) for each row

In [None]:
paris_jobs_after_s1 = pd.read_csv(
    "./../data/intermediate_steps/all_paris_jobs_after_step_2.csv",
    dtype={
        "doc_id": "str",
        "page": "str",
        "row": "str",
        "Nom": "str",
        "métier_original": "str",
        "rue": "str",
        "numéro": "str",
        "annee": "str",
        #"gallica_page": "str",
        "métier": "str",
        "metier_after_S1": "str",
        "metier_after_S2": "str",
    },
    header=0,
    encoding="utf-8",
)

In the next cells, before going to step 3, the unique tokens present in the dataset after step 2 are stored to the disk. For each token, the frequency and the urls in gallica where it appears are also stored.

In [None]:
# initialise empty dictionary to store token, its count and the urls
unique_denormalised_tokens = {}

for _, row in tqdm_notebook(
    paris_jobs_after_s1.iterrows(), total=len(paris_jobs_after_s1)
):
    # for row in the dataset

    url = "https://gallica.bnf.fr/ark:/12148/{}/f{}.zoom".format(
        row["doc_id"], row["gallica_page"]
    )
    # generate the url for the entry

    tokens = row["metier_after_S2"].split()
    # get the tokens for that row form the `metier_after_S2` column

    for token in tokens:
        # for each token
        if not check_presence(token, unique_denormalised_tokens):
            # create an entry in the dictionary if was not already created.
            unique_denormalised_tokens[token] = {"count": 0, "links": []}

        # increase the count of the token
        unique_denormalised_tokens[token]["count"] += 1
        # add the url of the entry
        unique_denormalised_tokens[token]["links"].append(url)

The unique tokens with the count and links is sorted based on the frequency and duplicate urls.

In [None]:
sorted_unique_denormalised_tokens = {
    k: v
    for k, v in sorted(unique_denormalised_tokens.items(), key=lambda e: -e[1]["count"])
}

for token_key, token_val_dict in sorted_unique_denormalised_tokens.items():
    token_val_dict["links"] = list(set(token_val_dict["links"]))

In [None]:
print(
    "The number of unqiue profession at the oneset {},\nafter cleaning special characters {},\nafter merging mistakenly split words {},\nafter token cleaning {}.".format(
        paris_jobs_after_s1["métier_original"].nunique(),
        paris_jobs_after_s1["métier"].nunique(),
        paris_jobs_after_s1["metier_after_S1"].nunique(),
        paris_jobs_after_s1["metier_after_S2"].nunique(),
    )
)
print(
    "The number of unique tokens after step 2 {}".format(
        len(sorted_unique_denormalised_tokens)
    )
)

### Saving the unique tokens with counts and urls

In [None]:
with open(
    "./../data/intermediate_steps/unique_denormalised_tokens.json", "w", encoding="utf8"
) as outfile:
    json.dump(sorted_unique_denormalised_tokens, outfile, indent=4, ensure_ascii=False)

### Reading the unique tokens with counts and urls

In [None]:
with open(
    "./../data/intermediate_steps/unique_denormalised_tokens.json", encoding="utf8"
) as f:
    sorted_unique_denormalised_tokens = json.load(f)

## Merge misspelled tokens with the correctly spelled tokens (Step 3)

In the third step, the tokens that do not have a correct spelling (identified through the list of correct words) will be merged with the closest correctly spelled word iteratively till there are no more possible merges. The merge is performed in three rounds. In the first two rounds, the context of the tokens (with respect to other tokens) is considered, and in the last round, the spellings are merged ignoring the context.

### Algorithm for merging tokens

- In the first round, the tokens are merged according to the [Algorithm for contextual merging](#algorithm-for-contextual-merging).
- In the second round in each per profession token set (after round 1) the less frequent tokens are removed. A number defined apriori will be used to determine if a token is frequent or not and the same process as round 1 is repeated.
- In the third round, the low frequent tokens per token set are added back to the per profession token sets (after round 2), however, this time the tokens are considered individually i.e. all tokens sets will be of length 1 and the same process as round 1 is repeated.
- After round 3, the per profession token sets (after round 2) are updated to change the tokens that are merged and the dataset is saved to disk.

#### Algorithm for contextual merging

1. While the tokens can be merged, continue the iteration
2. Create a counter for each unique token present in the dataset at this step
3. Group the per profession token sets (the result of step 2) based on the length of the sets.
4. For each length of the token set
    1. For each token set of the considered token set length
        1. Compare with all other token sets of the same length, if the two token sets differ only by one token (i.e. all the tokens are the same except one) and if the tokens are mergeable (cf [Algorithm for checking tokens mergability](#algorithm-for-checking-tokens-mergability))
            1. With the same tokens in both the token sets as a base, store all such pairs of mergeable tokens as values (Intuitively this means that the pairs of tokens are probably the same and they can be merged with confidence).
                1. For the tokens of length one, there is no specific base and a dummy string is considered as the base and the merge is continued.
5. For each base of token sets, produce a base-wise update mapping that stores the token and the token to which it should be updated (cf [Algorithm for merging token pairs](#algorithm-for-merging-token-pairs)).
6. Combine the base-wise update mapping's produced per base of token sets to generate a token update mapping for the iteration.
    1. This step is performed to combine the updates of tokens from the same token set that were merged under different bases.
7. Using the token update mapping for the iteration, update the profession string to token sets mapping.
8. If the profession string to token sets mapping is unchanged from the previous iteration (i.e. if no token is updated in the iteration) then stop the merging, else continue the merging from _2_.

##### Algorithm for checking tokens mergeability

Two tokens can be mergeable if the tokens have a similarity greater than a certain threshold (defined apriori) and

1. If the high frequent token is in the list of correctly spelled words and the low frequent token is not in the list of correctly spelled words.
2. If both the tokens have the same frequency and only one of them is in the list of correctly spelled words

In both cases, the token that is not in the list of correctly spelled words is merged to the word that is in the list of correctly spelled words.

###### Token similarity

The token similarity is the Levenshtein similarity between the tokens. The [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between two strings is the "minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other". The distance is normalized between 0 and 1 and converted into similarity. For this project, the similarity is calculated using the [RapidFuzz](https://github.com/maxbachmann/RapidFuzz) library's `fuzz.ratio` function without any processing of the strings. The reason for choosing this library is that the function to calculate the similarity between strings accepts a threshold for similarity and use it is as an early stopping criterion in calculating the similarity between the stings (based on the lengths of the strings, It is not possible to obtain a high similarity between the strings when the length of them is significantly different). It returns zero as the similarity when the similarity is less than the given threshold.

#####  Algorithm for merging token pairs

For a given base and the pairs of tokens that are mergeable, a two-step process is followed to determine to which token the pairs are changed.

1. Loop over all the pairs to create a mapping with key as the token and value as a list of tuples containing that token that it can be merged with and the similarity between the two tokens.
2. For each of the tokens to be merged,
    1. If there is only one token that it can be merged with then the token to be merged is changed to the token it can be merged with.
    2. else if there are more than one possible tuples, then the token to be merged is changed to the token with the highest similarity score.
        1. If there are multiple tokens with the same similarity score, then the token to be merged is changed to the token that has appeared more frequently among those that have the same frequency.
    3. If the token is not yet merged, then the token is left as it is without merging.
    
### Utility functions to merge misspelled tokens with the correctly spelled tokens

The functions are

1. `merge_close_spellings`: The function accepts the list of tokens sets for all the dataset along with a set of correctly spelled words, the thresholds to check if tokens can be classified as low frequency and if two tokens are similar and return the updated list of token sets for all the dataset and an array with the similarity scores of tokens that are merged.
2. `get_similar_tokens_differed_by_one_token`: The function accepts the clustering of token sets based on their length at current iteration, along with the clustering at previous iteration (to help in early stopping), the frequency counter for the tokens in the dataset, list of correctly spelled words and a minimum similarity score for the tokens to be able declared similar and return a mapping(Python dictionary). The returned python dictionary has the set of tokens common between two token sets (except one token) and the value is another dictionary. The key for the secondary dictionary is the pair of tokens that share the tokens in the primary dictionary that are mergeable and different only by these two tokens with the value as another dictionary. In the tertiary and last dictionary, there are two elements, the `sim_score` key has the similarity score between the secondary token pair as the value and the `merge_to` key has one of the tokens in the secondary dictionary key pair that will absorb the other token.
3. `are_tokens_mergable`: The function accepts two tokens, along with their frequency, presence in the list of correctly spelled words, and a threshold of similarity score and returns the similarity score and the token that will absorb the other token. The algorithm to decide if two tokens are mergeable is described as `Algorithm for checking tokens mergeability's in the documentation above.
4. `group_connected_words_with_common_base`: The function accepts the dictionary of common tokens and the mergeable tokens for the base along with their counter and their presence in the list of correctly spelled words and returns the mapping of the updates to the tokens per tokens set of profession strings.
5. `get_replacements`: This function accepts the similar tokens for having common tokens along with their similarity and returns one, a dictionary with the key as the token and the value as the token to replace the key, and two, a set of tokens updated in the current iteration.
6. `get_tok_count`: The function accepts the profession token sets for all the entries of a dataset and returns a counter containing the frequency for each token.
7. `get_unique_token_sets`: Return the unique profession token sets present in the dataset
8. `get_length_wise_token_sets`: The function accepts the unique token sets of the dataset and clusters them based on the length of the token sets.
9. `print_token_statistics`: The function prints the statistics about the changes in the number of tokens after the iteration or round.
10. `print_merge_statistics`: The function prints the mean, median, standard deviation, minimum and maximum values of the similarity scores.
11. `get_current_token_sets`: Get the last token set for a list of list of token sets.

In [None]:
def simple_processor(token: str) -> str:
    """A string processor to return the same string as input.
        This dummy processor is used to avoid the default processor of the Rapidfuzz module to calculate string similarity.

    Parameters
    ----------
    token : str
        The input string to process.

    
    Returns
    -------
    str
        The output string same as the input string.
    """
    return token

def get_current_token_sets(
    list_of_list_of_token_sets: List[List[FrozenSet[str]]],
) -> List[FrozenSet[str]]:
    """Get the last tokenset for a list of list of tokensets.

    Parameters
    ----------
    list_of_list_of_token_sets : List[List[FrozenSet[str]]]
        The current list of list of tokensets.

    Returns
    -------
    List[FrozenSet[str]]
        The current list of tokensets
    """
    return [tok_list[-1] for tok_list in list_of_list_of_token_sets]


def get_tok_count(token_sets_list: List[FrozenSet[str]]) -> Counter_type[str]:
    """The function accepts the profession token sets for all the entries of a dataset and returns a counter containig the frequency for each token

    Parameters
    ----------
    token_sets_list : List[FrozenSet[str]]
        The list of tokens sets that represent the profession string for all the entries of the dataset.

    Returns
    -------
    counter_all_tokens : Counter_type[str]
        The counter with the frequency for each token in the dataset.
    """

    # collect the individual tokens in each tokenset into a list
    all_tokens = [
        metier_token
        for metier_tokens in token_sets_list
        for metier_token in metier_tokens
    ]
    # Create a counter object on the list of tokens
    counter_all_tokens = Counter(all_tokens)
    # return the counter object
    return counter_all_tokens


def get_unique_token_sets(
    token_sets_list: List[FrozenSet[str]],
) -> List[FrozenSet[str]]:
    """Return the unique profession tokens sets present in the dataset

    Parameters
    ----------
    token_sets_list : List[FrozenSet[str]]
        The list of tokens sets that represent the profession string for all the entries of the dataset.

    Returns
    -------
    List[FrozenSet[str]]
        The list of unique tokens sets obtained by removing the duplicates in the input list.
    """

    return set(token_sets_list)


def get_length_wise_token_sets(
    unique_profession_token_sets: List[FrozenSet[str]],
) -> Dict[int, List[FrozenSet[str]]]:
    """The fucntion accepts the unque token sets of the dataset and clusters them based on the length of the token sets.

    Parameters
    ----------
    unique_profession_token_sets : List[FrozenSet[str]]
        The list of unique tokens sets per profession string in the dataset.

    Returns
    -------
    len_wise_token_sets : Dict[int, List[FrozenSet[str]]]
        A dictionary with length of the token sets as the key and the token sets as the values in a list.
    """

    # intialise a dictionary to store the length of the token sets as the key and the token sets as the values in a list.
    len_wise_token_sets = {}

    for froz_tok in unique_profession_token_sets:
        # for each token set
        set_len = len(froz_tok)  # get the length of the token set
        if not check_presence(set_len, len_wise_token_sets):
            # if the length of the token set is not present as a key, add it to the dictionary
            len_wise_token_sets[set_len] = []
        # append the token set to the dictionary at the corresponding length
        len_wise_token_sets[set_len].append(froz_tok)

    # sort the length wise clustered token sets based on the length (key of the clusters)
    # the sorting is performed for the sake of neatness.
    len_wise_token_sets = {
        k: v for k, v in sorted(len_wise_token_sets.items(), key=lambda e: e[0])
    }

    # return the length wise clustered token sets.
    return len_wise_token_sets


def are_tokens_mergable(
    tokens_to_check: Set[str],
    tokens_counter: Counter_type[str],
    correctly_spelled_words: Set[str],
    minimum_similarity: float,
) -> Tuple[Union[float, None], Union[str, None]]:
    """The function accepts two tokens, along with their frequency, presence in list of correctly spelled words and a threshold of similarity score and return the similarity score and the token that will absorb the other token. The alogorithm to decide if two tokens are mergable is described as ``Algorithm for checking tokens mergability`` in the documentaion above.
    
    Parameters
    ----------
    tokens_to_check : Set[str]
        A set of two token strings to be checked for mergability.
    tokens_counter : Counter_type[str]
        The frequency counter of the tokens in the dataset.
    correctly_spelled_words : Set[str]
        The set of tokens from the unique tokens for the dataset belonging to the list of correctly spelled words.
    minimum_similarity : float
        The minimum threshold for the similarity score between the token strings to be mergable.

    Returns
    -------
    sim_score : Union[float, None]
        The similarity score between the tokens. None if the similarity score is less than the threshold similarity score or if the tokens are not mergable due to other conditions
    merge_to : Union[str, None]
        One of the two input tokens that will absorb the other token as a result of the merge. None if the tokens are not mergable.
    """

    # initalising similarity score and the token to be merged to as None, assuming the tokens are not mergable
    sim_score = None
    merge_to = None
    # a falg to indicate if both the tokens have same frequency
    tokens_with_same_frequency = False

    # assign the higher frequency token as first token and the lower frequency token as second token.
    token_one, token_two = tokens_to_check

    if tokens_counter[token_one] == tokens_counter[token_two]:
        # if both the tokens have same frequency the flag is set to True
        tokens_with_same_frequency = True
    elif tokens_counter[token_two] > tokens_counter[token_one]:
        # assign the higher frequency token as first token and the lower frequency token as second token.
        token_one, token_two = token_two, token_one

    # flags to store if the tokens are in the list of correctly spelled words
    token_one_correct_spell = check_presence(token_one, correctly_spelled_words)
    token_two_correct_spell = check_presence(token_two, correctly_spelled_words)

    if (token_one_correct_spell and (not token_two_correct_spell)) or (
        tokens_with_same_frequency
        and (token_one_correct_spell ^ token_two_correct_spell)
    ):
        # calculate similarity
        # 1. if higher frequency token is in the dictionary and lower frequency token is not in the dictionary
        # or
        # 2. if only one token is in the dictionary when the tokens have same frequency.
        sim_score = round(
            fuzz.ratio(
                token_one, token_two, processor=None, score_cutoff=minimum_similarity
            )
        )
        # the `fuzz.ratio` function from the rapidfuzz library returns the 0 similarity score if the similarity score between the input strings is less than the threshold provided as another argument.

        # the token present in the dictionary will be assigned to the `merge_to` to absorb the other token
        merge_to = token_one
        if token_two_correct_spell:
            merge_to = token_two

    return sim_score, merge_to


def get_similar_tokens_differed_by_one_token(
    len_wise_token_sets: Dict[int, List[FrozenSet[str]]],
    prev_len_wise_token_sets: Dict[int, List[FrozenSet[str]]],
    tokens_counter: Counter_type[str],
    correctly_spelled_words: Set[str],
    minimum_similarity: float,
) -> Dict[FrozenSet[str], Dict[FrozenSet[str], Dict[str, Union[float, str]]]]:
    """The function accepts the clustering of token sets based on their length at current iteration, along with the clustering at previous iteration (to help in early stopping), the frequncy counter for the tokens in the dataset, list of correctly spelled words and a minimum similarity score for the tokens to be able declared similar and return a mapping(Python dictionary). The returned python dictionary has the set of tokens common between two token sets (except one token) and the value is another dictionary. The key for the secondary dictionary is the pair of tokens that share the tokens in the primary dictionary that are mergable and different only by these two tokens with the value as another dictionary. In the tertiary and last dictionary there are two elements, the ``sim_score`` key has the similarity score between the secondary token pair as the value and ``merge_to`` key has the one of the token in the secondary dictionary key pair that will absorb the other token.

    Parameters
    ----------
    len_wise_token_sets : Dict[int, List[FrozenSet[str]]]
        A dictionary with length of the token sets as the key and the token sets as the values in a list at current iteration.
    prev_len_wise_token_sets : Dict[int, List[FrozenSet[str]]]
        A dictionary with length of the token sets as the key and the token sets as the values in a list at previous iteration.
    tokens_counter : Counter_type[str]
        The frequency counter of the tokens in the dataset.
    correctly_spelled_words : Set[str]
        The set of tokens from the unique tokens for the dataset belonging to the list of correctly spelled words.
    minimum_similarity : float
        The minimum threshold for the similarity score between the token strings to be mergable.

    Returns
    -------
    similar_tokens_with_base : Dict[FrozenSet[str], Dict[FrozenSet[str], Dict[str, Union[float, str]]]]
        The three level dictionary with common tokens as primary key and the mergable tokens as secondary key and the similarity score and the token that will absorb the other token in the third level.
    """

    # initialise the empty dictionary to store return values.
    similar_tokens_with_base = {}

    # loop over the length wise clustered token sets
    for str_len, str_lists in len_wise_token_sets.items():
        # for token sets of same length

        if (set(str_lists) != set(prev_len_wise_token_sets.get(str_len, []))):
            # If the token sets of a given length at this iteration are not same as the token sets of a same length in previous iteration, continue with the process. If the sets are same with previous iteration, it means that the those tokens are not merged and the checking for mergability is redundant.

            print("Checking for token sets of length {}".format(str_len))

            len_str_lists = len(str_lists)

            for ind_lst_1 in tqdm_notebook(range(len_str_lists)):

                for ind_lst_2 in range(ind_lst_1 + 1, len_str_lists):
                    # for each pair of token sets of same length

                    # get the tokens that are different in the two token sets (acts as secondary key)
                    set_sym_diff = str_lists[ind_lst_1] ^ str_lists[ind_lst_2]

                    if len(set_sym_diff) == 2:
                        # if the sets have only two elements different i.e. if they are differed only by one token check if the tokens can be merged

                        score, merge_to = are_tokens_mergable(
                            set_sym_diff,
                            tokens_counter,
                            correctly_spelled_words,
                            minimum_similarity,
                        )

                        if score:
                            # if the tokens are mergable,

                            # the same tokens present in both the token sets are obtained to be the primary key of the dictionary.
                            froz_set_same = frozenset(
                                str_lists[ind_lst_1].intersection(str_lists[ind_lst_2])
                            )

                            # for token sets of length one, there will be any primary key. So a dummy variable is created to act as a primary key
                            if not froz_set_same:
                                froz_set_same = "SINGLE_TOKEN"

                            if not check_presence(
                                froz_set_same, similar_tokens_with_base
                            ):
                                # create the second level dictionary with the primary key
                                similar_tokens_with_base[froz_set_same] = {}
                            if not check_presence(
                                set_sym_diff, similar_tokens_with_base[froz_set_same]
                            ):
                                # create the third level dictionary with the secondary key and add the tertiary keys and the corresponding values.
                                similar_tokens_with_base[froz_set_same][
                                    set_sym_diff
                                ] = {
                                    "sim_score": score,
                                    "merge_to": merge_to,
                                }
        else:
            print("Skipping checking for token sets of length {}".format(str_len))
    return similar_tokens_with_base


def get_replacements(
    similar_tokens_for_base: Dict[FrozenSet[str], Dict[str, Union[float, str]]],
    tokens_counter: Counter_type[str],
) -> Tuple[Dict[str, str], Set[str]]:
    """This function accepts the similar tokens for having common tokens along with their similarity and 
        returns one, a dictionary with the key as the token and the value as the token to replace the key,
        and two, a set of tokens updated in the current iteration. 

    Parameters
    ----------
    similar_tokens_for_base : Dict[FrozenSet[str], Dict[str, Union[float, str]]]
        The two level dictionary per a set of common tokens with the mergable tokens as primary key and
        the similarity score and the token that will absorb the other token.
    tokens_counter : Counter_type[str]
        The frequency counter of the tokens in the dataset.

    Returns
    -------
    replacements : Dict[str, str]
        A dictionary with the mapping from the token to change to the token to be changed with for
        a common set of tokens that are different only by one token.
    all_tokens_round : Set[str]
        The set of tokens that are updated in the iteration
    """

    # intialize an empty dictionary to hold the possible pairs of token and similarity score that a token with low frequency and not in the list of correctly spelled words can be merged with.
    local_replacements = {}

    # intialize an empty dictionary to hold the merge mapping from of the token with low frequency and not in dictionary to token with high frequeny and that is present in the the list of correctly spelled words.
    replacements = {}

    # initialize an empty set to hold the list of tokens being updated for the considered base set of common tokens.
    all_tokens_round = set()

    # First, get the possible list of tokens the low frequent and not in the list of correctly spelled words token can be merged with.

    for similar_tokens, props in similar_tokens_for_base.items():
        # for each pair of similar tokens and their properties of similarity scores and the token that will absorb the other token

        # the token in the list of correctly spelled words and with high frequency is set as the `merge_to` variable and the lower frequency, not in the list of correctly spelled words token is assigned to `to_be_merged` variable.
        token_one, token_two = similar_tokens
        merge_to = props["merge_to"]
        if merge_to == token_one:
            to_be_merged = token_two
        else:
            to_be_merged = token_one

        if not check_presence(to_be_merged, local_replacements):
            # add the `to_be_merged` key to the local_replacements dictionary as key
            local_replacements[to_be_merged] = []
        if not check_presence(merge_to, local_replacements[to_be_merged]):
            # add the `merge_to` and the similarity scoore under the possible replacements for `to_be_merged` variable.
            local_replacements[to_be_merged].append((merge_to, props["sim_score"]))

    # Second, decide the token to which the low frequent not in the list of correctly spelled words will be merged to.

    for to_be_merged, possible_merges in local_replacements.items():
        # for each `to_be_merged` token and the list of possible merges

        selected_mergee = None
        if len(possible_merges) == 1:
            # if there is only one token, then the low frequent and not in the list of correctly spelled words token is merged with it.
            selected_mergee = possible_merges[0][0]
        else:
            # if there are more one one possible tokens, then the `merge_to` tokens are sorted based on the similarity of the `to_be_merged` and `merge_to` tokens
            sim_scores = [sim_score for _, sim_score in possible_merges]
            mergee_contenders = np.array(possible_merges)[
                np.where(sim_scores == np.max(sim_scores))
            ]
            if len(mergee_contenders) == 1:
                # select the token with highest similarity score to merge with.
                selected_mergee = mergee_contenders[0][0]
            else:
                # if there are more than one possible tokens with same similarity score, then the `merge_to` tokens that have same similarity score are sorted based on their frequency in the dataset.
                mergee_freq = [
                    tokens_counter[mergee] for mergee, _ in mergee_contenders
                ]
                mergee_contenders_freq = np.array(mergee_contenders)[
                    np.where(mergee_freq == np.max(mergee_freq))
                ]
                if len(mergee_contenders_freq) == 1:
                    # select the token with highest frequency to merge with.
                    selected_mergee = mergee_contenders_freq[0][0]

                # if there are more than one tokens with same similarity score and same frequency, then the `to_be_merged` is no not merged with any token for this base.
        if selected_mergee:
            replacements[to_be_merged] = selected_mergee
            all_tokens_round.update(
                [to_be_merged]
            )  # update the list of tokens updated in the iteration
    return replacements, all_tokens_round


def group_connected_words_with_common_base(
    similar_tokens_with_base: Dict[
        FrozenSet[str], Dict[FrozenSet[str], Dict[str, Union[float, str]]]
    ],
    tokens_counter: Counter_type[str],
) -> Dict[FrozenSet[str], Dict[str, str]]:
    """The function accepts the dictionary of common tokens and the mergable tokens for the base along with their counter and their presence in the list of correctly spelled words and returns the mapping of the updates to the tokens per tokens set of profession strings.

    Parameters
    ----------
    similar_tokens_with_base : Dict[FrozenSet[str], Dict[FrozenSet[str], Dict[str, Union[float, str]]]]
        The three level dictionary with common tokens as primary key and the mergable tokens as secondary key and the similarity score and the token that will absorb the other token in the third level.
    tokens_counter : Counter_type[str]
        The frequency counter of the tokens in the dataset.

    Returns
    -------
    replacements_per_profession_token_set : Dict[FrozenSet[str], Dict[str, str]]
        A two level dictionary, with the token set per profession string as the primary key and the value is a dictionary with the the token to change as the key and the token to be change with as the value.
    """

    print("Merging connected words with common base")

    # intialise a dictionary to store the updates on each token per each a profession string to token set mapping (only the tokens sets that have a change are stored)
    replacements_per_profession_token_set = {}

    for same_key, mergable_token_pairs in similar_tokens_with_base.items():
        # for each base and dictionary of mergable token pairs for that base, get the possible replacements for the tokens for the given base

        (same_base_replacements, all_tokens_this_round,) = get_replacements(
            mergable_token_pairs, tokens_counter,
        )

        for token in all_tokens_this_round:
            # for each token dealt during the iteration for merge

            # create the complete token set
            if same_key == "SINGLE_TOKEN":
                # if the base is the predefined keyword "SINGLE_TOKEN", then the token set consists only the token
                met_str_tokens = frozenset([token])
            else:
                # else, merge base with the token to generate the full token set (as we created base for those token sets that are differed by one token)
                met_str_tokens = frozenset(same_key.union(set([token])))

            if not check_presence(
                met_str_tokens, replacements_per_profession_token_set
            ):
                # create an entry in the dictionary to store the per profession string token changes
                replacements_per_profession_token_set[met_str_tokens] = {}

            if check_presence(token, same_base_replacements):
                # if the token is merged to another token during the process, add the mapping to the dictionary
                replacements_per_profession_token_set[met_str_tokens][
                    token
                ] = same_base_replacements.get(token)

    return replacements_per_profession_token_set


def print_token_statistics(
    prev_token_sets_len: int,
    curr_token_sets_len: int,
    unique_tokens_count_before: int,
    unique_tokens_count_filtered: int,
    unique_tokens_count_after: int,
) -> NoReturn:
    """The function prints the statistics about the changes in the number of tokens after the iteration or an round.

    Parameters
    ----------
    prev_token_sets_len : int
        The number of token sets i.e. tokens of profession strings before the current iteration.
    curr_token_sets_len : int
        The number of token sets i.e. tokens of profession strings after the current iteration.
    unique_tokens_count_before : int
        The number of unique tokens before the current iteration.
    unique_tokens_count_filtered : int
        The number of unique tokens before the current iteration after filtering based on frequency.
    unique_tokens_count_after : int
        The number of unique tokens after the current iteration.

    Returns
    -------
    NoReturn
        
    """
    print(
        "\t\tThe number of unique token sets before merge is {} and after merge is {}".format(
            prev_token_sets_len, curr_token_sets_len
        )
    )
    print(
        "\t\tThe number of unique tokens before is {}, after filtering {} and after merge is {}".format(
            unique_tokens_count_before,
            unique_tokens_count_filtered,
            unique_tokens_count_after,
        )
    )


def print_merge_statistics(replacement_scores_list: np.ndarray) -> NoReturn:
    """The function prints the mean, median, standard deviation, minimum and maximum values of the similarity scores.

    Parameters
    ----------
    replacement_scores_list : np.ndarray
        The array of similarity scores for the all the merges during the current iteration/round.

    Returns
    -------
    NoReturn
    """

    avg_score, min_score, max_score, median_score, stddev_score = (
        np.mean(replacement_scores_list),
        np.min(replacement_scores_list),
        np.max(replacement_scores_list),
        np.median(replacement_scores_list),
        np.std(replacement_scores_list),
    )

    print("\t\tNumber of Unique Replacements: {}".format(len(replacement_scores_list)))
    print(
        "\t\tLeast Similarity Score: {}, Max Similarity Score: {}".format(
            min_score, max_score
        )
    )
    print(
        "\t\tAverage Score: {}, Standard Deviation: {}, Median: {}\n".format(
            avg_score, stddev_score, median_score
        )
    )


def merge_close_spellings(
    profession_token_sets_full_dataset: List[List[FrozenSet[str]]],
    correctly_spelled_words: Set[str],
    minimum_token_frequency: int,
    minimum_similarity: float,
) -> Tuple[List[List[FrozenSet[str]]], np.ndarray]:
    """The function accepts the list of tokens sets for all the dataset along with set of correctly spelled words,
        the thresholds to check if tokens can be classified as low frequent and if two tokens are similar and
        returns the updated list of token sets for all the dataset and an array with the similarity scores of tokens that are merged.

    Parameters
    ----------
    profession_token_sets_full_dataset : List[List[FrozenSet[str]]]
        A list of token sets for the unique profession strings stored in a list. The token sets are stored in a list to track the changes.
    correctly_spelled_words : Set[str]
        The set of tokens from the unique tokens for the dataset belonging to the list of correctly spelled words.
    minimum_token_frequency : int
        The minimum threshold for the frequency of token to be considered frequent (or not low frequent).
    minimum_similarity : float
        The minimum threshold for the similarity score between the token strings to be mergable.

    Returns
    -------
    updated_token_sets_full_dataset_tracked : List[List[FrozenSet[str]]]
        A list of token sets for the unique profession strings stored in a list. The last entry of the list is the updated token set after the current round.
    replacement_scores_round : np.ndarray
        The array of similarity scores for the all the merges during the current round.
    """

    # create a copy of the profession token sets lists to avoid in place editing.
    updated_token_sets_full_dataset_tracked = copy.deepcopy(
        profession_token_sets_full_dataset
    )

    # get the last entry from the list of list of token sets as the last entry is the last updated entry as current token sets.
    token_sets_full_dataset = [
        met_list[-1] for met_list in updated_token_sets_full_dataset_tracked
    ]

    # A list to store the the list of token sets before the update, used to stop the (merging) round when the token sets do not merge any further.
    prev_token_sets_full_dataset = []
    # A dictionary to store the the length wise clustered profession token sets before the update, used to stop the (merging) iteration for the particular length when the token sets do not merge any further.
    prev_token_sets_per_length = {}
    # a list to store the similarity scores of tokens merged during the whole round.
    replacement_scores_round = []
    # A counter variable to store the number of iterations in the round
    itration_count = 0

    while set(token_sets_full_dataset) != set(prev_token_sets_full_dataset):
        # Megre the token sets as long as the previous token sets of profession strings is different as the current one. In other words, the round of merging is terminated when the token sets are not changed anymore.

        # increase and print the iteration count
        itration_count += 1
        print("Merging {} time".format(itration_count))

        unique_replacement_similarity_scores = {}

        # get the frequency counter for all the unique tokens in the current list of token sets for profession strings.
        counter_all_tokens = get_tok_count(token_sets_full_dataset)

        # assign the current list of list of token sets as previous iteration's list of list of token sets to be used in the next iteration to break the merge loop
        prev_token_sets_full_dataset = token_sets_full_dataset

        # According to the ``Algorithm for merging tokens`` mentioned above, in the second round the low frequent tokens are removed in the token sets. The filtering of tokens is performed here based on the ``minimum_token_frequency`` variable.
        filtered_token_sets = [
            frozenset(
                [
                    tok
                    for tok in met_tok_set
                    if (
                        (counter_all_tokens[tok] > minimum_token_frequency)
                        or (tok in correctly_spelled_words)
                    )
                ]
            )
            for met_tok_set in token_sets_full_dataset
        ]

        # get the list of unique token sets from the full dataset.
        unique_token_sets = get_unique_token_sets(filtered_token_sets)

        # get the length wise clustered token sets
        token_sets_per_length = get_length_wise_token_sets(unique_token_sets)

        # get the similar token paris with same common tokens
        contxt_similar_tokens = get_similar_tokens_differed_by_one_token(
            token_sets_per_length,
            prev_token_sets_per_length,
            counter_all_tokens,
            correctly_spelled_words,
            minimum_similarity,
        )

        # assign the current clustering of token sets per length as previous iteration's token sets per length to be used in the next iteration to break the merge loop
        prev_token_sets_per_length = token_sets_per_length

        # get the replacements for the tokens using the similar tokens obtained in the previous step
        replacements_per_unique_token_set = group_connected_words_with_common_base(
            contxt_similar_tokens, counter_all_tokens
        )

        # empty list to store the updated list of token sets.
        token_sets_full_dataset_updated = []

        # loop over the full dataset of token sets and the filtered full dataset of token sets together to add the earlier filtered tokens back
        for tok_set, tok_set_filt in zip(token_sets_full_dataset, filtered_token_sets):
            # for the token set and its filtered version, get the updated token set. If the token set is not updated then the tokens are left unchanged.

            updated_met_tok_set = tok_set
            replacements_for_the_token_set = replacements_per_unique_token_set.get(
                tok_set_filt, {}
            )
            if replacements_for_the_token_set:
                updated_met_tok_set = []
                for tok in tok_set:
                    replacement_tok = replacements_for_the_token_set.get(tok, tok)
                    updated_met_tok_set.append(replacement_tok)
                    if tok != replacement_tok:
                        toks_as_frozenset = frozenset([tok, replacement_tok])
                        if not check_presence(
                            toks_as_frozenset, unique_replacement_similarity_scores
                        ):
                            unique_replacement_similarity_scores[
                                toks_as_frozenset
                            ] = fuzz.ratio(tok, replacement_tok, processor=None)

            # append the updated token set to the list of updated token sets
            token_sets_full_dataset_updated.append(frozenset(updated_met_tok_set))

        # assign the updated token sets as the current token sets.
        token_sets_full_dataset = token_sets_full_dataset_updated

        # append the updated token set to the list of list of token sets for the full dataset.
        token_sets_list_tracked = []
        for ind_num in range(len(profession_token_sets_full_dataset)):
            tok_set_history, curr_met = (
                profession_token_sets_full_dataset[ind_num],
                token_sets_full_dataset[ind_num],
            )

            if tok_set_history[-1] != curr_met:
                tok_set_history.append(curr_met)
            token_sets_list_tracked.append(tok_set_history)

        updated_token_sets_full_dataset_tracked = token_sets_list_tracked

        # print the tokens related statistics for the iteration
        print(
            "\tDescriptive statistics of Merge Results for {} time".format(
                itration_count
            )
        )

        print_token_statistics(
            prev_token_sets_len=len(set(prev_token_sets_full_dataset)),
            curr_token_sets_len=len(set(token_sets_full_dataset)),
            unique_tokens_count_before=len(counter_all_tokens),
            unique_tokens_count_filtered=len(get_tok_count(filtered_token_sets)),
            unique_tokens_count_after=len(get_tok_count(token_sets_full_dataset)),
        )

        # print the similarity score related statistics for the iteration
        replacement_scores_iteration = np.fromiter(
            unique_replacement_similarity_scores.values(), dtype=float
        )

        if len(replacement_scores_iteration):
            replacement_scores_round.append(replacement_scores_iteration)

            flattened_replacement_scores = replacement_scores_iteration.flatten()

            print_merge_statistics(replacement_scores_list=flattened_replacement_scores)

        else:
            print("\t\tNo Replacements")

    # print the tokens related statistics for the complete round
    print("\n")
    print("*" * 100)
    print("Descriptive statistics of Merge Results for this Round")

    # print the similarity score related statistics for the complete round
    if len(replacement_scores_round):

        flattened_replacement_scores_round = np.hstack(replacement_scores_round)

        print_merge_statistics(
            replacement_scores_list=flattened_replacement_scores_round
        )

    else:
        print("\t\tNo Replacements")
    print("*" * 100)

    # return the list of list of token sets and the list of similarity scores for tokens merged during the round.
    return updated_token_sets_full_dataset_tracked, replacement_scores_round

### Read the data frame with serialized token sets (separated by space) for each row 

Only the column that contains the serialized token sets (separated by space) is read to save space.

In [None]:
col_list = ["metier_after_S2"]

paris_jobs_only_s2 = pd.read_csv(
    "./../data/intermediate_steps/all_paris_jobs_after_step_2.csv",
    dtype={"metier_after_S2": "str"},
    usecols=col_list,
    header=0,
    encoding="utf-8",
)

The serialized token sets are split at space (as they represent tokens) and the token sets for all the rows are stored in a list. The reason for storing them in a list is to keep track of the changes to the token sets throughout the process.

In [None]:
paris_all_profession_token_sets = [
    [frozenset(clean_met.split())]
    for clean_met in paris_jobs_only_s2["metier_after_S2"]
    if len(frozenset(clean_met.split()))
]

### Round 1

In [None]:
# get the current token sets for each profession
initial_token_sets = get_current_token_sets(paris_all_profession_token_sets)

# get the set of unique tokens from the dataset
intial_indv_tokens = set(get_tok_count(initial_token_sets).keys())

# create a set of tokens in list of correct words
correctly_spelled_tokens = intial_indv_tokens.intersection(words_with_correct_spellings)

num_of_correctly_spelled_tokens = len(correctly_spelled_tokens)

In [None]:
print(
    "Number of unique tokens that represent the professions before the round 1 are {}, \nand {} out of them are in the dictionary ({}%).".format(
        len(intial_indv_tokens),
        num_of_correctly_spelled_tokens,
        round((num_of_correctly_spelled_tokens / len(intial_indv_tokens)) * 100),
    )
)

In [None]:
# Round 1: Merge spellings of words that are close and have rest of the tokens same
# Minimum Token frequency is set to zero because in the first round the tokens are not filtered based on their frequency.

(
    paris_all_profession_token_sets_after_round_1,
    replacement_scores_round_after_round_1,
) = merge_close_spellings(
    profession_token_sets_full_dataset=paris_all_profession_token_sets,
    correctly_spelled_words=correctly_spelled_tokens,
    minimum_token_frequency=0,
    minimum_similarity=MINIMUM_TOKEN_SIMILARITY,
)

#### Saving 

##### The updated list of list of token sets to disk as pickle file as there are frozensets after round 1

In [None]:
with open(
    intermediate_steps_folder_prefix + "paris_all_profession_token_sets_after_round_1.pickle", "wb",
) as outfile:
    pickle.dump(paris_all_profession_token_sets_after_round_1, outfile)

##### The replacements similarity score to disk as pickle file as it is a list

In [None]:
with open(
    intermediate_steps_folder_prefix + "replacement_scores_round_after_round_1.pickle", "wb"
) as outfile:
    pickle.dump(replacement_scores_round_after_round_1, outfile)

#### Reading

##### The updated list of list of token sets from disk after round 1

In [None]:
with open(
    intermediate_steps_folder_prefix + "paris_all_profession_token_sets_after_round_1.pickle", "rb"
) as outfile:
    paris_all_profession_token_sets_after_round_1 = pickle.load(outfile)

##### the replacements similarity score from disk after round 1

In [None]:
with open(
    intermediate_steps_folder_prefix + "replacement_scores_round_after_round_1.pickle", "rb"
) as outfile:
    replacement_scores_round_after_round_1 = pickle.load(outfile)

### Round 2

In [None]:
# get the current token sets for each profession
token_sets_after_round_1 = get_current_token_sets(
    paris_all_profession_token_sets_after_round_1
)

# get the set of unique tokens from the dataset
indv_tokens_after_round_1 = set(get_tok_count(token_sets_after_round_1).keys())

In [None]:
print(
    "Number of unique tokens that represent the professions before the round 2 are {}, \nand {} out of them are in the dictionary ({}%).".format(
        len(indv_tokens_after_round_1),
        num_of_correctly_spelled_tokens,
        round((num_of_correctly_spelled_tokens / len(indv_tokens_after_round_1)) * 100),
    )
)

In [None]:
# Round 2: Merge spellings of words that are close and have rest of the tokens same after retaining the tokens that occur more than 50 times

(
    paris_all_profession_token_sets_after_round_2,
    replacement_scores_round_after_round_2,
) = merge_close_spellings(
    profession_token_sets_full_dataset=paris_all_profession_token_sets_after_round_1,
    correctly_spelled_words=correctly_spelled_tokens,
    minimum_token_frequency=MINIMUM_TOKEN_FREQUENCY,
    minimum_similarity=MINIMUM_TOKEN_SIMILARITY,
)

#### Saving 

##### The updated list of list of token sets to disk as pickle file as there are frozensets after round 2

In [None]:
with open(
    intermediate_steps_folder_prefix + "paris_all_profession_token_sets_after_round_2.pickle", "wb"
) as outfile:
    pickle.dump(paris_all_profession_token_sets_after_round_2, outfile)

##### The replacements similarity score to disk as pickle file as it is a list

In [None]:
with open(
    intermediate_steps_folder_prefix + "replacement_scores_round_after_round_2.pickle", "wb"
) as outfile:
    pickle.dump(replacement_scores_round_after_round_2, outfile)

#### Reading

##### The updated list of list of token sets from disk after round 2

In [None]:
with open(
    intermediate_steps_folder_prefix + "paris_all_profession_token_sets_after_round_2.pickle", "rb"
) as outfile:
    paris_all_profession_token_sets_after_round_2 = pickle.load(outfile)

##### the replacements similarity score from disk after round 2

In [None]:
with open(
    intermediate_steps_folder_prefix + "replacement_scores_round_after_round_2.pickle", "rb"
) as outfile:
    replacement_scores_round_after_round_2 = pickle.load(outfile)

### Round 3

In [None]:
# get the current token sets for each profession
token_sets_after_round_2 = get_current_token_sets(
    paris_all_profession_token_sets_after_round_2
)

# get the set of unique tokens from the dataset
indv_tokens_after_round_2 = set(get_tok_count(token_sets_after_round_2).keys())

# Convert each token into a token set
full_inv_token_after_round_2 = [
    [frozenset([tok])] for tok_set in token_sets_after_round_2 for tok in tok_set
]

In [None]:
print(
    "Number of unique tokens that represent the professions before the round 3 are {}, \nand {} out of them are in the dictionary ({}%).".format(
        len(indv_tokens_after_round_2),
        num_of_correctly_spelled_tokens,
        round((num_of_correctly_spelled_tokens / len(indv_tokens_after_round_2)) * 100),
    )
)

In [None]:
# Round 3

(
    unique_tokens_after_merging,
    replacement_scores_round_after_round_3,
) = merge_close_spellings(
    profession_token_sets_full_dataset=full_inv_token_after_round_2,
    correctly_spelled_words=correctly_spelled_tokens,
    minimum_token_frequency=0,
    minimum_similarity=MINIMUM_TOKEN_SIMILARITY,
)

#### Saving 

##### All unique tokens (with evolution) to disk as pickle file as there are frozensets after round 3

In [None]:
with open(
    intermediate_steps_folder_prefix + "unique_tokens_with_changes_after_round_3.pickle", "wb"
) as outfile:
    pickle.dump(unique_tokens_after_merging, outfile)

##### The replacements similarity score to disk as pickle file as it is a list

In [None]:
with open(
    intermediate_steps_folder_prefix + "replacement_scores_round_after_round_3.pickle", "wb"
) as outfile:
    pickle.dump(replacement_scores_round_after_round_3, outfile)

#### Reading

##### the unique tokens (with evolution) from disk after round 3

In [None]:
with open(
    intermediate_steps_folder_prefix + "unique_tokens_with_changes_after_round_3.pickle", "rb"
) as outfile:
    unique_tokens_after_merging = pickle.load(outfile)

##### the replacements similarity score from disk after round 3

In [None]:
with open(
    intermediate_steps_folder_prefix + "replacement_scores_round_after_round_3.pickle", "rb"
) as outfile:
    replacement_scores_round_after_round_3 = pickle.load(outfile)

### Updating Profession token sets after Round 3

The result of the third round of merging is a list of changes for each unique token in the dataset. To update the tokens in the dataset with changes, in the next step a dictionary is created with the key as the token before round 3 and the value is the same token after round 3. Only the tokens that have changed are included in the dictionary.

In [None]:
# a dictionary to hold the changes to the tokens
unique_token_changes = {}

for unique_token_updates in unique_tokens_after_merging:
    # for each token changes
    original_tok = list(unique_token_updates[0])[0]
    # the token before round 3
    changed_tok = list(unique_token_updates[-1])[0]
    # the token after round 3

    if original_tok != changed_tok:
        # if the token is changed
        if not check_presence(original_tok, unique_token_changes):
            # create an entry in the dictionary
            unique_token_changes[original_tok] = changed_tok

In [None]:
def update_unique_token_sets_after_round_3(
    current_unique_profession_token_sets: List[FrozenSet[str]],
    individual_token_changes: Dict[str, str],
) -> Dict[FrozenSet[str], FrozenSet[str]]:
    """The function accepts the unique token sets and individual token changes after round 3 and returns a dictionary with the before update token sets as keys and the after update token changes as values.

    Parameters
    ----------
    current_unique_profession_token_sets : List[FrozenSet[str]]
        A list of unique token sets for the professions.
    individual_token_changes : Dict[str, str]
        A dictionary with the tokens before round 3 as keys and the tokens after round 3 as values.

    Returns
    -------
    Dict[FrozenSet[str], FrozenSet[str]]
        A dictionary with the before update token sets as keys and the after update token changes as values.
    """
    updated_unique_profession_token_sets = {}
    for curr_tokens_set in current_unique_profession_token_sets:
        updated_unique_profession_token_sets[curr_tokens_set] = frozenset(
            [individual_token_changes.get(token, token) for token in curr_tokens_set]
        )

    return updated_unique_profession_token_sets

In [None]:
# get the current token sets for each profession
token_sets_after_round_2 = get_current_token_sets(
    paris_all_profession_token_sets_after_round_2
)

# get the unique token sets after round 2 (as round 3 deals with tokens and not token sets)
unique_token_sets_after_round_2 = get_unique_token_sets(token_sets_after_round_2)

# update the unique token sets with changes after round 3
updated_token_sets_mapping_after_round_3 = update_unique_token_sets_after_round_3(
    unique_token_sets_after_round_2, unique_token_changes
)

# Update the list of token sets for all the profession strings stored in a list.
paris_all_profession_token_sets_after_round_3 = []

for token_set_list in tqdm_notebook(paris_all_profession_token_sets_after_round_2):
    if check_presence(token_set_list[-1], updated_token_sets_mapping_after_round_3):
        if (
            updated_token_sets_mapping_after_round_3[token_set_list[-1]]
            != token_set_list[-1]
        ):
            paris_all_profession_token_sets_after_round_3.append(
                token_set_list
                + [updated_token_sets_mapping_after_round_3[token_set_list[-1]]]
            )
            continue
    paris_all_profession_token_sets_after_round_3.append(token_set_list)

In [None]:
# get the current token sets for each profession after round 3
token_sets_after_round_3 = get_current_token_sets(
    paris_all_profession_token_sets_after_round_3
)

# get the set of unique tokens from the dataset after round 3
indv_tokens_after_round_3 = get_tok_count(token_sets_after_round_3)

### Saving the tokens to csv file

The token sets is updated from frozen sets to strings (sepearated by space) as frozen sets cannot be saved to disk. Earlier, only the one column was read to save the space. Here, the full file is read and a new column is added.

In [None]:
# read the full dataset
paris_jobs_after_s2 = pd.read_csv(
    "./../data/intermediate_steps/all_paris_jobs_after_step_2.csv",
    dtype={
        "doc_id": "str",
        "page": "str",
        "row": "str",
        "Nom": "str",
        "métier_original": "str",
        "rue": "str",
        "numéro": "str",
        "annee": "str",
        #"gallica_page": "str",
        "métier": "str",
        "metier_after_S1": "str",
        "metier_after_S2": "str",
    },
    header=0,
    encoding="utf-8",
)

# searlize the token sets to strings (seperated by space)
metier_strings_after_round_3 = [
    " ".join(tok_frozen_set) for tok_frozen_set in token_sets_after_round_3
]

# add the new column to the dataset
paris_jobs_after_s2["metier_after_S3"] = metier_strings_after_round_3

paris_jobs_after_s2.to_csv(
    intermediate_steps_folder_prefix + "all_paris_jobs_after_step_3.csv", index=False
)

In [None]:
print(
    "Number of unique tokens that represent the professions after the round 3 are {}, \nand {} out of them are in the dictionary ({}%).".format(
        len(indv_tokens_after_round_3),
        num_of_correctly_spelled_tokens,
        round((num_of_correctly_spelled_tokens / len(indv_tokens_after_round_3)) * 100),
    )
)

In [None]:
# To store the unique tokens after the round 3, a dictionary with the token as key and the count is stored as the value (in another dictionary)

sorted_unique_tokens_after_round_3 = {
    k: {"count": v}
    for k, v in sorted(
        indv_tokens_after_round_3.items(), key=lambda e: e[1], reverse=True
    )
}

### Saving 

##### The updated list of list of token sets to disk as pickle file as there are frozensets after round 3

In [None]:
with open(
    intermediate_steps_folder_prefix + "paris_all_profession_token_sets_after_round_3.pickle",
    "wb",
) as outfile:
    pickle.dump(paris_all_profession_token_sets_after_round_3, outfile)

#### The unique tokens after round 3 

In [None]:
with open(
    intermediate_steps_folder_prefix + "unique_tokens_after_round_3.json",
    "w",
    encoding="utf8",
) as outfile:
    json.dump(sorted_unique_tokens_after_round_3, outfile, indent=4, ensure_ascii=False)

### Reading

##### The updated list of list of token sets from disk after round 3

In [None]:
with open(
    intermediate_steps_folder_prefix + "paris_all_profession_token_sets_after_round_3.pickle",
    "rb",
) as outfile:
    paris_all_profession_token_sets_after_round_3 = pickle.load(outfile)

#### The unique tokens after round 3

In [None]:
with open(
    intermediate_steps_folder_prefix + "unique_tokens_after_round_3.json",
    encoding="utf8",
) as f:
    sorted_unique_tokens_after_round_3 = json.load(f)

## Completing the Abbreviations with the full word (Step 4)

In the previous step, the tokens that do not have a correct spelling were merged with the closest correctly spelled word iteratively. In this last algorithmic step, the tokens that are classified as abbreviations are completed into full words. Any token containing a dot (`.`) is classified as an abbreviation. The idea of completing the abbreviations is drawn from round 1 of the previous step i.e. to use the contextual tokens to decide the abbreviations of the words. After the abbreviations are filled with contextual information, for the second time, the remaining ones are filled based on only frequency and similarity.

### Algorithm for completing the abbreviations

1. The current token sets for all the entries in the dataset are considered.
2. The abbreviations are completed using the contextual tokens (see [Algorithm for completing the abbreviations with support](#algorithm-for-completing-the-abbreviations-with-support))
3. Collected the tokens that have a dot as potential abbreviations after _2_.
4. The abbreviations are completed without contextual tokens (see [Algorithm for completing the abbreviations without support](#algorithm-for-completing-the-abbreviations-without-support))

#### Algorithm for completing the abbreviations with support

1. While the abbreviations can be filled, continue the iteration for the current list of token sets per profession (The token sets with only one token are ignored in this sub-step as they do not have contextual information)
2. Get the co-occurance frequency using the complete dataset with all possible combination of tokens for a given token set i.e. for a set of 3 tokens T1, T2 and T3, the possible combinations are `{T1, T2}, {T2, T3}, {T1, T3}, {{T1, T2}, T3}, {{T1, T3}, T2} and {{T2, T3}, {T1}}`.
3. The unique token sets are clustered based on the length. 
4. For each length of the token set
    1. For each token set of the considered token set length
        1. If not all the tokens are in the list of correctly spelled words and there is any token with a dot
            1. Compare with all other token sets of the same length that have at least one token in the list of correctly spelled words
                1. If the two token sets differ only by one token (i.e. all the tokens are the same except one), only one token has a dot (abbreviation) and the other token is in the list of correctly spelled words (full form), the length the full form is greater than the abbreviation and lastly the 2-gram Jaccard similarity of the abbreviation without a dot and the full form (reduced to the length of the abbreviation without dot) (called modified Jaccard score) is greater than the set threshold
                    1. With the same tokens in both the token sets as a base, store all such pairs of abbreviation and full form tokens along with the modified Jaccard score and the frequency of the full form with the common tokens.
5. For each base of token sets,
    1. For each abbreviation and its possible full forms, select the full form with the highest modified Jaccard score and then the frequency of the full form with the common tokens to produce a base wise update mapping that stores the token and the token to which it should be updated.
6. Using the base-wise abbreviation full forms the tokens sets containing those abbreviations are updated. While updating the abbreviations, a counter indicating the number of times an abbreviation is replaced for a particular full form is created. To keep in mind, this counter is created over unique token sets rather than the full dataset.     
7. If there aren't anymore full form suggestions from _5_ then stop the iteration, else continue from _2_.

#### Algorithm for completing the abbreviations without support

1. For each possible abbreviation
    1. If the same abbreviation is filled while using the support, then the full form that the abbreviation is replaced with for the most number of times using the support is used to complete the unfilled abbreviation.
    2. If the abbreviation is not filled while using the support, then the closest abbreviation that was filled using the support is obtained by using the `fuzz.ratio` similarity. If the multiple filled abbreviations have the same similarity as the unfilled one, the filled abbreviation with high frequency is considered and the unfilled abbreviation is filled with the full form that the filled abbreviation is replaced with for the most number of times using the support.


### Utility functions to fill the abbreviations 

The functions are

1. `get_cooccurring_frequency`: Returns the co-occurrence frequency using the current list of token sets with all possible combinations of tokens for a given token set.
2. `get_possible_full_form`: The function accepts the token sets grouped length-wise along with a set of correctly spelled words, the tokens cooccurring frequency, and a threshold score for similarity of full form and abbreviation and return a dictionary with the common tokens as keys and the abbreviation and potential full forms as values.
3. `get_full_form_suggestions_per_unique_tokens_set`: The function accepts a list of possible full for a given abbreviation with a set of common tokens between the abbreviation and the full form and selects the full form that is similar and most frequent.
4. `fill_abbreviations_with_support`: The function accepts the current token sets for the full dataset and iteratively fills the abbreviations with support until no more abbreviations can be filled.
5. `fill_abbreviations_without_support`: The function accepts the unique tokens left after filling abbreviations using contextual tokens and other related arguments and returns a mapping of unfilled abbreviations to a full form and the counter of abbreviation and full form 
6. `fill_abbreviations_with_full_forms`: The function accepts the list of token sets after round 3 of step 3 along with the list of correctly spelled words, the similarity thresholds and return the list of lists of tokens sets after completing full forms for an abbreviation with and without the support and the counter for abbreviation and full form pair.

In [None]:
def get_cooccurring_frequency(
    current_profession_token_sets: List[FrozenSet[str]],
) -> Dict[FrozenSet[str], int]:
    """Returns the co-occurance frequency using the current list of token sets with all possible combination of tokens for a given token set i.e. for a set of 3 tokens T1, T2 and T3, the possible combinations are `{T1, T2}, {T2, T3}, {T1, T3}, {{T1, T2}, T3}, {{T1, T3}, T2} and {{T2, T3}, {T1}}`

    Parameters
    ----------
    current_profession_token_sets : List[FrozenSet[str]]
        The list of profession token sets for full the dataset 

    
    Returns
    -------
    pair_wise_count : Dict[FrozenSet[str], int]
        A dictionary with a frozenset of tokens as key as the number of times they appeared together as the value.
    """

    # a dictionary to store cooccurring frequency of tokens with the words as key and count as value
    pair_wise_count = {}

    for prfes_toks in current_profession_token_sets:
        # for each token set for profession
        if len(prfes_toks) > 1:
            # if there are more than one tokens, otherwise there is no cooccurring frequency
            for comb_len in range(2, len(prfes_toks) + 1):
                # for all length combinations of tokens
                for tok_pair in itertools.combinations(prfes_toks, comb_len):
                    # for each combination
                    frozen_tok_pair = frozenset(tok_pair)
                    if frozen_tok_pair not in pair_wise_count:
                        pair_wise_count[frozen_tok_pair] = 0
                    # increase the count
                    pair_wise_count[frozen_tok_pair] += 1
    return pair_wise_count


def get_possible_full_form(
    length_wise_token_sets: Dict[int, List[FrozenSet[str]]],
    prev_len_wise_token_sets: Dict[int, List[FrozenSet[str]]],
    correctly_spelled_words: Set[str],
    tokens_cooccur_freq: Dict[FrozenSet[str], int],
    abbvr_similarity_theshold: float,
) -> Dict[FrozenSet[str], Dict[str, Dict[str, Tuple[float, int]]]]:
    """The function accepts the token sets grouped length-wise along with a set of correctly spelled words, the tokens cooccurring frequency, and a threshold score for similarity of full form and abbreviation and return a dictionary with the common tokens as keys and the abbreviation and potential full forms as values.

    Parameters
    ----------
    length_wise_token_sets : Dict[int, List[FrozenSet[str]]]
         A dictionary with the length of the token sets as the key and the token sets as the values in a list.
    prev_len_wise_token_sets : Dict[int, List[FrozenSet[str]]]
        A dictionary with length of the token sets as the key and the token sets as the values in a list at previous iteration.
    correctly_spelled_words : Set[str]
        The set of tokens from the unique tokens for the dataset belonging to the list of correctly spelled words.
    tokens_cooccur_freq : Dict[FrozenSet[str], int]
        A dictionary with a frozenset of tokens as key and the number of times they appeared together as the value.
    abbvr_similarity_theshold : float
        The minimum threshold for the modified Jaccard similarity score between the full form and the abbreviation.

    
    Returns
    -------
    possible_full_forms : Dict[FrozenSet[str], Dict[str, Dict[str, Tuple[float, int]]]]
        A three-level dictionary with the common tokens between two token sets as primary key, the abbreviation as the secondary key and the full form as the tertiary key, and the tuple of modified Jaccard similarity score and the frequency of full form with the common tokens as value.
    """

    possible_full_forms = {}
    for str_len, str_lists in length_wise_token_sets.items():
        # for token sets of same length
        if str_len != 1:
            # ignore the sets with only one token

            if set(str_lists) != set(prev_len_wise_token_sets.get(str_len, [])):
                # If the token sets of a given length at this iteration are not same as the token sets of a same length in previous iteration, continue with the process. If the sets are same with previous iteration, it means that the those tokens are not filled and the checking for full forms is redundant.

                print("Checking for token sets of length {}".format(str_len))

                len_str_lists = len(str_lists)
                for ind_lst_1 in tqdm_notebook(range(len_str_lists)):
                    # for each token set
                    tok_set1 = str_lists[ind_lst_1]
                    if (
                        not all(tok in correctly_spelled_words for tok in tok_set1)
                    ) and any("." in tok for tok in tok_set1):
                        # if not all the tokens are in the dictionary and there is atleast one token with a dot
                        for ind_lst_2 in range(ind_lst_1 + 1, len_str_lists):
                            # for each pair of token sets of same length
                            tok_set2 = str_lists[ind_lst_2]
                            if any(tok in correctly_spelled_words for tok in tok_set2):
                                # if there is atleast one word in the list of correctly spelled words and
                                # get the tokens that are different in the two token sets (acts as secondary key)
                                set_sym_diff = tok_set1 ^ tok_set2
                                if len(set_sym_diff) == 2:
                                    # if the two token sets only differ by one element
                                    abbvr, full_form = set_sym_diff
                                    if "." in full_form:
                                        abbvr, full_form = full_form, abbvr

                                    if (
                                        (("." in abbvr) and ("." not in full_form))
                                        and check_presence(
                                            full_form, correctly_spelled_words
                                        )
                                        and (len(full_form) > len(abbvr))
                                    ):
                                        # and if only one of them has a dot and one is the list of correctly spelled words

                                        froz_set_same = frozenset(
                                            tok_set1.intersection(tok_set2)
                                        )

                                        if froz_set_same not in possible_full_forms:
                                            possible_full_forms[froz_set_same] = {}

                                        if (
                                            abbvr
                                            not in possible_full_forms[froz_set_same]
                                        ):
                                            possible_full_forms[froz_set_same][
                                                abbvr
                                            ] = {}
                                        if (
                                            full_form
                                            not in possible_full_forms[froz_set_same][
                                                abbvr
                                            ]
                                        ):
                                            # get the modified jaccard score
                                            modified_jaccard_score = (
                                                textdistance.Jaccard(
                                                    qval=2
                                                ).normalized_similarity(
                                                    abbvr[:-1],
                                                    full_form[: len(abbvr) - 1],
                                                )
                                                * 100
                                            )
                                            if (
                                                modified_jaccard_score
                                                > abbvr_similarity_theshold
                                            ):
                                                # if the modified jaccard score is greater than the threshold, create the mapping
                                                possible_full_forms[froz_set_same][
                                                    abbvr
                                                ][full_form] = (
                                                    modified_jaccard_score,
                                                    tokens_cooccur_freq.get(
                                                        froz_set_same.union(
                                                            frozenset([full_form])
                                                        ),
                                                        0,
                                                    ),
                                                )
            else:
                print("Skipping checking for token sets of length {}".format(str_len))
    return possible_full_forms


def get_full_form_suggestions_per_unique_tokens_set(
    possible_full_forms: Dict[FrozenSet[str], Dict[str, Dict[str, Tuple[float, int]]]]
) -> Dict[FrozenSet[str], Dict[str, str]]:
    """The function accepts a list of possible full for a given abbreviation with a set of common tokens between the abbreviation and the full form and selects the full form that is similar and most frequent.

    Parameters
    ----------
    possible_full_forms : Dict[FrozenSet[str], Dict[str, Dict[str, Tuple[float, int]]]]
        A three-level dictionary with the common tokens between two token sets as primary key, the abbreviation as the secondary key and the full form as the tertiary key, and the tuple of modified Jaccard similarity score and the frequency of full form with the common tokens as value.

    
    Returns
    -------
    replacements_per_token_set: Dict[FrozenSet[str], Dict[str, str]]
        A two-level dictionary with the common tokens between two token sets as the primary key, the abbreviation as the secondary key, and the full form as the value.
    """

    # dictionary to store the full form for abbreviation per common base
    replacements_per_token_set = {}

    for context_set, abvr_full_form_map in possible_full_forms.items():
        # for each common set of tokens
        for abvr, possible_full_forms_tuples in abvr_full_form_map.items():
            # for each abbreviation per common base
            if possible_full_forms_tuples:
                # if there are any possible full forms
                selected_full_forms = sorted(
                    possible_full_forms_tuples.items(),
                    key=lambda item: (item[1][0], item[1][1]),
                    reverse=True,
                )[0]
                # sort the possible full forms based on similarity score and the frequency of the full form with the common set of tokens
                # select the first element of the soreted list and add the mapping to the dictionary.
                tok_set = context_set.union(frozenset([abvr]))
                if tok_set not in replacements_per_token_set:
                    replacements_per_token_set[tok_set] = {}
                if abvr not in replacements_per_token_set[tok_set]:
                    replacements_per_token_set[tok_set][abvr] = selected_full_forms[0]

    return replacements_per_token_set


def fill_abbreviations_with_support(
    current_profession_token_sets: List[FrozenSet[str]],
    correctly_spelled_words: Set[str],
    abbvr_similarity_theshold: float,
) -> Tuple[List[FrozenSet[str]], Dict[str, Dict[str, int]]]:
    """The function accepts the current token sets for the full dataset and iteratively fills the abbreviations based on the common tokens in other sets until no more abbreviations can be filled.

    Parameters
    ----------
    current_profession_token_sets : List[FrozenSet[str]]
        The list of profession token sets for full the dataset 
    correctly_spelled_words : Set[str]
        The set of tokens from the unique tokens for the dataset belonging to the list of correctly spelled words.
    abbvr_similarity_theshold : float
        The minimum threshold for the modified Jaccard similarity score between the full form and the abbreviation.

    
    Returns
    -------
    current_profession_token_sets : List[FrozenSet[str]]
        The list of profession token sets for full the dataset after filling full forms with support
    filled_abbreviations : Dict[str, Dict[str, int]]
        A two-level dictionary with the abbreviation as the first key and the full form as the secondary key and the number of times the abbreviation is filled with that full form per unique set of tokens as the count.
    """

    fill_abbreviations = True
    iteration_counter = 0
    filled_abbreviations = {}
    # A dictionary to store the the length wise clustered profession token sets before the update, used to stop the (filling) iteration for the particular length when the token sets do not update any further.
    prev_token_sets_per_length = {}

    while fill_abbreviations:
        # try to fill the abbreviations as long as possible
        iteration_counter += 1
        print("\nAdding abbreviations  {} time".format(iteration_counter))

        # get the unique tokens sets
        unique_profes_toks = get_unique_token_sets(current_profession_token_sets)
        # get the length wise clustered token sets
        token_sets_per_length = get_length_wise_token_sets(unique_profes_toks)
        # get the tokens cooccurring frequency
        cooccr_freq_map = get_cooccurring_frequency(current_profession_token_sets)

        # get the possible full forms for each abbreviation
        full_form_suggestions = get_possible_full_form(
            token_sets_per_length,
            prev_token_sets_per_length,
            correctly_spelled_words,
            cooccr_freq_map,
            abbvr_similarity_theshold,
        )

        # assign the current clustering of token sets per length as previous iteration's token sets per length to be used in the next iteration to break the merge loop
        prev_token_sets_per_length = token_sets_per_length

        # select the full form for abbreviation
        full_form_sugges_per_token_set = get_full_form_suggestions_per_unique_tokens_set(
            full_form_suggestions
        )

        if full_form_sugges_per_token_set:
            # if there are some full forms for abbreviations
            updated_unique_token_sets = {}

            # change the abbreviation to the full form and increase or create a count for the abbreviation full form pair
            for org_tok_set, replacements in full_form_sugges_per_token_set.items():

                new_tokens = []
                for org_tok in org_tok_set:
                    new_token = replacements.get(org_tok, org_tok)
                    new_tokens.append(new_token)
                    if new_token != org_tok:
                        if org_tok not in filled_abbreviations:
                            filled_abbreviations[org_tok] = {}
                        if new_token not in filled_abbreviations[org_tok]:
                            filled_abbreviations[org_tok][new_token] = 0
                        filled_abbreviations[org_tok][new_token] += 1

                updated_unique_token_sets[org_tok_set] = frozenset(new_tokens)

            # update the token sets with the new full forms and continue filling the abbreviations
            current_profession_token_sets = [
                updated_unique_token_sets.get(token_set, token_set)
                for token_set in current_profession_token_sets
            ]
        else:
            # if there aren't any suggestions for abbreviations then stop the iteration
            fill_abbreviations = False

    return current_profession_token_sets, filled_abbreviations


def fill_abbreviations_without_support(
    possible_abbreviations: List[str],
    filled_abvr_full_forms: Dict[str, Dict[str, int]],
    tokens_counter: Counter_type[str],
    min_abvr_abvr_similarity: float,
) -> Tuple[Dict[str, str], Dict[str, Dict[str, int]]]:
    """The function accepts the unique tokens left after filling abbreviations using contextual tokens and other related arguments and returns a mapping of unfilled abbreviations to a full form and the counter of abbreviation and full form.

    Parameters
    ----------
    possible_abbreviations : List[str]
        A list of unique tokens from the dataset that contain a dot after filling the abbreviations with support (unfilled abbreviations).
    filled_abvr_full_forms : Dict[str, Dict[str, int]]
        A two-level dictionary with the abbreviation as the first key and the full form as the secondary key and the number of times the abbreviation is filled with that full form per unique set of tokens as the count.
    tokens_counter : Counter_type[str]
        The frequency counter of the individual tokens in the dataset.
    min_abvr_abvr_similarity : float
        The minimum threshold between the filled and unfilled abbreviations to be considered similar.

    
    Returns
    -------
    new_abvr_fullforms_mapping : Dict[str, str]
        A dictionary with unfilled abbreviation as the key and the selected full form as the value.
    filled_abvr_full_forms_updated : Dict[str, Dict[str, int]]
        A two-level dictionary with the abbreviation as the first key and the full form as the secondary key and the number of times the abbreviation is filled with that full form per unique set of tokens as the count.
    """

    filled_abvr_full_forms_updated = copy.deepcopy(filled_abvr_full_forms)
    filled_abbreviations = set(filled_abvr_full_forms.keys())
    new_abvr_fullforms_mapping = {}

    print("Filling abbreviations without support")

    for pos_abvr in tqdm_notebook(possible_abbreviations):
        # for each unfilled abbreviation
        filled_dict = None
        if pos_abvr in filled_abvr_full_forms:
            # if the same abbreviation was filled in the previous step then get the most frequently substitued full form
            filled_dict = filled_abvr_full_forms[pos_abvr]
        else:
            # if the abbreviation is not previously filled, get the closest abbreviation. If there are multiple abbreviations close to the abbreviation to be filled.
            # Select the previously filled abbreviation with highest frequency in the entire dataset.

            close_abvrs = process.extract(
                pos_abvr,
                filled_abbreviations,
                processor=simple_processor,
                scorer=fuzz.ratio,
                score_cutoff=min_abvr_abvr_similarity,
            )

            selected_filled_abvr = None
            if close_abvrs:
                if len(close_abvrs) == 1:
                    # if there is only one abbreviation, select it.
                    selected_filled_abvr = close_abvrs[0][0]
                else:
                    # if there are more one one possible abbreviation, then they are sorted based on the similarity score
                    sim_scores = [sim_score for _, sim_score, _ in close_abvrs]
                    abvr_contenders = np.array(close_abvrs)[
                        np.where(sim_scores == np.max(sim_scores))
                    ]
                    if len(abvr_contenders) == 1:
                        # select the abbreviation with highest similarity score to merge with.
                        selected_filled_abvr = abvr_contenders[0][0]
                    else:
                        # if there are more than one possible abbreviations with same similarity score, then they are sorted based on their frequency in the dataset.
                        abvr_freq = [
                            tokens_counter[mergee] for mergee, _, _ in abvr_contenders
                        ]
                        abvr_contenders_freq = np.array(abvr_contenders)[
                            np.where(abvr_freq == np.max(abvr_freq))
                        ]
                        if len(abvr_contenders_freq) == 1:
                            # select the token with highest frequency to merge with.
                            selected_filled_abvr = abvr_contenders_freq[0][0]

            if selected_filled_abvr:
                # if any filled abbreviation is selected, get the full forms of the filled abbreviation.
                filled_dict = filled_abvr_full_forms[selected_filled_abvr]

        if filled_dict:
            # select the full form that is replaced highest number of times and update the count
            suggested_fullform = max(filled_dict, key=filled_dict.get)

            new_abvr_fullforms_mapping[pos_abvr] = suggested_fullform

            if pos_abvr not in filled_abvr_full_forms_updated:
                filled_abvr_full_forms_updated[pos_abvr] = {}
            if suggested_fullform not in filled_abvr_full_forms_updated[pos_abvr]:
                filled_abvr_full_forms_updated[pos_abvr][suggested_fullform] = 0

            filled_abvr_full_forms_updated[pos_abvr][suggested_fullform] += 1
    return new_abvr_fullforms_mapping, filled_abvr_full_forms_updated


def fill_abbreviations_with_full_forms(
    profession_token_sets_full_dataset: List[List[FrozenSet[str]]],
    correctly_spelled_words: Set[str],
    abbvr_full_form_similarity_theshold: float,
    min_abvr_abvr_similarity: float,
) -> Tuple[List[List[FrozenSet[str]]], Dict[str, Dict[str, int]]]:
    """The function accepts the list of token sets after round 3 of step 3 along with the list of correctly spelled words, the similarity thresholds and return the list of lists of tokens sets after completing full forms for an abbreviation with and without the support and the counter for abbreviation and full form pair.

    Parameters
    ----------
    profession_token_sets_full_dataset : List[List[FrozenSet[str]]]
        A list of token sets for the unique profession strings is stored in a list. The token sets are stored in a list to track the changes.
    correctly_spelled_words : Set[str]
        The set of tokens from the unique tokens for the dataset belonging to the list of correctly spelled words.
    abbvr_full_form_similarity_theshold : float
        The minimum threshold for the modified Jaccard similarity score between the full form and the abbreviation.
    min_abvr_abvr_similarity : float
        The minimum threshold between the filled and unfilled abbreviations to be considered similar.

    Returns
    -------
    token_sets_full_dataset_with_full_forms : List[List[FrozenSet[str]]]
        A list of token sets for the unique profession strings is stored in a list after changing abbreviations to full forms. The token sets are stored in a list to track the changes.
    full_forms_counter : Dict[str, Dict[str, int]]
        A two-level dictionary with the abbreviation as the first key and the full form as the secondary key and the number of times the abbreviation is filled with that full form per unique set of tokens as the count.
    """

    # create a copy of the profession token sets lists to avoid in place editing.
    updated_token_sets_full_dataset_tracked = copy.deepcopy(
        profession_token_sets_full_dataset
    )

    # get the last entry from the list of list of token sets as the last entry is the last updated entry as current token sets.
    token_sets_full_dataset = [
        tok_list[-1] for tok_list in updated_token_sets_full_dataset_tracked
    ]
    # get the counter of tokens before completing full forms
    tokens_count_before_fullforms = get_tok_count(token_sets_full_dataset)

    # fill the abbreviations using contextual tokens
    (
        token_sets_full_dataset_with_support_full_forms,
        support_full_forms,
    ) = fill_abbreviations_with_support(
        token_sets_full_dataset,
        correctly_spelled_words,
        abbvr_full_form_similarity_theshold,
    )

    # get the individual tokens after filling the full form using support tokens
    indv_tokens_after_support_fullforms = get_tok_count(
        token_sets_full_dataset_with_support_full_forms
    )
    # get the unique token sets after filling the full form using support tokens
    unique_token_sets_after_support_fullforms = get_unique_token_sets(
        token_sets_full_dataset_with_support_full_forms
    )

    # get the list of tokens with dot
    left_over_abbrvs = [
        tok for tok in indv_tokens_after_support_fullforms if "." in tok
    ]

    # fill the abbreviations without using contextual tokens
    abvrs_fullforms, full_forms_counter = fill_abbreviations_without_support(
        left_over_abbrvs,
        support_full_forms,
        tokens_count_before_fullforms,
        min_abvr_abvr_similarity,
    )

    # update the token sets for full dataset by adding the possible full forms
    unique_professions_before_after_fullforms = {
        tok_froz_set: frozenset(
            abvrs_fullforms.get(token, token) for token in tok_froz_set
        )
        for tok_froz_set in unique_token_sets_after_support_fullforms
    }

    token_sets_full_dataset_with_full_forms = [
        [unique_professions_before_after_fullforms.get(token_set)]
        for token_set in token_sets_full_dataset_with_support_full_forms
    ]

    return token_sets_full_dataset_with_full_forms, full_forms_counter

Using the list of list of token sets of all the dataset, the abbrevations are filled.

In [None]:
(
    paris_profession_token_sets_after_round_4,
    fullforms_counter,
) = fill_abbreviations_with_full_forms(
    paris_all_profession_token_sets_after_round_3,
    words_with_correct_spellings,
    abbvr_full_form_similarity_theshold=MINIMUM_ABVR_FULLFORM_SIMILARITY,
    min_abvr_abvr_similarity=MINIMUM_INTER_ABVR_SIMILARITY,
)

### Saving 

#### the dataset with abbreviations added tokens

The token sets is updated from frozen sets to strings (sepearated by space) as frozen sets cannot be saved to disk. Earlier, only the one column was read to save the space. Here, the full file is read and a new column is added.

In [None]:
# read the full dataset
paris_jobs_after_s3 = pd.read_csv(
    intermediate_steps_folder_prefix + "/all_paris_jobs_after_step_3.csv",
    dtype={
        "doc_id": "str",
        "page": "str",
        "row": "str",
        "Nom": "str",
        "métier_original": "str",
        "rue": "str",
        "numéro": "str",
        "annee": "str",
        #"gallica_page": "str",
        "métier": "str",
        "metier_after_S1": "str",
        "metier_after_S2": "str",
        "metier_after_S3": "str",
    },
    header=0,
    encoding="utf-8",
)

# get the current token sets for each profession after step 4
token_sets_after_round_4 = get_current_token_sets(
    paris_profession_token_sets_after_round_4
)


# searlize the token sets to strings (seperated by space)
metier_strings_after_round4 = [
    " ".join(met_frozen_set) for met_frozen_set in token_sets_after_round_4
]

# add the new column to the dataset
paris_jobs_after_s3["metier_after_S4"] = metier_strings_after_round4

paris_jobs_after_s3.to_csv(
    intermediate_steps_folder_prefix + "all_paris_jobs_after_step_4.csv", index=False
)

#### the individual tokens for the dataset after merging and filling full forms

In [None]:
# To store the unique tokens after the step 4, a dictionary with the token as key and the count is stored as the value (in another dictionary)

sorted_unique_tokens_after_step_4 = {
    k: {"count": v}
    for k, v in sorted(
        get_tok_count(token_sets_after_round_4).items(), key=lambda e: e[1], reverse=True
    )
}


with open(
    intermediate_steps_folder_prefix + "cleaned_unique_tokens.json",
    "w",
    encoding="utf8",
) as outfile:
    json.dump(sorted_unique_tokens_after_step_4, outfile, indent=4, ensure_ascii=False)

#### the abbreviation and full form mapping

In [None]:
with open(
    intermediate_steps_folder_prefix + "abbreviation_full_forms.json", "w", encoding="utf8"
) as outfile:
    json.dump(fullforms_counter, outfile, indent=4, ensure_ascii=False)

## Saving the tags (Step 5)

Until this step, the profession string of the dataset has been cleaned at multiple levels. In the last step, the data is stored on a disk. 

Data storage has two main components. First, is the complete dataset (nearly 4.5 million lines) that contains the information of all the addresses of Paris. Second, in the previous phase of the project, a subset of lines from the 5.5 million lines correspond to the Richelieu district where the street name (rue) and the number (numéro) are cleaned.

For the first component, the data shall be stored as a CSV. The reason for choosing a CSV file format is that the street name (rue) and the number (numéro) are not cleaned and normalized. Thus it is difficult to group the addresses spatially. The CSV file will contain the following columns

1. `annee`: The year of the entry
2. `gallica_ark`: The arc identifier on the Gallica
3. `gallica_page`: The page in the document (identified through `gallica_ark`) where the entry is present.
4. `row`: The row in the page `gallica_page` where the entry is present
5. `name`: The name of the person (sometimes, it is a business/ entity)
6. `métier_from_ocr`: The profession obtained as a result of OCRisation during phase 1 of the project
7. `rue`: The street name
8. `numéro`: The number of the house in the given street
9. `tags`: A list of strings called tags that represent the profession of the person/entity.


For the second component, the data will be stored as JSON pivoted based on the address. The format of the file will be 

```json
{
    "rue name": {
        "details": [
            {
                "rue number": {
                    "location": {
                        "geo coordinates": {
                            "latitude": "",
                            "longitude": ""
                        }
                    },
                    "people": {
                        "year": [
                            {
                                "Person Name": "",
                                "Profession": "",
                                "Tags": [],
                                "Related Data": {
                                    "Multimedia URLs": [],
                                    "Other URLs": []
                                },
                                "gallica link": ""
                            }
                        ]
                    }
                }
            }
        ]
    }
}
```

### Data for complete Paris (first component)

During the process of creating and cleaning the tokens, some rows in the dataset were dropped as they contained small profession stings or empty profession. In the next step, the original rows that were dropped earlier will be added.

In [None]:
# Read the dataset before any post processing
paris_jobs_before_normalising = pd.read_csv(
    "./../data/intermediate_steps/all_paris_jobs_with_gallica_pageno.csv",
    names=[
        "doc_id",
        "page",
        "row",
        "Nom",
        "métier_original",
        "rue",
        "numéro",
        "annee",
        #"gallica_page",
    ],
    dtype={
        "doc_id": "str",
        "page": "str",
        "row": "str",
        "Nom": "str",
        "métier": "str",
        "rue": "str",
        "numéro": "str",
        "annee": "str",
        #"gallica_page": "str",
    },
    header=0,
    encoding="utf-8",
)

In [None]:
# The dataset before adding tags is merged with dataset after four steps of cleaning on the components that remain unchanged.
paris_jobs_full_dataset = pd.merge(
    paris_jobs_before_normalising,
    paris_jobs_after_s3,
    how="outer",
    on=[
        "doc_id",
        "page",
        "row",
        "Nom",
        "rue",
        "numéro",
        "métier_original",
        "annee",
        #"gallica_page",
    ],
)
# The columns containing the tokens in the intermediate steps are removed.
paris_jobs_full_dataset.drop(
    columns=["page", "métier", "metier_after_S1", "metier_after_S2", "metier_after_S3"],
    inplace=True,
)
# The columns are renamed to understand easily.
paris_jobs_full_dataset.rename(
    columns={
        "doc_id": "gallica_ark",
        "Nom": "name",
        "métier_original": "métier_from_ocr",
        "metier_after_S4": "tags",
    },
    inplace=True,
)
# the tokens after step 4 were stored as string seperated by space. Now, the token string is split to produce list of tags and it stored as a new column.
paris_jobs_full_dataset["tags"] = paris_jobs_full_dataset["tags"].str.split()

# This dataset is saved to disk
paris_jobs_full_dataset.to_csv(
    "./../data/outcome_of_current_project/paris_jobs_with_tags_richelieu_project.csv",
    index=False,
)

### Data for Richelieu district (second component)

The data extracted for Richelieu district is stored at `df_addressing_with_numbers.csv`. Similarily to previous component, the two datasets are merged and then stored as a csv. However, the the profession under `métier` is wrong and simplified and the `rue` is cleaned.

In [None]:
'''
richelieu_jobs_before_normalising = pd.read_csv(
    "./../data/from_previous_project/people_of_richelieu_1839_1922.csv",
    names=[
        "old_index",
        "doc_id",
        "annee",
        "page",
        "row",
        "Nom",
        "métier_simplified",
        "rue",
        "numéro",
        "latitude",
        "longitude",
    ],
    dtype={
        "old_index": "str",
        "doc_id": "str",
        "annee": "str",
        "page": "str",
        "row": "str",
        "nom": "str",
        "métier": "str",
        "rue": "str",
        "numéro": "str",
        "latitude": "str",
        "longitude": "str",
    },
    header=0,
    encoding="utf-8",
    usecols=[
        "doc_id",
        "annee",
        "page",
        "row",
        "Nom",
        "métier_simplified",
        "rue",
        "numéro",
        "latitude",
        "longitude",
    ],
)
'''

In [None]:
'''
# The dataset of Richelieu district (without tags) is merged with dataset after four steps of cleaning on the components that remain unchanged.
richelieu_jobs_with_tags = pd.merge(
    richelieu_jobs_before_normalising,
    paris_jobs_after_s3,
    how="left",
    on=["doc_id", "page", "row", "Nom", "annee"],
)

# The columns containing the tokens in the intermediate steps are removed.
richelieu_jobs_with_tags.drop(
    columns=[
        "page",
        "métier_simplified",
        "rue_y",
        "numéro_y",
        "métier",
        "metier_after_S1",
        "metier_after_S2",
        "metier_after_S3",
    ],
    inplace=True,
)

# The columns are renamed to understand easily.
richelieu_jobs_with_tags.rename(
    columns={
        "doc_id": "gallica_ark",
        "Nom": "name",
        "métier_original": "métier_from_ocr",
        "metier_after_S4": "tags",
        "rue_x": "rue",
        "numéro_x": "numéro",
    },
    inplace=True,
)

# the tokens after step 4 were stored as string seperated by space. Now, the token string is split to produce list of tags and it stored as a new column.
richelieu_jobs_with_tags["tags"] = richelieu_jobs_with_tags["tags"].str.split()
'''

In [None]:
'''
richelieu_jobs_with_tags
'''

To store the data in a JSON, it is first grouped by the street name, number, the latitude, longitude and the year.

In [None]:
'''
richelieu_jobs_with_tags_spatial_grouped = richelieu_jobs_with_tags.groupby(
    by=["rue", "numéro", "latitude", "longitude", "annee"]
)
'''

### Utility function to create a dictionary object for each entry in the dataset

In [None]:
def create_person_object(
    data_row: pd.Series,
) -> Dict[str, Union[str, Dict[str, List[str]]]]:
    """The function accepts a row from the dataset and returns the information as a dictionary to enable storage in a JSON file

    Parameters
    ----------
    data_row : pd.Series
        A row of the dataset with name, profession, tags, gallica ark and gallica page number

    
    Returns
    -------
    Dict[str, Union[str, Dict[str, List[str]]]]
        The data in the row is restructured as a dictionary.
    """

    return {
        "Person Name": data_row["name"],
        "Profession": data_row["métier_from_ocr"],
        "Tags": data_row["tags"],
        "Related Data": {"Multimedia URLs": [], "Other URLs": []},
        "gallica link": "https://gallica.bnf.fr/ark:/12148/{}/f{}.zoom".format(
            data_row["gallica_ark"], data_row["gallica_page"]
        ),
    }

Loop over each of the address and year grouped dataframe and store the data in a nested dictionary with the street name as the primary key, the number as the secondary key. The geographical coordinates and year wise person information as values under the secondary key.

In [None]:
'''
street_wise_data = {}

for grp_name, group in tqdm_notebook(richelieu_jobs_with_tags_spatial_grouped):
    street_name, number, lat, long, year = grp_name
    if street_name not in street_wise_data:
        street_wise_data[street_name] = {}
    if number not in street_wise_data[street_name]:
        street_wise_data[street_name][number] = {
            "geographic_coordinate": {"latitude": lat, "longitude": long},
            "people": {},
        }
    if year not in street_wise_data[street_name][number]["people"]:
        street_wise_data[street_name][number]["people"][year] = []

    for _, person_data in group.iterrows():
        street_wise_data[street_name][number]["people"][year].append(
            create_person_object(person_data)
        )
'''

### Saving

Save the street wise dictionary created above to a JSON file

In [None]:
'''
with open(
    "./../data/outcome_of_current_project/street_wise_richelieu_people.json",
    "w",
    encoding="utf8",
) as outfile:
    json.dump(street_wise_data, outfile, indent=4, ensure_ascii=False)
'''