# Cleaning Special Characters

This notebook contains the code and commentary on the steps involved in cleaning the special characters present in the professions of people in Paris.

Before this notebook, 

The `add_correct_page_numbers.ipynb` notebook was run to generate `all_paris_jobs_with_gallica_pageno.csv` file. This file will serve as the starting point of this notebook.

## Summary

In this notebook, various steps that were mostly generic but some individual were applied to the strings that represent the profession of the people whose data is extracted from the directories. First, the special characters that required manual correction or a specific correction were applied. Later, towards the end, the generic corrections were applied.

The generic correction applied irrespective of the type of the special character is to remove it, if it is at the start or the end or surrounded by spaces. The majority of the corrections in the notebook fall into this category. The second type of correction is when the special character is present in the word. These entries were extracted and the correct entry after verifying it on the directory, the correction is made. 

## Imports

In [1]:
import pandas as pd
import string
import re
import numpy as np

### Reading the Data after adding the correct page numbers

After adding the correct page numbers for each entry corresponding to Gallica, the data set is stored at `all_paris_jobs_with_gallica_pageno.csv`. 

The CSV file contains 9 columns and their description is 
1. `doc_id`: The unique id of the document of Gallica.
2. `page`: The page in which the entry is present in the document.
3. `row`: The row of the entry on the page.
4. `Nom`: The name of the person.
5. `métier_original`: The profession before cleaning the special characters.
6. `rue`: The name of the street in the address of the person.
7. `numéro`: The number in the street in the address of the person.
8. `annee`: The year in which the entry is published.
9. `gallica_page`: The page number adjusted from the `page` column that can be used on the Gallica.

In the next cell, this file is read as a data frame, and a copy of the `métier_original` column is created (to store the original profession string) named `métier`.

All the strings in the `métier` column are lower cased and the rows that are empty at this column are removed. 

In [2]:
# Reading the csv file
raw_paris_jobs = pd.read_csv("./../data/intermediate_steps/all_paris_jobs_with_gallica_pageno.csv", names=["doc_id", "page", "row", "Nom", "métier_original", "rue", "numéro" , "annee", "gallica_page"],
                             dtype={"doc_id":'str', "page":'str', "row":'str', "Nom":'str', "métier":'str', "rue":'str', "numéro":'str', "annee":'str', "gallica_page":'str'},
                             header=0, encoding="utf-8")

raw_paris_jobs["métier"] = raw_paris_jobs["métier_original"]
# converting the strings in a column to lower case
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.lower()

# remove the rows with empty job
raw_paris_jobs['métier'].replace('', np.nan, inplace=True)
raw_paris_jobs.dropna(subset=['métier'], inplace=True)

FileNotFoundError: [Errno 2] No such file or directory: './../data/intermediate_steps/all_paris_jobs_with_gallica_pageno.csv'

## Individual special character cleaning

### Dealing with `(` and `)`

First, we shall remove the symbols when they surround a text. While doing so, only the cases when both of them are present and contain any text between them are considered.

**The remaining `(` and `)` will be dealt with at a later stage.**

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"\((.*?)\)"))]

  return func(self, *args, **kwargs)


Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
395,bpt6k6282019m,146,114,Alix,bur. de poste (Annexe E),Londres,33.,1855,74,bur. de poste (annexe e)
567,bpt6k6282019m,147,116,Alyon fils NC,loueur de voitures (les Carolines),St-Dominiqne-St-Germain,145.,1855,75,loueur de voitures (les carolines)
1317,bpt6k6282019m,152,37,Aubin,directeur de la Cie d'armements mari. times (l...,Laffitte,42.,1855,80,directeur de la cie d'armements mari. times (l...
1840,bpt6k6282019m,155,70,Badoureau père,école communale (3e arrondiss. ),Sentier,21.,1855,83,école communale (3e arrondiss. )
2255,bpt6k6282019m,157,179,Barbey,(A. ) jurisconsulte,Ste-Anne,18.,1855,85,(a. ) jurisconsulte
...,...,...,...,...,...,...,...,...,...,...
4404129,bpt6k9780089g,1592,336,Voisin (Albert) et Marin fils,(Etablissements) bouchons et articles de cave,r. de St-Quentir,12. (10). T. Nord A3.,1922,1253,(etablissements) bouchons et articles de cave
4404781,bpt6k9780089g,1597,237,Gambetta,à Aubervilliers (Seine). T. Nord 10. 9.,St-Fiacre,14.,1922,1258,à aubervilliers (seine). t. nord 10. 9.
4405208,bpt6k9780089g,1601,0,Wichard & Conge,forges de Courcelles (No-1 gent-en-Bassigny (H...,boul. de Clichy,60.,1922,1262,forges de courcelles (no-1 gent-en-bassigny (h...
4405521,bpt6k9780089g,1603,48,Wolff (Dr Amédée),médecin-spécialiste (voies urinaires),r. de Grenelle,137.,1922,1264,médecin-spécialiste (voies urinaires)


In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\((.*?)\)", r'\1', regex=True)

### Dealing with `«` and `»`

We shall remove the symbols (`«` and `»`) when they surround a text.

For example, *direeteur du journal «le travail »* will be come *direeteur du journal le travail*. While doing so, only the cases when both of them are present and contain any text between them are considered.

**The remaining `«` and `»` will be dealt with at a later stage.**

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"«(.*?)»"))]

  return func(self, *args, **kwargs)


Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
1385755,bpt6k9668037f,410,144,Bourne (Louis),direeteur du journal «le Travail »,rue de Provence,2.,1884,243,direeteur du journal «le travail »
1419893,bpt6k9668037f,620,129,Hébert (G.),directeur du journal « l'Europe artiste »,Lamartine,8.,1884,453,directeur du journal « l'europe artiste »
1426059,bpt6k9668037f,658,69,Laffitte (J.),directeur du journal « le Voltaire »,boul. des Italiens,6.,1884,491,directeur du journal « le voltaire »
1442154,bpt6k9668037f,758,13,Moreau (A.),directeur de la Cie « Le Lion» incendie,rue de la Banque,14.,1884,591,directeur de la cie « le lion» incendie
1466341,bpt6k9668037f,906,148,Vanlinden et Cie,directeurs du journal financier « le Crédit pu...,St-Marc,20.,1884,739,directeurs du journal financier « le crédit pu...
...,...,...,...,...,...,...,...,...,...,...
4399185,bpt6k9780089g,1558,120,Trèves (Marcel),admin.-délégué de la sté fue des Tissus « Tétra »,r. de Hanovre,12. (ge). T. Centr. 24.,1922,1219,admin.-délégué de la sté fue des tissus « tétra »
4399491,bpt6k9780089g,1560,120,Trouche (G.),propulseur p. tous bateaux la « Motogodille »,pass. Verdeau,26.,1922,1221,propulseur p. tous bateaux la « motogodille »
4402745,bpt6k9780089g,1583,285,Vidal,café « Au Grand Turenne »,boul. du Temple,27.,1922,1244,café « au grand turenne »
4404231,bpt6k9780089g,1593,203,Polterra (Jules),restaurant « le Capitole »,r. N.-D.-de-Lorette,58.,1922,1254,restaurant « le capitole »


In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"«(.*?)»", r'\1', regex=True)

### Cleaning Numbers

As the numbers do not provide essential information about the profession, they are removed.

The numbers in French are written in a pattern with digits followed by `e` and are generally followed by `bis`, `ter`. So the profession strings are processed to remove the numbers (at the start or preceded by a space) followed by upto 4 characters (with or without a space between the numbers and the characters) are removed. The words with `er` or `re` are replaced later when performing spelling correction.

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"\d+"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
163,bpt6k6282019m,145,35,Italiens,15,et Grammont,30.,1855,73,15
377,bpt6k6282019m,146,88,Antoine,159,passage Saint-Bernard,3.,1855,74,159
414,bpt6k6282019m,146,135,Meslay,34,et boul. St-Martin,43.,1855,74,34
551,bpt6k6282019m,147,99,Altairac,secrétaire-trésor. du bureau de bienfaisance d...,Varennes,39.,1855,75,secrétaire-trésor. du bureau de bienfaisance d...
624,bpt6k6282019m,147,189,Croix-des-Petits-Champs,48,et VieuxAugustins,3*.,1855,75,48
...,...,...,...,...,...,...,...,...,...,...
4405584,bpt6k9780089g,1603,195,(199). T,Nord 14. 42.,(2º). T. Louv. 03. 83. Wormser.ciseleur-grav.r...,24.,1922,1264,nord 14. 42.
4405785,bpt6k9780089g,1605,68,LUSMAIBAUM-PARIS. Zabern,mercerie. 1.,Keller,27.,1922,1266,mercerie. 1.
4405801,bpt6k9780089g,1605,107,(9%). T. Louv,01. 35.,Paradis,21 bis.,1922,1266,01. 35.
4405828,bpt6k9780089g,1605,179,Faub. Poissonnière,22,et r. d'Enghien,54.,1922,1266,22


In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"(?<!\S)(\d+\s{0,1}\S{0,4})(?!\S)", r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

### Dealing with `ſ`

`ſ` is misinterpredted for `f`. Thus `ſ` will be replaced with `f`.

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"ſ"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
302,bpt6k6282019m,145,212,Albitès (Mme et Mlle),proſess. de langues,St-Lazare,135.,1855,73,proſess. de langues
964,bpt6k6282019m,149,193,Arbey,habillements conſect.,Cafarelli,14.,1855,77,habillements conſect.
1633,bpt6k6282019m,154,24,Auzat,poèlier-ſuiniste,Mont-Thabor,7.,1855,82,poèlier-ſuiniste
1706,bpt6k6282019m,154,118,Azaïs,ſab. de cartonnages,Cléry,82.,1855,82,ſab. de cartonnages
3058,bpt6k6282019m,162,124,Baulant,ſab. de feuillages pour fleurs,Nvedes-Petits-Champs,21.,1855,90,ſab. de feuillages pour fleurs
...,...,...,...,...,...,...,...,...,...,...
4402635,bpt6k9780089g,1583,4,Viard,conſect. p. enfants,r. d'Hauteville,18 bis (100). T. Louvre 40. 33.,1922,1244,conſect. p. enfants
4403129,bpt6k9780089g,1586,97,Vignon,ſabr. d'horlogerie,r. d'Angoulême,70.,1922,1247,ſabr. d'horlogerie
4403216,bpt6k9780089g,1586,311,Vilain (Atej (Barbotte & Cie success.),amorces en papier pr pistolets et ſusils d'enf...,r. Rébeval,15.,1922,1247,amorces en papier pr pistolets et ſusils d'enf...
4404218,bpt6k9780089g,1593,174,Volney Importing Company,ſabr. de bonneterie,r. St-Georges,22 et 21.,1922,1254,ſabr. de bonneterie


In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace("ſ", 'f', regex=False)

### Dealing with `ď`

`ď` is misinterpredted for `d'`. Thus `ď` will be replaced with `d'`.

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"ď"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
493307,bpt6k6315985z,250,18,De la Chaumelle,aud. au Cons. -ď'Etat,Tournon,12.,1850,168,aud. au cons. -ď'etat
1354296,bpt6k63959929,378,173,Levard (C.) et Cie,fab. ď'ornements plaqués et moules pour confis...,Phélippeaux,42 et 44.,1851,296,fab. ď'ornements plaqués et moules pour confis...
1377044,bpt6k9668037f,359,80,Barbizet tils (NC.,faience ďart émaillée,place de la Nation,15.,1884,192,faience ďart émaillée
1377602,bpt6k9668037f,362,135,Barthélemy (A.),fabr. ďencriers,Temple,104.,1884,195,fabr. ďencriers
1379546,bpt6k9668037f,373,177,Bénel et Télot,fab. ďappareils à gaz,aven. de Clichy,47 et 47 bis.,1884,206,fab. ďappareils à gaz
...,...,...,...,...,...,...,...,...,...,...
4388622,bpt6k9780089g,1467,270,Saglier frères et Cie,fabr. ďor fèvrerie,r. d'Enghien,12.,1922,1128,fabr. ďor fèvrerie
4393791,bpt6k9780089g,1519,157,Société Peerless,importation ďaliments reconstituants,boul. de Strasbourg,30.,1922,1180,importation ďaliments reconstituants
4395363,bpt6k9780089g,1532,116,Suédoise (la) (société anonyme),fabr. ďallumettes,r. de la Pépinière,14.,1922,1193,fabr. ďallumettes
4396977,bpt6k9780089g,1543,169,Théry (José),avocat Cour ďappel,r. de la Pépinière,21.,1922,1204,avocat cour ďappel


In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace("ď", "d'", regex=False)

### Dealing with `ľ`

`ľ` is misinterpredted for `l'`. Thus `ľ` will be replaced with `l'`.

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"ľ"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
205338,bpt6k62931221,388,151,Dreux,cheľà l'Etat civil,Fontaine-au-Roi,10.,1841,237,cheľà l'etat civil
1383468,bpt6k9668037f,396,207,Bomier (Albert),secrétaire du conseil général des bâtiments ci...,Tour-Passy,62.,1884,229,secrétaire du conseil général des bâtiments ci...
1393255,bpt6k9668037f,455,62,Chesnier du Chesne,ancien administrateur gé rant du journal ĽUnion,St-Dominique,108.,1884,288,ancien administrateur gé rant du journal ľunion
1393536,bpt6k9668037f,457,21,Chevrolat,commis principal à ľadministration de l'octroi,Fontaine-St-Georges,25.,1884,290,commis principal à ľadministration de l'octroi
1441569,bpt6k9668037f,754,69,Monfils (Ernest),administrateur général du journal artistique a...,rue Lepic,78.,1884,587,administrateur général du journal artistique a...
1479055,bpt6k9669143t,341,52,Billard (l'abbé),2e vicaire à St-Germain-ľ Auxerrois,pl. du Louvre,3.,1882,196,vicaire à st-germain-ľ auxerrois
1556272,bpt6k9669143t,812,148,Soffar,maison eľachats,Blancs-Manteaux,23.,1882,667,maison eľachats
1572198,bpt6k9672117f,293,225,Biron et fils,représentants des carrières de ľEchaillon,boul. Richard-Lenoir,36.,1874,168,représentants des carrières de ľechaillon
1717017,bpt6k9672776c,695,226,Ragueneau,presses à imprimer eľ expéditifRagueneau,Joquelet,5 et 7.,1880,576,presses à imprimer eľ expéditifragueneau
1786560,bpt6k96727875,587,24,Maas (E.),directeur de la Cie d'assurances contre l'ince...,Banque,15.,1870,452,directeur de la cie d'assurances contre l'ince...


In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace("ľ", "l'", regex=False)

### Dealing with `ⓡ`

`ⓡ` is misinterpredted for `d'`. Thus `ⓡ` will be replaced with `d'`.

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"ⓡ"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
1429599,bpt6k9668037f,680,71,Lecomte (E.),fabr. Ⓡétain en feuilles,StMartin,220.,1884,513,fabr. ⓡétain en feuilles
1440229,bpt6k9668037f,746,8,Meyer (Frédéric) et Cle,fabr. Ⓡéventails,Meslay,55.,1884,579,fabr. ⓡéventails
1443098,bpt6k9668037f,763,209,Mouton (A.),fabr. Ⓡétuis à lunettes,Faubdu-Temple,83.,1884,596,fabr. ⓡétuis à lunettes
1497856,bpt6k9669143t,453,6,Delmotte (J.),fabr. Ⓡaccordéons,pass. du Grand-Cerf,1.,1882,308,fabr. ⓡaccordéons
1503653,bpt6k9669143t,488,170,Duru (Ch.) fils,fabr. Ⓡhorlogerie,Gravilliers,70.,1882,343,fabr. ⓡhorlogerie
...,...,...,...,...,...,...,...,...,...,...
3643143,bpt6k9764402m,1321,118,Simon (M.),fabr. Ⓡétalages de boutiques,r. des Gravilliers,43.,1900,992,fabr. ⓡétalages de boutiques
3742545,bpt6k9764647w,289,133,Beuve,apprét. Ⓡétoffes,Faub.-St-Denis,22.,1881,168,apprét. ⓡétoffes
3775707,bpt6k9764647w,479,70,Georges,fabr. Ⓡustensiles de pêche,Faub.du Temple,117.,1881,358,fabr. ⓡustensiles de pêche
3891515,bpt6k9764746t,547,68,Pacaud,fabr. Ⓡétalages de boutique,Chạpon,15.,1871,448,fabr. ⓡétalages de boutique


In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace("ⓡ", "d'", regex=False)

### Dealing with `\`

- Get the rows containg `\`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"\\"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
555874,bpt6k6318531z,383,47,Gillot jne,\c bois à ouvrer,q. d’Austerlitz,55,1858,275,\c bois à ouvrer
1674717,bpt6k9672776c,431,38,Dussort,\boucher,Michodière,19. disana,1880,312,\boucher
2999373,bpt6k9762929c,375,164,Goccoz,\libraire,Ancienne-Comédie,11.,1879,246,\libraire
4063787,bpt6k9776121t,294,285,Constantacopoulo frères,\commissionnaires,r. des Petites-Ecuries,21.,1907,245,\commissionnaires


There are 4 rows that have the `\`. 

The métier column will be replaced manually.

1. For index 555874, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6318531z/f275.item.r=Giljot%20jne.zoom. The `N.C` in a box was misinterpreted as `\c`. Thus the job will be changed from `\c bois à ouvrer` to `bois à ouvrer`.

2. For index 1674717, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f312.item.r=dussort.zoom. The printing on the backside of the page misinterpreted as `\`. Thus the job will be changed from `\boucher` to `boucher`.

3. For index 2999373, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f246.item.r=uoGcoz.zoom. The printing on the backside of the page misinterpreted as `\`. Thus the job will be changed from `\libraire` to `libraire`. 
    1. The Name in the dataframe is Goccoz. However, it is a mistake due to the quality of the scanned document. The actual name is Coccoz


4. For index 4063787, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9776121t/f245.item.r=Constantacopoulo.zoom. A mark (not due to printing) on the page misinterpreted as `\`. Thus the job will be changed from `\commissionnaires` to `commissionnaires`.

In [None]:
raw_paris_jobs.loc[555874, "métier"] = "bois à ouvrer"
raw_paris_jobs.loc[1674717, "métier"] = "boucher"
raw_paris_jobs.loc[2999373, "métier"] = "libraire"
raw_paris_jobs.loc[4063787, "métier"] = "commissionnaires"

### Dealing with `=`

- Get the rows containg `=`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"="))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
258861,bpt6k6305463c,373,144,Fourcaux,imprim. = lithogr.,Vinaigriers,49.0,1857,260,imprim. = lithogr.
1552477,bpt6k9669143t,787,208,Rousselot,vins=traiteur,boul. Montparnasse,45.0,1882,642,vins=traiteur


There are 2 rows that have the `=`. 

The métier column will be replaced manually.

1. For index 258861, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6305463c/f260.item.r=Vinaigriers.zoom. The `-` had an extra print mark on it, which was misinterpreted as `=`. Thus the job will be changed from `imprim. = lithogr.` to `imprim. - lithogr.`.

2. For index 1552477, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9669143t/f642.item.r=Montparnasse.zoom. The `-` had an extra print mark on it, which was misinterpreted as `=`. Thus the job will be changed from `vins=traiteur` to `vins-traiteur`.

In [None]:
raw_paris_jobs.loc[258861, "métier"] = "imprim.-lithogr."
raw_paris_jobs.loc[1552477, "métier"] = "vins-traiteur"

### Dealing with `^`

- Get the rows containg `^`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"\^"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
191803,bpt6k6292987t,982,38,Cordier-Lalande,^ S. E. 1831,Gravilliers,10 3.,1845,629,^ s. e.


There is 1 row that has the `^`. 

The métier column will be replaced manually.

1. For index 191803, the transcription of OCR appears to be a mistake and the entry image could not be identified. The page containing the names with Z was scanned (https://gallica.bnf.fr/ark:/12148/bpt6k6292987t/f629.item.r=Gravilliers.zoom) which included entries not realted to the address book that was intended to generate. The original entry could be https://gallica.bnf.fr/ark:/12148/bpt6k6292987t/f416.item.r=lalande.zoom. Hence will be changed from `^ s. e. 1831` to `fab. de bijoux dorés`.

In [None]:
raw_paris_jobs.loc[191803, "métier"] = "fab. de bijoux dorés"

### Dealing with `©`

- Get the rows containg `©`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"©"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
3953151,bpt6k9775724t,336,175,(10e). Dupin (-e),bouchons. © Brochant. 17.,(170). Dupin (L.). tonnelier. r. de Verneuil,27.0,1914,303,bouchons. © brochant.


There is 1 row that has the `©`. 

The métier column will be replaced manually.

1. For index 3953151, the transcription of OCR appears to be a mistake and the entry image appears to be https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f303.item.r=hmichnns.zoom. Hence will be changed from `bouchons. © brochant. 17.` to `bouchons`.

In [None]:
raw_paris_jobs.loc[3953151, "métier"] = "bouchons"

### Dealing with `¢`

- Get the rows containg `¢`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"¢"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
975836,bpt6k63243920,488,37,Nihart,¢picier,avenue St-Charles-Grenelle,31.0,1860,408,¢picier


There is 1 row that has the `¢`. 

The métier column will be replaced manually.

1. For index 975836, the image appears to be https://gallica.bnf.fr/ark:/12148/bpt6k63243920/f408.image.r=Nihart.zoom. The job will be changed from `¢picier` to `épicier`.

In [None]:
raw_paris_jobs.loc[975836, "métier"] = "épicier"

### Dealing with `®`

- Get the rows containg `®`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"®"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
191810,bpt6k6292987t,982,68,les colonies,® 1839,boul. StMartin,18 5.,1845,629,®
1035233,bpt6k6331310g,699,61,les colonies,® 1839,boul. StMartin,18 5).,1844,483,®
2545673,bpt6k9685861g,386,2,Ancelin,tabac el ®,r. St-Jacques,55.,1887,197,tabac el ®
3040035,bpt6k9762929c,624,52,Mathieu de la Redorte (ce) *,rue du Faub.-StMathieu de Vienne (A. J. B.) ® ...,Université,67.,1879,495,rue du faub.-stmathieu de vienne a. j. b. ® co...


There are 4 rows that has the `®`. 

The métier column will be replaced manually.

1. For index 191810, the entry in the csv file is wrong. The page containing the names with Z was scanned which included entries not realted to the address book that was intended to generate. However, the original entry image appears at https://gallica.bnf.fr/ark:/12148/bpt6k6292987t/f523.item.r=quincaillier.zoom. Thus the job will be changed from `® 1839` to `quincaillier-commissionn.`. However, the name is wrong.

2. For index 1035233, the situation is same as 191810, the image appears at https://gallica.bnf.fr/ark:/12148/bpt6k6331310g/f379.item.r=quincaillier.zoom. Thus the job will be changed from `® 1839` to `quincaillier-commissionn.`. However, the name is wrong.

3. For index 2545673, the image appears at https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f197.item.r=Jacques.zoom. The mailbox symbol was misinterpreted as `®`. Thus the job will be changed from `tabac el ®` to `tabac` (as the mail box entry was not interpreted in all cases).

4. For index 3040035, the image appears at https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f495.item.r=edorte.zoom. The ocr combined multiple lines of data and there is no correct entry to fill as there is no exact address. The job will be changed from `rue du faub.-stmathieu de vienne (a. j. b.) ® cour d'appel` to `conseiller à la cour d'appel`.

In [None]:
raw_paris_jobs.loc[191810, "métier"] = "quincaillier-commissionn."
raw_paris_jobs.loc[1035233, "métier"] = "quincaillier-commissionn."
raw_paris_jobs.loc[2545673, "métier"] = "tabac"
raw_paris_jobs.loc[3040035, "métier"] = "conseiller à la cour d'appel"

### Dealing with `¡`

- Get the rows containg `¡`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"¡"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
1691454,bpt6k9672776c,537,13,Klopp (N.),¡abr. de registres,Tiquetonne,66.0,1880,418,¡abr. de registres
2555949,bpt6k9685861g,446,90,Boquet (Aug.),¡abrique de chaussures,avenue Richerand,2.0,1887,257,¡abrique de chaussures


There are 2 rows that has the `¡`. 

The métier column will be replaced manually.

1. For index 1691454, the image appears at https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f418.item.r=Klopp.zoom. The `f` was misinterpreted as `¡`. Thus the job will be changed from `¡abr. de registres` to `fabr. de registres`.

2. For index 2555949, the image appears at https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f257.item.r=Hoquet.zoom. The `f`was misinterpreted as `¡`. Thus the job will be changed from `¡abrique de chaussures` to `fabrique de chaussures`.

In [None]:
raw_paris_jobs.loc[1691454, "métier"] = "fabr. de registres"
raw_paris_jobs.loc[2555949, "métier"] = "fabrique de chaussures"

### Dealing with `` ` ``

- Get the rows containg `` ` ``

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"`"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
535617,bpt6k6318531z,259,90,Bouchardat *,`professeur à la Faculté de médecine,Cloître--Notre-Dame,8.0,1858,151,`professeur à la faculté de médecine
767511,bpt6k63243601,367,77,Giraud Als,fournit. `pour la chapellerie,Billettes,19.0,1839,244,fournit. `pour la chapellerie
1722251,bpt6k9672776c,728,204,Satarnier er Pfaut,fournitures d`ébénisterie,Traversière,41.0,1880,609,fournitures d`ébénisterie


There are 3 rows that have the `` ` ``. 

The métier column will be replaced manually.

1. For index 535617, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6318531z/f151.item.r=Cloitre.zoom. The job had an extra print mark at the start, which was misinterpreted as `` ` ``. Thus the job will be changed from `` `professeur à la faculté de médecine`` to `professeur à la faculté de médecine`.

2. For index 767511, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243601/f244.item.r=chapellerie.zoom. The job had an extra print mark at the start, which was misinterpreted as `` ` ``. Thus the job will be changed from `` fournit. `pour la chapellerie `` to `fournit. pour la chapellerie`.

3. For index 1722251, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f609.item.r=Traversiere.zoom. The `'` was misinterpreted as `` ` ``. Thus the job will be changed from `` fournitures d`ébénisterie `` to `fournitures d'ébénisterie`.

In [None]:
raw_paris_jobs.loc[535617, "métier"] = "professeur à la faculté de médecine"
raw_paris_jobs.loc[767511, "métier"] = "fournit. pour la chapellerie"
raw_paris_jobs.loc[1722251, "métier"] = "fournitures d'ébénisterie"

### Dealing with `—`

- Get the rows containg `—`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"—"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
480089,bpt6k6315985z,164,6,Abattoirs : Grenelle. — Ménilmontant,— Miroménil. --- Montmartre. - Villejuif.,Adam,0.0,1850,82,— miroménil. --- montmartre. - villejuif.


One row has the `—`. 

The métier column will be replaced manually.

1. For index 480089, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6315985z/f82.item.r=abattoirs.zoom. The job name was ambiguios. An expert opinion about the row was sought from Loïc and the job will be changed from `— miroménil. --- montmartre. - villejuif.` to `abattoirs`.

In [None]:
raw_paris_jobs.loc[480089, "métier"] = "abattoirs"

### Dealing with `¿`

- Get the rows containg `¿`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"¿"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
529421,bpt6k6318531z,216,20,Abbatacei (Ch.) *,conseiller ¿'Etat,Caumartin,3.0,1858,108,conseiller ¿'etat


One row has the `¿`. 

The métier column will be replaced manually.

1. For index 529421, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6318531z/f108.item.r=Abbatncei.zoom. The `d` was misinterpreted as `¿`. The job will be changed from `conseiller ¿'etat` to `conseiller d'etat`.

In [None]:
raw_paris_jobs.loc[529421, "métier"] = "conseiller d'etat"

### Dealing with `£`

- Get the rows containg `£`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"£"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
399290,bpt6k6314752k,563,134,Beaumarchais,26; £ab. et magasin,HarlayMarais,3 et 5.,1856,373,£ab. et magasin
2090361,bpt6k9677737t,744,61,Pontremoli NC. et Cie,fabr. £équipements militaires,Faub.-St-Martin,140.,1883,605,fabr. £équipements militaires


There are 2 rows that have the `£`. 

The métier column will be replaced manually.

1. For index 399290, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6314752k/f373.item.r=HarlayMarais.zoom. The job name was ambiguios. However, the £ was misinterpreted for F. Thus the job will be changed from `26; £ab. et magasin` to `26; fab. et magasin`, further to `fab. et magasin` as the numbers are removed.

2. For index 2090361, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677737t/f605.item.r=Pontremoli.zoom. The `d'` was misinterpreted as `£`. Thus the job will be changed from `fabr. £équipements militaires` to `fabr. d'équipements militaires`.

In [None]:
raw_paris_jobs.loc[399290, "métier"] = "fab. et magasin"
raw_paris_jobs.loc[2090361, "métier"] = "fabr. d'équipements militaires"

### Dealing with `€`

- Get the rows containg `€`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"€"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
1684584,bpt6k9672776c,492,217,Grujon Le Bas,directeur de € Hospice de la vieillesse (femme...,boul. de l'Hôpital,47.0,1880,373,directeur de € hospice de la vieillesse femmes...


The two cases will be replaced manually and in remaining cases, `€` will be removed as the numbers were already removed.

1. For index 1684584, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f373.item.r=Grujon.zoom. The job will be changed from `directeur de € hospice de la vieillesse (femmes) à la salpêtrière` to `directeur de l'hospice de la vieillesse (femmes) à la salpêtrière`.

2. For index 3095906, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6309075f/f204.item.r=peaussier.zoom. The job is placed in the address, Thus the job will be changed from `O. 25€` to `colonel de cavalerie`. (This is not see in the above list as it the `€` was removed when dealing with numbers).

In [None]:
raw_paris_jobs.loc[1684584, "métier"] = "directeur de l'hospice de la vieillesse femmes à la salpêtrière"
raw_paris_jobs.loc[3095906, "métier"] = "colonel de cavalerie"

### Dealing with `„`

- Get the rows containg `„`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"„"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
313434,bpt6k6309075f,300,61,Cretey (E.),„peaussier,Chapon,44.0,1861,204,„peaussier
324229,bpt6k6309075f,369,4,Fourrier cadet,„vins en gros,Grande-RueBercy,2.0,1861,273,„vins en gros
723235,bpt6k6319811j,359,97,Husquin de Rhéville,„secrétaire architecte de la Société des ingén...,Buffault,26.0,1854,280,„secrétaire architecte de la société des ingén...
2065955,bpt6k9677737t,604,53,Lang,(Yve) „négociant,passage Chausson,5.0,1883,465,yve „négociant


There are 4 rows that have the `„`. 

The métier column will be replaced manually.

1. For index 313434, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6309075f/f204.item.r=peaussier.zoom. The job had an extra print mark at the start, which was misinterpreted as `„`. Thus the job will be changed from `„peaussier` to `peaussier`.

2. For index 324229, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6309075f/f273.item.r=RueBercy.zoom. The job had an extra print mark at the start, which was misinterpreted as `„`. Thus the job will be changed from `„vins en gros` to `vins en gros`.

3. For index 723235, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6319811j/f280.item.r=Buffault.zoom. The job had an extra print mark at the start, which was misinterpreted as `„`. Thus the job will be changed from `„secrétaire architecte de la société des ingénieurs civils` to `secrétaire architecte de la société des ingénieurs civils`.

4. For index 2065955, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677737t/f465.item.r=negociant.zoom. The OCR misinterpreted the first comma as the splitting point. Thus the job will be changed from `(Yve) „négociant` to `négociant`.

In [None]:
raw_paris_jobs.loc[313434, "métier"] = "peaussier"
raw_paris_jobs.loc[324229, "métier"] = "fabr. d'équipements militaires"
raw_paris_jobs.loc[723235, "métier"] = "secrétaire architecte de la société des ingénieurs civils"
raw_paris_jobs.loc[2065955, "métier"] = "négociant"

### Dealing with `்`

- Get the rows containg `்`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"்"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier


There was 1 row that have the `்`. 

The métier column will be replaced manually.

1. For index 648251, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63197984/f102.item.r=aubergiste.zoom. The job was present in two lines and was mis-interpredted. Thus the job will be changed from `aubergiste et com( 11ம்' t missionnaire de roulage` to `aubergiste et commissionnaire de roulage`. (This is not see in the above list as it the symbol was removed when dealing with numbers).

In [None]:
raw_paris_jobs.loc[648251, "métier"] = "aubergiste et commissionnaire de roulage"

### Dealing with `+`

- Get the rows containg `+`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"\+"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
108367,bpt6k62906378,605,211,Bonjour fils aîne et Verrier (C.),commis+ sionnaires do roulage,St-Denis,148 6).sanoa,1846,317,commis+ sionnaires do roulage
544978,bpt6k6318531z,316,54,De Brotonne,avoué lr+ inst.,Ste-Anne,23.,1858,208,avoué lr+ inst.
1054081,bpt6k6333170p,396,61,Dauphin et Cie,fab. de papiers à cigar+ties,St-Denis,364.,1864,259,fab. de papiers à cigar+ties
1234019,bpt6k6389871r,510,97,Thurcan (Mme),isaga+femme,Four-St-Ger. main,9.,1853,433,isaga+femme
1674324,bpt6k9672776c,428,155,Dupuy (1) NC.,sculpt+ornemaniste,Rocról,5.,1880,309,sculpt+ornemaniste
3063767,bpt6k9762929c,768,128,Tournemine,lavoir $t+Jean,Nys,9.,1879,639,lavoir $t+jean
3787639,bpt6k9764647w,547,196,Lacour (Emile),corroyeur et fabr. de chaus+ sures,Trois-Portes,10.,1881,426,corroyeur et fabr. de chaus+ sures
3803656,bpt6k9764647w,640,20,Mondoré et Leherteur,confections pour hom+i mes,Oberkampf,97.,1881,519,confections pour hom+i mes
4126811,bpt6k9776121t,720,187,Peiffer (E.),commissionn:+expéditeurs commissionn en marcha...,r. Amelot,14 bisa (110). TÉLÉPH) 933. 58.,1907,671,commissionn:+expéditeurs commissionn en marcha...
4258830,bpt6k97774838,1405,47,Rover (A.) (L. Prevost succ.),fondeur en fer. r. de appr. 2: (+4). T. Rog. 1...,Daguerre,54.,1921,1078,fondeur en fer. r. de appr. +4. t. rog.


There are 10 rows that have the `+`. 

The métier column will be replaced manually.

1. For index 108367, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k62906378/f317.item.r=verrier.zoom#. The job was present in two lines and was mis-interpredted. Thus the job will be changed from `commis+ sionnaires do roulage` to `commissionnaires de roulage`.

2. For index 544978, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6318531z/f208.item.r=tourte.zoom. The 1re was misinterpreted as `+`. Thus the job will be changed from `avoué lr+ inst.` to `avoué première inst.`.

3. For index 1054081, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f259.item.r=cigarettes.zoom. The `t` was misinterpreted as `+`. Thus the job will be changed from `fab. de papiers à cigar+ties` to `fab. de papiers à cigarettes`.

4. For index 1234019, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6389871r/f433.item.r=isag.zoom. The `-` was misinterpreted as `+`. Thus the job will be changed from `isaga+femme` to `sage-femme`.

5. For index 1674324, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f309.item.r=pcq.zoom. The `.-` was misinterpreted as `+`. Thus the job will be changed from `sculpt+ornemaniste` to `sculpt.-ornemaniste`.

6. For index 3063767, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f639.item.r=Tournemine.zoom. The `s` was misinterpreted as `$` and `-` as `+`. Thus the job will be changed from `lavoir $t+jean` to `lavoir st-jean`.

7. For index 3787639, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764647w/f426.item.r=coitroptur.zoom. The job was present in two lines and was mis-interpredted. Thus the job will be changed from `corroyeur et fabr. de chaus+ sures` to `corroyeur et fabr. de chaussures`.

8. For index 3803656, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764647w/f519.item.r=Lehcrl.zoom. The job was present in two lines and was mis-interpredted. Thus the job will be changed from `confections pour hom+i mes` to `confections pour hommes`.

9. For index 4126811, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9776121t/f671.item.r=cotnmissionn.zoom. The job was present in two lines and was mis-interpredted. Thus the job will be changed from `commissionn:+expéditeurs commissionn en marchandises` to `commissionnaire expéditeurs commissionnaire en marchandises`.

10. For index 4258830, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6309075f/f204.item.r=peaussier.zoom. The addressed and job were misinterpreted as job. Thus the job will be changed from `fondeur en fer. r. de appr. 2: (+4). t. rog. 14. 71.` to `fondeur en fer`. However the address is wrong.

In [None]:
raw_paris_jobs.loc[108367, "métier"] = "commissionnaires de roulage"
raw_paris_jobs.loc[544978, "métier"] = "avoué première inst."
raw_paris_jobs.loc[1054081, "métier"] = "fab. de papiers à cigarettes"
raw_paris_jobs.loc[1234019, "métier"] = "sage-femme"
raw_paris_jobs.loc[1674324, "métier"] = "sculpt.-ornemaniste"
raw_paris_jobs.loc[3063767, "métier"] = "lavoir st-jean"
raw_paris_jobs.loc[3787639, "métier"] = "corroyeur et fabr. de chaussures"
raw_paris_jobs.loc[3803656, "métier"] = "confections pour hommes"
raw_paris_jobs.loc[4126811, "métier"] = "commissionnaire expéditeurs commissionnaire en marchandises"
raw_paris_jobs.loc[4258830, "métier"] = "fondeur en fer"

### Dealing with `{`

- Get the rows containg `{`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"\{"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
1089871,bpt6k6333170p,631,160,Noel,{ab. de bouteilles en pierre pour filtrer les ...,Chemin de ronde des Vertus,9.,1864,494,{ab. de bouteilles en pierre pour filtrer les ...
1469297,bpt6k9668037f,923,109,Wallet,{peintre-artiste,Denfert-Rochereau,77.,1884,756,{peintre-artiste
1591243,bpt6k9672117f,406,83,Dubois,{marchand de paniers,Montdétour,14.,1874,281,{marchand de paniers
1737809,bpt6k96727875,299,242,Belard,'hôtel St-1{omain,Dauphin,5 et 7.,1870,164,'hôtel st-1{omain
1898931,bpt6k96762564,847,216,Pinchon fils ainé et Cie,nég {s-commissionnaires pour l'Angleterre et l...,Michel- le-Comte,30.,1886,662,nég {s-commissionnaires pour l'angleterre et l...
2224730,bpt6k9684013b,443,185,Boca (P.) et Cie,experts du commerce des tabacs près la régie {...,r'. de Chateaudun,50.,1896,254,experts du commerce des tabacs près la régie {...
2240872,bpt6k9684013b,542,122,Couvreur (V.) et fils,fabr. de caisses et bo{tes en bois,imp. Célestin,11.,1896,353,fabr. de caisses et bo{tes en bois
2442588,bpt6k9685098r,655,62,Bouzu et Silvestre,{abr. de voitures,cité Bayvet,15.,1898,374,{abr. de voitures
2644532,bpt6k9685861g,981,111,Vincent (René),avocat {cour d'appel,place d'Iena,3.,1887,792,avocat {cour d'appel
3002722,bpt6k9762929c,396,63,Danlos fils et Delisle,estampes et {ibrairie,quai Malaquais,15.,1879,267,estampes et {ibrairie


There are 17 rows that have the `{`. 

The métier column will be replaced manually.

1. For index 1089871, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f494.item.r=bouteilles.zoom. The `f` was misinterpreted as `{`. Thus the job will be changed from `{ab. de bouteilles en pierre pour filtrer les eaux` to `fab. de bouteilles en pierre pour filtrer les eaux`.

2. For index 1469297, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9668037f/f756.item.r=Denfert%20.zoom. The job had an extra print mark at the start, which was misinterpreted as `{`. Thus the job will be changed from `{peintre-artiste` to `peintre-artiste`.

3. For index 1591243, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672117f/f281.item.r=marclland.zoom. The job had an extra print mark at the start, which was misinterpreted as `{`. Thus the job will be changed from `{marchand de paniers` to `marchand de paniers`.

4. For index 1737809, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f259.item.r=cigarettes.zoom. The `R` was misinterpreted as `1{`. Thus the job will be changed from `'hôtel st-1{omain` to `hôtel st-Romain`.

5. For index 1898931, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96762564/f662.item.r=CAngielerre.zoom. The `l` was misinterpreted as `{`. Thus the job will be changed from `nég {s-commissionnaires pour l'angleterre et les colonies` to `négts-commissionnaires pour l'angleterre et les colonies`.

6. For index 2224730, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9684013b/f254.item.r=boca.zoom. The `f` was misinterpreted as `{`. Thus the job will be changed from `experts du commerce des tabacs près la régie {rançaise` to `experts du commerce des tabacs près la régie française`.

7. For index 2240872, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9684013b/f353.item.r=caisses.zoom. The `î` was misinterpreted as `{`. Thus the job will be changed from `fabr. de caisses et bo{tes en bois` to `fabr. de caisses et boîtes en bois`.

8. For index 2442588, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685098r/f374.item.r=Bouzu.zoom. The `f` was misinterpreted as `{`. Thus the job will be changed from `{abr. de voitures` to `fabr. de voitures`.

9. For index 2644532, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f792.image.r=jeour.zoom. The job had an extra print mark at the start, which was misinterpreted as `{`. Thus the job will be changed from `avocat {cour d'appel` to `avocat cour d'appel`.

10. For index 3002722, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f267.item.r=eSlmmpcs.zoom. The `l` was misinterpreted as `{`. Thus the job will be changed from `estampes et {ibrairie` to `estampes et librairie`.

11. For index 3093807, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97630871/f164.item.r=Cron.zoom. The job had an extra print mark at the start, which was misinterpreted as `{`. Thus the job will be changed from `{dessinateurs pour tissus d'itmeublements` to `dessinateurs pour tissus d'ameublements`.

12. For index 3312228, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763471j/f449.item.r=lleckmann.zoom. The job had an extra print mark at the start, which was misinterpreted as `{`. Thus the job will be changed from `{tonnelier` to `tonnelier`.

13. For index 3681828, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97645375/f281.image.r=Dumoret.zoom. The `d` was misinterpreted as `{`. Thus the job will be changed from `chef {'institution` to `chef d'institution`.

14. For index 3796176, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764647w/f476.item.r=Popincourt.zoom. The `é` was misinterpreted as `{`. Thus the job will be changed from `{picier` to `épicier`.

15. For index 3812185, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97630871/f164.item.r=Cron.zoom. The `d'` was misinterpreted as `{`. Thus the job will be changed from `réparation {objeis de curiosités` to `réparation d'objets de curiosités`.

16. For index 3818381, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97630871/f164.item.r=Cron.zoom. The `d'é` was misinterpreted as `{ l`. Thus the job will be changed from `{fabr. d'articles { lbénisterie pour bureaux` to `fabr. d'articles d'ébénisterie pour bureaux`.

17. For index 3882817, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764746t/f397.item.r=Luppe.zoom. The `d'` was misinterpreted as `{`. Thus the job will be changed from `auditeur au conseil {'etat` to `auditeur au conseil d'etat`.

In [None]:
raw_paris_jobs.loc[1089871, "métier"] = "fab. de bouteilles en pierre pour filtrer les eaux"
raw_paris_jobs.loc[1469297, "métier"] = "peintre-artiste"
raw_paris_jobs.loc[1591243, "métier"] = "marchand de paniers"
raw_paris_jobs.loc[1737809, "métier"] = "hôtel st-romain"
raw_paris_jobs.loc[1898931, "métier"] = "négts-commissionnaires pour l'angleterre et les colonies"
raw_paris_jobs.loc[2224730, "métier"] = "experts du commerce des tabacs près la régie française"
raw_paris_jobs.loc[2240872, "métier"] = "fabr. de caisses et boîtes en bois"
raw_paris_jobs.loc[2442588, "métier"] = "fabr. de voitures"
raw_paris_jobs.loc[2644532, "métier"] = "avocat cour d'appel"
raw_paris_jobs.loc[3002722, "métier"] = "estampes et librairie"
raw_paris_jobs.loc[3093807, "métier"] = "dessinateurs pour tissus d'ameublements"
raw_paris_jobs.loc[3312228, "métier"] = "tonnelier"
raw_paris_jobs.loc[3681828, "métier"] = "chef d'institution"
raw_paris_jobs.loc[3796176, "métier"] = "épicier"
raw_paris_jobs.loc[3812185, "métier"] = "réparation d'objets de curiosités"
raw_paris_jobs.loc[3818381, "métier"] = "fabr. d'articles d'ébénisterie pour bureaux"
raw_paris_jobs.loc[3882817, "métier"] = "auditeur au conseil d'etat"

### Dealing with `}`

- Get the rows containg `}`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"\}"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
21195,bpt6k6282019m,272,65,Enne,avoué }re instance,Richelieu,15.,1855,200,avoué }re instance
853031,bpt6k63243905,292,64,Bauerkeller (Guillaume),professeur de vio}ọ,naue,24.,1863,158,professeur de vio}ọ
853361,bpt6k63243905,294,94,Beaucelou,avoué } re instance,Gaillon,14.,1863,160,avoué } re instance
964070,bpt6k63243920,413,179,Laveissière (E.) et Courtois,quincai!}iors,Fidélné,20 22.,1860,333,quincai!}iors
1230395,bpt6k6389871r,481,0,Rolland,inspecteur de la Manufacture des ta. } bacs,Bell -Cha-se,21,1853,404,inspecteur de la manufacture des ta. } bacs
1302116,bpt6k6393838j,564,126,Habard,} caussier,St-Martin,71.,1843,345,} caussier
1756014,bpt6k96727875,408,0,Délivré,secrétaire au conseil des Prud'hom- } mes,Sévigné,19.,1870,273,secrétaire au conseil des prud'hom- } mes
2106291,bpt6k9677737t,837,7,Vaillaud,hôtel meubl},Bellechasse,8.,1883,698,hôtel meubl}
2406011,bpt6k9684454n,966,118,Potier (A.),} herborisle,Vieille du-Temple,90.,1893,731,} herborisle
3180338,bpt6k97631451,887,142,Drin (G.) et Cie,blanchiment et apprêts. usines à Courbevoie et...,Pompe,95,1901,565,blanchiment et apprêts. usines à courbevoie et...


There are 11 rows that have the `}`. 

The métier column will be replaced manually.

1. For index 21195, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6282019m/f200.image.r=instance.zoom. The `1` was misinterpreted as `}`. Thus the job will be changed from `avoué }re instance` to `avoué première instance`.

2. For index 853031, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9668037f/f756.item.r=Denfert%20.zoom. The job was in two lines and the print was not clear, which was misinterpreted as `}`. Thus the job will be changed from `professeur de vio}ọ` to `professeur de violon`.

3. For index 853361, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243905/f160.item.r=Gaillon.zoom. The `1` was misinterpreted as `}`. Thus the job will be changed from `avoué } re instance` to `avoué première instance`.

4. For index 964070, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243920/f333.item.r=courtios.zoom. The print was unclear which was misinterpreted as `}`. Thus the job will be changed from `quincai!}iors'` to `quincailliers`.

5. For index 1230395, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6389871r/f404.item.r=manufacture.zoom. The job was in two lines and the print was not clear, which was misinterpreted as `}`. Thus the job will be changed from `inspecteur de la manufacture des ta. } bacs` to `inspecteur de la manufacture des tabacs`.

6. For index 1302116, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6393838j/f345.item.r=peaussier.zoom. The print was unclear which was misinterpreted as `}`. Thus the job will be changed from `} caussier` to `peaussier`.

7. For index 1756014, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f273.item.r=Delivre.zoom. The job was in two lines, which was misinterpreted as `}`. Thus the job will be changed from `secrétaire au conseil des prud'hom- } mes` to `secrétaire au conseil des prud'hommes`.

8. For index 2106291, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677737t/f698.item.r=Vaillaud.zoom. The `é` was misinterpreted as `}`. Thus the job will be changed from `hôtel meubl}` to `hôtel meublé`.

9. For index 2406011, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9684454n/f731.item.r=herboriste.zoom. The job had an extra print mark at the start, which was misinterpreted as `}`. Thus the job will be changed from `} herborisle` to `herboriste`.

10. For index 3180338, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f267.item.r=eSlmmpcs.zoom. The `l'` was misinterpreted as `}`. Thus the job will be changed from `blanchiment et apprêts; usines à courbevoie et à}lile st-denis (seine` to `blanchiment et apprêts. usines à courbevoie et à l'ile st-denis (seine)`.

11. For index 3450599, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f124.item.r=Jtlvreut.zoom. The job was misinterpreted as name and the street name . Thus the job will be changed from `st-sau} veur` to `plombier-couvreur`.

In [None]:
raw_paris_jobs.loc[21195, "métier"] = "avoué première instance"
raw_paris_jobs.loc[853031, "métier"] = "professeur de violon"
raw_paris_jobs.loc[853361, "métier"] = "avoué première instance"
raw_paris_jobs.loc[964070, "métier"] = "quincailliers"
raw_paris_jobs.loc[1230395, "métier"] = "inspecteur de la manufacture des tabacs"
raw_paris_jobs.loc[1302116, "métier"] = "peaussier"
raw_paris_jobs.loc[1756014, "métier"] = "secrétaire au conseil des prud'hommes"
raw_paris_jobs.loc[2106291, "métier"] = "hôtel meublé"
raw_paris_jobs.loc[2406011, "métier"] = "herboriste"
raw_paris_jobs.loc[3180338, "métier"] = "estampes et librairie"
raw_paris_jobs.loc[3450599, "métier"] = "plombier-couvreur"

### Dealing with `¥`

- Get the rows containg `¥`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"¥"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
107369,bpt6k62906378,599,73,e musique miBesson (C.),C. ¥ pair,boul. Poissonnièro,19.,1846,311,c. ¥ pair
143491,bpt6k62906378,823,180,Saint-Didier (Bon de),C. ¥pair,Villel'Evêque,17.,1846,535,c. ¥pair
152098,bpt6k6292987t,729,30,Bessun (C.),C. ¥. pair,boul. Poissonnière,19.,1845,376,c. ¥. pair
164980,bpt6k6292987t,811,211,Tissot jeune,élig. ¥,Vivienne,7.,1845,458,élig. ¥
186866,bpt6k6292987t,949,182,Saint-Didier (Bon de),C. ¥. pair,Villel'Evêque,17.,1845,596,c. ¥. pair
196555,bpt6k62931221,332,174,Borgella,capit. d'artil er. ¥,Vaugirard,20.,1841,181,capit. d'artil er. ¥
251624,bpt6k6305463c,330,4,Delaunay (baron),C. ¥. intendant militaire en retraite,Grenelle-St. Germain,71.,1857,217,c. ¥. intendant militaire en retraite
474270,bpt6k6315927h,986,10,Saint-Didier (Bon de),C. ¥. pair,Villel'Evèque,17.,1848,637,c. ¥. pair
977647,bpt6k63243920,500,105,Pelletier-Descarrières (A.),C. ¥. général de division en retraite,St-Lazare,136.,1860,420,c. ¥. général de division en retraite
987729,bpt6k63243920,564,178,Soutif,caissier de la maison Gréau ¥ et Cie,Deux-Boules,3.,1860,484,caissier de la maison gréau ¥ et cie


There are 18 rows that have the `¥`. 

All the `¥` were misinterpredted for a similar looking symbol, that was used in the directory to indicate title (look at the bottom of the page at https://gallica.bnf.fr/ark:/12148/bpt6k62906378/f9.item.zoom)

![symbols_meaning.png](./images/symbols_meaning.png)

For the purposes of tag generation, `¥` symbol shall be removed.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace('¥', '', regex=False)

### Dealing with `#`

- Get the rows containg `#`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"#"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
92052,bpt6k6286466w,598,168,Mortier (Bon H.),C. # pair,Paix,3.,1842,409,c. # pair
302259,bpt6k6309075f,227,1,Bertsch,#photogr.,Fontaine-St-Georges,27,1861,131,#photogr.
342461,bpt6k6309075f,482,86,Loudun,# sous-biblioth. à l'Arsenal,Saliy,1.,1861,386,# sous-biblioth. à l'arsenal
355454,bpt6k6309075f,564,33,Questel (Ch. ),# architecte,Mazarine,20.,1861,468,# architecte
369861,bpt6k6314752k,393,39,Baradère,C. #ancien conseiller d'État,Université,35.,1856,203,c. #ancien conseiller d'état
...,...,...,...,...,...,...,...,...,...,...
3519048,bpt6k9763554c,646,195,Prevost,de la maison Jeanti #NC. et Prevost,Rougemont,14.,1875,523,de la maison jeanti #nc. et prevost
3533915,bpt6k9763554c,733,135,Walferdin,#anc. chef aux douanes,Budé,1.,1875,610,#anc. chef aux douanes
3759572,bpt6k9764647w,387,156,Darrac,maison de literie Séguier # NCS,Cadet,24.,1881,266,maison de literie séguier # ncs
3780870,bpt6k9764647w,508,71,Guyot-Sionnest,de la maison Chaligny # NC et Guyot-Sionnest,Philippe-de-Girard,54.,1881,387,de la maison chaligny # nc et guyot-sionnest


There are 61 rows that have the `#`.

Similar to ¥, # is also misinterpreted most of the times for the same symbol 

First, some of the job names contain # as a spelling mistake. They will be replaced manually and as the rest of the `#` correspond to an award (verified visually) they shall be removed as they correspond to the awards.

1. For index 1037978, https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f155.item.r=tribunal.zoom. The job name was ambiguios. An expert opinion about the row was sought from Loïc and the job will be changed from `commis-#reffier au tribunal de com merce` to `commis-greffier au tribunal de commerce`.

2. For index 1746023, https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f212.item.r=torcy.zoom. The `d'` was misinterpreted as `#`. The job will be changed from `directrice de la salle #asile` to `directrice de la salle d'asile`.

3. For index 1755653, https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f270.item.r=worth.zoom. The `et` was misinterpreted as `#`. The job will be changed from `commissionn. en vins # spiritueux` to `commissionn. en vins et spiritueux`.

4. For index 1760796, https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f301.item.r=Ouplessy.zoom. The `d'` was misinterpreted as `#`. The job will be changed from `fondeur # or et d'argent` to `fondeur d'or et d'argent`.

5. For index 2642414, https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f779.item.r=Vanlinden.zoom. The `«` was misinterpreted as `#`. The job will be changed from `directeurs du journal financier # le crédit public »` to `directeurs du journal financier « le crédit public »`. This is further reduced to `directeurs du journal financier le crédit public` as « and » will be removed eventually.

6. For index 2905684, https://gallica.bnf.fr/ark:/12148/bpt6k9732740w/f581.item.r=david.zoom. The `d'` was misinterpreted as `#`. The job will be changed from `fabr. #engrais` to `fabr. d'engrais`.

7. For index 3014436, https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f338.item.r=avocat.zoom. The `d'` was misinterpreted as `#`. The job will be changed from `avocat général à la cour #appel` to `avocat général à la cour d'appel`.

8. For index 3362469, https://gallica.bnf.fr/ark:/12148/bpt6k9763471j/f751.image.r=Versigny.zoom. The `H` was misinterpreted as `#`. The job will be changed from `député de la #te-saône` to `député de la hte-saône`.

9. For index 3816360, https://gallica.bnf.fr/ark:/12148/bpt6k9764647w/f591.item.r=Reverchon.zoom. The `d'` was misinterpreted as `#`. The job will be changed from `lunettes et objets #optique` to `lunettes et objets d'optique`.

10. For index 975097, https://gallica.bnf.fr/ark:/12148/bpt6k63243920/f403.item.r=grains.zoom. The `g` was misinterpreted as `#`. The job will be changed from `#rains et fourrag` to `grains et fourrag`.

11. For index 1063875, https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f323.image.r=issy.zoom. The `s` was misinterpreted as `#`. The job will be changed from `#tatuaire` to `statuaire`.

In [None]:
raw_paris_jobs.loc[1037978, "métier"] = "commis-greffier au tribunal de commerce"
raw_paris_jobs.loc[1746023, "métier"] = "directrice de la salle d'asile"
raw_paris_jobs.loc[1755653, "métier"] = "commissionn. en vins et spiritueux"
raw_paris_jobs.loc[1760796, "métier"] = "fondeur d'or et d'argent"
raw_paris_jobs.loc[2642414, "métier"] = "directeurs du journal financier le crédit public"
raw_paris_jobs.loc[2905684, "métier"] = "fabr. d'engrais"
raw_paris_jobs.loc[3014436, "métier"] = "avocat général à la cour d'appel"
raw_paris_jobs.loc[3362469, "métier"] = "député de la hte-saône"
raw_paris_jobs.loc[3816360, "métier"] = "lunettes et objets d'optique"
raw_paris_jobs.loc[975097, "métier"] = "grains et fourrag"
raw_paris_jobs.loc[1063875, "métier"] = "statuaire"

raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace('#', '', regex=False)

### Dealing with `$`

- Get the rows containg `$`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"\$"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
427509,bpt6k6314752k,729,139,Verniol,$ins,Bourg-l'Abbé,26.,1856,539,$ins
648201,bpt6k63197984,158,70,"Boufnouveau""",tab. de papier av fantaisit: $_1,Louis-en-l'lle,4610-34-50305110722,1852,102,tab. de papier av fantaisit: $_1
825297,bpt6k6324389h,384,134,Legrand,serrurier en b$uin.,Chabrol,25.,1859,312,serrurier en b$uin.
1157442,bpt6k6333200c,566,139,Mailly (comte de),prince d'Orange en France et de l'Isle-800$-Mo...,Université,53.,1862,434,prince d'orange en france et de l'isle-800$-mo...
1458342,bpt6k9668037f,856,63,) NCS (A. Démichel succ.),instruu$ pour les sciences,Pavée-Marais,24.,1884,689,instruu$ pour les sciences
1617561,bpt6k9672117f,564,122,Maigret et Vve Massa,cafe $t-Thomas,FillesSt-Thomas,9.,1874,439,cafe $t-thomas
1665700,bpt6k9672776c,371,102,Courant se,pharmacien de tre classe à l'hôpital militaire...,Récollets,8.,1880,252,pharmacien de tre classe à l'hôpital militaire...
1671394,bpt6k9672776c,407,237,'Devert et Schwob,cagents de fabriques de dis- $48,boul. Koltaire,41 bise (LA) sa sees,1880,288,cagents de fabriques de dis- $48
1755953,bpt6k96727875,407,152,Delhorme,G. O. $ général du cadre de ré serve,Anjou-St-Honoré,19.,1870,272,g. o. $ général du cadre de ré serve
1985887,bpt6k9677392n,593,77,Muller,propriétaire du Grand-H$tel d'Amérique,Pasquier,39.,1877,483,propriétaire du grand-h$tel d'amérique


There are 24 rows that have the `$`. 

The métier column will be replaced manually.

1. For index 427509, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6314752k/f539.item.r=Verniol.zoom. The `v` was misinterpreted as `$`. Thus the job will be changed from `$ins` to `vins`.

2. For index 648201, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63197984/f102.item.r=Boufnouveau.zoom. The `St` in the address was misinterpreted as `$_t`. Thus the job will be changed from `tab. de papier av fantaisit: $_1` to `fab. de papier de fantaisie`.

3. For index 825297, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6324389h/f312.item.r=chabrol.zoom. The `ât` was misinterpreted as `$u`. Thus the job will be changed from `serrurier en b$uin.` to `serrurier en bâtim.`.

4. For index 1157442, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333200c/f434.item.r=orange.zoom. The `sous` was misinterpreted as `800$`. Thus the job will be changed from `prince d'orange en france et de l'isle-800$-montréal` to `prince d'orange en france et de l'isle-sous-montréal`.

5. For index 1458342, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9668037f/f689.item.r=Demichel.zoom. The job was present in two lines. Thus the job will be changed from `instruu$ pour les sciences` to `instruments pour les sciences`. However, the name is still wrong.

6. For index 1617561, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672117f/f439.image.r=Massa.zoom. The `s` was misinterpreted as `$`. Thus the job will be changed from `cafe $t-thomas` to `café st-thomas`.

7. For index 1665700, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f252.image.r=pharmacien.zoom. The `st` was misinterpreted as `$1`. Thus the job will be changed from `pharmacien de tre classe à l'hôpital militaire $1-91artin` to `pharmacien de 1re classe à l'hôpital militaire st-martin`.

8. For index 1671394, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f288.image.r=Schwob.zoom. The job was present in two lines. Thus the job will be changed from `cagents de fabriques de dis- $48` to `agents de fabriques de tissus`.

9. For index 1755953, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f272.item.r=cadre.zoom. The awrad information was included in the job name. Thus the job will be changed from `g. o. $ général du cadre de ré serve` to `général du cadre de réserve`.

10. For index 1985887, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677392n/f483.image.r=Muller.zoom. The `ô` was misinterpreted as `$`. Thus the job will be changed from `propriétaire du grand-h$tel d'amérique` to `propriétaire du grand-hôtel d'amérique`.

11. For index 2002421, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677392n/f581.image.r=Sutin.zoom. The `s` was misinterpreted as `$`. Thus the job will be changed from `vicaire de l'église $t-sulpice` to `vicaire de l'église st-sulpice`.

12. For index 2330316, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9684454n/f266.item.r=Bathias.zoom. The `r$` was extra. Thus the job will be changed from `fabr. de r$ jouets` to `fabr. de jouets`.

13. For index 2718443, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9692626p/f733.item.r=Machin.zoom. The `s` was misinterpreted as `$`. Thus the job will be changed from `boulanger-$` to `boulangers`.

14. For index 3026425, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f411.item.r=Wivct.zoom. The `s` was misinterpreted as `$`. Thus the job will be changed from `$errurier` to `serrurier`.

15. For index 3476446, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f275.item.r=contre.zoom. The awrad information was included in the job name. Thus the job will be changed from `c. $ contre-amiral` to `contre-amiral`.

16. For index 3671453, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97645375/f221.item.r=courtier.zoom. The `s` was misinterpreted as `$`. Thus the job will be changed from `courtier aux docks de $l-ouen` to `courtier aux docks de st-ouen`.

17. For index 3909655, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764746t/f551.image.r=germain.zoom. The `s` was misinterpreted as `$`. Thus the job will be changed from `vins et hôtel $t-germain` to `vins et hôtel st-germain`.

18. For index 4079280, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9776121t/f350.item.r=dentiste.zoom. The name was present in the job. Thus the job will be changed from `fay $4. dentiste` to `dentiste`. However, the name is wrong.

19. For index 4157114, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f331.item.r=gallois.zoom. The `s` was misinterpreted as `$`. Thus the job will be changed from `$ociete frunçaise de produits pharmaceutiques` to `société française de produits pharmaceutiques`.

20. For index 4190886, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f577.item.r=success%20.zoom. The job will be changed from `fabr. de $; 1.` to `fabr. de bronzes`. However, the name is wrong.

21. For index 4198495, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f577.item.r=success%20.zoom. The job will be changed from `cartes postales illus$` to `cartes postales illustrées`. However, the name is wrong.

22. For index 4199562, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f641.item.r=Eglantine.zoom. The job will be changed from `quincaillerie en $. 1.` to `quincaillerie en gros`. However, the name is wrong.

23. For index 4253356, the image from the directory is . The job is present in the name. Thus the job will be changed from `n membre de la chambre de commerce $; 4.` to `bois des iles et sciage ancien membre de la chambre de commerce de paris`. However, the name is wrong.

24. For index 4390179, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f1139.item.r=Sawisky.zoom. The `s` was misinterpreted as `$`. Thus the job will be changed from `repousseur $. métaux` to `repousseur s. métaux`.

In [None]:
raw_paris_jobs.loc[427509, "métier"] = "vins"
raw_paris_jobs.loc[648201, "métier"] = "fab. de papier de fantaisie"
raw_paris_jobs.loc[825297, "métier"] = "serrurier en bâtim."
raw_paris_jobs.loc[1157442, "métier"] = "prince d'orange en france et de l'isle-sous-montréal"
raw_paris_jobs.loc[1458342, "métier"] = "instruments pour les sciences"
raw_paris_jobs.loc[1617561, "métier"] = "café st-thomas"
raw_paris_jobs.loc[1665700, "métier"] = "pharmacien de classe à l'hôpital militaire st-martin"
raw_paris_jobs.loc[1671394, "métier"] = "agents de fabriques de tissus"
raw_paris_jobs.loc[1755953, "métier"] = "général du cadre de réserve"
raw_paris_jobs.loc[1985887, "métier"] = "propriétaire du grand-hôtel d'amérique"
raw_paris_jobs.loc[2002421, "métier"] = "vicaire de l'église st-sulpice"
raw_paris_jobs.loc[2330316, "métier"] = "fabr. de jouets"
raw_paris_jobs.loc[2718443, "métier"] = "boulangers"
raw_paris_jobs.loc[3026425, "métier"] = "serrurier"
raw_paris_jobs.loc[3476446, "métier"] = "contre-amiral"
raw_paris_jobs.loc[3671453, "métier"] = "courtier aux docks de st-ouen"
raw_paris_jobs.loc[3909655, "métier"] = "vins et hôtel st-germain"
raw_paris_jobs.loc[4079280, "métier"] = "dentiste"
raw_paris_jobs.loc[4157114, "métier"] = "société française de produits pharmaceutiques"
raw_paris_jobs.loc[4190886, "métier"] = "fabr. de bronzes"
raw_paris_jobs.loc[4198495, "métier"] = "cartes postales illustrées"
raw_paris_jobs.loc[4199562, "métier"] = "quincaillerie en gros"
raw_paris_jobs.loc[4253356, "métier"] = "bois des iles et sciage ancien membre de la chambre de commerce de paris"
raw_paris_jobs.loc[4390179, "métier"] = "repousseur s. métaux"

### Dealing with `@`

- Get the rows containg `@`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"@"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
34041,bpt6k6282019m,349,188,Latroucherie,aumonier des Dames-du-SacréC@ur,Varennes,94.,1855,277,aumonier des dames-du-sacréc@ur
93225,bpt6k6286466w,606,171,Pacotte,beurre et@ufs,Piliers-Potiers-d'Etain,10.,1842,417,beurre et@ufs
184730,bpt6k6292987t,936,90,Raulet,beurre et @ufs,Prècheurs,33.,1845,583,beurre et @ufs
237257,bpt6k6305463c,243,27,Bnrthélemy (marquise de),directrice des s@eurs de l'ordre de St-Louis,Clichy,64.,1857,130,directrice des s@eurs de l'ordre de st-louis
270520,bpt6k6305463c,442,131,Latroucherie,aumônier des Dames-du-SacréC@ur,Varennes,94.,1857,329,aumônier des dames-du-sacréc@ur
...,...,...,...,...,...,...,...,...,...,...
4353335,bpt6k9780089g,1224,312,Lemeunier,beurre & @ufs,r. Beaurepaire,4.,1922,885,beurre & @ufs
4354358,bpt6k9780089g,1231,302,Leroy,beurre & @ufs,r. d'Avron,62.,1922,892,beurre & @ufs
4387146,bpt6k9780089g,1457,133,Roulier,beurre & @ufs,Marché St-Germain,231.,1922,1118,beurre & @ufs
4396848,bpt6k9780089g,1542,177,Thébert,beurre et@ufs,r.du Château-d'Eau,28.,1922,1203,beurre et@ufs


79 rows contain `@`. Primiarily they are misinterpretation for œ. For this, we shall take a word count of the words that contain a `@` and replace it with correct word.

In [None]:
at_word_freq = {}
for _,row in raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"@"))].iterrows():
    at_tokens = row["métier"].split()
    for token in at_tokens:
        if '@' in token:
            if token not in at_word_freq:
                at_word_freq[token] = 0
            at_word_freq[token] += 1
            
at_word_freq

{'dames-du-sacréc@ur': 2,
 'et@ufs': 2,
 '@ufs': 52,
 's@eurs': 1,
 's@urs': 2,
 'sacré-c@ur': 2,
 '1@tes-du-nord': 1,
 't@lier': 2,
 "l'@upre": 1,
 '@uvre': 1,
 'c@ur': 1,
 'sacrés-c@urs': 1,
 '@ommissionnairesi': 1,
 'oh@pelier': 1,
 '@u-|': 1,
 'c@q-héron': 1,
 '17@wanzos': 1,
 '@illets': 1,
 "l'@uvre": 2,
 '@wes': 1,
 'saintc@ur-de-marie': 1,
 'au@cat': 1}

First, the ambiguious ones will be replaced manually by checking the image in the bottin and in the remaining cases @ will be replaced with œ

In [None]:
raw_paris_jobs.loc[1783671, "métier"] = "trésorier général de l'œuvre de l'adoption"
raw_paris_jobs.loc[799647, "métier"] = "député des côtes-du-nord"
raw_paris_jobs.loc[1088812, "métier"] = "tôlier"
raw_paris_jobs.loc[1651328, "métier"] = "tôlier"
raw_paris_jobs.loc[1709813, "métier"] = "commissionnaires en marchandises"
raw_paris_jobs.loc[1717611, "métier"] = "chapelier"
raw_paris_jobs.loc[1877976, "métier"] = "grand hôtel coq-héron"
raw_paris_jobs.loc[3001979, "métier"] = "fabr. de claviers pour pianos et orgues"
raw_paris_jobs.loc[3727038, "métier"] = "directrice de l'asile du saint-cœur-de-marie"
raw_paris_jobs.loc[3864697, "métier"] = "avocat cour d'appel"
raw_paris_jobs.loc[93225, "métier"] = "beurre et œufs"
raw_paris_jobs.loc[4396848, "métier"] = "beurre et œufs"
raw_paris_jobs.loc[34041, "métier"] = "aumonier des dames-du-sacré-cœur"
raw_paris_jobs.loc[270520, "métier"] = "aumonier des dames-du-sacré-cœur"
raw_paris_jobs.loc[1177209, "métier"] = "supérieure générale de l'œuvre de notre-dame-de-sion"
raw_paris_jobs.loc[3715616, "métier"] = "propriétaire des œuvres de feu a. panseron"


raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace('@', 'œ', regex=False)

### Dealing with `_`

- Get the rows containg `_`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"_"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
241855,bpt6k6305463c,270,181,Bouin (A.) *,chef de bureau au ministère_de l'instr. publiq...,Varennes,13.,1857,157,chef de bureau au ministère_de l'instr. publiq...
522275,bpt6k6315985z,439,48,Riff,hơiel d'Al_sterdam,Vieux-Augusting,50.,1850,357,hơiel d'al_sterdam
716206,bpt6k6319811j,311,168,Fouquet (Vict.) NC),entrepreneur de bati_ments,Londres,7.,1854,232,entrepreneur de bati_ments
893767,bpt6k63243905,548,52,Leemann et Dinslage tailleurs,pass. del'Opé_ra,galerie du Baromètre,17.,1863,414,pass. del'opé_ra
960521,bpt6k63243920,390,190,Jollivet (P.),fab_ de tiges de bottines et souliers,St-André-des-Arts,65.,1860,310,fab_ de tiges de bottines et souliers
1115796,bpt6k6333200c,308,131,Bochet,fab._carmin d'indigo,Glacière-Gentilly,69.,1862,176,fab._carmin d'indigo
1553077,bpt6k9669143t,791,98,Ruiz (Vve) et Cle,broderies_sur soie,Montmartre,111.,1882,646,broderies_sur soie
1571018,bpt6k9672117f,287,71,Berthet,dépositaire des tourbières de St-Ma_xence (Oise),quai Valmy,131.,1874,162,dépositaire des tourbières de st-ma_xence oise
1667238,bpt6k9672776c,380,255,Dantès (Altred),homme de lettres et proprié|_taire,Notre-Dame des Champs,752,1880,261,homme de lettres et proprié|_taire
1756741,bpt6k96727875,412,27,Deny,serrurier en voiture et charron-forge_?on,Chaussée-du-Maine,22.,1870,277,serrurier en voiture et charron-forge_?on


1. For index 241855, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6305463c/f157.item.r=cultes.zoom, The job will be changed from `chef de bureau au ministère_de l'instr. publique et des cultes` to `chef de bureau au ministère de l'instr. publique et des cultes`

2. For index 522275, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6315985z/f357.item.r=Riff.zoom, The job will be changed from `hơiel d'al_sterdam` to `hôtel d'amsterdam`

3. For index 716206, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6319811j/f232.image.r=Fouquet.zoom, The job will be changed from `entrepreneur de bati_ments` to `entrepreneur de batiments`

4. For index 893767, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243905/f414.image.r=Leemann.zoom, The job will be changed from `pass. del'opé_ra` to `pass. de l'opéra`

5. For index 960521, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243905/f381.image.r=bottins.zoom, The job will be changed from `fab_ de tiges de bottines et souliers` to `fab. de tiges de bottines et sonliers`

6. For index 1115796, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333200c/f176.item.r=Bochet.zoom, The job will be changed from `fab._carmin d'indigo` to `fab. carmin d'indigo`

7. For index 1553077, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9669143t/f646.image.r=Ruiz.zoom, The job will be changed from `broderies_sur soie` to `broderies sur soie`

8. For index 1571018, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672117f/f162.item.r=tourbieres.zoom, The job will be changed from `dépositaire des tourbières de st-ma_xence (oise)` to `dépositaire des tourbières de st-maxence (oise)`

9. For index 1667238, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f261.item.r=Dantes.zoom, The job will be changed from `homme de lettres et proprié|_taire` to `homme de lettres et propriétaire`

10. For index 1756741, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f277.item.r=Deny.zoom, The job will be changed from `serrurier en voiture et charron-forge_?on` to `serrurier en voiture et charron-forgeron`

11. For index 1775469, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f387.item.r=jahn.zoom, The job will be changed from `représentan_ de fabriques` to `représentant de fabriques`

12. For index 1780454, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f416.item.r=manufacture.zoom, The job will be changed from `manufacture de dentel_les` to `manufacture de dentelles`

13. For index 1801778, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f541.item.r=delbes.zoom, The job will be changed from `fabr. de chevilles et ri_vets pour chaussures` to `fabr. de chevilles et rivets pour chaussures`

14. For index 1919390, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96762564/f780.item.r=Houssiaux.zoom, The job will be changed from `librai_re` to `libraire`

15. For index 1934867, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677392n/f186.item.r=serrures.zoom, The job will be changed from `fabr. de serrures pour né_cessaires` to `fabr. de serrures pour nécessaires`

16. For index 1946061, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677392n/f252.item.r=Delaroque.zoom, The job will be changed from `fabi. de produits chimi_gues` to `fabr. de produits chimiques`

17. For index 1963097, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677392n/f349.item.r=gendres.zoom, The job will be changed from `pierres meuliè_res` to `pierres meulières`

18. For index 1969446, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677392n/f386.item.r=Conservatoire.zoom#, The job will be changed from `professeur au conservatoire de musi|_que` to `professeur au conservatoire de musique`

19. For index 1996863, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677392n/f547.item.r=entrepreneurs.zoom, The job will be changed from `entrepreneurs de tra_vaux publics` to `entrepreneurs de travaux publics`

20. For index 2006152, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677392n/f602.item.r=Varnier.zoom, The job will be changed from `entrepreneur de_transports` to `entrepreneur de transports`

21. For index 2383653, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9684454n/f594.item.r=Lechevretel.zoom, The job will be changed from `ustensiles de ména_ ge` to `ustensiles de ménage`

22. For index 2582159, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f416.item.r=neufs.zoom, The job will be changed from `p_anos neufs et d'occasion` to `pianos neufs et d'occasion`

23. For index 2615085, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f614.item.r=lyon.zoom, The job will be changed from `de la maison michaud et lyon_net` to `de la maison michaud et lyonnet`

24. For index 2634835, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f731.item.r=Savaresse.zoom, The job will be changed from `cordes har|_moniques` to `cordes harmoniques`

25. For index 2640580, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f768.item.r=Touluch.zoom, The job will be changed from `_vins et liqueurs` to `vins et liqueurs`

26. For index 2778820, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9692809v/f220.image.r=instruments.zoom, The job will be changed from `facteur d'instruments de mu_sique en cuivre` to `facteur d'instruments de musique en cuivre`

27. For index 3047938, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f564.image.r=archiviste.zoom, The job will be changed from `secrétaire-archiviste de l'arche_vêché` to `secrétaire-archiviste de l'archevêché`

28. For index 3138510, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97630871/f438.item.r=chaudronniers.zoom, The job will be changed from `chaudronniers et ma_ chines d'occasion` to `chaudronniers et machines d'occasion`

29. For index 3214560, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97631451/f776.item.r=gendre.zoom, The job will be changed from `graveurs_sur métaux` to `graveurs sur métaux`

30. For index 3435821, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763553z/f516.item.r=Rapicault.zoom, The job will be changed from `me: _erie` to `mercerie`

31. For index 3478825, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f288.item.r=etiquettes.zoom, The job will be changed from `fabr. de_boueles et étiquettes pour conserves` to `fabr. de boucles et étiquettes pour conserves`

32. For index 3493909, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f375.item.r=Adrien.zoom, The job will be changed from `ceinturon_nier` to `ceinturonnier`

33. For index 3569239, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764402m/f543.item.r=Delzenne.zoom, The job will be changed from `timbres-poste pour collec_ tions` to `timbres-poste pour collections`

34. For index 3729350, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97645375/f561.item.r=Lefrancois.zoom, The job will be changed from `tabr_d'irrigateurs` to `fabr. d'irrigateurs`

35. For index 3959866, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f347.item.r=Fradet.zoom, The job will be changed from `arbitre près le tribunal de com|_merce` to `arbitre près le tribunal de commerce`

36. For index 3983392, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f502.item.r=chimiques.zoom, The job will be changed from `fabr. de_ benzine et produits chimiques` to `fabr. de benzine et produits chimiques`

37. For index 3989727. The job will be changed from `_beurre et oeufs` to `beurre et œufs`

38. For index 4039236. The job will be changed from `beurre et_ ufs` to `beurre et œufs`

39. For index 4124868, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9776121t/f657.item.r=chaudronnerie.zoom, The job will be changed from `chaudronnerie pour automo: _1.` to `chaudronnerie pour automobiles`

40. For index 4173690, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f448.item.r=tissus.zoom, The job will be changed from `tissus pour_ameublements` to `tissus pour ameublements`

41. For index 4182797, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f512.item.r=Chopard%20freres.zoom, The job will be changed from `bijouterie_imitation` to `bijouterie imitation`

42. For index 4214148, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f745.item.r=delamaire.zoom, The job will be changed from `_marcel delamaire` to `halle aux cuirs`

43. For index 4215735. The job will be changed from `_chauffage central` to `chauffage central`

44. For index 4253362, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f1035.item.r=cour.zoom, The job will be changed from `avooat_cour 'appel` to `avocat cour d'appel`

45. For index 4317038, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f631.item.r=impressions.zoom, The job will be changed from `graveur_pour impressions` to `graveur pour impressions`

46. For index 4329857. The job will be changed from `avocat_à la cour d'appel` to `avocat à la cour d'appel`

47. For index 4334843. The job will be changed from `fabr. de_ chaussures` to `fabr. de chaussures`

48. For index 4376196, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f1044.item.r=Pezzani.zoom, The job will be changed from `agence_théâtrale` to `agence théâtrale`

49. For index 4379706. The job will be changed from `_serruriers` to `serruriers`

50. For index 4382626, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f1042.item.r=machine.zoom, The job will be changed from `machine à _affüier` to `machine à affûter`

In [None]:
raw_paris_jobs.loc[241855, "métier"] = "chef de bureau au ministère de l'instr. publique et des cultes"
raw_paris_jobs.loc[522275, "métier"] = "hôtel d'amsterdam"
raw_paris_jobs.loc[716206, "métier"] = "entrepreneur de batiments"
raw_paris_jobs.loc[1553077, "métier"] = "broderies sur soie"
raw_paris_jobs.loc[1571018, "métier"] = "dépositaire des tourbières de st-maxence oise"
raw_paris_jobs.loc[1780454, "métier"] = "manufacture de dentelles"
raw_paris_jobs.loc[1801778, "métier"] = "fabr. de chevilles et rivets pour chaussures"
raw_paris_jobs.loc[1919390, "métier"] = "libraire"
raw_paris_jobs.loc[1946061, "métier"] = "fabr. de produits chimiques"
raw_paris_jobs.loc[1996863, "métier"] = "entrepreneurs de travaux publics"
raw_paris_jobs.loc[2006152, "métier"] = "entrepreneur de transports"
raw_paris_jobs.loc[2582159, "métier"] = "pianos neufs et d'occasion"
raw_paris_jobs.loc[2615085, "métier"] = "de la maison michaud et lyonnet"
raw_paris_jobs.loc[2778820, "métier"] = "facteur d'instruments de musique en cuivre"
raw_paris_jobs.loc[3047938, "métier"] = "secrétaire-archiviste de l'archevêché"
raw_paris_jobs.loc[3214560, "métier"] = "graveurs sur métaux"
raw_paris_jobs.loc[3478825, "métier"] = "fabr. de boucles et étiquettes pour conserves"
raw_paris_jobs.loc[3493909, "métier"] = "ceinturonnier"
raw_paris_jobs.loc[3729350, "métier"] = "fabr. d'irrigateurs"
raw_paris_jobs.loc[4173690, "métier"] = "tissus pour ameublements"
raw_paris_jobs.loc[4182797, "métier"] = "bijouterie imitation"
raw_paris_jobs.loc[4253362, "métier"] = "avocat cour d'appel"
raw_paris_jobs.loc[4317038, "métier"] = "graveur pour impressions"
raw_paris_jobs.loc[4376196, "métier"] = "agence théâtrale"
raw_paris_jobs.loc[893767, "métier"] = "pass. de l'opéra"
raw_paris_jobs.loc[960521, "métier"] = "fab. de tiges de bottines et sonliers"
raw_paris_jobs.loc[1115796, "métier"] = "fab. carmin d'indigo"
raw_paris_jobs.loc[1667238, "métier"] = "homme de lettres et propriétaire"
raw_paris_jobs.loc[1756741, "métier"] = "serrurier en voiture et charron-forgeron"
raw_paris_jobs.loc[1775469, "métier"] = "représentant de fabriques"
raw_paris_jobs.loc[1934867, "métier"] = "fabr. de serrures pour nécessaires"
raw_paris_jobs.loc[1963097, "métier"] = "pierres meulières"
raw_paris_jobs.loc[1969446, "métier"] = "professeur au conservatoire de musique"
raw_paris_jobs.loc[2383653, "métier"] = "ustensiles de ménage"
raw_paris_jobs.loc[2582159, "métier"] = "pianos neufs et d'occasion"
raw_paris_jobs.loc[2634835, "métier"] = "cordes harmoniques"
raw_paris_jobs.loc[2640580, "métier"] = "vins et liqueurs"
raw_paris_jobs.loc[3138510, "métier"] = "chaudronniers et machines d'occasion"
raw_paris_jobs.loc[3435821, "métier"] = "mercerie"
raw_paris_jobs.loc[3569239, "métier"] = "timbres-poste pour collections"
raw_paris_jobs.loc[3959866, "métier"] = "arbitre près le tribunal de commerce"
raw_paris_jobs.loc[3983392, "métier"] = "fabr. de benzine et produits chimiques"
raw_paris_jobs.loc[3989727, "métier"] = "beurre et œufs"
raw_paris_jobs.loc[4039236, "métier"] = "beurre et œufs"
raw_paris_jobs.loc[4215735, "métier"] = "chauffage central"
raw_paris_jobs.loc[4329857, "métier"] = "avocat à la cour d'appel"
raw_paris_jobs.loc[4334843, "métier"] = "fabr. de chaussures"
raw_paris_jobs.loc[4379706, "métier"] = "serruriers"
raw_paris_jobs.loc[4382626, "métier"] = "machine à affûter"
raw_paris_jobs.loc[4124868, "métier"] = "chaudronnerie pour automobiles"
raw_paris_jobs.loc[4214148, "métier"] = "halle aux cuirs"

### Dealing with `‘`

- Get the rows with `‘`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"‘"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
37197,bpt6k6282019m,368,100,Leroy,directeur au sous-comptoir de la‘librairie,Bonaparte,5.0,1855,296,directeur au sous-comptoir de la‘librairie
126484,bpt6k62906378,719,113,Jacob,fab. de cein‘ures,Bourg-l'Abbé,56.0,1846,431,fab. de cein‘ures
658138,bpt6k63197984,227,111,Dreyfus aîné et Cie,fab. chapeaux et cas‘quettes,passage Ste-Avoie,4.0,1852,171,fab. chapeaux et cas‘quettes
892974,bpt6k63243905,543,97,Lebreton,bo‘lier,Caumartin,58.0,1863,409,bo‘lier
1018151,bpt6k6331310g,591,12,Lefebvre,fab. ‘et magasin de nécessaires et gaînerie en...,Ste-Croix-de-la-Bretonnerie,44.0,1844,375,fab. ‘et magasin de nécessaires et gaînerie en...
1041756,bpt6k6333170p,316,142,Blacoar Vse),co‘ons filés,Albouy,8.0,1864,179,co‘ons filés
1073489,bpt6k6333170p,521,185,Jamin et Cie,photog‘aphie-microscopique,Chapop,13.0,1864,384,photog‘aphie-microscopique


1. For index 37197, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6282019m/f296.image.r=directeur.zoom, The job will be changed from `directeur au sous-comptoir de la‘librairie` to `directeur au sous-comptoir de la librairie`

2. For index 126484, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k62906378/f431.item.r=fub.zoom, The job will be changed from `fab. de cein‘ures` to `fab. de ceintures`

3. For index 658138, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63197984/f171.image.r=Dreyfus.zoom, The job will be changed from `fab. chapeaux et cas‘quettes` to `fab. chapeaux et casquettes`

4. For index 892974, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243905/f409.item.r=Caumartin.zoom, The job will be changed from `bo‘lier` to `bottier`

5. For index 1018151, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6331310g/f375.item.r=gainerie.zoom, The job will be changed from `fab. et magasin de nécessaires et gaînerie en tous genres` to `fab. ‘et magasin de nécessaires et gaînerie en tous genres`

6. For index 1041756, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f179.item.r=Albouy.zoom, The job will be changed from `co‘ons filés` to `cotons filés`

7. For index 1073489, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f384.image.r=jamin.zoom, The job will be changed from `photog‘aphie-microscopique` to `photographie-microscopique`

In [None]:
raw_paris_jobs.loc[37197, "métier"] = "directeur au sous-comptoir de la librairie"
raw_paris_jobs.loc[126484, "métier"] = "fab. de ceintures"
raw_paris_jobs.loc[658138, "métier"] = "fab. chapeaux et casquettes"
raw_paris_jobs.loc[892974, "métier"] = "bottier"
raw_paris_jobs.loc[1018151, "métier"] = "fab. et magasin de nécessaires et gaînerie en tous genres"
raw_paris_jobs.loc[1041756, "métier"] = "cotons filés"
raw_paris_jobs.loc[1073489, "métier"] = "photographie-microscopique"

### Dealing with `”`

- Get the rows with `”`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"”"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
403469,bpt6k6314752k,588,93,Leblanc,” fab. d'instruments de chirurgie,St-Antoine,86.,1856,398,” fab. d'instruments de chirurgie
540834,bpt6k6318531z,290,160,Chaumonot et Cie,fab. de chapeaux de paill”,Montmartre,138.,1858,182,fab. de chapeaux de paill”
958186,bpt6k63243920,376,66,Hérault,horticulteu”,Sèvres-Vaugirard,172,1860,296,horticulteu”
1024821,bpt6k6331310g,632,49,Nortier,libraire et cabinet de lectur”,VienxAugustins,64.,1844,416,libraire et cabinet de lectur”
1077180,bpt6k6333170p,547,176,Luugi-r ère et fils (Jh. Sichel successeur),fab. de pariumeri”,boul. Sébastopol,105.,1864,410,fab. de pariumeri”
1271812,bpt6k6391515w,859,58,Nereu,tab”c,Ecole-de-Médecine,21.,1847,540,tab”c
1844791,bpt6k96762564,534,68,Defrémont,teinturier dégraisseu”,boul. de Grenelle,60.,1886,349,teinturier dégraisseu”
2572204,bpt6k9685861g,544,13,Defrémont,teinturier dégraisseu”,boul. Garibaldi,17.,1887,355,teinturier dégraisseu”
3400256,bpt6k9763553z,411,120,Gandolle,tabac et ”,boul. Bonne-Nouvelle,18.,1876,306,tabac et ”
3475054,bpt6k9763554c,389,161,De Plument,fab. dejupo”,Vivienne,33.,1875,266,fab. dejupo”


1. For index 403469, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6314752k/f398.image.r=instruments.zoom, The job will be changed from `” fab. d'instruments de chirurgie` to `fab. d'instruments de chirurgie`

2. For index 540834, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6318531z/f182.image.r=Chaumonot, The job will be changed from `fab. de chapeaux de paill”` to `fab. de chapeaux de paille`

3. For index 958186, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243920/f296.image.r=Herault.zoom, The job will be changed from `horticulteu”` to `horticulteur`

4. For index 1024821, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6331310g/f416.image.r=Nortier, The job will be changed from `libraire et cabinet de lectur”` to `libraire et cabinet de lecture`

5. For index 1077180, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f410.image.r=Sichel.zoom, The job will be changed from `fab. de pariumeri”` to `fab. de pariumerie`

6. For index 1271812. The job will be changed from `tab”c` to `tabac`

7. For index 1844791, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96762564/f349.image.r=Defremont, The job will be changed from `teinturier dégraisseu”` to `teinturier dégraisseur`

8. For index 2572204. The job will be changed from `teinturier dégraisseu”` to `teinturier dégraisseur`

9. For index 3400256, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763553z/f306.image.r=Gandolle.zoom, The mailbox symbol was misinterpreted. The job will be changed from `tabac et ”` to `tabac`

10. For index 3475054, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f266.image.r=Plument.zoom, The job will be changed from `fab. dejupo”` to `fab. de jupons`

11. For index 3490488, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f355.item.r=Guyard.zoom, The job will be changed from `boucher”` to `boucher`

12. For index 3965994, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f385.image.r=Goguey, The job will be changed from `autographi”` to `autographie`

13. For index 4026461, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f791.image.r=Thiolier.zoom, The job will be changed from `éditeur des «livretsguides du touriste”` to `éditeur des livrets-guides du touriste`

14. For index 4273362, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f977.image.r=Pansements.zoom, The name is seen in the job. The job will be changed from `pansements la croix-soleil”` to `pansements la croix-soleil`

In [None]:
raw_paris_jobs.loc[403469, "métier"] = "fab. d'instruments de chirurgie"
raw_paris_jobs.loc[540834, "métier"] = "fab. de chapeaux de paille"
raw_paris_jobs.loc[958186, "métier"] = "horticulteur"
raw_paris_jobs.loc[1024821, "métier"] = "libraire et cabinet de lecture"
raw_paris_jobs.loc[1077180, "métier"] = "fab. de pariumerie"
raw_paris_jobs.loc[1271812, "métier"] = "tabac"
raw_paris_jobs.loc[1844791, "métier"] = "teinturier dégraisseur"
raw_paris_jobs.loc[2572204, "métier"] = "teinturier dégraisseur"
raw_paris_jobs.loc[3400256, "métier"] = "tabac"
raw_paris_jobs.loc[3475054, "métier"] = "fab. de jupons"
raw_paris_jobs.loc[3490488, "métier"] = "boucher"
raw_paris_jobs.loc[3965994, "métier"] = "autographie"
raw_paris_jobs.loc[4026461, "métier"] = "éditeur des livrets-guides du touriste"
raw_paris_jobs.loc[4273362, "métier"] = "pansements la croix-soleil"

### Dealing with `°`

- Get the rows containg `°`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"°"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
71399,bpt6k6286466w,459,27,Chine,3 v. in-8°,Echelle-St-Honoré,9.,1842,270,in-8°
137427,bpt6k62906378,786,99,Nortier,libraire et cabinet de lectur°,VieuxAugustins,64.,1846,498,libraire et cabinet de lectur°
1074363,bpt6k6333170p,527,158,Jousgelin,directeur du bureau de Poste n° 23,ſaub. St-Antoine,174.,1864,390,directeur du bureau de poste n°
1287019,bpt6k6393838j,460,70,de l'Etat de la Typographie ottomane à Constan...,1 vol. in-8°,Paris,1821; 20 Guide des pèlerins de Constantinople ...,1843,241,in-8°
1291750,bpt6k6393838j,493,57,monumentale et pittoresque,3 vol. in-8°,Cassette,6.,1843,274,in-8°
1292297,bpt6k6393838j,497,0,matrice,in-8°,1828; Nouvelle méthode d'extraire la pierre de...,1830; Mémoire sur la compression et la ligatur...,1843,278,in-8°
1310002,bpt6k6393838j,618,57,quatre classes d'animaux vertébrés; Traité éle...,2 vol. in-8°,ornés de 160 pl. gravées ; plusieurs mémoires ...,38.,1843,399,in-8°
1480181,bpt6k9669143t,347,94,Blot (E.),directeur du dépôt n° 3 de la Société générale...,rue du Cotentin,6.,1882,202,directeur du dépôt n° la société générale de l...
1497139,bpt6k9669143t,448,175,Delamarche (Paul) *,7°eceveur d'octroi,Brézin,13.,1882,303,7°eceveur d'octroi
1850384,bpt6k96762564,566,193,Du Bucquoy,secrétaire chef des bureaux à la mairie du VI°...,Bonaparte,78.,1886,381,secrétaire chef des bureaux à la mairie du vi°...


First, `°` are removed when they are surrounded by spaces or at the start follwed by a space or at the end preceded by a space

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'(^|\s)\°(\s|$)', r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"°"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
71399,bpt6k6286466w,459,27,Chine,3 v. in-8°,Echelle-St-Honoré,9.,1842,270,in-8°
137427,bpt6k62906378,786,99,Nortier,libraire et cabinet de lectur°,VieuxAugustins,64.,1846,498,libraire et cabinet de lectur°
1074363,bpt6k6333170p,527,158,Jousgelin,directeur du bureau de Poste n° 23,ſaub. St-Antoine,174.,1864,390,directeur du bureau de poste n°
1287019,bpt6k6393838j,460,70,de l'Etat de la Typographie ottomane à Constan...,1 vol. in-8°,Paris,1821; 20 Guide des pèlerins de Constantinople ...,1843,241,in-8°
1291750,bpt6k6393838j,493,57,monumentale et pittoresque,3 vol. in-8°,Cassette,6.,1843,274,in-8°
1292297,bpt6k6393838j,497,0,matrice,in-8°,1828; Nouvelle méthode d'extraire la pierre de...,1830; Mémoire sur la compression et la ligatur...,1843,278,in-8°
1310002,bpt6k6393838j,618,57,quatre classes d'animaux vertébrés; Traité éle...,2 vol. in-8°,ornés de 160 pl. gravées ; plusieurs mémoires ...,38.,1843,399,in-8°
1480181,bpt6k9669143t,347,94,Blot (E.),directeur du dépôt n° 3 de la Société générale...,rue du Cotentin,6.,1882,202,directeur du dépôt n° la société générale de l...
1497139,bpt6k9669143t,448,175,Delamarche (Paul) *,7°eceveur d'octroi,Brézin,13.,1882,303,7°eceveur d'octroi
1850384,bpt6k96762564,566,193,Du Bucquoy,secrétaire chef des bureaux à la mairie du VI°...,Bonaparte,78.,1886,381,secrétaire chef des bureaux à la mairie du vi°...


Some entries, that are not in the above output are present in the below list. This is due the fact that these symbols have been removed during the deletion of numbers. However, to continue using the earlier corrections, they are still corrected.

1. For index 71399, The job will be changed from ` v. in-°` to ` `

2. For index 137427, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9669143t/f303.item.r=Brezin.zoom, The name is seen in the job. The job will be changed from `libraire et cabinet de lectur°` to `libraire et cabinet de lecture`

3. For index 1074363, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f390.item.r=directeur.zoom, The job will be changed from `directeur du bureau de poste n° ` to `directeur du bureau de poste`

4. For index 1287019, The job will be changed from ` vol. in-°` to ` `

5. For index 1291750, The job will be changed from ` vol. in-°` to ` `

6. For index 1292297, The job will be changed from `in-°` to ` `

7. For index 1310002, The job will be changed from ` vol. in-°` to ` `

8. For index 1480181, The job will be changed from `directeur du dépôt n°  de la société générale de laiterie` to `directeur du dépôt de la société générale de laiterie`

9. For index 1497139, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9669143t/f303.item.r=Brezin.zoom, The name is seen in the job. The job will be changed from `°eceveur d'octroi` to `receveur d'octroi`

10. For index 1850384, The job will be changed from `secrétaire chef des bureaux à la mairie du vi° arrondissement` to `secrétaire chef des bureaux à la mairie du arrondissement`

11. For index 1896731, The job will be changed from `adjoint au maire du xii° arrondissement` to `adjoint au maire du arrondissement`

12. For index 2532718, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685098r/f901.item.r=auguste.zoom, The job will be changed from `mandataire aux halles (volaille et gibier pavillon n° ` to `mandataire aux halles volaille et gibier pavillon`

13. For index 2608795, The job will be changed from `hôtel du iv° arrondiss.` to `hôtel du arrondiss.`

14. For index 3161362, The job will be changed from `huissier de la justice de paix du xx° arrond.` to `huissier de la justice de paix du arrond.`

15. For index 3203919, The job will be changed from `bains du xx° siècle` to `bains du siècle`

16. For index 3212761, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97631451/f765.item.r=Leewitz.zoom, The job will be changed from `directeur général de markt et c° paris l'` to `directeur général de markt`

17. For index 3252354, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97631451/f1015.image.r=victoire.zoom. The entry was present in multiple lines and was not interpredted completely. The job will be changed from `°: crédint.` to ` `

18. For index 3277314, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763471j/f235.image.r=Boyeldieu.zoom. The job will be changed from `juge de paig du:° arrondissement` to `juge de paix du arrondissement`

19. For index 3554299, The job will be changed from `huissier de la justice de paix du xx° arrond.` to `huissier de la justice de paix du arrond.`

20. For index 3596150, The job will be changed from `bains du xx° siècle` to `bains du siècle`

21. For index 3604926. The job will be changed from `directeur général de markt et c° paris ld` to `directeur général de markt`

22. For index 3985724. The job will be changed from `blanchiss. r. linne. . °).` to `blanchiss.`

23. For index 3988946, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f543.image.r=MaKlu, The job will be changed from `r. eugène-marlin.. °).` to `boulanger`

24. For index 4126745, The job will be changed from ` °.` to ` `

25. For index 4158163, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f339.image.r=Ecossais.zoom, The job will be changed from `°. t. elys. . ; usine` to `usine`

26. For index 4181218, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f501.item.r=blonge.zoom, The job will be changed from `horu hassagne °). t. gob. .` to ` `

27. For index 4192828. The job will be changed from `bls. °.` to ` `

28. For index 4212655, The job will be changed from `maire du° arrond` to `maire arrond`

29. For index 4222726. The job will be changed from `-). °. onnier` to ` `

30. For index 4237075, the image from the directory is not clear. It could be present in this page https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f913.item. The job will be changed from `emporte-p t. °)` to `emporte-pièces`

31. For index 4242116, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f948.image.r=Mortier#, The job will be changed from `medecin. r. de villejust.  his. °. t. pas. . .` to `medecin.`

32. For index 4244104. The job will be changed from `. °. t. centr. . .` to ` `

33. For index 4245283, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838, The job will be changed from `r. monsieur-le-prir °. t. gob. . .` to ` `

34. For index 4254500, The job will be changed from `). °).` to ` `

35. For index 4257529. The job will be changed from `r. davai.  °.` to ` `

36. For index 4266415. The job will be changed from ` bis . °.` to ` `

37. For index 4305806. The job will be changed from ` et . °). t. roq. . .` to ` `

38. For index 4362974. The job will be changed from `il. °.` to ` `

39. For index 4371075, The job will be changed from ` °.` to ` `

In [None]:
raw_paris_jobs.loc[71399, "métier"] = ""
raw_paris_jobs.loc[137427, "métier"] = "libraire et cabinet de lecture"
raw_paris_jobs.loc[1074363, "métier"] = "directeur du bureau de poste"
raw_paris_jobs.loc[1287019, "métier"] = ""
raw_paris_jobs.loc[1291750, "métier"] = ""
raw_paris_jobs.loc[1292297, "métier"] = ""
raw_paris_jobs.loc[1310002, "métier"] = ""
raw_paris_jobs.loc[1473793, "métier"] = "greffier à la justice de paix du arrondissement"
raw_paris_jobs.loc[1480181, "métier"] = "directeur du dépôt de la société générale de laiterie"
raw_paris_jobs.loc[1497139, "métier"] = "receveur d'octroi"
raw_paris_jobs.loc[1850384, "métier"] = "secrétaire chef des bureaux à la mairie du arrondissement"
raw_paris_jobs.loc[1896731, "métier"] = "adjoint au maire du arrondissement"
raw_paris_jobs.loc[2532718, "métier"] = "mandataire aux halles volaille et gibier pavillon"
raw_paris_jobs.loc[2608795, "métier"] = "hôtel du arrondiss."
raw_paris_jobs.loc[3161362, "métier"] = "huissier de la justice de paix du arrond."
raw_paris_jobs.loc[3203919, "métier"] = "bains du siècle"
raw_paris_jobs.loc[3212761, "métier"] = "directeur général de markt"
raw_paris_jobs.loc[3252354, "métier"] = ""
raw_paris_jobs.loc[3277314, "métier"] = "juge de paix du arrondissement"
raw_paris_jobs.loc[3554299, "métier"] = "huissier de la justice de paix du arrond."
raw_paris_jobs.loc[3596150, "métier"] = "bains du siècle"
raw_paris_jobs.loc[3604926, "métier"] = "directeur général de markt"
raw_paris_jobs.loc[3985724, "métier"] = "blanchiss."
raw_paris_jobs.loc[3988946, "métier"] = "boulanger"
raw_paris_jobs.loc[4126745, "métier"] = ""
raw_paris_jobs.loc[4158163, "métier"] = "usine"
raw_paris_jobs.loc[4181218, "métier"] = ""
raw_paris_jobs.loc[4192828, "métier"] = ""
raw_paris_jobs.loc[4212655, "métier"] = "maire du arrond"
raw_paris_jobs.loc[4222726, "métier"] = ""
raw_paris_jobs.loc[4237075, "métier"] = "emporte-pièces"
raw_paris_jobs.loc[4242116, "métier"] = "medecin."
raw_paris_jobs.loc[4244104, "métier"] = ""
raw_paris_jobs.loc[4245283, "métier"] = ""
raw_paris_jobs.loc[4254500, "métier"] = ""
raw_paris_jobs.loc[4257529, "métier"] = ""
raw_paris_jobs.loc[4266415, "métier"] = ""
raw_paris_jobs.loc[4305806, "métier"] = ""
raw_paris_jobs.loc[4362974, "métier"] = ""
raw_paris_jobs.loc[4371075, "métier"] = ""

### Dealing with `|`

- Get rows with `|`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'\|'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
980,bpt6k6282019m,149,215,Archéologie (cours d'),à la Bibliothèque impé| riale,Richelieu,58,1855,77,à la bibliothèque impé| riale
25981,bpt6k6282019m,300,146,Giraud,sous-chef à la division des procès| verbaux au...,Université,126 et 128.,1855,228,sous-chef à la division des procès| verbaux au...
26684,bpt6k6282019m,305,1,Gouré joune NC. et Grandjean,fabr. de cha- | les,Nve-St-Eustache,8.,1855,233,fabr. de cha- | les
30481,bpt6k6282019m,328,0,Intérieur ( ministère de l'),Grenelle-St-Gor-| main,101; Bureaux,103,1855,256,grenelle-st-gor-| main
31341,bpt6k6282019m,333,0,Joly fils (Edmond de),architecte au palais | du Corps législatif,Université,126-128.,1855,261,architecte au palais | du corps législatif
...,...,...,...,...,...,...,...,...,...,...
4326550,bpt6k9780089g,1041,0,Foyer Normand (le),habitations à bon | marché,r. de Clichy,25.,1922,702,habitations à bon | marché
4333442,bpt6k9780089g,1087,0,Gonot (Georges),Les Plombiers-Fontainiers | de Paris,av. Philippe-Auguste,84.,1922,748,les plombiers-fontainiers | de paris
4386635,bpt6k9780089g,1453,313,Rosier (Ch.),location d'automobiles particu| lières,r. Laffitte,18. (9°). T. Berg. 41. 23.,1922,1114,location d'automobiles particu| lières
4387221,bpt6k9780089g,1458,0,Rouquét (J.) et Cle,mandataires-défenseurs | près le tribunal de c...,r. François-Miron,26.,1922,1119,mandataires-défenseurs | près le tribunal de c...


There are 463 rows with a `|`. From the bottins it appears that `|` was used to seperate jobs or sometimes to give specific description.

`|` are removed when they are surrounded by spaces or at the start follwed by a space or at the end preceded by a space and the remaining will be dealt during tag generation (As there are nearly 300 entries it is not practicle to clean all of them manually)

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'(^|\s)\|(\s|$)', r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

### Dealing with `;`

- Get the rows containg `;`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r";"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
313,bpt6k6282019m,146,7,Albrecht (T.),consul de Saxe; agence de bateaux à vapeur,Basse-du-Rempart,10.,1855,74,consul de saxe; agence de bateaux à vapeur
3541,bpt6k6282019m,165,168,Beichof (Daniel),dépôt ; d'ouates,Temple,145.,1855,93,dépôt ; d'ouates
5123,bpt6k6282019m,175,25,Billaud (A.) *,agent de change; membre de la commission munic...,Michodière,8.,1855,103,agent de change; membre de la commission munic...
5469,bpt6k6282019m,177,48,Blanchet,papiers en gros; domicile,Seine,43.,1855,105,papiers en gros; domicile
7500,bpt6k6282019m,189,51,Bouron fils,charbons de bois à Joigny; domicile,Paradis-Poissonnière,12.,1855,117,charbons de bois à joigny; domicile
...,...,...,...,...,...,...,...,...,...,...
4393164,bpt6k9780089g,1504,213,Société des Ateliers et Chantiers de Balbigny,constructions métalliques; siège so- cial,bou. Malesherbes,28.,1922,1165,constructions métalliques; siège so- cial
4395471,bpt6k9780089g,1533,46,Sutcliffe,Speakman & Cº; broyeurs-concasseurs et machine...,r. La fa- yette,15.,1922,1194,speakman & cº; broyeurs-concasseurs et machine...
4397816,bpt6k9780089g,1548,274,Thorne (Samuel),tissage mécanique de tissus nouveautés; dépôt,r. St-Honoré,203. (fer). T. Gut. 73. 66.,1922,1209,tissage mécanique de tissus nouveautés; dépôt
4400076,bpt6k9780089g,1565,211,Usines Chimiques du Pecq,fabr. de produits chimiques et pharmaceutiques...,r. Cambon,39.,1922,1226,fabr. de produits chimiques et pharmaceutiques...


There are 2060 rows that have the `;`.

`;` are removed when they are surrounded by spaces or at the start follwed by a space or at the end preceded by a space and rows that contain `;` in a word are corrected as it will be a cause for the wrong spelling.

The remaining will be dealt during tag generation (As there are nearly 1500 entries it is not practicle to clean all of them manually)

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'(^|\s)\;(\s|$)', r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"[(a-z)+];[(a-z)+]"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
244345,bpt6k6305463c,285,139,Calland (Victor) et Cie,directeurs de la Compapie dentais;alais de fam...,boul. ucs i apucines,3J. *,1857,172,directeurs de la compapie dentais;alais de fam...
275081,bpt6k6305463c,469,144,Lomenie-Demarin et Cie,vins en bou;eilles,Bergère,32.,1857,356,vins en bou;eilles
297621,bpt6k6309075f,196,168,Amyoi,ilvocat à la cour im;).,Prouaires,3.,1861,100,ilvocat à la cour im;).
306352,bpt6k6309075f,254,175,Briffault,représent. de fabric;ues de rubans de soie,Française,10 et 12.,1861,158,représent. de fabric;ues de rubans de soie
606142,bpt6k6319106t,738,71,De Navia (Chevalier),ancien ambassadeur d'Es;agne,Chaussée-d’Antin,38.,1849,414,ancien ambassadeur d'es;agne
981206,bpt6k63243920,522,147,Pucheran,aide-naturaliste au Muséum d'l;istoire naturelle,Cuvier,57.,1860,442,aide-naturaliste au muséum d'l;istoire naturelle
1110864,bpt6k6333200c,278,42,Badel (Bernard),banq;rier,Rossini,3.,1862,146,banq;rier
1277700,bpt6k6391515w,896,86,Rosier,vérificateur des travaux publics et des hos;ices,Lille,7.,1847,577,vérificateur des travaux publics et des hos;ices
1292762,bpt6k6393838j,500,82,Couchery,scul;teur ornemaniste,Folie-Rcgnault,6.,1843,281,scul;teur ornemaniste
1425940,bpt6k9668037f,657,95,Lacroix,cha;caux,Ste-Anne,43.,1884,490,cha;caux


There is 15 rows that have the `;` in a word. 

The métier column will be replaced manually for these 15 entries.

1. For index 244345, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6305463c/f172.item.r=calland.zoom. The job was present in two lines and the print faded away. Thus the job will be changed from `directeurs de la compapie dentais;alais de famille` to `directeurs de la compagnie générale des palais de famille`.

2. For index 275081, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6305463c/f356.item.r=Lomenie.zoom. The `t` was misinterpreted as `;`. Thus the job will be changed from `vins en bou;eilles` to `vins en bouteilles`.

3. For index 297621, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6309075f/f100.item.r=avocat.zoom. The `p` was misinterpreted as `;`. Thus the job will be changed from `ilvocat à la cour im;).` to `avocat à la cour imp.`.

4. For index 306352, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6309075f/f158.item.r=Briffault.zoom. The `q` was misinterpreted as `c;`. Thus the job will be changed from `représent. de fabric;ues de rubans de soie` to `représent. de fabriques de rubans de soie`.

5. For index 606142, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6319106t/f414.image.r=Navia.zoom. The `p` was misinterpreted as `;`. Thus the job will be changed from `ancien ambassadeur d'es;agne` to `ancien ambassadeur d'espagne`.

6. For index 981206, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243920/f442.image.r=naturaliste.zoom. The `h` was misinterpreted as `;`. Thus the job will be changed from `aide-naturaliste au muséum d'l;istoire naturelle` to `aide-naturaliste au muséum d'histoire naturelle`.

7. For index 1110864, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333200c/f146.item.r=Badel.zoom. The `u` was misinterpreted as `;`. Thus the job will be changed from `banq;rier` to `banquier`.

8. For index 1277700, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6391515w/f577.image.r=Rosier.zoom. The `p` was misinterpreted as `;`. Thus the job will be changed from `vérificateur des travaux publics et des hos;ices` to `vérificateur des travaux publics et des hospices`.

9. For index 1292762, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6393838j/f281.item.r=Coucliery.zoom. The `p` was misinterpreted as `;`. Thus the job will be changed from `scul;teur ornemaniste` to `sculpteur ornemaniste`.

10. For index 1425940, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96762564/f515.image.r=chapeaux.zoom. The `p` was misinterpreted as `;`. Thus the job will be changed from `cha;caux` to `chapeaux`.

11. For index 1827186, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96762564/f248.item.r=eoitjeur.zoom. The `ff` was misinterpreted as `l;`. Thus the job will be changed from `coil;eur` to `coiffeur`.

12. For index 2281525, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9684013b/f590.item.r=boni.zoom. The `tt` was misinterpreted as `;`. Thus the job will be changed from `teint. et n;lloyage` to `teint. et nettoyage`.

13. For index 3960963, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f353.item.r=Carrefour.zoom. The ` ` was misinterpreted as ` `. Thus the job will be changed from `ocurre et wu;s` to `beurre et œufs`.

14. For index 4063345, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f220.item.r=siege.zoom. The entry is a company and the company name contains the business name and the street name appears in the job column. Thus the job will be changed from `siège social;r. st-jacques` to `siège social`.

15. For index 4299664, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f506.image.r=Chanipionnet.zoom. The `,` was misinterpreted as `;`. Thus the job will be changed from `papetier;r` to `papetier`.

In [None]:
raw_paris_jobs.loc[244345, "métier"] = "directeurs de la compagnie générale des palais de famille"
raw_paris_jobs.loc[275081, "métier"] = "vins en bouteilles"
raw_paris_jobs.loc[297621, "métier"] = "avocat à la cour imp."
raw_paris_jobs.loc[306352, "métier"] = "représent. de fabriques de rubans de soie"
raw_paris_jobs.loc[606142, "métier"] = "ancien ambassadeur d'espagne"
raw_paris_jobs.loc[981206, "métier"] = "aide-naturaliste au muséum d'histoire naturelle"
raw_paris_jobs.loc[1110864, "métier"] = "banquier"
raw_paris_jobs.loc[1277700, "métier"] = "vérificateur des travaux publics et des hospices"
raw_paris_jobs.loc[1292762, "métier"] = "sculpteur ornemaniste"
raw_paris_jobs.loc[1425940, "métier"] = "chapeaux"
raw_paris_jobs.loc[1827186, "métier"] = "coiffeur"
raw_paris_jobs.loc[2281525, "métier"] = "teint. et nettoyage"
raw_paris_jobs.loc[3960963, "métier"] = "beurre et œufs"
raw_paris_jobs.loc[4063345, "métier"] = "siège social"
raw_paris_jobs.loc[4299664, "métier"] = "papetier"

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r";[(a-z)+]"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
85503,bpt6k6286466w,555,46,Leclerc,;draps,Lavand.-Ste-Opportune,22.,1842,366,;draps
169428,bpt6k6292987t,839,108,Hamelin,é;icier,Fourreurs,17.,1845,486,é;icier
381874,bpt6k6314752k,462,153,Roquette,35 et 37;tissage,rue du MûrierSt-Victor,6.,1856,272,37;tissage
754076,bpt6k63243601,268,133,Besson,;bottier,Faub. du Temple,16.,1839,145,;bottier
2481614,bpt6k9685098r,884,32,Parc-Royal,8; TÉLÉPH. 218-91;ateliers-ďexpédition,r. St-Antoine,170. Guimet (E.) (0.,1898,603,téléph. 218-91;ateliers-d'expédition
2704034,bpt6k9692626p,951,139,Lancry,3;Entrepôi,r.del'Oureq,112.,1897,646,3;entrepôi


Some entries, that are not in the above output are present in the below list. This is due the fact that these symbols have been removed during the deleion of numbers. However, to continue using the earlier corrections, they are still corrected.

1. For index 85503, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6286466w/f366.item.r=Lavand.zoom, The job will be changed from `;draps` to `draps`

2. For index 169428. The job will be changed from `é;icier` to `épicier`

3. For index 381874, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6314752k/f272.image.r=tissage.zoom. The ocr misinterpredted multiline entry. The job will be changed from ` et ;tissage` to `tissage`

4. For index 754076, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243601/f145.image.r=bottier.zoom, The job will be changed from `;bottier` to `bottier`

5. For index 2481614, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685098r/f603.image.r=Guimet.zoom, Multiple entries in the directory were combined together. The job will be changed from `téléph. -;ateliers-ďexpédition` to `ateliers-ďexpédition`

6. For index 2692047, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9692626p/f576.item.r=Balzac.zoom  Multiple entries in the directory were combined together. The job will be changed from `;r. de la chaussée-d’antin` to `professeurs d'escrime`

7. For index 2704034, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9692626p/f646.image.r=Oureq.zoom, The job will be changed from `;entrepôi` to `entrepôt`

8. For index 2921263, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9732740w/f678.item.r=Boetie.zoom, Multiple entries in the directory were combined together. The job will be changed from `;rue lallier` to `Messageries du commerce`

9. For index 3224887, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97631451/f841.image.r=montaigne.zoom, Multiple entries in the directory were combined together. The job will be changed from `;av. la motte-picquet` to `messageries montaigne`

In [None]:
raw_paris_jobs.loc[85503, "métier"] = "draps"
raw_paris_jobs.loc[169428, "métier"] = "épicier"
raw_paris_jobs.loc[381874, "métier"] = "tissage"
raw_paris_jobs.loc[754076, "métier"] = "bottier"
raw_paris_jobs.loc[2481614, "métier"] = "ateliers-ďexpédition"
raw_paris_jobs.loc[2692047, "métier"] = "professeurs d'escrime"
raw_paris_jobs.loc[2704034, "métier"] = "entrepôt"
raw_paris_jobs.loc[2921263, "métier"] = "messageries du commerce"
raw_paris_jobs.loc[3224887, "métier"] = "messageries montaigne"

### Dealing with `·`

Not to be confused with `.`

- Get rows with `·`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'\·'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
13049,bpt6k6282019m,222,90,Combette *,administr. à la direction de l'en· registr. et...,Mont-Thabor,7.,1855,150,administr. à la direction de l'en· registr. et...
23561,bpt6k6282019m,286,101,Fruntz de Lienhard *,capitaine d'état-ma- · jor,St-Louis-en-l'Ile,81.,1855,214,capitaine d'état-ma- · jor
41872,bpt6k6282019m,396,45,Merley (Louis),sculpteur et graveur en mé· dailles,Cassette,12.,1855,324,sculpteur et graveur en mé· dailles
44812,bpt6k6282019m,414,41,Nollet,fab. de plaques pour compagnies l'as· surances,Faub. -St-Denis,65.,1855,342,fab. de plaques pour compagnies l'as· surances
82885,bpt6k6286466w,537,47,Jolivet,inspecteur des travaux du Palais-de· Justice,cour du Palais-de-Justice,8 et 9.,1842,348,inspecteur des travaux du palais-de· justice
...,...,...,...,...,...,...,...,...,...,...
4259711,bpt6k97774838,1412,296,Salomon (Louis),achat et vente d'immeu· bles,boul. Hausmann,21.,1921,1085,achat et vente d'immeu· bles
4277358,bpt6k9780089g,695,84,André (Georges),concessionnaire des machines à coudre · Elias ...,boul. de Pic- pus,15.,1922,356,concessionnaire des machines à coudre · elias ...
4323683,bpt6k9780089g,1021,34,Feldmann (J.) et Volkaerts,Pansements · La,(40e). T. Trud. 01. 44-01. 43,01.,1922,682,pansements · la
4381879,bpt6k9780089g,1421,228,Randegger et Niestlé,constructions électro· mécaniques,boul. Voltaire,188.,1922,1082,constructions électro· mécaniques


First `·` are removed when they are surrounded by spaces or at the start follwed by a space or at the end preceded by a space.

It is observed that due to the job present in two lines sometime the job is split at `· `. So all the `· ` are removed.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'(^|\s)\.(\s|$)', r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'\·\s', r'', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'·'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
258739,bpt6k6305463c,373,9,Fougas (Mme),Athénée des jeunes person·nes,Université,4.0,1857,260,athénée des jeunes person·nes
391534,bpt6k6314752k,519,127,Fouqué,imprimeur·lithogr.,passage du Caire,49.0,1856,329,imprimeur·lithogr.
428200,bpt6k6314752k,733,167,Viollier(J.-A.),commissionn. en marchandi·ses,Victoire,88.0,1856,543,commissionn. en marchandi·ses
537249,bpt6k6318531z,268,154,Breuillard fils,syndic ·près le tribunal de commerce,place Breda,8.0,1858,160,syndic ·près le tribunal de commerce
809699,bpt6k6324389h,282,138,Durand-Brager,peint.·artiste,Amsterdam,71.0,1859,210,peint.·artiste
977771,bpt6k63243920,501,69,Pequignot (Aug.),gray.·artiste,Mouffetard,85.0,1860,421,gray.·artiste
1095492,bpt6k6333170p,668,43,Prudbome et Cie,hy dro·physique,Borchers-Passy,13.0,1864,531,hy dro·physique
1167871,bpt6k6333200c,631,54,Pillon,ferbl·lampiste,Chopinette,45.0,1862,499,ferbl·lampiste
1233296,bpt6k6389871r,504,165,Tassin,ferbl·ntier,Gravilliers,16.0,1853,427,ferbl·ntier
1268631,bpt6k6391515w,839,89,Martin,f·b. de ceintures,Auninire,11.0,1847,520,f·b. de ceintures


In the remaining rows word are corrected as it will be a cause for the wrong spelling.

As some of the words are common mistakes, these have been changed directly without refering to the original document.

1. For index 258739. The job will be changed from `athénée des jeunes person·nes` to `athénée des jeunes personnes`

2. For index 391534. The job will be changed from `imprimeur·lithogr.` to `imprimeur-lithogr.`

3. For index 428200. The job will be changed from `commissionn. en marchandi·ses` to `commissionn. en marchandises`

4. For index 537249, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6318531z/f160.image.r=Breuillard, The job will be changed from `syndic ·près le tribunal de commerce` to `syndic près le tribunal de commerce`

5. For index 809699, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6324389h/f210.image.r=amsterdam.zoom, The job will be changed from `peint.·artiste` to `peint.-artiste`

6. For index 977771, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243920/f421.image.r=Pequignot, The job will be changed from `gray.·artiste` to `grav.-artiste`

7. For index 1095492, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f531.item.r=hydro%20physique.zoom, The job will be changed from `hy dro·physique` to `hydro-physique`

8. For index 1167871, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333200c/f499.item.r=pillon.zoom, The job will be changed from `ferbl·lampiste` to `ferbl.-lampiste`

9. For index 1233296, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6389871r/f427.image.r=gravilliers.zoom, The job will be changed from `ferbl·ntier` to `ferblantier`

10. For index 1268631. The job will be changed from `f·b. de ceintures` to `fab. de ceintures`

11. For index 1313984, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6393838j/f426.item.r=librair.zoom, The job will be changed from `librair ·` to `libraire`

12. For index 1363989. The job will be changed from `p·intre en voit.` to `peintre en voit.`

13. For index 1883632. The job will be changed from `sculpteur et moulures pour meu·bles` to `sculpteur et moulures pour meubles`

14. For index 2361548, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9684454n/f458.item.r=SebasLopol.zoom, The job will be changed from `patin·caoutchouc fer` to `patin-caoutchouc pour fer`

15. For index 2629368. The job will be changed from `marbrier pour bronzes et pendu·les` to `marbrier pour bronzes et pendules`

16. For index 3253849. The job will be changed from `attaché mi·litaire à l'ambassade d'allemagne` to `attaché militaire à l'ambassade d'allemagne`

17. For index 3374138. The job will be changed from `entreprise générale de maisons centrales et prisons départementales` to `entreprise générale de maisons centrales et prisons départementa·les`

18. For index 3382906. The job will be changed from `propriétaire de vigno·bles` to `propriétaire de vignobles`

19. For index 3870736. The job will be changed from `dessins et échantil·lons` to `dessins et échantillons`

20. For index 4182250. The job will be changed from `professeur de comptabi·lité` to `professeur de comptabilité`

In [None]:
raw_paris_jobs.loc[258739, "métier"] = "athénée des jeunes personnes"
raw_paris_jobs.loc[391534, "métier"] = "imprimeur-lithogr."
raw_paris_jobs.loc[428200, "métier"] = "commissionn. en marchandises"
raw_paris_jobs.loc[537249, "métier"] = "syndic près le tribunal de commerce"
raw_paris_jobs.loc[809699, "métier"] = "peint.-artiste"
raw_paris_jobs.loc[977771, "métier"] = "grav.-artiste"
raw_paris_jobs.loc[1095492, "métier"] = "hydro-physique"
raw_paris_jobs.loc[1167871, "métier"] = "ferbl.-lampiste"
raw_paris_jobs.loc[1233296, "métier"] = "ferblantier"
raw_paris_jobs.loc[1268631, "métier"] = "fab. de ceintures"
raw_paris_jobs.loc[1313984, "métier"] = "libraire"
raw_paris_jobs.loc[1363989, "métier"] = "peintre en voit."
raw_paris_jobs.loc[1883632, "métier"] = "sculpteur et moulures pour meubles"
raw_paris_jobs.loc[2361548, "métier"] = "patin caoutchouc pour fer"
raw_paris_jobs.loc[2629368, "métier"] = "marbrier pour bronzes et pendules"
raw_paris_jobs.loc[3253849, "métier"] = "attaché militaire à l'ambassade d'allemagne"
raw_paris_jobs.loc[3374138, "métier"] = "entreprise générale de maisons centrales et prisons départementales"
raw_paris_jobs.loc[3382906, "métier"] = "propriétaire de vignobles"
raw_paris_jobs.loc[3870736, "métier"] = "dessins et échantillons"
raw_paris_jobs.loc[4182250, "métier"] = "professeur de comptabilité"

### Dealing with `?`

- Get rows with `?`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'\?'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
14708,bpt6k6282019m,232,175,Dampoux,sculpteur-ornem?,Nve-Vavin,15.,1855,160,sculpteur-ornem?
94079,bpt6k6286466w,612,165,Perignon,say?urier,Las-Cases,15.,1842,423,say?urier
139744,bpt6k62906378,800,118,Pirioux,menuisier-rampist?,St-Laurent,4.,1846,512,menuisier-rampist?
229605,bpt6k62931221,544,104,Saint-Priest,directeur de la R?vuc du 19e siècle,Filles-St-Thomas,5.,1841,393,directeur de la r?vuc du siècle
263969,bpt6k6305463c,403,78,Guillaumin (G.),négoc.-?xport.,Englien,8.,1857,290,négoc.-?xport.
...,...,...,...,...,...,...,...,...,...,...
4272714,bpt6k97774838,1524,57,Vinant (G.) ( I),ingénieur-constr?,boul. St-Germain,232.,1921,1197,ingénieur-constr?
4339461,bpt6k9780089g,1127,341,Hermet,taillel?,r. du Regard,4.,1922,788,taillel?
4351821,bpt6k9780089g,1214,220,Lefèvre,fabr. d'eau de selt?,r. du MoulinVert,41.,1922,875,fabr. d'eau de selt?
4372046,bpt6k9780089g,1355,46,Pactat,vins et hô?el,r. Etienne-Dolet,10.,1922,1016,vins et hô?el


First, remove the `?` when surrounded by spaces.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'(^|\s)\?(\s|$)', r' ', regex=True)

There are 117 rows and it is not practicle to replace all of them manually, only `?` that have a space before or after it shall be repalced manually.

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"(\? )|( \?)"))]

  return func(self, *args, **kwargs)


Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
294570,bpt6k6305463c,590,111,Vallée (A.),fab.de ca? nnages,St-Denis,211.,1857,477,fab.de ca? nnages
412537,bpt6k6314752k,641,43,Morel,teinturier-? graisseur,St-Roch,33.,1856,451,teinturier-? graisseur
446392,bpt6k6315927h,810,198,Durand jeune,fab. de reill.. Trois-aurs. ?.,Durand-Chane: rel (1. elig. Loursine,9 et 15.,1848,461,fab. de reill.. trois-aurs. ?.
456645,bpt6k6315927h,876,20,Lafſecteur,propriétaire du véritable Ro? antisyphilitique,Petits-Augustins,11.,1848,527,propriétaire du véritable ro? antisyphilitique
507196,bpt6k6315985z,340,71,Laffecteur,propriétaire du véritable Ro? antisyphilitique,Petits-Augustins,9.,1850,258,propriétaire du véritable ro? antisyphilitique
561250,bpt6k6318531z,414,194,Jacquemart (F.),fab. de produits chimiqu ?s,Faub. Poissonnière,58.,1858,306,fab. de produits chimiqu ?s
1006194,bpt6k6331310g,515,115,Dotard,fab? de moulures pour bâtiments,FourSt-G.,51.,1844,299,fab? de moulures pour bâtiments
1073179,bpt6k6333170p,519,92,Izainbard,inspecteur commis cia? du chemia de fer du Nord,hoal Magenta,181.,1864,382,inspecteur commis cia? du chemia de fer du nord
1340862,bpt6k63959929,289,19,Fauconnier et Cie,peintres de marbres? bois et agates sur pap.,Charonne,108.,1851,207,peintres de marbres? bois et agates sur pap.
1440014,bpt6k9668037f,744,176,Mettetal (A.),ancien substitut du procurest? au tribunal de ...,boul. Malesnerbes,79.,1884,577,ancien substitut du procurest? au tribunal de ...


Some entries, that are not in the above output are present in the below list. This is due the fact that these symbols have been removed during the deleion of numbers. However, to continue using the earlier corrections, they are still corrected.

1. For index 294570, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6305463c/f597.image.r=vallee.zoom (https://gallica.bnf.fr/ark:/12148/bpt6k6318531z/f480.image.r=vallee), The job will be changed from `fab.de ca? nnages` to `fab.de cartonnages`

2. For index 412537, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6314752k/f451.item.r=roch.zoom, The job will be changed from `teinturier-? graisseur` to `teinturier-dégraisseur`

3. For index 446392, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6315927h/f461.item.r=jeune.zoom. Multiple lines were interpreted together. The job will be changed from `fab. de reill.. trois-aurs. ?.` to `fab. d'aiguilles à bas`

4. For index 456645, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6315927h/f527.item.r=Lafsecteur.zoom, The job will be changed from `propriétaire du véritable ro? antisyphilitique` to `propriétaire du véritable rob antisyphilitique`

5. For index 507196. The job will be changed from `propriétaire du véritable ro? antisyphilitique` to `propriétaire du véritable rob antisyphilitique`

6. For index 561250. The job will be changed from `fab. de produits chimiqu ?s` to `fab. de produits chimiques`

7. For index 1006194. The job will be changed from `fab? de moulures pour bâtiments` to `fab. de moulures pour bâtiments`

8. For index 1073179, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f382.item.r=inspecteur.zoom, The job will be changed from `inspecteur commis cia? du chemia de fer du nord` to `inspecteur commercial du chemin de fer du nord`

9. For index 1340862, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63959929/f207.item.r=Fauconnier.zoom, The job will be changed from `peintres de marbres? bois et agates sur pap.` to `peintres de marbres bois et agates sur pap.`

10. For index 1440014, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9668037f/f577.item.r=Mettetal.zoom, The job will be changed from `ancien substitut du procurest? au tribunal de re instance` to `ancien substitut du procureur au tribunal de première instance`.

11. For index 1583431, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672117f/f234.item.r=titane.zoom. The job will be changed from `blanchissage et apprêtage? chapeaux de paille` to `blanchissage et apprêtage chapeaux de paille`

12. For index 1593909, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672117f/f296.item.r=Erckmann.zoom, The job will be changed from `représentant de maisons d'ala? lemagne` to `représentant de maisons d'allemagne`

13. For index 1663946, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f241.item.r=ultbout.zoom, The job will be changed from `propriétaire-géren? du journal u france-nouvelle` to `propriétaire-gérant du journal la france-nouvelle`

14. For index 1664483, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f244.item.r=baudet.zoom, The job will be changed from `grauta? sur métaux` to `graveur sur métaux`

15. For index 1922839, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677392n/f117.item.r=Versepuy.zoom, The job will be changed from `commissionnaires-ne? gociants` to `commissionnaires-négociants`

16. For index 1953486, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677392n/f295.item.r=Faradeche.zoom, The job will be changed from `représentant de fabr. de pas? piers` to `représentant de fabr. de papiers`

17. For index 2550892, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f227.item.r=Beliart.zoom, The job will be changed from `bijoutier en ?'` to `bijoutier en or`

18. For index 2805593, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9692809v/f375.item.r=sertisseur.zoom, The job will be changed from `ser? seur` to `serrurier`

19. For index 3037726, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f481.item.r=gerant.zoom, The job will be changed from `gérant du marché des? ternes` to `gérant du marché des ternes`

20. For index 3044512, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f521.item.r=fournitures.zoom, The job will be changed from `fournitures d'horlogerie c? gros` to `fournitures d'horlogerie en gros`

21. For index 3052091, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f566.item.r=iFriedlanfl.zoom, The job will be changed from `artides d? courie` to `articles d'écurie`

22. For index 3467877, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f224.item.r=lagrange.zoom, The job will be changed from `de la maison ch. lagrange et chet? vreuil` to `de la maison ch. lagrange et chevreuil`

23. For index 3618634, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764402m/f842.image.r=Mistral, The job will be changed from `conse? vation de tapis` to `conservation de tapis`

24. For index 3746665, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764647w/f192.item.r=monuments.zoom, The job will be changed from `marbrier pour monuments fu ?! ores` to `marbrier pour monuments funèbres`

25. For index 3858635, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764746t/f260.item.r=Dusart.zoom, The job will be changed from `avou? de tre instance` to `avoué de première instance`

26. For index 3859931, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764746t/f268.item.r=Farrenc.zoom, The job will be changed from `professeur de piano au conse? vatoire de musique` to `professeur de piano au conservatoire de musique`

27. For index 4236101, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f906.item.r=bazars.zoom, The job will be changed from `art. p? bazars` to `art. pr bazars`

28. For index 4253063, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f1033.item.r=Philippe.zoom, The job will be changed from `confec de.c? confections pour dames` to `confections pour dames`

29. For index 4399898, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f1224.item.r=Unik.zoom, The job will be changed from `horlog ?r` to `horloger`

In [None]:
raw_paris_jobs.loc[294570, "métier"] = "fab.de cartonnages"
raw_paris_jobs.loc[412537, "métier"] = "teinturier-dégraisseur"
raw_paris_jobs.loc[446392, "métier"] = "fab. d'aiguilles à bas"
raw_paris_jobs.loc[456645, "métier"] = "propriétaire du véritable rob antisyphilitique"
raw_paris_jobs.loc[507196, "métier"] = "propriétaire du véritable rob antisyphilitique"
raw_paris_jobs.loc[561250, "métier"] = "fab. de produits chimiques"
raw_paris_jobs.loc[1006194, "métier"] = "fab. de moulures pour bâtiments"
raw_paris_jobs.loc[1073179, "métier"] = "inspecteur commercial du chemin de fer du nord"
raw_paris_jobs.loc[1340862, "métier"] = "peintres de marbres bois et agates sur pap."
raw_paris_jobs.loc[1440014, "métier"] = "ancien substitut du procureur au tribunal de première instance"
raw_paris_jobs.loc[1583431, "métier"] = "blanchissage et apprêtage chapeaux de paille"
raw_paris_jobs.loc[1593909, "métier"] = "représentant de maisons d'allemagne"
raw_paris_jobs.loc[1663946, "métier"] = "propriétaire-gérant du journal la france-nouvelle"
raw_paris_jobs.loc[1664483, "métier"] = "graveur sur métaux"
raw_paris_jobs.loc[1922839, "métier"] = "commissionnaires-négociants"
raw_paris_jobs.loc[1953486, "métier"] = "représentant de fabr. de papiers"
raw_paris_jobs.loc[2550892, "métier"] = "bijoutier en or"
raw_paris_jobs.loc[2805593, "métier"] = "serrurier"
raw_paris_jobs.loc[3037726, "métier"] = "gérant du marché des ternes"
raw_paris_jobs.loc[3044512, "métier"] = "fournitures d'horlogerie en gros"
raw_paris_jobs.loc[3052091, "métier"] = "articles d'écurie"
raw_paris_jobs.loc[3467877, "métier"] = "de la maison ch. lagrange et chevreuil"
raw_paris_jobs.loc[3618634, "métier"] = "conservation de tapis"
raw_paris_jobs.loc[3746665, "métier"] = "marbrier pour monuments funèbres"
raw_paris_jobs.loc[3858635, "métier"] = "avoué de première instance"
raw_paris_jobs.loc[3859931, "métier"] = "professeur de piano au conservatoire de musique"
raw_paris_jobs.loc[4236101, "métier"] = "art. pr bazars"
raw_paris_jobs.loc[4253063, "métier"] = "confections pour dames"
raw_paris_jobs.loc[4399898, "métier"] = "horloger"

### Dealing with `“`

- Get rows with `“`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'“'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
111061,bpt6k62906378,622,155,Carpot-Vignier,fab. de bandages et d'instru“ ments en gomme é...,Cité,31.,1846,334,fab. de bandages et d'instru“ ments en gomme é...
184466,bpt6k6292987t,934,188,Quinet (Edg.),professeur au collége de Fran“ ce,Mont-Parnasse,4 lis.,1845,581,professeur au collége de fran“ ce
277174,bpt6k6305463c,482,80,Martella,“peintre en bâtiments,Isly,6.,1857,369,“peintre en bâtiments
291275,bpt6k6305463c,570,9,Serpantié (Marie,directeur du Moniteur dra“ matique,passage Suuln.er,18.,1857,457,directeur du moniteur dra“ matique
383649,bpt6k6314752k,473,59,De Champeaux,avocat à la cour“ impériale,Cassette,25.,1856,283,avocat à la cour“ impériale
394471,bpt6k6314752k,535,221,Giraud (B.) ainé et Cie,fab. de boutons et ga“. lons de soie,St-Denis,229.,1856,345,fab. de boutons et ga“. lons de soie
412282,bpt6k6314752k,639,133,Montullet (Mme),supérieure de la Congréga“ tion de St-Vincent-...,Bac,140.,1856,449,supérieure de la congréga“ tion de st-vincent-...
412790,bpt6k6314752k,642,163,Mottet (Ch.) NC,membre du tribunal de com“ merce,Hauteville,23.,1856,452,membre du tribunal de com“ merce
488506,bpt6k6315985z,218,155,Cauchois-Lemaire *,chef de la section légis“ lative aux archives ...,Berry-Merais,14.,1850,136,chef de la section légis“ lative aux archives ...
721291,bpt6k6319811j,346,18,Gyssens,fab. d'instruments de musique en “ bois,Montmartre,46.,1854,267,fab. d'instruments de musique en “ bois


1. For index 111061. The job will be changed from `fab. de bandages et d'instru“ ments en gomme élastique` to `fab. de bandages et d'instruments en gomme élastique`

2. For index 184466, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6292987t/f581.image.r=quinet, The job will be changed from `professeur au collége de fran“ ce` to `professeur au collège de france`

3. For index 277174, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6305463c/f369.image.r=Martella.zoom, The job will be changed from `“peintre en bâtiments` to `peintre en bâtiments`

4. For index 291275. The job will be changed from `directeur du moniteur dra“ matique` to `directeur du moniteur dramatique`

5. For index 383649. The job will be changed from `avocat à la cour“ impériale` to `avocat à la cour impériale`

6. For index 394471, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6314752k/f345.item.r=boutons.zoom, The job will be changed from `fab. de boutons et ga“. lons de soie` to `fab. de boutons et galons de soie`

7. For index 412282. The job will be changed from `supérieure de la congréga“ tion de st-vincent-de-paul` to `supérieure de la congrégation de st-vincent-de-paul`

8. For index 412790. The job will be changed from `membre du tribunal de com“ merce` to `membre du tribunal de commerce`

9. For index 488506. The job will be changed from `chef de la section légis“ lative aux archives de la nation` to `chef de la section législative aux archives de la nation`

10. For index 721291, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6319811j/f267.image.r=Gyssens.zoom, The job will be changed from `fab. d'instruments de musique en “ bois` to `fab. d'instruments de musique en bois`

11. For index 805966. The job will be changed from `“picier` to `épicier`

12. For index 807922. The job will be changed from `négociant-commission“ naire` to `négociant-commissionnaire`

13. For index 822125, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6324389h/f291.item.r=Luget.zoom, The job will be changed from `'professeur al conservatoire imper. de musiqu“` to `professeur au conservatoire imper. de musique`

14. For index 830888, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6324389h/f350.item.r=Menier.zoom, The job will be changed from `maison centrale de dro“ guerie` to `maison centrale de droguerie`

15. For index 936879, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243920/f158.item.r=david.zoom, The job will be changed from `habillements d'en“ fants` to `habillements d'enfants`

16. For index 1049505, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f229.item.r=ressorts.zoom, The job will be changed from `aciers et ressorts pour ju“ pons` to `aciers et ressorts pour jupons`

17. For index 1068304, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f351.item.r=Gournay.zoom, The job will be changed from `correcteur à l'imprimerie “impériale` to `correcteur à l'imprimerie impériale`

18. For index 1083759, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f454.item.r=sage.zoom, The job will be changed from `sage-femm“` to `sage-femme`

19. For index 1091979. The job will be changed from `épicie“` to `épicier`

20. For index 1101286, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f568.image.r=Sclioll, The job will be changed from `ré lacteur en chef du nain.“` to `rédacteur en chef du nain jaune`

21. For index 1111704, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333200c/f151.image.r=Taranne;ZOOM, The job will be changed from `marbrie“` to `marbrier`

22. For index 1137457, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333200c/f311.item.r=vannier.zoom, The job will be changed from `boisselier“ vannier` to `boisselier vannier`

23. For index 1220729, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6389871r/f335.item.r=hartmann.zoom, The job will be changed from `agence de la compa“ gnie du chemin de fer du nord` to `agence de la compagnie du chemin de fer du nord`

24. For index 1635325, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672117f/f552.item.r=avo%20.zoom, The job will be changed from `avo“ st aguesseau` to `avocat`

25. For index 1717818, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f581.item.r=success.zoom, The job will be changed from `cou“ leurs` to `couleurs`

26. For index 2468177, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685098r/f526.item.r=Esteuf.zoom, The job will be changed from `doreur et “rnisseur` to `doreur et vernisseur`

27. For index 2597359, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f507.item.r=Joltrain.zoom, The job will be changed from `papier photographique gommo“ ferrique` to `papier photographique gommoferrique`

28. For index 2784116, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9692809v/f253.item.r=Derache.zoom, The job will be changed from `agent comptable de la chambre des“ huissiers` to `agent comptable de la chambre des huissiers`

29. For index 3086739, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97630871/f121.item.r=%20ournit.zoom, The job will be changed from `“ournit. pour parapluies` to `fournit. pour parapluies`

30. For index 3407142, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763553z/f346.item.r=cows.zoom, The job will be changed from `cou“s de topographie pralique` to `cours de topographie pratique`

31. For index 4247352, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f990.item.r=outils.zoom, The job will be changed from `ing“ e. c. p.` to `ingr e. c. p. outils à découper`

In [None]:
raw_paris_jobs.loc[111061, "métier"] = "fab. de bandages et d'instruments en gomme élastique"
raw_paris_jobs.loc[184466, "métier"] = "professeur au collége de france"
raw_paris_jobs.loc[277174, "métier"] = "peintre en bâtiments"
raw_paris_jobs.loc[291275, "métier"] = "directeur du moniteur dramatique"
raw_paris_jobs.loc[383649, "métier"] = "avocat à la cour impériale"
raw_paris_jobs.loc[394471, "métier"] = "fab. de boutons et galons de soie"
raw_paris_jobs.loc[412282, "métier"] = "supérieure de la congrégation de st-vincent-de-paul"
raw_paris_jobs.loc[412790, "métier"] = "membre du tribunal de commerce"
raw_paris_jobs.loc[488506, "métier"] = "chef de la section législative aux archives de la nation"
raw_paris_jobs.loc[721291, "métier"] = "fab. d'instruments de musique en bois"
raw_paris_jobs.loc[805966, "métier"] = "épicier"
raw_paris_jobs.loc[807922, "métier"] = "négociant-commissionnaire"
raw_paris_jobs.loc[822125, "métier"] = "professeur au conservatoire imper. de musique"
raw_paris_jobs.loc[830888, "métier"] = "maison centrale de droguerie"
raw_paris_jobs.loc[936879, "métier"] = "habillements d'enfants"
raw_paris_jobs.loc[1049505, "métier"] = "aciers et ressorts pour jupons"
raw_paris_jobs.loc[1068304, "métier"] = "correcteur à l'imprimerie impériale"
raw_paris_jobs.loc[1083759, "métier"] = "sage-femme"
raw_paris_jobs.loc[1091979, "métier"] = "épicier"
raw_paris_jobs.loc[1101286, "métier"] = "rédacteur en chef du nain jaune"
raw_paris_jobs.loc[1111704, "métier"] = "marbrier"
raw_paris_jobs.loc[1137457, "métier"] = "boisselier vannier"
raw_paris_jobs.loc[1220729, "métier"] = "agence de la compagnie du chemin de fer du nord"
raw_paris_jobs.loc[1635325, "métier"] = "avocat"
raw_paris_jobs.loc[1717818, "métier"] = "couleurs"
raw_paris_jobs.loc[2468177, "métier"] = "doreur et vernisseur"
raw_paris_jobs.loc[2597359, "métier"] = "papier photographique gommoferrique"
raw_paris_jobs.loc[2784116, "métier"] = "agent comptable de la chambre des huissiers"
raw_paris_jobs.loc[3086739, "métier"] = "fournit. pour parapluies"
raw_paris_jobs.loc[3407142, "métier"] = "cours de topographie pratique"
raw_paris_jobs.loc[4247352, "métier"] = "ingr e.c.p. outils à découper"

### Dealing with `–`

- Get rows with `–` (not to be confused with `-`)

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'–'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
7379,bpt6k6282019m,188,117,Bourgeois,sellier – harnacheur,Basse-duRempart,14.,1855,116,sellier – harnacheur
12604,bpt6k6282019m,219,151,Cuchot * frères,ingénieurs – mécaniciens,Moreau-St-Antoine,12 et 14.,1855,147,ingénieurs – mécaniciens
61110,bpt6k6286466w,389,125,Ballieux et Heurtin,bijoutiers – garnisseurs,Montmorency,3.,1842,200,bijoutiers – garnisseurs
93058,bpt6k6286466w,605,119,Olivrel,jardinier – fleuriste,quai Jemmapes,252.,1842,416,jardinier – fleuriste
94587,bpt6k6286466w,615,170,Peyrol (F.),papetier – relieur,Taitbout,36 et 38.,1842,426,papetier – relieur
...,...,...,...,...,...,...,...,...,...,...
3961841,bpt6k9775724t,392,298,Gamounet-Dehollande tils,articles – Amiens,boul. de Magenta,139.,1914,359,articles – amiens
4088705,bpt6k9776121t,461,326,Goriot,instruments – optique,r. des Amandiers,63.,1907,412,instruments – optique
4106294,bpt6k9776121t,579,284,Lefebvre (Victor),dépositaire des filatures ei filteries –Alost,cité St-Martin,1 (Faub. SG,1907,530,dépositaire des filatures ei filteries –alost
4121315,bpt6k9776121t,681,223,Moreau-Teigne (Ad. et Ed. Deraisme succ.),fabr. d'instruments – optique,r. St-Maur,167.,1907,632,fabr. d'instruments – optique


First `–` are removed when they are surrounded by spaces or at the start follwed by a space or at the end preceded by a space.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'(^|\s)–(\s|$)', r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'–'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
249551,bpt6k6305463c,317,36,Crouzet,bijout. –joaill. fab.,Coquillière,40.,1857,204,bijout. –joaill. fab.
383270,bpt6k6314752k,471,14,David,administrateur du bureau de bienfai– sance du ...,Vaugirard,48.,1856,281,administrateur du bureau de bienfai– sance du ...
885671,bpt6k63243905,497,186,Heizler (H.),sculpt –gtat.,Ménilmontant,'56.,1863,363,sculpt –gtat.
1147514,bpt6k6333200c,505,135,Joffrin,chargé des affaires de M. le duc d'Os– sua,La Rochefoucault,19.,1862,373,chargé des affaires de m. le duc d'os– sua
1156581,bpt6k6333200c,561,75,Louis,bijout. –garnisseur,Aumaire,12.,1862,429,bijout. –garnisseur
1193340,bpt6k6389871r,225,52,Chasse'oup-Laubat (Comte P. de) C.,dé– puté de la Charente-Inférieure,Bienfaisance,9 et 11.,1853,148,dé– puté de la charente-inférieure
1293107,bpt6k6393838j,502,176,Croizat,horl –mécan.,Vieille-du-Temple,31.,1843,283,horl –mécan.
4106294,bpt6k9776121t,579,284,Lefebvre (Victor),dépositaire des filatures ei filteries –Alost,cité St-Martin,1 (Faub. SG,1907,530,dépositaire des filatures ei filteries –alost


1. For index 249551, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6305463c/f204.item.r=bijont.zoom, The job will be changed from `bijout. –joaill. fab.` to `bijout.-joaill. fab.`

2. For index 383270, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6314752k/f281.item.r=administrateur.zoom, The job will be changed from `administrateur du bureau de bienfai– sance du e arrondissement` to `administrateur du bureau de bienfaisance du arrondissement`

3. For index 885671, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243905/f363.item.r=Hei7.zoom, The job will be changed from `sculpt –gtat.` to `sculpt.-stat.`

4. For index 1147514, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333200c/f373.item.r=Joffrin.zoom, The job will be changed from `chargé des affaires de m. le duc d'os– sua` to `chargé des affaires de m. le duc d'ossuna`

5. For index 1156581, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333200c/f429.item.r=Aumaira.zoom, The job will be changed from `bijout. –garnisseur` to `bijout.-garnisseur`

6. For index 1193340, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6389871r/f148.item.r=Laubat.zoom, The job will be changed from `dé– puté de la charente-inférieure` to `député de la charente-inférieure`

7. For index 1293107, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6393838j/f283.item.r=Croizat.zoom, The job will be changed from `horl –mécan.` to `horl-mécan.`

8. For index 4106294, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9776121t/f530.item.r=depositaire.zoom, The job will be changed from `dépositaire des filatures ei filteries –alost` to `dépositaire des filatures et filteries d'alost`

In [None]:
raw_paris_jobs.loc[249551, "métier"] = "bijout.-joaill. fab."
raw_paris_jobs.loc[383270, "métier"] = "administrateur du bureau de bienfaisance du arrondissement"
raw_paris_jobs.loc[885671, "métier"] = "sculpt.-stat."
raw_paris_jobs.loc[1147514, "métier"] = "chargé des affaires de m. le duc d'ossuna"
raw_paris_jobs.loc[1156581, "métier"] = "bijout.-garnisseur"
raw_paris_jobs.loc[1193340, "métier"] = "député de la charente-inférieure"
raw_paris_jobs.loc[1293107, "métier"] = "horl-mécan."
raw_paris_jobs.loc[4106294, "métier"] = "dépositaire des filatures et filteries d'alost"

### Dealing with `•`

- Get rows with `•`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'•'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
6466,bpt6k6282019m,183,57,Bordier,fabrique d'orfévrerie en doublé d'ar• gent,Montmorency,46.,1855,111,fabrique d'orfévrerie en doublé d'ar• gent
25921,bpt6k6282019m,300,75,Girardin,cadres p• porte-monnaie,EnfantsRouges,2.,1855,228,cadres p• porte-monnaie
56384,bpt6k6282019m,485,151,Tirrart,sculpteur-ornemaniste et carton-pier• re,impasse Sandrié,4 bis.,1855,413,sculpteur-ornemaniste et carton-pier• re
109421,bpt6k62906378,612,66,Boutigny,teint •dégr.,St-Lazare,137,1846,324,teint •dégr.
113720,bpt6k62906378,639,109,Cormier,directeur de la compagnie d'as• surance l’Agri...,Ste-Anne,51 bis 6.,1846,351,directeur de la compagnie d'as• surance l’agri...
...,...,...,...,...,...,...,...,...,...,...
4185207,bpt6k97774838,862,74,Commercial de l'habillement,con• dames,Faub. St-Martin,59.,1921,535,con• dames
4202818,bpt6k97774838,992,206,Fontaine-Bour (Georges),directeur du Mo• niteur des Intérêts Matériels,r. Thiers,4. (16e). T. Pas. 43. 01.,1921,665,directeur du mo• niteur des intérêts matériels
4260411,bpt6k97774838,1418,300,Savarin rientas). taillandier,r: des Jardins•St Pau,13,14 et 15 (4e).,1921,1091,r: des jardins•st pau
4286766,bpt6k9780089g,760,54,Beuchey (René),représ. de •commerce,r. Cler,30.,1922,421,représ. de •commerce


First `•` are removed when they are surrounded by spaces or at the start follwed by a space or at the end preceded by a space.

It is observed that due to the job present in two lines sometime the job is split at `• `. So all the `• `are removed.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'(^|\s)\•(\s|$)', r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'• ', r'', regex=True)

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'•'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
109421,bpt6k62906378,612,66,Boutigny,teint •dégr.,St-Lazare,137,1846,324,teint •dégr.
281809,bpt6k6305463c,511,26,Obry (Vve),mcd •leur-cisel.,Sainte-Anne,16.,1857,398,mcd •leur-cisel.
590259,bpt6k6318531z,592,113,Verb:ckmoes(N. G.),agence maritime des paquebots à heli•es au nord,Dronot,2.,1858,484,agence maritime des paquebots à heli•es au nord
1054297,bpt6k6333170p,397,155,Davy (Vor),contrôleur•è la garantie des matières d'or et ...,Montyon-Mont- ronge,10.,1864,260,contrôleur•è la garantie des matières d'or et ...
1298719,bpt6k6393838j,541,117,Fouque ainé,quincailli •r,Trois-Paiilions,18.,1843,322,quincailli •r
1314409,bpt6k6393838j,648,115,Piellard,horlog •r-bijoutier,Temple,87.,1843,429,horlog •r-bijoutier
1321022,bpt6k6393838j,694,75,Vax claire,epici •r,Vieille-du Temple,82.,1843,475,epici •r
2979674,bpt6k9762899p,482,113,Degeux,•sculpteur sur bois,pass. de la Maind'Or,23.,1890,325,•sculpteur sur bois
3970066,bpt6k9775724t,445,324,H bert et Cie,bi •yclettes,av. de la GrandeArmée,29. (160). T. Pas. 45. 98.,1914,412,bi •yclettes
4260411,bpt6k97774838,1418,300,Savarin rientas). taillandier,r: des Jardins•St Pau,13,14 et 15 (4e).,1921,1091,r: des jardins•st pau


1. For index 109421, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k62906378/f324.image.r=degr.zoom, The job will be changed from `teint •dégr.` to `teint.-dégr.`

2. For index 281809, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6305463c/f398.item.r=annee.zoom, The job will be changed from `mcd •leur-cisel.` to `modeleur-cisel.`

3. For index 590259, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6318531z/f484.item.r=paquebots.zoom, The job will be changed from `agence maritime des paquebots à heli•es au nord` to `agence maritime des paquebots à helices au nord`

4. For index 1054297, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f260.item.r=vor.zoom, The job will be changed from `contrôleur•è la garantie des matières d'or et d'argent` to `contrôleur à la garantie des matières d'or et d'argent`

5. For index 1298719, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6393838j/f322.item.r=quincailli.zoom, The job will be changed from `quincailli •r` to `quincaillier`

6. For index 1314409, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6393838j/f429.item.r=Piellard.zoom, The job will be changed from `horlog •r-bijoutier` to `horloger-bijoutier`

7. For index 1321022, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6393838j/f475.item.r=vax.zoom, The job will be changed from `epici •r` to `épicier`

8. For index 2979674, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762899p/f325.item.r=Degeux.zoom, The job will be changed from `•sculpteur sur bois` to `sculpteur sur bois`

9. For index 3970066, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f412.item.r=yclettes.zoom, The job will be changed from `bi •yclettes` to `bicyclettes`

10. For index 4260411, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f1091.item.r=tjesi.zoom, The job will be changed from `r: des jardins•st pau` to `taillandier`

11. For index 4286766, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f421.item.r=rene.zoom, The job will be changed from `représ. de •commerce` to `représ. de commerce`

In [None]:
raw_paris_jobs.loc[109421, "métier"] = "teint.-dégr."
raw_paris_jobs.loc[281809, "métier"] = "modeleur-cisel."
raw_paris_jobs.loc[590259, "métier"] = "agence maritime des paquebots à helices au nord"
raw_paris_jobs.loc[1054297, "métier"] = "contrôleur à la garantie des matières d'or et d'argent"
raw_paris_jobs.loc[1298719, "métier"] = "quincaillier"
raw_paris_jobs.loc[1314409, "métier"] = "horloger-bijoutier"
raw_paris_jobs.loc[1321022, "métier"] = "épicier"
raw_paris_jobs.loc[2979674, "métier"] = "sculpteur sur bois"
raw_paris_jobs.loc[3970066, "métier"] = "bicyclettes"
raw_paris_jobs.loc[4260411, "métier"] = "taillandier"
raw_paris_jobs.loc[4286766, "métier"] = "représ. de commerce"

### Dealing with `/`

- Get rows with `/`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'/'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
4526,bpt6k6282019m,171,146,Berthier (Paul),peintre art/ste,quai St-Michel,19.,1855,99,peintre art/ste
40787,bpt6k6282019m,389,168,Massalve (J.),parap/uies,S2-Denis,319.,1855,317,parap/uies
63619,bpt6k6286466w,406,0,Bizet,inspect. des domaines de la ville et des / aba...,Rochechouart,73.,1842,217,inspect. des domaines de la ville et des / aba...
253943,bpt6k6305463c,344,35,Doneaud (Mme Adèle),peint.-art/,Bouloi,8.,1857,231,peint.-art/
649269,bpt6k63197984,165,211,Buisson,contr/leur de fabrication à la manuua facture ...,quai d'Orsay. 63. 7.,189 1900,1852,109,contr/leur de fabrication à la manuua facture ...
657361,bpt6k63197984,222,44,D-vaux,peintre-art/ste,Chabrol,16.,1852,166,peintre-art/ste
775895,bpt6k63243601,429,130,Malo (C.),direct. du journal la France lit/érairc,Eperon,10 b.s.,1839,306,direct. du journal la france lit/érairc
788367,bpt6k6324389h,145,27,Almussion lam les hôn. (bu,central / ),quai Le Peletier,4.,1859,73,central / )
1023608,bpt6k6331310g,624,152,Deux-Portes-St-Sauveur,34 4).. no/,H lo Montguyon (Cte de),0.,1844,408,no/
1668285,bpt6k9672776c,387,141,Decrais,/ 05 -conseiller d'Etat,avenue du Bois-de-Boulogne,62.,1880,268,/ -conseiller d'etat


First `/` are removed when they are surrounded by spaces or at the start follwed by a space or at the end preceded by a space.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'(^|\s)/(\s|$)', r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'/'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
4526,bpt6k6282019m,171,146,Berthier (Paul),peintre art/ste,quai St-Michel,19.,1855,99,peintre art/ste
40787,bpt6k6282019m,389,168,Massalve (J.),parap/uies,S2-Denis,319.,1855,317,parap/uies
253943,bpt6k6305463c,344,35,Doneaud (Mme Adèle),peint.-art/,Bouloi,8.,1857,231,peint.-art/
649269,bpt6k63197984,165,211,Buisson,contr/leur de fabrication à la manuua facture ...,quai d'Orsay. 63. 7.,189 1900,1852,109,contr/leur de fabrication à la manuua facture ...
657361,bpt6k63197984,222,44,D-vaux,peintre-art/ste,Chabrol,16.,1852,166,peintre-art/ste
775895,bpt6k63243601,429,130,Malo (C.),direct. du journal la France lit/érairc,Eperon,10 b.s.,1839,306,direct. du journal la france lit/érairc
1023608,bpt6k6331310g,624,152,Deux-Portes-St-Sauveur,34 4).. no/,H lo Montguyon (Cte de),0.,1844,408,no/
1723391,bpt6k9672776c,736,14,Schneider,ins/itution israélite de jeunes gens,boul. Latour-Maubourg,96.,1880,617,ins/itution israélite de jeunes gens
1723980,bpt6k9672776c,739,182,Sérane (l),mécanicien/ modeleur,passage Maurice,12. 37919,1880,620,mécanicien/ modeleur
1740276,bpt6k96727875,314,43,Bisson,vins et liqueu/9,boul. Rochechouart,46. 45.,1870,179,vins et liqueu/9


One entry that is not in the above output is present in the below list. This is due the fact that these symbols have been removed during the deleion of numbers. However, to continue using the earlier corrections, it is still corrected.

1. For index 4526. The job will be changed from `peintre art/ste` to `peintre artiste`

2. For index 40787, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6282019m/f317.item.r=Massalve.zoom, The job will be changed from `parap/uies` to `parapluies`

3. For index 253943, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6305463c/f231.item.r=Doneaud.zoom, The job will be changed from `peint.-art/` to `peintre artiste`

4. For index 649269, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63197984/f109.item.r=fabrication.zoom, The job will be changed from `contr/leur de fabrication à la manuua facture de tabac` to `contrôleur de fabrication à la manufacture de tabac`

5. For index 657361. The job will be changed from `peintre-art/ste` to `peintre artiste`

6. For index 775895, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243601/f306.item.r=journal.zoom, The job will be changed from `direct. du journal la france lit/érairc` to `direct. du journal la france littéraire`

7. For index 1023608, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6331310g/f408.image.r=Montguyon. Multiple lines are interpreted together. The job will be changed from ` ).. no/` to ` ` as all the characters are not useful.

8. For index 1723391. The job will be changed from `ins/itution israélite de jeunes gens` to `institution israélite de jeunes gens`

9. For index 1723980, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f620.item.r=mecanicien.zoom, The job will be changed from `mécanicien/ modeleur` to `mécanicien modeleur`

10. For index 1740276. The job will be changed from `vins et liqueu/` to `vins et liqueur`

11. For index 2095784. The job will be changed from `chif/ons en gros et vieux papiers` to `chiffons en gros et vieux papiers`

12. For index 2190288. The job will be changed from `coiffeur par/umeur` to `coiffeur parfumeur`

13. For index 2337865. The job will be changed from `beurre etoeu/s` to `beurre et œufs`

14. For index 2359867. The job will be changed from `président du conseil de la société d'or/èvrerie d'ercuis` to `président du conseil de la société d'orfèvrerie d'ercuis`

15. For index 2366102. The job will be changed from `coiſ/eur` to `coiffeur`.

16. For index 2375297. The job will be changed from `coif/eur` to `coiffeur`

17. For index 2383213. The job will be changed from `directeur de l'administration yénerale de convois el transports /une- bres` to `directeur de l'administration génerale de convois et transports funèbres`

18. For index 2993571. The job will be changed from `ins/itution` to `institution`

19. For index 3063012. The job will be changed from `avocet/cour d'appel` to `avocat cour d'appel`

20. For index 3073171. The job will be changed from `beurre et qu/s` to `beurre et œufs`

21. For index 3153772, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97631451/f400.item.r=Boidin.zoom, The job will be changed from `contrôleur des contributions direc/tes` to `contrôleur des contributions directes`

22. For index 3338470. The job will be changed from `contrôleur des contributions diu/ rectes` to `contrôleur des contributions directes`

23. For index 3347303. The job will be changed from `bot/ier` to `bottier`

24. For index 3390243, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763553z/f247.item.r=Demolliens.zoom, The job will be changed from `greffier de la jus/ice dle pair du e arrondissement` to `greffier de la justice de paix du arrondissement`

25. For index 3437425. The job will be changed from `institu/ion` to `institution`

26. For index 3454446. The job will be changed from `co/feur` to `coiffeur`

27. For index 3657264, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97645375/f136.item.r=Avenin.zoom, The job will be changed from `fab. d'ingrédients tartri/ ges` to `fab. d'ingrédients tartrifuges`

28. For index 3705856, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97645375/f420.item.r=Linard.zoom, The job will be changed from `blanc/ssserie de tinge fin` to `blanchisserie de linge fin`

29. For index 3747096, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764647w/f194.item.r=Bousquet.zoom, The job will be changed from `fabrade/ bouchons` to `fabr. de bouchons`

30. For index 3985642, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f517.item.r=Roussel.zoom, The job will be changed from `assurances. r. girodet. . /e).` to `assurances`

31. For index 4003073, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f633.item.r=lycee.zoom, The job will be changed from `pro/esseur au lycée charlemagne` to `professeur au lycée charlemagne`

32. For index 4028672. The job will be changed from `architecte-exper/` to `architecte-expert`

33. For index 4118211, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9776121t/f612.item.r=Esquirol.zoom, The job will be changed from `coi//eur` to `coiffeur`

34. For index 4284903, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f408.item.r=dynamos.zoom, The job will be changed from `/; usine` to `usine`

35. For index 4293239, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f463.item.r=Boutroue.zoom, The job will be changed from `graveurs s./métaux` to `graveurs sur métaux`

36. For index 4399790. The job will be changed from `coi/feur` to `coiffeur`

In [None]:
raw_paris_jobs.loc[4526, "métier"] = "peintre artiste"
raw_paris_jobs.loc[40787, "métier"] = "parapluies"
raw_paris_jobs.loc[253943, "métier"] = "peintre artiste"
raw_paris_jobs.loc[649269, "métier"] = "contrôleur de fabrication à la manufacture de tabac"
raw_paris_jobs.loc[657361, "métier"] = "peintre artiste"
raw_paris_jobs.loc[775895, "métier"] = "direct. du journal la france littéraire"
raw_paris_jobs.loc[1023608, "métier"] = ""
raw_paris_jobs.loc[1723391, "métier"] = "institution israélite de jeunes gens"
raw_paris_jobs.loc[1723980, "métier"] = "mécanicien modeleur"
raw_paris_jobs.loc[1740276, "métier"] = "vins et liqueur"
raw_paris_jobs.loc[2095784, "métier"] = "chiffons en gros et vieux papiers"
raw_paris_jobs.loc[2190288, "métier"] = "coiffeur parfumeur"
raw_paris_jobs.loc[2337865, "métier"] = "beurre et œufs"
raw_paris_jobs.loc[2359867, "métier"] = "président du conseil de la société d'orfèvrerie d'ercuis"
raw_paris_jobs.loc[2366102, "métier"] = "coiffeur"
raw_paris_jobs.loc[2375297, "métier"] = "coiffeur"
raw_paris_jobs.loc[2383213, "métier"] = "directeur de l'administration génerale de convois et transports funèbres"
raw_paris_jobs.loc[2993571, "métier"] = "institution"
raw_paris_jobs.loc[3063012, "métier"] = "avocat cour d'appel"
raw_paris_jobs.loc[3073171, "métier"] = "beurre et œufs"
raw_paris_jobs.loc[3153772, "métier"] = "contrôleur des contributions directes"
raw_paris_jobs.loc[3338470, "métier"] = "contrôleur des contributions directes"
raw_paris_jobs.loc[3347303, "métier"] = "bottier"
raw_paris_jobs.loc[3390243, "métier"] = "greffier de la justice de paix du arrondissement"
raw_paris_jobs.loc[3437425, "métier"] = "institution"
raw_paris_jobs.loc[3454446, "métier"] = "coiffeur"
raw_paris_jobs.loc[3657264, "métier"] = "fab. d'ingrédients tartrifuges"
raw_paris_jobs.loc[3705856, "métier"] = "blanchisserie de linge fin"
raw_paris_jobs.loc[3747096, "métier"] = "fabr. de bouchons"
raw_paris_jobs.loc[3985642, "métier"] = "assurances"
raw_paris_jobs.loc[4003073, "métier"] = "professeur au lycée charlemagne"
raw_paris_jobs.loc[4028672, "métier"] = "architecte-expert"
raw_paris_jobs.loc[4118211, "métier"] = "coiffeur"
raw_paris_jobs.loc[4284903, "métier"] = "usine"
raw_paris_jobs.loc[4293239, "métier"] = "graveurs sur métaux"
raw_paris_jobs.loc[4399790, "métier"] = "coiffeur"

### Dealing with `’`

- Get the rows with `’`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"\’"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
1183,bpt6k6282019m,151,56,Artus,chef d'orchestre à l’Ambigu-Comique,Marais-St-Martin,46.,1855,79,chef d'orchestre à l’ambigu-comique
4943,bpt6k6282019m,174,31,Bibliothèque de l'Arsenal,à l’Arsenal,rue Sully,1.,1855,102,à l’arsenal
14263,bpt6k6282019m,230,5,Croneau du Plessis (Alph.) fils,peintre d’histoire,Bonaparte,15.,1855,158,peintre d’histoire
15059,bpt6k6282019m,234,184,Dauvergne (L. H.) *,chef d'escadron d’étatmajor,Ferme-des-Mathurins,19.,1855,162,chef d'escadron d’étatmajor
20134,bpt6k6282019m,265,200,Du Pays,collaborateur du journal l’Illustration,Lavoisier,19.,1855,193,collaborateur du journal l’illustration
...,...,...,...,...,...,...,...,...,...,...
4181522,bpt6k97774838,830,171,Chauffeurs - Réunis (les),garage d’autos,boul. Péreire,267.,1921,503,garage d’autos
4197346,bpt6k97774838,949,153,Dupont et Elluin (Elluin succ.),office international de brevets d’invention,boul. Bonne--Nouvelle,42.,1921,622,office international de brevets d’invention
4297637,bpt6k9780089g,831,285,Camus et Métay,importations d’Algérie,av. d'Orléans,5.,1922,492,importations d’algérie
4383938,bpt6k9780089g,1435,293,Ricard (A.),administrateur d’immeubles,r. de Pétrograd,15.,1922,1096,administrateur d’immeubles


`’` is mostly a misinterpretation for `'`. There are no `’` surrounded by spaces or at the start or the end. 

Thus all the `’` will be replaced by `'`.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\’", r"'", regex=True)

### Dealing with `:`

- Get rows with `:`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r':'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
2387,bpt6k6282019m,158,148,Bardon (E.) et Asseline,drogueries NC: et produits chimiques,pass. Ste-Croix-de-la-Bro- tonnerie,11.,1855,86,drogueries nc: et produits chimiques
4015,bpt6k6282019m,168,148,Berendorff (Jh.) fils,INC: mécanicien,Mouffetard,294.,1855,96,inc: mécanicien
5610,bpt6k6282019m,178,22,Blin (Louis),NC: hôtel de France et de Champagne,Montmartre,i32.,1855,106,nc: hôtel de france et de champagne
7523,bpt6k6282019m,189,76,Bourru,fab: de jeux de patience,Bourgl'Abbé,23.,1855,117,fab: de jeux de patience
7972,bpt6k6282019m,191,195,Bréau,librairie et cabinet de lect:,Bac,144.,1855,119,librairie et cabinet de lect:
...,...,...,...,...,...,...,...,...,...,...
4403006,bpt6k9780089g,1585,185,Vigier (1) (André Lesure succ),phar: macien,r. du Bac,70. (70). T: Saxe 01.,1922,1246,phar: macien
4404243,bpt6k9780089g,1593,227,(101). T. Gut. 00. 74,Adr. T: ORGANSIN-PARIS.,(200). T. Roq. 50,13.,1922,1254,adr. t: organsin-paris.
4404489,bpt6k9780089g,1595,189,Walker (Fernand),fabr: de chaussures,boul. Voltaire,238.,1922,1256,fabr: de chaussures
4404843,bpt6k9780089g,1598,72,'Weil,importat: ur,k. de St Quentin,24.,1922,1259,importat: ur


First `:` are removed when they are surrounded by spaces or at the start follwed by a space or at the end preceded by a space.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'(^|\s):(\s|$)', r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r':'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
2387,bpt6k6282019m,158,148,Bardon (E.) et Asseline,drogueries NC: et produits chimiques,pass. Ste-Croix-de-la-Bro- tonnerie,11.,1855,86,drogueries nc: et produits chimiques
4015,bpt6k6282019m,168,148,Berendorff (Jh.) fils,INC: mécanicien,Mouffetard,294.,1855,96,inc: mécanicien
5610,bpt6k6282019m,178,22,Blin (Louis),NC: hôtel de France et de Champagne,Montmartre,i32.,1855,106,nc: hôtel de france et de champagne
7523,bpt6k6282019m,189,76,Bourru,fab: de jeux de patience,Bourgl'Abbé,23.,1855,117,fab: de jeux de patience
7972,bpt6k6282019m,191,195,Bréau,librairie et cabinet de lect:,Bac,144.,1855,119,librairie et cabinet de lect:
...,...,...,...,...,...,...,...,...,...,...
4403006,bpt6k9780089g,1585,185,Vigier (1) (André Lesure succ),phar: macien,r. du Bac,70. (70). T: Saxe 01.,1922,1246,phar: macien
4404243,bpt6k9780089g,1593,227,(101). T. Gut. 00. 74,Adr. T: ORGANSIN-PARIS.,(200). T. Roq. 50,13.,1922,1254,adr. t: organsin-paris.
4404489,bpt6k9780089g,1595,189,Walker (Fernand),fabr: de chaussures,boul. Voltaire,238.,1922,1256,fabr: de chaussures
4404843,bpt6k9780089g,1598,72,'Weil,importat: ur,k. de St Quentin,24.,1922,1259,importat: ur


There are still 2731 rows with `:`. As all these entries cannot be replaced manually. However, `NC` (in a box) is used in the bottins to indicate `Notable Commercant` (see above image for awards).

Thus, we shall find the words with `nc` and delete them.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace('nc:', r'', regex=False)

### Dealing with `!`

- Get rows with `!`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'!'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
2624,bpt6k6282019m,160,6,Barry-Flamant,hôte!,Bourdonnais,13.,1855,88,hôte!
2959,bpt6k6282019m,162,2,Baudon,fab. bourda!oux,Temple,56.,1855,90,fab. bourda!oux
3821,bpt6k6282019m,167,92,Bénard,propr!étaire,Bondy,54.,1855,95,propr!étaire
21986,bpt6k6282019m,277,82,Félix (F.),joai!l.-sertisseur,Montmartre,31.,1855,205,joai!l.-sertisseur
28399,bpt6k6282019m,315,62,Hache,de la maison Lobligeois NC! et Hache,Ste-Croix-de-la-Bretonnerie,36.,1855,243,de la maison lobligeois nc! et hache
...,...,...,...,...,...,...,...,...,...,...
4378912,bpt6k9780089g,1401,85,Ponsot (MOR A.),papier ! Arménie et paniers medicinaux,r. St-Claude,26.,1922,1062,papier ! arménie et paniers medicinaux
4393196,bpt6k9780089g,1505,102,Société des burcaur d'assurances,Vida! Engourran frères,r. Laffitte,34.,1922,1166,vida! engourran frères
4394641,bpt6k9780089g,1527,106,Spiedt,pape!ier,r. Réaumur,26.,1922,1188,pape!ier
4398433,bpt6k9780089g,1552,333,Torné (Mue),cou!ur.,r. Lecourbe,84.,1922,1213,cou!ur.


First `!` are removed when they are surrounded by spaces or at the start follwed by a space or at the end preceded by a space.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'(^|\s)\!(\s|$)', r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'!'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
2624,bpt6k6282019m,160,6,Barry-Flamant,hôte!,Bourdonnais,13.,1855,88,hôte!
2959,bpt6k6282019m,162,2,Baudon,fab. bourda!oux,Temple,56.,1855,90,fab. bourda!oux
3821,bpt6k6282019m,167,92,Bénard,propr!étaire,Bondy,54.,1855,95,propr!étaire
21986,bpt6k6282019m,277,82,Félix (F.),joai!l.-sertisseur,Montmartre,31.,1855,205,joai!l.-sertisseur
28399,bpt6k6282019m,315,62,Hache,de la maison Lobligeois NC! et Hache,Ste-Croix-de-la-Bretonnerie,36.,1855,243,de la maison lobligeois nc! et hache
...,...,...,...,...,...,...,...,...,...,...
4355419,bpt6k9780089g,1240,40,Lévy(Auguste),ingénieur en chef de la Sté! du gaz de Paris,r. de Monceau,80.,1922,901,ingénieur en chef de la sté! du gaz de paris
4356846,bpt6k9780089g,1250,329,Lordier (Georges),entreprise generale !cinématographique,boul. Bonne Nouvelle,28.,1922,911,entreprise generale !cinématographique
4393196,bpt6k9780089g,1505,102,Société des burcaur d'assurances,Vida! Engourran frères,r. Laffitte,34.,1922,1166,vida! engourran frères
4394641,bpt6k9780089g,1527,106,Spiedt,pape!ier,r. Réaumur,26.,1922,1188,pape!ier


There are still 791 rows with `!`. As all these entries cannot be replaced manually. However, `NC` (in a box) is used in the bottins to indicate `Notable Commercant` (see above image for awards).

Thus, we shall find the words with `nc!` and delete them.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace('nc!', r'', regex=False)

### Dealing with `&`

- Get rows with `&`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'&'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
27614,bpt6k6282019m,310,142,Guérin,peintre en b&timents,Beaurepaire,8.,1855,238,peintre en b&timents
39503,bpt6k6282019m,382,72,Malsang,fab. d'encre d'imprimerie et lithogr&o phique,Poupée,6.,1855,310,fab. d'encre d'imprimerie et lithogr&o phique
51731,bpt6k6282019m,456,124,Rolland,peint. b&tim.,Faub.-Montmartre,30.,1855,384,peint. b&tim.
170727,bpt6k6292987t,847,206,Husbrocq fils,fab de paillons et poudit & d)rer,Ste-Avoie,69.,1845,494,fab de paillons et poudit & d)rer
247996,bpt6k6305463c,307,136,Collé et Damien,&cul.-marbr.,Roquette,153.,1857,194,&cul.-marbr.
...,...,...,...,...,...,...,...,...,...,...
4403369,bpt6k9780089g,1587,314,Villemer (René),fabr. de couleurs & vernis. av. de la République,98,à Aubervilliers (Seine). T. Nord 24.,1922,1248,fabr. de couleurs & vernis. av. de la république
4405593,bpt6k9780089g,1603,215,Wormser (Paul),clarifiants & produits cnologiques,boul. de Bercy,3 et 5.,1922,1264,clarifiants & produits cnologiques
4405737,bpt6k9780089g,1604,277,Ytier,charbons & vins,r. Taine,18.,1922,1265,charbons & vins
4405830,bpt6k9780089g,1605,183,Zamarou,beurre & oeufs,boul. Barbès,46.,1922,1266,beurre & oeufs


If `&` is surrounded by space, it mostly means and in all such cases it will be removed. Then when & is preceded or followed by space then they are replaced by space. In the remaining cases will be dealt during tag generation.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'(^|\s)&(\s|$)', r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'&'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
27614,bpt6k6282019m,310,142,Guérin,peintre en b&timents,Beaurepaire,8.,1855,238,peintre en b&timents
39503,bpt6k6282019m,382,72,Malsang,fab. d'encre d'imprimerie et lithogr&o phique,Poupée,6.,1855,310,fab. d'encre d'imprimerie et lithogr&o phique
51731,bpt6k6282019m,456,124,Rolland,peint. b&tim.,Faub.-Montmartre,30.,1855,384,peint. b&tim.
247996,bpt6k6305463c,307,136,Collé et Damien,&cul.-marbr.,Roquette,153.,1857,194,&cul.-marbr.
343521,bpt6k6309075f,489,65,Mallet,entrepr. &'éclairage,Martel,7.,1861,393,entrepr. &'éclairage
...,...,...,...,...,...,...,...,...,...,...
4325158,bpt6k9780089g,1031,22,Flutsch,Guerrier&Cie agentsdemanufactures,r. Meissonier,2.,1922,692,guerrier&cie agentsdemanufactures
4364031,bpt6k9780089g,1298,200,Mennessier,vins &hôtel,r. Saussure,75.,1922,959,vins &hôtel
4376587,bpt6k9780089g,1386,124,Picard& de Cooman,impr. typogr.& lithogr.,pass. Kuszner,17.,1922,1047,impr. typogr.& lithogr.
4380415,bpt6k9780089g,1411,76,Privat,charb.&vins,av. de Malakoff,140.,1922,1072,charb.&vins


1. For index 27614, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6282019m/f238. The job will be changed from `peintre en b&timents` to `peintre en bâtiments`

2. For index 39503, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6282019m/f310. The job will be changed from `fab. d'encre d'imprimerie et lithogr&o phique` to `fab. d'encre d'imprimerie et lithographique`

3. For index 51731, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6282019m/f384. The job will be changed from `peint. b&tim.` to `peint. bâtim.`

4. For index 247996, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6305463c/f194. The job will be changed from `&cul.-marbr.` to `scul.-marbr.`

5. For index 343521, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6309075f/f393.image.r=eclairage, The job will be changed from `entrepr. &'éclairage` to `entrepr. d'éclairage`

6. For index 559539, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6318531z/f296.image.r=fabricant, The job will be changed from `fabricant de ch& les` to `fabricant de châles`

7. For index 712025, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6319811j/f204.image.r=Douchin.zoom, The job will be changed from `serrur. en b&t.` to `serrur. en bât.`

8. For index 744976, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6319811j/f431.image.r=Sangouard.zoom, The job will be changed from `fabric. de pouveautés ch&les` to `fabric. de nouveautés châles`

9. For index 859661, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243905/f199. The job will be changed from `&vocat cour impériale` to `avocat cour impériale`

10. For index 876365, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243905/f306.item.r=Farjon.zoom, The job will be changed from `com mis-greffier à la cour de ca&tion` to `commis-greffier à la cour de cassation`

11. For index 905120, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243905/f485.image.r=Lejeune.zoom, The job will be changed from `fab. de tours de i&tes` to `fab. de tours de tétes`

12. For index 914337, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243905/f541. The job will be changed from `&vocal cour impériale` to `avocat cour impériale`

13. For index 923912, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243905/f601. The job will be changed from `professeur à la f&culté des lettres` to `professeur à la faculté des lettres`

14. For index 977541, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243920/f419.item.r=Masson.zoom, The job will be changed from `montures de p&rapluies` to `montures de parapluies`

15. For index 1048975, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f226.item.r=Chareyron.zoom, The job will be changed from `&nc. sous-préfet` to `anc. sous-préfet`

16. For index 1049845, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f231. The job will be changed from `&voné de première inst.` to `avoué de première inst.`

17. For index 1056386, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f274.image.r=photographiques, The job will be changed from `fab. d'appareil& photographiques` to `fab. d'appareils photographiques`

18. For index 1065956, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f336.item.r=Lavendun.zoom, The job will be changed from `frangeuse de ch&les` to `frangeuse de châles`

19. For index 1069370, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f358.item.r=Gueret.zoom, The job will be changed from `&iphons et irrigateurs` to `siphons et irrigateurs`

20. For index 1073119, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f381. The job will be changed from `fab. d'instrum. de mathém&tiques` to `fab. d'instrum. de mathématiques`

21. For index 1080206, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f430. The job will be changed from `'&voué .re instance` to `avoué première instance`

22. For index 1091285, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f503.item.r=Passerat.zoom, The job will be changed from `fabr. chaussures et g&loches` to `fabr. chaussures et galoches`

23. For index 1230921, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6389871r/f408, The job will be changed from `agent th &tral` to `agent théâtral`

24. For index 1260046, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6391515w/f466.item.r=vizet.zoom, The job will be changed from `entrepreneurs des berlines ch&lonpaises` to `entrepreneurs des berlines châlonnaises`

25. For index 1341192, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63959929/f209.image.r=inarci, The job will be changed from `march. &c: fés` to `march. de cafés`

26. For index 1439726, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9668037f/f575.image.r=profess%20, The job will be changed from `profess. &'escrime` to `profess. d'escrime`

27. For index 1480449, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9669143t/f204.image.r=etiquettes, The job will be changed from `fabr. &l'étiquettes` to `fabr. d'étiquettes`

28. For index 1555524, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9669143t/f662.image.r=bonneterie, The job will be changed from `fabr. &articles de bonneterie` to `fabr. d'articles de bonneterie`

29. For index 1599701, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672117f/f330.image.r=appareils, The job will be changed from `fabr. &appareils gaz` to `fabr. d'appareils gaz`

30. For index 1621252, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672117f/f462.image.r=jeune, The job will be changed from `fabr. &acier poti` to `fabr. d'acier poli`

31. For index 1622073, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672117f/f467, The job will be changed from `fabr. &instruments pour les sciences` to `fabr. d'instruments pour les sciences`

32. For index 1626145, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672117f/f494.image.r=sculpteur, The job will be changed from `sculpteur &ornements` to `sculpteur d'ornements`

33. For index 1628519, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672117f/f509. The job will be changed from `adjoint au secr&ariat de l'ins- litut` to `adjoint au secrétariat de l'institut`

34. For index 1631473, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672117f/f527, The job will be changed from `caf& brasserie` to `café brasserie`

35. For index 1675651, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f318.image.r=grand%20duch, The job will be changed from `ancien consul du grandduch& de hesse` to `ancien consul du grand-duché de hesse`

36. For index 1687631, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f393.image.r=guipures, The job will be changed from `guipures &art` to `guipures d'art`

37. For index 1697055, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f453, The job will be changed from `fabr. &appareils à gaz` to `fabr. d'appareils à gaz`

38. For index 1709840, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f533.item.r=Paix.zoom, The job will be changed from `directeur de la caiss&générale des familles` to `directeur de la caisse générale des familles`

39. For index 1726575, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f637. The job will be changed from `&picier` to `épicier`

40. For index 1791998, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f482.image.r=animales, The job will be changed from `fabr. &huiles animales` to `fabr. d'huiles animales`

41. For index 1812180, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f601.item.r=Trionllier.zoom, The job will be changed from `orfevrerie et bronzes d&glise` to `orfévrerie et bronzes d'église`

42. For index 1857358, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96762564/f422.image.r=accessoires, The job will be changed from `fabr. &accessoires pour moulins` to `fabr. d'accessoires pour moulins`

43. For index 1905683, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96762564/f700.image.r=zoologique. The job will be changed from `secrétaire-trésorier de la soriété d'&tudes zoologiques` to `secrétaire-trésorier de la société d'études zoologiques`

44. For index 1961681, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677392n/f341.image.r=Gresillon.zoom, The job will be changed from `administrateur à la caisse d'&pargne` to `administrateur à la caisse d'epargne`

45. For index 1968550, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677392n/f381, The job will be changed from `fabr. &arbustes` to `fabr. d'arbustes`

46. For index 2051760, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677737t/f384, The job will be changed from `caf&-erémerie` to `café-erémerie`

47. For index 2063703, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677737t/f452.image.r=marchand, The job will be changed from `marchand &habits` to `marchand d'habits`

48. For index 2628634, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f694, The job will be changed from `fabr. &horlogerie et baromètres` to `fabr. d'horlogerie et baromètres`

49. For index 2635829, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f738, The job will be changed from `fabr. &'eau de seltz` to `fabr. d'eau de seltz`

50. For index 2704003, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9692626p/f646, The job will be changed from `fabr. &horlogerie` to `fabr. d'horlogerie`

51. For index 2984929, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f159, The job will be changed from ` obno us&` to ` `

52. For index 2991897, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f202.image.r=piMuie, The job will be changed from `fabr. &orfèvrerte plaquée` to `fabr. d'Orfèvrerie plaquée`

53. For index 2992032, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f203, The job will be changed from `do&teur-médecin` to `docteur-médecin`

54. For index 3011730, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f321.image.r=capitaine, The job will be changed from `capitaine &ėtal-major` to `capitaine d'état-major`

55. For index 3030091, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f433, The job will be changed from `caf&s` to `cafés`

56. For index 3056835, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f594, The job will be changed from `chef &'institution` to `chef d'institution`

57. For index 3383731, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763553z/f210, The job will be changed from `bijoutier-j&aillier` to `bijoutier-joaillier`

58. For index 3384280, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763553z/f213, The job will be changed from `fabr. &agrafes` to `fabr. d'agrafes`

59. For index 3409850, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763553z/f362.image.r=tabacs, The job will be changed from `preposé aux ventes &irectes à la manufacture des tabacs` to `preposé aux ventes directes à la manufacture des tabacs`

60. For index 3509225, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f464.image.r=etuis, The job will be changed from `fabr. &étuis aiguilles` to `fabr. d'étuis à aiguilles`

61. For index 3510741, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f473, The job will be changed from `&picier` to `épicier`

62. For index 3512492, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f484.item.r=Mussat.zoom, The job will be changed from `secrétaire général de la société d'&mulation pour les sciences pharmaceutiques` to `secrétaire général de la société d'émulation pour les sciences pharmaceutiques`

63. For index 3632332, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764402m/f925.item.r=Pyhuit.zoom, The job will be changed from `&biblioth. universelle` to `biblioth. universelle`

64. For index 3666637, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97645375/f192, The job will be changed from `fabr. spéciale de bronze et orf&vrerie d'église` to `fabr. spéciale de bronze et orfévrerie d'église`

65. For index 3711308, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97645375/f452, The job will be changed from `fabr. &'armes blanches` to `fabr. d'armes blanches`

66. For index 3753515, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764647w/f231, The job will be changed from `fabr. &instruments de précision en verre` to `fabr. d'instruments de précision en verre`

67. For index 3769325, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764647w/f322.item.r=Botcchardoo.zoom, The job will be changed from `fabr. &albâtre` to `fabr. d'albâtre`

68. For index 3786801, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764647w/f422, The job will be changed from `marchand &habits` to `marchand d'habits`

69. For index 3817728, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764647w/f598, The job will be changed from `secrétaire-trésorier de la société d'&tudes zoologiques` to `secrétaire-trésorier de la société d'études zoologiques`

70. For index 3864347, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764746t/f293, The job will be changed from `sel d'or de fordos et g&tis` to `sel d'or de fordos et gétis`

71. For index 3871754, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764746t/f335, The job will be changed from `&picier` to `épicier`

72. For index 3872928, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764746t/f342, The job will be changed from `juge de paix du &me arrondissement` to `juge de paix du arrondissement`

73. For index 4094098, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9776121t/f448, The job will be changed from `ingénieur a: &m.e.c.p.` to `ingénieur (e.c.p)`

74. For index 4157074, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f330, The job will be changed from `transports &camionnages` to `transports camionnages`

75. For index 4181410, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f502, The job will be changed from `café-limonadier et tab&c` to `café-limonadier et tabac`

76. For index 4192849, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f590, The job will be changed from `fruits &légumes` to `fruits légumes`

77. For index 4202661, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f664, The job will be changed from `hôtel &vins` to `hôtel vins`

78. For index 4238112, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f920, The job will be changed from `vins &hôtel` to `vins hôtel`

79. For index 4249176, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f1004.item.r=cooman.zoom, The job will be changed from `impr. typogr.&tithogr.` to `impr. typogr. lithogr.`

80. For index 4252440, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f1028, The job will be changed from `charb.& vins` to `charb. vins`

81. For index 4266356, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f1150, The job will be changed from `peinture &vitrerie` to `peinture vitrerie`

82. For index 4283340, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f398, The job will be changed from `peinture &vitrerie` to `peinture vitrerie`

83. For index 4286643, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f420, The job will be changed from `g& hôtel des acacias et restaurant` to `hôtel des acacias et restaurant`

84. For index 4295247, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f476, The job will be changed from `archite&te` to `architecte`

85. For index 4307411, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f567, The job will be changed from `charb. &vins` to `charb. vins`

86. For index 4314488, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f614, The job will be changed from `fruils &légumes` to `fruits légumes`

87. For index 4325158, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f692, The job will be changed from `guerrier&cie agentsdemanufactures` to `agents de manufactures`

88. For index 4364031, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f959, The job will be changed from `vins &hôtel` to `vins hôtel`

89. For index 4376587, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f1047, The job will be changed from `impr. typogr.& lithogr.` to `impr. typogr. lithogr.`

90. For index 4380415, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f1072, The job will be changed from `charb.&vins` to `charb. vins`

91. For index 4396563, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f1201, The job will be changed from `peinture &vitrerie` to `peinture vitrerie`

In [None]:
raw_paris_jobs.loc[27614, "métier"] = "peintre en bâtiments"
raw_paris_jobs.loc[39503, "métier"] = "fab. d'encre d'imprimerie et lithographique"
raw_paris_jobs.loc[51731, "métier"] = "peint. bâtim."
raw_paris_jobs.loc[247996, "métier"] = "scul.-marbr."
raw_paris_jobs.loc[343521, "métier"] = "entrepr. d'éclairage"
raw_paris_jobs.loc[559539, "métier"] = "fabricant de châles"
raw_paris_jobs.loc[712025, "métier"] = "serrur. en bât."
raw_paris_jobs.loc[744976, "métier"] = "fabric. de nouveautés châles"
raw_paris_jobs.loc[859661, "métier"] = "avocat cour impériale"
raw_paris_jobs.loc[876365, "métier"] = "commis-greffier à la cour de cassation"
raw_paris_jobs.loc[905120, "métier"] = "fab. de tours de tétes"
raw_paris_jobs.loc[914337, "métier"] = "avocat cour impériale"
raw_paris_jobs.loc[923912, "métier"] = "professeur à la faculté des lettres"
raw_paris_jobs.loc[977541, "métier"] = "montures de parapluies"
raw_paris_jobs.loc[1048975, "métier"] = "anc. sous-préfet"
raw_paris_jobs.loc[1049845, "métier"] = "avoué de première inst."
raw_paris_jobs.loc[1056386, "métier"] = "fab. d'appareils photographiques"
raw_paris_jobs.loc[1065956, "métier"] = "frangeuse de châles"
raw_paris_jobs.loc[1069370, "métier"] = "siphons et irrigateurs"
raw_paris_jobs.loc[1073119, "métier"] = "fab. d'instrum. de mathématiques"
raw_paris_jobs.loc[1080206, "métier"] = "avoué première instance"
raw_paris_jobs.loc[1091285, "métier"] = "fabr. chaussures et galoches"
raw_paris_jobs.loc[1230921, "métier"] = "agent théâtral"
raw_paris_jobs.loc[1260046, "métier"] = "entrepreneurs des berlines châlonnaises"
raw_paris_jobs.loc[1341192, "métier"] = "march. de cafés"
raw_paris_jobs.loc[1439726, "métier"] = "profess. d'escrime"
raw_paris_jobs.loc[1480449, "métier"] = "fabr. d'étiquettes"
raw_paris_jobs.loc[1555524, "métier"] = "fabr. d'articles de bonneterie"
raw_paris_jobs.loc[1599701, "métier"] = "fabr. d'appareils gaz"
raw_paris_jobs.loc[1621252, "métier"] = "fabr. d'acier poli"
raw_paris_jobs.loc[1622073, "métier"] = "fabr. d'instruments pour les sciences"
raw_paris_jobs.loc[1626145, "métier"] = "sculpteur d'ornements"
raw_paris_jobs.loc[1628519, "métier"] = "adjoint au secrétariat de l'institut"
raw_paris_jobs.loc[1631473, "métier"] = "café brasserie"
raw_paris_jobs.loc[1675651, "métier"] = "ancien consul du grand-duché de hesse"
raw_paris_jobs.loc[1687631, "métier"] = "guipures d'art"
raw_paris_jobs.loc[1697055, "métier"] = "fabr. d'appareils à gaz"
raw_paris_jobs.loc[1709840, "métier"] = "directeur de la caisse générale des familles"
raw_paris_jobs.loc[1726575, "métier"] = "épicier"
raw_paris_jobs.loc[1791998, "métier"] = "fabr. d'huiles animales"
raw_paris_jobs.loc[1812180, "métier"] = "orfévrerie et bronzes d'église"
raw_paris_jobs.loc[1857358, "métier"] = "fabr. d'accessoires pour moulins"
raw_paris_jobs.loc[1905683, "métier"] = "secrétaire-trésorier de la société d'études zoologiques"
raw_paris_jobs.loc[1961681, "métier"] = "administrateur à la caisse d'epargne"
raw_paris_jobs.loc[1968550, "métier"] = "fabr. d'arbustes"
raw_paris_jobs.loc[2051760, "métier"] = "café-erémerie"
raw_paris_jobs.loc[2063703, "métier"] = "marchand d'habits"
raw_paris_jobs.loc[2628634, "métier"] = "fabr. d'horlogerie et baromètres"
raw_paris_jobs.loc[2635829, "métier"] = "fabr. d'eau de seltz"
raw_paris_jobs.loc[2704003, "métier"] = "fabr. d'horlogerie"
raw_paris_jobs.loc[2984929, "métier"] = ""
raw_paris_jobs.loc[2991897, "métier"] = "fabr. d'orfèvrerie plaquée"
raw_paris_jobs.loc[2992032, "métier"] = "docteur-médecin"
raw_paris_jobs.loc[3011730, "métier"] = "capitaine d'état-major"
raw_paris_jobs.loc[3030091, "métier"] = "cafés"
raw_paris_jobs.loc[3056835, "métier"] = "chef d'institution"
raw_paris_jobs.loc[3383731, "métier"] = "bijoutier-joaillier"
raw_paris_jobs.loc[3384280, "métier"] = "fabr. d'agrafes"
raw_paris_jobs.loc[3409850, "métier"] = "preposé aux ventes directes à la manufacture des tabacs"
raw_paris_jobs.loc[3509225, "métier"] = "fabr. d'étuis à aiguilles"
raw_paris_jobs.loc[3510741, "métier"] = "épicier"
raw_paris_jobs.loc[3512492, "métier"] = "secrétaire général de la société d'émulation pour les sciences pharmaceutiques"
raw_paris_jobs.loc[3632332, "métier"] = "biblioth. universelle"
raw_paris_jobs.loc[3666637, "métier"] = "fabr. spéciale de bronze et orfévrerie d'église"
raw_paris_jobs.loc[3711308, "métier"] = "fabr. d'armes blanches"
raw_paris_jobs.loc[3753515, "métier"] = "fabr. d'instruments de précision en verre"
raw_paris_jobs.loc[3769325, "métier"] = "fabr. d'albâtre"
raw_paris_jobs.loc[3786801, "métier"] = "marchand d'habits"
raw_paris_jobs.loc[3817728, "métier"] = "secrétaire-trésorier de la société d'études zoologiques"
raw_paris_jobs.loc[3864347, "métier"] = "sel d'or de fordos et gétis"
raw_paris_jobs.loc[3871754, "métier"] = "épicier"
raw_paris_jobs.loc[3872928, "métier"] = "juge de paix du arrondissement"
raw_paris_jobs.loc[4094098, "métier"] = "ingénieur e.c.p"
raw_paris_jobs.loc[4157074, "métier"] = "transports camionnages"
raw_paris_jobs.loc[4181410, "métier"] = "café-limonadier et tabac"
raw_paris_jobs.loc[4192849, "métier"] = "fruits légumes"
raw_paris_jobs.loc[4202661, "métier"] = "hôtel vins"
raw_paris_jobs.loc[4238112, "métier"] = "vins hôtel"
raw_paris_jobs.loc[4249176, "métier"] = "impr. typogr. lithogr."
raw_paris_jobs.loc[4252440, "métier"] = "charb. vins"
raw_paris_jobs.loc[4266356, "métier"] = "peinture vitrerie"
raw_paris_jobs.loc[4283340, "métier"] = "peinture vitrerie"
raw_paris_jobs.loc[4286643, "métier"] = "hôtel des acacias et restaurant"
raw_paris_jobs.loc[4295247, "métier"] = "architecte"
raw_paris_jobs.loc[4307411, "métier"] = "charb. vins"
raw_paris_jobs.loc[4314488, "métier"] = "fruits légumes"
raw_paris_jobs.loc[4325158, "métier"] = "agents de manufactures"
raw_paris_jobs.loc[4364031, "métier"] = "vins hôtel"
raw_paris_jobs.loc[4376587, "métier"] = "impr. typogr. lithogr."
raw_paris_jobs.loc[4380415, "métier"] = "charb. vins"
raw_paris_jobs.loc[4396563, "métier"] = "peinture vitrerie"

### Dealing with `"`

- Get rows with `"`.

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'"'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
32065,bpt6k6282019m,337,103,Kleinfelder et Hoffmann,"commiss. de rou."" lage",Chabrol,15.,1855,265,"commiss. de rou."" lage"
32293,bpt6k6282019m,338,192,Labbé de Montais fils,"propr.""",Choiseul,20.,1855,266,"propr."""
35691,bpt6k6282019m,359,212,Lofrançois-Elwart et A. Poret,"fab. de dentel""les",Vivienne,33 *,1855,287,"fab. de dentel""les"
55275,bpt6k6282019m,478,142,Talbotier,"directeur de ""l'Office général du contentieux",Faub. -St-Denis,23.,1855,406,"directeur de ""l'office général du contentieux"
56832,bpt6k6282019m,488,97,Tresvaux du Fayral,"""chanoine",Cloître-NotreDame,14.,1855,416,"""chanoine"
...,...,...,...,...,...,...,...,...,...,...
4401879,bpt6k9780089g,1577,223,Vercken (Fernandi,"ing""",r. Canbon,47.,1922,1238,"ing"""
4403866,bpt6k9780089g,1591,78,Vissaguet (L.),"""ébéniste",av. de La Motte-Picquet,16.,1922,1252,"""ébéniste"
4405330,bpt6k9780089g,1601,285,· Willmin,"ing""",pl. de Laborde,12bis.,1922,1262,"ing"""
4405436,bpt6k9780089g,1602,206,Witzig (A.),"ing""",av. du Président-Wilson,12.,1922,1263,"ing"""


- Get the rows with `"` at the end preceeded by space.

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'\s"$'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
2804815,bpt6k9692809v,484,88,Jalillier jeu e,"chemi ie """,Palais-Royal,galepe Valo. 131.,1878,371,"chemi ie """
3694989,bpt6k97645375,480,29,Hory,"tabac et """,St-Jacques,161.,1873,357,"tabac et """
3725249,bpt6k97645375,658,203,Ruinet de Tailly,"sous-chef de gare au chemin de fer de Ly """,boul. Mazas,20.,1873,535,"sous-chef de gare au chemin de fer de ly """
4304340,bpt6k9780089g,877,213,Clément-Bayard (Etablissements),"fabr. d'automobiles"" Bayard """,quai Michelet,à Le- vallois (Seine). T. Wagr. 34. 41et 17. 99.,1922,538,"fabr. d'automobiles"" bayard """


There four rows shall be replaced manually.

1. For index 2804815, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9692809v/f371.item.r=chemi.zoom, The job will be changed from `chemi ie "` to `chemisier`

2. For index 3694989, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97645375/f357.item.r=Hory.zoom. The mailbox symbol was misinterpreted as `"`. Thus the job will be changed from `tabac et "` to `tabac et bureau de poste aux lettres` (cf. https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f8.item.zoom).

3. For index 3725249, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97645375/f535.item.r=tailly.zoom, The job will be changed from `sous-chef de gare au chemin de fer de ly "` to `sous-chef de gare au chemin de fer de lyon`

4. For index 4304340, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f538.item.r=Bayard.zoom, The job will be changed from `fabr. d'automobiles" bayard "` to `fabr. d'automobiles bayard`

In [None]:
raw_paris_jobs.loc[2804815, "métier"] = "chemisier"
raw_paris_jobs.loc[3694989, "métier"] = "tabac et bureau de poste aux lettres"
raw_paris_jobs.loc[3725249, "métier"] = "sous-chef de gare au chemin de fer de lyon"
raw_paris_jobs.loc[4304340, "métier"] = "fabr. d'automobiles bayard"

All the `"` surrounded by space were precceded by numbers and was misinterpretation for <sup>e</sup>. Thus are already removed.

- Get the rows with `"` at the start followed by space.

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'^"\s'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
381698,bpt6k6314752k,461,159,Costil et Boissel,""" maçons",Bayard-ChampsElysées,26.,1856,271,""" maçons"
426434,bpt6k6314752k,723,79,Tubeuf aîné,""" nouveautés",St-Martin,308.,1856,533,""" nouveautés"
645312,bpt6k63197984,137,148,Berger *,""" substitut au 2e conseil do la guerre",Cherche-Midi,37.,1852,81,""" substitut au conseil do la guerre"
647513,bpt6k63197984,153,95,Bouchart jebne,""" Hins",St-Gervais,1,1852,97,""" hins"
901709,bpt6k63243905,597,136,Mercier (NC),""" parcheminier",Parcheminerio,18,1863,463,""" parcheminier"
1555890,bpt6k9669143t,809,61,Simon (vve),""" ""fabr. de lanternes de voitures et de marine",St-Sabin,60.,1882,664,""" ""fabr. de lanternes de voitures et de marine"
1557160,bpt6k9669143t,818,103,Sulzer et Bahuaud,""" chaussures",Vingt-NeufJuillet,7.,1882,673,""" chaussures"
1696226,bpt6k9672776c,567,19,Lefebvre (l'abbé),""" vieaire à St-Louis du St",01 Lefebvre-de-Ste-Marie,0.,1880,448,""" vieaire à st-louis du st"
1711253,bpt6k9672776c,661,175,Paul,""" tailleur",Toullier,4.,1880,542,""" tailleur"
1794008,bpt6k96727875,630,58,Murat NC),""" bijoutier en doublé d'or",GrandChantier,4. no :,1870,495,""" bijoutier en doublé d'or"


In all the 30 rows, the `"` at the start can be removed as they are mostly added by mistake during the OCR process.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'(^|\s)"(\s|$)', r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r'[(a-z)+]\"[(a-z)+]'))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
35691,bpt6k6282019m,359,212,Lofrançois-Elwart et A. Poret,"fab. de dentel""les",Vivienne,33 *,1855,287,"fab. de dentel""les"
107645,bpt6k62906378,601,35,Billot,"limon""dier",St-Honoré,306.,1846,313,"limon""dier"
253923,bpt6k6305463c,344,8,Dommergue Mlle,"ext""rnat",St-Louis-Mar.,33.,1857,231,"ext""rnat"
378138,bpt6k6314752k,440,129,Castrique,"directeur de l'entreprise de net""toyage",Enghien,39,1856,250,"directeur de l'entreprise de net""toyage"
648252,bpt6k63197984,158,165,Boussardière (chevalier de la) *,"chef de ba""taillon en retraite",Faub.-St-Honoré,222.,1852,102,"chef de ba""taillon en retraite"
679957,bpt6k63197984,375,55,Morel-Cornet,"représentant du peuple (Som""me)",Godot,41.,1852,319,"représentant du peuple som""me"
704734,bpt6k6319811j,232,33,Charrière et Raffiné,"coif""eurs",Temple,219.,1854,153,"coif""eurs"
751694,bpt6k63243601,251,100,Arnault Robert,"éditeurs d'ouvrages de librai""rie",Vivienne,36.,1839,128,"éditeurs d'ouvrages de librai""rie"
765174,bpt6k63243601,350,167,Fesser,"crista""x",Paix,19.,1839,227,"crista""x"
807510,bpt6k6324389h,269,29,Doljé,"co""royeur",Geoffroy-Langevin,9.,1859,197,"co""royeur"


1. For index 35691. The job will be changed from `fab. de dentel"les` to `fab. de dentelles`

2. For index 107645, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k62906378/f313.item.r=billot.zoom, The job will be changed from `limon"dier` to `limonadier`

3. For index 253923, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6305463c/f231.image.r=Dommergue.zoom, The job will be changed from `ext"rnat` to `externat`

4. For index 378138., The job will be changed from `directeur de l'entreprise de net"toyage` to `directeur de l'entreprise de nettoyage`

5. For index 648252. The job will be changed from `chef de ba"taillon en retraite` to `chef de bataillon en retraite`

6. For index 679957, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63197984/f319.item.r=representant, The job will be changed from `représentant du peuple som"me` to `représentant du peuple somme`

7. For index 704734. The job will be changed from `coif"eurs` to `coiffeurs`

8. For index 751694. The job will be changed from `éditeurs d'ouvrages de librai"rie` to `éditeurs d'ouvrages de librai"rie`

9. For index 765174, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243601/f227.item.r=Fesser.zoom, The job will be changed from `crista"x` to `cristaux`

10. For index 807510, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6324389h/f197.image.r=Geoffroy.zoom, The job will be changed from `co"royeur` to `corroyeur`

11. For index 1003094, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6331310g/f279.image.r=Croisset.zoom, The job will be changed from `laye"er` to `layetier`

12. For index 1075163, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f396.item.r=abit.zoom, The job will be changed from `tissus nouse"u'és` to `tissus nouveautés`

13. For index 1233892. The job will be changed from `co"donnier` to `cordonnier`

14. For index 1346062. The job will be changed from `restaurat"ur` to `restaurateur`

15. For index 1563683. The job will be changed from `emballeur et articles de voya"ge` to `emballeur et articles de voyage`

16. For index 1565530. The job will be changed from `officier d'administration aux sub"sistances militaires` to `officier d'administration aux subsistances militaires`

17. For index 1649462.https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f153.image.r=coutellerie. The job will be changed from `coutellerie et gar nit. de néces"saires` to `coutellerie et garnit. de nécessaires`

18. For index 1651058. The job will be changed from `commissionnaires en mar"chandises` to `commissionnaires en marchandises`

19. For index 1727069. The job will be changed from `représentant de fabrique de cou"leur 'aniline` to `représentant de fabrique de couleur d'aniline`

20. For index 1768454. The job will be changed from `supérieur général de la so"ciété des ecoles chrétiennes de st-antoine` to `supérieur général de la société des ecoles chrétiennes de st-antoine`

21. For index 1792649, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f487.image.r=Montalembert, The job will be changed from `st"guillaume` to `ctesse`

22. For index 1813970, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f612.image.r=Verny, The job will be changed from `m"e. teinturier-dégraisseur` to `teinturier-dégraisseur`

23. For index 2359454. The job will be changed from `cou"erture et plomberie` to `couverture et plomberie`

24. For index 2361210, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9684454n/f456.item.r=auriferes.zoom, The job will be changed from `traitement des mu"tières aurifères et argentiſeres` to `traitement des matières aurifères et argentiferes`

25. For index 2385367. The job will be changed from `eau minérale sulfureuse na"turelle` to `eau minérale sulfureuse naturelle`

26. For index 2666546. The job will be changed from `vins en"gros` to `vins en gros`

27. For index 2845040. The job will be changed from `fabr. de couronnes mortuai"res` to `fabr. de couronnes mortuaires`

28. For index 2971670. The job will be changed from `vins"en gros; entrepôt de bercy` to `vins en gros; entrepôt de bercy`

29. For index 3012174. The job will be changed from `entrepôt de bières de stras"bourg` to `entrepôt de bières de strasbourg`

30. For index 3036589, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f474.image.r=sebastopol, The job will be changed from `fabr. de passemen"terie` to `fabr. de passementerie`

31. For index 3185857. The job will be changed from `manufacture ďhor"logerie` to `manufacture d'horlogerie`

32. For index 3257114. The job will be changed from `fabr. de bérets et casquet"tes` to `fabr. de bérets et casquettes`

33. For index 3466045, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f214.image.r=jeune, The job will be changed from `vins-traiteur et hôtel meu"bte` to `vins-traiteur et hôtel meublé`

34. For index 3483032. The job will be changed from `négts-commissionnai"res` to `négts-commissionnaires`

35. For index 3512271, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f482.item.r=blocq.zoom, The job will be changed from `draperie et noil"veautés en gros` to `draperie et nouveautés en gros`

36. For index 3573977. The job will be changed from `directrice de l'école libre de st"vincent-de-paul` to `directrice de l'école libre de st-vincent-de-paul`

37. For index 3725940. The job will be changed from `chaudières et machines à va"peur d'occasion` to `chaudières et machines à vapeur d'occasion`

38. For index 3744226. The job will be changed from `nou"eautés pour enfants` to `nouveautés pour enfants`

39. For index 3860681, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764746t/f272.item.r=fex.zoom, The job will be changed from `fab. d'ullumeltes à montreuil"sous-bois` to `fab. d'allumettes à montreuil sous-bois`

40. For index 3885446. The job will be changed from `maison spéciale de gardes-ma"lades` to `maison spéciale de gardes-malades`

41. For index 4129578. The job will be changed from `directeur de la cave cen"trale des hôpitaux` to `directeur de la cave centrale des hôpitaux`

42. For index 4267934. The job will be changed from `peinture et vitre"rie` to `peinture et vitrerie`

43. For index 4335517, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f761.item.r=fabriques.zoom, The job will be changed from `confect. p"dames` to `confect. p'dames`

In [None]:
raw_paris_jobs.loc[35691, "métier"] = "fab. de dentelles"
raw_paris_jobs.loc[107645, "métier"] = "limonadier"
raw_paris_jobs.loc[253923, "métier"] = "externat"
raw_paris_jobs.loc[378138, "métier"] = "directeur de l'entreprise de nettoyage"
raw_paris_jobs.loc[648252, "métier"] = "chef de bataillon en retraite"
raw_paris_jobs.loc[679957, "métier"] = "représentant du peuple somme"
raw_paris_jobs.loc[704734, "métier"] = "coiffeurs"
raw_paris_jobs.loc[751694, "métier"] = "éditeurs d'ouvrages de librairie"
raw_paris_jobs.loc[765174, "métier"] = "cristaux"
raw_paris_jobs.loc[807510, "métier"] = "corroyeur"
raw_paris_jobs.loc[1003094, "métier"] = "layetier"
raw_paris_jobs.loc[1075163, "métier"] = "tissus nouveautés"
raw_paris_jobs.loc[1233892, "métier"] = "cordonnier"
raw_paris_jobs.loc[1346062, "métier"] = "restaurateur"
raw_paris_jobs.loc[1563683, "métier"] = "emballeur et articles de voyage"
raw_paris_jobs.loc[1565530, "métier"] = "officier d'administration aux subsistances militaires"
raw_paris_jobs.loc[1649462, "métier"] = "coutellerie et garnit. de nécessaires"
raw_paris_jobs.loc[1651058, "métier"] = "commissionnaires en marchandises"
raw_paris_jobs.loc[1727069, "métier"] = "représentant de fabrique de couleur 'aniline"
raw_paris_jobs.loc[1768454, "métier"] = "supérieur général de la société des ecoles chrétiennes de st-antoine"
raw_paris_jobs.loc[1792649, "métier"] = "ctesse"
raw_paris_jobs.loc[1813970, "métier"] = "teinturier-dégraisseur"
raw_paris_jobs.loc[2359454, "métier"] = "couverture et plomberie"
raw_paris_jobs.loc[2361210, "métier"] = "traitement des matières aurifères et argentiferes"
raw_paris_jobs.loc[2385367, "métier"] = "eau minérale sulfureuse naturelle"
raw_paris_jobs.loc[2666546, "métier"] = "vins en gros"
raw_paris_jobs.loc[2845040, "métier"] = "fabr. de couronnes mortuaires"
raw_paris_jobs.loc[2971670, "métier"] = "vins en gros; entrepôt de bercy"
raw_paris_jobs.loc[3012174, "métier"] = "entrepôt de bières de strasbourg"
raw_paris_jobs.loc[3036589, "métier"] = "fabr. de passementerie"
raw_paris_jobs.loc[3185857, "métier"] = "manufacture d'horlogerie"
raw_paris_jobs.loc[3257114, "métier"] = "fabr. de bérets et casquettes"
raw_paris_jobs.loc[3466045, "métier"] = "vins-traiteur et hôtel meublé"
raw_paris_jobs.loc[3483032, "métier"] = "négts-commissionnaires"
raw_paris_jobs.loc[3512271, "métier"] = "draperie et nouveautés en gros"
raw_paris_jobs.loc[3573977, "métier"] = "directrice de l'école libre de st-vincent-de-paul"
raw_paris_jobs.loc[3725940, "métier"] = "chaudières et machines à vapeur d'occasion"
raw_paris_jobs.loc[3744226, "métier"] = "nouveautés pour enfants"
raw_paris_jobs.loc[3860681, "métier"] = "fab. d'allumettes à montreuil sous-bois"
raw_paris_jobs.loc[3885446, "métier"] = "maison spéciale de gardes-malades"
raw_paris_jobs.loc[4129578, "métier"] = "directeur de la cave centrale des hôpitaux"
raw_paris_jobs.loc[4267934, "métier"] = "peinture et vitrerie"
raw_paris_jobs.loc[4335517, "métier"] = "confect. p. dames"

Nearly 200 entries in the métier column contain an entry as `ing"` or `ingen"`. These entries correspond to `ingénieur`. These entries will be replaced and the remaining entries containing `"` are not corrected (as there are many to replace manually and will be dealt during tag generation).

In [None]:
pattern = '|'.join(['ingen"', 'ing"', 'ingén"'])

raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(pattern, 'ingénieur')

### Dealing with `*`

- Get the rows containg `*`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"\*"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
2158,bpt6k6282019m,157,56,Baradère,C. *.ancien conseiller d'État,Université,35.,1855,85,c. *.ancien conseiller d'état
9687,bpt6k6282019m,202,24,Carbonnel,G.O. *. général de brigade en retraite,Anjou-St-Honoré,42.,1855,130,g.o. *. général de brigade en retraite
13327,bpt6k6282019m,224,53,Corbel (P.),rentier *,Lancry,16.,1855,152,rentier *
17087,bpt6k6282019m,247,106,Depaquit,* corrtier-gourmet,boulev. Beaumarchais,84.,1855,175,* corrtier-gourmet
17769,bpt6k6282019m,251,143,Despolaines (Mme),reprises dans les den* telles et cachemires,Bourbon-Villeneuve,19.,1855,179,reprises dans les den* telles et cachemires
...,...,...,...,...,...,...,...,...,...,...
4287559,bpt6k9780089g,765,183,Binoche (Laus) ( I),avoodt * la Cour d'appel,r. Marbeuf,37.,1922,426,avoodt * la cour d'appel
4299822,bpt6k9780089g,846,94,Central Agence Cinéma,appareils p* cinémas,Faub. SL-Denis,77.,1922,507,appareils p* cinémas
4359295,bpt6k9780089g,1267,174,Mancini (A.),fo*mes p. chapeaux,r. Ste. Apolline,9.,1922,928,fo*mes p. chapeaux
4387067,bpt6k9780089g,1456,310,Rougier (Henry),* automobiles TurcatMéry,av. des Champs-Elysées,122.,1922,1117,* automobiles turcatméry


There are 775 rows with `*`.

First, double stars are replaced by single star. Then remove the `*` is removed when it is surrounded by spaces and at the start followed by a space or at the end preceeded by a space

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'\*\*', r'*', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'(^|\s)\*(\s|$)', r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

Spelling corections

- Get the rows that have * in a word.

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"[(a-z)+]\*[(a-z)+]"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
521028,bpt6k6315985z,430,197,Quillé et Bernier,commissionn. en denrées co*loniales,Braque,3.,1850,348,commissionn. en denrées co*loniales
1532840,bpt6k9669143t,667,206,Marckert (Eugène),fabr. de chapelets et a*ticles de religion,boul. Sébastopol,60.,1882,522,fabr. de chapelets et a*ticles de religion
1861829,bpt6k96762564,631,221,Giriak et Ronto,tailleurs pour dames ei hom*mes,Courcelles,30.,1886,446,tailleurs pour dames ei hom*mes
1945546,bpt6k9677392n,358,205,Deguingand,marbrier pour bronzes et pen*dules,Commines,18.,1877,248,marbrier pour bronzes et pen*dules
2936541,bpt6k9732740w,1011,165,Revues diverses. (Voyez Journaux,aux Pro*fessions.),La Fontaine,41 bis.,1894,772,aux pro*fessions.)
3030957,bpt6k9762929c,567,146,Lavalard,Edmond)*directeur de la cavalerie et des foura...,Lavallard (Alfred-L.) teinturier-dégraissen Lo...,20.,1879,438,edmond)*directeur de la cavalerie et des foura...
3449628,bpt6k9763553z,701,165,Walter (F.) (A. Bonnardot success.),guêtrier*culollier,boul. Haussmann,49.,1876,596,guêtrier*culollier
3520044,bpt6k9763554c,652,186,Raingo (Jules) NCH,fabr. de bronzes et pen*dules,Vieille-du-Temple,102. 21.,1875,529,fabr. de bronzes et pen*dules
3622737,bpt6k9764402m,1197,96,Noetinger,contrôleur des contributions direc*tes,r. Gay-Lussac,64.,1900,868,contrôleur des contributions direc*tes
3762751,bpt6k9764647w,405,179,De Planard *,secrétaire de la section des finan*ces au cons...,Mont-Thabor,10.,1881,284,secrétaire de la section des finan*ces au cons...


There are 13 rows with a * inside the word. These shall be replaced manually.


1. For index 521028, the image from the directory is . The job was present in two lines. Thus the job will be changed from `commissionn. en denrées co*loniales` to `commissionn. en denrées coloniales`.

2. For index 1532840, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9669143t/f522.image.r=Marckert.zoom. The `r` was misinterpreted as `*`. Thus the job will be changed from `fabr. de chapelets et a*ticles de religion` to `fabr. de chapelets et articles de religion`.

3. For index 1861829, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96762564/f446.image.r=Giriak.zoom. The job was present in two lines. Thus the job will be changed from `tailleurs pour dames ei hom*mes` to `tailleurs pour dames et hommes`.

4. For index 1945546, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677392n/f248.image.r=Deguingand.zoom. The job was present in two lines. Thus the job will be changed from `marbrier pour bronzes et pen*dules` to `marbrier pour bronzes et pendules`.

5. For index 2936541, the OCR has misinterpredted multiple lines. The name and the address do not match. However, the job will be changed from `aux pro*fessions.)` to `aux professions`.

6. For index 3030957, the image from the directory is . The name and the job weree combined. Thus the job will be changed from `edmond)*directeur de la cavalerie et des fourages de la cie générale des omni` to `directeur de la cavalerie et des fourages de la cie générale des omnibus`.

7. For index 3449628, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763553z/f596.image.r=Walter. The job was present in two lines. Thus the job will be changed from `guêtrier*culollier` to `guêtrier culottier`.

8. For index 3520044, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f529.image.r=Raingo.zoom. The job was present in two lines. Thus the job will be changed from `fabr. de bronzes et pen*dules` to `fabr. de bronzes et pendules`.

9. For index 3622737, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764402m/f868.image.r=Noetinger.zoom. The job was present in two lines. Thus the job will be changed from `contrôleur des contributions direc*tes` to `contrôleur des contributions directes`.

10. For index 3762751, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764647w/f284.image.r=planard.zoom. The job was present in two lines. Thus the job will be changed from `secrétaire de la section des finan*ces au conseil d'etat` to `secrétaire de la section des finances au conseil d'etat`.

11. For index 4228861, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f851.image.r=coutur%20.zoom. The job will be changed from `vel (m*e.)` to `coutur.`.

12. For index 4359295, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f928.item.r=Mancini.zoom. The job will be changed from `fo*mes p. chapeaux` to `formes p. chapeaux`.

13. For index 4404621, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f1257.item.r=Warlier.zoom. The job was present in two lines. Thus the job will be changed from `installation générale d'électri*cité` to `installation générale d'électricité`.

In [None]:
raw_paris_jobs.loc[521028, "métier"] = "commissionn. en denrées coloniales"
raw_paris_jobs.loc[1532840, "métier"] = "fabr. de chapelets et articles de religion"
raw_paris_jobs.loc[1861829, "métier"] = "tailleurs pour dames et hommes"
raw_paris_jobs.loc[1945546, "métier"] = "marbrier pour bronzes et pendules"
raw_paris_jobs.loc[2936541, "métier"] = "aux professions"
raw_paris_jobs.loc[3030957, "métier"] = "directeur de la cavalerie et des fourages de la cie générale des omnibus"
raw_paris_jobs.loc[3449628, "métier"] = "guêtrier culottier"
raw_paris_jobs.loc[3520044, "métier"] = "fabr. de bronzes et pendules"
raw_paris_jobs.loc[3622737, "métier"] = "contrôleur des contributions directes"
raw_paris_jobs.loc[3762751, "métier"] = "secrétaire de la section des finances au conseil d'etat"
raw_paris_jobs.loc[4228861, "métier"] = "coutur."
raw_paris_jobs.loc[4359295, "métier"] = "formes p. chapeaux"
raw_paris_jobs.loc[4404621, "métier"] = "installation générale d'électricité"

Similar to ¥ and #, * is also misinterpreted for the same symbol of the award. As shown in the image above, the awards have a pattern with G. or C. or O. etc. We shall try to remove `*` in those cases.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"c\.\s\*", r"c. ", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"c\.\*", r"c.", regex=True)

raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"o\.\*", r"o.", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"o\.\s\*", r"o. ", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"o\*", r"o", regex=True)

raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\*\.", r"", regex=True)

raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"g\.\s\*", r"g. ", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r':\*', r':', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"\*"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
17769,bpt6k6282019m,251,143,Despolaines (Mme),reprises dans les den* telles et cachemires,Bourbon-Villeneuve,19.,1855,179,reprises dans les den* telles et cachemires
38168,bpt6k6282019m,374,38,Lioret aîné et jeune,entrepreneurs de trans* ports par eau,quai d'Austerlitz,61.,1855,302,entrepreneurs de trans* ports par eau
75126,bpt6k6286466w,485,11,Dutens aîné,. *(de l'Institut),Grammont,· 3.,1842,296,*de l'institut
78135,bpt6k6286466w,505,126,Gu: (Dubochet,Pauwels* et Cie),Lafayette,3.,1842,316,pauwels* et cie)
289264,bpt6k6305463c,556,139,Roplet (Alexis),fab. de taben pour ga*,Guier rin-Boisseau,31.,1857,443,fab. de taben pour ga*
349330,bpt6k6309075f,525,96,Narp (Marquise,Vve de)*,Vaugirard,93.,1861,429,vve de)*
391258,bpt6k6314752k,517,216,Forestier de Périgny,*receveur-percepteur,boul. St-Denis,24.,1856,327,*receveur-percepteur
572521,bpt6k6318531z,482,188,Mauriès,*chaudronpier,Boucheries-des-Invalides,31.,1858,374,*chaudronpier
748392,bpt6k6319811j,534,17,Trémolinaire,menuisier en bâtimer *s,Condé,29.,1854,455,menuisier en bâtimer *s
766802,bpt6k63243601,362,87,Gauthier de Charnacé (Bon),*conseiller à la cour royale,Nve-St-Paul,9.,1839,239,*conseiller à la cour royale


1. For index 17769. The job will be changed from `reprises dans les den* telles et cachemires` to `reprises dans les dentelles et cachemires`

2. For index 38168. The job will be changed from `entrepreneurs de trans* ports par eau` to `entrepreneurs de transports par eau`

3. For index 75126, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6286466w/f296.image.r=Dutensaine.zoom, The job will be changed from `. *de l'institut` to `de l'institut`

4. For index 78135, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6286466w/f286.item.r=Dubochet.zoom, The job will be changed from `pauwels* et cie)` to `entrep. et gérants d'éclairge au gaz`

5. For index 289264, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6305463c/f443.item.r=Boisseau.zoom, The job will be changed from `fab. de taben pour ga*` to `fab. de tubes pour gaz`

6. For index 349330, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6309075f/f429.image.r=Narp.zoom. There is no job so the job will be changed from `vve de)*` to `marquise vve de`

7. For index 391258, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6314752k/f327.image.r=perigny.zoom, The job will be changed from `*receveur-percepteur` to `receveur-percepteur`

8. For index 572521, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6318531z/f374.image.r=Mauries, The job will be changed from `*chaudronpier` to `chaudronnier`

9. For index 748392, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6319811j/f455.image.r=Tremolinaire, The job will be changed from `menuisier en bâtimer *s` to `menuisier en bâtiments`

10. For index 766802. The job will be changed from `*conseiller à la cour royale` to `conseiller à la cour royale`

11. For index 793431. The job will be changed from `*chef de bureau aux finances` to `*chef de bureau aux finances`

12. For index 831599, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6324389h/f354.item.r=ird.zoom, The job will be changed from `passementerie et rubans de velour*` to `passementerie et rubans de velours`

13. For index 851985, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243905/f151.item.r=Baralte.zoom, The job will be changed from `*oleries et lamages` to `soieries et lainages`

14. For index 929102, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243920/f109.image.r=Berard.zoom. The name is present in the job column and job is in the rue column. The job will be changed from `e. levainville* et cie` to `ingenieurs`

15. For index 964592, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243920/f337.item.r=Rivoli.zoom, The job will be changed from `*orfévre-fab.` to `orfévre-fab.`

16. For index 1024652, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6331310g/f415.item.r=astorg.zoom, The job will be changed from `duc de poix)*` to `juste duc de poix*`

17. For index 1089329, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f491.item.r=Narp.zoom, The job will be changed from `vvede)*` to `marquise vve de`

18. For index 1113487, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333200c, The job will be changed from `e. levainville* et cie` to `ingenieurs`

19. For index 1162233. The job will be changed from `sage-femm*` to `sage-femme`

20. For index 1349865, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63959929/f267.image.r=Laclaudure, The job will be changed from `représentant du peuple haute* vienne` to `représentant du peuple haute vienne`

21. For index 1356276, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63959929/f310.image.r=Marcuard, The job will be changed from `de la maison adolphe mar* cuard et cie` to `de la maison adolphe marcuard et cie`

22. For index 1357960, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63959929/f321.image.r=chatelin, The job will be changed from `bijoutiers-joailliers* fabricants` to `bijoutiers-joailliers-fabricants`

23. For index 1419017, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9668037f/f447.image.r=Sionnest, The job will be changed from `de la maison chaligny *nce et guyot-sionnest` to `de la maison chaligny et guyot-sionnest`

24. For index 1547544, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9669143t/f613.image.r=Quenay, The job will be changed from `de la maison binder aîné*` to `de la maison binder aîné`

25. For index 1588262, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672117f/f263.image.r=De%20Metz, The job will be changed from `* architecte-expert` to `architecte-expert`

26. For index 1688272, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f398.image.r=goupillon, The job will be changed from `fabr. de goupillon-tubes coulis* saud` to `fabr. de goupillon-tubes coulissaud`

27. For index 1688778. The job will be changed from `a. j.* graveur sur doier` to `graveur sur doier`

28. For index 1751036. The job will be changed from `ancien avoué à la cour impé* riale` to `ancien avoué à la cour impériale`

29. For index 1800759, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f535.item.r=Rougeraont.zoom, The job will be changed from `de la maison jeanti *nco et evost` to `de la maison jeanti et prevost`

30. For index 1965794, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677392n/f365.image.r=IfoudaiItc, The job will be changed from `de la maison a. landier et hou* daille` to `de la maison a. landier et houdaille`

31. For index 1980210. The job will be changed from `ancien contróleur des contribu* tions directes` to `ancien contrôleur des contributions directes`

32. For index 1994314. The job will be changed from `*receveur particulier des finances` to `receveur particulier des finances`

33. For index 2066044. The job will be changed from `de la maison c. savart *nch et cie` to `de la maison c. savart et cie` (nch is a symbol)

34. For index 2113598. The job will be changed from `.*procureur général près la cour des comptes` to `procureur général près la cour des comptes`

35. For index 2202649, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96839542/f712.image.r=Schuhl, The job will be changed from `de la maison lantz* frères` to `de la maison lantz frères`

36. For index 2345534, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9684454n/f358.item.r=portefoin.zoom, The job will be changed from `fabr. de garde-robe. *ro binets` to `fabr. de garde-robes et robinets`

37. For index 2546082, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f199.item.r=paradis.zoom, The job will be changed from `de la maison l. nicolas *c. do` to `de la maison l. nicolas`

38. For index 2587361. The job will be changed from `*propriétaire` to `propriétaire`

39. For index 2769614. The job will be changed from `g* avocat cour d'appel` to `avocat cour d'appel`

40. For index 2803530. The job will be changed from `.*nc secrétaire de la chambre de commerce` to `secrétaire de la chambre de commerce`

41. For index 2854184, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9732740w/f269.item.r=Buffon.zoom, The job will be changed from `*nc. de la maison mathieuplessy` to `de la maison mathieu plessy`

42. For index 2959031, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762899p/f189.item.r=villiers.zoom, The job will be changed from `chaussures pow* hommes` to `chaussures pour hommes`

43. For index 3014937. The job will be changed from `*; conservateur adjoint à la bibliothèque mazarine` to `conservateur adjoint à la bibliothèque mazarine`

44. For index 3434600. The job will be changed from `de la maison jeanti *nc. et prevost` to `de la maison jeanti et prevost`

45. For index 3450397, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763553z/f601.item.r=Zinner.zoom, The job will be changed from `de la maison kolbe et zin* ner` to `de la maison kolbe et zinner`

46. For index 3655915. The job will be changed from `*nc. héliogravure` to `héliogravure`

47. For index 3715127. The job will be changed from `*inspecteur-agent de l'ecole spéciale des beaux-arts` to `inspecteur-agent de l'ecole spéciale des beaux-arts`

48. For index 3719534, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97645375/f502.item.r=Portzert.zoom, The job will be changed from `teinturier en laine et cachemi* res` to `teinturier en laine et cachemires`

49. For index 3752036, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764647w/f222.image.r=Chablin, The job will be changed from `peintre et décorateur sui* celaine` to `peintre et décorateur sur porcelaine`

50. For index 3826679. The job will be changed from `*arehitecte` to `arehitecte`

51. For index 3861747. The job will be changed from `*juge d'instruction au trib. de er inst.` to `juge d'instruction au trib. de première inst.`

52. For index 3876583, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9764746t/f362.image.r=Latimier.zoom, The job will be changed from `. *régent de la ban. que de france` to `régent de la banque de france`

53. For index 3980556, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f484.image.r=Laquinla.zoom, The job will be changed from `*intur.` to `teintur.`

54. For index 3986758, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f527, The job will be changed from `noucher.r. du commerce. *5.` to `boucher`

55. For index 3999080, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f606.item.r=Moutaillier.zoom, The job will be changed from `peaux de couleurs p* reliure et chaussures` to `peaux de couleurs pr reliure et chaussures`

56. For index 4299822, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f507.item.r=Cinema.zoom, The job will be changed from `appareils p* cinémas` to `appareils pr cinémas`

57. For index 4326418. The job will be changed from `aveué instance` to `avoué première instance`

In [None]:
raw_paris_jobs.loc[17769, "métier"] = "reprises dans les dentelles et cachemires"
raw_paris_jobs.loc[38168, "métier"] = "entrepreneurs de transports par eau"
raw_paris_jobs.loc[75126, "métier"] = "de l'institut"
raw_paris_jobs.loc[78135, "métier"] = "entrep. et gérants d'éclairge au gaz"
raw_paris_jobs.loc[289264, "métier"] = "fab. de tubes pour gaz"
raw_paris_jobs.loc[349330, "métier"] = "marquise vve de"
raw_paris_jobs.loc[391258, "métier"] = "receveur-percepteur"
raw_paris_jobs.loc[572521, "métier"] = "chaudronnier"
raw_paris_jobs.loc[748392, "métier"] = "menuisier en bâtiments"
raw_paris_jobs.loc[766802, "métier"] = "conseiller à la cour royale"
raw_paris_jobs.loc[793431, "métier"] = "chef de bureau aux finances"
raw_paris_jobs.loc[831599, "métier"] = "passementerie et rubans de velours"
raw_paris_jobs.loc[851985, "métier"] = "soieries et lainages"
raw_paris_jobs.loc[929102, "métier"] = "ingenieurs"
raw_paris_jobs.loc[964592, "métier"] = "orfévre-fab."
raw_paris_jobs.loc[1024652, "métier"] = "juste duc de poix"
raw_paris_jobs.loc[1089329, "métier"] = "marquise vve de"
raw_paris_jobs.loc[1113487, "métier"] = "ingenieurs"
raw_paris_jobs.loc[1162233, "métier"] = "sage-femme"
raw_paris_jobs.loc[1349865, "métier"] = "représentant du peuple haute vienne"
raw_paris_jobs.loc[1356276, "métier"] = "de la maison adolphe marcuard et cie"
raw_paris_jobs.loc[1357960, "métier"] = "bijoutiers-joailliers-fabricants"
raw_paris_jobs.loc[1419017, "métier"] = "de la maison chaligny et guyot-sionnest"
raw_paris_jobs.loc[1547544, "métier"] = "de la maison binder aîné"
raw_paris_jobs.loc[1588262, "métier"] = "architecte-expert"
raw_paris_jobs.loc[1688272, "métier"] = "fabr. de goupillon-tubes coulissaud"
raw_paris_jobs.loc[1688778, "métier"] = "graveur sur doier"
raw_paris_jobs.loc[1751036, "métier"] = "ancien avoué à la cour impériale"
raw_paris_jobs.loc[1800759, "métier"] = "de la maison jeanti et prevost"
raw_paris_jobs.loc[1965794, "métier"] = "de la maison a. landier et houdaille"
raw_paris_jobs.loc[1980210, "métier"] = "ancien contrôleur des contributions directes"
raw_paris_jobs.loc[1994314, "métier"] = "receveur particulier des finances"
raw_paris_jobs.loc[2066044, "métier"] = "de la maison c. savart et cie"
raw_paris_jobs.loc[2113598, "métier"] = "procureur général près la cour des comptes"
raw_paris_jobs.loc[2202649, "métier"] = "de la maison lantz frères"
raw_paris_jobs.loc[2345534, "métier"] = "fabr. de garde-robes et robinets"
raw_paris_jobs.loc[2546082, "métier"] = "de la maison l. nicolas"
raw_paris_jobs.loc[2587361, "métier"] = "propriétaire"
raw_paris_jobs.loc[2769614, "métier"] = "avocat cour d'appel"
raw_paris_jobs.loc[2803530, "métier"] = "secrétaire de la chambre de commerce"
raw_paris_jobs.loc[2854184, "métier"] = "de la maison mathieu plessy"
raw_paris_jobs.loc[2959031, "métier"] = "chaussures pour hommes"
raw_paris_jobs.loc[3014937, "métier"] = "conservateur adjoint à la bibliothèque mazarine"
raw_paris_jobs.loc[3434600, "métier"] = "de la maison jeanti et prevost"
raw_paris_jobs.loc[3450397, "métier"] = "de la maison kolbe et zinner"
raw_paris_jobs.loc[3655915, "métier"] = "héliogravure"
raw_paris_jobs.loc[3715127, "métier"] = "inspecteur-agent de l'ecole spéciale des beaux-arts"
raw_paris_jobs.loc[3719534, "métier"] = "teinturier en laine et cachemires"
raw_paris_jobs.loc[3752036, "métier"] = "peintre et décorateur sur porcelaine"
raw_paris_jobs.loc[3826679, "métier"] = "arehitecte"
raw_paris_jobs.loc[3861747, "métier"] = "juge d'instruction au trib. de première inst."
raw_paris_jobs.loc[3876583, "métier"] = "régent de la banque de france"
raw_paris_jobs.loc[3980556, "métier"] = "teintur."
raw_paris_jobs.loc[3986758, "métier"] = "boucher"
raw_paris_jobs.loc[3999080, "métier"] = "peaux de couleurs pr reliure et chaussures"
raw_paris_jobs.loc[4299822, "métier"] = "appareils pr cinémas"
raw_paris_jobs.loc[4326418, "métier"] = "avoué première instance"

### Dealing with `(` and `)`

We need to deal with both these characters together.
In the first step itself, the symbols were removed when they surround a text.

Even after that there are 1148 rows with a `(` and 2899 rows with a `)`.

- Remove the `(` and `)` if it is surrounded by spaces or at the start follwed by a space or at the end preceded by a space and the remaining will be dealt during the tag generation.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'^\(\s', r'', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'\s\($', r'', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\(\.", r'', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\.\(", r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\(nc", r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'\s\(\s', r' ', regex=True)


raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'\s\)$', r'', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'^\)\s', r'', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\)\.", r'', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\.\)", r'.', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'nc\)', r'', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'\s\)\s', r' ', regex=True)

raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"[(a-z)+]\([(a-z)+]"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
157088,bpt6k6292987t,761,69,Chedeville-Dépinay,articles de Tarare et de St(uentin,Sentier,8.,1845,408,articles de tarare et de st(uentin
605111,bpt6k6319106t,731,176,Dehèque,chef de bureau à la I(e mairie,Grenelle-St-Germain,7.,1849,407,chef de bureau à la i(e mairie
717138,bpt6k6319811j,318,62,Gallais (l'abbé),professeur de dogme(grand,et galerie Feydeau,14,1854,239,professeur de dogme(grand
882598,bpt6k63243905,478,132,Gouin,épi(ier,Mondétour,30.,1863,344,épi(ier
946531,bpt6k63243920,301,23,Driou,Morel(Mmes et Cie,brocieries,Mail. 23,1860,221,morel(mmes et cie
1371518,bpt6k63959929,497,4,Vergnon fils,entrepreneur de ma(onnerie,Chanoinesse,10.,1851,415,entrepreneur de ma(onnerie
1408270,bpt6k9668037f,549,88,Farradesche et Cie,monture-fermeture pari(sienne de parapluies,rue St-Martin,325.,1884,382,monture-fermeture pari(sienne de parapluies
1755751,bpt6k96727875,406,96,Delavigne (H.),commissionnaire en pharma(cie,Quincampoix,70.,1870,271,commissionnaire en pharma(cie
1775394,bpt6k96727875,521,194,Jacquin (maison),aciérage de planches gra(vées en cuivre,N.-D.-des-Champs,71.,1870,386,aciérage de planches gra(vées en cuivre
1796384,bpt6k96727875,644,107,Passerat (Edouard),fabr. de chaussures et ga(loches,passage Raoul,11.,1870,509,fabr. de chaussures et ga(loches


1. For index 157088, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6292987t/f408.item.r=ChedeviliA.zoom, The job will be changed from `articles de tarare et de st(uentin` to `articles de tarare et de st-quentin`

2. For index 605111, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6319106t/f407.image.r=Deheque, The job will be changed from `chef de bureau à la i(e mairie` to `chef de bureau à la mairie`

3. For index 717138, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6319811j/f239.image.r=abbe, The job will be changed from `professeur de dogme(grand` to `professeur de dogme grand cours`

4. For index 882598. The job will be changed from `épi(ier` to `épicier`

5. For index 946531, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243920/f221.item.r=moret.zoom, The job will be changed from `morel(mmes et cie` to `borderies`

6. For index 1262761, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6391515w/f483.item.r=Mistfnem%20.zoom. however, it is not clear and the same entry can found at https://gallica.bnf.fr/ark:/12148/bpt6k6391515w/f267.image.r=Lalleiuant.zoom. The job will be changed from `g(eare` to `grain`

7. For index 1371518, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63959929/f415.image.r=Vergnon, The job will be changed from `entrepreneur de ma(onnerie` to `entrepreneur de maçonnerie`

8. For index 1408270. The job will be changed from `monture-fermeture pari(sienne de parapluies` to `monture-fermeture parisienne de parapluies`

9. For index 1755751. The job will be changed from `commissionnaire en pharma(cie` to `commissionnaire en pharmacie`

10. For index 1775394, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f386.item.r=maison.zoom, The job will be changed from `aciérage de planches gra(vées en cuivre` to `aciérage de planches gravées en cuivre`

11. For index 1796384, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f509.image.r=edouard, The job will be changed from `fabr. de chaussures et ga(loches` to `fabr. de chaussures et galoches`

12. For index 2583943. The job will be changed from `imprimeur typographe et litho(graphe` to `imprimeur typographe et lithographe`

13. For index 2968585. The job will be changed from `esq. e secrétaire à l'am(bassade de la grande-bretagne` to `esq. e secrétaire à l'ambassade de la grande-bretagne`

14. For index 3051198, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f561.item.r=PommeryetCle.zoom, The job will be changed from `agents(l'affaires` to `agents d'affaires`

15. For index 3193681, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97631451/f645.item.r=societe.zoom, The job will be changed from `société des joailliersbijoutiers-orfèvres(dite société des cendres` to `société des joailliers-bijoutiers-orfèvres dite société des cendres`

16. For index 3347266. The job will be changed from `restaurant et bat du lac st-far(geau` to `restaurant et bat du lac st-fargeau`

17. For index 3382629. The job will be changed from `équipements militai(res` to `équipements militaires`

18. For index 3407503. The job will be changed from `courtiers en marchan(lises` to `courtiers en marchandises`

19. For index 3471355. The job will be changed from `repriseuses en cachemi(res` to `repriseuses en cachemires`

20. For index 3474816. The job will be changed from `blondes et den(telles` to `blondes et dentelles`

21. For index 3507360. The job will be changed from `chef de service à la préfecture de po(lice` to `chef de service à la préfecture de police`

22. For index 3513083. The job will be changed from `fournitures pour dentis(tes` to `fournitures pour dentistes`

23. For index 3765997. The job will be changed from `conseiller(etat` to `conseiller d'état`

24. For index 3955980, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f323.item.r=Porte.zoom, The job will be changed from `expéditeurs de denrées ali(mentaires` to `expéditeurs de denrées alimentaires`.

25. For index 4146847, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9776121t/f815.item.r=Emile.zoom, The job will be changed from `coij(eur` to `coiffeur`

26. For index 4192051, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f585.image.r=Soules.zoom, The job will be changed from `fabr. de veil(ses` to `fabr. de veilleuses`

In [None]:
raw_paris_jobs.loc[157088, "métier"] = "articles de tarare et de st-quentin"
raw_paris_jobs.loc[605111, "métier"] = "chef de bureau à la mairie"
raw_paris_jobs.loc[717138, "métier"] = "professeur de dogme grand cours"
raw_paris_jobs.loc[882598, "métier"] = "épicier"
raw_paris_jobs.loc[946531, "métier"] = "borderies"
raw_paris_jobs.loc[1262761, "métier"] = "grain"
raw_paris_jobs.loc[1371518, "métier"] = "entrepreneur de maçonnerie"
raw_paris_jobs.loc[1408270, "métier"] = "monture-fermeture parisienne de parapluies"
raw_paris_jobs.loc[1755751, "métier"] = "commissionnaire en pharmacie"
raw_paris_jobs.loc[1775394, "métier"] = "aciérage de planches gravées en cuivre"
raw_paris_jobs.loc[1796384, "métier"] = "fabr. de chaussures et galoches"
raw_paris_jobs.loc[2583943, "métier"] = "imprimeur typographe et lithographe"
raw_paris_jobs.loc[2968585, "métier"] = "esq. e secrétaire à l'ambassade de la grande-bretagne"
raw_paris_jobs.loc[3051198, "métier"] = "agents d'affaires"
raw_paris_jobs.loc[3193681, "métier"] = "société des joailliers-bijoutiers-orfèvres dite société des cendres"
raw_paris_jobs.loc[3347266, "métier"] = "restaurant et bat du lac lac st-fargeau"
raw_paris_jobs.loc[3382629, "métier"] = "équipements militaires"
raw_paris_jobs.loc[3407503, "métier"] = "courtiers en marchandises"
raw_paris_jobs.loc[3471355, "métier"] = "repriseuses en cachemires"
raw_paris_jobs.loc[3474816, "métier"] = "blondes et dentelles"
raw_paris_jobs.loc[3507360, "métier"] = "chef de service à la préfecture de police"
raw_paris_jobs.loc[3513083, "métier"] = "fournitures pour dentistes"
raw_paris_jobs.loc[3765997, "métier"] = "conseiller d'état"
raw_paris_jobs.loc[3955980, "métier"] = "expéditeurs de denrées alimentaires"
raw_paris_jobs.loc[4146847, "métier"] = "coiffeur"
raw_paris_jobs.loc[4192051, "métier"] = "fabr. de veilleuses"

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"[(a-z)+]\)[(a-z)+]"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
170727,bpt6k6292987t,847,206,Husbrocq fils,fab de paillons et poudit & d)rer,Ste-Avoie,69.,1845,494,fab de paillons et poudit d)rer
901578,bpt6k63243905,596,160,Mennel,bi)ontier en doré,Temple,176.,1863,462,bi)ontier en doré
954207,bpt6k63243920,350,123,Cilly,iab)d'ustensiles de menage,QuatreFils,8.,1860,270,iab)d'ustensiles de menage
1671439,bpt6k9672776c,408,60,De Villefosse {bon,de)s avocat cour dappel,chelieu,98. 81 ftobucdono& 29035,1880,289,de)s avocat cour dappel
2607822,bpt6k9685861g,759,10,L'Hay (M. E,de)peintre-artiste,Rochechouart,74.,1887,570,de)peintre-artiste
2645388,bpt6k9685861g,986,77,Waldthausen (Adolphe),dépositaire de mar)quinerie,r. St-Denis,101.,1887,797,dépositaire de mar)quinerie
2843688,bpt6k9692809v,708,222,Thys (Mme),vi)ns,Moret,22. 96.,1878,595,vi)ns
3717901,bpt6k97645375,615,54,Piaultjeune ('Transon,successeur)coutellerie,St-Denis,229. 5 1 niggisda,1873,492,successeur)coutellerie
4240510,bpt6k97774838,1264,14,Moissel (Maurice,I)peintre-artiste,r. Viète,3.,1921,937,i)peintre-artiste


1. For index 170727, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6292987t/f494.image.r=Husbrocq.zoom, The job will be changed from `fab de paillons et poudit d)rer` to `fab de paillons et poudre à dorer`

2. For index 901578, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243905/f462.image.r=Mennel.zoom, The job will be changed from `bi)ontier en doré` to `bijoutier en doré`

3. For index 954207, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243920/f270.item.r=QuatreFils.zoom, The job will be changed from `iab)d'ustensiles de menage` to `fab. d'ustensiles de menage`

4. For index 1671439. The job will be changed from `de)s avocat cour dappel` to `avocat cour d'appel`

5. For index 2607822. The job will be changed from `de)peintre-artiste` to `peintre-artiste`

6. For index 2645388, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f797.item.r=Adolphe.zoom, The job will be changed from `dépositaire de mar)quinerie` to `dépositaire de maroquinerie`

7. For index 2843688, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9692809v/f595.image.r=Thys, The job will be changed from `vi)ns` to `vins`

8. For index 3717901, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97645375/f492.item.r=Transou.zoom, The job will be changed from `successeur)coutellerie` to `coutellerie`

9. For index 4240510. The job will be changed from `i)peintre-artiste` to `peintre-artiste`

In [None]:
raw_paris_jobs.loc[170727, "métier"] = "fab de paillons et poudre à dorer"
raw_paris_jobs.loc[901578, "métier"] = "bijoutier en doré"
raw_paris_jobs.loc[954207, "métier"] = "fab. d'ustensiles de menage"
raw_paris_jobs.loc[1671439, "métier"] = "avocat cour d'appel"
raw_paris_jobs.loc[2607822, "métier"] = "peintre-artiste"
raw_paris_jobs.loc[2645388, "métier"] = "dépositaire de maroquinerie"
raw_paris_jobs.loc[2843688, "métier"] = "vins"
raw_paris_jobs.loc[3717901, "métier"] = "coutellerie"
raw_paris_jobs.loc[4240510, "métier"] = "peintre-artiste"

The remaining will be dealt during tag generation

### Dealing with `[` and `]`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"\["))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
120213,bpt6k62906378,680,86,Fernet (E.),sous-directeur de la caisse pa[ternelle,Richelieu,110.,1846,392,sous-directeur de la caisse pa[ternelle
313327,bpt6k6309075f,299,110,Cramail (J. ) *,juge d'instruct. au tribunal de [re instance,Jacob,30.,1861,203,juge d'instruct. au tribunal de [re instance
382222,bpt6k6314752k,464,166,Cramail *,juge d'instruct. au tribunal de [re instance,pl. St-Germ.-l'Auxerrois,20.,1856,274,juge d'instruct. au tribunal de [re instance
810753,bpt6k6324389h,289,90,Fagniez *,juge suppl. au trib. de [re instance,Mogador,15.,1859,217,juge suppl. au trib. de [re instance
897603,bpt6k63243905,572,0,Lorget (E.),avoué [ro inst.,St-Honoré,862,1863,438,avoué [ro inst.
1162427,bpt6k6333200c,597,123,Montucci,prof. zu ircée St:[nu',Sentier,38,1862,465,prof. zu ircée st:[nu'
1253897,bpt6k6391515w,746,17,Fagniez *,juge suppl. au trib. de [re instance,Amsterdam,47.,1847,427,juge suppl. au trib. de [re instance
1306346,bpt6k6393838j,594,67,Achèvement des écuries d'Artois,restauration du théâtre royal [talien,etc.; St-Lazale,34.,1843,375,restauration du théâtre royal [talien
1384512,bpt6k9668037f,402,236,Bouche (l'abbé),1er vicaire à St-Germain[Auxerrois,place du Louvre,3.,1884,235,vicaire à st-germain[auxerrois
1495287,bpt6k9669143t,437,21,Dambrine (l'abbé),vicaire à St-Germain[Auxerrois,place du Louvre,3.,1882,292,vicaire à st-germain[auxerrois


1. For index 120213, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k62906378/f392.image.r=Fernet, The job will be changed from `sous-directeur de la caisse pa[ternelle` to `sous-directeur de la caisse paternelle`

2. For index 313327. The job will be changed from `juge d'instruct. au tribunal de [re instance` to `juge d'instruct. au tribunal de première instance`

3. For index 382222. The job will be changed from `juge d'instruct. au tribunal de [re instance` to `juge d'instruct. au tribunal de première instance`

4. For index 810753. The job will be changed from `juge suppl. au trib. de [re instance` to `juge suppl. au trib. de première instance`

5. For index 897603. The job will be changed from `avoué [ro inst.` to `avoué première inst.`

6. For index 1162427, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333200c/f465.item.r=Montucci.zoom, The job will be changed from `prof. zu ircée st:[nu'` to `prof. au lycée st-louis`

7. For index 1253897. The job will be changed from `juge suppl. au trib. de [re instance` to `juge suppl. au trib. de première instance`

8. For index 1306346. The job will be changed from `restauration du théâtre royal [talien` to `restauration du théâtre royal italien`

9. For index 1384512, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9668037f/f235.item.r=abbe.zoom, The job will be changed from `er vicaire à st-germain[auxerrois` to `vicaire à st-germain l'auxerrois`

10. For index 1495287. The job will be changed from `vicaire à st-germain[auxerrois` to `vicaire à st-germain l'auxerrois`

11. For index 1685700, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f380.item.r=Guin.zoom, The job will be changed from `[ondeurs en cuivre et fabr. de robinets` to `fondeurs en cuivre et fabr. de robinets`

12. For index 1733684. The job will be changed from `hôte [ ste-anne` to `hôtel ste-anne`

13. For index 1765814, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f331.image.r=Gaiffe, The job will be changed from `fabr. [instruments de chirurgie` to `fabr. d'instruments de chirurgie`

14. For index 1851040, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96762564/f385.image.r=Dutrenoy, The job will be changed from `employé aux [finances` to `employé aux finances`

15. For index 1903134. The job will be changed from `bibliothécaire à [ecole normale supérieure` to `bibliothécaire à l'ecole normale supérieure`

16. For index 1979703, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677392n/f446.item.r=pigeard.zoom, The job will be changed from `ferblantier-[ampiste` to `ferblantier-lampiste`

17. For index 2118298, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96839542/f214.image.r=Bergmann.zoom, The job will be changed from `brasserie [amstel` to `brasserie l'amstel`

18. For index 2198885, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96839542/f690.item.r=rollin.zoom, The job will be changed from `fabr. d'articles [lbénisterie pour bureaux` to `fabr. d'articles d'ébénisterie pour bureaux`

19. For index 2324108, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9684013b/f841.item.r=Ulysse.zoom, The job will be changed from `[éditeur de musique` to `éditeur de musique`

20. For index 2346462, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9684454n/f364.image.r=Chesnier, The job will be changed from `ancien administrateur gérant du journal [union` to `ancien administrateur gérant du journal l'union`

21. For index 2372935, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9684454n/f527.image.r=croissant, The job will be changed from `de [express des journaux` to `de l'express des journaux`

22. For index 2611810, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f594.image.r=Marseuil, The job will be changed from `directeur du journal [abeille` to `directeur du journal l'abeille`

23. For index 2624514, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685861g/f670.image.r=Comte, The job will be changed from `nég[s-commissionnaires pour l'angleterre et les colonies` to `négts-commissionnaires pour l'angleterre et les colonies`

24. For index 2756922, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9692626p/f963.image.r=chorale, The job will be changed from `directeur de la société chorale [ abeille` to `directeur de la société chorale l'abeille`

25. For index 2773777. The job will be changed from `hôtel françois [er` to `hôtel françois`

26. For index 2849491, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9732740w/f240.item.r=Normandie.zoom, The job will be changed from `hôtel[de normandie` to `hôtel de normandie`

27. For index 3031134, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f439.item.r=Lawa.zoom, The job will be changed from `[ubr. de cuirs et courroies mécaniques` to `fabr. de cuirs et courroies mécaniques`

28. For index 3086546, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97630871/f119.image.r=gurin, The job will be changed from `député de [ain` to `député de l'ain`

29. For index 3132282, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97630871/f401.image.r=Spitzer, The job will be changed from `bains francois [er` to `bains francois`

30. For index 3345317, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763471j/f646.image.r=Poussin, The job will be changed from `[tapissier` to `tapissier`

31. For index 3372045, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763553z/f143.item.r=Ferdinand.zoom, The job will be changed from `ancien professeur à l'institut des sourds-m[uets` to `ancien professeur à l'institut des sourds-muets`

32. For index 3373078. The job will be changed from `e vicaire à st-germain-[auacerrois` to `vicaire à st-germain l'auxerrois`

33. For index 3382626, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763553z/f203.image.r=Chauveau, The job will be changed from `[truffes et comestibles` to `truffes et comestibles`

34. For index 3473104. The job will be changed from `vicaire à st-germain-[auxerrois` to `vicaire à st-germain l'auxerrois`

35. For index 3923071, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f108.item.r=Billon.zoom, The job will be changed from `ameub[ts` to `ameublts`

36. For index 3931237. The job will be changed from `[obr. de mèches et veilleuses` to `fabr. de mèches et veilleuses`

37. For index 3941450. The job will be changed from `[art. de ménage` to `art. de ménage`

38. For index 3961753. The job will be changed from `ameub[ls` to `ameublts`

39. For index 3986407. The job will be changed from `[anr. d'instruments de précision` to `fabr. d'instruments de précision`

40. For index 3997874, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f599.item.r=20.zoom, The job will be changed from `[erronnerie d'art` to `ferronnerie d'art`

41. For index 4172171. The job will be changed from `coif[eur` to `coiffeur`

42. For index 4206984, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f695.image.r=alimentaires, The job will be changed from `conserves alimentaires et truf[es` to `conserves alimentaires et truffes`

43. For index 4249232, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f1004.item.r=Martery.zoom, The job will be changed from `[euillagiste` to `feuillagiste`

44. For index 4374210. The job will be changed from `[charb. et vins` to `charb. et vins`

45. For index 4393376, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f1171.item.r=Milourd.zoom, The job will be changed from `aciers et outillage etab[ts ebord` to `aciers et outillage etablts elword`

46. For index 4400996, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f1232.item.r=Vanet.zoom, The job will be changed from `[acteur de pianos` to `facteur de pianos`

In [None]:
raw_paris_jobs.loc[120213, "métier"] = "sous-directeur de la caisse paternelle"
raw_paris_jobs.loc[313327, "métier"] = "juge d'instruct. au tribunal de première instance"
raw_paris_jobs.loc[382222, "métier"] = "juge d'instruct. au tribunal de première instance"
raw_paris_jobs.loc[810753, "métier"] = "juge suppl. au trib. de première instance"
raw_paris_jobs.loc[897603, "métier"] = "avoué première inst."
raw_paris_jobs.loc[1162427, "métier"] = "prof. au lycée st-louis"
raw_paris_jobs.loc[1253897, "métier"] = "juge suppl. au trib. de première instance"
raw_paris_jobs.loc[1306346, "métier"] = "restauration du théâtre royal italien"
raw_paris_jobs.loc[1384512, "métier"] = "vicaire à st-germain l'auxerrois"
raw_paris_jobs.loc[1495287, "métier"] = "vicaire à st-germain l'auxerrois"
raw_paris_jobs.loc[1685700, "métier"] = "fondeurs en cuivre et fabr. de robinets"
raw_paris_jobs.loc[1733684, "métier"] = "hôtel ste-anne"
raw_paris_jobs.loc[1765814, "métier"] = "fabr. d'instruments de chirurgie"
raw_paris_jobs.loc[1851040, "métier"] = "employé aux finances"
raw_paris_jobs.loc[1903134, "métier"] = "bibliothécaire à l'ecole normale supérieure"
raw_paris_jobs.loc[1979703, "métier"] = "ferblantier-lampiste"
raw_paris_jobs.loc[2118298, "métier"] = "brasserie l'amstel"
raw_paris_jobs.loc[2198885, "métier"] = "fabr. d'articles d'ébénisterie pour bureaux"
raw_paris_jobs.loc[2324108, "métier"] = "éditeur de musique"
raw_paris_jobs.loc[2346462, "métier"] = "ancien administrateur gérant du journal l'union"
raw_paris_jobs.loc[2372935, "métier"] = "de l'express des journaux"
raw_paris_jobs.loc[2611810, "métier"] = "directeur du journal l'abeille"
raw_paris_jobs.loc[2624514, "métier"] = "négts-commissionnaires pour l'angleterre et les colonies"
raw_paris_jobs.loc[2756922, "métier"] = "directeur de la société chorale l'abeille"
raw_paris_jobs.loc[2773777, "métier"] = "hôtel françois"
raw_paris_jobs.loc[2849491, "métier"] = "hôtel de normandie"
raw_paris_jobs.loc[3031134, "métier"] = "fabr. de cuirs et courroies mécaniques"
raw_paris_jobs.loc[3086546, "métier"] = "député de l'ain"
raw_paris_jobs.loc[3132282, "métier"] = "bains francois"
raw_paris_jobs.loc[3345317, "métier"] = "tapissier"
raw_paris_jobs.loc[3372045, "métier"] = "ancien professeur à l'institut des sourds-muets"
raw_paris_jobs.loc[3373078, "métier"] = "vicaire à st-germain l'auxerrois"
raw_paris_jobs.loc[3382626, "métier"] = "truffes et comestibles"
raw_paris_jobs.loc[3473104, "métier"] = "vicaire à st-germain l'auxerrois"
raw_paris_jobs.loc[3923071, "métier"] = "ameublts"
raw_paris_jobs.loc[3931237, "métier"] = "fabr. de mèches et veilleuses"
raw_paris_jobs.loc[3941450, "métier"] = "art. de ménage"
raw_paris_jobs.loc[3961753, "métier"] = "ameublts"
raw_paris_jobs.loc[3986407, "métier"] = "fanr. d'instruments de précision"
raw_paris_jobs.loc[3997874, "métier"] = "ferronnerie d'art"
raw_paris_jobs.loc[4172171, "métier"] = "coiffeur"
raw_paris_jobs.loc[4206984, "métier"] = "conserves alimentaires et truffes"
raw_paris_jobs.loc[4249232, "métier"] = "feuillagiste"
raw_paris_jobs.loc[4374210, "métier"] = "charb. et vins"
raw_paris_jobs.loc[4393376, "métier"] = "aciers et outillage etablts elword"
raw_paris_jobs.loc[4400996, "métier"] = "facteur de pianos"

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"\]"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
138803,bpt6k62906378,794,199,Perrin,juge d'instr. de ]re inst.,Madame,20.,1846,506,juge d'instr. de ]re inst.
242242,bpt6k6305463c,273,20,Bourgain,juge ]re inst.,St-Louis-Marais,30.,1857,160,juge ]re inst.
254041,bpt6k6305463c,344,162,Dorléans,hor] -mécan.,Faub.-du-Temple,106.,1857,231,hor] -mécan.
403009,bpt6k6314752k,585,117,Laurens-Rabier,avoué de ]re instance,Ri-. voli,118.,1856,395,avoué de ]re instance
480208,bpt6k6315985z,165,0,Adler-Mesnard,professeur d'allemand à l'E ] cole normale,Ulm,45.,1850,83,professeur d'allemand à l'e ] cole normale
536340,bpt6k6318531z,263,100,Bourgain,juge ]re inst.,St-Louis-Marais,30.,1858,155,juge ]re inst.
630868,bpt6k6319106t,895,11,Perrin,juge d'instr. de ]re inst.,Vaugirard,31.,1849,571,juge d'instr. de ]re inst.
699811,bpt6k6319811j,198,169,Bonabeau,conimis-greffier de la justice de paix du ] er...,Pépinière,40.,1854,119,conimis-greffier de la justice de paix du ] er...
798737,bpt6k6324389h,210,46,Caron,avoné ]re instance,Richelieu,15.,1859,138,avoné ]re instance
814737,bpt6k6324389h,314,124,Germain (Mle),école communale du ] arrondiss.,Bac,119.,1859,242,école communale du ] arrondiss.


1. For index 120371. The job will be changed from `chef de l'état civil du ]e arrondissement` to `chef de l'état civil du arrondissement`

2. For index 128622. The job will be changed from ` et  ].` to ` `

3. For index 134710. The job will be changed from ` ].` to ` `

4. For index 138803. The job will be changed from `juge d'instr. de ]re inst.` to `juge d'instr. de première inst.`

5. For index 168780. The job will be changed from ` ].` to ` `

6. For index 242242. The job will be changed from `juge ]re inst.` to `juge première inst.`

7. For index 254041, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6305463c/f231.image.r=Dorleans, The job will be changed from `hor] -mécan.` to `horl-mécan.`

8. For index 403009. The job will be changed from `avoué de ]re instance` to `avoué de première instance`

9. For index 480208. The job will be changed from `professeur d'allemand à l'e ] cole normale` to `professeur d'allemand à l'ecole normale`

10. For index 536340. The job will be changed from `juge ]re inst.` to `juge première inst.`

11. For index 620412, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6319106t/f504.image.r=capitaine, The job will be changed from `capitaine ]e légion` to `capitaine légion`

12. For index 630868. The job will be changed from `juge d'instr. de ]re inst.` to `juge d'instr. de première inst.`

13. For index 699811. The job will be changed from `conimis-greffier de la justice de paix du ] er arrondissement` to `conimis-greffier de la justice de paix du arrondissement`

14. For index 798737. The job will be changed from `avoné ]re instance` to `avoué première instance`

15. For index 814737. The job will be changed from `école communale du ] arrondiss.` to `école communale du arrondiss.`

16. For index 820152. The job will be changed from `avoué ]re instance` to `avoué première instance`

17. For index 857577. The job will be changed from `juge ]re instance` to `juge première instance`

18. For index 932293. The job will be changed from `juge ]re instance` to `juge première instance`

19. For index 933108. The job will be changed from `juge ]re inst.` to `juge première inst.`

20. For index 1043393. The job will be changed from `juge ]re instance` to `juge première instance`

21. For index 1044328. The job will be changed from `juge honor. ]ro instance` to `juge honor. première instance`

22. For index 1068538, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f352.item.r=Berthier.zoom, The job will be changed from `capit. d'artz]l.` to `capit. d'artill.`

23. For index 1117795. The job will be changed from `juge honor. ]re instance` to `juge honor. première instance`

24. For index 1202488. The job will be changed from `avoué au tribunal de ]re instance` to `avoué au tribunal de première instance`

25. For index 1270226. The job will be changed from ` ].` to ` `

26. For index 1298225. The job will be changed from `chef de l'état civil du l]e arrondissement` to `chef de l'état civil du arrondissement`

27. For index 1306395. The job will be changed from ` ].` to ` `

28. For index 1744412. The job will be changed from `peintre en bâtiment]` to `peintre en bâtiment`

29. For index 1751397, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f245.item.r=bougard.zoom, The job will be changed from `bouts de parapluie et ] tenons` to `bouts de parapluie et tenons`

30. For index 2161922. The job will be changed from `] fabr. de casquettes` to `fabr. de casquettes`

31. For index 2232363. The job will be changed from `vins et] spiritueux en gros à l'entrepôt` to `vins et spiritueux en gros à l'entrepôt`

32. For index 2254023. The job will be changed from `vins ] gros` to `vins gros`

33. For index 2802060. The job will be changed from `représentant de fabri-] ques` to `représentant de fabriques`

34. For index 2985977. The job will be changed from `] conseiller à la cour d'appel` to `conseiller à la cour d'appel`

35. For index 3156239, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97631451/f414.image.r=Denis, The job will be changed from `ancien nct]` to `ancien nct`

36. For index 3481682. The job will be changed from `.] peintres en bâtiments` to `peintres en bâtiments`

37. For index 3511179, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f476.item.r=Mentgauri.zoom, The job will be changed from `chef du contentieux à la doua.] ne` to `chef du contentieux à la douane`

38. For index 3549314. The job will be changed from `ancien not]` to `ancien nct`

In [None]:
raw_paris_jobs.loc[120371, "métier"] = "chef de l'état civil du arrondissement"
raw_paris_jobs.loc[128622, "métier"] = ""
raw_paris_jobs.loc[134710, "métier"] = ""
raw_paris_jobs.loc[138803, "métier"] = "juge d'instr. de première première inst."
raw_paris_jobs.loc[168780, "métier"] = ""
raw_paris_jobs.loc[242242, "métier"] = "juge première inst."
raw_paris_jobs.loc[254041, "métier"] = "horl-mécan."
raw_paris_jobs.loc[403009, "métier"] = "avoué de première instance"
raw_paris_jobs.loc[480208, "métier"] = "professeur d'allemand à l'ecole normale"
raw_paris_jobs.loc[536340, "métier"] = "juge première inst."
raw_paris_jobs.loc[620412, "métier"] = "capitaine légion"
raw_paris_jobs.loc[630868, "métier"] = "juge d'instr. de première inst."
raw_paris_jobs.loc[699811, "métier"] = "conimis-greffier de la justice de paix du arrondissement"
raw_paris_jobs.loc[798737, "métier"] = "avoué première instance"
raw_paris_jobs.loc[814737, "métier"] = "école communale du arrondiss."
raw_paris_jobs.loc[820152, "métier"] = "avoué première instance"
raw_paris_jobs.loc[857577, "métier"] = "juge première instance"
raw_paris_jobs.loc[932293, "métier"] = "juge première instance"
raw_paris_jobs.loc[933108, "métier"] = "juge première inst."
raw_paris_jobs.loc[1043393, "métier"] = "juge première instance"
raw_paris_jobs.loc[1044328, "métier"] = "juge honor. première instance"
raw_paris_jobs.loc[1068538, "métier"] = "capit. d'artill."
raw_paris_jobs.loc[1117795, "métier"] = "juge honor. première instance"
raw_paris_jobs.loc[1202488, "métier"] = "avoué au tribunal de première instance"
raw_paris_jobs.loc[1270226, "métier"] = ""
raw_paris_jobs.loc[1298225, "métier"] = "chef de l'état civil du arrondissement"
raw_paris_jobs.loc[1306395, "métier"] = ""
raw_paris_jobs.loc[1744412, "métier"] = "peintre en bâtiment"
raw_paris_jobs.loc[1751397, "métier"] = "bouts de parapluie et tenons"
raw_paris_jobs.loc[2161922, "métier"] = "fabr. de casquettes"
raw_paris_jobs.loc[2232363, "métier"] = "vins et spiritueux en gros à l'entrepôt"
raw_paris_jobs.loc[2254023, "métier"] = "vins gros"
raw_paris_jobs.loc[2802060, "métier"] = "représentant de fabriques"
raw_paris_jobs.loc[2985977, "métier"] = "conseiller à la cour d'appel"
raw_paris_jobs.loc[3156239, "métier"] = "ancien nct"
raw_paris_jobs.loc[3481682, "métier"] = "peintres en bâtiments"
raw_paris_jobs.loc[3511179, "métier"] = "chef du contentieux à la douane"
raw_paris_jobs.loc[3549314, "métier"] = "ancien nct"

### Dealing with `<`

- Get the rows containg `<`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"<"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
993942,bpt6k6331310g,436,176,Apponi Ier (Cte Rodolphe d'),secrét. <le l'ambass. d'Autriche,Grenelle-St-Germain,121.,1844,220,secrét. <le l'ambass. d'autriche
2167096,bpt6k96839542,673,41,Lafitte de Canson,O. < (et Mme),Néva,6.,1885,502,o. < et mme
2831084,bpt6k9692809v,636,5,Pommier-Dunoyer,ancien avocat cour d'<ppel,Royale,10.,1878,523,ancien avocat cour d'<ppel
2993124,bpt6k9762929c,338,101,Brun (J.),terrines de foies gras et comesti<bles en gros,Arbre-Sec,50. 19 W,1879,209,terrines de foies gras et comesti<bles en gros
3251682,bpt6k97631451,1328,13,Simon,cycles < Solide »,boul. Ménilmontant,14.,1901,1006,cycles < solide »
3960932,bpt6k9775724t,386,235,Frydmane (J.) & Cie,papiers < Frane »,r. SLMerrí,12. (40). T. Arch.28. 19.,1914,353,papiers < frane »
4019685,bpt6k9775724t,770,90,Savoye (P.),roues < Celer » pour automobiles,av. de la Grande-Armée,8.,1914,737,roues < celer » pour automobiles
4144939,bpt6k9776121t,848,261,Société des inventions économiques,machine à laver < l'Economique »,r. du Faub. StDenis,190.,1907,799,machine à laver < l'economique »
4231882,bpt6k97774838,1202,102,Leu (le),revue théosophique franalions < Rhéa »,square Rapp,4.,1921,875,revue théosophique franalions < rhéa »
4244493,bpt6k97774838,1294,193,(69). Odin,courroies-chaines < 0.,T. Nord 22. 01. Succursales : r. de Londres,1.,1921,967,courroies-chaines <


There are 14 rows that have the `<`. 

Most of them are misintepreted for `«`. The métier column will be replaced manually for those entries where < does not correspond to `«`.

1. For index 993942, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6331310g/f220.item.r=secret.zoom. The `d` was misinterpreted as `<l`. Thus the job will be changed from `secrét. <le l'ambass. d'Autriche` to `secrét. de l'ambass. d'Autriche`.

2. For index 2167096, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96839542/f502.item.r=Canson.zoom. The job will be changed from `so. < et mme` to `mme`.

3. For index 2831084, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9692809v/f523.item.r=avocat.zoom. The `a` was misinterpreted as `<`. Thus the job will be changed from `ancien avocat cour d'<ppel` to `ancien avocat cour d'appel`.

4. For index 2993124, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9762929c/f209.item.r=terrines.zoom. The job was present in two lines and was mis-interpredted. Thus the job will be changed from `terrines de foies gras et comesti<bles en gros` to `terrines de foies gras et comestibles en gros`.

5. For index 4244493, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f967.image.r=courroies.zoom. The job was present in two lines and was mis-interpredted. Thus the job will be changed from `courroies-chaines < 0.` to `courroies-chaines O.D.I.N pour transmissions`.

In the remaining entries, `<` is removed.

In [None]:
raw_paris_jobs.loc[993942, "métier"] = "secrét. de l'ambass. d'Autriche"
raw_paris_jobs.loc[2167096, "métier"] = "mme"
raw_paris_jobs.loc[2831084, "métier"] = "ancien avocat cour d'appel"
raw_paris_jobs.loc[2993124, "métier"] = "errines de foies gras et comestibles en gros"
raw_paris_jobs.loc[4244493, "métier"] = "courroies-chaines o.d.i.n pour transmissions"

raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'<', r'', regex=True)

### Dealing with `>`

- Get the rows containg `>`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r">"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
198357,bpt6k62931221,343,194,Bussy *,prof. de chimie à l'école de phar-> macie,Verrerie,55,1841,192,prof. de chimie à l'école de phar-> macie
509734,bpt6k6315985z,356,139,Legendre,administrateur d'un bureau de bien-> faisance,Grenelle-St-Germain,66.,1850,274,administrateur d'un bureau de bien-> faisance
2084523,bpt6k9677737t,711,104,Olivier (Marius),ingénieur des arts et manu-> factures,boul. Malesherbes,97.,1883,572,ingénieur des arts et manu-> factures
2609722,bpt6k9685861g,770,151,Magnard (Francis),rédacteur en chef du «Figaro >>,boul. Montmorency,27.,1887,581,rédacteur en chef du «figaro >>
3467483,bpt6k9763554c,345,129,Chennevière,de la maison Lainé et Chenne> vière,Louvre,6. 30350,1875,222,de la maison lainé et chenne> vière
3493967,bpt6k9763554c,498,160,Jacquin (maison),aciérage de planches gra> vees en cuivre,N.-D.-des-Champs,71.,1875,375,aciérage de planches gra> vees en cuivre
3737551,bpt6k9764647w,260,98,Bachem (H.),directeur de la « Zurich > compagnie d'assuran...,Châteauduu,7.,1881,139,directeur de la « zurich > compagnie d'assuran...
4022853,bpt6k9775724t,798,93,Société génle de publicité et d'affiches,« ViaDecor >,r. Tronchet,35. (90). T. Louv.02. 8102. 82.,1914,765,« viadecor >
4365714,bpt6k9780089g,1310,104,Mijon-Gayet,confections en gros pour en-> fants,r. d'Aboukir,49. (2º). T. Centr.91.,1922,971,confections en gros pour en-> fants


There are 9 rows that have the `>`. 

Misintepreted for `»`. The métier column will be replaced manually for those entries where > does not correspond to `»`.

1. For index 198357, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k62931221/f192.image.r=chimie.zoom. The job was present in two lines and was mis-interpredted. Thus the job will be changed from `prof. de chimie à l'école de phar-> macie` to `prof. de chimie à l'école de pharmacie`.

2. For index 509734, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6315985z/f274.image.r=administrateur.zoom. The job was present in two lines and was mis-interpredted. Thus the job will be changed from `administrateur d'un bureau de bien-> faisance` to `administrateur d'un bureau de bienfaisance`.

3. For index 2084523, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9677737t/f572.image.r=Malesherbes.zoom. The job was present in two lines and was mis-interpredted. Thus the job will be changed from `ingénieur des arts et manu-> factures` to `ingénieur des arts et manufactures`.

4. For index 3467483, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f222.image.r=Laine.zoom. The job was present in two lines and was mis-interpredted. Thus the job will be changed from `de la maison Lainé et Chenne> vière` to `de la maison Lainé et Chennevière`.

5. For index 3493967, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9763554c/f375.item.r=acierage.zoom.zoom. The job was present in two lines and was mis-interpredted. Thus the job will be changed from `aciérage de planches gra> vees en cuivre` to `aciérage de planches gravées en cuivre`.

6. For index 4365714, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f971.image.r=Mijon.zoom. The job was present in two lines and was mis-interpredted. Thus the job will be changed from `confections en gros pour en-> fants` to `confections en gros pour enfants`.

In the remaining entries, `>` is removed.

In [None]:
raw_paris_jobs.loc[198357, "métier"] = "prof. de chimie à l'école de pharmacie"
raw_paris_jobs.loc[509734, "métier"] = "administrateur d'un bureau de bienfaisance"
raw_paris_jobs.loc[2084523, "métier"] = "ingénieur des arts et manufactures"
raw_paris_jobs.loc[3467483, "métier"] = "de la maison lainé et chennevière"
raw_paris_jobs.loc[3493967, "métier"] = "aciérage de planches gravées en cuivre"
raw_paris_jobs.loc[4365714, "métier"] = "confections en gros pour enfants"

raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'>', r'', regex=True)

### Dealing with `«` and `»`

We need to deal with both these characters together. These symbols represent quotes in French.

First, the words containing `«` are retrived and manually replaced. then the words containing `»` are retrived and manually replaced.

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"«"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
106622,bpt6k62906378,594,135,Beltz (H. B.),pr«p.,St-Denis,374.,1846,306,pr«p.
140389,bpt6k62906378,804,168,Poulain de Bossay *,proviseu« au collége royal St-Louis,Harpe,92-94-96.,1846,516,proviseu« au collége royal st-louis
157159,bpt6k6292987t,761,160,Chernowski (Nicolas),beurre et «-ufs,Aubryle-Boucher,49 bis.,1845,408,beurre et «-ufs
207783,bpt6k62931221,405,7,Flaniant,menuisier et mer«ier,St-Jacques,351.,1841,254,menuisier et mer«ier
282775,bpt6k6305463c,517,107,Patareau,ſab «le robinets,Lancry,6.,1857,404,fab «le robinets
...,...,...,...,...,...,...,...,...,...,...
4387704,bpt6k9780089g,1461,159,Roux et Duncan,chaines « Fartia,av. Jean-Jaurès,75.,1922,1122,chaines « fartia
4399394,bpt6k9780089g,1559,269,Trogan (Edouard),revue « Le Correspondant,r. St-Guillaume,31.,1922,1220,revue « le correspondant
4401842,bpt6k9780089g,1577,145,Véran (Claudius),lessive « Niké,boul. St-Jacques,52 bis. (14°). T. Gob. 49. 66.,1922,1238,lessive « niké
4402186,bpt6k9780089g,1579,310,Vernay,bar « Royal Moka,boul. Sébastopol,20. (4e). T. Arch. 96. 40),1922,1240,bar « royal moka


The symbols (`«` and `»`) when they surround a text were removed.

Then the symbols are removed when they are surrounded by spaces and at the start followed by a space or at the end preceeded by a space

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'(^|\s)«(\s|$)', r' ', regex=True)

raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'(^|\s)»(\s|$)', r' ', regex=True)

raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"[(a-z)+]«[(a-z)+]"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
106622,bpt6k62906378,594,135,Beltz (H. B.),pr«p.,St-Denis,374.0,1846,306,pr«p.
207783,bpt6k62931221,405,7,Flaniant,menuisier et mer«ier,St-Jacques,351.0,1841,254,menuisier et mer«ier
743714,bpt6k6319811j,500,128,Renzi:r,r«lieur,St-Jacques,40.0,1854,421,r«lieur
796635,bpt6k6324389h,197,34,Branchu (F.),cominiss. en marchan«lises,Vieux-Augustins,16.0,1859,125,cominiss. en marchan«lises
824605,bpt6k6324389h,380,70,Ledanois,r«ferendaire au sceau,Nve-St-Augustin,11.0,1859,308,r«ferendaire au sceau
883405,bpt6k63243905,483,168,Griviaut,ho«elier,Buffault,32.0,1863,349,ho«elier
954811,bpt6k63243920,354,88,Goix,hồiel Cor n«iile,Corneille,5.0,1860,274,hồiel cor n«iile
1322227,bpt6k63959929,165,25,Adon,licen«ié en droit,Montorgueil,61.0,1851,83,licen«ié en droit
1344102,bpt6k63959929,310,73,Gicqueau (de),représentant du pr«uple (LoireInterieurel,Beaune,12.0,1851,228,représentant du pr«uple (loireinterieurel
1372283,bpt6k63959929,502,45,Vizet,voi«ures publiques pour Vouziers et Stenay,Jussienne,11.0,1851,420,voi«ures publiques pour vouziers et stenay


1. For index 106622, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k62906378/f306.item.r=374.zoom, The job will be changed from `pr«p.` to `prop.`

2. For index 207783, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k62931221/f254.item.r=menuisieret.zoom, The job will be changed from `menuisier et mer«ier` to `menuisier et mercier`

3. For index 743714, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6319811j/f421.item.r=Renzi%20r.zoom, The job will be changed from `r«lieur` to `relieur`

4. For index 796635, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6324389h/f125.item.r=Branchu.zoom, The job will be changed from `cominiss. en marchan«lises` to `commiss. en marchandises`

5. For index 824605, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6324389h/f308.item.r=Ledanois.zoom, The job will be changed from `r«ferendaire au sceau` to `reférendaire au sceau`

6. For index 883405, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243905/f349.item.r=Griviaut.zoom, The job will be changed from `ho«elier` to `hôtelier`

7. For index 954811, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63243920/f274.item.r=goix.zoom, The job will be changed from `hồiel cor n«iile` to `hôtel corneille`

8. For index 1322227, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63959929/f83.item.r=adon.zoom, The job will be changed from `licen«ié en droit` to `licencié en droit`

9. For index 1344102, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63959929/f228.item.r=Gicqucau.zoom, The job will be changed from `représentant du pr«uple (loireinterieurel` to `représentant du peuple (loire-inférieure)`

10. For index 1372283, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63959929/f420.item.r=Vizet.zoom, The job will be changed from `voi«ures publiques pour vouziers et stenay` to `voitures publiques pour vouziers et stenay`

11. For index 1824786, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96762564/f234.item.r=Jules.zoom, The job will be changed from `couronn«s et articles funéraires` to `couronnes et articles funéraires`

12. For index 2390129, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9684454n/f636.item.r=Malepeyre.zoom, The job will be changed from `profe«seur de comptabilité à l'ecole des haules brudes commerciales` to `professeur de comptabilité à l'ecole des hautes etudes commerciales`

13. For index 4287779, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f428.item.r=moleurs.zoom, The job will be changed from `moleurs pour automo«biles` to `moteurs pour automobiles`

In [None]:
raw_paris_jobs.loc[106622, "métier"] = "prop."
raw_paris_jobs.loc[207783, "métier"] = "menuisier et mercier"
raw_paris_jobs.loc[743714, "métier"] = "relieur"
raw_paris_jobs.loc[796635, "métier"] = "commiss. en marchandises"
raw_paris_jobs.loc[824605, "métier"] = "reférendaire au sceau"
raw_paris_jobs.loc[883405, "métier"] = "hôtelier"
raw_paris_jobs.loc[954811, "métier"] = "hôtel corneille"
raw_paris_jobs.loc[1322227, "métier"] = "licencié en droit"
raw_paris_jobs.loc[1344102, "métier"] = "représentant du peuple loire-inférieure"
raw_paris_jobs.loc[1372283, "métier"] = "voitures publiques pour vouziers et stenay"
raw_paris_jobs.loc[1824786, "métier"] = "couronnes et articles funéraires"
raw_paris_jobs.loc[2390129, "métier"] = "professeur de comptabilité à l'ecole des hautes etudes commerciales"
raw_paris_jobs.loc[4287779, "métier"] = "moteurs pour automobiles"

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"[(a-z)+]»[(a-z)+]"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
210774,bpt6k62931221,423,111,Gouyon,i»specteu de la gia de voirie,VieuxAugustins,41.,1841,272,i»specteu de la gia de voirie
345588,bpt6k6309075f,501,183,Mangras (I'r.),avocat à la cour impéri»le,Sept-Voies,21.,1861,405,avocat à la cour impéri»le
464387,bpt6k6315927h,924,33,Mellion,t»illeur,Monnaie,22.,1848,575,t»illeur
563172,bpt6k6318531z,426,98,Lacaille,p»piers à cigarettes,Paradis-Poissonnière,1.,1858,318,p»piers à cigarettes
868349,bpt6k63243905,388,160,Drophin,vins et restaur»nt,Feuillantines,40,1863,254,vins et restaur»nt
1037975,bpt6k6333170p,292,110,Barbier,cb»reutier,boul de la Villette,4.,1864,155,cb»reutier
1063014,bpt6k6333170p,455,60,Flament,l»mpes,Lancry,19.,1864,318,l»mpes
1064839,bpt6k6333170p,467,15,Gagpeax,chau»sures,Ménilmontant,5.,1864,330,chau»sures
1081129,bpt6k6333170p,574,4,Létacq (Ernest),fab. p»pies imperméables,Roate Militaire-Villette,2 et 3.,1864,437,fab. p»pies imperméables
1092621,bpt6k6333170p,649,105,*. député de l'Eure,Croix-desPetty-Ch»mps,27,Petit ( 1.,1864,512,croix-despetty-ch»mps


1. For index 210774, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k62931221/f272.item.r=nspecteu.zoom, The job will be changed from `i»specteu de la gia de voirie` to `inspecteur de la grande voirie`

2. For index 345588, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6309075f/f405.item.r=Mangras.zoom, The job will be changed from `avocat à la cour impéri»le` to `avocat à la cour impériale`

3. For index 464387, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6315927h/f575.image.r=Mellion.zoom, The job will be changed from `t»illeur` to `tailleur`

4. For index 563172, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6318531z/f318.image.r=Lacaille, The job will be changed from `p»piers à cigarettes` to `papiers à cigarettes`

5. For index 868349. The job will be changed from `vins et restaur»nt` to `vins et restaurant`

6. For index 1037975, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f155.item.r=Villette.zoom, The job will be changed from `cb»reutier` to `charcutier`

7. For index 1063014, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f318.item.r=Lancry.zoom, The job will be changed from `l»mpes` to `lampes`

8. For index 1064839, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f330.item.r=Menilmontant.zoom, The job will be changed from `chau»sures` to `chaussures`

9. For index 1081129, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f437.item.r=Letacq.zoom, The job will be changed from `fab. p»pies imperméables` to `fab. papiers imperméables`

10. For index 1092621. The address field is present in the job column. The job will be changed from `croix-despetty-ch»mps` to `croix-des petite-champs`

11. For index 1252646, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6391515w/f418.item.r=Dumeril.zoom, The job will be changed from `nég»c.` to `négoc.`

12. For index 1293542, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6393838j/f286.image.r=Dartevelle, The job will be changed from `f»b. de papiers peints` to `fab. de papiers peints`

13. For index 1344286, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k63959929/f229.item.r=Pastourelle.zoom, The job will be changed from `f»b. d'agrafes` to `fab. d'agrafes`

In [None]:
raw_paris_jobs.loc[210774, "métier"] = "inspecteur de la grande voirie"
raw_paris_jobs.loc[345588, "métier"] = "avocat à la cour impériale"
raw_paris_jobs.loc[464387, "métier"] = "tailleur"
raw_paris_jobs.loc[563172, "métier"] = "papiers à cigarettes"
raw_paris_jobs.loc[868349, "métier"] = "vins et restaurant"
raw_paris_jobs.loc[1037975, "métier"] = "charcutier"
raw_paris_jobs.loc[1063014, "métier"] = "lampes"
raw_paris_jobs.loc[1064839, "métier"] = "chaussures"
raw_paris_jobs.loc[1081129, "métier"] = "fab. papiers imperméables"
raw_paris_jobs.loc[1092621, "métier"] = "croix-des petite-champs"
raw_paris_jobs.loc[1252646, "métier"] = "négoc."
raw_paris_jobs.loc[1293542, "métier"] = "fab. de papiers peints"
raw_paris_jobs.loc[1344286, "métier"] = "fab. d'agrafes"

Now get the rows containg `»`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"»"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
401494,bpt6k6314752k,576,56,Lafaurie,C. *»conseiller maître à la cour des comptes,Arcade,25.,1856,386,c. »conseiller maître à la cour des comptes
530793,bpt6k6318531z,226,63,Armoni,grav. sur bij »ux,Michel-le-Comte,20.,1858,118,grav. sur bij »ux
1073708,bpt6k6333170p,523,93,Jay ( Emile),»vocat à la cour imu.,Seine,12.,1864,386,»vocat à la cour imu.
1076553,bpt6k6333170p,543,84,Langlois,soetr.--»ér:fic.,La Rochefencault,35.,1864,406,soetr.--»ér:fic.
1105572,bpt6k6333170p,733,182,Vacquant,professeur au lycée Nap:»léon,Sorbonne,2,1864,596,professeur au lycée nap:»léon
1677678,bpt6k9672776c,450,133,Fonteray,»lombier-zingueur,Nicolo,17.,1880,331,»lombier-zingueur
1703504,bpt6k9672776c,612,138,Marty,rédacteur à l'aAgence Havas »),St-Horé,163.,1880,493,rédacteur à l'aagence havas »)
1829907,bpt6k96762564,448,174,Bourne (Louis),directeur du journal a le Travail»,rue de Provence,2.,1886,263,directeur du journal a le travail»
1887612,bpt6k96762564,782,78,Mayer (Eugène),directeur de la a Lanterne»,Pyramides,8.,1886,597,directeur de la a lanterne»
2359769,bpt6k9684454n,682,20,Durand (Paul),administrateur-gérant du journal ale Gaz»,r. du Faub.-Montmartre,66.,1893,447,administrateur-gérant du journal ale gaz»


1. For index 401494, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6314752k/f386.image.r=Lafaurie.zoom, The job will be changed from `c. »conseiller maître à la cour des comptes` to `conseiller maître à la cour des comptes`

2. For index 530793, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6318531z/f118.image.r=bij, The job will be changed from `grav. sur bij »ux` to `grav. sur bijoux`

3. For index 1073708. The job will be changed from `»vocat à la cour imu.` to `avocat à la cour imp.`

4. For index 1076553, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f406.item.r=LaRochefouoault.zoom, The job will be changed from `soetr.--»ér:fic.` to `métreur-vérific.`

5. For index 1105572. The job will be changed from `professeur au lycée nap:»léon` to `professeur au lycée napoléon`

6. For index 1677678, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f331.item.r=nicolo.zoom, The job will be changed from `»lombier-zingueur` to `plombier-zingueur`

7. For index 1703504, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9672776c/f493.item.r=redacteur.zoom, The job will be changed from `rédacteur à l'aagence havas »)` to `rédacteur à l'agence havas`

8. For index 1829907. The job will be changed from `directeur du journal a le travail»` to `directeur du journal a le travail`

9. For index 1887612. The job will be changed from `directeur de la a lanterne»` to `directeur de la a lanterne`

10. For index 2359769, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9684454n/f447.image.r=administrateur, The job will be changed from `administrateur-gérant du journal ale gaz»` to `administrateur-gérant du journal le gaz`

11. For index 2407433, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9684454n/f740.item.r=Quesnel.zoom, The job will be changed from `dépôt du kalodont»` to `dépôt du kalodont`

12. For index 2745936. The job will be changed from `directeur de la a vie pratique»` to `directeur de la a vie pratique`

13. For index 2885194. The job will be changed from `administrateur-gérant du journal ale gaz»` to `administrateur-gérant du journal le gaz`

14. For index 3115992, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97630871/f302.item.r=Patinot.zoom, The job will be changed from `directeur du a journal des débats»` to `directeur du a journal des débats`

15. For index 3182870. The job will be changed from `administrateur-gérant du journal e le gaz»` to `administrateur-gérant du journal le gaz`

16. For index 3332297. The job will be changed from `directeur de la a lanterne»` to `directeur de la a lanterne`

17. For index 3948914, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9775724t/f276.item.r=rochechouart.zoom, The job will be changed from `constr»-serrurier` to `constru.-serrurier`

18. For index 4003982. The job will be changed from `vin»` to `vins`

19. For index 4028357. The job will be changed from `revue a le correspondant»` to `revue a le correspondant`

20. For index 4205611, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f685.item.r=adunol.zoom, The job will be changed from `a adunol »)` to `fabr. de parfumerie`

21. For index 4268711, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k97774838/f1167.item.r=Treves.zoom, The job will be changed from `admin.-délégué de la sté f des tissus tétra»` to `admin.-délégué de la sté fse des tissus tétra`

22. For index 4379292, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9780089g/f1064.item.r=chemin, The job will be changed from `accumulateur unic»` to `accumulateur unic`

In [None]:
raw_paris_jobs.loc[401494, "métier"] = "cconseiller maître à la cour des comptes"
raw_paris_jobs.loc[530793, "métier"] = "grav. sur bijoux"
raw_paris_jobs.loc[1073708, "métier"] = "avocat à la cour imp."
raw_paris_jobs.loc[1076553, "métier"] = "métreur-vérific."
raw_paris_jobs.loc[1105572, "métier"] = "professeur au lycée napoléon"
raw_paris_jobs.loc[1677678, "métier"] = "plombier-zingueur"
raw_paris_jobs.loc[1703504, "métier"] = "rédacteur à l'agence havas"
raw_paris_jobs.loc[1829907, "métier"] = "directeur du journal a le travail"
raw_paris_jobs.loc[1887612, "métier"] = "directeur de la a lanterne"
raw_paris_jobs.loc[2359769, "métier"] = "administrateur-gérant du journal le gaz"
raw_paris_jobs.loc[2407433, "métier"] = "dépôt du kalodont"
raw_paris_jobs.loc[2745936, "métier"] = "directeur de la a vie pratique"
raw_paris_jobs.loc[2885194, "métier"] = "administrateur-gérant du journal le gaz"
raw_paris_jobs.loc[3115992, "métier"] = "directeur du a journal des débats"
raw_paris_jobs.loc[3182870, "métier"] = "administrateur-gérant du journal le gaz"
raw_paris_jobs.loc[3332297, "métier"] = "directeur de la a lanterne"
raw_paris_jobs.loc[3948914, "métier"] = "constru.-serrurier"
raw_paris_jobs.loc[4003982, "métier"] = "vins"
raw_paris_jobs.loc[4028357, "métier"] = "revue a le correspondant"
raw_paris_jobs.loc[4205611, "métier"]= "fabr. de parfumerie"
raw_paris_jobs.loc[4268711, "métier"] = "admin.-délégué de la sté fse des tissus tétra"
raw_paris_jobs.loc[4379292, "métier"] = "accumulateur unic"

The remaining entries will be dealt during tag generation.

### Dealing with `%`

- Get the rows containg `%`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"%"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
473141,bpt6k6315927h,978,156,Rodriguez,entrepr. de déménazeinent%,Faub.St-Martin,11.0,1848,629,entrepr. de déménazeinent%
1035442,bpt6k6333170p,275,106,Affichard,grav.%. mét.,Michel-le-Comte,25.0,1864,138,grav.%. mét.
1047899,bpt6k6333170p,356,43,Catois aîné,faïenc %,Pastourel,6.0,1864,219,faïenc %
1062179,bpt6k6333170p,449,145,Fauconnier (Mme),corset%,Pont-de-Lodi,5.0,1864,312,corset%
1828992,bpt6k96762564,443,145,Bouillot (I.),fabr. de presses à copier (maison E. Ravasse %),rue Lafayeļte,203.0,1886,258,fabr. de presses à copier maison e. ravasse %
2513166,bpt6k9685098r,1070,4,Office Ch. Desnos,brevets d'invention (Auguste Canivet (%A),boul. Magenta,11.0,1898,789,brevets d'invention auguste canivet (%a


The list below is larger than the 6 entries in the output above. This is due the fact that these symbols have been removed during the deleion of numbers. However, to continue using the earlier corrections, they are still corrected.

1. For index 135146. The job will be changed from ` %.` to ` `

2. For index 173725. The job will be changed from ` %` to ` `

3. For index 179603. The job will be changed from ` %` to ` `

4. For index 473141, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6315927h/f629.item.r=Rodriguez.zoom. The job will be changed from `entrepr. de déménazeinent%` to `entrepr. de déménagements`.

5. For index 799324. The job will be changed from `%.` to ` `

6. For index 1002961. The job will be changed from ` %` to ` `

7. For index 1009455. The job will be changed from ` %.` to ` `

8. For index 1035442, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f138.item.r=michel.zoom. The job will be changed from `grav.%. mét.` to `grav. s. mét.` (Graveurs sur Métaux).

9. For index 1047899, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f219.item.r=Catois.zoom. The job will be changed from `faïenc %` to `faïences`.

10. For index 1062179, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k6333170p/f312.item.r=Fauconnier.zoom. The job will be changed from `corset%` to `corsets`

11. For index 1268072. The job will be changed from ` %` to ` `

12. For index 1753985, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96727875/f260.item.r=ciete%20journal.zoom. The job will be changed from `administrateur gérant de la 30% 70. ciété du journal des notaires et des avocats` to `administrateur-gérant de la société du journal des notaires et des avocats`

13. For index 1828992, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k96762564/f258.image.r=presses.zoom. The job will be changed from `fabr. de presses à copier (maison e. ravasse %)` to `fabr. de presses à copier maison e. ravasse`

14. For index 1875579. The job will be changed from `%. familmen-` to ` `

15. For index 2413412. The job will be changed from ` %` to ` `

16. For index 2445896. The job will be changed from `greffier à la justice de paix du % arrond` to `greffier à la justice de paix du arrond.`

17. For index 2492551, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685098r/f668.image.r=Larigaldie, The job will be changed from `de la maison dupanloup % et cie` to `de la maison dupanloup`

18. For index 2513166, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9685098r/f789.image.r=Canivet, The job will be changed from `brevets d'invention auguste canivet (%a` to `brevets d'invention auguste canivet`

19. For index 2773048, the image from the directory is https://gallica.bnf.fr/ark:/12148/bpt6k9692809v/f187.image.r=Briatte.zoom. The job will be changed from `(abr. de chaînes en 0%` to `fabr. de chaînes en or`

20. For index 2866708. The job will be changed from `greffier à la justice de paix du % arrond` to `greffier à la justice de paix du arrond`

21. For index 3091994. The job will be changed from `adjoint au maire su % arrond.` to `adjoint au maire du arrond.`

22. For index 3311253. The job will be changed from `secrétaire-trésorier du bureau de bienfaisance du % arrond.` to `secrétaire-trésorier du bureau de bienfaisance du arrond.`

23. For index 3453989. The job will be changed from `% de l'institut` to `de l'institut`

24. For index 4071113. The job will be changed from ` quai st-bernard. %. téléph . .` to `quai st-bernard`

25. For index 4159398. The job will be changed from `%` to ` `

26. For index 4159576. The job will be changed from `. %. t. gob. . .` to ` `

27. For index 4171453. The job will be changed from `. %. t. gob. . .` to ` `

28. For index 4185002. The job will be changed from `. % et inter.` to ` `

29. For index 4212776. The job will be changed from `% boucher` to `boucher`

30. For index 4245341. The job will be changed from `. %.  boulanger` to `boulanger`

31. For index 4263444. The job will be changed from `r. de labor %. t. wagr. . .` to `r. de labor wagr.`

32. For index 4267266. The job will be changed from `. %. nas l. office central` to `nas l. office central`

33. For index 4289157. The job will be changed from ` r. des prairies et r. stendhal %. t. roq. . .` to ` r. des prairies et r. stendhal`

34. For index 4315361. The job will be changed from ` imp. garnier. %.` to ` imp. garnier.`

35. For index 4393261. The job will be changed from `. % et . .` to ` `

36. For index 4402566. The job will be changed from `. %.` to ` `

In [None]:
raw_paris_jobs.loc[135146, "métier"] = ""
raw_paris_jobs.loc[173725, "métier"] = ""
raw_paris_jobs.loc[179603, "métier"] = ""
raw_paris_jobs.loc[473141, "métier"] = "entrepr. de déménagements"
raw_paris_jobs.loc[799324, "métier"] = ""
raw_paris_jobs.loc[1002961, "métier"] = ""
raw_paris_jobs.loc[1009455, "métier"] = ""
raw_paris_jobs.loc[1035442, "métier"] = "graveurs sur métaux"
raw_paris_jobs.loc[1047899, "métier"] = "faïences"
raw_paris_jobs.loc[1062179, "métier"] = "corsets"
raw_paris_jobs.loc[1268072, "métier"] = ""
raw_paris_jobs.loc[1753985, "métier"] = "administrateur-gérant de la société du journal des notaires et des avocats"
raw_paris_jobs.loc[1828992, "métier"] = "fabr. de presses à copier maison e. ravasse"
raw_paris_jobs.loc[1875579, "métier"] = ""
raw_paris_jobs.loc[2413412, "métier"] = ""
raw_paris_jobs.loc[2445896, "métier"] = "greffier à la justice de paix du arrond."
raw_paris_jobs.loc[2492551, "métier"] = "de la maison dupanloup"
raw_paris_jobs.loc[2513166, "métier"] = "brevets d'invention auguste canivet"
raw_paris_jobs.loc[2773048, "métier"] = "fabr. de chaînes en or"
raw_paris_jobs.loc[2866708, "métier"] = "greffier à la justice de paix du arrond."
raw_paris_jobs.loc[3091994, "métier"] = "adjoint au maire du arrond."
raw_paris_jobs.loc[3311253, "métier"] = "secrétaire-trésorier du bureau de bienfaisance du arrond."
raw_paris_jobs.loc[3453989, "métier"] = "de l'institut"
raw_paris_jobs.loc[4071113, "métier"] = "quai st-bernard"
raw_paris_jobs.loc[4159398, "métier"] = ""
raw_paris_jobs.loc[4159576, "métier"] = ""
raw_paris_jobs.loc[4171453, "métier"] = ""
raw_paris_jobs.loc[4185002, "métier"] = ""
raw_paris_jobs.loc[4212776, "métier"] = "boucher"
raw_paris_jobs.loc[4245341, "métier"] = "boulanger"
raw_paris_jobs.loc[4263444, "métier"] = "r. de labor wagr."
raw_paris_jobs.loc[4267266, "métier"] = "nas l. office central"
raw_paris_jobs.loc[4289157, "métier"] = "r. des prairies et r. stendhal"
raw_paris_jobs.loc[4315361, "métier"] = "imp. garnier."
raw_paris_jobs.loc[4393261, "métier"] = ""
raw_paris_jobs.loc[4402566, "métier"] = ""

### Dealing with `'`

- Get rows with `'`

In [None]:
raw_paris_jobs[(raw_paris_jobs["métier"].str.contains(r"'"))]

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page,métier
13,bpt6k6282019m,144,24,Abbatucci # (Th.),maître des requêtes au conseil d'Etat,pl. Vendôme,ll et 13.,1855,72,maître des requêtes au conseil d'etat
14,bpt6k6282019m,144,25,Abbaye-au-Bois (communauté del'),église succursale de St-Thomas-d'Aquin,Sèvres,16.,1855,72,église succursale de st-thomas-d'aquin
19,bpt6k6282019m,144,31,Abel de Pujol * (de l'Institut),peintre d'hist.,Albouy,14.,1855,72,peintre d'hist.
46,bpt6k6282019m,144,65,Acar,premier pharmacien de l'Empereur,St-Honoré,313.,1855,72,premier pharmacien de l'empereur
84,bpt6k6282019m,144,110,Adam,ancien chef à l'enregistr.,Sentier,9.,1855,72,ancien chef à l'enregistr.
...,...,...,...,...,...,...,...,...,...,...
4406092,bpt6k9780089g,1607,108,Zoonens,fabr. d'ébénisterie d'art,av. de Taillebourg,9.,1922,1268,fabr. d'ébénisterie d'art
4406105,bpt6k9780089g,1607,141,Zuber ( A) (Ateliers),réparations d'automobiles,av. de Choisy,24.,1922,1268,réparations d'automobiles
4406106,bpt6k9780089g,1607,143,Zuber,vente d'immeubles,r. de Coulmiers,33.,1922,1268,vente d'immeubles
4406141,bpt6k9780089g,1607,216,Zurenger,agent représ. pour l'exportation,r. de Paradis,54.,1922,1268,agent représ. pour l'exportation


Only common patterns of occurance such as surrounded by spaces shall be removed now, rest of thm shall be dealt during tag generation.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"''", r"'", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s'", r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"'\s", r' ', regex=True)

raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"(^|\s)'(\s|$)", r' ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

### Dealing with `.`

- The `.` was used in the bottins to indicate many short forms. Here we do the following

1. Change the double dots to single dots
2. As mentioned earlier, `g.`, `a.`, `o.`, and `c.` were used as part of awards (along with `nc.`). They will be removed.
3. The dots at the start and end with a space.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\.\.+", r".", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"^g\.", r"", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"^a\.\s", r"", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"^i\.", r"", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"^o\.\s", r"", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"^c\.\s", r"", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\.$", r"", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"(^|\s)anc\.(\s|$)", r"", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"(^|\s)nc\.(\s|$)", r"", regex=True)


raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"^\.\s", r"", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"^\.", r"", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"^\.$", r"", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"^\.", r"", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'\. \.', r'. ', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'\s\.\s', r' ', regex=True)

raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

### Dealing with `-`

- Get rows with `-`

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'--', r'-', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'\s-\s', r'-', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'^-\s', r'', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'\s-$', r'', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'^-$', r'', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'\.\s-', r'.-', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'\.;', r'', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'\s-\.', r'', regex=True)

### Dealing with `nc`

- Get rows with `nc`

`NC` (in a box) is used in the bottins to indicate `Notable Commercant`. 

We will remove all such occurances along with those that are have 3 characters starting with nc (as the 3rd character is generally a misinterpretation for the box surrounding the NC) Additionally, `anc` that is used to describe ancient is also removed.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"(^|\s)nc.(\s|$)", r"", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"(^|\s)nc(\s|$)", r"", regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"(^|\s)anc(\s|$)", r"", regex=True)

### Dealing with `,`

Although there is no commas in the dataset, but do have a generic pipeline, the next cell remove the commas.

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"(^|\s),(\s|$)", r"", regex=True)

### Removing more than one spaces

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r"\s\s+", r' ', regex=True)

### Removing Spaces at the start and end

In [None]:
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'^\s', r'', regex=True)
raw_paris_jobs["métier"] = raw_paris_jobs["métier"].str.replace(r'\s$', r'', regex=True)

## Saving the data to a csv file after cleaning the special characters

In [None]:
raw_paris_jobs.to_csv("./../data/intermediate_steps/all_paris_jobs_splchar_cleaned.csv", index=False)