# Adding correct page numbers

## Summary

The data outcome of the previous project contains the information about the scanned documents.  For each entry, it has the information about the id of the document, the page number in the document, and the row on the page. However, the page number is not aligned to access the document in Gallica. In this notebook, the page number is adjusted and added to a new column titled `gallica_page`.

The data outcome of the previous project is stored in `data/from_previous_project/people_of_paris_1839_1922.csv` and has a page number. This page number is offset by a certain value. The `data/from_previous_project/repertoires.xlsx` file contains the starting page number for each document in Gallica. The start page number per document in `repertoires.xlsx` is subtracted from the page number in the `people_of_paris_1839_1922.csv` to provide the correct page number to access the document on gallica.

## Imports
Only the pandas module is used to read and write the data.

In [1]:
import pandas as pd

In [2]:
# read the first sheet of the file and store it as a dataframe. As the last line in the sheet contains the summary about the line above, it is ignored.
repertoires = pd.read_excel("./../data/from_previous_project/repertoires.xlsx", skipfooter=1, engine='openpyxl')


# read the person profession dataset
people_of_paris = pd.read_csv("./../data/from_previous_project/people_of_paris_1839_1922.csv", names=["doc_id", "page", "row", "Nom", "métier_original", "rue", "numéro" , "annee"],
                             dtype={"doc_id":'str', "page":'int', "row":'str', "Nom":'str', "métier":'str', "rue":'str', "numéro":'str', "annee":'str'},
                             header=0, encoding="utf-8")

The data on Gallica for some years is present in two different documents. Thus, instead of identifying the documents based on year, a unique id called the ark identifier is used. The ark identifier per each document is present as part of the URL in the `lien_source` column of the `repertoires`. First, that ark is separated as a new column by splitting the string.

Then create a dictionary with the key as the ark identifier and the value as the start page number.

In [3]:
repertoires["ark_iden"] = repertoires["lien_source"].apply(lambda x : str(x).split("/")[-1]).astype(str)

# the start page is set to int type to be able perform mathematical operation
repertoires["vue_debut"] = repertoires["vue_debut"].astype(int)

# the dictionary of ark identifier (document id) to the starting page
vue_mapping = pd.Series(repertoires.vue_debut.values,index=repertoires.ark_iden).to_dict()

In [4]:
def page_gallica(curr_page: int, doc_id: str) -> int:
    """The function accepts the current page number and document id (ark identifier) and returns the page number on Gallica.
    """
    return curr_page - vue_mapping[doc_id]

people_of_paris["gallica_page"] = people_of_paris.apply(lambda x: page_gallica(x.page, x.doc_id), axis=1)

In [5]:
people_of_paris

Unnamed: 0,doc_id,page,row,Nom,métier_original,rue,numéro,annee,gallica_page
0,bpt6k6282019m,144,0,Aaron,bronzes,passage Choiseal,72 et 74.,1855,72
1,bpt6k6282019m,144,1,Aaron (Mic.),manuf. de porcelaines,Bondy,30.,1855,72
2,bpt6k6282019m,144,3,Abadie,architecte,Provence,7.,1855,72
3,bpt6k6282019m,144,5,Abadie,tabac et estamin.,Ménilmontant,158.,1855,72
4,bpt6k6282019m,144,6,Abanse,instituteur,Sts-Pères,30.,1855,72
...,...,...,...,...,...,...,...,...,...
4406159,bpt6k9780089g,1607,258,Zwobada (Ch.) et Clo,serruriers,r. d'Amsterdam,47. (80). T. Centr. 77. 53.,1922,1268
4406160,bpt6k9780089g,1607,259,Zygomalas,perles fausses,r. de Constantinople,28.,1922,1268
4406161,bpt6k9780089g,1607,261,Zysapel,restaurant,r. des Ecouffes,14.,1922,1268
4406162,bpt6k9780089g,1607,263,Zyssmann,bar,r. de Pivoli,40.,1922,1268


In [6]:
people_of_paris.to_csv("./../data/intermediate_steps/all_paris_jobs_with_gallica_pageno.csv", index=False)