Manon FLEURY\
Yi XU \
Professor: Hugues de Mazancourt

Project NLP in Industry: Publication date extraction

# Introduction

Determining the publication date of official documents is crucial for monitoring systems that track updates from French city councils, regions, and other entities. The challenge lies in the fact that documents appearing on websites may not reflect their actual publication dates. Factors such as website restructuring, new regulations, or editorial practices can result in significant discrepancies between the upload date and the true publication date.

This project aims to address this challenge by developing a system to extract or predict the publication date of documents. Using a gold standard dataset, we will create and evaluate an extraction and prediction system. The ultimate goal is to ensure accurate identification of publication dates, providing reliable data for use in watch systems.

We will present here two approaches: one using regular expressions (regex) and one using a fine-tuned model.

**First step: Create a gold standard**

We created a gold standard by manually annotating the first 500 documents. These gold dates are in a Hugging Face dataset ("maribr/publication_dates_fr"). (https://huggingface.co/datasets/maribr/publication_dates_fr)

## 1. Date extraction with regex

**Analysis of the problem**

Making the gold standard was a good way to observe our data. Here are a few observations, crucial to find a way on how to implement the problem:

- Where to find publication dates? Usually at the very beginning or the very end of the document.

- Which words indicate publication dates? "**publié le", "date de publication", "affiché le", "date de mise en ligne", "disponible depuis", "fait à [...], le"** for example.

- However, usually there is no such explicit terms in the documents, so a way to solve this problem could be to extract all the dates in the document and select the latest date. However, the latest date could correspond to a measure that is going to be set up in the future. Thus, we could use the context surrounding the dates and remove them if in the same sentence we have the words "**annonce**" (verb "annoncer") or "**à compter du**", which announces a future measure, so these dates must not be taken into account. We also remove the date if there is any future tense in the sentence.

- One problem that can't be solved for the moment: when the pdf files that don't have an url available for the 'text version', we offer to use a pdf text reader (PyPDF2). However, there are rare cases in our dataset where the pdf is not readable, and since there is no url text version, we don't have any mean to extract and analyze the text. A further implementation could be to use OCR but the few techniques available that we tried are not successful and reliable.

Conclusions:

- With **regular expressions**, we will search for dates that are surrounded by words related to publication ("publié le", "mise en ligne") with the additionnal criteria of these dates being at the very beggining or very end of the document. (Indeed, it happened that there is a "publié le" but talking about another document, within the document -in the middle.)

- Since this way of search doesn't apply to every document (because it's not frequent to have explicits words like "publié le"), we will process the rest of the documents with a **global search** of the dates (still at the beginning or the end of the document) with **removing** the ones included in a sentence in a **future tense** or containing words like "**annonce**" and "**à compter du**".

- Finally we will compare these dates to the one written in the url (if so) and choose the most recent date.


**Preparation of the data**

a. Load our gold dataset (made on Hugging Face)

In [None]:
import pandas as pd

In [None]:
!pip install datasets

In [None]:
from datasets import load_dataset

gold_dataset = load_dataset("maribr/publication_dates_fr")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

NLP_in_industry - Annotations(5).csv:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
gold_dataset

DatasetDict({
    train: Dataset({
        features: ['Text', 'Gold published date', 'url'],
        num_rows: 500
    })
})

This is a DatasetDict. So to use it, we only take the dataset (train):

In [None]:
from datasets import DatasetDict
train_dataset = gold_dataset["train"]
train_dataset

Dataset({
    features: ['Text', 'Gold published date', 'url'],
    num_rows: 500
})

And then we can use pandas to work on a dataframe:

In [None]:
df_gold = train_dataset.to_pandas()
print(df_gold.head())

                                                Text Gold published date  \
0  PROCES-VERBAL DE LA REUNION PUBLIQUE\nDU CONSE...          16/01/2023   
1  CONSEIL COMMUNAUTAIRE DU\n25 JANVIER 2023\nPRO...          25/01/2023   
2  Date de mise en ligne de\nl’acte : 02/ 02/2023...          02/02/2023   
3  Envoyé en préfecture le 26/01/2023\nReçu en pr...          26/01/2023   
4       \nFait à Bourg-en-Bresse, le 23 janvier 2023          23/01/2023   

                                                 url  
0  http://www.ville-saint-ay.fr/userfile/fichier-...  
1  https://www.gatine-racan.fr/wp-content/uploads...  
2  https://www.ville-mazeres.fr/IMG/pdf/2023_1_1.pdf  
3  https://www.fier-et-usses.com/cms_viewFile.php...  
4  https://www.grandbourg.fr/cms_viewFile.php?idt...  


In [None]:
df_gold['Gold published date']

Unnamed: 0,Gold published date
0,16/01/2023
1,25/01/2023
2,02/02/2023
3,26/01/2023
4,23/01/2023
...,...
495,24/01/2024
496,09/01/2024
497,22/11/2022
498,21/12/2023


b. Load the original dataset.csv file

We create a dataframe of the dataset.csv file containing the doc_id, url, cache, text version, nature, published, entity, entity_type.

In [None]:
df = pd.read_csv("dataset.csv", nrows=500) # we only take the first 500 rows since we did the gold dates for the first 500 rows
df.head(2)

Unnamed: 0,doc_id,url,cache,text version,nature,published,entity,entity_type
0,6357/71845_1698228833-PV---Conseil-Municipal-1...,http://www.ville-saint-ay.fr/userfile/fichier-...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,pv.cr,16/01/2023,Saint-Ay,Commune
1,2515/213c7_proces-verbal-25-01-2023.pdf,https://www.gatine-racan.fr/wp-content/uploads...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,pv.cr,25/01/2023,CC de Gâtine-Racan,Intercommunalité


1. Explicit publication dates in the document

Implementation method:

- We make a extract_text_from_pdf_url function to return the written text in the pdf, in the case where the url of text version (of our dataframe df) isn't working/available.

In [None]:
!pip install PyPDF2

In [None]:
# 1. Function to extract text from pdf (this function will be called inside the next function 'extract_publication_date')

import requests
from io import BytesIO
from PyPDF2 import PdfReader

def extract_text_from_pdf_url(url):
    try:
        # Load the content of the pdf from the url
        response = requests.get(url)
        response.raise_for_status()  # Verify if the request succeeded

        # Load the pdf content in a binary flow
        pdf_content = BytesIO(response.content)

        # Read the pdf with PyPDF2
        reader = PdfReader(pdf_content)
        text = ""
        for page in reader.pages:
            text += page.extract_text()  # Extraire le texte de chaque page

        return text
    except Exception as e:
        return f"Erreur lors de l'extraction du texte : {e}"

In [None]:
import requests
import re
import pandas as pd
from datetime import datetime

# Dictionnaire pour convertir les mois en chiffres
mois_to_num = {
    "janvier": "01", "février": "02", "mars": "03", "avril": "04", "mai": "05", "juin": "06",
    "juillet": "07", "août": "08", "septembre": "09", "octobre": "10", "novembre": "11", "décembre": "12"
}

# Expressions régulières pour détecter les dates
pattern_date = r'(\d{1,2})\s*(janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre)\s*(\d{4})|(\d{1,2})\s*\/\s*(\d{1,2})\s*\/\s*(\d{4})'

# Recherche des termes qui suggèrent une date de publication
pattern_pub = r"(publié|paru|date de publication|disponible depuis|affiché|affichage|mise en ligne|reçu|réception|télétransmission|adopté|rédigé|approuvé|révision|modification)"

def extract_publication_date(url):
    # Requête HTTP pour récupérer le texte du document
    response = requests.get(url)
    if response.status_code != 200:  # Si le texte brut n'est pas disponible via l'URL
        content = extract_text_from_pdf_url(df['url'])
    else:  # Si le texte dans l'URL est disponible
        content = response.text

    # Découper le contenu en mots
    words = content.split()

    # Recherche des dates associées aux termes de publication
    matches_pub = re.finditer(pattern_pub, content.lower())
    dates_pub = []

    for match in matches_pub:
        # Trouver la position du match de la date de publication
        start_index = match.end()

        # Calculer la position du mot en cours
        word_position = len(content[:start_index].split())

        # Vérifier si la date est dans les 300 premiers mots ou les 150 derniers mots
        if word_position <= 300 or word_position >= len(words) - 150:
            # Extraire la date qui suit ces termes
            sub_text = content[start_index:start_index + 75]  # Limite à 50 caractères après le terme de publication
            date_matches = re.findall(pattern_date, sub_text)

            for date_match in date_matches:
                if date_match[0]:  # Format "12 décembre 2023"
                    jour = date_match[0].zfill(2)
                    mois = mois_to_num[date_match[1].lower()]
                    annee = date_match[2]
                    dates_pub.append(f"{jour}/{mois}/{annee}")
                elif date_match[3]:  # Format "12/12/2023"
                    jour = date_match[3].zfill(2)
                    mois = date_match[4].zfill(2)
                    annee = date_match[5]
                    dates_pub.append(f"{jour}/{mois}/{annee}")

    # Si plusieurs dates sont trouvées, on sélectionne la plus récente
    if dates_pub:
        # Convertir les dates en objets datetime
        dates_pub_datetime = [datetime.strptime(date, "%d/%m/%Y") for date in dates_pub]
        # Trouver la plus récente
        latest_date = max(dates_pub_datetime)
        return latest_date.strftime("%d/%m/%Y")  # Retourner la date la plus récente sous forme de chaîne
    else:
        return None

# Appliquer la fonction à la colonne 'text version' du DataFrame df
df['extracted_publication_date'] = df['text version'].apply(extract_publication_date)

# Afficher les résultats
print(df[['url', 'extracted_publication_date']])

                                                   url  \
0    http://www.ville-saint-ay.fr/userfile/fichier-...   
1    https://www.gatine-racan.fr/wp-content/uploads...   
2    https://www.ville-mazeres.fr/IMG/pdf/2023_1_1.pdf   
3    https://www.fier-et-usses.com/cms_viewFile.php...   
4    https://www.grandbourg.fr/cms_viewFile.php?idt...   
..                                                 ...   
495  https://plombieres-les-dijon.fr/wp-content/upl...   
496  https://www.orne.gouv.fr/contenu/telechargemen...   
497  https://www.vosges.gouv.fr/contenu/telechargem...   
498  http://www.grandchambery.fr/fileadmin/mediathe...   
499  http://www.hauts-de-seine.fr/fileadmin/user_up...   

    extracted_publication_date  
0                         None  
1                         None  
2                   02/02/2023  
3                   26/01/2023  
4                         None  
..                         ...  
495                       None  
496                       None  
497   

Let's display the rows where a publication date has actually been extracted:

In [None]:
valeurs_non_none = df.loc[df['extracted_publication_date'].notna(), ['extracted_publication_date', 'url']]
valeurs_non_none

Unnamed: 0,extracted_publication_date,url
2,02/02/2023,https://www.ville-mazeres.fr/IMG/pdf/2023_1_1.pdf
3,26/01/2023,https://www.fier-et-usses.com/cms_viewFile.php...
6,13/02/2023,http://www.villeneuve-tolosane.fr/ad_attachmen...
8,16/02/2023,https://www.guingamp-paimpol-agglo.bzh/wp-cont...
9,22/02/2023,https://www.ales.fr/wp-content/uploads/2023/04...
...,...,...
487,16/07/2020,https://www.cc-flandrelys.fr/images/2-VIVRE-ET...
488,16/07/2020,https://www.cc-flandrelys.fr/images/2-VIVRE-ET...
489,16/07/2020,https://www.cc-flandrelys.fr/images/2-VIVRE-ET...
491,26/12/2023,https://www.gennesvaldeloire.fr/medias/2024/03...


We see that we obtain dates for 150 documents out of 500. We explain this number by the fact that there is no explicit terms about publication in every of the 500 documents. So this suggests that there are explicit terms about 'publication' in 150 documents.

Let's analize this result by comparing it with our gold dates: for that we merge the two dataframes on the 'url' column:

In [None]:
# Fusionner les DataFrames df et df_gold sur la colonne 'url'
df_merged = pd.merge(df.loc[df['extracted_publication_date'].notna(), ['url', 'extracted_publication_date']],
                     df_gold[['url', 'Gold published date']],
                     on='url',
                     how='inner')

# Comparer les dates
df_merged['date_match'] = df_merged['extracted_publication_date'] == df_merged['Gold published date']

# Afficher les résultats
print(df_merged.head())

# Calculer le pourcentage de 'True' dans la colonne 'date_match'
percentage_true = df_merged['date_match'].mean() * 100
percentage_true



                                                 url  \
0  https://www.ville-mazeres.fr/IMG/pdf/2023_1_1.pdf   
1  https://www.fier-et-usses.com/cms_viewFile.php...   
2  http://www.villeneuve-tolosane.fr/ad_attachmen...   
3  https://www.guingamp-paimpol-agglo.bzh/wp-cont...   
4  https://www.ales.fr/wp-content/uploads/2023/04...   

  extracted_publication_date Gold published date  date_match  
0                 02/02/2023          02/02/2023        True  
1                 26/01/2023          26/01/2023        True  
2                 13/02/2023          13/02/2023        True  
3                 16/02/2023          16/02/2023        True  
4                 22/02/2023          22/02/2023        True  


62.66666666666667

We have an accuracy of 62.67 for the extracted_publication_date function.

In [None]:
df_gold[df_gold['Gold published date'].isnull()] # 9 gold dates None

Unnamed: 0,Text,Gold published date,url
163,URL can not open,,https://www.nimes.fr/fileadmin/user_upload/Act...
207,,,https://coarraze.fr/wp-content/uploads/2020/02...
360,,,https://www.villederueil.fr/sites/default/file...
365,,,https://www.terresdesconfluences.fr/sites/defa...
367,,,https://www.lecotentin.fr/system/files/2022-06...
368,,,https://www.ville-briancon.fr/sites/default/fi...
373,,,http://www.pontonx.fr/content/download/60475/4...
433,,,https://www.saulxleschartreux.fr/wp-content/up...
469,,,https://www.mairie-orly.fr/content/download/16...


In [None]:
# if we want to display the rows where the date_match is False:
df_no_match = df_merged[df_merged['date_match'] == False]
df_no_match
# df_no_match.to_string()

Unnamed: 0,url,extracted_publication_date,Gold published date,date_match
20,https://www.correze.gouv.fr/contenu/telecharge...,18/09/2020,06/03/2023,False
21,https://www.cc-tarnagout.fr/wp-content/uploads...,09/03/2023,15/03/2023,False
25,https://www.iledefrance.fr/actes/proces-verbau...,02/07/2020,29/03/2023,False
28,https://www.vienne.gouv.fr/contenu/telechargem...,05/01/2023,03/03/2023,False
31,http://www.mairiedeniherne.fr/wp-content/uploa...,09/01/2023,16/01/2023,False
32,https://www.iledefrance.fr/actes/deliberations...,07/07/2020,26/01/2023,False
36,http://www.sainte-maure-de-touraine.fr/userfil...,31/01/2023,08/02/2023,False
39,https://www.tours.fr/app/uploads/2023/02/PV-se...,09/11/2020,14/12/2022,False
40,http://recherche.lozere.fr/raa/pv_a_afficher.pdf,16/12/2022,25/10/2022,False
55,https://www.pessac.fr/fileadmin/medias/Publica...,08/02/2023,31/01/2023,False


**Analysis of the result:**

So we obtain 62.67 % of True matching between the gold dates and the dates extracted from this function.\
In any case, at the end we will check every extracted date that respects our criteria and we will choose the most recent.

So we extracted dates only from 150 documents. To continue this process of extraction, we will now focus on the rest of the documents: we will use our second strategy which is to **extract all the dates** in the **beginning** of the document and at the **very end** of the document, and choose the latest, if there is no occurrence of words like "annonce" or "à compter du" or a future tense in the sentence. \

Finally, to avoid the None, we extract the FIRST date mentionned in the document if no dates are found according to our criteria. (And then again, at the end we will do a comparison of all the extracted dates to select the most recent one)

2. Dates in the beginning/end of the document according to specific criteria

In [None]:
# Function that extracts the most recent date (without mentioning future measures in the sentence) in the beginning/end
# and the first date mentionned if None

import requests
import re
from datetime import datetime

def extract_most_recent_date(url):
    # Télécharger le contenu du texte à partir de l'URL
    response = requests.get(url)
    if response.status_code != 200:  # Si le texte brut n'est pas disponible via l'URL
        content = extract_text_from_pdf_url(df['url'])
    else:  # Si le texte dans l'URL est disponible
        content = response.text

    content = content.lower()
    mots = content.split()  # Séparer le contenu en mots

    # Sélectionner les 300 premiers mots et les 100 derniers
    extrait = mots[:300] + mots[-100:]
    extrait_texte = " ".join(extrait)

    # Regex pour reconnaître les dates au format "12 décembre 2023" ou "12/12/2023"
    pattern = r'(\d{1,2})\s*(janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre)\s*(\d{4})|(\d{1,2})\s*\/\s*(\d{1,2})\s*\/\s*(\d{4})'

    # Dictionnaire pour convertir les noms des mois en chiffres
    mois_to_num = {
        "janvier": "01", "février": "02", "mars": "03", "avril": "04", "mai": "05", "juin": "06",
        "juillet": "07", "août": "08", "septembre": "09", "octobre": "10", "novembre": "11", "décembre": "12"
    }

    # Trouver toutes les dates dans l'extrait
    matches = re.findall(pattern, extrait_texte)

    # Transformer les dates en format standard et les filtrer selon les conditions
    dates = []
    premiere_date = None  # Stocker la première date rencontrée
    for match in matches:
        # Vérifier si la date est au format "dd/mm/yyyy"
        if match[0] == '':  # Si match[0] est vide, c'est le format "dd/mm/yyyy"
            jour = match[3].zfill(2)
            mois = match[4].zfill(2)
            annee = match[5]
        else:  # Sinon, c'est le format "dd mois yyyy"
            jour = match[0].zfill(2)
            mois = mois_to_num[match[1].lower()]  # Convertir le mois en numéro
            annee = match[2]

        date_complete = f"{jour}/{mois}/{annee}"

        # Stocker la première date rencontrée si aucune n'a été sauvegardée
        if premiere_date is None:
            premiere_date = date_complete

        # Vérifier les conditions d'exclusion
        phrase = re.search(rf'[^.]*{match[0]} {match[1]} {match[2]}[^.]*\.', extrait_texte)
        if phrase:
            phrase = phrase.group(0)
            if any(verbe in phrase for verbe in ["annonce", "à compter d"]) or re.search(r"\bfutur\b", phrase):
                continue

        # Ajouter la date si elle passe le filtre
        dates.append(date_complete)

    # Convertir les dates en objets datetime pour les trier
    dates_obj = [datetime.strptime(date, "%d/%m/%Y") for date in dates]
    if dates_obj:
        # Retourner la date la plus récente si des dates valides ont été trouvées
        date_plus_recente = max(dates_obj)
        return date_plus_recente.strftime("%d/%m/%Y")
    else:
        # Si aucune date valide n'a été trouvée, retourner la première date rencontrée
        return premiere_date

# Appliquer la fonction à la colonne 'text version' du DataFrame df
df['extracted_most_recent_date'] = df['text version'].apply(extract_most_recent_date)

# Afficher les résultats
print(df[['url', 'extracted_most_recent_date']])


                                                   url  \
0    http://www.ville-saint-ay.fr/userfile/fichier-...   
1    https://www.gatine-racan.fr/wp-content/uploads...   
2    https://www.ville-mazeres.fr/IMG/pdf/2023_1_1.pdf   
3    https://www.fier-et-usses.com/cms_viewFile.php...   
4    https://www.grandbourg.fr/cms_viewFile.php?idt...   
..                                                 ...   
495  https://plombieres-les-dijon.fr/wp-content/upl...   
496  https://www.orne.gouv.fr/contenu/telechargemen...   
497  https://www.vosges.gouv.fr/contenu/telechargem...   
498  http://www.grandchambery.fr/fileadmin/mediathe...   
499  http://www.hauts-de-seine.fr/fileadmin/user_up...   

    extracted_most_recent_date  
0                   12/12/2023  
1                   25/01/2023  
2                   02/02/2023  
3                   26/01/2023  
4                   23/01/2023  
..                         ...  
495                 24/01/2024  
496                 09/01/2024  
497   

In [None]:
# Checking if we have "None" dates:
df[df['extracted_most_recent_date'].isna()]

Unnamed: 0,doc_id,url,cache,text version,nature,published,entity,entity_type,extracted_publication_date,extracted_most_recent_date
5,2785/384c7_D%C3%A9lib%C3%A9rations_Conseil_Com...,https://www.dinan-agglomeration.fr/content/dow...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,acte.delib,27/02/2023,CA Dinan Agglomération,Intercommunalité,,
12,2512/b2cf4_CR_09_f%C3%A9vrier_2023.pdf,https://www.legiennois.fr/images/3-Boismorand/...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,pv.cr,09/02/2023,CC Giennoises,Intercommunalité,,
24,693/f5db7_RAPPORT_BP_2023_COMMUNE.pdf,http://www.bagneux92.fr/images/1-Decouvrir/act...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,bdj,01/01/2023,Bagneux,Commune,,
30,2432/1775c_Couvron-2-Règlement.pdf,https://paysdelaserre.fr/wp-content/uploads/20...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,dlao.autres,01/01/2023,CC du Pays de la Serre,Intercommunalité,,
32,1450/ec09e_01 PV conseil municipal du 1er févr...,https://www.saint-medard-en-jalles.fr/storage/...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,pv.full,24/01/2023,Saint-Médard-en-Jalles,Commune,,
44,713/46f28_Comptes-Administratifs-2022-Mairie-d...,https://www.olemps.fr/uploads/sites/95/2023/03...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,pv.cr,01/03/2023,Olemps,Commune,,
49,2965/18c3595af8e450d0b8afffe9827a617fcfa8450f_...,https://www.pevelecarembault.fr/sites/default/...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,acte.delib,29/03/2023,CC Pévèle-Carembault,Intercommunalité,,
51,1970/5011763f908fe9bdec498bdf9cb1517bb66fbb56_...,https://www.pau.fr/sites/default/files/media/d...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,pv.cr,30/03/2023,Pau,Commune,,
63,1342/99118_PV%20int%C3%A9gral%20CM%20121222.pdf,https://www.ville-gonesse.fr/sites/default/fil...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,pv.full,06/02/2023,Gonesse,Commune,,
70,3029/a18582e994fceea2089730c835eba47315a1cc6d_...,https://www.melunvaldeseine.fr/fileadmin/01_-_...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,acte.delib,06/02/2023,CA Melun Val de Seine,Intercommunalité,,


We have 18 None dates. Indeed, even if we added a feature in our function to extract the **first** date mentionned in the document if no other is found, it can happen that \
1) the pdf is not readable (as said in the beginning)\
2) the text version url is damaged leading to missing caracters\
3) the text from the pdf url was not correctly read (through PyPDF2).

To complete this, because sometimes there is no publication date written in the document, we will finally **use the url**:

- after extracting a date in the document, we will compare it to the one in the url and if the url date is later we take the url date, and if the url date is before, we keep the date extracted from the document.
- if there is no publication date in the document, we compare the first date found in the document with the date in the url and we take the most recent one.

Third step: we extract the dates from the url (if exists) in a third date column in the df ('extracted_url_date').

3. Dates in the URLs

In [None]:
!pip install unidecode

In [None]:
import re
from unidecode import unidecode

def extract_date_from_url(url):
    # Mappage des mois (lettres vers chiffres)
    month_map = {
        "janvier": 1, "jan": 1, "fevrier": 2, "fev": 2,
        "mars": 3, "avril": 4, "avr": 4, "mai": 5, "juin": 6, "juillet": 7,
        "août": 8, "aout": 8, "septembre": 9, "sept": 9, "octobre": 10, "oct": 10,
        "novembre": 11, "nov": 11, "decembre": 12, "dec": 12
    }

    # Listes d'expressions régulières pour capturer divers formats
    patterns = [
        # Format avec année en premier : 2023-01-16 ou 2023/01/16
        r'(\d{4})[-_/\.](\d{1,2})[-_/\.](\d{1,2})',
        # Format classique : jour-mois-année (25-01-2023, 25.01.23)
        r'(\d{1,2})[-_/\.](\d{1,2})[-_/\.](\d{2,4})',
        # Mois écrit en toutes lettres : jour mois année (21_decembre_2023, 16-fevrier-2023)
        r'(\d{1,2})[_-](janvier|fevrier|mars|avril|mai|juin|juillet|aout|septembre|octobre|novembre|decembre|jan|fev|mar|avr|mai|jun|jul|aou|sep|oct|nov|dec)[_-](\d{2,4})',
        # Cas spécifique : "1er février 2023"
        r'(1er)[^\w]*(janvier|fevrier|mars|avril|mai|juin|juillet|aout|septembre|octobre|novembre|decembre|jan|fev|mar|avr|mai|jun|jul|aou|sep|oct|nov|dec)[^\w]*(\d{2,4})',
        # Format compact année-mois-jour : 20230126
        r'(\d{4})(\d{2})(\d{2})',
        # Format compact jour-mois-année : 30012023 (doit être isolé)
        r'\b(\d{2})(\d{2})(\d{4})\b'
    ]

    url_normalized = unidecode(url)

    # Premièrement, tenter de trouver une date dans les formats plus complexes
    for pattern in patterns[:4]:
        match = re.search(pattern, url_normalized, re.IGNORECASE)
        if match:
            groups = match.groups()
            try:
                if pattern == patterns[0]:  # Année-Mois-Jour (2023-01-16)
                    year, month, day = map(int, groups)
                elif pattern == patterns[1]:  # Jour-Mois-Année (25-01-2023)
                    day, month, year = map(int, groups)
                elif pattern == patterns[2]:  # Jour Mois en Lettres Année (21_decembre_2023)
                    day, month_str, year = groups
                    day = int(day)
                    month = month_map[month_str.lower()]
                    year = int(year)
                elif pattern == patterns[3]:  # "1er février 2023"
                    day = 1
                    month = month_map[groups[1].lower()]
                    year = int(groups[2])

                # Vérifier la validité de la date
                if 1 <= day <= 31 and 1 <= month <= 12:
                    # Corrige l'année si elle est sur 2 chiffres
                    if year < 100:
                        year += 2000
                    return day, month, year
            except (ValueError, KeyError):
                continue

    # Ensuite, traiter les dates au format compact, mais seulement si elles sont isolées (ex : 30012023)
    match = re.search(patterns[4], url_normalized)
    if match:
        groups = match.groups()
        try:
            year, month, day = map(int, groups)

            # Vérifier la validité de la date
            if 1 <= day <= 31 and 1 <= month <= 12:
                # Corrige l'année si elle est sur 2 chiffres
                if year < 100:
                    year += 2000
                return day, month, year
        except (ValueError, KeyError):
            pass

    # Enfin, gérer le format de date sous la forme ddmmyyyy (ex : 30012023) isolé dans l'URL
    match = re.search(patterns[5], url_normalized)
    if match:
        groups = match.groups()
        try:
            day, month, year = map(int, groups)

            # Vérifier la validité de la date
            if 1 <= day <= 31 and 1 <= month <= 12:
                # Corrige l'année si elle est sur 2 chiffres
                if year < 100:
                    year += 2000
                return day, month, year
        except (ValueError, KeyError):
            pass

    return None, None, None  # Si aucune date n'est trouvée


# Appliquer la fonction pour extraire la date
df['extracted_url_date'] = df['url'].apply(
    lambda url: "{:02d}/{:02d}/{:04d}".format(*extract_date_from_url(url)) if all(extract_date_from_url(url)) else None
)

# Afficher le DataFrame avec les dates extraites
df[['url', 'extracted_url_date']]


Unnamed: 0,url,extracted_url_date
0,http://www.ville-saint-ay.fr/userfile/fichier-...,16/01/2023
1,https://www.gatine-racan.fr/wp-content/uploads...,25/01/2023
2,https://www.ville-mazeres.fr/IMG/pdf/2023_1_1.pdf,01/01/2023
3,https://www.fier-et-usses.com/cms_viewFile.php...,
4,https://www.grandbourg.fr/cms_viewFile.php?idt...,16/01/2023
...,...,...
495,https://plombieres-les-dijon.fr/wp-content/upl...,24/01/2024
496,https://www.orne.gouv.fr/contenu/telechargemen...,
497,https://www.vosges.gouv.fr/contenu/telechargem...,
498,http://www.grandchambery.fr/fileadmin/mediathe...,21/12/2023


4. Final extraction

Finally, we will get the most recent date from all of the 3 columns of extracted dates, according to the 3 functions we executed:

1. Extracted publication date (along specific terms in the text)
2. Extracted most recent date (in the whole text, along specific criteria)
3. Extracted url date


In [None]:
# Convert to datetime, handling errors
for col in ['extracted_publication_date', 'extracted_most_recent_date', 'extracted_url_date']:
    df[col] = pd.to_datetime(df[col], format='%d/%m/%Y', errors='coerce')
    # errors='coerce' will set invalid dates to NaT (Not a Time)

# We take the most recent date from the 3 columns
df['real_publication_date'] = df[['extracted_publication_date', 'extracted_most_recent_date', 'extracted_url_date']].max(axis=1)

# Convertir la colonne 'publication' en format datetime (YYYY-MM-DD)
df['publication'] = pd.to_datetime(df['published'], errors='coerce')
# Compléter real_publication_date avec les dates de 'publication' là où il y a NaT
df['real_publication_date'] = df['real_publication_date'].fillna(df['published'])

df

  df['publication'] = pd.to_datetime(df['published'], errors='coerce')


Unnamed: 0,doc_id,url,cache,text version,nature,published,entity,entity_type,extracted_publication_date,extracted_most_recent_date,extracted_url_date,real_publication_date,publication
0,6357/71845_1698228833-PV---Conseil-Municipal-1...,http://www.ville-saint-ay.fr/userfile/fichier-...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,pv.cr,16/01/2023,Saint-Ay,Commune,NaT,2023-12-12,2023-01-16,2023-12-12,2023-01-16
1,2515/213c7_proces-verbal-25-01-2023.pdf,https://www.gatine-racan.fr/wp-content/uploads...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,pv.cr,25/01/2023,CC de Gâtine-Racan,Intercommunalité,NaT,2023-01-25,2023-01-25,2023-01-25,2023-01-25
2,1086/ee2ec_2023_1_1.pdf,https://www.ville-mazeres.fr/IMG/pdf/2023_1_1.pdf,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,pv.cr,31/01/2023,Mazères,Commune,2023-02-02,2023-02-02,2023-01-01,2023-02-02,2023-01-31
3,3020/68132_cms_viewFile.php,https://www.fier-et-usses.com/cms_viewFile.php...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,acte.delib,26/01/2023,CC Fier et Usses,Intercommunalité,2023-01-26,2023-01-26,NaT,2023-01-26,2023-01-26
4,3132/6df22_cms_viewFile.php,https://www.grandbourg.fr/cms_viewFile.php?idt...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,pv.cr,16/01/2023,CA du Bassin de Bourg-en-Bresse,Intercommunalité,NaT,2023-01-23,2023-01-16,2023-01-23,2023-01-16
...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,6238/24d8a2f8cfd1989d316a84435f308170f1ba9fcc_...,https://plombieres-les-dijon.fr/wp-content/upl...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,pv.cr,24/01/2024,Plombières-lès-Dijon,Commune,NaT,2024-01-24,2024-01-24,2024-01-24,2024-01-24
496,6812/a18ebaf196fccbea780f909e3508d7d4cb14bf6d_...,https://www.orne.gouv.fr/contenu/telechargemen...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,acte.arrete,09/01/2024,Préfecture - Orne,Préfecture,NaT,2024-01-09,NaT,2024-01-09,2024-01-09
497,6834/594b09f7dd530aa0245edaf6193cb6238cd31659_...,https://www.vosges.gouv.fr/contenu/telechargem...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,acte.arrete,22/11/2022,Préfecture - Vosges,Préfecture,NaT,2022-11-22,NaT,2022-11-22,2022-11-22
498,2019/c6c89e9f7afb2419432222a54525df83a756768a_...,http://www.grandchambery.fr/fileadmin/mediathe...,https://datapolitics-public.s3.gra.io.cloud.ov...,https://datapolitics-public.s3.gra.io.cloud.ov...,pv.cr,21/12/2023,CA du Grand Chambéry,Intercommunalité,NaT,2023-12-21,2023-12-21,2023-12-21,2023-12-21


In [None]:
print(df[['url','extracted_publication_date', 'extracted_most_recent_date', 'extracted_url_date','real_publication_date']].to_string())

                                                                                                                                                                                                                                                                                                                                                                                        url extracted_publication_date extracted_most_recent_date extracted_url_date real_publication_date
0                                                                                                                                                                                                                                                                             http://www.ville-saint-ay.fr/userfile/fichier-telechargement/1698228833-PV---Conseil-Municipal-16-01-2023.pdf                        NaT                 2023-12-12         2023-01-16            2023-12-12
1                                                 

5. Evaluation

Comparison to the gold dates

In [None]:
df_merged_f = df.merge(df_gold, on="url", how="inner")

# conversion de la colonne "Gold published date" en format datetime pour matcher le format de toutes les autres dates
df_merged_f['Gold published date_format'] = pd.to_datetime(df_merged_f['Gold published date'], errors='coerce')

# Comparer les dates
df_merged_f['date_match'] = df_merged_f['real_publication_date'] == df_merged_f['Gold published date_format']

# Afficher les résultats
print(df_merged_f.head())

# Calculer le pourcentage de 'True' dans la colonne 'date_match'
percentage_true_f = df_merged_f['date_match'].mean() * 100

# Afficher le pourcentage
percentage_true_f

                                              doc_id  \
0  6357/71845_1698228833-PV---Conseil-Municipal-1...   
1            2515/213c7_proces-verbal-25-01-2023.pdf   
2                            1086/ee2ec_2023_1_1.pdf   
3                        3020/68132_cms_viewFile.php   
4                        3132/6df22_cms_viewFile.php   

                                                 url  \
0  http://www.ville-saint-ay.fr/userfile/fichier-...   
1  https://www.gatine-racan.fr/wp-content/uploads...   
2  https://www.ville-mazeres.fr/IMG/pdf/2023_1_1.pdf   
3  https://www.fier-et-usses.com/cms_viewFile.php...   
4  https://www.grandbourg.fr/cms_viewFile.php?idt...   

                                               cache  \
0  https://datapolitics-public.s3.gra.io.cloud.ov...   
1  https://datapolitics-public.s3.gra.io.cloud.ov...   
2  https://datapolitics-public.s3.gra.io.cloud.ov...   
3  https://datapolitics-public.s3.gra.io.cloud.ov...   
4  https://datapolitics-public.s3.gra.io.cloud

  df_merged_f['Gold published date_format'] = pd.to_datetime(df_merged_f['Gold published date'], errors='coerce')


60.199999999999996

We obtain an accuracy of 60.2

**Interpretation**:

The functions that we made encompassed a lot of the different cases in the documents, but not all, hence our accuracy.

**Here are a few points to explain this score (while having expected a higher score due to the well designed functions we thoughfully made):**

- it is very possible (not to say certain) that the dates extracted from our functions are the correct ones compared to the "gold dates". Indeed, sometimes the computer functions are better than humans and in our case here, there are times where our annotators (students in the class) didn't notice the publication date, and sometimes it's even very difficult to deduce when a document was published when it is not said explicitely at all.
- as mentionned before, there are pdf that were not readable
- there were publication dates manually written on the documents (so, obviously not detectable here). An immprovement could be to use OCR.

Now let's see another implementation method involving a fine-tuned model.

# 2. Using a fine-tuned model

Goal: To extract the publication date of official documents accurately, using a combination of text analysis and URL-based fallback mechanisms.

1. Data Preparation: The dataset contains official documents with their text, gold-standard publication dates (for evaluation), and URLs. The publication date may be explicitly mentioned in the text or inferred from patterns in the URL.

2. Training and Prediction: A language model is fine-tuned (or prompted) to predict publication dates from the document's text. If the model cannot find a date in the text, a fallback mechanism extracts potential date information from the document's URL using regex.

3. Fallback Logic:

  - Primary Source: The model is prompted to extract the date from the document text, focusing on phrases indicating publication dates.
  - Fallback Source: If the model fails or is uncertain, regex patterns scan the URL to identify date-like patterns.

4. Evaluation: The predicted dates from the text and URLs are compared against the gold-standard publication dates. Rows where neither the model nor the URL extraction returns a valid date are excluded from accuracy calculations to avoid skewing results. Accuracy is defined as the proportion of cases where either the text-based or URL-based prediction matches the gold-standard date.

In [None]:
!pip install transformers datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [None]:
import os
import pandas as pd
from datetime import datetime
import re
from datasets import load_dataset, Dataset

# Disable Weights & Biases logging
os.environ["WANDB_DISABLED"] = "true"

# 1. Load dataset
gold_dataset = load_dataset("maribr/publication_dates_fr")

print(gold_dataset)

# Convert dataset to pandas for easier processing
df = pd.DataFrame(gold_dataset['train'])  # Use the train split for now
print(f"Loaded dataset with {len(df)} entries")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

NLP_in_industry - Annotations (3).csv:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Text', 'Gold published date', 'url'],
        num_rows: 500
    })
})
Loaded dataset with 500 entries


In [None]:
# Loop through all URLs in the DataFrame and print them
for index, url in enumerate(df['url']):
    print(url)

http://www.ville-saint-ay.fr/userfile/fichier-telechargement/1698228833-PV---Conseil-Municipal-16-01-2023.pdf
https://www.gatine-racan.fr/wp-content/uploads/2023/03/proces-verbal-25-01-2023.pdf
https://www.ville-mazeres.fr/IMG/pdf/2023_1_1.pdf
https://www.fier-et-usses.com/cms_viewFile.php?idtf=99450&path=deliberation-2023-5-annexe.pdf
https://www.grandbourg.fr/cms_viewFile.php?idtf=17579&path=2023-01-16-Proces-verbal-du-Bureau.pdf
https://www.dinan-agglomeration.fr/content/download/22870/298979/version/1/file/D%C3%A9lib%C3%A9rations_Conseil_Communautaire_27_f%C3%A9vrier_2023.pdf
http://www.villeneuve-tolosane.fr/ad_attachment/Conseil_Municipal/2023/8.02.2023/DEL-2023-006_Débat d'orientation budgétaire 2023_acteTampon.pdf
http://crosne.fr/data/1c-Procès-verbal du CM 7-02-2023.pdf
https://www.guingamp-paimpol-agglo.bzh/wp-content/uploads/2023/02/DELBU2023-02-012-CESSION-TERRAIN-SAINT-LOUP-PABU-tampon.pdf
https://www.ales.fr/wp-content/uploads/2023/04/Conseil-de-Communaute-du-16-fevrier-

In [None]:
# 2. Data Cleaning: Remove invalid rows
df = df[df['Text'].notna() & df['Gold published date'].notna() & df['url'].notna()]  # Ensure 'url' is not None
df = df[(df['Text'].str.strip() != "") & (df['Gold published date'].str.strip() != "") & (df['url'].str.strip() != "")]  # Ensure 'url' is not empty

print(f"Filtered dataset: {len(df)} entries remain after cleaning.")

Filtered dataset: 411 entries remain after cleaning.


In [None]:
# 3. Split into training and testing sets
train_test_split = df.sample(frac=0.8, random_state=42)  # 80% for training
test_split = df.drop(train_test_split.index)  # 20% for testing

# Convert splits to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_test_split)
test_dataset = Dataset.from_pandas(test_split)

print(f"Training dataset: {len(train_dataset)} entries")
print(f"Testing dataset: {len(test_dataset)} entries")

Training dataset: 329 entries
Testing dataset: 82 entries


In [None]:
# 4. Prepare dataset for fine-tuning
def prepare_data(row):
    input_text = f"""
    The following is an official document. Extract and return the publication date of this document.

    - Focus on identifying the publication date as mentioned in the text of the document.
    - If the publication date cannot be found in the text, you can check the following URL for potential date information: {row['url']}.
    - Return the publication date as it appears in the text or URL.

    Document:
    {row['Text'][:2000]}

    URL:
    {row['url']}
    """
    target_text = row['Gold published date']
    return {"input_text": input_text, "target_text": target_text}

train_dataset = train_dataset.map(lambda x: prepare_data(x), remove_columns=train_dataset.column_names)
test_dataset = test_dataset.map(lambda x: prepare_data(x), remove_columns=test_dataset.column_names)

print("Dataset preparation complete.")

Map:   0%|          | 0/329 [00:00<?, ? examples/s]

Map:   0%|          | 0/82 [00:00<?, ? examples/s]

Dataset preparation complete.


In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# 5. Load pre-trained model and tokenizer
model_name = "google/flan-t5-small"  # Use larger models like flan-t5-base if needed
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
# 6. Preprocess data
def preprocess_function(examples):
    inputs = tokenizer(examples['input_text'], truncation=True, max_length=512, padding="max_length")
    labels = tokenizer(examples['target_text'], truncation=True, max_length=10, padding="max_length").input_ids
    inputs["labels"] = labels
    return inputs

tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)

print("Dataset preprocessing complete.")

Map:   0%|          | 0/329 [00:00<?, ? examples/s]

Map:   0%|          | 0/82 [00:00<?, ? examples/s]

Dataset preprocessing complete.


In [None]:
from transformers import TrainingArguments, Trainer

# 7. Training configuration
training_args = TrainingArguments(
    output_dir="./results",
    logging_steps=5,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    save_total_limit=2,
    load_best_model_at_end=True,
    report_to="none",
)

# 8. Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
)

# 9. Train the model
print("Fine-tuning the model...")
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")
print("Model fine-tuned and saved successfully!")

  trainer = Trainer(


Fine-tuning the model...


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,17.6767,16.528364
2,15.4406,14.052052
3,13.9843,12.431303
4,12.7395,11.501972
5,12.3867,11.187079


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


Model fine-tuned and saved successfully!


In [None]:
import re
from datetime import datetime

# 10. Normalize Dates
def normalize_date(date_str):
    try:
        # Preprocess: remove unnecessary text (e.g., "Publié le", "URL:", etc.)
        date_str = re.sub(r"(Publié le|URL:.*$)", "", date_str).strip()

        # Match standard format (DD/MM/YYYY)
        if re.match(r"\b\d{2}/\d{2}/\d{4}\b", date_str):
            return date_str

        # Match ISO format (YYYY-MM-DD)
        iso_match = re.match(r"\b(\d{4})-(\d{2})-(\d{2})\b", date_str)
        if iso_match:
            year, month, day = iso_match.groups()
            return f"{day}/{month}/{year}"

        # Define a comprehensive months mapping (case insensitive)
        months_map = {
            **{
                "JANVIER": "01", "FÉVRIER": "02", "FEVRIER": "02", "MARS": "03", "AVRIL": "04",
                "MAI": "05", "JUIN": "06", "JUILLET": "07", "AOÛT": "08", "AOUT": "08",
                "SEPTEMBRE": "09", "OCTOBRE": "10", "NOVEMBRE": "11", "DÉCEMBRE": "12", "DECEMBRE": "12"
            },
            **{
                "JANUARY": "01", "FEBRUARY": "02", "MARCH": "03", "APRIL": "04",
                "MAY": "05", "JUNE": "06", "JULY": "07", "AUGUST": "08",
                "SEPTEMBER": "09", "OCTOBER": "10", "NOVEMBER": "11", "DECEMBER": "12"
            },
            **{
                "JAN": "01", "FEB": "02", "MAR": "03", "APR": "04",
                "MAY": "05", "JUN": "06", "JUL": "07", "AUG": "08",
                "SEP": "09", "OCT": "10", "NOV": "11", "DEC": "12"
            }
        }

        # Match textual format (e.g., "25 janvier 2023" or "25 October 2023")
        text_match = re.search(r"(\d{1,2})\s([a-zéêûîôäëïöüèç]+)\s(\d{4})", date_str, re.IGNORECASE)
        if text_match:
            day, month, year = text_match.groups()
            # Normalize the month to uppercase without accents
            month = month.upper().replace("É", "E").replace("À", "A").replace("Û", "U").replace("Î", "I").replace("Ô", "O")
            month = months_map.get(month)
            if not month:
                print(f"Unknown month in date: {month}")
                return None
            return f"{int(day):02d}/{month}/{year}"

        # Match flexible formats like "1/2/23"
        flexible_match = re.match(r"(\d{1,2})/(\d{1,2})/(\d{2,4})", date_str)
        if flexible_match:
            day, month, year = flexible_match.groups()
            if len(year) == 2:  # Expand two-digit year
                year = f"20{year}" if int(year) <= 30 else f"19{year}"
            return f"{int(day):02d}/{int(month):02d}/{year}"

        # Additional logging for debugging
        print(f"Date format not recognized: {date_str}")
        return None  # Return None if no valid format
    except Exception as e:
        print(f"Error normalizing date: {e} | Input: {date_str}")
        return None


def extract_date_from_url(url):
    """
    Extract a potential date from the URL using regex patterns for common date formats.
    """
    try:
        # Remove URL parameters and anchors (anything after `?` or `#`)
        clean_url = re.split(r'[?#]', url)[0]

        # Define regex patterns for different date formats
        date_patterns = [
            r"(\d{4})[/-](\d{2})[/-](\d{2})",   # YYYY-MM-DD or YYYY/MM/DD
            r"(\d{2})[/-](\d{2})[/-](\d{4})",   # DD-MM-YYYY or DD/MM/YYYY
            r"(\d{4})(\d{2})(\d{2})",           # YYYYMMDD
            r"(\d{2})\.(\d{2})\.(\d{4})",       # DD.MM.YYYY
            r"path=(\d{4})-(\d{2})-(\d{2})",    # YYYY-MM-DD in URL parameters
            r"date=(\d{4})(\d{2})(\d{2})",      # YYYYMMDD in URL parameters
        ]

        # Iterate through patterns to find a match
        for pattern in date_patterns:
            match = re.search(pattern, clean_url)
            if match:
                groups = list(map(int, match.groups()))  # Convert to integers for flexibility

                # Handle different formats
                if len(groups) == 3:
                    year, month, day = None, None, None
                    if groups[0] > 31:  # Assume YYYY-MM-DD or YYYYMMDD
                        year, month, day = groups
                    elif groups[2] > 31:  # Assume DD-MM-YYYY
                        day, month, year = groups
                    else:  # Handle ambiguous cases
                        year, month, day = groups

                    # Validate the date
                    try:
                        date_obj = datetime(year, month, day)
                        return date_obj.strftime("%d/%m/%Y")
                    except ValueError:
                        continue  # Skip invalid dates

        return None  # No match found
    except Exception as e:
        print(f"Error extracting date from URL: {e}")
        return None

In [None]:
# 11. Predict and normalize dates
def predict_and_normalize_date(row):
    prompt = f"""
    The following is an official document. Extract and return the publication date of this document.

    - Focus on identifying the publication date as mentioned in the text of the document.
    - If the publication date cannot be found in the text, you can check the following URL for potential date information: {row['url']}.
    - Return the publication date as it appears in the text or URL.

    Document:
    {row['Text'][:2000]}

    URL:
    {row['url']}
    """
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
    outputs = model.generate(**inputs, max_length=20)
    raw_output = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
    print(f"Raw Output: {raw_output}")

    normalized_date = normalize_date(raw_output)
    if normalized_date:
        print(f"Normalized Date (from model): {normalized_date}")
    else:
        print("Model failed to predict a valid date.")

    extracted_date = extract_date_from_url(row['url'])
    if extracted_date:
        print(f"Date extracted from URL: {extracted_date}")

    return normalized_date, extracted_date

# Apply prediction and normalization to the test set
print("Predicting and normalizing dates...")
test_split[['predicted_date', 'url_date']] = test_split.apply(
    lambda row: pd.Series(predict_and_normalize_date(row)), axis=1
)

Predicting and normalizing dates...
Raw Output: CONSEIL COMMUNAUTAIRE DU 25 JANVIER 2023
Normalized Date (from model): 25/01/2023
Date extracted from URL: 25/01/2023
Raw Output: COMMUNAUTÉ DE COMMUNES PRESENTS: On two mill
Date format not recognized: COMMUNAUTÉ DE COMMUNES PRESENTS: On two mill
Model failed to predict a valid date.
Date extracted from URL: 15/02/2023
Raw Output: 
Date format not recognized: 
Model failed to predict a valid date.
Raw Output: 
Date format not recognized: 
Model failed to predict a valid date.
Raw Output: MM
Date format not recognized: MM
Model failed to predict a valid date.
Date extracted from URL: 26/01/2023
Raw Output: 2022
Date format not recognized: 2022
Model failed to predict a valid date.
Raw Output: 
Date format not recognized: 
Model failed to predict a valid date.
Raw Output: COMMUNITY DES COMMUNITY DES COMMUNITY
Date format not recognized: COMMUNITY DES COMMUNITY DES COMMUNITY
Model failed to predict a valid date.
Raw Output: 
Date format not

In [None]:
# 12. Evaluate predictions
test_split['gold_date'] = pd.to_datetime(test_split['Gold published date'], format='%d/%m/%Y', errors='coerce')
test_split['predicted_date'] = pd.to_datetime(test_split['predicted_date'], format='%d/%m/%Y', errors='coerce')
test_split['url_date'] = pd.to_datetime(test_split['url_date'], format='%d/%m/%Y', errors='coerce')

# Exclude rows where both predicted_date and url_date are None
valid_rows = ~(
    test_split['predicted_date'].isna() &
    test_split['url_date'].isna()
)

# Calculate accuracy only for valid rows
valid_test_split = test_split[valid_rows]
valid_test_split['is_correct'] = (
    (valid_test_split['predicted_date'] == valid_test_split['gold_date']) |
    (valid_test_split['url_date'] == valid_test_split['gold_date'])
)

accuracy = valid_test_split['is_correct'].mean()
print(f"Accuracy (excluding rows with both dates missing): {accuracy:.2%}")

# Display mismatches for valid rows
print("\nExamples of mismatches:")
mismatches = valid_test_split[~valid_test_split['is_correct']]
print(mismatches[['Gold published date', 'predicted_date', 'url_date']].head(10))

Accuracy (excluding rows with both dates missing): 74.07%

Examples of mismatches:
    Gold published date predicted_date   url_date
35           16/12/2022            NaT 2023-01-26
198          01/01/2020     2020-02-25 2020-02-25
203      Date not found            NaT 2020-01-29
289          14/03/2022            NaT 2022-03-10
442          21/04/2022            NaT 2022-04-14
490          05/12/2023            NaT 2005-12-23
493          31/01/2024            NaT 2024-02-06


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid_test_split['is_correct'] = (


Accuracy: 74.07%

Potential Issues and Areas for Improvement
1. Model Performance on Text

  The model might not handle nuanced or varied date formats in the text well. Dates hidden in footnotes, metadata, or inconsistent phrasing could confuse the model. We need to check the proportion of correct predictions made by the model alone. If it contributes less to accuracy than URL parsing, it suggests the model requires further fine-tuning or prompt adjustments.

2. URL Parsing Limitations

  Regex patterns may miss unconventional or less-structured date formats in URLs. For example, URLs with encoded characters or non-standard date positions might bypass extraction rules. Maybe need to evaluate how many cases were resolved by URL extraction alone. If the performance is suboptimal, refine regex patterns or add more rules for edge cases.

3. Gold Standard vs. Extracted Dates

  Differences in date format or granularity (e.g., publication date vs. approval date) could cause false mismatches, reducing accuracy unfairly.
  Further investigate mismatched cases to see if the extracted dates were semantically valid but differed from the gold standard.

To conclude on our two approaches, the fine-tuned model resulted in a better accuracy (74.07%), compared to the brute force way with regex (60.2%). This demonstrates that leveraging machine learning techniques, such as fine-tuning, can outperform rule-based methods in handling complex tasks.

But these two scores still show that this task of publication-date-extraction is difficult and results could have been better if we had only cleaned and readable documents, a perfectly reliable gold dates set, and other points described previously.