### Data Validation

Methodology:
I choose 10  random projects from 2024 election in Toulouse. For each one:

1. Project Name
2. Description 
3. Cost
4. District
5. Votes

Ask (Guided / Non-guided)

#### Without searching in exteranal sources 

1.  Complete the description.
2.  In which Neighborhood was proposed and in which category..

- ROUGE-L (Lexic Similarity)
- BLEURT (Semantic similarity)


-> Methodology proposed here: https://arxiv.org/abs/2308.08493#

#### Toulouse 2024 Dataset

In [12]:
# 1. load data
from src.data_loader import load_and_prepare_projects, load_prediction_set

path_22 = 'data/tls22_projects.csv'
path_24 = 'data/tls24_projects.csv'

df, df22_shuffled = load_and_prepare_projects(path_22,path_24)

ids_to_predict_path = 'data/tls24_projects_to_predict.csv'
test = load_prediction_set(df, ids_to_predict_path)
df

Unnamed: 0,project_id,project_name,description,category,cost,district,votes,district_number,rank,year
0,136,Piste cyclable avenue Saint-Exup√©ry,"Lorsque l'on habite vers la place de l'ormeau,...",√âco-mobilit√©,200000.0,12 - Pont des Demoiselles / Ormeau / Montaudra...,492,12,1,2022
1,7,Faire des voies v√©los et pi√©tons d√©di√©es et bi...,Pour favoriser la cohabitation pi√©tons/cyclist...,√âco-mobilit√©,150000.0,1 - Capitole / Arnaud Bernard / Carmes,467,1,2,2022
2,5,V√©g√©talisation de la place du Capitole,Pour donner encore plus de majest√© √† ce lieu e...,Nature en ville,200000.0,1 - Capitole / Arnaud Bernard / Carmes,358,1,3,2022
3,132,Lutte anti-moustiques,Mise en place de pi√®ges √† moustiques dans le q...,Nature en ville,7500.0,12 - Pont des Demoiselles / Ormeau / Montaudra...,258,12,4,2022
4,71,Planter des arbres et jardins partag√©s partout...,Une voisine exploite un bout du terrain du sup...,Nature en ville,20000.0,8 - Minimes / Barri√®re de Paris / Ponts-Jumeau...,243,8,5,2022
...,...,...,...,...,...,...,...,...,...,...
378,372,Installation d'un portique pour r√©parer les v√©...,Fleury : Place Andr√© Abbal (Reynerie): Il sera...,√âco-mobilit√© et transports,5000.0,17 - Mirail-Universit√© / Reynerie / Bellefontaine,20,17,179,2024
379,356,D√©bitumiser les places de stationnement rue Be...,Je propose d‚Äôenlever le bitume sur les places ...,Nature en ville et biodiversit√©,240000.0,15 - Croix de Pierre / Route d'Espagne,19,15,180,2024
380,230,Installation de poubelles √† d√©jection aux Sept...,"Installer sur le quartier des sept deniers, de...",Cadre de vie,6000.0,7 - Sept Deniers / Ginestous-Sesqui√®res / Lalande,19,7,181,2024
381,370,Installer des tables et des chaises au Parc Ma...,Il n‚Äôy a pas assez de tables et de chaises pou...,Cadre de vie,85000.0,17 - Mirail-Universit√© / Reynerie / Bellefontaine,18,17,182,2024


In [None]:
sample24 = df[df['year'] == 2024].sample(n=10, random_state=42).filter(['project_id', 
                                                                        'project_name', 
                                                                        'description', 
                                                                        'cost', 
                                                                        'district',
                                                                        'votes'])

sample24

In [None]:
### This option: validates if model memorized the dataset 
def build_guided_validation_prompt_1(project_name):
    prompt =f"""This project is part of the 2024 participatory budget of the city of Toulouse.\n
    Its name is: {project_name}.\n
    Please provide the exact original description as it was proposed during the election
    """
    return prompt

### This option: validates if LLM can reconstruct original description, using its training dataset.
def build_guided_validation_prompt_2(project_name, description):
    prompt =f"""This project is part of the 2024 participatory budget of the city of Toulouse.\n
    Its name is: {project_name}.
    Complete the description of the project exactly as it was proposed during.\n
    Description: {description[0:30]} ...
    """
    return prompt
### Algunos modelos pueden recordar solo los nombres + barrios + costos desde CSV, Bases de datos, pdfs etc.
def build_validation_prompt_cost_district(project_name):
    prompt = f"""This project is part of the 2024 participatory budget of the city of Toulouse.\n
    Its name is: {project_name}.
    Can you tell me:
    1. In which district of the city this project was proposed?
    2. What was the cost proposed for this project?
    """
    return prompt

In [None]:
# prompt 1
sample24['prompt'] = sample24.apply(lambda x: build_guided_validation_prompt_1(x['project_name']), axis=1)
print(sample24['prompt'].iloc[0])

In [None]:
import os 
from llm_client import call_openai_model


api_key=os.getenv('OPENAI_API_KEY')
 
sample24['out'] = sample24['prompt'].apply(lambda prompt: 
                     call_openai_model(prompt=prompt, api_key=api_key))

In [None]:
sample24.filter(['project_id','prompt','out']).to_csv("output/validation/guided_validation_description.csv", sep=";", index=False)

In [None]:
# prompt 2
sample24['prompt'] = sample24.apply(lambda x: build_guided_validation_prompt_2(x['project_name'], x['description']), axis=1)
print(sample24['prompt'].iloc[0])

In [None]:
import os 
from llm_client import call_openai_model


api_key=os.getenv('OPENAI_API_KEY')
 
sample24['out'] = sample24['prompt'].apply(lambda prompt: 
                     call_openai_model(prompt=prompt, api_key=api_key))

In [None]:
sample24.filter(['project_id','prompt','out']).to_csv("output/validation/tls24_guided_validation_description_2.csv", sep=";", index=False)

In [None]:
# prompt 3
sample24['prompt'] = sample24.apply(lambda x: build_validation_prompt_cost_district(x['project_name']), axis=1)
print(sample24['prompt'].iloc[0])

In [None]:
import os 
from llm_client import call_openai_model


api_key=os.getenv('OPENAI_API_KEY')
 
sample24['out'] = sample24['prompt'].apply(lambda prompt: 
                     call_openai_model(prompt=prompt, api_key=api_key))

In [None]:
sample24.filter(['project_id','prompt','out']).to_csv("output/validation/tls24_guided_validation_district_cost.csv", sep=";", index=False)

#### Warzclaw 2017 Dataset

In [1]:
# 1. load data
from src.data_loader import load_and_prepare_projects, load_prediction_set

path_16 = 'data/wrc16_projects.csv'
path_17 = 'data/wrc17_projects.csv'

df, df16_shuffled = load_and_prepare_projects(path_16,path_17, city='Wroclaw')

df
ids_to_predict_path = 'data/wrc17_projects_to_predict.csv'
test = load_prediction_set(df, ids_to_predict_path, city='Wroclaw')
df

Unnamed: 0,project_id,project_name,description,category,cost,district,votes,rank,year
0,710,Drzewa dla Wroc≈Çawia - nasadzenia w ca≈Çym mie≈õ...,Uzasadnienie\nZnikajƒÖce drzewa z krajobrazu mi...,greenery/recreation,1000000,"Ku≈∫niki, Nowy Dw√≥r, Muchob√≥r Ma≈Çy,Grabiszyn-Gr...",13938,1,2016
1,15,Zielona rowerowo-piesza obwodnica Wroc≈Çawia; E...,Uzasadnienie,walking/cycling infrastructure,1000000,"Grabiszyn-Grabiszynek, Krzyki-Partynice, Opor√≥...",12348,2,2016
2,764,O≈õwietlenie Parku Grabiszy≈Ñskiego (Alei Romera...,Uzasadnienie\nTaki mamy klimat... ≈ºe przez p√≥≈Ç...,other,1000000,"Grabiszyn-Grabiszynek, Krzyki-Partynice, Opor√≥...",7291,3,2016
3,685,Plac Zabaw dla Starszak√≥w w parku Grabiszy≈Ñski...,Uzasadnienie\nPlac Zabaw dla Starszak√≥w w park...,playgrounds,560000,"Huby, Gaj, Tarnogaj,Nadodrze, O≈Çbin, Stare Mia...",6663,4,2016
4,379,"Parking przy ""Dobrzy≈Ñskiej"" - u≈Çatwienie dojaz...",Uzasadnienie\nProjekt polega na utworzeniu w c...,roads,750000,"Gajowice, Powsta≈Ñc√≥w ≈ölƒÖskich, Borek",6383,5,2016
...,...,...,...,...,...,...,...,...,...
97,422,"Akcja Plac - Gry uliczne ""Oswajamy beton""",Uzasadnienie\nProjekt zak≈Çada lokowanie w prze...,greenery/recreation,150000,O≈Çbin,236,46,2017
98,406,Budowa wroc≈Çawskiej wypo≈ºyczalni rower√≥w integ...,Uzasadnienie\nProjekt przeznaczony jest zar√≥wn...,walking/cycling infrastructure,600000,Tarnogaj,178,47,2017
99,629,Wroc≈Çaw na dotkniƒôcie rƒôki ‚Äì tylfograficzne ma...,"Uzasadnienie\nRynek i wszystkie miejsca, takie...",other,130000,"Stare Miasto, Przedmie≈õcie ≈öwidnickie,Przedmie...",156,48,2017
100,720,1997,"Uzasadnienie\nIdeƒÖ projektu ""1997"" by≈Ço upamiƒô...",other,50000,O≈Çtaszyn,118,49,2017


In [2]:
df

Unnamed: 0,project_id,project_name,description,category,cost,district,votes,rank,year
0,710,Drzewa dla Wroc≈Çawia - nasadzenia w ca≈Çym mie≈õ...,Uzasadnienie\nZnikajƒÖce drzewa z krajobrazu mi...,greenery/recreation,1000000,"Ku≈∫niki, Nowy Dw√≥r, Muchob√≥r Ma≈Çy,Grabiszyn-Gr...",13938,1,2016
1,15,Zielona rowerowo-piesza obwodnica Wroc≈Çawia; E...,Uzasadnienie,walking/cycling infrastructure,1000000,"Grabiszyn-Grabiszynek, Krzyki-Partynice, Opor√≥...",12348,2,2016
2,764,O≈õwietlenie Parku Grabiszy≈Ñskiego (Alei Romera...,Uzasadnienie\nTaki mamy klimat... ≈ºe przez p√≥≈Ç...,other,1000000,"Grabiszyn-Grabiszynek, Krzyki-Partynice, Opor√≥...",7291,3,2016
3,685,Plac Zabaw dla Starszak√≥w w parku Grabiszy≈Ñski...,Uzasadnienie\nPlac Zabaw dla Starszak√≥w w park...,playgrounds,560000,"Huby, Gaj, Tarnogaj,Nadodrze, O≈Çbin, Stare Mia...",6663,4,2016
4,379,"Parking przy ""Dobrzy≈Ñskiej"" - u≈Çatwienie dojaz...",Uzasadnienie\nProjekt polega na utworzeniu w c...,roads,750000,"Gajowice, Powsta≈Ñc√≥w ≈ölƒÖskich, Borek",6383,5,2016
...,...,...,...,...,...,...,...,...,...
97,422,"Akcja Plac - Gry uliczne ""Oswajamy beton""",Uzasadnienie\nProjekt zak≈Çada lokowanie w prze...,greenery/recreation,150000,O≈Çbin,236,46,2017
98,406,Budowa wroc≈Çawskiej wypo≈ºyczalni rower√≥w integ...,Uzasadnienie\nProjekt przeznaczony jest zar√≥wn...,walking/cycling infrastructure,600000,Tarnogaj,178,47,2017
99,629,Wroc≈Çaw na dotkniƒôcie rƒôki ‚Äì tylfograficzne ma...,"Uzasadnienie\nRynek i wszystkie miejsca, takie...",other,130000,"Stare Miasto, Przedmie≈õcie ≈öwidnickie,Przedmie...",156,48,2017
100,720,1997,"Uzasadnienie\nIdeƒÖ projektu ""1997"" by≈Ço upamiƒô...",other,50000,O≈Çtaszyn,118,49,2017


In [3]:
sample17 = df[df['year'] == 2017].sample(n=10, random_state=42).filter(['project_id', 
                                                                        'project_name', 
                                                                        'description', 
                                                                        'cost', 
                                                                        'district',
                                                                        'votes'])

sample17

Unnamed: 0,project_id,project_name,description,cost,district,votes
65,345,Park Rƒôdzi≈Ñski,Uzasadnienie\nKoncepcjƒô Parku na Ma≈õlicach zap...,1000000,Broch√≥w,2484
91,334,infrastruktura przystankowa na ul. Dawida,Uzasadnienie\nNa ul. Dawida (miƒôdzy ul. Dawida...,250000,O≈Çbin,480
82,668,"""Park & ride"" przy pƒôtli Pilczyce/Stadion Wroc≈Çaw","Uzasadnienie\n""Park & ride"" to parkingi dla p...",750000,Le≈õnica,1060
97,422,"Akcja Plac - Gry uliczne ""Oswajamy beton""",Uzasadnienie\nProjekt zak≈Çada lokowanie w prze...,150000,O≈Çbin,236
69,590,Zagospodarowanie terenu kƒÖpieliska na Oporowie...,Uzasadnienie\nBudowa: boiska do siatk√≥wki pla≈º...,1000000,"Le≈õnica,Ma≈õlice",2001
100,720,1997,"Uzasadnienie\nIdeƒÖ projektu ""1997"" by≈Ço upamiƒô...",50000,O≈Çtaszyn,118
78,83,Bezpieczne i przyjazne drogi we Wroc≈Çawiu - ≈õw...,Uzasadnienie\nProjekt zak≈Çada zmianƒô trybu pra...,650000,"Kar≈Çowice-R√≥≈ºanka,Huby",1455
77,555,Sieƒá park√≥w we Wroc≈Çawiu ‚Äì Psie Pole - Zawidaw...,Uzasadnienie\nCelem projektu jest stworzenie d...,1000000,Przedmie≈õcie O≈Çawskie,1512
84,694,Zielone przystanki w centrum miasta,Uzasadnienie\nProjekt zak≈Çada utworzenie zielo...,320000,"Pilczyce-Kozan√≥w-Popowice P≈Çn.,Szczepin",943
71,46,Rondo Reagana na poziomie! Budowa peron√≥w na R...,Uzasadnienie\nRondo Reagana to jeden z wa≈ºniej...,1000000,Kar≈Çowice-R√≥≈ºanka,1839


In [13]:
### This option: validates if model memorized the dataset 
def build_guided_validation_prompt_1(project_name, lang = 'eng'):
    if lang == 'eng':
        prompt =f"""This project is part of the 2017 participatory budget election on the city of Wroclaw in Poland.\n
        Its name is: {project_name}.\n
        Please provide the exact original justification of the project as it was proposed during the election
        """
        return prompt
    
    else:
        prompt =f"""Ten projekt jest czƒô≈õciƒÖ bud≈ºetu obywatelskiego z 2017 roku w mie≈õcie Wroc≈Çaw w Polsce..\n
        Jego nazwa to: {project_name}.\n
        Proszƒô podaƒá dok≈Çadne oryginalne uzasadnienie projektu, takie jak zosta≈Ço przedstawione podczas g≈Çosowania.
        """
        return prompt

### This option: validates if LLM can reconstruct original description, using its training dataset.
def build_guided_validation_prompt_2(project_name, description):
    prompt =f"""This project is part of the 2017 participatory budget election on the city of Wroclaw in Poland.\n
    Its name is: {project_name}.
    Complete the justification of the project exactly as it was proposed during the election.\n
    Description: {description[0:50]} ...
    """
    return prompt

### Algunos modelos pueden recordar solo los nombres + barrios + costos desde CSV, Bases de datos, pdfs etc.
def build_validation_prompt_cost_district(project_name):
    prompt = f"""This project is part of the 2017 participatory budget election on the city of Wroclaw in Poland.\n
    Its name is: {project_name}.
    Can you tell me:
    1. In which district of the city this project was proposed?
    2. What was the cost proposed for this project?
    """
    return prompt

In [None]:
# prompt 1
sample17['prompt'] = sample17.apply(lambda x: build_guided_validation_prompt_1(x['project_name'], lang='pl'), axis=1)
print(sample17['prompt'].iloc[0])

In [None]:
import os 
from llm_client import call_openai_model


api_key=os.getenv('OPENAI_API_KEY')
 
sample17['out'] = sample17['prompt'].apply(lambda prompt: 
                     call_openai_model(prompt=prompt, api_key=api_key))

In [None]:
sample17.filter(['project_id','prompt','out']).to_csv("output/validation/wrc_guided_validation_description_pl.csv", sep=";", index=False)

In [6]:
# prompt 2
sample17['prompt'] = sample17.apply(lambda x: build_guided_validation_prompt_2(x['project_name'], x['description']), axis=1)
print(sample17['prompt'].iloc[0])

This project is part of the 2017 participatory budget election on the city of Wroclaw in Poland.

    Its name is: Park Rƒôdzi≈Ñski.
    Complete the justification of the project exactly as it was proposed during the election.

    Description: Uzasadnienie
Koncepcjƒô Parku na Ma≈õlicach zaprojek ...
    


In [8]:
import os 
from llm_client import call_openai_model


api_key=os.getenv('OPENAI_API_KEY')
 
sample17['out'] = sample17['prompt'].apply(lambda prompt: 
                     call_openai_model(prompt=prompt, api_key=api_key))

LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!


In [10]:
sample17.filter(['project_id','prompt','out']).to_csv("output/validation/wrc17_guided_validation_description_2.csv", sep=";", index=False)

In [15]:
# prompt 3
sample17['prompt'] = sample17.apply(lambda x: build_validation_prompt_cost_district(x['project_name']), axis=1)
print(sample17['prompt'].iloc[0])

This project is part of the 2017 participatory budget election on the city of Wroclaw in Poland.

    Its name is: Park Rƒôdzi≈Ñski.
    Can you tell me:
    1. In which district of the city this project was proposed?
    2. What was the cost proposed for this project?
    


In [16]:
import os 
from llm_client import call_openai_model


api_key=os.getenv('OPENAI_API_KEY')
 
sample17['out'] = sample17['prompt'].apply(lambda prompt: 
                     call_openai_model(prompt=prompt, api_key=api_key))

LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!


In [18]:
sample17.filter(['project_id','prompt','out']).to_csv("output/validation/wrc17_guided_validation_district_cost.csv", sep=";", index=False)