### Data Validation

Methodology:
I choose 10  random projects from 2024 election in Toulouse. For each one:

1. Project Name
2. Description 
3. Cost
4. District
5. Votes

Ask (Guided / Non-guided)

#### Without searching in exteranal sources 

1.  Complete the description.
2.  In which Neighborhood was proposed and in which category..

- ROUGE-L (Lexic Similarity)
- BLEURT (Semantic similarity)


-> Methodology proposed here: https://arxiv.org/abs/2308.08493#

#### Toulouse 2024 Dataset

In [12]:
# 1. load data
from src.data_loader import load_and_prepare_projects, load_prediction_set

path_22 = 'data/tls22_projects.csv'
path_24 = 'data/tls24_projects.csv'

df, df22_shuffled = load_and_prepare_projects(path_22,path_24)

ids_to_predict_path = 'data/tls24_projects_to_predict.csv'
test = load_prediction_set(df, ids_to_predict_path)
df

Unnamed: 0,project_id,project_name,description,category,cost,district,votes,district_number,rank,year
0,136,Piste cyclable avenue Saint-Exupéry,"Lorsque l'on habite vers la place de l'ormeau,...",Éco-mobilité,200000.0,12 - Pont des Demoiselles / Ormeau / Montaudra...,492,12,1,2022
1,7,Faire des voies vélos et piétons dédiées et bi...,Pour favoriser la cohabitation piétons/cyclist...,Éco-mobilité,150000.0,1 - Capitole / Arnaud Bernard / Carmes,467,1,2,2022
2,5,Végétalisation de la place du Capitole,Pour donner encore plus de majesté à ce lieu e...,Nature en ville,200000.0,1 - Capitole / Arnaud Bernard / Carmes,358,1,3,2022
3,132,Lutte anti-moustiques,Mise en place de pièges à moustiques dans le q...,Nature en ville,7500.0,12 - Pont des Demoiselles / Ormeau / Montaudra...,258,12,4,2022
4,71,Planter des arbres et jardins partagés partout...,Une voisine exploite un bout du terrain du sup...,Nature en ville,20000.0,8 - Minimes / Barrière de Paris / Ponts-Jumeau...,243,8,5,2022
...,...,...,...,...,...,...,...,...,...,...
378,372,Installation d'un portique pour réparer les vé...,Fleury : Place André Abbal (Reynerie): Il sera...,Éco-mobilité et transports,5000.0,17 - Mirail-Université / Reynerie / Bellefontaine,20,17,179,2024
379,356,Débitumiser les places de stationnement rue Be...,Je propose d’enlever le bitume sur les places ...,Nature en ville et biodiversité,240000.0,15 - Croix de Pierre / Route d'Espagne,19,15,180,2024
380,230,Installation de poubelles à déjection aux Sept...,"Installer sur le quartier des sept deniers, de...",Cadre de vie,6000.0,7 - Sept Deniers / Ginestous-Sesquières / Lalande,19,7,181,2024
381,370,Installer des tables et des chaises au Parc Ma...,Il n’y a pas assez de tables et de chaises pou...,Cadre de vie,85000.0,17 - Mirail-Université / Reynerie / Bellefontaine,18,17,182,2024


In [None]:
sample24 = df[df['year'] == 2024].sample(n=10, random_state=42).filter(['project_id', 
                                                                        'project_name', 
                                                                        'description', 
                                                                        'cost', 
                                                                        'district',
                                                                        'votes'])

sample24

In [None]:
### This option: validates if model memorized the dataset 
def build_guided_validation_prompt_1(project_name):
    prompt =f"""This project is part of the 2024 participatory budget of the city of Toulouse.\n
    Its name is: {project_name}.\n
    Please provide the exact original description as it was proposed during the election
    """
    return prompt

### This option: validates if LLM can reconstruct original description, using its training dataset.
def build_guided_validation_prompt_2(project_name, description):
    prompt =f"""This project is part of the 2024 participatory budget of the city of Toulouse.\n
    Its name is: {project_name}.
    Complete the description of the project exactly as it was proposed during.\n
    Description: {description[0:30]} ...
    """
    return prompt
### Algunos modelos pueden recordar solo los nombres + barrios + costos desde CSV, Bases de datos, pdfs etc.
def build_validation_prompt_cost_district(project_name):
    prompt = f"""This project is part of the 2024 participatory budget of the city of Toulouse.\n
    Its name is: {project_name}.
    Can you tell me:
    1. In which district of the city this project was proposed?
    2. What was the cost proposed for this project?
    """
    return prompt

In [None]:
# prompt 1
sample24['prompt'] = sample24.apply(lambda x: build_guided_validation_prompt_1(x['project_name']), axis=1)
print(sample24['prompt'].iloc[0])

In [None]:
import os 
from llm_client import call_openai_model


api_key=os.getenv('OPENAI_API_KEY')
 
sample24['out'] = sample24['prompt'].apply(lambda prompt: 
                     call_openai_model(prompt=prompt, api_key=api_key))

In [None]:
sample24.filter(['project_id','prompt','out']).to_csv("output/validation/guided_validation_description.csv", sep=";", index=False)

In [None]:
# prompt 2
sample24['prompt'] = sample24.apply(lambda x: build_guided_validation_prompt_2(x['project_name'], x['description']), axis=1)
print(sample24['prompt'].iloc[0])

In [None]:
import os 
from llm_client import call_openai_model


api_key=os.getenv('OPENAI_API_KEY')
 
sample24['out'] = sample24['prompt'].apply(lambda prompt: 
                     call_openai_model(prompt=prompt, api_key=api_key))

In [None]:
sample24.filter(['project_id','prompt','out']).to_csv("output/validation/tls24_guided_validation_description_2.csv", sep=";", index=False)

In [None]:
# prompt 3
sample24['prompt'] = sample24.apply(lambda x: build_validation_prompt_cost_district(x['project_name']), axis=1)
print(sample24['prompt'].iloc[0])

In [None]:
import os 
from llm_client import call_openai_model


api_key=os.getenv('OPENAI_API_KEY')
 
sample24['out'] = sample24['prompt'].apply(lambda prompt: 
                     call_openai_model(prompt=prompt, api_key=api_key))

In [None]:
sample24.filter(['project_id','prompt','out']).to_csv("output/validation/tls24_guided_validation_district_cost.csv", sep=";", index=False)

#### Warzclaw 2017 Dataset

In [1]:
# 1. load data
from src.data_loader import load_and_prepare_projects, load_prediction_set

path_16 = 'data/wrc16_projects.csv'
path_17 = 'data/wrc17_projects.csv'

df, df16_shuffled = load_and_prepare_projects(path_16,path_17, city='Wroclaw')

df
ids_to_predict_path = 'data/wrc17_projects_to_predict.csv'
test = load_prediction_set(df, ids_to_predict_path, city='Wroclaw')
df

Unnamed: 0,project_id,project_name,description,category,cost,district,votes,rank,year
0,710,Drzewa dla Wrocławia - nasadzenia w całym mieś...,Uzasadnienie\nZnikające drzewa z krajobrazu mi...,greenery/recreation,1000000,"Kuźniki, Nowy Dwór, Muchobór Mały,Grabiszyn-Gr...",13938,1,2016
1,15,Zielona rowerowo-piesza obwodnica Wrocławia; E...,Uzasadnienie,walking/cycling infrastructure,1000000,"Grabiszyn-Grabiszynek, Krzyki-Partynice, Oporó...",12348,2,2016
2,764,Oświetlenie Parku Grabiszyńskiego (Alei Romera...,Uzasadnienie\nTaki mamy klimat... że przez pół...,other,1000000,"Grabiszyn-Grabiszynek, Krzyki-Partynice, Oporó...",7291,3,2016
3,685,Plac Zabaw dla Starszaków w parku Grabiszyński...,Uzasadnienie\nPlac Zabaw dla Starszaków w park...,playgrounds,560000,"Huby, Gaj, Tarnogaj,Nadodrze, Ołbin, Stare Mia...",6663,4,2016
4,379,"Parking przy ""Dobrzyńskiej"" - ułatwienie dojaz...",Uzasadnienie\nProjekt polega na utworzeniu w c...,roads,750000,"Gajowice, Powstańców Śląskich, Borek",6383,5,2016
...,...,...,...,...,...,...,...,...,...
97,422,"Akcja Plac - Gry uliczne ""Oswajamy beton""",Uzasadnienie\nProjekt zakłada lokowanie w prze...,greenery/recreation,150000,Ołbin,236,46,2017
98,406,Budowa wrocławskiej wypożyczalni rowerów integ...,Uzasadnienie\nProjekt przeznaczony jest zarówn...,walking/cycling infrastructure,600000,Tarnogaj,178,47,2017
99,629,Wrocław na dotknięcie ręki – tylfograficzne ma...,"Uzasadnienie\nRynek i wszystkie miejsca, takie...",other,130000,"Stare Miasto, Przedmieście Świdnickie,Przedmie...",156,48,2017
100,720,1997,"Uzasadnienie\nIdeą projektu ""1997"" było upamię...",other,50000,Ołtaszyn,118,49,2017


In [2]:
df

Unnamed: 0,project_id,project_name,description,category,cost,district,votes,rank,year
0,710,Drzewa dla Wrocławia - nasadzenia w całym mieś...,Uzasadnienie\nZnikające drzewa z krajobrazu mi...,greenery/recreation,1000000,"Kuźniki, Nowy Dwór, Muchobór Mały,Grabiszyn-Gr...",13938,1,2016
1,15,Zielona rowerowo-piesza obwodnica Wrocławia; E...,Uzasadnienie,walking/cycling infrastructure,1000000,"Grabiszyn-Grabiszynek, Krzyki-Partynice, Oporó...",12348,2,2016
2,764,Oświetlenie Parku Grabiszyńskiego (Alei Romera...,Uzasadnienie\nTaki mamy klimat... że przez pół...,other,1000000,"Grabiszyn-Grabiszynek, Krzyki-Partynice, Oporó...",7291,3,2016
3,685,Plac Zabaw dla Starszaków w parku Grabiszyński...,Uzasadnienie\nPlac Zabaw dla Starszaków w park...,playgrounds,560000,"Huby, Gaj, Tarnogaj,Nadodrze, Ołbin, Stare Mia...",6663,4,2016
4,379,"Parking przy ""Dobrzyńskiej"" - ułatwienie dojaz...",Uzasadnienie\nProjekt polega na utworzeniu w c...,roads,750000,"Gajowice, Powstańców Śląskich, Borek",6383,5,2016
...,...,...,...,...,...,...,...,...,...
97,422,"Akcja Plac - Gry uliczne ""Oswajamy beton""",Uzasadnienie\nProjekt zakłada lokowanie w prze...,greenery/recreation,150000,Ołbin,236,46,2017
98,406,Budowa wrocławskiej wypożyczalni rowerów integ...,Uzasadnienie\nProjekt przeznaczony jest zarówn...,walking/cycling infrastructure,600000,Tarnogaj,178,47,2017
99,629,Wrocław na dotknięcie ręki – tylfograficzne ma...,"Uzasadnienie\nRynek i wszystkie miejsca, takie...",other,130000,"Stare Miasto, Przedmieście Świdnickie,Przedmie...",156,48,2017
100,720,1997,"Uzasadnienie\nIdeą projektu ""1997"" było upamię...",other,50000,Ołtaszyn,118,49,2017


In [3]:
sample17 = df[df['year'] == 2017].sample(n=10, random_state=42).filter(['project_id', 
                                                                        'project_name', 
                                                                        'description', 
                                                                        'cost', 
                                                                        'district',
                                                                        'votes'])

sample17

Unnamed: 0,project_id,project_name,description,cost,district,votes
65,345,Park Rędziński,Uzasadnienie\nKoncepcję Parku na Maślicach zap...,1000000,Brochów,2484
91,334,infrastruktura przystankowa na ul. Dawida,Uzasadnienie\nNa ul. Dawida (między ul. Dawida...,250000,Ołbin,480
82,668,"""Park & ride"" przy pętli Pilczyce/Stadion Wrocław","Uzasadnienie\n""Park & ride"" to parkingi dla p...",750000,Leśnica,1060
97,422,"Akcja Plac - Gry uliczne ""Oswajamy beton""",Uzasadnienie\nProjekt zakłada lokowanie w prze...,150000,Ołbin,236
69,590,Zagospodarowanie terenu kąpieliska na Oporowie...,Uzasadnienie\nBudowa: boiska do siatkówki plaż...,1000000,"Leśnica,Maślice",2001
100,720,1997,"Uzasadnienie\nIdeą projektu ""1997"" było upamię...",50000,Ołtaszyn,118
78,83,Bezpieczne i przyjazne drogi we Wrocławiu - św...,Uzasadnienie\nProjekt zakłada zmianę trybu pra...,650000,"Karłowice-Różanka,Huby",1455
77,555,Sieć parków we Wrocławiu – Psie Pole - Zawidaw...,Uzasadnienie\nCelem projektu jest stworzenie d...,1000000,Przedmieście Oławskie,1512
84,694,Zielone przystanki w centrum miasta,Uzasadnienie\nProjekt zakłada utworzenie zielo...,320000,"Pilczyce-Kozanów-Popowice Płn.,Szczepin",943
71,46,Rondo Reagana na poziomie! Budowa peronów na R...,Uzasadnienie\nRondo Reagana to jeden z ważniej...,1000000,Karłowice-Różanka,1839


In [13]:
### This option: validates if model memorized the dataset 
def build_guided_validation_prompt_1(project_name, lang = 'eng'):
    if lang == 'eng':
        prompt =f"""This project is part of the 2017 participatory budget election on the city of Wroclaw in Poland.\n
        Its name is: {project_name}.\n
        Please provide the exact original justification of the project as it was proposed during the election
        """
        return prompt
    
    else:
        prompt =f"""Ten projekt jest częścią budżetu obywatelskiego z 2017 roku w mieście Wrocław w Polsce..\n
        Jego nazwa to: {project_name}.\n
        Proszę podać dokładne oryginalne uzasadnienie projektu, takie jak zostało przedstawione podczas głosowania.
        """
        return prompt

### This option: validates if LLM can reconstruct original description, using its training dataset.
def build_guided_validation_prompt_2(project_name, description):
    prompt =f"""This project is part of the 2017 participatory budget election on the city of Wroclaw in Poland.\n
    Its name is: {project_name}.
    Complete the justification of the project exactly as it was proposed during the election.\n
    Description: {description[0:50]} ...
    """
    return prompt

### Algunos modelos pueden recordar solo los nombres + barrios + costos desde CSV, Bases de datos, pdfs etc.
def build_validation_prompt_cost_district(project_name):
    prompt = f"""This project is part of the 2017 participatory budget election on the city of Wroclaw in Poland.\n
    Its name is: {project_name}.
    Can you tell me:
    1. In which district of the city this project was proposed?
    2. What was the cost proposed for this project?
    """
    return prompt

In [None]:
# prompt 1
sample17['prompt'] = sample17.apply(lambda x: build_guided_validation_prompt_1(x['project_name'], lang='pl'), axis=1)
print(sample17['prompt'].iloc[0])

In [None]:
import os 
from llm_client import call_openai_model


api_key=os.getenv('OPENAI_API_KEY')
 
sample17['out'] = sample17['prompt'].apply(lambda prompt: 
                     call_openai_model(prompt=prompt, api_key=api_key))

In [None]:
sample17.filter(['project_id','prompt','out']).to_csv("output/validation/wrc_guided_validation_description_pl.csv", sep=";", index=False)

In [6]:
# prompt 2
sample17['prompt'] = sample17.apply(lambda x: build_guided_validation_prompt_2(x['project_name'], x['description']), axis=1)
print(sample17['prompt'].iloc[0])

This project is part of the 2017 participatory budget election on the city of Wroclaw in Poland.

    Its name is: Park Rędziński.
    Complete the justification of the project exactly as it was proposed during the election.

    Description: Uzasadnienie
Koncepcję Parku na Maślicach zaprojek ...
    


In [8]:
import os 
from llm_client import call_openai_model


api_key=os.getenv('OPENAI_API_KEY')
 
sample17['out'] = sample17['prompt'].apply(lambda prompt: 
                     call_openai_model(prompt=prompt, api_key=api_key))

LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!


In [10]:
sample17.filter(['project_id','prompt','out']).to_csv("output/validation/wrc17_guided_validation_description_2.csv", sep=";", index=False)

In [15]:
# prompt 3
sample17['prompt'] = sample17.apply(lambda x: build_validation_prompt_cost_district(x['project_name']), axis=1)
print(sample17['prompt'].iloc[0])

This project is part of the 2017 participatory budget election on the city of Wroclaw in Poland.

    Its name is: Park Rędziński.
    Can you tell me:
    1. In which district of the city this project was proposed?
    2. What was the cost proposed for this project?
    


In [16]:
import os 
from llm_client import call_openai_model


api_key=os.getenv('OPENAI_API_KEY')
 
sample17['out'] = sample17['prompt'].apply(lambda prompt: 
                     call_openai_model(prompt=prompt, api_key=api_key))

LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!
LLM response receive!


In [18]:
sample17.filter(['project_id','prompt','out']).to_csv("output/validation/wrc17_guided_validation_district_cost.csv", sep=";", index=False)