## Description
This notebook processes all labeled entries by passing them through a prompt to the large language model (LLM) Google Gemini. It handles the data in batches of 10 rows from the DataFrame and waits for the result after each batch.

#### Improvments to consider
- Create multiple API keys to allow more requests per minute and per day.
- The prompt is somewhat unstable, as multiple entries are processed in a single request. Processing one entry at a time with max_output_tokens = 1 would be much more stable.
- Define promps in other languages

The code was created with the assistance of ChatGPT-4.

In [None]:
import pandas as pd
from google import genai
from google.genai import types
import json

inputdata_file = 'data/01_labelled_data.csv'
outputdata_file ='data/02_predicted_data.csv'

with open("data/apikeys.json") as f:
    config = json.load(f)
API_KEY = config["GOOGLE_API_KEY"]

In [8]:
df = pd.read_csv(inputdata_file, dtype={'mobilitydata_labelled': 'string'}, low_memory=False)

# Drop rows where 'mobilitydata_labelled' is empty (NaN)
df = df.dropna(subset=['mobilitydata_labelled'])

# Convert 'mobilitydata_labelled' to boolean type
df['mobilitydata_labelled'] = df['mobilitydata_labelled'].map({'True': True, 'False': False})

# Print the number of rows remaining after filtering
print(f"Number of labelled rows after filtering: {len(df)}")

Number of labelled rows after filtering: 21


In [None]:
chunk_size = 10
client = genai.Client(api_key=API_KEY)

# Prepare a new column with empty values to store the results
df['mobilitydata_generated'] = None

# Iterate over the DataFrame in chunks
for i in range(0, len(df), chunk_size):
    # Select the relevant columns from the current chunk
    chunk_df = df.iloc[i:i + chunk_size][['dataset_title_DE', 'dataset_description_DE']]

    # Combine title and description for each entry into a single string
    chunk_lines = chunk_df.apply(
        lambda row: f"Titel: {row['dataset_title_DE']}\nBeschreibung: {row['dataset_description_DE']}",
        axis=1
    ).tolist()

    # Construct the prompt with all entries in the chunk
    prompt = "Handelt es sich bei folgendem Inhalt um Verkehrs- oder Mobilitätsdaten? Antworte nur mit T (True) oder F (False) Zeilenweise.\n\n" + "\n\n".join(chunk_lines)

    # Send the prompt to the Gemini model
    response = client.models.generate_content_stream(
        model="gemini-2.0-flash",
        contents=[prompt],
        config=types.GenerateContentConfig(
            max_output_tokens=chunk_size * 2,
            temperature=0
        )
    )

    # Collect the response text from the stream
    result_text = ""
    for chunk in response:
        result_text += chunk.text

    # Split the result into individual predictions
    predictions = result_text.strip().splitlines()

    # Write the predictions back into the corresponding rows in the DataFrame
    target_indices = df.iloc[i:i + len(predictions)].index
    df.loc[target_indices, 'mobilitydata_generated'] = predictions

In [19]:
# Write dataframe in new csv-File
df.to_csv(outputdata_file, index=False)

print(f'The file has been successfully saved as {outputdata_file}.')

The file has been successfully saved as data/02_predicted_data.csv.
