# Augmenting Data with the Gemini API in Python

This tutorial aims to illustrate how to interact with the Gemini LLM API to create labels using predefined tasks.

## Set up your API key

To use the Gemini API, you'll need an API key. If you don't already have one, create a key in Google AI Studio.

<a class="button" href="https://aistudio.google.com/app/apikey" target="_blank" rel="noopener noreferrer">Get an API key</a>

In Colab, add the key to the secrets manager under the "🔑" in the left panel. Give it the name `GOOGLE_API_KEY`. Then pass the key to the SDK:

In [54]:
# Import the Python SDK
import google.generativeai as genai
# Used to securely store your API key
from google.colab import userdata

GOOGLE_API_KEY="COLLEZ_VOTRE_CLE_ICI"
genai.configure(api_key=GOOGLE_API_KEY)

## Initialize the Generative Model

Before you can make any API calls, you need to initialize the Generative Model.

In [86]:
VERSION = "gemini-1.5-flash"
# model = genai.GenerativeModel('gemini-pro')
model = genai.GenerativeModel(VERSION)


## Generate text

In [57]:
response = model.generate_content("Peux-tu me raconter une histoire en 5 phrases")
print(response.text)

Une petite fille nommée Lily aimait regarder les étoiles. Chaque soir, elle s'allongeait dans l'herbe et observait les points lumineux dans le ciel nocturne. Un soir, elle a vu une étoile filante traverser le ciel. Elle a fait un vœu en regardant l'étoile disparaitre. Le lendemain matin, Lily a trouvé un petit chaton errant dans son jardin. Elle l'a nommé Étoile, en souvenir de son vœu. Lily et Étoile sont devenus les meilleurs amis et ont passé de nombreuses heures à regarder les étoiles ensemble. 



# Hack Time

In [58]:
# Créez une un object contenant votre demande et lancez votre requête.
task = ""
response = model.generate_content(task)
print(response.text)


TypeError: contents must not be empty

# Let's get some external data

In [60]:
import pandas as pd

DATA_PATH = "https://raw.githubusercontent.com/mickaeltemporao/workshop-ai-augmented-data/main/data/raw/us_pols_20.csv"
df = pd.read_csv(DATA_PATH)
df.head()

Unnamed: 0,name,username,sex,age,party,description
0,William Timmons,votetimmons,male,40.0,Republican Party,Fighting for #SC04 in Congress. Captain in the...
1,Vicky Hartzler,RepHartzler,female,64.0,Republican Party,"The archived tweets of Vicky Hartzler, fmr Rep..."
2,Jill Stein,DrJillStein,female,74.0,Green Party of the United States,Medical doctor. Presidential candidate. People...
3,Madeleine Albright,madeleine,female,87.0,Democratic Party,Author of the NYT bestseller Hell and Other De...
4,Ron Wyden,WydenPress,male,75.0,Democratic Party,Official account of U.S. Senator @RonWyden 's ...


In [61]:
def make_task(task_text):
    return f"""
        You are an unbiased US politics expert.
        You will be provided with a Twitter account name and description.
        Your task is to classify the account into one of the following numbered categories:
        {task_text}
        For each of these categories, I have provided a short description to help you with your choice.
        Your output consists only of the number of the selected category, that is the number before the description provided.
    """

# Hack Time

In [63]:
# Test the newly created function


'\n        You are an unbiased US politics expert.\n        You will be provided with a Twitter account name and description.\n        Your task is to classify the account into one of the following numbered categories:\n        test\n        For each of these categories, I have provided a short description to help you with your choice.\n        Your output consists only of the number of the selected category, that is the number before the description provided.\n    '

# Let's add more tasks

In [80]:
tasks = {
    "ideology": make_task(
        """
        1. Left wing accounts are those that express political views and opinions or include content that focuses on issues of income equality, environmental protection, social justice, open borders, progressive policies to promote minority representation;
        2. Centre accounts are those that express political views that mix or combine left and right opinion and content such that one opinion or type of content does not dominate;
        3. Right wing accounts are those that express political views and opinions or include content that focuses on issues of economic liberalism, less state intervention in citizens lives, lower taxes, controlling borders and immigration);
        4. Non-partisan accounts are those that typically do not express political views or contain any political content;
        """
    ),
    "age": make_task(
        """
        1. 18-24 Early adulthood, references to college, social media trends, youth culture.
        2. 25-34: Early career stage, potential references to career growth, early family life, pop culture.
        3. 35-44: Mid-career, family-oriented topics, more established professional references.
        4. 45-54: Experienced career stage, references to leadership roles, mature pop culture.
        5. 55-64: Pre-retirement stage, discussions about retirement, long-term career, older family dynamics.
        6. 65+: Retirement, senior living, nostalgia, grandparenting.
        0. Unclassifiable/Insufficient Information.
        """
    ),
    "gender": make_task(
        """
        1. Male: Masculine language, traditional male-dominated interests or references.
        2. Female: Feminine language, topics or references more common among women.
        3. Non-binary/Genderqueer: Non-gendered language, a mix of traditionally male and female references.
        0. Unclassifiable/Insufficient Information.
        """
    ),
    "education": make_task(
        """
        1. High School or Lower: Basic language use, common knowledge, fewer technical terms.
        2. Some College/Technical School: Intermediate language, some industry-specific terms or references.
        3. Undergraduate Degree: Advanced language, references to undergraduate-level education.
        4. Graduate Degree (Master’s, PhD): Complex language, use of specialized terminology, advanced concepts.
        5. Professional Certifications: Industry-specific jargon, focus on certification-related content.
        0. Unclassifiable/Insufficient Information.
        """
    ),
}

# Hack Time

In [67]:
# Test it!


'\n        You are an unbiased US politics expert.\n        You will be provided with a Twitter account name and description.\n        Your task is to classify the account into one of the following numbered categories:\n        \n        1. 18-24 Early adulthood, references to college, social media trends, youth culture.\n        2. 25-34: Early career stage, potential references to career growth, early family life, pop culture.\n        3. 35-44: Mid-career, family-oriented topics, more established professional references.\n        4. 45-54: Experienced career stage, references to leadership roles, mature pop culture.\n        5. 55-64: Pre-retirement stage, discussions about retirement, long-term career, older family dynamics.\n        6. 65+: Retirement, senior living, nostalgia, grandparenting.\n        0. Unclassifiable/Insufficient Information.\n        \n        For each of these categories, I have provided a short description to help you with your choice.\n        Your output c

# Let's find a user!

In [70]:
df.iloc[0]

Unnamed: 0,0
name,William Timmons
username,votetimmons
sex,male
age,40.0
party,Republican Party
description,Fighting for #SC04 in Congress. Captain in the...


In [73]:
def make_content(obs):
    return f"""Account name: {obs.username}
Account description: {obs['description']}
"""

print(make_content(df.iloc[0]))

Account name: votetimmons
Account description: Fighting for #SC04 in Congress. Captain in the US Air Force @theSCANG. Small business owner. Endorsed by President Trump!



# Let's try this!

In [81]:
model=genai.GenerativeModel(
    model_name=VERSION,
    system_instruction=tasks['age']
)

response = model.generate_content(
    make_content(df.iloc[0]),
    generation_config=genai.types.GenerationConfig(
        candidate_count=1,
        stop_sequences=['x'],
        max_output_tokens=10,
        temperature=0.7,
    )
)
print(response.text)


4 



# We can automate this for more users & tasks!

In [88]:
def run_task(task, content):
    model=genai.GenerativeModel(
      model_name=VERSION,
      system_instruction=task
    )
    response = model.generate_content(
        content,
        generation_config=genai.types.GenerationConfig(
            candidate_count=1,
            stop_sequences=['x'],
            max_output_tokens=10,
            temperature=0.7,
        )
    )
    output = response.text
    return output


In [89]:
def main():
    for task in tasks:
        print(f"Starting {task}")
        newcol = f'task_{task}'

        if newcol not in df.columns:
            df[newcol] = "NONE"

        for i, j in df.iterrows():
            if df.loc[j.name, newcol] != "NONE":
                continue
            print(f"Running task for {j.name}")
            task_output = run_task(
                tasks[task],
                make_content(j)
            )
            df.loc[j.name, newcol] = task_output


In [103]:
main()

Starting ideology
Starting age
Starting gender
Starting education
Running task for 8
Running task for 9
Running task for 10
Running task for 11
Running task for 12
Running task for 13
Running task for 14


In [122]:
df

Unnamed: 0,name,username,sex,age,party,description,task_ideology,task_age,task_gender,task_education
0,William Timmons,votetimmons,male,40.0,Republican Party,Fighting for #SC04 in Congress. Captain in the...,3,4,1,2
1,Vicky Hartzler,RepHartzler,female,64.0,Republican Party,"The archived tweets of Vicky Hartzler, fmr Rep...",3,0,2,2
2,Jill Stein,DrJillStein,female,74.0,Green Party of the United States,Medical doctor. Presidential candidate. People...,1,5,2,3
3,Madeleine Albright,madeleine,female,87.0,Democratic Party,Author of the NYT bestseller Hell and Other De...,2,4,2,4
4,Ron Wyden,WydenPress,male,75.0,Democratic Party,Official account of U.S. Senator @RonWyden 's ...,1,0,0,2
5,Mike Pence,mike_pence,male,65.0,Republican Party,"Christian, Conservative, Republican- In That O...",3,4,1,2
6,Tim Burchett,timburchett,male,60.0,Republican Party,Married to Kelly. Father to Isabel. Congressma...,3,4,1,2
7,Dianne Feinstein,SenFeinstein,female,91.0,Democratic Party,United States Senator from California. On Face...,3,4,2,4
8,Aftab Pureval,aftabpureval,male,42.0,Democratic Party,70th Mayor of Cincinnati. Ohio born and raised...,4,3,1,0
9,Vanessa Gibson,Vanessalgibson,female,45.0,Democratic Party,"The 14th Bronx Borough President, Former Counc...",1,3,2,2


In [144]:
vars = df.columns[df.columns.str.contains('task')]
df[vars] = df[vars].apply(pd.to_numeric)


# Let's take a closer look...

In [119]:
import altair as alt


In [127]:
def make_bar_fig(df, var='task_age', title="Age Group", num=0):
    df = df[[f'{var}', 'task_ideology']].dropna()
    n = df.shape[0]
    fig = (
        alt.Chart(df).mark_bar().encode(
            x=alt.X('count(task_ideology)').stack("normalize").title(""),
            y=alt.Y(f'{var}').title(title + f" (n={n})"),
            color=alt.Color('task_ideology').title("Ideology").legend(orient='top')
        )
    )


In [130]:
make_bar_fig(df)

TypeError: no numeric data to plot