# Augmenting Data with the Gemini API in Python

This tutorial aims to illustrate how to interact with the Gemini LLM API to create labels using predefined tasks.

## Set up your API key

To use the Gemini API, you'll need an API key. If you don't already have one, create a key in Google AI Studio.

<a class="button" href="https://aistudio.google.com/app/apikey" target="_blank" rel="noopener noreferrer">Get an API key</a>

In Colab, add the key to the secrets manager under the "🔑" in the left panel. Give it the name `GOOGLE_API_KEY`. Then pass the key to the SDK:

In [None]:
# Import the Python SDK
import google.generativeai as genai
# Used to securely store your API key
from google.colab import userdata

GOOGLE_API_KEY="COLLEZ_VOTRE_CLE_ICI"
genai.configure(api_key=GOOGLE_API_KEY)

## Initialize the Generative Model

Before you can make any API calls, you need to initialize the Generative Model.

In [None]:
VERSION = "gemini-1.5-flash"
# model = genai.GenerativeModel('gemini-pro')
model = genai.GenerativeModel(VERSION)


## Generate text

In [None]:
response = model.generate_content("Peux-tu me raconter une histoire en 5 phrases")
print(response.text)

# Hack Time

In [None]:
# Créez une un object contenant votre demande et lancez votre requête.
task = ""
response = model.generate_content(task)
print(response.text)


# Let's get some external data

In [None]:
import pandas as pd

DATA_PATH = "https://raw.githubusercontent.com/mickaeltemporao/workshop-ai-augmented-data/main/data/raw/us_pols_20.csv"
df = pd.read_csv(DATA_PATH)
df.head()

In [None]:
def make_task(task_text):
    return f"""
        You are an unbiased US politics expert.
        You will be provided with a Twitter account name and description.
        Your task is to classify the account into one of the following numbered categories:
        {task_text}
        For each of these categories, I have provided a short description to help you with your choice.
        Your output consists only of the number of the selected category, that is the number before the description provided.
    """

# Hack Time

In [None]:
# Test the newly created function


# Let's add more tasks

In [None]:
tasks = {
    "ideology": make_task(
        """
        1. Left wing accounts are those that express political views and opinions or include content that focuses on issues of income equality, environmental protection, social justice, open borders, progressive policies to promote minority representation;
        2. Centre accounts are those that express political views that mix or combine left and right opinion and content such that one opinion or type of content does not dominate;
        3. Right wing accounts are those that express political views and opinions or include content that focuses on issues of economic liberalism, less state intervention in citizens lives, lower taxes, controlling borders and immigration);
        4. Non-partisan accounts are those that typically do not express political views or contain any political content;
        """
    ),
    "age": make_task(
        """
        1. 18-24 Early adulthood, references to college, social media trends, youth culture.
        2. 25-34: Early career stage, potential references to career growth, early family life, pop culture.
        3. 35-44: Mid-career, family-oriented topics, more established professional references.
        4. 45-54: Experienced career stage, references to leadership roles, mature pop culture.
        5. 55-64: Pre-retirement stage, discussions about retirement, long-term career, older family dynamics.
        6. 65+: Retirement, senior living, nostalgia, grandparenting.
        0. Unclassifiable/Insufficient Information.
        """
    ),
    "gender": make_task(
        """
        1. Male: Masculine language, traditional male-dominated interests or references.
        2. Female: Feminine language, topics or references more common among women.
        3. Non-binary/Genderqueer: Non-gendered language, a mix of traditionally male and female references.
        0. Unclassifiable/Insufficient Information.
        """
    ),
    "education": make_task(
        """
        1. High School or Lower: Basic language use, common knowledge, fewer technical terms.
        2. Some College/Technical School: Intermediate language, some industry-specific terms or references.
        3. Undergraduate Degree: Advanced language, references to undergraduate-level education.
        4. Graduate Degree (Master’s, PhD): Complex language, use of specialized terminology, advanced concepts.
        5. Professional Certifications: Industry-specific jargon, focus on certification-related content.
        0. Unclassifiable/Insufficient Information.
        """
    ),
}

# Hack Time

In [None]:
# Test it!


# Let's find a user!

In [None]:
df.iloc[0]

In [None]:
def make_content(obs):
    return f"""Account name: {obs.username}
Account description: {obs['description']}
"""

print(make_content(df.iloc[0]))

# Let's try this!

In [None]:
model=genai.GenerativeModel(
    model_name=VERSION,
    system_instruction=tasks['age']
)

response = model.generate_content(
    make_content(df.iloc[0]),
    generation_config=genai.types.GenerationConfig(
        candidate_count=1,
        stop_sequences=['x'],
        max_output_tokens=10,
        temperature=0.7,
    )
)
print(response.text)


# We can automate this for more users & tasks!

In [None]:
def run_task(task, content):
    model=genai.GenerativeModel(
      model_name=VERSION,
      system_instruction=task
    )
    response = model.generate_content(
        content,
        generation_config=genai.types.GenerationConfig(
            candidate_count=1,
            stop_sequences=['x'],
            max_output_tokens=10,
            temperature=0.7,
        )
    )
    output = response.text
    return output


In [None]:
def main():
    for task in tasks:
        print(f"Starting {task}")
        newcol = f'task_{task}'

        if newcol not in df.columns:
            df[newcol] = "NONE"

        for i, j in df.iterrows():
            if df.loc[j.name, newcol] != "NONE":
                continue
            print(f"Running task for {j.name}")
            task_output = run_task(
                tasks[task],
                make_content(j)
            )
            df.loc[j.name, newcol] = task_output


In [None]:
main()

In [None]:
df

In [None]:
vars = df.columns[df.columns.str.contains('task')]
df[vars] = df[vars].apply(pd.to_numeric)


In [None]:
df[vars] = df[vars].astype('category')

# Let's take a closer look...

In [None]:
import altair as alt


In [None]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('count(task_ideology)').stack("normalize"),
    y=alt.Y(f'{task_age}'),
    color=alt.Color('task_ideology').legend(orient='top')
)


In [None]:
alt.Chart(df).mark_bar().encode(
  x=alt.X('count(task_education)'),
  y='task_education',
  color='task_ideology'
)
