# Data Analyst Agent: Get Your Data's Insights in the Blink of an Eye

In this example, we will make a data analyst agent: **a Code agent armed with data analysis libraries, that can load and transform dataframes to extract insights from our data, and plot the results.**

## Setups

In [None]:
!pip install -qU seaborn smolagents transformers

## Load dataset

We will use the [Kaggle Titanic challenge](https://www.kaggle.com/competitions/titanic/data) to predict the survival of individual passengers. Download and training and test datasets from the website.

## Create agents

We will use a `CodeAgent` to create the agent so we do not need to give it any tools since it can directly run its code. We need to make sure to let it use data science related libraries by passing `additional_authorized_imports: ['numpy', 'pandas', 'matplotlib.pyplot', 'seaborn']`.

In general, when passing libraries in `additional_authorized_imports`, we need to make sure they are installed on our local environment, since the Python interpreter can only use libraries installed on our environment.

We will use [`meta-llama/Llama-3.1-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) to power our agent via `HfApiModel` that uses HF's Inference API.

In [None]:
from smolagents import HfApiModel, CodeAgent

model = HfApiModel('meta-llama/Llama-3.1-70B-Instruct')

In [None]:
agent = CodeAgent(
    tools=[],
    model=model,
    max_iterations=10,
    additional_authorized_imports=['numpy', 'pandas', 'matplotlib.pyplot', 'seaborn']
)

## Data analysis

Before running the agent, we will provide additional notes directly taken from the competition, and give these as a kwarg to the `run` method:

In [None]:
import os
os.mkdir('./figures')

In [None]:
additional_notes = """
### Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
"""

In [None]:
analysis = agent.run(
    """You are an expert data analyst.
Please load the source file and analyze its content.
According to the variables you have, begin by listing 3 interesting questions that could be asked on this data, for instance about specific correlations with survival rate.
Then answer these questions one by one, by finding the relevant numbers.
Meanwhile, plot some figures using matplotlib/seaborn and save them to the (already existing) folder './figures/': take care to clear each figure with plt.clf() before doing another plot.

In your final answer: summarize these correlations and trends
After each number derive real worlds insights, for instance: "Correlation between is_december and boredness is 1.3453, which suggest people are more bored in winter".
Your final answer should have at least 3 numbered and detailed parts.
""",
    additional_args=dict(
        additional_notes=additional_notes,
        source_file='titanic/train.csv'
    )
)

In [None]:
print(analysis)

## Data scientist agent

Now we will let our agent model perform predictions on the data. To do so, we also let it use `sklearn` in the `additional_authroized_imports`.

In [None]:
agent = CodeAgent(
    tools=[],
    model=model,
    additional_authorized_imports=[
        'numpy',
        'pandas',
        'matplotlib.pyplot',
        'seaborn',
        'sklearn'
    ],
    max_iterations=12
)

In [None]:
output = agent.run(
    """You are an expert machine learning engineer.
Please train a ML model on "titanic/train.csv" to predict the survival for rows of "titanic/test.csv".
Output the results under './output.csv'.
Take care to import functions and modules before using them!
""",
    additional_args=dict(additional_notes=additional_notes + '\n' + analysis)
)