# Data Analysis with GPT and LangChain

# Introduction

AI is fast becoming an integral part of businesses, governmental workflows and as well as human life. There are many ways in which AI can be helpful in data analysis workflow, By leveraging AI, data analysts can streamline their workflows, gain deeper insights, and make more informed decisions based on the data at hand.

To begin leveraging the power of OpenAI API for this work, first, we will provide a description of the features of the dataset that we want to analyze. This description will be provided to the GPT model so that it becomes aware of our dataset.

**`Example dataset description: `**


You have a dataset about unemployment rate by gender in Africa. It is available as pandas dataframe named `unemployment_rate`

The dataset contains the following columns.

- `Entity` (object): The field represents the country or region.                                                               
- `Code` (object) : A three letter column representing the country where the uemployment rate data was collected.                                                                   
- `Year` (integer): The yead when the uenemployment rate data was collected.                                                                     
- `Unemployment, female (% of female labor force) (modeled ILO estimate)` (float): The percentage of the female labor force that is unemployed.   
- `Unemployment, male (% of male labor force) (modeled ILO estimate)` (float): The percentage of the male labor force that is unemployed.       
- `Population (historical estimates)` (integer): This column represents the population figures for the respective entity for the given year
- `Continent` (object): This field indicates the continent where the `Entity (country or region)` is located.
"""


# Aim

Our aim is to develop a simple data analysis bot using Langchain and GPT that can suggest some data analysis questions when prompted with the description of a dataset. The bot will also be a able to a hold a conversation about the dataset.

In [1]:
# Import packages

import warnings
import os
from IPython.display import display, Markdown

import openai
from langchain.chat_models import ChatOpenAI
from langchain.schema import AIMessage, HumanMessage, SystemMessage

# Ignore possible warnings
warnings.filterwarnings("ignore")

In [3]:
# Add OpenAI API key
openai.api_key = os.environ.get("OPENAI_API_KEY")

# Dataset description

Here, we will provide description for the dataset we are in interested in. This is done by describing the features of the said dataset.

In [4]:
# Dataset description
dataset_description = """
You have a dataset about unemployment rate by gender in Africa. It is available as pandas dataframe named `unemployment_rate`

The dataset contains the following columns.

- `Entity` (object): The field represents the country or region.                                                               
- `Code` (object) : A three letter column representing the country where the uemployment rate data was collected.                                                                   
- `Year` (integer): The yead when the uenemployment rate data was collected.                                                                     
- `Unemployment, female (% of female labor force) (modeled ILO estimate)` (float): The percentage of the female labor force that is unemployed.   
- `Unemployment, male (% of male labor force) (modeled ILO estimate)` (float): The percentage of the male labor force that is unemployed.       
- `Population (historical estimates)` (integer): This column represents the population figures for the respective entity for the given year
- `Continent` (object): This field indicates the continent where the `Entity (country or region)` is located.
"""

# Provide user message
user_msg = "Suggest some data analysis questions that can be answered with this dataset"

# Assign a role to the model and add to the human message
messages = [
    SystemMessage(content = "You are a data analysis expert"),
    HumanMessage(content = f"{dataset_description}\n\n{user_msg}")
]


In [5]:
# Create a ChatOpenAI object.
chat = ChatOpenAI(temperature = 0)

In [7]:
# Display model response
response = chat(messages)
print(response.content)

1. What is the overall unemployment rate in Africa for each year?
2. How does the unemployment rate differ between males and females in Africa for each year?
3. Which African countries have the highest and lowest unemployment rates for females and males?
4. Is there a correlation between the population size and the unemployment rate in Africa?
5. How has the unemployment rate changed over time in Africa for both genders?
6. Are there any significant differences in the unemployment rates between different continents in Africa?
7. Is there a relationship between the unemployment rate and the level of economic development in African countries?
8. How does the unemployment rate in Africa compare to the global average?
9. Are there any specific years or periods where the unemployment rate significantly increased or decreased in Africa?
10. Is there a difference in the unemployment rate between urban and rural areas in African countries?


## Hold a conversation with GPT

Here, we will attempt to ask follow up questions based on the data analysis questions the model had suggested. The AI assistant should be able to provide precise responses and generate codes when required.

In [8]:
# Append models's response to messages: This ensures the model has some memory of past conversation
messages.append(response)

# Display messages
print(messages)

[SystemMessage(content='You are a data analysis expert', additional_kwargs={}), HumanMessage(content='\nYou have a dataset about unemployment rate by gender in Africa. It is available as pandas dataframe named `unemployment_rate`\n\nThe dataset contains the following columns.\n\n- `Entity` (object): The field represents the country or region.                                                               \n- `Code` (object) : A three letter column representing the country where the uemployment rate data was collected.                                                                   \n- `Year` (integer): The yead when the uenemployment rate data was collected.                                                                     \n- `Unemployment, female (% of female labor force) (modeled ILO estimate)` (float): The percentage of the female labor force that is unemployed.   \n- `Unemployment, male (% of male labor force) (modeled ILO estimate)` (float): The percentage of the male labor fo

In [12]:
prompt = HumanMessage(content = """Write some code python to find out how unemployment rate for females compare
                      to the unemployment rate for males in Africa""")

# Append prompt to messages
messages.append(prompt)

# Model response
response = chat(messages)

# Display response
display(Markdown(response.content))

Certainly! Here's an example code in Python to compare the unemployment rate for females and males in Africa using the given dataset:

```python
import pandas as pd

# Load the dataset
unemployment_rate = pd.read_csv('unemployment_rate.csv')

# Filter the data for Africa
africa_data = unemployment_rate[unemployment_rate['Continent'] == 'Africa']

# Calculate the average unemployment rate for females and males in Africa
avg_female_unemployment = africa_data['Unemployment, female (% of female labor force) (modeled ILO estimate)'].mean()
avg_male_unemployment = africa_data['Unemployment, male (% of male labor force) (modeled ILO estimate)'].mean()

# Print the results
print("Average Unemployment Rate for Females in Africa: {:.2f}%".format(avg_female_unemployment))
print("Average Unemployment Rate for Males in Africa: {:.2f}%".format(avg_male_unemployment))
```

Make sure to replace `'unemployment_rate.csv'` with the actual file path or name of your dataset. This code will calculate the average unemployment rate for females and males in Africa and print the results.

In [14]:
# Ask GPT for code for a bar plot, as detailed in the instructions
prompt = HumanMessage(content="""Write some Python code to draw a bar plot of how unemployment rate for females compare
to the unemployment rate for males in Africa""")

# Append prompt to messages
messages.append(prompt)

# Model response
response = chat(messages)

# Display response
display(Markdown(response.content))

Sure! Here's an example of Python code using the `matplotlib` library to draw a bar plot comparing the unemployment rate for females and males in Africa:

```python
import pandas as pd
import matplotlib.pyplot as plt

# Assuming you have already loaded the dataset into a pandas DataFrame named 'unemployment_rate'

# Filter the data for Africa
africa_data = unemployment_rate[unemployment_rate['Continent'] == 'Africa']

# Group the data by year and calculate the average unemployment rate for females and males
grouped_data = africa_data.groupby('Year').mean()[['Unemployment, female (% of female labor force) (modeled ILO estimate)',
                                                   'Unemployment, male (% of male labor force) (modeled ILO estimate)']]

# Reset the index to make 'Year' a column
grouped_data = grouped_data.reset_index()

# Plotting the bar plot
plt.figure(figsize=(10, 6))
plt.bar(grouped_data['Year'], grouped_data['Unemployment, female (% of female labor force) (modeled ILO estimate)'],
        width=0.4, label='Female Unemployment Rate')
plt.bar(grouped_data['Year'], grouped_data['Unemployment, male (% of male labor force) (modeled ILO estimate)'],
        width=0.4, label='Male Unemployment Rate')
plt.xlabel('Year')
plt.ylabel('Unemployment Rate (%)')
plt.title('Comparison of Unemployment Rate for Females and Males in Africa')
plt.legend()
plt.xticks(rotation=45)
plt.show()
```

This code will create a bar plot showing the unemployment rate for females and males in Africa for each year. The x-axis represents the years, and the y-axis represents the unemployment rate in percentage. The blue bars represent the female unemployment rate, and the orange bars represent the male unemployment rate.

# Conclusion
***`Doing more analysis with GPT assistance is simply a case of continuing the conversation by appending new HumanMessage prompts to the message list, and calling the chat() function.`***