#  Introduction to Large Language Models with GPT & LangChain

In order to chat with GPT, we need first need to load the `openai` and `os` packages to set the API key from the environment variables you just created.

### Instructions

- Import the `os` package.
- Import the `openai` package.
- Set `openai.api_key` to the `OPENAI_API_KEY` environment variable.

In [2]:
# Import the os package
import os

# Import the openai package
import openai

# Set openai.api_key to the OPENAI_API_KEY environment variable
openai_api_key = os.environ["OPENAI_API_KEY"]

In [3]:
# Import the langchain package as lc
import langchain as lc

# From the langchain.chat_models module, import ChatOpenAI
from langchain.chat_models import ChatOpenAI

# From the langchain.schema module, import AIMessage, HumanMessage, SystemMessage
from langchain.schema import AIMessage, HumanMessage, SystemMessage

We need to do some light data manipulation with the `pandas` package and data visualization with `plotly.express`.  Finally, the `IPython.display` pacakges contains functions to prettily display Markdown content.

In [4]:
# Import pandas using the alias pd
import pandas as pd

# Import plotly.express using the alias px
import plotly.express as px

# From the IPython.display package, import display and Markdown
from IPython.display import display, Markdown

The electric cars data is contained in a CSV file named `electric_cars.csv`.

Each row in the dataset represents the count of the number of cars registered within a city, for a particular model.

The dataset contains the following columns.

- `city` (character): The city in which the registered owner resides.
- `county` (character): The county in which the registered owner resides.
- `model_year` (integer): The [model year](https://en.wikipedia.org/wiki/Model_year#United_States_and_Canada) of the car.
- `make` (character): The manufacturer of the car.
- `model` (character): The model of the car.
- `electric_vehicle_type` (character): Either "Plug-in Hybrid Electric Vehicle (PHEV)" or "Battery Electric Vehicle (BEV)".
- `n_cars` (integer): The count of the number of vehicles registered.

Our first step is to import and print the data.

In [5]:
# Read the data from electric_cars.csv. Assign to electric_cars.
electric_cars = pd.read_csv('electric_cars.csv')

# Display a description of the numeric columns
print("Description of numeric columns\n")
display(electric_cars.describe())

# Display a description of the text (object) columns
print("Description of text columns\n")
display(electric_cars.describe(include='O'))

# Print the whole dataset
print("The electric cars dataset\n")
display(electric_cars)

Description of numeric columns



Unnamed: 0,model_year,n_cars
count,26813.0,26813.0
mean,2019.375527,5.612166
std,3.286257,26.997325
min,1997.0,1.0
25%,2017.0,1.0
50%,2020.0,2.0
75%,2022.0,4.0
max,2024.0,1514.0


Description of text columns



Unnamed: 0,city,county,make,model,electric_vehicle_type
count,26813,26813,26813,26813,26813
unique,683,183,37,127,2
top,Bothell,King,TESLA,LEAF,Battery Electric Vehicle (BEV)
freq,479,7066,5071,1889,15885


The electric cars dataset



Unnamed: 0,city,county,model_year,make,model,electric_vehicle_type,n_cars
0,Seattle,King,2023,TESLA,MODEL Y,Battery Electric Vehicle (BEV),1514
1,Seattle,King,2018,TESLA,MODEL 3,Battery Electric Vehicle (BEV),1153
2,Seattle,King,2021,TESLA,MODEL Y,Battery Electric Vehicle (BEV),1147
3,Seattle,King,2022,TESLA,MODEL Y,Battery Electric Vehicle (BEV),1122
4,Bellevue,King,2023,TESLA,MODEL Y,Battery Electric Vehicle (BEV),931
...,...,...,...,...,...,...,...
26808,Lakewood,Pierce,2022,BMW,IX,Battery Electric Vehicle (BEV),1
26809,Lakewood,Pierce,2022,BMW,X5,Plug-in Hybrid Electric Vehicle (PHEV),1
26810,Lakewood,Pierce,2022,FORD,TRANSIT,Battery Electric Vehicle (BEV),1
26811,Lakewood,Pierce,2022,HYUNDAI,KONA ELECTRIC,Battery Electric Vehicle (BEV),1


### Types of Message

There are three types of message, documented in the [Introduction](https://platform.openai.com/docs/guides/chat/introduction) to the Chat documentation. We'll use two of them here.

- `system` messages describe the behavior of the AI assistant. If you don't know what you want, try "You are a helpful assistant".
- `user` messages describe what you want the AI assistant to say. We'll cover examples of this today.

In [6]:
# Define the system message. Assigning it to system_msg_test.
system_msg_test = "You are a helpful assistant who understands data science. You write in a clear language that a ten year old can understand. You keep your answers brief."

# Define the user message. Assigning to user_msg_test.
user_msg_test = 'Tell me some of the uses of GPT for data analysis'

# Create a message list from the system and user messages. Assign to msgs_test.
msg_test = [
    {'role': 'system', 'content': system_msg_test},
    {'role': 'user', 'content': user_msg_test}
]

# Send the messages to GPT. Assign to rsps_test.
rsps_test = openai.ChatCompletion.create(model= 'gpt-3.5-turbo', messages=msg_test)

Now you need to explore the response. The result is a highly nested object. As well as the text response that we want, there's a lot of metadata. You'll print the whole thing so you can see the structure, and extract just the text content.

<details>
<summary>Code hints</summary>
<p>
        
Buried within the response variable is the text we asked GPT to generate. Luckily, it's always in the same place.
        
```
response["choices"][0]["message"]["content"]
```

</p>
</details>

In [7]:
# Print the whole response
print("The whole response\n")
print(rsps_test)


print("\n\n----\n\n")

# Print just the response's content
print("Just the response's content\n")
print(rsps_test["choices"][0]["message"]["content"])

The whole response

{
  "id": "chatcmpl-8n7hRdF17XeK3qCLysVTDMx2I2piY",
  "object": "chat.completion",
  "created": 1706718361,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "GPT, or Generative Pre-trained Transformer, is a type of artificial intelligence that can understand and analyze data. Some of the uses of GPT for data analysis include predicting trends in financial markets, generating natural language summaries of documents, and helping in medical research by analyzing patient data."
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 53,
    "completion_tokens": 57,
    "total_tokens": 110
  },
  "system_fingerprint": null
}


----


Just the response's content

GPT, or Generative Pre-trained Transformer, is a type of artificial intelligence that can understand and analyze data. Some of the uses of GPT for data analysis include pre

Now we know that GPT is working, we can start asking questions about data analysis. Because we have details of our dataset, we can pass these in to our prompt to improve the quality of the mesages we get back.

Another change that we're going to make is to use the `langchain` package, which provides a convenience layer on top of the `openai` package.

### Why should we use LangChain?

One of the advantages of LangChain is that it simplifies the code for some tasks, letting you avoid messing about with too many square brackets and curly braces as you navigate these deep objects.

Secondly, if you want to swap GPT for a different model at a later date (as you might in a corporate setting), it can be easier to do so if you use the `langchain` package rather than the `openai` package directly.

### LangChain message types

The LangChain message types are names slightly differently from the OpenAI message types.

- `SystemMessage` is the equivalent of OpenAI's `system` message.
- `HumanMessage` is the equivalent of OpenAI's `user` message.

In [8]:
# A description of the dataset
dataset_description = """
You have a dataset about electric cars registered in Washington state, USA in 2020. It is available as a pandas DataFrame named `electric_cars`.

Each row in the dataset represents the count of the number of cars registered within a city, for a particular model.

The dataset contains the following columns.

- `city` (character): The city in which the registered owner resides.
- `county` (character): The county in which the registered owner resides.
- `model_year` (integer): The [model year](https://en.wikipedia.org/wiki/Model_year#United_States_and_Canada) of the car.
- `make` (character): The manufacturer of the car.
- `model` (character): The model of the car.
- `electric_vehicle_type` (character): Either "Plug-in Hybrid Electric Vehicle (PHEV)" or "Battery Electric Vehicle (BEV)".
- `n_cars` (integer): The count of the number of vehicles registered.
"""

# Create a task for the AI. Assign to suggest_questions.
suggest_questions = "Suggest some data analysis questions that could be answered with this dataset."

# Concatenate the dataset description and the request. Assign to msgs_suggest_questions.
msgs_suggest_questions = [
    SystemMessage(content='You are a data analysis expert.'),
    HumanMessage(content=f'{dataset_description}\n\n{suggest_questions}')
]


In [9]:
# Create a ChatOpenAI object. Assign to chat.
chat = ChatOpenAI()

# Pass your message to GPT. Assign to rsps_suggest_questionsrsps_suggest_questions.
rsps_suggest_questions = chat(msgs_suggest_questions)

# Print the response
print("The whole response\n")
print(rsps_suggest_questions)

print("\n----\n")

# Print just the response's content
print("Just the response's content\n")
print(rsps_suggest_questions.content)


print("\n----\n")

# Print the type of the response
print("The type of the response\n")
print(type(rsps_suggest_questions))


The whole response

content='1. What is the total number of electric cars registered in Washington state in 2020?\n2. What is the distribution of electric car models registered in Washington state in 2020?\n3. Which city in Washington state had the highest number of electric cars registered in 2020?\n4. Which county in Washington state had the highest number of electric cars registered in 2020?\n5. What is the average number of electric cars registered per city in Washington state in 2020?\n6. What is the most common electric vehicle type registered in Washington state in 2020?\n7. Which manufacturer had the highest number of electric cars registered in Washington state in 2020?\n8. How does the number of electric cars registered vary by model year in Washington state in 2020?\n9. What is the average number of cars registered per electric vehicle type in Washington state in 2020?\n10. Is there a correlation between the number of electric cars registered and the population size of a cit

Notice that the response from GPT was a dictionary-like object. The most useful part of this is the `.content` element, which contains the text response to your prompt.

While a single prompt and response can be useful, often you want to have a longer conversation with GPT. In this case, you can pass previous messages so that GPT can "remember" what was said before.

### AI messages

The response from GPT had type `AIMessage`. By distinguishing `AIMessage`s from the `HumanMessage`s, you can tell who said what in a conversation with the AI.

In [10]:
# Append the response and a new message to the previous messages. 
# Assign to msgs_python_top_models.
msgs_python_top_models = msgs_suggest_questions + [
    rsps_suggest_questions,
    HumanMessage(content='Write python code to find the top make and model combinations of electric cars in Washington')
]

# Pass your message to GPT. Assign to rsps_python_top_models.
rsps_python_top_models = chat(msgs_python_top_models)

# Display the response's Markdown content
display(Markdown(rsps_python_top_models.content))

Sure! Here's a Python code snippet that finds the top make and model combinations of electric cars in Washington:

```python
# Group the DataFrame by make and model columns and sum the counts
top_make_model = electric_cars.groupby(['make', 'model']).sum('n_cars')

# Sort the results in descending order based on the count of registered cars
top_make_model = top_make_model.sort_values('n_cars', ascending=False)

# Display the top 10 make and model combinations
top_make_model.head(10)
```

This code first groups the DataFrame `electric_cars` by the 'make' and 'model' columns and then sums the counts of registered cars for each make and model combination. It then sorts the results in descending order based on the count of registered cars. Finally, it displays the top 10 make and model combinations using the `head()` function.

In [11]:
# Group the data by make and model and calculate the sum of registered cars
top_combinations = electric_cars.groupby(['make', 'model']).sum('n_cars')

# Sort the combinations in descending order based on the count of registered cars
top_combinations = top_combinations.sort_values(by='n_cars', ascending=False)

# Select the top 10 combinations
top_combinations = top_combinations.head(10)

# Print the top make and model combinations
for index, row in top_combinations.iterrows():
    make = index[0]
    model = index[1]
    count = row['n_cars']
    print(f"Make: {make}, Model: {model}, Number of Cars: {count}")

Make: TESLA, Model: MODEL Y, Number of Cars: 28502
Make: TESLA, Model: MODEL 3, Number of Cars: 27708
Make: NISSAN, Model: LEAF, Number of Cars: 13186
Make: TESLA, Model: MODEL S, Number of Cars: 7611
Make: CHEVROLET, Model: BOLT EV, Number of Cars: 5733
Make: TESLA, Model: MODEL X, Number of Cars: 5114
Make: CHEVROLET, Model: VOLT, Number of Cars: 4890
Make: VOLKSWAGEN, Model: ID.4, Number of Cars: 2999
Make: KIA, Model: NIRO, Number of Cars: 2876
Make: CHRYSLER, Model: PACIFICA, Number of Cars: 2642


Doing more analysis with GPT assistance is simply a case of continuing the conversation by appending new `HumanMessage` prompts to the message list, and calling the `chat()` function.

The output from GPT is random, but when doing data analysis this isn't always desirable since you'd like your results to be reproducible. With large language models, the amount of randomness can be controled with a parameter known as "temperature".

- `temperature` controls the randomness of the response. It ranges from `0` to `2` with zero meaning minimal randomness (to make it easier to reproduce results) and two meaning maximum randomness (often gives weird responses). If you use the OpenAI API directly, the default is `1`, but using LangChain reduces the default value to `0.7`. 

In [12]:
# Create a new OpenAI chat object with temperature set to zero. Assign to chat0.
chat0 = ChatOpenAI(temperature=0)

In [13]:
# Ask GPT for code for a bar plot, as detailed in the instructions
msgs_messages_plots = msgs_python_top_models + [
    rsps_python_top_models,
    HumanMessage(content='Write some python code draw a bar plot of the total count of electric cars by model year, with bars colored by electric vehicle type. Use the plotply express package')
]
rsps_python_plot = chat0(msgs_messages_plots)
display(Markdown(rsps_python_plot.content))

To draw a bar plot of the total count of electric cars by model year, with bars colored by electric vehicle type using the `plotly express` package, you can use the following Python code:

```python
import plotly.express as px

# Group the DataFrame by model year and electric vehicle type and sum the counts
grouped_data = electric_cars.groupby(['model_year', 'electric_vehicle_type']).sum('n_cars').reset_index()

# Draw the bar plot
fig = px.bar(grouped_data, x='model_year', y='n_cars', color='electric_vehicle_type',
             title='Total Count of Electric Cars by Model Year',
             labels={'model_year': 'Model Year', 'n_cars': 'Count of Electric Cars'},
             color_discrete_map={'Plug-in Hybrid Electric Vehicle (PHEV)': 'blue',
                                 'Battery Electric Vehicle (BEV)': 'green'})

# Show the plot
fig.show()
```

This code first groups the DataFrame `electric_cars` by 'model_year' and 'electric_vehicle_type' columns and sums the counts of registered cars for each combination. It then uses `plotly express` to create a bar plot, where the x-axis represents the model year, the y-axis represents the count of electric cars, and the bars are colored based on the electric vehicle type. The `title` and `labels` parameters are used to set the title and axis labels of the plot. The `color_discrete_map` parameter is used to assign specific colors to each electric vehicle type. Finally, the `fig.show()` function is called to display the plot.

Setting temperature to zero removed all randomness so you got the same output twice. That makes your workflow more reproducible.

The final task is to see the plot. As before, remember that GPT is only your assistant and you are the boss. Check the code and its output to make sure that you really have what you want.

In [14]:
# Paste the code generated by GPT and run it
rsps_python_plot1 = chat0(msgs_messages_plots)
display(Markdown(rsps_python_plot1.content))

To draw a bar plot of the total count of electric cars by model year, with bars colored by electric vehicle type using the `plotly express` package, you can use the following Python code:

```python
import plotly.express as px

# Group the DataFrame by model year and electric vehicle type and sum the counts
grouped_data = electric_cars.groupby(['model_year', 'electric_vehicle_type']).sum('n_cars').reset_index()

# Draw the bar plot
fig = px.bar(grouped_data, x='model_year', y='n_cars', color='electric_vehicle_type',
             title='Total Count of Electric Cars by Model Year',
             labels={'model_year': 'Model Year', 'n_cars': 'Count of Electric Cars'},
             color_discrete_map={'Plug-in Hybrid Electric Vehicle (PHEV)': 'blue',
                                 'Battery Electric Vehicle (BEV)': 'green'})

# Show the plot
fig.show()
```

This code first groups the DataFrame `electric_cars` by 'model_year' and 'electric_vehicle_type' columns and sums the counts of registered cars for each combination. It then uses `plotly express` to create a bar plot, where the x-axis represents the model year, the y-axis represents the count of electric cars, and the bars are colored based on the electric vehicle type. The `title` and `labels` parameters are used to set the title and axis labels of the plot. The `color_discrete_map` parameter is used to assign specific colors to each electric vehicle type. Finally, the `fig.show()` function is called to display the plot.

## Summary

You've now seen how to access GPT through the OpenAI API both directly and using LangChain.

You saw how GPT can be used to come up with ideas for analyses to perform and to write code for you.

You also saw how to have an extended conversation and how to control the reproducibility of the responses.

In [15]:
# import plotly.express as px

# # Group the dataframe by model year and electric vehicle type, and sum the number of cars for each combination
# model_year_counts = electric_cars.groupby(['model_year', 'electric_vehicle_type'])['n_cars'].sum().reset_index()

# # Draw the bar plot
# fig = px.bar(model_year_counts, x='model_year', y='n_cars', color='electric_vehicle_type',
#              title='Total Count of Electric Cars by Model Year',
#              labels={'model_year': 'Model Year', 'n_cars': 'Count of Electric Cars'},
#              color_discrete_map={'Plug-in Hybrid Electric Vehicle (PHEV)': 'blue',
#                                  'Battery Electric Vehicle (BEV)': 'green'})

# # Show the plot
# fig.show()

import plotly.express as px

# Group the data by model year and electric vehicle type and calculate the sum of registered cars
grouped_data = electric_cars.groupby(['model_year', 'electric_vehicle_type']).sum('n_cars').reset_index()

# Plot the bar plot
fig = px.bar(grouped_data, x='model_year', y='n_cars', color='electric_vehicle_type',
             labels={'model_year': 'Model Year', 'n_cars': 'Count of Electric Cars'},
             title='Total Count of Electric Cars by Model Year',
             barmode='group')

# Show the plot
fig.show()

  sf: grouped.get_group(s if len(s) > 1 else s[0])


In [16]:
!pip install nbformat>=4.2.0