# Blog 1: Structured Output with LangChain + OpenAI

## Introduction
In this blog post, we will explore how to produce structured output using LangChain with OpenAI. LangChain is a powerful framework that simplifies the integration of language models into various applications, enabling us to manage inputs and outputs, including structured data formats, with ease. We will apply these capabilities to an Urban Heat Island Analysis project, leveraging NASA‚Äôs EarthData API to access MODIS Land Surface Temperature (LST) data. Our goal is to dynamically adjust parameters to ensure high-quality data for our analysis.

### Objectives

By the end of this lesson, you will be able to:

1. Set up the initial parameters for a data analysis project.
2. Visualize selected regions using Plotly Express.
3. Use LangChain and OpenAI to dynamically adjust project parameters.
4. Produce structured output using Pydantic models and LangChain.

## 1. Initial Setup

First, we need to manually set parameters for our analysis, such as the city region, coordinates for urban and rural areas, and the time range. This serves as our starting point and ensures we have a baseline for our project.

In [1]:
# Define the initial data parameters for the analysis
data_params = {
    'city_region_name': 'Houston, TX',  # Name of the city region
    'coordinates': {
        'urban': {'SW': [29.69193, -95.47998], 'NE': [29.90719, -95.2251]},  # Coordinates for the urban area
        'rural': {'SW': [30.5, -96.5], 'NE': [31.0, -96.0]}  # Coordinates for the rural area
    },
    'time': {
        'start': '2023-06-01T00:00:00Z',  # Start time for data collection
        'end': '2023-08-31T23:59:59Z'  # End time for data collection
    }
}
data_params

{'city_region_name': 'Houston, TX',
 'coordinates': {'urban': {'SW': [29.69193, -95.47998],
   'NE': [29.90719, -95.2251]},
  'rural': {'SW': [30.5, -96.5], 'NE': [31.0, -96.0]}},
 'time': {'start': '2023-06-01T00:00:00Z', 'end': '2023-08-31T23:59:59Z'}}

In this step, we specify the city region as Houston, TX, define the coordinates for both urban and rural areas, and set a time range for our analysis.


## 2. Visualizing Selected Regions:


Visualizing the selected regions helps us verify the accuracy of our coordinates. We use Plotly Express to generate sample points within the bounding boxes and plot them.

In [2]:
import pandas as pd
import plotly.express as px

def generate_sample_points(sw, ne, num_points=10):
    """Generate sample points within a given bounding box."""
    latitudes = [sw[0] + i * (ne[0] - sw[0]) / (num_points - 1) for i in range(num_points)]
    longitudes = [sw[1] + i * (ne[1] - sw[1]) / (num_points - 1) for i in range(num_points)]
    return [(lat, lon) for lat in latitudes for lon in longitudes]

# Generate sample points for urban and rural areas
sampled_coordinates = []
for region, bounding_box in data_params['coordinates'].items():
    
    sample_points = generate_sample_points(bounding_box['SW'], bounding_box['NE'])
    for lat, lon in sample_points:
        sampled_coordinates.append({'Region': region, 'Latitude': lat, 'Longitude': lon})

# Convert the sample points to a DataFrame
coords_df = pd.DataFrame(sampled_coordinates)
coords_df.head(2)

Unnamed: 0,Region,Latitude,Longitude
0,urban,29.69193,-95.47998
1,urban,29.69193,-95.45166


In [3]:
## Plot the region suggested
fig = px.scatter_mapbox(coords_df, lat="Latitude", lon="Longitude", color='Region',
                        # color_continuous_scale="Viridis", 
                        mapbox_style="carto-positron",
                        title=f"Preview of Selected Bounding Boxes for {data_params['city_region_name']}",
                        height=600, width=800)
## Update the layout
fig.update_layout(
    margin={"r":0, "l":0,'b':0, 't':100},# Remove left and right side margins
    legend={'orientation':"h", 'yanchor':"top", 'y':1.05, 'xanchor':"left", 'x':0}, # Move legend to top   
)
fig.show()

This visualization helps us confirm that the regions are correctly defined and prepares us for the next steps.


## 3. Getting New Suggested Parameters from OpenAI 

### Setting Up OpenAI + LangChain


Before using OpenAI‚Äôs API, we need to set up our environment. Follow these steps:



1. **Sign up for OpenAI's API:** 
   - Visit the [OpenAI website](https://www.openai.com) and sign up for an API key.



2. **Create a `.secret` folder in your home directory:**
   ```bash
   cd ~
   mkdir .secret
   ```

3. **Save your API key as a text file in the `.secret` folder:**
   - Open a text editor and paste your API key.
   - Save the file as `open-ai.txt` in the `.secret` folder. For example, you can use the following command in the terminal to create the file and save the API key:
   


4. **Export the key from the file to your `.bash_profile` or `.zshrc`:**
   - Open your `.bash_profile` for editing in VS Code:
   ```bash
   # If using bash
    code ~/.bash_profile

    # If using zsh, use: 
    code ~/.zshrc
    ```

- Add the following line to export the API key:
   ```bash
   export OPENAI_API_KEY=$(cat ~/.secret/open-ai.txt)
   ```
   
- Save the file and exit the editor 


After following these steps, your API key will be available in your environment variables as `OPENAI_API_KEY`.



**5.	Confirm the setup with Python to ensure the key is correctly loaded.**

You can confirm this with Python by importing the `os` module and checking the `os.environ` dictionary for 'OPENAI_API_KEY'.


In [4]:
import os
'OPENAI_API_KEY' in os.environ  # Should return True if the key is correctly loaded

True


> Note: Do NOT display the value of your OPENAI_API_KEY. If you accidentally expose your API credentials, OpenAI will automatically deactivate them, causing any program or app that uses it to break. 

### Using LangChain with ChatGPT

LangChain is a versatile framework designed to simplify the integration of language models into various applications, enabling seamless management of inputs and outputs, including structured data formats.

 By leveraging LangChain, developers can efficiently implement advanced AI functionalities, such as function calling, tool calling, and JSON mode, ensuring consistent and reliable outputs tailored to specific needs. This framework facilitates the creation of sophisticated AI-driven solutions across different domains, making it an essential tool for modern AI development. 

In [5]:
## Installing langchain
# !pip install -U langchain_openai langchain_core langchain_community pydantic

To consruct a chain using LangChain's newer LCEL (LangChain Execution Language) we need to define the following elements:
- The PromptTemplate
- The LLM model/Chat object.
- The OutputParser

#### LangChain's `PromptTemplate`

Prompt engineering/construction is vital for obtaining high-quality results from any Large Langauge Model (LLM). To get the best suggestions from the API, it is important to provide sufficient context in our prompt/query, while leaving the option to customize the prompt on-the-fly.

To do so, we will create a prompt_string that has our prompt plus a set of curly brackets around a `specs` variable, as-if it was an f-string ( but note that we are not actually using it as an f-string).

In [6]:

prompt_string = """I am performing an urban heat island analysis project with MODIS data comparing urban areas vs. rural areas. 
I need to download MODIS data for 2 nearby non-overlapping regions (urban area and rural area outside of city) and time range.
Help me select the urban and rural regions and time, keeping the following in mind:
{specs}
"""
print(prompt_string)

I am performing an urban heat island analysis project with MODIS data comparing urban areas vs. rural areas. 
I need to download MODIS data for 2 nearby non-overlapping regions (urban area and rural area outside of city) and time range.
Help me select the urban and rural regions and time, keeping the following in mind:
{specs}



We will use this prompt_string to construct a PromptTemplate object with LangChain, which will turn our f-string variable into a parameter that we pass in with our request. 

In [7]:
from langchain_core.prompts import PromptTemplate

# Converting the prompt_string to a PromptTemplate
prompt = PromptTemplate.from_template(prompt_string) 
prompt

PromptTemplate(input_variables=['specs'], template='I am performing an urban heat island analysis project with MODIS data comparing urban areas vs. rural areas. \nI need to download MODIS data for 2 nearby non-overlapping regions (urban area and rural area outside of city) and time range.\nHelp me select the urban and rural regions and time, keeping the following in mind:\n{specs}\n')

In [8]:
# Any f-string variables in the prompt string become input_variables.
prompt.input_variables

['specs']

In [9]:
# The prompt string with the f-string variables replaced by the input values using the .format() method
specs_string= "1. Urban area should be within the city limits and rural area should be outside the city limits."
print(prompt.format(specs=specs_string))

I am performing an urban heat island analysis project with MODIS data comparing urban areas vs. rural areas. 
I need to download MODIS data for 2 nearby non-overlapping regions (urban area and rural area outside of city) and time range.
Help me select the urban and rural regions and time, keeping the following in mind:
1. Urban area should be within the city limits and rural area should be outside the city limits.



#### Instantiating an LLM

Next, we need to instantiate a language model. LangChain has separate packages for each of the LLMs it is compatible with. Since we are using OpenAI, we will use the ChatOpenAI object from `langchain_openai`. The ChatOpenAI object from LangChain will interact with the OpenAI API.

In [10]:
from langchain_openai import ChatOpenAI 
chat = ChatOpenAI(api_key=os.environ['OPENAI_API_KEY'], model="gpt-4o", temperature=0.0)
chat

ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x16c3b93d0>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x16c3c5750>, model_name='gpt-4o', temperature=0.0, openai_api_key=SecretStr('**********'), openai_proxy='')

Setting the temperature to 0.0 ensures that the responses are deterministic and consistent.


#### Adding an Output Parser

The final piece of our chain is the output parser. While we could technically leave this out of our chain, it is better that we use the default StrOutputParser to convert our response into a simple string (instead of a JSON-like dictionary).

In [11]:
from langchain_core.output_parsers import StrOutputParser
output_parser = StrOutputParser()

### 4. Putting it All Together: Our First Chain

We construct the final chain using the prompt, LLM, and output parser. This chain allows us to query OpenAI for new parameters.


Our chain starts with the prompt, followed by a pipe symbol `|` and then the LLM object, followed by another pipe, and finally the output parser. 

In [12]:
# Constructing Our Chain
chain = prompt | chat | output_parser
chain

PromptTemplate(input_variables=['specs'], template='I am performing an urban heat island analysis project with MODIS data comparing urban areas vs. rural areas. \nI need to download MODIS data for 2 nearby non-overlapping regions (urban area and rural area outside of city) and time range.\nHelp me select the urban and rural regions and time, keeping the following in mind:\n{specs}\n')
| ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x16c3b93d0>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x16c3c5750>, model_name='gpt-4o', temperature=0.0, openai_api_key=SecretStr('**********'), openai_proxy='')
| StrOutputParser()

To use the chain we use the .invoke method and must provide the `input_variables` that are required by the prompt template.

In [13]:
# Using the input variables to provide the necessary information
specs_string = """1. Urban area should be within the city limits and rural area should be outside the city limits.
2. The urban area should be selected within the southern US region to study the urban heat island effect. 
"""

# Invoke the chain to get the response
response = chain.invoke({'specs':specs_string})
print(response)

To perform an urban heat island analysis using MODIS data, you need to carefully select your urban and rural regions to ensure they are representative and non-overlapping. Here‚Äôs a step-by-step guide to help you select the regions and time range:

### Step 1: Select the City
Choose a city in the southern US that is known for its urban heat island effect. For this example, let's select **Houston, Texas**.

### Step 2: Define the Urban Area
Identify the city limits of Houston. You can use GIS tools or online maps to delineate the boundaries. The urban area should be well within these limits to capture the urban heat island effect accurately.

### Step 3: Define the Rural Area
Select a rural area outside the city limits of Houston. Ensure this area is sufficiently far from the urban boundary to avoid any influence from the city. A good choice could be an agricultural or forested area to the northwest or southeast of Houston.

### Step 4: Select the Time Range
Choose a time range that co

In [14]:
# ChatGPT responds with Markdown-stytled text, so we can use IPython's `Markdown` class to render it
from IPython.display import Markdown, display
# display(Markdown(response))

To leverage the new suggestions from ChatGPT, we would have to manually construct a new `data_params` dict using the raw text from the response. This is a manual process that can be automated by using a more complex Output Parser: the Json.

### 5. Getting Structured Output from LangChain

#### Structured Output

We aim to produce a JSON-dictionary with the same structure as our original data_params. To achieve this, we define a Pydantic data model.

In [15]:
## Defining the structured output desired from chat gpt
from pydantic import BaseModel, Field
from typing import List, Optional, Text, Dict
from langchain_core.output_parsers import JsonOutputParser

# Define Pydantic models for structured output
class Coordinates(BaseModel):
    SW: List[float]
    NE: List[float]

class RegionCoordinates(BaseModel):
    rural: Optional[Coordinates]
    urban: Optional[Coordinates]

class DataParams(BaseModel):
    city_region_name: str
    coordinates: Optional[RegionCoordinates]
    time: Dict[str, str]



We will insert a new input_variable in our PromptTempalte for the format_instructions, which will be filled in as a partial prompt using `prompt.partial().`

In [16]:
from langchain_core.prompts import PromptTemplate

# Define the prompt template for the language model
prompt_string = """
I am performing an urban heat island analysis project with MODIS data comparing urban areas vs. rural areas.
I need to download MODIS data for 2 nearby non-overlapping regions (urban area and rural area outside of city) and time range.
Help me select the urban and rural regions and time, keeping the following in mind:
{specs}

Format Instructions:
{format_instructions}
"""
prompt = PromptTemplate.from_template(prompt_string)

# Define the output parser with the Pydantic model
output_parser = JsonOutputParser(pydantic_object=DataParams)

# Update the prompt template to include format instructions for JSON output
prompt = prompt.partial(format_instructions=output_parser.get_format_instructions())

# Construct the chain using the prompt, language model, and output parser
chain = prompt | chat | output_parser

# Define the specifications for the query

specs = """Select a region that will be a perfect example of the effects of urban heat islands.
Select identically-sized nearby non-overlapping regions from the selected area to minimize the size of the dataset.
Do not select a city in the desert or near a large body of water."""

# Invoke the chain to get the response in structured format
data_params = chain.invoke({'specs': specs})
data_params # Print the structured response from the language model

{'city_region_name': 'Atlanta, Georgia',
 'coordinates': {'urban': {'SW': [33.640411, -84.442575],
   'NE': [33.775622, -84.336193]},
  'rural': {'SW': [33.775622, -84.336193], 'NE': [33.910833, -84.229811]}},
 'time': {'start': '2022-06-01', 'end': '2022-08-31'}}

This structured output can be directly used to adjust our parameters for the Urban Heat Island Analysis project.


In [17]:
# Generate sample points for urban and rural areas
sampled_coordinates = []
for region, bounding_box in data_params['coordinates'].items():
    sample_points = generate_sample_points(bounding_box['SW'], bounding_box['NE'])
    for lat, lon in sample_points:
        sampled_coordinates.append({'Region': region, 'Latitude': lat, 'Longitude': lon})

# Convert the sample points to a DataFrame
coords_df = pd.DataFrame(sampled_coordinates)

## Plot the region suggested
fig = px.scatter_mapbox(coords_df, lat="Latitude", lon="Longitude", color='Region',
                        # color_continuous_scale="Viridis", 
                        mapbox_style="carto-positron",
                        title=f"Preview of Selected Bounding Boxes for {data_params['city_region_name']}",
                        height=600, width=800)
## Update the layout
fig.update_layout(
    margin={"r":0, "l":0,'b':0, 't':100},# Remove left and right side margins
    legend={'orientation':"h", 'yanchor':"top", 'y':1.05, 'xanchor':"left", 'x':0}, # Move legend to top   
)
fig.show()

## Conclusion

Using LangChain with OpenAI allows us to dynamically adjust parameters and obtain structured outputs, significantly improving our data analysis workflow. This setup can be adapted for various projects requiring precise and structured data retrieval. By automating the parameter selection process, we can ensure high-quality data and streamline our analysis.

Feel free to reach out with questions or feedback, and stay tuned for more advanced tutorials!


___

# APPENDIX

In [19]:
# raise Exception("Stop here")

### Created a function that asks ChatGPT for suggested parameters

In [20]:
def suggest_data_params(specs: str, temperature=0.0, model_type='gpt-4o', return_json=True) -> str:
    """
    Suggests data parameters for downloading MODIS data for a specific region and time range.
    
    Args:
        query (str): The query describing the requirements for the data download.
        temperature (float, optional): The temperature parameter for the language model. Defaults to 0.1.
        model_type (str, optional): The type of language model to use. Defaults to 'gpt-4o'.
        return_llm (bool, optional): Whether to return the language model chain. Defaults to False.
        return_json (bool, optional): Whether to return the response as JSON. Defaults to True.
    
    Returns:
        str: The response from the language model chain or the JSON response, depending on the value of return_json.
    """
    
    # The prompt template for suggesting data parameters
    prompt = """
    I am performing an urban heat island analysis project with MODIS data comparing urban areas vs. rural areas. 
    I need to download MODIS data for 2 nearby non-overlapping regions (urban area and rural area outside of city) and time range.
    Help me select the urban and rural regions and time following the instructions below.
    {specs}
    
    Provide me the data parameters for the download (city_region_name, coordinates as SW [lat,long] NE [lat,long], time_start named 'start', time_end named 'end') in the following format:
    Format Instructions:
    Use the 2-letter abbreviations for the state.
    {format_instructions}
    """
    # Create a ChatPromptTemplate object
    final_prompt_template = PromptTemplate.from_template(prompt)

    # Get api key for OpenAI from the environment or session state (if on Streamlit)
    api_key = os.getenv('OPENAI_API_KEY')
        
    # Instantiate the language model and setting the specific model (chat-gpto is newest and reasonable price)
    # and  set the temperature (creativity level)
    llm = ChatOpenAI(temperature=temperature, model=model_type, api_key=api_key)
    
    if return_json:
        # # JsonOutputParser will use the data model classes from above
        parser = JsonOutputParser(pydantic_object=DataParams,)    
        # Add formatting instructions for pydantic
        instructions =  parser.get_format_instructions()
            
    else:
        ## StrOutputParser will return the response as a string
        parser = StrOutputParser(output_key="response")
        # Manually defining the format instructions
        instructions = "Respond with text for each topic as a nested list with the topic number,  descriptive label,top words, and short insight."
        
        
    ## Adding the instructions to the prompt template
    final_prompt_template = final_prompt_template.partial(format_instructions=instructions)
    
    
    # Making the final chain
    llm_chain = final_prompt_template | llm | parser
    
    # Invoke the chain with the query to get the response
    return llm_chain.invoke(input=dict(specs=specs))

In [27]:

specs = """Select a region that will be a perfect example of the effects of urban heat islands.
Select identically-sized nearby non-overlapping regions from the selected area to minimize the size of the dataset.
"""

# ask ChatGPT to suggest another set of parameters
chatgpt_params = suggest_data_params(specs=specs, 
                                    return_json=True, temperature=0.0)
chatgpt_params

{'city_region_name': 'Phoenix, AZ',
 'coordinates': {'urban': {'SW': [33.4484, -112.074],
   'NE': [33.5484, -111.974]},
  'rural': {'SW': [33.4484, -112.274], 'NE': [33.5484, -112.174]}},
 'time': {'start': '2022-06-01', 'end': '2022-08-31'}}

In [28]:
def preview_regions(data_params):
    

    # Dataframe to store results
    sampled_coordinates = []

    # Check if any coordinates within the bounding boxes are over sea
    for region, bounding_box in data_params['coordinates'].items():
        # Generate sample points within the bounding box
        sample_points = generate_sample_points(bounding_box['SW'], bounding_box['NE'], num_points=10)
        for lat, lon in sample_points:
            sampled_coordinates.append({'Region': region, 'Latitude': lat, 'Longitude': lon})

    # Convert results to DataFrame
    coords_df = pd.DataFrame(sampled_coordinates)
    
    
        ## Plot the region suggested
    fig = px.scatter_mapbox(coords_df, lat="Latitude", lon="Longitude", color='Region',
                            # color_continuous_scale="Viridis", 
                            mapbox_style="carto-positron",
                            title=f"Preview of Selected Bounding Boxes for {data_params['city_region_name']}",
                            height=600, width=600)

    # Remove left and right side margins
    fig.update_layout(
        margin={"r":0, "l":0,'b':0, 't':100},
        legend={'orientation':"h", 'yanchor':"top", 'y':1.05, 'xanchor':"left", 'x':0},
        
    )

    return fig
    ## Save fig for README
    # fig.write_image(f"{img_dir}selected_regions.png")
    
preview_regions(chatgpt_params)

In [32]:

specs = """Select a region that will be a perfect example of the effects of urban heat islands.
Select small identically-sized nearby non-overlapping regions from the selected area to minimize the size of the dataset.
Select a region in a non-desert area."""

# ask ChatGPT to suggest another set of parameters
chatgpt_params = suggest_data_params(specs=specs, 
                                    return_json=True, temperature=0.0)
print(chatgpt_params)
preview_regions(chatgpt_params)

{'city_region_name': 'Atlanta, GA', 'coordinates': {'urban': {'SW': [33.7, -84.5], 'NE': [33.8, -84.4]}, 'rural': {'SW': [33.9, -84.6], 'NE': [34.0, -84.5]}}, 'time': {'start': '2022-06-01', 'end': '2022-08-31'}}


And then we can use these new data_params to download the NASA Earth Science data, which will be covered in a later blog post. 


### CENSUS Data

In [52]:
import requests
import json
import pandas as pd


url = "https://api.census.gov/data/2022/acs/acs5/variables.json"

temp_df = pd.read_json(url)
df_variables = pd.json_normalize(temp_df['variables'])
df_variables.index = temp_df.index
df_variables = df_variables.reset_index()
df_variables

Unnamed: 0,index,label,concept,predicateType,group,limit,predicateOnly,hasGeoCollectionSupport,attributes,required
0,for,Census API FIPS 'for' clause,Census API Geography Specification,fips-for,,0,True,,,
1,in,Census API FIPS 'in' clause,Census API Geography Specification,fips-in,,0,True,,,
2,ucgid,Uniform Census Geography Identifier clause,Census API Geography Specification,ucgid,,0,True,True,,
3,B24022_060E,Estimate!!Total:!!Female:!!Service occupations...,Sex by Occupation and Median Earnings in the P...,int,B24022,0,,,"B24022_060EA,B24022_060M,B24022_060MA",
4,B19001B_014E,"Estimate!!Total:!!$100,000 to $124,999",Household Income in the Past 12 Months (in 202...,int,B19001B,0,,,"B19001B_014EA,B19001B_014M,B19001B_014MA",
...,...,...,...,...,...,...,...,...,...,...
28188,B25124_022E,Estimate!!Total:!!Owner occupied:!!3-person ho...,Tenure by Household Size by Units in Structure,int,B25124,0,,,"B25124_022EA,B25124_022M,B25124_022MA",
28189,B20005I_071E,"Estimate!!Total:!!Female:!!Worked full-time, y...",Sex by Work Experience in the Past 12 Months b...,int,B20005I,0,,,"B20005I_071EA,B20005I_071M,B20005I_071MA",
28190,B08113_054E,Estimate!!Total:!!Worked from home:!!Speak oth...,Means of Transportation to Work by Language Sp...,int,B08113,0,,,"B08113_054EA,B08113_054M,B08113_054MA",
28191,B06009_006E,Estimate!!Total:!!Graduate or professional degree,Place of Birth by Educational Attainment in th...,int,B06009,0,,,"B06009_006EA,B06009_006M,B06009_006MA",


In [105]:
df_variables.to_csv('census_variables.csv', index=False)

In [48]:
import os
creds_file = "/Users/codingdojo/.secret/census.json"
with open(os.path.abspath(creds_file)) as f:
    creds = json.load(f)
    api_key = creds['api-key']

Example API Queries
https://api.census.gov/data/2022/acs/acs5/subject/examples.html 

To retrieve data for every census tract in Georgia using the American Community Survey (ACS) API, you'll need to make use of the U.S. Census Bureau's API. Below, I‚Äôll guide you through the process, including how to access the API, construct your query, and extract the data for analysis.

### Step 1: Understanding the ACS API

The ACS API allows you to access a wide range of demographic, economic, and housing data. For this example, we'll focus on the ACS 5-Year estimates, which provide detailed information at smaller geographic levels like census tracts.

### Step 2: Obtain an API Key

Before using the Census API, you need to obtain an API key from the Census Bureau:

1. Visit the [Census Bureau's API Key Request page](https://api.census.gov/data/key_signup.html).
2. Fill out the form to request a key. The key will be sent to your email.

### Step 3: Construct the API Request

You'll need to construct a URL to query the API. The URL includes several parameters:

- **Base URL:** The endpoint for the ACS 5-Year Data.
- **Variables:** The specific data points you want to retrieve (e.g., total population, median household income).
- **Geography:** Specify the geographic level (e.g., state, county, tract).
- **Year:** The year of the ACS data you're interested in.
- **State and County Codes:** To filter results specifically for Georgia census tracts.

Here‚Äôs a general structure of the API request:

```
https://api.census.gov/data/20XX/acs/acs5?get=VARIABLES&for=tract:*&in=state:13&key=YOUR_API_KEY
```

- Replace `20XX` with the year you're interested in (e.g., `2022` for the latest data).
- Replace `VARIABLES` with a comma-separated list of variables you're interested in.
- `state:13` is the FIPS code for Georgia.



### Step 4: Identify the Variables You Want

You can find a list of available variables in the [ACS Data Dictionary](https://api.census.gov/data/2022/acs/acs5/variables.html). Commonly used variables include:

- **B01003_001E:** Total population.
- **B19013_001E:** Median household income.
- **B25077_001E:** Median home value.

### Step 5: Fetch the Data Using Python

Here's how you can use Python to fetch the data:

```python
import requests
import pandas as pd

# Define the parameters
year = "2022"
variables = "B01003_001E,B19013_001E,B25077_001E"
state_fips = "13"  # FIPS code for Georgia
api_key = "YOUR_API_KEY"

# Construct the API URL
url = f"https://api.census.gov/data/{year}/acs/acs5?get=NAME,{variables}&for=tract:*&in=state:{state_fips}&key={api_key}"

# Make the API request
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()

    # Convert the data to a pandas DataFrame
    columns = data[0]
    rows = data[1:]
    df = pd.DataFrame(columns=columns, data=rows)

    # Rename columns for better readability
    df = df.rename(columns={
        "B01003_001E": "Total Population",
        "B19013_001E": "Median Household Income",
        "B25077_001E": "Median Home Value"
    })

    # Display the DataFrame
    import ace_tools as tools; tools.display_dataframe_to_user(name="ACS Data for Georgia Census Tracts", dataframe=df)
else:
    print("Error fetching data:", response.status_code)
```

### Step 6: Analyze and Use the Data

Once you have the data in a pandas DataFrame, you can analyze it, visualize it, or merge it with other datasets for further analysis.

### Explanation of the Code

- **API URL Construction:** The URL is constructed using the desired year, variables, and geographic filters. The wildcard `tract:*` specifies that you want data for all tracts in Georgia.
- **Response Parsing:** The API response is in JSON format, and we parse it into a DataFrame for easy manipulation.
- **Data Cleaning:** We rename the columns to make them more descriptive.

### Additional Tips

- **Explore More Variables:** Use the [ACS Variables List](https://api.census.gov/data/2022/acs/acs5/variables.html) to find more variables you might be interested in.
- **Error Handling:** Incorporate error handling to manage cases where the API might not return data successfully.
- **Rate Limiting:** Be mindful of the Census API's rate limits and structure your requests accordingly.

This approach will help you obtain and analyze ACS data for every census tract in Georgia. Let me know if you have any questions or need further assistance!

To analyze socioeconomic differences between regions that are part of an urban heat island (UHI) and those that are not, you can select a variety of variables from the American Community Survey (ACS) that reflect the socioeconomic status, housing characteristics, and demographic profiles of these areas. Here are some key variables that would be useful for this type of analysis:

### Key Socioeconomic Variables

1. **Income and Poverty:**
   - **Median Household Income:** (`B19013_001E`) - Reflects the overall economic status of households.
   - **Per Capita Income:** (`B19301_001E`) - Provides insight into the average income per person in an area.
   - **Percentage of Population Below Poverty Level:** (`B17021_002E / B17021_001E`) - Indicates economic hardship and poverty prevalence.

2. **Employment and Education:**
   - **Unemployment Rate:** (`B23025_005E / B23025_002E`) - Measures the percentage of the labor force that is unemployed.
   - **Educational Attainment:** 
     - **Percentage with High School Diploma or Higher:** (`B15003_017E to B15003_025E`)
     - **Percentage with Bachelor's Degree or Higher:** (`B15003_022E to B15003_025E`) - Highlights educational disparities.

3. **Housing Characteristics:**
   - **Median Home Value:** (`B25077_001E`) - Reflects the economic conditions and housing market.
   - **Median Gross Rent:** (`B25064_001E`) - Provides insight into rental market dynamics.
   - **Percentage of Owner-Occupied Housing Units:** (`B25003_002E / B25003_001E`) - Indicates housing stability and investment.

4. **Demographic and Social Characteristics:**
   - **Race and Ethnicity:**
     - **Percentage of Population by Race:** (`B02001_002E` for White, `B02001_003E` for Black or African American, etc.)
     - **Percentage of Hispanic or Latino Origin:** (`B03003_003E`) - Identifies racial and ethnic diversity.
   - **Age Distribution:**
     - **Median Age:** (`B01002_001E`) - Provides demographic insights into age-related vulnerabilities.
   - **Language Spoken at Home:**
     - **Percentage of Non-English Speakers:** (`B16001_002E to B16001_006E`) - Highlights linguistic diversity and potential barriers.

5. **Transportation and Access:**
   - **Means of Transportation to Work:**
     - **Percentage Using Public Transit:** (`B08301_010E / B08301_001E`) - Indicates reliance on public transportation.
   - **Access to Vehicles:**
     - **Percentage of Households with No Vehicle Available:** (`B25044_003E / B25044_001E`) - Reflects access to transportation resources.

### Analysis Approach

To effectively analyze these variables in the context of urban heat islands, consider the following steps:

1. **Data Collection:**
   - Use the ACS API or download data directly from the [U.S. Census Bureau's data portal](https://data.census.gov/) for the relevant variables and geographic regions.

2. **Define Urban Heat Island Areas:**
   - Utilize geospatial analysis to identify areas that are part of urban heat islands. This may involve using satellite data (e.g., NASA MODIS LST) to identify temperature anomalies in urban areas.

3. **Spatial Analysis:**
   - Overlay socioeconomic data with urban heat island data using GIS software like QGIS or ArcGIS.
   - Use spatial joins to associate census tracts with UHI data.

4. **Comparative Analysis:**
   - Compare socioeconomic indicators between UHI areas and non-UHI areas using statistical methods.
   - Conduct t-tests, chi-square tests, or regression analysis to identify significant differences or correlations.

5. **Visualization:**
   - Create maps and charts to visually represent socioeconomic disparities.
   - Use heat maps, bar charts, and scatter plots to highlight key findings.

### Example Python Code for Data Retrieval

Here's how you might use Python to retrieve some of these variables from the ACS API:


In [53]:
import requests
import pandas as pd

# Variables to retrieve
vars_to_retrieve = {
        "B19013_001E": "Median Household Income",
        "B19301_001E": "Per Capita Income",
        "B17021_002E": "Population Below Poverty Level",
        "B17021_001E": "Total Population for Poverty",
        "B23025_005E": "Unemployed Population",
        "B23025_002E": "Labor Force Population",
        "B15003_017E": "High School Graduate",
        "B15003_022E": "Bachelor's Degree or Higher",
        "B25077_001E": "Median Home Value",
        "B25064_001E": "Median Gross Rent",
        "B25003_002E": "Owner Occupied Housing Units",
        "B25003_001E": "Total Housing Units",
        "B02001_002E": "White Population",
        "B02001_003E": "Black or African American Population",
        "B03003_003E": "Hispanic or Latino Population",
        "B01002_001E": "Median Age",
        "B16001_002E": "Non-English Speakers",
        "B08301_010E": "Public Transit Commuters",
        "B08301_001E": "Total Commuters",
        "B25044_003E": "Households with No Vehicle",
        "B25044_001E": "Total Households"
    }



# Define parameters
year = "2022"
# variables = "B19013_001E,B19301_001E,B17021_002E,B17021_001E,B23025_005E,B23025_002E,B15003_017E,B15003_022E,B25077_001E,B25064_001E,B25003_002E,B25003_001E,B02001_002E,B02001_003E,B03003_003E,B01002_001E,B16001_002E,B08301_010E,B08301_001E,B25044_003E,B25044_001E"
variables = ",".join(vars_to_retrieve.keys())
state_fips = "13"  # FIPS code for Georgia
api_key = api_key

# Construct API URL
url = f"https://api.census.gov/data/{year}/acs/acs5?get=NAME,{variables}&for=tract:*&in=state:{state_fips}&key={api_key}"

# Make API request
response = requests.get(url)
response.status_code

200

In [55]:

# Check if request was successful
if response.status_code == 200:
    # Parse JSON response
    data = response.json()

    # Convert data to pandas DataFrame
    columns = data[0]
    rows = data[1:]
    df = pd.DataFrame(columns=columns, data=rows)

    # Rename columns for readability
    df = df.rename(columns={
        "B19013_001E": "Median Household Income",
        "B19301_001E": "Per Capita Income",
        "B17021_002E": "Population Below Poverty Level",
        "B17021_001E": "Total Population for Poverty",
        "B23025_005E": "Unemployed Population",
        "B23025_002E": "Labor Force Population",
        "B15003_017E": "High School Graduate",
        "B15003_022E": "Bachelor's Degree or Higher",
        "B25077_001E": "Median Home Value",
        "B25064_001E": "Median Gross Rent",
        "B25003_002E": "Owner Occupied Housing Units",
        "B25003_001E": "Total Housing Units",
        "B02001_002E": "White Population",
        "B02001_003E": "Black or African American Population",
        "B03003_003E": "Hispanic or Latino Population",
        "B01002_001E": "Median Age",
        "B16001_002E": "Non-English Speakers",
        "B08301_010E": "Public Transit Commuters",
        "B08301_001E": "Total Commuters",
        "B25044_003E": "Households with No Vehicle",
        "B25044_001E": "Total Households"
    })

    # Calculate derived variables
    df['Poverty Rate'] = df['Population Below Poverty Level'].astype(float) / df['Total Population for Poverty'].astype(float) * 100
    df['Unemployment Rate'] = df['Unemployed Population'].astype(float) / df['Labor Force Population'].astype(float) * 100
    df['Percentage Owner Occupied'] = df['Owner Occupied Housing Units'].astype(float) / df['Total Housing Units'].astype(float) * 100
    df['Percentage Public Transit'] = df['Public Transit Commuters'].astype(float) / df['Total Commuters'].astype(float) * 100
    df['Percentage No Vehicle'] = df['Households with No Vehicle'].astype(float) / df['Total Households'].astype(float) * 100

else:
    print("Error fetching data:", response.status_code)

In [56]:
df

Unnamed: 0,NAME,Median Household Income,Per Capita Income,Population Below Poverty Level,Total Population for Poverty,Unemployed Population,Labor Force Population,High School Graduate,Bachelor's Degree or Higher,Median Home Value,...,Households with No Vehicle,Total Households,state,county,tract,Poverty Rate,Unemployment Rate,Percentage Owner Occupied,Percentage Public Transit,Percentage No Vehicle
0,Census Tract 9501; Appling County; Georgia,56997,27616,1042,3613,32,1583,685,172,98700,...,22,1253,13,001,950100,28.840299,2.021478,72.625698,0.000000,1.755786
1,Census Tract 9502.01; Appling County; Georgia,47473,25296,416,2392,44,1308,493,116,158600,...,7,939,13,001,950201,17.391304,3.363914,75.399361,0.000000,0.745474
2,Census Tract 9502.02; Appling County; Georgia,30320,18127,458,2160,49,732,499,45,53900,...,73,903,13,001,950202,21.203704,6.693989,59.911406,0.000000,8.084164
3,Census Tract 9503.01; Appling County; Georgia,44198,28281,336,2592,37,787,714,92,115500,...,0,1007,13,001,950301,12.962963,4.701398,78.649454,0.000000,0.000000
4,Census Tract 9503.02; Appling County; Georgia,35174,18903,664,1759,52,695,436,80,101000,...,14,814,13,001,950302,37.748721,7.482014,31.572482,0.000000,1.719902
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2791,Census Tract 9502.01; Worth County; Georgia,62885,34709,449,2067,18,989,780,71,62100,...,23,816,13,321,950201,21.722303,1.820020,93.014706,0.000000,2.818627
2792,Census Tract 9502.02; Worth County; Georgia,57607,31121,1349,3853,136,1800,878,383,141100,...,65,1576,13,321,950202,35.011679,7.555556,62.119289,0.000000,4.124365
2793,Census Tract 9504; Worth County; Georgia,46131,24675,1098,4800,90,2015,981,134,93700,...,55,1726,13,321,950400,22.875000,4.466501,73.580533,0.104004,3.186559
2794,Census Tract 9505; Worth County; Georgia,50282,23032,720,4241,161,2107,1184,132,83200,...,120,1795,13,321,950500,16.977128,7.641196,58.272981,2.739726,6.685237


In [57]:

    # # Display the DataFrame
    # import ace_tools as tools; tools.display_dataframe_to_user(name="ACS Data for Georgia Census Tracts", dataframe=df)




### Conclusion

By selecting these socioeconomic variables, you can gain insights into how urban heat islands affect communities differently based on their economic status, housing characteristics, and demographic profiles. This analysis can inform urban planning, policy-making, and interventions to address vulnerabilities related to heat exposure. Let me know if you need further assistance!

## Updated ChatGPT Data Model to include Census Tract FIPS

In [291]:
## Defining the structured output desired from chat gpt
from pydantic import BaseModel, Field
from typing import List, Optional, Text, Dict
from langchain_core.output_parsers import JsonOutputParser

# Define Pydantic models for structured output
class Coordinates(BaseModel):
    SW: List[float]
    NE: List[float]

class RegionCoordinates(BaseModel):
    rural: Optional[Coordinates]
    urban: Optional[Coordinates]


class FIPS(BaseModel):
    state_fips: str
    county_fips: List[str]
    census_tract_fips: List[str]
    # urban_census_tract_fips: List[str]
class RegionFIPS(BaseModel):
    state_fips: str
    # rural_census_tract_fips: List[str]
    rural: Optional[FIPS]
    urban: Optional[FIPS]
    # urban_census_tract_fips: List[str]

class DataParamsFips(BaseModel):
    city_region_name: str
    coordinates: Optional[RegionCoordinates]
    time: Dict[str, str]
    fips: Optional[RegionFIPS]


In [292]:
def suggest_data_params(specs: str, temperature=0.0, model_type='gpt-4o', return_json=True,
                        pydantic_model=DataParamsFips) -> str:
    """
    Suggests data parameters for downloading MODIS data for a specific region and time range.
    
    Args:
        query (str): The query describing the requirements for the data download.
        temperature (float, optional): The temperature parameter for the language model. Defaults to 0.1.
        model_type (str, optional): The type of language model to use. Defaults to 'gpt-4o'.
        return_llm (bool, optional): Whether to return the language model chain. Defaults to False.
        return_json (bool, optional): Whether to return the response as JSON. Defaults to True.
    
    Returns:
        str: The response from the language model chain or the JSON response, depending on the value of return_json.
    """
    
    # The prompt template for suggesting data parameters
    prompt = """
    I am performing an urban heat island analysis project with MODIS data comparing urban areas vs. rural areas. 
    I need to download MODIS data for 2 nearby non-overlapping regions (urban area and rural area outside of city) and time range.
    Help me select the urban and rural regions and time following the instructions below.
    {specs}
    
    Provide me the data parameters for the download (city_region_name, coordinates as SW [lat,long] NE [lat,long], time_start named 'start', time_end named 'end') 
    also include the state FIPS and a list of every census tract fips code included within the selected regions for the Census API, in the following format:
    Format Instructions:
    Use the 2-letter abbreviations for the state.
    {format_instructions}
    """
    # Create a ChatPromptTemplate object
    final_prompt_template = PromptTemplate.from_template(prompt)

    # Get api key for OpenAI from the environment or session state (if on Streamlit)
    api_key = os.getenv('OPENAI_API_KEY')
        
    # Instantiate the language model and setting the specific model (chat-gpto is newest and reasonable price)
    # and  set the temperature (creativity level)
    llm = ChatOpenAI(temperature=temperature, model=model_type, api_key=api_key)
    
    if return_json:
        # # JsonOutputParser will use the data model classes from above
        parser = JsonOutputParser(pydantic_object=pydantic_model,)    
        # Add formatting instructions for pydantic
        instructions =  parser.get_format_instructions()
            
    else:
        ## StrOutputParser will return the response as a string
        parser = StrOutputParser(output_key="response")
        # Manually defining the format instructions
        instructions = "Respond with text for each topic as a nested list with the topic number,  descriptive label,top words, and short insight."
        
        
    ## Adding the instructions to the prompt template
    final_prompt_template = final_prompt_template.partial(format_instructions=instructions)
    
    
    # Making the final chain
    llm_chain = final_prompt_template | llm | parser
    
    # Invoke the chain with the query to get the response
    return llm_chain.invoke(input=dict(specs=specs))

In [293]:

specs = """Select a region that will be a perfect example of the effects of urban heat islands.
Select small identically-sized nearby non-overlapping regions from the selected area to minimize the size of the dataset.
Select a region in a non-desert area."""

chatgpt_params = suggest_data_params(specs=specs,pydantic_model=DataParamsFips, return_json=True, temperature=0.0)
chatgpt_params

{'city_region_name': 'Atlanta, GA',
 'coordinates': {'rural': {'SW': [33.6, -84.5], 'NE': [33.7, -84.4]},
  'urban': {'SW': [33.7, -84.4], 'NE': [33.8, -84.3]}},
 'time': {'start': '2022-06-01', 'end': '2022-08-31'},
 'fips': {'state_fips': '13',
  'rural': {'state_fips': '13',
   'county_fips': ['089'],
   'census_tract_fips': ['13089020100', '13089020200', '13089020300']},
  'urban': {'state_fips': '13',
   'county_fips': ['121'],
   'census_tract_fips': ['13121010100', '13121010200', '13121010300']}}}

In [301]:
import requests
import pandas as pd

def fetch_acs_data(tract_fips_codes, variables=None, year="2022", api_key="YOUR_API_KEY"):
    """
    Fetches ACS data for specified census tracts.

    Parameters:
    - tract_fips_codes (list of str): List of 11-digit FIPS codes for census tracts.
    - variables (str): Comma-separated list of ACS variables to retrieve.
    - year (str): Year of the ACS data (default is "2022").
    - api_key (str): Your Census API key.

    Returns:
    - DataFrame: A pandas DataFrame containing the requested ACS data for the specified tracts.
    """
    
    # Variables to retrieve
    vars_to_retrieve = {
            "B19013_001E": "Median Household Income",
            "B19301_001E": "Per Capita Income",
            "B17021_002E": "Population Below Poverty Level",
            "B17021_001E": "Total Population for Poverty",
            "B23025_005E": "Unemployed Population",
            "B23025_002E": "Labor Force Population",
            "B15003_017E": "High School Graduate",
            "B15003_022E": "Bachelor's Degree or Higher",
            "B25077_001E": "Median Home Value",
            "B25064_001E": "Median Gross Rent",
            "B25003_002E": "Owner Occupied Housing Units",
            "B25003_001E": "Total Housing Units",
            "B02001_002E": "White Population",
            "B02001_003E": "Black or African American Population",
            "B03003_003E": "Hispanic or Latino Population",
            "B01002_001E": "Median Age",
            "B16001_002E": "Non-English Speakers",
            "B08301_010E": "Public Transit Commuters",
            "B08301_001E": "Total Commuters",
            "B25044_003E": "Households with No Vehicle",
            "B25044_001E": "Total Households"
        }
    variables= ",".join(vars_to_retrieve.keys())

    # Initialize a list to store data
    data_list = []

    # Loop over each FIPS code
    for fips_code in tract_fips_codes:
        # Extract state, county, and tract components
        state_fips = fips_code[:2]
        county_fips = fips_code[2:5]
        tract_code = fips_code[5:]

        # Construct the API URL for the specific tract
        # url = f"https://api.census.gov/data/{year}/acs/acs5?get=NAME,{variables}&for=tract:{tract_code}&in=state:{state_fips}+county:{county_fips}&key={api_key}"
        # url =  f"https://api.census.gov/data/{year}/acs/acs5?get=NAME,{variables}&for=tract:{tract_code}&in=state:{state_fips}%20county:{county_fips}&key={api_key}"
        
        # Make the API request
        # response = requests.get(url)
        
        # Define the parameters for the API request
        params = {
            "get": f"NAME,{variables}",
            "for": f"tract:{tract_code}",
            "in": f"state:{state_fips} county:{county_fips}",
            "key": api_key
        }

        # Base URL for the ACS API
        url = f"https://api.census.gov/data/{year}/acs/acs5"

        # Make the API request
        response = requests.get(url, params=params)



        # Check if the request was successful
        if response.status_code == 200:
            # Parse the JSON response
            data = response.json()

            # Append the data (skipping header row)
            if len(data) > 1:
                data_list.append(data[1])
        else:
            print(f"Error fetching data for tract {fips_code}: {response.status_code}")

    # Convert the collected data to a pandas DataFrame
    if data_list:
        columns = data[0]
        df = pd.DataFrame(columns=columns, data=data_list)

        # Rename columns for better readability
        df = df.rename(columns=vars_to_retrieve)

        # Calculate derived variables
        try:
            df['Poverty Rate'] = df['Population Below Poverty Level'].astype(float) / df['Total Population for Poverty'].astype(float) * 100
            df['Unemployment Rate'] = df['Unemployed Population'].astype(float) / df['Labor Force Population'].astype(float) * 100
            df['Percentage Owner Occupied'] = df['Owner Occupied Housing Units'].astype(float) / df['Total Housing Units'].astype(float) * 100
            df['Percentage Public Transit'] = df['Public Transit Commuters'].astype(float) / df['Total Commuters'].astype(float) * 100
            df['Percentage No Vehicle'] = df['Households with No Vehicle'].astype(float) / df['Total Households'].astype(float) * 100
        except KeyError as e:
            print(f"KeyError: {e}. Unable to calculate derived variables.")
        return df

    else:
        print("No data was fetched. Please check your FIPS codes or API request parameters.")
        return pd.DataFrame()  # Return an empty DataFrame in case of failure


In [302]:

# # Example usage:
# tract_fips_codes = [
#     "13051000100",  # Example FIPS codes
#     "13051000200",
#     # Add more FIPS codes as needed
# ]

# variables = "B19013_001E,B19301_001E,B17021_002E,B17021_001E,B23025_005E,B23025_002E,B15003_017E,B15003_022E,B25077_001E,B25064_001E,B25003_002E,B‚Äã‚¨§

In [303]:

specs = """Select a region that will be a perfect example of the effects of urban heat islands.
Select small identically-sized nearby non-overlapping regions from the selected area to minimize the size of the dataset."""
# Select a region in a non-desert area."""

chatgpt_params = suggest_data_params(specs=specs,pydantic_model=DataParamsFips, return_json=True, temperature=0.0)
chatgpt_params

{'city_region_name': 'Phoenix, AZ',
 'coordinates': {'rural': {'SW': [33.3484, -112.174],
   'NE': [33.3584, -112.164]},
  'urban': {'SW': [33.4484, -112.074], 'NE': [33.4584, -112.064]}},
 'time': {'start': '2022-06-01', 'end': '2022-08-31'},
 'fips': {'state_fips': '04',
  'rural': {'state_fips': '04',
   'county_fips': ['013'],
   'census_tract_fips': ['04013112300', '04013112400']},
  'urban': {'state_fips': '04',
   'county_fips': ['013'],
   'census_tract_fips': ['04013116600', '04013116700']}}}

In [304]:
df_urban = fetch_acs_data(chatgpt_params['fips']['urban']['census_tract_fips'], api_key=api_key)

Error fetching data for tract 04013116600: 204
Error fetching data for tract 04013116700: 204
No data was fetched. Please check your FIPS codes or API request parameters.


In [298]:
chatgpt_params['fips']['urban']['census_tract_fips']

['04013116600', '04013116700']

In [300]:
df_urban = fetch_acs_data(chatgpt_params['fips']['urban']['census_tract_fips'], api_key=api_key)

Error fetching data for tract 04013116600: 204
Error fetching data for tract 04013116700: 204
No data was fetched. Please check your FIPS codes or API request parameters.


In [97]:
chatgpt_params['fips']['urban']

{'state_fips': '04',
 'country_fips': ['04013'],
 'census_tract_fips': ['04013100100', '04013100200', '04013100300']}

In [96]:
tract_fips_codes = chatgpt_params['fips']['urban']['census_tract_fips']
tract_fips_codes

['04013100100', '04013100200', '04013100300']

In [201]:
import requests
import pandas as pd

def fetch_acs_data(tract_fips_codes, variables, year="2022", api_key="YOUR_API_KEY"):
    """
    Fetches ACS data for specified census tracts.

    Parameters:
    - tract_fips_codes (list of str): List of 11-digit FIPS codes for census tracts.
    - variables (str): Comma-separated list of ACS variables to retrieve.
    - year (str): Year of the ACS data (default is "2022").
    - api_key (str): Your Census API key.

    Returns:
    - DataFrame: A pandas DataFrame containing the requested ACS data for the specified tracts.
    """
    # Initialize a list to store data
    data_list = []

    # Loop over each FIPS code
    for fips_code in tract_fips_codes:
        
        # Extract state, county, and tract components
        state_fips = fips_code[:2]
        county_fips = fips_code[2:5]
        tract_code = fips_code[5:]

        fips_data = [ ]
        
        for var in variables.split(','):

            # Construct the API URL for the specific tract
            # url = f"https://api.census.gov/data/{year}/acs/acs5/subject?get=NAME,{variables}&for=tract:{tract_code}&in=state:{state_fips}+county:{county_fips}&key={api_key}"
            url =  f"https://api.census.gov/data/{year}/acs/acs5?get=NAME,{var}&for=tract:{tract_code}&in=state:{state_fips}%20county:{county_fips}&key={api_key}"
            # Make the API request
            response = requests.get(url)

            # Check if the request was successful
            if response.status_code == 200:
                # Parse the JSON response
                data = response.json()

                # Append the data (skipping header row)
                if len(data) > 1:
                    # data_list.append(data[1])
                    fips_data.append(data[1])
                else:
                    data = ""
            data_list.append(fips_data)
        else:
            print(f"Error fetching data for tract {fips_code}: {response.status_code}")
            print(f"URL: {url}")

    # Convert the collected data to a pandas DataFrame
    if data_list:
        try:
            columns = data[0]
            df = pd.DataFrame(columns=columns, data=data_list)

            # Rename columns for better readability
            df = df.rename(columns={
                "B19013_001E": "Median Household Income",
                "B19301_001E": "Per Capita Income",
                "B17021_002E": "Population Below Poverty Level",
                "B17021_001E": "Total Population for Poverty",
                "B23025_005E": "Unemployed Population",
                "B23025_002E": "Labor Force Population",
                "B15003_017E": "High School Graduate",
                "B15003_022E": "Bachelor's Degree or Higher",
                "B25077_001E": "Median Home Value",
                "B25064_001E": "Median Gross Rent",
                "B25003_002E": "Owner Occupied Housing Units",
                "B25003_001E": "Total Housing Units",
                "B02001_002E": "White Population",
                "B02001_003E": "Black or African American Population",
                "B03003_003E": "Hispanic or Latino Population",
                "B01002_001E": "Median Age",
                "B16001_002E": "Non-English Speakers",
                "B08301_010E": "Public Transit Commuters",
                "B08301_001E": "Total Commuters",
                "B25044_003E": "Households with No Vehicle",
                "B25044_001E": "Total Households"
            })

            # Calculate derived variables
            df['Poverty Rate'] = df['Population Below Poverty Level'].astype(float) / df['Total Population for Poverty'].astype(float) * 100
            df['Unemployment Rate'] = df['Unemployed Population'].astype(float) / df['Labor Force Population'].astype(float) * 100
            df['Percentage Owner Occupied'] = df['Owner Occupied Housing Units'].astype(float) / df['Total Housing Units'].astype(float) * 100
            df['Percentage Public Transit'] = df['Public Transit Commuters'].astype(float) / df['Total Commuters'].astype(float) * 100
            df['Percentage No Vehicle'] = df['Households with No Vehicle'].astype(float) / df['Total Households'].astype(float) * 100

            return df
        except:
            print("Error converting data to DataFrame.")
            return data_list

    else:
        print("No data was fetched. Please check your FIPS codes or API request parameters.")
        return pd.DataFrame()  # Return an empty DataFrame in case of failure


In [202]:
vars_to_retrieve = {
            "B19013_001E": "Median Household Income",
            "B19301_001E": "Per Capita Income",
            "B17021_002E": "Population Below Poverty Level",
            "B17021_001E": "Total Population for Poverty",
            "B23025_005E": "Unemployed Population",
            "B23025_002E": "Labor Force Population",
            "B15003_017E": "High School Graduate",
            "B15003_022E": "Bachelor's Degree or Higher",
            "B25077_001E": "Median Home Value",
            "B25064_001E": "Median Gross Rent",
            "B25003_002E": "Owner Occupied Housing Units",
            "B25003_001E": "Total Housing Units",
            "B02001_002E": "White Population",
            "B02001_003E": "Black or African American Population",
            "B03003_003E": "Hispanic or Latino Population",
            "B01002_001E": "Median Age",
            "B16001_002E": "Non-English Speakers",
            "B08301_010E": "Public Transit Commuters",
            "B08301_001E": "Total Commuters",
            "B25044_003E": "Households with No Vehicle",
            "B25044_001E": "Total Households"
        }
variables = ",".join(vars_to_retrieve.keys())
variables

'B19013_001E,B19301_001E,B17021_002E,B17021_001E,B23025_005E,B23025_002E,B15003_017E,B15003_022E,B25077_001E,B25064_001E,B25003_002E,B25003_001E,B02001_002E,B02001_003E,B03003_003E,B01002_001E,B16001_002E,B08301_010E,B08301_001E,B25044_003E,B25044_001E'

In [239]:

# # Example usage:
# tract_fips_codes = [
#     "13051000100",  # Example FIPS codes
#     "13051000200",
#     # Add more FIPS codes as needed
# ]

# variables = "B19013_001E,B19301_001E,B17021_002E,B17021_001E,B23025_005E,B23025_002E,B15003_017E,B15003_022E,B25077_001E,B25064_001E,B25003_002E,B25003_001E,B02001_002E,B02001_003E,B03003_003E,B01002_001E,B16001_002E,B08301_010E,B08301_001E,B25044_003E,B25044_001E"

# Fetch data using the function
df = fetch_acs_data(tract_fips_codes, variables=variables, year="2022", api_key=api_key)
df

Error fetching data for tract 13051000100: 204
Error fetching data for tract 13051000200: 204
No data was fetched. Please check your FIPS codes or API request parameters.


In [240]:
# Saving the available variables to as a list
vars_available = df_variables['index'].tolist()
vars_available[:15]


['for',
 'in',
 'ucgid',
 'B24022_060E',
 'B19001B_014E',
 'B07007PR_019E',
 'B19101A_004E',
 'B24022_061E',
 'B19001B_013E',
 'B07007PR_018E',
 'B19101A_005E',
 'B19001B_012E',
 'B24022_062E',
 'B01001B_029E',
 'B19101A_006E']

In [127]:
'B19013_001E' in vars_available

True

In [122]:

[i in vars_available for i in vars_to_retrieve.keys()]

[True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True]

## üëâüìåTrying Python Package
- Package works, main problem is the census tracts being given from ChatGPT
- New Idea, go from coordinates to lookup census tracts: https://geo.fcc.gov/api/census/

### FCC API

In [343]:
chatgpt_params['coordinates']['urban']

{'SW': [33.4, -112.1], 'NE': [33.5, -112.0]}

In [315]:
# Calculate center of urban region
urban_center = [(chatgpt_params['coordinates']['urban']['SW'][0] + chatgpt_params['coordinates']['urban']['NE'][0]) / 2,
                (chatgpt_params['coordinates']['urban']['SW'][1] + chatgpt_params['coordinates']['urban']['NE'][1]) / 2]
urban_center

[33.4534, -112.06899999999999]

In [322]:
base_url= "https://geo.fcc.gov/api/census/block/find"
params = {"lat":urban_center[0],"lon":urban_center[1],"censusYear":2020,"format":"json"}

response = requests.get(base_url, params=params)
response.status_code

200

In [323]:
response.json()

{'input': {'lat': 33.4534, 'lon': -112.06899999999999, 'censusYear': '2020'},
 'results': [{'block_fips': '040131131003010',
   'bbox': [-112.069991, 33.451374, -112.067472, 33.454486],
   'county_fips': '04013',
   'county_name': 'Maricopa County',
   'state_fips': '04',
   'state_code': 'AZ',
   'state_name': 'Arizona',
   'block_pop_2020': 8,
   'amt': 'AMT010',
   'bea': 'BEA158',
   'bta': 'BTA347',
   'cma': 'CMA026',
   'eag': 'EAG005',
   'ivm': 'IVM026',
   'mea': 'MEA040',
   'mta': 'MTA027',
   'pea': 'PEA015',
   'rea': 'REA005',
   'rpc': 'RPC005',
   'vpc': 'VPC041'}]}

In [387]:
def fetch_census_block_data(coordinates, censusYear=2020,only_block_fips=False):
    """
    Fetches census block data based on the given coordinates.
    Parameters:
    - coordinates (dict or tuple): The coordinates of the location. If a dictionary is provided, it should contain 'SW' and 'NE' keys representing the southwest and northeast coordinates respectively. If a tuple is provided, it should contain the latitude and longitude values.
    - censusYear (int): The year for which the census data is requested. Default is 2020.
    - only_block_fips (bool): If True, only the block FIPS code will be returned. If False, the entire response will be returned. Default is False.
    Returns:
    - census_block_fips (str): The block FIPS code if only_block_fips is True.
    - response (dict): The response containing the census block data if only_block_fips is False.
    """
    # Calculate the center of the region
    if isinstance(coordinates, dict):
        SW = coordinates['SW']
        NE = coordinates['NE']
        center = [(SW[0] + NE[0]) / 2, (SW[1] + NE[1]) / 2]
    else:
        center = coordinates
        
    # Construct the API URL
    base_url= "https://geo.fcc.gov/api/census/block/find"
    params = {"lat":center[0],"lon":center[1],"censusYear":censusYear,"format":"json"}

    # Make the API request
    response = requests.get(base_url, params=params)
    
    # Check if the request was successful
    if response.status_code != 200:
        print(f"Error fetching data: {response.status_code}")
        return None
    
    # Return the block FIPS code or the entire response
    if only_block_fips:
        census_block_fips = response.json()['results'][0]['block_fips']
        return census_block_fips
    else:
        return response.json()

# Test the function
urban_census_fips = fetch_census_block_data(chatgpt_params['coordinates']['urban'])
urban_census_fips

{'input': {'lat': 33.774, 'lon': -84.363, 'censusYear': '2020'},
 'results': [{'block_fips': '131210014002002',
   'bbox': [-84.363739, 33.773597, -84.360962, 33.774632],
   'county_fips': '13121',
   'county_name': 'Fulton County',
   'state_fips': '13',
   'state_code': 'GA',
   'state_name': 'Georgia',
   'block_pop_2020': 23,
   'amt': 'AMT003',
   'bea': 'BEA040',
   'bta': 'BTA024',
   'cma': 'CMA017',
   'eag': 'EAG003',
   'ivm': 'IVM017',
   'mea': 'MEA008',
   'mta': 'MTA011',
   'pea': 'PEA011',
   'rea': 'REA002',
   'rpc': 'RPC002',
   'vpc': 'VPC003'}]}

In [388]:
census_block_fips = urban_census_fips['results'][0]['block_fips']
census_block_fips

'131210014002002'

In [389]:
census_block_fips = fetch_census_block_data(chatgpt_params['coordinates']['urban'], only_block_fips=True)
census_block_fips


'131210014002002'

In [390]:
# Extract state, county, and tract components from the census block FIPS code
state_fips = census_block_fips[:2]
county_fips = census_block_fips[2:5]
tract_code = census_block_fips[5:11]
print(f"{state_fips=}, {county_fips=}, {tract_code=}")

state_fips='13', county_fips='121', tract_code='001400'


In [391]:
from census import Census

# Census api package from below 
c = Census(api_key, year=2020)
c.acs5.get(('NAME', ",".join(vars_to_retrieve.keys())),
           geo = {'for': f"tract:{tract_code}",
               "in" :f'state:{state_fips} county:{county_fips}'})

[{'NAME': 'Census Tract 14, Fulton County, Georgia',
  'B19013_001E': 82086.0,
  'B19301_001E': 77428.0,
  'B17021_002E': 145.0,
  'B17021_001E': 2291.0,
  'B23025_005E': 29.0,
  'B23025_002E': 1710.0,
  'B15003_017E': 62.0,
  'B15003_022E': 882.0,
  'B25077_001E': 397800.0,
  'B25064_001E': 1559.0,
  'B25003_002E': 680.0,
  'B25003_001E': 1420.0,
  'B02001_002E': 1885.0,
  'B02001_003E': 169.0,
  'B03003_003E': 73.0,
  'B01002_001E': 32.5,
  'B16001_002E': None,
  'B08301_010E': 54.0,
  'B08301_001E': 1661.0,
  'B25044_003E': 0.0,
  'B25044_001E': 1420.0,
  'state': '13',
  'county': '121',
  'tract': '001400'}]

In [392]:
def get_census_data_for_block(census_block_fips, year=2020):
    variable_dict = {
            "B19013_001E": "Median Household Income",
            "B19301_001E": "Per Capita Income",
            "B17021_002E": "Population Below Poverty Level",
            "B17021_001E": "Total Population for Poverty",
            "B23025_005E": "Unemployed Population",
            "B23025_002E": "Labor Force Population",
            "B15003_017E": "High School Graduate",
            "B15003_022E": "Bachelor's Degree or Higher",
            "B25077_001E": "Median Home Value",
            "B25064_001E": "Median Gross Rent",
            "B25003_002E": "Owner Occupied Housing Units",
            "B25003_001E": "Total Housing Units",
            "B02001_002E": "White Population",
            "B02001_003E": "Black or African American Population",
            "B03003_003E": "Hispanic or Latino Population",
            "B01002_001E": "Median Age",
            "B16001_002E": "Non-English Speakers",
            "B08301_010E": "Public Transit Commuters",
            "B08301_001E": "Total Commuters",
            "B25044_003E": "Households with No Vehicle",
            "B25044_001E": "Total Households"
        }
    from census import Census
    # Separate the FIPS code into its components
    state_fips = census_block_fips[:2]
    county_fips = census_block_fips[2:5]
    tract_code = census_block_fips[5:11]
    print(f"{state_fips=}, {county_fips=}, {tract_code=}")
    
    # Census api package from below 
    c = Census(api_key, year=year)
    results = c.acs5.get(('NAME', ",".join(variable_dict.keys())),
            geo = {'for': f"tract:{tract_code}",
                "in" :f'state:{state_fips} county:{county_fips}'})
    
    try:
        df = pd.DataFrame.from_records(results)
        df = df.rename(columns=variable_dict)
        # Calculate derived variables
        df['Poverty Rate'] = df['Population Below Poverty Level'].astype(float) / df['Total Population for Poverty'].astype(float) * 100
        df['Unemployment Rate'] = df['Unemployed Population'].astype(float) / df['Labor Force Population'].astype(float) * 100
        df['Percentage Owner Occupied'] = df['Owner Occupied Housing Units'].astype(float) / df['Total Housing Units'].astype(float) * 100
        df['Percentage Public Transit'] = df['Public Transit Commuters'].astype(float) / df['Total Commuters'].astype(float) * 100
        df['Percentage No Vehicle'] = df['Households with No Vehicle'].astype(float) / df['Total Households'].astype(float) * 100
        return df
    except Exception as e:
        display(e)
        return results
    

In [393]:
urban_census_fips

{'input': {'lat': 33.774, 'lon': -84.363, 'censusYear': '2020'},
 'results': [{'block_fips': '131210014002002',
   'bbox': [-84.363739, 33.773597, -84.360962, 33.774632],
   'county_fips': '13121',
   'county_name': 'Fulton County',
   'state_fips': '13',
   'state_code': 'GA',
   'state_name': 'Georgia',
   'block_pop_2020': 23,
   'amt': 'AMT003',
   'bea': 'BEA040',
   'bta': 'BTA024',
   'cma': 'CMA017',
   'eag': 'EAG003',
   'ivm': 'IVM017',
   'mea': 'MEA008',
   'mta': 'MTA011',
   'pea': 'PEA011',
   'rea': 'REA002',
   'rpc': 'RPC002',
   'vpc': 'VPC003'}]}

In [394]:
# Putting it all together
specs = """Select a region that will be a perfect example of the effects of urban heat islands.
Select small identically-sized nearby non-overlapping regions from 2 separate census FIPS counties
the selected area to minimize the size of the dataset."""
# Select a region in a non-desert area."""

chatgpt_params = suggest_data_params(specs=specs,pydantic_model=DataParamsFips, return_json=True, temperature=0.0)
print(chatgpt_params)

display(preview_regions(chatgpt_params))

## Get urban data
urban_census_block_fips = fetch_census_block_data(chatgpt_params['coordinates']['urban'], only_block_fips=True, censusYear=2020)
print(f"{urban_census_block_fips=}")

urban_data = get_census_data_for_block(urban_census_block_fips)
urban_data['Region'] = 'Urban'

## Get rural data
rural_census_block_fips = fetch_census_block_data(chatgpt_params['coordinates']['rural'], only_block_fips=True, censusYear=2020)
print(f"{rural_census_block_fips=}")

rural_data = get_census_data_for_block(rural_census_block_fips)
rural_data['Region'] = 'Rural'
rural_data

{'city_region_name': 'Atlanta, GA', 'coordinates': {'rural': {'SW': [33.35, -84.6], 'NE': [33.45, -84.5]}, 'urban': {'SW': [33.7, -84.45], 'NE': [33.8, -84.35]}}, 'time': {'start': '2022-01-01', 'end': '2022-12-31'}, 'fips': {'state_fips': '13', 'rural': {'state_fips': '13', 'county_fips': ['113'], 'census_tract_fips': ['13113010100', '13113010200', '13113010300', '13113010400', '13113010500']}, 'urban': {'state_fips': '13', 'county_fips': ['121'], 'census_tract_fips': ['13121010100', '13121010200', '13121010300', '13121010400', '13121010500']}}}


urban_census_block_fips='131210035002019'
state_fips='13', county_fips='121', tract_code='003500'
rural_census_block_fips='131131403044017'
state_fips='13', county_fips='113', tract_code='140304'


Unnamed: 0,NAME,Median Household Income,Per Capita Income,Population Below Poverty Level,Total Population for Poverty,Unemployed Population,Labor Force Population,High School Graduate,Bachelor's Degree or Higher,Median Home Value,...,Total Households,state,county,tract,Poverty Rate,Unemployment Rate,Percentage Owner Occupied,Percentage Public Transit,Percentage No Vehicle,Region
0,"Census Tract 1403.04, Fayette County, Georgia",101172.0,41859.0,782.0,5861.0,0.0,2722.0,622.0,1083.0,272000.0,...,2001.0,13,113,140304,13.342433,0.0,80.109945,0.619915,3.998001,Rural


In [395]:
df = pd.concat([urban_data, rural_data], axis=0)
df

Unnamed: 0,NAME,Median Household Income,Per Capita Income,Population Below Poverty Level,Total Population for Poverty,Unemployed Population,Labor Force Population,High School Graduate,Bachelor's Degree or Higher,Median Home Value,...,Total Households,state,county,tract,Poverty Rate,Unemployment Rate,Percentage Owner Occupied,Percentage Public Transit,Percentage No Vehicle,Region
0,"Census Tract 35, Fulton County, Georgia",51474.0,35590.0,854.0,2203.0,187.0,1706.0,285.0,422.0,228400.0,...,1111.0,13,121,3500,38.76532,10.961313,21.692169,19.920053,1.980198,Urban
0,"Census Tract 1403.04, Fayette County, Georgia",101172.0,41859.0,782.0,5861.0,0.0,2722.0,622.0,1083.0,272000.0,...,2001.0,13,113,140304,13.342433,0.0,80.109945,0.619915,3.998001,Rural


### Census Package

In [165]:
!pip install census
!pip install us


Collecting us
  Downloading us-3.2.0-py3-none-any.whl.metadata (10 kB)
Collecting jellyfish (from us)
  Downloading jellyfish-1.1.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (2.6 kB)
Downloading us-3.2.0-py3-none-any.whl (13 kB)
Downloading jellyfish-1.1.0-cp311-cp311-macosx_11_0_arm64.whl (303 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m303.1/303.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: jellyfish, us
Successfully installed jellyfish-1.1.0 us-3.2.0


In [241]:
from census import Census
from us import states

c = Census(api_key)
c
# c = Census("MY_API_KEY", year=2010)

# c.acs5.get(('NAME', 'B25034_010E'),
#           {'for': 'state:{}'.format(states.MD.fips)})

<census.core.Census at 0x31d61e3d0>

state_county_tract(fields, state_fips, county_fips, tract)



In [242]:
chatgpt_params['fips']['urban']#['census_tract_fips']

{'state_fips': '13',
 'county_fips': ['121'],
 'census_tract_fips': ['13121010100', '13121010200', '13121010300']}

In [243]:
tract = chatgpt_params['fips']['urban']['census_tract_fips'][0]
tract

'13121010100'

In [247]:
state_fips = tract[:2]
county_fips = tract[2:5]
tract_code = tract[5:]
state_fips, county_fips, tract_code 

('13', '121', '010100')

In [182]:
var_list = list(vars_to_retrieve.keys())
var_list[:5]

['B19013_001E', 'B19301_001E', 'B17021_002E', 'B17021_001E', 'B23025_005E']

In [180]:
# Test var name
var_name = var_list[0]
var_name


'B19013_001E'

In [None]:
c = Census(api_key, year=2022)
c.acs5.get(('NAME', ",".join(vars_to_retrieve.keys())),
           geo = {'for': "tract:*",
               "in" :f'state:{state_fips} county:{county_fips}'})

In [263]:
c = Census(api_key, year=2022)
c.acs5.get(('NAME', ",".join(vars_to_retrieve.keys())),
           geo = {'for': "tract:*",
               "in" :f'state:{state_fips} county:{county_fips}'})

[{'NAME': 'Census Tract 1; Fulton County; Georgia',
  'B19013_001E': 154808.0,
  'B19301_001E': 89698.0,
  'B17021_002E': 438.0,
  'B17021_001E': 5159.0,
  'B23025_005E': 60.0,
  'B23025_002E': 2842.0,
  'B15003_017E': 131.0,
  'B15003_022E': 1882.0,
  'B25077_001E': 831500.0,
  'B25064_001E': 1707.0,
  'B25003_002E': 1545.0,
  'B25003_001E': 2458.0,
  'B02001_002E': 4439.0,
  'B02001_003E': 140.0,
  'B03003_003E': 90.0,
  'B01002_001E': 44.8,
  'B16001_002E': None,
  'B08301_010E': 17.0,
  'B08301_001E': 2765.0,
  'B25044_003E': 52.0,
  'B25044_001E': 2458.0,
  'state': '13',
  'county': '121',
  'tract': '000100'},
 {'NAME': 'Census Tract 2.01; Fulton County; Georgia',
  'B19013_001E': 120982.0,
  'B19301_001E': 119300.0,
  'B17021_002E': 44.0,
  'B17021_001E': 2233.0,
  'B23025_005E': 49.0,
  'B23025_002E': 1606.0,
  'B15003_017E': 85.0,
  'B15003_022E': 702.0,
  'B25077_001E': 612400.0,
  'B25064_001E': 1747.0,
  'B25003_002E': 805.0,
  'B25003_001E': 1220.0,
  'B02001_002E': 1923.

how to go from tract:* to tract:tract number

In [276]:
# WORKING!
c = Census(api_key, year=2022)
c.acs5.get(('NAME', ",".join(vars_to_retrieve.keys())),
           geo = {'for': "tract:000100,980000",
               "in" :f'state:{state_fips} county:{county_fips}'})

[{'NAME': 'Census Tract 1; Fulton County; Georgia',
  'B19013_001E': 154808.0,
  'B19301_001E': 89698.0,
  'B17021_002E': 438.0,
  'B17021_001E': 5159.0,
  'B23025_005E': 60.0,
  'B23025_002E': 2842.0,
  'B15003_017E': 131.0,
  'B15003_022E': 1882.0,
  'B25077_001E': 831500.0,
  'B25064_001E': 1707.0,
  'B25003_002E': 1545.0,
  'B25003_001E': 2458.0,
  'B02001_002E': 4439.0,
  'B02001_003E': 140.0,
  'B03003_003E': 90.0,
  'B01002_001E': 44.8,
  'B16001_002E': None,
  'B08301_010E': 17.0,
  'B08301_001E': 2765.0,
  'B25044_003E': 52.0,
  'B25044_001E': 2458.0,
  'state': '13',
  'county': '121',
  'tract': '000100'},
 {'NAME': 'Census Tract 9800; Fulton County; Georgia',
  'B19013_001E': -666666666.0,
  'B19301_001E': -666666666.0,
  'B17021_002E': 0.0,
  'B17021_001E': 0.0,
  'B23025_005E': 0.0,
  'B23025_002E': 0.0,
  'B15003_017E': 0.0,
  'B15003_022E': 0.0,
  'B25077_001E': -666666666.0,
  'B25064_001E': -666666666.0,
  'B25003_002E': 0.0,
  'B25003_001E': 0.0,
  'B02001_002E': 0.0

In [268]:
tracts

'010100,010200,010300'

In [278]:
# Testing manual rearragmnt of first 2 digitS
','.join(["100100",'100200','100300'])

'000100,000200,000300'

In [279]:
c = Census(api_key, year=2022)
c.acs5.get(('NAME', ",".join(vars_to_retrieve.keys())),
           geo = {'for': "tract:100100,100200,100300",
               "in" :f'state:{state_fips} county:{county_fips}'})

[]

In [305]:
chatgpt_params['fips']['urban']['census_tract_fips']

['04013116600', '04013116700']

In [306]:
chatgpt_params['fips']['urban']

{'state_fips': '04',
 'county_fips': ['013'],
 'census_tract_fips': ['04013116600', '04013116700']}

In [307]:
region_type = 'urban'

state_fips = chatgpt_params["fips"][region_type]["state_fips"]
county_fips = chatgpt_params["fips"][region_type]["county_fips"]#.replace(state_fips,"")
tracts =",".join(chatgpt_params["fips"][region_type]["census_tract_fips"])
state_fips,county_fips,tracts

('04', ['013'], '04013116600,04013116700')

- not sure if package wants 6 letter of full-fips for tract

In [308]:
region_type = 'urban'

state_fips = chatgpt_params["fips"][region_type]["state_fips"]
county_fips = chatgpt_params["fips"][region_type]["county_fips"][0]#.replace(state_fips,"")
# tracts =",".join(chatgpt_params["fips"][region_type]["census_tract_fips"])

# Get tracts without the state and county fips
tract_list = []
for tract in chatgpt_params["fips"][region_type]["census_tract_fips"]:
    tract_list.append(tract[5:])
tracts = ",".join(tract_list)
    
print(f"{state_fips=},{county_fips=},{tracts=}")


results_list = []

for tract in tract_list:
        
    results = c.acs5.get(('NAME', ",".join(vars_to_retrieve.keys())),
            geo={'for': f'tract:{tract}',#{tracts}',
                        'in': f'state:{state_fips}, county:{county_fips}'})
    results_list.append(results)
# results = c.acs5.get(('NAME', ",".join(vars_to_retrieve.keys())),
#            geo={'for': f'tract:*',#{tracts}',
#                        'in': f'state:{state_fips} county:{county_fips}'})
# results

state_fips='04',county_fips='013',tracts='116600,116700'


In [309]:
df_results = pd.DataFrame(results)
df_results

In [227]:
tract_list

['000400', '000500', '000600']

In [229]:
df_results[df_results['tract'].isin(tract_list)]

Unnamed: 0,NAME,B19013_001E,B19301_001E,B17021_002E,B17021_001E,B23025_005E,B23025_002E,B15003_017E,B15003_022E,B25077_001E,...,B03003_003E,B01002_001E,B16001_002E,B08301_010E,B08301_001E,B25044_003E,B25044_001E,state,county,tract


# ‚≠êÔ∏è FINAL COMPLETE WORKFLOW

> Note: in the end the workflow doesn't actually require the FIPS information from chatGPT anymore. The FIPS info is acquired USIGN the selected regio lat/long. 
> HOWEVER the suggested parameters from ChatGPT were better overall when it considered FIPS, so we are still using the new updated prompt.

In [25]:
## Defining the structured output desired from chat gpt
from pydantic import BaseModel, Field
from typing import List, Optional, Text, Dict
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate

import pandas as pd
import plotly.express as px
import requests, json, os

# Define Pydantic models for structured output
class Coordinates(BaseModel):
    SW: List[float]
    NE: List[float]

class RegionCoordinates(BaseModel):
    rural: Optional[Coordinates]
    urban: Optional[Coordinates]



### NEW CLASSES FOR FIPS
class FIPS(BaseModel):
    state_fips: str
    county_fips: List[str]
    census_tract_fips: List[str]
    # urban_census_tract_fips: List[str]
class RegionFIPS(BaseModel):
    state_fips: str
    # rural_census_tract_fips: List[str]
    rural: Optional[FIPS]
    urban: Optional[FIPS]
    # urban_census_tract_fips: List[str]


## UPDATED DATA PARAMS
class DataParamsFips(BaseModel):
    city_region_name: str
    coordinates: Optional[RegionCoordinates]
    time: Dict[str, str]
    fips: Optional[RegionFIPS]
    

## Updated suggest function with FIPS added to prompt
def suggest_data_params(specs: str, temperature=0.0, model_type='gpt-4o', return_json=True,
                        pydantic_model=DataParamsFips) -> str:
    """
    Suggests data parameters for downloading MODIS data for a specific region and time range.
    
    Args:
        query (str): The query describing the requirements for the data download.
        temperature (float, optional): The temperature parameter for the language model. Defaults to 0.1.
        model_type (str, optional): The type of language model to use. Defaults to 'gpt-4o'.
        return_llm (bool, optional): Whether to return the language model chain. Defaults to False.
        return_json (bool, optional): Whether to return the response as JSON. Defaults to True.
    
    Returns:
        str: The response from the language model chain or the JSON response, depending on the value of return_json.
    """
    
    # The prompt template for suggesting data parameters
    prompt = """
    I am performing an urban heat island analysis project with MODIS data comparing urban areas vs. rural areas. 
    I need to download MODIS data for 2 nearby non-overlapping regions (urban area and rural area outside of city) and time range.
    Help me select the urban and rural regions and time following the instructions below.
    {specs}
    
    Provide me the data parameters for the download (city_region_name, coordinates as SW [lat,long] NE [lat,long], time_start named 'start', time_end named 'end') 
    also include the state FIPS and a list of every census tract fips code included within the selected regions for the Census API, in the following format:
    Format Instructions:
    Use the 2-letter abbreviations for the state.
    {format_instructions}
    """
    # Create a ChatPromptTemplate object
    final_prompt_template = PromptTemplate.from_template(prompt)

    # Get api key for OpenAI from the environment or session state (if on Streamlit)
    api_key = os.getenv('OPENAI_API_KEY')
        
    # Instantiate the language model and setting the specific model (chat-gpto is newest and reasonable price)
    # and  set the temperature (creativity level)
    llm = ChatOpenAI(temperature=temperature, model=model_type, api_key=api_key)
    
    if return_json:
        # # JsonOutputParser will use the data model classes from above
        parser = JsonOutputParser(pydantic_object=pydantic_model,)    
        # Add formatting instructions for pydantic
        instructions =  parser.get_format_instructions()
            
    else:
        ## StrOutputParser will return the response as a string
        parser = StrOutputParser(output_key="response")
        # Manually defining the format instructions
        instructions = "Respond with text for each topic as a nested list with the topic number,  descriptive label,top words, and short insight."
        
        
    ## Adding the instructions to the prompt template
    final_prompt_template = final_prompt_template.partial(format_instructions=instructions)
    
    
    # Making the final chain
    llm_chain = final_prompt_template | llm | parser
    
    # Invoke the chain with the query to get the response
    return llm_chain.invoke(input=dict(specs=specs))

In [26]:
# Visualization Functions
def generate_sample_points(sw, ne, num_points=10):
    """Generate sample points within a given bounding box."""
    latitudes = [sw[0] + i * (ne[0] - sw[0]) / (num_points - 1) for i in range(num_points)]
    longitudes = [sw[1] + i * (ne[1] - sw[1]) / (num_points - 1) for i in range(num_points)]
    return [(lat, lon) for lat in latitudes for lon in longitudes]


def preview_regions(data_params):
    """
    Generate a preview of selected bounding boxes for a given city region.
    Parameters:
    - data_params (dict): A dictionary containing the following keys:
        - 'coordinates' (dict): A dictionary containing region names as keys and bounding boxes as values.
            Each bounding box is represented by a dictionary with 'SW' and 'NE' keys, representing the
            southwest and northeast coordinates respectively.
        - 'city_region_name' (str): The name of the city region.
    Returns:
    - fig (plotly.graph_objects.Figure): A scatter mapbox plot showing the preview of selected bounding boxes.
    """
    # Dataframe to store results
    sampled_coordinates = []

    # Check if any coordinates within the bounding boxes are over sea
    for region, bounding_box in data_params['coordinates'].items():
        # Generate sample points within the bounding box
        sample_points = generate_sample_points(bounding_box['SW'], bounding_box['NE'], num_points=10)
        for lat, lon in sample_points:
            sampled_coordinates.append({'Region': region, 'Latitude': lat, 'Longitude': lon})

    # Convert results to DataFrame
    coords_df = pd.DataFrame(sampled_coordinates)
    
    ## Plot the region suggested
    fig = px.scatter_mapbox(coords_df, lat="Latitude", lon="Longitude", color='Region',
                            # color_continuous_scale="Viridis", 
                            mapbox_style="carto-positron",
                            title=f"Preview of Selected Bounding Boxes for {data_params['city_region_name']}",
                            height=600, width=600)

    # Remove left and right side margins
    fig.update_layout(
        margin={"r":0, "l":0,'b':0, 't':100},
        legend={'orientation':"h", 'yanchor':"top", 'y':1.05, 'xanchor':"left", 'x':0},
        
    )
    return fig


- Workflow:
    - Get ChatGPT suggestions
    - Use SW/NE corners for each region to calcualte the center coordinate
    - Use the center coordinate with the FCC Enterprise Area API: https://geo.fcc.gov/api/census/#!/block/get_block_find 
    - Use the census tracts from the FCC api to retreive census data from ACS5 using Census API.

In [27]:
def fetch_census_block_data(coordinates, censusYear=2020,only_block_fips=False):
    """
    Fetches census block data based on the given coordinates.
    Parameters:
    - coordinates (dict or tuple): The coordinates of the location. If a dictionary is provided, it should contain 'SW' and 'NE' keys representing the southwest and northeast coordinates respectively. If a tuple is provided, it should contain the latitude and longitude values.
    - censusYear (int): The year for which the census data is requested. Default is 2020.
    - only_block_fips (bool): If True, only the block FIPS code will be returned. If False, the entire response will be returned. Default is False.
    Returns:
    - census_block_fips (str): The block FIPS code if only_block_fips is True.
    - response (dict): The response containing the census block data if only_block_fips is False.
    """
    # Calculate the center of the region
    if isinstance(coordinates, dict):
        SW = coordinates['SW']
        NE = coordinates['NE']
        center = [(SW[0] + NE[0]) / 2, (SW[1] + NE[1]) / 2]
    else:
        center = coordinates
        
    # Construct the API URL
    base_url= "https://geo.fcc.gov/api/census/block/find"
    params = {"lat":center[0],"lon":center[1],"censusYear":censusYear,"format":"json"}

    # Make the API request
    response = requests.get(base_url, params=params)
    
    # Check if the request was successful
    if response.status_code != 200:
        print(f"Error fetching data: {response.status_code}")
        return None
    
    # Return the block FIPS code or the entire response
    if only_block_fips:
        census_block_fips = response.json()['results'][0]['block_fips']
        return census_block_fips
    else:
        return response.json()


In [28]:
def get_census_data_for_block(census_block_fips, year=2020):
    variable_dict = {
            "B19013_001E": "Median Household Income",
            "B19301_001E": "Per Capita Income",
            "B17021_002E": "Population Below Poverty Level",
            "B17021_001E": "Total Population for Poverty",
            "B23025_005E": "Unemployed Population",
            "B23025_002E": "Labor Force Population",
            "B15003_017E": "High School Graduate",
            "B15003_022E": "Bachelor's Degree or Higher",
            "B25077_001E": "Median Home Value",
            "B25064_001E": "Median Gross Rent",
            "B25003_002E": "Owner Occupied Housing Units",
            "B25003_001E": "Total Housing Units",
            "B02001_002E": "White Population",
            "B02001_003E": "Black or African American Population",
            "B03003_003E": "Hispanic or Latino Population",
            "B01002_001E": "Median Age",
            "B16001_002E": "Non-English Speakers",
            "B08301_010E": "Public Transit Commuters",
            "B08301_001E": "Total Commuters",
            "B25044_003E": "Households with No Vehicle",
            "B25044_001E": "Total Households"
        }
    from census import Census
    # Separate the FIPS code into its components
    state_fips = census_block_fips[:2]
    county_fips = census_block_fips[2:5]
    tract_code = census_block_fips[5:11]
    print(f"{state_fips=}, {county_fips=}, {tract_code=}")
    
    # Census api package from below 
    c = Census(api_key, year=year)
    results = c.acs5.get(('NAME', ",".join(variable_dict.keys())),
            geo = {'for': f"tract:{tract_code}",
                "in" :f'state:{state_fips} county:{county_fips}'})
    
    try:
        df = pd.DataFrame.from_records(results)
        df = df.rename(columns=variable_dict)
        # Calculate derived variables
        df['Poverty Rate'] = df['Population Below Poverty Level'].astype(float) / df['Total Population for Poverty'].astype(float) * 100
        df['Unemployment Rate'] = df['Unemployed Population'].astype(float) / df['Labor Force Population'].astype(float) * 100
        df['Percentage Owner Occupied'] = df['Owner Occupied Housing Units'].astype(float) / df['Total Housing Units'].astype(float) * 100
        df['Percentage Public Transit'] = df['Public Transit Commuters'].astype(float) / df['Total Commuters'].astype(float) * 100
        df['Percentage No Vehicle'] = df['Households with No Vehicle'].astype(float) / df['Total Households'].astype(float) * 100
        return df
    except Exception as e:
        display(e)
        return results
    

In [29]:
import os
creds_file = "/Users/codingdojo/.secret/census.json"
with open(os.path.abspath(creds_file)) as f:
    creds = json.load(f)
    api_key = creds['api-key']

In [30]:
# Putting it all together
specs = """Select a region that will be a perfect example of the effects of urban heat islands.
Select small identically-sized nearby non-overlapping regions from 2 separate census FIPS counties
the selected area to minimize the size of the dataset."""
# Select a region in a non-desert area."""

chatgpt_params = suggest_data_params(specs=specs,pydantic_model=DataParamsFips, return_json=True, temperature=0.0)
print(chatgpt_params)

display(preview_regions(chatgpt_params))

## Get urban data
urban_census_block_fips = fetch_census_block_data(chatgpt_params['coordinates']['urban'], only_block_fips=True, censusYear=2020)
print(f"{urban_census_block_fips=}")

urban_data = get_census_data_for_block(urban_census_block_fips)
urban_data['Region'] = 'Urban'

## Get rural data
rural_census_block_fips = fetch_census_block_data(chatgpt_params['coordinates']['rural'], only_block_fips=True, censusYear=2020)
print(f"{rural_census_block_fips=}")

rural_data = get_census_data_for_block(rural_census_block_fips)
rural_data['Region'] = 'Rural'

df = pd.concat([urban_data, rural_data], axis=0)
df


{'city_region_name': 'Atlanta, GA', 'coordinates': {'rural': {'SW': [33.35, -84.8], 'NE': [33.45, -84.7]}, 'urban': {'SW': [33.7, -84.45], 'NE': [33.8, -84.35]}}, 'time': {'start': '2022-06-01', 'end': '2022-08-31'}, 'fips': {'state_fips': '13', 'rural': {'state_fips': '13', 'county_fips': ['077'], 'census_tract_fips': ['13077010100', '13077010200', '13077010300']}, 'urban': {'state_fips': '13', 'county_fips': ['121'], 'census_tract_fips': ['13121010100', '13121010200', '13121010300']}}}


urban_census_block_fips='131210035002019'
state_fips='13', county_fips='121', tract_code='003500'
rural_census_block_fips='130771703081036'
state_fips='13', county_fips='077', tract_code='170308'


Unnamed: 0,NAME,Median Household Income,Per Capita Income,Population Below Poverty Level,Total Population for Poverty,Unemployed Population,Labor Force Population,High School Graduate,Bachelor's Degree or Higher,Median Home Value,...,Total Households,state,county,tract,Poverty Rate,Unemployment Rate,Percentage Owner Occupied,Percentage Public Transit,Percentage No Vehicle,Region
0,"Census Tract 35, Fulton County, Georgia",51474.0,35590.0,854.0,2203.0,187.0,1706.0,285.0,422.0,228400.0,...,1111.0,13,121,3500,38.76532,10.961313,21.692169,19.920053,1.980198,Urban
0,"Census Tract 1703.08, Coweta County, Georgia",46471.0,27590.0,425.0,4718.0,121.0,2437.0,1101.0,485.0,182900.0,...,1934.0,13,77,170308,9.008054,4.965121,52.171665,0.0,0.672182,Rural
