# 📍 Location & Census Data Acquisition

##  Overview
This notebook was designed for **local execution** to acquire **store location and census data** for the project. Unlike the **product API pipeline**, which is designed for periodic updates, this dataset was considered **static** since:
- Census data is from **2023 and was not intended to be updated**.
- Store locations were **retrieved once**, with no active tracking of new store openings or closures.

##  Purpose
- Extract and clean **store location data** from Kroger’s API.
- Acquire **U.S. Census data** on **income, poverty, SNAP participation, education, and racial demographics**.
- Format and store these datasets for integration into the **main analysis pipeline**.


In [2]:
import pandas as pd
import requests
import base64
import sys
import os
from dotenv import load_dotenv, set_key, get_key
import time
from datetime import datetime, timedelta
from pprint import pprint

## Census Data Acquisition

This section extracts **demographic and economic indicators** from the **U.S. Census Bureau’s ACS 2023 data**.  
Key metrics include:
- **Population Counts**
- **Poverty Rate**
- **SNAP Participation**
- **Education Levels**
- **Racial Demographics**

### Why This Data Was Considered Static
- Data was pulled **once for 2023** with **no intention to update**.
- Unlike product pricing, **demographic shifts occur gradually**, making static data sufficient for this project.


In [None]:
BASE_DIR = os.path.abspath(os.path.dirname(os.getcwd()))

# Define directories relative to `BASE_DIR`
SRC_DIR = os.path.join(BASE_DIR, "src")  # Points to `src/`
DATA_DIR = os.path.join(SRC_DIR, "data/Census_Files")  # Points to `src/data/Census_Files`

# Ensure paths are correctly set
census_files = {
    "Population": os.path.join(DATA_DIR, "ACSDT5Y2023.B01003", "ACSDT5Y2023.B01003-Data.csv"),
    "Poverty Rate": os.path.join(DATA_DIR, "ACSDT5Y2023.B17001", "ACSDT5Y2023.B17001-Data.csv"),
    "Median Income": os.path.join(DATA_DIR, "ACSDT5Y2023.B19013", "ACSDT5Y2023.B19013-Data.csv"),
    "SNAP Participation": os.path.join(DATA_DIR, "ACSDT5Y2023.B22010", "ACSDT5Y2023.B22010-Data.csv"),
    "Race Demographics": os.path.join(DATA_DIR, "ACSDT5Y2023.B02001", "ACSDT5Y2023.B02001-Data.csv"),
    "Educational Attainment": os.path.join(DATA_DIR, "ACSDT5Y2023.B15003", "ACSDT5Y2023.B15003-Data.csv")
}

# Function to clean and standardize Census data
def clean_census_data(df, value_column, new_column_name, zip_column="NAME"):
    """Cleans Census data by removing headers, extracting ZIP codes, and selecting relevant columns."""
    df = df.iloc[1:].reset_index(drop=True)
    df["ZIP Code"] = df[zip_column].str.extract(r'(\d{5})')
    df[new_column_name] = pd.to_numeric(df[value_column], errors="coerce")
    return df[["ZIP Code", new_column_name]]

# Load and clean each dataset
cleaned_data = {
    "Population": clean_census_data(pd.read_csv(census_files["Population"], low_memory=False), "B01003_001E", "Total Population"),
    "Poverty Rate": clean_census_data(pd.read_csv(census_files["Poverty Rate"], low_memory=False), "B17001_002E", "Poverty Count"),
    "Median Income": clean_census_data(pd.read_csv(census_files["Median Income"], low_memory=False), "B19013_001E", "Median Household Income"),
    "SNAP Participation": clean_census_data(pd.read_csv(census_files["SNAP Participation"], low_memory=False), "B22010_002E", "SNAP Households"),
    "White Population": clean_census_data(pd.read_csv(census_files["Race Demographics"], low_memory=False), "B02001_002E", "White Population"),
    "Black Population": clean_census_data(pd.read_csv(census_files["Race Demographics"], low_memory=False), "B02001_003E", "Black Population"),
    "American Indian Population": clean_census_data(pd.read_csv(census_files["Race Demographics"], low_memory=False), "B02001_004E", "American Indian Population"),
    "Asian Population": clean_census_data(pd.read_csv(census_files["Race Demographics"], low_memory=False), "B02001_005E", "Asian Population"),
    "Pacific Islander Population": clean_census_data(pd.read_csv(census_files["Race Demographics"], low_memory=False), "B02001_006E", "Pacific Islander Population"),
    "Other Race Population": clean_census_data(pd.read_csv(census_files["Race Demographics"], low_memory=False), "B02001_007E", "Other Race Population"),
    "Two or More Races Population": clean_census_data(pd.read_csv(census_files["Race Demographics"], low_memory=False), "B02001_008E", "Two or More Races Population"),
    "High School Graduate": clean_census_data(pd.read_csv(census_files["Educational Attainment"], low_memory=False), "B15003_017E", "High School Graduate"),
    "Bachelor's Degree": clean_census_data(pd.read_csv(census_files["Educational Attainment"], low_memory=False), "B15003_022E", "Bachelor's Degree"),
    "Master's Degree": clean_census_data(pd.read_csv(census_files["Educational Attainment"], low_memory=False), "B15003_023E", "Master's Degree"),
    "Doctorate Degree": clean_census_data(pd.read_csv(census_files["Educational Attainment"], low_memory=False), "B15003_025E", "Doctorate Degree")
}

# Merge datasets on ZIP Code
census_merged = cleaned_data["Population"]
for key, df in cleaned_data.items():
    if key != "Population":
        census_merged = census_merged.merge(df, on="ZIP Code", how="left")

# Save cleaned and merged Census data
output_file = os.path.join(DATA_DIR, "cleaned_census_data.csv")
census_merged.to_csv(output_file, index=False)

print("Cleaned Census Data Saved as 'cleaned_census_data.csv'")
print(census_merged.head())
print(census_merged.describe)


✅ Cleaned Census Data Saved as 'cleaned_census_data.csv'
  ZIP Code  Total Population  Poverty Count  Median Household Income  \
0    00601             16721          10199                  18571.0   
1    00602             37510          17504                  21702.0   
2    00603             48317          22683                  19243.0   
3    00606              5435           2984                  20226.0   
4    00610             25413          11145                  23732.0   

   SNAP Households  White Population  Black Population  \
0             3219             13904               314   
1             7138             13781               520   
2            10261             35550              1572   
3             1056              3697                12   
4             4936              6582               525   

   American Indian Population  Asian Population  Pacific Islander Population  \
0                           7                19                            0   
1

## API Access & Credentials

This notebook retrieves **store location data** from Kroger’s API, which requires authentication via **client credentials** stored in an `.env` file.  
Since this notebook was designed for **local execution**, sensitive credentials are not embedded in the script.

### Key Considerations:
- The API credentials are stored in a **local environment file (`.env`)**.
- The authentication process follows **OAuth2**, retrieving an **access token** before making requests.
- This notebook **was not refactored** into a modular pipeline since the store dataset was intended to remain static.

In [None]:
# Load environment variables
ENV_FILE = "kroger_client_info.env"  # Define .env file path

load_dotenv()
CLIENT_ID = get_key(ENV_FILE, "KROGER_CLIENT_ID")
CLIENT_SECRET = get_key(ENV_FILE, "KROGER_CLIENT_SECRET")

# Encode CLIENT_ID and CLIENT_SECRET in Base64
encoded_auth = base64.b64encode(f"{CLIENT_ID}:{CLIENT_SECRET}".encode()).decode()

print(encoded_auth)

# Define API endpoint and headers
TOKEN_URL = "https://api-ce.kroger.com/v1/connect/oauth2/token"
headers = {
    "Authorization": f"Basic {encoded_auth}",
    "Content-Type": "application/x-www-form-urlencoded"
}

# Define request body
data = "grant_type=client_credentials&scope=product.compact"  # Modify scope as needed

# Make POST request to get access token
response = requests.post(TOKEN_URL, headers=headers, data=data)

# Handle response
if response.status_code == 200:
    response_data = response.json()
    access_token = response_data.get("access_token")
    expires_in = response_data.get("expires_in", 1800)  # Default to 1800 seconds if missing

    # Store token and expiration timestamp
    token_expiration_time = datetime.now() + timedelta(seconds=expires_in)
    os.environ["PRODUCT_COMPACT_ACCESS_TOKEN"] = access_token
    os.environ["PRODUCT_COMPACT_ACCESS_TOKEN_EXPIRATION"] = str(token_expiration_time)
    
    # Save token details to .env file for persistent storage
    set_key(ENV_FILE, "PRODUCT_COMPACT_ACCESS_TOKEN", access_token)
    set_key(ENV_FILE, "PRODUCT_COMPACT_ACCESS_TOKEN_EXPIRATION", token_expiration_time.isoformat())

    print("Access Token Retrieved and Stored!")
    print(f"🔹 Token: {access_token[:20]}...")  # Only print part of the token for security
    print(f"🔹 Token Expiration Time: {token_expiration_time}")

else:
    print("Failed to retrieve access token")
    print("🔹 Status Code:", response.status_code)
    print("🔹 Response:", response.json())


ZHNjaXByb2plY3QtMjQzMjYxMjQzMDM0MjQ1NTYzNWE2ZTZlNzk2ODZmNWE1NzQ3NDc1NjY1NjkzOTMzNTU2NDRhNTc0ZjRhMmYzODQ4Nzk0MzZiMzMzOTZmNzYyZjRiNzQ3Mzc0NTg2NjZkNGMzMDYxNmE3YTMzNTQ2MjYxNDM1NjRiMjM3NDA0MDM5MTI0Nzg4MDY0NzptVlpscnpGTVhSSGFTeVZjY1loSlRDUUZ1bmhQcldyQ1Q1V3g1cGND
✅ Access Token Retrieved and Stored!
🔹 Token: eyJhbGciOiJSUzI1NiIs...
🔹 Token Expiration Time: 2025-03-04 17:48:19.488080


## Kroger Location API Integration

### Overview
This module retrieves **store location data** from Kroger’s API using **OAuth2 authentication**. It allows searching for **grocery store locations near a given ZIP code**, providing details such as **store name, address, and distance** from the search point.

### Key Functions
#### `get_kroger_location_token()`
- Manages **OAuth2 token retrieval and refresh** for accessing the Kroger Location API.
- Requests a **new token** if the current one has expired.
- Stores tokens **securely in an environment file (`.env`)**.

#### `search_kroger_locations(zip_code, radius, limit)`
- Searches for **Kroger stores within a given ZIP code**.
- Allows customization of the **search radius (default: 20 miles)** and **number of results**.
- Handles **API errors gracefully**, ensuring reliable data retrieval.

### Purpose
This function was used for **one-time store location extraction** but could be **adapted for periodic updates** to track **new store openings and closures**.


In [None]:
TOKEN_URL = "https://api-ce.kroger.com/v1/connect/oauth2/token"

def get_kroger_location_token():
    """Retrieve or refresh the Kroger Location API access token."""
    token = os.getenv("LOCATION_ACCESS_TOKEN")
    expiration = os.getenv("LOCATION_TOKEN_EXPIRATION")

    # If there's no token or it has expired, request a new one
    if not token or datetime.now() >= datetime.fromisoformat(expiration):
        print("Location Token expired or missing, requesting a new one...")

        encoded_auth = base64.b64encode(f"{CLIENT_ID}:{CLIENT_SECRET}".encode()).decode()

        headers = {
            "Authorization": f"Basic {encoded_auth}",
            "Content-Type": "application/x-www-form-urlencoded"
        }

        data = "grant_type=client_credentials"

        response = requests.post(TOKEN_URL, headers=headers, data=data)
        
        if response.status_code == 200:
            response_data = response.json()
            token = response_data.get("access_token")
            expires_in = response_data.get("expires_in", 1800)
            expiration_time = datetime.now() + timedelta(seconds=expires_in)

            # Store new token
            os.environ["LOCATION_ACCESS_TOKEN"] = token
            os.environ["LOCATION_TOKEN_EXPIRATION"] = str(expiration_time)
            
            # Save token details to .env file for persistent storage
            set_key(ENV_FILE, "LOCATION_ACCESS_TOKEN", token)
            set_key(ENV_FILE, "LOCATION_ACCESS_TOKEN_EXPIRATION", expiration_time.isoformat())

            print("New Location Token Retrieved!")
        else:
            print("Failed to retrieve location token:", response.json())

    return token

def search_kroger_locations(zip_code, radius = 20, limit=5):
    """Search for Kroger store locations near a given ZIP code."""
    token = get_kroger_location_token()  # Get a valid token

    # Define API URL for locations
    LOCATIONS_API_URL = "https://api-ce.kroger.com/v1/locations"

    headers = {
        "Authorization": f"Bearer {token}",
        "Accept": "application/json"
    }
    
    params = {
        "filter.zipCode.near": zip_code,  # Search by ZIP code
        "filter.radiusInMiles": radius, # Set search radius
        "filter.limit": limit  # Limit number of results
    }


    try:
        response = requests.get(LOCATIONS_API_URL, headers=headers, params=params)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error fetching locations for ZIP {zip_code}: {e}")
        return None


## ZIP Code-Based Store Location Retrieval

### Overview
This section queries **Kroger’s Location API** to retrieve **store locations by ZIP code**, dynamically adjusting search parameters based on **population density**.

### Key Features
- **Dynamic Search Radius**  
  - Urban areas (**high population density**) → **smaller radius** (e.g., 15 miles).  
  - Rural areas (**low population density**) → **larger radius** (e.g., 100 miles).  

- **ZIP Code Processing Logic**  
  - Uses a **pre-filtered list** of ZIP codes from Census data.  
  - **Tracks processed ZIPs** in a file (`processed_zips.txt`) to prevent redundant queries.  
  - **Reduces API requests** by querying every **third ZIP code** (`zip_codes[::3]`).  

- **Data Storage & Deduplication**  
  - Prevents **duplicate store records** using a set of **existing Location IDs**.  
  - **Appends new data** to `kroger_locations.csv` only when new stores are found.  
  - Displays a **real-time progress tracker** to monitor request completion.

### Purpose
This method provides **targeted store location retrieval** while **minimizing API calls** and avoiding redundant queries.  
Although designed for **one-time execution**, it could be adapted for **periodic updates** to track **store openings and closures**.


In [None]:
# Load census data, create a dictionary with ZIP and Pop, create a list of ZIP Codes
census_df = pd.read_csv("cleaned_census_data.csv", dtype={"ZIP Code": str})
zip_pop_map = dict(zip(census_df["ZIP Code"],census_df["Total Population"].fillna(0)))
zip_codes  = list(zip_pop_map.keys())

# Function to set dynamic search radius
def get_dynamic_radius(population):
    """Set search radius dynamically based on population density."""
    if population > 100000:
        return 15  # Dense urban areas
    elif population > 50000:
        return 25  # Urban areas
    elif population > 20000:
        return 50  # Suburban areas
    else:
        return 100  # Rural areas

# Load processed ZIPs from a file
processed_zip_file = "processed_zips.txt"

# Ensure the file exists
if not os.path.exists(processed_zip_file):
    open(processed_zip_file, "w").close()  # Creates an empty file

# Load processed ZIPs from the file
with open(processed_zip_file, "r") as f:
    processed_zip_codes = set(f.read().splitlines())

# Generate condensed ZIP list and remove processed ZIPs
reduced_zip_list = [zip_code for zip_code in zip_codes[::3] if zip_code not in processed_zip_codes]

# Batches each run to a smaller subset
condensed_zip_list = reduced_zip_list[:2000]

# Initialize storage elements
stores = []
existing_loc_ids = set()
request_count = 0
total_requests = len(condensed_zip_list)

stores_file = "kroger_locations.csv"
if os.path.exists(stores_file) and os.path.getsize(stores_file) > 0:
    existing_stores_df = pd.read_csv(stores_file, dtype={"ZIP Code": str}) 
    stores.extend(existing_stores_df.to_dict(orient="records"))
    existing_loc_ids.update(existing_stores_df["Location ID"].astype(str).tolist()) # prevent duplicate records
else:
    print("No existing stores file found or the file is empty.")
    
# Main loop for processing ZIP Code searches
for i, zip_code in enumerate(condensed_zip_list, start=1):
    if zip_code in processed_zip_codes:
        print(f"Skipping {zip_code}, already processed.")
        continue  # Skip already processed ZIPs
    
    if len(processed_zip_codes) >= len(zip_codes):
        print("All ZIP codes processed. Exiting loop.")
        break
    
    try:
        population = zip_pop_map.get(zip_code, 0)
        radius = get_dynamic_radius(population)# adjust param based on population density and adding 
        
        location_results = search_kroger_locations(zip_code, radius, limit = 100)
        request_count += 1
        
        if location_results and "data"  in location_results:
            for store in location_results["data"]:
                location_id = store["locationId"]
                if location_id not in existing_loc_ids:
                    stores.append({
                        "Location ID": location_id,
                        "Store Name": store.get("name", "unknown"),
                        "Store Number": store.get("storeNumber", "unknown"),
                        "Chain Name": store.get("chain", "unknown"),
                        "Division Number": store.get("divisionNumber", "unknown"),
                        "Address": store.get("address", {}).get("addressLine1", "Unknown"),
                        "City": store.get("address", {}).get("city", "Unknown"),
                        "State": store.get("address", {}).get("state", "Unknown"),
                        "ZIP Code": zip_code})
                    existing_loc_ids.add(location_id)
        
        processed_zip_codes.add(zip_code)
        with open(processed_zip_file, "a") as f:
            f.write(zip_code + "\n")
        
        # Mental sanity feature so I don't constantly wonder how many records have been  processed    
        sys.stdout.write(f"\rProgress: {i}/{total_requests} ZIP codes processed")
        sys.stdout.flush()
    
    except Exception as e:
        print(f"Error processing ZIP {zip_code}: {e}")
    
    time.sleep(1) # prevent rate limiting


# Append results to file only if there is new data
if stores:
    stores_df = pd.DataFrame(stores)
    stores_df.to_csv(stores_file, mode="a", index=False, header=not os.path.exists(stores_file))
    print(f"\nTest Kroger store locations saved as '{stores_file}'.")
else:
    print("\nNo new store data found. Skipping file update.")

print(stores_df.head())



🔄 Location Token expired or missing, requesting a new one...
✅ New Location Token Retrieved!
🔄 Progress: 132/132 ZIP codes processed
✅ Test Kroger store locations saved as 'kroger_locations.csv'.
  Location ID                                    Store Name Store Number  \
0    542FC805          Harris Teeter - - Philadelphia Spoke        FC805   
1    09700336                     Harris Teeter - Long Neck        00336   
2    09700327                       Harris Teeter - Bayside        00327   
3    09700380            Harris Teeter - Easton Marketplace        00380   
4    09700392  Harris Teeter - The Shops at Canton Crossing        00392   

  Chain Name  Division Number               Address          City State  \
0       HART              542    7750 Essington Ave  Philadelphia    PA   
1       HART               97     26370 Bay Farm Rd     Millsboro    DE   
2       HART               97  31221 Americana Pkwy    Selbyville    DE   
3       HART               97    28528 Marlboro

## Final Thoughts

### Summary
- **Successfully acquired and cleaned location and census data**.
- **Ensured ZIP codes and population data were properly formatted**.
- **Stored the dataset for integration with pricing and correlation analysis**.

### Next Steps
- **Consider automating store location updates on a scheduled basis**.
- **If new ACS data is available, an updated census pull could improve analysis**.
