# DSCI 511: Data Acquisition and Pre-Processing <br> Term Project Phase 1: Scoping a data set

## Group 10
### Team Members: 
#### Ian Auger

My name is Ian Auger, and I am the sole member of this team. My background is deeply rooted in the food industry, where I spent nearly a decade working in and managing restaurants. After transitioning out of the restaurant space, I spent the past six years working in food technology companies, initially in Operations and Strategy, and more recently as a Data Analyst.

##### Background
My expertise lies in relational databases, querying languages, data analysis, and data visualization, as these have been core responsibilities in my role for the past four years. While I am growing more proficient in object-oriented programming languages like Python and Java, my strongest skill is SQL.

From an industry perspective, I have worked with both D2C e-commerce companies and food delivery platforms, giving me exposure to a variety of operational and product challenges. This experience has fostered a strong sense of adaptability and problem-solving.

##### Personal Goals for This Project
Moving forward, I want to strengthen my proficiency in Python, as it is becoming increasingly critical in my career. Specifically, I aim to:

- Improve my Python skills for data acquisition, processing and cleaning.
- Develop a stronger understanding of APIs and how to integrate them effectively.
- Work with large datasets, learning to process and analyze them efficiently.

These are foundational skills that I recognize as gaps in my current expertise, and I look forward to addressing them through this project.

## Geographic Indexing of Grocery Prices & Socioeconomic Data

### Objective
This project aims to acquire, integrate, and preprocess datasets that enable an analysis of grocery product pricing, product availability, and socioeconomic conditions. The goal is to construct a clean, structured dataset that facilitates future analyses of pricing disparities and access to affordable groceries across different communities.

### Data Acquisition & Integration
#### 1️⃣ Grocery Product Pricing & Availability (Kroger)
##### **Data Needed:**
- Product prices at various Kroger store locations.
- Availability of different product categories (e.g., fresh produce, dairy, packaged goods).
- Store locations (latitude/longitude or ZIP code).

##### **Acquisition Methods:**
- **Kroger Public APIs**
  - **Location API**: Provides comprehensive data on the network of Kroger stores
  - **Product API**: Supplies location-specific product information
  
##### **Preprocessing Tasks:**
- Standardizing price data (unit price normalization).
- Handling missing or inconsistent pricing values.
- Assigning geographic identifiers (geocoding store locations to ZIP codes/census tracts).

#### 2️⃣ Demographic & Economic Data (U.S. Census 2023)
##### **Data Needed:**
- **Income Levels**: Median household income, poverty rate.
- **Population Density**: Urban vs. rural classification.
- **Ethnic Composition**: Racial and ethnic demographics.
- **Education Levels**: Percentage of the population with a college degree.
- **Employment Statistics**: Unemployment rate, occupation distribution.
- **Housing & Cost of Living**: Median home value, rent prices.

##### **Acquisition Methods:**
- **U.S. Census Bureau API** ([data.census.gov](https://data.census.gov))
  - American Community Survey (ACS) 1-Year Estimates
    - Demographic and economic data summarizing individual American metropolitan and micropolitan areas

##### **Preprocessing Tasks:**
- Mapping census data to store locations.
- Handling missing or aggregated demographic values.
- Normalizing data across different geographic resolutions (ZIP code vs. county level).

#### 3️⃣ Additional Socioeconomic Factors: SNAP Participation & Food Assistance
While not a requirement for this exercise, incorporating SNAP participation data per store and the rate of SNAP participation per household in each geographic area could enhance the dataset. 

Since household income data is available via census sources, SNAP participation rates could serve as a complementary indicator of a community’s relative wealth. Analyzing food pricing in these communities could help determine whether grocery chains adjust prices based on local economic conditions or maintain a uniform pricing model across different income brackets.


## Data Overview

### **Purpose of This Data**
 
The goal of assembling this dataset is to analyze how a national retailer like Kroger adjusts product pricing across diverse markets, considering their unique demographic and economic characteristics. Specifically, this project seeks to identify whether there is a correlation between regional market profiles and pricing strategies, including variations in product category diversity and affordability.

By leveraging data from Kroger’s extensive store network, which spans a wide range of metropolitan and suburban areas, this analysis aims to provide insights into potential pricing disparities. Understanding these patterns could help answer key questions, such as:

- Do lower-income areas experience proportionally lower grocery prices relative to wealthier neighborhoods?
- Is product availability and category diversity consistent across different economic regions, or does it vary based on market demand and demographic features?
- Does Kroger employ a dynamic pricing strategy, adjusting for local economic conditions, or is pricing relatively uniform nationwide?
- Does Kroger employ different pricing strategies  for each chain and/or division or is it consistent across all stores? 

Given the heightened public attention on grocery pricing and affordability, this study seeks to contribute a data-driven perspective to a discussion often dominated by subjective viewpoints. By structuring and analyzing this dataset, we aim to provide a clearer, evidence-based understanding of pricing trends and their potential implications for consumers.  

---

### **Intended Audience**  
This dataset is relevant to a broad audience, as grocery pricing directly affects nearly every American household. In recent years, price volatility has increased due to a range of factors, including large-scale weather events intensified by climate change, mass food recalls, and livestock disease outbreaks. These disruptions have led to fluctuating prices and inconsistent availability of staple commodities, challenging consumer expectations of affordability and accessibility.  

Understanding these trends is essential for multiple stakeholders:  

- **Consumers & Advocacy Groups:** Many households are struggling with the rising cost of food, and consumers are becoming more conscious of regional disparities in pricing and product availability. This analysis could provide transparency into how pricing strategies vary across different economic and demographic landscapes, helping consumers make more informed decisions.  

- **Policy Makers & Economists:** With food insecurity and affordability becoming critical policy issues, government agencies and economists can use this data to better understand pricing structures in different communities. Insights from this study could inform policies aimed at promoting fair access to groceries and reducing price disparities between high- and low-income areas.  

- **Retail & Supply Chain Analysts:** Industry professionals in grocery retail, supply chain management, and food distribution can leverage this data to assess how pricing strategies respond to regional demand, supply disruptions, and economic conditions. This could help inform business decisions regarding localized pricing, inventory management, and product assortment strategies.  

- **Climate & Agricultural Researchers:** Given that climate change is increasingly affecting agricultural output and food supply chains, researchers studying the intersection of environmental factors and food pricing could use this dataset to analyze how extreme weather events correlate with price fluctuations at a granular level.  

By examining the pricing strategies of one of the largest grocery retailers in the U.S., this study seeks to provide actionable insights that go beyond anecdotal evidence. Whether for individual consumers, policymakers, businesses, or researchers, this data has the potential to contribute to a clearer, more evidence-based understanding of the economic and social factors shaping the cost of food.  

---

## **Study Limitations**

This study focuses exclusively on pricing dynamics within Kroger’s network of stores due to data availability. While Kroger holds a significant market presence as the second-largest grocery retailer in the United States, its roughly 10% market share means it is far from representative of the entire national grocery landscape.

Additionally, this study does not account for alternative grocery retailers that cater to different consumer segments, such as:

- **Budget-focused retailers** (e.g., Dollar General, Aldi) that often offer lower-cost alternatives.
- **Membership-based wholesale clubs** (e.g., Costco, Sam’s Club) where pricing structures differ due to bulk purchasing models.
- **Regional grocery chains** that may follow distinct pricing strategies influenced by local market conditions.

Despite these limitations, this analysis remains valuable in contributing to a broader discussion on grocery pricing, cost of living, and affordability. However, it is essential to recognize that Kroger's pricing strategies may not reflect the experience of all consumers nationwide. Future studies incorporating data from multiple retailers could provide a more comprehensive view of pricing dynamics across the grocery industry.


## Data Creation, Sources and Access

### **Data Sources and Creation**
This study will rely on two primary data sources:

#### **U.S. Census Data**
The first data source comes from the U.S. Census Bureau, which collects and publishes demographic and economic data annually. Specifically, this study will utilize data from the American Community Survey (ACS), which provides detailed insights into population size, demographic composition, income levels, and other economic conditions across American communities. These data points will help contextualize grocery pricing trends within different socioeconomic environments.

#### **Kroger Retail Data**
The second source of data comes from Kroger’s backend ERP system, which contains detailed information about products sold in its stores. This dataset includes:

- Product pricing (regular and promotional prices)
- Item configurations (e.g., unit size, packaging)
- Availability (stock levels per store)
- Geographic store data (store locations, ZIP codes, and store attributes)

By combining these two datasets, this study aims to analyze how grocery pricing interacts with local economic and demographic conditions, helping to identify potential patterns and disparities in food accessibility and affordability. 

## **Data Access**

### **Kroger Publis APIs**

To analyze grocery product pricing and availability across different locations, this project will leverage Kroger’s public APIs. These APIs provide access to store locations, product details, and pricing information. The integration process involves authentication, data retrieval, and preprocessing steps to structure the data for analysis.

### **API Endpoints & Data Retrieval**
#### 1️⃣ **Kroger Location API**
##### **Purpose:**
- Fetch the list of Kroger store locations, including geographic coordinates (latitude/longitude) and associated ZIP codes.

##### **Access Method:**
- Use API authentication to request store details.
- Query stores within specific geographic regions based on ZIP code or radius.
  - Leverage census data to create a comprehensive list of ZIP codes we want to analyze, then iterate over that list to associate all stores within range of a ZIP code to that census tract.
  - Omit ZIP codes from the final dataset that do not contain a store within the Kroger network.


##### **Key Data Fields:**
- `locationId`: Unique identifier for each store.
- `name`: Common name
- `storeNumber`: Internal company identifier
- `divisionNumber`: Division identifier
- `chain`: Store branding (e.g., Kroger, Ralphs, etc.).
- `latitude`, `longitude`: Geographic coordinates.
- `address`: `addressLine1`, `city`, `state`, `zipCode`

---

#### 2️⃣ **Kroger Product API**
##### **Purpose:**
- Retrieve location-specific product details, including pricing and availability.

##### **Access Method:**
- Query products by `locationId` to get pricing and availability.
- Filter by product category (e.g., fresh produce, dairy, packaged goods).

##### **Key Data Fields:**
- `productId`: Unique internal product identifier.
- `upc`: Unique universal product identifier
- `description`: Product name/label.
- `brand`: Item brand
- `categories`: Grocery category(s)
- `items.price.regular`: Regular retail price.
- `items.price.promo`: Discounted promotional price (if available).
- `items.size`: size configuration (e.g., 12ct, 1gal)
- `items.inventory.stockLevel`: Stock status per store.

### U.S. Census Data

Due to the unpredictability of API access for Census data and all government data currently, I will fetch raw census data manually and store in csv files locally. This data will need to be cleaned, but it will ensure stable availability of Census data. 

---

## Current Progress
### **Authentication & Request Workflow**
1. **Obtain API Credentials**
   - Register for a Kroger Developer Account to obtain API keys.
   - Use OAuth 2.0 authentication to generate access tokens.
   - Create individual functions for generating access tokens per scope parameter and establish secure, persistent environment storage for all tokens, keys and key expiration data.
      - Functions operate to determine key expiration and will programatically generate new keys whenever required before making a call to an API endpoint. 

2. **Fetch Store Locations**
   - Call the Location API using geographic filters (ZIP code, radius).
   - Determine json schema for the data and identify key fields of interest for the dataset

3. **Retrieve Product & Pricing Data**
   - Query the Product API by `locationId`.
   - Determine json schema for the data and identify key fields of interest for the dataset

### **Next Steps**
1. **Gather Census Data**
   - Review documentation for ACS 1-year tabular data and determine relevant fields of  interest. 
   - Fetch census data from census.gov.
   - Store census data as .csv file(s) in project folder locally for access by notebook. 
   - Review and clean census data as needed

2. **Fetch Complete Store List**
   - Use cleaned census data to assemble list of zip codes, and iterate through list with the Location API function to generate a complete dataset of all stores. 
   - Associate all stores to census tracts

3. **Build Comprehensive Product Dataset**
   - Use the assembled store list in conjuction with the Product API function to generate a Product dataset and assign individual products to each store. 
      - This will be limited in size to either a product  category like Dairy or a single item  type such as Eggs. 
   - Create process for recognizing new products and appending them to the dataset for monitoring. 

4. **Create Product History Dataset**
   - Use Product API function to call the endpoints daily and generate log entries for each product we are capturing, including price, promos, availability, and any other mutable field. 
      - Store the time of the API call and use that as the datetime for the log entry to ensure timeseries analysis functionality in the future. 

## Data Availability

The dataset generated for this study will be sourced exclusively from public APIs, ensuring that there are no restrictions on making it publicly available. By sharing this dataset, it could provide value to other researchers, analysts, or anyone interested in grocery pricing trends. Additionally, making it accessible aligns with open data principles and contributes to transparency in food pricing research.

Beyond the dataset itself, the code used to generate it will also be made publicly available. This serves multiple purposes:

- Reproducibility: Allowing others to replicate the data collection process in their own environments.
- Educational Value: Demonstrating proficiency in API access, data extraction, and structuring raw data into a reusable format.
- Portfolio Enhancement: Showcasing technical skills in data acquisition and preprocessing on platforms like GitHub.

By ensuring both the dataset and code are openly accessible, this project can serve as a useful resource for others while also reinforcing best practices in data transparency and reproducibility.

## Sample Kroger Product Data

Single object returned from the Kroger Product API to demonstrate access and schema. 

```json

{'data': [{'aisleLocations': [{'bayNumber': '6',
                               'description': 'AISLE 20',
                               'number': '20',
                               'numberOfFacings': '3',
                               'shelfNumber': '3',
                               'shelfPositionInBay': '1',
                               'side': 'R'}],
           'brand': 'Simple Truth',
           'categories': ['Breakfast', 'Dairy'],
           'description': 'Simple Truth™ Natural Cage Free Grade A Large Brown '
                          'Eggs',
           'images': [{'perspective': 'back',
                       'sizes': [{'size': 'xlarge',
                                  'url': 'https://www.kroger.com/product/images/xlarge/back/0001111079770'},
                                 {'size': 'large',
                                  'url': 'https://www.kroger.com/product/images/large/back/0001111079770'},
                                 {'size': 'medium',
                                  'url': 'https://www.kroger.com/product/images/medium/back/0001111079770'},
                                 {'size': 'small',
                                  'url': 'https://www.kroger.com/product/images/small/back/0001111079770'},
                                 {'size': 'thumbnail',
                                  'url': 'https://www.kroger.com/product/images/thumbnail/back/0001111079770'}]},
                      {'featured': True,
                       'perspective': 'front',
                       'sizes': [{'size': 'xlarge',
                                  'url': 'https://www.kroger.com/product/images/xlarge/front/0001111079770'},
                                 {'size': 'large',
                                  'url': 'https://www.kroger.com/product/images/large/front/0001111079770'},
                                 {'size': 'medium',
                                  'url': 'https://www.kroger.com/product/images/medium/front/0001111079770'},
                                 {'size': 'small',
                                  'url': 'https://www.kroger.com/product/images/small/front/0001111079770'},
                                 {'size': 'thumbnail',
                                  'url': 'https://www.kroger.com/product/images/thumbnail/front/0001111079770'}]}],
           'itemInformation': {'depth': '4.0',
                               'height': '2.75',
                               'width': '11.5'},
           'items': [{'favorite': False,
                      'fulfillment': {'curbside': True,
                                      'delivery': True,
                                      'inStore': True,
                                      'shipToHome': False},
                      'inventory': {'stockLevel': 'HIGH'},
                      'itemId': '0001111079770',
                      'price': {'promo': 0, 'regular': 3.49},
                      'size': '12 ct',
                      'soldBy': 'UNIT'}],
           'productId': '0001111079770',
           'productPageURI': '/p/simple-truth-natural-cage-free-grade-a-large-brown-eggs/0001111079770?cid=dis.api.tpi_products-api_20240521_b:all_c:np_t:dsciproject-24326124',
           'temperature': {'heatSensitive': False, 'indicator': 'Refrigerated'},
           'upc': '0001111079770'}]}

## Sample Kroger Location Data

Single object returned from Kroger Location API to demonstrate access and schema.

``` json

{'data': [{'address': {'addressLine1': '100 E Court St',
                       'city': 'Cincinnati',
                       'county': 'HAMILTON COUNTY',
                       'state': 'OH',
                       'zipCode': '45202'},
           'chain': 'KROGER',
           'departments': [{'departmentId': '0J', 'name': 'Home Chef'},
                           {'departmentId': '23',
                            'name': 'Drug & General Merchandise'},
                           {'departmentId': '45', 'name': 'Wine'},
                           {'departmentId': '46', 'name': 'Sushi'},
                           {'departmentId': '63', 'name': 'I-wireless'},
                           {'departmentId': 'TL', 'name': 'Tobacco'},
                           {'departmentId': '05', 'name': 'Seafood Department'},
                           {'departmentId': '58',
                            'name': 'Free Wireless Access'},
                           {'departmentId': '99', 'name': 'Check Cashing'},
                           {'departmentId': '13', 'name': 'Cell Phone'},
                           {'departmentId': '76', 'name': 'Rotisserie Chicken'},
                           {'departmentId': 'AF', 'name': 'Account Funding'},
                           {'departmentId': 'LY', 'name': 'Lottery Tickets'},
                           {'departmentId': '0K', 'name': 'Full Strength Beer'},
                           {'departmentId': '40', 'name': 'Coffee Bar'},
                           {'departmentId': '77', 'name': 'Fried Chicken'},
                           {'departmentId': '48', 'name': 'Starbucks'},
                           {'departmentId': '66', 'name': "Murray's Cheese"},
                           {'departmentId': '72', 'name': 'Expresslane'},
                           {'departmentId': '89', 'name': 'Western Union'},
                           {'departmentId': '01', 'name': 'Deli'},
                           {'departmentId': '02', 'name': 'Bakery'},
                           {'departmentId': '08', 'name': 'Floral'},
                           {'departmentId': '12', 'name': 'Cosmetics'},
                           {'departmentId': '30', 'name': 'Salad Bar'},
                           {'departmentId': '88', 'name': 'Perishables'},
                           {'departmentId': 'SB', 'name': 'Sports Betting'},
                           {'departmentId': 'MO', 'name': 'Money Orders'},
                           {'departmentId': '04', 'name': 'Meat Department'},
                           {'departmentId': '0G', 'name': 'Meal Kits'},
                           {'departmentId': '21', 'name': 'Self Checkout'},
                           {'departmentId': '24',
                            'name': 'Franchise Restaurant'},
                           {'departmentId': '44', 'name': 'Beer'},
                           {'departmentId': '54', 'name': 'Chef'},
                           {'departmentId': '65', 'name': 'Money Services'},
                           {'departmentId': '07', 'name': 'Restaurant'},
                           {'departmentId': '32',
                            'name': 'Natural And Organics'},
                           {'departmentId': '90', 'name': 'Snap'}],
           'divisionNumber': '014',
           'geolocation': {'latLng': '39.10682,-84.51253',
                           'latitude': 39.10682,
                           'longitude': -84.51253},
           'hours': {'friday': {'close': '22:00',
                                'open': '06:00',
                                'open24': False},
                     'gmtOffset': '(UTC-05:00) Eastern Time (US Canada)',
                     'monday': {'close': '22:00',
                                'open': '06:00',
                                'open24': False},
                     'open24': False,
                     'saturday': {'close': '22:00',
                                  'open': '06:00',
                                  'open24': False},
                     'sunday': {'close': '22:00',
                                'open': '06:00',
                                'open24': False},
                     'thursday': {'close': '22:00',
                                  'open': '06:00',
                                  'open24': False},
                     'timezone': 'America/New_York',
                     'tuesday': {'close': '22:00',
                                 'open': '06:00',
                                 'open24': False},
                     'wednesday': {'close': '22:00',
                                   'open': '06:00',
                                   'open24': False}},
           'locationId': '01400513',
           'name': 'Kroger - Kroger On the Rhine',
           'phone': '5132635900',
           'storeNumber': '00513'}}

