# DSCI 511: Data Acquisition and Pre-Processing <br> Term Project Phase 1: Scoping a data set

## Group 10
### Team Members: 
#### Ian Auger

My name is Ian Auger, and I am the sole member of this team. My background is deeply rooted in the food industry, where I spent nearly a decade working in and managing restaurants. After transitioning out of the restaurant space, I spent the past six years working in food technology companies, initially in Operations and Strategy, and more recently as a Data Analyst.

##### Background
My expertise lies in relational databases, querying languages, data analysis, and data visualization, as these have been core responsibilities in my role for the past four years. While I am growing more proficient in object-oriented programming languages like Python and Java, my strongest skill is SQL.

From an industry perspective, I have worked with both D2C e-commerce companies and food delivery platforms, giving me exposure to a variety of operational and product challenges. This experience has fostered a strong sense of adaptability and problem-solving.

##### Goals for This Project
Moving forward, I want to strengthen my proficiency in Python, as it is becoming increasingly critical in my career. Specifically, I aim to:

- Improve my Python skills for data acquisition, processing and cleaning.
- Develop a stronger understanding of APIs and how to integrate them effectively.
- Work with large datasets, learning to process and analyze them efficiently.

These are foundational skills that I recognize as gaps in my current expertise, and I look forward to addressing them through this project.

## Your topic

The course of your project will be determined by two things:

1. the motivations present in your project's team and
2. the data your project is able to pull together.

Thus, choosing your topic is closely tied to both your team and the data you are able to identify. To start, discuss the domain interests present on your project team. Te get you on your way, let's start with two questions:

1. Is there an aspect of the IoT, natural world, society, literature, or art, etc. that you would like to investigate computationally through what might be considered 'data'? 

2. What sort of data-medium are you interested to work with?&mdash;For example: transaction records, stock prices, memes and online conversations, open-domain poems, congressional records, News Articles, songs and popularity, Associated Press Images, transit records, call logs, CCTV footage, etcetera.

Whatever the direction you set for your project please make sure you document it well, keeping track of how its objectives and strategies change as you encounter available materials and other existing work.

## Topic: Geographic Indexing  of Grocery Product Prices Alongside Demographic and Economic Social Statistics

### Objective

This project aims to acquire, integrate, and preprocess datasets that allow for an analysis of grocery product pricing, product availability, and socioeconomic conditions. The goal is to construct a clean, structured dataset that enables future analyses of pricing disparities and access to affordable groceries across different communities.

### Data Sources
#### 1️⃣ Grocery Product Pricing & Availability (Kroger)
##### Data Needed:
- Product prices at various Kroger store locations.
- Availability of different product categories (e.g., fresh produce, dairy, packaged goods).
- Store locations (latitude/longitude or ZIP code).

##### Acquisition Methods:
Kroger Public API 

##### Preprocessing Tasks:
- Standardizing price data (unit price normalization).
- Handling missing or inconsistent pricing values.
- Assigning geographic identifiers (geocoding store locations to ZIP codes/census tracts).

#### 2️⃣ Demographic & Economic Data (U.S. Census 2023)
##### Data Needed:
- Income Levels (Median household income, poverty rate).
- Population Density (Urban vs. rural classification).
- Ethnic Composition (Racial and ethnic demographics).
- Education Levels (Percentage of population with a college degree).
- Employment Statistics (Unemployment rate, occupation distribution).
- Housing & Cost of Living (Median home value, rent prices).

##### Acquisition Methods:
- U.S. Census Bureau API (data.census.gov).
    - American Community Survey (ACS) 1-Year Estimates.
        - Demographic and Economic data summarizing individual American metropolitan and micropolitan areas
- Kroger Public APIs
    - Location API
        - Comprehensive data on the network of stores within the Kroger Corporation
    - Product API
        - Location specific product information

##### Preprocessing Tasks:
- Mapping census data to store locations.
- Handling missing or aggregated demographic values.
- Normalizing data across different geographic resolutions (ZIP code vs. county level).

##### 3️⃣ Supplemental Data: SNAP Participation & Food Assistance
Not  a requirement for this excersize, but the inclusion of SNAP participation per store and the rate of SNAP participation per household in each geographic area would potentially augment this dataset. We will be able to gather information on household income via census data; however, the rate of participation in food assistance programs could be a good  indicator of a community's relative wealth. It would be important to  understand if the pricing of food in these communities  reflects the wealth of the consumers, or if grocery chains maintain a flatter, more rigid pricing model. 

## What you're responsible for in this phase
Ok, so here's the goal again for phase 1. You must:

- scope a computationally tangible artifact&mdash;heretofore known as the data set&mdash;whose study is expected to satisfy goals pertaining to the project's topic of interest.

This phase of the project will set expectations and a work plan for your project's open-ended work. Not only should you scope the collection of your dataset, but determine what mode's of distribution will be possible once its produces. Will you have to distribute access code, or will you be able to directly provide links to stored data.

Ultimately, the completion of your poject will produce raw materials for other folks (possibly you) interested in trying out analysis applications in future coursework (DSCI 521). So, as you identify a potential data set be sure to be realistic about what is possible to collect and how you can preprocess it for use! Ultimately, please make sure that some portion of your target data are guarenteed to be collectable. However, it's okay to try for some data that are a reach, just document any un- or partially successful efforts in your report and discuss what obstacles prevented those data from being collected.

### Things I'll be looking for in a Phase 1 (due week 4) report

- a background report on the team's members, their self-identified skills, and individual contributions
- a discussion of what you would like to your data to do/hope it is good for
- an exhibition of a sample of your data&mdash;show me it exists and what it looks like, even if very raw
- a discussion of who might be interested in your data set
- a discussion of how your data is limited and could be improved
- a discussion of how your data were created, e.g., people texting, The Earth's molten core spinning, etc.
- a discussion of what sort of access rights presently exist on your data and how/if you will make them available

As a heads up, by the end of the term and in your final report (due week 10, not now) I'll be looking for things like
- a data dictionary or README.md that describes what is present in the data set and where or how to access
- code that documents the construction of your data&mdash;I should be able to re-construct/re-access it!
- code that allows me or someone else to interact with your data set
- tables and figures indicating the size and variety present in your data

_Note_: These are not exhaustive lists of topics or tasks worth covering in your project. In general, if there's something interesting about your dataset, whether relating to its construction, existence, representative population or _anything else_, then be sure to document it!