# COGS 108 - Data Checkpoint

## Authors

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Muska Mesdaq: Background research, Writing - original draft, Writing - review & editing, Data Overview, Data Wrangling
- Shourya Kulkarni: Analysis, Data Overview, Ethics, Writing - review & editing, Writing - original draft
- Jasmine Le: Data Wrangling,  Writing - original draft, Writing - review & editing
- Pauline Shah: Data Overview, Analysis, Writing - review & editing, Writing - original draft


## Research Question

What product attributes, such as material, color, category, origin, gender-target (gender product is marketed to), and popularity (measured by difference from mean number of ratings for garment type) most influence pricing in U.S. dollars of athleticwear in the U.S.?  


## Background and Prior Work

## Hypothesis



We hypothesize that origin and material are the product attributes that will most influence the pricing of athleticwear. This is due to the nature of the clothing industry, which heavily relies on outsourcing work to foreign countries to reduce cost of production for goods. We also believe material is a product attribute that will most influence pricing of athleticwear, as the actual cost of the material that is used to make the product is a logical factor in its pricing.

## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
- Dataset Name: Data on Adidas products available for purchase in the United States
- Link: https://www.kaggle.com/datasets/thedevastator/adidas-fashion-retail-products-dataset-9300-prod
- Number of observations: ~9,300
- Number of variables: 12
- Description of the variables most relevant to this project
- Descriptions of any shortcomings:
    - Missing material composition (e.g., cotton vs. polyester blend)
    - Missing production cost and sales frequency, likely proprietary
    - Imbalanced representation — heavily skewed toward Adidas products
    - Some missing values for `average_rating` and `reviews_count`
    - Text-based variables (`description`) require parsing to extract structured information

Despite these limitations, this dataset provides a strong foundation for exploring pricing patterns, consumer ratings, and category-based trends in the athletic wear market.

### Dataset #2
- Dataset Name: Adidas vs. Nike
- Link: https://www.kaggle.com/datasets/kaushiksuresh147/adidas-vs-nike
- Number of observations: ~1,000 (approximate, confirm in dataset)
- Number of variables: 7
- Description of the variables most relevant to this project
- Descriptions of any shortcomings: may have missing values, limited coverage of attributes

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

This dataset includes several key metrics that describe both Adidas and Nike products. The Listing Price and the Sale Price are recorded in U.S. cents, which allows us to analyze the original pricing and the markdowns of the products.For example, a product listed as 14,999 U.S.cents would cost 149.99 U.S. dollars, and a product sold for 7,499 U.S. cents would be 74.99 U.S. dollars. The Discount metric is expressed as a whole number, and shows how much the Sale Price is reduced in comparison to the Listing Price. The dataset also captures customer engagement metrics through Ratings (from a scale of 1-5) where 1 indicates dissatisfaction with the product and 5 indicates high satisfaction, and Reviews, which records the number of reviews a particular product has received. Some additional features in the dataset include the Brand of the listed product (whether it is Nike or Adidas), the Name of the specific product, its Product ID which consists of a combination of numbers and letters, a brief Description of the product (in a text summary format), and Last Visited (a timestamp of the most recent customer interaction with the product). These listed metrics provide insight for certain characteristics within a product and the customer interests associated with them. 

The dataset may contain several sources of bias and limitations, despite the number of metrics it has. Because the data is taken mainly from online shoppers, it reflects the individuals who use e-commerce platforms and online shops/websites to purchase these products, and excludes consumers who shop in-person at physical retail stores. This introduces self-selection bias, since online shoppers may have different purchasing habits and characteristics, such as possibly  being more technologically comfortable, in comparison to in-store shoppers. The data from the dataset is also restricted to Nike and Adidas, and since our research only focuses on Adidas, the entries of the Nike data will be removed during the preprocessing. Due to the research’s sole focus on Adidas, the results of our analysis cannot be generalized for the broader athleticwear industry. Any insights received from our analysis will be for Adidas products within the online shopping context.  


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics

### A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data used to which they consent?
> There are no human subjects involved in our data collection.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
> This study reflects pricing and marketing choices specific to only one brand/company because the data being analyzed comes only from Adidas. This limits an overall general analysis of the larger athleticwear industry and introduces potential bias. Additionally, our dataset(s) focus on U.S. pricing and consumer trends, overlooking regional variations in pricing and affordability, which may skew perceptions of product value across global markets. To limit any bias, we will clearly mention that the results from this project describe pricing strategies specific to Adidas pricing structure only and not the entire athleticwear industry to avoid overstating our conclusions.

 - [ ] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
> There is no personal information that is being collected in our data collection, and no privacy risks are present. All of the data is related to the products, not the individuals themselves.

 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
> We will check for unbalanced groups, especially with gender-specific items and the pricing associated with those products, to ensure that our model doesn't unintentionally reinforce bias or social stereotypes through Adidas’s marketing/pricing decisions.  

### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
> No sensitive information is being stored. The datasets are from Kaggle and downloaded to a secure environment. 

 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
> Not applicable to our project since the datasets were publicly shared by Kaggle and no personal information is stored or shared in the datasets.

 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
> The data will be archived and/or deleted after our team finalizes analyzing and producing results - when the project concludes.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
> Our dataset only includes Adidas products, which lacks perspectives and inputs from other both larger and smaller athleticwear companies.We will make sure that we clearly state that our analysis only focuses on Adidas, and doesn't represent the dynamics of the broader athleticwear industry. 

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
> We will make sure that we check for any imbalanced classes/representation in certain product features (material, color, category, origin, gender-target) within our data before moving into the modeling process. If we find any uneven representation, we will make sure to be transparent about it in our results.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
> We will clearly and honestly state our data cleaning, analysis and modeling process steps and make sure that our visualizations and reports accurately reflect our findings. Our team will also be explicit in stating that the correlations we find are descriptive of Adidas’s dataset(s) and not causal.  

 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
> No PII exists in our data, so there are no privacy risks or concerns regarding this.

 - [ ] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
> All steps of analysis and modeling will be well documented in a Jupyter Notebook and in the GitHub repository to allow any replication in the future.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
> Since the dataset does include a gender-target variable, we will analyze to see if gendered products have consistent pricing patterns. We will check any gender-specific products to avoid reinforcing any stereotypes.This will ensure that our results do not unintentionally reproduce or justify disparities within pricing.

 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
> We will check for any variations in pricing when it comes to gender-specific products to make sure that our results are neutral and not biased in any kind of way. We will test whether pricing across different categories are statistically significant and if they reflect Adidas’s marketing decisions. 

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
> We will use modeling techniques such as multiple linear regression to analyze the features in our data and loss functions such as MAE and/or MSE (Mean Absolute Error and/or Mean Squared Error) to optimize our defined metrics and for transparency.

 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
> Coefficients and features that are involved in regression and other visualizations will be used to show how each variable, such as material, color, category, origin, gender-target, influences the pricing of a product. This will ensure that our results are understandable for a general audience.


 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
> We will communicate our findings very carefully to ensure that readers understand that our analysis applies only to Adidas’s dataset. This is because we are aware that misinterpreting our results as an attempt to generalize pricing across all brands could unintentionally reinforce misleading marketing assumptions. Our findings will be communicated to be specific to Adidas's marketing and pricing strategy, not pricing for products in the broader athleticwear industry. We will include clear disclaimers in our project to prevent any misunderstanding and communicate these limitations.

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
> Not applicable because our team is not planning to deploy this model.

 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
> Not applicable because individuals are not particularly affected by any outcomes of this model.

 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
> Not applicable because our model is not under production.

 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
> We will make sure to clarify that our model should not be used to justify variations in pricing within different product categories.
While our model is not being deployed to the public, our findings in our model/project could be misused if taken out of context. For instance, someone could use our analysis from this project to justify the pricing differences for products by gender. To prevent any misuse and misinterpretation like this, we will explicitly state that our research does not justify, recommend, or endorse any discriminatory pricing strategies, especially for explicit commercial use. We will make a clear disclaimer which states that it is unethical to use our results/analysis for commercial use and pricing for products without proper equity testing, since it can perpetuate gender and/or economic inequalities. 


## Team Expectations 

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback


## Project Timeline Proposal

Instructions: Replace this with your timeline.  **PLEASE UPDATE your Timeline!** No battle plan survives contact with the enemy, so make sure we understand how your plans have changed.  Also if you have lost points on the previous checkpoint fix them