# COGS 108 - Data Checkpoint

## Authors

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Muska Mesdaq: Background research, Writing - original draft, Writing - review & editing, Data Overview, Data Wrangling
- Shourya Kulkarni: Analysis, Data Overview, Ethics, Writing - review & editing, Writing - original draft
- Jasmine Le: Data Wrangling,  Writing - original draft, Writing - review & editing
- Pauline Shah: Data Overview, Analysis, Writing - review & editing, Writing - original draft


## Research Question

What product attributes, such as material, color, category, origin, gender-target (gender product is marketed to), and popularity (measured by difference from mean number of ratings for garment type) most influence pricing in U.S. dollars of athleticwear in the U.S.?  


## Background and Prior Work

The prevalence of athletic wear, whether worn for leisure or actual activity, has been on the rise. In fact, the athletic wear market now represents a substantial and rapidly growing segment of the global apparel industry, with the North American market alone valued at over 26 billion dollars in 2024, and a projected value of 677.26 billion dollars by 2030. Among other things, this rapid growth can be attributed to a culture that is increasingly health-conscious, and the not-unrelated merging of fashion and functionality, dubbed "athleisure". As casually-worn athletic wear becomes increasingly mainstream, however, there is the question of cost - specifically, what factors make it cost what it does?

Now, the question of what drives apparel pricing has been examined before. Multiple countries have used the hedonic model to calculate the Consumer Price Index, used to track inflation, for apparel in their countries. The hedonic model uses regression to "estimate a value for each characteristic of the item", which then all sum to the total price of the item. The use of this technique allows researchers to make adjustments to account for seasonal changes and evolving product characteristics, essentially isolating the contribution of individual attributes to overall product pricing.

One factor, material composition, represents one of the most fundamental drivers of garment pricing, with significant research demonstrating its importance. In the active wear market, synthetic materials were very popular, supported by widespread use of polyester, which has moisture-wicking properties and is durable, important qualities for active wear to have. However, consumer preferences are evolving with the times, and natural materials like organic cotton and bamboo blends are becoming more popular as eco-friendly and sustainable products, since sustainability has become a more mainstream concern. The price consumers are willing to pay for sustainable materials has been documented before. Research on outdoor apparel using hedonic pricing models found that material type had a statistically significant impact on item price, with different materials like cotton, polyester, wool, and nylon valued at different price points based on their breathability, durability, and environmental impact. Not to mention, raw material price fluctuations also significantly impact manufacturing costs for brands, with materials such as organic cotton subject to changes in supply and demand as well as energy and transportation costs. This cost pressure directly influences retail pricing strategies, as manufacturers must balance quality expectations with competitive pricing.

Another important consideration for pricing that transcends the realm of clothing is the "Pink Tax", or gender-based pricing. This is a recorded phenomenon, even reported on by the U.S. Congress's Joint Economic Committee, which notes that "... the frequency with which female consumers find themselves paying higher prices for gender-specific goods and services effectively becomes a tax on being a woman". Gender-targeted marketing and its relationship to pricing represents one of the most controversial aspects of apparel pricing research, with another 2015 study by the New York City Department of Consumer Affairs finding that women's clothing was 8% more expensive than men's clothing on average. Not only that, but women's clothing imports face higher average tariff rates than men's clothing, with women's clothing at 15.1% compared to 11.9% for men. However, other research challenges the notion of gender-based pricing. Studies like these find that most comparable products differed in their ingredient compositions, suggesting that price differences may reflect genuine product differentiation rather than discriminatory pricing. The debate centers on distinguishing between legitimate reasons in pricing difference and discriminatory practices, something we wish to also explore on a smaller level, in our analysis of what determines the cost of athletic wear.

**Country of Origin Effects**  
Country of origin (COO) represents another well-established driver of consumer perceptions and pricing in apparel markets. The COO effect describes how consumers' attitudes, perceptions, and purchasing decisions are influenced by products' country of origin labeling, which may refer to where a brand is based, where products are designed, or where they are manufactured. Research has empirically demonstrated that the COO effect has significant price-related consequences, with brands possessing favorable COO associations able to charge price premiums over and above those attributed to observable product differentiation. The strength of COO effects varies considerably by product category and consumer characteristics. In apparel specifically, research examining clothing from multiple countries found that COO was related to assessment of product quality, but when evaluating purchase likelihood, COO seemed not to be as important, suggesting a more complex situation.

The influence of popularity metrics on pricing is a newer area of research enabled by e-commerce platforms, which we choose to measure in the form of customer reviews and ratings in the absence of actual total purchasing data. Consumer surveys consistently show that ratings and reviews are considered very important, with up to 98% of shoppers saying reviews are an "essential resource when making purchase decisions". Research also demonstrates that products with five reviews have 270% greater purchase likelihood than products with no reviews, but these benefits diminish rapidly after the first five reviews, according to Northwestern. Conjoint analysis research also indicates that "consumers considered star rating, number of reviews, online shopping platform, and premium reviews to be more important than price when making an online shopping decision". However, there are many factors that differentiate from study to study, and so we would like to uncover ourselves how perceived popularity influence product pricing in the realm of active wear.

While individual attributes influencing apparel pricing have been studied on their own, comprehensive research examining the effects of material, color, category, origin, gender-targeting, and popularity metrics specifically within the athleticwear segment is still limited. As such, we plan to employ a comprehensive dataset of U.S. athleticwear products to try to quantify the relative importance of these diverse attributes in determining price, and hopefully providing insight into the factors that decide how much a product costs.

Athleticwear has become one of the biggest parts of the fashion industry, especially with the rise of fitness culture and athleisure clothing. Big brands like Adidas have built their image around performance and lifestyle wear, selling products that mix comfort, function, and style. Research on fashion pricing has shown that things like brand reputation, material quality, and product design have a big impact on how much items cost online. One study found that brands with stronger reputations tend to keep higher prices and are less likely to discount their items. [Source](https://www.jstage.jst.go.jp/article/isase/ISASE2019/0/ISASE2019_1_8/_article?utm_source)

Another study on denim clothing showed that items made from unique or mixed materials often had higher prices because people see them as higher quality. [Source](https://www.mdpi.com/2673-7248/3/1/2?utm_source)

For Adidas specifically, there’s been some research on its marketing and pricing strategies, but not much on what exact product attributes drive those prices. One paper mentioned that Adidas focuses heavily on innovation and sustainability while still trying to improve how it prices and markets its products. [Source](https://www.ewadirect.com/proceedings/aemps/article/view/3640?utm_source)

Another study found that brand image, product quality, and marketing are major factors influencing how customers view Adidas and what they’re willing to pay. [Source](https://bsq.cultechpub.com/index.php/bsq/article/view/2?utm_source)

Even though these studies show that branding and quality matter, they don’t really dig into which features like material, color, or product category make certain Adidas items cost more than others. Our project aims to explore that gap by analyzing online pricing data for Adidas products in the U.S. to see which product details have the biggest impact on price.

### Sources
1. [Consumer Price Index – U.S. Bureau of Labor Statistics](https://www.bls.gov/cpi/quality-adjustment/questions-and-answers.htm)
2. [Sportswear Fabric Market – Market Research Future](https://www.marketresearchfuture.com/reports/sportswear-fabric-market-37412)
3. [Lindahl, E. (2018) *The Outdoor Apparel Industry: Measuring the Premium for Sustainability with a Hedonic Pricing Model*](https://www.researchgate.net/publication/332530663_The_Outdoor_Apparel_IndustryMeasuring_the_Premium_for_Sustainability_with_a_Hedonic_Pricing_Model)
4. [The Pink Tax – U.S. Congress Joint Economic Committee](https://www.jec.senate.gov/public/_cache/files/8a42df04-8b6d-4949-b20b-6f40a326db9e/the-pink-tax---how-gender-based-pricing-hurts-women-s-buying-power.pdf)
5. [Bessendorf, A. (2015) *From Cradle to Cane: The Cost of Being a Female Consumer*](https://www.nyc.gov/assets/dca/downloads/pdf/partners/Study-of-Gender-Pricing-in-NYC.pdf)
6. [Taylor, L., Dar, J. (2015) *Fairer Trade: Removing Gender Bias in US Import Taxes*](http://bush.tamu.edu/wp-content/uploads/2020/07/V6-3-Tariff-Discrimination-Takeaway.pdf)
7. [Moshary, S., Tuchman, A., Vajravelu, N. (2023) *Gender-Based Pricing in Consumer Packaged Goods: A Pink Tax?*](https://pubsonline.informs.org/doi/pdf/10.1287/mksc.2023.1452)
8. [Florian, M., Diamantopoulos, A. (2012) *Activation of Country Stereotypes: Automaticity, Consonance, and Impact*](https://link.springer.com/article/10.1007/s11747-012-0318-1)
9. [Boutin Jr, P. (2011) *The Country-of-Origin Construct and Its Effect on Consumer Behavior*](https://www.researchgate.net/publication/318283638_The_Country-of-Origin_Construct_and_Its_Effect_on_Consumer_Behavior_A_Review_of_Selected_Literature_and_Proposed_Future_Research_Directions)
10. [The Ever-Growing Power of Reviews (2023 Edition)](https://www.powerreviews.com/power-of-reviews-2023/)
11. [How Online Reviews Influence Sales – Northwestern University](https://spiegel.medill.northwestern.edu/how-online-reviews-influence-sales/)
12. [Sung, E., Chung, W.Y. & Lee, D. (2023) *Factors that Affect Consumer Trust in Product Quality*](https://doi.org/10.1057/s41599-023-02277-7)


## Hypothesis



We hypothesize that origin and material are the product attributes that will most influence the pricing of athleticwear. This is due to the nature of the clothing industry, which heavily relies on outsourcing work to foreign countries to reduce cost of production for goods. We also believe material is a product attribute that will most influence pricing of athleticwear, as the actual cost of the material that is used to make the product is a logical factor in its pricing.

## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
- Dataset Name: Data on Adidas products available for purchase in the United States
- Link: https://www.kaggle.com/datasets/thedevastator/adidas-fashion-retail-products-dataset-9300-prod
- Number of observations: ~9,300
- Number of variables: 12
- Description of the variables most relevant to this project
- Descriptions of any shortcomings:
    - Missing material composition (e.g., cotton vs. polyester blend)
    - Missing production cost and sales frequency, likely proprietary
    - Imbalanced representation — heavily skewed toward Adidas products
    - Some missing values for `average_rating` and `reviews_count`
    - Text-based variables (`description`) require parsing to extract structured information

Despite these limitations, this dataset provides a strong foundation for exploring pricing patterns, consumer ratings, and category-based trends in the athletic wear market.

### Dataset #2
- **Dataset Name**: Adidas vs. Nike
- **Link**: [Kaggle Dataset](https://www.kaggle.com/datasets/kaushiksuresh147/adidas-vs-nike)
- **Number of observations**: 3,268
- **Number of variables**: 10
- **Description of the variables most relevant to this project**:
  - Listing Price and Sale Price (in U.S. cents) allow analysis of original pricing vs. markdowns. Example: 14,999 cents = $149.99.
  - Discount (whole number) indicates reduction from Listing Price.
  - Ratings (1-5) capture customer satisfaction, with 5 = highly satisfied.
  - Reviews show the number of customer reviews.
  - Brand, Product Name, Product ID, Description, and Last Visited (timestamp) provide product and engagement context.
- **Descriptions of any shortcomings**: may have missing values, limited coverage of attributes.

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.
- This dataset includes several key metrics describing both Adidas and Nike products. Listing Price and Sale Price are recorded in U.S. cents, enabling analysis of price reductions. Discount shows how much the Sale Price is reduced. Ratings and Reviews provide customer engagement insights. Brand, Product Name, Product ID, Description, and Last Visited provide additional context.
- If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets:
  - This dataset may contain several sources of bias. Data comes mainly from online shoppers, excluding in-store customers, introducing self-selection bias. Online shoppers may differ in purchasing habits and characteristics. Because the research focuses solely on Adidas, Nike entries will be removed during preprocessing. Therefore, analysis results cannot be generalized to the broader athleticwear industry; insights will only apply to Adidas products in an online shopping context.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Adidas U.S. Athleticwear Dataset

### Instructions:
2. Write a few paragraphs about this dataset. Make sure to cover:
   1. Describe the important metrics, what units they are in, and give some sense of what they mean.
   2. If there are any major concerns with the dataset, describe them.

### Key Variables and Metrics
- **Product Name / Brand**: Identifies the item and confirms it belongs to Adidas.
- **Garment Subtype**: Describes the product type (e.g., *T-shirt*, *Shorts*, *Leggings*, *Jacket*). Useful for comparing price differences between categories.
- **Material Composition**: Describes the fabric makeup (e.g., *60% cotton, 40% polyester*). Synthetic materials like polyester are durable and cheaper, while natural fibers like organic cotton can increase price due to sustainability value.
- **Color**: The primary color of the garment (categorical). Limited-edition colors or trendy shades may be priced higher.
- **Gender**: The target audience—*Men*, *Women*, or *Unisex*. Allows testing for gender-based pricing (“Pink Tax”).
- **Country of Production**: Where the item was manufactured (e.g., *Vietnam, China, Indonesia*). May affect cost because of labor and production expenses.
- **Customer Rating**: Average rating (1–5 stars) reflecting consumer satisfaction. A higher rating may suggest higher perceived value or quality.
- **Number of Reviews**: Total customer reviews, a proxy for product popularity or sales volume.
- **Sale Price (USD)**: The main outcome variable, representing listed retail price in U.S. dollars.

### Potential Data Concerns and Biases
- **Online Marketplace Bias**: The dataset is sourced from online listings, which may not represent in-store pricing or promotional discounts.
- **Self-Selection Bias**: Popular products with more reviews are overrepresented, while new or niche products might lack visibility.
- **Missing or Inconsistent Entries**: Some records may have missing *country*, *rating*, or *review count* fields, especially for newly listed products.
- **Duplicate Products**: Variants of the same item (e.g., same shirt in multiple colors) may appear multiple times and need consolidation.
- **Temporal Variation**: Prices may change over time due to promotions, seasonal trends, or clearance sales.

3. Use the cell below to:
    1. Load the dataset.
    2. Make the dataset tidy or demonstrate that it was already tidy.
    3. Demonstrate the size of the dataset.
    4. Find out how much data is missing, where it's missing, and if it's missing at random or shows systematic patterns.
    5. Find and flag any outliers or suspicious entries.
    6. Clean the data or demonstrate that it was already clean. You may choose how to deal with missingness (`dropna` or `fillna`, with `how='any'` or `how='all'`), and you should justify your choice.
    7. You will load raw data from `data/00-raw/`, optionally write intermediate stages to `data/01-interim/`, and write the final fully wrangled version of your data to `data/02-processed/`.

4. Optionally, you can also show some summary statistics for variables that you think are important to the project.
5. Feel free to add more cells here if that's helpful for you.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('data/00-raw/adidas_athleticwear.csv')

# Show basic info
print('Shape:', df.shape)
display(df.head())

# Missingness check
print('\nMissing values:\n', df.isnull().sum())

sns.heatmap(df.isnull(), cbar=False)
plt.title('Missing Data Heatmap')
plt.show()

# Detect outliers in price
sns.boxplot(x=df['sale_price'])
plt.title('Price Distribution and Outliers')
plt.show()

# Example cleaning
df = df.drop_duplicates()
df = df.dropna(subset=['sale_price'])
df['customer_rating'].fillna(df['customer_rating'].median(), inplace=True)

# Save cleaned version
df.to_csv('data/02-processed/adidas_athleticwear_cleaned.csv', index=False)


### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

This dataset includes several key metrics that describe both Adidas and Nike products. The Listing Price and the Sale Price are recorded in U.S. cents, which allows us to analyze the original pricing and the markdowns of the products.For example, a product listed as 14,999 U.S.cents would cost 149.99 U.S. dollars, and a product sold for 7,499 U.S. cents would be 74.99 U.S. dollars. The Discount metric is expressed as a whole number, and shows how much the Sale Price is reduced in comparison to the Listing Price. The dataset also captures customer engagement metrics through Ratings (from a scale of 1-5) where 1 indicates dissatisfaction with the product and 5 indicates high satisfaction, and Reviews, which records the number of reviews a particular product has received. Some additional features in the dataset include the Brand of the listed product (whether it is Nike or Adidas), the Name of the specific product, its Product ID which consists of a combination of numbers and letters, a brief Description of the product (in a text summary format), and Last Visited (a timestamp of the most recent customer interaction with the product). These listed metrics provide insight for certain characteristics within a product and the customer interests associated with them. 

The dataset may contain several sources of bias and limitations, despite the number of metrics it has. Because the data is taken mainly from online shoppers, it reflects the individuals who use e-commerce platforms and online shops/websites to purchase these products, and excludes consumers who shop in-person at physical retail stores. This introduces self-selection bias, since online shoppers may have different purchasing habits and characteristics, such as possibly  being more technologically comfortable, in comparison to in-store shoppers. The data from the dataset is also restricted to Nike and Adidas, and since our research only focuses on Adidas, the entries of the Nike data will be removed during the preprocessing. Due to the research’s sole focus on Adidas, the results of our analysis cannot be generalized for the broader athleticwear industry. Any insights received from our analysis will be for Adidas products within the online shopping context.  


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics

### A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data used to which they consent?
> There are no human subjects involved in our data collection.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
> This study reflects pricing and marketing choices specific to only one brand/company because the data being analyzed comes only from Adidas. This limits an overall general analysis of the larger athleticwear industry and introduces potential bias. Additionally, our dataset(s) focus on U.S. pricing and consumer trends, overlooking regional variations in pricing and affordability, which may skew perceptions of product value across global markets. To limit any bias, we will clearly mention that the results from this project describe pricing strategies specific to Adidas pricing structure only and not the entire athleticwear industry to avoid overstating our conclusions.

 - [ ] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
> There is no personal information that is being collected in our data collection, and no privacy risks are present. All of the data is related to the products, not the individuals themselves.

 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
> We will check for unbalanced groups, especially with gender-specific items and the pricing associated with those products, to ensure that our model doesn't unintentionally reinforce bias or social stereotypes through Adidas’s marketing/pricing decisions.  

### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
> No sensitive information is being stored. The datasets are from Kaggle and downloaded to a secure environment. 

 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
> Not applicable to our project since the datasets were publicly shared by Kaggle and no personal information is stored or shared in the datasets.

 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
> The data will be archived and/or deleted after our team finalizes analyzing and producing results - when the project concludes.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
> Our dataset only includes Adidas products, which lacks perspectives and inputs from other both larger and smaller athleticwear companies.We will make sure that we clearly state that our analysis only focuses on Adidas, and doesn't represent the dynamics of the broader athleticwear industry. 

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
> We will make sure that we check for any imbalanced classes/representation in certain product features (material, color, category, origin, gender-target) within our data before moving into the modeling process. If we find any uneven representation, we will make sure to be transparent about it in our results.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
> We will clearly and honestly state our data cleaning, analysis and modeling process steps and make sure that our visualizations and reports accurately reflect our findings. Our team will also be explicit in stating that the correlations we find are descriptive of Adidas’s dataset(s) and not causal.  

 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
> No PII exists in our data, so there are no privacy risks or concerns regarding this.

 - [ ] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
> All steps of analysis and modeling will be well documented in a Jupyter Notebook and in the GitHub repository to allow any replication in the future.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
> Since the dataset does include a gender-target variable, we will analyze to see if gendered products have consistent pricing patterns. We will check any gender-specific products to avoid reinforcing any stereotypes.This will ensure that our results do not unintentionally reproduce or justify disparities within pricing.

 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
> We will check for any variations in pricing when it comes to gender-specific products to make sure that our results are neutral and not biased in any kind of way. We will test whether pricing across different categories are statistically significant and if they reflect Adidas’s marketing decisions. 

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
> We will use modeling techniques such as multiple linear regression to analyze the features in our data and loss functions such as MAE and/or MSE (Mean Absolute Error and/or Mean Squared Error) to optimize our defined metrics and for transparency.

 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
> Coefficients and features that are involved in regression and other visualizations will be used to show how each variable, such as material, color, category, origin, gender-target, influences the pricing of a product. This will ensure that our results are understandable for a general audience.


 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
> We will communicate our findings very carefully to ensure that readers understand that our analysis applies only to Adidas’s dataset. This is because we are aware that misinterpreting our results as an attempt to generalize pricing across all brands could unintentionally reinforce misleading marketing assumptions. Our findings will be communicated to be specific to Adidas's marketing and pricing strategy, not pricing for products in the broader athleticwear industry. We will include clear disclaimers in our project to prevent any misunderstanding and communicate these limitations.

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
> Not applicable because our team is not planning to deploy this model.

 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
> Not applicable because individuals are not particularly affected by any outcomes of this model.

 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
> Not applicable because our model is not under production.

 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
> We will make sure to clarify that our model should not be used to justify variations in pricing within different product categories.
While our model is not being deployed to the public, our findings in our model/project could be misused if taken out of context. For instance, someone could use our analysis from this project to justify the pricing differences for products by gender. To prevent any misuse and misinterpretation like this, we will explicitly state that our research does not justify, recommend, or endorse any discriminatory pricing strategies, especially for explicit commercial use. We will make a clear disclaimer which states that it is unethical to use our results/analysis for commercial use and pricing for products without proper equity testing, since it can perpetuate gender and/or economic inequalities. 


## Team Expectations 

**Team Members:** Shourya Kulkarni, Pauline Shah, Muska Mesdaq, Jasmine Le

**Team Expectation 1:** Communication and Collaboration
Our team will communicate primarily through iMessage, providing frequent updates throughout the week. We’ll also hold FaceTime calls if necessary to check in on progress, assign tasks, and discuss any challenges. Everyone agrees to respond to messages within 24 hours on weekdays. We’ll keep communication clear and respectful, using a “blunt but polite” tone.

**Team Expectation 2:** Task Distribution and Accountability
Each team member will contribute equally to the project while focusing on their strengths. We’ll track progress using a shared Google Doc or GitHub Projects board. If someone is struggling to complete a task, they will notify the group at least one day in advance so adjustments can be made. Repeated lack of communication or contribution will be documented and reported if necessary.

**Team Expectation 3:** Respectful Conflict Resolution
We’ll address any disagreements calmly and constructively, discussing them through iMessage or FaceTime. If an issue cannot be resolved within the team, the professor will be contacted per course policy.

**Team Expectation 4:** Professionalism and Commitment
Each member commits to staying engaged, meeting deadlines, writing clear and well-commented code, and helping edit project documents. Everyone will maintain a respectful and collaborative attitude throughout the quarter.


## Project Timeline Proposal

| Week | Dates | Goals & Tasks | Deliverables / Check-ins |
|------|-------|---------------|--------------------------|
| **Week 5** | **Oct 28 – Nov 3** | • Finalize data science question and hypotheses.<br>• Search for and clean datasets.<br>• Assign roles (wrangling, visualization, writing, communication). | Clear research question, dataset confirmed, and GitHub repo organized. |
| **Week 6** | **Nov 4 – Nov 10** | • Begin **data wrangling and exploration**.<br>• Handle missing data, variable formatting, and merge datasets.<br>• Early exploratory plots to understand distributions. | Draft of wrangling and EDA sections pushed to GitHub. |
| **Week 7** | **Nov 11 – Nov 17** | • Conduct deeper **EDA and preliminary analysis** (correlations, trends, regressions, etc.).<br>• Meet to interpret findings and refine hypotheses.<br>• Begin writing “Methods” section. | EDA visuals and first analysis summary completed. |
| **Week 8** | **Nov 18 – Nov 24** | • Finalize main **analysis and modeling** (if applicable).<br>• Start writing “Results” and “Discussion.”<br>• Peer review teammates’ code and markdown explanations. | Completed analysis code and preliminary write-up in notebook. |
| **Week 9** | **Nov 25 – Dec 1** | • Create clean **visualizations** and polish narrative flow.<br>• Integrate feedback from TA / peers.<br>• Begin editing and proofreading entire notebook. | Draft of final report ready for revision. |
| **Week 10** | **Dec 2 – Dec 9** | • Final revisions, formatting, and submission prep.<br>• Double-check references, figure captions, and README file.<br>• Submit final project by deadline. | Final project submitted on GitHub and Canvas. |
| **Finals Week** | **Dec 10** | • Ensure project is properly turned in and accessible.<br>• Complete group reflection / survey if required. | Confirmation of successful submission and survey completion. |
