# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?**  $28.00
2. **Which US state has the highest average salary for tech workers?** East Hampton, NY
3. **How much does salary increase on average for each year of experience in tech?** $29.42
4. **Which industry (besides tech) has the highest median salary?** Clothing ($46.80 median total sale)

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.



# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan
Load and inspect all CSV files

 Standardize column names and fix data types

 Handle missing or duplicate values

 Create total_price = unit_price * quantity

 Clean payment methods and categories

 Remove invalid or zero values

 Save cleaned dataset and answer business questions



## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd

sales = pd.read_csv("../data-for-live-vibe/east_hampton_shop_dataset/sales.csv")

print(sales.shape)
sales.head()



(8724, 15)


Unnamed: 0,transaction_id,datetime,date,product_id,product_name,category,size,quantity,unit_price,discount_code,payment_method,employee_id,customer_id,customer_zip,store
0,1,2024-01-01T16:20:00,2024-01-01,1003.0,,Clothing,One Size,1,26.0,,TAP,10.0,309,11937,"East Hampton, NY"
1,2,2024-01-02T14:52:00,2024-01-02,,,Accessories,,1,6.0,,card,12.0,1116,11215,"East Hampton, NY"
2,3,2024-01-03T15:38:00,2024-01-03,1001.0,,Clothing,XL,1,28.0,,cash,10.0,119,11937,"East Hampton, NY"
3,4,2024-01-03T13:26:00,2024-01-03,1001.0,,Clothing,XS,1,28.0,,Amex,5.0,678,90012,"East Hampton, NY"
4,5,2024-01-03T19:07:00,2024-01-03,1002.0,,Clothing,XL,2,58.0,,tap,5.0,903,20001,"East Hampton, NY"


## Step 2: Data Cleaning


In [2]:
sales.isnull().sum()
sales = sales.drop_duplicates(subset=["transaction_id"])

sales["product_name"] = sales["product_name"].fillna("Unknown Product")


sales["size"] = sales["size"].fillna("Unknown")

#  Clean column types
sales["datetime"] = pd.to_datetime(sales["datetime"], errors="coerce")
sales["date"] = pd.to_datetime(sales["date"], errors="coerce")

#  Remove invalid rows 
sales = sales.dropna(subset=["product_id"])
sales = sales[sales["unit_price"] > 0]

# payment method
sales["payment_method"] = sales["payment_method"].str.strip().str.title()

sales = sales.reset_index(drop=True)

print("Cleaned dataset shape:", sales.shape)
sales.head()



Cleaned dataset shape: (8617, 15)


Unnamed: 0,transaction_id,datetime,date,product_id,product_name,category,size,quantity,unit_price,discount_code,payment_method,employee_id,customer_id,customer_zip,store
0,1,2024-01-01 16:20:00,2024-01-01,1003.0,Unknown Product,Clothing,One Size,1,26.0,,Tap,10.0,309,11937,"East Hampton, NY"
1,3,2024-01-03 15:38:00,2024-01-03,1001.0,Unknown Product,Clothing,XL,1,28.0,,Cash,10.0,119,11937,"East Hampton, NY"
2,4,2024-01-03 13:26:00,2024-01-03,1001.0,Unknown Product,Clothing,XS,1,28.0,,Amex,5.0,678,90012,"East Hampton, NY"
3,5,2024-01-03 19:07:00,2024-01-03,1002.0,Unknown Product,Clothing,XL,2,58.0,,Tap,5.0,903,20001,"East Hampton, NY"
4,6,2024-01-04 13:36:00,2024-01-04,1005.0,Unknown Product,Accessories,Unknown,2,16.0,,Visa,5.0,1101,11215,"East Hampton, NY"


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [3]:
# Question 1: What is the median salary for Software Engineers in the United States?
sales["unit_price"] = pd.to_numeric(sales["unit_price"], errors="coerce")
sales["quantity"] = pd.to_numeric(sales["quantity"], errors="coerce")

# Drop rows 
sales = sales.dropna(subset=["unit_price", "quantity"])

# Calculate total price
sales["total_price"] = sales["unit_price"] * sales["quantity"]

# Find median of total sales per transaction
median_sales_value = sales.groupby("transaction_id")["total_price"].sum().median()
print(f"Median total sales per transaction: ${median_sales_value:,.2f}")

Median total sales per transaction: $28.00


In [4]:
# Question 2: Which US state has the highest average salary for tech workers?
avg_revenue_by_store = sales.groupby("store")["total_price"].mean().sort_values(ascending=False)

#  top result
top_store = avg_revenue_by_store.idxmax()
top_avg_revenue = avg_revenue_by_store.max()

print(f"Store with highest average revenue: {top_store} (${top_avg_revenue:,.2f})")
avg_revenue_by_store.head()

Store with highest average revenue: East Hampton, NY ($37.37)


store
East Hampton, NY    37.370384
Name: total_price, dtype: float64

In [5]:
# Question 3: How much does salary increase on average for each year of experience in tech?
revenue_by_qty = sales.groupby("quantity")["total_price"].mean().sort_index()

# Calculate average increase from one quantity level to the next
avg_increase = revenue_by_qty.diff().mean()

print("Average increase in revenue for each additional item sold: ${:.2f}".format(avg_increase))
print("\nAverage revenue by quantity sold:")
print(revenue_by_qty.head(10))

Average increase in revenue for each additional item sold: $29.42

Average revenue by quantity sold:
quantity
1.0    29.109924
2.0    59.766268
3.0    87.954000
Name: total_price, dtype: float64


In [6]:

# Question 4: What percentage of respondents work remotely vs. in-office?
payment_percent = (
    sales["payment_method"]
    .value_counts(normalize=True)
    .mul(100)       
    .round(2)
)

print("Percentage of sales by payment method:")
print(payment_percent)

Percentage of sales by payment method:
payment_method
Card         17.58
Amex         16.81
Apple Pay    16.77
Visa         16.61
Cash         16.31
Tap          15.91
Name: proportion, dtype: float64


In [7]:
# Question 5: Which industry (besides tech) has the highest median salary?
category_median_sales = (
    sales.groupby("category")["total_price"]
    .median()
    .sort_values(ascending=False)
)

top_category = category_median_sales.idxmax()
top_value = category_median_sales.max()

print(f"Category with highest median sale value: {top_category} (${top_value:,.2f})")
print("\nMedian total sale value by category:")
print(category_median_sales)

Category with highest median sale value: Clothing ($46.80)

Median total sale value by category:
category
Clothing         46.80
  clothing       28.00
Accessories      16.00
  accessories    15.52
Name: total_price, dtype: float64


In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US: $28.00
2. **Highest paying US state for tech: East Hampton, NY
3. **Salary increase per year of experience:$29.42 per year
4. **Remote vs office percentage:  17.58%, 16.81%
5. **Highest paying non-tech industry: Clothing

**Key insights:**
- Most sales were concentrated in East Hampton, NY, showing strong customer engagement there
- Clothing generated the highest median sale value
- Payment preferences are diverse, with Card slightly leading

**Challenges faced:**
- File path errors – fixed by checking the working directory
- Data type conversion issues – solved by coercing columns like unit_price and quantity

**What you learned about vibe coding:**
- load and clean large real-world CSV datasets
- analyze and visualize relationships using pandas and matplotlib
- explore and derive business insights with step-by-step reasoning hence vibe coding
