# **ACM GRADIENT FORGE: SHAPING THE FIRST METAL [PARTICIPANT'S NOTEBOOK]**

---

### **YOUR MISSION**

**Research Question:**  
*What factors influence romantic attraction and matching success in speed dating?*

**What You'll Do:**  
Follow along with the instructor as we explore the speed dating dataset together. You'll learn the complete data analysis workflow:

1. **Load and Preview** the dataset
2. **Profile** the data structure  
3. **Clean** messy data (duplicates, missing values)
4. **Engineer** meaningful features
5. **Visualize** key relationships
6. **Discover** insights about human attraction

**Let's find out what makes a successful match!**

---

## **I. INTRODUCTION - FIRST IMPRESSIONS**

**Dataset Name**: `speed_dating`

**Dataset Link**: https://www.kaggle.com/datasets/mexwell/speed-dating

> *Are You Dateable? The Data Science of First Impressions*



**Description:**

The dataset comes from 21 speed dating sessions from Columbia University’s 2002–2004 speed dating study involving mostly young adults meeting potential partners. Each row represents an individual’s interaction in a speed date, including:

- `gender` - gender of the person being rated
  - 0: female
  - 1: male

- `age` - age during the speed date

- `income` - reported income level

- `goal` - what they're looking for
  - 1: seated for a fun night out
  - 2: to meet new people
  - 3: to get a date
  - 4: looking for a serious relationship
  - 5: to say i did it
  - 6: other

- `career` - field of study or professional career

- `attr` - attractiveness rating (scale 1–10)

- `sinc` - sincerity rating (scale 1–10)

- `intel` - intelligence rating (scale 1–10)

- `fun` - fun / personality rating (scale 1–10)

- `amb` - ambitiousness rating (scale 1–10)

- `shar` - shared interests rating (scale 1–10)

- `like` - overall liking (scale 1–10)

- `prob` - how much did the rater think the interest was mutual? (scale 1–10)

- `met` - have they met before?
  - 1: yes
  - 2 or 0: no (vary by session coding)
  - NaN: missing value

- `dec` - final decision
  - 0: no (no match desired)
  - 1: yes (match desired)

---

## **II. DATA PROFILING - GETTING TO KNOW EACH OTHER**

**A. LOAD DATASET**

In [None]:
# ========================================
# IMPORT LIBRARIES
# ========================================
# pandas: Data manipulation and analysis framework
# numpy: Numerical computing library

import pandas as pd
import numpy as np

In [None]:
# ========================================
# LOAD THE SPEED DATING DATASET
# ========================================
# This loads our CSV file into a pandas DataFrame
# Each row represents one person's ratings of another during a speed date

df = pd.read_csv('speed_dating.csv')

**B. PREVIEW DATASET**

In [None]:
# ========================================
# PREVIEW THE FIRST FEW ROWS
# ========================================
# This gives us a quick look at the data structure and values
# Notice the mix of numerical ratings and categorical information
# ========================================

# YOUR CODE HERE: Display the first few rows

In [None]:
# PREVIEW LAST FEW ROWS OF DATA

# YOUR CODE HERE: Display the last few rows

**C. DATASET STRUCTURE**

In [None]:
# ========================================
# EXAMINE DATASET STRUCTURE
# ========================================
# info() shows us:
# - Number of entries (rows)
# - Data types of each column
# - Non-null counts (helps identify missing values)
# - Memory usage
# ========================================

# YOUR CODE HERE: Show dataframe info

In [None]:
# SEE DATASET SHAPE

# YOUR CODE HERE: Get shape (rows, columns)

In [None]:
# SEE DATASET COLUMNS ONLY

# YOUR CODE HERE: View column names

**D. INITIAL SUMMARY**

In [None]:
# ========================================
# STATISTICAL SUMMARY
# ========================================
# describe() generates descriptive statistics:
# - count: number of non-missing values
# - mean, std: central tendency and spread
# - min, max: range of values
# - 25%, 50%, 75%: quartiles (useful for understanding distributions)
# ========================================

# YOUR CODE HERE: Generate summary statistics

---
## **III. DATA CLEANING - SPOTTING THE RED FLAGS**

**A. REMOVE DUPLICATE ROWS**

In [None]:
# ========================================
# REMOVE DUPLICATE ENTRIES
# ========================================
# Duplicates can skew statistical analysis
# ========================================

# YOUR CODE HERE: Remove duplicate rows

**B. IDENTIFY MISSING VALUES**

In [None]:
# ========================================
# IDENTIFY MISSING VALUES
# ========================================
# isna() creates a boolean mask of missing values
# sum() counts the True values (missing) for each column
# This helps us understand data quality and decide on imputation strategies
# ========================================

# YOUR CODE HERE: Count missing values

**C. NUMERICAL IMPUTATION**

In [None]:
# ========================================
# IMPUTE NUMERICAL VARIABLES WITH MEDIAN
# ========================================
# Strategy: Use median (not mean) because:
# - Median is robust to outliers
# - For rating scales, median preserves the "typical" value
# - Doesn't artificially inflate/deflate averages
#
# We impute: age, income, and all rating variables (attr through prob)

numeric_cols = ['age', 'income', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob']

# YOUR CODE HERE: Impute numeric_cols with median and store it in numeric_cols df

In [None]:
# ========================================
# IDENTIFY MISSING VALUES
# ========================================
# isna() creates a boolean mask of missing values
# sum() counts the True values (missing) for each column
# This helps us understand data quality and decide on imputation strategies
# ========================================

# YOUR CODE HERE: Count missing values

In [None]:
# HANDLE OTHER MISSING NUMERICAL VARIABLES

df['met'] = df['met'].fillna(0)

In [None]:
# YOUR CODE HERE: Count missing values

**D. CATEGORICAL IMPUTATION**

In [None]:
# ========================================
# IMPUTE CATEGORICAL VARIABLES WITH MODE
# ========================================
# Strategy: Use mode (most frequent value) for categorical data
# - mode[0] gets the most common value
# - Preserves the distribution shape
# - Better than creating an 'unknown' category for this analysis
# ========================================

df['goal'] = df['goal'].fillna(df['goal'].mode()[0])

# YOUR CODE HERE: Impute missing values in career using mode

In [None]:
# ========================================
# IDENTIFY MISSING VALUES
# ========================================
# isna() creates a boolean mask of missing values
# sum() counts the True values (missing) for each column
# This helps us understand data quality and decide on imputation strategies

df.isna().sum()

---

## **IV. FEATURE ENGINEERING - UNLOCKING CHEMISTRY**

**A. FEATURE SELECTION / REDUCTION**

In [None]:
# YOUR CODE HERE: Try dropping a column

**B. FEATURE CREATION**

In [None]:
# ========================================
# FEATURE ENGINEERING: CHARM SCORE
# ========================================
# Hypothesis: Attractiveness + Fun creates a "charm" composite
# Why combine these?
# - Together they capture "overall appeal" in first impressions
# - Simpler than using two separate features in analysis
#
# This creates a new column with values ranging from 0-20
# ========================================

# YOUR CODE HERE: Create charm score feature by adding attr and fun

**C. FEATURE DISCRETIZATION - GROUPING / BINNING**

In [None]:
# ========================================
# FEATURE ENGINEERING: AGE GROUPS
# ========================================
# Convert continuous age into discrete bins for easier analysis
#
# bins=[18, 20, 26, 31, 36, 60]: defines the boundaries
# labels: provides meaningful names for each bin
# right=False: intervals are [left, right) - e.g., 18-20 includes 18, 19 but not 20
#
# Why bin ages? Easier to:
# - Visualize patterns across life stages
# - Compare match rates by age bracket
# - Account for non-linear age effects

df['age_group'] = pd.cut(
    df['age'],
    bins=[18, 20, 26, 31, 36, 60],
    labels=['18-19', '20-25', '26-30', '31-35', '36+'],
    right=False
)

---
## **V. DATA VISUALIZATION & INSIGHTS - SEEING THE SPARKS & CONFESSION TIME**

In [None]:
# YOUR CODE HERE: Import matplotlib.pyplot and seaborn

In [None]:
# YOUR CODE HERE: Set whitegrid

**A. ATTRACTIVENESS VS FUN**

In [None]:
# ========================================
# ANALYZE DECISION DRIVERS: ATTRACTIVENESS VS FUN
# ========================================
# Create a scatterplot to visualize the relationship between 'Attractiveness'
# and 'Fun' ratings, color-coded by the final match decision.
# This reveals how these two specific traits influence a "Yes" vs. "No" result.

# Map binary decision (0/1) to readable labels for the legend
df['match_decision'] = df['dec'].map({0: 'No', 1: 'Yes'})

plt.figure(figsize=(8,6))
sns.scatterplot(
    data=df,
    x='attr',
    y='fun',
    hue='match_decision',
    palette={'No':'red', 'Yes':'green'},
    alpha=0.6            # Add transparency to handle overlapping points
)

plt.title('Attractiveness vs Fun')
plt.xlabel('Attractiveness Rating')
plt.ylabel('Fun Rating')
plt.legend(title='Match Decision')
plt.show()

**Key Findings:**
- [insert]
- [insert]

**B. CHARM VS DECISION**

In [None]:
# ========================================
# DISTRIBUTION OF "CHARM" SCORES BY OUTCOME
# ========================================
# Create a boxplot to compare the distribution of Charm (Attractiveness + Fun)
# scores between successful matches and rejections.
# This helps identify the score thresholds required to secure a "Yes".
# ========================================

# YOUR CODE HERE: Create distribution of charm score vs match decision boxplot

**Key Findings:**
- [insert]
- [insert]

**C. AVERAGE RATING BY AGE GROUP**

In [None]:
# ========================================
# COMPARE RATING PATTERNS ACROSS AGE GROUPS
# ========================================
# Create a heatmap showing how different age groups rate various attributes
# This helps us see if certain age groups value different qualities
#
# First, define which columns contain ratings

rating_cols = ['attr', 'sinc', 'intel', 'fun', 'amb', 'shar']

# Calculate mean rating for each attribute by age group
age_group_means = df.groupby('age_group', observed=True)[rating_cols].mean()

# YOUR CODE HERE: Create the heatmap of average ratings by age group

**Key Findings:**
- [insert]
- [insert]

---

### **CONGRATULATIONS!**
> *You have officially completed your first Data Alchemy session in the Gradient Forge. However, the forge never stays cold for long! Now that you have the tools, your next step is to apply what you've learned. Look out for the upcoming challenge!*