# Lab 3 – Grouping, Joining, and Reshaping

Survey : https://docs.google.com/forms/d/1GjFaaHNqYfTMPs0W4auMhbY0tJ2P_bGLI_Pc_xDP3mM/edit

In this lab, we’ll improve our data wrangling skills by learning three powerful Pandas techniques:

- **Grouping** → Summarizing data by categories (e.g., average commute time per section)
- **Joining** → Combining data from multiple tables (e.g., adding instructor info to each response)
- **Reshaping** → Pivoting or melting data between wide and long formats to make analysis easier

We’ll start with simple examples to see how each concept works.  
Then, we’ll apply them to our **Lab 3 Survey** results to generate useful summaries and insights.


In [3]:
import pandas as pd
import numpy as np
import re

# Load your data first (using your CSV or Google Sheet export link)
url = "https://docs.google.com/spreadsheets/d/1TdsMmeCbvNq2GHToHbDD67qaBXZ0LKSqj5eCkuDFNtY/export?format=csv&gid=942752529"
df = pd.read_csv(url)

# Rename columns to shorter, more usable names
df = df.rename(columns={
    "What is your academic year?": "Year",
    "what section are you in?": "Section",
    "What prerequisites did you take for this class?": "Prereqs",
    "What Track or Major Are You In?": "Track",
    "What courses are you taking?": "Courses",
    "How long does it take you to get to campus from home?": "CommuteTime",
    "What are your thoughts on the number 67?": "Num67Rating",
    "how many cups of coffee do you drink a week?": "CupsCoffeeWeekly"
})

# (Optional) Drop Timestamp since it has no data
df = df.drop(columns=["Timestamp"], errors="ignore")

print(df.head())
print(df.info())


     Year  Section   Prereqs                Track  \
0  Senior      1.0  SOCY 100         Data Science   
1  Junior      1.0  BSOS 233  Social Data Science   
2  Senior      1.0  INST 327                  NaN   
3  Senior      1.0  INST 327  Social Data Science   
4  Senior      1.0  INST 326         Data Science   

                                             Courses CommuteTime  Num67Rating  \
0   INST 346, INST 362, INST 354, INST 447, INST 414  25 minutes          3.0   
1                                 INST 366, INST 414  10 minutes         10.0   
2                                              alot   30 minutes         10.0   
3  NEUR398H, INST462, SURV400, BSCI330, INST447, ...  45 minutes          2.0   
4                 ECON422, ANTH222, INST414, INST447      1 hour          1.0   

  CupsCoffeeWeekly  
0                0  
1                0  
2               10  
3                0  
4                7  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 7

### Part 1 — IDK How you guys responded to the survey questions, lets clean the data together.

In [4]:
df 

Unnamed: 0,Year,Section,Prereqs,Track,Courses,CommuteTime,Num67Rating,CupsCoffeeWeekly
0,Senior,1.0,SOCY 100,Data Science,"INST 346, INST 362, INST 354, INST 447, INST 414",25 minutes,3.0,0
1,Junior,1.0,BSOS 233,Social Data Science,"INST 366, INST 414",10 minutes,10.0,0
2,Senior,1.0,INST 327,,alot,30 minutes,10.0,10
3,Senior,1.0,INST 327,Social Data Science,"NEUR398H, INST462, SURV400, BSCI330, INST447, ...",45 minutes,2.0,0
4,Senior,1.0,INST 326,Data Science,"ECON422, ANTH222, INST414, INST447",1 hour,1.0,7
...,...,...,...,...,...,...,...,...
72,Junior,2.0,"SOCY 100, INST 314",Health Information,"INST326, HLTH200, INST335",15,5.0,5
73,Freshman,2.0,"AASP 101, STAT 100",Data Science,"INST201, INST326, MATH140",40,3.0,0
74,Senior,3.0,"ECON 201, GEOG 202",Social Data Science,"INST490, INST447, ECON330",5,10.0,7
75,Junior,1.0,"INST 201, INST 301",Cybersecurity & Privacy,"INST327, CMSC250, CMSC216",30,8.0,8


## Part 2 — Grouping

The first skill we’ll practice is **grouping** — summarizing data by categories.

In pandas, we use the `groupby()` method to split our data into groups, apply a function to each group, and combine the results.

### Example: Count students per academic year
This shows how many students responded in each `Year`.


In [5]:
# Count the number of students per Year
df.groupby("Year")["Year"].count()

Year
Freshman      3
Junior       12
Senior       57
Sophomore     3
Name: Year, dtype: int64

### Example: Average commute time per section
This shows how long it takes students in each `Section` to get to campus, on average.


In [6]:


df["CommuteTime"] = df['CommuteTime'].str.replace(r'[^0-9.]', '', regex=True)

df["CommuteTime"] = pd.to_numeric(df["CommuteTime"])

df["CommuteMinutes"] = df["CommuteTime"].apply(lambda x: x * 60 if x < 10 else x)

df.groupby("Section")["CommuteTime"].mean()


Section
1.0     88.315789
2.0    120.750000
3.0     24.461538
Name: CommuteTime, dtype: float64

### Example: Multiple aggregations at once
We can calculate several summary statistics in one go by using `.agg()`.


In [7]:
df["CupsCoffeeWeekly"] = df['CupsCoffeeWeekly'].str.replace(r'[^0-9.]', '', regex=True)

df['CupsCoffeeWeekly'] = pd.to_numeric(df['CupsCoffeeWeekly'])

df.groupby("Year").agg({
    "CommuteTime": ["mean", "median"],
    "CupsCoffeeWeekly": ["mean", "max"]
})

Unnamed: 0_level_0,CommuteTime,CommuteTime,CupsCoffeeWeekly,CupsCoffeeWeekly
Unnamed: 0_level_1,mean,median,mean,max
Year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Freshman,30.0,30.0,5.0,12.0
Junior,10.333333,10.0,14.166667,78.0
Senior,107.907407,27.5,7.462727,79.0
Sophomore,26.0,25.0,3.0,6.0


In [8]:
# create a column to determine eligibility
def check_prereqs(row):
    prereqs = {p.strip().upper() for p in str(row["Prereqs"]).split(",")}

    # Define required sets
    set_prereqs = set(prereqs)

    # Rules
    has_stat100 = "STAT 100" in set_prereqs
    has_inst327 = "INST 327" in set_prereqs

    group1 = {"INST 201", "INST 301", "BSOS 233"}
    group2 = {"AASP 101", "ANTH 210", "ANTH 260", "ECON 200", "ECON 201", "GEOG 202", "GVPT 170", "PSYC 100", "SOCY 100"}
    group3 = {"BSOS 233", "INST 314"}
    group4 = {"BSOS 331", "GEOG 273", "INST 326"}

    has_group1 = len(group1 & set_prereqs) >=1
    has_group2 = len(group2 & set_prereqs) >=1
    has_group3 = len(group3 & set_prereqs) >=1
    has_group4 = len(group4 & set_prereqs) >=1

    track_ok = row["Track"] in ["Information Science", "Social Data Science"]

    return has_stat100 and has_inst327 and any([has_group1, has_group2, has_group3, has_group4]) and track_ok

# Apply function to each row
df["Eligible"] = df.apply(check_prereqs, axis=1)

# Check results
df[["Year", "Section", "Prereqs", "Track", "Eligible"]]


Unnamed: 0,Year,Section,Prereqs,Track,Eligible
0,Senior,1.0,SOCY 100,Data Science,False
1,Junior,1.0,BSOS 233,Social Data Science,False
2,Senior,1.0,INST 327,,False
3,Senior,1.0,INST 327,Social Data Science,False
4,Senior,1.0,INST 326,Data Science,False
...,...,...,...,...,...
72,Junior,2.0,"SOCY 100, INST 314",Health Information,False
73,Freshman,2.0,"AASP 101, STAT 100",Data Science,False
74,Senior,3.0,"ECON 201, GEOG 202",Social Data Science,False
75,Junior,1.0,"INST 201, INST 301",Cybersecurity & Privacy,False


### Practice Questions
1. Calculate the average rating of the number 67 for each Track.
2. Count how many students are in each Section.
3. Find which Year drinks the most coffee per week.
4. what percentage of students are eligible per year?

In [9]:

big = df.groupby(["Track", "Section", "Year"]).agg({
    "Num67Rating": ["mean"], # Q1
    "Section": ["count"],    # Q2    
    "CupsCoffeeWeekly": ["mean"], # Q3                 
    "Eligible": lambda x: 100 * x.mean() # Q4
})

big



Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Num67Rating,Section,CupsCoffeeWeekly,Eligible
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,count,mean,<lambda>
Track,Section,Year,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Cybersecurity & Privacy,1.0,Junior,8.0,1,8.0,0.0
Cybersecurity & Privacy,1.0,Senior,7.25,4,3.25,0.0
Cybersecurity & Privacy,2.0,Sophomore,8.0,1,6.0,0.0
"Cybersecurity & Privacy, Data Science",1.0,Junior,6.0,1,78.0,0.0
"Cybersecurity & Privacy, Data Science",1.0,Senior,6.0,1,3.0,0.0
"Cybersecurity & Privacy, Data Science",2.0,Senior,9.0,2,7.0,0.0
"Cybersecurity & Privacy, Data Science",3.0,Senior,10.0,1,17.0,0.0
"Cybersecurity & Privacy, Data Science, None",1.0,Senior,7.0,1,2.0,0.0
Data Science,1.0,Freshman,5.5,2,7.5,0.0
Data Science,1.0,Junior,3.0,2,1.5,0.0


# Part 3 — Joining

Now that we’ve grouped and checked prerequisites, let’s learn **joining** — combining two datasets together.  

This is super common when your main data (like survey results) needs extra information from another table.  

---

## Example: Adding Instructor & Room Info by Section

Let's say we have a separate table with information about each section, the instructor, and the classroom.  
We can "join" this information to the survey responses so each row shows where that student’s section meets.

We'll explore **four types of joins** in Pandas:

- **Inner Join** → keep only rows with matching keys in *both* tables  
- **Left Join** → keep *all* rows from the left (main) table, add matches if they exist  
- **Right Join** → keep *all* rows from the right (lookup) table, add matches if they exist  
- **Outer Join** → keep *all* rows from *both* tables, fill with NaN if no match


In [10]:
# Example Section Info Table
section_info = pd.DataFrame({
    "Section": [1, 2, 3],
    "Instructor": ["Wei Ai", "Wei Ai", "Wei Ai"],
    "Room": ["TMH 0301 & HBK 0302J", "TMH 0301 & HBK 0302H", "TMH 0301 & HBK 0302H"]
})

section_info


Unnamed: 0,Section,Instructor,Room
0,1,Wei Ai,TMH 0301 & HBK 0302J
1,2,Wei Ai,TMH 0301 & HBK 0302H
2,3,Wei Ai,TMH 0301 & HBK 0302H


In [11]:
# Inner Join (only students whose section exists in section_info)
df_inner = pd.merge(df, section_info, on="Section", how="inner")
df_inner.head()


Unnamed: 0,Year,Section,Prereqs,Track,Courses,CommuteTime,Num67Rating,CupsCoffeeWeekly,CommuteMinutes,Eligible,Instructor,Room
0,Senior,1.0,SOCY 100,Data Science,"INST 346, INST 362, INST 354, INST 447, INST 414",25.0,3.0,0.0,25.0,False,Wei Ai,TMH 0301 & HBK 0302J
1,Junior,1.0,BSOS 233,Social Data Science,"INST 366, INST 414",10.0,10.0,0.0,10.0,False,Wei Ai,TMH 0301 & HBK 0302J
2,Senior,1.0,INST 327,,alot,30.0,10.0,10.0,30.0,False,Wei Ai,TMH 0301 & HBK 0302J
3,Senior,1.0,INST 327,Social Data Science,"NEUR398H, INST462, SURV400, BSCI330, INST447, ...",45.0,2.0,0.0,45.0,False,Wei Ai,TMH 0301 & HBK 0302J
4,Senior,1.0,INST 326,Data Science,"ECON422, ANTH222, INST414, INST447",1.0,1.0,7.0,60.0,False,Wei Ai,TMH 0301 & HBK 0302J


In [12]:
# Left Join (keep all students, add section info where possible)
df_left = pd.merge(df, section_info, on="Section", how="left")
df_left.head()

Unnamed: 0,Year,Section,Prereqs,Track,Courses,CommuteTime,Num67Rating,CupsCoffeeWeekly,CommuteMinutes,Eligible,Instructor,Room
0,Senior,1.0,SOCY 100,Data Science,"INST 346, INST 362, INST 354, INST 447, INST 414",25.0,3.0,0.0,25.0,False,Wei Ai,TMH 0301 & HBK 0302J
1,Junior,1.0,BSOS 233,Social Data Science,"INST 366, INST 414",10.0,10.0,0.0,10.0,False,Wei Ai,TMH 0301 & HBK 0302J
2,Senior,1.0,INST 327,,alot,30.0,10.0,10.0,30.0,False,Wei Ai,TMH 0301 & HBK 0302J
3,Senior,1.0,INST 327,Social Data Science,"NEUR398H, INST462, SURV400, BSCI330, INST447, ...",45.0,2.0,0.0,45.0,False,Wei Ai,TMH 0301 & HBK 0302J
4,Senior,1.0,INST 326,Data Science,"ECON422, ANTH222, INST414, INST447",1.0,1.0,7.0,60.0,False,Wei Ai,TMH 0301 & HBK 0302J


In [13]:
# Right Join (keep all section_info rows, add students if they exist)
df_right = pd.merge(df, section_info, on="Section", how="right")
df_right.head()


Unnamed: 0,Year,Section,Prereqs,Track,Courses,CommuteTime,Num67Rating,CupsCoffeeWeekly,CommuteMinutes,Eligible,Instructor,Room
0,Senior,1.0,SOCY 100,Data Science,"INST 346, INST 362, INST 354, INST 447, INST 414",25.0,3.0,0.0,25.0,False,Wei Ai,TMH 0301 & HBK 0302J
1,Junior,1.0,BSOS 233,Social Data Science,"INST 366, INST 414",10.0,10.0,0.0,10.0,False,Wei Ai,TMH 0301 & HBK 0302J
2,Senior,1.0,INST 327,,alot,30.0,10.0,10.0,30.0,False,Wei Ai,TMH 0301 & HBK 0302J
3,Senior,1.0,INST 327,Social Data Science,"NEUR398H, INST462, SURV400, BSCI330, INST447, ...",45.0,2.0,0.0,45.0,False,Wei Ai,TMH 0301 & HBK 0302J
4,Senior,1.0,INST 326,Data Science,"ECON422, ANTH222, INST414, INST447",1.0,1.0,7.0,60.0,False,Wei Ai,TMH 0301 & HBK 0302J


In [14]:
# Outer Join (keep everything from both tables)
df_outer = pd.merge(df, section_info, on="Section", how="outer")
df_outer.head()


Unnamed: 0,Year,Section,Prereqs,Track,Courses,CommuteTime,Num67Rating,CupsCoffeeWeekly,CommuteMinutes,Eligible,Instructor,Room
0,Senior,1.0,SOCY 100,Data Science,"INST 346, INST 362, INST 354, INST 447, INST 414",25.0,3.0,0.0,25.0,False,Wei Ai,TMH 0301 & HBK 0302J
1,Junior,1.0,BSOS 233,Social Data Science,"INST 366, INST 414",10.0,10.0,0.0,10.0,False,Wei Ai,TMH 0301 & HBK 0302J
2,Senior,1.0,INST 327,,alot,30.0,10.0,10.0,30.0,False,Wei Ai,TMH 0301 & HBK 0302J
3,Senior,1.0,INST 327,Social Data Science,"NEUR398H, INST462, SURV400, BSCI330, INST447, ...",45.0,2.0,0.0,45.0,False,Wei Ai,TMH 0301 & HBK 0302J
4,Senior,1.0,INST 326,Data Science,"ECON422, ANTH222, INST414, INST447",1.0,1.0,7.0,60.0,False,Wei Ai,TMH 0301 & HBK 0302J


In [15]:
df_joined = pd.merge(df, section_info, on="Section", how="left")
df_joined.head()

Unnamed: 0,Year,Section,Prereqs,Track,Courses,CommuteTime,Num67Rating,CupsCoffeeWeekly,CommuteMinutes,Eligible,Instructor,Room
0,Senior,1.0,SOCY 100,Data Science,"INST 346, INST 362, INST 354, INST 447, INST 414",25.0,3.0,0.0,25.0,False,Wei Ai,TMH 0301 & HBK 0302J
1,Junior,1.0,BSOS 233,Social Data Science,"INST 366, INST 414",10.0,10.0,0.0,10.0,False,Wei Ai,TMH 0301 & HBK 0302J
2,Senior,1.0,INST 327,,alot,30.0,10.0,10.0,30.0,False,Wei Ai,TMH 0301 & HBK 0302J
3,Senior,1.0,INST 327,Social Data Science,"NEUR398H, INST462, SURV400, BSCI330, INST447, ...",45.0,2.0,0.0,45.0,False,Wei Ai,TMH 0301 & HBK 0302J
4,Senior,1.0,INST 326,Data Science,"ECON422, ANTH222, INST414, INST447",1.0,1.0,7.0,60.0,False,Wei Ai,TMH 0301 & HBK 0302J


### Join Questions.

1. **Scenario:** Some students didn’t enter their section, but we still want to keep their survey responses.  
   ➜ Which join should we use?
   
Left join (students left joined with sections)

2. **Scenario:** We only want students in sections that actually exist (no typos or missing sections).  
   ➜ Which join should we use?

Inner join (students inner joined with sections)

3. **Scenario:** We want to see if there are any sections that no one signed up for.  
   ➜ Which join should we use?

Right Join (Left join maybe depending on the table order if sections table comes before students table)


4. **Scenario:** We want to see *all* students and *all* sections, even if there’s no match.  
   ➜ Which join should we use?

Outer join (students outer joined with sections)




# Part 4 — Reshaping Data (Pivoting & Melting)

**Reshaping** means changing the *layout* of your dataset without changing the underlying information.  

You can think of it as reorganizing the same data in a new shape to make it easier to:
- **Summarize** values by categories  
- **Compare** groups side-by-side  
- **Visualize** multiple variables in a single plot  

The two most common reshaping techniques in Pandas are:

1. **Pivot / Pivot Table** — turns "long" data into a "wide" matrix with rows & columns
2. **Melt** — turns "wide" data into a "long" format, stacking columns into rows


In [16]:
# Example: Pivot Table (Wide Format)
# ---------------------------------
# Let's see the average number of cups of coffee per week,
# grouped by academic year (rows) and section (columns).

coffee_pivot = df.pivot_table(
    values="CupsCoffeeWeekly",   # what we want to summarize
    index="Year",             # rows
    columns="Section",        # columns
    aggfunc="mean"            # how to aggregate the values
)

coffee_pivot


Section,1.0,2.0,3.0
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Freshman,7.5,0.0,
Junior,12.0,5.0,23.0
Senior,8.142857,4.117647,11.245
Sophomore,2.0,6.0,1.0


In [19]:
# Example: Melt (Long Format)
# ---------------------------
# Suppose we want to plot multiple numeric columns more easily.
# We can "melt" them into a long format.

df_long = df.melt(
    id_vars=["Year", "Section"],                 # columns to keep
    value_vars=["CommuteTime", "CupsCoffeeWeekly"], # columns to unpivot
    var_name="Metric",                           # new column with metric names
    value_name="Value"                           # new column with values
)

df_long.head(10)


Unnamed: 0,Year,Section,Metric,Value
0,Senior,1.0,CommuteTime,25.0
1,Junior,1.0,CommuteTime,10.0
2,Senior,1.0,CommuteTime,30.0
3,Senior,1.0,CommuteTime,45.0
4,Senior,1.0,CommuteTime,1.0
5,Senior,1.0,CommuteTime,40.0
6,Junior,1.0,CommuteTime,15.0
7,Junior,1.0,CommuteTime,2.0
8,Junior,1.0,CommuteTime,10.0
9,Senior,1.0,CommuteTime,30.0


### Practice Questions (Choose the Right Tool, doesnt have to be just melt or pivot)

1. Show the average rating of the number 67 for each Track in a way that makes it easy to compare Tracks side by side.

2. Compare the maximum number of cups of coffee consumed per week by each Year. Which Year has the person who drinks the most coffee?

3. Put commute times and coffee consumption into the same column.

4. Add number 67 ratings to your reshaped data as well and compare how each metric (Commute, Coffee, Num67Rating) differs by Year by comparing their average.
  
    ***hint: use .unstack() to unstack the metrics***


In [18]:
# showing average side by side rating comparison for tracks 

track_67_ratings = df.pivot_table(
    values='Num67Rating',  # column
    index='Track', # rows
    aggfunc='mean'  
).sort_values('Num67Rating', ascending=False) # descending order

print(track_67_ratings)

# comparing maximum cups of coffee consumed per week by each year

coffee_by_year = df.groupby('Year')['CupsCoffeeWeekly'].max().sort_values(ascending=False)

# year with the person who drinks the most coffee

max_coffee_year = coffee_by_year.idxmax() # index of max value
max_coffee_amount = coffee_by_year.max() # max value

print(f"{max_coffee_year} year has the heaviest coffee drinker: {max_coffee_amount} cups/week")
print(coffee_by_year)

# putting commute and coffee into the same column

melted_data = df.melt(
    id_vars=['Year', 'Track'],  
    value_vars=['CommuteTime', 'CupsCoffeeWeekly'], # combined column
    var_name='Metric',  
    value_name='Value'  
)

print(melted_data.head())

# melting data for combination again but with rating now

metrics_melted = df.melt(
    id_vars=['Year', 'Track'],
    value_vars=['CommuteTime', 'CupsCoffeeWeekly', 'Num67Rating'],  # Include all three metrics
    var_name='Metric',
    value_name='Value'
)

# pivot table showing average of each metric grouped by year

yearly_metrics = metrics_melted.pivot_table(
    index='Year',
    columns='Metric',
    values='Value',
    aggfunc='mean'
)

# comparison point (average)

print("Average metrics by Year:")
print(yearly_metrics)

# unstacked version

unstacked_metrics = metrics_melted.groupby(['Year', 'Metric'])['Value'].mean().unstack()
print("\nUnstacked data:")
print(unstacked_metrics)

# comparison

print("\nMetric averages by years:")
print(yearly_metrics.mean().sort_values(ascending=False)) # descending order

                                             Num67Rating
Track                                                   
Digital Curation                                9.000000
Cybersecurity & Privacy, Data Science           8.000000
Cybersecurity & Privacy                         7.500000
Social Data Science                             7.312500
Cybersecurity & Privacy, Data Science, None     7.000000
Data Science                                    6.757576
Health Information                              5.500000
Senior year has the heaviest coffee drinker: 79.0 cups/week
Year
Senior       79.0
Junior       78.0
Freshman     12.0
Sophomore     6.0
Name: CupsCoffeeWeekly, dtype: float64
     Year                Track       Metric  Value
0  Senior         Data Science  CommuteTime   25.0
1  Junior  Social Data Science  CommuteTime   10.0
2  Senior                  NaN  CommuteTime   30.0
3  Senior  Social Data Science  CommuteTime   45.0
4  Senior         Data Science  CommuteTime    1.0
Avera