# üß≠ Guided Pandas Assessment: Intro & Subset Selection
---
This guided notebook challenges you to apply your skills in **pandas basics** and **data selection**.
Each dataset tells a short real-world story, and your goal is to extract meaningful insights.

Datasets used:
- `bikes.csv` ‚Äì Chicago Divvy rides
- `movie.csv` ‚Äì Movie data from IMDb/The Numbers
- `sales_transactions.csv` ‚Äì Synthetic business dataset

Each section progresses from intermediate to advanced level with clues provided along the way.
---

## üö¥ Part 1: The Winter Riding Challenge

### 1Ô∏è‚É£ Load and inspect the `bikes` dataset.
Display the shape, columns, and a few sample rows.

> üí° *Hint:* Use `pd.read_csv()` and `.head()`.

In [3]:
import pandas as pd 

bikes = pd.read_csv('../data/bikes.csv')
bikes.head()

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,11.0,Michigan Ave & Oak St,15.0,73.9,12.7,mostlycloudy
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy
3,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,19.0,Clark St & Randolph St,31.0,72.0,16.1,mostlycloudy
4,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,19.0,Damen Ave & Pierce Ave,19.0,73.0,17.3,partlycloudy


In [5]:
bikes.shape

(50089, 11)

In [6]:
bikes.columns

Index(['gender', 'starttime', 'stoptime', 'tripduration', 'from_station_name',
       'start_capacity', 'to_station_name', 'end_capacity', 'temperature',
       'wind_speed', 'events'],
      dtype='object')

In [74]:
bikes.describe()

Unnamed: 0,tripduration,start_capacity,end_capacity,temperature,wind_speed
count,50089.0,50083.0,50077.0,50089.0,50089.0
mean,716.867755,21.340215,21.241708,62.608237,7.070111
std,1319.849896,7.634167,7.556756,48.151252,178.93798
min,60.0,0.0,0.0,-9999.0,-9999.0
25%,356.0,15.0,15.0,52.0,6.9
50%,572.0,19.0,19.0,66.9,10.4
75%,906.0,23.0,23.0,75.9,12.7
max,86188.0,55.0,55.0,96.1,42.6


### 2Ô∏è‚É£ Identify all rides that occurred during freezing temperatures (below 32¬∞F).

> üí° *Hint:* Use a Boolean mask like `df['temperature'] < 32`.

In [75]:
freezing_and_below = bikes[bikes['temperature'] < 32]

print(f"Total freezing rides: {len(freezing_and_below)}")
print(f"Percentage of total: {len(freezing_and_below)/len(bikes):.2%}")
freezing_and_below.head()

Total freezing rides: 2905
Percentage of total: 5.80%


Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events,temp_category
1306,Male,2013-10-22 06:00:00,2013-10-22 06:05:00,308,Orleans St & Elm St,15.0,Michigan Ave & Oak St,15.0,30.0,4.6,partlycloudy,cold
1354,Male,2013-10-25 06:52:00,2013-10-25 06:59:00,452,Halsted St & Waveland Ave,15.0,Sheffield Ave & Wellington Ave,23.0,30.0,6.9,partlycloudy,cold
1355,Male,2013-10-25 07:18:00,2013-10-25 07:35:00,1011,Clinton St & Washington Blvd,31.0,Cityfront Plaza & N Water St,19.0,30.0,5.8,partlycloudy,cold
1356,Male,2013-10-25 07:57:00,2013-10-25 08:12:00,862,Milwaukee Ave & Wabansia Ave,15.0,Larrabee St & Menomonee St,15.0,30.0,5.8,partlycloudy,cold
1529,Female,2013-11-07 06:24:00,2013-11-07 06:47:00,1386,California Ave & Division St,15.0,Ogden Ave & Chicago Ave,19.0,30.0,4.6,clear,cold


### 3Ô∏è‚É£ Create a function `classify_temp(temp)` that categorizes temperatures as `'cold'`, `'mild'`, or `'hot'`.
Apply it to create a new column `temp_category`.

> üí° *Hint:* Use `apply()` with a custom function.

In [78]:
def classify_temp(temp):
    if not isinstance(temp, (int, float)):
        raise TypeError(f"Temperature must be numeric, got {type(temp).__name__}")
    
    if temp < 68:
        return 'cold'
    elif temp >= 68 and temp < 86:
        return 'mild'
    else:
        return 'hot'
    
bikes['temp_category'] = bikes['temperature'].apply(classify_temp)

bikes

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events,temp_category
0,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,11.0,Michigan Ave & Oak St,15.0,73.9,12.7,mostlycloudy,mild
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy,mild
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy,mild
3,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,19.0,Clark St & Randolph St,31.0,72.0,16.1,mostlycloudy,mild
4,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,19.0,Damen Ave & Pierce Ave,19.0,73.0,17.3,partlycloudy,mild
...,...,...,...,...,...,...,...,...,...,...,...,...
50084,Male,2017-12-30 13:07:00,2017-12-30 13:34:00,1625,State St & Pearson St,27.0,Clark St & Elm St,27.0,5.0,16.1,partlycloudy,cold
50085,Male,2017-12-30 13:34:00,2017-12-30 13:44:00,585,Halsted St & 35th St (*),16.0,Union Ave & Root St,11.0,5.0,16.1,partlycloudy,cold
50086,Male,2017-12-30 13:34:00,2017-12-30 13:48:00,824,Kingsbury St & Kinzie St,31.0,Halsted St & Blackhawk St (*),20.0,5.0,16.1,partlycloudy,cold
50087,Female,2017-12-31 09:30:00,2017-12-31 09:33:00,178,Clinton St & Lake St,23.0,Kingsbury St & Kinzie St,31.0,7.0,11.5,partlycloudy,cold


### 4Ô∏è‚É£ Using your new column, filter all rides labeled `'cold'` **and** with `tripduration` greater than 1000 seconds.

> üí° *Hint:* Combine Boolean masks using `&`.

In [27]:
filt1 = bikes['temp_category'] == 'cold'
filt2 = bikes['tripduration'] > 1000

long_cold_ride = bikes[filt1 & filt2]

#alternative
long_cold_ride = bikes[(bikes['temp_category'] == 'cold') & (bikes['tripduration'] > 1000)]

long_cold_ride

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events,temp_category
26,Male,2013-07-12 01:07:00,2013-07-12 01:24:00,1043,State St & Harrison St,19.0,Racine Ave & 18th St,15.0,64.9,0.0,clear,cold
95,Male,2013-07-24 07:49:00,2013-07-24 08:11:00,1329,Wood St & Division St,15.0,Dearborn St & Monroe St,23.0,64.9,10.4,partlycloudy,cold
112,Male,2013-07-26 19:10:00,2013-07-26 19:33:00,1395,Larrabee St & Kingsbury St,27.0,Damen Ave & Pierce Ave,19.0,66.9,12.7,rain,cold
113,Female,2013-07-26 21:14:00,2013-07-26 21:37:00,1403,Peoria St & Jackson Blvd,19.0,Orleans St & Merchandise Mart Plaza,23.0,64.9,4.6,mostlycloudy,cold
117,Male,2013-07-27 14:12:00,2013-07-27 14:44:00,1920,Cannon Dr & Fullerton Ave,15.0,Lake Shore Dr & Ohio St,19.0,62.1,15.0,cloudy,cold
...,...,...,...,...,...,...,...,...,...,...,...,...
50058,Female,2017-12-23 13:48:00,2017-12-23 14:14:00,1559,Michigan Ave & Madison St,19.0,Shedd Aquarium,55.0,28.0,8.1,mostlycloudy,cold
50077,Male,2017-12-29 09:13:00,2017-12-29 09:53:00,2378,Clinton St & 18th St,15.0,Canal St & Taylor St,15.0,12.9,4.6,cloudy,cold
50080,Male,2017-12-29 15:28:00,2017-12-29 15:51:00,1378,Cityfront Plaza Dr & Pioneer Ct,23.0,Mies van der Rohe Way & Chestnut St,19.0,14.0,6.9,snow,cold
50083,Male,2017-12-30 11:36:00,2017-12-30 11:55:00,1175,Western Ave & Walton St,19.0,Damen Ave & Clybourn Ave,15.0,3.9,13.8,partlycloudy,cold


### 5Ô∏è‚É£ Using `.query()`, show all rides that happened during `'snow'` events **or** where `temperature < 0`.

> üí° *Hint:* Inside `.query()`, you can use `events == 'snow' or temperature < 0`.

In [32]:
bikes.query('events == "snow" or temperature < 0')

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events,temp_category
1707,Male,2013-11-22 08:21:00,2013-11-22 08:31:00,608,Canal St & Adams St,27.0,Wells St & Erie St,19.0,35.1,13.8,snow,cold
1852,Male,2013-12-09 05:57:00,2013-12-09 06:08:00,684,Sedgwick St & North Ave,19.0,Wabash Ave & Grand Ave,15.0,23.0,11.5,snow,cold
1853,Male,2013-12-09 06:28:00,2013-12-09 06:33:00,302,Greenview Ave & Fullerton Ave,15.0,Sheffield Ave & Fullerton Ave,15.0,21.9,12.7,snow,cold
1860,Female,2013-12-10 08:32:00,2013-12-10 08:45:00,781,Ada St & Washington Blvd,15.0,Franklin St & Quincy St,15.0,10.4,19.6,snow,cold
1871,Male,2013-12-12 05:13:00,2013-12-12 05:27:00,878,California Ave & North Ave,11.0,Carpenter St & Huron St,19.0,-2.0,6.9,mostlycloudy,cold
...,...,...,...,...,...,...,...,...,...,...,...,...
50074,Male,2017-12-28 13:08:00,2017-12-28 13:13:00,258,Ashland Ave & Chicago Ave,15.0,Ashland Ave & Division St,19.0,12.9,9.2,snow,cold
50075,Male,2017-12-28 17:03:00,2017-12-28 17:07:00,226,Wabash Ave & Wacker Pl,19.0,State St & Randolph St,27.0,12.0,8.1,snow,cold
50079,Male,2017-12-29 12:08:00,2017-12-29 12:12:00,270,Southport Ave & Belmont Ave,15.0,Wilton Ave & Belmont Ave,23.0,14.0,6.9,snow,cold
50080,Male,2017-12-29 15:28:00,2017-12-29 15:51:00,1378,Cityfront Plaza Dr & Pioneer Ct,23.0,Mies van der Rohe Way & Chestnut St,19.0,14.0,6.9,snow,cold


### 6Ô∏è‚É£ Write a function `winter_summary(df)` that returns:
- The total number of cold rides
- The average `tripduration` for cold rides
- The minimum and maximum `temperature` recorded

> üí° *Hint:* Use Boolean filtering and DataFrame methods like `.mean()` and `.min()`.

In [79]:
def temp_category_summary(df: pd.DataFrame,temp_category:str):
    temp_categories = ['cold','mild','hot']
    
    if not isinstance(df,pd.DataFrame):
        raise TypeError(f"Expected pandas dataframe, instead got {type(df).__name__}")
    
    if not isinstance(temp_category,str):
        raise TypeError(f"Expected string for temp category, instead got {type(temp_category).__name__}")

    if temp_category not in temp_categories:
        raise ValueError(f"Value should be in {temp_categories}, instead got {temp_category}")
    
    filt = bikes['temp_category'] == temp_category
    avg_trip = round(bikes[filt]['tripduration'].mean(),2)
    min_temp = bikes[filt]['temperature'].min()
    max_temp = bikes[filt]['temperature'].max()
    
    return {
    'category': temp_category,
    'avg_tripduration': avg_trip,
    'min_temp': min_temp,
    'max_temp': max_temp
    }

temp_category_summary(bikes,'mild')

{'category': 'mild',
 'avg_tripduration': np.float64(760.84),
 'min_temp': np.float64(68.0),
 'max_temp': np.float64(84.9)}

### 7Ô∏è‚É£ Challenge: Find rides where `start_capacity` > `end_capacity` and `temperature < 40`.
Return only station names and capacities.

> üí° *Hint:* Use `.loc` with column subset.

In [80]:
cols = ['from_station_name','to_station_name','start_capacity','end_capacity']
filt = (bikes['temperature'] < 40) & (bikes['start_capacity'] > bikes['end_capacity'])


bikes.loc[filt,cols]

Unnamed: 0,from_station_name,to_station_name,start_capacity,end_capacity
1300,Michigan Ave & Pearson St,Peoria St & Jackson Blvd,23.0,19.0
1301,Franklin St & Jackson Blvd,State St & Erie St,27.0,23.0
1302,Clark St & Congress Pkwy,Clinton St & 18th St,27.0,15.0
1305,Paulina St & 18th St,Halsted St & 21st St,15.0,11.0
1313,Orleans St & Merchandise Mart Plaza,Dearborn Pkwy & Delaware Pl,23.0,19.0
...,...,...,...,...
50078,Southport Ave & Waveland Ave,Clarendon Ave & Gordon Ter,23.0,15.0
50080,Cityfront Plaza Dr & Pioneer Ct,Mies van der Rohe Way & Chestnut St,23.0,19.0
50083,Western Ave & Walton St,Damen Ave & Clybourn Ave,19.0,15.0
50085,Halsted St & 35th St (*),Union Ave & Root St,16.0,11.0


### üß© Insight Task:
Determine whether rides in **snowy or subzero weather** tend to last longer than rides in clear weather.
Print your conclusion based on your filtering results.

In [73]:
snowy_weather = round(bikes[bikes['events'] == 'snow']['tripduration'].mean(),2)

clear_weather = round(bikes[bikes['events'] == 'clear']['tripduration'].mean(),2)

if snowy_weather > clear_weather:
    print(f'Snowy weather tend to last longer than clear weather with an avg trip time of {snowy_weather} compared to {clear_weather} during clear weather')
else:
    print(f'Clear weather tend to last longer than snowy weather with an avg trip time of {clear_weather} compared to {snowy_weather} during snowy weather')

Clear weather tend to last longer than snowy weather with an avg trip time of 767.72 compared to 592.86 during snowy weather


## üé¨ Part 2: Spotlight on Directors

### 1Ô∏è‚É£ Load and inspect the `movie` dataset.
Focus on columns like `director_name`, `imdb_score`, `gross`, and `budget`.

> üí° *Hint:* Use `read_csv()` with `index_col='title'` if available.

In [93]:
import pandas as pd

cols = ['director_name','imdb_score','gross','budget','title']

movies = pd.read_csv('../data/movie.csv', usecols=cols, index_col='title')

movies

Unnamed: 0_level_0,director_name,gross,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Avatar,James Cameron,760505847.0,237000000.0,7.9
Pirates of the Caribbean: At World's End,Gore Verbinski,309404152.0,300000000.0,7.1
Spectre,Sam Mendes,200074175.0,245000000.0,6.8
The Dark Knight Rises,Christopher Nolan,448130642.0,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,Doug Walker,,,7.1
...,...,...,...,...
Signed Sealed Delivered,Scott Smith,,,7.7
The Following,,,,7.5
A Plague So Pleasant,Benjamin Roberds,,1400.0,6.3
Shanghai Calling,Daniel Hsia,10443.0,,6.3


### 2Ô∏è‚É£ Clean the data by removing movies missing both `budget` and `gross` values.

> üí° *Hint:* Use `.isna().all(axis=1)` to detect rows with both missing.

In [104]:
filt = ~movies[['budget','gross']].isna().all(axis=1)

movies_cleaned = movies[filt]

movies_cleaned

Unnamed: 0_level_0,director_name,gross,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Avatar,James Cameron,760505847.0,237000000.0,7.9
Pirates of the Caribbean: At World's End,Gore Verbinski,309404152.0,300000000.0,7.1
Spectre,Sam Mendes,200074175.0,245000000.0,6.8
The Dark Knight Rises,Christopher Nolan,448130642.0,250000000.0,8.5
John Carter,Andrew Stanton,73058679.0,263700000.0,6.6
...,...,...,...,...
The Mongol King,Anthony Vallone,,3250.0,7.8
Newlyweds,Edward Burns,4584.0,9000.0,6.4
A Plague So Pleasant,Benjamin Roberds,,1400.0,6.3
Shanghai Calling,Daniel Hsia,10443.0,,6.3


### 3Ô∏è‚É£ Write a function `director_summary(df, director)` that returns:
- Number of movies by that director
- Average IMDb score
- Average gross-to-budget ratio

> üí° *Hint:* Use Boolean filtering on `director_name`.

In [None]:
import pandas


movies.columns


def director_summary(df:pandas.DataFrame,director:str):
    if not isinstance(df,pd.DataFrame):
        return TypeError(f"Expected a pandas dataframe but instead received {type(df).__name__}")
    
    if not isinstance(director,pd.str):
        return TypeError(f"Expected a director name but instead received {type(director).__name__}")
    
    directors = movies['director_name'].to_list()
    
    directors_lower = [d.lower() for d in directors if isinstance(d, str)]

    return True

movies['director_name'] == 'James Cameron'

movies['director_name'].to_list().apply(lambda x: for )

AttributeError: 'list' object has no attribute 'str'

### 4Ô∏è‚É£ Filter movies directed by `'Christopher Nolan'` where IMDb score ‚â• 8 and `gross > budget`.

> üí° *Hint:* Chain multiple conditions using `&`.

### 5Ô∏è‚É£ Use `.query()` to find all movies released after 2010 with IMDb score > 8 and content rating not equal to `'R'`.

### 6Ô∏è‚É£ Create a new column `profit_margin` = (`gross` - `budget`) / `budget`.
Then select all movies where `profit_margin > 1.0` (at least 100% profit).

### 7Ô∏è‚É£ Challenge: Identify which **director** appears most often among movies with both high IMDb scores (‚â•8) and high profit margins (>1.0).

> üí° *Hint:* Use `value_counts()` on the filtered subset.

### üß© Insight Task:
Write a short conclusion about what traits successful directors share (high IMDb + profitable movies).

## üè™ Part 3: Regional Profit Watch

### 1Ô∏è‚É£ Load the `sales_transactions.csv` dataset and display its first few rows.

> üí° *Hint:* Use `pd.read_csv()` and `.head()`.

### 2Ô∏è‚É£ Create a new column `margin_ratio` = `revenue` / (`unit_price` * `units_sold`).
This shows the fraction of revenue retained after discounts.

> üí° *Hint:* Apply direct arithmetic on columns.

### 3Ô∏è‚É£ Write a function `classify_margin(row)` that returns `'high'` if `margin_ratio` > 0.9,
`'medium'` if between 0.7‚Äì0.9, otherwise `'low'`.
Apply this to create a new column `margin_level`.

### 4Ô∏è‚É£ Filter all `'high'` margin sales that occurred in the `'West'` or `'North'` regions and involve products `'Laptop'` or `'Phone'`.

> üí° *Hint:* Use `.isin()` and Boolean logic.

### 5Ô∏è‚É£ Using `.query()`, find all transactions where `discount` > 10 **and** `revenue` < 5000.

### 6Ô∏è‚É£ Challenge: Find all transactions where actual `revenue` < (`units_sold` * `unit_price`)
and display columns `sales_rep`, `product`, `discount`, and `revenue`.

> üí° *Hint:* Compare columns directly inside a Boolean mask.

### 7Ô∏è‚É£ Bonus Function ‚Äì `region_summary(df)`:
Return the average revenue per region for `'high'` margin sales only.

### üß© Insight Task:
Based on your results, determine which region seems most profitable and which discounts lead to the steepest revenue drop.