# Pandas Fun Problems - Advanced

Dive even deeper, focusing on sophisticated Pandas techniques and operations. This notebook is designed to push your data manipulation skills further, leveraging the full power of Pandas for complex data analysis tasks.

- **Practice Problems Source:** [Explore Problems](https://www.practiceprobs.com/problemsets/python-pandas/advanced/)
- **YouTube Tutorial:** [Video Guide](https://www.youtube.com/watch?v=8xUgesdShE8)

Building on the foundational knowledge of DataFrames, this advanced problem set challenges you to solve more complex scenarios, offering a path to mastering data analysis with Pandas.

Feel free to share your thoughts, solutions, and any questions in the comments. Your contributions help make this learning journey richer for everyone.

### **Found this notebook helpful? Don't forget to give it an upvote!**


# Q1 - Class Transitions Problem

https://www.practiceprobs.com/problemsets/python-pandas/advanced/class-transitions/

You have a DataFrame called schedules that represents the daily schedule of each student in a school. For example, If Ryan attends four classes - math, english, history, and chemistry, your schedules DataFrame will have four rows for Ryan in the order he attends each class.



In [1]:
import numpy as np
import pandas as pd

generator = np.random.default_rng(seed=1234)
classes = ['english', 'math', 'history', 'chemistry', 'gym', 'civics', 'writing', 'engineering']

schedules = pd.DataFrame({
    'student_id':np.repeat(np.arange(100), 4),
    'class':generator.choice(classes, size=400, replace=True)
}).drop_duplicates()
schedules['grade'] = generator.integers(101, size=schedules.shape[0])

print(schedules)
#      student_id        class  grade
# 0             0  engineering     86
# 3             0    chemistry     75
# 4             1         math     85
# 5             1  engineering      0
# 6             1      english     73
# ..          ...          ...    ...
# 394          98      writing     16
# 395          98       civics     89
# 396          99  engineering     90
# 398          99         math     55
# 399          99      history     31
# 
# [339 rows x 3 columns]

     student_id        class  grade
0             0  engineering     86
3             0    chemistry     75
4             1         math     85
5             1  engineering      0
6             1      english     73
..          ...          ...    ...
394          98      writing     16
395          98       civics     89
396          99  engineering     90
398          99         math     55
399          99      history     31

[339 rows x 3 columns]


You have this theory that the sequence of class-to-class transitions affects students' grades. For instance, you suspect Ryan would do better in his Chemistry class if it immediately followed his Math class instead of his History class.

Determine the average and median Chemistry grade for groups of students based on the class they have immediately prior to Chemistry. Also report how many students fall into each group.

## Solution

In [2]:
# Create a new column 'prev_class' that represents the class that each student had immediately prior to their current class
schedules['prev_class'] = schedules.groupby('student_id')['class'].shift()
schedules['prev_class']

0              NaN
3      engineering
4              NaN
5             math
6      engineering
          ...     
394           math
395        writing
396            NaN
398    engineering
399           math
Name: prev_class, Length: 339, dtype: object

In [3]:
# Calculate the average and median Chemistry grade for groups of students based on the class they have immediately prior to Chemistry
class_pairs = schedules.groupby(['prev_class', 'class']).agg(
    students=('student_id', 'count'),
    avg_grade=('grade', 'mean'),
    med_grade=('grade', 'median')
)
class_pairs

Unnamed: 0_level_0,Unnamed: 1_level_0,students,avg_grade,med_grade
prev_class,class,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
chemistry,civics,1,19.0,19.0
chemistry,engineering,2,50.0,50.0
chemistry,english,5,55.2,50.0
chemistry,gym,5,36.6,25.0
chemistry,history,6,53.166667,55.0
chemistry,math,3,38.0,10.0
chemistry,writing,5,70.2,85.0
civics,chemistry,2,55.0,55.0
civics,engineering,3,76.666667,78.0
civics,english,2,93.0,93.0


In [4]:
# Filter the class_pairs DataFrame for chemistry grades
filtered_class_pairs = class_pairs.xs(key='chemistry', axis=0, level=1, drop_level=False)
filtered_class_pairs

Unnamed: 0_level_0,Unnamed: 1_level_0,students,avg_grade,med_grade
prev_class,class,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
civics,chemistry,2,55.0,55.0
engineering,chemistry,6,43.333333,43.5
english,chemistry,6,32.666667,31.5
gym,chemistry,2,68.5,68.5
history,chemistry,3,35.333333,27.0
math,chemistry,3,26.0,23.0
writing,chemistry,3,45.333333,46.0


In [5]:
# Sort the filtered_class_pairs DataFrame by median grade
sorted_class_pairs = filtered_class_pairs.sort_values('med_grade')
sorted_class_pairs

Unnamed: 0_level_0,Unnamed: 1_level_0,students,avg_grade,med_grade
prev_class,class,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
math,chemistry,3,26.0,23.0
history,chemistry,3,35.333333,27.0
english,chemistry,6,32.666667,31.5
engineering,chemistry,6,43.333333,43.5
writing,chemistry,3,45.333333,46.0
civics,chemistry,2,55.0,55.0
gym,chemistry,2,68.5,68.5


# Q2 - Rose Thorn Problem

https://www.practiceprobs.com/problemsets/python-pandas/advanced/rose-thorn/

You developed a multiplayer indie game called 🌹 Rose Thorn. Players compete in one of two venues - the ocean or the desert. You track the outcome of five games between three players in a DataFrame called games.



In [6]:
import numpy as np
import pandas as pd

games = pd.DataFrame({
    'bella1':   ['2nd', '3rd', '1st', '2nd', '3rd'],
    'billybob': ['1st', '2nd', '2nd', '1st', '2nd'],
    'nosoup4u': ['3rd', '1st', '3rd', '3rd', '3rd'],
    'venue': ['desert', 'ocean', 'desert', 'ocean', 'desert']
})

print(games)
#   bella1 billybob nosoup4u   venue
# 0    2nd      1st      3rd  desert
# 1    3rd      2nd      1st   ocean
# 2    1st      2nd      3rd  desert
# 3    2nd      1st      3rd   ocean
# 4    3rd      2nd      3rd  desert

  bella1 billybob nosoup4u   venue
0    2nd      1st      3rd  desert
1    3rd      2nd      1st   ocean
2    1st      2nd      3rd  desert
3    2nd      1st      3rd   ocean
4    3rd      2nd      3rd  desert


Now you want to analyze the data. Convert the games DataFrame into a new DataFrame that identifies how many times each (player, placement) occurs per venue, specifically with venue as the row index and (player, placed) as the column MultiIndex.

## Solution

In [7]:
step1 = games.melt(id_vars='venue', var_name='player', value_name='placed')
step1['venue'] = pd.Categorical(step1.venue)
step1['player'] = pd.Categorical(step1.player)
step1['placed'] = pd.Categorical(step1.placed)
step1

Unnamed: 0,venue,player,placed
0,desert,bella1,2nd
1,ocean,bella1,3rd
2,desert,bella1,1st
3,ocean,bella1,2nd
4,desert,bella1,3rd
5,desert,billybob,1st
6,ocean,billybob,2nd
7,desert,billybob,2nd
8,ocean,billybob,1st
9,desert,billybob,2nd


In [8]:
step2 = step1.groupby(['venue', 'player', 'placed']).size()
step2

  step2 = step1.groupby(['venue', 'player', 'placed']).size()


venue   player    placed
desert  bella1    1st       1
                  2nd       1
                  3rd       1
        billybob  1st       1
                  2nd       2
                  3rd       0
        nosoup4u  1st       0
                  2nd       0
                  3rd       3
ocean   bella1    1st       0
                  2nd       1
                  3rd       1
        billybob  1st       1
                  2nd       1
                  3rd       0
        nosoup4u  1st       1
                  2nd       0
                  3rd       1
dtype: int64

In [9]:
step3 = step2.unstack(level=['player', 'placed'])
step3

player,bella1,bella1,bella1,billybob,billybob,billybob,nosoup4u,nosoup4u,nosoup4u
placed,1st,2nd,3rd,1st,2nd,3rd,1st,2nd,3rd
venue,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
desert,1,1,1,1,2,0,0,0,3
ocean,0,1,1,1,1,0,1,0,1


# Q3- Product Volumes Problem

https://www.practiceprobs.com/problemsets/python-pandas/advanced/product-volumes/

Given a Series of product descriptions like “birch table measures 3’x6’x2'”, estimate the volume of each product. (Note: 3' means 3 "feet" in the imperial system of units.)



In [10]:
import numpy as np
import pandas as pd

descriptions = pd.Series([
    "soft and fuzzy teddy bear, product dims: 1'x2'x1', shipping not included",
    "birch table measures 3'x6'x2'",
    "tortilla blanket ~ sleep like a fajita ~ 6'x8'x1'",
    "inflatable arm tube man | 12'x1'x1' when inflated",
    "dinosaur costume -- 6'x4'x2' -- for kids and small adults"
], dtype='string')


## Solution 1

In [11]:
dims = descriptions.str.extract(r"(\d+)'x(\d+)'x(\d+)'").astype('int64')
dims.product(axis=1)

0     2
1    36
2    48
3    12
4    48
dtype: int64

## Solution 2

In [12]:
dims = descriptions.str.extractall(r"(\d+)'").astype('int64')
dims.groupby(dims.index.get_level_values(0)).agg(pd.Series.product)

Unnamed: 0,0
0,2
1,36
2,48
3,12
4,48
