<a href="https://www.kaggle.com/code/matinmahmoodi/pandas-fun-problems-dataframe?scriptVersionId=166449988" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Pandas Fun Problems - DataFrame

Dive deeper into Pandas with  second notebook, focusing on DataFrames, the backbone of data manipulation in Pandas. This notebook guides you through various DataFrame operations, enhancing your data analysis skills.

- [Practice Problems Source](https://www.practiceprobs.com/problemsets/python-pandas/dataframe/)
- [YouTube Tutorial](https://www.youtube.com/watch?v=8xUgesdShE8)

After mastering Series, understanding DataFrames is crucial for any data science endeavor. This problem set and accompanying tutorial offer hands-on experience and insights into effective DataFrame usage.

Share your thoughts and solutions in the comments. Your interaction enriches the learning experience for all.

### **Found this useful? Support with an upvote!**


# Q1 - Hobbies Problem

You polled five heterosexual couples on their hobbies.

In [1]:
import numpy as np
import pandas as pd

couples = pd.DataFrame({
    'man': [
        ['fishing', 'biking', 'reading'],
        ['hunting', 'mudding', 'fishing'],
        ['reading', 'movies', 'running'],
        ['running', 'reading', 'biking', 'mudding'],
        ['movies', 'reading', 'yodeling']
    ],
    'woman': [
        ['biking', 'reading', 'movies'],
        ['fishing', 'drinking'],
        ['knitting', 'reading'],
        ['running', 'biking', 'fishing', 'movies'],
        ['movies']
    ]
})
print(couples)

                                   man                               woman
0           [fishing, biking, reading]           [biking, reading, movies]
1          [hunting, mudding, fishing]                 [fishing, drinking]
2           [reading, movies, running]                 [knitting, reading]
3  [running, reading, biking, mudding]  [running, biking, fishing, movies]
4          [movies, reading, yodeling]                            [movies]


For each couple, determine what hobbies each man has that his wife doesn’t and what hobbies each woman has that her husband doesn’t.

## Solution 1

In [2]:
man_hobbies = []
woman_hobbies = []
for index, row in couples.iterrows():
    man_hobbies.append(list(set(row['man']) - set(row['woman'])))
    woman_hobbies.append(list(set(row['woman']) - set(row['man'])))

print(man_hobbies)
print(woman_hobbies)

[['fishing'], ['hunting', 'mudding'], ['running', 'movies'], ['reading', 'mudding'], ['reading', 'yodeling']]
[['movies'], ['drinking'], ['knitting'], ['movies', 'fishing'], []]


## Solution 2 - more professional

In [3]:
couples = pd.DataFrame({
    'man': [
        ['fishing', 'biking', 'reading'],
        ['hunting', 'mudding', 'fishing'],
        ['reading', 'movies', 'running'],
        ['running', 'reading', 'biking', 'mudding'],
        ['movies', 'reading', 'yodeling']
    ],
    'woman': [
        ['biking', 'reading', 'movies'],
        ['fishing', 'drinking'],
        ['knitting', 'reading'],
        ['running', 'biking', 'fishing', 'movies'],
        ['movies']
    ]
})

In [4]:
couples.map(set)

Unnamed: 0,man,woman
0,"{reading, biking, fishing}","{movies, reading, biking}"
1,"{mudding, hunting, fishing}","{drinking, fishing}"
2,"{running, movies, reading}","{knitting, reading}"
3,"{running, reading, mudding, biking}","{running, movies, biking, fishing}"
4,"{movies, reading, yodeling}",{movies}


In [5]:
sets = couples.map(set)

In [6]:
woman_not_man = sets.diff(axis=1, periods=1).drop(columns='man')

In [7]:
man_not_woman = sets.diff(periods=-1, axis=1).drop(columns='woman')

In [8]:
hobbies_not_shared = pd.concat((man_not_woman, woman_not_man), axis=1).map(list)

In [9]:
print(hobbies_not_shared)

                   man              woman
0            [fishing]           [movies]
1   [hunting, mudding]         [drinking]
2    [running, movies]         [knitting]
3   [reading, mudding]  [movies, fishing]
4  [reading, yodeling]                 []


# Q2 - Party Time Problem

Whenever your friends John and Judy visit you together, y’all have a party 🥳. Given a DataFrame with 10 rows representing the next 10 days of your schedule and whether John and Judy are scheduled to make an appearance, insert a new column called days_til_party that indicates how many days until the next party.

In [10]:
import numpy as np
import pandas as pd

generator = np.random.default_rng(123)
df = pd.DataFrame({
    'john': generator.choice([True, False], size=10, replace=True),
    'judy': generator.choice([True, False], size=10, replace=True)
})
print(df)
#     john   judy
# 0   True   True
# 1  False  False
# 2  False   True
# 3   True  False
# 4  False   True
# 5   True   True
# 6   True  False
# 7   True  False
# 8   True  False
# 9   True  False


    john   judy
0   True   True
1  False  False
2  False   True
3   True  False
4  False   True
5   True   True
6   True  False
7   True  False
8   True  False
9   True  False


days_til_party should be 0 on days when a party occurs, 1 on days when a party doesn’t occur but will occur the next day, etc.

## Solution 

In [11]:
party = df.john & df.judy
party

0     True
1    False
2    False
3    False
4    False
5     True
6    False
7    False
8    False
9    False
dtype: bool

In [12]:
grps = party.iloc[::-1].cumsum()
grps

9    0
8    0
7    0
6    0
5    1
4    1
3    1
2    1
1    1
0    2
dtype: int64

In [13]:
df['days_til_party'] = party.groupby(grps).cumcount(ascending=False)
df['days_til_party']

0    0
1    4
2    3
3    2
4    1
5    0
6    3
7    2
8    1
9    0
Name: days_til_party, dtype: int64

In [14]:
df.loc[(party.loc[party].index[-1] + 1):, 'days_til_party'] = pd.NA
df

Unnamed: 0,john,judy,days_til_party
0,True,True,0.0
1,False,False,4.0
2,False,True,3.0
3,True,False,2.0
4,False,True,1.0
5,True,True,0.0
6,True,False,
7,True,False,
8,True,False,
9,True,False,


# Q3 - Vending Machines Problem

You own a collection of vending machines, and you want to evaluate your inventory. Given a DataFrame of (machine, product, stock), determine the number of machines carrying each product and how many of them are out of stock.

In [15]:
import numpy as np
import pandas as pd

machine_products = pd.DataFrame({
    'machine': ['abc', 'abc', 'def', 'def', 'def', 'ghi'],
    'product': ['skittles', 'soap', 'soap', 'm&ms', 'skittles', 'm&ms'],
    'stock': [10, 0, 15, 2, 0, 3]
})
print(machine_products)
#   machine   product  stock
# 0     abc  skittles     10
# 1     abc      soap      0
# 2     def      soap     15
# 3     def      m&ms      2
# 4     def  skittles      0
# 5     ghi      m&ms      3


  machine   product  stock
0     abc  skittles     10
1     abc      soap      0
2     def      soap     15
3     def      m&ms      2
4     def  skittles      0
5     ghi      m&ms      3


Build a new DataFrame with product as the row index, and calculated columns n_machines and n_machines_empty.

## Solution 1

In [16]:
new_df = machine_products.pivot_table(index='product', aggfunc={'machine': 'nunique', 'stock': lambda x: sum(x == 0)})
new_df

Unnamed: 0_level_0,machine,stock
product,Unnamed: 1_level_1,Unnamed: 2_level_1
m&ms,2,0
skittles,2,1
soap,2,1


In [17]:
new_df.columns = ['n_machines', 'n_machines_empty']
print(new_df)

          n_machines  n_machines_empty
product                               
m&ms               2                 0
skittles           2                 1
soap               2                 1


## Solution 2 - more professional

In [18]:
def count_zero(x): return (x == 0).sum()
machine_products.groupby('product').agg(
    n_machines = ('stock', 'count'),
    n_machines_empty = ('stock', count_zero)
)


Unnamed: 0_level_0,n_machines,n_machines_empty
product,Unnamed: 1_level_1,Unnamed: 2_level_1
m&ms,2,0
skittles,2,1
soap,2,1


# Q4 - Cradle Robbers Problem

Given a DataFrame of married couples and a separate DataFrame with each person’s age, identify “cradle robbers”, people:

who are at least 20 years older than their spouse and
whose spouse is under the age of 30


In [19]:
import numpy as np
import pandas as pd

couples = pd.DataFrame({
    'person1': ['Cody', 'Dustin', 'Peter', 'Adam', 'Ryan', 'Brian', 'Jordan', 'Gregory'],
    'person2': ['Sarah', 'Amber', 'Brianna', 'Caitlin', 'Rachel', 'Kristen', 'Alyssa', 'Morgan']
}).convert_dtypes()

ages = pd.DataFrame({
    'person': ['Adam', 'Alyssa', 'Amber', 'Brian', 'Brianna', 'Caitlin', 'Cody', 'Dustin', 'Gregory', 'Jordan',
               'Kristen', 'Rachel', 'Morgan', 'Peter', 'Ryan', 'Sarah'],
    'age': [62, 40, 41, 50, 65, 29, 27, 39, 42, 39, 33, 61, 43, 55, 28, 36]
}).convert_dtypes()

print(couples)
#    person1  person2
# 0     Cody    Sarah
# 1   Dustin    Amber
# 2    Peter  Brianna
# 3     Adam  Caitlin
# 4     Ryan   Rachel
# 5    Brian  Kristen
# 6   Jordan   Alyssa
# 7  Gregory   Morgan

print(ages)
#      person  age
# 0      Adam   62
# 1    Alyssa   40
# 2     Amber   41
# 3     Brian   50
# 4   Brianna   65
# 5   Caitlin   29
# 6      Cody   27
# 7    Dustin   39
# 8   Gregory   42
# 9    Jordan   39
# 10  Kristen   33
# 11   Rachel   61
# 12   Morgan   43
# 13    Peter   55
# 14     Ryan   28
# 15    Sarah   36

   person1  person2
0     Cody    Sarah
1   Dustin    Amber
2    Peter  Brianna
3     Adam  Caitlin
4     Ryan   Rachel
5    Brian  Kristen
6   Jordan   Alyssa
7  Gregory   Morgan
     person  age
0      Adam   62
1    Alyssa   40
2     Amber   41
3     Brian   50
4   Brianna   65
5   Caitlin   29
6      Cody   27
7    Dustin   39
8   Gregory   42
9    Jordan   39
10  Kristen   33
11   Rachel   61
12   Morgan   43
13    Peter   55
14     Ryan   28
15    Sarah   36


## Solution 

In [20]:
ages = ages.set_index('person').age
ages

person
Adam       62
Alyssa     40
Amber      41
Brian      50
Brianna    65
Caitlin    29
Cody       27
Dustin     39
Gregory    42
Jordan     39
Kristen    33
Rachel     61
Morgan     43
Peter      55
Ryan       28
Sarah      36
Name: age, dtype: Int64

In [21]:
couples['age1'] = ages.loc[couples.person1].to_numpy()
couples['age2'] = ages.loc[couples.person2].to_numpy()

In [22]:
cr1 = couples.loc[(couples.age1 - couples.age2 >= 20) & (couples.age2 < 30), 'person1']
cr2 = couples.loc[(couples.age2 - couples.age1 >= 20) & (couples.age1 < 30), 'person2']

In [23]:
cradle_robbers = pd.concat((cr1, cr2))
cradle_robbers

3      Adam
4    Rachel
dtype: string

# Q5 - Potholes Problem

Fed up with your city’s roads, you go around collecting data on potholes in your area. Due to an unfortunate ☕ coffee spill, you lost bits and pieces of your data.

In [24]:
import numpy as np
import pandas as pd

potholes = pd.DataFrame({
    'length':[5.1, np.nan, 6.2, 4.3, 6.0, 5.1, 6.5, 4.3, np.nan, np.nan],
    'width':[2.8, 5.8, 6.5, 6.1, 5.8, np.nan, 6.3, 6.1, 5.4, 5.0],
    'depth':[2.6, np.nan, 4.2, 0.8, 2.6, np.nan, 3.9, 4.8, 4.0, np.nan],
    'location':pd.Series(['center', 'north edge', np.nan, 'center', 'north edge', 'center', 'west edge',
                          'west edge', np.nan, np.nan], dtype='string')
})

print(potholes)
#    length  width  depth    location
# 0     5.1    2.8    2.6      center
# 1     NaN    5.8    NaN  north edge
# 2     6.2    6.5    4.2        <NA>
# 3     4.3    6.1    0.8      center
# 4     6.0    5.8    2.6  north edge
# 5     5.1    NaN    NaN      center
# 6     6.5    6.3    3.9   west edge
# 7     4.3    6.1    4.8   west edge
# 8     NaN    5.4    4.0        <NA>
# 9     NaN    5.0    NaN        <NA>

   length  width  depth    location
0     5.1    2.8    2.6      center
1     NaN    5.8    NaN  north edge
2     6.2    6.5    4.2        <NA>
3     4.3    6.1    0.8      center
4     6.0    5.8    2.6  north edge
5     5.1    NaN    NaN      center
6     6.5    6.3    3.9   west edge
7     4.3    6.1    4.8   west edge
8     NaN    5.4    4.0        <NA>
9     NaN    5.0    NaN        <NA>


Given your DataFrame of pothole measurements, discard rows where more than half the values are NaN, elsewhere impute NaNs with the average value per column unless the column is non-numeric, in which case use the mode.

## Solution 

In [25]:
# Discard rows where more than half the values are NaN
potholes = potholes.dropna(thresh=potholes.shape[1] / 2)
potholes

Unnamed: 0,length,width,depth,location
0,5.1,2.8,2.6,center
1,,5.8,,north edge
2,6.2,6.5,4.2,
3,4.3,6.1,0.8,center
4,6.0,5.8,2.6,north edge
5,5.1,,,center
6,6.5,6.3,3.9,west edge
7,4.3,6.1,4.8,west edge
8,,5.4,4.0,


In [26]:
# Separate numeric and non-numeric columns
numeric_columns = potholes.select_dtypes(include=np.number).columns
non_numeric_columns = potholes.select_dtypes(exclude=np.number).columns

In [27]:
# Impute NaNs with the average value per numeric column
potholes[numeric_columns] = potholes[numeric_columns].fillna(potholes[numeric_columns].mean())

# Impute NaNs in non-numeric columns with the mode
potholes[non_numeric_columns] = potholes[non_numeric_columns].fillna(potholes[non_numeric_columns].mode().iloc[0])

In [28]:
print(potholes)

     length  width     depth    location
0  5.100000    2.8  2.600000      center
1  5.357143    5.8  3.271429  north edge
2  6.200000    6.5  4.200000      center
3  4.300000    6.1  0.800000      center
4  6.000000    5.8  2.600000  north edge
5  5.100000    5.6  3.271429      center
6  6.500000    6.3  3.900000   west edge
7  4.300000    6.1  4.800000   west edge
8  5.357143    5.4  4.000000      center


# Q6 - AFOLs Problem

You're developing the comment system for a social platform for AFOLs (Adult Fans Of Legos). The comment data looks like this.

In [29]:
import numpy as np
import pandas as pd

comments = pd.DataFrame([
    [32, pd.NA, 'Legos are awesome'],
    [12, pd.NA, 'Legos are okay..'],
      [11, 12, 'Just okay??'],
      [4, 12, 'Okay, troll..'],
        [31, 4, "I'm serious"],
      [75, 12, 'yeah, nah'],
        [41, 75, 'u from down undah?'],
          [5, 41, 'nah, yeah'],
    [82, pd.NA, 'I love legos'],
      [81, 82, 'U 4 rl?'],
        [71, 81, 'no'],
      [95, 82, 'Me too!'],
      [96, 82, 'same']
], columns=['id', 'parent_id', 'comment'])

# Make id the index
comments.set_index('id', inplace=True)

print(comments)
#    parent_id             comment
# id                              
# 32      <NA>   Legos are awesome
# 12      <NA>    Legos are okay..
# 11        12         Just okay??
# 4         12       Okay, troll..
# 31         4         I'm serious
# 75        12           yeah, nah
# 41        75  u from down undah?
# 5         41           nah, yeah
# 82      <NA>        I love legos
# 81        82             U 4 rl?
# 71        81                  no
# 95        82             Me too!
# 96        82                same

   parent_id             comment
id                              
32      <NA>   Legos are awesome
12      <NA>    Legos are okay..
11        12         Just okay??
4         12       Okay, troll..
31         4         I'm serious
75        12           yeah, nah
41        75  u from down undah?
5         41           nah, yeah
82      <NA>        I love legos
81        82             U 4 rl?
71        81                  no
95        82             Me too!
96        82                same


Each comment can have one or none parent_id that identifies its parent comment.

Insert a column called n_descendants in comments that displays how many total comments are nested below each comment. For example, comment 12 has six descendants.



## Solution

In [30]:
# Function to calculate the number of descendants for each comment
def count_descendants(comment_id, comments_df):
    descendants = comments_df[comments_df['parent_id'] == comment_id]
    count = len(descendants)
    for desc_id in descendants.index:
        count += count_descendants(desc_id, comments_df)
    return count


In [31]:
# Applying the function to each comment and creating the n_descendants column
comments['n_descendants'] = comments.index.map(lambda x: count_descendants(x, comments))
comments

Unnamed: 0_level_0,parent_id,comment,n_descendants
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
32,,Legos are awesome,0
12,,Legos are okay..,6
11,12.0,Just okay??,0
4,12.0,"Okay, troll..",1
31,4.0,I'm serious,0
75,12.0,"yeah, nah",2
41,75.0,u from down undah?,1
5,41.0,"nah, yeah",0
82,,I love legos,4
81,82.0,U 4 rl?,1


# Q7 - Humans Problem

You've developed a 🤖 machine learning model that classifies images. Specifically, it outputs labels with non-negligible probabilities.



In [32]:
import pandas as pd

predictions = pd.DataFrame.from_dict({
    'preds': {
        12: "{'dog': 0.55, 'cat': 0.25, 'squirrel': 0.2}",
        41: "{'telephone pole': 0.8, 'tower': 0.1, 'stick': 0.1}",
        43: "{'man': 0.65, 'woman': 0.33, 'monkey': 0.02}",
        46: "{'waiter': 0.45, 'waitress': 0.30, 'newspaper': 0.15, 'cat': 0.10}",
        49: "{'nurse': 0.50, 'doctor': 0.50}",
        72: "{'baseball': 0.8, 'basketball': 0.15, 'football': 0.05}",
        91: "{'woman': 0.62, 'man': 0.28, 'elephant': 0.10}"
    }
})

print(predictions)
#                                                                  preds
# 12                         {'dog': 0.55, 'cat': 0.25, 'squirrel': 0.2}
# 41                 {'telephone pole': 0.8, 'tower': 0.1, 'stick': 0.1}
# 43                        {'man': 0.65, 'woman': 0.33, 'monkey': 0.02}
# 46  {'waiter': 0.45, 'waitress': 0.30, 'newspaper': 0.15, 'cat': 0.10}
# 49                                     {'nurse': 0.50, 'doctor': 0.50}
# 72             {'baseball': 0.8, 'basketball': 0.15, 'football': 0.05}
# 91                      {'woman': 0.62, 'man': 0.28, 'elephant': 0.10}

                                                preds
12        {'dog': 0.55, 'cat': 0.25, 'squirrel': 0.2}
41  {'telephone pole': 0.8, 'tower': 0.1, 'stick':...
43       {'man': 0.65, 'woman': 0.33, 'monkey': 0.02}
46  {'waiter': 0.45, 'waitress': 0.30, 'newspaper'...
49                    {'nurse': 0.50, 'doctor': 0.50}
72  {'baseball': 0.8, 'basketball': 0.15, 'footbal...
91     {'woman': 0.62, 'man': 0.28, 'elephant': 0.10}


Each row in predictions represents predictions for a different image.

Insert a column called prob_human that calculates the probability each image represents a human. You can use the following list of strings to identify human labels.



In [33]:
humans = ['doctor', 'man', 'nurse', 'teacher', 'waiter', 'waitress', 'woman']

## Solution

In [34]:
# Convert preds from strings to dictionaries
import ast
predictions['preds'] = predictions.preds.map(ast.literal_eval)

# Helper function to sum subset of dict values
def sum_dict_vals(x, keys):
    """Given a dict, x, subset it by keys, then sum its values"""
    return sum(x[k] for k in keys if k in x.keys())

In [35]:
# Insert prob_human column
predictions['prob_human'] = predictions.preds.apply(sum_dict_vals, args=(humans,))

print(predictions)

                                                preds  prob_human
12        {'dog': 0.55, 'cat': 0.25, 'squirrel': 0.2}        0.00
41  {'telephone pole': 0.8, 'tower': 0.1, 'stick':...        0.00
43       {'man': 0.65, 'woman': 0.33, 'monkey': 0.02}        0.98
46  {'waiter': 0.45, 'waitress': 0.3, 'newspaper':...        0.75
49                      {'nurse': 0.5, 'doctor': 0.5}        1.00
72  {'baseball': 0.8, 'basketball': 0.15, 'footbal...        0.00
91      {'woman': 0.62, 'man': 0.28, 'elephant': 0.1}        0.90
