# 🌳 Pandas Mastery Series - Advanced

Welcome to the Pandas Mastery Series - Advanced! In this notebook, we will explore advanced pandas topics to enhance your data manipulation and analysis skills. This series will cover complex operations and provide practical examples to deepen your understanding of pandas. Let's dive into the advanced techniques!

## Table of Contents

### 1. **MultiIndex**
- Creating MultiIndex
- Accessing MultiIndex
- Advanced Indexing with MultiIndex

### 2. **Advanced GroupBy**
- Grouping by Multiple Columns
- Custom Aggregations
- Grouping with Functions

### 3. **Reshaping**
- Pivot and Pivot Tables
- Stack and Unstack
- Melting DataFrames

### 4. **Time Series**
- Date Range Generation
- Time Zone Handling
- Time Series Resampling

### 5. **Merging, Joining, and Concatenating**
- Concatenating DataFrames
- Merging on Index
- Advanced Joining Techniques

### 6. **Window Functions**
- Rolling Windows
- Expanding Windows
- Applying Custom Functions

### 7. **Text Data**
- String Methods
- Regular Expressions
- Extracting Information from Text

### 8. **Performance and Optimization**
- Efficient Data Loading
- Memory Usage Reduction
- Parallel Processing

### 9. **Fun Challenges**
- Challenge 1: The MultiIndex Mystery
- Challenge 2: The GroupBy Gauntlet
- Challenge 3: The Reshaping Riddle
- Challenge 4: The Time Series Trial
- Challenge 5: The Optimization Obstacle

### Ready for the Ultimate Challenge?

Once you've completed all the notebooks in the Pandas Mastery Series, you'll be ready to tackle the final challenge: [Pandas Mastery Series - Ultimate Challenge](https://www.kaggle.com/code/matinmahmoudi/pandas-mastery-series-ultimate-challenge). This ultimate challenge will put your pandas skills to the test and ensure you're truly a pandas master.

Let's get started and become pandas advanced masters!


# 1. MultiIndex

MultiIndex (also known as hierarchical indexing) is an advanced indexing technique that allows for multiple levels of indexing within a pandas DataFrame or Series. This feature enables more complex data analysis and manipulation.

### Creating MultiIndex
MultiIndex can be created from arrays, lists, or tuples. You can also set an existing DataFrame's index to be a MultiIndex.

### Accessing MultiIndex
Accessing data in a MultiIndex DataFrame or Series involves using a combination of levels and labels.

### Advanced Indexing with MultiIndex
Advanced indexing techniques, such as slicing and indexing with multiple levels, can be performed on MultiIndex objects.


In [1]:
# Import pandas library
import pandas as pd

arrays = [
    ['Hobbit', 'Hobbit', 'Wizard', 'Human', 'Elf'],
    ['Frodo', 'Sam', 'Gandalf', 'Aragorn', 'Legolas']
]
multi_index = pd.MultiIndex.from_arrays(arrays, names=('Race', 'Character'))
print("MultiIndex from arrays:\n", multi_index)

# Creating a DataFrame with MultiIndex
data = {
    'Age': [50, 38, 2019, 87, 2931],
    'Role': ['Ring-bearer', 'Gardener', 'Wizard', 'King', 'Archer']
}
df_multi = pd.DataFrame(data, index=multi_index)
print("\nDataFrame with MultiIndex:\n", df_multi)

# Accessing MultiIndex
# Accessing data for 'Hobbit' race
hobbit_data = df_multi.loc['Hobbit']
print("\nData for 'Hobbit' race:\n", hobbit_data)

# Accessing data for 'Gandalf' in 'Wizard' race
gandalf_data = df_multi.loc[('Wizard', 'Gandalf')]
print("\nData for 'Gandalf' in 'Wizard' race:\n", gandalf_data)

# Advanced Indexing with MultiIndex
# Slicing data for 'Hobbit' and 'Wizard' races
hobbit_wizard_data = df_multi.loc[['Hobbit', 'Wizard']]
print("\nData for 'Hobbit' and 'Wizard' races:\n", hobbit_wizard_data)


MultiIndex from arrays:
 MultiIndex([('Hobbit',   'Frodo'),
            ('Hobbit',     'Sam'),
            ('Wizard', 'Gandalf'),
            ( 'Human', 'Aragorn'),
            (   'Elf', 'Legolas')],
           names=['Race', 'Character'])

DataFrame with MultiIndex:
                    Age         Role
Race   Character                   
Hobbit Frodo        50  Ring-bearer
       Sam          38     Gardener
Wizard Gandalf    2019       Wizard
Human  Aragorn      87         King
Elf    Legolas    2931       Archer

Data for 'Hobbit' race:
            Age         Role
Character                  
Frodo       50  Ring-bearer
Sam         38     Gardener

Data for 'Gandalf' in 'Wizard' race:
 Age       2019
Role    Wizard
Name: (Wizard, Gandalf), dtype: object

Data for 'Hobbit' and 'Wizard' races:
                    Age         Role
Race   Character                   
Hobbit Frodo        50  Ring-bearer
       Sam          38     Gardener
Wizard Gandalf    2019       Wizard


# 2. Advanced GroupBy

The GroupBy operation in pandas is powerful for aggregating data. Advanced GroupBy techniques include grouping by multiple columns, custom aggregations, and grouping with functions.

### Grouping by Multiple Columns
You can group data by more than one column, allowing for more granular analysis.

### Custom Aggregations
Custom aggregation functions can be applied to grouped data for specific calculations.

### Grouping with Functions
You can group data using custom functions to define the grouping criteria.


In [2]:
# Import pandas library
import pandas as pd

# Creating a DataFrame for GroupBy operations
data = {
    'Character': ['Frodo', 'Sam', 'Gandalf', 'Aragorn', 'Legolas', 'Boromir', 'Gimli', 'Pippin', 'Merry'],
    'Race': ['Hobbit', 'Hobbit', 'Wizard', 'Human', 'Elf', 'Human', 'Dwarf', 'Hobbit', 'Hobbit'],
    'Age': [50, 38, 2019, 87, 2931, 41, 139, 29, 37]
}
df = pd.DataFrame(data)

# Grouping by multiple columns
grouped_multi = df.groupby(['Race', 'Character'])['Age'].mean()
print("Grouped by Race and Character:\n", grouped_multi)

# Custom Aggregations
# Define custom aggregation functions
def range_agg(series):
    return series.max() - series.min()

custom_agg = df.groupby('Race').agg(
    count=('Age', 'size'),
    mean_age=('Age', 'mean'),
    age_range=('Age', range_agg)
)
print("\nCustom Aggregations:\n", custom_agg)

# Grouping with Functions
# Define a function to group characters into 'Young' and 'Old'
def age_group(age):
    return 'Young' if age < 100 else 'Old'

# Apply the function to the 'Age' column to create a new column 'AgeGroup'
df['AgeGroup'] = df['Age'].apply(age_group)

# Group by 'AgeGroup' and calculate the mean for numeric columns
grouped_func = df.groupby('AgeGroup').mean(numeric_only=True)
print("\nGrouped by Age Group:\n", grouped_func)



Grouped by Race and Character:
 Race    Character
Dwarf   Gimli         139.0
Elf     Legolas      2931.0
Hobbit  Frodo          50.0
        Merry          37.0
        Pippin         29.0
        Sam            38.0
Human   Aragorn        87.0
        Boromir        41.0
Wizard  Gandalf      2019.0
Name: Age, dtype: float64

Custom Aggregations:
         count  mean_age  age_range
Race                              
Dwarf       1     139.0          0
Elf         1    2931.0          0
Hobbit      4      38.5         21
Human       2      64.0         46
Wizard      1    2019.0          0

Grouped by Age Group:
                   Age
AgeGroup             
Old       1696.333333
Young       47.000000


# 3. Reshaping

Reshaping data in pandas involves changing the layout of a DataFrame or Series. This can include pivoting, stacking, unstacking, and melting data.

### Pivot and Pivot Tables
Pivoting involves reshaping data to form a different DataFrame, typically for summary statistics.

### Stack and Unstack
Stacking involves compressing a level in a MultiIndex to columns, while unstacking involves expanding a level in a MultiIndex to rows.

### Melting DataFrames
Melting converts a DataFrame from a wide format to a long format.


In [3]:
# Import pandas library
import pandas as pd

data = {
    'Character': ['Frodo', 'Sam', 'Gandalf', 'Aragorn', 'Legolas'],
    'Race': ['Hobbit', 'Hobbit', 'Wizard', 'Human', 'Elf'],
    'Age': [50, 38, 2019, 87, 2931],
    'Role': ['Ring-bearer', 'Gardener', 'Wizard', 'King', 'Archer']
}
df = pd.DataFrame(data)

# Pivot Table
pivot_table = df.pivot_table(values='Age', index='Race', columns='Role', aggfunc='mean')
print("Pivot Table:\n", pivot_table)

# Stack
stacked = df.set_index(['Race', 'Character']).stack()
print("\nStacked DataFrame:\n", stacked)

# Unstack
unstacked = stacked.unstack()
print("\nUnstacked DataFrame:\n", unstacked)

# Melting
melted = pd.melt(df, id_vars=['Character'], value_vars=['Race', 'Age', 'Role'], var_name='Attribute', value_name='Value')
print("\nMelted DataFrame:\n", melted)


Pivot Table:
 Role    Archer  Gardener  King  Ring-bearer  Wizard
Race                                               
Elf     2931.0       NaN   NaN          NaN     NaN
Hobbit     NaN      38.0   NaN         50.0     NaN
Human      NaN       NaN  87.0          NaN     NaN
Wizard     NaN       NaN   NaN          NaN  2019.0

Stacked DataFrame:
 Race    Character      
Hobbit  Frodo      Age              50
                   Role    Ring-bearer
        Sam        Age              38
                   Role       Gardener
Wizard  Gandalf    Age            2019
                   Role         Wizard
Human   Aragorn    Age              87
                   Role           King
Elf     Legolas    Age            2931
                   Role         Archer
dtype: object

Unstacked DataFrame:
                    Age         Role
Race   Character                   
Elf    Legolas    2931       Archer
Hobbit Frodo        50  Ring-bearer
       Sam          38     Gardener
Human  Aragorn      87

# 4. Time Series

Time series data is a sequence of data points recorded over time. pandas provides robust functionality for working with time series data, including creating time series, resampling, shifting, and applying rolling windows.

### Date Range Generation
Creating sequences of dates using `pd.date_range()` for generating time series indices.

### Time Zone Handling
Converting time series data to different time zones.

### Time Series Resampling
Changing the frequency of your time series data using resampling methods.


In [4]:
# Import pandas library
import pandas as pd
import numpy as np

# Date Range Generation
date_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
print("Date Range:\n", date_range)

# Creating a time series with random data
time_series = pd.Series(np.random.randn(10), index=date_range)
print("\nTime Series:\n", time_series)

# Time Zone Handling
# Localize the time series to a specific time zone
localized_series = time_series.tz_localize('UTC')
print("\nLocalized Time Series (UTC):\n", localized_series)

# Convert the time series to a different time zone
converted_series = localized_series.tz_convert('US/Eastern')
print("\nConverted Time Series (US/Eastern):\n", converted_series)

# Time Series Resampling
# Resampling to a different frequency (e.g., weekly)
weekly_series = time_series.resample('W').mean()
print("\nWeekly Resampled Series:\n", weekly_series)


Date Range:
 DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08',
               '2023-01-09', '2023-01-10'],
              dtype='datetime64[ns]', freq='D')

Time Series:
 2023-01-01    0.581692
2023-01-02    1.350540
2023-01-03   -0.706632
2023-01-04    0.189312
2023-01-05    1.085120
2023-01-06   -0.508063
2023-01-07   -1.023770
2023-01-08   -0.996731
2023-01-09   -1.139132
2023-01-10   -0.932358
Freq: D, dtype: float64

Localized Time Series (UTC):
 2023-01-01 00:00:00+00:00    0.581692
2023-01-02 00:00:00+00:00    1.350540
2023-01-03 00:00:00+00:00   -0.706632
2023-01-04 00:00:00+00:00    0.189312
2023-01-05 00:00:00+00:00    1.085120
2023-01-06 00:00:00+00:00   -0.508063
2023-01-07 00:00:00+00:00   -1.023770
2023-01-08 00:00:00+00:00   -0.996731
2023-01-09 00:00:00+00:00   -1.139132
2023-01-10 00:00:00+00:00   -0.932358
Freq: D, dtype: float64

Converted Time Series (US/Eastern):
 2022-12-31 

# 5. Merging, Joining, and Concatenating

Combining multiple DataFrames is a common task in data analysis. pandas provides several methods to accomplish this, including concatenation, merging, and joining.

### Concatenating DataFrames
Concatenation combines multiple DataFrames along a particular axis.

### Merging on Index
Merging allows you to combine DataFrames based on their index or a common column.

### Advanced Joining Techniques
Advanced joining techniques involve performing SQL-like joins with DataFrames.


In [5]:
# Import pandas library
import pandas as pd


df1 = pd.DataFrame({
    'Name': ['Frodo', 'Sam', 'Gandalf'],
    'Race': ['Hobbit', 'Hobbit', 'Wizard']
})
df2 = pd.DataFrame({
    'Name': ['Aragorn', 'Legolas', 'Gimli'],
    'Race': ['Human', 'Elf', 'Dwarf']
})

# Concatenating DataFrames
concat_df = pd.concat([df1, df2])
print("Concatenated DataFrame:\n", concat_df)

# Creating DataFrames for merging
df3 = pd.DataFrame({
    'Name': ['Frodo', 'Sam', 'Gandalf'],
    'Weapon': ['Sting', 'Sword', 'Staff']
})
df4 = pd.DataFrame({
    'Name': ['Aragorn', 'Legolas', 'Gimli'],
    'Weapon': ['Anduril', 'Bow', 'Axe']
})

# Merging DataFrames on a common column
merged_df = pd.merge(df1, df3, on='Name', how='inner')
print("\nMerged DataFrame (inner join):\n", merged_df)

# Advanced Joining Techniques
# Left join
left_join_df = df1.join(df3.set_index('Name'), on='Name', how='left')
print("\nLeft Join DataFrame:\n", left_join_df)

# Right join
right_join_df = df1.join(df4.set_index('Name'), on='Name', how='right')
print("\nRight Join DataFrame:\n", right_join_df)


Concatenated DataFrame:
       Name    Race
0    Frodo  Hobbit
1      Sam  Hobbit
2  Gandalf  Wizard
0  Aragorn   Human
1  Legolas     Elf
2    Gimli   Dwarf

Merged DataFrame (inner join):
       Name    Race Weapon
0    Frodo  Hobbit  Sting
1      Sam  Hobbit  Sword
2  Gandalf  Wizard  Staff

Left Join DataFrame:
       Name    Race Weapon
0    Frodo  Hobbit  Sting
1      Sam  Hobbit  Sword
2  Gandalf  Wizard  Staff

Right Join DataFrame:
         Name Race   Weapon
NaN  Aragorn  NaN  Anduril
NaN  Legolas  NaN      Bow
NaN    Gimli  NaN      Axe


# 6. Window Functions

Window functions in pandas allow you to perform operations over a specified window or rolling time frame. These functions are useful for smoothing, trend analysis, and other time series operations.

### Rolling Windows
Applying operations over a rolling window of specified size.

### Expanding Windows
Applying cumulative operations over an expanding window.

### Applying Custom Functions
Using custom functions with rolling and expanding windows for more complex calculations.


In [6]:
# Import pandas library
import pandas as pd
import numpy as np

# Creating a time series with random data
date_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
time_series = pd.Series(np.random.randn(10), index=date_range)
print("Original Time Series:\n", time_series)

# Rolling Windows
# Applying a rolling window of 3 days and calculating the mean
rolling_mean = time_series.rolling(window=3).mean()
print("\nRolling Mean (3-day window):\n", rolling_mean)

# Expanding Windows
# Applying an expanding window and calculating the cumulative sum
expanding_sum = time_series.expanding().sum()
print("\nExpanding Sum:\n", expanding_sum)

# Applying Custom Functions
# Custom function to calculate the range (max - min) over a rolling window
def range_func(x):
    return x.max() - x.min()

rolling_range = time_series.rolling(window=3).apply(range_func)
print("\nRolling Range (3-day window):\n", rolling_range)


Original Time Series:
 2023-01-01    0.157666
2023-01-02    0.089719
2023-01-03    0.617544
2023-01-04   -0.164820
2023-01-05    0.086518
2023-01-06   -0.190851
2023-01-07   -0.786600
2023-01-08   -0.083808
2023-01-09   -0.704289
2023-01-10   -0.468537
Freq: D, dtype: float64

Rolling Mean (3-day window):
 2023-01-01         NaN
2023-01-02         NaN
2023-01-03    0.288310
2023-01-04    0.180814
2023-01-05    0.179748
2023-01-06   -0.089718
2023-01-07   -0.296978
2023-01-08   -0.353753
2023-01-09   -0.524899
2023-01-10   -0.418878
Freq: D, dtype: float64

Expanding Sum:
 2023-01-01    0.157666
2023-01-02    0.247385
2023-01-03    0.864929
2023-01-04    0.700109
2023-01-05    0.786628
2023-01-06    0.595776
2023-01-07   -0.190824
2023-01-08   -0.274632
2023-01-09   -0.978921
2023-01-10   -1.447457
Freq: D, dtype: float64

Rolling Range (3-day window):
 2023-01-01         NaN
2023-01-02         NaN
2023-01-03    0.527825
2023-01-04    0.782363
2023-01-05    0.782363
2023-01-06    0.2773

# 7. Text Data

Working with text data in pandas involves various string operations, including manipulation, regular expressions, and extracting information from text. These operations are essential for cleaning and transforming text data.

### String Methods
pandas provides numerous built-in string methods for common text operations.

### Regular Expressions
Regular expressions (regex) are powerful tools for pattern matching and text extraction.

### Extracting Information from Text
You can extract specific information from text data using pandas' string methods and regex.


In [7]:
# Import pandas library
import pandas as pd

# Creating a Series with text data
characters = pd.Series(['Frodo Baggins', 'Samwise Gamgee', 'Gandalf the Grey', 'Aragorn son of Arathorn', 'Legolas of Mirkwood'])
print("Original Series:\n", characters)

# String Methods
# Converting to lowercase
lowercase_characters = characters.str.lower()
print("\nLowercase Series:\n", lowercase_characters)

# Extracting length of each string
length_characters = characters.str.len()
print("\nLength of each character name:\n", length_characters)

# Regular Expressions
# Extracting first names using regex
first_names = characters.str.extract(r'(\w+)', expand=False)
print("\nExtracted First Names:\n", first_names)

# Extracting titles (e.g., "the Grey", "son of Arathorn")
titles = characters.str.extract(r'(the \w+|son of \w+|of \w+)', expand=False)
print("\nExtracted Titles:\n", titles)

# Extracting Information from Text
# Checking if the character name contains "of"
contains_of = characters.str.contains('of')
print("\nCharacter names containing 'of':\n", contains_of)


Original Series:
 0              Frodo Baggins
1             Samwise Gamgee
2           Gandalf the Grey
3    Aragorn son of Arathorn
4        Legolas of Mirkwood
dtype: object

Lowercase Series:
 0              frodo baggins
1             samwise gamgee
2           gandalf the grey
3    aragorn son of arathorn
4        legolas of mirkwood
dtype: object

Length of each character name:
 0    13
1    14
2    16
3    23
4    19
dtype: int64

Extracted First Names:
 0      Frodo
1    Samwise
2    Gandalf
3    Aragorn
4    Legolas
dtype: object

Extracted Titles:
 0                NaN
1                NaN
2           the Grey
3    son of Arathorn
4        of Mirkwood
dtype: object

Character names containing 'of':
 0    False
1    False
2    False
3     True
4     True
dtype: bool


# 8. Performance and Optimization

Optimizing the performance of your pandas code is crucial for handling large datasets efficiently. This section covers techniques for efficient data loading, reducing memory usage, and parallel processing.

### Efficient Data Loading
Loading data efficiently using various methods provided by pandas.

### Memory Usage Reduction
Techniques to reduce memory usage by optimizing data types and other strategies.

### Parallel Processing
Leveraging parallel processing to speed up computations on large datasets.


In [8]:
# Import pandas library
import pandas as pd
import numpy as np

# Efficient Data Loading
# Using a smaller data type for integers
data = pd.DataFrame({
    'A': np.random.randint(0, 100, size=1000000),
    'B': np.random.randint(0, 100, size=1000000)
})
print("Original DataFrame memory usage:\n", data.memory_usage(deep=True))

# Convert integer columns to smaller data types
data['A'] = data['A'].astype(np.int16)
data['B'] = data['B'].astype(np.int16)
print("\nOptimized DataFrame memory usage:\n", data.memory_usage(deep=True))

# Memory Usage Reduction
# Creating a DataFrame with mixed data types
data_mixed = pd.DataFrame({
    'integers': np.random.randint(0, 100, size=1000),
    'floats': np.random.rand(1000),
    'strings': pd.Series(pd.util.hash_pandas_object(pd.Series(np.random.rand(1000).astype(str))))
})
print("\nOriginal mixed DataFrame memory usage:\n", data_mixed.memory_usage(deep=True))

# Convert strings to category type
data_mixed['strings'] = data_mixed['strings'].astype('category')
print("\nOptimized mixed DataFrame memory usage:\n", data_mixed.memory_usage(deep=True))

# Parallel Processing
# Using Dask to handle large datasets with parallel processing
import dask.dataframe as dd

# Convert pandas DataFrame to Dask DataFrame
dask_df = dd.from_pandas(data, npartitions=4)

# Perform parallel computation with Dask
result = dask_df.groupby('A').mean().compute()
print("\nParallel computation with Dask:\n", result)


Original DataFrame memory usage:
 Index        128
A        8000000
B        8000000
dtype: int64

Optimized DataFrame memory usage:
 Index        128
A        2000000
B        2000000
dtype: int64

Original mixed DataFrame memory usage:
 Index        128
integers    8000
floats      8000
strings     8000
dtype: int64

Optimized mixed DataFrame memory usage:
 Index         128
integers     8000
floats       8000
strings     43064
dtype: int64

Parallel computation with Dask:
             B
A            
0   49.594467
1   49.462999
2   49.676426
3   49.573739
4   49.680146
..        ...
95  49.211500
96  49.664722
97  49.512751
98  50.021618
99  49.798630

[100 rows x 1 columns]


# Challenge 1: The MultiIndex Mystery

You are given a DataFrame with a MultiIndex. Your task is to perform the following operations:

1. Access the data for a specific level of the MultiIndex.
2. Slice the data for a range of values in one of the levels.
3. Perform an aggregation on one of the levels.

**Dataset:**
- Characters and their attributes from "The Lord of the Rings"

Perform these operations and display the results.


In [9]:
# Import pandas library
import pandas as pd

arrays = [
    ['Hobbit', 'Hobbit', 'Wizard', 'Human', 'Elf'],
    ['Frodo', 'Sam', 'Gandalf', 'Aragorn', 'Legolas']
]
multi_index = pd.MultiIndex.from_arrays(arrays, names=('Race', 'Character'))
data = {
    'Age': [50, 38, 2019, 87, 2931],
    'Role': ['Ring-bearer', 'Gardener', 'Wizard', 'King', 'Archer']
}
df_multi = pd.DataFrame(data, index=multi_index)


In [10]:
# Step 1: Access the data for 'Hobbit' race
hobbit_data = df_multi.loc['Hobbit']
print("\nData for 'Hobbit' race:\n", hobbit_data)

# Step 2: Slice the data for 'Hobbit' and 'Wizard' races
hobbit_wizard_data = df_multi.loc[['Hobbit', 'Wizard']]
print("\nData for 'Hobbit' and 'Wizard' races:\n", hobbit_wizard_data)

# Step 3: Perform an aggregation on the 'Race' level (mean age)
mean_age_by_race = df_multi.groupby('Race')['Age'].mean()
print("\nMean age by race:\n", mean_age_by_race)



Data for 'Hobbit' race:
            Age         Role
Character                  
Frodo       50  Ring-bearer
Sam         38     Gardener

Data for 'Hobbit' and 'Wizard' races:
                    Age         Role
Race   Character                   
Hobbit Frodo        50  Ring-bearer
       Sam          38     Gardener
Wizard Gandalf    2019       Wizard

Mean age by race:
 Race
Elf       2931.0
Hobbit      44.0
Human       87.0
Wizard    2019.0
Name: Age, dtype: float64


# Challenge 2: The GroupBy Gauntlet

You are given a DataFrame of characters and their attributes from "The Lord of the Rings." Your task is to perform the following operations:

1. Group the data by 'Race' and calculate the mean 'Age'.
2. Apply a custom aggregation to calculate the range of 'Age' within each 'Race'.
3. Group the data using a custom function to classify characters as 'Young' or 'Old' based on their age.

Perform these operations and display the results.


In [11]:
# Import pandas library
import pandas as pd

data = {
    'Character': ['Frodo', 'Sam', 'Gandalf', 'Aragorn', 'Legolas', 'Boromir', 'Gimli', 'Pippin', 'Merry'],
    'Race': ['Hobbit', 'Hobbit', 'Wizard', 'Human', 'Elf', 'Human', 'Dwarf', 'Hobbit', 'Hobbit'],
    'Age': [50, 38, 2019, 87, 2931, 41, 139, 29, 37]
}
df = pd.DataFrame(data)

In [12]:
# Step 1: Group by 'Race' and calculate the mean 'Age'
mean_age_by_race = df.groupby('Race')['Age'].mean()
print("Mean age by race:\n", mean_age_by_race)

# Step 2: Custom aggregation to calculate the range of 'Age' within each 'Race'
def range_agg(series):
    return series.max() - series.min()

# Ensure that only numeric data is used for the custom aggregation
age_range_by_race = df.groupby('Race').agg(
    age_range=pd.NamedAgg(column='Age', aggfunc=range_agg)
)
print("\nAge range by race:\n", age_range_by_race)

# Step 3: Group data using a custom function to classify characters as 'Young' or 'Old'
def age_group(age):
    return 'Young' if age < 100 else 'Old'

df['AgeGroup'] = df['Age'].apply(age_group)
grouped_by_age_group = df.groupby('AgeGroup')['Age'].mean()
print("\nGrouped by Age Group:\n", grouped_by_age_group)


Mean age by race:
 Race
Dwarf      139.0
Elf       2931.0
Hobbit      38.5
Human       64.0
Wizard    2019.0
Name: Age, dtype: float64

Age range by race:
         age_range
Race             
Dwarf           0
Elf             0
Hobbit         21
Human          46
Wizard          0

Grouped by Age Group:
 AgeGroup
Old      1696.333333
Young      47.000000
Name: Age, dtype: float64


# Challenge 3: The Reshaping Riddle

You are given a DataFrame of characters and their attributes from "The Lord of the Rings." Your task is to perform the following operations:

1. Pivot the data to create a summary table of ages by race and role.
2. Stack the data to convert columns into rows.
3. Unstack the data to convert rows into columns.
4. Melt the data to convert it from wide to long format.

Perform these operations and display the results.


In [13]:
# Import pandas library
import pandas as pd

data = {
    'Character': ['Frodo', 'Sam', 'Gandalf', 'Aragorn', 'Legolas'],
    'Race': ['Hobbit', 'Hobbit', 'Wizard', 'Human', 'Elf'],
    'Age': [50, 38, 2019, 87, 2931],
    'Role': ['Ring-bearer', 'Gardener', 'Wizard', 'King', 'Archer']
}
df = pd.DataFrame(data)

In [14]:
# Step 1: Pivot the data to create a summary table of ages by race and role
pivot_table = df.pivot_table(values='Age', index='Race', columns='Role', aggfunc='mean')
print("Pivot Table:\n", pivot_table)

# Step 2: Stack the data to convert columns into rows
stacked = df.set_index(['Race', 'Character']).stack()
print("\nStacked DataFrame:\n", stacked)

# Step 3: Unstack the data to convert rows into columns
unstacked = stacked.unstack()
print("\nUnstacked DataFrame:\n", unstacked)

# Step 4: Melt the data to convert it from wide to long format
melted = pd.melt(df, id_vars=['Character'], value_vars=['Race', 'Age', 'Role'], var_name='Attribute', value_name='Value')
print("\nMelted DataFrame:\n", melted)


Pivot Table:
 Role    Archer  Gardener  King  Ring-bearer  Wizard
Race                                               
Elf     2931.0       NaN   NaN          NaN     NaN
Hobbit     NaN      38.0   NaN         50.0     NaN
Human      NaN       NaN  87.0          NaN     NaN
Wizard     NaN       NaN   NaN          NaN  2019.0

Stacked DataFrame:
 Race    Character      
Hobbit  Frodo      Age              50
                   Role    Ring-bearer
        Sam        Age              38
                   Role       Gardener
Wizard  Gandalf    Age            2019
                   Role         Wizard
Human   Aragorn    Age              87
                   Role           King
Elf     Legolas    Age            2931
                   Role         Archer
dtype: object

Unstacked DataFrame:
                    Age         Role
Race   Character                   
Elf    Legolas    2931       Archer
Hobbit Frodo        50  Ring-bearer
       Sam          38     Gardener
Human  Aragorn      87

# Challenge 4: The Time Series Trial

You are given a time series dataset with daily observations. Your task is to perform the following operations:

1. Generate a date range and create a time series with random data.
2. Resample the data to a weekly frequency and calculate the mean.
3. Shift the data forward by 2 days.
4. Calculate the rolling mean with a window of 3 days.

Perform these operations and display the results.


In [15]:
# Import pandas library
import pandas as pd
import numpy as np

# Step 1: Generate a date range and create a time series with random data
date_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
time_series = pd.Series(np.random.randn(10), index=date_range)
print("Original Time Series:\n", time_series)

# Step 2: Resample the data to a weekly frequency and calculate the mean
weekly_series = time_series.resample('W').mean()
print("\nWeekly Resampled Series (Mean):\n", weekly_series)

# Step 3: Shift the data forward by 2 days
shifted_series = time_series.shift(2)
print("\nShifted Series (Forward by 2 days):\n", shifted_series)

# Step 4: Calculate the rolling mean with a window of 3 days
rolling_mean_series = time_series.rolling(window=3).mean()
print("\nRolling Mean Series (3-day window):\n", rolling_mean_series)


Original Time Series:
 2023-01-01    1.488081
2023-01-02    0.027298
2023-01-03   -0.729678
2023-01-04    1.745853
2023-01-05    0.442168
2023-01-06   -1.280220
2023-01-07   -0.922605
2023-01-08    0.746070
2023-01-09   -0.673303
2023-01-10    1.189986
Freq: D, dtype: float64

Weekly Resampled Series (Mean):
 2023-01-01    1.488081
2023-01-08    0.004127
2023-01-15    0.258341
Freq: W-SUN, dtype: float64

Shifted Series (Forward by 2 days):
 2023-01-01         NaN
2023-01-02         NaN
2023-01-03    1.488081
2023-01-04    0.027298
2023-01-05   -0.729678
2023-01-06    1.745853
2023-01-07    0.442168
2023-01-08   -1.280220
2023-01-09   -0.922605
2023-01-10    0.746070
Freq: D, dtype: float64

Rolling Mean Series (3-day window):
 2023-01-01         NaN
2023-01-02         NaN
2023-01-03    0.261900
2023-01-04    0.347824
2023-01-05    0.486114
2023-01-06    0.302600
2023-01-07   -0.586886
2023-01-08   -0.485585
2023-01-09   -0.283279
2023-01-10    0.420918
Freq: D, dtype: float64


# Challenge 5: The Optimization Obstacle

You are given a DataFrame with mixed data types. Your task is to perform the following operations:

1. Optimize the DataFrame by converting data types to more efficient ones.
2. Measure the memory usage before and after optimization.
3. Perform a parallel computation using Dask to handle large datasets.

Perform these operations and display the results.


In [16]:
# Import necessary libraries
import pandas as pd
import numpy as np
import dask.dataframe as dd

# Given DataFrame with mixed data types
data = pd.DataFrame({
    'integers': np.random.randint(0, 100, size=100000),
    'floats': np.random.rand(100000),
    'strings': pd.Series(['string' + str(i) for i in range(100000)])
})
print("Original DataFrame memory usage:\n", data.memory_usage(deep=True))

Original DataFrame memory usage:
 Index           128
integers     800000
floats       800000
strings     6788890
dtype: int64


In [17]:
# Step 1: Optimize the DataFrame by converting data types
data['integers'] = data['integers'].astype(np.int16)
data['floats'] = data['floats'].astype(np.float32)
data['strings'] = data['strings'].astype('category')
print("\nOptimized DataFrame memory usage:\n", data.memory_usage(deep=True))

# Step 2: Measure memory usage before and after optimization
memory_before = data.memory_usage(deep=True).sum()
data['integers'] = data['integers'].astype(np.int16)
data['floats'] = data['floats'].astype(np.float32)
data['strings'] = data['strings'].astype('category')
memory_after = data.memory_usage(deep=True).sum()
print("\nMemory usage before optimization: {} bytes".format(memory_before))
print("Memory usage after optimization: {} bytes".format(memory_after))

# Step 3: Perform a parallel computation using Dask
# Convert pandas DataFrame to Dask DataFrame
dask_df = dd.from_pandas(data[['integers', 'floats']], npartitions=4)

# Perform a parallel computation (mean of each group of 'integers')
result = dask_df.groupby('integers').floats.mean().compute()
print("\nParallel computation result using Dask:\n", result)



Optimized DataFrame memory usage:
 Index           128
integers     200000
floats       400000
strings     9302466
dtype: int64

Memory usage before optimization: 9902594 bytes
Memory usage after optimization: 9902594 bytes

Parallel computation result using Dask:
 integers
0     0.503266
1     0.496368
2     0.489691
3     0.504265
4     0.505815
        ...   
95    0.499826
96    0.490031
97    0.499781
98    0.497330
99    0.494543
Name: floats, Length: 100, dtype: float64
