# EDA Advanced Lesson 

As usual, we import the necessary libraries.

In [65]:
import pandas as pd
import numpy as np

## Covariance and Correlation

Covariance and correlation are two mathematical concepts which are commonly used in statistics. They are used to determine the relationship between two variables. The covariance is used to measure the linear relationship between two variables. On the other hand, the correlation is used to measure both the strength and direction of the _linear relationship_ between two variables.

Covariance is a measure of how much two random variables vary together. It’s similar to variance, but where variance tells you how a single variable varies, covariance tells you how two variables vary together.

Correlation (coefficient) is a _normalized_ measure of covariance that is easier to understand, as it provides quantitative measurements of the statistical dependence between two random variables. The correlation coefficient is a value that indicates the strength of the relationship between variables. The coefficient can take any values from -1 to 1. The interpretations of the values are:

- **-1**: Perfect negative linear correlation
- **-0.8**: Strong negative linear correlation
- **-0.5**: Moderate negative linear correlation
- **-0.2**: Weak negative linear correlation
- **0**: No linear correlation
- **0.2**: Weak positive linear correlation
- **0.5**: Moderate positive linear correlation
- **0.8**: Strong positive linear correlation
- **1**: Perfect positive linear correlation


Here, we'll use DataFrames of stock prices and volumes obtained from Yahoo! Finance available in binary Python pickle files.

In [66]:
price = pd.read_pickle("../data/yahoo_price.pkl")
volume = pd.read_pickle("../data/yahoo_volume.pkl")

In [67]:
price

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,27.990226,313.062468,113.304536,25.884104
2010-01-05,28.038618,311.683844,111.935822,25.892466
2010-01-06,27.592626,303.826685,111.208683,25.733566
2010-01-07,27.541619,296.753749,110.823732,25.465944
2010-01-08,27.724725,300.709808,111.935822,25.641571
...,...,...,...,...
2016-10-17,117.550003,779.960022,154.770004,57.220001
2016-10-18,117.470001,795.260010,150.720001,57.660000
2016-10-19,117.120003,801.500000,151.259995,57.529999
2016-10-20,117.059998,796.969971,151.520004,57.250000


In [68]:
volume

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,123432400,3927000,6155300,38409100
2010-01-05,150476200,6031900,6841400,49749600
2010-01-06,138040000,7987100,5605300,58182400
2010-01-07,119282800,12876600,5840600,50559700
2010-01-08,111902700,9483900,4197200,51197400
...,...,...,...,...
2016-10-17,23624900,1089500,5890400,23830000
2016-10-18,24553500,1995600,12770600,19149500
2016-10-19,20034600,116600,4632900,22878400
2016-10-20,24125800,1734200,4023100,49455600


Compute percent changes of the prices using a window function (we will explain window functions in the later section).

In [69]:
returns = price.pct_change()

returns.tail()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-10-17,-0.00068,0.001837,0.002072,-0.003483
2016-10-18,-0.000681,0.019616,-0.026168,0.00769
2016-10-19,-0.002979,0.007846,0.003583,-0.002255
2016-10-20,-0.000512,-0.005652,0.001719,-0.004867
2016-10-21,-0.00393,0.003011,-0.012474,0.042096


Compute the correlation and covariance between the returns of `MSFT` and `IBM`:

In [70]:
returns["MSFT"].cov(returns["IBM"])

np.float64(8.870655479703546e-05)

In [71]:
returns["MSFT"].corr(returns["IBM"])

np.float64(0.49976361144151144)

You can also get the full (pair-wise) correlation or covariance matrix as a DataFrame:

In [72]:
returns.cov()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,0.000277,0.000107,7.8e-05,9.5e-05
GOOG,0.000107,0.000251,7.8e-05,0.000108
IBM,7.8e-05,7.8e-05,0.000146,8.9e-05
MSFT,9.5e-05,0.000108,8.9e-05,0.000215


In [73]:
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.407919,0.386817,0.389695
GOOG,0.407919,1.0,0.405099,0.465919
IBM,0.386817,0.405099,1.0,0.499764
MSFT,0.389695,0.465919,0.499764,1.0


You can also compute pair-wise correlations between a DataFrame’s columns or rows with another Series or DataFrame. Passing a Series returns a Series with the correlation value computed for each column:

In [74]:
returns.corrwith(returns["IBM"])

AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64

Passing a DataFrame computes the correlations of matching column names.

In [75]:
returns.corrwith(volume)

AAPL   -0.075565
GOOG   -0.007067
IBM    -0.204849
MSFT   -0.092950
dtype: float64

## Hierarchical Indexing

Hierarchical indexing (MultiIndex) allows you to have multiple (two or more) _index levels_ on an axis. It enables "higher dimensional" data in a lower dimensional data structure.

You create a hierarchical index by simply passing a list of arrays to the index argument of a pandas DataFrame or Series.

### <span style="color: #4472C4;">Detail Explanation</span>

**Think of Hierarchical Indexing like organizing a filing cabinet with multiple levels of organization.**

Imagine you're organizing customer data for a retail business. Instead of having separate folders for each combination (like "Electronics_January_Online", "Electronics_January_Store", etc.), you create a hierarchical system:

- **Level 1**: Product Category (Electronics, Clothing, Books)
- **Level 2**: Month (January, February, March)  
- **Level 3**: Sales Channel (Online, Store)

This is exactly what pandas MultiIndex does - it creates multiple "levels" of indexing that work together.

**Key Benefits of Hierarchical Indexing:**

1. **Space Efficiency**: Instead of creating separate DataFrames for each group, you store everything in one structure
2. **Easy Grouping**: You can quickly slice and dice data by any level
3. **Natural Data Representation**: Many real-world datasets naturally have hierarchical structure

**Real-World Applications:**

- **Financial Data**: Company → Year → Quarter → Metric
- **Sales Data**: Region → Store → Product Category → Month
- **Survey Data**: Country → Age Group → Gender → Question
- **Scientific Data**: Experiment → Trial → Measurement Type → Time

**The Power of Partial Indexing:**

When you have hierarchical indexing, you can "zoom in" at any level:
- Want all data for a specific company? Use the first level
- Want Q1 data across all companies? Use the second level
- Want specific combinations? Use multiple levels together

Think of it like a spreadsheet where you can collapse and expand grouped rows, but much more powerful and programmatic!

In [76]:
# Let's create a practical example: Sales data for a retail company
import pandas as pd
import numpy as np

# Create sample sales data with hierarchical structure
np.random.seed(42)  # For reproducible results

# Define the hierarchy levels
regions = ['North', 'South', 'East', 'West']
products = ['Electronics', 'Clothing', 'Books']
months = ['Jan', 'Feb', 'Mar']

# Create all combinations
multi_index = pd.MultiIndex.from_product([regions, products, months], 
                                       names=['Region', 'Product', 'Month'])

# Generate sample sales data
sales_data = np.random.randint(1000, 10000, size=len(multi_index))

# Create the hierarchical Series
sales = pd.Series(sales_data, index=multi_index, name='Sales')

print("Sample of our hierarchical sales data:")
print(sales.head(10))

Sample of our hierarchical sales data:
Region  Product      Month
North   Electronics  Jan      8270
                     Feb      1860
                     Mar      6390
        Clothing     Jan      6191
                     Feb      6734
                     Mar      7265
        Books        Jan      1466
                     Feb      5426
                     Mar      6578
South   Electronics  Jan      9322
Name: Sales, dtype: int64


In [77]:
# Example 1: Get all sales for the North region (Level 1 slicing)
print("All sales in North region:")
print(sales['North'])
print("\n" + "="*50 + "\n")

# Example 2: Get Electronics sales across all regions (Level 2 slicing)
print("Electronics sales across all regions:")
print(sales.xs('Electronics', level='Product'))
print("\n" + "="*50 + "\n")

# Example 3: Get January sales across all regions and products (Level 3 slicing)
print("January sales across all regions and products:")
print(sales.xs('Jan', level='Month'))
print("\n" + "="*50 + "\n")

# Example 4: Get specific combination - North region, Electronics, February
print("North region, Electronics sales in February:")
print(sales['North', 'Electronics', 'Feb'])

All sales in North region:
Product      Month
Electronics  Jan      8270
             Feb      1860
             Mar      6390
Clothing     Jan      6191
             Feb      6734
             Mar      7265
Books        Jan      1466
             Feb      5426
             Mar      6578
Name: Sales, dtype: int64


Electronics sales across all regions:
Region  Month
North   Jan      8270
        Feb      1860
        Mar      6390
South   Jan      9322
        Feb      2685
        Mar      1769
East    Jan      5555
        Feb      4385
        Mar      7396
West    Jan      3734
        Feb      4005
        Mar      5658
Name: Sales, dtype: int64


January sales across all regions and products:
Region  Product    
North   Electronics    8270
        Clothing       6191
        Books          1466
South   Electronics    9322
        Clothing       7949
        Books          6051
East    Electronics    5555
        Clothing       9666
        Books          3047
West    Electronics 

In [78]:
# Let's create a DataFrame with hierarchical indexing
# This is like having a spreadsheet with multiple row headers

# Create additional metrics for our sales data
profit_margin = np.random.uniform(0.1, 0.3, size=len(multi_index))
units_sold = np.random.randint(10, 100, size=len(multi_index))

# Create DataFrame with multiple columns
sales_df = pd.DataFrame({
    'Sales': sales_data,
    'Profit_Margin': profit_margin,
    'Units_Sold': units_sold
}, index=multi_index)

print("DataFrame with Hierarchical Index:")
print(sales_df.head(10))
print(f"\nDataFrame shape: {sales_df.shape}")
print(f"Index levels: {sales_df.index.nlevels}")
print(f"Index names: {sales_df.index.names}")

DataFrame with Hierarchical Index:
                          Sales  Profit_Margin  Units_Sold
Region Product     Month                                  
North  Electronics Jan     8270       0.236062          38
                   Feb     1860       0.190100          24
                   Mar     6390       0.102653          54
       Clothing    Jan     6191       0.288440          74
                   Feb     6734       0.212658          98
                   Mar     7265       0.177083          80
       Books       Jan     1466       0.103193          18
                   Feb     5426       0.146179          97
                   Mar     6578       0.148205          10
South  Electronics Jan     9322       0.236653          17

DataFrame shape: (36, 3)
Index levels: 3
Index names: ['Region', 'Product', 'Month']


In [79]:
# Common Business Questions with Hierarchical Data

# 1. What are the total sales by region?
print("Total sales by region:")
region_sales = sales_df.groupby(level='Region')['Sales'].sum().sort_values(ascending=False)
print(region_sales)
print("\n" + "="*50 + "\n")

# 2. Which product category performs best across all regions?
print("Average sales by product category:")
product_sales = sales_df.groupby(level='Product')['Sales'].mean().sort_values(ascending=False)
print(product_sales)
print("\n" + "="*50 + "\n")

# 3. Monthly trend analysis
print("Average sales by month:")
monthly_sales = sales_df.groupby(level='Month')['Sales'].mean()
print(monthly_sales)
print("\n" + "="*50 + "\n")

# 4. Best performing region-product combination
print("Top 5 region-product combinations by average sales:")
region_product = sales_df.groupby(level=['Region', 'Product'])['Sales'].mean().sort_values(ascending=False).head()
print(region_product)

Total sales by region:
Region
North    50180
East     47392
South    47124
West     39271
Name: Sales, dtype: int64


Average sales by product category:
Product
Clothing       6154.666667
Electronics    5085.750000
Books          4090.166667
Name: Sales, dtype: float64


Average sales by month:
Month
Feb    4711.916667
Jan    5556.500000
Mar    5062.166667
Name: Sales, dtype: float64


Top 5 region-product combinations by average sales:
Region  Product    
East    Clothing       7357.666667
North   Clothing       6730.000000
South   Clothing       5897.666667
East    Electronics    5778.666667
North   Electronics    5506.666667
Name: Sales, dtype: float64


**Common Operations and Pro Tips:**

1. **Level Selection**: Use `.xs()` for cross-section selection when you want specific values from inner levels
2. **Multiple Level Grouping**: Group by multiple levels simultaneously for detailed analysis
3. **Index Manipulation**: Use `swaplevel()` and `sort_index()` to reorganize your hierarchy
4. **Memory Efficiency**: Hierarchical indexing is more memory-efficient than separate DataFrames

**When to Use Hierarchical Indexing:**

✅ **Good for:**
- Time series data with multiple dimensions (stock prices by company and date)
- Survey data with multiple categorical breakdowns
- Financial data with multiple levels of aggregation
- Any data where you frequently need to slice by different categorical combinations

❌ **Avoid when:**
- You have simple, flat data structures
- You only ever need to access data by one dimension
- Your data doesn't have natural hierarchical relationships

**Key Takeaway:** Hierarchical indexing transforms complex, multi-dimensional data into an organized, easily queryable structure. It's like having a well-organized library where you can find books by author, genre, year, or any combination thereof!

In [80]:
data = pd.Series(np.random.uniform(size=9),
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                 [1, 2, 3, 1, 3, 1, 2, 2, 3]])

data

a  1    0.382927
   2    0.971712
   3    0.848914
b  1    0.721730
   3    0.235985
c  1    0.256068
   2    0.040434
d  2    0.710663
   3    0.110891
dtype: float64

In [81]:
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

You can use _partial indexing_ to select subsets of data:

In [82]:
data["b"]

1    0.721730
3    0.235985
dtype: float64

In [83]:
data["b":"c"]

b  1    0.721730
   3    0.235985
c  1    0.256068
   2    0.040434
dtype: float64

### <span style="color: #4472C4;">Detail Explanation</span>

**Understanding `data["b":"c"]` - Slice Indexing with Hierarchical Data**

This syntax is called **slice indexing** and it works like slicing a list, but with hierarchical index labels instead of numbers.

**Think of it like this:**
- Your hierarchical data is like a **sorted filing cabinet**
- `data["b":"c"]` means "give me everything from folder 'b' up to and including folder 'c'"
- It's **inclusive** on both ends, so you get all 'b' entries AND all 'c' entries

**What happens step by step:**
1. **pandas looks at the first level** of your hierarchical index
2. **Finds all entries starting from 'b'** (inclusive)
3. **Continues until it reaches 'c'** (inclusive)
4. **Returns all the data** in that range

**Real-world analogy:**
Imagine you have customer files organized alphabetically:
- `customers["Brown":"Davis"]` would give you all customers from Brown through Davis
- This includes Brown, Carter, Chen, Davis, etc.

**Key Points:**
- ✅ **Both endpoints are included** ('b' and 'c' entries are both returned)
- ✅ **Works with the first level** of hierarchical index by default
- ✅ **Maintains the hierarchical structure** in the result
- ⚠️ **Requires sorted index** for predictable results

**Why is this useful?**
- **Quick range selection**: Get data for consecutive categories
- **Alphabetical filtering**: Perfect for name ranges, product codes, etc.
- **Efficient**: Much faster than filtering with conditions

In [84]:
# Let's create a clear example to demonstrate slice indexing
import pandas as pd
import numpy as np

# Create sample data with clear hierarchical structure
np.random.seed(100)  # For consistent results
companies = ['Apple', 'Google', 'Microsoft', 'Netflix', 'Tesla']
quarters = ['Q1', 'Q2', 'Q3', 'Q4']

# Create hierarchical index
multi_idx = pd.MultiIndex.from_product([companies, quarters], 
                                     names=['Company', 'Quarter'])

# Create sample revenue data
revenue = pd.Series(np.random.randint(10, 100, size=len(multi_idx)), 
                   index=multi_idx, name='Revenue_Billions')

print("Complete dataset:")
print(revenue)
print("\n" + "="*60 + "\n")

# Example 1: Slice from Google to Netflix (inclusive)
print("Example 1: revenue['Google':'Netflix']")
print("This gets ALL data from Google through Netflix (alphabetically)")
print(revenue['Google':'Netflix'])
print("\n" + "="*60 + "\n")

# Example 2: Slice from Apple to Microsoft
print("Example 2: revenue['Apple':'Microsoft']") 
print("This gets ALL data from Apple through Microsoft")
print(revenue['Apple':'Microsoft'])
print("\n" + "="*60 + "\n")

# Example 3: What if we want just one company?
print("Example 3: revenue['Google'] (no slice, just single selection)")
print("This gets ALL quarters for Google only")
print(revenue['Google'])

Complete dataset:
Company    Quarter
Apple      Q1         18
           Q2         34
           Q3         77
           Q4         97
Google     Q1         89
           Q2         58
           Q3         20
           Q4         62
Microsoft  Q1         63
           Q2         76
           Q3         24
           Q4         44
Netflix    Q1         34
           Q2         25
           Q3         70
           Q4         68
Tesla      Q1         26
           Q2         19
           Q3         96
           Q4         12
Name: Revenue_Billions, dtype: int64


Example 1: revenue['Google':'Netflix']
This gets ALL data from Google through Netflix (alphabetically)
Company    Quarter
Google     Q1         89
           Q2         58
           Q3         20
           Q4         62
Microsoft  Q1         63
           Q2         76
           Q3         24
           Q4         44
Netflix    Q1         34
           Q2         25
           Q3         70
           Q4         68
Na

In [85]:
# Now let's look at the original example to understand it better
print("ORIGINAL EXAMPLE EXPLANATION:")
print("="*50)

# The original data has this structure:
# Level 1: ['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd']
# Level 2: [1, 2, 3, 1, 3, 1, 2, 2, 3]

print("Original data structure:")
print("a  1    [some value]")
print("   2    [some value]") 
print("   3    [some value]")
print("b  1    [some value]")
print("   3    [some value]")  
print("c  1    [some value]")
print("   2    [some value]")
print("d  2    [some value]")
print("   3    [some value]")
print()

print("When you do data['b':'c'], you get:")
print("b  1    [some value]")
print("   3    [some value]")  
print("c  1    [some value]")
print("   2    [some value]")
print()
print("Notice: It includes BOTH 'b' and 'c' groups entirely!")
print("This is different from data['b'] which would only give you the 'b' group.")

# Let's demonstrate with the actual data
data_demo = pd.Series(np.random.uniform(size=9),
                     index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                           [1, 2, 3, 1, 3, 1, 2, 2, 3]])

print("\n" + "="*50)
print("ACTUAL DEMONSTRATION:")
print("Full data:")
print(data_demo)
print("\ndata['b':'c'] result:")
print(data_demo['b':'c'])

ORIGINAL EXAMPLE EXPLANATION:
Original data structure:
a  1    [some value]
   2    [some value]
   3    [some value]
b  1    [some value]
   3    [some value]
c  1    [some value]
   2    [some value]
d  2    [some value]
   3    [some value]

When you do data['b':'c'], you get:
b  1    [some value]
   3    [some value]
c  1    [some value]
   2    [some value]

Notice: It includes BOTH 'b' and 'c' groups entirely!
This is different from data['b'] which would only give you the 'b' group.

ACTUAL DEMONSTRATION:
Full data:
a  1    0.978624
   2    0.811683
   3    0.171941
b  1    0.816225
   3    0.274074
c  1    0.431704
   2    0.940030
d  2    0.817649
   3    0.336112
dtype: float64

data['b':'c'] result:
b  1    0.816225
   3    0.274074
c  1    0.431704
   2    0.940030
dtype: float64


**Common Use Cases for Slice Indexing:**

🎯 **Business Scenarios:**
- `sales['January':'March']` - Get Q1 data
- `customers['Brown':'Davis']` - Get customers in alphabetical range
- `products['Electronics':'Furniture']` - Get product categories in range
- `regions['Asia':'Europe']` - Get geographical regions

**Important Notes:**

⚠️ **Order Matters**: Your index should be sorted for predictable results
```python
# Good: sorted index
data.sort_index()['b':'d']  # Predictable results

# Risky: unsorted index  
data['b':'d']  # May give unexpected results
```

✅ **Best Practices:**
1. **Always sort your index first** when using slice indexing
2. **Use `.loc[]` for more explicit control**: `data.loc['b':'c']`
3. **Remember it's inclusive** on both ends
4. **Test with small examples** first to understand the behavior

**Quick Comparison:**
- `data['b']` → Only the 'b' group
- `data['b':'c']` → Both 'b' and 'c' groups (range)
- `data[['b', 'c']]` → Only 'b' and 'c' groups (specific selection, skips anything in between)

**Key Takeaway:** Slice indexing (`['start':'end']`) is like selecting a range of chapters in a book - you get everything from the start chapter through the end chapter, including both endpoints!

In [86]:
data.loc[["b", "d"]]

b  1    0.721730
   3    0.235985
d  2    0.710663
   3    0.110891
dtype: float64

You can also select from "inner" level:

In [87]:
data.loc[:, 2]

a    0.971712
c    0.040434
d    0.710663
dtype: float64

Hierarchical indexing works on both axes.

In [88]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                        index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                        columns=[['Ohio', 'Ohio', 'Colorado'],
                        ['Green', 'Red', 'Green']])

frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


Setting names on the axes work as usual:

In [89]:
frame.index.names = ["key1", "key2"]
frame.columns.names = ["state", "color"]

frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [90]:
frame.index.nlevels

2

Partial indexing works on columns too:

In [91]:
frame["Ohio"]

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


You may need to rearrange the order of the levels on an axis. The `swaplevel` method will swap the levels in the MultiIndex on a particular axis. The default is to swap the levels on the rows:

In [92]:
frame.swaplevel()

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [93]:
frame.swaplevel(0, 1, axis=1)

Unnamed: 0_level_0,color,Green,Red,Green
Unnamed: 0_level_1,state,Ohio,Ohio,Colorado
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


You can also sort by a single level or subset of levels:

In [94]:
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


> Swap the levels on the rows then sort the index by level `0`.

It's common to use one or more columns from a DataFrame as the row index.

In [95]:
frame = pd.DataFrame({"a": range(7), "b": range(7, 0, -1), 
                      "c": ["one", "one", "one", "two", "two", "two", "two"], 
                      "d": [0, 1, 2, 0, 1, 2, 3]})

frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


`set_index` will return a new DataFrame using one or more of its columns as the index.

In [96]:
frame2 = frame.set_index(["c", "d"])

frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


`reset_index` does the opposite of `set_index` and turns the index back into a column.

In [97]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


You can choose to drop the columns when resetting index:

In [98]:
frame2.reset_index(drop=True)

Unnamed: 0,a,b
0,0,7
1,1,6
2,2,5
3,3,4
4,4,3
5,5,2
6,6,1


## Date Time Data

Pandas is oriented towards working with arrays of dates, whether used as an axis index or a column.

The `to_datetime` method parses may different kinds of date representations:

In [99]:
dates = ["2011-07-06 12:00:00", "2011-08-06 00:00:00"]

pd.to_datetime(dates)

DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)

It uses `NaT` (Not a Time) as null values for datetime data.

In [100]:
idx = pd.to_datetime(dates + [None])

idx

DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)

In [101]:
pd.isna(idx)

array([False, False,  True])

Standard Python uses the `datetime` module to handle date and time data. Pandas has a `Timestamp` object that is similar to the `datetime` object. Pandas also has a `Timedelta` object that is similar to the `timedelta` object.

If you use `datetime` objects as index to a Series or DataFrame, Pandas will automatically convert them to `DatetimeIndex` objects.

In [102]:
from datetime import datetime

In [103]:
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7), datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]

ts = pd.Series(np.random.standard_normal(6), index=dates)

ts

2011-01-02   -0.438136
2011-01-05   -1.118318
2011-01-07    1.618982
2011-01-08    1.541605
2011-01-10   -0.251879
2011-01-12   -0.842436
dtype: float64

In [104]:
ts.index

DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

Like other Series, arithmetic operations between differently indexed time series automatically align on the dates:

In [105]:
# [::2] selects every second element
ts + ts[::2]

2011-01-02   -0.876271
2011-01-05         NaN
2011-01-07    3.237963
2011-01-08         NaN
2011-01-10   -0.503758
2011-01-12         NaN
dtype: float64

`DatetimeIndex` is an array of `Timestamp` objects.

In [106]:
ts.index[0]

Timestamp('2011-01-02 00:00:00')

You can index by passing a `datetime`, `Timestamp` or `string` that is interpretable as a date:

In [107]:
ts[datetime(2011, 1, 7)]

np.float64(1.6189816606752596)

In [108]:
ts[pd.Timestamp("2011-01-07")]

np.float64(1.6189816606752596)

In [109]:
ts["2011-01-07"]

np.float64(1.6189816606752596)

You can even specify the year or year-month strings:

In [110]:
# date_range generate an array of dates
longer_ts = pd.Series(np.random.standard_normal(1000), 
                      index=pd.date_range("2000-01-01", periods=1000))

longer_ts

2000-01-01    0.184519
2000-01-02    0.937082
2000-01-03    0.731000
2000-01-04    1.361556
2000-01-05   -0.326238
                ...   
2002-09-22   -0.528333
2002-09-23    0.700045
2002-09-24    0.074560
2002-09-25   -0.091808
2002-09-26   -0.810560
Freq: D, Length: 1000, dtype: float64

In [111]:
longer_ts["2001"]

2001-01-01   -0.154848
2001-01-02   -0.086056
2001-01-03   -0.335757
2001-01-04   -0.136629
2001-01-05    0.092776
                ...   
2001-12-27    0.321444
2001-12-28    0.233519
2001-12-29    1.014952
2001-12-30   -1.365343
2001-12-31    0.276531
Freq: D, Length: 365, dtype: float64

In [112]:
longer_ts["2001-05"]

2001-05-01    0.139862
2001-05-02    0.712539
2001-05-03    0.359624
2001-05-04    0.712748
2001-05-05    1.068786
2001-05-06   -0.495174
2001-05-07    0.132962
2001-05-08   -0.082899
2001-05-09    0.214219
2001-05-10    0.692696
2001-05-11   -1.908185
2001-05-12    0.116533
2001-05-13    1.755174
2001-05-14   -0.395566
2001-05-15   -0.932583
2001-05-16   -0.673565
2001-05-17    0.714532
2001-05-18   -1.790004
2001-05-19    0.224303
2001-05-20   -3.209955
2001-05-21    1.523585
2001-05-22   -0.069616
2001-05-23    1.365590
2001-05-24   -0.558695
2001-05-25    0.275393
2001-05-26    2.400918
2001-05-27   -1.011035
2001-05-28    0.942507
2001-05-29   -2.221511
2001-05-30   -2.585768
2001-05-31    0.688921
Freq: D, dtype: float64

Or slicing:

In [113]:
longer_ts["2001-05":]

2001-05-01    0.139862
2001-05-02    0.712539
2001-05-03    0.359624
2001-05-04    0.712748
2001-05-05    1.068786
                ...   
2002-09-22   -0.528333
2002-09-23    0.700045
2002-09-24    0.074560
2002-09-25   -0.091808
2002-09-26   -0.810560
Freq: D, Length: 514, dtype: float64

> Use `date_range` to generate a Series of random values from 1-31st January 2023. Then slice the Series to return data from 5-15th January.

In [114]:
price = pd.read_pickle('../data/yahoo_price.pkl')

In [115]:
price

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,27.990226,313.062468,113.304536,25.884104
2010-01-05,28.038618,311.683844,111.935822,25.892466
2010-01-06,27.592626,303.826685,111.208683,25.733566
2010-01-07,27.541619,296.753749,110.823732,25.465944
2010-01-08,27.724725,300.709808,111.935822,25.641571
...,...,...,...,...
2016-10-17,117.550003,779.960022,154.770004,57.220001
2016-10-18,117.470001,795.260010,150.720001,57.660000
2016-10-19,117.120003,801.500000,151.259995,57.529999
2016-10-20,117.059998,796.969971,151.520004,57.250000


In [116]:
price.index

DatetimeIndex(['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
               '2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13',
               '2010-01-14', '2010-01-15',
               ...
               '2016-10-10', '2016-10-11', '2016-10-12', '2016-10-13',
               '2016-10-14', '2016-10-17', '2016-10-18', '2016-10-19',
               '2016-10-20', '2016-10-21'],
              dtype='datetime64[ns]', name='Date', length=1714, freq=None)

You can access other attributes like `day_of_week` or `month`:

In [117]:
price.index.day_of_week

Index([0, 1, 2, 3, 4, 0, 1, 2, 3, 4,
       ...
       0, 1, 2, 3, 4, 0, 1, 2, 3, 4],
      dtype='int32', name='Date', length=1714)

In [118]:
price.index.month

Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       ...
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10],
      dtype='int32', name='Date', length=1714)

If the datetime is in a column instead of the index, you can use the `dt` accessor to access the datetime properties.

In [119]:
price_reindex = price.reset_index()

price_reindex

Unnamed: 0,Date,AAPL,GOOG,IBM,MSFT
0,2010-01-04,27.990226,313.062468,113.304536,25.884104
1,2010-01-05,28.038618,311.683844,111.935822,25.892466
2,2010-01-06,27.592626,303.826685,111.208683,25.733566
3,2010-01-07,27.541619,296.753749,110.823732,25.465944
4,2010-01-08,27.724725,300.709808,111.935822,25.641571
...,...,...,...,...,...
1709,2016-10-17,117.550003,779.960022,154.770004,57.220001
1710,2016-10-18,117.470001,795.260010,150.720001,57.660000
1711,2016-10-19,117.120003,801.500000,151.259995,57.529999
1712,2016-10-20,117.059998,796.969971,151.520004,57.250000


In [120]:
price_reindex["Date"].dt.day_name()

0          Monday
1         Tuesday
2       Wednesday
3        Thursday
4          Friday
          ...    
1709       Monday
1710      Tuesday
1711    Wednesday
1712     Thursday
1713       Friday
Name: Date, Length: 1714, dtype: object

In [121]:
price_reindex

Unnamed: 0,Date,AAPL,GOOG,IBM,MSFT
0,2010-01-04,27.990226,313.062468,113.304536,25.884104
1,2010-01-05,28.038618,311.683844,111.935822,25.892466
2,2010-01-06,27.592626,303.826685,111.208683,25.733566
3,2010-01-07,27.541619,296.753749,110.823732,25.465944
4,2010-01-08,27.724725,300.709808,111.935822,25.641571
...,...,...,...,...,...
1709,2016-10-17,117.550003,779.960022,154.770004,57.220001
1710,2016-10-18,117.470001,795.260010,150.720001,57.660000
1711,2016-10-19,117.120003,801.500000,151.259995,57.529999
1712,2016-10-20,117.059998,796.969971,151.520004,57.250000


> Get the week of year from the date column and create a new column `week_of_year`.

As you can see from above, the dates are on business days, if you want to change the frequency to calendar days (known as resampling):

In [122]:
price_resampled = price.resample('D').asfreq()

In [123]:
price_resampled.head(10)

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,27.990226,313.062468,113.304536,25.884104
2010-01-05,28.038618,311.683844,111.935822,25.892466
2010-01-06,27.592626,303.826685,111.208683,25.733566
2010-01-07,27.541619,296.753749,110.823732,25.465944
2010-01-08,27.724725,300.709808,111.935822,25.641571
2010-01-09,,,,
2010-01-10,,,,
2010-01-11,27.480148,300.255255,110.763844,25.315406
2010-01-12,27.167562,294.945572,111.644958,25.148142
2010-01-13,27.550775,293.252243,111.405433,25.382312


If you want to fill the na values with the most recent value, you can use the `.ffill()` method.

In [124]:
price_resampled = price.resample('D').ffill()

price_resampled.head(10)

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,27.990226,313.062468,113.304536,25.884104
2010-01-05,28.038618,311.683844,111.935822,25.892466
2010-01-06,27.592626,303.826685,111.208683,25.733566
2010-01-07,27.541619,296.753749,110.823732,25.465944
2010-01-08,27.724725,300.709808,111.935822,25.641571
2010-01-09,27.724725,300.709808,111.935822,25.641571
2010-01-10,27.724725,300.709808,111.935822,25.641571
2010-01-11,27.480148,300.255255,110.763844,25.315406
2010-01-12,27.167562,294.945572,111.644958,25.148142
2010-01-13,27.550775,293.252243,111.405433,25.382312


If you want to resample to a lower frequency (e.g. monthly) you need to provide an aggregation method:

In [125]:
price_resampled = price.resample('MS').mean()

price_resampled.head()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-01,27.166942,289.012006,110.364037,25.212407
2010-02-01,26.00037,267.179857,107.496919,23.767984
2010-03-01,29.21976,280.235245,109.849295,24.584058
2010-04-01,32.847556,278.253415,111.324726,25.646245
2010-05-01,32.888483,248.418509,109.690449,23.670255


For more resampling frequencies options, please refer to the official [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases)

> Resample price to `yearly` (start of year) frequency, use `sum` as aggregation function.

### Window functions

You can apply functions evaluated over a sliding window using the `rolling` method.

For example, to compute the 30-day moving average for Apple price:

In [129]:
price_head = price["AAPL"].rolling(30).mean()

By default, rolling functions require all of the values in the window to be non-NA. This behavior can be changed to account for missing data and, especially at the beginning of the time series.

In [None]:
price["AAPL"].rolling(30, min_periods=3).mean()

> Compute a 10-day moving average for `GOOG` with a min period of 5 days.

## Combining and Merging Datasets

Data can be combined or merged in a number of ways:

- `merge`: connects rows in DataFrames based on one or more keys. Equivalent to database `join` operations.
- `concat`: concatenates or "stacks" together objects along an axis. Equivalent to database `union` operations.
- `combine_first`: instance method enables splicing together overlapping data to fill in missing values in one object with values from another. _We went through this in unit 7_.

### `merge`

In [None]:
df1 = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "a", "b"], 
                    "data1": pd.Series(range(7), dtype="Int64")})

df2 = pd.DataFrame({"key": ["a", "b", "d"], 
                    "data2": pd.Series(range(3), dtype="Int64")})

In [None]:
df1

In [None]:
df2

Merging the two dataframes above constitutes a _many-to-one_ join; the data in `df1` has multiple rows labeled `a` and `b`, whereas `df2` has only one row for each value in the key column `key`.

In [None]:
pd.merge(df1, df2)

If you did not specify which column(s) to join on, `merge` uses the overlapping column names as the keys. It’s a good practice to specify explicitly, though.

In [None]:
pd.merge(df1, df2, on="key")

If the column names are different in each object, you can specify them separately:

In [None]:
df3 = pd.DataFrame({"lkey": ["b", "b", "a", "c", "a", "a", "b"], 
                    "data1": pd.Series(range(7), dtype="Int64")})

df4 = pd.DataFrame({"rkey": ["a", "b", "d"], 
                    "data2": pd.Series(range(3), dtype="Int64")})

pd.merge(df3, df4, left_on="lkey", right_on="rkey")

The default merge type is `inner` join. You can specify the other options- `left, right, outer` via the `how` parameter.

In [None]:
pd.merge(df1, df2, how="outer")

In [None]:
pd.merge(df3, df4, left_on="lkey", right_on="rkey", how="outer")

Let's consider a _many-to-many_ join:

In [None]:
df1 = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"], 
                    "data1": pd.Series(range(6), dtype="Int64")})

df2 = pd.DataFrame({"key": ["a", "b", "a", "b", "d"], 
                    "data2": pd.Series(range(5), dtype="Int64")})

In [None]:
df1

In [None]:
df2

Since there were `three "b"` rows in the left DataFrame and `two` in the right one, there are `six "b"` rows in the result:

In [None]:
pd.merge(df1, df2, how="inner")

> Merge `df1` and `df2` with a left join. 

To merge with multiple keys, pass a list of column names:

In [None]:
left = pd.DataFrame({"key1": ["foo", "foo", "bar"], 
                     "key2": ["one", "two", "one"],
                     "lval": pd.Series([1, 2, 3], dtype='Int64')})

right = pd.DataFrame({"key1": ["foo", "foo", "bar", "bar"],
                      "key2": ["one", "one", "one", "two"],
                      "rval": pd.Series([4, 5, 6, 7], dtype='Int64')})

pd.merge(left, right, on=["key1", "key2"], how="outer")

If there are overlapping non-key column names:

In [None]:
pd.merge(left, right, on="key1")

You can pass `suffixes` to specify the strings to append to the overlaping names:

In [None]:
pd.merge(left, right, on="key1", suffixes=("_left", "_right"))

If the merge key(s) is in the index, you can pass `left_index=True` or `right_index=True` to indicate that the index should be used as the merge key.

In [None]:
left1 = pd.DataFrame({"key": ["a", "b", "a", "a", "b", "c"],
                      "value": pd.Series(range(6), dtype="Int64")})

right1 = pd.DataFrame({"group_val": [3.5, 7]}, index=["a", "b"])

In [None]:
left1

In [None]:
right1

In [None]:
pd.merge(left1, right1, left_on="key", right_index=True)

DataFrame has a `join` method which performs a left join by default. The join key on the right dataframe has to be the index. The join key on the left dataframe can be an index or a column (by specifying the `on` parameter):

In [None]:
left1.join(right1, on='key')

### `concat`

You can join DataFrames along any axis which is referred to as _concatenation_ or _stacking_. This is akin to database `union` operations, in any "direction" (axis).

In [None]:
s1 = pd.Series([0, 1], index=["a", "b"], dtype="Int64")
s2 = pd.Series([2, 3, 4], index=["c", "d", "e"], dtype="Int64")
s3 = pd.Series([5, 6], index=["f", "g"], dtype="Int64")

Calling `concat` with these objects in a list glues together the values and indexes:

In [None]:
s1

In [None]:
s2

In [None]:
s3

By default, `concat` works along `axis="index"`, producing another Series. If you pass `axis="columns"`, the result will instead be a DataFrame:

In [None]:
pd.concat([s1, s2, s3])

In [None]:
pd.concat([s1, s2, s3], axis="columns")

The default behavior of `concat` is union (`outer` join) of the indexes, you can also intersect them by passing `join='inner'`:

In [None]:
s4 = pd.concat([s1, s3])

s4

In [None]:
pd.concat([s1, s4], axis="columns")

In [None]:
pd.concat([s1, s4], axis="columns", join="inner")

When combining Series along axis="columns", pass the `keys` argument for the DataFrame column headers:

In [None]:
pd.concat([s1, s2, s3], axis="columns", keys=["one", "two", "three"])

> Concat `s1`, `s2` and `s3` along index and pass `keys=["one", "two", "three"]`.

For DataFrames, it will become a hierarchical index instead:

In [None]:
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=["a", "b", "c"],
                   columns=["one", "two"])

df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=["a", "c"], 
                   columns=["three", "four"])               

In [None]:
df1

In [None]:
df2

In [None]:
pd.concat([df1, df2], axis="columns", keys=["level1", "level2"])

If the index does not contain any relevant data, and you want to avoid concatenating based on indexes, you can pass the `ignore_index=True` argument, this will assign a new default index:

In [None]:
df1 = pd.DataFrame(np.random.standard_normal((3, 4)), 
                   columns=["a", "b", "c", "d"])

df2 = pd.DataFrame(np.random.standard_normal((2, 3)), 
                   columns=["b", "d", "a"])

pd.concat([df1, df2], ignore_index=True)

> Concat `df1` and `df2` on the column axis but ignore the index.

## Reshaping and Pivoting Data

Reshaping or pivoting dataframes refers to the process of changing the layout of a dataframe. This is useful when you want to change the granularity of your data or when you want to convert a _wide_ dataframe into a _long_ dataframe or vice versa.

### Reshaping

In [None]:
data = pd.DataFrame(np.arange(6).reshape((2, 3)), 
                    index=pd.Index(["Ohio", "Colorado"], name="state"),
                    columns=pd.Index(["one", "two", "three"], name="number"))

data

The `stack` method pivots the columns into rows, producing a Series with a MultiIndex.

In [None]:
result = data.stack()

result

From a hierarchically indexed Series, you can rearrange the data back into a DataFrame with `unstack` , which pivots rows into columns.

By default, the innermost level is unstacked (same with stack).

In [None]:
result.unstack()

You can unstack a different level by passing a level number or name:

In [None]:
result.unstack(level=0)

In [None]:
# or just stating the name of the level
result.unstack(level="state")

When you unstack in a DataFrame, the level unstacked becomes the lowest level:

In [None]:
df = pd.DataFrame({"left": result, "right": result + 5},
                  columns=pd.Index(["left", "right"], name="side"))

df

In [None]:
df.unstack(level="state")

In [None]:
df.unstack(level="state").stack(level="side")

### Pivoting between "Wide" and "Long" Format

Long format and wide format are two common ways of organizing data in the context of databases, spreadsheets, or data analysis. They refer to the arrangement of data rows and columns.

1. Long format

Each row typically represents a single observation or entry, and each column contains variables or attributes related to that observation. This format is also known as "tidy data" or "normalized data."

Example:

| Year | Country | Population |
| ---- | ------- | ---------- |
| 2019 | SG      | 5.7        |
| 2019 | MY      | 31.5       |
| 2019 | TH      | 69.8       |
| 2020 | SG      | 5.7        |
| 2020 | MY      | 32.7       |
| 2020 | TH      | 69.8       |

Advantages:

- It is easier to handle and analyze structured data with different attributes.
- Efficient storage for sparse data, as it avoids repeating column headers.

2. Wide format

Each row contains multiple observations or entries, and each column contains variables or attributes related to that observation.

Example:

| Year | SG   | MY   | TH   |
| ---- | ---- | ---- | ---- |
| 2019 | 5.7  | 31.5 | 69.8 |
| 2020 | 5.7  | 32.7 | 69.8 |

Advantages:

- Easier to read and understand when the number of variables is limited.
- Suitable for simple summary statistics and basic analyses.

In [None]:
price_reindex

We can "pivot" a table from a "wide" format to a "long" format using the `melt` function.

The `date` column is the group indicator, while the other columns are data values. We need to indicate the group indicator(s):


In [None]:
melted = pd.melt(price_reindex, id_vars="Date")

melted

> Rerun `melt` and pass arguments such that the new columns are named `Company` and `Price` respectively.

Using `pivot`, we can reshape back to the original layout:

In [None]:
reshaped = melted.pivot(index='Date', columns='variable', values='value')

reshaped

## Data Aggregation

Data aggregation is the process of grouping data together and performing calculations on them. It is equivalent to the `GROUP BY` clause in SQL.

In [None]:
df = pd.DataFrame({"key1" : ["a", "a", None, "b", "b", "a", None], 
                   "key2" : pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
                   "data1" : np.random.standard_normal(7), 
                   "data2" : np.random.standard_normal(7)})

df

If you want to compute the mean for each unique value in `key1`:

In [None]:
df.groupby("key1").mean()

It does not make sense to compute the mean for `key2` since it is a categorical variable and also serves as a key.

We can select the numeric columns to compute the mean for (after the `groupby` method):

In [None]:
df.groupby("key1")[["data1", "data2"]].mean()

Note that the following also works, since the returned result is a DataFrame, however it is less efficient as the selection/subset happens after the computation.

In [None]:
df.groupby("key1").mean()[["data1", "data2"]]

You can group by more than 1 column. There is a useful GroupBy method `size` which returns a Series containing group sizes.

In [None]:
df.groupby(['key1', 'key2']).size()

You can also group by other `Series`/`array`/`list` with the same length:

In [None]:
states = np.array(["OH", "CA", "CA", "OH", "OH", "CA", "OH"])
years = [2005, 2005, 2006, 2005, 2006, 2005, 2006]

df["data1"].groupby([states, years]).mean()

For built-in aggregation methods in pandas, refer to the [documentation](https://pandas.pydata.org/docs/user_guide/groupby.html#built-in-aggregation-methods).

> Group by `key1` and `key2` and compute the standard deviation.

To use your own aggregation functions, pass any function that aggregates an array to the `aggregate` method or its short alias `agg`:

In [None]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

In [None]:
grouped = df.groupby("key1")

grouped.agg(peak_to_peak)

You can pass a list of functions, or function names (for built-in functions) to `aggregate`: 

In [None]:
grouped.agg([peak_to_peak, "mean", "std"])

### Apply

The most general-purpose GroupBy method is `apply`, which splits the object being manipulated into pieces, invokes the passed function on each piece, and then concatenates the pieces.

In [None]:
tips = pd.read_csv("../data/tips.csv")

tips

In [None]:
# add a column with the tip percentage

tips["tip_pct"] = tips["tip"] / tips["total_bill"]

Suppose we want to select the top five `tip_pct` values by group. First, write a function that selects the rows with the largest values in a particular column:

In [None]:
def top(df, n=5, column="tip_pct"):
    return df.sort_values(column, ascending=False)[:n]

In [None]:
top(tips, n=6)

We can then `apply` this function by different groups using `groupby`:

In [None]:
tips.groupby("smoker").apply(top)

You can pass the arguments to the function as follows:

In [None]:
tips.groupby(["smoker", "day"]).apply(top, n=2, column="total_bill")

> Apply the function on `day` and `time` group.

> Create a function that selects the bottom five `tip_pct` values.
>
> Then apply it on `smoker` group.

### Transform

You can also transform your data using the `transform` method. It is similar to `apply` but imposes more restrictions on the type of function you can use. The function must:

- Produce a scalar value to be broadcast to the shape of the group chunk, or
- Return an object that is the same shape as the group chunk
- Not mutate its input

In [None]:
df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4, 'value': np.arange(12.)})

df

In [None]:
g = df.groupby('key')['value']

g.mean()

`transform` produce a Series of the same shape as `df['value']` but with values replaced by the average grouped by `key`.

We can pass a function or function name (for built-in aggregation) to `transform`:

In [None]:
g.transform(lambda g: g.mean())

In [None]:
g.transform('mean')

In [None]:
def times_two(group):
    return group * 2

g.transform(times_two)

A common transformation in data analytics / science is _standardization_ or _standard scaling_. This is where we transform the data to have a mean of 0 and a standard deviation of 1. It is also known as _z-score normalization_.

The formula for standard scaling is:

$$
z = \frac{x - \mu}{\sigma}
$$

where $x$ is the value, $\mu$ is the mean, and $\sigma$ is the standard deviation.

We can achieve this using `transform`:

In [None]:
def normalize(x):
    return (x - x.mean()) / x.std()

g.transform(normalize)

or the following works too:

In [None]:
standardized = (df['value'] - g.transform('mean')) / g.transform('std')

standardized

## Pivot Tables and Cross-Tabulation

Pivot table is a data summarization tool that is used in the context of data processing. Pivot tables are used to summarize, sort, reorganize, group, count, total or average data. It allows its users to transform columns into rows and rows into columns. It allows grouping by any data field.

In pandas, you can use the `pivot_table` method which is made possible through the `groupby` and `reshape` operations utilizing hierarchical indexing. In addition, `pivot_table` can add partial totals, also known as _margins_.

The default aggregation for `pivot_table` is mean.

In [None]:
tips.pivot_table(index=["day", "smoker"], values=["size", "tip", "tip_pct", "total_bill"])

You can put `smoker` in the table columns and `time` and `day` in the rows:

In [None]:
tips.pivot_table(index=["time", "day"], columns="smoker", 
                 values=["tip_pct", "size"])

Add partial totals by passing `margins=True`:

In [None]:
tips.pivot_table(index=["time", "day"], columns="smoker", 
                 values=["tip_pct", "size"], margins=True)

To use other aggregation functions, pass it to the `aggfunc` keyword:

In [None]:
tips.pivot_table(index=["time", "smoker"], columns="day", 
                 values="tip_pct", aggfunc=len, margins=True)

Use `fill_value` to fill missing values:

In [None]:
tips.pivot_table(index=["time", "smoker"], columns="day", 
                 values="tip_pct", aggfunc=len, margins=True, fill_value=0)

> Compute the sum of `tip` in a pivot table with `day` and `time` in the rows and `smoker` in the column.

A _cross-tabulation_ or _crosstab_ is a special case of pivot table that computes group frequencies (counts):

In [None]:
pd.crosstab(index=[tips["time"], tips["day"]], columns=tips["smoker"], margins=True)