# Recap from last week

1. Load Titanic data set
2. Count number of missing values in column `Age`
3. Compute mean age
4. Replace missing values with mean age rounded to nearest integer

In [3]:
# Uncomment this to use files in the local data/ directory
DATA_PATH = '../../data'

# Uncomment this to load data directly from GitHub
# DATA_PATH = 'https://raw.githubusercontent.com/richardfoltyn/TECH2-H24/main/data'

In [4]:
fn = f'{DATA_PATH}/titanic.csv'

In [35]:
import pandas as pd
df = pd.read_csv(fn)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,211536,13.0000,,S
887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,111369,30.0000,C148,C


In [9]:
group = df.groupby("Sex")
group['Survived'].mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

In [37]:
dfage = df.loc[df["Age"] >50]
dfage.groupby("Sex")["Survived"].mean()

Sex
female    0.941176
male      0.127660
Name: Survived, dtype: float64

In [38]:
df.query('Age < 20').groupby(['Pclass', 'Sex'])['Survived'].mean()

Pclass  Sex   
1       female    0.928571
        male      0.571429
2       female    1.000000
        male      0.526316
3       female    0.533333
        male      0.190476
Name: Survived, dtype: float64

In [39]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,211536,13.0000,,S
887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,111369,30.0000,C148,C


In [41]:
import numpy as np
def my_agg(x):
    return np.sum(x >= 40)

df.groupby('Pclass')['Age'].agg(my_agg)

Pclass
1    81
2    37
3    45
Name: Age, dtype: int64

In [42]:
df.groupby('Pclass')['Age'].agg(['mean', 'median', 'min', 'max'])

Unnamed: 0_level_0,mean,median,min,max
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,38.233441,37.0,0.92,80.0
2,29.87763,29.0,0.67,70.0
3,25.14062,24.0,0.42,74.0


***

# Grouping and aggregation with pandas

## Aggregation and reduction

*split-apply-combine* operations:

1. *Split* data into groups based on some criteria;
2. *Apply* some function to each group separately; and
3. *Combine* the results into a single `DataFrame` or `Series`.

See also the pandas [cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) for an illustration of such operations.

### Working with entire DataFrames

- Apply functions such as `mean()`, `min()`, `max()`, etc. to entire columns
- See Titanic example from earlier

### Working on subsets of data (grouping)

- [`groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) allows us to apply operations to *sub-sets* of data

*Example: Compute means by class (first, second, third)*

#### Built-in aggregations

There are numerous routines to aggregate grouped data, for example:

- [`mean()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.mean.html):
    averages within each group
- [`sum()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.sum.html):
    sum values within each group
- [`std()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.std.html), 
    [`var()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.var.html): 
    within-group standard deviation and variance
- [`quantile()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html):
    compute quantiles within each group
- [`size()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.size.html): 
    number of observations in each group
- [`count()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.count.html):
    number of non-missing observations in each group
- [`first()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.first.html), 
    [`last()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.last.html): 
    first and last elements in each group
-   [`min()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.min.html), 
    [`max()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.max.html): 
    minimum and maximum elements within a group

See the [official documentation](https://pandas.pydata.org/docs/user_guide/groupby.html#built-in-aggregation-methods) for a complete list.

*Example: Number of elements within each group*

*Example: Return first observation of each group*

<div class="alert alert-info">
<h3> Your turn</h3>
Use the Titanic data set to perform the following aggregations:
<ol>
    <li>Compute the average survival rate by sex (stored in the <TT>Sex</TT> column).</li>
    <li>Count the number of passengers aged 50+. Compute the average survival rate by sex for this group.</li>
    <li>Count the number of passengers below the age of 20 by class and sex. Compute the average survival rate for this group (by class and sex).</li>
</ol>
</div>

#### Writing custom aggregations

- Built-in functions don't cover all possible use cases
- Apply custom aggregation functions using
[`agg()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html)
(short-hand for [`aggregate()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html))
- Functions operate on each column separately
- Functions can be passed as string (e.g., `"mean"`) or as function reference (e.g., `np.mean`), or as a lambda expression

*Example: Compute mean age by class using `agg()`*

*Example: Count number of passengers aged 40+ by class*

#### Applying multiple functions at once

- Multiple functions applied to the same column: `.agg(['function1', 'function2'])`

*Example: Compute mean and median age by class*

- Apply multiple functions to different columns (["named aggregation"](https://pandas.pydata.org/docs/user_guide/groupby.html#named-aggregation)):

```python
    groups.agg(
        new_column_name1=('column_name1', 'operation1'),
        new_column_name2=('column_name2', 'operation2'),
        ...
    )
```

*Example: Compute maximum fare and mean age by class*

In [44]:
df.groupby('Pclass').agg(max_fare=('Fare', 'max'), avg_age=('Age', 'mean'))

Unnamed: 0_level_0,max_fare,avg_age
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,512.3292,38.233441
2,73.5,29.87763
3,69.55,25.14062


<div class="alert alert-info">
<h3> Your turn</h3>
Use the Titanic data set to perform the following aggregations:
<ol>
    <li>Compute the minimum, maximum and average age by embarkation port (stored in the column <TT>Embarked</TT>) in a single <TT>agg()</TT> operation.
    Note that there are several ways to solve this problem.</li>
    <li>Compute the number of passengers, the average age and the fraction of women by embarkation port in a single <TT>agg()</TT> operation. This one is more challenging and probably requires use of <TT>lambda</TT> expressions.</li>
</ol>
</div>

In [48]:
df.groupby('Embarked').agg(min_age=('Age', 'min'), avg_age=('Age', 'mean'), max_age=('Age', 'max'))

Unnamed: 0_level_0,min_age,avg_age,max_age
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,0.42,30.814769,71.0
Q,2.0,28.089286,70.5
S,0.67,29.445397,80.0


In [52]:
df['Female'] = (df['Sex'] == 'female')

In [53]:
df.groupby('Embarked').agg(number = ("Name", "count"), avg_age=('Age', 'mean'), fraction_w = ('Female', 'mean'))

Unnamed: 0_level_0,number,avg_age,fraction_w
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,168,30.814769,0.434524
Q,77,28.089286,0.467532
S,644,29.445397,0.315217


***

## Transformations

- Aggregations & reductions _reduce_ the dimensionality of the result (e.g., series of data => mean)
- Transformations: apply group-level operations to each _observation_, data dimension remains unchanged
- Transformations can be applied using [`transform()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html)

*Example: Compute average fare by class and assign it to each observation*

In [56]:
df['Avg_Age'] = df.groupby('Pclass')['Age'].transform('mean')

In [57]:
df[['Pclass', 'Age', 'Avg_Age']]

Unnamed: 0,Pclass,Age,Avg_Age
0,3,22.0,25.140620
1,1,38.0,38.233441
2,3,26.0,25.140620
3,1,35.0,38.233441
4,3,35.0,25.140620
...,...,...,...
886,2,27.0,29.877630
887,1,19.0,38.233441
888,3,,25.140620
889,1,26.0,38.233441


*Example: Deviation from average fare*

In [59]:
import numpy as np
def my_diff(x):
    return x - np.mean(x)

df['Diff_Fare'] = df.groupby('Pclass')['Fare'].transform(my_diff)

In [60]:
df[['Pclass', 'Fare', 'Diff_Fare']]

Unnamed: 0,Pclass,Fare,Diff_Fare
0,3,7.2500,-6.425550
1,1,71.2833,-12.871387
2,3,7.9250,-5.750550
3,1,53.1000,-31.054687
4,3,8.0500,-5.625550
...,...,...,...
886,2,13.0000,-7.662183
887,1,30.0000,-54.154687
888,3,23.4500,9.774450
889,1,30.0000,-54.154687


<div class="alert alert-info">
<h3> Your turn</h3>
Use the Titanic data set to perform the following aggregations:
<ol>
    <li>Compute the <i>excess</i> fare paid by each passenger relative to the minimum fare by embarkation port and class, i.e., compute <i>Fare - min(Fare)</i>
        by port and class.</li>
</ol>
</div>

In [70]:
import numpy as np
def my_diff(x):
    return x - np.min(x)

df['Diff_Fare'] = df.groupby(['Pclass', 'Embarked'])['Fare'].transform(my_diff)


In [71]:
df[['Pclass', 'Embarked', 'Fare', 'Diff_Fare']]

Unnamed: 0,Pclass,Embarked,Fare,Diff_Fare
0,3,S,7.2500,7.2500
1,1,C,71.2833,44.7333
2,3,S,7.9250,7.9250
3,1,S,53.1000,53.1000
4,3,S,8.0500,8.0500
...,...,...,...,...
886,2,S,13.0000,13.0000
887,1,S,30.0000,30.0000
888,3,S,23.4500,23.4500
889,1,C,30.0000,3.4500


***
# Working with time series data

- Time series data: indexed by time stamp, date, etc.
- Example: Quarterly GDP since 1950
- Pandas has comprehensive support for time series data

*Example: Create artificial daily data*

In [76]:
start = '2024-01-01'
end = '2024-03-31'

index = pd.date_range(start, end, freq = 'D')

In [78]:
import numpy as np
data = np.arange(len[index])

TypeError: 'builtin_function_or_method' object is not subscriptable

Construct three months of daily data from 2024-01-01 to 2024-03-31 using the 
[`date_range()`](https://pandas.pydata.org/docs/reference/api/pandas.date_range.html)

***

## Indexing with date/time indices

- Can pass dates, date ranges, etc., directly to `.loc[]`
- Supports partial indexing

*Example: Select single date, date range, whole months*

In [None]:
s.loc

***

## Lags, differences, and other useful transformations

Common time-series operations:

- [`shift()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html): move observations forward/backward
- [`diff()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html): compute difference across periods
- [`pct_change()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pct_change.html): compute percentage change across periods


***
## Resampling and aggregation

- Resampling is like `groupby()`, but applied to time periods
- Observations can be grouped by year (`'YE'`), quarter (`'QE'`), month (`'ME'`), week (`'W'`), etc.
- Apply build-in methods to grouped object just like with `groupby()` aggregation

*Example: Compute monthly averages*

*Example: Select last weekly observation*