# Advanced Data Wrangling & Analysis

## Lesson Overview

This workshop transforms basic Python knowledge into professional data manipulation skills. We follow the "Extract-Transform-Analyze" workflow across 4 distinct sections:

1.  **Part 1: Financial Time Series & Window Functions**
    * Handling Datetime objects and indexing
    * Resampling and Frequency conversion
    * Window functions (Rolling means)
    * Covariance and Correlation
2.  **Part 2: Data Wrangling (Merge & Reshape)**
    * Merging datasets (Inner, Outer, Left, Right joins)
    * Reshaping data: Melt and Pivot
3.  **Part 3: Aggregation & Reporting**
    * GroupBy mechanics (Split-Apply-Combine)
    * Pivot Tables and Cross-Tabulations
4.  **Part 4: Advanced Toolkit (Optional/Deep Dive)**
    * Hierarchical Indexing (MultiIndex)
    * Concatenation
    * Stacking/Unstacking
    * Advanced GroupBy: Apply and Transform

---

**Setup:** Import necessary libraries.

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime

# Part 1: Financial Time Series & Window Functions

**Learning Objective:** Handle data where the *order* matters (Time Series). We will learn to convert strings to dates, handle missing business days, smoothing volatile data, and analyzing stock correlations.

## 1.1 Handling Date Time Data

Pandas is oriented towards working with arrays of dates, whether used as an axis index or a column.

The `to_datetime` method parses may different kinds of date representations:

In [None]:
dates = ["2011-07-06 12:00:00", "2011-08-06 00:00:00"]

pd.to_datetime(dates)

It uses `NaT` (Not a Time) as null values for datetime data.

In [None]:
idx = pd.to_datetime(dates + [None])
idx

In [None]:
pd.isna(idx)

Standard Python uses the `datetime` module to handle date and time data. Pandas has a `Timestamp` object that is similar to the `datetime` object. 

If you use `datetime` objects as index to a Series or DataFrame, Pandas will automatically convert them to `DatetimeIndex` objects.

In [None]:
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7), 
         datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]

ts = pd.Series(np.random.standard_normal(6), index=dates)
ts

In [None]:
ts.index

Like other Series, arithmetic operations between differently indexed time series automatically align on the dates:

In [None]:
# [::2] selects every second element
ts + ts[::2]

### Indexing & Slicing

You can index by passing a `datetime`, `Timestamp` or `string` that is interpretable as a date:

In [None]:
ts[datetime(2011, 1, 7)]

In [None]:
ts[pd.Timestamp("2011-01-07")]

In [None]:
ts["2011-01-07"]

You can even specify the year or year-month strings to slice a range of data. This is very powerful for quick analysis.

In [None]:
# date_range generate an array of dates
longer_ts = pd.Series(np.random.standard_normal(1000), 
                      index=pd.date_range("2000-01-01", periods=1000))

longer_ts

In [None]:
# Select all data from 2001
longer_ts["2001"].head()

In [None]:
# Select all data from May 2001
longer_ts["2001-05"].head()

In [None]:
# Slicing with strings
longer_ts["2001-05":].head()

**Student Exercise:** Use `date_range` to generate a Series of random values from 1-31st January 2023. Then slice the Series to return data from 5-15th January.

## 1.2 Real-World Application: Stock Market Data

Let's load stock prices (AAPL, GOOG, IBM, MSFT) and trade volumes. We use `.read_pickle` here as it preserves the datetime objects native to the file.

In [None]:
price = pd.read_pickle("../data/yahoo_price.pkl")
volume = pd.read_pickle("../data/yahoo_volume.pkl")

In [None]:
price.head()

In [None]:
volume.head()

### Inspecting the Index
Notice the index is a `DatetimeIndex`.

In [None]:
price.index

We can access attributes like `day_of_week` directly:

In [None]:
price.index.day_of_week

In [None]:
price.index.month

If the datetime is in a column instead of the index, you can use the `dt` accessor to access the datetime properties.

In [None]:
price_reindex = price.reset_index()
price_reindex.head()

In [None]:
price_reindex["Date"].dt.day_name().head()

**Student Exercise:** Get the week of year from the date column and create a new column `week_of_year`.

## 1.3 Resampling (Frequency Conversion)

As you can see from above, the dates are on business days. If you want to change the frequency to calendar days (e.g. including weekends), we use `resample`.

This introduces missing data (NaN) for weekends.

In [None]:
price_resampled = price.resample('D').asfreq()
price_resampled.head(10)

**Forward Fill (`ffill`)**: If you want to fill the na values with the most recent value (common in finance - Saturday price is same as Friday close), you can use the `.ffill()` method.

In [None]:
price_resampled = price.resample('D').ffill()
price_resampled.head(10)

If you want to resample to a **lower frequency** (e.g. Monthly 'MS' - Month Start) you need to provide an aggregation method (like `mean`):

In [None]:
price_resampled = price.resample('MS').mean()
price_resampled.head()

**Student Exercise:** Resample price to `yearly` (start of year) frequency, use `sum` as aggregation function.

## 1.4 Window Functions (Moving Averages)

You can apply functions evaluated over a sliding window using the `rolling` method.

For example, to compute the 30-day moving average for Apple price:

In [None]:
price["AAPL"].rolling(30).mean().tail()

By default, rolling functions require all of the values in the window to be non-NA. This behavior can be changed to account for missing data and, especially at the beginning of the time series.

In [None]:
price["AAPL"].rolling(30, min_periods=3).mean().head()

**Student Exercise:** Compute a 10-day moving average for `GOOG` with a min period of 5 days.

## 1.5 Covariance and Correlation

Covariance and correlation measure the relationship between two variables.

* **Covariance:** Measure of how much two random variables vary together. Hard to interpret magnitude.
* **Correlation:** Normalized measure (-1 to 1). 1 is perfect positive correlation, -1 is perfect negative.

In finance, we usually look at **Returns** (Percent Change), not raw prices.

In [None]:
returns = price.pct_change()
returns.tail()

Compute the correlation and covariance between the returns of `MSFT` and `IBM`:

In [None]:
print("Covariance:", returns["MSFT"].cov(returns["IBM"]))
print("Correlation:", returns["MSFT"].corr(returns["IBM"]))

You can also get the full (pair-wise) correlation or covariance matrix as a DataFrame:

In [None]:
returns.corr()

You can also compute pair-wise correlations between a DataFrameâ€™s columns or rows with another Series or DataFrame.

In [None]:
# Correlation of all companies against IBM
returns.corrwith(returns["IBM"])

In [None]:
# Correlation of returns against volume
returns.corrwith(volume)

---
# Part 2: Data Wrangling (Merge & Reshape)

**Learning Objective:** Combine data from different sources (SQL-style Joins) and reshape table layouts (Wide to Long) to prepare for analysis.

## 2.1 Merging (Joins)

`merge` connects rows in DataFrames based on one or more keys. This is equivalent to database `join` operations.

In [None]:
df1 = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "a", "b"], 
                    "data1": pd.Series(range(7), dtype="Int64")})

df2 = pd.DataFrame({"key": ["a", "b", "d"], 
                    "data2": pd.Series(range(3), dtype="Int64")})

print("DF1 (Left):\n", df1)
print("\nDF2 (Right):\n", df2)

**Many-to-One Join:** `df1` has multiple rows labeled `a` and `b`, whereas `df2` has only one row for each value in the key column `key`.

The default is an **Inner Join** (intersection of keys).

In [None]:
pd.merge(df1, df2)

It is good practice to specify the key explicitly:

In [None]:
pd.merge(df1, df2, on="key")

If the column names are different in each object, you can specify them separately using `left_on` and `right_on`:

In [None]:
df3 = pd.DataFrame({"lkey": ["b", "b", "a", "c", "a", "a", "b"], 
                    "data1": pd.Series(range(7), dtype="Int64")})

df4 = pd.DataFrame({"rkey": ["a", "b", "d"], 
                    "data2": pd.Series(range(3), dtype="Int64")})

pd.merge(df3, df4, left_on="lkey", right_on="rkey")

### Join Types (Inner, Outer, Left, Right)

You can specify the other options via the `how` parameter.

In [None]:
# Outer Join: Union of keys. Fills missing with NaN
pd.merge(df1, df2, how="outer")

In [None]:
# Outer Join with mismatched key names
pd.merge(df3, df4, left_on="lkey", right_on="rkey", how="outer")

**Many-to-Many Join:**

In [None]:
df1 = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"], 
                    "data1": pd.Series(range(6), dtype="Int64")})

df2 = pd.DataFrame({"key": ["a", "b", "a", "b", "d"], 
                    "data2": pd.Series(range(5), dtype="Int64")})

pd.merge(df1, df2, how="inner")

**Student Exercise:** Merge `df1` and `df2` with a left join.

### Merging on Multiple Keys & Suffixes

To merge with multiple keys, pass a list of column names:

In [None]:
left = pd.DataFrame({"key1": ["foo", "foo", "bar"], 
                     "key2": ["one", "two", "one"],
                     "lval": pd.Series([1, 2, 3], dtype='Int64')})

right = pd.DataFrame({"key1": ["foo", "foo", "bar", "bar"],
                      "key2": ["one", "one", "one", "two"],
                      "rval": pd.Series([4, 5, 6, 7], dtype='Int64')})

pd.merge(left, right, on=["key1", "key2"], how="outer")

If there are overlapping non-key column names, `merge` adds suffixes `_x` and `_y` by default. You can customize this:

In [None]:
pd.merge(left, right, on="key1")

In [None]:
pd.merge(left, right, on="key1", suffixes=("_left", "_right"))

### Merging on Index

If the merge key(s) is in the index, you can pass `left_index=True` or `right_index=True`.

In [None]:
left1 = pd.DataFrame({"key": ["a", "b", "a", "a", "b", "c"],
                      "value": pd.Series(range(6), dtype="Int64")})

right1 = pd.DataFrame({"group_val": [3.5, 7]}, index=["a", "b"])

pd.merge(left1, right1, left_on="key", right_index=True)

DataFrame has a `join` method which performs a left join by default. It's a convenient shortcut for index-on-index merging.

In [None]:
left1.join(right1, on='key')

## 2.2 Reshaping and Pivoting

We often need to switch between **Wide Format** (Excel style, years as columns) and **Long Format** (Database style, one row per observation).

### Melt (Wide to Long)
Let's look at our stock price data. It is currently **Wide**.

In [None]:
# Reset index so Date is a column
price_reindex = price.reset_index()
price_reindex.head()

In [None]:
# Melt into Long format
melted = pd.melt(price_reindex, id_vars="Date")
melted.head()

**Student Exercise:** Rerun `melt` and pass arguments such that the new columns are named `Company` and `Price` respectively.

### Pivot (Long to Wide)
Using `pivot`, we can reshape back to the original layout:

In [None]:
reshaped = melted.pivot(index='Date', columns='variable', values='value')
reshaped.head()

---
# Part 3: Aggregation & Reporting

**Learning Objective:** Summarize data using GroupBy, Custom Aggregations, and Pivot Tables to answer business questions.

## 3.1 Data Aggregation (GroupBy)

Data aggregation is the process of grouping data together and performing calculations on them.

In [None]:
df = pd.DataFrame({"key1" : ["a", "a", None, "b", "b", "a", None], 
                   "key2" : pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
                   "data1" : np.random.standard_normal(7), 
                   "data2" : np.random.standard_normal(7)})
df

If you want to compute the mean for each unique value in `key1`:

In [None]:
df.groupby("key1").mean()

It does not make sense to compute the mean for `key2` since it is a categorical variable and also serves as a key.

We can select the numeric columns to compute the mean for (after the `groupby` method):

In [None]:
df.groupby("key1")[["data1", "data2"]].mean()

Note that the following also works, since the returned result is a DataFrame, however it is less efficient as the selection/subset happens after the computation.

In [None]:
df.groupby("key1").mean()[["data1", "data2"]]

You can group by more than 1 column. There is a useful GroupBy method `size` which returns a Series containing group sizes.

In [None]:
df.groupby(['key1', 'key2']).size()

You can also group by other `Series`/`array`/`list` with the same length:

In [None]:
states = np.array(["OH", "CA", "CA", "OH", "OH", "CA", "OH"])
years = [2005, 2005, 2006, 2005, 2006, 2005, 2006]

df["data1"].groupby([states, years]).mean()

**Student Exercise:** Group by `key1` and `key2` and compute the standard deviation.

## 3.2 Custom Aggregation

To use your own aggregation functions, pass any function that aggregates an array to the `aggregate` method or its short alias `agg`:

In [None]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

In [None]:
grouped = df.groupby("key1")
grouped.agg(peak_to_peak)

You can pass a list of functions, or function names (for built-in functions) to `aggregate`: 

In [None]:
grouped.agg([peak_to_peak, "mean", "std"])

## 3.3 Pivot Tables

Pivot tables are used to summarize, sort, reorganize, group, count, total or average data. It allows its users to transform columns into rows and rows into columns.

We will use the `tips.csv` dataset.

In [None]:
tips = pd.read_csv("../data/tips.csv")

# add a column with the tip percentage
tips["tip_pct"] = tips["tip"] / tips["total_bill"]

tips.head()

The default aggregation for `pivot_table` is mean.

In [None]:
tips.pivot_table(index=["day", "smoker"], values=["size", "tip", "tip_pct", "total_bill"])

You can put `smoker` in the table columns and `time` and `day` in the rows:

In [None]:
tips.pivot_table(index=["time", "day"], columns="smoker", 
                 values=["tip_pct", "size"])

Add partial totals by passing `margins=True`:

In [None]:
tips.pivot_table(index=["time", "day"], columns="smoker", 
                 values=["tip_pct", "size"], margins=True)

To use other aggregation functions, pass it to the `aggfunc` keyword:

In [None]:
tips.pivot_table(index=["time", "smoker"], columns="day", 
                 values="tip_pct", aggfunc=len, margins=True)

Use `fill_value` to fill missing values:

In [None]:
tips.pivot_table(index=["time", "smoker"], columns="day", 
                 values="tip_pct", aggfunc=len, margins=True, fill_value=0)

**Student Exercise:** Compute the sum of `tip` in a pivot table with `day` and `time` in the rows and `smoker` in the column.

### Cross-Tabulation

A _cross-tabulation_ or _crosstab_ is a special case of pivot table that computes group frequencies (counts):

In [None]:
pd.crosstab(index=[tips["time"], tips["day"]], columns=tips["smoker"], margins=True)

---
# Part 4: Advanced Toolkit (Optional / Deep Dive)

**Learning Objective:** Master complex data structures and advanced transformations. This section covers Hierarchical Indexing, Stacking, Concatenation, and custom Apply/Transform logic.

## 4.1 Hierarchical Indexing (MultiIndex)

Hierarchical indexing (MultiIndex) allows you to have multiple (two or more) _index levels_ on an axis. It enables "higher dimensional" data in a lower dimensional data structure.

In [None]:
data = pd.Series(np.random.uniform(size=9),
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                 [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data

In [None]:
data.index

You can use _partial indexing_ to select subsets of data:

In [None]:
data["b"]

In [None]:
data["b":"c"]

In [None]:
data.loc[["b", "d"]]

You can also select from "inner" level:

In [None]:
data.loc[:, 2]

Hierarchical indexing works on both axes.

In [None]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                        index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                        columns=[['Ohio', 'Ohio', 'Colorado'],
                        ['Green', 'Red', 'Green']])
frame

Setting names on the axes work as usual:

In [None]:
frame.index.names = ["key1", "key2"]
frame.columns.names = ["state", "color"]
frame

In [None]:
frame.index.nlevels

Partial indexing works on columns too:

In [None]:
frame["Ohio"]

### Reordering and Sorting Levels

You may need to rearrange the order of the levels on an axis. The `swaplevel` method will swap the levels. The default is to swap the levels on the rows:

In [None]:
frame.swaplevel()

In [None]:
frame.swaplevel(0, 1, axis=1)

You can also sort by a single level or subset of levels:

In [None]:
frame.sort_index(level=1)

> **Student Exercise:** Swap the levels on the rows then sort the index by level `0`.

### Setting and Resetting Index

It's common to use one or more columns from a DataFrame as the row index.

In [None]:
frame = pd.DataFrame({"a": range(7), "b": range(7, 0, -1), 
                      "c": ["one", "one", "one", "two", "two", "two", "two"], 
                      "d": [0, 1, 2, 0, 1, 2, 3]})
frame

`set_index` will return a new DataFrame using one or more of its columns as the index.

In [None]:
frame2 = frame.set_index(["c", "d"])
frame2

`reset_index` does the opposite of `set_index` and turns the index back into a column.

In [None]:
frame2.reset_index()

You can choose to drop the columns when resetting index:

In [None]:
frame2.reset_index(drop=True)

## 4.2 Concatenation

You can join DataFrames along any axis which is referred to as _concatenation_ or _stacking_. This is akin to database `union` operations.

In [None]:
s1 = pd.Series([0, 1], index=["a", "b"], dtype="Int64")
s2 = pd.Series([2, 3, 4], index=["c", "d", "e"], dtype="Int64")
s3 = pd.Series([5, 6], index=["f", "g"], dtype="Int64")

In [None]:
pd.concat([s1, s2, s3])

By default, `concat` works along `axis="index"`, producing another Series. If you pass `axis="columns"`, the result will instead be a DataFrame:

In [None]:
pd.concat([s1, s2, s3], axis="columns")

The default behavior of `concat` is union (`outer` join) of the indexes, you can also intersect them by passing `join='inner'`:

In [None]:
s4 = pd.concat([s1, s3])
pd.concat([s1, s4], axis="columns", join="inner")

When combining Series along axis="columns", pass the `keys` argument for the DataFrame column headers:

In [None]:
pd.concat([s1, s2, s3], axis="columns", keys=["one", "two", "three"])

**Student Exercise:** Concat `s1`, `s2` and `s3` along index and pass `keys=["one", "two", "three"]`.

If the index does not contain any relevant data, and you want to avoid concatenating based on indexes, you can pass the `ignore_index=True` argument:

In [None]:
df1 = pd.DataFrame(np.random.standard_normal((3, 4)), 
                   columns=["a", "b", "c", "d"])

df2 = pd.DataFrame(np.random.standard_normal((2, 3)), 
                   columns=["b", "d", "a"])

pd.concat([df1, df2], ignore_index=True)

## 4.3 Stacking and Unstacking

These are alternative reshaping methods to Melt/Pivot that work specifically on the Index levels.

In [None]:
data = pd.DataFrame(np.arange(6).reshape((2, 3)), 
                    index=pd.Index(["Ohio", "Colorado"], name="state"),
                    columns=pd.Index(["one", "two", "three"], name="number"))
data

The `stack` method pivots the columns into rows, producing a Series with a MultiIndex.

In [None]:
result = data.stack()
result

From a hierarchically indexed Series, you can rearrange the data back into a DataFrame with `unstack` , which pivots rows into columns.

In [None]:
result.unstack()

You can unstack a different level by passing a level number or name:

In [None]:
result.unstack(level=0)

## 4.4 Advanced GroupBy: Apply

The most general-purpose GroupBy method is `apply`, which splits the object being manipulated into pieces, invokes the passed function on each piece, and then concatenates the pieces.

Suppose we want to select the top five `tip_pct` values by group. First, write a function that selects the rows with the largest values in a particular column:

In [None]:
def top(df, n=5, column="tip_pct"):
    return df.sort_values(column, ascending=False)[:n]

In [None]:
top(tips, n=6)

We can then `apply` this function by different groups using `groupby`:

In [None]:
tips.groupby("smoker").apply(top)

You can pass the arguments to the function as follows:

In [None]:
tips.groupby(["smoker", "day"]).apply(top, n=2, column="total_bill")

**Student Exercise:** Apply the function on `day` and `time` group.

## 4.5 Advanced GroupBy: Transform

You can also transform your data using the `transform` method. It is similar to `apply` but the function must:
- Produce a scalar value to be broadcast to the shape of the group chunk, or
- Return an object that is the same shape as the group chunk

This is useful for z-score normalization within groups.

In [None]:
df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4, 'value': np.arange(12.)})
g = df.groupby('key')['value']
g.mean()

`transform` produce a Series of the same shape as `df['value']` but with values replaced by the average grouped by `key`.

In [None]:
g.transform(lambda g: g.mean())

In [None]:
def normalize(x):
    return (x - x.mean()) / x.std()

g.transform(normalize)