# Advanced Chickweight 

- joins/merges/concatenations 
- dealing with missing values dropna
- transformations instead of aggregations 

In [None]:
import numpy as np
import pandas as pd
%matplotlib inline

url = 'http://koaning.io/old/theme/data/chickweight.csv'
chickweight = (pd.read_csv(url).rename(str.lower, axis='columns'))

## Combining Datasets 

In [None]:
chickweight.set_index('diet').head()

In [None]:
agg = (chickweight
       .groupby('diet')
       .apply(lambda d: pd.Series({"mean_weight": np.mean(d['weight'])})))

agg

In [None]:
chickweight.set_index("diet").join(agg).sample(6)

In [None]:
chickweight.merge(agg.reset_index()).sample(6)

You may notice that there is overlap in the functionality of `.join()` and `.merge()`. The difference is minor;

- **join** will merge based on the indices as base behavior
- **merge** will join based on overlapping column names as base behavior 

You don't have to use these functions implicitly though, you can also use them explicitly to describe what columns should be merged on what other columns.

In [None]:
chickweight.merge(agg.reset_index(), on="diet").sample(3)

# Assignment 

Suppose that we have an extra dataframe with some information we'd like to get joined to our original dataframe.

In [None]:
agg = (chickweight
       .groupby(['diet', 'time'])
       .apply(lambda d: pd.Series({
           "weight": d['weight'].mean(), 
           "variance": d['weight'].var()}))
       .reset_index()
       .rename(columns={"time": "tijd"}))

agg.head(3)

Write a statement that will join the two dataframes using the `chickweight.merge()` command. Make sure that you join on both the diet as well as the time (even though one dataframe is mis-spelled). In order to make it a nice join you might want to read the documentation of `chickweight.merge()` in order to figure out what the suffix settings might do and what other `on` settings exist.

Note that you should use a **single** `.merge()` command for the join and little else.

## Bonus Points 

Can you also perform the join using `agg.merge()`? 

In [None]:
%load answers/joined_frames.py

## Computation Time 

In principle the operation we just did is quite common. You want to group by some information (things like say, average session length) and add this information to a raw dataset. To perform the aggregation first makes sense but especially for large dataframes the join operation that follows after can be a bit expensive. 

This is why there's also a `.transform()` method. This method will do the aggregation as well as the join in one go. To demonstrate how this works, let us first check what the mean weight is per diet.

In [None]:
chickweight.groupby("diet").mean()['weight']

Next up we will use the `.transform()` method on the grouped object. We will first check the shape.

In [None]:
transformed = chickweight.groupby("diet").transform(np.mean)['weight']
transformed.shape

We note that what comes out has the same shape as what came in. Next we check if the value corresponds with the 1st diet.

In [None]:
transformed.head()

It is indeed so.

# Assignment

Take the original `chickweight` dataframe and create these columns on the raw data without performing a join: 

1. **mean_weight_diet**: which calculates the mean weight per diet 
2. **mean_weight_diet_time**: which calculates the mean weight per diet at a given time
3. **num_chickens_diet**: which calculates the total number of chickens per diet

In [None]:
%load answers/transforms.py

# Analysis Assignment 

When do chickens grow the most? Does the growth per time depend on the diet? Is there even a difference?

- **Hint 1**: google the `shift` function on a series and note that it works differently when a `groupby` is active.
- **Hint 2**: there is also an alternative function you can use for this, can you find it in the documentation? 

In [None]:
%load answers/chicken_growth.py

In [None]:
%load answers/analysis.py