# Creating New Columns

Often we will want to create a new column so that we can use it at a later date.

In this notebook we will cover:

* [Creating new columns: avoid common bad practice](#bad-pract)
* [Using `assign()` to create new columns](#assign)
* [Shifting a column](#shifting)
* [Other Verbs](#other-verbs)
    * [Renaming columns](#rename)
    * [Drop duplicate values](#drop)
* [<mark>Assignment</mark>](#dead-chickens)

Before we do that thought, let's import pandas and read in our data as we're in a new notebook:

In [None]:
import pandas as pd

chickweight = (
    pd.read_csv('data/chickweight.csv')
    .rename(str.lower, axis='columns')
)

<a id='bad-pract'></a>
## Creating new columns: avoid common bad practice

Say we want to create a new column where the weight is doubled.

We have the ability to use the assignment tool to create a new column as seen below.

In [None]:
chickweight['weight2'] = chickweight['weight'] * 2

We can see the result of this below.

In [None]:
chickweight.head()

However, adding columns like this is an anti-pattern as we have changed the original dataframe.

**Code should always perform in the same way regardless of where it is in the project**

Imagine in our analysis we were using the below code to find the max of the second-last column (`'chick'`) of the dataframe:

In [None]:
def max_penultimate_column(df):
    return df.iloc[:,-2].max()

In [None]:
max_penultimate_column(chickweight)

if someone else picks up our code and doesn't realise the original dataframe was over-written, they would get a different result from this function

In [None]:
chickweight = (
    pd.read_csv('data/chickweight.csv')
    .rename(str.lower, axis='columns')
)

In [None]:
max_penultimate_column(chickweight)

This kind of thing can lead to pandas frustration...

<img src='images/panda.gif' width='300px' align='left'>

To avoid this, we never want to overwrite our data frame.

We should also avoid things like:

In [None]:
chickweight_temp = chickweight.copy()
chickweight_temp['weight2'] = chickweight_temp['weight'] * 2

OK so we didn't overwrite our dataframe, but eventually we will end up with too many versions of a dataframe, which is not ideal & will become confusing in a long notebook.

***So what's the answer?***

Well first let's read our data in once more so we're back to the original:

In [None]:
chickweight = (
    pd.read_csv('data/chickweight.csv')
    .rename(str.lower, axis='columns')
)

<a id='assign'></a>
## Using `assign()` to create new columns
We can tell pandas to make a new column with `.assign()` and **how** this is done is determined by a lambda that we pass into the method.

In [None]:
(
    chickweight
    .assign(weight2=lambda df: df['weight'] * 2)
).head()

Note that we have not changed our original dataframe

In [None]:
chickweight.head()

Want to assign two columns at the same time? No problem!

In [None]:
(
    chickweight
    .loc[lambda df: df['weight'] < 45]
    .assign(weight2 = chickweight['weight'] * 2, 
            weight3 = chickweight['weight'] * 3)
).head()

**Assuming that the chick weights are in grams, can you add a column that gives the chickweights in kg?**

Note that we can also drop columns if requried!

In [None]:
(
    chickweight
    .drop(columns = ['rownum', 'time'])
).head()

<a id='shifting'></a>

## Shifting a column

The shift() method for a pandas series helps shift values in a column up or down.

In [None]:
(
    chickweight
    .assign(shifted_time=lambda df: df['time'].shift(5)
).head(14)  

Imagine you wanted to investigate the rate of growth for different diets - You may need the previous weight of a chick at each timepoint. 

What **issue** can you see below??

In [None]:
(
    chickweight
    .assign(previous_weight=lambda df: df['weight'].shift())
).head(14)

Let's use `.groupby()` to solve this issue so the shifts only happen for each chicken separately.

In [None]:
(
    chickweight
    .assign(previous_weight=lambda df: df.groupby('chick')['weight'].shift())
).head(14)

Can we get rid of the NaNs?

In [None]:
# drop rows with NaN in
(
    chickweight
    .assign(previous_weight=lambda df: df.groupby('chick')['weight'].shift())
    .dropna()
).head(14)

In [None]:
# we don't want to always drop rows with NaN as we're missing vital data
# how about instead we set the previous weight to the current weight in every first record:
(
    chickweight
    .assign(previous_weight=lambda df: df.groupby('chick')['weight'].shift().fillna(df['weight']))
).head(14)

**Can you create a new column called `differences` that contains the difference between the chicks current weight and its previous weight?**

<a id='other-verbs'></a>

## Other Verbs

The following methods can also be useful
<a id='rename'></a>

### Cumulative counts

After a `.groupby()`, we can use the `.cumcount()` method to get a cumulative count of the rows of a particular group.

In [None]:
(
    chickweight
    .groupby('chick').cumcount()
).head(15)

We can add this information to a column using the `.assign()` method.

In [None]:
(
    chickweight
    .assign(rank=lambda df: df.groupby('chick').cumcount() + 1)
).head(15)

We can then refer to this new column in a later method using a lambda function...

For example, say we want to only look at our chicks that are past their 5th iteration:

In [None]:
(
    chickweight
    .assign(rank=lambda df: df.groupby('chick').cumcount() + 1)
    .loc[lambda df: df['rank'] > 5]
).head(15)

### Renaming columns

In [None]:
(
    chickweight
    .rename(str.upper, axis="columns")
    .head(3)
)

In [None]:
(
    chickweight
    .rename({"chick": "chicken_id"}, axis="columns")
    .head(3)
)

<a id='drop'></a>
### Dropping duplicate information

When we select a just few columns from a table, we can get left with repeated information.

In [None]:
(
    chickweight
    [['chick', 'diet']]
).head()

In [None]:
(
    chickweight
    [['time', 'diet']]
    .drop_duplicates()
).head()

<a id='dead-chickens'></a>
## Assignment

<img src="images/assignment.png" width="240" height="240" align="center"/>

### 1. Find the dead chickens

There are some chickens that died prematurely. Find them! 

Hint: use `describe` to find some clues and use `groupby` to get your answer

Can you also find which diet were they on?

<a id='fattest-chicken'></a>
### 2. Find the fattest chicken per diet

Hint: use `groupby`

In [None]:
# %load answers/dead-chickens.py

In [None]:
# %load answers/fattest-chicken.py