# Data Transformations with Pandas in Python - Creating Data

In this notebook we will create some new data. We will look at three ways to create and use new data:
1. **Replacing data** - using the function `.replace()` we will change multiple values inside a column
2. **Categorizing data** - using the function `pd.cut()` we will put numerical data in different categories (bins)
3. **Getting more data our of timestamp data** - using the attributes in `.dt` we will access all kinds of time-related data

We will do these operations based on the simple CSV file `createData.csv`. Run the below cell to load and check this data.

In [None]:
import pandas as pd
createData = pd.read_csv('createData.csv')
createData.timestamp = pd.to_datetime(createData.timestamp)
createData

## 1. Replacing data
You can create a new column based on an existing column and replaced values, with the method `.replace()`. You need to supply _what_ you want to replace, _with what_. This can both be one value or multiple values.

For example, creating a column in which the station name `'arbaminch'` is being replaced with `'AM'`:

In [None]:
other_name = createData.station.replace('arbaminch', 'AM')
other_name

Or replacing all month _numbers_ with month _names_:

In [None]:
# Creating numbers 1 to 12
numbers = range(1,13)

# Creating list with month names
names = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

# Replacing numbers with names
month_names = createData.Month.replace(numbers, names)
month_names

**Exercise**: Create a column in which the names `arbaminch`, `hawassa` and `sodo` (of the column `createData.station`) are replaced by the abbreviations `am`, `ha` and `so`.

In [None]:
# Write your code for making a column with new station names here.

## 2. Categorizing data

We can put numerical data in different categories (_bins_) with the function `pd.cut()`. For `pd.cut()`, we need to know three things:
- What to categorize
- The edges of the bins
- The labels of the bins

For example, creating a column in which the months are divided into two _bins_: `first_half` and `second_half`:
- What to categorize: the column `data.Month`
- The edges of the bins: `[0, 6, 12]`
- The labels of the bins: `first_half` and `second_half`

Additionally, if the order of the different categories is not very important, it is best to include `ordered=False`.

In [None]:
half_years = pd.cut(createData.Month, bins=[0, 6, 12], labels=['first_half', 'second_half'], ordered=False)
half_years

You can add this data directly to the original DataFrame (`createData['half_years'] = half_years)`), and use it for example in methods like `.groupby()`.
To get the average per `first_half` and `second_half`:

In [None]:
createData['rainfall'].groupby(by=half_years).mean()

**Exercise**: Create a column with the seasons (Kiremt, Bega, Belg), based on the column `createData.Month`.
- What to categorize: the column `createData.Month`
- The edges of the bins: `[0, 1, 5, 9, 12]`
- The labels of the bins: `['Bega', 'Belg', 'Kiremnt', 'Bega']` (or choose other edges and corresponding seasons if you think it is different)

In [None]:
# Write your code for creating the variable seasons, based on categorized month data, here.

## 3. Getting more data out of timestamp data
- If you have a column consisting out of timestamps, you can get all kinds of time-related data. 
    - N.B.: timestamps are not strings, but actual datetime objects!
    - Remember: if you have strings, you can turn those into timestamps with the function `pd.to_datetime()` 
- For example, you can get a column with only the hours, or only the days, or only the years. This is useful input for functions like `.groupby()` or `pd.cut()`. 
- You access that timedata through `.dt.methodname` on your column with timestamp data. For example:
    - `.dt.year` for the years
    - `.dt.month` for the months
    - `.dt.hour` for the hours
    - `.dt.weekofyear` for the week number
    - Etc...

Let's say we have PM2.5 air quality data (run below cell to load it).

In [None]:
# Loading the data
pm25Data = pd.read_csv('pm25Data.csv')

# Creating timestamps out of the datetime strings
pm25Data.Datetime = pd.to_datetime(pm25Data.Datetime, dayfirst=True)
pm25Data

We would like to see the daily trend: one value per hour of the day (in total 24 values, being the average of all hours 0, 1, 2, ...). We do have timestamp data (column `Datetime`), but to group for hour per day we need a column that simply mentions only the hour of the day. This column we can create by using `dt.hour` on our column with timestamps.

In [None]:
hours = pm25Data.Datetime.dt.hour
print(hours)

This column hours is perfect input for the function `.groupby()`.

In [None]:
hour_average = pm25Data.PM25.groupby(by=hours).mean()
hour_average

**Exercise**: Show the yearly trend, by calculating the average per month. 

Steps:
- Create a column of month numbers with `pm25Data.Datetime.dt.month`
- Use this column in the function `.groupby()`

In [None]:
# Write your code for calculating the average per month here

**Exercise**: Calculate the seasonal average.

Steps:
- Create a column of month numbers with `pm25Data.Datetime.dt.month` (or use the one from the previous exercise)
- Use this column as input for the function `pd.cut()`, to get as output the seasons (in other words: _categorize_ the month numbers)
- Use the created seasons in the function `.groupby()`

In [None]:
# Write your code for calculating the average per season here.