# Tidying data for analysis
Here, you'll learn about the principles of tidy data and more importantly, why you should care about them and how they make subsequent data analysis more efficient. You'll gain first hand experience with reshaping and tidying your data using techniques such as pivoting and melting

# 1. Tidy data
## 1.1 Recognizing tidy data
For data to be tidy, it must have:

* Each variable as a separate column.
* Each row as a separate observation.

As a data scientist, you'll encounter data that is represented in a variety of different ways, so it is important to be able to recognize tidy (or untidy) data when you see it.

In this exercise, two example datasets have been pre-loaded into the DataFrames `df1` and `df2`. Only one of them is tidy. Your job is to explore these further in the IPython Shell and identify the one that is not tidy, and why it is not tidy.

In the rest of this course, you will frequently be asked to explore the structure of DataFrames in the IPython Shell prior to performing different operations on them. Doing this will not only strengthen your comprehension of the data cleaning concepts covered in this course, but will also help you realize and take advantage of the relationship between working in the Shell and in the script.

### Possible Answers:
1. df2 - the rows are not all separate observations.
2. df1 - each variable is not a separate column.
3. df2 - each variable is not a separate column.
4. df1 - the rows are not all separate observations.

In [1]:
import pandas as pd
import numpy as np

df1 = pd.read_csv('_datasets/airquality.csv')
df2 = pd.read_csv('_datasets/airquality2.csv')

In [2]:
df1.head(10)

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5
5,28.0,,14.9,66,5,6
6,23.0,299.0,8.6,65,5,7
7,19.0,99.0,13.8,59,5,8
8,8.0,19.0,20.1,61,5,9
9,,194.0,8.6,69,5,10


In [3]:
df1.tail(10)

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
143,13.0,238.0,12.6,64,9,21
144,23.0,14.0,9.2,71,9,22
145,36.0,139.0,10.3,81,9,23
146,7.0,49.0,10.3,69,9,24
147,14.0,20.0,16.6,63,9,25
148,30.0,193.0,6.9,70,9,26
149,,145.0,13.2,77,9,27
150,14.0,191.0,14.3,75,9,28
151,18.0,131.0,8.0,76,9,29
152,20.0,223.0,11.5,68,9,30


In [4]:
df2.head(10)

Unnamed: 0,Month,Day,variable,value
0,5,1,Ozone,41.0
1,5,2,Ozone,36.0
2,5,3,Ozone,12.0
3,5,4,Ozone,18.0
4,5,5,Ozone,
5,5,6,Ozone,28.0
6,5,7,Ozone,23.0
7,5,8,Ozone,19.0
8,5,9,Ozone,8.0
9,5,10,Ozone,


In [5]:
df2.tail(10)

Unnamed: 0,Month,Day,variable,value
602,9,21,Temp,64.0
603,9,22,Temp,71.0
604,9,23,Temp,81.0
605,9,24,Temp,69.0
606,9,25,Temp,63.0
607,9,26,Temp,70.0
608,9,27,Temp,77.0
609,9,28,Temp,75.0
610,9,29,Temp,76.0
611,9,30,Temp,68.0


__Answer:__ Notice that the `variable` column of `df2` contains the values `Solar.R`, `Ozone`, `Temp`, and `Wind`. For it to be tidy, these should all be in separate columns, as in `df1`. (3)

## 1.2 Reshaping your data using melt
Melting data is the process of turning columns of your data into rows of data. Consider the DataFrames from the previous exercise. In the tidy DataFrame, the variables `Ozone`, `Solar.R`, `Wind`, and `Temp` each had their own column. If, however, you wanted these variables to be in rows instead, you could melt the DataFrame. In doing so, however, you would make the data untidy! This is important to keep in mind: Depending on how your data is represented, you will have to reshape it differently (e.g., this could make it easier to plot values).

In this exercise, you will practice melting a DataFrame using `pd.melt()`. There are two parameters you should be aware of: `id_vars` and `value_vars`. The `id_vars` represent the columns of the data you __do not__ want to melt (i.e., keep it in its current shape), while the `value_vars` represent the columns you __do__ wish to melt into rows. By default, if no `value_vars` are provided, all columns not set in the `id_vars` will be melted. This could save a bit of typing, depending on the number of columns that need to be melted.

The (tidy) DataFrame `airquality` has been pre-loaded. Your job is to melt its `Ozone`, `Solar.R`, `Wind`, and `Temp` columns into rows. Later in this chapter, you'll learn how to bring this melted DataFrame back into a tidy form.

### Instructions:
* Print the head of `airquality`.
* Use `pd.melt()` to melt the `Ozone`, `Solar.R`, `Wind`, and `Temp` columns of `airquality` into rows. Do this by using `id_vars` to specify the columns you __do not__ wish to melt: `'Month'` and `'Day'`.
* Print the head of `airquality_melt`.

In [6]:
airquality = pd.read_csv('_datasets/airquality.csv')

# Print the head of airquality
airquality.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


In [7]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'])

# Print the head of airquality_melt
airquality_melt.head()

Unnamed: 0,Month,Day,variable,value
0,5,1,Ozone,41.0
1,5,2,Ozone,36.0
2,5,3,Ozone,12.0
3,5,4,Ozone,18.0
4,5,5,Ozone,


 This exercise demonstrates that melting a DataFrame is not always appropriate if you want to make it tidy. You may have to perform other transformations depending on how your data is represented.

## 1.3 Customizing melted data
When melting DataFrames, it would be better to have column names more meaningful than `variable` and `value` (the default names used by `pd.melt()`).

The default names may work in certain situations, but it's best to always have data that is self explanatory.

You can rename the `variable` column by specifying an argument to the `var_name` parameter, and the `value` column by specifying an argument to the `value_name` parameter. You will now practice doing exactly this. 

### Instructions:
* Print the head of `airquality`.
* Melt the columns of `airquality` with the default `variable` column renamed to `'measurement'` and the default `value` column renamed to `'reading'`. You can do this by specifying, respectively, the `var` _name_ and `value` _name_ parameters.
* Print the head of `airquality_melt`.

In [8]:
# Print the head of airquality
airquality.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


In [9]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')

# Print the head of airquality_melt
airquality_melt.head()

Unnamed: 0,Month,Day,measurement,reading
0,5,1,Ozone,41.0
1,5,2,Ozone,36.0
2,5,3,Ozone,12.0
3,5,4,Ozone,18.0
4,5,5,Ozone,


The DataFrame is more informative now. In the next chapter, you'll learn about pivoting, which is the opposite of melting. You'll then be able to convert this DataFrame back into its original, tidy, form!

# 2. Pivoting data
## 2.1 Pivot data
Pivoting data is the opposite of melting it. Remember the tidy form that the airquality `DataFrame` was in before you melted it? You'll now begin pivoting it back into that form using the `.pivot_table()` method!

While melting takes a set of columns and turns it into a single column, pivoting will create a new column for each unique value in a specified column.

`.pivot_table()` has an `index` parameter which you can use to specify the columns that you __don't__ want pivoted: It is similar to the `id_vars` parameter of `pd.melt()`. Two other parameters that you have to specify are `columns` (the name of the column you want to pivot), and `values` (the values to be used when the column is pivoted). The melted DataFrame `airquality_melt` has been pre-loaded for you.

### Instructions:
* Print the head of `airquality_melt`.
* Pivot `airquality_melt` by using `.pivot_table()` with the rows indexed by `'Month'` and `'Day'`, the columns indexed by `'measurement'`, and the values populated with `'reading'`.
* Print the head of `airquality_pivot`.

In [10]:
# Print the head of airquality_melt
airquality_melt.head()

Unnamed: 0,Month,Day,measurement,reading
0,5,1,Ozone,41.0
1,5,2,Ozone,36.0
2,5,3,Ozone,12.0
3,5,4,Ozone,18.0
4,5,5,Ozone,


In [11]:
# Pivot airquality_melt: airquality_pivot
airquality_pivot = pd.pivot_table(airquality_melt, index=['Month', 'Day'], columns='measurement', values='reading')

# Print the head of airquality_pivot
airquality_pivot.head()

Unnamed: 0_level_0,measurement,Ozone,Solar.R,Temp,Wind
Month,Day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,1,41.0,190.0,67.0,7.4
5,2,36.0,118.0,72.0,8.0
5,3,12.0,149.0,74.0,12.6
5,4,18.0,313.0,62.0,11.5
5,5,,,56.0,14.3


Notice that the pivoted DataFrame does not actually look like the original DataFrame. In the next exercise, you'll turn this pivoted DataFrame back into its original form.

## 2.2 Resetting the index of a DataFrame
After pivoting `airquality_melt` in the previous exercise, you didn't quite get back the original DataFrame.

What you got back instead was a pandas DataFrame with a [hierarchical index (also known as a MultiIndex)](http://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html).

Hierarchical indexes are covered in depth in "Manipulating DataFrames with pandas". In essence, they allow you to group columns or rows by another variable - in this case, by `'Month'` as well as `'Day'`.

There's a very simple method you can use to get back the original DataFrame from the pivoted DataFrame: `.reset_index()`. Dan didn't show you how to use this method in the video, but you're now going to practice using it in this exercise to get back the original DataFrame from `airquality_pivot`, which has been pre-loaded.

### Instructions:
* Print the index of `airquality_pivot` by accessing its `.index` attribute. This has been done for you.
* Reset the index of `airquality_pivot` using its `.reset_index()` method.
* Print the new index of `airquality_pivot`.
* Print the head of `airquality_pivot`.

In [12]:
# Print the index of airquality_pivot
airquality_pivot.index

MultiIndex(levels=[[5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 

In [13]:
# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()

# Print the new index of airquality_pivot_reset
airquality_pivot_reset.index

RangeIndex(start=0, stop=153, step=1)

In [14]:
# Print the head of airquality_pivot_reset
airquality_pivot_reset.head()

measurement,Month,Day,Ozone,Solar.R,Temp,Wind
0,5,1,41.0,190.0,67.0,7.4
1,5,2,36.0,118.0,72.0,8.0
2,5,3,12.0,149.0,74.0,12.6
3,5,4,18.0,313.0,62.0,11.5
4,5,5,,,56.0,14.3


You've now converted the DataFrame back into its original form!

## 2.3 Pivoting duplicate values
So far, you've used the `.pivot_table()` method when there are multiple `index` values you want to hold constant during a pivot. In the video, Dan showed you how you can also use pivot tables to deal with duplicate values by providing an aggregation function through the `aggfunc` parameter. Here, you're going to combine both these uses of pivot tables.

Let's say your data collection method accidentally duplicated your dataset. Such a dataset, in which each row is duplicated, has been pre-loaded as `airquality_dup`. In addition, the `airquality_melt` DataFrame from the previous exercise has been pre-loaded. Explore their shapes in the IPython Shell by accessing their `.shape` attributes to confirm the duplicate rows present in `airquality_dup`.

You'll see that by using `.pivot_table()` and the `aggfunc` parameter, you can not only reshape your data, but also remove duplicates. Finally, you can then flatten the columns of the pivoted DataFrame using `.reset_index()`.

### Instructions:
* Pivot `airquality_dup` by using `.pivot_table()` with the rows indexed by `'Month'` and `'Day'`, the columns indexed by `'measurement'`, and the values populated with `'reading'`. Use `np.mean` for the aggregation function.
* Print the head of `airquality_pivot`.
* Flatten `airquality_pivot` by resetting its index.
* Print the head of `airquality_pivot` and then the original `airquality` DataFrame to compare their structure.

In [15]:
airquality_dup = pd.concat([airquality_melt, airquality_melt])

In [16]:
# Pivot airquality_dup: airquality_pivot
airquality_pivot = pd.pivot_table(airquality_dup, index=['Month', 'Day'], columns='measurement', values='reading', aggfunc=np.mean)

# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the head of airquality_pivot
airquality_pivot.head()

measurement,Month,Day,Ozone,Solar.R,Temp,Wind
0,5,1,41.0,190.0,67.0,7.4
1,5,2,36.0,118.0,72.0,8.0
2,5,3,12.0,149.0,74.0,12.6
3,5,4,18.0,313.0,62.0,11.5
4,5,5,,,56.0,14.3


In [17]:
# Print the head of airquality
airquality.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


The default aggregation function used by `.pivot_table()` is `np.mean()`. So you could have pivoted the duplicate values in this DataFrame even without explicitly specifying the `aggfunc` parameter.

# 3. Beyond melt and pivot
## 3.1 Splitting a column with .str
The dataset you saw in the video, consisting of case counts of tuberculosis by country, year, gender, and age group, has been pre-loaded into a DataFrame as `tb`.

In this exercise, you're going to tidy the `'m014'` column, which represents males aged 0-14 years of age. In order to parse this value, you need to extract the first letter into a new column for `gender`, and the rest into a column for `age_group`. Here, since you can parse values by position, you can take advantage of pandas' vectorized string slicing by using the `str` attribute of columns of type `object`.

Begin by printing the columns of `tb` in the IPython Shell using its `.columns` attribute, and take note of the problematic column.

### Instructions:
* Melt `tb` keeping `'country'` and `'year'` fixed.
* Create a `'gender'` column by slicing the first letter of the `variable` column of `tb_melt`.
* Create an `'age_group'` column by slicing the rest of the `variable` column of `tb_melt`.
* Print the head of `tb_melt`. This has been done for you, so hit 'Submit Answer' to see the results!

In [18]:
tb = pd.read_csv("_datasets/tb.csv")

tb.head()

Unnamed: 0,country,year,m014,m1524,m2534,m3544,m4554,m5564,m65,mu,f014,f1524,f2534,f3544,f4554,f5564,f65,fu
0,AD,2000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,,,,,,,
1,AE,2000,2.0,4.0,4.0,6.0,5.0,12.0,10.0,,3.0,16.0,1.0,3.0,0.0,0.0,4.0,
2,AF,2000,52.0,228.0,183.0,149.0,129.0,94.0,80.0,,93.0,414.0,565.0,339.0,205.0,99.0,36.0,
3,AG,2000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,0.0,0.0,0.0,
4,AL,2000,2.0,19.0,21.0,14.0,24.0,19.0,16.0,,3.0,11.0,10.0,8.0,8.0,5.0,11.0,


In [19]:
# Melt tb: tb_melt
tb_melt = pd.melt(tb, id_vars=['country', 'year'])

# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]

# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]

# Print the head of tb_melt
tb_melt.head()

Unnamed: 0,country,year,variable,value,gender,age_group
0,AD,2000,m014,0.0,m,14
1,AE,2000,m014,2.0,m,14
2,AF,2000,m014,52.0,m,14
3,AG,2000,m014,0.0,m,14
4,AL,2000,m014,2.0,m,14


In [20]:
tb_melt.tail()

Unnamed: 0,country,year,variable,value,gender,age_group
3211,YE,2000,fu,,f,u
3212,YU,2000,fu,,f,u
3213,ZA,2000,fu,,f,u
3214,ZM,2000,fu,,f,u
3215,ZW,2000,fu,,f,u


Notice the new `'gender'` and `'age_group'` columns you created. It is vital to be able to split columns as needed so you can access the data that is relevant to your question.

## 3.2 Splitting a column with .split() and .get()
Another common way multiple variables are stored in columns is with a delimiter. You'll learn how to deal with such cases in this exercise, using a [dataset consisting of Ebola cases and death counts by state and country](https://data.humdata.org/dataset/ebola-cases-2014). It has been pre-loaded into a DataFrame as `ebola`.

Print the columns of `ebola` in the IPython Shell using `ebola.columns`. Notice that the data has column names such as `Cases_Guinea` and `Deaths_Guinea`. Here, the underscore `_` serves as a delimiter between the first part (cases or deaths), and the second part (country).

This time, you cannot directly slice the variable by position as in the previous exercise. You now need to use Python's built-in string method called `.split()`. By default, this method will split a string into parts separated by a space. However, in this case you want it to split by an underscore. You can do this on `Cases_Guinea`, for example, using `Cases_Guinea.split('_')`, which returns the list `['Cases', 'Guinea']`.

The next challenge is to extract the first element of this list and assign it to a `type` variable, and the second element of the list to a `country` variable. You can accomplish this by accessing the `str` attribute of the column and using the `.get()` method to retrieve the `0` or `1` index, depending on the part you want.

### Instructions:
* Melt `ebola` using `'Date'` and `'Day'` as the `id_vars`, `'type_country'` as the `var_name`, and `'counts'` as the `value_name`.
* Create a column called `'str_split'` by splitting the `'type_country'` column of `ebola_melt` on `'_'`. Note that you will first have to access the `str` attribute of `type_country` before you can use `.split()`.
* Create a column called `'type'` by using the `.get()` method to retrieve index `0` of the `'str_split'` column of `ebola_melt`.
* Create a column called `'country'` by using the `.get()` method to retrieve index `1` of the `'str_split'` column of `ebola_melt`.
* Print the head of `ebola`. This has been done for you, so hit 'Submit Answer' to view the results!

In [21]:
ebola=pd.read_csv('_datasets/ebola.csv')

In [22]:
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')

# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')

# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)

# Print the head of ebola_melt
ebola_melt.head()

Unnamed: 0,Date,Day,type_country,counts,str_split,type,country
0,1/5/2015,289,Cases_Guinea,2776.0,"[Cases, Guinea]",Cases,Guinea
1,1/4/2015,288,Cases_Guinea,2775.0,"[Cases, Guinea]",Cases,Guinea
2,1/3/2015,287,Cases_Guinea,2769.0,"[Cases, Guinea]",Cases,Guinea
3,1/2/2015,286,Cases_Guinea,,"[Cases, Guinea]",Cases,Guinea
4,12/31/2014,284,Cases_Guinea,2730.0,"[Cases, Guinea]",Cases,Guinea
