### tidying data

For data to be tidy, it must have:
Each variable as a separate column.
Each row as a separate observation.

#### melting data

Melting data is the process of turning columns of your data into rows of data. If, we want variables in the columns, to be in rows instead, we could melt the DataFrame. In doing so, however, we would make the data 
untidy! 

In this exercise, we will practice melting a DataFrame using `pd.melt()`. There are two parameters
we should be aware of: **id_vars and value_vars**. The id_vars represent the columns of the data we
do not want to melt (i.e., keep it in its current shape), while the value_vars represent the
columns we do wish to melt into rows. By default, if no value_vars are provided, all columns
not set in the id_vars will be melted. This could save a bit of typing, depending on the number
of columns that need to be melted.

In [16]:
import pandas as pd

# Assign url of file: url
url= 'https://raw.githubusercontent.com/wblakecannon/DataCamp/master/10-cleaning-data-in-python/_datasets/airquality.csv'

# Read file into a DataFrame: df
airquality= pd.read_csv(url)

# Print the head of the DataFrame
print(airquality.head())

   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5


In [20]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'])

# Print the head of airquality_melt
print(airquality_melt.head())

   Month  Day variable  value
0      5    1    Ozone   41.0
1      5    2    Ozone   36.0
2      5    3    Ozone   12.0
3      5    4    Ozone   18.0
4      5    5    Ozone    NaN


#### Customizing melted data

When melting DataFrames, it would be better to have column names more meaningful than variable
and value.

We can rename the variable column by specifying an argument to the "var_name parameter", and the
value column by specifying an argument to the "value_name parameter".

In [25]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')

# Print the head of airquality_melt
print(airquality_melt.head())

   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN


### Pivot data
Pivoting data is the opposite of melting it. Remember the tidy form that the airquality DataFrame
was in before we melted it? We'll now begin pivoting it back into that form using the
`.pivot_table()` method!

While melting takes a set of columns and turns it into a single column, pivoting will create a
new column for each unique value in a specified column.

`.pivot_table()` has an index parameter which we can use to specify the columns that we don't
want pivoted: It is similar to the id_vars parameter of `pd.melt()`. Two other parameters that
we have to specify are columns (the name of the column we want to pivot), and values (the
values to be used when the column is pivoted).

In [36]:
# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')

# Print the head of airquality_pivot
print(airquality_pivot.head())

measurement  Ozone  Solar.R  Temp  Wind
Month Day                              
5     1       41.0    190.0  67.0   7.4
      2       36.0    118.0  72.0   8.0
      3       12.0    149.0  74.0  12.6
      4       18.0    313.0  62.0  11.5
      5        NaN      NaN  56.0  14.3


### Resetting the index of a DataFrame
After pivoting airquality_melt in the previous exercise, we didn't quite get back the original
DataFrame.
What we got back instead was a pandas DataFrame with a hierarchical index (also known as a
MultiIndex).

In essence, they allow you to group columns or rows by another variable - in this case, by 'Month' as
well as 'Day'.
There's a very simple method we can use to get back the original DataFrame from the pivoted
DataFrame: `.reset_index()`. 


In [37]:
# Print the index of airquality
print(airquality.index)

# Print the index of airquality_pivot
print(airquality_pivot.index)

# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the new index of airquality_pivot.
print(airquality_pivot.index)

# Print the head of airquality_pivot
print(airquality_pivot.head())

RangeIndex(start=0, stop=153, step=1)
MultiIndex([(5,  1),
            (5,  2),
            (5,  3),
            (5,  4),
            (5,  5),
            (5,  6),
            (5,  7),
            (5,  8),
            (5,  9),
            (5, 10),
            ...
            (9, 21),
            (9, 22),
            (9, 23),
            (9, 24),
            (9, 25),
            (9, 26),
            (9, 27),
            (9, 28),
            (9, 29),
            (9, 30)],
           names=['Month', 'Day'], length=153)
RangeIndex(start=0, stop=153, step=1)
measurement  Month  Day  Ozone  Solar.R  Temp  Wind
0                5    1   41.0    190.0  67.0   7.4
1                5    2   36.0    118.0  72.0   8.0
2                5    3   12.0    149.0  74.0  12.6
3                5    4   18.0    313.0  62.0  11.5
4                5    5    NaN      NaN  56.0  14.3


### Pivoting duplicate values

We can also use pivot tables to
deal with duplicate values by providing an aggregation function through the aggfunc parameter.

We'll see that by using `.pivot_table()` and the "aggfunc parameter", we can both reshape 
data and remove duplicates.

In [43]:
# Assign url of file: url
url= 'https://raw.githubusercontent.com/ksatola/Data-Science-Notes/master/data/airquality_dup.csv'

# Read file into a DataFrame: df
airquality_dup= pd.read_csv(url, index_col=0)

# Print the head of the DataFrame
print(airquality_dup.head())

   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN


In [50]:
import numpy as np

# Pivot table the airquality_dup: airquality_pivot
airquality_pivot = airquality_dup.pivot_table(index=['Month', 'Day'], 
                                              columns='measurement', 
                                              values='reading', 
                                              aggfunc=np.mean)

# Print the head of airquality_pivot before reset_index
print(airquality_pivot.head())

# Print the head of airquality
print(airquality_dup.head())

# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the head of airquality_pivot
print(airquality_pivot.head())

print(airquality_dup.shape)
print(airquality_pivot.shape)

measurement  Ozone  Solar.R  Temp  Wind
Month Day                              
5     1       41.0    190.0  67.0   7.4
      2       36.0    118.0  72.0   8.0
      3       12.0    149.0  74.0  12.6
      4       18.0    313.0  62.0  11.5
      5        NaN      NaN  56.0  14.3
   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN
measurement  Month  Day  Ozone  Solar.R  Temp  Wind
0                5    1   41.0    190.0  67.0   7.4
1                5    2   36.0    118.0  72.0   8.0
2                5    3   12.0    149.0  74.0  12.6
3                5    4   18.0    313.0  62.0  11.5
4                5    5    NaN      NaN  56.0  14.3
(1224, 4)
(153, 6)


### Splitting a column with .str

The dataset, consisting of case counts of tuberculosis by country, year, gender, and age group.

In this exercise, you're going to tidy the 'm014' column, which represents males aged 0-14 years of age. In order to parse this value, you need to extract the first letter into a new column for gender, and the rest into a column for age_group. Here, since you can parse values by position, you can take advantage of pandas' vectorized string slicing by using the str attribute of columns of type object.

In [52]:
# Assign url of file: url
url= 'https://raw.githubusercontent.com/wblakecannon/DataCamp/master/10-cleaning-data-in-python/_datasets/tb.csv'

# Read file into a DataFrame: df
tb= pd.read_csv(url)

# Print the head of the DataFrame
print(tb.head())


  country  year  m014  m1524  m2534  m3544  m4554  m5564   m65  mu  f014  \
0      AD  2000   0.0    0.0    1.0    0.0    0.0    0.0   0.0 NaN   NaN   
1      AE  2000   2.0    4.0    4.0    6.0    5.0   12.0  10.0 NaN   3.0   
2      AF  2000  52.0  228.0  183.0  149.0  129.0   94.0  80.0 NaN  93.0   
3      AG  2000   0.0    0.0    0.0    0.0    0.0    0.0   1.0 NaN   1.0   
4      AL  2000   2.0   19.0   21.0   14.0   24.0   19.0  16.0 NaN   3.0   

   f1524  f2534  f3544  f4554  f5564   f65  fu  
0    NaN    NaN    NaN    NaN    NaN   NaN NaN  
1   16.0    1.0    3.0    0.0    0.0   4.0 NaN  
2  414.0  565.0  339.0  205.0   99.0  36.0 NaN  
3    1.0    1.0    0.0    0.0    0.0   0.0 NaN  
4   11.0   10.0    8.0    8.0    5.0  11.0 NaN  


In [53]:
# Melt tb: tb_melt
tb_melt = pd.melt(frame=tb, id_vars=['country', 'year'])

# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]

# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]

# Print the head of tb_melt
print(tb_melt.head())

  country  year variable  value gender age_group
0      AD  2000     m014    0.0      m       014
1      AE  2000     m014    2.0      m       014
2      AF  2000     m014   52.0      m       014
3      AG  2000     m014    0.0      m       014
4      AL  2000     m014    2.0      m       014


### Splitting a column with `.split()` and `.get()`
Another common way multiple variables are stored in columns is with a delimiter. 
Notice that the data has
column names such as Cases_Guinea and Deaths_Guinea. Here, the underscore _ serves as a delimiter
between the first part (cases or deaths), and the second part (country).
`.split().` method, by default, split a string into parts separated by a space. However, in this case you want it to split by
an underscore. You can do this on Cases_Guinea, for example, using `Cases_Guinea.split('_')`,
which returns the list ['Cases', 'Guinea'].
The next challenge is to extract the first element of this list and assign it to a type variable,
and the second element of the list to a country variable. You can accomplish this by accessing
the str attribute of the column and using the `.get()` method to retrieve the 0 or 1 index, depending
on the part you want.


In [55]:
# Assign url of file: url
url= 'https://raw.githubusercontent.com/wblakecannon/DataCamp/master/10-cleaning-data-in-python/_datasets/ebola.csv'

# Read file into a DataFrame: df
ebola= pd.read_csv(url)

# Print the head of the DataFrame
print(ebola.head())


         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
0    1/5/2015  289        2776.0            NaN            10030.0   
1    1/4/2015  288        2775.0            NaN             9780.0   
2    1/3/2015  287        2769.0         8166.0             9722.0   
3    1/2/2015  286           NaN         8157.0                NaN   
4  12/31/2014  284        2730.0         8115.0             9633.0   

   Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \
0            NaN            NaN                 NaN          NaN         NaN   
1            NaN            NaN                 NaN          NaN         NaN   
2            NaN            NaN                 NaN          NaN         NaN   
3            NaN            NaN                 NaN          NaN         NaN   
4            NaN            NaN                 NaN          NaN         NaN   

   Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \
0         1786.0          

In [56]:
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
ebola_melt.head()

Unnamed: 0,Date,Day,type_country,counts
0,1/5/2015,289,Cases_Guinea,2776.0
1,1/4/2015,288,Cases_Guinea,2775.0
2,1/3/2015,287,Cases_Guinea,2769.0
3,1/2/2015,286,Cases_Guinea,
4,12/31/2014,284,Cases_Guinea,2730.0


In [57]:
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
ebola_melt.head()

Unnamed: 0,Date,Day,type_country,counts,str_split
0,1/5/2015,289,Cases_Guinea,2776.0,"[Cases, Guinea]"
1,1/4/2015,288,Cases_Guinea,2775.0,"[Cases, Guinea]"
2,1/3/2015,287,Cases_Guinea,2769.0,"[Cases, Guinea]"
3,1/2/2015,286,Cases_Guinea,,"[Cases, Guinea]"
4,12/31/2014,284,Cases_Guinea,2730.0,"[Cases, Guinea]"


In [58]:
# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)
ebola_melt.head()

Unnamed: 0,Date,Day,type_country,counts,str_split,type
0,1/5/2015,289,Cases_Guinea,2776.0,"[Cases, Guinea]",Cases
1,1/4/2015,288,Cases_Guinea,2775.0,"[Cases, Guinea]",Cases
2,1/3/2015,287,Cases_Guinea,2769.0,"[Cases, Guinea]",Cases
3,1/2/2015,286,Cases_Guinea,,"[Cases, Guinea]",Cases
4,12/31/2014,284,Cases_Guinea,2730.0,"[Cases, Guinea]",Cases


In [61]:
# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)

# Print the head of ebola_melt
ebola_melt.head()

Unnamed: 0,Date,Day,type_country,counts,str_split,type,country
0,1/5/2015,289,Cases_Guinea,2776.0,"[Cases, Guinea]",Cases,Guinea
1,1/4/2015,288,Cases_Guinea,2775.0,"[Cases, Guinea]",Cases,Guinea
2,1/3/2015,287,Cases_Guinea,2769.0,"[Cases, Guinea]",Cases,Guinea
3,1/2/2015,286,Cases_Guinea,,"[Cases, Guinea]",Cases,Guinea
4,12/31/2014,284,Cases_Guinea,2730.0,"[Cases, Guinea]",Cases,Guinea
