# Tidy Data

Here, you'll learn about the principles of tidy data and more importantly, why you should care about them and how they make subsequent data analysis more efficient. You'll gain first hand experience with reshaping and tidying your data using techniques such as pivoting and melting.

Hadley Wickham, PhD - formalized way to describe the shape of data. Tidy Data provides a standard way to organize data values within a dataset.  There are many reasons why a standard approach to cleaning data is good, and the Tidy Data paper is worth the read to better understand how the shape of data fits into the various components of data analysis.

http://vita.had.co.nz/papers/tidy-data.html

#### 3 Principles of Tidy Data
- Columns represent separate variables
- Rows represent individual observations
- Observational units form tables (we won't cover this chapter)


There are formats better for reporting.  And there are formats better for analysis.  If we aim to make our data tidy during the data cleaning process, we can fix common data problems along the way, and be able to quickly transform data into different shapes as needed.  

Wickham defines the data problem we are trying to fix as columns containing values, isntead of variables.  
- Pandas Melt function, pd.melt()
- 

In [3]:
# Specify Dataframe, 
# id_vars parameter - which columns to hold constant, Names of people will be fixed
# Value_vars parameter - holds the variables
# Output datafrae will have a Variable and a Value Column for each Name
# var_name parameter changes the variable name
# value_name parameter changes the value name, so you have more meaningful column names

pd.melt(frame=df, id_vars='name',
        value_vars=['treament_a', 'treatment_b'],
        var_name='treatment', 
        value_name='result')

NameError: name 'pd' is not defined

# Recognizing tidy data

For data to be tidy, it must have:

- Each variable as a separate column.
- Each row as a separate observation.


As a data scientist, you'll encounter data that is represented in a variety of different ways, so it is important to be able to recognize tidy (or untidy) data when you see it.

In this exercise, two example datasets have been pre-loaded into the DataFrames df1 and df2. Only one of them is tidy. Your job is to explore these further in the IPython Shell and identify the one that is not tidy, and why it is not tidy.

In the rest of this course, you will frequently be asked to explore the structure of DataFrames in the IPython Shell prior to performing different operations on them. Doing this will not only strengthen your comprehension of the data cleaning concepts covered in this course, but will also help you realize and take advantage of the relationship between working in the Shell and in the script.

#### Reshaping your data using melt

Melting data is the process of turning columns of your data into rows of data. Consider the DataFrames from the previous exercise. In the tidy DataFrame, the variables Ozone, Solar.R, Wind, and Temp each had their own column. If, however, you wanted these variables to be in rows instead, you could melt the DataFrame. In doing so, however, you would make the data untidy! This is important to keep in mind: Depending on how your data is represented, you will have to reshape it differently (e.g., this could make it easier to plot values).

In this exercise, you will practice melting a DataFrame using pd.melt(). There are two parameters you should be aware of: id_vars and value_vars. The id_vars represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while the value_vars represent the columns you do wish to melt into rows. By default, if no value_vars are provided, all columns not set in the id_vars will be melted. This could save a bit of typing, depending on the number of columns that need to be melted.

The (tidy) DataFrame airquality has been pre-loaded. Your job is to melt its Ozone, Solar.R, Wind, and Temp columns into rows. Later in this chapter, you'll learn how to bring this melted DataFrame back into a tidy form.

This exercise demonstrates that melting a DataFrame is not always appropriate if you want to make it tidy. You may have to perform other transformations depending on how your data is represented.



In [8]:
import pandas as pd

airquality = pd.read_csv('./data/airquality.csv')

In [9]:
airquality.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


In [10]:
# melt into rows

# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'])

# Print the head of airquality_melt
print(airquality_melt.head())


   Month  Day variable  value
0      5    1    Ozone   41.0
1      5    2    Ozone   36.0
2      5    3    Ozone   12.0
3      5    4    Ozone   18.0
4      5    5    Ozone    NaN


In [7]:
airquality_melt.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,variable,value
0,41.0,190.0,7.4,67,Month,5
1,36.0,118.0,8.0,72,Month,5
2,12.0,149.0,12.6,74,Month,5
3,18.0,313.0,11.5,62,Month,5
4,,,14.3,56,Month,5


NOTE: This exercise demonstrates that melting a DataFrame is not always appropriate if you want to make it tidy. You may have to perform other transformations depending on how your data is represented.




In [12]:
# Print the head of airquality
print(airquality.head())

# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')

# Print the head of airquality_melt
print(airquality_melt.head())


   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5
   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN


The DataFrame is more informative now. 

In the next video, you'll learn about pivoting, which is the opposite of melting. You'll then be able to convert this DataFrame back into its original, tidy, form!



# Pivoting  & PIVOT Table