# Tidy Data

### Objectives

+ Explain what tidy data is
+ Spot messy data
+ Transform a simple messy dataset into a tidy data set
+ Master the reshaping methods: **`melt, pivot`**

### Tidy Data
Tidy data is a term coined by Hadley Wickham, the creator of many useful R packages, to describe a structure of data that makes data analysis easier. It is highly recommended that you read [his paper](http://vita.had.co.nz/papers/tidy-data.pdf) to get a fuller understanding of tidy data. The basics will be covered below.

Tidy data is a specific structure of data that makes analysis easier. A dataset is tidy when:
1. Each variable forms a column
2. Each observation forms a row
3. Each type of observational unit forms a table

Any dataset that does not meet this definition is considered messy. 

### First example of messy data
Messy data can appear deceptively clean and tidy, especially if you have not been exposed to it before. In the table below we have some data about the weight of some fruit owned by some people.

In [1]:
import pandas as pd

In [2]:
# looks so nice and clean!
df = pd.DataFrame(data=[['Texas', 12, 10, 40], 
                        ['Arizona', 9, 7, 12], 
                        ['Florida', 0, 14, 190]], 
                  columns=['State', 'Apple', 'Orange', 'Banana'])
df

Unnamed: 0,State,Apple,Orange,Banana
0,Texas,12,10,40
1,Arizona,9,7,12
2,Florida,0,14,190


### What's wrong?
The main issue with the above dataset is that the column names are variables themselves. 

### What are the variable names?
Only one of the variable names is actually part of the DataFrame above. You must infer the others from the context of the problem. The variables are:
+ State
+ Types of fruit 
+ Weight of fruit

### Actual Tidying
To tidy, we simply need to make sure the three tidy rules are followed. Let's start with forcing each variable into a column.

The types of fruit are column names and need to be transposed to a column.

The weight of the fruit is a total mess and comprises a three by three square.

### Melting the data
Pandas contains a flexible DataFrame method named **`melt`** which takes up to 5 parameters with two of them being more important. 

+ **`id_vars`** - a list of column names that you want to keep as columns.
+ **`value_vars`** - a list of column names that you would like to move into one column

This 'moving' into one column is usually referred to as 'melting' or 'stacking'. The **`id_vars`** will stay in the same column they are currently in but repeat to align with all the newly stacked values in the **`value_vars`** columns. 

In [3]:
df_melt = df.melt(id_vars='State', 
                  value_vars=['Apple', 'Orange', 'Banana'])
df_melt

Unnamed: 0,State,variable,value
0,Texas,Apple,12
1,Arizona,Apple,9
2,Florida,Apple,0
3,Texas,Orange,10
4,Arizona,Orange,7
5,Florida,Orange,14
6,Texas,Banana,40
7,Arizona,Banana,12
8,Florida,Banana,190


### Renaming with `melt`
**`melt`** contains two other handy-dandy parameters that let you name the melted and value columns.

In [4]:
df_melt = df.melt(id_vars='State', 
                  value_vars=['Apple', 'Orange', 'Banana'],
                  var_name='Fruit', 
                  value_name='Weight')
df_melt

Unnamed: 0,State,Fruit,Weight
0,Texas,Apple,12
1,Arizona,Apple,9
2,Florida,Apple,0
3,Texas,Orange,10
4,Arizona,Orange,7
5,Florida,Orange,14
6,Texas,Banana,40
7,Arizona,Banana,12
8,Florida,Banana,190


# Your Turn

### Problem 1
<span  style="color:green; font-size:16px">There are three columns with actor names in them. Reshape the data so that you may count the frequency of all actors together regardless of the column their original column.</span>

In [5]:
movie = pd.read_csv('data/movie.csv')

In [13]:
# your code here
movie.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

In [27]:
movie_melt = movie.melt(id_vars=['movie_title'], 
                  value_vars=['actor_1_name', 'actor_2_name', 'actor_3_name'],var_name='actor_seq',value_name='actor')

movie_melt.sort_values(['movie_title','actor_seq']).head(10)

Unnamed: 0,movie_title,actor_seq,actor
4349,#Horror,actor_1_name,Timothy Hutton
9265,#Horror,actor_2_name,Balthazar Getty
14181,#Horror,actor_3_name,Lydia Hearst
3629,10 Cloverfield Lane,actor_1_name,Bradley Cooper
8545,10 Cloverfield Lane,actor_2_name,John Gallagher Jr.
13461,10 Cloverfield Lane,actor_3_name,Sumalee Montano
2964,10 Days in a Madhouse,actor_1_name,Christopher Lambert
7880,10 Days in a Madhouse,actor_2_name,Kelly LeBrock
12796,10 Days in a Madhouse,actor_3_name,Alexandra Callas
2799,10 Things I Hate About You,actor_1_name,Joseph Gordon-Levitt


### Problem 2
<span  style="color:green; font-size:16px">There are three columns with actor Facebook likes in them. Reshape the data and then sum up all the actor facebook likes for the entire dataset.</span>

In [28]:
# your code here
movie.head()


Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [39]:
movie_fb_melt = movie.melt(id_vars=['movie_title'], value_vars=['actor_1_facebook_likes', 'actor_1_facebook_likes', 'actor_1_facebook_likes'],var_name='actor_like_seq',value_name='actor_likes')

movie_fb_melt

movie_fb_melt.groupby('actor_like_seq')['actor_likes'].sum()

actor_like_seq
actor_1_facebook_likes    95644332.0
Name: actor_likes, dtype: float64

### Problem 3
<span  style="color:green; font-size:16px">Tidy the dataset in the **`employee_messy1.csv`** file. It contains the count of all employees by race and gender.</span>

In [45]:
# your code here
employee1 = pd.read_csv('data/employee_messy1.csv')


FileNotFoundError: File b'data/employee_messy1.csv' does not exist

### Problem 4
<span  style="color:green; font-size:16px">Tidy the dataset in the **`employee_messy2.csv`** file. It contains the count of all employees by department, race and gender.</span>

In [None]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Use the **`pivot`** method to reverse **`df_melt`** (from above) back to its original DataFrame.</span>

In [None]:
# your code here

# Solutions

### Problem 1
<span  style="color:green; font-size:16px">There are three columns with actor names in them. Reshape the data so that you may count the frequency of all actors together regardless of the column their original column.</span>

In [40]:
actors = movie[['actor_1_name', 'actor_2_name', 'actor_3_name']]
actors.head()

Unnamed: 0,actor_1_name,actor_2_name,actor_3_name
0,CCH Pounder,Joel David Moore,Wes Studi
1,Johnny Depp,Orlando Bloom,Jack Davenport
2,Christoph Waltz,Rory Kinnear,Stephanie Sigman
3,Tom Hardy,Christian Bale,Joseph Gordon-Levitt
4,Doug Walker,Rob Walker,


In [41]:
actor_melt = actors.melt(value_name='name')
actor_melt.head()

Unnamed: 0,variable,name
0,actor_1_name,CCH Pounder
1,actor_1_name,Johnny Depp
2,actor_1_name,Christoph Waltz
3,actor_1_name,Tom Hardy
4,actor_1_name,Doug Walker


In [42]:
actor_melt['name'].value_counts().head(10)

Robert De Niro    53
Morgan Freeman    43
Bruce Willis      38
Matt Damon        37
Steve Buscemi     36
Johnny Depp       36
Brad Pitt         33
Nicolas Cage      33
Liam Neeson       32
Will Ferrell      32
Name: name, dtype: int64

### Problem 2
<span  style="color:green; font-size:16px">There are three columns with actor Facebook likes in them. Reshape the data and then sum up all the actor facebook likes for the entire dataset.</span>

In [43]:
movie[['actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes']].melt()['value'].sum()

42922570.0

### Problem 3
<span  style="color:green; font-size:16px">Tidy the dataset in the **`employee_messy1.csv`** file. It contains the count of all employees by race and gender.</span>

In [None]:
em = pd.read_csv('data/employee_messy1.csv')
em.melt(id_vars='RACE', value_vars=['Female', 'Male'], var_name='GENDER', value_name='COUNT')

### Problem 4
<span  style="color:green; font-size:16px">Tidy the dataset in the **`employee_messy2.csv`** file. It contains the count of all employees by department, race and gender.</span>

In [None]:
em2 = pd.read_csv('data/employee_messy2.csv')
em2.head()

In [None]:
em2.melt(id_vars=['DEPARTMENT', 'GENDER'], var_name='RACE', value_name='COUNT').head(10)

### Problem 5
<span  style="color:green; font-size:16px">Use the **`pivot`** method to reverse **`df_melt`** (from above) back to its original DataFrame.</span>

In [None]:
df_melt.pivot(index='State', columns='Fruit', values='Weight')