In [1]:
import pandas as pd
import numpy as np

 **Data Tidying** is structuring datasets so they are easily analyzed

### The Goal of Tidy Data:
- data is tabluar, i.e. made up of rows and columns
- there is one value per cell
- each variable is a column
- each observation is a row

#### General Ideas:
- each variable is a characteristic of an observation
- if the units are the same, maybe they should be in the same column
- if one column has measurements of different units, it should be spread out
- should you be able to groupy some of the columns? Combine them
- can I pass this data to seaborn?
- can we ask interesting questions and answer them by a group by? Generally we don't want to be taking row or column averages.

In [2]:
# example
from pydataset import data
data('tips').head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


### Tools for this:

#### Reshaping the data:
- wide data -> long dat format (melt)
- long data -> wide data format (pivot_table, unstack)

In [3]:
np.random.seed(49)

df = pd.DataFrame({
    'id' : np.arange(1000,1003),
    'a' : np.random.randint(1,11,3),
    'b' : np.random.randint(1,11,3),
    'c' : np.random.randint(1,11,3),
})

In [4]:
df

Unnamed: 0,id,a,b,c
0,1000,9,5,6
1,1001,6,3,2
2,1002,7,6,3


Note that every element of id is produced on a new row:

In [5]:
df.melt(id_vars=['id'])

Unnamed: 0,id,variable,value
0,1000,a,9
1,1001,a,6
2,1002,a,7
3,1000,b,5
4,1001,b,3
5,1002,b,6
6,1000,c,6
7,1001,c,2
8,1002,c,3


renaming the second and third column names on the fly:

In [6]:
df.melt(id_vars=['id'], var_name='group', value_name='measure')

Unnamed: 0,id,group,measure
0,1000,a,9
1,1001,a,6
2,1002,a,7
3,1000,b,5
4,1001,b,3
5,1002,b,6
6,1000,c,6
7,1001,c,2
8,1002,c,3


** df.melt arguments**
- id_vars = columns you want to keep
- var_name = name of new column you created by melting columns
- value_name = column name for values

### df.pivot helps us get variables into columns

In [9]:
import itertools as it

df = pd.DataFrame(it.product('ABC', ['one', 'two', 'three']), columns = ['group','subgroup'])
df['x'] = np.random.randn(df.shape[0])

In [10]:
df

Unnamed: 0,group,subgroup,x
0,A,one,0.659185
1,A,two,-0.707622
2,A,three,0.30329
3,B,one,0.542172
4,B,two,-2.937112
5,B,three,-0.06199
6,C,one,0.441694
7,C,two,0.387383
8,C,three,-0.130701


In [11]:
df.pivot_table(values='x', index='subgroup', columns ='group')

group,A,B,C
subgroup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0.659185,0.542172,0.441694
three,0.30329,-0.06199,-0.130701
two,-0.707622,-2.937112,0.387383


**df.pivot_table arguments**
- index= columns you want to keep (not pivot)
- columns= column you want to pivot
- values = values we want to populate in the new columns
- aggfunct = how you want to aggregate the duplicate rows