# 🧱 Structuring Data: Intro to Tidy Data

*A simple, powerful structure for your data.*

<center><a href="https://vita.had.co.nz/papers/tidy-data.html">https://vita.had.co.nz/papers/tidy-data.html</a></center>


![Tidy Data paper](../images/tidy-data-paper.png)

> A huge amount of effort is spent cleaning data to get it ready for analysis, but there
has been little research on how to make data cleaning as easy and effective as possible.
This paper tackles a small, but important, component of data cleaning: data tidying.
Tidy datasets are easy to manipulate, model and visualise, and have a specific structure:
each variable is a column, each observation is a row, and each type of observational unit
is a table. 

> A dataset is a collection of __values__, usually either numbers (if quantitative) or strings (if qualitative). 

> Values are organised in two ways. Every value belongs to a __variable__ and an __observation__. 

> A __variable__ contains all values that measure the same underlying attribute (like height, temperature, duration) across units. 

> An __observation__ contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

#  In "tidy data":

> - Every __column__ is a __variable__.

> - Every __row__ is an __observation__.

> - Every __cell__ is a single __value__.

Note: You want your __observational__ data tidy, but your __analyses__ can produce any "shape" of data.

# Examples of "messy" vs. "tidy"

In [1]:
import pandas as pd

In [2]:
students_raw_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vScPt_dOJIulsY96YvXYVVR4PZWSpJfMJKjjJowaz_P_bwRfkAHxVViNG8_mm7Dpc_44bvLO0cwkfLD/pub?gid=0&single=true&output=csv"
students_raw = pd.read_csv(students_raw_url)
students_raw

Unnamed: 0,student,quiz_1,quiz_2,essay_1,essay_2
0,Jarred,78,83,75,77
1,Richa,63,59,76,69
2,Ravi,73,70,64,94
3,Isabella,70,91,81,77
4,Janek,89,72,83,85


## Q: Is this "messy" or "tidy"?

- What is the unit of observation?
- What are the variables?

- Which student has highest average, across all assignments?
- Who got the highest single score, and on what assignment?
- Did students score higher on essays or quizzes?
- Did students score higher the second time(s), overall?
- Did scores improve/decrease more with quizzes or essays?
- What student had the biggest gap between essay and quiz scores?

In [3]:
# Live coding

## Now let's look at the tidy version:

In [4]:
students_tidy_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vScPt_dOJIulsY96YvXYVVR4PZWSpJfMJKjjJowaz_P_bwRfkAHxVViNG8_mm7Dpc_44bvLO0cwkfLD/pub?gid=509891746&single=true&output=csv"
students_tidy = pd.read_csv(students_tidy_url)
students_tidy.head(7)

Unnamed: 0,student,work_type,work_num,score
0,Jarred,quiz,1,78
1,Richa,quiz,1,63
2,Ravi,quiz,1,73
3,Isabella,quiz,1,70
4,Janek,quiz,1,89
5,Jarred,quiz,2,83
6,Richa,quiz,2,59


- Which student has highest average, across all assignments?
- Who got the highest single score, and on what assignment?
- Did students score higher on essays or quizzes?
- Did students score higher the second time(s), overall?
- Did scores improve/decrease more with quizzes or essays?
- What student had the biggest gap between essay and quiz scores?

In [5]:
# Live coding

## Which student has highest average, across all assignments?

In [6]:
(
    students_tidy
    .groupby("student")
    ["score"].mean()
    .sort_values(ascending=False)
)

student
Janek       82.25
Isabella    79.75
Jarred      78.25
Ravi        75.25
Richa       66.75
Name: score, dtype: float64

## Who got the highest single score, and on what assignment?

In [7]:
(
    students_tidy
    .sort_values("score", ascending=False)
    .head(3)
)

Unnamed: 0,student,work_type,work_num,score
17,Ravi,essay,2,94
8,Isabella,quiz,2,91
4,Janek,quiz,1,89


## Did students score higher on essays or quizzes?

In [8]:
(
    students_tidy
    .groupby("work_type")
    ["score"]
    .mean()
)

work_type
essay    78.1
quiz     74.8
Name: score, dtype: float64

## Did students score higher the second time(s), overall?

In [9]:
(
    students_tidy
    .groupby("work_num")
    ["score"]
    .mean()
)

work_num
1    75.2
2    77.7
Name: score, dtype: float64

## Did scores improve/decrease more with quizzes or essays?

In [10]:
(
    students_tidy
    .groupby([ "work_type", "work_num" ])
    ["score"]
    .mean()
    .unstack()
    .assign(
        change = lambda df: df[2] - df[1]
    )
)

work_num,1,2,change
work_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
essay,75.8,80.4,4.6
quiz,74.6,75.0,0.4


## What student had the biggest gap between essay and quiz scores?

In [11]:
(
    students_tidy
    .groupby([ "student", "work_type" ])
    ["score"]
    .mean()
    .unstack()
    .assign(
        diff = lambda df: df["essay"] - df["quiz"],
        diff_abs = lambda df: df["diff"].abs()
    )
    .sort_values("diff_abs", ascending=False)
)

work_type,essay,quiz,diff,diff_abs
student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Richa,72.5,61.0,11.5,11.5
Ravi,79.0,71.5,7.5,7.5
Jarred,76.0,80.5,-4.5,4.5
Janek,84.0,80.5,3.5,3.5
Isabella,79.0,80.5,-1.5,1.5


# How do we get from messy to tidy?

Generally: `.melt(...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.melt.html))

In [12]:
students_raw.melt()

Unnamed: 0,variable,value
0,student,Jarred
1,student,Richa
2,student,Ravi
3,student,Isabella
4,student,Janek
5,quiz_1,78
6,quiz_1,63
7,quiz_1,73
8,quiz_1,70
9,quiz_1,89


In [13]:
(
    students_raw
    .melt(id_vars=["student"])
)

Unnamed: 0,student,variable,value
0,Jarred,quiz_1,78
1,Richa,quiz_1,63
2,Ravi,quiz_1,73
3,Isabella,quiz_1,70
4,Janek,quiz_1,89
5,Jarred,quiz_2,83
6,Richa,quiz_2,59
7,Ravi,quiz_2,70
8,Isabella,quiz_2,91
9,Janek,quiz_2,72


In [14]:
students_tidy = (
    students_raw
    .melt(id_vars=["student"], value_name="score")
    .assign(
        work_type = lambda df: df["variable"].str.split("_").str.get(0),
        work_num = lambda df: df["variable"].str.split("_").str.get(1),
    )
    .drop(columns = [ "variable" ])
    [[
        "student",
        "work_type",
        "work_num",
        "score"
    ]]
)

students_tidy.head(7)

Unnamed: 0,student,work_type,work_num,score
0,Jarred,quiz,1,78
1,Richa,quiz,1,63
2,Ravi,quiz,1,73
3,Isabella,quiz,1,70
4,Janek,quiz,1,89
5,Jarred,quiz,2,83
6,Richa,quiz,2,59


## How do we un-tidy?

`.pivot(...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html))

In [15]:
(
    students_tidy
    .assign(
        work_id = lambda df: df["work_type"] + "_" + df["work_num"].astype(str)
    )
    .pivot(index="student", columns="work_id", values="score")
)

work_id,essay_1,essay_2,quiz_1,quiz_2
student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Isabella,81,77,70,91
Janek,83,85,89,72
Jarred,75,77,78,83
Ravi,64,94,73,70
Richa,76,69,63,59


---

---

---