# What is tidy data?

<a href="https://vita.had.co.nz/papers/tidy-data.html">Tidy Data by Hadley Wickham</a>

<img src="img/tidy-data.png" width=700 height=700 />

<img src="img/tidy-data2.png" width=700 height=700 />

#  In "tidy data"

- Every **row** is an **observation**.
- Every **column** is a **variable**.
- Every **cell** is a single **value**.

Note: We want tidy **observational data**; the output of your **analysis**, however, can take any shape

In [1]:
import pandas as pd

In [2]:
students_raw_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSeQZ2fplhai3gyzTXeGtSTcE287R36fAqVCHFtsD7NQhSvf8TUmeo0bNBzOjAoakZ8VtByfsEn4qgx/pub?gid=0&single=true&output=csv"
students_raw = pd.read_csv(students_raw_url)
students_raw

Unnamed: 0,student,quiz_1,quiz_2,paper_1,paper_2
0,Sam,78,85,75,90
1,Sandhya,66,59,88,69
2,Ian,66,70,77,94
3,Christina,59,91,86,88
4,George,90,89,90,85


## So is this "messy" or "tidy"?

- What is the unit of observation?
- What are the variables?

In [3]:
students_tidy_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSeQZ2fplhai3gyzTXeGtSTcE287R36fAqVCHFtsD7NQhSvf8TUmeo0bNBzOjAoakZ8VtByfsEn4qgx/pub?gid=74676846&single=true&output=csv"
students_tidy = pd.read_csv(students_tidy_url)
students_tidy.head(8)

Unnamed: 0,student,work_category,work_number,score
0,Sam,quiz,1,78
1,Sandhya,quiz,1,66
2,Ian,quiz,1,66
3,Christina,quiz,1,59
4,George,quiz,1,90
5,Sam,quiz,2,85
6,Sandhya,quiz,2,59
7,Ian,quiz,2,70


## Who got the highest single score and on what assignment?

In [4]:
(
    students_tidy
    .nlargest(1, "score", keep="all")
)

Unnamed: 0,student,work_category,work_number,score
17,Ian,paper,2,94


## Which student has highest average, across all assignments?

In [5]:
(
    students_tidy
    .groupby("student")
    ["score"]
    .mean()
    .sort_values(ascending=False)
    .to_frame()
)

Unnamed: 0_level_0,score
student,Unnamed: 1_level_1
George,88.5
Sam,82.0
Christina,81.0
Ian,76.75
Sandhya,70.5


## Did students score higher on papers or quizzes?

In [6]:
(
    students_tidy
    .groupby("work_category")
    ["score"]
    .mean()
    .sort_values(ascending=False)
    .to_frame()
)

Unnamed: 0_level_0,score
work_category,Unnamed: 1_level_1
paper,84.2
quiz,75.3


## Did students score higher the second time(s), overall?

In [7]:
(
    students_tidy
    .groupby("work_number")
    ["score"]
    .mean()
    .to_frame()
)

Unnamed: 0_level_0,score
work_number,Unnamed: 1_level_1
1,77.5
2,82.0


## Did scores improve/decrease more with quizzes or papers?

In [8]:
(
    students_tidy
    .groupby([ "work_category", "work_number" ])
    ["score"]
    .mean()
    .unstack()
    .assign(
        change = lambda df: df[2] - df[1]
    )
)

work_number,1,2,change
work_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
paper,83.2,85.2,2.0
quiz,71.8,78.8,7.0


## What student had the biggest gap between their average paper and quiz scores?

In [9]:
(
    students_tidy
    .groupby([ "student", "work_category" ])
    ["score"]
    .mean()
    .unstack()
    .assign(
        difference = lambda df: df["paper"] - df["quiz"],
    )
    .sort_values("difference", ascending=False)
)

work_category,paper,quiz,difference
student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ian,85.5,68.0,17.5
Sandhya,78.5,62.5,16.0
Christina,87.0,75.0,12.0
Sam,82.5,81.5,1.0
George,87.5,89.5,-2.0


# How do we get from messy to tidy?
Generally: `.melt(...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.melt.html))

In [10]:
students_raw.melt()

Unnamed: 0,variable,value
0,student,Sam
1,student,Sandhya
2,student,Ian
3,student,Christina
4,student,George
5,quiz_1,78
6,quiz_1,66
7,quiz_1,66
8,quiz_1,59
9,quiz_1,90


The `id_vars=[...]` argument lets you **keep** one or more variables **associated with the rest of original row's data**:

In [11]:
(
    students_raw
    .melt(id_vars=["student"])
)

Unnamed: 0,student,variable,value
0,Sam,quiz_1,78
1,Sandhya,quiz_1,66
2,Ian,quiz_1,66
3,Christina,quiz_1,59
4,George,quiz_1,90
5,Sam,quiz_2,85
6,Sandhya,quiz_2,59
7,Ian,quiz_2,70
8,Christina,quiz_2,91
9,George,quiz_2,89


You'll often also want to `.str.split(...)` to break compound values (like `quiz_2`) into their components (`quiz` and `2`):

In [12]:
students_tidy = (
    students_raw
    .melt(id_vars=["student"], value_name="score")
    .assign(
        work_type = lambda df: df["variable"].str.split("_").str.get(0),
        work_num = lambda df: df["variable"].str.split("_").str.get(1),
    )
    .drop(columns = [ "variable" ])
    [[
        "student",
        "work_type",
        "work_num",
        "score"
    ]]
)

students_tidy.head(7)

Unnamed: 0,student,work_type,work_num,score
0,Sam,quiz,1,78
1,Sandhya,quiz,1,66
2,Ian,quiz,1,66
3,Christina,quiz,1,59
4,George,quiz,1,90
5,Sam,quiz,2,85
6,Sandhya,quiz,2,59


## How do we un-tidy?

`.pivot(...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html))

In [13]:
(
    students_tidy
    .assign(
        work_id = lambda df: df["work_type"] + "_" + df["work_num"].astype(str)
    )
    .pivot(
        index="student",
        columns="work_id",
        values="score"
    )
)

work_id,paper_1,paper_2,quiz_1,quiz_2
student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Christina,86,88,59,91
George,90,85,90,89
Ian,77,94,66,70
Sam,75,90,78,85
Sandhya,88,69,66,59
