# Tidy Data
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp24&branch=main&urlpath=tree%2Fdata271_sp24%2Fdemos%2Fdata271_demo33_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [None]:
import numpy as np
import pandas as pd

### Tidying Data

### Example 1

In [None]:
df = pd.read_csv("pew-raw.csv")
df

In [None]:
# MELT to tidy
tidy_df = pd.melt(df,id_vars = "religion", var_name="income", value_name="frequency")
tidy_df

### Example 2

In [None]:
person_data = pd.read_csv('https://gist.githubusercontent.com/Kimmirikwa/87886e7740d30697145d8a638a523b90/raw/ad12e081266db54c44a0a7c994306006c4096396/student_raw.csv')
person_data

In [None]:
performance_data = pd.read_csv('https://gist.githubusercontent.com/Kimmirikwa/98e0982d035a09a7c7441617b079c1c0/raw/5a20352893e097a1de23ee135cc9f9b82f86b449/performance_raw.csv')
performance_data

In [None]:
# Tidy the person data
person_data['sex'] = person_data['sex and age'].str.split('_').str[0]
person_data['age'] = person_data['sex and age'].str.split('_').str[1]
person_data.drop(columns = 'sex and age',inplace=True)
person_data

In [None]:
# There are some duplicates
tidy_person_data = person_data.drop_duplicates()
tidy_person_data

In [None]:
# Tidy performance data
performance_data.melt(id_vars=['id','test number'],value_vars=['term 1', 'term 2','term 3'],var_name='term',value_name='score')

### Example 3

In [None]:
tuples = list(
    zip(
        *[
            ["level 1", "level 1", "level 2", "level 2", "level 3", "level 3", "level 4", "level 4"],
            ["one", "two", "one", "two", "one", "two", "one", "two"],
        ]
    )
)
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df

In [None]:
# sometimes tidying can be done with the "stack" method (if you have multi-index)
df = pd.DataFrame(df.stack().reset_index())
df.columns = ['first','second','variable','value']
df

### Example 4

In [None]:
df = pd.read_csv("tb-raw.csv")
df

In [None]:
df = pd.melt(df, id_vars=["country","year"], value_name="cases", var_name="sex_and_age")
df.head()

In [None]:
# Extract Sex, Age lower bound and Age upper bound group
tmp_df = df["sex_and_age"].str.extract("(\D)(\d+)(\d{2})", expand=False)   # regular expression 
tmp_df.head()

In [None]:
# Name columns
tmp_df.columns = ["sex", "age_lower", "age_upper"]
tmp_df.head()

In [None]:
# Create `age`column based on `age_lower` and `age_upper`
tmp_df["age"] = tmp_df["age_lower"] + "-" + tmp_df["age_upper"]
tmp_df.head()

In [None]:
# combine data 
df = pd.concat([df, tmp_df], axis=1)
df.head()

In [None]:
# Drop unnecessary columns and rows
df = df.drop(['sex_and_age',"age_lower","age_upper"], axis=1)
df.head()

### Activity

**Activity 1**: Run the following cell to get a messy dataset. Tidy the data.

In [None]:
data = {
    'Date': ['04/01/24', '04/02/24','04/03/24'],
    'New York_Temperature': [32, 35, 33],
    'New York_Humidity': [40, 45, 47],
    'Los Angeles_Temperature': [70, 72, 71],
    'Los Angeles_Humidity': [60, 65, 66]
}

df = pd.DataFrame(data)
df

In [None]:
df2 = 
df2

**Activity 2**: Run the following cell to get a messy dataset. Tidy the data.

In [None]:
df = pd.read_csv("billboard.csv", encoding="mac_latin2")
df.head(10)

In [None]:
# Melting
id_vars = 
df2 = 
df2.head()