<a href="https://colab.research.google.com/github/p-tech/wbs-dm/blob/main/Tidy_Data/Tidy_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tidy Data in Python
by [Jean-Nicholas Hould](http://www.jeannicholashould.com/)

In [2]:
import pandas as pd
import datetime
from os import listdir
from os.path import isfile, join
import glob
import re

## Column headers are values, not variable names

A dataset is in a wide format where important data values are stored in column names instead of inside the table as proper values.

### Pew Research Center Dataset

In [None]:
df = pd.read_csv("./data/pew-raw.csv")
df

`formatted_df = pd.melt(df,["religion"], var_name="income", value_name="freq")`

pd.melt() is used to convert wide-format data into long-format data.

"religion" is kept as an identifier variable (it stays the same).

Other column names (except "religion") are converted into values under a new column called "income".

The corresponding values from those columns are placed in a new column called "freq".

In [None]:
formatted_df = pd.melt(df,["religion"], var_name="income", value_name="freq")
formatted_df = formatted_df.sort_values(by=["religion"])
formatted_df.head(10)

The first column is the original index for the data in the table.  If we want to reindex the data use the following code.

In [None]:
formatted_df = formatted_df.sort_values(by=["religion"]).reset_index(drop=True)
formatted_df.head(10)

### Billboard Top 100 Dataset

This dataset represents the weekly rank of songs from the moment they enter the Billboard Top 100 to the subsequent 75 weeks.

Problems:

The columns headers are composed of values: the week number (x1st.week, …)

If a song is in the Top 100 for less than 75 weeks, the remaining columns are filled with missing values.

In [None]:
df = pd.read_csv("./data/billboard.csv", encoding="mac_latin2")
df.head(10)

A tidy version of this dataset is one without the week’s numbers as columns but rather as values of a single column.

In order to do so, we’ll melt the weeks columns into a single date column. We will create one row per week for each record.

If there is no data for the given week, we will not create a row.

What’s Happening?

The dataset originally has week numbers as column headers (e.g., "wk1", "wk2", etc.), where each row represents a song.

pd.melt() converts the dataset from wide format to long format, keeping "year", "artist.inverted", "track", etc., as identifier variables.

All week columns (wk1, wk2, etc.) are melted into one column called "week", and their values go into the "rank" column.

The "week" column originally contains text like "wk1", "wk2", etc.

str.extract('(\d+)') extracts only the numeric part (1, 2, 3, etc.).
astype(int)

converts it from a string to an integer.

In [None]:
# Melting
id_vars = ["year",
           "artist.inverted",
           "track",
           "time",
           "genre",
           "date.entered",
           "date.peaked"]

df = pd.melt(frame=df,id_vars=id_vars, var_name="week", value_name="rank") # Changed "rank" to "value"

# Formatting
df["week"] = df['week'].str.extract('(\d+)', expand=False).astype(int)

# Remove rows where value is NaN
df = df.dropna(subset=['rank'])

#.loc is a label-based indexing method in pandas. It's more explicit about how you are accessing data in the DataFrame.
#The colon : before the comma indicates that you want to select all rows.
#"rank" after the comma indicates that you want to select the column labeled "rank".

df.loc[:, "rank"] = df["rank"].astype(int)
#df["rank"] = df["rank"].astype(int) # Changed "rank" to "value"

# Cleaning out unnecessary rows
df = df.dropna()

# Create "date" columns
df['date'] = pd.to_datetime(df['date.entered']) + pd.to_timedelta(df['week'], unit='w') - pd.DateOffset(weeks=1)

df = df[["year",
         "artist.inverted",
         "track",
         "time",
         "genre",
         "week",
         "rank", # Changed "rank" to "value"
         "date"]]
df = df.sort_values(ascending=True, by=["year","artist.inverted","track","week","rank"]) # Changed "rank" to "value"

# Assigning the tidy dataset to a variable for future usage
billboard = df

df.head(10)

Following up on the Billboard dataset, we’ll now address the repetition problem of the previous table.

Problems:

Multiple observational units (the song and its rank) in a single table.

## Multiple types in one table

In [None]:
songs_cols = ["year", "artist.inverted", "track", "time", "genre"]
songs = billboard[songs_cols].drop_duplicates()
songs = songs.reset_index(drop=True)
songs["song_id"] = songs.index
songs.head(10)

What’s Happening?

pd.merge() combines the billboard and songs datasets based on common columns.

The on=[...] argument specifies the matching columns that should be the same in both datasets.

The resulting dataset (ranks) will contain rows where both datasets have matching values for:

year
artist.inverted (artist name formatted as Last, First)
track (song name)
time (track duration)
genre (music genre)

Why Merge?

The songs dataset probably contains extra song details (e.g., song_id), which are not present in billboard.

By merging, we connect ranking data (billboard) with song metadata (songs).

In [None]:
ranks = pd.merge(billboard, songs, on=["year","artist.inverted", "track", "time", "genre"])
ranks = ranks[["song_id", "date","rank"]]
ranks.head(10)

## Multiple variables stored in one column

### Tubercolosis Example

This dataset documents the count of confirmed tuberculosis cases by country, year, age and sex.

Problems:

Some columns contain multiple values: sex and age.

Mixture of zeros and missing values NaN. This is due to the data collection process and the distinction is important for this dataset.

A few notes on the raw data set:

- The columns starting with "m" or "f" contain multiple variables:
    - Sex ("m" or "f")
    - Age Group ("0-14","15-24", "25-34", "45-54", "55-64", "65", "unknown")
- Mixture of 0s and missing values("NaN"). This is due to the data collection process and the distinction is important for this dataset.

In [None]:
df = pd.read_csv("./data/tb-raw.csv")
df

In order to tidy this dataset, we need to remove the different values from the header and unpivot them into rows.

We’ll first need to melt the sex + age group columns into a single one.

Once we have that single column, we’ll derive three columns from it: sex, age_lower and age_upper.

With those, we’ll be able to properly build a tidy dataset.

`df["sex_and_age"].str.extract("(\D)(\d+)(\d{2})", expand=False)`

.str.extract() is a pandas string method used to extract substrings that match a regular expression (regex).

\D - match non-digit characters

\d+ - get following digits

\d{2} - get last 2 digits

expand=False ensures the extracted data is returned as a DataFrame instead of a DataFrame with named columns.

In [None]:
df = pd.melt(df, id_vars=["country","year"], value_name="cases", var_name="sex_and_age")

# Extract Sex, Age lower bound and Age upper bound group
tmp_df = df["sex_and_age"].str.extract("(\D)(\d+)(\d{2})", expand=False)

# Name columns
tmp_df.columns = ["sex", "age_lower", "age_upper"]

# Create `age`column based on `age_lower` and `age_upper`
tmp_df["age"] = tmp_df["age_lower"] + "-" + tmp_df["age_upper"]

# Merge
df = pd.concat([df, tmp_df], axis=1)

# Drop unnecessary columns and rows
df = df.drop(['sex_and_age',"age_lower","age_upper"], axis=1)
df = df.dropna()
df = df.sort_values(ascending=True,by=["country", "year", "sex", "age"])
df.head(10)

## Variables are stored in both rows and columns

### Global Historical Climatology Network Dataset

This dataset represents the daily weather records for a weather station (MX17004) in Mexico for five months in 2010.

Problems:

Variables are stored in both rows (tmin, tmax) and columns (days).

In [None]:
df = pd.read_csv("./data/weather-raw.csv")
df.head(10)

In [None]:
df = pd.melt(df, id_vars=["id", "year","month","element"], var_name="day_raw")
df.head(10)

In order to make this dataset tidy, we want to move the three misplaced variables (tmin, tmax and days) as three individual columns: tmin. tmax and date.

d - Matches the letter "d" exactly.

(\d+) - Captures one or more digits (the day number).

`df[["year","month","day"]] = df[["year","month","day"]].apply(lambda x: pd.to_numeric(x, errors='ignore'))`

take the year / month / day - converts to a numeric and stores as date.




In [None]:
# Extracting day
df["day"] = df["day_raw"].str.extract("d(\d+)", expand=False)
df["id"] = "MX17004"

# To numeric values
df[["year","month","day"]] = df[["year","month","day"]].apply(lambda x: pd.to_numeric(x, errors='coerce'))

# Creating a date from the different columns
def create_date_from_year_month_day(row):
    return datetime.datetime(year=row["year"], month=int(row["month"]), day=row["day"])

df["date"] = df.apply(lambda row: create_date_from_year_month_day(row), axis=1)
df = df.drop(['year',"month","day", "day_raw"], axis=1)
df = df.dropna()

# Unmelting column "element"
df = df.pivot_table(index=["id","date"], columns="element", values="value")
df.reset_index(drop=False, inplace=True)
df

## One type in multiple tables

### Baby Names in Illinois

Problems:

The data is spread across multiple tables/files.

The “Year” variable is present in the file name.

In order to load those different files into a single DataFrame, we can run a custom script that will append the files together.

Furthermore, we’ll need to extract the “Year” variable from the file name.

In [None]:
def extract_year(string):
    match = re.match(".+(\d{4})", string)
    if match != None: return match.group(1)

path = './data'
allFiles = glob.glob(path + "/201*-baby-names-illinois.csv")
frame = pd.DataFrame()
df_list= []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    df.columns = map(str.lower, df.columns)
    df["year"] = extract_year(file_)
    df_list.append(df)

df = pd.concat(df_list)
df.head(10)