# Tidy Data (continued)
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp24&branch=main&urlpath=tree%2Fdata271_sp24%2Fdemos%2Fdata271_demo34_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [None]:
import numpy as np
import pandas as pd

## Revisit activity from last time
**Activity 1**: Run the following cell to get a messy dataset. Tidy the data.

In [None]:
data = {
    'Date': ['04/01/24', '04/02/24','04/03/24'],
    'New York_Temperature': [32, 35, 33],
    'New York_Humidity': [40, 45, 47],
    'Los Angeles_Temperature': [70, 72, 71],
    'Los Angeles_Humidity': [60, 65, 66]
}

df = pd.DataFrame(data)
df

In [None]:
# Melt to put in long form
df2 = df.melt(id_vars='Date',var_name='City_Var',value_name='Value')
df2

In [None]:
# Create new variables so that each column is single variable
df2['City'] = df2.City_Var.str.split('_').str[0]
df2['Variable'] = df2.City_Var.str.split('_').str[1]
df2.drop(columns = 'City_Var',inplace=True)
df2

In [None]:
# Pivot to get Temperature and Humidity as separate columns
df2.pivot_table(index = ['Date','City'],columns = 'Variable',values = 'Value').reset_index()

**NOTE:** You can also use str.split with a `expand=True` argument to create multiple columns at once. Also, you can remove "Variable" as the label for the index. See below

In [None]:
df2 = df.melt(id_vars='Date',var_name='City_Var',value_name='Value')
df2[['City','Variable']] = df2.City_Var.str.split('_',expand=True) # create multiple columns at once
df2.drop('City_Var',axis=1,inplace=True) # Drop the unneccessary column
df2

In [None]:
# Pivot
tidy = df2.pivot_table(index = ['Date','City'],columns = 'Variable',values = 'Value')
tidy

In [None]:
# remove the name for the columns
tidy.columns.name= None 
tidy

In [None]:
tidy.reset_index()

### More Complex example: Tidying Billboard Top 100 Dataset
The dataset shows the Billboard top hits around the year 2000. This dataset records the date a song first entered the Billboard Top 100. It has variables for artist, track, date entered, date peaked, genre, time, rank and week.

In [None]:
df = pd.read_csv("billboard.csv", encoding="mac_latin2")
df.head(10)

In [None]:
df.columns

Upon first glance, the first seven columns look okay (we'll probably need to check the dtypes for those date columns though), but the next columns show weeks. Week is a variable. According to the principles of tidy data "each variable is a column", so Week should be a column. Lets use the `.melt` method to do this. 

In [None]:
# Melting
# We will use the first 7 columns as the identifier variables, use the name "week" for the variable column
# and the name "rank" for the value column 
id_vars = ["year","artist.inverted","track","time","genre","date.entered","date.peaked"]
df = pd.melt(frame=df,id_vars=id_vars, var_name="week", value_name="rank")
df.head()

In [None]:
# Check dtypes to see what we're dealing with
results = df.dtypes
results

As expected, those dates aren't Pandas datetime type, so we'll want to address that eventually. Before that, the `week` column is looking messy. We should extract the week number from the string in the `week` column.  We can use regular expression to do that. 

Lets use the `extract` method ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html)) which allows us to extract regex capture groups from strings in a Pandas series.

In [None]:
# Extract the numbers
df["week"] = df['week'].str.extract('(\d+)', expand=False).astype(int) # regex \d+ matches one or more digits in 
df.head()

Okay, now what are we working with? 

In [None]:
df.info()

There are quite a few null values in the rank column. This is because if a song is in the Top 100 for less than 76 weeks the remaining columns are filled with NaN. Let's remove those.

In [None]:
# Cleaning out unnecessary rows
df = df.dropna()
df.head()

It is also strange for rank to be floats. They represent a position, so lets make them ints.

In [None]:
df['rank'] = df['rank'].astype(int)
df.head()

As we closely examine the data now, we might notice another interesting thing. Let's look at the first two rows. Somehow they both have a value of 1 in the `week` column, but the first song enterred the billboard data in September of 2000 and the second one entered in February of 2000. Does week 1 correspond to the same date for both of those songs? No. The week columns shows the number of weeks *after* `date.entered`. So a `week` value of 1 does not correspond to the same date for all songs. If we want to put everything on the same scale of just `date` we will have to create that. Lets do this below. 

At this point, it will be useful to convert `date.entered` to a datetime type. We will also want to use the `to_timedelta` method ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html)) to combine the `week` information with the datetime information. 

In [None]:
# Create "date" columns
df['date'] = pd.to_datetime(df['date.entered']) + pd.to_timedelta(df['week'], unit='w') - pd.DateOffset(weeks=1)
df

Finally, a last nice step in tidying data is to keep only relevant columns and sort the data in a nice order. 

In [None]:
df = df[["artist.inverted", "track", "time", "genre", "week", "rank", "date"]]
df = df.sort_values(by=["date","artist.inverted","track","week","rank"],ascending=True)
df.head()

One last thing we notice about this is that some of the information about these songs is repeated many times. i.e. `artist.inverted`, `track`, `time`, and `genre` are all the same within a single song. It's really only the rank information that is changing. In this case we are dealing with one of the common problems of "messy" date: multiple types of observational units are stored in the same table.

The two types of observational units here are song and rank. If we want to make this data a little tidier, we should split these into separate tables. 

In [None]:
# Create a dataframe that contains the info for each unique song
songs_cols = ["artist.inverted", "track", "time", "genre"]
songs = df[songs_cols].drop_duplicates()
songs = songs.reset_index(drop=True)
songs["song_id"] = songs.index
songs

In [None]:
# Create dataframe that contains the info about ranks through time
ranks = df.merge(songs, on=["artist.inverted", "track", "time", "genre"])
ranks = ranks[["song_id", "date","rank"]]
ranks.head(10)

Now each dataframe contains a single type of observational value. There we can still get the information about specific song ranks (`song_id`is repeated and allows us to map to the song info), but now we aren't carrying around a whole bunch of repeated columns. This could make downstream analysis more efficient.

### Another example: Tuberculosis
Hadley Wickham is a statistician that created the concept of "tidy data". In his paper, he used the following dataset about tubercolosis cases. The column names indicate whether the group is male or female and their age range.  For example m1524 means a male between the ages of 15 and 24, inclusive.

Also, there is a distinction between zeros and missing values due to the data collection process, and this distinction is important. Lets tidy this data

In [None]:
df = pd.read_csv("tb-raw.csv")
df

Recall that one of the possible problems that occur in "messy" data is when multiple variables are stored in one column. We are dealing with this issue in this dataset since sex and age range are stored in columns. Let's tidy this up.

In [None]:
df = pd.melt(df, id_vars=["country","year"], value_name="cases", var_name="sex_and_age")
df.head()

We need to separate sex and age.

In [None]:
# Extract Sex, Age lower bound and Age upper bound group
df[["sex","age_lower",'age_upper']] = df["sex_and_age"].str.extract("(\D)(\d+)(\d{2})", expand=True)   # regular expression 
df.head()

In [None]:
# Create `age`column based on `age_lower` and `age_upper`
df["age"] = df["age_lower"] + "-" + df["age_upper"]
df.head()

In [None]:
# Drop unnecessary columns
df.drop(['sex_and_age',"age_lower","age_upper"], axis=1,inplace=True)
df

At this point, our data follows the principles of tidy data. But we can follow up with some nice-to-dos as well by maybe dropping the null values and sorting.

In [None]:
# Drop nulls and sort
df = df.dropna()
df = df.sort_values(by=["country", "year", "sex", "age"],ascending=True)
df.head()