# Tidy Data in Python
The examples and code in this notebook are made by [Jean-Nicholas Hould](http://www.jeannicholashould.com/)

Detailed explanations for important code snippets are provided by Mervat Abuelkheir as part of the CSEN1095 Data Engineering Course.

The goal of this notebook is to show how a messy dataset can be tidied into proper rows representing objects, columns representing attributes, and cells representing scalar values.

Pay attention to the <span style="color:red"> <b> paragraphs in bold red</b></span>; they ask you to do something and provide input!

First thing we need to do is import some libraries.

In [55]:
import pandas as pd
import datetime # to handle date/time attributes
from os import listdir # os is a module for interacting with the OS
from os.path import isfile, join # to verify file object, and concatenate paths
import glob # to find pathnames matching a specific pattern
import re # regular expressions :)

## Examining the datasets

In this part of the exercise we will import a number of datasets and examine their structure to verify if the datasets are tidy.

Remember the requirements for a tidy dataset:
<br> 1- Each row describes a single object
<br> 2- Each column describes a property/attribute of that object
<br> 3- Column values have the same measurement unit
<br> 4- Columns contain atomic/scalar values (no multiple values per table cell)

For each dataset imported, test your ability to identify is it is tidy or not.

### Dataset 1: Pew Research Center

Pew Research Center is a famous center in the US that performs polling surveys on citizens. This is example data about the breakdown of yearly income per religion.

In [56]:
df = pd.read_csv("./data/pew-raw.csv")

# Display dataframe
df

Unnamed: 0,religion,<$10k,$10-20k,$20-30k,$30-40k,$40-50k,$50-75k
0,Agnostic,27,34,60,81,76,137
1,Atheist,12,27,37,52,35,70
2,Buddhist,27,21,30,34,33,58
3,Catholic,418,617,732,670,638,1116
4,Dont know/refused,15,14,15,11,10,35
5,Evangelical Prot,575,869,1064,982,881,1486
6,Hindu,1,9,7,9,11,34
7,Historically Black Prot,228,244,236,238,197,223
8,Jehovahs Witness,20,27,24,24,21,30
9,Jewish,19,19,25,25,30,95


<span style="color:red"> <b> What are the attributes of interest? How are they organized? Is the dataset tidy? </b></span> 
    
You can brainstorm your thought process and document in a new cell if you like.
<br>Instructions for beginners:
<br>- Add a new cell from the notebook menu above (+ button).
<br>- Double click anywhere inside the new cell to enter edit mode.
<br>- When done, press CTRL+ENTER or SHIFT+ENTER to commit content.
<br>- You can edit content anytime by double clicking inside the cell.

## Let's tidy the dataset!

The melt function is used to change the format of a pandas data frame from wide to long, assigning one column as an identifier and "unpivoting" the others.

In [57]:
# melt method takes as input a dataframe, one or more identifier attributes, one or more attribute names, and value attribute 
# define new pandas dataframe, religion column will be identifier attribute
# values spread across multiple column headers of income ranges will be unpivoted into new attribute "income"
# actual frequencies of citizens with specific income range will be unpivoted into new attribute "freq"
formatted_df = pd.melt(df,["religion"], var_name="income", value_name="freq")
formatted_df = formatted_df.sort_values(by=["religion"]) # just sorting the new table by religion attribute
formatted_df.head(10) # show first 10 rows

Unnamed: 0,religion,income,freq
0,Agnostic,<$10k,27
30,Agnostic,$30-40k,81
40,Agnostic,$40-50k,76
50,Agnostic,$50-75k,137
10,Agnostic,$10-20k,34
20,Agnostic,$20-30k,60
41,Atheist,$40-50k,35
21,Atheist,$20-30k,37
11,Atheist,$10-20k,27
31,Atheist,$30-40k,52


<span style="color:red"> <b> Why do the indices that are added automatically by pandas appear out of order? </b></span> 
<br>(Just a question to let you think of how pandas dataframes are indexed.)

### Dataset 2: Billboard Top 100

This dataset outlines data about the top hit songs on the Billboard list. 

In [58]:
df = pd.read_csv("./data/billboard.csv", encoding="mac_latin2")
df.head(10)

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63.0,49.0,...,,,,,,,,,,
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8.0,6.0,...,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48.0,43.0,...,,,,,,,,,,
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23.0,18.0,...,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47.0,45.0,...,,,,,,,,,,
5,2000,Janet,Doesn't Really Matter,4:17,Rock,2000-06-17,2000-08-26,59,52.0,43.0,...,,,,,,,,,,
6,2000,Destiny's Child,Say My Name,4:31,Rock,1999-12-25,2000-03-18,83,83.0,44.0,...,,,,,,,,,,
7,2000,"Iglesias, Enrique",Be With You,3:36,Latin,2000-04-01,2000-06-24,63,45.0,34.0,...,,,,,,,,,,
8,2000,Sisqo,Incomplete,3:52,Rock,2000-06-24,2000-08-12,77,66.0,61.0,...,,,,,,,,,,
9,2000,Lonestar,Amazed,4:25,Country,1999-06-05,2000-03-04,81,54.0,44.0,...,,,,,,,,,,


<span style="color:red"> <b> Again: What are the attributes of interest? How are they organized? Is the dataset tidy? </b></span>

The structure of the dataset is more complex than the previous one, and it is not immediately clear what a typical row should represent or look like. Answering the above questions helps you frame the data better. 
<br>You can brainstorm your thought process and document in a new cell if you like.

## Let's tidy the dataset!

One way a record could be organized is to make it represent the rank of each song in every week the song was on the Billboard list. This omits the need to keep track of all 76 weeks data, which is null for most of the songs.

A record would have data about the year, artist, track, time, genre, week, rank, and date.

The unique identifier is no single attribute, as one artist can have the track on the billboards at the same year, genre, and time. The only difference would be the week, rank, and date (since date is correlated with week). Therefore, to identify a track's rank and week, we need to use the year, artist, track, time, genre, and date as a combined unique identifier.


### <span style="color:blue"> Note on conversions in Python</span>

<span style="color:blue"> The following conversions are accepted by Python:</span>
<br><span style="color:blue"> - passing a string representation of an integer into int</span>
<br><span style="color:blue"> - passing a string representation of a float into float</span>
<br><span style="color:blue"> - passing a string representation of an integer into float</span>
<br><span style="color:blue"> - passing an integer into float</span>
<br><span style="color:blue"> - passing a float into int</span>

<span style="color:blue"> You get an error if you pass a string representation of a float (or anything other than an integer) into int</span>
<br><span style="color:blue"> This is especially problematic if you have NaN values that are float and you want to convert them to integers. It does not work using int, and you have to use Int32. </span>

Now back to tidying up the Billboard dataset!

In [59]:
# Melting
# Define unique identifiers in one variable. Include both dates of entry and peak for now; will be merged into one attribute later.
id_vars = ["year","artist.inverted","track","time","genre","date.entered","date.peaked"]
# Now melt structure to have identifiers, variable name (week) and values (rank)
df = pd.melt(frame=df,id_vars=id_vars, var_name="week", value_name="rank")

# Formatting 
# First, for week attribute, extract week number from string representation of week column names and convert to float then to integer
df["week"] = df["week"].str.extract('(\d+)', expand=False).astype(float).astype(int) 
# Second, extract rank values and convert them to integer
df["rank"] = df["rank"].astype('Int32')

# Cleaning out unnecessary rows
df = df.dropna()

# Create "date" columns
# Date for each week is date the track entered the billboard + number of weeks passed for an entry
# Example: if date entered is 26/02/2000, then this is the date for week 1, and the date will change for week 2 to become 04/03/2000, and so on
df["date"] = pd.to_datetime(df["date.entered"]) + pd.to_timedelta(df["week"], unit='w') - pd.DateOffset(weeks=1)


# Frame the final tidy data, replacing the dates of entry and peak with only the date, then sort by the identifiers
final_df = df[["year", "artist.inverted", "track", "time", "genre", "week", "rank", "date"]]
final_df = final_df.sort_values(ascending=True, by=["year","artist.inverted","track","week","rank"])

# Assigning the tidy dataset to a variable for future usage
billboard = final_df

ValueError: Cannot convert non-finite values (NA or inf) to integer

<span style="color:red"><b>Why did we convert the week string to float before converting it to int?</b></span>

<span style="color:red"><b>What does the parameter '(\d+)' in the string.extract method do? </b></span>

In [None]:
# Now let's check the tidied data frame
# Separating this line of code to avoid running the formatting code multiple times and getting errors
final_df.head(10)

### Dataset 3: Tubercolosis

This dataset outlines the number of tubercolosis patients in different countries in the year 2000.

A few notes on the raw data set:

- The columns starting with "m" or "f" contain multiple variables: 
    - Sex ("m" or "f")
    - Age Group ("0-14","15-24", "25-34", "45-54", "55-64", "65", "unknown")
- Mixture of 0s and missing values("NaN"). This is due to the data collection process and the distinction is important for this dataset.

In [None]:
df = pd.read_csv("./data/tb-raw.csv")
df

In [None]:
df = pd.

<span style="color:red"> <b> Again: What are the attributes of interest? How are they organized? Is the dataset tidy? </b></span>

## Let's tidy the dataset!

Same as what we did before: We need identifiers, we need the column names to represent variables (two in this case, since the column names carry information about gender and age group), and we need the frequency values to be in one column.


In [None]:
# Let's use the year and country as unique identifiers, and name the # of patients as "cases" and the column variables as "sex and age"
df = pd.melt(df, id_vars=["country","year"], value_name="cases", var_name="sex_and_age")

# Extract Sex, Age lower bound and Age upper bound group
tmp_df = df["sex_and_age"].str.extract("(\D)(\d+)(\d{2})", expand=False)    

# tmp_df now has multiple columns corresponding to the strings extracted from the column names. Now name the columns
tmp_df.columns = ["sex", "age_lower", "age_upper"]

# Create "age" column based on "age_lower" and "age_upper"
tmp_df["age"] = tmp_df["age_lower"] + "-" + tmp_df["age_upper"]

# Merge - axis parameter indicates the axis along which merge will take place. 1 means by columns
df = pd.concat([df, tmp_df], axis=1)

# Drop unnecessary columns and rows
df = df.drop(['sex_and_age',"age_lower","age_upper"], axis=1)
# Drop null values
df = df.dropna()
# Sort rows by all four attributes
df = df.sort_values(ascending=True,by=["country", "year", "sex", "age"])
df.head(10)

<span style="color:red"><b>What does the parameter value "(\D)(\d+)(\d{2})" do?</b></span>

### Dataset 4: Global Historical Climatology Network

In [None]:
df = pd.read_csv("./data/weather-raw.csv")
df.head(10)

In this dataset, variables are stored in both rows and columns. tmax and tmin stand for max and min temperatures for each day. Date is broken down to three columns, with the day being spread across multiple columns. We need the data to represent min and max temperatures per date.

Notice that the dataset has many missing values.

## Let's tidy the dataset!

Same as what we did before: We need identifiers, we need the column names to represent variables (min and max, and date!), and we need the temperature values to be in two columns.


In [None]:
# Let's start first by putting the day values in one column. We will not play with min and max temperatures for now
df = pd.melt(df, id_vars=["id", "year","month","element"], var_name="day_raw")
df.head(10)

In [None]:
# Extracting day
# df["day"] automatically adds a "day" attribute to the df dataframe
df["day"] = df["day_raw"].str.extract("d(\d+)", expand=False)  
df["id"] = "MX17004"

# Convert year, month, and day to numeric values
# Notice the use of the lamda function to apply one instruction to multiple inputs
df[["year","month","day"]] = df[["year","month","day"]].apply(lambda x: pd.to_numeric(x, errors='ignore'))

# Let's define a function to create a date from the different columns. Function accepts a row of 3 values as input and returns consolidated date
def create_date_from_year_month_day(row):
    return datetime.datetime(year=row["year"], month=int(row["month"]), day=row["day"])

# Define date attribute, by having the temporary lamda function call the create_date function
df["date"] = df.apply(lambda row: create_date_from_year_month_day(row), axis=1)
# Drop the redundant columns used to compute date
df = df.drop(['year',"month","day", "day_raw"], axis=1)
# Now drop the missing values
df = df.dropna()

# Unmelting column "element"
df = df.pivot_table(index=["id","date"], columns="element", values="value")
df.reset_index(drop=False, inplace=True)
df

## <span style="color:red"> Exercise your tidying muscles! </span>

<span style="color:red"><b> The GapMinder dataset includes information about the life expectancy, the GDP per capita, and the population of various countries between the years 1952 and 2007.</b></span>

<span style="color:red"> <b>Import the dataset, investigate it to identify what the potential attributes should be, the problems with the current structure, and think of how to tidy the dataset, and then proceed to tidy the dataset.</b></span>

In [60]:
df = pd.read_csv("./data/gapminder.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142 entries, 0 to 141
Data columns (total 38 columns):
continent         142 non-null object
country           142 non-null object
gdpPercap_1952    142 non-null float64
gdpPercap_1957    142 non-null float64
gdpPercap_1962    142 non-null float64
gdpPercap_1967    142 non-null float64
gdpPercap_1972    142 non-null float64
gdpPercap_1977    142 non-null float64
gdpPercap_1982    142 non-null float64
gdpPercap_1987    142 non-null float64
gdpPercap_1992    142 non-null float64
gdpPercap_1997    142 non-null float64
gdpPercap_2002    142 non-null float64
gdpPercap_2007    142 non-null float64
lifeExp_1952      142 non-null float64
lifeExp_1957      142 non-null float64
lifeExp_1962      142 non-null float64
lifeExp_1967      142 non-null float64
lifeExp_1972      142 non-null float64
lifeExp_1977      142 non-null float64
lifeExp_1982      142 non-null float64
lifeExp_1987      142 non-null float64
lifeExp_1992      142 non-null float64


In [110]:
df = pd.read_csv("./data/gapminder.csv")
df.head(10)
df = pd.melt(df, id_vars=["continent", "country"], var_name="property_year")
df[["property","year"]] = df["property_year"].str.extract('(\D+)_(\d+)', expand=False)
df = df.drop(['property_year'], axis=1)
df.head()

Unnamed: 0,continent,country,value,property,year
0,Africa,Algeria,2449.008185,gdpPercap,1952
1,Africa,Angola,3520.610273,gdpPercap,1952
2,Africa,Benin,1062.7522,gdpPercap,1952
3,Africa,Botswana,851.241141,gdpPercap,1952
4,Africa,Burkina Faso,543.255241,gdpPercap,1952


In [111]:
df = (df.pivot_table(index=["continent", "country","year"], columns='property', values='value'))
df = df.reset_index()
df.head()

property,continent,country,year,gdpPercap,lifeExp,pop
id,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,Africa,Algeria,1952,2449.008185,43.077,9279525.0
1,Africa,Algeria,1957,3013.976023,45.685,10270856.0
2,Africa,Algeria,1962,2550.81688,48.303,11000948.0
3,Africa,Algeria,1967,3246.991771,51.407,12760499.0
4,Africa,Algeria,1972,4182.663766,54.518,14760787.0


In [112]:
df.columns

Index(['continent', 'country', 'year', 'gdpPercap', 'lifeExp', 'pop'], dtype='object', name='property')

In [114]:
df.columns.name = ''
df.head()

Unnamed: 0,continent,country,year,gdpPercap,lifeExp,pop
0,Africa,Algeria,1952,2449.008185,43.077,9279525.0
1,Africa,Algeria,1957,3013.976023,45.685,10270856.0
2,Africa,Algeria,1962,2550.81688,48.303,11000948.0
3,Africa,Algeria,1967,3246.991771,51.407,12760499.0
4,Africa,Algeria,1972,4182.663766,54.518,14760787.0


In [115]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
continent    1704 non-null object
country      1704 non-null object
year         1704 non-null object
gdpPercap    1704 non-null float64
lifeExp      1704 non-null float64
pop          1704 non-null float64
dtypes: float64(3), object(3)
memory usage: 80.0+ KB


In [124]:
df['year'].value_counts()

1992    142
1972    142
2007    142
1987    142
1967    142
1962    142
1957    142
1977    142
1997    142
1982    142
1952    142
2002    142
Name: year, dtype: int64