## CMPINF 2110 Spring 2021 - Homework 04

### SOLUTION GUIDE

Dr. Joseph P. Yurko

### Overview

This notebook puts the Bob Ross 538 data set into tidy format. We walked through the steps in detail in the 3rd week of the semester and so this notebook repeats the major steps necessary to create the tidy data set. It then "decomposes" the tidy data into the required unique tables described in the logical data model.

## Import Modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

## Read and tidy the data

The data are downloaded from the 538 website in the cell below.

In [2]:
data_url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/bob-ross/elements-by-episode.csv'

df = pd.read_csv( data_url )

print( df.shape )

(403, 69)


Since this is the same data used in lecture, the solutions will not describe the "tidying" process in depth. The "tidy" steps are given in the cell below.

In [3]:
df_copy = df.copy()

df_copy[['season', 'episode']] = df_copy.EPISODE.str.extract( '(\d+)+[a-zA-Z]+(\d+)' )

df_copy.loc[:, ['EPISODE', 'season', 'episode']]

df_copy['season'] = df_copy.season.astype('int64')
df_copy['episode'] = df_copy.episode.astype('int64')

df_copy.drop(columns=['EPISODE'], inplace=True)

The artists within the data set are extracted and stored in two data sets below. One provides the *link* for the artist per episode, and the other provides information associated with each unique artist.

In [4]:
people = df_copy.loc[ :, ['season', 'episode', 'DIANE_ANDRE', 'GUEST', 'STEVE_ROSS']].copy()

people['BOB_ROSS'] = np.where( people.GUEST == 0, 1, 0 )

people['OTHER'] = np.where( people.BOB_ROSS + people.STEVE_ROSS + people.DIANE_ANDRE == 0, 1, 0 )

people.drop( columns=['GUEST'], inplace=True )

people_lf = people.melt( id_vars = ['season', 'episode'], 
                        value_vars = ['DIANE_ANDRE', 'OTHER', 'STEVE_ROSS', 'BOB_ROSS'],
                       var_name = 'artist')

artist_per_episode = people_lf.loc[ people_lf.value > 0, : ].copy()

artist_per_episode.sort_values(['season', 'episode'], inplace=True)

artist_per_episode.reset_index(drop=True, inplace=True)

artist_info = artist_per_episode.\
groupby(['artist']).\
aggregate(sum_value = ('value', 'sum'),
         num_episodes = ('episode', 'count'),
         num_seasons = ('season', 'nunique'),
         first_season = ('season', 'min'),
         last_season = ('season', 'max')).\
reset_index()

artist_per_episode.drop(columns=['value'], inplace=True)

Next, we need to tidy the "features" contained in each painting.

In [5]:
features = df_copy.drop(columns = ['DIANE_ANDRE', 'GUEST', 'STEVE_ROSS', 'TITLE']).copy()

features_names = features.columns[ ~features.columns.isin(['season', 'episode']) ]

features_lf = features.melt( id_vars=['season', 'episode'], value_vars=features_names.to_list(), var_name='feature')

present_features = features_lf.loc[ features_lf.value == 1, :].copy()

present_features.sort_values(['season', 'episode'], inplace=True)

present_features.reset_index(drop=True, inplace=True)

episode_info = features_lf.groupby(['season', 'episode']).\
aggregate(possible_features = ('feature', 'nunique'),
         num_features = ('value', 'sum')).\
reset_index()

In [6]:
present_features

Unnamed: 0,season,episode,feature,value
0,1,1,BUSHES,1
1,1,1,DECIDUOUS,1
2,1,1,GRASS,1
3,1,1,RIVER,1
4,1,1,TREE,1
...,...,...,...,...
3182,31,13,DECIDUOUS,1
3183,31,13,GRASS,1
3184,31,13,MOUNTAIN,1
3185,31,13,TREE,1


In [7]:
tidy_df = pd.merge( present_features, artist_per_episode, on=['season', 'episode'], how='outer')

In [8]:
tidy_df

Unnamed: 0,season,episode,feature,value,artist
0,1,1,BUSHES,1.0,BOB_ROSS
1,1,1,DECIDUOUS,1.0,BOB_ROSS
2,1,1,GRASS,1.0,BOB_ROSS
3,1,1,RIVER,1.0,BOB_ROSS
4,1,1,TREE,1.0,BOB_ROSS
...,...,...,...,...,...
3185,31,13,TREE,1.0,BOB_ROSS
3186,31,13,TREES,1.0,BOB_ROSS
3187,9,10,,,BOB_ROSS
3188,15,4,,,BOB_ROSS


In [9]:
tidy_df.sort_values(['season', 'episode'], inplace=True)

tidy_df.drop(columns=['value'], inplace=True)

We now have a tidy data set where one row corresponds to one feature in an episode within a season and the artist that painted it.

In [10]:
tidy_df

Unnamed: 0,season,episode,feature,artist
0,1,1,BUSHES,BOB_ROSS
1,1,1,DECIDUOUS,BOB_ROSS
2,1,1,GRASS,BOB_ROSS
3,1,1,RIVER,BOB_ROSS
4,1,1,TREE,BOB_ROSS
...,...,...,...,...
3182,31,13,DECIDUOUS,BOB_ROSS
3183,31,13,GRASS,BOB_ROSS
3184,31,13,MOUNTAIN,BOB_ROSS
3185,31,13,TREE,BOB_ROSS


Check all the seasons and episodes are present.

In [11]:
tidy_df.groupby(['season']).\
aggregate(num_episode = ('episode', 'nunique'),
          num_features = ('feature', 'nunique'),
          num_artists = ('artist', 'nunique')).\
reset_index()

Unnamed: 0,season,num_episode,num_features,num_artists
0,1,13,24,1
1,2,13,22,1
2,3,13,34,2
3,4,13,26,3
4,5,13,26,2
5,6,13,23,1
6,7,13,26,3
7,8,13,29,2
8,9,13,29,1
9,10,13,31,2


## Create tables consistent with data model

Now that we have our tidy data set it's time to "decompose" it into the the logical data model required tables. These separate tables will allow our database to satisfy the normal forms which reduce redundancy and also limit the chance of entry, alteration, and query errors. 

We have already created several of the tables already, but we will make them again from the tidy data set.

### Artists table

We already created a table containing the unique set of artists. However, let's make that table again by grouping the `tidy_df` by `artist`. We will apply several summary functions to check we get the same information as those presented in lecture.

In [12]:
artists = tidy_df.groupby(['artist']).\
aggregate(num_rows = ('feature', 'size'),
          num_seasons = ('season', 'nunique'),
          first_season = ('season', 'min'),
          last_season = ('season', 'max')).\
reset_index()

In [13]:
artists

Unnamed: 0,artist,num_rows,num_seasons,first_season,last_season
0,BOB_ROSS,3033,31,1,31
1,DIANE_ANDRE,8,1,4,4
2,OTHER,47,7,4,29
3,STEVE_ROSS,102,11,3,31


The `artists` DataFrame above is almost identical to the `artist_info` DataFrame from earlier in the notebook. Let's now include in easier to read names for the artists.

In [14]:
artists['artist_name'] = ['Bob Ross', 'Diane Andre', 'Other', 'Steve Ross']

In lecture we included in how the artists is related to Bob Ross, but we will not include that step here. For simplicity, we will focus on just the name of the artists. However, let's add in an index for each artist.

In [15]:
artists['artist_id'] = artists.index + 1

In [16]:
artists

Unnamed: 0,artist,num_rows,num_seasons,first_season,last_season,artist_name,artist_id
0,BOB_ROSS,3033,31,1,31,Bob Ross,1
1,DIANE_ANDRE,8,1,4,4,Diane Andre,2
2,OTHER,47,7,4,29,Other,3
3,STEVE_ROSS,102,11,3,31,Steve Ross,4


We only need the `artist_id` and `artist_name`, but we will keep the original `artist` variable for now.

In [17]:
artists = artists.loc[:, ['artist_id', 'artist_name', 'artist']].copy()

In [18]:
artists

Unnamed: 0,artist_id,artist_name,artist
0,1,Bob Ross,BOB_ROSS
1,2,Diane Andre,DIANE_ANDRE
2,3,Other,OTHER
3,4,Steve Ross,STEVE_ROSS


### Seasons table

Next, we need to create the `seasons` table which stores the information about each unique season of the show. We simply need to group the `tidy_df` by `season`.

In [19]:
seasons = tidy_df.groupby(['season']).\
aggregate(num_episode = ('episode', 'nunique'),
          num_features = ('feature', 'nunique'),
          num_artists = ('artist', 'nunique')).\
reset_index()

In [20]:
seasons.head()

Unnamed: 0,season,num_episode,num_features,num_artists
0,1,13,24,1
1,2,13,22,1
2,3,13,34,2
3,4,13,26,3
4,5,13,26,2


Since the `season` number is by itself a unique integer we will not define a `season_id` variable. It would be redundant. We do not have any other information associated with the season, so for our purposes the `seasons` table just has a single column `season`.

In [21]:
seasons = seasons.loc[:, ['season']].copy()

In [22]:
seasons.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   season  31 non-null     int64
dtypes: int64(1)
memory usage: 376.0 bytes


### Paintings table

Each episode of the show corresponds to a separate painting, so we will define an episode as the painting rather than calling it the episode. Each painting is painted by a single artist. The `paintings` table is therefore created by grouping `tidy_df` by `season`, `episode`, and `artist`.

In [23]:
paintings = tidy_df.groupby(['season', 'episode', 'artist']).\
aggregate(num_rows = ('feature', 'size'),
          num_features = ('feature', 'nunique')).\
reset_index()

The `.info()` method shows us that we have 403 rows in the `paintings` table, which is the same number of rows as the original `df` DataFrame that we read in!

In [24]:
paintings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 403 entries, 0 to 402
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   season        403 non-null    int64 
 1   episode       403 non-null    int64 
 2   artist        403 non-null    object
 3   num_rows      403 non-null    int64 
 4   num_features  403 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 15.9+ KB


Each painting has a title, so we need to join `paintings` with `df_copy` to bring in the `title`. As a reminder, `df_copy` includes the `season` and `episode` columns. We will join the two DataFrames using those two variables as a composite key.

In [25]:
df_copy.loc[:, ['season', 'episode', 'TITLE']]

Unnamed: 0,season,episode,TITLE
0,1,1,"""A WALK IN THE WOODS"""
1,1,2,"""MT. MCKINLEY"""
2,1,3,"""EBONY SUNSET"""
3,1,4,"""WINTER MIST"""
4,1,5,"""QUIET STREAM"""
...,...,...,...
398,31,9,"""EVERGREEN VALLEY"""
399,31,10,"""BALMY BEACH"""
400,31,11,"""LAKE AT THE RIDGE"""
401,31,12,"""IN THE MIDST OF WINTER"""


In [26]:
paintings = paintings.merge( df_copy.loc[:, ['season', 'episode', 'TITLE']].copy(),
                            on=['season', 'episode'],
                            how = 'left')

In [27]:
paintings

Unnamed: 0,season,episode,artist,num_rows,num_features,TITLE
0,1,1,BOB_ROSS,6,6,"""A WALK IN THE WOODS"""
1,1,2,BOB_ROSS,9,9,"""MT. MCKINLEY"""
2,1,3,BOB_ROSS,10,10,"""EBONY SUNSET"""
3,1,4,BOB_ROSS,8,8,"""WINTER MIST"""
4,1,5,BOB_ROSS,5,5,"""QUIET STREAM"""
...,...,...,...,...,...,...
398,31,9,BOB_ROSS,8,8,"""EVERGREEN VALLEY"""
399,31,10,BOB_ROSS,7,7,"""BALMY BEACH"""
400,31,11,STEVE_ROSS,10,10,"""LAKE AT THE RIDGE"""
401,31,12,BOB_ROSS,10,10,"""IN THE MIDST OF WINTER"""


Next, we need to merge in the `artist_id` so that way we can associate the who painted the painting with just an integer instead of character string.

In [28]:
paintings = paintings.merge( artists.loc[:, ['artist', 'artist_id']].copy(), on='artist', how='left')

Let's add in the unique identifier for the painting, `painting_id`. This identifier represents a composition of `season` and `episode`, but will allow us to point to a single column to uniquely define the rows of the `paintings` DataFrame.

In [29]:
paintings['painting_id'] = paintings.index + 1

Lastly, we only need to keep the `painting_id`, `season`, `episode`, `artist_id`, and the `TITLE`. But, we will change the `TITLE` column to `title` to be consistent with lower cases for column names used throughout this notebook.

In [30]:
paintings = paintings.loc[:, ['painting_id', 'season', 'episode', 'artist_id', 'TITLE']].copy()

paintings.rename(columns={'TITLE': 'title'}, inplace=True)

As shown below, the `paintings` DataFrame still has 403 rows, but now only contains a limited set of integers and the painting title.

In [31]:
paintings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 403 entries, 0 to 402
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   painting_id  403 non-null    int64 
 1   season       403 non-null    int64 
 2   episode      403 non-null    int64 
 3   artist_id    403 non-null    int64 
 4   title        403 non-null    object
dtypes: int64(4), object(1)
memory usage: 18.9+ KB


In [32]:
paintings.nunique()

painting_id    403
season          31
episode         13
artist_id        4
title          401
dtype: int64

In [33]:
paintings

Unnamed: 0,painting_id,season,episode,artist_id,title
0,1,1,1,1,"""A WALK IN THE WOODS"""
1,2,1,2,1,"""MT. MCKINLEY"""
2,3,1,3,1,"""EBONY SUNSET"""
3,4,1,4,1,"""WINTER MIST"""
4,5,1,5,1,"""QUIET STREAM"""
...,...,...,...,...,...
398,399,31,9,1,"""EVERGREEN VALLEY"""
399,400,31,10,1,"""BALMY BEACH"""
400,401,31,11,4,"""LAKE AT THE RIDGE"""
401,402,31,12,1,"""IN THE MIDST OF WINTER"""


Since the `paintings` DataFrame now consists of the `artist_id` unique identifier, we no longer the `artist` column in the `artists` DataFrame.

In [34]:
artists.drop(columns=['artist'], inplace=True)

In [35]:
artists

Unnamed: 0,artist_id,artist_name
0,1,Bob Ross
1,2,Diane Andre
2,3,Other
3,4,Steve Ross


### Features table

The `features` table is created by grouping the `tidy_df` by the `feature` column.

In [36]:
features = tidy_df.groupby(['feature']).\
aggregate(num_rows = ('episode', 'size'),
          num_seasons = ('season', 'nunique'),
          num_episodes = ('episode', 'nunique'),
          num_artists = ('artist', 'nunique')).\
reset_index()

In [37]:
features

Unnamed: 0,feature,num_rows,num_seasons,num_episodes,num_artists
0,APPLE_FRAME,1,1,1,1
1,AURORA_BOREALIS,2,2,2,1
2,BARN,17,13,10,1
3,BEACH,27,23,11,1
4,BOAT,2,2,2,1
...,...,...,...,...,...
58,WAVES,34,27,12,2
59,WINDMILL,1,1,1,1
60,WINDOW_FRAME,1,1,1,1
61,WINTER,69,29,13,3


Let's add in the unique ID for the feature.

In [38]:
features['feature_id'] = features.index + 1

We only need the `feature_id` and `feature` columns in the `features` table.

In [39]:
features = features.loc[:, ['feature_id', 'feature']].copy()

In [40]:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   feature_id  63 non-null     int64 
 1   feature     63 non-null     object
dtypes: int64(1), object(1)
memory usage: 1.1+ KB


In [41]:
features

Unnamed: 0,feature_id,feature
0,1,APPLE_FRAME
1,2,AURORA_BOREALIS
2,3,BARN
3,4,BEACH
4,5,BOAT
...,...,...
58,59,WAVES
59,60,WINDMILL
60,61,WINDOW_FRAME
61,62,WINTER


### Features in paintings table

Lastly, we need to create the features in paintings link table which lets us know which feature is associated with each painting. The `tidy_df` already contains this information, but we want the `features_in_paintings` table to only consist of unique IDs, `painting_id` and `feature_id`.

Let's go ahead and merge the `paintings` table with the `tidy_df` so we can bring in the `paintings_id` column. We will also chain together a merge with the `features` table, so we can bring in the `features_id` column.

In [42]:
features_in_paintings = tidy_df.merge( paintings.loc[:, ['painting_id', 'season', 'episode']].copy(), 
                                       on=['season', 'episode'],
                                       how='left').\
merge( features, on='feature', how='left')

In [43]:
features_in_paintings

Unnamed: 0,season,episode,feature,artist,painting_id,feature_id
0,1,1,BUSHES,BOB_ROSS,1,8.0
1,1,1,DECIDUOUS,BOB_ROSS,1,17.0
2,1,1,GRASS,BOB_ROSS,1,27.0
3,1,1,RIVER,BOB_ROSS,1,46.0
4,1,1,TREE,BOB_ROSS,1,55.0
...,...,...,...,...,...,...
3185,31,13,DECIDUOUS,BOB_ROSS,403,17.0
3186,31,13,GRASS,BOB_ROSS,403,27.0
3187,31,13,MOUNTAIN,BOB_ROSS,403,35.0
3188,31,13,TREE,BOB_ROSS,403,55.0


The `feature_id` variable was converted to a float because of the missing value for `feature`. Let's confirm that the number of missing values for `feature_id` is the same as the number of missing values for `feature`.

In [44]:
features_in_paintings.isna().sum()

season         0
episode        0
feature        3
artist         0
painting_id    0
feature_id     3
dtype: int64

And let's confirm the rows with the missing values for `feature` are the same rows with missing values for `feature_id`.

In [45]:
features_in_paintings.loc[ features_in_paintings.feature.isna(), :]

Unnamed: 0,season,episode,feature,artist,painting_id,feature_id
889,9,10,,BOB_ROSS,114,
1526,15,4,,BOB_ROSS,186,
2675,26,10,,BOB_ROSS,335,


The `features_in_paintings` link table only needs to store the `painting_id` and `feature_id` columns.

In [46]:
features_in_paintings = features_in_paintings.loc[:, ['painting_id', 'feature_id']].copy()

In [47]:
features_in_paintings

Unnamed: 0,painting_id,feature_id
0,1,8.0
1,1,17.0
2,1,27.0
3,1,46.0
4,1,55.0
...,...,...
3185,403,17.0
3186,403,27.0
3187,403,35.0
3188,403,55.0


### Save the tables

In [48]:
seasons.to_csv('seasons_table.csv', index=False, na_rep='NULL')

artists.to_csv('artists_table.csv', index=False, na_rep='NULL')

paintings.to_csv('paintings_table.csv', index=False, na_rep='NULL')

features.to_csv('features_table.csv', index=False, na_rep='NULL')

features_in_paintings.to_csv('features_in_paintings_table.csv', index=False, na_rep='NULL')