*  DSC 540-T302 Data Preparation
*  Week 7 & 8 Exercise
*  Peter Lozano

# Cleaning and Transforming Data

We have two different data sources that we can use for this exercise:
1.   [So Much Candy Data Seriously](https://www.scq.ubc.ca/so-much-candy-data-seriously/)
2.   [The Metropolitan Museum of Art Open Access](https://github.com/metmuseum/openaccess/)

I will need to complete **8** different data cleaning and transformation methods on at least one of these datasets. I should at least pick 2 different methods from each chapter (Chapters 7, 8, 10, and 11) in the textbook.

## Import libraries

In [1]:
import pandas as pd

## Load dataset

In [21]:
# Loading all candy datasets
candy_data = pd.read_excel('Weeks 7 & 8 Data/CANDYDATA.xlsx')
candy_hierarchy_2015 = pd.read_excel('Weeks 7 & 8 Data/CANDY-HIERARCHY-2015-SURVEY-Responses.xlsx')
candy_hierarchy_2016 = pd.read_excel('Weeks 7 & 8 Data/BOING-BOING-CANDY-HIERARCHY-2016-SURVEY-Responses.xlsx')
candy_hierarchy_2017 = pd.read_excel('Weeks 7 & 8 Data/candyhierarchy2017.xlsx')

  warn(msg)
  warn(msg)


### Print the first 5 rows of each dataset to understand their structure and null values.

In [32]:
candy_data.head()

Unnamed: 0,ITEM,JOY,DESPAIR,NET FEELIES,NET CLOUT,DESPAIR (NEG)
0,York Peppermint Patties,634,78,556.0,1.639118,-78.0
1,Whole Wheat anything,21,419,-398.0,1.012938,-419.0
2,White Bread,15,473,-458.0,1.12344,-473.0
3,Vicodin,323,210,113.0,1.227036,-210.0
4,Twix,770,26,744.0,1.832497,-26.0


`candy_data` is a combined dataset from 2015, 2016, and 2017 candy surveys. It is small in size and easy to show all columns. What I'm looking for here are any obvious null values or structural issues.

In [37]:
candy_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ITEM           86 non-null     object 
 1   JOY            87 non-null     int64  
 2   DESPAIR        87 non-null     int64  
 3   NET FEELIES    86 non-null     float64
 4   NET CLOUT      86 non-null     float64
 5   DESPAIR (NEG)  86 non-null     float64
dtypes: float64(3), int64(2), object(1)
memory usage: 4.2+ KB


I'm reading the results from `.info()` function by seeing the total number of entries [**87**] and comparing that to the non-null counts for each column. If the non-null count is less than 87, then there are null values in that column.

Now, I will check the remaining datasets for null values and structural issues.

In [33]:
candy_hierarchy_2015.head()

Unnamed: 0,Timestamp,How old are you?,Are you going actually going trick or treating yourself?,[Butterfinger],[100 Grand Bar],[Anonymous brown globs that come in black and orange wrappers],[Any full-sized candy bar],[Black Jacks],[Bonkers],[Bottle Caps],...,[Necco Wafers],"Which day do you prefer, Friday or Sunday?",Please estimate the degrees of separation you have from the following folks [Bruce Lee],Please estimate the degrees of separation you have from the following folks [JK Rowling],Please estimate the degrees of separation you have from the following folks [Malala Yousafzai],Please estimate the degrees of separation you have from the following folks [Thom Yorke],Please estimate the degrees of separation you have from the following folks [JJ Abrams],Please estimate the degrees of separation you have from the following folks [Hillary Clinton],Please estimate the degrees of separation you have from the following folks [Donald Trump],Please estimate the degrees of separation you have from the following folks [Beyoncé Knowles]
0,2015-10-23 08:46:20.451,35,No,JOY,,DESPAIR,JOY,,,,...,,,,,,,,,,
1,2015-10-23 08:46:51.583,41,No,JOY,JOY,DESPAIR,JOY,DESPAIR,DESPAIR,JOY,...,DESPAIR,,,,,,,,,
2,2015-10-23 08:47:34.285,33,No,DESPAIR,DESPAIR,DESPAIR,JOY,DESPAIR,DESPAIR,DESPAIR,...,DESPAIR,,,,,,,,,
3,2015-10-23 08:47:58.964,31,No,JOY,JOY,DESPAIR,JOY,DESPAIR,DESPAIR,JOY,...,DESPAIR,,,,,,,,,
4,2015-10-23 08:48:11.719,30,No,,JOY,DESPAIR,JOY,,,,...,,,,,,,,,,


This is a wide dataset with many columns. Therefore, I will use the `verbose=True` and `show_counts=True` parameters in the `.info()` function to see all columns and their non-null counts.

In [None]:
candy_hierarchy_2015.info(
    # Show all columns
    verbose=True,
    # Show non-null counts
    show_counts=True
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5630 entries, 0 to 5629
Data columns (total 124 columns):
 #    Column                                                                                                             Non-Null Count  Dtype         
---   ------                                                                                                             --------------  -----         
 0    Timestamp                                                                                                          5630 non-null   datetime64[ns]
 1    How old are you?                                                                                                   5431 non-null   object        
 2    Are you going actually going trick or treating yourself?                                                           5630 non-null   object        
 3     [Butterfinger]                                                                                                    5247 non-nu

From the 2015 hierarchy dataset, I can see that there are a total of 123 columns. This is not ideal to show all columns in a Jupyter Notebook, so I will instead check for null values by comparing the total number of columns to the non-null counts.

Since the rest of these datasets are also wide with many columns, I will instead just check the number of null columns by using the `.isnull().any()` function on the columns and counting the number of columns that return `True`. I will then subtract that from the total number of columns to get the number of non-null columns.

In [34]:
candy_hierarchy_2016.head()

Unnamed: 0,Timestamp,Are you going actually going trick or treating yourself?,Your gender:,How old are you?,Which country do you live in?,"Which state, province, county do you live in?",[100 Grand Bar],[Anonymous brown globs that come in black and orange wrappers],[Any full-sized candy bar],[Black Jacks],...,Please estimate the degree(s) of separation you have from the following celebrities [JK Rowling],Please estimate the degree(s) of separation you have from the following celebrities [JJ Abrams],Please estimate the degree(s) of separation you have from the following celebrities [Beyoncé],Please estimate the degree(s) of separation you have from the following celebrities [Bieber],Please estimate the degree(s) of separation you have from the following celebrities [Kevin Bacon],Please estimate the degree(s) of separation you have from the following celebrities [Francis Bacon (1561 - 1626)],"Which day do you prefer, Friday or Sunday?","Do you eat apples the correct way, East to West (side to side) or do you eat them like a freak of nature, South to North (bottom to top)?","When you see the above image of the 4 different websites, which one would you most likely check out (please be honest).",[York Peppermint Patties] Ignore
0,2016-10-24 05:09:23.033,No,Male,22,Canada,Ontario,JOY,DESPAIR,JOY,MEH,...,3 or higher,2,3 or higher,3 or higher,3 or higher,3 or higher,Friday,South to North,Science: Latest News and Headlines,
1,2016-10-24 05:09:54.798,No,Male,45,usa,il,MEH,MEH,JOY,JOY,...,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,Friday,East to West,Science: Latest News and Headlines,
2,2016-10-24 05:13:06.734,No,Female,48,US,Colorado,JOY,DESPAIR,JOY,MEH,...,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,Sunday,East to West,Science: Latest News and Headlines,
3,2016-10-24 05:14:17.192,No,Male,57,usa,il,JOY,MEH,JOY,MEH,...,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,Sunday,South to North,Science: Latest News and Headlines,
4,2016-10-24 05:14:24.625,Yes,Male,42,USA,South Dakota,MEH,DESPAIR,JOY,DESPAIR,...,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,Friday,East to West,ESPN,


In [None]:
# Count of null columns in candy_hierarchy_2016
null_columns = candy_hierarchy_2016.columns[candy_hierarchy_2016.isnull().any()]

# Count of non-null columns
number_of_non_nulls = len(candy_hierarchy_2016.columns) - len(null_columns)

print(f'There are \033[1m{number_of_non_nulls}\033[0m non-null columns and \033[1m{len(null_columns)}\033[0m columns with null values out of a total of \033[1m{len(candy_hierarchy_2016.columns)}\033[0m columns.')

There are [1m2[0m non-null columns and [1m121[0m columns with null values out of a total of [1m123[0m columns.


In [35]:
candy_hierarchy_2017.head()

Unnamed: 0,Internal ID,Q1: GOING OUT?,Q2: GENDER,Q3: AGE,Q4: COUNTRY,"Q5: STATE, PROVINCE, COUNTY, ETC",Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,...,Q8: DESPAIR OTHER,Q9: OTHER COMMENTS,Q10: DRESS,Unnamed: 113,Q11: DAY,Q12: MEDIA [Daily Dish],Q12: MEDIA [Science],Q12: MEDIA [ESPN],Q12: MEDIA [Yahoo],"Click Coordinates (x, y)"
0,90258773,,,,,,,,,,...,,,,,,,,,,
1,90272821,No,Male,44.0,USA,NM,MEH,DESPAIR,JOY,MEH,...,,Bottom line is Twix is really the only candy w...,White and gold,,Sunday,,1.0,,,"(84, 25)"
2,90272829,,Male,49.0,USA,Virginia,,,,,...,,,,,,,,,,
3,90272840,No,Male,40.0,us,or,MEH,DESPAIR,JOY,MEH,...,,Raisins can go to hell,White and gold,,Sunday,,1.0,,,"(75, 23)"
4,90272841,No,Male,23.0,usa,exton pa,JOY,DESPAIR,JOY,DESPAIR,...,,,White and gold,,Friday,,1.0,,,"(70, 10)"


In [67]:
# Count of null columns in candy_hierarchy_2017
null_columns = candy_hierarchy_2017.columns[candy_hierarchy_2017.isnull().any()]

# Count of non-null columns
number_of_non_nulls = len(candy_hierarchy_2017.columns) - len(null_columns)

print(f'There are \033[1m{number_of_non_nulls}\033[0m non-null columns and \033[1m{len(null_columns)}\033[0m columns with null values out of a total of \033[1m{len(candy_hierarchy_2017.columns)}\033[0m columns.')

There are [1m1[0m non-null columns and [1m119[0m columns with null values out of a total of [1m120[0m columns.


## Chapter 7 #1: Filter out missing data

To find the columns that contain or not contain null values, I can use the following code:

In [None]:
# Columns without null values in candy_hierarchy_2017
candy_hierarchy_2017.columns[~candy_hierarchy_2017.isnull().any()].tolist()

['Internal ID']

I used the `~` operator to invert the boolean values returned by `.isnull().any()`, so that I get `True` for columns without null values. Then, I use `.tolist()` to convert the resulting index object to a list of column names.

Removing the `~` operator will give me the columns that do contain null values.

## Chapter 7 #2: Replace values

Since I'm working with null values, I might as well see what I can do to fill those null values with meaningful data.

Using the `candy_hieararchy_2016` dataset, I will replace null values in the `gender` column with the string "Not Specified".

In [82]:
# Filling null values in the 'Your gender:' column with 'Not Specified'
candy_hierarchy_2016.fillna({'Your gender:': 'Not Specified'}, inplace=True)

# Counting the number of cases where gender was filled
count_of_filled = len(candy_hierarchy_2016.loc[candy_hierarchy_2016["Your gender:"] == "Not Specified"])

print(f'There were \033[1m{count_of_filled}\033[0m rows where gender was filled with "Not Specified".')

There were [1m9[0m rows where gender was filled with "Not Specified".


## Chapter 8 #1: Create hierarchical index