
"""
M3 - WEEK 5 | PROJECT: Make your data shine!

    Due 6 Mar by 21:59 

Many datasets you have seen so far were nice and clean, and ready to be analysed. 
But, that's not the case in the real world where data is often messy - with lots of information missing, 
wrong data types, etc. Before you can start working on that kind of dataset, you need to make it tidy.

In this module project your task is to pick a dataset from the link below and do the following:

    load it to Python using an appropriate library (pandas, sqllite3, etc.)
    understand the issues (take a look at the issues section for each dataset on the given URL)
    clean the data (take care of outliers, missing values, data types, etc.)
    provide explanations for all steps you took while cleaning the data
    explore and visualize your data

You'll be working in groups of two in this project  
Please go ahead with forming your own groups, and remember that each of you in the group needs to submit it. 
And remember to write the group members on your project when submitting it.

Group formation sheet is available here. (Links to an external site.)Links to an external site.

Submit your work as a Jupyter Notebook with all the code and narrative. 

URL for the data sets: https://makingnoiseandhearingthings.com/2018/04/19/datasets-for-data-cleaning-practice/ 
(Links to an external site.)Links to an external site.

 
As part of this project, you are also expected to submit the following:

1- Self-evaluation 

2- Peer evaluation 

To be able to do the peer-evaluation you must first submit your project on Canvas.

"""

### Dataset chosen from: <br> https://www.scq.ubc.ca/so-much-candy-data-seriously/

# CANDY (CRUSH) TIME

In [3]:
# dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# MERGER COMMAND just in case
%matplotlib inline

In [4]:
# use the read_csv() function to read in the dataset from a file 
# then store it in a DataFrame
# File presented some encoding issues that needed to be sorted at loading stage
candy17_df = pd.read_csv("candyhierarchy2017.csv", encoding="latin-1") #read file into DF

In [5]:
# EXPLORING THE DATASET
# Checking loaded df
candy17_df.head(3)

Unnamed: 0,Internal ID,Q1: GOING OUT?,Q2: GENDER,Q3: AGE,Q4: COUNTRY,"Q5: STATE, PROVINCE, COUNTY, ETC",Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,...,Q8: DESPAIR OTHER,Q9: OTHER COMMENTS,Q10: DRESS,Unnamed: 113,Q11: DAY,Q12: MEDIA [Daily Dish],Q12: MEDIA [Science],Q12: MEDIA [ESPN],Q12: MEDIA [Yahoo],"Click Coordinates (x, y)"
0,90258773,,,,,,,,,,...,,,,,,,,,,
1,90272821,No,Male,44.0,USA,NM,MEH,DESPAIR,JOY,MEH,...,,Bottom line is Twix is really the only candy w...,White and gold,,Sunday,,1.0,,,"(84, 25)"
2,90272829,,Male,49.0,USA,Virginia,,,,,...,,,,,,,,,,


In [8]:
# Checking nulls/NaN, df shape and datatypes
candy17_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2460 entries, 0 to 2459
Columns: 120 entries, Internal ID to Click Coordinates (x, y)
dtypes: float64(4), int64(1), object(115)
memory usage: 2.3+ MB


#### Dropping columns: Internal ID - Q7 - Q8 - Q9 - Q10 & Q12  
####  we can drop them now and recover them later if we need

In [32]:
# dropping col with information not useful 
# using df.drop("colName", axes?, inPlace?) method and assigning to new df
candy17_c_df = candy17_df.drop(["Internal ID", "Q7: JOY OTHER", "Q8: DESPAIR OTHER", "Q9: OTHER COMMENTS", "Q10: DRESS", "Unnamed: 113", "Q12: MEDIA [Daily Dish]", "Q12: MEDIA [Science]", "Q12: MEDIA [ESPN]", "Q12: MEDIA [Yahoo]"] , axis="columns")

In [31]:
# candy17_c_df = candy17_df.drop(["Q12: MEDIA [Daily Dish]", "Q12: MEDIA [Science]", "Q12: MEDIA [ESPN]", "Q12: MEDIA [Yahoo]"] , axis="columns")

In [33]:
candy17_c_df.head(3)

Unnamed: 0,Q1: GOING OUT?,Q2: GENDER,Q3: AGE,Q4: COUNTRY,"Q5: STATE, PROVINCE, COUNTY, ETC",Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,Q6 | Bonkers (the candy),...,Q6 | Trail Mix,Q6 | Twix,"Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein",Q6 | Vicodin,Q6 | Whatchamacallit Bars,Q6 | White Bread,Q6 | Whole Wheat anything,Q6 | York Peppermint Patties,Q11: DAY,"Click Coordinates (x, y)"
0,,,,,,,,,,,...,,,,,,,,,,
1,No,Male,44.0,USA,NM,MEH,DESPAIR,JOY,MEH,DESPAIR,...,DESPAIR,JOY,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR,Sunday,"(84, 25)"
2,,Male,49.0,USA,Virginia,,,,,,...,,,,,,,,,,


In [37]:
# outputs a count of the number of missing values in each col, in descending order
candy17_c_df.isnull().sum().sort_values(ascending=False) # counts all the True (1) and False (0)

Q6 | JoyJoy (Mit Iodine!)               1026
Q6 | Maynards                           1024
Q6 | Reggie Jackson Bar                 1014
Q6 | Bonkers (the board game)           1006
Q6 | Sweetums (a friend to diabetes)    1002
                                        ... 
Q1: GOING OUT?                           110
Q5: STATE, PROVINCE, COUNTY, ETC         100
Q3: AGE                                   84
Q4: COUNTRY                               64
Q2: GENDER                                41
Length: 110, dtype: int64

### What are we going to do with the missing values???
### Maybe drop the row with missing values above 100 or 500
### And then fill in the others???


In [None]:
# JUST SOME CODE FOR (MAYBE) LATER 

# ri_df.drop("county_name", axis="columns", inplace=True)
# ri_df.dropna(subset=["stop_date", "stop_time"], inplace=True)
# ri_df["is_arrested"] = ri_df["is_arrested"].astype("bool")
# ri_df["stop_date"].str.replace('/', '-')
# combo_datetime = ri_df["stop_date"].str.cat(ri_df["stop_time"], sep=' ')
# ri_df["combo_datetime"] = pd.to_datetime(combo_datetime)
# ri_df.set_index("combo_datetime", inplace=True)
# ## ADDITIONAL - not in course; to make row value_counts() for stop_outcome same as shape
# ri_df.dropna(subset=["stop_outcome"], inplace=True)
# #  ADDITIONAL - not in course; drop search_type column with 83232 missing values
# # ri_df.drop("search_type", axis="columns", inplace=True)
# # ri_df.info()