<a href="https://colab.research.google.com/github/mvince33/Coding-Dojo/blob/main/project2/project2_part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 2 - Part 2
- Michael Vincent
- 8/8/22


## Rubrik
- [x] deleted unnecessary columns
- [x] deleted duplicate rows
- [x] identified and addressed missing values 
- [x] identified and corrected inconsistencies in data for categorical values (i.e. Cat, cat, cats)
- [ ] produced univariate visuals for the target and all features
- [x] identified outliers
- [ ] clearly commented all of your cleaning steps and described any decisions you made  

## Imports

In [146]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Load the data

In [147]:
# Load the data
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSu_3xbzvwqg6MpBKKDB3u8YHK31h6CTK5z1MClZorpRvHz4gTYJdv3IrrdSzwBA3gHuxlY7hsShEpZ/pub?output=csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
0,p,15.26,x,g,o,f,e,,w,16.95,...,s,y,w,u,w,t,g,,d,w
1,p,16.6,x,g,o,f,e,,w,17.99,...,s,y,w,u,w,t,g,,d,u
2,p,14.07,x,g,o,f,e,,w,17.8,...,s,y,w,u,w,t,g,,d,w
3,p,14.17,f,h,e,f,e,,w,15.77,...,s,y,w,u,w,t,p,,d,w
4,p,14.64,x,h,o,f,e,,w,16.53,...,s,y,w,u,w,t,p,,d,w


## Clean the data

### Duplicates

In [148]:
# Look for duplicates
print('Duplicates:', df.duplicated().sum())

# Remove the duplicates
df.drop_duplicates(inplace = True)

# Make sure the duplicate values were dropped
print('Duplicates:', df.duplicated().sum())

Duplicates: 146
Duplicates: 0


### Missing values

In [149]:
# Check for missing values
print('Missing Values:', df.isna().sum().sum())

# Find missing values by column
print(df.isna().sum())
print('-' * 80)

# Get the percentage of missing data in 
for col, n in zip(df.isna().sum().index, df.isna().sum().values):
  if n > 0:
    print(f'Percentage of data missing in {col}: {round(n / len(df) * 100, 2)}')

# ring-type is missing only 4% of its values so we will drop the rows 
# with missing values in this column.
df.dropna(subset = ['ring-type'], inplace = True)

Missing Values: 307019
class                       0
cap-diameter                0
cap-shape                   0
cap-surface             14120
cap-color                   0
does-bruise-or-bleed        0
gill-attachment          9855
gill-spacing            25062
gill-color                  0
stem-height                 0
stem-width                  0
stem-root               51536
stem-surface            38122
stem-color                  0
veil-type               57746
veil-color              53510
has-ring                    0
ring-type                2471
spore-print-color       54597
habitat                     0
season                      0
dtype: int64
--------------------------------------------------------------------------------
Percentage of data missing in cap-surface: 23.18
Percentage of data missing in gill-attachment: 16.18
Percentage of data missing in gill-spacing: 41.14
Percentage of data missing in stem-root: 84.59
Percentage of data missing in stem-surface: 62.57
Perc

> We see the columns stem-root, veil-type, veil-color, and spore-print-color all have more than 80% of their values missing. We will drop these columns since it seems unwise to impute such a large amount of data.

In [150]:
# Drop the columns with more than 80% of the data missing.
df.dropna(axis = 1, thresh = int(0.2*len(df)), inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 58452 entries, 0 to 61068
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   class                 58452 non-null  object 
 1   cap-diameter          58452 non-null  float64
 2   cap-shape             58452 non-null  object 
 3   cap-surface           44685 non-null  object 
 4   cap-color             58452 non-null  object 
 5   does-bruise-or-bleed  58452 non-null  object 
 6   gill-attachment       48950 non-null  object 
 7   gill-spacing          34449 non-null  object 
 8   gill-color            58452 non-null  object 
 9   stem-height           58452 non-null  float64
 10  stem-width            58452 non-null  float64
 11  stem-surface          21742 non-null  object 
 12  stem-color            58452 non-null  object 
 13  has-ring              58452 non-null  object 
 14  ring-type             58452 non-null  object 
 15  habitat            

In [156]:
# The remaning missing values will be imputed with 'M'. A separate copy
# of the data will be used for machine learning so the we don't risk data leakage.
df.fillna('M', inplace = True)

# Check to make sure missing values have been filled.
print('Missing Values:', df.isna().sum().sum())

Missing Values: 0


> We chose to impute missing values with 'M' in the case where less than 80%, and more than 20% of the data was missing. We felt more than 80% of the values of an attribute being missing would not contribute much to our model. In the case of 20-80% of the data missing we feel that a label that indicates the data is missing may be beneficial to our model. We will revisit these assumptions in the event that we are not able to construct a model that scores well on this data.

### Inconsistent labels

In [151]:
# Check for inconsistent labels
cat_cols = df.select_dtypes(include = 'object')
for col in cat_cols:
  print('Column:', col)
  print(df[col].value_counts(normalize = True))
  print()

Column: class
p    0.559143
e    0.440857
Name: class, dtype: float64

Column: cap-shape
x    0.431328
f    0.224235
s    0.122562
b    0.089646
o    0.056730
p    0.044447
c    0.031051
Name: cap-shape, dtype: float64

Column: cap-surface
t    0.175271
s    0.161732
y    0.130066
g    0.105673
h    0.103390
d    0.095446
e    0.053866
i    0.049793
w    0.048115
k    0.047376
l    0.029272
Name: cap-surface, dtype: float64

Column: cap-color
n    0.387446
y    0.135068
w    0.128071
g    0.073770
e    0.067252
o    0.062496
r    0.030435
u    0.029238
p    0.029135
k    0.021881
b    0.021043
l    0.014165
Name: cap-color, dtype: float64

Column: does-bruise-or-bleed
f    0.824865
t    0.175135
Name: does-bruise-or-bleed, dtype: float64

Column: gill-attachment
a    0.245128
d    0.201961
x    0.144229
s    0.115383
p    0.115383
e    0.108172
f    0.069745
Name: gill-attachment, dtype: float64

Column: gill-spacing
c    0.685738
d    0.215159
f    0.099103
Name: gill-spacing, dtype: 

> There do not appear to be any inconsistencies in our labels.

### Outliers

In [152]:
# Get the descriptive stats of the numeric data
df.describe()

Unnamed: 0,cap-diameter,stem-height,stem-width
count,58452.0,58452.0,58452.0
mean,6.690789,6.509678,12.209577
std,5.318472,3.376686,10.200153
min,0.38,0.0,0.0
25%,3.43,4.59,5.03
50%,5.82,5.88,10.16
75%,8.49,7.55,16.7
max,62.34,33.92,103.91


> There are values in 'stem-height' and 'stem-width' that are set to 0. We will investigate these data points more closely.

In [153]:
# Find the rows with 'stem-heigh' equal to 0.
display(df[df['stem-height'] == 0])

# Find the number of entries in 'stem-height' and
# 'stem-width' equal to 0
print(len(df[df['stem-height'] == 0]))
print(len(df[df['stem-width'] == 0]))

# Find the number of rows where exactly one of 'stem-heigh'
# and 'stem-width' is equal to 0
print(len(df[(df['stem-height'] == 0) & (df['stem-width'] != 0)]))
print(len(df[(df['stem-height'] != 0) & (df['stem-width'] == 0)]))

Unnamed: 0,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,stem-width,stem-surface,stem-color,has-ring,ring-type,habitat,season
56480,p,2.48,o,t,n,f,,c,w,0.0,0.0,f,f,f,f,d,u
56481,p,4.29,o,t,w,f,,c,w,0.0,0.0,f,f,f,f,d,u
56482,p,4.29,o,t,n,f,,c,w,0.0,0.0,f,f,f,f,d,u
56483,p,4.72,o,t,w,f,,c,w,0.0,0.0,f,f,f,f,d,u
56484,p,4.66,o,t,w,f,,c,w,0.0,0.0,f,f,f,f,d,a
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58234,p,2.21,o,l,g,f,f,f,f,0.0,0.0,f,f,f,f,d,w
58235,p,3.34,o,l,g,f,f,f,f,0.0,0.0,f,f,f,f,d,w
58238,p,2.28,o,l,g,f,f,f,f,0.0,0.0,f,f,f,f,d,a
58240,p,2.54,o,l,g,f,f,f,f,0.0,0.0,f,f,f,f,d,u


915
915
0
0


> Based on the above we assume that a 0 for stem-height and stem-width just means the mushroom under consideration doesn't have a stem.

In [154]:
# Check the relatively large values
df.loc[df['stem-width'] > 60, ['cap-diameter', 'stem-height', 'stem-width']]

Unnamed: 0,cap-diameter,stem-height,stem-width
48361,13.80,17.93,70.21
48362,17.63,17.55,69.37
48363,22.40,15.59,69.47
48364,24.73,16.28,65.87
48365,22.83,16.90,70.92
...,...,...,...
55400,4.90,6.56,62.04
55404,5.78,7.42,73.28
55407,4.39,7.43,67.74
55412,5.32,8.56,61.74


> We would need to consult subject-matter expert to determine if any of these data points are unlikely. So we choose to not alter the numeric data.

### Data Visualizations