# Capstone 2 Notebook 2 Data wrangling<a id='Capstone_2_Notebook_2_Data_wrangling'></a>

## 2.2 Introduction<a id='2.2_Introduction'></a>

This step focuses on collecting your data, organizing it, and making sure it's well defined. Paying attention to these tasks will pay off greatly later on. Some data cleaning can be done at this stage, but it's important not to be overzealous in your cleaning before you've explored the data to better understand it.

### 2.2.1 Recap Of Data Science Problem<a id='2.2.1_Recap_Of_Data_Science_Problem'></a>

The purpose of this data science project is to come up with a classification model for edible mushrooms in our geographic location. Mushroom foragers suspect they may not be maximizing their harvests, relative to what's optimal. They also do not have a strong sense of what mushroom features matter the most in determining edibility. This project aims to build a classification model for edible mushrooms based on a number of categorical phenotypic data.
This model will be used to provide guidance for mushroom foragers in mushroom identification.

### 2.2.2 Introduction To Notebook<a id='2.2.2_Introduction_To_Notebook'></a>

Notebooks grow organically as we explore our data. If you used paper notebooks, you could discover a mistake and cross out or revise some earlier work. Later work may give you a reason to revisit earlier work and explore it further. The great thing about Jupyter notebooks is that you can edit, add, and move cells around without needing to cross out figures or scrawl in the margin. However, this means you can lose track of your changes easily. If you worked in a regulated environment, the company may have a a policy of always dating entries and clearly crossing out any mistakes, with your initials and the date.

**Best practice here is to commit your changes using a version control system such as Git.** Try to get into the habit of adding and committing your files to the Git repository you're working in after you save them. You're are working in a Git repository, right? If you make a significant change, save the notebook and commit it to Git. In fact, if you're about to make a significant change, it's a good idea to commit before as well. Then if the change is a mess, you've got the previous version to go back to.

**Another best practice with notebooks is to try to keep them organized with helpful headings and comments.** Not only can a good structure, but associated headings help you keep track of what you've done and your current focus. Anyone reading your notebook will have a much easier time following the flow of work. Remember, that 'anyone' will most likely be you. Be kind to future you!

In this notebook, note how we try to use well structured, helpful headings that frequently are self-explanatory, and we make a brief note after any results to highlight key takeaways. This is an immense help to anyone reading your notebook and it will greatly help you when you come to summarise your findings. **Top tip: jot down key findings in a final summary at the end of the notebook as they arise. You can tidy this up later.** This is a great way to ensure important results don't get lost in the middle of your notebooks.

In this, and subsequent notebooks, there are coding tasks marked with `#Code task n#` with code to complete. The `___` will guide you to where you need to insert code.

## 2.3 Imports<a id='2.3_Imports'></a>

Placing your imports all together at the start of your notebook means you only need to consult one place to check your notebook's dependencies. By all means import something 'in situ' later on when you're experimenting, but if the imported dependency ends up being kept, you should subsequently move the import statement here with the rest.

In [1]:
#Code task 1#
#Import pandas, matplotlib.pyplot, and seaborn in the correct lines below
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

from library.sb_utils import save_file

## 2.4 Objectives<a id='2.4_Objectives'></a>

There are some fundamental questions to resolve in this notebook before you move on.

* Do you think you may have the data you need to tackle the desired question?
    * Have you identified the required target value?
    * Do you have potentially useful features?
* Do you have any fundamental issues with the data?

## 2.5 Load The Mushroom Data<a id='2.5_Load_The_Mushroom_Data'></a>

In [2]:
# the supplied CSV data file is the raw_data directory
mushroom_data = pd.read_csv('../raw_data/mushrooms.csv')

Good first steps in auditing the data are the info method and displaying the first few records with head.

In [3]:
#Code task 2#
#Call the info method on mushroom_data to see a summary of the data
mushroom_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

`class` is whether a mushroom is edible (abbreviated by `e`) or poisonous (abbreviated by `p`). The other columns are potential features.

In [4]:
#Code task 3#
#Call the head method on mushroom_data to print the first several rows of the data
mushroom_data.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


The output above suggests you've made a good start getting the mushroom data organized. You have plausible column headings.

## 2.6 Explore The Data<a id='2.6_Explore_The_Data'></a>

### 2.6.1 Number Of Missing Values By Column<a id='2.6.1_Number_Of_Missing_Values_By_Column'></a>

Count the number of missing values in each column and sort them.

In [5]:
#Code task 5#
#Count (using `.sum()`) the number of missing values (`.isnull()`) in each column of 
#mushroom_data as well as the percentages (using `.mean()` instead of `.sum()`).
#Order them (increasing or decreasing) using sort_values
#Call `pd.concat` to present these in a single table (DataFrame) with the helpful column names 'count' and '%'
missing = pd.concat([mushroom_data.isnull().sum(), 100 * mushroom_data.isnull().mean()], axis=1)
missing.columns=['count','%']
missing.sort_values(by=['count','%'],ascending=False)

Unnamed: 0,count,%
class,0,0.0
cap-shape,0,0.0
cap-surface,0,0.0
cap-color,0,0.0
bruises,0,0.0
odor,0,0.0
gill-attachment,0,0.0
gill-spacing,0,0.0
gill-size,0,0.0
gill-color,0,0.0


There are no missing values.

### 2.6.2 Categorical Features<a id='2.6.2_Categorical_Features'></a>

The mushroom dataset has no numerical features, just categorical ones. These are discrete entities. 'edible' (abbreviated by `e`) is an adjective. Although adjectvies can be sorted alphabetically, it makes no sense to take the average of 'edible' and 'poisonous' (abbreviated by `p`). Similarly, 'edible' is before 'poisonous' only lexicographically; it is neither 'less than' nor 'greater than' 'poisonous'. As such, they tend to require different handling than strictly numeric quantities. Note, a feature _can_ be numeric but also categorical. For example, instead of giving the number_of_bruises on a mushroom, a feature might be has_bruises (`bruises` in our actual dataset) and have the value 0 or 1 (`t` or `f` in our actual dataset) to denote absence or presence of such bruises. In such a case it would not make sense to take an average of this or perform other mathematical calculations on it. Although you digress a little to make a point, colors are also, strictly speaking, categorical features. Yes, when a color is represented by its RGB color code it provides a convenient way to visualize data over a gradient. And, arguably, there is some logical interpretation of the average of red and blue (ff0000 and 0000ff) being purple (800080). However, the color wheel loops from purple back to red, whereas the RGB color cube does not.

In [6]:
#Code task 6#
#Use mushroom_data's `select_dtypes` method to select columns of dtype 'object'
mushroom_data.select_dtypes(include='object')

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,e,k,s,n,f,n,a,c,b,y,...,s,o,o,p,o,o,p,b,c,l
8120,e,x,s,n,f,n,a,c,b,y,...,s,o,o,p,n,o,p,b,v,l
8121,e,f,s,n,f,n,a,c,b,n,...,s,o,o,p,o,o,p,b,c,l
8122,p,k,y,n,f,y,f,c,n,b,...,k,w,w,p,w,o,e,w,v,l


#### 2.6.2.1 Unique Mushrooms<a id='2.6.2.1_Unique_Mushrooms'></a>

In [7]:
#Code task 7#
#Use pandas' DataFrame method `duplicated` to find any duplicated rows
mushroom_data[mushroom_data.duplicated()].head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat


There are no duplicated mushrooms.

#### 2.6.2.2 Number of distinct values for features<a id='2.6.2.2_Number_of_distinct_values_for_features'></a>

In [9]:
#Code task 12#
#Select the 'Region' and 'state' columns from ski_data and use the `nunique` method to calculate
#the number of unique values in each
mushroom_data.nunique()

class                        2
cap-shape                    6
cap-surface                  4
cap-color                   10
bruises                      2
odor                         9
gill-attachment              2
gill-spacing                 2
gill-size                    2
gill-color                  12
stalk-shape                  2
stalk-root                   5
stalk-surface-above-ring     4
stalk-surface-below-ring     4
stalk-color-above-ring       9
stalk-color-below-ring       9
veil-type                    1
veil-color                   4
ring-number                  3
ring-type                    5
spore-print-color            9
population                   6
habitat                      7
dtype: int64

There is only one kind of veil-type, so this column is useless. Let's drop it.

In [10]:
mushroom_data.drop(columns='veil-type', inplace=True)

In [11]:
mushroom_data.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,s,w,w,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,s,w,w,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,s,w,w,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,s,w,w,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,s,w,w,w,o,e,n,a,g


Now it is gone. Here are the remaining features with their numbers of distinct values:

In [14]:
mushroom_data.nunique()

class                        2
cap-shape                    6
cap-surface                  4
cap-color                   10
bruises                      2
odor                         9
gill-attachment              2
gill-spacing                 2
gill-size                    2
gill-color                  12
stalk-shape                  2
stalk-root                   5
stalk-surface-above-ring     4
stalk-surface-below-ring     4
stalk-color-above-ring       9
stalk-color-below-ring       9
veil-color                   4
ring-number                  3
ring-type                    5
spore-print-color            9
population                   6
habitat                      7
dtype: int64

#### 2.6.2.3 Frequency Counts Of Values For Each Feature<a id='2.6.2.3_Frequency_Counts_Of_Values_For_Each_Feature'></a>

Now we look at frequency counts of values for each feature:

**Class**

In [15]:
mushroom_data['class'].value_counts()

e    4208
p    3916
Name: class, dtype: int64

There are 2 classes: `e` (edible) and `p` (poisonous).

Slightly over half of the mushrooms are edible.

**Cap Shape**

In [17]:
mushroom_data['cap-shape'].value_counts()

x    3656
f    3152
k     828
b     452
s      32
c       4
Name: cap-shape, dtype: int64

There are 6 cap shapes: `b` (bell), `c` (conical), `x` (convex), `f` (flat), `k` (knobbed), and `s` (sunken).

Convex cap shapes are the most common.

**Cap Surface**

In [19]:
mushroom_data['cap-surface'].value_counts()

y    3244
s    2556
f    2320
g       4
Name: cap-surface, dtype: int64

There are 4 kinds of cap surfaces: `f` (fibrous), `g` (grooves), `y` (scaly), and `s` (smooth).

Scaly cap surfaces are the most common.

**Cap Color**

In [20]:
mushroom_data['cap-color'].value_counts()

n    2284
g    1840
e    1500
y    1072
w    1040
b     168
p     144
c      44
u      16
r      16
Name: cap-color, dtype: int64

There are 10 cap colors: `n` (brown), `b` (buff), `c` (cinnamon), `g` (gray), `r` (green), `p` (pink), `u` (purple), `e` (red), `w` (white), and `y` (yellow).

Brown caps are the most common.

**Bruises**

In [21]:
mushroom_data['bruises'].value_counts()

f    4748
t    3376
Name: bruises, dtype: int64

A mushroom may have bruises (`t`) or not (`f`).

A majority of the mushrooms don't have bruises.

**Odor**

In [24]:
mushroom_data['odor'].value_counts()

n    3528
f    2160
s     576
y     576
l     400
a     400
p     256
c     192
m      36
Name: odor, dtype: int64

There are 9 mushroom odors: `a` (almond), `l` (anise), `c` (creosote), `y` (fishy), `f` (foul), `m` (musty), `n` (none), `p` (pungent), `s` (spicy).

A plurality of mushrooms have no odor.

**Gill Attachment**

In [25]:
mushroom_data['gill-attachment'].value_counts()

f    7914
a     210
Name: gill-attachment, dtype: int64

There are 2 kinds of gill attachments: `a` (attached), and `f` (free).

A vast majority of mushrooms have gills free, not attached.

**Gill Spacing**

In [26]:
mushroom_data['gill-spacing'].value_counts()

c    6812
w    1312
Name: gill-spacing, dtype: int64

There are 2 kinds of gill spacing: `c` (close) and `w` (crowded).

A large majority of mushrooms have gills close together, but not crowded.

**Gill Size**

In [27]:
mushroom_data['gill-size'].value_counts()

b    5612
n    2512
Name: gill-size, dtype: int64

There are 2 gill sizes: `b` (broad) and `n` (narrow).

A large majority of gills are broad.

**Gill Color**

In [29]:
mushroom_data['gill-color'].value_counts()

b    1728
p    1492
w    1202
n    1048
g     752
h     732
u     492
k     408
e      96
y      86
o      64
r      24
Name: gill-color, dtype: int64

There are 12 gill colors: `k` (black), `n` (brown), `b` (buff), `h` (chocolate), `g` (gray), `r` (green), `o` (orange), `p` (pink), `u` (purple), `e` (red), `w` (white), and `y` (yellow). Buff gills are the most common.

**Stalk Shape**

In [30]:
mushroom_data['stalk-shape'].value_counts()

t    4608
e    3516
Name: stalk-shape, dtype: int64

There are 2 stalk shapes: `e` (enlarging) and `t` (tapering).

A majority of stalks taper.

**Stalk Root**

In [32]:
mushroom_data['stalk-root'].value_counts()

b    3776
?    2480
e    1120
c     556
r     192
Name: stalk-root, dtype: int64

There are 5 possibilities for stalk roots: `b` (bulbous), `c` (club), `e` (equal), `z` (rhizomorphs), and `?` (missing).

A plurality of mushrooms have bulbous stalk roots.

**Stalk Surface Above Ring**

In [34]:
mushroom_data['stalk-surface-above-ring'].value_counts()

s    5176
k    2372
f     552
y      24
Name: stalk-surface-above-ring, dtype: int64

There are 4 kinds of stalk surfaces above the ring: `f` (fibrous), `y` (scaly), `k` (silky), and `s` (smooth).

A large majority of mushrooms have smooth stalk surfaces above the ring.

**Stalk Surface Below Ring**

In [35]:
mushroom_data['stalk-surface-below-ring'].value_counts()

s    4936
k    2304
f     600
y     284
Name: stalk-surface-below-ring, dtype: int64

There are 4 kinds of stalk surfaces below the ring: `f` (fibrous), `y` (scaly), `k` (silky), and `s` (smooth).

A large majority of mushrooms have smooth stalk surfaces below the ring.

**Stalk Color Above Ring**

In [36]:
mushroom_data['stalk-color-above-ring'].value_counts()

w    4464
p    1872
g     576
n     448
b     432
o     192
e      96
c      36
y       8
Name: stalk-color-above-ring, dtype: int64

There are 9 stalk colors above the ring: `n` (brown), `b` (buff), `c` (cinnamon), `g` (gray), `o` (orange), `p` (pink), `e` (red), `w` (white), and `y` (yellow).

A majority of mushrooms have stalks white above the ring.

**Stalk Color Below Ring**

In [37]:
mushroom_data['stalk-color-below-ring'].value_counts()

w    4384
p    1872
g     576
n     512
b     432
o     192
e      96
c      36
y      24
Name: stalk-color-below-ring, dtype: int64

There are 9 stalk colors below the ring: `n` (brown), `b` (buff), `c` (cinnamon), `g` (gray), `o` (orange), `p` (pink), `e` (red), `w` (white), and `y` (yellow).

A majority of mushrooms have stalks white below the ring.

**Veil Color**

In [39]:
mushroom_data['veil-color'].value_counts()

w    7924
o      96
n      96
y       8
Name: veil-color, dtype: int64

There are 4 veil colors: `n` (brown), `o` (orange), `w` (white), and `y` (yellow).

A vast majority of mushrooms have white veils.

**Ring Number**

In [40]:
mushroom_data['ring-number'].value_counts()

o    7488
t     600
n      36
Name: ring-number, dtype: int64

There are 3 possible numbers of rings: `n` (none/zero), `o` (one), or `t` (two).

A large majority of mushrooms have one ring.

**Ring Type**

In [41]:
mushroom_data['ring-type'].value_counts()

p    3968
e    2776
l    1296
f      48
n      36
Name: ring-type, dtype: int64

There are 5 possibilites for rings: `e` (evanescent), `f` (flaring), `l` (large), `p` (pendant), and `n` (none/there isn't any).

A plurality of mushrooms have pendant rings. The number of mushrooms with ring type none is consistent with the number of mushrooms with no rings.

**Spore Print Color**

In [42]:
mushroom_data['spore-print-color'].value_counts()

w    2388
n    1968
k    1872
h    1632
r      72
b      48
u      48
o      48
y      48
Name: spore-print-color, dtype: int64

There are 9 spore print colors: `k` (black), `n` (brown), `b` (buff), `h` (chocolate), `r` (green), `o` (orange), `u` (purple), `w` (white), and `y` (yellow).

A plurality of mushrooms have white spore prints.

**Population**

In [43]:
mushroom_data['population'].value_counts()

v    4040
y    1712
s    1248
n     400
a     384
c     340
Name: population, dtype: int64

There are 6 kinds of mushroom populations: `a` (abundant), `c` (clustered), `n` (numerous), `s` (scattered), `v` (several), and `y` (solitary).

A majority of mushrooms have a population described by the word 'several'.

**Habitat**

In [44]:
mushroom_data['habitat'].value_counts()

d    3148
g    2148
p    1144
l     832
u     368
m     292
w     192
Name: habitat, dtype: int64

There are 7 kinds of mushroom habitats: `g` (grasses), `l` (leaves), `m` (meadows), `p` (paths), `u` (urban), `w` (waste), and `d` (woods).

A plurality of mushrooms grow in the woods.

#### 2.6.2.4 Categorical Data Summary<a id='2.6.2.4_Categorical_Data_Summary'></a>

In [16]:
#Call mushroom_data's `describe` method for a statistical summary of the categorical columns
#Hint: there are fewer summary stat columns than features, so displaying the transpose
#will be useful again
mushroom_data.describe().T

Unnamed: 0,count,unique,top,freq
class,8124,2,e,4208
cap-shape,8124,6,x,3656
cap-surface,8124,4,y,3244
cap-color,8124,10,n,2284
bruises,8124,2,f,4748
odor,8124,9,n,3528
gill-attachment,8124,2,f,7914
gill-spacing,8124,2,c,6812
gill-size,8124,2,b,5612
gill-color,8124,12,b,1728


### 2.6.4 Numeric Features<a id='2.6.4_Numeric_Features'></a>

There are no numeric features.

## 2.7 Target Feature<a id='2.7_Target_Feature'></a>

The target feature will be the class of the mushroom, i.e. whether it is `e` (edible) or `p` (poisonous).

## 2.8 Save data<a id='2.8_Save_data'></a>

In [45]:
mushroom_data.shape

(8124, 22)

Save this to your data directory, separately. Note that you were provided with the data in `raw_data` and you should saving derived data in a separate location. This guards against overwriting our original data.

In [46]:
# save the data to a new csv file
datapath = '../data'
save_file(mushroom_data, 'mushroom_data_cleaned.csv', datapath)

Writing file.  "../data/mushroom_data_cleaned.csv"


## 2.13 Summary<a id='2.13_Summary'></a>

We started with 8124 rows and 23 columns of mushroom data. We dropped the `veil-type` column because it had no useful information. All of the mushrooms had the same veil type, `p` (partial). We are in the data wrangling stage of this project and about to move on to the exploratory data analysis stage. The target feature to predict is `class` because mushroom foragers want edible mushrooms, not poisonous ones. We ended up with 8124 rows and 22 columns after cleaning the mushroom data.