## Data Preprocessing/Exploratory Data Analysis for the Poisonous Mushrooms Code



### Dataset Analysis

In [16]:
#Loading the data into a dataframe

import csv
import pandas as pd
import matplotlib.pyplot as plt

#Importing a limited section of the dataset for now for easier processing
df = pd.read_csv("poisonous_mushrooms.csv", nrows=2000)
# Source: https://www.kaggle.com/datasets/davinascimento/poisonous-mushrooms?resource=download
# this assumes that you have the csv downloaded and stored in the same directory as this file

# Iterates through the columns, prints out counts of each data for each column
for column in df:
	df_series = df[column]
	results = df_series.value_counts()
	print("Results for column: ", column)
	print(results)
	print("Missing: ", df_series.isnull().sum(), "\n")

# results will be looked at more carefully in below cells, so don't worry about scrolling




Results for column:  id
id
0       1
1329    1
1342    1
1341    1
1340    1
       ..
661     1
660     1
659     1
658     1
1999    1
Name: count, Length: 2000, dtype: int64
Missing:  0 

Results for column:  class
class
p    1118
e     882
Name: count, dtype: int64
Missing:  0 

Results for column:  cap-diameter
cap-diameter
1.52     9
1.43     8
1.49     8
3.77     8
3.84     8
        ..
1.18     1
13.55    1
5.27     1
11.61    1
11.65    1
Name: count, Length: 945, dtype: int64
Missing:  0 

Results for column:  cap-shape
cap-shape
x    936
f    413
s    232
b    219
p     76
o     62
c     62
Name: count, dtype: int64
Missing:  0 

Results for column:  cap-surface
cap-surface
t        303
s        249
y        199
h        184
g        168
d        132
e         90
k         71
i         63
w         62
l         36
15.94      1
Name: count, dtype: int64
Missing:  442 

Results for column:  cap-color
cap-color
n    875
w    254
y    221
g    160
e    138
o    109
p     57
u   

Focusing first on the counts of different attributes in the categorical cells...

First is the poisonous or edible attribute (p = poisonous, e = edible)

This is the target data. As the results below show, about 55% of this subsample (when N = 2000) is poisonous, which is relatively balanced

In [4]:
print(df["class"].value_counts())

class
p    1118
e     882
Name: count, dtype: int64


Cap Data:

Labels from the Kaggle dataset

cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s, oval=o

cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s, l: silky

cap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y, black=k


These seem like they would make good features (at least in terms of data quality)


In [5]:
for cat in ["cap-shape", "cap-surface", "cap-color"]:
	print(df[cat].value_counts())
	print("Missing: ", df_series.isnull().sum(), "\n")


#Results suggest solid mix of cap shapes and surface types; might need to sample from shape and color data to balance the input sets

cap-shape
x    936
f    413
s    232
b    219
p     76
o     62
c     62
Name: count, dtype: int64
Missing:  0 

cap-surface
t        303
s        249
y        199
h        184
g        168
d        132
e         90
k         71
i         63
w         62
l         36
15.94      1
Name: count, dtype: int64
Missing:  0 

cap-color
n    875
w    254
y    221
g    160
e    138
o    109
p     57
u     50
r     50
b     38
k     32
l     16
Name: count, dtype: int64
Missing:  0 



cap-diameter:

In [6]:
print(df["cap-diameter"].describe())
print("Missing: ", df_series.isnull().sum(), "\n")

count    2000.000000
mean        6.177930
std         4.192656
min         0.510000
25%         3.267500
50%         5.710000
75%         8.252500
max        55.940000
Name: cap-diameter, dtype: float64
Missing:  0 



gill data:

All categories, summaries below. Some are missing attributes:

gill-attachment: attached=a, descending=d, free=f, notched=n

gill-spacing: close=c,crowded=w,distant=d

gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y

In [7]:

for column in ["gill-attachment", "gill-spacing", "gill-color"]:
	df_series = df[column]
	print(column)
	print("Missing: ", df_series.isnull().sum(), "\n")

gill-attachment
Missing:  328 

gill-spacing
Missing:  851 

gill-color
Missing:  0 



Gill spacing missing enough data that it would probably be simpler to ignore it

Stem Data:

In [15]:
for column in ["stem-width", "stem-height"]:
	df_series = df[column]
	print(column)
	print(df_series.describe())
	zeros = df_series.value_counts().get(0.0)
	print("Missing: ", zeros, "\n")


stem-width
count    2000.000000
mean       11.088920
std         8.002323
min         0.000000
25%         4.967500
50%         9.560000
75%        15.777500
max        57.210000
Name: stem-width, dtype: float64
Missing:  1 

stem-height
count    2000.000000
mean        6.412225
std         2.852064
min         0.000000
25%         4.610000
50%         5.870000
75%         7.560000
max        25.930000
Name: stem-height, dtype: float64
Missing:  1 



Other relevant data:
does-bruise-or-bleed
has-ring

Imbalanced features, but not too hard to clean


Spore color, Veil Color, and Veil Type has more missing entries than filled, so those will likely be ignored