# Challenge #224: Mushroom for Improvement

This week's challenge focuses on a dataset describing mushrooms which has been encoded. In the 23 columns describing each mushroom observed, the values are represented by a single character. Use the Attribute Info table to replace the coded information with the values they represent.

Original challenge: https://community.alteryx.com/t5/Weekly-Challenge/Challenge-224-Mushroom-for-Improvement/td-p/602542

In [582]:
import pandas as pd
pd.options.display.max_colwidth = 100

### Import the datasets

In [583]:
df_mushrooms = pd.read_csv('./challenge_224_input.csv')

In [584]:
df_mushrooms.head(5)

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [585]:
df_attributes = pd.read_csv('./challenge_224_input2.csv')
df_attributes.head(5)

Unnamed: 0,Field1
0,"Attribute Information: (classes: edible=e, poisonous=p)"
1,"cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s"
2,"cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s"
3,"cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y"
4,"bruises: bruises=t,no=f"


### Clean df_attributes dataset
First, we want to make sure that the headers between the two datasets match.

In [586]:
def clean_strings(text, replacement):
    df_attributes['Field1'] = df_attributes['Field1'].replace(to_replace=text, value=replacement, regex = True)

In [587]:
clean_strings('Attribute Information: \(', '')
clean_strings('\)', '')
clean_strings('classes', 'class')

Here, we split the headers from the values which we can use to decode the abbreviations in the mushroom dataset.

In [588]:
df_attributes[['Header', 'Translations']] = df_attributes['Field1'].str.split(': ', expand = True)
df_attributes.head()

Unnamed: 0,Field1,Header,Translations
0,"class: edible=e, poisonous=p",class,"edible=e, poisonous=p"
1,"cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s",cap-shape,"bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s"
2,"cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s",cap-surface,"fibrous=f,grooves=g,scaly=y,smooth=s"
3,"cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y",cap-color,"brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y"
4,"bruises: bruises=t,no=f",bruises,"bruises=t,no=f"


We split the dataset into rows so that we get one attribute per row.

In [589]:
del df_attributes['Field1']
df_attributes = df_attributes.set_index(['Header']).apply(lambda x: x.str.split(',').explode()).reset_index() 
df_attributes.head()

Unnamed: 0,Header,Translations
0,class,edible=e
1,class,poisonous=p
2,cap-shape,bell=b
3,cap-shape,conical=c
4,cap-shape,convex=x


We use = as delimiter to separate abbreviation and decoding

In [590]:
df_attributes[['Decoded', 'Abbreviation']] = df_attributes['Translations'].str.split('=', expand = True)
df_attributes.head()

Unnamed: 0,Header,Translations,Decoded,Abbreviation
0,class,edible=e,edible,e
1,class,poisonous=p,poisonous,p
2,cap-shape,bell=b,bell,b
3,cap-shape,conical=c,conical,c
4,cap-shape,convex=x,convex,x


Splitting into columns has produced white space that we need to remove.

In [591]:
df_attributes['Header'] = df_attributes['Header'].str.strip()
df_attributes['Abbreviation'] = df_attributes['Abbreviation'].str.strip()
df_attributes.head()

Unnamed: 0,Header,Translations,Decoded,Abbreviation
0,class,edible=e,edible,e
1,class,poisonous=p,poisonous,p
2,cap-shape,bell=b,bell,b
3,cap-shape,conical=c,conical,c
4,cap-shape,convex=x,convex,x


### Reshaping the mushroom dataset

We want to pivot the mushroom dataset so we can join it on the attributes dataset. In order to bring it back into its original shape later on, we use the index to create a record ID.

In [592]:
df_mushrooms.reset_index(level=0, inplace=True)
df_mushrooms.head()

Unnamed: 0,index,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,0,p,x,s,n,t,p,f,c,n,...,s,w,w,p,w,o,p,k,s,u
1,1,e,x,s,y,t,a,f,c,b,...,s,w,w,p,w,o,p,n,n,g
2,2,e,b,s,w,t,l,f,c,b,...,s,w,w,p,w,o,p,n,n,m
3,3,p,x,y,w,t,p,f,c,n,...,s,w,w,p,w,o,p,k,s,u
4,4,e,x,s,g,f,n,f,w,b,...,s,w,w,p,w,o,e,n,a,g


The following steps pivots the dataset.

In [593]:
df_mushrooms = pd.melt(df_mushrooms, id_vars='index')
df_mushrooms.sort_values(by='index', inplace = True)
df_mushrooms.head(5)

Unnamed: 0,index,variable,value
0,0,class,p
146232,0,ring-number,o
40620,0,odor,p
24372,0,cap-color,n
162480,0,spore-print-color,k


We can now join df_mushroom and df_attributes using both the header/variable and abbreviation/value as join conditions to make sure each abbreviations is associated with the correct header. A single abbreviation/header join condition would likely caus duplicates.

In [594]:
df_join = df_mushrooms.merge(df_attributes, how="inner", left_on=['variable', 'value'], right_on=['Header', 'Abbreviation'])
df_join.sort_values(by='index', inplace = True)
df_join

Unnamed: 0,index,variable,value,Header,Translations,Decoded,Abbreviation
0,0,class,p,class,poisonous=p,poisonous,p
76874,0,stalk-color-above-ring,w,stalk-color-above-ring,white=w,white,w
76466,0,gill-color,k,gill-color,black=k,black,k
75218,0,population,s,population,scattered=s,scattered,s
72706,0,gill-size,n,gill-size,narrow=n,narrow,n
...,...,...,...,...,...,...,...
186241,8123,gill-attachment,a,gill-attachment,attached=a,attached,a
186535,8123,spore-print-color,o,spore-print-color,orange=o,orange,o
50369,8123,ring-type,p,ring-type,pendant=p,pendant,p
130313,8123,bruises,f,bruises,no=f,no,f


We only keep 3 columns to unpivot the table

In [595]:
df_join = df_join[['index', 'variable', 'Decoded']]
df_join.head()

Unnamed: 0,index,variable,Decoded
0,0,class,poisonous
76874,0,stalk-color-above-ring,white
76466,0,gill-color,black
75218,0,population,scattered
72706,0,gill-size,narrow


In [596]:
df_join = df_join.pivot(index = 'index', columns='variable', values='Decoded')
df_join

variable,bruises,cap-color,cap-shape,cap-surface,class,gill-attachment,gill-color,gill-size,gill-spacing,habitat,...,ring-type,spore-print-color,stalk-color-above-ring,stalk-color-below-ring,stalk-root,stalk-shape,stalk-surface-above-ring,stalk-surface-below-ring,veil-color,veil-type
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,bruises,brown,convex,smooth,poisonous,free,black,narrow,close,urban,...,pendant,black,white,white,equal,enlarging,smooth,smooth,white,partial
1,bruises,yellow,convex,smooth,edible,free,black,broad,close,grasses,...,pendant,brown,white,white,club,enlarging,smooth,smooth,white,partial
2,bruises,white,bell,smooth,edible,free,brown,broad,close,meadows,...,pendant,brown,white,white,club,enlarging,smooth,smooth,white,partial
3,bruises,white,convex,scaly,poisonous,free,brown,narrow,close,urban,...,pendant,black,white,white,equal,enlarging,smooth,smooth,white,partial
4,no,gray,convex,smooth,edible,free,black,broad,crowded,grasses,...,evanescent,brown,white,white,equal,tapering,smooth,smooth,white,partial
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,no,brown,knobbed,smooth,edible,attached,yellow,broad,close,leaves,...,pendant,buff,orange,orange,missing,enlarging,smooth,smooth,orange,partial
8120,no,brown,convex,smooth,edible,attached,yellow,broad,close,leaves,...,pendant,buff,orange,orange,missing,enlarging,smooth,smooth,brown,partial
8121,no,brown,flat,smooth,edible,attached,brown,broad,close,leaves,...,pendant,buff,orange,orange,missing,enlarging,smooth,smooth,orange,partial
8122,no,brown,knobbed,scaly,poisonous,free,buff,narrow,close,leaves,...,evanescent,white,white,white,missing,tapering,smooth,silky,white,partial
