# Homework 4
### Stanisław Antonowicz

As with the previous homeworks, I'm using the mushroom dataset. It might not be the best one for Ceteris Paribus plots (because all features are categorical), but maybe I'll by able to draw some meaningful conclusions.

In [11]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np
import ceteris_paribus

### Loading data

Quick look at the data to remind us how it looks:

In [4]:
data = np.genfromtxt('dataset_24_mushroom.csv', delimiter=',', dtype='<U20', skip_header=1)
labels = data[:,-1]
label_encoder = LabelEncoder()
label_encoder.fit(labels)
labels = label_encoder.transform(labels)
class_names = label_encoder.classes_
data = data[:,:-1]

categorical_features = list(range(22))


categorical_names = """1. cap-shape: bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s 
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s  
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y  
4. bruises?: bruises=t,no=f  
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s  
6. gill-attachment: attached=a,descending=d,free=f,notched=n  
7. gill-spacing: close=c,crowded=w,distant=d  
8. gill-size: broad=b,narrow=n  
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y  
10. stalk-shape: enlarging=e,tapering=t  
11. stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?  
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s  
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s  
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y  
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y  
16. veil-type: partial=p,universal=u  
17. veil-color: brown=n,orange=o,white=w,yellow=y  
18. ring-number: none=n,one=o,two=t  
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z  
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y  
21. population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y  
22. habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d""".split('\n')

feature_names = []
for ind, line in enumerate(categorical_names):
    vals = line.strip().split('. ')[1]
    feature_name, values = vals.split(': ')
    feature_names.append(feature_name.strip())
    values = {x.split('=')[1]: x.split('=')[0] for x in values.split(',')}
    data[:,ind] = np.array([values[x.strip("'")] for x in data[:,ind]])

data = pd.DataFrame(data)
data.columns = feature_names

In [23]:
data

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,convex,smooth,brown,bruises,pungent,free,close,narrow,black,enlarging,...,smooth,white,white,partial,white,one,pendant,black,scattered,urban
1,convex,smooth,yellow,bruises,almond,free,close,broad,black,enlarging,...,smooth,white,white,partial,white,one,pendant,brown,numerous,grasses
2,bell,smooth,white,bruises,anise,free,close,broad,brown,enlarging,...,smooth,white,white,partial,white,one,pendant,brown,numerous,meadows
3,convex,scaly,white,bruises,pungent,free,close,narrow,brown,enlarging,...,smooth,white,white,partial,white,one,pendant,black,scattered,urban
4,convex,smooth,gray,no,none,free,crowded,broad,black,tapering,...,smooth,white,white,partial,white,one,evanescent,brown,abundant,grasses
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,knobbed,smooth,brown,no,none,attached,close,broad,yellow,enlarging,...,smooth,orange,orange,partial,orange,one,pendant,buff,clustered,leaves
8120,convex,smooth,brown,no,none,attached,close,broad,yellow,enlarging,...,smooth,orange,orange,partial,brown,one,pendant,buff,several,leaves
8121,flat,smooth,brown,no,none,attached,close,broad,brown,enlarging,...,smooth,orange,orange,partial,orange,one,pendant,buff,clustered,leaves
8122,knobbed,scaly,brown,no,fishy,free,close,narrow,buff,tapering,...,silky,white,white,partial,white,one,evanescent,white,several,leaves


In [6]:
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

In [7]:
categorical_features = data.columns
categorical_transformer = Pipeline(
    steps=[
        ('one_hot_encoder', OneHotEncoder())
    ]
)


transformer = ColumnTransformer(
    transformers=[
        ('categorical', categorical_transformer, categorical_features)
    ]
)

In [8]:
model_rf = Pipeline(
    steps=[
        ('transformer', transformer),
        ('model', RandomForestClassifier())
    ])

In [24]:
_ = model_rf.fit(X_train, y_train, )

I'll skip showing classification reports this time – as always, the prediction is perfect.

### Ceteris Paribus Profiles


In [12]:
from ceteris_paribus.profiles import individual_variable_profile
from ceteris_paribus.explainer import explain
from ceteris_paribus.plots.plots import plot
from ceteris_paribus.select_data import select_sample, select_neighbours

In [14]:
rf_explainer = explain(model_rf, X_train.columns, X_train, y_train, 
                       predict_function=lambda x: model_rf.predict_proba(x)[::,1], label='RandomForest')

In [56]:
idx = 1
print(f'Observation {idx}, label: {"poisonous" if y_train[idx] else "edible"}')

cp_profile = individual_variable_profile(
    rf_explainer, X_train.iloc[idx], y_train[idx],
    variables=['odor', 'gill-size']
)



plot(cp_profile, destination='notebook', height=600, width=950)

Observation 1, label: poisonous


In [57]:
idx = 766
print(f'Observation {idx}, label: {"poisonous" if y_train[idx] else "edible"}')

cp_profile = individual_variable_profile(
    rf_explainer, X_train.iloc[idx], y_train[idx],
    variables=['odor', 'gill-size']
)



plot(
    cp_profile, 
    destination='notebook', 
    height=600, 
    width=950
)



Observation 766, label: poisonous


Although both of these observations are labelled as poisonous, we can see that the profiles look different. 

For example, for the first observation, if the mushroom had no odor it would be predicted to be less poisonous, whereas for the second observation it seems all other kinds of odor would cause this behaviour. 

As for the gill-size, in the first observation changing gill size from narrow to broad would lower the probability of being poisonous, whereas in the second observation it would be the opposite.

No we'll look at the second model – neural network from scikit-learn with default parameters.

In [46]:
model_nn = Pipeline(
    steps=[
        ('transformer', transformer),
        ('model', MLPClassifier())
    ]
)

In [47]:
_ = model_nn.fit(X_train, y_train)

In [48]:
nn_explainer = explain(model_nn, X_train.columns, X_train, y_train, 
                       predict_function=lambda x: model_nn.predict_proba(x)[::,1], label='NeuralNet')

In [53]:
# 2342
idx = 2342
print(f'Observation {idx}, label: {"poisonous" if y_train[idx] else "edible"}')

cp_profile_rf = ceteris_paribus.profiles.individual_variable_profile(
    rf_explainer, X_train.iloc[idx], y_train[idx],
    variables=['odor', 'gill-size']
)

cp_profile_nn = ceteris_paribus.profiles.individual_variable_profile(
    nn_explainer, X_train.iloc[idx], y_train[idx],
    variables=['odor', 'gill-size']
)



ceteris_paribus.plots.plots.plot(cp_profile_rf, cp_profile_nn, destination='notebook', height=600, width=950)

Observation 2342, label: edible


Profiles look really different for this particular observation.

Changing the odor to creosote, fishy, foul, pungent or spicy would automatically make the neural net's prediction from edible to poisonous, whereas for random forest it would not.

For the gill size, the behaviour is opposite – changing it from broad to narrow doesn't make a huge difference for the neural net, whereas it does for the random forest (although not so much to change the prediction).

I think this is the result of the way features are used in the splitting: each time a split is done only a random subset of features is used. This ensures that it's less like for one feature to "overpower" the other ones.


# Appendix

# Homework 4
### Stanisław Antonowicz

As with the previous homeworks, I'm using the mushroom dataset. It might not be the best one for Ceteris Paribus plots (because all features are categorical), but maybe I'll by able to draw some meaningful conclusions.

In [11]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np
import ceteris_paribus

### Loading data

Quick look at the data to remind us how it looks:

In [4]:
data = np.genfromtxt('dataset_24_mushroom.csv', delimiter=',', dtype='<U20', skip_header=1)
labels = data[:,-1]
label_encoder = LabelEncoder()
label_encoder.fit(labels)
labels = label_encoder.transform(labels)
class_names = label_encoder.classes_
data = data[:,:-1]

categorical_features = list(range(22))


categorical_names = """1. cap-shape: bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s 
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s  
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y  
4. bruises?: bruises=t,no=f  
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s  
6. gill-attachment: attached=a,descending=d,free=f,notched=n  
7. gill-spacing: close=c,crowded=w,distant=d  
8. gill-size: broad=b,narrow=n  
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y  
10. stalk-shape: enlarging=e,tapering=t  
11. stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?  
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s  
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s  
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y  
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y  
16. veil-type: partial=p,universal=u  
17. veil-color: brown=n,orange=o,white=w,yellow=y  
18. ring-number: none=n,one=o,two=t  
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z  
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y  
21. population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y  
22. habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d""".split('\n')

feature_names = []
for ind, line in enumerate(categorical_names):
    vals = line.strip().split('. ')[1]
    feature_name, values = vals.split(': ')
    feature_names.append(feature_name.strip())
    values = {x.split('=')[1]: x.split('=')[0] for x in values.split(',')}
    data[:,ind] = np.array([values[x.strip("'")] for x in data[:,ind]])

data = pd.DataFrame(data)
data.columns = feature_names

In [23]:
data

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,convex,smooth,brown,bruises,pungent,free,close,narrow,black,enlarging,...,smooth,white,white,partial,white,one,pendant,black,scattered,urban
1,convex,smooth,yellow,bruises,almond,free,close,broad,black,enlarging,...,smooth,white,white,partial,white,one,pendant,brown,numerous,grasses
2,bell,smooth,white,bruises,anise,free,close,broad,brown,enlarging,...,smooth,white,white,partial,white,one,pendant,brown,numerous,meadows
3,convex,scaly,white,bruises,pungent,free,close,narrow,brown,enlarging,...,smooth,white,white,partial,white,one,pendant,black,scattered,urban
4,convex,smooth,gray,no,none,free,crowded,broad,black,tapering,...,smooth,white,white,partial,white,one,evanescent,brown,abundant,grasses
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,knobbed,smooth,brown,no,none,attached,close,broad,yellow,enlarging,...,smooth,orange,orange,partial,orange,one,pendant,buff,clustered,leaves
8120,convex,smooth,brown,no,none,attached,close,broad,yellow,enlarging,...,smooth,orange,orange,partial,brown,one,pendant,buff,several,leaves
8121,flat,smooth,brown,no,none,attached,close,broad,brown,enlarging,...,smooth,orange,orange,partial,orange,one,pendant,buff,clustered,leaves
8122,knobbed,scaly,brown,no,fishy,free,close,narrow,buff,tapering,...,silky,white,white,partial,white,one,evanescent,white,several,leaves


In [6]:
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

In [7]:
categorical_features = data.columns
categorical_transformer = Pipeline(
    steps=[
        ('one_hot_encoder', OneHotEncoder())
    ]
)


transformer = ColumnTransformer(
    transformers=[
        ('categorical', categorical_transformer, categorical_features)
    ]
)

In [8]:
model_rf = Pipeline(
    steps=[
        ('transformer', transformer),
        ('model', RandomForestClassifier())
    ])

In [24]:
_ = model_rf.fit(X_train, y_train, )

I'll skip showing classification reports this time – as always, the prediction is perfect.

### Ceteris Paribus Profiles


In [12]:
from ceteris_paribus.profiles import individual_variable_profile
from ceteris_paribus.explainer import explain
from ceteris_paribus.plots.plots import plot
from ceteris_paribus.select_data import select_sample, select_neighbours

In [14]:
rf_explainer = explain(model_rf, X_train.columns, X_train, y_train, 
                       predict_function=lambda x: model_rf.predict_proba(x)[::,1], label='RandomForest')

In [44]:
idx = 1
print(f'Observation {idx}, label: {"poisonous" if y_train[idx] else "edible"}')

cp_profile = individual_variable_profile(
    rf_explainer, X_train.iloc[idx], y_train[idx],
    variables=['odor', 'gill-size']
)



plot(cp_profile, destination='notebook', height=600, width=950)

Observation 1, label: poisonous


In [45]:
idx = 766
print(f'Observation {idx}, label: {"poisonous" if y_train[idx] else "edible"}')

cp_profile = individual_variable_profile(
    rf_explainer, X_train.iloc[idx], y_train[idx],
    variables=['odor', 'gill-size']
)



plot(
    cp_profile, 
    destination='notebook', 
    height=600, 
    width=950
)



Observation 766, label: edible


Although both of these observations are labelled as poisonous, we can see that the profiles look different. 

For example, for the first observation, if the mushroom had no odor it would be predicted to be less poisonous, whereas for the second observation it seems all other kinds of odor would cause this behaviour. 

As for the gill-size, in the first observation changing gill size from narrow to broad would lower the probability of being poisonous, whereas in the second observation it would be the opposite.

No we'll look at the second model – neural network from scikit-learn with default parameters.

In [46]:
model_nn = Pipeline(
    steps=[
        ('transformer', transformer),
        ('model', MLPClassifier())
    ]
)

In [47]:
_ = model_nn.fit(X_train, y_train)

In [48]:
nn_explainer = explain(model_nn, X_train.columns, X_train, y_train, 
                       predict_function=lambda x: model_nn.predict_proba(x)[::,1], label='NeuralNet')

In [53]:
# 2342
idx = 2342
print(f'Observation {idx}, label: {"poisonous" if y_train[idx] else "edible"}')

cp_profile_rf = ceteris_paribus.profiles.individual_variable_profile(
    rf_explainer, X_train.iloc[idx], y_train[idx],
    variables=['odor', 'gill-size']
)

cp_profile_nn = ceteris_paribus.profiles.individual_variable_profile(
    nn_explainer, X_train.iloc[idx], y_train[idx],
    variables=['odor', 'gill-size']
)



ceteris_paribus.plots.plots.plot(cp_profile_rf, cp_profile_nn, destination='notebook', height=600, width=950)

Observation 2342, label: edible


Profiles look really different for this particular observation.

Changing the odor to creosote, fishy, foul, pungent or spicy would automatically make the neural net's prediction from edible to poisonous, whereas for random forest it would not.

For the gill size, the behaviour is opposite – changing it from broad to narrow doesn't make a huge difference for the neural net, whereas it does for the random forest (although not so much to change the prediction).

I think this is the result of the way features are used in the splitting: each time a split is done only a random subset of features is used. This ensures that it's less like for one feature to "overpower" the other ones.
