In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing

# Load data

We load to alternative tables as exlained in the notebook '0-data-preparation'. 

In [2]:
btvote = pd.read_pickle('../data/btvote.pkl')
btvote.head()

Unnamed: 0,party,vote_19001,vote_19002,vote_19003,vote_19004,vote_19005,vote_19006,vote_19007,vote_19008,vote_19009,...,vote_19235,vote_19236,vote_19237,vote_19238,vote_19239,vote_19240,vote_19241,vote_19242,vote_19243,vote_19244
0,CDU/CSU,yes,yes,yes,yes,yes,yes,yes,yes,no,...,yes,yes,yes,yes,yes,yes,no,yes,yes,yes
1,SPD,,,,,,,,,,...,yes,yes,yes,yes,yes,yes,no,absence,absence,absence
2,Linke,no,no,no,no,no,no,no,no,yes,...,no,no,no,no,no,no,no,abstain,no,no
3,CDU/CSU,yes,yes,yes,yes,yes,yes,yes,yes,no,...,yes,yes,yes,yes,yes,yes,no,yes,yes,absence
4,Linke,absence,absence,absence,absence,absence,absence,absence,absence,absence,...,no,no,no,no,no,no,no,abstain,no,absence


In [3]:
btvote_alt = pd.read_pickle('../data/btvote_alternative.pkl')
btvote_alt.head()

Unnamed: 0,party,vote_19001,vote_19002,vote_19003,vote_19004,vote_19005,vote_19006,vote_19007,vote_19008,vote_19009,...,vote_19235,vote_19236,vote_19237,vote_19238,vote_19239,vote_19240,vote_19241,vote_19242,vote_19243,vote_19244
0,CDU/CSU,yes,yes,yes,yes,yes,yes,yes,yes,no,...,yes,yes,yes,yes,yes,yes,no,yes,yes,yes
1,SPD,,,,,,,,,,...,yes,yes,yes,yes,yes,yes,no,excused absence,excused absence,unexcused absent
2,Linke,no,no,no,no,no,no,no,no,yes,...,no,no,no,no,no,no,no,abstain,no,no
3,CDU/CSU,yes,yes,yes,yes,yes,yes,yes,yes,no,...,yes,yes,yes,yes,yes,yes,no,yes,yes,excused absence
4,Linke,excused absence,excused absence,excused absence,excused absence,excused absence,unexcused absent,unexcused absent,unexcused absent,excused absence,...,no,no,no,no,no,no,no,abstain,no,excused absence


# Split data and encode target variable

In [4]:
# Split dataframe in 'data' and 'target'
btvote_data = btvote.drop('party', axis=1)
btvote_target = btvote['party']

# Create a 'data' part for the alternative dataset btvote_alt
btvote_data_alt = btvote_alt.drop('party', axis=1)

# Encode the target variable
label_encoder = preprocessing.LabelEncoder()
btvote_target = label_encoder.fit_transform(btvote_target)

# Pipeline and GridSearch setup

In the pipeline we include the SimpleImputer with different strategies as well as the KNNImputer with number of neighbors between 2 and 9.\
For Balancing, we just the RandomOverSampler at the moment. In a later notebook, the Balancing will be evaluated in detail.\
As an estimator, we consider nearest-neighbor, Decision Tree and Naive Bayes. Again, later more models will be analysed in detail. For the moment, this selection of estimators should just ensure well-founded results for the different Imputing methods

In [9]:
from imblearn.pipeline import Pipeline
# normalisation
from sklearn.preprocessing import OneHotEncoder
# imputer
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
# balancing
from imblearn.over_sampling import RandomOverSampler
# classifiers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

# Pipeline
pipeline = Pipeline([
    ('imputer', None),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
    ('balancing', RandomOverSampler()),
    ('estimator', None)
])

In [10]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer

# specify the cross validation
stratified_10_fold_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# define the scoring function
# Note: As we use Balancing the micro average will equal the macro average
f1 = make_scorer(f1_score, average='micro')

# NaN handling

### Data Enrichment
Definition of NaN
1. Only actual NaNs are considered as NaNs, 'abstain' and 'absence' both as separate values\
    a. 'unexcused absent' and 'excused absence' combined as 'absence'\
    b. 'unexcused absent' and 'excused absence' as separate values
2. Consider actual NaNs and 'absence' as NaN, keep 'abstain' as possible value\
    a. encode 'no', 'abstain' and 'yes' using OneHotEncoder in pipeline\
    b. encode 'no', 'abstain' and 'yes' ordinal
3. Consider only 'yes' and 'no' as allowed values

Imputing of missing values (NaN)

- sklearn [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) with strategies *mean*, *most_frequent* and *constant*
- sklearn [KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer) with different *n_neighbors*

*Limitation:* The KNNImputer and the SimpleImputer with strategy 'mean' can not deal with categorial values. If we encode the data using OneHotEncoding before, we would result in an extra column for NaN. The only option is to use an OrdinalEncoding before Imputing. In that case, the ordinal encoding influences distances and averages. So, those two Imputing approaches can only be used in experiment 2b and 3, as we cannot define an order for experiment 1.\
*Exception:* KNNImputer with n_neighbors=1 can be used in all experiments, as no averaging is performed.

### 1. 'abstain' and 'absence' as separate values
#### a. 'unexcused absent' and 'excused absence' combined as 'absence'

In [7]:
# define parameter grid
parameters = {
    'imputer': [SimpleImputer(strategy='most_frequent'), SimpleImputer(strategy='constant'), KNNImputer(n_neighbors=1)],
    'estimator': [KNeighborsClassifier(n_neighbors=7), DecisionTreeClassifier(max_depth=5, random_state=42), GaussianNB()],
}

In [11]:
# encode data
btvote_data_1a = btvote_data.replace({'no':0, 'yes':1, 'abstain':2, 'absence':3})

# create the grid search instance
grid_search_estimator = GridSearchCV(pipeline, parameters, scoring=f1, cv=stratified_10_fold_cv, error_score='raise')

# run the grid search
grid_search_estimator.fit(btvote_data_1a, btvote_target)

# results of all hyper-parameter combinations
results = pd.DataFrame(grid_search_estimator.cv_results_)

# pivot the results for better visualization
results['param_imputer'] = results['param_imputer'].astype(str)
results['param_estimator'] = results['param_estimator'].astype(str)
pivoted_results = results.pivot(index='param_imputer', columns='param_estimator', values='mean_test_score')
pivoted_results['Average'] = pivoted_results[['DecisionTreeClassifier(max_depth=5, random_state=42)','GaussianNB()','KNeighborsClassifier(n_neighbors=7)']].mean(axis=1)
display(pivoted_results)

param_estimator,"DecisionTreeClassifier(max_depth=5, random_state=42)",GaussianNB(),KNeighborsClassifier(n_neighbors=7),Average
param_imputer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
KNNImputer(n_neighbors=1),0.740811,0.810865,0.699297,0.750324
SimpleImputer(strategy='constant'),0.680577,0.724811,0.707387,0.704258
SimpleImputer(strategy='most_frequent'),0.68036,0.785297,0.731532,0.732396


We see, that on average, the KNNImputer with 'n_neighbors'=1 performs the best, then the SimpleImputer with strategy 'most_frequent' and then 'constant'. We won't go into detail on the different models, as they will be discussed in later notebooks.

#### b: 'unexcused absent' and 'excused absence' as separate values

In [12]:
# encode data
btvote_data_1b = btvote_data_alt.replace({'no':0, 'yes':1, 'abstain':2, 'unexcused absent':3, 'excused absence':4})

# create the grid search instance
grid_search_estimator = GridSearchCV(pipeline, parameters, scoring=f1, cv=stratified_10_fold_cv, error_score='raise')

# run the grid search
grid_search_estimator.fit(btvote_data_1b, btvote_target)

# results of all hyper-parameter combinations
results = pd.DataFrame(grid_search_estimator.cv_results_)

# pivot the results for better visualization
results['param_imputer'] = results['param_imputer'].astype(str)
results['param_estimator'] = results['param_estimator'].astype(str)
pivoted_results = results.pivot(index='param_imputer', columns='param_estimator', values='mean_test_score')
pivoted_results['Average'] = pivoted_results[['DecisionTreeClassifier(max_depth=5, random_state=42)','GaussianNB()','KNeighborsClassifier(n_neighbors=7)']].mean(axis=1)
display(pivoted_results)

param_estimator,"DecisionTreeClassifier(max_depth=5, random_state=42)",GaussianNB(),KNeighborsClassifier(n_neighbors=7),Average
param_imputer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
KNNImputer(n_neighbors=1),0.73955,0.804036,0.714144,0.752577
SimpleImputer(strategy='constant'),0.707441,0.739694,0.716919,0.721351
SimpleImputer(strategy='most_frequent'),0.686126,0.785279,0.734252,0.735219


We see that splitting up 'absence' into 'unexcused absent' and 'excused absence' will improve the test results slightly if anything. Having in mind that for our Wahl-O-Mat, there will be only one option for 'absence', from now on, we will use 'unexcused absent' and 'excused absence' combined as absence.

### 2. 'absence' is NaN, 'abstain' as allowed value

#### a. encode 'no', 'abstain' and 'yes' using OneHotEncoder in pipeline

The evaluation works the same as for the case before. We just convert all 'absence' values to NaN before executing the grid search.

In [13]:
# encode data
btvote_data_2a = btvote_data.replace({'absence':np.nan, 'no':0, 'yes':1, 'abstain':2})

# create the grid search instance
grid_search_estimator = GridSearchCV(pipeline, parameters, scoring=f1, cv=stratified_10_fold_cv, error_score='raise')

# run the grid search
grid_search_estimator.fit(btvote_data_2a, btvote_target)

# results of all hyper-parameter combinations
results = pd.DataFrame(grid_search_estimator.cv_results_)

# pivot the results for better visualization
results['param_imputer'] = results['param_imputer'].astype(str)
results['param_estimator'] = results['param_estimator'].astype(str)
pivoted_results = results.pivot(index='param_imputer', columns='param_estimator', values='mean_test_score')
pivoted_results['Average'] = pivoted_results[['DecisionTreeClassifier(max_depth=5, random_state=42)','GaussianNB()','KNeighborsClassifier(n_neighbors=7)']].mean(axis=1)
display(pivoted_results)

param_estimator,"DecisionTreeClassifier(max_depth=5, random_state=42)",GaussianNB(),KNeighborsClassifier(n_neighbors=7),Average
param_imputer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
KNNImputer(n_neighbors=1),0.786486,0.822865,0.797243,0.802198
SimpleImputer(strategy='constant'),0.659189,0.706072,0.736973,0.700745
SimpleImputer(strategy='most_frequent'),0.691171,0.786667,0.773135,0.750324


Removing 'absence' as value significantly improves performance for the KNNImputer. For the other two Imputers the results are similar to the previous experiments. Overall, we can say, that using 'absence' as information reduces performance.

#### b. encode 'no', 'abstain' and 'yes' ordinal

Now we define a new pipeline without the OneHotEncoder. We encode the ordinal before executing the grid search: {'no':0}<{'abstain':0.5}<{'yes':1}
And now, these ordinal encodings can be used for classification and imputing. 
In this case, all imputers listed above can be used, as the voting behavior is in numeric format. That is why we redefine the parameter grid.

In [14]:
# ordinal pipeline without OneHotEncoder
ordinal_pipeline = Pipeline([('imputer', None), ('balancing', RandomOverSampler()), ('estimator', None)])

In [15]:
# redefine parameter grid
ordinal_parameters = [
    {
        'imputer': [SimpleImputer(strategy='mean'), SimpleImputer(strategy='most_frequent'), SimpleImputer(strategy='constant')],
        'estimator': [KNeighborsClassifier(n_neighbors=7), DecisionTreeClassifier(max_depth=5, random_state=42), GaussianNB()],
    }, {
        'imputer': [KNNImputer()],
        'imputer__n_neighbors': range(1,10),
        'estimator': [KNeighborsClassifier(n_neighbors=7), DecisionTreeClassifier(max_depth=5, random_state=42), GaussianNB()],
    }
]

In [16]:
# encode data
btvote_data_2b = btvote_data.replace({'absence':np.nan, 'no':0, 'abstain':0.5, 'yes':1})

# create the grid search instance
grid_search_estimator = GridSearchCV(ordinal_pipeline, ordinal_parameters, scoring=f1, cv=stratified_10_fold_cv, error_score='raise')

# run the grid search
grid_search_estimator.fit(btvote_data_2b, btvote_target)

# results of all hyper-parameter combinations
results = pd.DataFrame(grid_search_estimator.cv_results_)

# pivot the results for better visualization
results['param_imputer'] = results['param_imputer'].astype(str)
results['param_estimator'] = results['param_estimator'].astype(str)
pivoted_results = results.pivot(index=['param_imputer','param_imputer__n_neighbors'], columns='param_estimator', values='mean_test_score')
pivoted_results['Average'] = pivoted_results[['DecisionTreeClassifier(max_depth=5, random_state=42)','GaussianNB()','KNeighborsClassifier(n_neighbors=7)']].mean(axis=1)
display(pivoted_results)

  pivoted_results = results.pivot(index=['param_imputer','param_imputer__n_neighbors'], columns='param_estimator', values='mean_test_score')


Unnamed: 0_level_0,param_estimator,"DecisionTreeClassifier(max_depth=5, random_state=42)",GaussianNB(),KNeighborsClassifier(n_neighbors=7),Average
param_imputer,param_imputer__n_neighbors,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
KNNImputer(),1.0,0.782414,0.820144,0.801243,0.801267
KNNImputer(),2.0,0.786613,0.801495,0.798631,0.79558
KNNImputer(),3.0,0.786595,0.76,0.798595,0.78173
KNNImputer(),4.0,0.778486,0.695477,0.801423,0.758462
KNNImputer(),5.0,0.782523,0.688757,0.783874,0.751718
KNNImputer(),6.0,0.786559,0.683369,0.782667,0.750865
KNNImputer(),7.0,0.782631,0.682072,0.796,0.753568
KNNImputer(),8.0,0.770631,0.729171,0.801405,0.767069
KNNImputer(),9.0,0.783892,0.73982,0.785351,0.769688
SimpleImputer(),,0.695315,0.816144,0.708703,0.740054


Encoding the values 'no', 'abstain' and 'yes' ordinal results in similar values for the methods that could be applied before. As we saved the OneHotEncoding step which adds more computational overhead through more classification variables, this strategy is more desirable.

### 3. only 'yes' and 'no' are allowed values

Again, all imputers listed above can be used, as the voting behavior is in numeric format and ordinal.

In [17]:
# encode data
btvote_data_3 = btvote_data.replace({'absence':np.nan, 'abstain':np.nan, 'no':0, 'yes':1})

# create the grid search instance
grid_search_estimator = GridSearchCV(ordinal_pipeline, ordinal_parameters, scoring=f1, cv=stratified_10_fold_cv, error_score='raise')

# run the grid search
grid_search_estimator.fit(btvote_data_3, btvote_target)

# results of all hyper-parameter combinations
results = pd.DataFrame(grid_search_estimator.cv_results_)

# pivot the results for better visualization
results['param_imputer'] = results['param_imputer'].astype(str)
results['param_estimator'] = results['param_estimator'].astype(str)
pivoted_results = results.pivot(index=['param_imputer','param_imputer__n_neighbors'], columns='param_estimator', values='mean_test_score')
pivoted_results['Average'] = pivoted_results[['DecisionTreeClassifier(max_depth=5, random_state=42)','GaussianNB()','KNeighborsClassifier(n_neighbors=7)']].mean(axis=1)
display(pivoted_results)

  pivoted_results = results.pivot(index=['param_imputer','param_imputer__n_neighbors'], columns='param_estimator', values='mean_test_score')


Unnamed: 0_level_0,param_estimator,"DecisionTreeClassifier(max_depth=5, random_state=42)",GaussianNB(),KNeighborsClassifier(n_neighbors=7),Average
param_imputer,param_imputer__n_neighbors,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
KNNImputer(),1.0,0.77182,0.816198,0.785171,0.791063
KNNImputer(),2.0,0.774468,0.802811,0.790577,0.789285
KNNImputer(),3.0,0.777189,0.789387,0.789207,0.785261
KNNImputer(),4.0,0.783838,0.796036,0.790649,0.790174
KNNImputer(),5.0,0.786577,0.790703,0.797405,0.791562
KNNImputer(),6.0,0.77182,0.793387,0.788,0.784402
KNNImputer(),7.0,0.779892,0.793405,0.790667,0.787988
KNNImputer(),8.0,0.771766,0.793387,0.794649,0.786601
KNNImputer(),9.0,0.775856,0.790703,0.800054,0.788871
SimpleImputer(),,0.743694,0.805477,0.731586,0.760252


Compared to the previous experiment, the results for the KNNImputer are more stable for larger values of n_neighbors. The SimpleImputers perform similar to the previous experiment.

# Conclusion
We've seen that generally the performance is higher, when using KNNImputer instead of the SimpleImputer independent of the strategy used. The highest F1 score can be achieved by setting n_neighbors to 1, so that only the nearest neighbor is considered. So from now on we will use this as our imputer.

We simultaniously evaluated different input formats. We got a clear result, that using 'abstain' and 'absence' as possible values for the input data generally leads to a lower F1-Score. Whether to consider 'abstain' as value doesn't really influence the results. As we are interested in using an 'abstain' option for our model later, we will keep this value.\
Using an OneHotEncoder instead of an OrdinalEncoder doesn't improve the results and just adds a computation overhead. We will finally use option 2b in the future experiments.

# Dataset generation

We will finally generate and export the encoded dataset for option 2b, so we can just import the encoded dataset in the next experiments.

In [18]:
btvote_encoded = btvote.replace({'absence':np.nan, 'no':0, 'abstain':0.5, 'yes':1})
btvote_encoded.to_pickle('../data/btvote_encoded.pkl')