In [1]:
pip install ucimlrepo

Note: you may need to restart the kernel to use updated packages.


In [140]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
import pandas as pd
import numpy as np

# adjust delimiter / header according to the dataset description
df = pd.read_csv("house-votes-84.data", header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [141]:
mode = int(input().strip()) # choosing the mode of the program

Justification of the chosen method:

Just as it is specified in the task's condition, the first approach threats the missing values as a third category whereas the second one is supposed to fill them with already existing values using a particular approach we have chosen. So for this second mode, I try to fill the '?' with the most frequent value of each feature (column) within each class (democrat or republican). I prefer this particular way of handling the '?' since it does not introduce a third category that may just blur (да размие) category values. Moreover, choosing the most plausible value of each column for replacing the '?' preserves the initial empirical distribution of the votes in the data. 

In [142]:
if mode == 0:
    # first approach: treating '?' as a separate category
    cols_to_encode = df.columns[1:] 
    encoder = {'y': 0, 'n': 1, '?': 2}     
    df[cols_to_encode] = df[cols_to_encode].replace(encoder)
elif mode == 1:
    # second approach: filling  missing values ('?') with the most frequent value per class
    cols_to_encode = df.columns[1:] 
    df[cols_to_encode] = df[cols_to_encode].replace('?', np.nan)
    for party in ['democrat', 'republican']:
        mask_class = (df[0] == party)
        for col in cols_to_encode:
            most_frequent = df.loc[mask_class, col].mode()[0]  
            df.loc[mask_class, col] = df.loc[mask_class, col].fillna(most_frequent)
    encoder = {'y': 0, 'n': 1}  
    df[cols_to_encode] = df[cols_to_encode].replace(encoder)     

  df[cols_to_encode] = df[cols_to_encode].replace(encoder)


In [143]:
X = df.iloc[:, 1:]  # all columns except the first one
y = df.iloc[:, 0]   # first column only

print(type(X))
print(X.shape)
print(type(y))
print(y.shape)
#print(X[16][0])

class_labels_list = y.unique()
for current_class in class_labels_list:
    print(current_class)

<class 'pandas.core.frame.DataFrame'>
(435, 16)
<class 'pandas.core.series.Series'>
(435,)
republican
democrat


In [144]:
# Sources of the concept of spliting the data into training and testing sets without sklearn: 
# https://www.geeksforgeeks.org/python/how-to-split-data-into-training-and-testing-in-python-without-sklearn/
# https://www.geeksforgeeks.org/python/stratified-sampling-in-pandas/
import pandas as pd

data = pd.concat([X, y], axis=1)
data = data.sample(frac=1, random_state=14).reset_index(drop=True)  # shuffle the data

train = data.groupby(0, group_keys=False).apply(lambda x: x.sample(frac=0.8, random_state=14))
X_train = train.drop(0, axis=1)
y_train = train[0]

test = data.drop(train.index)
X_test = test.drop(0, axis=1)
y_test = test[0]

print(round(((y_train == 'democrat').sum()) / ((y_train == 'republican').sum()), 1))

print(round(((y_test == 'democrat').sum()) / ((y_test == 'republican').sum()), 1))
print(X_train)
print(y_train)

1.6
1.6
     1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16
345   0   0   0   1   1   1   0   0   0   0   1   1   1   1   0   0
309   0   1   0   1   1   1   0   0   0   0   1   1   1   0   0   0
188   0   1   1   1   1   0   0   0   0   0   1   1   1   0   0   0
116   1   1   0   1   1   1   0   0   0   1   0   1   1   1   0   0
7     0   1   0   1   1   0   0   0   0   0   1   1   0   0   1   0
..   ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..
144   1   0   0   0   0   0   1   1   1   0   1   0   0   0   1   0
161   1   0   1   0   0   0   1   1   1   0   0   0   0   0   1   0
64    1   1   1   0   0   0   1   1   1   1   1   0   0   0   1   1
108   0   0   1   0   0   0   0   0   1   1   1   0   0   0   1   0
262   1   0   1   0   0   0   1   1   1   1   1   0   0   0   1   1

[348 rows x 16 columns]
345      democrat
309      democrat
188      democrat
116      democrat
7        democrat
          ...    
144    republican
161    republican
64     

  train = data.groupby(0, group_keys=False).apply(lambda x: x.sample(frac=0.8, random_state=14))


In [145]:
def get_pre_feature_probabilities(alpha, sample_class, X, y):
    dict =  {}
    data = pd.concat([X, y], axis=1)
    Nc = (y == sample_class).sum()  
    
    for col_name in X.columns:
        current_feature = X[col_name]         
        categories_count = current_feature.nunique()            
  
        xi = (y == sample_class) 
        counts_in_class = current_feature[xi].value_counts()

    # probability for one specific category t
        unique_values = current_feature.unique()
        for t in unique_values:
            Ntic = counts_in_class.get(t, 0)
            p = (Ntic + alpha) / (Nc + alpha * categories_count)
            dict[(col_name, t)] = p
            
    return dict

In [146]:
def calculate_class_probability(alpha, sample_class, X, y, x_row_sample):
    class_prob = np.log((y == sample_class).sum()/y.value_counts().sum())
    all_features_categories_probs_dict = get_pre_feature_probabilities(alpha, sample_class, X, y)
    for col_name, feature_value in zip(X.columns, x_row_sample):
        feature_probability = all_features_categories_probs_dict.get((col_name, feature_value), 0)
        class_prob += np.log(feature_probability)
    return class_prob    

In [148]:
def get_result_class(alpha, X, y, x_row_sample): 
    result_class = 'democrat' 
    class_prob = -np.inf
    
    class_labels_list = y.unique()
    for current_class in class_labels_list:
        current_class_prob = calculate_class_probability(alpha, current_class, X, y, x_row_sample)
        
        if current_class_prob > class_prob:
            result_class = current_class
            class_prob = current_class_prob
        
    return result_class       

In [150]:
def compute_accuracy(X_train, y_train, X_set, y_set, alpha):
    correct_classifications = 0
    total_classifications = 0
    for xi, yi in zip(X_set.values, y_set.values):
        current_prediction = get_result_class(alpha, X_train, y_train, xi)
        if current_prediction == yi:
            correct_classifications += 1
        total_classifications += 1
    return correct_classifications/total_classifications * 100

In [151]:
def stratify_folds(X, y, fold_groups):
    data = pd.concat([X, y], axis=1)
    data = data.sample(frac=1, random_state=14).reset_index(drop=True)  # shuffle the data

    folds = []
    for fold_index in range(fold_groups):
        fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
        folds.append(fold)
    return folds

In [152]:
def get__cross_val_accuracy_scores(X, y, fold_groups, alpha):
    folds = stratify_folds(X, y, fold_groups)
    cross_validation_accuracy_scores = []
    
    for i in range(fold_groups):
        test = folds[i]
        train = pd.concat(folds[:i] + folds[i+1:])

        X_train_sample = train.drop(0, axis=1)
        y_train_sample = train[0]
        X_test_sample = test.drop(0, axis=1)
        y_test_sample = test[0]    
       
        current_accuracy = compute_accuracy(X_train_sample, y_train_sample, X_test_sample, y_test_sample, alpha)
        cross_validation_accuracy_scores.append(current_accuracy)
    return cross_validation_accuracy_scores

In [153]:
def get_cross_validation_results(fold_groups, alpha):
    cross_validation_accuracy_scores = get__cross_val_accuracy_scores(X, y, fold_groups, alpha)
    np_scores = np.array(cross_validation_accuracy_scores)
    for i, score in enumerate(cross_validation_accuracy_scores, start=1):
        print("Accuracy Fold {}: {:.2f} %".format(i, score))
    print("Average accuracy: {:.2f} %".format(np_scores.mean()))
    print("Standart deviation: {:.2f} %".format(np_scores.std()))    

In [None]:
#Mode 0
alpha = 0.3

print("Train Set Accuracy:")
print("Accuracy: {:.2f} %".format(compute_accuracy(X_train, y_train, X_train, y_train, alpha)))

Train Set Accuracy:
Accuracy: 92.24 %


In [166]:
# Mode 1
alpha = 0.3

print("Train Set Accuracy:")
print("Accuracy: {:.2f} %".format(compute_accuracy(X_train, y_train, X_train, y_train, alpha)))

Train Set Accuracy:
Accuracy: 93.10 %


In [None]:
#Mode 0
print("10-Fold Cross-Validation Results:")
print("Cross-validation results for alpha =", 0)
get_cross_validation_results(fold_groups = 10, alpha = 0)
print()
print("Cross-validation results for alpha =", 0.5)
get_cross_validation_results(fold_groups = 10, alpha = 0.5)
print()
print("Cross-validation results for alpha =", 1)
get_cross_validation_results(fold_groups = 10, alpha = 1)

10-Fold Cross-Validation Results:
Cross-validation results for alpha = 0


  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])


Accuracy Fold 1: 93.18 %
Accuracy Fold 2: 90.91 %
Accuracy Fold 3: 88.64 %
Accuracy Fold 4: 88.64 %
Accuracy Fold 5: 95.45 %
Accuracy Fold 6: 88.64 %
Accuracy Fold 7: 88.64 %
Accuracy Fold 8: 86.05 %
Accuracy Fold 9: 88.10 %
Accuracy Fold 10: 95.24 %
Average accuracy: 90.35 %
Standart deviation: 3.06 %

Cross-validation results for alpha = 0.5


  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])


Accuracy Fold 1: 93.18 %
Accuracy Fold 2: 88.64 %
Accuracy Fold 3: 88.64 %
Accuracy Fold 4: 88.64 %
Accuracy Fold 5: 95.45 %
Accuracy Fold 6: 88.64 %
Accuracy Fold 7: 88.64 %
Accuracy Fold 8: 86.05 %
Accuracy Fold 9: 88.10 %
Accuracy Fold 10: 95.24 %
Average accuracy: 90.12 %
Standart deviation: 3.09 %

Cross-validation results for alpha = 1


  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])


Accuracy Fold 1: 93.18 %
Accuracy Fold 2: 88.64 %
Accuracy Fold 3: 88.64 %
Accuracy Fold 4: 88.64 %
Accuracy Fold 5: 95.45 %
Accuracy Fold 6: 88.64 %
Accuracy Fold 7: 88.64 %
Accuracy Fold 8: 86.05 %
Accuracy Fold 9: 88.10 %
Accuracy Fold 10: 95.24 %
Average accuracy: 90.12 %
Standart deviation: 3.09 %


In [156]:
#Mode 1
print("10-Fold Cross-Validation Results:")
print("Cross-validation results for alpha =", 0)
get_cross_validation_results(fold_groups = 10, alpha = 0)
print()
print("Cross-validation results for alpha =", 0.5)
get_cross_validation_results(fold_groups = 10, alpha = 0.5)
print()
print("Cross-validation results for alpha =", 1)
get_cross_validation_results(fold_groups = 10, alpha = 1)

10-Fold Cross-Validation Results:
Cross-validation results for alpha = 0


  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])


Accuracy Fold 1: 97.73 %
Accuracy Fold 2: 90.91 %
Accuracy Fold 3: 90.91 %
Accuracy Fold 4: 88.64 %
Accuracy Fold 5: 95.45 %
Accuracy Fold 6: 88.64 %
Accuracy Fold 7: 90.91 %
Accuracy Fold 8: 88.37 %
Accuracy Fold 9: 88.10 %
Accuracy Fold 10: 97.62 %
Average accuracy: 91.73 %
Standart deviation: 3.61 %

Cross-validation results for alpha = 0.5


  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])


Accuracy Fold 1: 97.73 %
Accuracy Fold 2: 90.91 %
Accuracy Fold 3: 90.91 %
Accuracy Fold 4: 88.64 %
Accuracy Fold 5: 95.45 %
Accuracy Fold 6: 88.64 %
Accuracy Fold 7: 90.91 %
Accuracy Fold 8: 88.37 %
Accuracy Fold 9: 88.10 %
Accuracy Fold 10: 97.62 %
Average accuracy: 91.73 %
Standart deviation: 3.61 %

Cross-validation results for alpha = 1


  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])
  fold = data.groupby(0, group_keys=False).apply(lambda x: x.iloc[fold_index::fold_groups])


Accuracy Fold 1: 97.73 %
Accuracy Fold 2: 90.91 %
Accuracy Fold 3: 90.91 %
Accuracy Fold 4: 88.64 %
Accuracy Fold 5: 95.45 %
Accuracy Fold 6: 88.64 %
Accuracy Fold 7: 90.91 %
Accuracy Fold 8: 88.37 %
Accuracy Fold 9: 88.10 %
Accuracy Fold 10: 97.62 %
Average accuracy: 91.73 %
Standart deviation: 3.61 %


In [None]:
# Mode 0
# Compute test set accuracy
alpha = 0.001
print("Test Set Accuracy:")
print("Accuracy: {:.2f} %".format(compute_accuracy(X_train, y_train, X_test, y_test, alpha)))

Test Set Accuracy:
Accuracy: 91.95 %


In [161]:
# Mode 1
# Compute test set accuracy
alpha = 0.001
print("Test Set Accuracy:")
print("Accuracy: {:.2f} %".format(compute_accuracy(X_train, y_train, X_test, y_test, alpha)))

Test Set Accuracy:
Accuracy: 91.95 %


Conclusion: 

1. Mode 0 vs mode 1.

Performance of the classifier in the second mode is slightly higher than in the first one (when threating '?' as a third category). This is probably because in "mode 0" the model may overfit to paterns like "if a row has a category with '?', they fall into the democrat's/republican's party" which lessens generalization. On the contrary, when removing the artificial "abstained" value, it helps the classifier focus on the "yes vs no" conflict that is more descriptive for the choice between democrat and republican.

2. The choice of smoothing parameter's value.

During exercises (seminars) we have concluded that Laplace's smoothing parameter lambda (in my case called "alpha" for the Categorical Bayse Classifier) should be between 0 and 1. When alpha = 0 we have no smoothing and use exact empirical frequencies, which may lead to the zero probability problem that may totally eliminate a certain class because of one unseen combination.

On the contary, alpha = 1 removes zero probabilities but this and other higher values (apha > 1) may distort results when working with overall smaller counts.

"The truth is somewhere in between", as they say. After some trials with different values of alpha in the interval between 0 and 1, I can say that alpha in [0.1, 0.3] accounts for the best model's performance. It is due to the lighter smoothing that is closer to raw counts but still solves the zero probability problem.