## Mitigating bias in a research
I have found the following research: https://www.researchgate.net/publication/2571315_Transforming_Classifier_Scores_into_Accurate_Multiclass_Probability_Estimates. The aim of the research is to "show how to obtain
accurate probability estimates for multiclass problems by combining calibrated binary probability estimates". In this research they estimate the Mean Squared Error (MSE) and error rate of, among others, the Adult dataset from https://archive.ics.uci.edu/ml/datasets/Adult. They do so by using both a Naive Bayes (NB) model and an Support Vector machine (SVM) model. 

Since they did not do any feature selection, as is stated in their research, I will be reproducing their outcomes (MSE + error rate) and calculate the gender bias within their research. I will also be trying to mitigate the bias and furthermore check if the outcomes of the MSE and the error rate will be more positive.

## importing the dataset

In [20]:
import pandas as pd #"as pd" means that we can use the abbreviation in commands

df = pd.read_csv('adults/adults.csv')
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [21]:
df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income' ]

In [22]:
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
#function to labelencode multiple columns at once, retrieved from: https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn
class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

In [23]:
df = MultiColumnLabelEncoder(columns = ['workclass','education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'income']).fit_transform(df)
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,4,257302,7,12,2,13,5,4,0,0,0,38,39,0
32557,40,4,154374,11,9,2,7,0,4,1,0,0,40,39,1
32558,58,4,151910,11,9,6,1,4,4,0,0,0,40,39,0
32559,22,4,201490,11,9,4,1,3,4,1,0,0,20,39,0


In [24]:
y_train = df['income'] #We need to take out the price as our Y-variable
X_train = df.loc[:, df.columns != 'income']  #all the rows except y
X_train

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,4,257302,7,12,2,13,5,4,0,0,0,38,39
32557,40,4,154374,11,9,2,7,0,4,1,0,0,40,39
32558,58,4,151910,11,9,6,1,4,4,0,0,0,40,39
32559,22,4,201490,11,9,4,1,3,4,1,0,0,20,39


In [25]:
df_test = pd.read_csv('adults/test.csv')
df_test

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16276,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.
16277,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K.
16278,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.
16279,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.


In [26]:
df_test = MultiColumnLabelEncoder(columns = ['workclass','education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'income']).fit_transform(df_test)
df_test

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,4,226802,1,7,4,7,3,2,1,0,0,40,38,0
1,38,4,89814,11,9,2,5,0,4,1,0,0,50,38,0
2,28,2,336951,7,12,2,11,0,4,1,0,0,40,38,1
3,44,4,160323,15,10,2,7,0,2,1,7688,0,40,38,1
4,18,0,103497,15,10,4,0,3,4,0,0,0,30,38,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16276,39,4,215419,9,13,0,10,1,4,0,0,0,36,38,0
16277,64,0,321403,11,9,6,0,2,2,1,0,0,40,38,0
16278,38,4,374983,9,13,2,10,0,4,1,0,0,50,38,0
16279,44,4,83891,9,13,0,1,3,1,1,5455,0,40,38,0


In [27]:
y_test = df_test['income'] #We need to take out the price as our Y-variable
X_test = df_test.loc[:, df_test.columns != 'income']  #all the rows except y
X_test

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,4,226802,1,7,4,7,3,2,1,0,0,40,38
1,38,4,89814,11,9,2,5,0,4,1,0,0,50,38
2,28,2,336951,7,12,2,11,0,4,1,0,0,40,38
3,44,4,160323,15,10,2,7,0,2,1,7688,0,40,38
4,18,0,103497,15,10,4,0,3,4,0,0,0,30,38
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16276,39,4,215419,9,13,0,10,1,4,0,0,0,36,38
16277,64,0,321403,11,9,6,0,2,2,1,0,0,40,38
16278,38,4,374983,9,13,2,10,0,4,1,0,0,50,38
16279,44,4,83891,9,13,0,1,3,1,1,5455,0,40,38


## Multinomial Naive Bayes

In [28]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB() #clf = classifier
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Results

In [29]:
from sklearn.metrics import mean_squared_error
y_train_pred = clf.predict(X_train)

#### based on training set

In [30]:
mean_squared_error(y_train_pred, y_train)

0.21743803937225514

In [31]:
from sklearn.metrics import classification_report
print(classification_report(y_train,y_train_pred))

              precision    recall  f1-score   support

           0       0.80      0.96      0.87     24720
           1       0.63      0.24      0.35      7841

    accuracy                           0.78     32561
   macro avg       0.71      0.60      0.61     32561
weighted avg       0.76      0.78      0.74     32561



as accuracy is 0.78, so error rate would be 1 - 0.78 = 0.22

#### based on test set

In [32]:
y_test_pred = clf.predict(X_test)
mean_squared_error(y_test_pred, y_test)

0.21491308887660462

In [33]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_test_pred))

              precision    recall  f1-score   support

           0       0.80      0.96      0.87     12435
           1       0.62      0.23      0.34      3846

    accuracy                           0.79     16281
   macro avg       0.71      0.59      0.60     16281
weighted avg       0.76      0.79      0.75     16281



as accuracy is 0.79, so error rate would be 1 - 0.79 = 0.21

## Gaussian Naive Bayes

In [34]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_train_pred = gnb.fit(X_train, y_train).predict(X_train)
y_test_pred = gnb.fit(X_train, y_train).predict(X_test)

In [35]:
mean_squared_error(y_train_pred, y_train)

0.20496913485458063

In [36]:
mean_squared_error(y_test_pred, y_test)

0.20465573367729256

## Complement Naive Bayes

In [37]:
from sklearn.naive_bayes import ComplementNB
Compl_clf = ComplementNB()
Compl_clf.fit(X_train, y_train)

ComplementNB(alpha=1.0, class_prior=None, fit_prior=True, norm=False)

In [38]:
y_train_pred = Compl_clf.predict(X_train)
mean_squared_error(y_train_pred, y_train)

0.21743803937225514

In [39]:
y_test_pred = Compl_clf.predict(X_test)
mean_squared_error(y_test_pred, y_test)

0.21491308887660462

## Bernoulli Naive Bayes

In [40]:
from sklearn.naive_bayes import BernoulliNB
Bern_clf = BernoulliNB()
Bern_clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [41]:
y_train_pred = Bern_clf.predict(X_train)
mean_squared_error(y_train_pred, y_train)

0.2677743312551826

In [42]:
y_test_pred = Bern_clf.predict(X_test)
mean_squared_error(y_test_pred, y_test)

0.26208463853571645

## Categorial Naive Bayes

In [45]:
from sklearn.naive_bayes import CategoricalNB
Cat_clf = CategoricalNB()
Cat_clf.fit(X_train, y_train)

CategoricalNB(alpha=1.0, class_prior=None, fit_prior=True)

In [49]:
y_train_pred = Cat_clf.predict(X_train)
mean_squared_error(y_train_pred, y_train)

0.11145235097202175

In [50]:
y_test_pred = Cat_clf.predict(X_test)
mean_squared_error(y_test_pred, y_test)

IndexError: index 1490400 is out of bounds for axis 1 with size 1484706