# Final Project - Logistic Regression Bagging for Multiclass Classification

We hope to compare the metrics of our algorithm of bagging logistic models, as opposed to the metrics of sklearn's logistic regression.

### What is "Logistic Bagging?"

Logistic Bagging, as this notebook implements, is a form of bootstrap aggregation. Bootstrap aggregation is the claim that we can take a goup of machine learning algorithms, each of which weakly learn our data, and then combine them to become greater than the sum of its parts. These 'weak learners' each learn the data slightly better than chance, but the overall learning algorithm sometimes performs quite well. This bagging is seen with neural nets and random decision forests, as decision trees or perceptrons often overfit the data or assume linear separability respectively.

Logistic bagging is a method of this that creates multiple logistic regression models for a given dataset. Each of these logistic regression models are used to 'vote' on the classification of a particular instance's class label. The plurality vote is the class label that is assigned to that instance. As we go forward, we will implement this ML model and compare its results to logistic regression on a dataset of dry beans, procured from [a zip file in this archive](https://archive.ics.uci.edu/ml/datasets/Dry+Bean+Dataset).

## Importing Packages

Random is utilized for the bagging method, which implements some form of chance in how these bags are generated.

Pandas is helpful for dataframe manipulation.

Numpy helps with array manipulation.

Sklearn is a beneficial library with various machine learning algorithms, including the logistic regression model that we compare our model with. It is also able to take a set of predictions and create various metrics with it, notably a confusion matrix.

In [1]:
# utility
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# sklearn utility
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

# sklearn classifiers
from sklearn.linear_model import LogisticRegression

# sklearn grid search
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline

# sklearn metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import f1_score


from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

To ensure replicability of our code, we would like to have the random library to be seeded, which we do.

In [2]:
random.seed(42)

Below is a group of helper functions to create what is referred to as a "bag." These are important for our purposes because these groups of a subset of instances/features is what makes a single logistic model a 'weak learner.' It is only trained on a subset of the entire data set, so this is how we create that subset.

In [3]:
# random bagging:
# shuffle a list of indices.
# Take the first numInstances items in the list.
# Create a dictionary, representing which rows are in the new list.
# From that dictionary, create a pandas series. Then we can create a filtered dataframe.

# Time complexity: n^2, because of d[i] = i in chosenInstances

def bagBooleans(array, m):
    # returns a pd series of booleans with m true values from n total elements.
    
    random.shuffle(array)
    
    chosenInstances = array[:m]
    
    # Create a pandas series of size n, to store booleans of which rows
    # we keep for a given logistic model.
    d = {}
    for i in array:
        d[i] = i in chosenInstances
    bagBooleans = pd.Series(data=d, index = array.copy())
    return bagBooleans

def createBag(dataframe, numInstances = 10, numFeatures = 3, random_state = -1):
    # random?
    '''
    if(random_state != -1):
        random.seed(random_state)
    '''
        
    numRows = dataframe.shape[0]
    numCols = dataframe.shape[1]
    
    # bagging for columns
    
    chosen_cols = bagBooleans(list(i for i in range(dataframe.shape[1])), numFeatures)
    chosen_cols_names = dataframe.columns[chosen_cols].values
    df_bagged_columns = dataframe.loc[:,chosen_cols_names]
    
    # bagging for rows
    chosen_rows = bagBooleans(list(dataframe.index), numInstances)
    bag = df_bagged_columns[list(chosen_rows)]
 
    return bag

Below is the class we use to manipulated the logistic bagging ML model. It is based on sklearn's objects for ML models, so that its usage syntax is consistent. Most notably, the model has attributes to be able to fit onto a training set, and then predict on a test set.

In [4]:
#the absolute minimum for something like this is the ability to fit the model, and then the ability to predict with it.

class AggregatedLogistic:
    def __init__(self):
        self.isFit = False
        self.models = []
        self.model_cols = []
    
    def fit(self, x_train, y_train, n_estimators = 5, solver='lbfgs',
                            multi_class='multinomial',
                            C=1e-2,
                            random_state = 42):
        # random state? 
        
        self.isFit = True
        
        # make number of models
        for i in range(n_estimators):
            # split the data randomly
            bag_num_instances = int(x_train.shape[0] / n_estimators)
            bag_num_predictors = random.randint(2,x_train.shape[1])
            
            bag = createBag(x_train, 
                            bag_num_instances,
                            bag_num_predictors,
                            random_state=random_state)
            # then fit a logreg to the bag that was created
            
            model = LogisticRegression(solver = solver,
                                   multi_class=multi_class,
                                    C=C,
                                    random_state=random_state)
            x_fit = bag
            y_fit = y_train.loc[list(bag.index)]
            
            model.fit(x_fit, y_fit)
            self.models.append(model)
            self.model_cols.append(bag.columns)
            
    def predict(self, x_test):
        #make the series go boom and plurality vote winner gets it
        if(not self.isFit):
            print("Model not yet fit to a data set.")
            return
        predictions = np.ndarray(shape=(1, len(x_test)), dtype=object, order='F')
        
        model_preds = []
        
        for i in range(len(self.models)):
            test_data = (x_test[self.model_cols[i]])
            curr_preds = self.models[i].predict(test_data)
            model_preds.append(curr_preds)
        
        for i in range(predictions.shape[1]):
            array_predictions = []
            unique_preds = []
            for pred in model_preds:
                array_predictions.append(pred[i])
                if(not(pred[i] in unique_preds)):
                    unique_preds.append(pred[i])
            unique_preds = sorted(unique_preds, key = lambda x:array_predictions.count(x), reverse=True)
            predictions[0, i] = unique_preds[0]
        
        return predictions[0]
        
    
    

## Testing the Model

The zip file has a messy data set, so we instead use a data set that was cleaned from previous data exploration.

In [5]:
df = pd.read_csv("Clean_Bean_Data.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRatio,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,Roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
0,0,28395,610.291,208.178117,173.888747,1.197191,0.549812,28715,190.141097,0.763923,0.988856,0.958027,0.913358,0.007332,0.003147,0.834222,0.998724,SEKER
1,1,28734,638.018,200.524796,182.734419,1.097356,0.411785,29172,191.27275,0.783968,0.984986,0.887034,0.953861,0.006979,0.003564,0.909851,0.99843,SEKER
2,2,29380,624.11,212.82613,175.931143,1.209713,0.562727,29690,193.410904,0.778113,0.989559,0.947849,0.908774,0.007244,0.003048,0.825871,0.999066,SEKER
3,3,30008,645.884,210.557999,182.516516,1.153638,0.498616,30724,195.467062,0.782681,0.976696,0.903936,0.928329,0.007017,0.003215,0.861794,0.994199,SEKER
4,4,30140,620.134,201.847882,190.279279,1.060798,0.33368,30417,195.896503,0.773098,0.990893,0.984877,0.970516,0.006697,0.003665,0.9419,0.999166,SEKER


We have to drop a column that was included when saving the data set, but afterwards the data is cleaned.

In [6]:
df.drop(['Unnamed: 0'], axis=1, inplace=True)
df.head()

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRatio,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,Roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
0,28395,610.291,208.178117,173.888747,1.197191,0.549812,28715,190.141097,0.763923,0.988856,0.958027,0.913358,0.007332,0.003147,0.834222,0.998724,SEKER
1,28734,638.018,200.524796,182.734419,1.097356,0.411785,29172,191.27275,0.783968,0.984986,0.887034,0.953861,0.006979,0.003564,0.909851,0.99843,SEKER
2,29380,624.11,212.82613,175.931143,1.209713,0.562727,29690,193.410904,0.778113,0.989559,0.947849,0.908774,0.007244,0.003048,0.825871,0.999066,SEKER
3,30008,645.884,210.557999,182.516516,1.153638,0.498616,30724,195.467062,0.782681,0.976696,0.903936,0.928329,0.007017,0.003215,0.861794,0.994199,SEKER
4,30140,620.134,201.847882,190.279279,1.060798,0.33368,30417,195.896503,0.773098,0.990893,0.984877,0.970516,0.006697,0.003665,0.9419,0.999166,SEKER


Below, we show a test of creating a bag. We create one with ten instances and three features, which is just a subset of the overall dataframe.

In [7]:
sample_bag = createBag(df, numInstances = 10, numFeatures = 3, random_state=44)
print(type(sample_bag))
print(sample_bag)

<class 'pandas.core.frame.DataFrame'>
    Area  Perimeter  MajorAxisLength
0  28395    610.291       208.178117
1  28734    638.018       200.524796
2  29380    624.110       212.826130
3  30008    645.884       210.557999
4  30140    620.134       201.847882
5  30279    634.927       212.560556
6  30477    670.033       211.050155
7  30519    629.727       212.996755
8  30685    635.681       213.534145
9  30834    631.934       217.227813


### Splitting/Scaling our Data

To show a test of the aggregatedLogistic class, we must be able to create instances of a logistic model as well. To do that, we must split and scale the data for model building.

In [8]:
x_train, x_test, y_train, y_test = train_test_split(df, df['Class'], test_size=0.2, random_state=45, stratify=df[['Class']])
x_train = x_train.drop('Class', axis=1)
x_test = x_test.drop('Class', axis=1)
x_train.shape[0], x_test.shape[0]
x_train.head()

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRatio,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,Roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4
5412,91751,1146.703,445.17612,264.370534,1.68391,0.804571,92725,341.790874,0.670871,0.989496,0.876837,0.767766,0.004852,0.00104,0.589464,0.992604
5347,87979,1148.631,443.738692,255.653953,1.7357,0.817354,89768,334.691413,0.692835,0.980071,0.837969,0.754253,0.005044,0.001007,0.568898,0.987437
8100,42001,768.513,285.177858,188.821847,1.510301,0.749398,42528,231.251668,0.772304,0.987608,0.893649,0.810903,0.00679,0.001811,0.657564,0.99312
2118,54677,911.022,308.853903,226.398571,1.364204,0.680198,55858,263.850182,0.753013,0.978857,0.82786,0.854288,0.005649,0.001856,0.729808,0.995607
6015,49573,880.556,350.312735,181.419348,1.930956,0.855454,50084,251.233565,0.617025,0.989797,0.803417,0.717169,0.007067,0.001153,0.514332,0.993152


In [9]:
# IMPORTANT: index = x_train.index or index = x_test.index. these MUST be maintained to align
# with y_train. 

scaler = StandardScaler()
x_train = pd.DataFrame(scaler.fit_transform(x_train), columns = x_train.columns, index = x_train.index)
x_test = pd.DataFrame(scaler.transform(x_test), columns = x_test.columns, index = x_test.index)

x_train.head()

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRatio,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,Roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4
5412,1.322486,1.364378,1.465726,1.380619,0.419032,0.590841,1.311021,1.502853,-1.613505,0.504278,0.056924,-0.53037,-1.517329,-1.142718,-0.556561,-0.564727
5347,1.193825,1.373383,1.448936,1.186784,0.629594,0.729453,1.211687,1.382829,-1.164776,-1.508755,-0.596795,-0.749669,-1.347338,-1.198191,-0.764506,-1.749044
8100,-0.374463,-0.402155,-0.403078,-0.2994,-0.286795,-0.007456,-0.375247,-0.365922,0.458767,0.101103,0.339689,0.169745,0.201142,0.152109,0.132017,-0.446469
2118,0.05791,0.263507,-0.126538,0.536215,-0.88077,-0.757858,0.072548,0.185188,0.064639,-1.767993,-0.766816,0.873867,-0.810802,0.227491,0.862491,0.123609
6015,-0.116185,0.1212,0.357707,-0.464014,1.423428,1.142616,-0.121418,-0.028108,-2.713578,0.568636,-1.177915,-1.351533,0.446628,-0.952661,-1.316238,-0.439086


### Comparing Logistic Regression to Logistic Bagging

We create a logistic regression model below.

In [10]:
logreg = LogisticRegression(solver='lbfgs',
                            multi_class='multinomial',
                            C=1e-2,
                            random_state = 0)
logreg.fit(x_train, y_train)
logreg_report = classification_report(y_test, logreg.predict(x_test))
print(logreg_report)

              precision    recall  f1-score   support

    BARBUNYA       0.94      0.86      0.90       265
      BOMBAY       1.00      1.00      1.00       104
        CALI       0.90      0.94      0.92       326
    DERMASON       0.91      0.92      0.91       709
       HOROZ       0.94      0.95      0.94       372
       SEKER       0.95      0.92      0.93       406
        SIRA       0.83      0.86      0.84       527

    accuracy                           0.91      2709
   macro avg       0.92      0.92      0.92      2709
weighted avg       0.91      0.91      0.91      2709



From the above data, we can see that a standard logistic regression ML algorithm learns this data fairly effectively, with an accuracy score of 91%. So this test will not be about whether it finds something completely new, but rather if this algorithm can maintain a similar effectiveness. Now, let's compare it with our aggregated logistic regression ML algorithm!

In [14]:
agglog = AggregatedLogistic()
agglog.fit(x_train, y_train, solver='lbfgs',
                            multi_class='multinomial',
                            C=1e-2,
                            random_state = 0,
                            n_estimators = 5)
agglog_report = classification_report(y_test, agglog.predict(x_test))
print(agglog_report)

              precision    recall  f1-score   support

    BARBUNYA       0.95      0.75      0.84       265
      BOMBAY       1.00      1.00      1.00       104
        CALI       0.86      0.94      0.90       326
    DERMASON       0.85      0.95      0.89       709
       HOROZ       0.94      0.95      0.94       372
       SEKER       0.95      0.89      0.92       406
        SIRA       0.83      0.78      0.80       527

    accuracy                           0.89      2709
   macro avg       0.91      0.89      0.90      2709
weighted avg       0.89      0.89      0.89      2709



## Discussion and Conclusion

The above report gives an accuracy score of 87%, which is comparable to a pure logistic standpoint. Somewhat surprisingly, it also fully separated Bombay from the other class labels. This is notable particularly because not every weak learner will have the needed information to be able to sully separate Bombay.

The primary thing to note with this learner is what happens in repeated executions: by running the above cell repeatedly, you will notice that the f1-scores and accuracy scores vary drastically. The major drawback of the random bagging method we perform is that it is random, and therefore, the models we create are quite subject to random chance. We have no guarantee (in fact, it is almost impossible) that every instance will be used in creating the models, and there is a possibility that due to random chance some features will be considered in many more models than others.

Additionally, random bagging for logistic regression maintains many of the flaws of a logistic model -- data sets that aren't linearly separable can still be difficult to fit effectively, as unlike decision forests we aren't trying to overfit smaller parts of our data. Further, this method is much more difficult to explain to someone who may be seeking an accessible machine learning algorithm, as per Occam's Razor.

The development of this algorithm has been an insightful survey into bagging methods more generally, and it is promising that this maintains most of the effectiveness of a logistic model. Due to random variance of the model and difficulty to explain, recommending it as an alternative to longer-standing models is contentious. However, bagging models in general have been quite promising for further study, for algorithms such as random decision forests use an extremely similar bagging method to create models that have seen much success.

## Acknowledgements

Thanks to the professor for everything in the course and in working to achieve this implementation. Thanks to a classmate for reminding me of loc versus iloc in pandas, which took way too much time to debug.


## References

- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
- https://pandas.pydata.org/docs/reference/api/pandas.Series.html
- https://stackoverflow.com/questions/11285613/selecting-multiple-columns-in-a-pandas-dataframe
- https://pandas.pydata.org/docs/reference/indexing.html
- https://pandas.pydata.org/docs/user_guide/indexing.html
- https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Int64Index.html

- honorable mention to this exact problem i was having in stackoverflow like ten years ago https://stackoverflow.com/questions/52428472/typeerror-unhashable-type-list-when-calling-iloc

- https://stackoverflow.com/questions/35723472/how-to-use-sklearn-fit-transform-with-pandas-and-return-dataframe-instead-of-num
- https://stackoverflow.com/questions/46628837/fit-got-an-unexpected-keyword-argument-criterion
- https://towardsdatascience.com/indexing-best-practices-in-pandas-series-e455c7d2417
- https://www.sharpsightlabs.com/blog/sklearn-predict/
- https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html