## $\S$1. Load data

We will be using three datasets to compare how logistic regression performs on different types of datasets.

First, we start with the `iris` dataset, which has four features and we will use the logistic regression model to predict the species of the iris.

In [13]:
import pandas as pd
# read in the iris data set
iris = pd.read_csv("iris.csv")
iris.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## $\S$2. Split and clean data

In order to train and test the model, we split the data so that we train on 80% of the data and test on the remaining 20%. We define the `prep_data()` function to clean the data.

In [14]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(iris, test_size = 0.2)

In [15]:
from sklearn import preprocessing

def prep_data(data_df):
    """
    Takes in a dataframe, cleans the data entries, separates the predictor and target data 
    Returns a tuple of the predictor data variable and the target data variable
    """
    df = data_df.copy()
    df = df.dropna()
    
    # convert the values in the qualitative columns to integer values
    le = preprocessing.LabelEncoder()
    df['Species'] = le.fit_transform(df['Species'])
    
    # create the predictor data variable
    X = df.drop(['Species'], axis = 1)
    # create the target data variable
    y = df['Species']
        
    return (X, y)

# prepare the train data and the test data
X_train, y_train = prep_data(train)
X_test,  y_test  = prep_data(test)

## $\S$3. Logistic Regression

We use the `LogisticRegression` model from the `sklearn` package in Python. We fit the model using the training data then calculate the model's accuracy on the test data.

In [16]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(max_iter = 500)
LR.fit(X_train, y_train)
iris_score = LR.score(X_test, y_test)
iris_score

1.0

## $\S$4. Test the parameters

We run the model using the features that we used PCA to extract from the data.

In [17]:
cols = ['Sepal.Length', 'Sepal.Width', 'Petal.Length']
LR.fit(X_train[cols], y_train)
LR.score(X_test[cols], y_test)

0.9666666666666667

## $\S$5. Compare the model with other datasets

Next, we will perform the same process on the `penguins` dataset. This data has more features than the `iris` dataset, so we update the `prep_data()` function to convert the qualititative features.

In [18]:
# read in the penguins data set
penguins = pd.read_csv("penguins.csv")
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


In [19]:
from sklearn import preprocessing

def prep_data(data_df):
    """
    Takes in a dataframe, encodes the qualitative columns, separates the predictor and target data 
    Returns a tuple of the predictor data variable and the target data variable
    """
    df = data_df.copy()
    df = df.dropna()
    
    # convert the values in the qualitative columns to integer values
    le = preprocessing.LabelEncoder()
    df['sex'] = le.fit_transform(df['sex'])
    df['island'] = le.fit_transform(df['island'])
    
    # create the predictor data variable
    X = df.drop(['species'], axis = 1)
    # create the target data variable
    y = df['species']
        
    return(X, y)

# prepare the train data and the test data
train, test = train_test_split(penguins, test_size = 0.2)
X_train, y_train = prep_data(train)
X_test,  y_test  = prep_data(test)

In [20]:
LR = LogisticRegression(max_iter = 500)
LR.fit(X_train, y_train)
penguins_score = LR.score(X_test, y_test)
penguins_score

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


1.0

In [21]:
cols = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']
LR.fit(X_train[cols], y_train)
LR.score(X_test[cols], y_test)

0.9701492537313433

In [None]:
Last, we will test the logistic regression model on the `seeds_data`

In [24]:
# read in the seeds data set
seeds = pd.read_csv("seeds_dataset.csv")
seeds.head()

Unnamed: 0,Area,Perim,Compact,K.Length,K.Width,Assym,G.Length,Class
0,15.26,14.84,0.871,5.763,3.312,2.221,5.22,1
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
2,14.29,14.09,0.905,5.291,3.337,2.699,4.825,1
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1


In [25]:
def prep_data(data_df):
    df = data_df.copy()
    df = df.dropna()
    
    # create the predictor data variable
    X = df.drop(['Class'], axis = 1)
    # create the target data variable
    y = df['Class']
        
    return(X, y)

# prepare the train data and the test data
train, test = train_test_split(seeds, test_size = 0.2)
X_train, y_train = prep_data(train)
X_test,  y_test  = prep_data(test)

In [26]:
LR = LogisticRegression(max_iter = 500)
LR.fit(X_train, y_train)
seeds_score = LR.score(X_test, y_test)
seeds_score

0.9761904761904762

In [27]:
cols = ['Area', 'Perim', 'Compact']
LR.fit(X_train[cols], y_train)
LR.score(X_test[cols], y_test)

0.9285714285714286

In [28]:
iris_score, penguins_score, seeds_score

(0.9333333333333333, 0.9852941176470589, 0.9761904761904762)