## SHREYOSHI GHOSH HW 3

Using the Scikit-Learn Library, train the Logistic Regression model using the following

All six cases of using two features at a time.

In [44]:
import numpy as np 
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 
from sklearn.preprocessing import StandardScaler

from itertools import combinations


In [45]:
# loading data with column names
colnames = ['sepal_length', 'sepal_width','petal_length','petal_width','species']
df = pd.read_csv('https://archive.ics.uci.edu/ml/'
        'machine-learning-databases/iris/iris.data', header=None, names = colnames)
df.tail()

# column combos
combos = list(combinations(colnames[0:4],2)) + list(combinations(colnames[0:4],3)) + list(combinations(colnames[0:4],4))
combos

df_results = pd.DataFrame(index = [ '+'.join(combo) for combo in combos])
df_results

sepal_length+sepal_width
sepal_length+petal_length
sepal_length+petal_width
sepal_width+petal_length
sepal_width+petal_width
petal_length+petal_width
sepal_length+sepal_width+petal_length
sepal_length+sepal_width+petal_width
sepal_length+petal_length+petal_width
sepal_width+petal_length+petal_width
sepal_length+sepal_width+petal_length+petal_width


Using the Scikit-Learn Library, train the Logistic Regression model using the following
All six cases of using two features at a time.
All four cases of using three features at a time.
The one case of using all features at once.

In [46]:
def iris_model(df, cols, df_results, c = 1.0, p = 'none'):
    # splitting data for model
    X_train, X_test, y_train, y_test = train_test_split(df.loc[:,cols], df['species'], test_size= 0.30, random_state= 777, shuffle = True, stratify= df['species'])
    # scaling data
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    # fitting model
    model = LogisticRegression(random_state= 777, C = c, penalty= p)
    model = model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)
    # scoring model
    train_acc = accuracy_score(y_pred=y_train_pred, y_true=y_train)
    test_acc = accuracy_score(y_pred=y_pred, y_true=y_test)
    # storing results in table 
    df_results.loc['+'.join(cols),'train_acc'] = train_acc
    df_results.loc['+'.join(cols),'test_acc'] = test_acc
    df_results.loc['+'.join(cols),'iterations'] = model.n_iter_


Summarize your results (i.e, what’s the best accuracy you can obtain for each of the 11 cases you considered, how many iterations does it take to converge, anything else you think is relevant and important) in a table.

In [47]:
for combo in combos:
    iris_model(df, combo, df_results)

df_results

Unnamed: 0,train_acc,test_acc,iterations
sepal_length+sepal_width,0.809524,0.822222,37.0
sepal_length+petal_length,0.980952,0.955556,32.0
sepal_length+petal_width,0.961905,0.977778,33.0
sepal_width+petal_length,0.961905,0.933333,38.0
sepal_width+petal_width,0.952381,0.955556,33.0
petal_length+petal_width,0.980952,0.933333,33.0
sepal_length+sepal_width+petal_length,0.980952,0.955556,33.0
sepal_length+sepal_width+petal_width,0.961905,0.955556,35.0
sepal_length+petal_length+petal_width,1.0,0.955556,28.0
sepal_width+petal_length+petal_width,1.0,0.933333,37.0


Play with both L1 and L2 regularization and vary the regularization parameter C.

In [61]:
df_results_reg = pd.DataFrame()

def iris_model_reg(df, cols, df_results, c = 1.0, p = 'none'):
    # splitting data for model
    X_train, X_test, y_train, y_test = train_test_split(df.loc[:,cols], df['species'], test_size= 0.30, random_state= 777, shuffle = True, stratify= df['species'])
    # scaling data
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    # fitting model
    model = LogisticRegression(random_state= 777, C = c, penalty= p, solver='saga', max_iter= 200)
    model = model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)
    # scoring model
    train_acc = accuracy_score(y_pred=y_train_pred, y_true=y_train)
    test_acc = accuracy_score(y_pred=y_pred, y_true=y_test)
    # storing results in table 
    name = '+'.join(cols) + f'( {p},{str(c)})'
    df_results.loc[name,'train_acc'] = train_acc
    df_results.loc[name,'test_acc'] = test_acc
    df_results.loc[name,'iterations'] = model.n_iter_


# using sepal_width+petal_length+petal_width feature combo
# trying values of c: 1.0, 0.1, 0.01, 0.001

for reg in ['l1','l2']:
    for c in [1.0, 0.1, 0.01, 0.001]:
        iris_model_reg(df, ['sepal_width','petal_length','petal_width'], df_results_reg,c,p = reg)
df_results_reg



Unnamed: 0,train_acc,test_acc,iterations
"sepal_width+petal_length+petal_width( l1,1.0)",0.980952,0.933333,200.0
"sepal_width+petal_length+petal_width( l1,0.1)",0.942857,0.955556,20.0
"sepal_width+petal_length+petal_width( l1,0.01)",0.333333,0.333333,3.0
"sepal_width+petal_length+petal_width( l1,0.001)",0.333333,0.333333,1.0
"sepal_width+petal_length+petal_width( l2,1.0)",0.971429,0.933333,41.0
"sepal_width+petal_length+petal_width( l2,0.1)",0.914286,0.933333,16.0
"sepal_width+petal_length+petal_width( l2,0.01)",0.87619,0.844444,16.0
"sepal_width+petal_length+petal_width( l2,0.001)",0.809524,0.8,11.0


## Discuss your findings.  Does using more dimensions help when trying to classify the data in this dataset?  How important is regularization in these cases?

To create a logistic regression model I used a 70/30 training/testing split on the data and then scaled it. Then I did the initial training and testing with no regularization and found that the best combination of features was all four features, as it has the highest training and testing accuracy. This indicates that using more dimensions does help when trying to classify the data in the iris dataset. 

The second best combination of features was sepal_length+petal_length+petal_width, with a training accuracy of 1 and a testing accuracy of 0.933333. I chose this combination to test the different regularization methods, to see if any change would cause the testing accuracy to increase. The L1 regularization with 0.1 C value caused the test accuracy to go up to 0.955556, indicating a better model performance. This shows that regularization is important in helping fine tune the model.