## Exercise: Build a logistic regression model to predict fish species

Using the fish dataset in this repo (location relative to this notebook: `../datasets/fish/Fish.csv`) do the following:

1. Split the data into labels and features, with the "species" column being the label and the other columns as features.
2. Split the fish data into training and validation sets.
3. Fit the model!
4. Score the model, what is this value? How should you interpret it?
5. Examine the coefficents and intercept value...
    * These are a strange shape compared to linear regression. 
    * Can you explain what they are?
6. Compute and plot the a "confusion matrix"
    * Hint 1: [https://machinelearningmastery.com/confusion-matrix-machine-learning/](https://machinelearningmastery.com/confusion-matrix-machine-learning/)
    * Hint 2: [https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)
    * Hint 3: [https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix](https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix)
    * Is your model good at some classes and bad at others?
    * Are there any interesting trends in the confusion matrix?

In [1]:
# Your code below here... use as many cells as you'd like.

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt


# Load the data
fish_dataset = pd.read_csv('../datasets/fish/Fish.csv')
fish_dataset.head(5)

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.52,4.02
1,Bream,290.0,24.0,26.3,31.2,12.48,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961
3,Bream,363.0,26.3,29.0,33.5,12.73,4.4555
4,Bream,430.0,26.5,29.0,34.0,12.444,5.134


In [2]:
labels = fish_dataset['Species']
features = fish_dataset.drop(columns=['Species'])

training_data, test_data, training_labels, test_labels = train_test_split(features, labels)

In [3]:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Weight   159 non-null    float64
 1   Length1  159 non-null    float64
 2   Length2  159 non-null    float64
 3   Length3  159 non-null    float64
 4   Height   159 non-null    float64
 5   Width    159 non-null    float64
dtypes: float64(6)
memory usage: 7.6 KB


In [4]:
model = LogisticRegression(max_iter=1000000, multi_class='ovr', penalty='none')
model.fit(training_data, training_labels)

LogisticRegression(max_iter=1000000, multi_class='ovr', penalty='none')

In [5]:
# model.fit(training_data, training_labels)

LogisticRegression(max_iter=1000000, multi_class='ovr', penalty='none')

In [6]:
training_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 119 entries, 26 to 83
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Weight   119 non-null    float64
 1   Length1  119 non-null    float64
 2   Length2  119 non-null    float64
 3   Length3  119 non-null    float64
 4   Height   119 non-null    float64
 5   Width    119 non-null    float64
dtypes: float64(6)
memory usage: 6.5 KB


In [7]:
score_data = model.score(test_data, test_labels)
print(score_data)

0.975


In [8]:
print(model.coef_, model.intercept_)

[[ 6.21080611e-02 -1.59220577e+01 -1.64794328e+01  2.42357783e+01
   9.46843559e+00 -7.29910553e+00]
 [-1.66826797e+00  4.70451945e+01  2.67579659e+01 -9.64562639e+01
   1.26342709e+02  9.71740920e+00]
 [ 5.13549039e-02 -8.20430978e+01  5.15587016e+02 -4.32073590e+02
   9.12254246e+01 -8.34704111e+00]
 [ 2.54490586e-01  6.51024993e+00  4.02300937e+00 -2.86901082e+00
  -4.02719425e+01 -1.16727545e+01]
 [-6.65241222e-02  1.32436668e+01 -2.21411834e+01  8.15516224e+00
  -2.42992647e+00  1.56877211e+01]
 [-1.67531894e+00  1.02651381e+00  9.94662689e-01  1.10133337e+00
  -1.59515893e-01  2.48138106e-02]
 [-3.59780596e-03 -1.03675808e+01  7.97149523e+00  1.64838965e+00
  -1.69022594e+00  4.24546738e+00]] [ -8.26037053   9.44610738  19.14218744  -2.88052422 -10.02267345
   0.07896935 -14.57897824]


In [9]:
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

fish_dataset = pd.read_csv('../datasets/fish/Fish.csv')

labels = fish_dataset['Species']
features = fish_dataset.drop(columns=['Species'])

tuned_parameters = {
    'max_iter': [100, 1000, 10000],
    'multi_class': ['ovr', 'auto', 'multinomial'],
    'penalty': ['l2', 'none']
}

# These two lines will result in every possible combo of the above paramters to be fit and scored
# which can take a LONG TIME with large datasets.
clf = LogisticRegression()
grid_tree = GridSearchCV(clf, tuned_parameters)
grid_tree.fit(features, labels)

print("Best parameters set found on development set:")
print()
print(grid_tree.best_params_, grid_tree.best_score_)
print()
print("Grid scores on development set:")
print()
means = grid_tree.cv_results_['mean_test_score']
stds = grid_tree.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid_tree.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))

NameError: name 'GridSearchCV' is not defined