## Logistic Regression Lab

In this lab we will use the sklearn package to do some logistic regression.  In the next section, on gradient descent, we will do the "optimization" by hand.

In [None]:
import numpy as np
from scipy.special import softmax
import pandas as pd
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from bokeh.transform import factor_cmap
output_notebook()


We will begin by looking at a dataset from a food marketing company, obtained from the [Marketing Analytics](https://www.kaggle.com/datasets/jackdaoud/marketing-data?select=ifood_df.csv) project on kaggle.  The data contains a variety of information about customers
and whether or not the were moved by marketing campaigns.

We will introduce the use of pandas (instead of pure numpy) to help us deal with the mixed data types in this dataset.

The information is contained in the file ```ifood_df.csv```.

The fields in this data are described here.

![Food Metadata](food_metadata.png)



In [None]:

# the read_csv function from pandas creates a pandas "dataframe" which is a fancy kind of array
df = pd.read_csv('../data/ifood_df.csv',delimiter=',')

In [None]:
df.head()

We'll use the dataframe as a data source for bokeh plotting.

In [None]:
source=ColumnDataSource(df)

We'll pull out the numerical features to use for logistic regression.  The columns of the dataframe are all the features.  Here we list them with
numbers so we see where they are in the dataframe.

In [None]:
for i, x in enumerate(df.columns):
    print(f"{i}: {x}")

Now we pull out the numerical ones we care about.

In [None]:
features = df.columns[[0,4,5,6,7,8,9,10,11,12,13,14,24,36,37]]

In [None]:
df[features]

We'll do logistic regression trying to match against the response variable.

In [None]:
L=LogisticRegression(max_iter=10000)
x_train, x_test, y_train, y_test = train_test_split(df[features].values, df['Response'].values)
L.fit(x_train,y_train)

Let's look at the coefficients.  We'll set some printoptions so we can read the numbers.

In [None]:
np.set_printoptions(precision=3,suppress=True)

In [None]:
L.coef_

Notice that the coefficients that seem to be significant are in positions 7,8,9,10,11,12.  

In [None]:
features[7:13]

Interestingly, it's the results related to the type of purchases, together with age, that seem to matter.
In particular, the more store purchases, the *less* likely you are to accept the marketing offer; and the older you are, the less likely.


In [None]:
L.score(x_test,y_test)

We get 85% accuracy. 

## Classification of MNIST


In [None]:
mnist_train = np.genfromtxt(fname="mnist_train.csv",delimiter=',',skip_header=1)
labels = mnist_train[:,0]
pixels = mnist_train[:,1:]

In [None]:
pixels.shape

In [None]:
x=np.linspace(0,27,28)
y=np.linspace(0,27,28)
xx,yy=np.meshgrid(x,y)
p=figure()
p.image(image=[pixels[6].reshape(28,28)],x=0,y=0,dw=28,dh=28,palette="Greys256")
show(p)

In [None]:
pixels_train, pixels_test, labels_train, labels_test = train_test_split(pixels, labels, train_size=.3)

In [None]:
pixels_train.shape

In [None]:
L=LogisticRegression(max_iter=10000,solver='lbfgs')
L.fit(pixels_train,labels_train)

In [None]:
L.score(pixels_test,labels_test)

## Fashion MNIST
The fashion MNIST database is available from kaggle
[here](https://kaggle.com/datasets/zalando-research/fashionmnist).  Download this data and try logistic regression to classify the images.