## Logistic Regression Lab

In this lab we will use the sklearn package to do some logistic regression.  In the next section, on gradient descent, we will do the "optimization" by hand.

In [104]:
import numpy as np
from scipy.special import softmax
import pandas as pd
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from bokeh.transform import factor_cmap
output_notebook()


We will begin by looking at a dataset from a food marketing company, obtained from the [Marketing Analytics](https://www.kaggle.com/datasets/jackdaoud/marketing-data?select=ifood_df.csv) project on kaggle.  The data contains a variety of information about customers
and whether or not the were moved by marketing campaigns.

We will introduce the use of pandas (instead of pure numpy) to help us deal with the mixed data types in this dataset.

The information is contained in the file ```ifood_df.csv```.

The fields in this data are described here.

![Food Metadata](food_metadata.png)



In [3]:

# the read_csv function from pandas creates a pandas "dataframe" which is a fancy kind of array
df = pd.read_csv('../data/ifood_df.csv',delimiter=',')

In [4]:
df.head()

Unnamed: 0,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,...,marital_Together,marital_Widow,education_2n Cycle,education_Basic,education_Graduation,education_Master,education_PhD,MntTotal,MntRegularProds,AcceptedCmpOverall
0,58138.0,0,0,58,635,88,546,172,88,88,...,0,0,0,0,1,0,0,1529,1441,0
1,46344.0,1,1,38,11,1,6,2,1,6,...,0,0,0,0,1,0,0,21,15,0
2,71613.0,0,0,26,426,49,127,111,21,42,...,1,0,0,0,1,0,0,734,692,0
3,26646.0,1,0,26,11,4,20,10,3,5,...,1,0,0,0,1,0,0,48,43,0
4,58293.0,1,0,94,173,43,118,46,27,15,...,0,0,0,0,0,0,1,407,392,0


We'll use the dataframe as a data source for bokeh plotting.

In [51]:
source=ColumnDataSource(df)

We'll pull out the numerical features to use for logistic regression.  The columns of the dataframe are all the features.  Here we list them with
numbers so we see where they are in the dataframe.

In [82]:
for i, x in enumerate(df.columns):
    print(f"{i}: {x}")

0: Income
1: Kidhome
2: Teenhome
3: Recency
4: MntWines
5: MntFruits
6: MntMeatProducts
7: MntFishProducts
8: MntSweetProducts
9: MntGoldProds
10: NumDealsPurchases
11: NumWebPurchases
12: NumCatalogPurchases
13: NumStorePurchases
14: NumWebVisitsMonth
15: AcceptedCmp3
16: AcceptedCmp4
17: AcceptedCmp5
18: AcceptedCmp1
19: AcceptedCmp2
20: Complain
21: Z_CostContact
22: Z_Revenue
23: Response
24: Age
25: Customer_Days
26: marital_Divorced
27: marital_Married
28: marital_Single
29: marital_Together
30: marital_Widow
31: education_2n Cycle
32: education_Basic
33: education_Graduation
34: education_Master
35: education_PhD
36: MntTotal
37: MntRegularProds
38: AcceptedCmpOverall


Now we pull out the numerical ones we care about.

In [98]:
features = df.columns[[0,4,5,6,7,8,9,10,11,12,13,14,24,36,37]]

In [99]:
df[features]

Unnamed: 0,Income,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Age,MntTotal,MntRegularProds
0,58138.0,635,88,546,172,88,88,3,8,10,4,7,63,1529,1441
1,46344.0,11,1,6,2,1,6,2,1,1,2,5,66,21,15
2,71613.0,426,49,127,111,21,42,1,8,2,10,4,55,734,692
3,26646.0,11,4,20,10,3,5,2,2,0,4,6,36,48,43
4,58293.0,173,43,118,46,27,15,5,5,3,6,5,39,407,392
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2200,61223.0,709,43,182,42,118,247,2,9,3,4,5,53,1094,847
2201,64014.0,406,0,30,0,0,8,7,8,2,5,7,74,436,428
2202,56981.0,908,48,217,32,12,24,1,2,3,13,6,39,1217,1193
2203,69245.0,428,30,214,80,30,61,2,6,5,10,3,64,782,721


We'll do logistic regression trying to match against the response variable.

In [105]:
L=LogisticRegression(max_iter=10000)
x_train, x_test, y_train, y_test = train_test_split(df[features].values, df['Response'].values)
L.fit(x_train,y_train)

Let's look at the coefficients.  We'll set some printoptions so we can read the numbers.

In [106]:
np.set_printoptions(precision=3,suppress=True)

In [107]:
L.coef_

array([[-0.   ,  0.002, -0.001,  0.003, -0.002, -0.001,  0.001,  0.028,
         0.074,  0.125, -0.261,  0.026, -0.033,  0.   , -0.   ]])

Notice that the coefficients that seem to be significant are in positions 7,8,9,10,11,12.  

In [108]:
features[7:13]

Index(['NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases',
       'NumStorePurchases', 'NumWebVisitsMonth', 'Age'],
      dtype='object')

Interestingly, it's the results related to the type of purchases, together with age, that seem to matter.
In particular, the more store purchases, the *less* likely you are to accept the marketing offer; and the older you are, the less likely.


In [110]:
L.score(x_test,y_test)

0.8478260869565217

We get 85% accuracy. 

## Classification of MNIST


In [119]:
mnist_train = np.genfromtxt(fname="mnist_train.csv",delimiter=',',skip_header=1)
labels = mnist_train[:,0]
pixels = mnist_train[:,1:]

In [122]:
pixels.shape

(60000, 784)

In [149]:
x=np.linspace(0,27,28)
y=np.linspace(0,27,28)
xx,yy=np.meshgrid(x,y)
p=figure()
p.image(image=[pixels[6].reshape(28,28)],x=0,y=0,dw=28,dh=28,palette="Greys256")
show(p)

In [151]:
L=LogisticRegression(max_iter=10000)
L.fit(pixels,labels)