## Introduction

This notebook contains a very early prototype for a LogisticRegression machine learning model.

There is much more that needs to be done to flesh out this solution.  We are predicting the probability of a order item being cancelled, but it may make more sense to predict the probability of the overall order being cancelled.

## COS S3 endpoint

Change the following variables to point to your COS S3 endpoint

In [None]:
api_key = 'changeme'
service_instance_id = 'changeme'
auth_endpoint = 'https://iam.bluemix.net/oidc/token'
service_endpoint = 'https://s3-api.us-geo.objectstorage.service.networklayer.com' # you may need to change this
bucket = 'streams-python-models'

## Environment setup

In [None]:
!pip install --quiet numpy==1.11.1 --upgrade
!pip install --quiet sklearn
!pip install --quiet pandas

In [None]:
import sklearn
import numpy as np
import pandas as pd

print('sklearn: ' + sklearn.__version__)
print('numpy:   ' + np.__version__)
print('pandas:  ' + pd.__version__)

## Data download

In [None]:
ONLINE_RETAIL_XLSX  = './OnlineRetail.xlsx'

In [None]:
try:
    # python 3
    import urllib.request as urlrequest
except ImportError:
    import urllib as urlrequest

source_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
urlrequest.urlretrieve(source_url, ONLINE_RETAIL_XLSX)

In [None]:
# !ls -l OnlineRetail.xlsx

In [None]:
df = pd.read_excel(ONLINE_RETAIL_XLSX, sheetname='Online Retail')

## Data cleansing

In [None]:
# If this code starts with letter 'c', it indicates a cancellation. 
# Create a column 'Cancelled' which has the values 1=Cancelled, 0=Not-cancelled

df['Cancelled'] = df['InvoiceNo'].str.startswith('C')

mask = df['Cancelled'] == True
df.loc[mask, 'Cancelled'] = 1

mask = df['Cancelled'].isnull()
df.loc[mask, 'Cancelled'] = 0

df.head()

In [None]:
# Take the absolute value for quantity and unitprice
df['Quantity'] = df['Quantity'].abs()
df['UnitPrice'] = df['UnitPrice'].abs()

# Remove rows where CustomerID is null
df.dropna(subset=['CustomerID'], how='all', inplace=True)

# Convert customerID field to integer
df['CustomerID'] = df['CustomerID'].astype(int)

## Build Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()

In [None]:
X = df[['UnitPrice', 'Quantity', 'CustomerID']]
Y = df['Cancelled']

In [None]:
logistic.fit(X.values, list(Y.values))

## Try out the model

In [None]:
# predict all rows with probability
prediction = logistic.predict_proba(test)
p_df = pd.DataFrame(prediction)
p_df.head()

## Save the model to DSX local file system

In [None]:
# Remove previous copies of the saved model
! rm -f logistic.*
! ls

In [None]:
import pickle
pickle.dump(logistic, open( "logistic.pkl", "wb" ) )

In [None]:
! ls logistic.*

## Save the model to IBM COS S3

In [None]:
import ibm_boto3
from ibm_botocore.client import Config

cos = ibm_boto3.resource('s3',
                      ibm_api_key_id=api_key,
                      ibm_service_instance_id=service_instance_id,
                      ibm_auth_endpoint=auth_endpoint,
                      config=Config(signature_version='oauth'),
                      endpoint_url=service_endpoint)

object = cos.Object(bucket, 'logistic.pkl')
object.put(Body=open('logistic.pkl', 'rb'))