<h1>E-Commerce Actual Data Transactions from UK Retailer</h1>
<br>
In this notebook we are going to explore an <a href='https://archive.ics.uci.edu/ml/datasets/Online+Retail'>e-commerce dataset transactions from an UK retailer</a>, this dataset lists purchases made by approximately 40000 customers through a period of time of one year <i>(from 12/01/2010 to 12/09/2011)</i>. The main aim of this notebook is to develop a machine learning model that allows to anticipate the purhcases that will be made by a new customer, over the next year according to its firsts purchases.
<br>
<br>
This notebooks is divided by the following steps:
<ul>
    <li>Data Cleaning.</li>
    <li>Feature Exploration.</li>
    <li>Understanding Product Categories.</li>
    <li>Customers Categories.</li>
    <li>Classifying Customers.</li>
    <li>Testing Predictions.</li>
    <li>Explaining The Decissions of The Model.</li>
</ul>

<h2>Importing Necessary Packages, Modules and Libraries</h2>

In [2]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import datetime, nltk, warnings
import matplotlib.cm as cm
import itertools
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn import preprocessing, model_selection, metrics, feature_selection
from sklearn.model_selection import GridSearchCV, learning_curve
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn import neighbors, linear_model, svm, tree, ensemble
from wordcloud import WordCloud, STOPWORDS
from sklearn.ensemble import AdaBoostClassifier
from sklearn.decomposition import PCA
from IPython.display import display, HTML
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode,iplot
warnings.filterwarnings("ignore")
plt.rcParams["patch.force_edgecolor"] = True
plt.style.use('fivethirtyeight')
mpl.rc('patch', edgecolor = 'dimgray', linewidth=1)
%matplotlib inline

<h2>Data Cleaning</h2>
<br>
Let's load the dataset into memory! with <code>encoding</code> parameter and the value of <code>ISO-8859-1</code> will allows us to be able to read the dataset, for a better understanding of this parameter and its value, check the next links:
<ul>
    <li><code>encoding</code>: <a href='https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html'>Pandas official docs for <code>read_csv</a> method.</li>
    <li><code>ISO-8859-1</code>: <a href=''>Codec registry and base classes</a>.</li>
</ul>
with <code>dtype</code> parameter by using a dictionary we convert into <code>string</code> variables the columns <code>CustomerID</code> and <code>InvoiceID</code>.

In [23]:
df = pd.read_csv('ecommerce-data/data.csv', encoding='ISO-8859-1', dtype={'CustomerID': str, 'InvoiceID': str})

Now with the data in memory, let's take a look a its dimensions.

In [24]:
df.shape

(541909, 8)

<b>541909</b> rows and <b>8</b> columns! Let's see general information of this dataset.

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null object
UnitPrice      541909 non-null float64
CustomerID     406829 non-null object
Country        541909 non-null object
dtypes: float64(1), int64(1), object(6)
memory usage: 33.1+ MB


Looks like there are null values, missing values. We must clean up this dataset!
<br>
<br>
Let's create a variable <code>columns_info</code> that will hold the data types of every column in the dataset. This is done by creating a <code>DataFrame</code> object that has information the types of the columns that the dataset <code>df</code> holds, this is obtained by passing <code>df.dtypes</code> as parameter for the dataset, then with the function <code>T</code> we <a href='https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.T.html'>transposed</a> our dataset and by renaming with the function <code>rename</code> we give a name to the new columns

In [26]:
columns_info= pd.DataFrame(df.dtypes).T.rename(index={0: 'Column Type'})

Looking for how many nulls we have in every single column of this dataset.
<br>
<br>
We append to <codde>columns_info</code> another <i>row in the index column</i>, called <code>Null Values (NV)</code>, this new row will hold the amount of null values for every column in the dataset, this is done thanks to <code>df.isnull().sum()</code> that is passed as the value for the parameter <code>data</code> in the constructor of <code>DataFrame</code>.

In [27]:
columns_info = columns_info.append(pd.DataFrame(df.isnull().sum()).T.rename(index={0:'Null Values (NV)'}))

<b>How much do these null values composed the dataset?</b>
<br>
<br>
Again by appending to <code>columns_info</code> another <i>row in the index column</i>, called <code>Null Values (%)</code>, this new row will hold the <b>percentage</b> of null values for every column in the dataset, this is done thanks to <code>df.isnull().sum()/df.shape[0]*100</code> that is passed as the value for the parameter data in the constructor of <code>DataFrame</code>.
<br>
<br>
<b>Explaining the operation</b> <code>df.isnull().sum()/df.shape[0]*100</code>:
<br>
With <code>df.isnull().sum()</code> we get the total amount of null values in every column and then by dividing for the number of rows gotten from <code>df.shape[0]</code> we multiply by <code>100</code> to known how much percentage these null values represent in the dataset.

In [28]:
columns_info=columns_info.append(pd.DataFrame(df.isnull().sum()/df.shape[0]*100).T.rename(index={0:'Null Values (%)'}))

Let's take a look at <code>columns_info</code>:

In [31]:
columns_info

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
Column Type,object,object,object,int64,object,float64,object,object
Null Values (NV),0,0,1454,0,0,0,135080,0
Null Values (%),0,0,0.268311,0,0,0,24.9267,0


There are null values in the columns <code>Description</code> and <code>CustomerID</code>, these null values represent a <b>0.26%</b> and <b>24.9267%</b> respectively.
<br>
<br>
Let's see a random sample from the dataset:

In [30]:
df.sample(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
53730,540848,22071,SMALL WHITE RETROSPOT MUG IN BOX,2,1/12/2011 9:26,3.36,,United Kingdom
50844,540562,22383,LUNCH BAG SUKI DESIGN,20,1/10/2011 10:35,1.65,12524.0,Germany
99680,544778,48194,DOORMAT HEARTS,2,2/23/2011 12:03,7.95,14978.0,United Kingdom
535547,581217,22113,GREY HEART HOT WATER BOTTLE,3,12/8/2011 9:20,8.29,,United Kingdom
438538,574326,79191C,RETRO PLASTIC ELEPHANT TRAY,12,11/4/2011 8:29,0.85,14913.0,United Kingdom


Now we know that almost <b>25%</b> of the transactions are not assigned to a particular client and arround <b>0.27%</b> of the transactions descriptions are not specified, there are several ways to deal with missing values:
<ul>
    <li>Impute values for the <code>CustomerID</code> and <code>Description</code>, in this case it is impossible, does not 
        make sense. </li>
    <li>Apply clustering analysis and see patterns in those <i>unknown</i> clients and <i>unknown</i> description of 
        products. Once these patterns are detected we can assign a labels for them and use these labels as a generic 
        <i>CustomerID</i> and <i>Description</i>.</li>
    <li>Delete the rows where these missing values are found.</li>
<ul>

For simplicity we'll delete the missing values with the method <code>dropna()</code>, understanding its parameters:
<ul>
    <li><code>axis</code>: with the value of <code>0</code> to specify that we are interested in the missing <i>rows</i>.
    </li>
    <li><code>subset</code>: The name of the column, where the missing values.</li>
    <li><code>inplace</code>: To return a new dataset that does not have missing values.</li>
</ul>

In [32]:
df.dropna(axis = 0, subset = ['CustomerID'], inplace = True)

We have cleaned the dataset from missing values! Once again by using the same methods as we did to know how many null values were in the dataset and how much they composed the dataset, we take a look to check up if these values were deleted from the dataset.

In [35]:
columns_info=pd.DataFrame(df.dtypes).T.rename(index={0:'Column Type'})
columns_info=columns_info.append(pd.DataFrame(df.isnull().sum()).T.rename(index={0:'Null Values (NB)'}))
columns_info=columns_info.append(pd.DataFrame(df.isnull().sum()/df.shape[0]*100).T.rename(index={0:'Null Values (%)'}))

The methods have been applied, now let's see the information:

In [36]:
columns_info

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
Column Type,object,object,object,int64,object,float64,object,object
Null Values (NB),0,0,0,0,0,0,0,0
Null Values (%),0,0,0,0,0,0,0,0


Nice! The missing values were deleted from the dataset. Now what are we going to do is to check for duplicated values, <i>duplicate values can cause generalization problems, duplicated values can biased the model.</i> This check is done the next way.

In [56]:
print('Quantity of duplicate values: {}'.format(df.duplicated().sum()))

Quantity of duplicate values: 5225


There are <b>5225</b> duplicated values, let's take a look at some of them.
<br>
First we get the indexes of 5 random duplicated values, these indexes are going to be hold in the <code>indexex</code> var, then with the <code>loc[]</code> function we found to print them.

In [73]:
indexes = df.duplicated().sample(5).index
print('Samples of duplicate values:\n{}'.format(df.loc[indexes]))

Samples of duplicate values:
       InvoiceNo StockCode                        Description  Quantity  \
538643    581412     22847        BREAD BIN DINER STYLE IVORY         1   
179103    552262     21556       CERAMIC STRAWBERRY MONEY BOX         6   
419318    572767     22997      TRAVEL CARD WALLET UNION JACK        24   
436730    574239     22910  PAPER CHAIN KIT VINTAGE CHRISTMAS        12   
131987    547651     22292   HANGING CHICK  YELLOW DECORATION         1   

             InvoiceDate  UnitPrice CustomerID         Country  
538643   12/8/2011 14:38      16.95      14415  United Kingdom  
179103    5/8/2011 11:37       2.55      14911            EIRE  
419318  10/26/2011 10:11       0.42      17865  United Kingdom  
436730   11/3/2011 12:43       2.95      14849  United Kingdom  
131987   3/24/2011 12:11       1.45      16904  United Kingdom  


Cool! Time to drop them from the dataset, with the funciton <code>drop_duplicates</code> and the value <code>True</code> for the parameter <code>inplace</code>, this wo need won't need to assign back to the <code>df</code> dataframe, because it is on the same copy.


In [76]:
df.drop_duplicates(inplace=True)

We have finished with cleaning the dataset, now we are going to work on <b>Feature Exploration</b> to <i>have a view and a opinion over the columns that the dataset is made of.</i>
<br>
<h2>Feature Exploration</h2>
<br>
In the <a href='https://archive.ics.uci.edu/ml/datasets/Online+Retail'>UCl Machine Learning Repository</a> we can find the explanation of every column, I'll list their explanation below:
<ul>
    <li><code>InvoiceNo</code>: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. 
        If this code starts with letter <b>'c'</b>, it indicates a cancellation.</li>
    <li><code>StockCode</code>: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct 
        product.</li>
    <li><code>Description</code>: Product (item) name. Nominal.</li>
    <li><code>Quantity</code>: The quantities of each product (item) per transaction. Numeric.</li>
    <li><code>InvoiceDate</code>: Invice Date and time. Numeric, the day and time when each transaction was generated.</li>
    <li><code>UnitPrice</code>: Unit price. Numeric, Product price per unit in sterling.</li>
    <li><code>CustomerID</code>: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
    </li>
    <li><code>Country</code>: Country name. Nominal, the name of the country where each customer resides.</li>
</ul>
Every column from this dataset have been explained, now we have understood what they do, how they are composed and what information they hold! Let's explore some of them.

<h4>Exploring the Country Column</h4>
<br>
We are going to look at the countries from which customers made their orders, by creating a temporal dataframe object called <code>temp</code> that holds the columns <code>CustomerID</code>, <code>InvoiceNo</code> and <code>Country</code> and group this new temporal dataframe by counting <code>CustomerID</code>, <code>InvoiceNo</code> and <code>Country</code> groups. <a href='https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html'>Look here to understand better the <code>groupby()</code> function</a>.
<br>
<br>
This will hold every customer that did a transaction and from the country they did it, it is going to be mainly group by the column <code>CustomerID</code>, thanks to the method <code>count()</code> that return a <code>DataFrame</code> object we are going to be able to display its information and a clearer way.

In [90]:
temp = df[['CustomerID', 'InvoiceNo', 'Country']].groupby(['CustomerID', 'InvoiceNo', 'Country']).count()

Them with <code>reset_index</code> and the parameter <code>drop</code> with its default value of <code>False</code>, we reset the index and we <i>do not try to insert index into dataframe columns.</i> This resets the index to the default integer index. <a href='https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html'>Check here for more info.</a>

In [91]:
temp = temp.reset_index(drop=False)

Great! Let's see how many countries buy from the e-commerce retailer, this done with the <code>pandas</code> function <code>value_counts()</code> that returns the quantity of unique values.

In [93]:
countries = temp['Country'].value_counts()
print('Number of countries that purchase from the e-commerce retailer: {}'.format(len(countries)))

Number of countries that purchase from the e-commerce retailer: 37


Let's see how much purchases were done by every country, we can do it using the <code>numpy</code> function <code>unique</code> passing it the <code>Country</code> column and <code>True</code> as value for the <code>return_counts</code> that will return the number of frequencies for every different value from the column specified, it will return two arrays; one array with the names of the countries that appear in the column <code>Country</code> and the other array with the frequency quantity that appears, then by converting this two arrays in a one single array 

In [110]:
COUNTRY, COUNTRY_PURCHASES = np.unique(temp['Country'], return_counts=True)

In [122]:
data_country_purchases = pd.DataFrame({'Country':COUNTRY, 'Purchases': COUNTRY_PURCHASES})
data_country_purchases.sort_values(by='Purchases', ascending=False, inplace=True, ignore_index=False)
data_country_purchases

Unnamed: 0,Country,Purchases
35,United Kingdom,19857
14,Germany,603
13,France,458
10,EIRE,319
3,Belgium,119
30,Spain,105
23,Netherlands,101
32,Switzerland,71
26,Portugal,70
0,Australia,69


In [103]:
COUNTRY_PURCHASES.sum()

22190

14911    248
12748    224
17841    169
14606    128
15311    118
13089    118
12971     89
14527     86
13408     81
14646     77
16029     76
16422     75
14156     66
13798     63
18102     62
13694     60
15061     55
17450     55
16013     54
15189     53
15039     52
13767     52
17949     52
17811     51
12921     50
12471     49
16133     46
17511     46
14298     45
17315     43
        ... 
15262      1
17120      1
17038      1
13937      1
16776      1
16406      1
14039      1
14894      1
16276      1
16961      1
15619      1
16542      1
14241      1
17206      1
14098      1
15677      1
15168      1
14988      1
16144      1
12587      1
15930      1
16727      1
15460      1
15350      1
17128      1
16415      1
16527      1
17234      1
14830      1
12851      1
Name: CustomerID, Length: 4372, dtype: int64

In [78]:
!git add . e-commerce_uk_retailer_machine_learning_analysis.ipynb
!git commit -m "Data Preparation rename to Data Cleaning"
!git push origin master --force

The file will have its original line endings in your working directory


[master bad1f38] Data Preparation rename to Data Cleaning
 1 file changed, 40 insertions(+), 7 deletions(-)


To https://github.com/kleyersoma/E-Commerce_UK_Retailer_ML
   f52e5e4..bad1f38  master -> master
