# Customer Classification using E-Commerce Dataset


## About Dataset
This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

## Aim
We aim to implement various classification algorithms like SVM, Logistic Regression, Naive Bayes, Random Forest, SGD, k-NN to predict a customer's origin and to compare the performance of these supervised machine learning models.

### 1. Data Processing

In [0]:
#Importing necessary libraries
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

In [3]:
from google.colab import files
uploaded = files.upload()

Saving data.csv to data.csv


In [0]:
#Loading dataset
df = pd.read_csv('data.csv', encoding = 'ISO-8859-1')

In [0]:
#Function to display information about the dataset
def information(df):
    print(df.describe())
    print(df.dtypes)
    print(df.head())

def shape(df):
    print(df.shape)

In [0]:
# Displaying data set information
print(shape(df))
print(information(df))

In [10]:
#Check missing values for each column
df.isnull().sum().sort_values(ascending=False)

CustomerID     135080
Description      1454
Country             0
UnitPrice           0
InvoiceDate         0
Quantity            0
StockCode           0
InvoiceNo           0
dtype: int64

In [11]:
#Since customer ID integral for our model, we drop only those rows which contain NA in the CustomerID field
df.dropna(axis=0, subset=['CustomerID'], inplace=True)
shape(df)

(406829, 8)


In [12]:
# Drop duplicates by keeping the first value
df.drop_duplicates(keep='first', inplace=True)
shape(df)

(401604, 8)


### 2. Exploratory Data Analysis

#### Exploring the content of variables

This dataframe contains 8 variables that correspond to:
<br><br>
__InvoiceNo:__ Invoice number. Nominal - A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'C', it indicates a cancellation. <br>
__StockCode:__ Product code. Nominal - A 5-digit integral number uniquely assigned to each distinct product.<br>
__Description:__ Product (item) name. Nominal. 
<br>
__Quantity:__ Numeric - The quantities of each product (item) per transaction. <br>
__InvoiceDate:__ Invice Date and time. Numeric - The day and time when each transaction was generated. <br>
__UnitPrice:__ Unit price. Numeric - Price per unit of the product <br>
__CustomerID:__ Customer number. Nominal - A 5-digit integral number uniquely assigned to each customer. <br>
__Country:__ Country name. Nominal - The name of the country where each customer resides.<br>

In [13]:
# Count the total number of countries - y label - multi-class classification with 37 classes
df['Country'].nunique()

37

In [14]:
#We now add another variable - Total Price for better EDA
df['TotalPrice'] = df['UnitPrice'] * df['Quantity']
shape(df)

(401604, 9)


In [0]:
# Number of orders made by each country
total_orders=df.groupby('Country')['Quantity'].count().sort_values(ascending=False)
total_orders.plot('bar')
plt.xlabel('Country')
plt.ylabel('Number of Orders')
plt.title('Number of Orders per Country', fontsize=16)
plt.show()

In [0]:
#Total amount spent per country
total_amount=df.groupby('Country')['TotalPrice'].mean().sort_values(ascending=False)
total_amount.plot('bar')
plt.xlabel('Country')
plt.ylabel('Total amount')
plt.title('Total amount spent per Country', fontsize=16)
plt.show()

In [0]:
#Mean amount spent per country plot
mean_amount=df.groupby('Country')['TotalPrice'].mean().sort_values(ascending=False)
mean_amount.plot('bar')
plt.xlabel('Country')
plt.ylabel('Total Purchase')
plt.title('Amount spent by every Country', fontsize=16)
plt.show()

**ANALYSIS:** These anaysis show that even though the maximum number of orders are from United Kingdom, the mean amount spent in these purchases is very low as compared to higher countries.

In [0]:
# Distribuition of purchases in the website according to Countries
plt.figure(figsize=(40,10))
plt.title('No. of invoices vs Country');
sns.countplot(x='Country', data=df);

In [0]:
high_descrip = df['Description'].value_counts()[:20]
plt.figure(figsize=(40,10))
plt.title('Highest purchase');
sns.countplot(x=df['Description'], data=high_descrip)

In [0]:
#Assign numbers categorically to each unique description 
df['Description'] = pd.Categorical(df['Description'])
df['descriptioncode'] = df['Description'].cat.codes
df.drop(columns=['Description'], inplace=True)

In [0]:
#Assign numbers categorically to each unique description  
df['Country'] = pd.Categorical(df['Country'])
df['countrycode'] = df['Country'].cat.codes
df.drop(columns=['Country'], inplace=True)

In [0]:
#To-Do
#Maximum cancellations are from which country
#Which maximum sales are performed at what date in each country
#Mean time does each country purchase
#Highest purchasing customers belong to which country
#Word occurrence

In [0]:
#Creating X_train dataset
d={'Customer_ID': df['CustomerID'], 'Description': df['descriptioncode'], 'Quantity': df['Quantity'], 'Unit_Price': df['UnitPrice']}
X = pd.DataFrame(d)

In [0]:
#Creating Y_train dataset
y_d = {'Country': df['countrycode']}
Y=pd.DataFrame(y_d)

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier 
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

#Splitting the dataset: 75% train, 25% test
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 0)

In [50]:
#Decision Tree
decision_model = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train) 
decision_predictions = decision_model.predict(X_test) 
  
# creating a confusion matrix 
decision_cm = confusion_matrix(y_test, decision_predictions) 
#Returning accuracy score
decision_accuracy = accuracy_score(y_test, decision_predictions)
print(decision_accuracy)

0.9157279309967032


In [0]:
# Support Vector Machine
svm_model= SVC(kernel = 'linear').fit(X_train, y_train) 
svm_predictions = svm_model.predict(X_test) 
  
# model accuracy for X_test   
svm_accuracy = svm_model.score(X_test, y_test)  
# creating a confusion matrix 
svm_cm = confusion_matrix(y_test, svm_predictions) 
print(svm_accuracy)

  y = column_or_1d(y, warn=True)


In [51]:
#K-NN

knn_model = KNeighborsClassifier().fit(X_train, y_train)

# creating a confusion matrix 
knn_predictions = knn_model.predict(X_test)  
knn_cm = confusion_matrix(y_test, knn_predictions) 

# accuracy on X_test 
knn_accuracy = knn_model.score(X_test, y_test) 
print(knn_accuracy) 

  


0.9589047917849424


In [52]:
#Naive-Bayes

nb_model = GaussianNB().fit(X_train, y_train) 
gnb_predictions = nb_model.predict(X_test) 

# accuracy on X_test 
nb_accuracy = nb_model.score(X_test, y_test) 
print(nb_accuracy) 

# creating a confusion matrix 
nb_cm = confusion_matrix(y_test, gnb_predictions) 

  y = column_or_1d(y, warn=True)


0.7237776516170158
