<a href="https://colab.research.google.com/github/jacquesbilombe/CRM-RFM-Analysis/blob/main/CustomerClassifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The dataset for this project can be found on the project's GitHub repository or at [Kaggle](https://www.kaggle.com/datasets/ddosad/auto-sales-data). This project serves as a case study for the Data Science and Analytics course at PUC RIO. For more information, please refer to the project README.

In [18]:
import os
import csv
import sys
import random
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

from pathlib import Path
from datetime import datetime
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

## Other configurations

In [3]:
%matplotlib inline
warnings.filterwarnings('ignore')

# Get the data access
! git clone https://github.com/jacquesbilombe/CRM-RFM-Analysis.git

Cloning into 'CRM-RFM-Analysis'...
remote: Enumerating objects: 44, done.[K
remote: Counting objects: 100% (44/44), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 44 (delta 14), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (44/44), 92.61 KiB | 3.43 MiB/s, done.
Resolving deltas: 100% (14/14), done.


## Exploratory Data Analysis

In [24]:
# Files path
data_folder = os.path.join('CRM-RFM-Analysis', 'dataset')

df = pd.DataFrame(pd.read_csv(data_folder + '/' + 'dataset.csv'))
df.head(5)

Unnamed: 0,ORDERNUMBER,QUANTITYORDERED,PRICEEACH,ORDERLINENUMBER,SALES,ORDERDATE,DAYS_SINCE_LASTORDER,STATUS,PRODUCTLINE,MSRP,PRODUCTCODE,CUSTOMERNAME,PHONE,ADDRESSLINE1,CITY,POSTALCODE,COUNTRY,CONTACTLASTNAME,CONTACTFIRSTNAME,DEALSIZE
0,10107,30,95.7,2,2871.0,24/02/2018,828,Shipped,Motorcycles,95,S10_1678,Land of Toys Inc.,2125557818,897 Long Airport Avenue,NYC,10022,USA,Yu,Kwai,Small
1,10121,34,81.35,5,2765.9,07/05/2018,757,Shipped,Motorcycles,95,S10_1678,Reims Collectables,26.47.1555,59 rue de l'Abbaye,Reims,51100,France,Henriot,Paul,Small
2,10134,41,94.74,2,3884.34,01/07/2018,703,Shipped,Motorcycles,95,S10_1678,Lyon Souveniers,+33 1 46 62 7555,27 rue du Colonel Pierre Avia,Paris,75508,France,Da Cunha,Daniel,Medium
3,10145,45,83.26,6,3746.7,25/08/2018,649,Shipped,Motorcycles,95,S10_1678,Toys4GrownUps.com,6265557265,78934 Hillside Dr.,Pasadena,90003,USA,Young,Julie,Medium
4,10168,36,96.66,1,3479.76,28/10/2018,586,Shipped,Motorcycles,95,S10_1678,Technics Stores Inc.,6505556809,9408 Furth Circle,Burlingame,94217,USA,Hirano,Juri,Medium


In [25]:
# Missing Values
print(df.isnull().sum())
print("---------------")

print("Number of lines before removing missing values: ", df.shape[0])

# Dropping Missing Values
df.dropna(inplace=True)

print("Number of lines after removing missing values: ", df.shape[0])

ORDERNUMBER             0
QUANTITYORDERED         0
PRICEEACH               0
ORDERLINENUMBER         0
SALES                   0
ORDERDATE               0
DAYS_SINCE_LASTORDER    0
STATUS                  0
PRODUCTLINE             0
MSRP                    0
PRODUCTCODE             0
CUSTOMERNAME            0
PHONE                   0
ADDRESSLINE1            0
CITY                    0
POSTALCODE              0
COUNTRY                 0
CONTACTLASTNAME         0
CONTACTFIRSTNAME        0
DEALSIZE                0
dtype: int64
---------------
Number of lines before removing missing values:  2747
Number of lines after removing missing values:  2747


In [26]:
# Unique Invoice Count
print("Unique Invoice Count: ", df["ORDERNUMBER"].nunique())

# Unique Customer Count
print("Unique Customer Count: ", df["CUSTOMERNAME"].nunique())

Unique Invoice Count:  298
Unique Customer Count:  89


The dataset doesn't have null lines or column, but some column types don't match

In [27]:
df.dtypes

ORDERNUMBER               int64
QUANTITYORDERED           int64
PRICEEACH               float64
ORDERLINENUMBER           int64
SALES                   float64
ORDERDATE                object
DAYS_SINCE_LASTORDER      int64
STATUS                   object
PRODUCTLINE              object
MSRP                      int64
PRODUCTCODE              object
CUSTOMERNAME             object
PHONE                    object
ADDRESSLINE1             object
CITY                     object
POSTALCODE               object
COUNTRY                  object
CONTACTLASTNAME          object
CONTACTFIRSTNAME         object
DEALSIZE                 object
dtype: object

In [28]:
# Convert the "ORDERDATE" to datetime object
df['ORDERDATE'] = pd.to_datetime(df['ORDERDATE'])

In [29]:
df["ORDERDATE"].max()

Timestamp('2020-05-31 00:00:00')

In [30]:
# How Many of the Products were Sold
df.groupby("PRODUCTLINE").agg({"QUANTITYORDERED": "sum"}).sort_values("QUANTITYORDERED", ascending=False).head(10)

Unnamed: 0_level_0,QUANTITYORDERED
PRODUCTLINE,Unnamed: 1_level_1
Classic Cars,33373
Vintage Cars,20059
Motorcycles,11080
Planes,10636
Trucks and Buses,10579
Ships,7989
Trains,2712


In [31]:
# Describe of Data
df.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
ORDERNUMBER,2747.0,10259.761558,10100.0,10181.0,10264.0,10334.5,10425.0,91.877521
QUANTITYORDERED,2747.0,35.103021,6.0,27.0,35.0,43.0,97.0,9.762135
PRICEEACH,2747.0,101.098952,26.88,68.745,95.55,127.1,252.87,42.042549
ORDERLINENUMBER,2747.0,6.491081,1.0,3.0,6.0,9.0,18.0,4.230544
SALES,2747.0,3553.047583,482.13,2204.35,3184.8,4503.095,14082.8,1838.953901
ORDERDATE,2747.0,2019-05-13 21:56:17.211503360,2018-01-06 00:00:00,2018-11-08 00:00:00,2019-06-24 00:00:00,2019-11-17 00:00:00,2020-05-31 00:00:00,
DAYS_SINCE_LASTORDER,2747.0,1757.085912,42.0,1077.0,1761.0,2436.5,3562.0,819.280576
MSRP,2747.0,100.691664,33.0,68.0,99.0,124.0,214.0,40.114802


In [32]:
df.shape

(2747, 20)

## RFM Analysis
- For this analysis, we'll consider today's date as the date of the last purchase plus one day to avoid "null" value. If this doesn't apply to your situation, please provide the date in string format ("YYYY-MM-DD") in the following function.

- Analysising the dataset,

In [19]:
def rfm_date(df, to_day):
  if to_day == "":
    return df["ORDERDATE"].max() + pd.Timedelta(days=1)
  else:
    # Check if the given date value if correct
    # before executing the code and deal with the error
    try:
      return datetime.strptime(to_day, "%Y-%m-%d")
    except ValueError as e:
      print("\n Wrong date value, please try againg with this format 'YYYY-MM-DD'")
      sys.exit(0)

In [20]:
current_date = rfm_date(df=df, to_day="")
current_date

Timestamp('2020-06-01 00:00:00')