In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

***Dataset Story***

* The data set named Online Retail - II includes the sales of an online store between 01/12/2009 - 09/12/2011.

* The product catalog of this company includes souvenirs.

* The majority of the company's customers are corporate customers.

Building of CRM strategies that overlap with customers’ expectations and needs and also adopt the insight of right customer, right product, right time, right offer is one of the most important approach to deepen customer relationships.

For this purpose, you want to build customer-oriented strategies. You aim to contact your customers with different campaigns, fictions and attractive messages. So which customer will you contact with which strategy? In summary, do you know the answers to the following questions?

* How recent was a customer's latest purchase? (Recency)

* How often a customer makes a purchase? (Frequency)

* How much money a customer spends on? (Monetary)

At this point, the most effective way of identifying your customers is to combine CRM with Analytics. ​“RFM Analysis” is an indispensable application of CRM Analytics which answers these questions and ensure to get deeply insights about customer habits.

In this study, below topics have been handled:

Calculating R, F, M values ,
Divide into groups according to RFM Scores
Personalize of marketing strategies for relevant segments.

****Business Problem & Goal:****
 
An e-commerce company thinks that doing marketing activities based on customer segments with common behaviors will increase income. For this reason, it is aimed to divide customers into segments and determine marketing strategies according to these segments.

***Variables Description:***

* InvoiceNo : The number of the invoice, unique per each purchase. Refund invoice numbers contain "C"

* StockCode : Unique code per each item

* Description : Name of the item

* Quantity : The number of items within the invoice

* InvoiceDate : Date and time of the purchase

* UnitPrice : Price of a single item, as of Sterlin

* CustomerID : Unique id number per each customer

* Country : The country where the customer is living


In [None]:
# Import Libraries:


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
from datetime import timedelta

# Setting Configurations:

pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Import Warnings:

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)


# Import helpers Module

from shutil import copyfile
copyfile(src = "../input/helpers/eda.py", dst = "../working/eda.py")
copyfile(src = "../input/helpers/data_prep.py", dst = "../working/data_prep.py")

from data_prep import *
from eda import *


In [None]:
# Import Data:

df = pd.read_csv("../input/online-retail-ii-data-set-from-ml-repository/Year 2010-2011.csv")

df.head()

***Exploratory Data Analysis***

In [None]:
check_df(df)

In [None]:
# Categorical / Numerical / Cardinal Features: 

cat_cols, num_cols, cat_but_car = grab_col_names(df)
num_cols = [col for col in num_cols if ("ID" not in col) and ("Date" not in col) ]

The dataset includes 8 features in which there're 3 numerical columns and 5 categorical columns. But there is no column which has high cardinality

In [None]:
# Let's observe  numerical columns: 

for col in num_cols:
    num_summary(df,col)

In [None]:
# Missing Values:

missing_values_table(df)

In [None]:
# Drop NA values:

df.dropna(inplace=True)
missing_values_table(df)

In [None]:
# Let's remove the returned product transactions (negative values -> Invoice Id contains value "C")

df_Invoice = pd.DataFrame({"Invoice":[row for row in df["Invoice"].values if "C"  not in str(row)]})
df_Invoice.head()
df_Invoice = df_Invoice.drop_duplicates("Invoice")

# The transactions except returned product transactions
df = df.merge(df_Invoice, on = "Invoice")


In [None]:
# Delete values less than or equal to 0 in the variables Quantity and Price

df = df[df["Quantity"] > 0]
df = df[df["Price"] > 0]

In [None]:
# Let's only observe outlier values but we don't need to handle outliers as a problem, because we'll be scoring the dataset.

for col in num_cols:
    grab_outliers(df,col)
 

In [None]:
# Unique Number of Products (with Description)

df.Description.nunique()

In [None]:
# Unique Number of Products (with StockCode)

df.StockCode.nunique()

In [None]:
# The unique values of these 2 variables (Description & StockCode) should be equal, because each stock code represents a product.

# 1st Step
df_product = df[["Description","StockCode"]].drop_duplicates()
df_product = df_product.groupby(["Description"]).agg({"StockCode":"count"}).reset_index()


df_product.rename(columns={'StockCode':'StockCode_Count'},inplace=True)
df_product.head()

In [None]:
df_product = df_product.sort_values("StockCode_Count", ascending=False)
df_product = df_product[df_product["StockCode_Count"]>1]

df_product.head()

In [None]:
# Let's delete products with more than one stock code 

df = df[~df["Description"].isin(df_product["Description"])]

print(df.StockCode.nunique())
print(df.Description.nunique())

In [None]:
# 2nd Step

df_product = df[["Description","StockCode"]].drop_duplicates()
df_product = df_product.groupby(["StockCode"]).agg({"Description":"count"}).reset_index()
df_product.rename(columns={'Description':'Description_Count'},inplace=True)
df_product = df_product.sort_values("Description_Count", ascending=False)
df_product = df_product[df_product["Description_Count"] > 1] 


df_product.head()


In [None]:
# Let's delete stock codes that represent multiple products

df = df[~df["StockCode"].isin(df_product["StockCode"])]

In [None]:
# Now each stock code represents a single product

print(df.StockCode.nunique())
print(df.Description.nunique())

In [None]:
# The post statement in the stock code shows the postage cost, let's delete it as it is not a product

df = df[~df["StockCode"].str.contains("POST", na=False)]

In [None]:
# Calculating Total Price:

df['TotalPrice'] = df['Quantity'] * df['Price']

In [None]:
df.head()

***Calculating RFM Metrics***

In [None]:
df.info()

In [None]:
# Let's observe the last transaction date.
# So we can determine the performans/measurement date for calculating how recent a customer's latest purchase was.

df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['InvoiceDate'].max()

In [None]:
# Assign "performans_date" as 2 days after the last transaction date of purchase:

performans_date = df["InvoiceDate"].max() + timedelta(days=2)
performans_date

Let's create a new df called as rfm_df by calculating the Recency, Frequency and Monetary values.

* Recency : the number of days between performans_date and the last purchase date of  each customers
* Frequency: the number of transactions (unique invoices) of each customers
* Monetary : the sum of TotalPrice of each customers.

In [None]:
rfm_df = df.groupby("Customer ID").agg \
                                    ({"InvoiceDate" : lambda InvoiceDate :(performans_date - InvoiceDate.max()).days,  # Recency
                                     "Invoice" : lambda Invoice: Invoice.nunique(),  # Frequency
                                     "TotalPrice":  lambda Total_Price: Total_Price.sum()})    # Monetary


In [None]:
rfm_df.head()

In [None]:
# Replace column names with Recency, Frequency and Monetary:

rfm_df.columns = ['recency', 'frequency', 'monetary']

rfm_df.head()

In [None]:
# Let's check if the values inclued any NaN values:

check_df(rfm_df)

***Assigning RFM Scores***

*  RFM analysis numerically scale each of these three categories for each customer 1 to 5. This is the higher the number, the better the result. The "Best" customer would receive a top score in every category whereas for Receny score, this is the opposite, because the most valueable customer is that has recently made purchasing so Recency score is labeled as 1.

* The max number of Frequency and Monetary metrics mean that the customer is purchasing frequently and spending more money, so the highest score should be given as 5 to represent best customers.


In [None]:
rfm_df["Recency_Score"]  = pd.qcut(rfm_df['recency'], 5, [5, 4, 3, 2, 1])
rfm_df["Frequency_Score"]  = pd.qcut(rfm_df['frequency'].rank(method="first"), 5, [1, 2, 3, 4, 5])
rfm_df["Monetary_Score"]  = pd.qcut(rfm_df['monetary'], 5, [1, 2, 3, 4, 5])

In [None]:

rfm_df["RFM_SCORE"] = (rfm_df['Recency_Score'].astype(str) +
                    rfm_df['Frequency_Score'].astype(str))

rfm_df.head() 

***Generating Segments Based on RFM Scores***

We can assign the segments by using  Receny & Frequency Grid frequently seen in the literature.

In [None]:
rfm_df['Segment'] = rfm_df['RFM_SCORE']
rfm_df.head()

In [None]:
seg_map = {
    r'[1-2][1-2]': 'hibernating',
    r'[1-2][3-4]': 'at_Risk',
    r'[1-2]5': 'cant_loose',
    r'3[1-2]': 'about_to_sleep',
    r'33': 'need_attention',
    r'[3-4][4-5]': 'loyal_customers',
    r'41': 'promising',
    r'51': 'new_customers',
    r'[4-5][2-3]': 'potential_loyalists',
    r'5[4-5]': 'champions'
}

In [None]:
rfm_df['Segment'] = rfm_df['Segment'].replace(seg_map, regex=True)
rfm_df.reset_index(inplace=True)
rfm_df.head()

In [None]:
rfm_df.groupby('Segment').agg({"Customer ID":"count"}).sort_values("Customer ID",ascending=False)

In [None]:
colors  = ("darkorange", "darkseagreen", "orange", "cyan", "cadetblue", "hotpink", "lightsteelblue", "coral",  "mediumaquamarine","palegoldenrod")
explodes = [0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25]

rfm_df["Segment"].value_counts(sort=False).plot.pie(colors=colors,
                                                 textprops={'fontsize': 12}, 
                                                 autopct = '%4.1f',
                                                 startangle= 90, 
                                                 radius =2, 
                                                 rotatelabels=True,
                                                 shadow = True, 
                                                 explode = explodes)
plt.ylabel("");


***Build Marketing Strategies***

In [None]:
rfm_df[["recency", "frequency", "monetary"]].agg(["mean"])

In [None]:
rfm_df[["Segment","recency", "frequency", "monetary"]].groupby("Segment").agg(["mean", "count","sum"])

Now, let's focus on some segments which can be critically important for marketing strategies

* champions
* loyal_customers
* cant_loose
* need_attention 



* **Champions:**

This segment constitutes 15% of the customer portfolio and also it includes 641 customers who have made their last purchases within the last week and generate an average turnover of 6000 TL. Because of this segment consists of customers  most frequently spends and can easily  adopt to new products and services, cross-sales strategies can be taken for this segment.



* **Loyal Customers:**

There are 818 customers in this segment, and while the purchasing frequency is 4 on average in all customer segments, it has been seen that average purchases are 2 times higher in this segment (average of frequency is 8). The average monetary value of this segment is 50% above the general average. 

As a conclusion,  in order to ensure customer loyalty sustainable, cross-sell communications in line with customer expectations and needs can be organized for this segment.






* **Need_Attention:**


There are 184 customers that last purchased nearly 2 months ago in this segment. Although they dont make purchase frequently,  total transaction amounts of these customers contribute to profitability. 

As a result, Cashback and bonus campaigns can be organized for this segment to retain customers and even move them to a segment that makes more purchases. In fact, discounted product offers and campaigns based on gift coupons can be planned  by observing  habits of other customers with similar behaviors, and analyzing according product association rules.

* **Can't_loose:** 

Customers of Can't_Loose segment have a higher transaction frequency, even though their spending amounts are close to the loyal customer segment. However, since these customers are nearly lost customers that made last purchases nearly 4 months ago.

So, new campaign strategies based on rewards, discounts, and other special incentives as a way to attract and retain customers can be planned in order to make them feel special and loyal again. 