In [None]:
from include.customer_segmentation_functions import *

In [None]:
map_source_url= R"DataSets\country_map\ne_10m_admin_0_countries\ne_10m_admin_0_countries.shp"

In [None]:
data_file= R"DataSets\rfm_ana\online_retail_II.csv"
main_data= pd.read_csv(data_file,encoding ='cp1252')
data= main_data #.sample(10000)

<div class="alert alert-block alert-success">
To keep the repo and this doc clean, I have kept all the user defined function used in this doc to a different python file(include/customer_segmentation_functions.py).
</div>

<div class="center_header">

# About the data:

</div>
<p>
This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/201- The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.
</p>

<h5 style="margin: 0px; padding: 0px;">Attribute Information:</h5>
<ul>
<li>InvoiceNo:</li> Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
<li>StockCode:</li> Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
<li>Description:</li> Product (item) name. Nominal.
<li>Quantity:</li> The quantities of each product (item) per transaction. Numeric.
<li>InvoiceDate:</li> Invice date and time. Numeric. The day and time when a transaction was generated.
<li>UnitPrice:</li> Unit price. Numeric. Product price per unit in sterling (Â£).
<li>CustomerID:</li> Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
<li>Country:</li> Country name. Nominal. The name of the country where a customer resides
</ul>

find the dataset here __[here](https://www.kaggle.com/datasets/mashlyn/online-retail-ii-uci/data)__

<style>
  .center_header {
  line-height: 90px;
  height: 90px;
  border: 5px solid blue;
  text-align: center;
  font-size: xx-large;
  }
</style>

<div class="center_header">

## What is RFM ?

</div>

### What is Customer Segmentation ?

<p>
  Customer segmentation is the practice of dividing a company’s customers into groups that reflect similarity among customers in each group. The goal of segmenting customers is to decide how to relate to customers in each segment in order to maximize the value of each customer to the business.<br>
</p>

__[article link](https://www.optimove.com/resources/learning-center/customer-segmentation)__


### Different types of customer segmentation:

<ul>
<li>Demographic customer segmentation</li>
<li>Geographic customer segmentation</li>
<li>Behavioral customer segmentation</li>
<li>Psychographic customer segmentation</li>
<li>Technographic customer segmentation</li>
</ul>

### What is RFM Segmentation?

<p>
RFM segmentation is a marketing analysis method that involves analyzing customer behavior based on three key factors: recency, frequency, and monetary value. This RFM analysis helps businesses categorize customers into segments, enabling targeted and personalized marketing strategies.  
</p>

<ul>
<li>Recency:</li> How much time has elapsed since a customer’s last activity or transaction with the brand? Activity is usually a purchase, although variations are sometimes used, e.g., the last visit to a website or use of a mobile app. In most cases, the more recently a customer has interacted or transacted with a brand, the more likely that customer will be responsive to communications from the brand. 
<li>Frequency:</li> How often has a customer transacted or interacted with the brand during a particular period of time? Clearly, customers with frequent activities are more engaged, and probably more loyal, than customers who rarely do so. And one-time-only customers are in a class of their own. 
<li>Monetary:</li> Also referred to as “monetary value,” this factor reflects how much a customer has spent with the brand during a particular period of time. Big spenders should usually be treated differently than customers who spend little. Looking at monetary divided by frequency indicates the average purchase amount – an important secondary factor to consider when segmenting customers. 
</ul>

__[article link](https://www.optimove.com/resources/learning-center/rfm-segmentation)__


















<style>
  .center_header {
  line-height: 90px;
  height: 90px;
  border: 5px solid blue;
  text-align: center;
  font-size: xx-large;
  }

<div class="center_header">

# General view of the dataset:

</div>

<style>
  .center_header {
  line-height: 90px;
  height: 90px;
  border: 5px solid blue;
  text-align: center;
  font-size: xx-large;
  }

In [None]:
data.sample(3)

In [None]:
data.info()

In [None]:
data.describe(include="all").T

<div class="center_header">

# Data Preprocessing:

</div>
<p>

`Which involves few steps eg: Data Cleaning, Data Transformation, Data Reduction, etc`

</p>
<style>
  .center_header {
  line-height: 90px;
  height: 90px;
  border: 5px solid blue;
  text-align: center;
  font-size: xx-large;
  }

In [None]:
#cahnging col name
data.rename(columns = {x:x.lower().replace(' ','') for x in data.columns}, inplace = True)

#change datatype
data['quantity'] = pd.to_numeric(data['quantity'])
data['price'] = pd.to_numeric(data['price'])
data["invoicedate"]=pd.to_datetime(data["invoicedate"])

#removing extra white spaces
data["description"]= data["description"].str.strip()
data.sample(3)

<div class="alert alert-block alert-info", style= "width: 65%;">

### Checking for null(NaN) values in the dataset:
</div>

In [None]:
for col_name in ["stockcode", "description", "quantity", "price", "customerid"]:
    get_value_counts(data, col_name)

<div class="alert alert-block alert-info", style= "width: 65%;">

### Removing rows with NaN values and cancelled items.<br>The invoice code, starting with 'C' were cancalled.
</div>

In [None]:
data= drop_nun_val_in_col(data, "customerid")
data= drop_canceled_items(data)

In [None]:
data['customerid']= data['customerid'].astype(np.int64).astype("string")
data['quantity']= data['quantity'].astype(np.int64)
data["invoicetime"]= data["invoicedate"].dt.time
data["invoicedate"]= data["invoicedate"].dt.date

In [None]:
print(data[data["price"]<0]["price"].count())
print(data[data["quantity"]<0]["quantity"].count())

<div class="alert alert-block alert-info", style= "width: 65%;">

### We can ignore the products which have no price (that is =0.0).<br> Also removing duplicate entries.
</div>

In [None]:
print(f"before cleaning 0 price iteams:\n{data[data["price"] ==0]}")
data= data[data["price"] !=0.0]

In [None]:
print(f"row count before removing duplicates:{len(data)}")
data.drop_duplicates(inplace= True)
print(f" row count after removing duplicates:{len(data)}")

<div class="alert alert-block alert-info", style= "width: 65%;">

### Removing outliers in Unit price and Quantity
</div>

<i> For calculating upper and lower bound we took 5 percentile as Q1 and 95 percentile as Q3 </i>

In [None]:
data[["quantity", "price"]].describe().T

In [None]:
new_df= outlier_remover(data, "price")
new_df= outlier_remover(new_df, "quantity")

In [None]:
data= new_df
data[["quantity", "price"]].describe().T

<div class="center_header">

# Exploratory data analysis (EDA):

</div>
<style>
  .center_header {
  line-height: 90px;
  height: 90px;
  border: 5px solid blue;
  text-align: center;
  font-size: xx-large;
  }

<div class="alert alert-block alert-info", style= "width: 65%;">

### checking stockcode and description relationships
</div>

In [None]:
check_relationship_type(data, "stockcode", "description")

In [18]:
descr= recheck_relationship_type(data, "description", "stockcode")
stockc= recheck_relationship_type(data, "stockcode", "description")

 25%|██▌       | 1318/5178 [01:23<03:51, 16.65it/s]

In [None]:
print(len(stockc))
for val in random.sample(stockc, 5):
    print(f"{val}: {data[data["stockcode"]==val]["description"].unique()}")

In [None]:
print(len(descr))
for val in random.sample(descr, 5):
    print(f"{val}: {data[data["description"]==val]["stockcode"].unique()}")

In [None]:
data.drop(columns=['stockcode'], inplace= True)

<div class="alert alert-block alert-warning",  style= "width: 55%;>

##### Findings:

Here we can see that the stock code and product description don't have a 'one-to-one' relationship due to errors such as typing mistakes and the use of different synonymous words for the same product code. However, we can safely assume that they indeed have a one-to-one relationship and we can drop one of this 2 columns.<br>So we can remove the column 'stokecode'
</div>

<div class="alert alert-block alert-info", style= "width: 65%;">

### Checking customer and country relationships(if any customer is associated with more than one country)
</div>

In [None]:
check_relationship_type(data, "customerid", "customerid")

In [None]:
cust_list= recheck_relationship_type(data, "customerid", "country")

In [None]:
for val in cust_list:
    print(f"{val}: {data[data["customerid"]==val]["country"].unique()}")

<div class="alert alert-block alert-warning",  style= "width: 55%;>

##### Findings:

very few customer made purched from more than one contry
</div>

<div class="alert alert-block alert-info", style= "width: 65%;">

### Since we have quantity and unit price, we will calculate the total price/revenue
</div>

> $ revenue= quantity * unitprice $

In [None]:
data["totalprice"]= data["quantity"]* data["price"]
data.sample(3)

<div class="alert alert-block alert-info", style= "width: 65%;">

### Descriptive stats for quantitative datas:
</div>


In [None]:
get_descriptive_stats(data, ['quantity', 'price', 'totalprice'] )

<div class="alert alert-block alert-info", style= "width: 65%;">

### Descriptive stats for qualitatitve datas:
</div>


In [None]:
get_descriptive_stats(data, ['description', 'customerid', 'country'] )

<div class="center_header">

# RFM analysis:

</div>


<div class="alert alert-block alert-success">

`Here:`

-  T --> Interpurchase Time
- L --> Shopping Cycle
- F --> Frequency
- T1 --> First purchase
- Tn --> Last purchase
- $ T = L/(F-1) = (Tn - T1)/(F-1) $
</div>

<style>
  .center_header {
  line-height: 90px;
  height: 90px;
  border: 5px solid blue;
  text-align: center;
  font-size: xx-large;
  }
</style>

In [None]:
RFM= get_rmf_data_set(data)

<div class="alert alert-block alert-success", style= "width: 65%;">

- calculating R,F,M,T score based on quartiles 
- rfm_score= R+F+M
- rfm_score: Label
     - 01 - 03: Silver
     - 03 - 05: Gold
     - 05 - 09: Platinum
     - 09 - 12: Diamond
</div>

In [None]:
RFM_with_score= rfm_score_calculate(RFM)

In [None]:
barplot = dict(RFM_with_score['label'].value_counts())
bar_names = list(barplot.keys())
bar_values = list(barplot.values())
plt.bar(bar_names,bar_values)
print(pd.DataFrame(barplot, index=[' ']))

<div class="center_header">

# Choropleth Map:

</div>

find the dataset for map cordination here __[here](https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip)__
<style>
  .center_header {
  line-height: 90px;
  height: 90px;
  border: 5px solid blue;
  text-align: center;
  font-size: xx-large;
  }
</style>

In [None]:
merged_df= choropleth_map_plot(data, map_source_url)