# Product Visibility and Marketing:

| Serial No.| Description |
| --------- | ----------- |
|1| Data Exploration.|
|2| Data Wrangling.|
|3| Data Analysis.| 
|4| Product Analytics:|
|4.1| Time Series Trends.|
|4.2| Trends In Revenue.|
|4.3| Trends of Repeating Customers.|
|4.4| Revenue From Repeated Customers.|
|4.5| Popular Items over Time.|
|4.6| Top 5 Items. |
|4.7| Trending Items.|
|5| **Product Recommendation.**|
|5.1| Customer Item Matrix.|
|5.2| Collaborative Filtering.|
|5.2.1| User-Based Collaborative Filtering.|
|5.2.1A| User-to-User Similarity Matrix.|
|5.2.1B| User-Based Recommedations.|
|5.2.2| Items Based Collaborative Filtering.|
|5.2.2A| Item-to-Item Similarity Matrix.|
|5.2.2B| Item-Based Recommendation.|
|6| **Customer Life Time Value.**|
|6.1| Predicting 3-Month Customer Life-Time Value.|
|6.2| Regression Model.|
|7| Customer Segmentation.|
|7.1| Segmentation Using K-Means|

## 1: Data Exploration:

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import matplotlib as mpl

mpl.rcParams['axes.grid'] = False

In [None]:
df = pd.read_excel('/content/Online.xlsx',
                   sheet_name=2)

BadZipFile: ignored

In [None]:
data = df.copy()
print(f'Dimension of the Dataframe: {data.shape}')
for col in data.columns:
    print(f'Column: {col:15} | \
    Type: {str(data[col].dtype):15} | \
    Missing Values: {data[col].isnull().sum()}')

NameError: ignored

In [None]:
data.head()

## 2: Data Wrangling:

In [None]:
data.columns = map(str.lower, data.columns)
data.columns

In [None]:
data.rename(columns={
    'invoiceno':'invoice_no',
    'stockcode':'stock_code',
    'invoicedate':'invoice_date',
    'unitprice':'unit_price',
    'customerid': 'customer_id',
}, inplace=True)
data.head(3)

#### Negative values in Quantity Feature:

In [None]:
data.loc[data['quantity'] <= 0].shape

There are **10624** values are negative in the **Quantity column**

In [None]:
data = data.loc[data['quantity'] > 0]
data.shape

#### Missing values in Custoer ID Feature:

In [None]:
pd.isnull(data['customer_id']).sum()

In [None]:
data = data[pd.notnull(data['customer_id'])]
data.shape

#### Total Sales:

In [None]:
data['sales'] = data['quantity'] * data['unit_price']
data.head(3)

#### Per Oder Data:

In [None]:
orders_data = data.groupby(['customer_id',
                            'invoice_no']).agg({
                                'sales': sum,
                                'invoice_date': max
                            })
orders_data

## 3: Data Analysis:

In [None]:
def groupby_mean(x):
    return x.mean()

def groupby_count(x):
    return x.count()

def purchase_duration(x):
    return (x.max() - x.min()).days

def avg_frequency(x):
    return (x.max() - x.min()).days/ x.count()

groupby_mean.__name__ = 'avg'
groupby_count.__name__ = 'count'
purchase_duration.__name__ = 'purchase_duration'
avg_frequency.__name__ = 'purchase_frequency'

In [None]:
summary_data = orders_data.reset_index().groupby('customer_id').agg({
    'sales': [min, max, groupby_mean, groupby_count],
    'invoice_date': [min, max, purchase_duration, avg_frequency]
})
summary_data


In [None]:
summary_data.columns = ['_'.join(col).lower() for \
                        col in summary_data.columns]
summary_data              

In [None]:
summary_data.shape

In [None]:
summary_data = summary_data.loc[summary_data[\
                    'invoice_date_purchase_duration']> 0]
summary_data.shape

In [None]:
plt.figure(figsize=(13,9))
ax = summary_data.groupby('sales_count')\
    .count()['sales_avg'][:20].plot(
        kind='bar', color='#bf263c'
    )
ax.set_ylabel('Count')
sns.despine(offset=20, trim=True)
plt.show()

In [None]:
summary_data['sales_count'].describe()

In [None]:
summary_data['sales_avg'].describe()

In [None]:
plt.figure(figsize=(13,7))
ax = summary_data['invoice_date_purchase_frequency']\
    .hist(bins=20, color='#bf263c', rwidth=0.8, 
          grid=False)
ax.set_xlabel('Avg no. of Days between Purchases')
ax.set_ylabel('Count')
sns.despine(offset=20, trim=True)
plt.show()

In [None]:
summary_data['invoice_date_purchase_frequency'].describe()

In [None]:
summary_data['invoice_date_purchase_duration'].describe()

## 4: Product Analytics:

**Product analytics** is a way to draw insights from data on how customer engage and interact with products offered, how different products perform, and what some of the observable weaknesses and strengths in a business are. However, product analytics does not just stop at analyzing the data. The ultimate goal of product analytics is really to build actionable insights and reports that can further help optimize and improve product performance and generate new marketing or product ideas based on the findings of product analytics.
Product analytics starts by tracking events. These events can be customer website visits, page views, browser histories, purchases, or any other actions that customers can take with the products that you offer. Then, you can start analyzing and visualizing any observable patterns in these events with the goal of creating actionable insights or reports.



### Time Series Trends:

In [None]:
data.columns

In [None]:
monthly_orders_data = data.set_index('invoice_date')['invoice_no'].resample('M').nunique()
monthly_orders_data

In [None]:
ax = pd.DataFrame(monthly_orders_data.values)\
.plot(color='#bf263c',figsize=(13,7), legend=False)
ax.set_ylabel('date')
ax.set_xlabel('Number of Orders/Invoices')
ax.set_title('Total Number of Orders Over Time')
plt.xticks(
    range(len(monthly_orders_data.index)),
    [x.strftime('%m.%Y') 
    for x in monthly_orders_data.index],
    rotation=45
)
sns.despine(offset=10, trim=True)
plt.show()

One thing that is noticeable from this chart is that there is a sudden radical drop in the number of orders in December 2011. If you look closely at the data, this is simply because we do not have the data for the full month of December 2011. 

In [None]:
print('Date Range: %s ~ %s' % (data['invoice_date'].min(),
                    data['invoice_date'].max()))


In [None]:
data.loc[data['invoice_date'] >= '2011-12-01'].shape

In [None]:
data.shape

In [None]:
data = data.loc[data['invoice_date'] < '2011-12-01']
data.shape

In [None]:
monthly_orders_data = data.set_index('invoice_date')['invoice_no'].resample('M').nunique()
monthly_orders_data

In [None]:
ax = pd.DataFrame(monthly_orders_data.values)\
.plot(color='#bf263c',figsize=(13,7), legend=False)
ax.set_ylabel('date')
ax.set_xlabel('Number of Orders/Invoices')
ax.set_title('Total Number of Orders Over Time')
plt.xticks(
    range(len(monthly_orders_data.index)),
    [x.strftime('%m.%Y') 
    for x in monthly_orders_data.index],
    rotation=45
)
sns.despine(offset=10, trim=True)
plt.show()

The monthly number of orders seems to float around 1,500 from December 2010 to August 2011, and then increases significantly from September 2011, and almost doubles by November 2011. One explanation for this could be that the business is actually growing significantly from September 2011. Another explanation could be seasonal effects. In e-commerce businesses, it is not rare to see spikes in sales as it approaches the end of the year. Typically, sales rise significantly from October to January for many e-commerce businesses, and without the data from the previous year, it is difficult to conclude whether this spike in sales is due to a growth in business or due to seasonal effects. When you are analyzing your data, we advise you to compare the current year's data against the previous year's data.

### Time Series Revenue

In [None]:
monthly_revenue_data = data.set_index(
    'invoice_date')['sales'].resample('M').sum()

In [None]:
monthly_revenue_data

In [None]:
ax = pd.DataFrame(monthly_revenue_data.values).plot(
    figsize=(13,7), color='#bf263c'
)
ax.set_xlabel('Date')
ax.set_ylabel('Sales')
ax.set_title('Total Revenue Over TIme')
ax.set_ylim([0, 
    max(monthly_revenue_data.values + 100000)],)
plt.xticks(
    range(len(monthly_revenue_data.index)),
    [x.strftime('%m.%Y') 
    for x in monthly_revenue_data.index]
)

sns.despine(offset=20,trim=True)
plt.show()

We see a similar pattern to the previous monthly Total Number of Orders Over Time chart in this monthly revenue chart. The monthly revenue floats around 700,000 from December 2010 to August 2011 and then it increases significantly from September 2011. As discussed before, to verify whether this significant increase in sales and revenue is due to a growth in business or due to seasonal effects, we need to look further back in the sales history and compare the current year's sales against the previous year's sales.

### Time Series Analysis of Repeat Customers:

Important factor of a successful business is how well it is retaining customers and how many repeat purchases and customers it has. In this section, we are going to analyze the number of monthly repeat purchases and how much of the monthly revenue is attributable to these repeat purchases and customers. A typical strong and stable business has a steady stream of sales from existing customers. Let's see how much of the sales are from repeat and existing customers of the online retail business

In [None]:
data.head(3)

In [None]:
invoice_customer_data = data.groupby(
    by=['invoice_no', 'invoice_date']
).agg({
    'sales':sum,
    'customer_id':max,
    'country':max
}).reset_index()

In [None]:
invoice_customer_data.head(3)

In [None]:
monthly_repeat_customers_data = invoice_customer_data\
.set_index('invoice_date').groupby([
          pd.Grouper(freq='M'),'customer_id'                         
]).filter(lambda x: len(x)> 1)\
.resample('M').nunique()['customer_id']


In [None]:
monthly_repeat_customers_data

In [None]:
monthly_unique_customer_data = data.set_index(
    'invoice_date')['customer_id'].resample('M').nunique()
monthly_unique_customer_data

In [None]:
monthly_repeat_percentage = (monthly_repeat_customers_data\
/ monthly_unique_customer_data) * 100
monthly_repeat_percentage

In [None]:
ax = pd.DataFrame(monthly_repeat_customers_data.values).plot(
    figsize=(13,7)
)
pd.DataFrame(monthly_unique_customer_data.values).plot(
    ax=ax, grid=False, color='#bf263c'
)

ax2 = pd.DataFrame(monthly_repeat_percentage.values).plot.bar(
    ax=ax, 
    grid=False,
    secondary_y=True,
    color='#323133',
    alpha=0.4
)
ax.set_xlabel('Date')
ax.set_ylabel('Number of Customers')
ax.set_title('Number of All vs Repeat Customers Over Time')
ax.legend(['Repeat Customers','All Customers'])

ax2.set_ylabel('Percentage (%)')
ax2.legend(['Percetage of Repeat'], loc='upper right')

ax.set_ylim([0, monthly_unique_customer_data.values.max()+100])
ax2.set_ylim([0,100])

plt.xticks(
    range(len(monthly_repeat_customers_data.index)),
    [x.strftime('%m.%Y') 
    for x in monthly_unique_customer_data.index],
    rotation=45
)
plt.show()

As you can see from this chart, the numbers of both repeat and all customers start to rise significantly from September 2011. The percentage of Repeat Customers seems to stay pretty consistent at about 20 to 30%. This online retail business will benefit from this steady stream of Repeat Customers, as they will help the business to generate a stable stream of sales. 

### Revenue From Repeat Customers:

In [None]:
monthly_rev_repeat_customers = invoice_customer_data\
.set_index('invoice_date').groupby([
    pd.Grouper(freq='M'), 'customer_id'
]).filter(lambda x: len(x) > 1)\
.resample('M').sum()['sales']

In [None]:
monthly_rev_perc_repeat_customers = (monthly_rev_repeat_customers\
/ monthly_revenue_data) * 100

In [None]:
monthly_rev_perc_repeat_customers

In [None]:
ax = pd.DataFrame(monthly_revenue_data.values).plot(
    figsize=(13,7)
)
pd.DataFrame(monthly_rev_repeat_customers.values).plot(
    ax=ax,
    grid=False
)
ax.set_xlabel('Date')
ax.set_ylabel('Sales')
ax.set_title('Total Revenue vs Revenue from Repeat Customer')
ax.legend(['Total Revenue', 'Repeat Customer Revenue'])
ax.set_ylim([0, max(monthly_revenue_data.values)+100000])
ax2 = ax.twinx()
pd.DataFrame(monthly_rev_perc_repeat_customers.values).plot(
    ax=ax2,
    kind='bar',
    color='#323133',
    alpha=0.4
)
ax2.set_ylim([0, max(monthly_rev_perc_repeat_customers.values)+30])
ax2.set_ylabel('Percentage(%')
ax2.legend(['Repeat Revenue Percentage'])
ax2.set_xticklabels([
    x.strftime('%m.%Y') 
    for x in monthly_rev_perc_repeat_customers.index
])
plt.show(

)

We see a similar pattern as before, where there is a significant increase in the revenue from September 2011. One interesting thing to notice here is the percentage of the monthly revenue from repeat customers. We have seen that roughly 20-30% of the customers who made purchases are repeat customers. However, in this graph, we can see that roughly 40-50% of the Total Revenue is from repeat customers. In other words, roughly half of the revenue was driven by the 20-30% of the customer base who are repeat customers. This shows how important it is to retain existing customers

### Trending Items Over Time:

In [None]:
date_item_data = pd.DataFrame(
    data.set_index('invoice_date').groupby([
            pd.Grouper(freq='M'), 'stock_code'
    ])['quantity'].sum()
)
date_item_data

In [None]:
last_month_sorted_data = date_item_data\
.loc['2011-11-30'].sort_values(
    by='quantity',
    ascending=False
).reset_index()
last_month_sorted_data

### Top 5 Items:

In [None]:
date_item_data = pd.DataFrame(
    data.loc[
        data['stock_code'].isin(
            [23084,84826,22197,22086,'85099B'])
    ].set_index('invoice_date').groupby([
        pd.Grouper(freq='M'), 'stock_code'
    ])['quantity'].sum()
)
date_item_data

### Trending Items:

In [None]:
trending_items_data = date_item_data.reset_index()\
.pivot('invoice_date','stock_code').fillna(0)

trending_items_data = trending_items_data.reset_index()
trending_items_data = trending_items_data.set_index('invoice_date')
trending_items_data.columns = trending_items_data.columns.droplevel(0)
trending_items_data



In [None]:
ax = pd.DataFrame(
    trending_items_data.values
).plot(
    figsize=(13,7), grid=False
)
ax.set_ylabel('Number of Purchases')
ax.set_xlabel('Date')
ax.set_title('Items Trends Over Time')
ax.legend(trending_items_data.columns,
          loc='upper left')
plt.xticks(
    range(len(trending_items_data.index)),
    [x.strftime('%m.%Y') for
     x in trending_items_data.index]
)
sns.despine(offset=20, trim=True)
plt.show()

Let's take a closer look at this time series plot. The sales of these five products spiked in November 2011, especially, the sales of the product with the stock code, 85099B, which were close to 0 from February 2011 to October 2011. Then, it suddenly spiked in November 2011. It might be worth taking a closer look into what might have driven this spike. It could be an item that is highly sensitive to seasonality, such that this item becomes very popular during November, or it could also be due to a genuine change in trends that led this item to become suddenly more popular than before.

The popularity of the rest of the top five products, 22086, 22197, 23084, and 84826, seem to have built up in the few months prior to November 2011. As a marketer, it would be worthwhile taking a closer look at the potential drivers behind this buildup of rising popularity for these items. You could look at whether these items are typically more popular in colder seasons or whether there is a rising trend for these specific items in the market.

# Product Recommendation:

### Customer Item Matrix:

In [None]:
customer_item_matrix = data.pivot_table(
    index= 'customer_id',
    columns= 'stock_code',
    values= 'quantity',
    aggfunc= 'sum'
)
customer_item_matrix.loc[12481:].head()

In [None]:
customer_item_matrix.shape

In [None]:
data['stock_code'].nunique()

In [None]:
customer_item_matrix.loc[12348.0].sum()

In [None]:
customer_item_matrix = customer_item_matrix\
.applymap(lambda x: 1 if x > 0 else 0)
customer_item_matrix.loc[12481:].head()

## Collaborative Filtering

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

### User-Based Collaborative FIltering

### User-to-User Similarity Matrix

In [None]:
user_user_sim_matrix = pd.DataFrame(
    cosine_similarity(customer_item_matrix)
)
user_user_sim_matrix.head()

In [None]:
user_user_sim_matrix.columns = customer_item_matrix.index

user_user_sim_matrix['customer_id'] = customer_item_matrix.index
user_user_sim_matrix = user_user_sim_matrix.set_index('customer_id')
user_user_sim_matrix.head()

Let's take a closer look at this user-to-user similarity matrix. As you can imagine, the cosine similarity between a customer to themselves is 1, and this is what we can observe from this similarity matrix. The diagonal elements in this user-to-user similarity matrix have values of 1. The rest represents the pairwise cosine similarity between two customers. For example, the cosine similarity measure between customers 12347 and 12348 is 0.063022. On the other hand, the cosine similarity between customers 12347 and 12349 is 0.046130. This suggests that customer 12348 is more similar to customer 12347 than customer 12349 is to the customer 12347, based on the products that they purchased. This way, we can easily tell which customers are similar to others, and which customers have bought similar items to others.
These pairwise cosine similarity measures are what we are going to use for product recommendations. Let's work by picking one customer as an example. We will first rank the most similar customers to the customer with ID 12350, using the following code:

### Making Recommendation:

In [None]:
user_user_sim_matrix.loc[12350.0].sort_values(ascending=False)

These are the top 10 customers that are the most similar to customer 12350. Let's pick customer 17935 and discuss how we can recommend products using these results. The strategy is as follows. First, we need to identify the items that the customers 12350 and 17935 have already bought. Then, we are going to find the products that the target customer 17935 has not purchased, but customer 12350 has. Since these two customers have bought similar items in the past, we are going to assume that the target customer 17935 has a high chance of purchasing the items that he or she has not bought, but customer 12350 has bought. Lastly, we are going to use this list of items and recommend them to the target customer 17935.

In [None]:
items_bought_by_A = set(customer_item_matrix.loc[
        12350.0].iloc[
            customer_item_matrix.loc[
                12350.0].to_numpy().nonzero()].index)
items_bought_by_A

In [None]:
items_bought_by_B = set(customer_item_matrix.loc[
        17935.0].iloc[
            customer_item_matrix.loc[
                17935.0].to_numpy().nonzero()].index)
items_bought_by_B

In [None]:
items_recommend_to_B = items_bought_by_A - items_bought_by_B
items_recommend_to_B

In [None]:
data.loc[
    data['stock_code'].isin(items_recommend_to_B),
    ['stock_code','description']
].drop_duplicates().set_index('stock_code')

Using user-based collaborative filtering, we can do targeted product recommendations for individual customers. We can custom-tailor and include these products that each target customer is likely to purchase in your marketing messages, which can potentially drive more conversions from your customers. As discussed so far, using a user-based collaborative filtering algorithm, we can easily do product recommendations for target customers.
However, there is one main disadvantage of using user-based collaborative filtering. As we can see recommendations are based on the individual customer's purchase history. For new customers, it is not going to be enough data to compare these new customers against the others

## Item-Based Collaborative FIltering:

### Item-to-Item Similarity Matrix:

In [None]:
item_item_sim_matrix = pd.DataFrame(
    cosine_similarity(
        customer_item_matrix.T
    )
)

In [None]:
item_item_sim_matrix.columns = customer_item_matrix.T.index

item_item_sim_matrix['stock_code'] = customer_item_matrix.T.index
item_item_sim_matrix = item_item_sim_matrix.set_index('stock_code')
item_item_sim_matrix

As before, the diagonal elements have values of 1. This is because the similarity between an item and itself is 1, meaning the two are identical. The rest of the elements contain the similarity measure values between items based on the cosine similarity calculation. For example, looking at the preceding item-to-item similarity matrix, the cosine similarity between the item with Stock Code 10002 and the item with Stock Code 10120 is 0.094868. On the other hand, the cosine similarity between the item 10002 and the item 10125 is 0.090351. This suggests that the item with Stock Code 10120 is more similar to that with Stock Code 10002, than the item with Stock Code 10125 is to that with Stock Code 10002.


### Making Recommendation:

In [None]:
top_10_similar_items = list(
    item_item_sim_matrix.loc[
        23166
    ].sort_values(
        ascending=False).iloc[:10].index
)
top_10_similar_items

In [None]:
data.loc[
    data['stock_code'].isin(
        top_10_similar_items
    ),['stock_code','description']
].drop_duplicates().set_index(
    'stock_code').loc[
        top_10_similar_items]

The first item here is the item that the target customer just bought and the other nine items are the items that are frequently bought by others who have bought the first item. As you can see, those who have bought ceramic top storage jars often buy jelly moulds, spice tins, and cake tins. With this data, you can include these items in your marketing messages for this target customer as further product recommendations. Personalizing the marketing messages with targeted product recommendations typically yields higher conversion rates from customers. Using an item-based collaborative filtering algorithm, we can now easily do product recommendations for both new and existing customers.

# 4: Customer Life Time Value (CLV):

In marketing, the CLV is one of the key metrics to have and monitor. The CLV measures customers' total worth to the business over the course of their lifetime relationship with the company. This metric is especially important to keep track of for acquiring new customers. It is generally more expensive to acquire new customers than to keep existing customers, so knowing the lifetime value and the costs associated with acquiring new customers is essential in order to build marketing strategies with a positive ROI.

## 4.1: Predicting 3 Month Customer Life-Time Value:

In [None]:
clv_freq = '3M'
clv_data = orders_data.reset_index().groupby([
    'customer_id', pd.Grouper(key='invoice_date', 
            freq=clv_freq)]).agg({
    'sales':[sum, groupby_mean, groupby_count]
            })        

In [None]:
clv_data.columns = ['_'.join(col).lower() for \
            col in clv_data]
clv_data = clv_data.reset_index()
clv_data.head(10)

In [None]:
date_month_map = {
    str(x)[:10]: 'M_%s' % (i+1) for i, x in enumerate(
        sorted(clv_data.reset_index()['invoice_date'].unique(), reverse=True))
}


In [None]:
clv_data['M'] = clv_data['invoice_date']\
.apply(lambda x: date_month_map[str(x)[:10]])

date_month_map

In [None]:
clv_data.head(10)

### Building Sample Set:

In [None]:
features_data = pd.pivot_table(
    clv_data.loc[clv_data['M'] != 'M_1'],
    values= ['sales_sum','sales_avg','sales_count'],
    columns= 'M',
    index= 'customer_id'
)

In [None]:
features_data.columns = ['_'.join(col) \
                         for col in features_data.columns]
features_data.shape

In [None]:
features_data.head(10)

In [None]:
response_data = clv_data.loc[
            clv_data['M'] == 'M_1',
            ['customer_id','sales_sum']
]

response_data.columns = ['customer_id', 'CLV_'+ clv_freq]
response_data.shape

In [None]:
response_data.head()

In [None]:
sample_set_data = features_data.merge(
    response_data, left_index=True,
    right_on='customer_id', how='left'
)

sample_set_data.shape

In [None]:
sample_set_data.fillna(0, inplace=True)
sample_set_data.head(10)

In [None]:
sample_set_data['CLV_'+ clv_freq].describe()

## 4.2: Regression Models:

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
target = 'CLV_'+clv_freq
features = [x for x in sample_set_data.columns\
            if x not in ['customer_id', target]]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    sample_set_data[features],
    sample_set_data[target],
    test_size=0.3
)

In [None]:
from sklearn.linear_model import LinearRegression

linreg = LinearRegression()
linreg.fit(X_train, y_train)

In [None]:
coefs = pd.DataFrame(list(zip(features, linreg.coef_)))
coefs.columns = ['features','coefs']
coefs

### Evaluation:

In [None]:
from sklearn.metrics import r2_score, median_absolute_error

In [None]:
training_predictions = linreg.predict(X_train)
test_predictions = linreg.predict(X_test)

### R-Squared

In [None]:
print(f'In-Sample R-Squared: {r2_score(y_train, training_predictions):.4f}')
print(f'Out-of-Sample R-Squared: {r2_score(y_test, test_predictions):.4f}')

### Median Absolute Error:

In [None]:
print(f'In-Sample R-Squared: {median_absolute_error(y_train, training_predictions):.4f}')
print(f'Out-of-Sample R-Squared: {median_absolute_error(y_test, test_predictions):.4f}')

### Scatter Plot:

In [None]:
plt.figure(figsize=(13,7))
plt.scatter(y_train,training_predictions, color='#bf263c', s=70)
plt.plot([0, max(y_train)],
         [0, max(training_predictions)],
         color='#323133', lw=2,linestyle='dashed')
plt.xlabel('actual')
plt.ylabel('predicted')
plt.title('In-Sample Actual vs Predicted')
plt.grid()
sns.despine(offset=20, trim=True)
plt.show()

In [None]:
plt.figure(figsize=(13,7))
plt.scatter(y_test,test_predictions, color='#bf263c', s=70)
plt.plot([0, max(y_test)],
         [0, max(test_predictions)],
         color='#323133', lw=2,linestyle='dashed')
plt.xlabel('actual')
plt.ylabel('predicted')
plt.title('Out-of-Sample Actual vs Predicted')
plt.grid()
sns.despine(offset=20, trim=True)
plt.show()

# 7: Customer Segmentation:

#### Per Customer Data:

In [None]:
data.head()

In [None]:
customer_data = data.groupby('customer_id').agg({
    'sales': sum,
    'invoice_no': lambda x: x.nunique()
})
customer_data.columns = ['total_sales', 'order_count']
customer_data['avg_order_value'] = customer_data['total_sales']\
/ customer_data['order_count']

In [None]:
customer_data.head()

In [None]:
customer_data.describe()

In [None]:
rank_data = customer_data.rank(method='first')
rank_data.head()

In [None]:
normalize_data = (rank_data - rank_data.mean()) / rank_data.std()
normalize_data.head(10)

In [None]:
normalize_data.describe()

## 7.1: Segmentation Using K-Means Clustering:

In [None]:
from sklearn.cluster import KMeans

In [None]:
kmeans = KMeans(n_clusters=4)\
.fit(normalize_data[['total_sales',
     'order_count','avg_order_value']]\
         .copy(deep=True))


In [None]:
kmeans.labels_

In [None]:
kmeans.cluster_centers_

In [None]:
cluster_data = normalize_data[['total_sales',
          'order_count', 'avg_order_value']]
cluster_data['cluster'] = kmeans.labels_

In [None]:
cluster_data.head()

In [None]:
cluster_data.groupby('cluster').count()['total_sales']

In [None]:
plt.scatter(
    cluster_data.loc[cluster_data\
            ['cluster'] == 0]['order_count'],
    cluster_data.loc[cluster_data\
            ['cluster'] == 0]['total_sales'],
            c='#bf263c', s=80
)
plt.scatter(
    cluster_data.loc[cluster_data\
            ['cluster'] == 1]['order_count'],
    cluster_data.loc[cluster_data\
            ['cluster'] == 1]['total_sales'],
            c='dodgerblue'
)
plt.scatter(
    cluster_data.loc[cluster_data\
            ['cluster'] == 2]['order_count'],
    cluster_data.loc[cluster_data\
            ['cluster'] == 2]['total_sales'],
            c='green'
)
plt.scatter(
    cluster_data.loc[cluster_data\
            ['cluster'] == 3]['order_count'],
    cluster_data.loc[cluster_data\
            ['cluster'] == 3]['total_sales'],
            c='orange'
)
sns.despine(offset=20, trim=True)
plt.title('Total Sales vs. Order Count Clusters')
plt.xlabel('Order Count')
plt.ylabel('Total Sales')
plt.grid()
plt.show()

Let's take a closer look at this plot. The cluster in blue is the group of low-value customers, who have not purchased our products so much. On the other hand, the cluster in red is the group of high-value customers, who have purchased the greatest amount and who have purchased products frequently. We can also visualize the clusters with different angles, using the rest of the variables. 

In [None]:
plt.scatter(
    cluster_data.loc[cluster_data\
            ['cluster'] == 0]['order_count'],
    cluster_data.loc[cluster_data\
            ['cluster'] == 0]['avg_order_value'],
            c='#bf263c', s=80
)
plt.scatter(
    cluster_data.loc[cluster_data\
            ['cluster'] == 1]['order_count'],
    cluster_data.loc[cluster_data\
            ['cluster'] == 1]['avg_order_value'],
            c='dodgerblue'
)
plt.scatter(
    cluster_data.loc[cluster_data\
            ['cluster'] == 2]['order_count'],
    cluster_data.loc[cluster_data\
            ['cluster'] == 2]['avg_order_value'],
            c='green'
)
plt.scatter(
    cluster_data.loc[cluster_data\
            ['cluster'] == 3]['order_count'],
    cluster_data.loc[cluster_data\
            ['cluster'] == 3]['avg_order_value'],
            c='orange'
)
sns.despine(offset=20, trim=True)
plt.title('Average Order value vs. Order Count Clusters')
plt.xlabel('Order Count')
plt.ylabel('Average Order Value')
plt.grid()
plt.show()

In [None]:
plt.scatter(
    cluster_data.loc[cluster_data\
            ['cluster'] == 0]['total_sales'],
    cluster_data.loc[cluster_data\
            ['cluster'] == 0]['avg_order_value'],
            c='#bf263c'
)
plt.scatter(
    cluster_data.loc[cluster_data\
            ['cluster'] == 1]['total_sales'],
    cluster_data.loc[cluster_data\
            ['cluster'] == 1]['avg_order_value'],
            c='dodgerblue'
)
plt.scatter(
    cluster_data.loc[cluster_data\
            ['cluster'] == 2]['total_sales'],
    cluster_data.loc[cluster_data\
            ['cluster'] == 2]['avg_order_value'],
            c='green'
)
plt.scatter(
    cluster_data.loc[cluster_data\
            ['cluster'] == 3]['order_count'],
    cluster_data.loc[cluster_data\
            ['cluster'] == 3]['avg_order_value'],
            c='orange'
)
sns.despine(offset=20, trim=True)
plt.title('Average Order value vs. Total Sales Clusters')
plt.xlabel('Total Sales')
plt.ylabel('Average Order Value')
plt.grid()
plt.show()

The second plot shows the clusters visualized based on AvgOrderValue and OrderCount. On the other hand, the second plot shows the clusters visualized based on AvgOrderValue and TotalSales. As you can see from these plots, the cluster in blue has the lowest average per-order value and the lowest number of orders. However, the cluster in red has the highest average per-order value and the greatest number of orders

#### Selecting the Best Number of Clusters:

In [None]:
from sklearn.metrics import silhouette_score

In [None]:
for n_cluster in [4,5,6,7,8,9]:
    kmeans = KMeans(n_clusters=n_cluster).fit(
        normalize_data[['total_sales',
                'order_count', 'avg_order_value']]
    )
    silhouette_avg = silhouette_score(
        normalize_data[['total_sales',
                'order_count', 'avg_order_value']],
                kmeans.labels_
    )
    print('Silhouette Score for %i Cluster: %0.4f' %(n_cluster, silhouette_avg))


#### Interpretting Customer Segments:

In [None]:
kmeans = KMeans(n_clusters=4).fit(
    normalize_data[['total_sales',
            'order_count','avg_order_value']]
)

In [None]:
four_cluster_data = normalize_data[['total_sales',
            'order_count','avg_order_value']]
four_cluster_data['cluster'] = kmeans.labels_


In [None]:
four_cluster_data.head(10)

In [None]:
kmeans.cluster_centers_

In [None]:
high_value_cluster = four_cluster_data.loc[
                four_cluster_data['cluster']==2]
high_value_cluster.head()

In [None]:
customer_data.loc[high_value_cluster.index].describe()

In [None]:
pd.DataFrame(data.loc[data['customer_id'].isin(
    four_cluster_data.loc[four_cluster_data['cluster']==3].index
)].groupby('description').count()['stock_code'].sort_values(
    ascending=False
).head())

For this high-value segment, the best-selling item was JUMBO BAG RED RETROSPOT and the second best-selling item was REGENCY CAKESTAND 3 TIER. You can utilize this information in marketing strategies, when you target this customer segment. In your marketing campaigns, you can recommend items similar to these best-selling items to this segment of customers, as they are the most interested in these types of items.