# Project 1, Part 5, Best Customer Analytics

University of California, Berkeley
Master of Information and Data Science (MIDS) program
w205 - Fundamentals of Data Engineering

Student: John (Jack) Galvin

Year: 2022

Semester: Spring

Section: 9


# Included Modules and Packages

Code cell containing your includes for modules and packages

In [1]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import psycopg2

# Supporting code

Code cells containing any supporting code, such as connecting to the database, any functions, etc.  Remember you can use any code from the labs.

In [2]:
# Function to run a select query and return rows in a pandas dataframe
# Pandas puts all numeric values from postgres to float
# If it will fit in an integer, change it to integer


def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # fix the float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)

In [3]:
# Connect to Postgres

connection = psycopg2.connect(
    user = "postgres",
    password = "ucb",
    host = "postgres",
    port = "5432",
    database = "postgres"
)

In [4]:
# Create a cursor for the connection

cursor = connection.cursor()

## The executives want you to come up with a high level design of a model, in the form of written criteria, to determine who the best customers are. 

## You do NOT have to code the model. 

## You do NOT have to give an actual list of best customers. 

## Create an executive summary explaining your model. You must support your summary with data, in the form of output of queries, data visualization, etc. There is a 1 query minimum.

In [6]:
rollback_before_flag = True
rollback_after_flag = True

query = """

select c.customer_id,
        c.last_name,
        c.first_name,
        sum(sa.total_amount) as total_spent,
        count(distinct store_id || '-' || sale_id) as total_txns,
        max(sa.sale_date) as last_txn
from customers as c
    left join sales as sa
        on c.customer_id = sa.customer_id
group by c.customer_id, c.last_name, c.first_name
order by total_spent
        

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,customer_id,last_name,first_name,total_spent,total_txns,last_txn
0,6485,Wainwright,Alyce,612,9,2020-12-19
1,22530,Maus,Thomasina,648,14,2020-10-18
2,12861,Leeburne,Kacie,672,14,2020-12-19
3,1528,Cherry,Lorin,684,16,2020-11-21
4,4220,Colnett,Laurel,708,13,2020-12-26
...,...,...,...,...,...,...
31077,30333,Greenhead,Luise,,0,
31078,29182,Mouser,Simona,,0,
31079,2676,Arnke,Daniella,,0,
31080,23653,Rosenstock,Stephine,,0,


# Criteria for the Best Customer Model

I would adopt a recency, frequency, and monetary value model to describe our customers. As can be seen from the query above, we have the data required to do so. The model encapsulates customer behaviors which drive business metrics and can be scaled as our business grows.

The RFM model is one of the most widely-used methods to segment an organization's customer base. The "best" customer is one who purchased a product relatively recently, purchases frequently, and spends a relatively high total amount of money with the organization. These customers are ones who could potentially pilot / test new products before they are launched and who ought to serve as the target market for such new products.

New customers are those who have made recent purchases, but their frequency and monetary value are relatively low. These customers ought to be engaged in an effort to drive them to converting into loyal, heavy spenders.

Customers with high frequency and monetary value scores but who haven't recently made purchases are at risk of churning. These customers should be engaged, too, but for the purpose of (a) figuring out why they haven't made a purchase recently and (b) attempting to salvage the customer before they are lost.

Finally, customers who have not recently made a purchase with low frequency and monetary value are already lost. It could be worthwhile to contact them to figure out why they churned, but minimal effort should be invested in them otherwise. 

Since we have the data required to implement a RFM model, we could use an unsupervised learning algorithm (K Means Clustering) to figure out where naturally occurring groups of our customers stand on each of these 3 dimensions. We could then use these clusters/groupings to inform our approach to growing our relationship with them. This model could be reformed / rerun at specific time intervals to reflect changes in our customer base as our business grows.