# Lead Calculation with Rule-Based Classification

### CASE STUDY 

A game company wants to create level-based new customer definitions (personas) using some features of its customers, and to create segments according to these new customer definitions and to estimate how much the new customers can earn on average according to these segments.

<b>For example:</b><br>
It is desired to determine how much a 25-year-old male user from the Netherlands, who is an IOS user, can earn on average to the company.

<b>Dataset</b><br>
The *persona.csv dataset* contains the prices of the products sold by an international game company and some demographic information of the users who buy these products. The data set consists of records created in each sales transaction. This means that the table is not deduplicated. In other words, a user with certain demographic characteristics may have made more than one purchase.

<b>Variables</b><br>
**price** – Customer's spending amount<br>
**source** – The type of device the customer is connecting to<br>
**sex** – Gender of the client<br>
**country** – Customer's country<br>
**age** – Customer's age<br>

In [1]:
# Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

In [2]:
data = pd.read_csv("datas/persona.csv")
data.rename(columns={'PRICE': 'price',
                    'SOURCE': 'source',
                    'SEX': 'sex',
                    'COUNTRY': 'country',
                    'AGE': 'age'}, inplace=True)

In [3]:
def check_df(dataframe, head=5):
    
    print("                    HEAD                    ")
    print(dataframe.head(head))
    print("\n                    TAIL                    ")
    print(dataframe.tail(head))
    print("\n                    SHAPE                    ")
    print(dataframe.shape)
    print("\n                    TYPES                    ")
    print(dataframe.dtypes)
    print("\n                    MISSING VALUES                    ")
    print(dataframe.isnull().sum())
    print("\n                    DESCRIBE                    ")
    print(dataframe.describe().T)
    
check_df(data)

                    HEAD                    
   price   source   sex country  age
0     39  android  male     bra   17
1     39  android  male     bra   17
2     49  android  male     bra   17
3     29  android  male     tur   17
4     49  android  male     tur   17

                    TAIL                    
      price   source     sex country  age
4995     29  android  female     bra   31
4996     29  android  female     bra   31
4997     29  android  female     bra   31
4998     39  android  female     bra   31
4999     29  android  female     bra   31

                    SHAPE                    
(5000, 5)

                    TYPES                    
price       int64
source     object
sex        object
country    object
age         int64
dtype: object

                    MISSING VALUES                    
price      0
source     0
sex        0
country    0
age        0
dtype: int64

                    DESCRIBE                    
        count     mean        std   min   2

*As can be seen, the data consists of 5000 rows and 5 columns. It also does not contain any missing values. The overall average age is about 24 and the average price is 34. The highest values seem to be 66 for age and 59 for price.*

# Data Investigation 

In [4]:
def unique_values(dataframe,column_name:str):
    
    values = dataframe[column_name].value_counts()
    num_nunique = dataframe[column_name].nunique()
    
    print(f"The feature {column_name} has {num_nunique} unique values as \n{values}.\n")

In [5]:
source_nunique = unique_values(data,"source")
price_nunique = unique_values(data,"price")

The feature source has 2 unique values as 
android    2974
ios        2026
Name: source, dtype: int64.

The feature price has 6 unique values as 
29    1305
39    1260
49    1031
19     992
59     212
9      200
Name: price, dtype: int64.



*The highest sale was obtained with the price of 29 and closely followed by the price of 39.*

In [6]:
data.head()

Unnamed: 0,price,source,sex,country,age
0,39,android,male,bra,17
1,39,android,male,bra,17
2,49,android,male,bra,17
3,29,android,male,tur,17
4,49,android,male,tur,17


In [7]:
unique_values(data,"country")

The feature country has 6 unique values as 
usa    2065
bra    1496
deu     455
tur     451
fra     303
can     230
Name: country, dtype: int64.



*As can be seen, there are 6 countries in total and the leading country that made the highest sales is the USA and there is a small gap between the second country which is Brazil. The country that has the least sales is Canada.*

## How much was earned in total from sales by country?

In [8]:
data.groupby("country").agg({"price":"sum"})

Unnamed: 0_level_0,price
country,Unnamed: 1_level_1
bra,51354
can,7730
deu,15485
fra,10177
tur,15689
usa,70225


## What are the *sales* numbers by *source types*?

In [9]:
data.groupby("source").agg({"price": "value_counts"})
# This simple unstack will convert the columns as rows and vice versa 

Unnamed: 0_level_0,Unnamed: 1_level_0,price
source,price,Unnamed: 2_level_1
android,29,778
android,39,749
android,49,620
android,19,584
android,59,124
android,9,119
ios,29,527
ios,39,511
ios,49,411
ios,19,408


In [10]:
data.groupby("source").agg({"price": "value_counts"}).unstack(fill_value=0)

Unnamed: 0_level_0,price,price,price,price,price,price
price,9,19,29,39,49,59
source,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
android,119,584,778,749,620,124
ios,81,408,527,511,411,88


## What are the *price* averages by *country*?

In [11]:
data.groupby("country").agg({"price":"mean"})

Unnamed: 0_level_0,price
country,Unnamed: 1_level_1
bra,34.32754
can,33.608696
deu,34.032967
fra,33.587459
tur,34.78714
usa,34.007264


*The contribution of the countries to the company appears to be the same on an average basis. However, it is seen that Turkey is at the top with approximately 35 million.*

## What are the *price* averages by *source*?

In [12]:
data.groupby("source").agg({"price":"mean"})

Unnamed: 0_level_0,price
source,Unnamed: 1_level_1
android,34.174849
ios,34.069102


## What are the *price* averages in the *country*-*source* breakdown?

In [13]:
data.groupby(["country","source"]).agg({"price":"mean"})

Unnamed: 0_level_0,Unnamed: 1_level_0,price
country,source,Unnamed: 2_level_1
bra,android,34.387029
bra,ios,34.222222
can,android,33.330709
can,ios,33.951456
deu,android,33.869888
deu,ios,34.268817
fra,android,34.3125
fra,ios,32.776224
tur,android,36.229437
tur,ios,33.272727


*In the table, it is seen that android has the highest value with 36 million in Turkey. In Germany, Canada and America, it is seen that ios sells more than android.*

 ## What are the average earnings in breakdown of *country*, *source*, *sex*, *age*?

In [14]:
data.groupby(["country","source","sex","age"]).agg({"price":"mean"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,price
country,source,sex,age,Unnamed: 4_level_1
bra,android,female,15,38.714286
bra,android,female,16,35.944444
bra,android,female,17,35.666667
bra,android,female,18,32.255814
bra,android,female,19,35.206897
...,...,...,...,...
usa,ios,male,42,30.250000
usa,ios,male,50,39.000000
usa,ios,male,53,34.000000
usa,ios,male,55,29.000000


In [15]:
agg_df = data.groupby(["country","source","sex","age"]).agg({"price":"mean"}).sort_values(by="price", ascending=False)
agg_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,price
country,source,sex,age,Unnamed: 4_level_1
bra,android,male,46,59.0
usa,android,male,36,59.0
fra,android,female,24,59.0
usa,ios,male,32,54.0
deu,android,female,36,49.0


In [16]:
agg_df = agg_df.reset_index()
agg_df.head()

Unnamed: 0,country,source,sex,age,price
0,bra,android,male,46,59.0
1,usa,android,male,36,59.0
2,fra,android,female,24,59.0
3,usa,ios,male,32,54.0
4,deu,android,female,36,49.0


# Converting a Numerical Data to a Categorical Data

In [17]:
age = agg_df["age"].unique()
print(np.sort(age))

[15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
 39 40 41 42 43 44 45 46 47 49 50 51 52 53 54 55 56 57 59 61 65 66]


In [18]:
labels=['0_18', '19_23', '24_30', '31_40', '41_66']
agg_df["age_cat"] = pd.cut(agg_df["age"],[0,18,23,30,40,agg_df["age"].max()], labels=labels)

In [19]:
agg_df.head()

Unnamed: 0,country,source,sex,age,price,age_cat
0,bra,android,male,46,59.0,41_66
1,usa,android,male,36,59.0,31_40
2,fra,android,female,24,59.0,24_30
3,usa,ios,male,32,54.0,31_40
4,deu,android,female,36,49.0,31_40


# Defining New Level-based Customers (Personas)

In [20]:
agg_df["age_cat"]=agg_df["age_cat"].astype("object")
agg_df["customers_level_based"] = agg_df["country"]+"_"+agg_df["source"]+"_"+agg_df["sex"]+"_"+agg_df["age_cat"]

In [21]:
agg_df.head()

Unnamed: 0,country,source,sex,age,price,age_cat,customers_level_based
0,bra,android,male,46,59.0,41_66,bra_android_male_41_66
1,usa,android,male,36,59.0,31_40,usa_android_male_31_40
2,fra,android,female,24,59.0,24_30,fra_android_female_24_30
3,usa,ios,male,32,54.0,31_40,usa_ios_male_31_40
4,deu,android,female,36,49.0,31_40,deu_android_female_31_40


*Attention! After creating customers_level_based values with list comprehension, these values need to be deduplicated. For example, it could be more than one of the following: usa_ios_male_31_40. It is necessary to take them to groupby and get the price averages.*

In [22]:
clb_df = agg_df.groupby("customers_level_based").agg({"price":"mean"})
clb_df = clb_df.reset_index()
#clb_df = agg_df[["customers_level_based","price"]]
clb_df["customers_level_based"] = clb_df["customers_level_based"].str.upper()
clb_df.head()

Unnamed: 0,customers_level_based,price
0,BRA_ANDROID_FEMALE_0_18,35.645303
1,BRA_ANDROID_FEMALE_19_23,34.07734
2,BRA_ANDROID_FEMALE_24_30,33.863946
3,BRA_ANDROID_FEMALE_31_40,34.898326
4,BRA_ANDROID_FEMALE_41_66,36.737179


# Creating New Customers' Segments

In [23]:
clb_df["segment"] = pd.qcut(clb_df["price"], 4, labels=["D","C","B","A"])
clb_df.head()

Unnamed: 0,customers_level_based,price,segment
0,BRA_ANDROID_FEMALE_0_18,35.645303,B
1,BRA_ANDROID_FEMALE_19_23,34.07734,C
2,BRA_ANDROID_FEMALE_24_30,33.863946,C
3,BRA_ANDROID_FEMALE_31_40,34.898326,B
4,BRA_ANDROID_FEMALE_41_66,36.737179,A


# Classifying New Customers

*Estimating how much revenue new customers can generate.*

In [24]:
new_customer = 'TUR_ANDROID_FEMALE_24_30'
clb_df[clb_df["customers_level_based"] == new_customer]

Unnamed: 0,customers_level_based,price,segment
71,TUR_ANDROID_FEMALE_24_30,30.785714,D


- What segment does a 33-year-old Turkish woman using ANDROID belong to and how much income is expected to earn on average?<br>
<br>
- What segment does a 35-year-old French woman using IOS belong to and how much income is expected to earn on average?

In [25]:
new_customer = 'TUR_ANDROID_FEMALE_31_40'
clb_df[clb_df["customers_level_based"] == new_customer]

Unnamed: 0,customers_level_based,price,segment
72,TUR_ANDROID_FEMALE_31_40,41.833333,A


In [27]:
new_customer = 'FRA_IOS_FEMALE_31_40'
clb_df[clb_df["customers_level_based"] == new_customer]

Unnamed: 0,customers_level_based,price,segment
63,FRA_IOS_FEMALE_31_40,32.818182,C
