# Lead Calculation with Rule Based Classification

### Problem:
A game company wants to create new level based personas by using features of their customers. It wants to create segments according to these personas and estimate their profit according to these segments.  

### Dataset Story
The Persona.csv dataset contains the prices of the products sold by an international game company and some demographic information of the users who buy these products.

__Price:__ Payment of customer<br>
__Source:__ The type of device the customer is connecting to<br>
__Sex:__ Gender of the customer<br>
__Country:__ Country of customer<br>
__Age:__ Age of customer<br>

In [1]:
import numpy as np
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot

In [2]:
df = pd.read_csv("C:/Users/ozkan/Desktop/Miuul ML/Kural_Tabanli_Siniflandirma/persona.csv")
df

Unnamed: 0,PRICE,SOURCE,SEX,COUNTRY,AGE
0,39,android,male,bra,17
1,39,android,male,bra,17
2,49,android,male,bra,17
3,29,android,male,tur,17
4,49,android,male,tur,17
...,...,...,...,...,...
4995,29,android,female,bra,31
4996,29,android,female,bra,31
4997,29,android,female,bra,31
4998,39,android,female,bra,31


In [3]:
def check_df(dataframe, head=5):
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("##################### Quantiles #####################")
    print(dataframe.describe([0, 0.05, 0.50, 0.95, 0.99, 1]).T)
    print("##################### Head #####################")
    print(dataframe.head(head))
    print("##################### Tail #####################")
    print(dataframe.tail(head))
check_df(df)

##################### Shape #####################
(5000, 5)
##################### Types #####################
PRICE       int64
SOURCE     object
SEX        object
COUNTRY    object
AGE         int64
dtype: object
##################### NA #####################
PRICE      0
SOURCE     0
SEX        0
COUNTRY    0
AGE        0
dtype: int64
##################### Quantiles #####################
        count     mean        std   min    0%    5%   50%   95%   99%  100%  \
PRICE  5000.0  34.1320  12.464897   9.0   9.0  19.0  39.0  49.0  59.0  59.0   
AGE    5000.0  23.5814   8.995908  15.0  15.0  15.0  21.0  43.0  53.0  66.0   

        max  
PRICE  59.0  
AGE    66.0  
##################### Head #####################
   PRICE   SOURCE   SEX COUNTRY  AGE
0     39  android  male     bra   17
1     39  android  male     bra   17
2     49  android  male     bra   17
3     29  android  male     tur   17
4     49  android  male     tur   17
##################### Tail #####################
      P

In [5]:
agg_df = df.groupby(by=["COUNTRY", 'SOURCE', "SEX", "AGE"]).agg({"PRICE": "mean"}).sort_values("PRICE", ascending=False)
agg_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,PRICE
COUNTRY,SOURCE,SEX,AGE,Unnamed: 4_level_1
bra,android,male,46,59.0
usa,android,male,36,59.0
fra,android,female,24,59.0
usa,ios,male,32,54.0
deu,android,female,36,49.0
...,...,...,...,...
usa,ios,female,38,19.0
usa,ios,female,30,19.0
can,android,female,27,19.0
fra,android,male,18,19.0


In [6]:
agg_df = agg_df.reset_index()
agg_df.head()

Unnamed: 0,COUNTRY,SOURCE,SEX,AGE,PRICE
0,bra,android,male,46,59.0
1,usa,android,male,36,59.0
2,fra,android,female,24,59.0
3,usa,ios,male,32,54.0
4,deu,android,female,36,49.0


In [9]:
bins = [0, 18, 23, 30, 40, agg_df["AGE"].max()]

mylabels = ['0_18', '19_23', '24_30', '31_40', '41_' + str(agg_df["AGE"].max())]

agg_df["age_cat"] = pd.cut(agg_df["AGE"], bins, labels=mylabels)
agg_df.head()

Unnamed: 0,COUNTRY,SOURCE,SEX,AGE,PRICE,age_cat
0,bra,android,male,46,59.0,41_66
1,usa,android,male,36,59.0,31_40
2,fra,android,female,24,59.0,24_30
3,usa,ios,male,32,54.0,31_40
4,deu,android,female,36,49.0,31_40


In [10]:
agg_df["customers_level_based"] = [row[0].upper() + "_" + row[1].upper() + "_" + row[2].upper() + "_" + row[5].upper() for row in agg_df.values]
agg_df.head()

Unnamed: 0,COUNTRY,SOURCE,SEX,AGE,PRICE,age_cat,customers_level_based
0,bra,android,male,46,59.0,41_66,BRA_ANDROID_MALE_41_66
1,usa,android,male,36,59.0,31_40,USA_ANDROID_MALE_31_40
2,fra,android,female,24,59.0,24_30,FRA_ANDROID_FEMALE_24_30
3,usa,ios,male,32,54.0,31_40,USA_IOS_MALE_31_40
4,deu,android,female,36,49.0,31_40,DEU_ANDROID_FEMALE_31_40


In [13]:
agg_df = agg_df[["customers_level_based", "PRICE"]]
agg_df.head()

Unnamed: 0,customers_level_based,PRICE
0,BRA_ANDROID_MALE_41_66,59.0
1,USA_ANDROID_MALE_31_40,59.0
2,FRA_ANDROID_FEMALE_24_30,59.0
3,USA_IOS_MALE_31_40,54.0
4,DEU_ANDROID_FEMALE_31_40,49.0


In [14]:
agg_df["customers_level_based"].value_counts()

USA_IOS_FEMALE_24_30        7
BRA_ANDROID_MALE_24_30      7
USA_ANDROID_MALE_41_66      7
BRA_ANDROID_FEMALE_24_30    7
USA_ANDROID_MALE_24_30      7
                           ..
DEU_IOS_MALE_41_66          1
CAN_ANDROID_MALE_19_23      1
CAN_IOS_MALE_31_40          1
DEU_IOS_FEMALE_31_40        1
CAN_IOS_FEMALE_24_30        1
Name: customers_level_based, Length: 109, dtype: int64

In [17]:
agg_df = agg_df.groupby("customers_level_based").agg({"PRICE": "mean"})
agg_df.head()

Unnamed: 0_level_0,PRICE
customers_level_based,Unnamed: 1_level_1
BRA_ANDROID_FEMALE_0_18,35.645303
BRA_ANDROID_FEMALE_19_23,34.07734
BRA_ANDROID_FEMALE_24_30,33.863946
BRA_ANDROID_FEMALE_31_40,34.898326
BRA_ANDROID_FEMALE_41_66,36.737179


In [18]:
agg_df = agg_df.reset_index()
agg_df.head()

Unnamed: 0,customers_level_based,PRICE
0,BRA_ANDROID_FEMALE_0_18,35.645303
1,BRA_ANDROID_FEMALE_19_23,34.07734
2,BRA_ANDROID_FEMALE_24_30,33.863946
3,BRA_ANDROID_FEMALE_31_40,34.898326
4,BRA_ANDROID_FEMALE_41_66,36.737179


In [20]:
agg_df["customers_level_based"].value_counts()

TUR_IOS_FEMALE_41_66        1
TUR_ANDROID_FEMALE_31_40    1
BRA_IOS_FEMALE_41_66        1
CAN_ANDROID_FEMALE_41_66    1
FRA_IOS_MALE_24_30          1
                           ..
FRA_ANDROID_FEMALE_31_40    1
BRA_IOS_MALE_31_40          1
TUR_IOS_MALE_0_18           1
TUR_ANDROID_MALE_24_30      1
BRA_ANDROID_FEMALE_41_66    1
Name: customers_level_based, Length: 109, dtype: int64

In [21]:
agg_df["SEGMENT"] = pd.qcut(agg_df["PRICE"], 4, labels=["D", "C", "B", "A"])
agg_df.head(30)

Unnamed: 0,customers_level_based,PRICE,SEGMENT
0,BRA_ANDROID_FEMALE_0_18,35.645303,B
1,BRA_ANDROID_FEMALE_19_23,34.07734,C
2,BRA_ANDROID_FEMALE_24_30,33.863946,C
3,BRA_ANDROID_FEMALE_31_40,34.898326,B
4,BRA_ANDROID_FEMALE_41_66,36.737179,A
5,BRA_ANDROID_MALE_0_18,34.805861,B
6,BRA_ANDROID_MALE_19_23,31.673243,D
7,BRA_ANDROID_MALE_24_30,33.413919,C
8,BRA_ANDROID_MALE_31_40,34.327381,B
9,BRA_ANDROID_MALE_41_66,40.041667,A


### Classifying new customers and estimating how much revenue they can generate.
What segment does a 3 year old Turkish woman using ANDROID belong to and how much income is expected to earn on average?

In [22]:
new_user = "TUR_ANDROID_FEMALE_31_40"
agg_df[agg_df["customers_level_based"] == new_user]

Unnamed: 0,customers_level_based,PRICE,SEGMENT
72,TUR_ANDROID_FEMALE_31_40,41.833333,A
