# Lead Based Expand with Rule Based Classifier
Summary
---
One of the Game Company would like to learn how much money its new customers will potentially spend by observing the purchasing movements within the application.

The Company wants to segment its customers and plan the necessary notifications, additional scores or mail marketing.

*The dataset includes the Country, Source, Age, Sex information of the customers and their game purchases.*

Variables:
---
1. Price: Payments made by costumers
2. Source: The operating system used, including IOS and Android
3. Sex: Gender of users, Female and Male
4. Country : Information from which country the costumers are from
5. Age: Age of costumers



In [105]:
# Load the Pandas Module
import pandas as pd

In [106]:
# Read persona.csv file as pd.DF
persona_df = pd.read_csv("resources/persona.csv")

# Observe the Dataset
'''
- get the first 5 rows to check if data loaded into the DF successfully
- check the dimensionality of the DataFrame
- Generate descriptive statistics from the dataset - {count, mean, std, min, max}
- Returns any values is missing in DataFrame
- Returns how many missing values exist in the DataFrame
'''

'\n- get the first 5 rows to check if data loaded into the DF successfully\n- check the dimensionality of the DataFrame\n- Generate descriptive statistics from the dataset - {count, mean, std, min, max}\n- Returns any values is missing in DataFrame\n- Returns how many missing values exist in the DataFrame\n'

In [107]:
persona_df.head()
persona_df.shape
persona_df.describe().T
persona_df.isnull().values.any()
persona_df.isnull().sum()

PRICE      0
SOURCE     0
SEX        0
COUNTRY    0
AGE        0
dtype: int64

# A. Observing Dataset for Segmentation
A1. Observe the individual variables
A2. Examine multiple variables together and observe their breakdown
    A2.1 Getting the Index from DataFrame
    A2.2 Further Divide the customer into Certain Age Group i.e. [0_18, 19_23, 24_30, 31_40, 41_70]
    A2.3 Identify new label-based customers i.e new labels would be (country_source_sex_ageGroup) i.e. bra_android_male_41_70
    A2.4 Further divide the label_based customers into 4 Groups {A,B,C,D} and describe each segment in terms of mean, max, min, sum of Pricing


---

## A1. Observer Individual Variables

In [108]:
# Get to know the DataSet better
# Look at the unique value members and frequencies for a number of categorical variables
# 1. Count number of Distinct OS used by the users . Here OS--> Variable "SOURCE"
persona_df["SOURCE"].nunique()       # return int : iOS and Andriod

# Return count of Source Rows
persona_df["SOURCE"].value_counts()     # How many users using <andriod> devices, How many using <ios> devices

# 2. Count distinct countries
persona_df["COUNTRY"].nunique()

# Count total users from each country
persona_df["COUNTRY"].value_counts()


usa    2065
bra    1496
deu     455
tur     451
fra     303
can     230
Name: COUNTRY, dtype: int64

#
# A2. Examine Multiple Variables Together and Observe their breakdown
Using: **groupby** and **aggregation** functions
Aggregate function - mean, median, prod,sum, std,var


In [109]:
# Country breakdown of payment average - how much avg payment is made from each country
persona_df.groupby("COUNTRY")["PRICE"].mean()

#  How much avg payment is made from each country grouped by (Country, UserDeviceOS) i.e. (USA, ios)--> (USA, andriod)
persona_df.groupby(["COUNTRY","SOURCE"])["PRICE"].mean()

# Avg spending on the basis of (country, source, sex)
persona_df.groupby(["COUNTRY","SOURCE","SEX"])["PRICE"].mean()

# Avg spending on the basis of (country, source, sex, age); sort in descending (decreasing) order
# and List the top five spending
spending_grBY_cn_os_gender_age = persona_df.groupby(["COUNTRY", "SOURCE", "SEX", "AGE"])["PRICE"].mean().sort_values(ascending=False)
spending_grBY_cn_os_gender_age.count        # Check how many rows in  the DF exists as this info might require later when we go for Numeric indexing



<bound method Series.count of COUNTRY  SOURCE   SEX     AGE
bra      android  male    46     59.0
usa      android  male    36     59.0
fra      android  female  24     59.0
usa      ios      male    32     54.0
deu      android  female  36     49.0
                                 ... 
usa      ios      female  38     19.0
                          30     19.0
can      android  female  27     19.0
fra      android  male    18     19.0
deu      android  male    26      9.0
Name: PRICE, Length: 348, dtype: float64>

# A2.1 Getting the index from the dataFrame
Index --> is like an address, that's how any data point across the dataframe can be accessed

In [110]:
# Check the index of the existing aggregated DF
spending_grBY_cn_os_gender_age.index
spending_grBY_cn_os_gender_age.get(('bra', 'android',   'male', 46))    # get a specific value from a given index

# Reindex the above dataframe using in range [0 ..n] index of indexes like ('deu', 'android',   'male', 26)
spending_grBY_cn_os_gender_age_reindexed=spending_grBY_cn_os_gender_age.reset_index()
spending_grBY_cn_os_gender_age_reindexed.index      # RangeIndex(start=0, stop=348, step=1)

# Observe the DF after reindexing
#spending_grBY_cn_os_gender_age_reindexed.head()


RangeIndex(start=0, stop=348, step=1)

# A2.2 Further Divide the customer into Certain Age Group
We will convert Age variable into 5x categorical variables [0_18, 19_23, 24_30, 31_40, 41_70]

In [111]:
# get the average age of customer
round(spending_grBY_cn_os_gender_age_reindexed["AGE"].mean())

28

In [112]:
# Convert AGE variable to categorical variable and adding to the reindexed spending DF
# Then we will assign the customer from each spending group to a certain age_group as we set the labels
# To slice the age data into centain slots, we will use pd.cut() and bins to cut the continuous variable into categorical_variable
custom_agegroup_labels = ['0_18','19_23','24_30','31_40','41_70']
spending_grBY_cn_os_gender_age_reindexed["AGE_GROUP"] = pd.cut(x=spending_grBY_cn_os_gender_age_reindexed["AGE"], bins=[0,18,23,30,40,70], labels=custom_agegroup_labels)

# Check the data whether the AGE_GROUP applied
spending_grBY_cn_os_gender_age_reindexed.tail()

Unnamed: 0,COUNTRY,SOURCE,SEX,AGE,PRICE,AGE_GROUP
343,usa,ios,female,38,19.0,31_40
344,usa,ios,female,30,19.0,24_30
345,can,android,female,27,19.0,24_30
346,fra,android,male,18,19.0,0_18
347,deu,android,male,26,9.0,24_30


# A2.3 Identify new label-based customers
new labels would be (country_source_sex_ageGroup) i.e. bra_android_male_41_70

In [113]:
# For the spending dataFrame add a new column "customers_level_based" where this will be defined using the country_source_sex_ageGroup for each row in the DataFrame
# Instead of ForLoop, we will use List Comprehension to write a for statement
spending_grBY_cn_os_gender_age_reindexed["customers_level_based"] = [f"{i[0]}_{i[1]}_{i[2]}_{i[-1]}" for i in spending_grBY_cn_os_gender_age_reindexed.values]

# inspect the data
spending_grBY_cn_os_gender_age_reindexed.head()
#spending_grBY_cn_os_gender_age_reindexed["customers_level_based"].head()

Unnamed: 0,COUNTRY,SOURCE,SEX,AGE,PRICE,AGE_GROUP,customers_level_based
0,bra,android,male,46,59.0,41_70,bra_android_male_41_70
1,usa,android,male,36,59.0,31_40,usa_android_male_31_40
2,fra,android,female,24,59.0,24_30,fra_android_female_24_30
3,usa,ios,male,32,54.0,31_40,usa_ios_male_31_40
4,deu,android,female,36,49.0,31_40,deu_android_female_31_40


In [114]:
# Access group of rows and columns in the DF using LOC
# Select all rows --> using :
# Columns we want "customers_level_based" and "PRICING"
# grouped by "customers_level_based"
# then sort the values by PRICING:mean in descending (decreasing) order

spending_grBY_cn_os_gender_age_reindexed= \
    spending_grBY_cn_os_gender_age_reindexed.loc[:,["customers_level_based","PRICE"]].groupby("customers_level_based")["PRICE"].mean().sort_values(ascending=False).reset_index()

# inspect the data
spending_grBY_cn_os_gender_age_reindexed.head()

Unnamed: 0,customers_level_based,PRICE
0,fra_android_female_24_30,45.428571
1,tur_ios_male_24_30,45.0
2,tur_ios_male_31_40,42.333333
3,tur_android_female_31_40,41.833333
4,can_android_male_19_23,40.111111


# A2.4 Further divide the label_based customers into 4 Groups {A,B,C,D} and describe each segment in terms of mean, max, min, sum of Pricing
A --> Most Profitable Customer
D --> Least Profitable Customer

In [115]:
# Divide the personas into quantiles using qcut, not with cut
# Add a new column "SEGMENT" into the spending DF with these 4 labels
custom_profitability_labels = ["D","C","B","A"]
spending_grBY_cn_os_gender_age_reindexed["SEGMENT"] = pd.qcut(spending_grBY_cn_os_gender_age_reindexed["PRICE"],4, labels=custom_profitability_labels)
spending_grBY_cn_os_gender_age_reindexed.head()

Unnamed: 0,customers_level_based,PRICE,SEGMENT
0,fra_android_female_24_30,45.428571,A
1,tur_ios_male_24_30,45.0,A
2,tur_ios_male_31_40,42.333333,A
3,tur_android_female_31_40,41.833333,A
4,can_android_male_19_23,40.111111,A


In [116]:
# Let's Describe each Profitability Segment into further details
spending_grBY_cn_os_gender_age_reindexed.groupby("SEGMENT").agg({"PRICE": ["mean","max","min","sum"]})

Unnamed: 0_level_0,PRICE,PRICE,PRICE,PRICE
Unnamed: 0_level_1,mean,max,min,sum
SEGMENT,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
D,29.20678,32.333333,19.0,817.789833
C,33.509674,34.07734,32.5,904.761209
B,34.999645,36.0,34.103727,944.990411
A,38.691234,45.428571,36.060606,1044.663328


In [117]:
# Describe Segment 'C'
spending_grBY_cn_os_gender_age_reindexed[spending_grBY_cn_os_gender_age_reindexed["SEGMENT"] == "C"].describe()

Unnamed: 0,PRICE
count,27.0
mean,33.509674
std,0.492587
min,32.5
25%,33.0
50%,33.627634
75%,34.0
max,34.07734


In [None]:
#--------------------------------- SEGMENTATION COMPLETES HERE --------------------------------------------

# B. Simulation

In [100]:
# Scenario_01: Let’s say a 25-year-old French man downloaded the game and processing on game market with his Android device. In this case, which segment does it belong to and how much does it earn on average?

In [118]:
new_user= "fra_android_male_24_30"
spending_grBY_cn_os_gender_age_reindexed[spending_grBY_cn_os_gender_age_reindexed["customers_level_based"]==new_user]



Unnamed: 0,customers_level_based,PRICE,SEGMENT
74,fra_android_male_24_30,33.0,C


In [None]:
# Interpretation: This customer is in the C segment and spend potentially $33 to the game.
# In this way, special marketing strategies can be planned by segmenting all new customers.