## Heroes of Pymolia - Final Data Analysis
#### By: Mike Suomi 6/3/2018

- Observed Trend 1: The Normalized Purchase Total per user is the lowest for the 15-24 age range, but this is our largest user demographic so targeting this range to increase more multiple purchases would help our profit the most. 


- Observed Trend 2: We need to figure out how to entice our users to make multiple purchases. Only six of our users have more than three purchases and the vast majority of our users only have one purchase.  How can we get them playing longer and/or entice them to buy more add-ons to increase the longevity of our game and long-term revenue?


- Observed Trend 3: How do we increase the sale of our top sellers? Out of 780 purchases, our two top sellers only have 11 purchases each (and these are relatively cheap items at around 2.30).  If we can't successfully increase the frequency of top sellers, than we should increase purchase price for many of our items, because our largest revenue items are those that have a higher purchase price and still relatively low purchase count (around 4.00 and 6 to 9 purchases).

In [2]:
import pandas as pd
import numpy as np

In [3]:
#import json file to dataframe
json_file_path = 'purchase_data.json'

df = pd.read_json(json_file_path)

### Player Count

In [4]:
player_count = len(df.SN.unique())  #the json data is all purchases, so make sure have unique SNs
pd.DataFrame({'Total Unique Players': [player_count]})

Unnamed: 0,Total Unique Players
0,573


### Purchasing Analysis (Total)

In [5]:
num_unique_items_purchased = len(df['Item ID'].unique())
average_purchase_price = df.Price.mean()
total_num_purchases = df.Price.count()
total_revenue = df.Price.sum()
purchasing_totals_df = pd.DataFrame([[num_unique_items_purchased,
                                    average_purchase_price,
                                    total_num_purchases,
                                    total_revenue]],
                                    columns = ['Number of Unique Items',
                                                'Average Purchase Price',
                                                'Total Number of Purchases',
                                                'Total Revenue'])

purchasing_totals_df['Average Purchase Price'] = purchasing_totals_df[
                                                'Average Purchase Price'].apply('${:.2f}'.format)
purchasing_totals_df['Total Revenue'] = purchasing_totals_df[
                                                'Total Revenue'].apply('${:.2f}'.format)

purchasing_totals_df

Unnamed: 0,Number of Unique Items,Average Purchase Price,Total Number of Purchases,Total Revenue
0,183,$2.93,780,$2286.33


### Gender Demographics

In [15]:
df_unique_users = df.drop_duplicates(subset=['SN'])[['SN','Age','Gender']] #get a df with only unique users

gender_counts = df_unique_users.Gender.value_counts()

df_gender_counts = pd.DataFrame(gender_counts) #convert gender counts to df
df_gender_counts.rename(columns = {'Gender':'Total Count'}, inplace=True)

df_gender_counts['% of Players'] = (df_gender_counts['Total Count'] / df_gender_counts['Total Count'].sum())*100
df_gender_counts['% of Players'] = df_gender_counts['% of Players'].apply('{:.2f}%'.format)

df_gender_counts = df_gender_counts[['% of Players','Total Count']] #reverse column order

gender_counts  = gender_counts.to_dict()  #store the gender user counts to a dictionary for later use

df_gender_counts

Unnamed: 0,% of Players,Total Count
Male,81.15%,465
Female,17.45%,100
Other / Non-Disclosed,1.40%,8


### Purchasing Analysis (Gender)

In [20]:
gender_analysis = df.groupby('Gender').agg({'Age': np.count_nonzero, 
                                           'Price': np.mean})
gender_analysis.rename(columns = {'Age':'Purchase Count', 'Price': 'Average Purchase Price'}, inplace=True) 

gender_analysis['Total Purchase Value'] = df.groupby('Gender')['Price'].sum()

#normalized totals is looking for average purchase total per user (instead of per purchase)
#create a list of values that can then be passed into dataframe column
num_rows_in_gender_analysis = len(gender_analysis['Total Purchase Value'])
gender_norm_tot = [gender_analysis['Total Purchase Value'][row] / #lookup total purchase value
                 gender_counts[gender_analysis.index[row]] #lookup the user count based on gender index
                 for row in range(0, num_rows_in_gender_analysis)]

gender_analysis['Normalized Totals'] = gender_norm_tot

gender_analysis[['Average Purchase Price', 'Total Purchase Value', 'Normalized Totals']]=gender_analysis[
                ['Average Purchase Price', 'Total Purchase Value', 'Normalized Totals']].applymap(
                                                                                '${:.2f}'.format)
gender_analysis

Unnamed: 0_level_0,Purchase Count,Average Purchase Price,Total Purchase Value,Normalized Totals
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,136,$2.82,$382.91,$3.83
Male,633,$2.95,$1867.68,$4.02
Other / Non-Disclosed,11,$3.25,$35.74,$4.47


### Age Demographics

In [21]:
min_age = df_unique_users.Age.min()
max_age = df_unique_users.Age.max()

bins = list(range(9,max_age,5)) #starts with first bin edge at 9 and then counts up by 5 until one before max
labels = [str(bins[idx]+1)+"-"+str(bins[idx+1]) for idx in range(0,len(bins)-1)] #names the labels for the bin range (inclusive)

#insert the starting and ending bins/labels to capture all data
bins.insert(0,0)
labels.insert(0,'<10')
bins.append(max_age)
labels.append(str(bins[-2]+1)+"+")
#print(bins) # temporary check
#print(labels) # temporary check

#add bins cut to both the unique users df and the overall df for later use
df_unique_users['Age Group'] = pd.cut(df_unique_users['Age'], bins=bins, labels=labels)
df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels)

age_group_counts = df_unique_users['Age Group'].value_counts(sort=False)

df_age_group_counts = pd.DataFrame(age_group_counts) #convert age counts to df
df_age_group_counts.rename(columns = {'Age Group':'Total Count'}, inplace=True)

df_age_group_counts['% of Players'] = (df_age_group_counts['Total Count'] / df_age_group_counts['Total Count'].sum())*100
df_age_group_counts['% of Players'] = df_age_group_counts['% of Players'].apply('{:.2f}%'.format)
df_age_group_counts = df_age_group_counts[['% of Players','Total Count']] #reverse column order

age_group_counts  = age_group_counts.to_dict() #convert age group counts to dict for later use

df_age_group_counts

Unnamed: 0,% of Players,Total Count
<10,3.32%,19
10-14,4.01%,23
15-19,17.45%,100
20-24,45.20%,259
25-29,15.18%,87
30-34,8.20%,47
35-39,4.71%,27
40-44,1.75%,10
45+,0.17%,1


### Purchasing Analysis (Age)

In [23]:
age_analysis = df.groupby('Age Group').agg({'Age': np.count_nonzero, 
                                           'Price': np.mean})
age_analysis.rename(columns = {'Age':'Purchase Count', 'Price': 'Average Purchase Price'}, inplace=True) 

age_analysis['Total Purchase Value'] = df.groupby('Age Group')['Price'].sum()

#normalized totals is looking for average purchase total per user (instead of per purchase)
#create a list of values that can then be passed into dataframe column
num_rows_in_age_analysis = len(age_analysis['Total Purchase Value'])
age_norm_tot = [age_analysis['Total Purchase Value'][row] / #lookup total purchase value
                 age_group_counts[age_analysis.index[row]] #lookup the user count based on age group index
                 for row in range(0, num_rows_in_age_analysis)]

age_analysis['Normalized Totals'] = age_norm_tot

age_analysis[['Average Purchase Price', 'Total Purchase Value', 'Normalized Totals']]=age_analysis[
                ['Average Purchase Price', 'Total Purchase Value', 'Normalized Totals']].applymap(
                                                                                '${:.2f}'.format)
age_analysis

Unnamed: 0_level_0,Purchase Count,Average Purchase Price,Total Purchase Value,Normalized Totals
Age Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
<10,28,$2.98,$83.46,$4.39
10-14,35,$2.77,$96.95,$4.22
15-19,133,$2.91,$386.42,$3.86
20-24,336,$2.91,$978.77,$3.78
25-29,125,$2.96,$370.33,$4.26
30-34,64,$3.08,$197.25,$4.20
35-39,42,$2.84,$119.40,$4.42
40-44,16,$3.19,$51.03,$5.10
45+,1,$2.72,$2.72,$2.72


### Top Spenders

In [30]:
user_spending = df.groupby('SN').agg({'Item Name': np.count_nonzero, 
                                           'Price': np.mean})
user_spending.rename(columns = {'Item Name':'Purchase Count', 'Price': 'Average Purchase Price'}, inplace=True) 

user_spending['Total Purchase Value'] = df.groupby('SN')['Price'].sum()

#get top 5 values by total purchase value before formatting changes to strings
user_spending_top5 = user_spending.nlargest(5, 'Total Purchase Value')
user_spending_top5
user_spending_top5[['Average Purchase Price', 'Total Purchase Value']]=user_spending_top5[
               ['Average Purchase Price', 'Total Purchase Value']].applymap('${:.2f}'.format)

user_spending_top5

Unnamed: 0_level_0,Purchase Count,Average Purchase Price,Total Purchase Value
SN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Undirrala66,5,$3.41,$17.06
Saedue76,4,$3.39,$13.56
Mindimnya67,4,$3.18,$12.74
Haellysu29,3,$4.24,$12.73
Eoda93,3,$3.86,$11.58


### Most Popular Items

In [25]:
popular_items = df.groupby(['Item ID', 'Item Name']).agg({'SN': np.count_nonzero, 
                                                       'Price': np.mean})
popular_items.rename(columns = {'SN':'Purchase Count', 'Price': 'Item Price'}, inplace=True) 

popular_items['Total Purchase Value'] = df.groupby(['Item ID','Item Name'])['Price'].sum()
#popular_items

#get top 5 purchased items by purchase count before formatting changes to strings
popular_items_top5 = popular_items.nlargest(5, 'Purchase Count')
#popular_items_top5
popular_items_top5[['Item Price', 'Total Purchase Value']]=popular_items_top5[
                    ['Item Price', 'Total Purchase Value']].applymap('${:.2f}'.format)

popular_items_top5

Unnamed: 0_level_0,Unnamed: 1_level_0,Purchase Count,Item Price,Total Purchase Value
Item ID,Item Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
39,"Betrayal, Whisper of Grieving Widows",11,$2.35,$25.85
84,Arcane Gem,11,$2.23,$24.53
13,Serenity,9,$1.49,$13.41
31,Trickster,9,$2.07,$18.63
34,Retribution Axe,9,$4.14,$37.26


### Most Profitable Items

In [27]:
#can use most popular data frame and just sort out top 5 profitable items
profitable_items_top5 = popular_items.nlargest(5, 'Total Purchase Value')
#profitable_items_top5
profitable_items_top5[['Item Price', 'Total Purchase Value']]=profitable_items_top5[
                    ['Item Price', 'Total Purchase Value']].applymap('${:.2f}'.format)

profitable_items_top5

Unnamed: 0_level_0,Unnamed: 1_level_0,Purchase Count,Item Price,Total Purchase Value
Item ID,Item Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
34,Retribution Axe,9,$4.14,$37.26
115,Spectral Diamond Doomblade,7,$4.25,$29.75
32,Orenmir,6,$4.95,$29.70
103,Singed Scalpel,6,$4.87,$29.22
107,"Splitter, Foe Of Subtlety",8,$3.61,$28.88
