- The majority of Pymoli players are males between the ages of 15-29. These players account for the majority of purchases by count and revenue.
- The average purchase size is just under $3.00. All but two players made 3 or fewer purchases.
- Our most popular items are $2.35 or under, which is in line with the average purchase price.
Potential courses of action include slightly increasing items prices, while keeping the prices under $3.00, to earn more money off typical purchasing, or increasing the availability of items below $2.35 to encourage more purchasing.
# ------------------------------------------------------
# Step 0: Import modules
# ------------------------------------------------------
import pandas as pd
from IPython.display import display, HTML
import matplotlib.pyplot as plt
# ------------------------------------------------------
# Step 1: Get the filename, then Read in and clean data.
# ------------------------------------------------------
# greet the user instructions
print("Welcome to Pymoli Data Analysis!")
filename = 'data/purchase_data.json'
# Read in the data file
purchases_df = pd.read_json(filename, orient='records')
# make sure everything is the right data type to use
purchases_df['Price']= purchases_df['Price'].replace("%","", regex=True).astype(float)
# check and remove null values
purchases_df.dropna(how='any')
purchases_df.to_csv("test.csv")
Welcome to Pymoli Data Analysis!
# -----------------------------------------------------------------------
# Step 2: Analyze the players: Look at
# Total number of players & Gender demographics (Double-checked in Excel)
# -----------------------------------------------------------------------
# Total Number of players
num_players_total = len(purchases_df.groupby('SN').count())
# remove all dupes to account for players who made multiple purchases
purchases_noDupes_df = purchases_df.set_index('SN')
purchases_noDupes_df = purchases_noDupes_df[~purchases_noDupes_df.index.duplicated(keep='first')]
# count and % of male players
num_players_male = purchases_noDupes_df[purchases_noDupes_df['Gender']=='Male'].count()['Age']
percent_players_male = num_players_male/num_players_total*100
# count and % of female players
num_players_female = purchases_noDupes_df[purchases_noDupes_df['Gender']=='Female'].count()['Age']
percent_players_female = num_players_female/num_players_total*100
# count and % of other/non-disclosed players
num_players_other = purchases_noDupes_df[purchases_noDupes_df['Gender']=='Other / Non-Disclosed'].count()['Age']
percent_players_other = num_players_other/num_players_total*100
# create dataframes to hold these result
gender_demographics_num_df = pd.DataFrame.from_dict({"Male":num_players_male, "Female":num_players_female,\
"Other / Non-Disclosed":num_players_other}, orient='index')
gender_demographics_percent_df = pd.DataFrame.from_dict({"Male":percent_players_male, "Female":percent_players_female,\
"Other / Non-Disclosed":percent_players_other}, orient='index')
# merge dataframes
gender_demographics_df = gender_demographics_num_df.merge(gender_demographics_percent_df, how='outer',\
left_index=True, right_index=True)
# rename columns
gender_demographics_df.rename(columns={'0_x': 'Number of Players', '0_y': 'Percent of Players'}, inplace=True)
# format results
gender_demographics_df['Percent of Players'] = gender_demographics_df['Percent of Players'].map('{0:.2f}%'.format)
# print out results
display(gender_demographics_df)
# create labels and wedge sizes
labels = ['Male', 'Female', 'Other / Non-Disclosed']
wedge_sizes = [gender_demographics_num_df[0]['Male'],\
gender_demographics_num_df[0]['Female'],\
gender_demographics_num_df[0]['Other / Non-Disclosed']]
# create pie chart
fig1, ax1 = plt.subplots()
ax1.pie(wedge_sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)
ax1.axis('equal')
plt.legend(loc='lower right')
plt.show()
# ----------------------------------
# End Step 2
# ----------------------------------
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Number of Players | Percent of Players | |
---|---|---|
Male | 465 | 81.15% |
Female | 100 | 17.45% |
Other / Non-Disclosed | 8 | 1.40% |
As can be seen above, the majority of Pymoli players are male.
Includes Data on Top Spenders, Overall Purchasing, Purchases Split by Self-Identified Gender, & Age Demographics
# ---------------------------------------------------
# Step 3: Purchasing Analysis
# 1) Top Spenders
# 2) Purchasing Analysis (Total)
# 3) Purchasing Analysis (Gender)
# 4) Age Demographics
# ---------------------------------------------------
# ---------------------------------------------------
# Question 1 - Top Spenders (Double-checked in Excel)
# ---------------------------------------------------
# identify top five spender by total puchase value
purchases_byPurchaseValue_df = purchases_df.groupby('SN').sum().sort_values('Price',ascending=False)
top_five_spenders_bySN = list(purchases_byPurchaseValue_df[0:5].reset_index()['SN'])
# create purchase count dataframe
top_five_purchase_count_df = pd.DataFrame.from_dict(dict(purchases_df[purchases_df['SN'].isin(top_five_spenders_bySN)].\
groupby('SN').count()))
top_five_purchase_count_df = top_five_purchase_count_df.drop(['Age', 'Gender', 'Item Name', 'Price'], 1)
top_five_purchase_count_df.rename(columns={'Item ID' : 'Purchase Count'}, inplace=True)
# create purchase average price dataframe
top_five_purchase_avg_df = pd.DataFrame.from_dict(dict(purchases_df[purchases_df['SN'].isin(top_five_spenders_bySN)].\
groupby('SN').mean()))
top_five_purchase_avg_df = top_five_purchase_avg_df.drop(['Age', 'Item ID'], 1)
top_five_purchase_avg_df.rename(columns={'Price' : 'Average Purchase Price'}, inplace=True)
# create total purchases dataframe
top_five_purchase_total_df = pd.DataFrame.from_dict(dict(purchases_df[purchases_df['SN'].isin(top_five_spenders_bySN)].\
groupby('SN').sum()))
top_five_purchase_total_df = top_five_purchase_total_df.drop(['Age', 'Item ID'], 1)
top_five_purchase_total_df.rename(columns={'Price' : 'Total Amount Purchased'}, inplace=True)
# merge the dataframes together to create one table
top_spenders_analysis_df = top_five_purchase_count_df.merge(top_five_purchase_avg_df, how='outer',\
left_index=True, right_index=True)
top_spenders_analysis_df = top_spenders_analysis_df.merge(top_five_purchase_total_df, how='outer',\
left_index=True, right_index=True)
# fix the formatting for the currency columns: df['cost'] = df['cost'].map('${:,.2f}'.format)
top_spenders_analysis_df['Average Purchase Price'] = top_spenders_analysis_df['Average Purchase Price'].map('${:,.2f}'.format)
top_spenders_analysis_df['Total Amount Purchased'] = top_spenders_analysis_df['Total Amount Purchased'].map('${:,.2f}'.format)
# sort by total amount purchased
top_spenders_analysis_df = top_spenders_analysis_df.sort_values('Total Amount Purchased', ascending=False)
display(top_spenders_analysis_df)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Purchase Count | Average Purchase Price | Total Amount Purchased | |
---|---|---|---|
SN | |||
Undirrala66 | 5 | $3.41 | $17.06 |
Saedue76 | 4 | $3.39 | $13.56 |
Mindimnya67 | 4 | $3.18 | $12.74 |
Haellysu29 | 3 | $4.24 | $12.73 |
Eoda93 | 3 | $3.86 | $11.58 |
Even the top spenders don't make an extraordinary number of purchases. The top 3 spenders made 4-5 purchases, and everyone else made 3 or less.
# --------------------------------------------------------------------
# Question 2 - Purchasing Analysis (Total) (Double-checked in Excel)
# --------------------------------------------------------------------
# number of unique items purchased
num_items_unique = len(purchases_df['Item ID'].value_counts())
# average purchase price
overall_purchase_avg = '${:,.2f}'.format(purchases_df.mean()['Price'])
# total number of purchases
overall_purchase_count = purchases_df.count()['Item ID']
# total revenue
overall_purchase_revenue = '${:,.2f}'.format(purchases_df['Price'].sum())
# create dataframe to display results
purchasing_total_analysis_df = pd.DataFrame({'Number of Unique Items Purchased':[num_items_unique],\
'Average Purchase Total':[overall_purchase_avg],\
'Total Number of Purchases':[overall_purchase_count],\
'Total Revenue':[overall_purchase_revenue]})
# rearrange columns in a more sensible way
purchasing_total_analysis_df = purchasing_total_analysis_df[['Total Revenue', 'Total Number of Purchases',\
'Average Purchase Total', 'Number of Unique Items Purchased']]
display(purchasing_total_analysis_df)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Total Revenue | Total Number of Purchases | Average Purchase Total | Number of Unique Items Purchased | |
---|---|---|---|---|
0 | $2,286.33 | 780 | $2.93 | 183 |
The average purchase size is under $3.
# --------------------------------------------------------------------
# Question 3 - Purchasing Analysis (Gender) (Double-checked with Lena)
# --------------------------------------------------------------------
# purchase count by gender
gender_purchase_count_df = pd.DataFrame.from_dict(dict(purchases_df.groupby('Gender').count()))
gender_purchase_count_df = gender_purchase_count_df.drop(['Age', 'Item Name', 'SN', 'Price'], 1)
gender_purchase_count_df.rename(columns={'Item ID' : 'Purchase Count'}, inplace=True)
# average purchase price by gender
gender_purchase_avg_df = pd.DataFrame.from_dict(dict(purchases_df.groupby('Gender').mean()))
gender_purchase_avg_df = gender_purchase_avg_df.drop(['Age', 'Item ID'], 1)
gender_purchase_avg_df.rename(columns={'Price' : 'Average Purchase Price'}, inplace=True)
# sum of all purchases by gender
gender_purchase_total_df = pd.DataFrame.from_dict(dict(purchases_df.groupby('Gender').sum()))
gender_purchase_total_df = gender_purchase_total_df.drop(['Age', 'Item ID'], 1)
gender_purchase_total_df.rename(columns={'Price' : 'Total Purchases'}, inplace=True)
# create dataframe to display results
purchasing_gender_analysis_df = gender_purchase_count_df.merge(gender_purchase_total_df, how='outer',\
left_index=True, right_index=True)
purchasing_gender_analysis_df = purchasing_gender_analysis_df.merge(gender_purchase_avg_df, how='outer',\
left_index=True, right_index=True)
# add normalized totals
purchasing_gender_analysis_df['Normalized Total'] = purchasing_gender_analysis_df['Total Purchases']/gender_demographics_df['Number of Players']
# add percent count column
purchasing_gender_analysis_df['Percent Purchase (Count)'] = purchasing_gender_analysis_df['Purchase Count']/\
int(purchasing_total_analysis_df['Total Number of Purchases'])*100
# format columns
purchasing_gender_analysis_df['Total Purchases'] = purchasing_gender_analysis_df['Total Purchases'].map('${:,.2f}'.format)
purchasing_gender_analysis_df['Average Purchase Price'] = purchasing_gender_analysis_df['Average Purchase Price'].map('${:,.2f}'.format)
purchasing_gender_analysis_df['Normalized Total'] = purchasing_gender_analysis_df['Normalized Total'].map('${:,.2f}'.format)
purchasing_gender_analysis_df['Percent Purchase (Count)'] = purchasing_gender_analysis_df['Percent Purchase (Count)'].map('{0:.2f}%'.format)
# rearrange columns
purchasing_gender_analysis_df = purchasing_gender_analysis_df[['Purchase Count', 'Percent Purchase (Count)',\
'Total Purchases','Average Purchase Price',\
'Normalized Total']]
display(purchasing_gender_analysis_df)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Purchase Count | Percent Purchase (Count) | Total Purchases | Average Purchase Price | Normalized Total | |
---|---|---|---|---|---|
Gender | |||||
Female | 136 | 17.44% | $382.91 | $2.82 | $3.83 |
Male | 633 | 81.15% | $1,867.68 | $2.95 | $4.02 |
Other / Non-Disclosed | 11 | 1.41% | $35.74 | $3.25 | $4.47 |
Just as the majority of users are male, the majority of purchases are made by male users. They accounted for 81.15% of purchases by count and 81.69% by purchase amount.
# ----------------------------------------------------------------------------
# Question 4 - Purchasing Analysis (Age) (partially double-checked with Excel)
# ----------------------------------------------------------------------------
ages_df = purchases_df.sort_values('Age')
# create bins: <10, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-45, 45+
bins = [0, 10, 15, 20, 25, 30, 35, 40, 45, 100]
# name the bins
bin_names = ['<10', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45+']
# create a usable version of the dataframe for analysis
purchases_df['Age Range'] = pd.cut(purchases_df['Age'], bins, labels=bin_names, right=False)
# generate dataframes with the data we need
# purchase count
age_purchase_count_df = pd.DataFrame.from_dict(dict(purchases_df.groupby('Age Range').count()))
age_purchase_count_df = age_purchase_count_df.drop(['Age', 'Gender', 'Item ID', 'Item Name', 'Price'], 1)
age_purchase_count_df.rename(columns={'SN' : 'Purchase Count'}, inplace=True)
# purchase total
age_purchase_total_df = pd.DataFrame.from_dict(dict(purchases_df.groupby('Age Range').sum()))
age_purchase_total_df = age_purchase_total_df.drop(['Age', 'Item ID'], 1)
age_purchase_total_df.rename(columns={'Price' : 'Total Purchases'}, inplace=True)
# average purchase price
age_purchase_avg_df = pd.DataFrame.from_dict(dict(purchases_df.groupby('Age Range').mean()))
age_purchase_avg_df = age_purchase_avg_df.drop(['Age', 'Item ID'], 1)
age_purchase_avg_df.rename(columns={'Price' : 'Average Purchase Price'}, inplace=True)
# create age demographics for funzies and for later use
purchases_noDupes_round2_df = purchases_df.set_index('SN')
purchases_noDupes_round2_df = purchases_noDupes_round2_df[~purchases_noDupes_round2_df.index.duplicated(keep='first')]
age_demographics_df = purchases_noDupes_round2_df.groupby('Age Range').count()
age_demographics_df = age_demographics_df.drop(['Age', 'Gender', 'Item Name', 'Price'], 1)
age_demographics_df.rename(columns={'Item ID' : 'Number of Users'}, inplace=True)
# make a dataframe to store the results
purchasing_age_analysis_df = age_demographics_df.merge(age_purchase_count_df, how='outer',\
left_index=True, right_index=True)
purchasing_age_analysis_df = purchasing_age_analysis_df.merge(age_purchase_total_df, how='outer',\
left_index=True, right_index=True)
purchasing_age_analysis_df = purchasing_age_analysis_df.merge(age_purchase_avg_df, how='outer',\
left_index=True, right_index=True)
# normalized totals
purchasing_age_analysis_df["Normalized Total"] = purchasing_age_analysis_df['Total Purchases']/age_demographics_df['Number of Users']
# format
purchasing_age_analysis_df['Total Purchases'] = purchasing_age_analysis_df['Total Purchases'].map('${:,.2f}'.format)
purchasing_age_analysis_df['Average Purchase Price'] = purchasing_age_analysis_df['Average Purchase Price'].map('${:,.2f}'.format)
purchasing_age_analysis_df['Normalized Total'] = purchasing_age_analysis_df['Normalized Total'].map('${:,.2f}'.format)
# print results
display(purchasing_age_analysis_df)
# --------------------------------------
# End Step 3
# --------------------------------------
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Number of Users | Purchase Count | Total Purchases | Average Purchase Price | Normalized Total | |
---|---|---|---|---|---|
Age Range | |||||
<10 | 19 | 28 | $83.46 | $2.98 | $4.39 |
10-14 | 23 | 35 | $96.95 | $2.77 | $4.22 |
15-19 | 100 | 133 | $386.42 | $2.91 | $3.86 |
20-24 | 259 | 336 | $978.77 | $2.91 | $3.78 |
25-29 | 87 | 125 | $370.33 | $2.96 | $4.26 |
30-34 | 47 | 64 | $197.25 | $3.08 | $4.20 |
35-39 | 27 | 42 | $119.40 | $2.84 | $4.42 |
40-44 | 10 | 16 | $51.03 | $3.19 | $5.10 |
45+ | 1 | 1 | $2.72 | $2.72 | $2.72 |
# --------------------------------------
# Step 4: Items analysis
# 1) Most Popular Items
# 2) Most Profitable Items
# --------------------------------------
# most popular items
items_byPopularity_df = purchases_df.groupby('Item ID').count().sort_values('Price', ascending=False)
top_five_item_byPopularity = list(items_byPopularity_df[0:5].reset_index()['Item ID'])
# create purchase count dataframe
top_five_item_popular_count_df = pd.DataFrame.from_dict(dict(purchases_df[purchases_df['Item ID'].isin(top_five_item_byPopularity)].\
groupby('Item ID').count()))
top_five_item_popular_count_df = top_five_item_popular_count_df.drop(['Age', 'SN', 'Gender', 'Price'], 1)
top_five_item_popular_count_df.rename(columns={'Item Name' : 'Purchase Count'}, inplace=True)
# create item price dataframe
top_five_popular_price_df = purchases_df[purchases_df['Item ID'].isin(top_five_item_byPopularity)]
top_five_popular_price_df = top_five_popular_price_df.set_index('Item ID')
top_five_popular_price_df = top_five_popular_price_df[~top_five_popular_price_df.\
index.duplicated(keep='first')].drop(['Age', 'Gender', 'Item Name',\
'SN','Age Range'], 1)
# create item name dataframe
top_five_popular_names_df = purchases_df[purchases_df['Item ID'].isin(top_five_item_byPopularity)]
top_five_popular_names_df = top_five_popular_names_df.set_index('Item ID')
top_five_popular_names_df = top_five_popular_names_df[~top_five_popular_names_df.\
index.duplicated(keep='first')].drop(['Age', 'Gender', 'Price',\
'SN','Age Range'], 1)
# create total purchases dataframe
top_five_item_popular_total_df = pd.DataFrame.from_dict(dict(purchases_df[purchases_df['Item ID'].isin(top_five_item_byPopularity)].\
groupby('Item ID').sum()))
top_five_item_popular_total_df = top_five_item_popular_total_df.drop('Age', 1)
top_five_item_popular_total_df.rename(columns={'Price' : 'Total Purchases'}, inplace=True)
# make a dataframe to store the results
top_five_items_byPurchaseCount_df = top_five_popular_names_df.merge(top_five_item_popular_count_df, how='outer',\
left_index=True, right_index=True)
top_five_items_byPurchaseCount_df = top_five_items_byPurchaseCount_df.merge(top_five_popular_price_df, how='outer',\
left_index=True, right_index=True)
top_five_items_byPurchaseCount_df = top_five_items_byPurchaseCount_df.merge(top_five_item_popular_total_df,\
how='outer', left_index=True,\
right_index=True)
# format the data
top_five_items_byPurchaseCount_df['Price'] = top_five_items_byPurchaseCount_df['Price'].map('${:,.2f}'.format)
top_five_items_byPurchaseCount_df['Total Purchases'] = top_five_items_byPurchaseCount_df['Total Purchases'].map('${:,.2f}'.format)
top_five_items_byPurchaseCount_df = top_five_items_byPurchaseCount_df.sort_values('Purchase Count',\
ascending=False)
display(top_five_items_byPurchaseCount_df)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Item Name | Age Range | Purchase Count | Price | Total Purchases | |
---|---|---|---|---|---|
Item ID | |||||
39 | Betrayal, Whisper of Grieving Widows | 11 | 11 | $2.35 | $25.85 |
84 | Arcane Gem | 11 | 11 | $2.23 | $24.53 |
13 | Serenity | 9 | 9 | $1.49 | $13.41 |
31 | Trickster | 9 | 9 | $2.07 | $18.63 |
175 | Woeful Adamantite Claymore | 9 | 9 | $1.24 | $11.16 |
# most profitable items
items_byTotalPurchases_df = purchases_df.groupby('Item ID').sum().sort_values('Price',ascending=False)
top_five_item_profit = list(items_byTotalPurchases_df[0:5].reset_index()['Item ID'])
# create purchase count dataframe
top_five_item_purchase_count_df = pd.DataFrame.from_dict(dict(purchases_df[purchases_df['Item ID'].isin(top_five_item_profit)].\
groupby('Item ID').count()))
top_five_item_purchase_count_df = top_five_item_purchase_count_df.drop(['Age', 'Age Range', 'SN', 'Gender', 'Price'], 1)
top_five_item_purchase_count_df.rename(columns={'Item Name' : 'Purchase Count'}, inplace=True)
# create item price dataframe
top_five_item_price_df = purchases_df[purchases_df['Item ID'].isin(top_five_item_profit)]
top_five_item_price_df = top_five_item_price_df.set_index('Item ID')
top_five_item_price_df = top_five_item_price_df[~top_five_item_price_df.\
index.duplicated(keep='first')].drop(['Age', 'Gender', 'Item Name',\
'SN', 'Age Range'], 1)
# create item name dataframe
top_five_item_names_df = purchases_df[purchases_df['Item ID'].isin(top_five_item_profit)]
top_five_item_names_df = top_five_item_names_df.set_index('Item ID')
top_five_item_names_df = top_five_item_names_df[~top_five_item_names_df.\
index.duplicated(keep='first')].drop(['Age', 'Gender', 'Price',\
'SN', 'Age Range'], 1)
# create total purchases dataframe
top_five_item_purchase_total_df = pd.DataFrame.from_dict(dict(purchases_df[purchases_df['Item ID'].isin(top_five_item_profit)].\
groupby('Item ID').sum()))
top_five_item_purchase_total_df = top_five_item_purchase_total_df.drop('Age', 1)
top_five_item_purchase_total_df.rename(columns={'Price' : 'Total Purchases'}, inplace=True)
# make a dataframe to store the results
top_five_items_byTotalPurchases_df = top_five_item_names_df.merge(top_five_item_purchase_count_df, how='outer',\
left_index=True, right_index=True)
top_five_items_byTotalPurchases_df = top_five_items_byTotalPurchases_df.merge(top_five_item_price_df, how='outer',\
left_index=True, right_index=True)
top_five_items_byTotalPurchases_df = top_five_items_byTotalPurchases_df.merge(top_five_item_purchase_total_df,\
how='outer', left_index=True,\
right_index=True)
# format the data
top_five_items_byTotalPurchases_df['Price'] = top_five_items_byTotalPurchases_df['Price'].map('${:,.2f}'.format)
top_five_items_byTotalPurchases_df['Total Purchases'] = top_five_items_byTotalPurchases_df['Total Purchases'].map('${:,.2f}'.format)
top_five_items_byTotalPurchases_df = top_five_items_byTotalPurchases_df.sort_values('Total Purchases',\
ascending=False)
display(top_five_items_byTotalPurchases_df)
# ----------------------------------
# End Step 4
# ----------------------------------
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Item Name | Purchase Count | Price | Total Purchases | |
---|---|---|---|---|
Item ID | ||||
34 | Retribution Axe | 9 | $4.14 | $37.26 |
115 | Spectral Diamond Doomblade | 7 | $4.25 | $29.75 |
32 | Orenmir | 6 | $4.95 | $29.70 |
103 | Singed Scalpel | 6 | $4.87 | $29.22 |
107 | Splitter, Foe Of Subtlety | 8 | $3.61 | $28.88 |