## Metrics calculation (mini-project)

**Importing libraries**

In [98]:
import pandas as pd

**Importing dataset from a zipped file**

In [99]:
df = pd.read_csv('august_data.zip', compression='zip')

df

Unnamed: 0,ga:date,ga:clientid,userID,ga:transaction_id,ga:revenue,Unnamed: 5,ga:user
0,28-08-2019 12:29:24,2.802509e+08,7186054,383919,28103,,141000.0
1,28-08-2019 12:27:12,8.196637e+08,7186010,97225,177697,,
2,28-08-2019 11:43:24,1.751156e+09,7184859,385087,64892,,
3,28-08-2019 11:40:50,5.515333e+08,7186029,385392,38816,,
4,28-08-2019 11:25:31,4.527935e+08,7183548,385871,3112,,
...,...,...,...,...,...,...,...
1006,01-08-2019 01:33:53,5.085028e+08,7186781,358692,9280,,
1007,01-08-2019 01:27:45,4.152444e+08,7186780,359792,2899,,
1008,01-08-2019 01:23:40,6.964930e+08,7186782,377751,8900,,
1009,01-08-2019 01:18:14,4.152444e+08,7186780,377721,8204,,


**Getting basic info about data types and missing values**

In [100]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1011 entries, 0 to 1010
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ga:date            1011 non-null   object 
 1   ga:clientid        1011 non-null   float64
 2   userID             1011 non-null   object 
 3   ga:transaction_id  1011 non-null   object 
 4   ga:revenue         1011 non-null   int64  
 5   Unnamed: 5         0 non-null      float64
 6   ga:user            1 non-null      float64
dtypes: float64(3), int64(1), object(3)
memory usage: 55.4+ KB


###### Columns description
* **ga:date** – date
* **ga:clientid** – client id from Google Analytics
* **userID** – client id from another analytical system
* **ga:transaction_id** – transaction id
* **ga:revenue** – revenue
* **ga:user** – am overall number of unique users

**Show the number of unique users?**

In [101]:
uniques = df.iloc[0, 6]

uniques

141000.0

**Cleaning dataframe**

In [102]:
# grabbing column names that we'll use
column_names = df.columns[0:5]
column_names

Index(['ga:date', 'ga:clientid', 'userID', 'ga:transaction_id', 'ga:revenue'], dtype='object')

In [103]:
# slicing our dataset
df = df[column_names]

In [104]:
# dropping 'userID' column (because we need data only from Google Analytics)
df = df.drop(['userID'], axis=1)

In [105]:
# renaming columns for our convenience
df = df.rename(columns=lambda x: x[3:])

df

Unnamed: 0,date,clientid,transaction_id,revenue
0,28-08-2019 12:29:24,2.802509e+08,383919,28103
1,28-08-2019 12:27:12,8.196637e+08,97225,177697
2,28-08-2019 11:43:24,1.751156e+09,385087,64892
3,28-08-2019 11:40:50,5.515333e+08,385392,38816
4,28-08-2019 11:25:31,4.527935e+08,385871,3112
...,...,...,...,...
1006,01-08-2019 01:33:53,5.085028e+08,358692,9280
1007,01-08-2019 01:27:45,4.152444e+08,359792,2899
1008,01-08-2019 01:23:40,6.964930e+08,377751,8900
1009,01-08-2019 01:18:14,4.152444e+08,377721,8204


**Find a number of clients**

In [119]:
clients = df[df['revenue'] > 0].clientid.nunique()

clients

685

**Calculate conversion**

In [154]:
cr = round((clients / uniques), 4)

cr

0.0049

**Find an average check**

In [125]:
round(df.revenue.sum() / df[df['revenue'] > 0].shape[0], 0)

34458.0

**Calculate an average number of purchases per user**

In [142]:
round(df[df['revenue'] > 0].groupby(['clientid'], as_index=False) \
    .agg({'transaction_id': 'count'}) \
    .transaction_id.mean(), 2)

1.36

**Find ARPPU**

In [150]:
arppu = round(df.revenue.sum() / clients, 0)

arppu

46883.0

**Calculate ARPU**

In [156]:
arpu = round(arppu * cr, 0)

arpu

230.0