In [1]:
import pandas as pd

import json
import os

## Analyzing the payment data

In order to better process the income data and before developing an ETL script I will analyze how the data is structured.

I know, beforehand, the columns names and their meaning.

They are:
    - ClientId: Identifing the client that acquired the plan
    - AcquisitionDate: The date in which the plan was acquired
    - PaymentValue: The amount that was paid
    - Plan: The name of the plan along with the number of months it was hired
    

In [2]:
columns_payment = ['ClientId', 'AcquisitionDate', 'PaymentValue', 'Plan']
df_payment = pd.read_csv('../data/payments.csv', names=columns_payment)

In [3]:
df_payment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95476 entries, 0 to 95475
Data columns (total 4 columns):
ClientId           95476 non-null int64
AcquisitionDate    95476 non-null object
PaymentValue       95476 non-null object
Plan               95476 non-null object
dtypes: int64(1), object(3)
memory usage: 2.9+ MB


In [4]:
df_payment.head()

Unnamed: 0,ClientId,AcquisitionDate,PaymentValue,Plan
0,4049,05/03/2017,"R$ 300,00",Bronze/3
1,1711,12/08/2018,"R$ 750,00",Ouro/3
2,3643,01/01/2017,"R$ 399,00",Platina/1
3,4683,09/06/2017,"R$ 2394,00",Platina/6
4,4645,25/04/2018,"R$ 250,00",Ouro/1


In [5]:
# Splitting the Plan Column into two: "Plan" and "AcquiredMonths"
df_payment[['Plan', 'AcquiredMonths']] = df_payment.Plan.str.split('/', expand=True)

In [6]:
df_payment.head()

Unnamed: 0,ClientId,AcquisitionDate,PaymentValue,Plan,AcquiredMonths
0,4049,05/03/2017,"R$ 300,00",Bronze,3
1,1711,12/08/2018,"R$ 750,00",Ouro,3
2,3643,01/01/2017,"R$ 399,00",Platina,1
3,4683,09/06/2017,"R$ 2394,00",Platina,6
4,4645,25/04/2018,"R$ 250,00",Ouro,1


In [7]:
# Cleaning and Transforming the PaymentValue Column
df_payment['PaymentValue'].apply(type).unique()

array([<class 'str'>], dtype=object)

In [8]:
df_payment['PaymentValue'] = df_payment['PaymentValue'].replace({'R\$': '', ',': '.'}, regex=True).astype(float)

In [9]:
df_payment.head()

Unnamed: 0,ClientId,AcquisitionDate,PaymentValue,Plan,AcquiredMonths
0,4049,05/03/2017,300.0,Bronze,3
1,1711,12/08/2018,750.0,Ouro,3
2,3643,01/01/2017,399.0,Platina,1
3,4683,09/06/2017,2394.0,Platina,6
4,4645,25/04/2018,250.0,Ouro,1


In [10]:
df_payment['PlanValue'] = df_payment['PaymentValue']/df_payment['AcquiredMonths'].astype(int)

In [11]:
df_payment.head()

Unnamed: 0,ClientId,AcquisitionDate,PaymentValue,Plan,AcquiredMonths,PlanValue
0,4049,05/03/2017,300.0,Bronze,3,100.0
1,1711,12/08/2018,750.0,Ouro,3,250.0
2,3643,01/01/2017,399.0,Platina,1,399.0
3,4683,09/06/2017,2394.0,Platina,6,399.0
4,4645,25/04/2018,250.0,Ouro,1,250.0


In [12]:
df_plan = df_payment[['Plan', 'PlanValue']].drop_duplicates()

In [13]:
df_plan.head()

Unnamed: 0,Plan,PlanValue
0,Bronze,100.0
1,Ouro,250.0
2,Platina,399.0
14,Prata,185.0


### Saving the Plan/Plan Value table

In [14]:
df_plan.to_csv('../data/planValue.csv', index=False)

In [17]:
# Transforming AcquisitionDate from string to Datetime object
df_payment['AcquisitionDate'] = pd.to_datetime(df_payment.AcquisitionDate, format='%d/%m/%Y', )

In [20]:
df_payment.AcquisitionDate.describe()

count                   95476
unique                    980
top       2016-11-27 00:00:00
freq                      288
first     2016-09-01 00:00:00
last      2019-07-28 00:00:00
Name: AcquisitionDate, dtype: object

Here we have our first insight about the `payments data`, we can see that the month of September/2016 was the one with most acquisitions. Later on, this might seem a nice analysis point. For now I will keep exploring the data.

Moreover it shows us that the data spam about **34 months between September/2016 and July/2019**

In [28]:
df_payment.nunique()

ClientId           5000
AcquisitionDate     980
PaymentValue         12
Plan                  4
AcquiredMonths        3
PlanValue             4
dtype: int64

In the above table we have more interesting informations:
    1. There are 5000 distinct companies acquiring a plan
    2. There are 12 distinct values paid, in acordance with the number of plans (4) and the number of months (3) that are available at this time

### Saving the clients list to be analyzed in another notebook

In [32]:
df_payment[['ClientId']].drop_duplicates().to_csv('../data/clients.csv', index=False)

### Saving the processed payment table

In [35]:
df_payment.to_csv('../data/payments_v2.csv', index=False)