In this inferential statistics analysis, I will see if there are statistically significant differences in following variables between the users that churned and those that did not churn:
* Age
* Auto renew
* Number of transactions made
* Number of seconds listened
* Number of unique songs listened

# Load Libraries

In [30]:
import pandas as pd
import time
from datetime import datetime
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt

In [7]:
train = pd.read_csv('./EDA_Data/train_modified.csv')
train_churn = train[train.is_churn == 1].msno
train_nochurn = train[train.is_churn == 0].msno

# Inferential Statistics

### Age
$H_0$: There's no difference in the age between user who churned and did not churn. <br>
$H_A$: There's a difference.

In [19]:
# Load Data
members = pd.read_csv('./EDA_Data/members_modified.csv')

In [21]:
# Merge Data
members = pd.merge(left = members, right = train,how = 'left',on=['msno'])
members_churn = members[members.is_churn == 1]
members_nochurn = members[members.is_churn == 0]

In [40]:
# Calculating Age statistics
age_churn = members_churn.bd.dropna()
age_nochurn = members_nochurn.bd.dropna()

# Computing statistics for group churn
churn_mean = np.mean(age_churn)
churn_std = np.std(age_churn)

# Computing statistics for group no churn
nochurn_mean = np.mean(age_nochurn)
nochurn_std = np.std(age_nochurn)

# Difference Stats
diff_mean = nochurn_mean - churn_mean
diff_se = np.sqrt(churn_std**2/len(age_churn) + nochurn_std**2/len(age_nochurn))

# Margin of error
moe = 1.96 * diff_se
lower_bd = diff_mean - moe
upper_bd = diff_mean + moe

print(f'Mean Difference: {diff_mean}', 
      f'\nMargin of error: {moe}', 
      f'\n95% Confidence Interval: [{lower_bd}, {upper_bd}]')

Mean Difference: 2.5739036785245375 
Margin of error: 0.13734545113421098 
95% Confidence Interval: [2.4365582273903263, 2.7112491296587486]


In [45]:
bootstrap_replicates = np.empty(10000)

for i in range (10000):
    perm = np.random.permutation(np.concatenate((age_churn, age_nochurn)))
    churn_perm = perm[:len(age_churn)]
    nochurn_perm = perm[len(age_churn):]
    bootstrap_replicates[i] = np.mean(nochurn_perm) - np.mean(churn_perm)

perm_pval = np.sum(bootstrap_replicates >= diff_mean)/10000
print(f'p-value: {perm_pval}')

p-value: 0.0


There is a mean difference of 2.574 in age between the group that churn and did not churn. The confidence interval does not contain 0; therefore, we are 95% confident that there is a difference. After conducting a bootstrap test, the null hypothesis was rejected due to a p-value of 0. This means that there is an age difference between the two groups, and that younger users are more likely to churn.

### Auto-Renew
$H_0$ : There's no difference in auto-renewal rate between user who churned and did not churn. <br>
$H_A$ : There's a difference.

In [48]:
# Load data
transactions = pd.read_csv('./EDA_Data/transactions_modified.csv')

In [63]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1909677 entries, 0 to 1909676
Data columns (total 9 columns):
msno                      object
payment_method_id         int64
payment_plan_days         int64
plan_list_price           int64
actual_amount_paid        int64
is_auto_renew             int64
transaction_date          object
membership_expire_date    object
is_cancel                 int64
dtypes: int64(6), object(3)
memory usage: 131.1+ MB


In [54]:
# Merge Data
auto_renew = transactions[['msno', 'is_auto_renew']].groupby('msno').mean()
auto_renew = pd.merge(left = auto_renew, right = train, how = 'left', on = ['msno'])

In [58]:
# Calculating Age statistics
ar_churn = auto_renew[auto_renew.is_churn == 1].is_auto_renew
ar_nochurn = auto_renew[auto_renew.is_churn == 0].is_auto_renew

# Computing statistics for group churn
churn_mean = np.mean(ar_churn)
churn_std = np.std(ar_churn)

# Computing statistics for group no churn
nochurn_mean = np.mean(ar_nochurn)
nochurn_std = np.std(ar_nochurn)

# Difference Stats
diff_mean = nochurn_mean - churn_mean
diff_se = np.sqrt(churn_std**2/len(ar_churn) + nochurn_std**2/len(ar_nochurn))

# Margin of error
moe = 1.96 * diff_se
lower_bd = diff_mean - moe
upper_bd = diff_mean + moe

print(f'Mean Difference: {diff_mean}', 
      f'\nMargin of error: {moe}', 
      f'\n95% Confidence Interval: [{lower_bd}, {upper_bd}]')

Mean Difference: 0.44100533658159546 
Margin of error: 0.004210435379622165 
95% Confidence Interval: [0.4367949012019733, 0.44521577196121764]


In [59]:
bootstrap_replicates = np.empty(10000)

for i in range (10000):
    perm = np.random.permutation(np.concatenate((ar_churn, ar_nochurn)))
    churn_perm = perm[:len(ar_churn)]
    nochurn_perm = perm[len(ar_churn):]
    bootstrap_replicates[i] = np.mean(nochurn_perm) - np.mean(churn_perm)

perm_pval = np.sum(bootstrap_replicates >= diff_mean)/10000
print(f'p-value: {perm_pval}')

p-value: 0.0


There was a difference of 0.441 in mean auto renewal rate between those that churned and did not churn. The confidence interval is very narrow [0.437, 0.445], meaning it is not likely that there is no difference in auto renewal rate. The bootstrap test had a p-value of 0; therefore, the null hypothesis was rejected. There is statisically significant difference in auto renewal rate between the churn and not churn group. On average, those who did not churn were 44% more likely to have their subscription auto-renewed.

### Number of Transactions
$H_0$: There's no difference in the number of transactions between user who churned and did not churn. <br>
$H_A$: Those who did not churn made more transactions than those who churned.

In [67]:
# Merge data
transaction_made = transactions[['msno', 'transaction_date']].groupby('msno').count()
transaction_made = pd.merge(left = transaction_made, right = train, how = 'left', on = ['msno'])

In [69]:
# Calculating Age statistics
trans_churn = transaction_made[transaction_made.is_churn == 1].transaction_date
trans_nochurn = transaction_made[transaction_made.is_churn == 0].transaction_date

# Computing statistics for group churn
churn_mean = np.mean(trans_churn)
churn_std = np.std(trans_churn)

# Computing statistics for group no churn
nochurn_mean = np.mean(trans_nochurn)
nochurn_std = np.std(trans_nochurn)

# Difference Stats
diff_mean = nochurn_mean - churn_mean
diff_se = np.sqrt(churn_std**2/len(trans_churn) + nochurn_std**2/len(trans_nochurn))

# Margin of error
moe = 1.96 * diff_se
lower_bd = diff_mean - moe
upper_bd = diff_mean + moe

print(f'Mean Difference: {diff_mean}', 
      f'\nMargin of error: {moe}', 
      f'\n95% Confidence Interval: [{lower_bd}, {upper_bd}]')

Mean Difference: 5.5200867432370675 
Margin of error: 0.08865758164061524 
95% Confidence Interval: [5.431429161596452, 5.608744324877683]


In [71]:
bootstrap_replicates = np.empty(10000)

for i in range (10000):
    perm = np.random.permutation(np.concatenate((trans_churn, trans_nochurn)))
    churn_perm = perm[:len(trans_churn)]
    nochurn_perm = perm[len(trans_churn):]
    bootstrap_replicates[i] = np.mean(nochurn_perm) - np.mean(churn_perm)

perm_pval = np.sum(bootstrap_replicates >= diff_mean)/10000
print(f'p-value: {perm_pval}')

p-value: 0.0


On average, those who did not churn made 5.52 more transactions than those that churned. The 95% confidence interval is [5.431, 5.609]. The bootstrap test resulted in a p-value of 0.0, meaning that the null hypothesis can be rejected. Therefore, we can say that there were more transactions made by those who did not churn.

### Number of Seconds Listened

$H_0$: There's no difference in the total number of seconds listened between user who churned and did not churn. <br>
$H_A$: There's a difference between the groups.

In [72]:
user = pd.read_csv('./EDA_Data/user_logs_modified.csv')

In [110]:
user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10927698 entries, 0 to 10927697
Data columns (total 9 columns):
msno          object
date          object
num_25        float64
num_50        float64
num_75        float64
num_985       float64
num_100       float64
num_unq       float64
total_secs    float64
dtypes: float64(7), object(2)
memory usage: 750.3+ MB


In [100]:
# Merge data
seconds = user[['msno', 'total_secs']].groupby('msno').sum()
seconds = pd.merge(left = seconds, right = train, how = 'left', on = ['msno'])

In [108]:
# Calculating Age statistics
secs_churn = seconds[seconds.is_churn == 1].total_secs
secs_nochurn = seconds[seconds.is_churn == 0].total_secs

# Computing statistics for group churn
churn_mean = np.mean(secs_churn)
churn_std = np.std(secs_churn)

# Computing statistics for group no churn
nochurn_mean = np.mean(secs_nochurn)
nochurn_std = np.std(secs_nochurn)

# Difference Stats
diff_mean = abs(nochurn_mean - churn_mean)
diff_se = np.sqrt(churn_std**2/len(secs_churn) + nochurn_std**2/len(secs_nochurn))

# Margin of error
moe = 1.96 * diff_se
lower_bd = diff_mean - moe
upper_bd = diff_mean + moe

print(f'Mean Difference: {diff_mean}', 
      f'\nMargin of error: {moe}', 
      f'\n95% Confidence Interval: [{lower_bd}, {upper_bd}]')

Mean Difference: 21891093981546.016 
Margin of error: 18958175174543.344 
95% Confidence Interval: [2932918807002.672, 40849269156089.36]


In [109]:
bootstrap_replicates = np.empty(10000)

for i in range (10000):
    perm = np.random.permutation(np.concatenate((secs_churn, secs_nochurn)))
    churn_perm = perm[:len(secs_churn)]
    nochurn_perm = perm[len(secs_churn):]
    bootstrap_replicates[i] = abs(np.mean(nochurn_perm) - np.mean(churn_perm))

perm_pval = np.sum(bootstrap_replicates >= diff_mean)/10000
print(f'p-value: {perm_pval}')

p-value: 0.0245


There seems to be a difference in the total number of seconds listened. After conducting the bootstrap test, we fail to reject the null hypothesis at 0.01 significance level. This means that number of total seconds listened may not differ between those that churned and did not churn.

### Number of Unique Songs Listened

$H_0$: There's no difference in the number of unique songs listened per session between user who churned and did not churn. <br>
$H_A$: There's a difference between the groups.

In [116]:
# Merge data
unique = user[['msno', 'num_unq']].groupby('msno').mean()
unique = pd.merge(left = unique, right = train, how = 'left', on = ['msno'])

In [122]:
# Calculating Age statistics
unq_churn = unique[unique.is_churn == 1].num_unq
unq_nochurn = unique[unique.is_churn == 0].num_unq

# Computing statistics for group churn
churn_mean = np.mean(unq_churn)
churn_std = np.std(unq_churn)

# Computing statistics for group no churn
nochurn_mean = np.mean(unq_nochurn)
nochurn_std = np.std(unq_nochurn)

# Difference Stats
diff_mean = nochurn_mean - churn_mean
diff_se = np.sqrt(churn_std**2/len(secs_churn) + nochurn_std**2/len(secs_nochurn))

# Margin of error
moe = 1.96 * diff_se
lower_bd = diff_mean - moe
upper_bd = diff_mean + moe

print(f'Mean Difference: {diff_mean}', 
      f'\nMargin of error: {moe}', 
      f'\n95% Confidence Interval: [{lower_bd}, {upper_bd}]')

Mean Difference: -1.093376972354637 
Margin of error: 0.2055256639861163 
95% Confidence Interval: [-1.2989026363407532, -0.8878513083685207]


In [123]:
bootstrap_replicates = np.empty(10000)

for i in range (10000):
    perm = np.random.permutation(np.concatenate((unq_churn, unq_nochurn)))
    churn_perm = perm[:len(secs_churn)]
    nochurn_perm = perm[len(secs_churn):]
    bootstrap_replicates[i] = np.mean(nochurn_perm) - np.mean(churn_perm)

perm_pval = np.sum(bootstrap_replicates <= diff_mean)/10000
print(f'p-value: {perm_pval}')

p-value: 0.0


On average, those that churned listened to one more unique song per session than those that did not churn. The confidence interval does not contain 0, so we are 95% confident this finding is significant. After conducting the bootstrap test, we are able to reject the null hypothesis since we received a p-value of 0.0. 

## Conclusion

Out of the 5 variables we've tested, 4 of the variables were proven to have statisically significant differences in distribution. These variables should be included in the predictor model:
* Age
* Auto-Renew
* Number of Transactions
* Number of Unique Songs