## Inferential Statistics: Telecom Churn Data

### Overview

The purpose of this analysis is to see if the **difference between the proportion of male and female churn rates** are the same so that we can see if it is statistically **AND** practically significant to take action among the respective gender.

In [1]:
# Import modules/packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import math
import sys

In [10]:
df = pd.read_csv('churn.csv') # Import csv
churn = df[['Churn','gender']].copy() # Make a copy of only neccessary column

In [11]:
churn.head()

Unnamed: 0,Churn,gender
0,No,Female
1,No,Male
2,Yes,Male
3,No,Male
4,Yes,Female


### Sample Proportion Calculation for Churn / No Churn

In [24]:
churn_m = churn[churn.gender=='Male'][churn.Churn== 'Yes']['Churn'].count() # Male churn count

churn_f = churn[churn.gender=='Female'][churn.Churn== 'Yes']['Churn'].count() # Female churn countb

  """Entry point for launching an IPython kernel.
  This is separate from the ipykernel package so we can avoid doing imports until


In [20]:
n = len(churn)# number of customers

In [25]:
prop_m = churn_m/n # churn proportion for male

prop_f = churn_f/n # churn proportion for female

print("The sample proportion of churn for men is "+str(round(prop_m,3)))
print("The sample proportion of churn for female is "+str(round(prop_f,3)))

The sample proportion of churn for men is 0.132
The sample proportion of churn for female is 0.133


### Central Limit Theorem Conditions

**Random Condition:** Each customer is randomly obtained and recorded and thus, our sample **meets** the random condition of the Central Limit Theorem.

**Normal  Condition:** Both sample proportions of male and female churn rates, when multiplied by sample size, are **greater than 10**. Their proportions are both in the **middle of 0 and 1** as well as have a **large number of records**. Thus, this sampling distribution for both sample proportions **meet** the normal condition.

**Independence Condition:** Both samples have sample sizes that are **less than 10%** of the number of members. Thus, this our sample distribution for both sample proportions **meet** the independent condition.


### Null & Alternative Hypothesis

*Null Hypothesis:* In terms of churn, there is **no difference** for female and male customers.

*Alternative Hypothesis:* In terms of churn, there is **a difference** for female and male customers.

### Significance Level & Power

**Significance Level**: α = 0.01 

**Power**: We are worried of making a **Type I error** because if there is no difference between the sample proportion of churn for male and female and reject this, telecom companies will be taking action on something that is insignificant --- as a result they will be **wasting time** and **losing money on action items**.

### Margin of Error & Confidence Interval

In [28]:
prop_diff = prop_m - prop_f # sample statistic

std_error = math.sqrt((prop_m * (1-prop_m) / n) + (prop_f * (1-prop_f)/n)) # standard error

z_score = round(stats.norm.ppf(.995),3) # Z-score of 99% confidence interval

print("The difference between the sample proportions of male and female is", round(prop_diff,3))

The difference between the sample proportions of male and female is -0.001


In [29]:
moe = z_score*std_error # margin of error

lower = prop_diff - moe # lower level of confidence interval
upper = prop_diff + moe # upper level of confidence interval

print("Margin of Error:", round(moe,3))
print("Confidence Interval:", [round(lower,3),round(upper,3)])

Margin of Error: 0.015
Confidence Interval: [-0.016, 0.013]


The is a **99% chance** that the true difference between male and female churn percentages is between *-.016 and .013.*

This means we are **99% confident** that there exists a difference between male and female churn percentages.

### Z-Score & P-Value

In [30]:
p = prop_m = prop_f # assuming null hypothesis is true

p_hat = (churn_m + churn_f) / (n + n) # new proportion assuming null-hypothesis is true

std_error = math.sqrt((2 * p_hat * (1-p_hat))/n) # standard error of sampling proportion assuming null-hypothesis is true

z_score = (prop_diff - 0)/std_error # calculate z score

p_value = stats.norm.sf(abs(z_score)) #calculate p-value

print('Z-score:', round(z_score,2))
print('P-Value:', round(p_value,4))

Z-score: -0.22
P-Value: 0.4116


The probability of getting a Z-score **as extreme or more extreme** than -.22 is .004116%, *assuming the null-hypothesis is true*. 

Since our p-value is **greater than** our predetermined significance level of 0.01, we **do not reject** the null hypothesis and assume **no statistical significance** in the difference between proportions of male and female churn rates.

### Conclusion

Firstly, our statistical analysis says that we are *confident* that the true difference between winning percentages on blue and red side is between -.016 and 0.013, 99% of the time. Thus, we are quite confident that the true difference is between -.016 and 0.013 **BUT** that percentage is **AT MOST** less than 1.5%. 

Furthermore, our analysis **ONLY** looked at the relationship male and female and simply looking at this analysis, gender should not be a predicting variable when looking into churn.

Thus, when advising a telecom to take action upon this analysis or not to take action, it is clear to not focus on the gender of the customer although it was a factor that I was curious in delving into. 