# Predict Customer Churn for Subscription-based Media Company  

Data provided by [Kaggle](https://www.kaggle.com/datasets/safrin03/predictive-analytics-for-customer-churn-dataset/data)  

<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#cleaning">Data Cleaning</a></li>
<li><a href="#training">Training</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id="intro"></a>
## Introduction

The data provided is synthetic for a fictional subcription-based media company. The goal of this effort is to predict when a customer is likely to cancel their subscription.

<a id='eda'></a>
## Exploratory Data Analysis

### Import necessary libraries

> These libraries are used to explore the data

In [195]:
import os  
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport
from sklearn.preprocessing import LabelBinarizer, LabelEncoder, Normalizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix

pd.set_option('display.max_colwidth', None)

%matplotlib inline

### Load the data

In [115]:
print('These are the files that were provided by Kaggle:\n')
for root, dirs, files in os.walk('./data/'):
    for file in files:
        if file != '.DS_Store':
            print(file)

These are the files that were provided by Kaggle:

test.csv
train.csv
data_descriptions.csv


> Load the three files into separate data frames

In [116]:
train_raw_df = pd.read_csv('data/train.csv')
test_raw_df = pd.read_csv('data/test.csv')
desc_df = pd.read_csv('data/data_descriptions.csv')

### Automated EDA 

> Using YData's library, creating an automated EDA report to get a feeling for the data before manually inspecting. Looking at the training data since this is the large dataset.

In [117]:
#TODO: Uncomment out for submission
# profile = ProfileReport(train_raw_df, title="Train Data Profiling Report")
# profile.to_widgets()

The main things I captured from the generated report:
1. No missing data
2. No duplicated rows
3. The highest correlation was between account age and total charges (which is expected)

### Manual EDA

In this section, I will manually explore the data. Some steps will be repeated to ensure the accuracy of the report.

> Inspect the Data Descriptions for the columns

In [118]:
desc_df

Unnamed: 0,Column_name,Column_type,Data_type,Description
0,AccountAge,Feature,integer,The age of the user's account in months.
1,MonthlyCharges,Feature,float,The amount charged to the user on a monthly basis.
2,TotalCharges,Feature,float,The total charges incurred by the user over the account's lifetime.
3,SubscriptionType,Feature,object,"The type of subscription chosen by the user (Basic, Standard, or Premium)."
4,PaymentMethod,Feature,string,The method of payment used by the user.
5,PaperlessBilling,Feature,string,Indicates whether the user has opted for paperless billing (Yes or No).
6,ContentType,Feature,string,"The type of content preferred by the user (Movies, TV Shows, or Both)."
7,MultiDeviceAccess,Feature,string,Indicates whether the user has access to the service on multiple devices (Yes or No).
8,DeviceRegistered,Feature,string,"The type of device registered by the user (TV, Mobile, Tablet, or Computer)."
9,ViewingHoursPerWeek,Feature,float,The number of hours the user spends watching content per week.


> Now peeking at the train and test dataframes

In [119]:
train_raw_df.head()

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,...,ContentDownloadsPerMonth,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID,Churn
0,20,11.055215,221.104302,Premium,Mailed check,No,Both,No,Mobile,36.758104,...,10,Sci-Fi,2.176498,4,Male,3,No,No,CB6SXPNVZA,0
1,57,5.175208,294.986882,Basic,Credit card,Yes,Movies,No,Tablet,32.450568,...,18,Action,3.478632,8,Male,23,No,Yes,S7R2G87O09,0
2,73,12.106657,883.785952,Basic,Mailed check,Yes,Movies,No,Computer,7.39516,...,23,Fantasy,4.238824,6,Male,1,Yes,Yes,EASDC20BDT,0
3,32,7.263743,232.439774,Basic,Electronic check,No,TV Shows,No,Tablet,27.960389,...,30,Drama,4.276013,2,Male,24,Yes,Yes,NPF69NT69N,0
4,57,16.953078,966.325422,Premium,Electronic check,Yes,TV Shows,No,TV,20.083397,...,20,Comedy,3.61617,4,Female,0,No,No,4LGYPK7VOL,0


In [120]:
test_raw_df.head()

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID
0,38,17.869374,679.036195,Premium,Mailed check,No,TV Shows,No,TV,29.126308,122.274031,42,Comedy,3.522724,2,Male,23,No,No,O1W6BHP6RM
1,77,9.912854,763.289768,Basic,Electronic check,Yes,TV Shows,No,TV,36.873729,57.093319,43,Action,2.021545,2,Female,22,Yes,No,LFR4X92X8H
2,5,15.019011,75.095057,Standard,Bank transfer,No,TV Shows,Yes,Computer,7.601729,140.414001,14,Sci-Fi,4.806126,2,Female,22,No,Yes,QM5GBIYODA
3,88,15.357406,1351.451692,Standard,Electronic check,No,Both,Yes,Tablet,35.58643,177.002419,14,Comedy,4.9439,0,Female,23,Yes,Yes,D9RXTK2K9F
4,91,12.406033,1128.949004,Standard,Credit card,Yes,TV Shows,Yes,Tablet,23.503651,70.308376,6,Drama,2.84688,6,Female,0,No,No,ENTCCHR1LR


> Noting that the precision of the float values are unnecessay. A monthly charge of $17.869374 does not make sense and should be $17.87

In [121]:
train_raw_df.sample(5)

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,...,ContentDownloadsPerMonth,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID,Churn
163502,57,17.449082,994.597701,Premium,Bank transfer,No,TV Shows,No,Computer,15.389308,...,17,Drama,2.328344,3,Male,11,No,No,F9EMHD95NT,1
162251,17,10.233929,173.976795,Premium,Mailed check,No,Movies,Yes,TV,22.560299,...,26,Comedy,4.481471,9,Male,7,Yes,Yes,2MQEND3JCQ,1
4870,35,10.101147,353.540149,Basic,Electronic check,No,Movies,No,Computer,32.323717,...,15,Sci-Fi,2.332089,8,Female,16,No,Yes,LKMM7815ID,0
234772,14,19.141657,267.983203,Standard,Mailed check,Yes,Both,No,Mobile,17.447951,...,15,Drama,3.220327,2,Female,15,No,No,GC6NH2U9C2,1
53730,16,12.029515,192.472246,Standard,Bank transfer,Yes,Movies,No,Computer,11.714178,...,10,Action,1.411203,4,Male,12,Yes,Yes,0DAUY4N7LF,0


In [122]:
test_raw_df.sample(5)

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID
58525,94,10.725389,1008.186536,Basic,Electronic check,Yes,Both,No,Computer,10.609954,176.470696,5,Fantasy,1.130909,0,Female,12,No,Yes,WPIQIDKDT6
21597,114,11.470068,1307.587799,Basic,Electronic check,Yes,Both,No,Computer,8.666098,92.526814,45,Sci-Fi,3.087956,3,Female,23,No,Yes,779SJS3OQU
1755,32,10.38988,332.476157,Standard,Electronic check,No,Both,No,TV,37.77811,108.684118,38,Fantasy,3.233877,7,Male,17,No,No,XIAVL3TT71
34725,14,12.60738,176.503316,Basic,Bank transfer,Yes,TV Shows,Yes,TV,28.807173,15.850715,48,Sci-Fi,4.998671,8,Female,2,Yes,Yes,OEU41JYALJ
75303,43,10.432553,448.599772,Basic,Credit card,No,Both,Yes,TV,26.831444,104.263156,17,Action,2.732252,2,Female,12,Yes,Yes,CG9HWRK8ZV


> Noting that the test data does not contain values for churn.

In [123]:
train_raw_df.shape

(243787, 21)

In [124]:
test_raw_df.shape

(104480, 20)

> Looking at some basic statistics for the quantitative data

In [125]:
train_raw_df.describe()

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,UserRating,SupportTicketsPerMonth,WatchlistSize,Churn
count,243787.0,243787.0,243787.0,243787.0,243787.0,243787.0,243787.0,243787.0,243787.0,243787.0
mean,60.083758,12.490695,750.741017,20.502179,92.264061,24.503513,3.002713,4.504186,12.018508,0.181232
std,34.285143,4.327615,523.073273,11.243753,50.505243,14.421174,1.155259,2.872548,7.193034,0.385211
min,1.0,4.990062,4.991154,1.000065,5.000547,0.0,1.000007,0.0,0.0,0.0
25%,30.0,8.738543,329.147027,10.763953,48.382395,12.0,2.000853,2.0,6.0,0.0
50%,60.0,12.495555,649.878487,20.523116,92.249992,24.0,3.002261,4.0,12.0,0.0
75%,90.0,16.23816,1089.317362,30.219396,135.908048,37.0,4.002157,7.0,18.0,0.0
max,119.0,19.989957,2378.723844,39.999723,179.999275,49.0,4.999989,9.0,24.0,1.0


In [126]:
test_raw_df.describe()

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,UserRating,SupportTicketsPerMonth,WatchlistSize
count,104480.0,104480.0,104480.0,104480.0,104480.0,104480.0,104480.0,104480.0,104480.0
mean,60.064692,12.474347,748.167669,20.489914,92.646128,24.4509,3.000958,4.507705,12.0404
std,34.285025,4.331734,520.782838,11.243173,50.631406,14.451309,1.154689,2.8767,7.204115
min,1.0,4.990051,5.019144,1.000528,5.000985,0.0,1.000016,0.0,0.0
25%,30.0,8.725621,328.961543,10.767551,48.554662,12.0,2.000577,2.0,6.0
50%,60.0,12.453073,649.385029,20.472305,92.533168,25.0,2.997293,5.0,12.0
75%,90.0,16.214247,1081.266991,30.196107,136.622615,37.0,4.000671,7.0,18.0
max,119.0,19.989797,2376.235183,39.999296,179.999785,49.0,4.99993,9.0,24.0


> Looking at the data types for each coloumn

In [127]:
train_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243787 entries, 0 to 243786
Data columns (total 21 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   AccountAge                243787 non-null  int64  
 1   MonthlyCharges            243787 non-null  float64
 2   TotalCharges              243787 non-null  float64
 3   SubscriptionType          243787 non-null  object 
 4   PaymentMethod             243787 non-null  object 
 5   PaperlessBilling          243787 non-null  object 
 6   ContentType               243787 non-null  object 
 7   MultiDeviceAccess         243787 non-null  object 
 8   DeviceRegistered          243787 non-null  object 
 9   ViewingHoursPerWeek       243787 non-null  float64
 10  AverageViewingDuration    243787 non-null  float64
 11  ContentDownloadsPerMonth  243787 non-null  int64  
 12  GenrePreference           243787 non-null  object 
 13  UserRating                243787 non-null  f

In [128]:
test_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104480 entries, 0 to 104479
Data columns (total 20 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   AccountAge                104480 non-null  int64  
 1   MonthlyCharges            104480 non-null  float64
 2   TotalCharges              104480 non-null  float64
 3   SubscriptionType          104480 non-null  object 
 4   PaymentMethod             104480 non-null  object 
 5   PaperlessBilling          104480 non-null  object 
 6   ContentType               104480 non-null  object 
 7   MultiDeviceAccess         104480 non-null  object 
 8   DeviceRegistered          104480 non-null  object 
 9   ViewingHoursPerWeek       104480 non-null  float64
 10  AverageViewingDuration    104480 non-null  float64
 11  ContentDownloadsPerMonth  104480 non-null  int64  
 12  GenrePreference           104480 non-null  object 
 13  UserRating                104480 non-null  f

> Validate that their are no missing values

In [129]:
train_raw_df.isnull().sum()

AccountAge                  0
MonthlyCharges              0
TotalCharges                0
SubscriptionType            0
PaymentMethod               0
PaperlessBilling            0
ContentType                 0
MultiDeviceAccess           0
DeviceRegistered            0
ViewingHoursPerWeek         0
AverageViewingDuration      0
ContentDownloadsPerMonth    0
GenrePreference             0
UserRating                  0
SupportTicketsPerMonth      0
Gender                      0
WatchlistSize               0
ParentalControl             0
SubtitlesEnabled            0
CustomerID                  0
Churn                       0
dtype: int64

In [130]:
test_raw_df.isnull().sum()

AccountAge                  0
MonthlyCharges              0
TotalCharges                0
SubscriptionType            0
PaymentMethod               0
PaperlessBilling            0
ContentType                 0
MultiDeviceAccess           0
DeviceRegistered            0
ViewingHoursPerWeek         0
AverageViewingDuration      0
ContentDownloadsPerMonth    0
GenrePreference             0
UserRating                  0
SupportTicketsPerMonth      0
Gender                      0
WatchlistSize               0
ParentalControl             0
SubtitlesEnabled            0
CustomerID                  0
dtype: int64

> Look for duplicated data

In [131]:
train_raw_df.duplicated().sum()

0

In [132]:
test_raw_df.duplicated().sum()

0

<a id='cleaning'></a>
## Data Cleaning  

In this section, the data will be cleaned and prepared for machine learning.

> From here on, ignoring the test data. It does not contain a value for churn and will get the test data from the training set, which contains 240k+ rows. Clearing out the dataframe to prevent accidentally using it later.

In [133]:
test_raw_df = pd.DataFrame()
test_raw_df.shape

(0, 0)

In [134]:
df = train_raw_df

> Dropping the `CustomerID` column to keep data anonymized and there is no reason to keep it. The AccountAge column tracks the age of the customer's account which is more appropriate.

In [135]:
df.drop(columns={'CustomerID'}, inplace=True)

In [136]:
categorical_data_cols = []
numerical_data_cols = []

for column in df.columns:
    if df[column].dtype == 'object':
        categorical_data_cols.append(column)
    else:
        numerical_data_cols.append(column)

In [137]:
categorical_data_cols

['SubscriptionType',
 'PaymentMethod',
 'PaperlessBilling',
 'ContentType',
 'MultiDeviceAccess',
 'DeviceRegistered',
 'GenrePreference',
 'Gender',
 'ParentalControl',
 'SubtitlesEnabled']

In [138]:
numerical_data_cols

['AccountAge',
 'MonthlyCharges',
 'TotalCharges',
 'ViewingHoursPerWeek',
 'AverageViewingDuration',
 'ContentDownloadsPerMonth',
 'UserRating',
 'SupportTicketsPerMonth',
 'WatchlistSize',
 'Churn']

### Categorical Data  
Look at value counts for all categorical data

In [139]:
df.SubscriptionType.value_counts()

SubscriptionType
Standard    81920
Basic       81050
Premium     80817
Name: count, dtype: int64

In [140]:
df.PaymentMethod.value_counts()

PaymentMethod
Electronic check    61313
Credit card         60924
Bank transfer       60797
Mailed check        60753
Name: count, dtype: int64

In [141]:
df.PaperlessBilling.value_counts()

PaperlessBilling
No     121980
Yes    121807
Name: count, dtype: int64

In [142]:
df.ContentType.value_counts()

ContentType
Both        81737
TV Shows    81145
Movies      80905
Name: count, dtype: int64

In [143]:
df.MultiDeviceAccess.value_counts()

MultiDeviceAccess
No     122035
Yes    121752
Name: count, dtype: int64

In [144]:
df.DeviceRegistered.value_counts()

DeviceRegistered
Computer    61147
Tablet      61143
Mobile      60914
TV          60583
Name: count, dtype: int64

In [145]:
df.GenrePreference.value_counts()

GenrePreference
Comedy     49060
Fantasy    48955
Drama      48744
Action     48690
Sci-Fi     48338
Name: count, dtype: int64

In [146]:
df.Gender.value_counts()

Gender
Female    121930
Male      121857
Name: count, dtype: int64

In [147]:
df.ParentalControl.value_counts()

ParentalControl
Yes    122085
No     121702
Name: count, dtype: int64

In [148]:
df.SubtitlesEnabled.value_counts()

SubtitlesEnabled
Yes    122180
No     121607
Name: count, dtype: int64

> Noting the following columns could be converted to binary values:  
`PaperlessBilling`  
`MultiDeviceAccess`  
`ParentalControl`  
`SubtitlesEnabled`  

### Binary Values
> Using sklearn's LabelBinarizer to do the following conversions: "No":"0", "Yes":"1"

In [149]:
lb = LabelBinarizer()

In [150]:
cleaned_pb_vcs = df.PaperlessBilling.value_counts()
df.PaperlessBilling = lb.fit_transform(df.PaperlessBilling)
binarized_pb_vcs = df.PaperlessBilling.value_counts()
# Validate data changes
assert(cleaned_pb_vcs.No == binarized_pb_vcs[0])
assert(cleaned_pb_vcs.Yes == binarized_pb_vcs[1])

In [151]:
cleaned_mda_vcs = df.MultiDeviceAccess.value_counts()
df.MultiDeviceAccess = lb.fit_transform(df.MultiDeviceAccess)
binarized_mda_vcs = df.MultiDeviceAccess.value_counts()
# Validate data changes
assert(cleaned_mda_vcs.No == binarized_mda_vcs[0])
assert(cleaned_mda_vcs.Yes == binarized_mda_vcs[1])

In [152]:
cleaned_pc_vcs = df.ParentalControl.value_counts()
df.ParentalControl = lb.fit_transform(df.ParentalControl)
binarized_pc_vcs = df.ParentalControl.value_counts()
# Validate data changes
assert(cleaned_pc_vcs.No == binarized_pc_vcs[0])
assert(cleaned_pc_vcs.Yes == binarized_pc_vcs[1])

In [153]:
cleaned_se_vcs = df.SubtitlesEnabled.value_counts()
df.SubtitlesEnabled = lb.fit_transform(df.SubtitlesEnabled)
binarized_se_vcs = df.SubtitlesEnabled.value_counts()
# Validate data changes
assert(cleaned_se_vcs.No == binarized_se_vcs[0])
assert(cleaned_se_vcs.Yes == binarized_se_vcs[1])

### Convert Categorical Data
`SubscriptionType`  
`PaymentMethod`  
`ContentType`   
`DeviceRegistered`  
`GenrePreference`  
`Gender`  

In [154]:
le = LabelEncoder()

In [155]:
cleaned_st_vcs = df.SubscriptionType.value_counts()
df.SubscriptionType = le.fit_transform(df.SubscriptionType)
binarized_st_vcs = df.SubscriptionType.value_counts()

In [156]:
df.PaymentMethod = le.fit_transform(df.PaymentMethod)

In [157]:
df.ContentType = le.fit_transform(df.ContentType)

In [158]:
df.DeviceRegistered = le.fit_transform(df.DeviceRegistered)

In [159]:
df.GenrePreference = le.fit_transform(df.GenrePreference)

In [160]:
df.Gender = le.fit_transform(df.Gender)

In [161]:
df[categorical_data_cols].sample(5)

Unnamed: 0,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,GenrePreference,Gender,ParentalControl,SubtitlesEnabled
165493,0,2,0,2,1,2,4,1,1,0
108420,1,0,0,2,0,2,3,1,0,1
239286,2,0,0,0,0,3,0,0,1,1
178995,2,3,0,1,1,1,1,0,1,1
74116,0,1,1,1,0,2,4,0,1,1


### Quantitative Data

> Rounding the quantitative to two decimal places

In [162]:
df[numerical_data_cols].sample(5)

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,UserRating,SupportTicketsPerMonth,WatchlistSize,Churn
183958,36,18.395221,662.227947,27.845914,78.542125,34,1.141833,7,12,0
83422,115,7.527436,865.65509,4.772418,144.680869,5,2.404584,4,6,0
123451,114,8.630043,983.82492,37.910304,67.330269,32,3.590707,4,22,0
106863,40,8.94457,357.782789,6.870905,138.50304,48,2.453349,2,24,0
224956,2,13.726103,27.452206,3.444698,83.057255,21,1.779996,0,8,1


In [163]:
for col in numerical_data_cols:
    if np.dtype(df[col]) == float:
        df[col] = df[col].round(2)

In [164]:
df.sample(5)

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,Churn
122217,66,5.52,364.58,0,2,0,2,0,2,7.39,110.0,0,2,1.62,7,0,12,0,1,0
52809,95,14.88,1413.66,0,0,0,1,0,0,28.3,38.39,33,1,3.69,5,0,14,1,1,0
184046,95,12.78,1214.01,1,3,0,1,1,1,34.52,53.75,3,4,3.4,5,0,24,0,0,0
188285,101,8.43,851.23,0,1,1,1,0,0,19.88,75.28,43,2,2.51,1,0,6,1,0,0
82430,102,15.98,1629.84,1,2,1,2,0,1,22.36,54.44,16,2,3.05,0,0,17,0,1,0


### Look for outliers

In [165]:
df.AccountAge.describe()

count    243787.000000
mean         60.083758
std          34.285143
min           1.000000
25%          30.000000
50%          60.000000
75%          90.000000
max         119.000000
Name: AccountAge, dtype: float64

> Even though the `AccountAge` column has outliers, it would make sense this would occur naturally. 

In [166]:
df.TotalCharges.describe()

count    243787.000000
mean        750.741017
std         523.073274
min           4.990000
25%         329.150000
50%         649.880000
75%        1089.320000
max        2378.720000
Name: TotalCharges, dtype: float64

> As with `AccountAge`, `Total Charges` would have the same situation with outliers occurring naturally.

In [167]:
df.ViewingHoursPerWeek.describe()

count    243787.000000
mean         20.502183
std          11.243755
min           1.000000
25%          10.760000
50%          20.520000
75%          30.220000
max          40.000000
Name: ViewingHoursPerWeek, dtype: float64

In [168]:
vhpw_stddev = 20.502183
mean = df.ViewingHoursPerWeek.mean()

vhpw_gt_2_stdev = df[(df['ViewingHoursPerWeek'] > mean + 1.5 * vhpw_stddev) | (df['ViewingHoursPerWeek'] < mean - 1.5 * vhpw_stddev)]['ViewingHoursPerWeek'].count()
print(f'Count greater than 1.5 stddev away from the mean ({round(mean, 1)}): {vhpw_gt_2_stdev}')

Count greater than 1.5 stddev away from the mean (20.5): 0


> No outliers found for `ViewingHoursPerWeek`

In [169]:
mean = df['AverageViewingDuration'].mean()
avd_stddev = df['AverageViewingDuration'].describe()['std']
df['AverageViewingDuration'].describe()

count    243787.000000
mean         92.264052
std          50.505245
min           5.000000
25%          48.380000
50%          92.250000
75%         135.910000
max         180.000000
Name: AverageViewingDuration, dtype: float64

In [170]:
avd_outliers = df[(df['AverageViewingDuration'] > mean + 1.5 * avd_stddev) | (df['AverageViewingDuration'] < mean - 1.5 * avd_stddev)]['AverageViewingDuration']
print(f'Count greater than 1.5 stddev away from the mean ({round(mean, 1)}): {avd_outliers.count()}')

Count greater than 1.5 stddev away from the mean (92.3): 32657


> There are 32,657 rows of data with `AverageViewingDuration` > the mean (92.3). However, this seems that it would occur naturally since most movies are around 1.5 hours long and movies do exist that are 3 hours long. Also, it is acceptable that a user would start a show just to stop a few minutes into the movie.

> Just to double check, checking to see if any data is > 2 stddev away

In [171]:
avd_outliers = df[(df['AverageViewingDuration'] > mean + 2 * avd_stddev) | (df['AverageViewingDuration'] < mean - 2 * avd_stddev)]['AverageViewingDuration']
print(f'Count greater than 2 stddev away from the mean ({round(mean, 1)}): {avd_outliers.count()}')

Count greater than 2 stddev away from the mean (92.3): 0


> Will not exclude any data because the values seem acceptable. If there had been data > 2 stddev away (192+ minutes), it would have been excluded.

In [172]:
mean = df['ContentDownloadsPerMonth'].mean()
avd_stddev = df['ContentDownloadsPerMonth'].describe()['std']
df['ContentDownloadsPerMonth'].describe()

count    243787.000000
mean         24.503513
std          14.421174
min           0.000000
25%          12.000000
50%          24.000000
75%          37.000000
max          49.000000
Name: ContentDownloadsPerMonth, dtype: float64

In [173]:
cdpm_outliers = df[(df['ContentDownloadsPerMonth'] > mean + 2 * avd_stddev) | (df['ContentDownloadsPerMonth'] < mean - 2 * avd_stddev)]['ContentDownloadsPerMonth']
print(f'Count greater than 2 stddev away from the mean ({round(mean, 1)}): {cdpm_outliers.count()}')

Count greater than 2 stddev away from the mean (24.5): 0


> Chose to go 2 stddev away since it seems likely some users would make a lot more use of the service than others.

> Data is fully cleaned

<a id="#training"></a>
### Train Data

> Split the data into train, test, and validation data

In [201]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

> Train, test, and validation data are set

### Logistic Regression

In [199]:
log_reg_model = LogisticRegression(max_iter=150_000)

In [200]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('Churn', axis=1), df['Churn'], test_size=0.25)
log_reg_model.fit(X_train, y_train)

y_pred = log_reg_model.predict(X_test)
accuracy = log_reg_model.score(X_test, y_test)

accuracy * 100

82.28624870789373

In [203]:
y_val_pred = log_reg_model.predict(X_val)
accuracy = log_reg_model.score(X_val, y_val)
accuracy * 100

82.53336250273463