# Streaming OTT  Subscription

For this analysis, we'll be examining customer churn data from a leading subscription-based streaming service. This industry giant boasts a vast library of movies, TV shows, and original content. Understanding why customers discontinue their subscriptions will be crucial in optimizing the user experience, reducing churn, and maximizing customer lifetime value.

## About the Data

| Feature                  | Description                                            |
|--------------------------|--------------------------------------------------------|
| CustomerID               | Unique identifier for each customer                    |
| SubscriptionType         | Type of subscription plan chosen by the customer      |
| PaymentMethod            | Method used for payment                                |
| PaperlessBilling         | Whether the customer uses paperless billing            |
| ContentType              | Type of content accessed by the customer               |
| MultiDeviceAccess        | Whether the customer has access on multiple devices   |
| DeviceRegistered         | Device registered by the customer                      |
| GenrePreference          | Genre preference of the customer                       |
| Gender                   | Gender of the customer                                 |
| ParentalControl          | Whether parental control is enabled                    |
| SubtitlesEnabled         | Whether subtitles are enabled                          |
| AccountAge               | Age of the customer's subscription account (in months) |
| MonthlyCharges           | Monthly subscription charges                           |
| TotalCharges             | Total charges incurred by the customer                 |
| ViewingHoursPerWeek      | Average number of viewing hours per week               |
| SupportTicketsPerMonth   | Number of customer support tickets raised per month    |
| AverageViewingDuration   | Average duration of each viewing session               |
| ContentDownloadsPerMonth | Number of content downloads per month                  |
| UserRating               | Customer satisfaction rating (1 to 5)                  |
| WatchlistSize            | Size of the customer's content watchlist               |
| Churn                    | Situation of customer churn or not (target variable)    |


# Objective

- Machine learning model for predicting Customer Churn

# Importing Dependencies

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder
from sklearn.preprocessing import StandardScaler

#pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

#model
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier,ExtraTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier,StackingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

#metric
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix

## Load the Data

In [2]:
df = pd.read_csv('/content/drive/MyDrive/ott_ml/ott_subscription.csv')

In [3]:
df.head()

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,...,ContentDownloadsPerMonth,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID,Churn
0,20,11.055215,221.104302,Premium,Mailed check,No,Both,No,Mobile,36.758104,...,10,Sci-Fi,2.176498,4,Male,3,No,No,CB6SXPNVZA,0.0
1,57,5.175208,294.986882,Basic,Credit card,Yes,Movies,No,Tablet,32.450568,...,18,Action,3.478632,8,Male,23,No,Yes,S7R2G87O09,0.0
2,73,12.106657,883.785952,Basic,Mailed check,Yes,Movies,No,Computer,7.39516,...,23,Fantasy,4.238824,6,Male,1,Yes,Yes,EASDC20BDT,0.0
3,32,7.263743,232.439774,Basic,Electronic check,No,TV Shows,No,Tablet,27.960389,...,30,Drama,4.276013,2,Male,24,Yes,Yes,NPF69NT69N,0.0
4,57,16.953078,966.325422,Premium,Electronic check,Yes,TV Shows,No,TV,20.083397,...,20,Comedy,3.61617,4,Female,0,No,No,4LGYPK7VOL,0.0


In [4]:
df = df.dropna()
df.drop('CustomerID', axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop('CustomerID', axis = 1, inplace = True)


# Train Test Split

In [5]:
x = df.drop('Churn', axis = 1)
y = df['Churn']

In [6]:
x_train,x_test,y_train,y_test = train_test_split(x,y,
                                                 test_size=0.25, shuffle=True,
                                                random_state=42,stratify=y)

# Column Transformer

In [7]:
column_transform = ColumnTransformer([
    ('One_hot', OneHotEncoder(drop='first', sparse_output=False), [4, 5, 6, 7, 8, 12, 15, 17, 18]),
    ('Ordinal_encoding', OrdinalEncoder(categories=[['Basic', 'Standard', 'Premium']]), [3]),
    ('standard_scaler', StandardScaler(), [0, 1, 2, 9, 10, 11, 13, 14, 16])
], remainder='passthrough')

In [8]:
transformed_x_train = column_transform.fit_transform(x_train)

# Cross Validation

In [9]:
cv_df = pd.DataFrame(transformed_x_train,columns = column_transform.get_feature_names_out())

In [10]:
cv_x = cv_df.copy()
cv_y = y_train.copy()

In [15]:
model_ = [DecisionTreeClassifier(),
          ExtraTreeClassifier(),
          RandomForestClassifier(),
          LGBMClassifier(),
          XGBClassifier(learning_rate = 0.01, booster = 'gbtree'),
          GradientBoostingClassifier(),
          SGDClassifier(loss = 'hinge', penalty='l1', alpha = 0.01, learning_rate='adaptive',max_iter=1080, eta0 = 0.01)
]

In [16]:
def cross_validation():
    for i in model_:
        cvscore = cross_val_score(i,cv_x,cv_y,cv = 5,scoring='accuracy')
        scores = print(f'{i.__class__.__name__}, {np.mean(cvscore)}')
    return scores

In [17]:
cross_validation()

DecisionTreeClassifier, 0.7251367315685846
ExtraTreeClassifier, 0.7207777291621089
RandomForestClassifier, 0.8221341063224678
[LightGBM] [Info] Number of positive: 26509, number of negative: 119763
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.017790 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1520
[LightGBM] [Info] Number of data points in the train set: 146272, number of used features: 27
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.181231 -> initscore=-1.508030
[LightGBM] [Info] Start training from score -1.508030
[LightGBM] [Info] Number of positive: 26509, number of negative: 119763
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.009543 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info

# Pipeline

In [19]:
column_transform = ColumnTransformer([
    ('One_hot', OneHotEncoder(drop='first', sparse_output=False), [4, 5, 6, 7, 8, 12, 15, 17, 18]),
    ('Ordinal_encoding', OrdinalEncoder(categories=[['Basic', 'Standard', 'Premium']]), [3]),
    ('standard_scaler', StandardScaler(), [0, 1, 2, 9, 10, 11, 13, 14, 16])
], remainder='passthrough')

esti_ = [
    ('Random_forest', RandomForestClassifier()),
    ('Lightgbm',LGBMClassifier()),
    ('Xgboost',XGBClassifier(learning_rate = 0.01, booster = 'gbtree')),
    ('Sgd',SGDClassifier(loss = 'hinge', penalty='l1', alpha = 0.01, learning_rate='adaptive',max_iter=1080, eta0 = 0.01)),
]

stacking = StackingClassifier(estimators=esti_, final_estimator=RandomForestClassifier())

pipes = Pipeline([
    ('column_transform', column_transform),
    ('model', stacking)
])

In [21]:
pipes.fit(x_train,y_train)

[LightGBM] [Info] Number of positive: 33136, number of negative: 149704
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.011852 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1520
[LightGBM] [Info] Number of data points in the train set: 182840, number of used features: 27
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.181229 -> initscore=-1.508040
[LightGBM] [Info] Start training from score -1.508040
[LightGBM] [Info] Number of positive: 26509, number of negative: 119763
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.041142 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1520
[LightGBM] [Info] Number of data points in the train set: 146272, number of used features: 27
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.181231 -> initscore=-1.50803

In [22]:
pred = pipes.predict(x_test)

In [24]:
accuracy_score(y_test,pred)

0.8127225294107996

In [25]:
confusion_matrix(y_test,pred)

array([[47831,  2070],
       [ 9344,  1702]])