To predict user churn for the HD television service, I consider several factors to select the most suitable machine learning model: 
-	Problem Type: 
Churn prediction is a classification problem.
Models effective in classification: Decision Trees, Random Forests, Gradient Boosting.
-	Data Complexity: 
Mixed data types (categorical and numerical) and unknown feature relationships.
-      Scalability and Efficiency: Large dataset size (~900,000 rows).
Models like Random Forests and Gradient Boosting over computationally intensive ones like SVMs.
-	Interpretability:.
Random Forests balance between accuracy and interpretability, unlike more complex but less interpretable models.

I'll use RandomForestClassifier model for predict churn user. Random Forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. 

In [1]:
# Import the necessary libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import time

In [2]:
# Load the dataset
data = pd.read_csv('HDdata.csv')

In [3]:
# Define the function used to categorize '# of days' and create the 'Subscription' column.
def categorize_subscription(days):
    if days <= 31:
        return 1  # Monthly
    elif days <= 91:
        return 3  # Quarterly
    elif days <= 182:
        return 6  # Half-yearly
    else:
        return 12  # Yearly

In [4]:
# Apply the function to create 'Subscription' column
data['Subscription'] = data['# of days'].apply(categorize_subscription)

In [5]:
data.head()

Unnamed: 0,MAC,SessionMainMenu,AppName,LogID,Event,ItemID,RealTimePlaying,ItemName,BoxTime,Contract,Session,AppID,# of days,Subscription
0,B046FCAC0DC1,2016-02-12 12:35:13.437,VOD,52,StopVOD,100052388.0,570.3,Trường Học Moorim (20 Tập),2016:02:12:12:46:23:663,SGFD81389,B046FCAC0DC1:2016:02:12:12:34:51:923,VOD,375,12
1,B046FCAC0DC1,2016-02-11 01:01:56.838,IPTV,40,EnterIPTV,,0.0,,2016:02:11:01:02:09:375,SGFD81389,B046FCAC0DC1:2016:02:11:01:01:56:813,IPTV,375,12
2,B046FCAC0DC1,2016-02-11 01:02:29.258,VOD,55,NextVOD,100052388.0,0.0,Trường Học Moorim (20 Tập),2016:02:11:03:06:16:654,SGFD81389,B046FCAC0DC1:2016:02:11:01:01:56:813,VOD,375,12
3,B046FCAC0DC1,2016-02-12 04:44:59.143,IPTV,18,ChangeModule,,0.0,,2016:02:12:04:45:40:984,SGFD81389,B046FCAC0DC1:2016:02:12:04:44:54:688,VOD,375,12
4,B046FCAC0DC1,2016-02-12 12:35:13.437,VOD,54,PlayVOD,100052388.0,0.0,Trường Học Moorim (20 Tập),2016:02:12:12:47:37:208,SGFD81389,B046FCAC0DC1:2016:02:12:12:34:51:923,VOD,375,12


In [6]:
# Encode the categorical columns 'MAC', 'AppID', and 'Event'.
le = LabelEncoder()
data['MAC'] = le.fit_transform(data['MAC'])
data['AppID'] = le.fit_transform(data['AppID'])
data['Event'] = le.fit_transform(data['Event'])

# Split the data into a training set and a test set.
X = data[['MAC', 'Event', 'RealTimePlaying', 'AppID', '# of days']]
y = data['Subscription']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [7]:
# Train the RandomForest classifier.
rf = RandomForestClassifier()
start_time = time.time()
rf.fit(X_train, y_train)
end_time = time.time()
training_time = end_time - start_time

# Evaluate the model's accuracy on the test set.
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

In [8]:
# Test the model with more data.
subset_X_test = X_test.sample(50, random_state=42)
subset_y_pred = rf.predict(subset_X_test)
subset_y_test = y_test[subset_X_test.index]
subset_predictions_df = pd.DataFrame({'Actual Subscription': subset_y_test, 'Predicted Subscription': subset_y_pred})

subset_predictions_df

Unnamed: 0,Actual Subscription,Predicted Subscription
520478,1,1
795376,12,12
482839,3,3
27628,12,12
764807,12,12
138911,1,1
712748,12,12
221820,12,12
3820,12,12
63042,1,1
