# **Machine Learning Model building**

**Objective:** Build ML model that accurately predicts whether a video contains claims or offers opinions to reduce the backlog of user reports. <br>
**Action:** Build XGBoost and Random Forest model, evaluate their performance, and select the best performing model for next steps. <br>
Note: Logistic Regression model was initially built but has been excluded from this project summary due to insufficient performance.

In [1]:
import json
import joblib

# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Import packages for data modeling
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import xgboost

In [2]:
data = pd.read_csv("../datasets/tiktok_final_dataset.csv")
raw_data = pd.read_csv("../datasets/tiktok_final_dataset.csv")

In [3]:
X = data.drop(columns=["claim_status"])
y = data[["claim_status"]]

In [4]:
X

Unnamed: 0,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
...,...,...,...,...,...,...,...
19059,not verified,active,6067.0,423.0,81.0,8.0,2.0
19060,not verified,active,2973.0,820.0,70.0,3.0,0.0
19061,not verified,active,734.0,102.0,7.0,2.0,1.0
19062,not verified,active,3394.0,655.0,123.0,11.0,4.0


In [5]:
y

Unnamed: 0,claim_status
0,claim
1,claim
2,claim
3,claim
4,claim
...,...
19059,opinion
19060,opinion
19061,opinion
19062,opinion


Check class balance.

In [6]:
y.value_counts()

claim_status
claim           9606
opinion         9458
Name: count, dtype: int64

Encode all catgorical variables.

In [7]:
X_enc = X.replace({"verified":1.0, "not verified":0.0,
                       "active":1.0, "banned":0.0, "under review":2.0})
X_enc

Unnamed: 0,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,0.0,2.0,343296.0,19425.0,241.0,1.0,0.0
1,0.0,1.0,140877.0,77355.0,19034.0,1161.0,684.0
2,0.0,1.0,902185.0,97690.0,2858.0,833.0,329.0
3,0.0,1.0,437506.0,239954.0,34812.0,1234.0,584.0
4,0.0,1.0,56167.0,34987.0,4110.0,547.0,152.0
...,...,...,...,...,...,...,...
19059,0.0,1.0,6067.0,423.0,81.0,8.0,2.0
19060,0.0,1.0,2973.0,820.0,70.0,3.0,0.0
19061,0.0,1.0,734.0,102.0,7.0,2.0,1.0
19062,0.0,1.0,3394.0,655.0,123.0,11.0,4.0


In [8]:
y_enc = y.replace({"claim":1.0,"opinion":0.0})
y_enc

Unnamed: 0,claim_status
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0
...,...
19059,0.0
19060,0.0
19061,0.0
19062,0.0


In [9]:
X_train_all, X_test, y_train_all, y_test = train_test_split(X_enc,y_enc, test_size=0.2)
X_train, X_val, y_train, y_val = train_test_split(X_train_all, y_train_all, test_size=0.25)

In [43]:
# dict of objects that will be saved in a file for model evaluation in another jupyter notebook
objects_to_save = {}
objects_to_save["X_val"] = X_val
objects_to_save["y_val"] = y_val
objects_to_save["X_test"] = X_test
objects_to_save["y_test"] = y_test

In [10]:
X_train.shape , X_val.shape, y_train.shape, y_val.shape, X_test.shape, y_test.shape

((11438, 7), (3813, 7), (11438, 1), (3813, 1), (3813, 7), (3813, 1))

## **GridSearchCV**
We are building Random Forest models and XGBoost models that have different hyperparameters to find the model that performs best on the performance metrics. <br>
In this project it is important that we use a metric that measures the buisness need of this project: Finding claims has priority over finding opinions. <br> 
The amount of false negatives, which are actual claims that are predicted as opinions, should be as low as possible. This case is measured by metric **recall** <br>
which we will use to determine the best model after the grid search is complete.

#### **Random Forest model**

In [11]:
rf_clf = RandomForestClassifier()

hyperparams = {"n_estimators":[100,200,400],
               'max_depth' : [5,10,20],        
              'min_samples_leaf' : [1,2], 
              'min_samples_split' : [2,4],
              'max_features' : ["sqrt", None], 
              'max_samples' : [.5,.7]
              }

scoring = ['accuracy', 'precision', 'recall', 'f1']

rf_cv = GridSearchCV(rf_clf,hyperparams, scoring=scoring, cv=5, refit="recall")

In [12]:
%%time

rf_cv.fit(X_train, np.array(y_train).reshape(-1))

CPU times: total: 7min 24s
Wall time: 7min 49s


In [13]:
rf_cv.best_score_

0.9913174835891494

In [15]:
rf_cv.best_params_

{'max_depth': 20,
 'max_features': 'sqrt',
 'max_samples': 0.7,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 200}

In [17]:
rf_best = rf_cv.best_estimator_

In [45]:
objects_to_save["rf_best"] = rf_best

#### **XGBoost model**

In [18]:
# Instantiate the XGBoost classifier
xgb_clf = xgboost.XGBClassifier()

# Create a dictionary of hyperparameters to tune
hyperparams_xgb = {'max_depth': [4,8,12],
                   'min_child_weight': [3, 5],
                   'learning_rate': [0.01, 0.1, 0.2],
                   'n_estimators': [200, 350, 500]
                  }

# Define a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1']

# Instantiate the GridSearchCV object
xgb_cv = GridSearchCV(xgb_clf,hyperparams_xgb, scoring=scoring, cv=5, refit="recall")


In [23]:
%%time
xgb_cv.fit(X_train, y_train)

CPU times: total: 9min 39s
Wall time: 1min 7s


In [26]:
xgb_cv.best_score_

0.9911441741480838

In [27]:
xgb_cv.best_params_

{'learning_rate': 0.2,
 'max_depth': 12,
 'min_child_weight': 3,
 'n_estimators': 500}

In [29]:
xgb_best = xgb_cv.best_estimator_

In [46]:
objects_to_save["xgb_best"] = xgb_best

In [48]:
joblib.dump(objects_to_save, "../datasets/for_evaluation.joblib")

['../datasets/for_evaluation.joblib']

**To be continued at ../ML model evalution/**  