# **TikTok Claims Classification Machine Learning Model Project**
![](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/jfqeLHzXSmiWDsf44aBzcg_49d87872c4e54600b22020ca45545df1_image.png?expiry=1724198400000&hmac=FrrujYjU4bnsRleCWUHnmsbBw5M2E5XTUaFwu1s5ZGs)

Welcome to the TikTok Claims Classification Model Project!

This project is part of the Google Advanced Data Analytics Certificate Program in which we have 3 different projects that we can work with in order to finish the course.

I've chosen the TikTok Claims Classification Model Project, and this notebook shows my thought process. Goes through EDA (exploring and cleaning the dataset), regression analysis, and modeling a machine learning algorithm.

Let's get started by stating some important thigs first and then proceed with the actual project and what it stands for.

***Note:*** *The TikTok dataset was made available as part of the course and it was created in partnership with the short-form video hosting company, TikTok. The story, all names, characters, and incidents portrayed in this project are fictitious. No identification with actual persons (living or deceased) is intended or should be inferred. And, the data shared in this project has been created for pedagogical purposes.*

**Background**:

TikTok is the leading destination for short-form mobile video. The platform is built to help imaginations thrive. TikTok's mission is to create a place for inclusive, joyful, and authentic content–where people can safely discover, create, and connect.

**Scenario**:

TikTok users have the ability to report videos and comments that contain user claims. These reports identify content that needs to be reviewed by moderators. This process generates a large number of user reports that are difficult to address quickly. 

TikTok is working on the development of a predictive model that can determine whether a video contains a claim or offers an opinion. With a successful prediction model, TikTok can reduce the backlog of user reports and prioritize them more efficiently.

**Project Goal**:

The TikTok data team is developing a machine learning model for classifying claims made in videos submitted to the platform.


# **1. Project Proposal**

Before we start coding, it’s important to make a project proposal for the data science team over at TikTok, so that they can approve all the steps we are about to make.
The project proposal that I'll create will have milestones for the tasks within the claim’s classification project.

You can view the project proposal over here, in my Github profile.

# **2. Exploratory Data Analysis (EDA)**

The purpose of this exploratory data analysis is for me to understand the impact that videos have on TikTok users. To do so, I'm going to see some descriptive statistics and analyze variables that will showcase user engagement: views, likes, and comments counts.

## **2.1. Importing the Necessary Libraries and Data Frame for EDA**

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Loading the dataset into a data frame

data = pd.read_csv("/kaggle/input/tiktok-dataset/tiktok_dataset.csv")

## **2.2. Descriptive Statistics**

In [None]:
# Let's display and examine the first 10 rows of the data

data.head(10)

**These are some observations that I can make seeing these 10 rows of data:**

* Each row represents a different video posted by a TikTok user. This way, we know various information’s about the videos, such as whether it is flagged as a comment video or a claim video, as well as the number of views, shares, comments, likes, downloads, it's duration in seconds as well as a transcription of the text that users have talked in that specific video.
* We also have information about the status of the video author, for example, whether it is banned, under investigation or active.
* Furthermore, we have information about the user verified status, weather is a verified or not verified user.
* Basically, we have a variety of metadata about the videos and users in addition to having the videos tagged as being claims or opinions.

In [None]:
# Summary info

data.info()

**Let's explain the information we get from this output:**

* When checking the different variables, I've noticed that there are 3 data types - int64, object and float64.
* In total, we have 12 columns with metadata about the videos and 19382 rows (videos).
* I can see that we have some lines with missing data in some columns, namely: `claim_status`, `video_transcription_text`, `video_view_count`, `video_like_count`, `video_share_count`, `video_download_count`, `video_comment_count` and, in all of them, the same number of rows of information are missing - 298 lines. This could mean that these videos have not received any type of claim status, comments or views, for example or, that these lines are in this database by mistake.

In [None]:
# Summary statistics

data.describe()

In [None]:
data.size

**What can we see from these descriptive statistics?**

* Looking at the table, it appears that there are some outliers, particularly in the maximum values, since they are quite far from the range values (quartiles). Furthermore, they have extremely large standard deviations!

Since we know from our project proposal that the ultimate objective is to use machine learning to classify videos as either claims or opinions, a good first step towards understanding the data might therefore be examining the `claim_status` variable. Let's begin by determining how many videos there are for each different claim status.

In [None]:
data['claim_status'].value_counts()

We can see that the values are very well balanced, with an almost perfect ratio between the number of videos with opinions and videos with claims, which will be optimal for our model later.

Now that we understand that the distribution of the claim status variable, it’s important to extract some information about the engagement levels of the videos. In particular, let's examine the engagement trends associated with each different claim status.

I'm going to start by using Boolean masking to filter the data according to claim status, then calculate the mean and median view counts for each claim status.

In [None]:
# What is the average view count of videos with "claim" status?

claims = data[data['claim_status']=='claim']
print('Mean visualizations in videos with Claims:', claims['video_view_count'].mean())
print('Median visualizations in videos with Claims:', claims['video_view_count'].median())

In [None]:
# What is the average view count of videos with "opinion" status?

claims = data[data['claim_status']=='opinion']
print('Mean visualizations in videos with Opinions:', claims['video_view_count'].mean())
print('Median visualizations in videos with Opinions:', claims['video_view_count'].median())

**What can we notice about the mean and media within each claim category?**

* The two statistics are very close in both cases (mean and median), which indicates that there aren’t outliers when we measure visualizations grouped by claim status.
* We can see that videos with claims have many more views than videos with opinions. On average, videos with claims have more than 500 thousand views, while videos with opinions do not even have 5 thousand views, on average.

Now, let's examine trends associated with the ban status of the author.

In [None]:
data.groupby(by=['author_ban_status', 'claim_status']).count()[['#']]

**What conclusions can we draw from here?**

* I’m able to see a very strong correlation with users who make claims in their videos and those who are banned from the platform or under review, meaning that those who are banned or under review, have more videos with claims.
* This may happen because these types of videos have greater restrictions and authors have to comply with the platform's policies.
* However, it should be noted that there is no way to know whether videos with claims result in their users being banned or that users who post videos with claims are more likely to post videos that break the terms of service.
* Finally, while we can use this data to draw conclusions about banned/active authors, we cannot draw conclusions about banned videos. There is no way to know whether a particular video caused it to be banned and the authors may have posted videos that complied with the terms of service.

Let's continue investigating engagement levels, now focusing on `author_ban_status`.

In [None]:
data.groupby(by=['author_ban_status']).agg({'video_view_count': ['count', 'mean', 'median'],
                                        'video_like_count': ['count', 'mean', 'median'],
                                        'video_share_count': ['count', 'mean', 'median']})

**Let’s discuss the output of this code:**

* As you can see, the users who are banned from the platform or under review, can have videos with double or more the engagement levels of active users like in number of views, likes and shares.
* Furthermore, we can see that in terms of likes and shares, the average is quite different from the median, which indicates that there are outliers, that is, videos with a lot of engagement!

Now, let's create three new columns to help better understand engagement rates:
* `likes_per_view`: represents the number of likes divided by the number of views for each video
* `comments_per_view`: represents the number of comments divided by the number of views for each video
* `shares_per_view`: represents the number of shares divided by the number of views for each video

In [None]:
# Creating 3 new columns

data['likes_per_view'] = data['video_like_count'] / data['video_view_count']

data['comments_per_view'] = data['video_comment_count'] / data['video_view_count']

data['shares_per_view'] = data['video_share_count'] / data['video_view_count']

In [None]:
# Compile the information

data.groupby(by=['author_ban_status', 'claim_status']).agg({'likes_per_view': ['count', 'mean', 'median'],
                                                          'comments_per_view': ['count', 'mean', 'median'],
                                                          'shares_per_view': ['count', 'mean', 'median']})

* Analyzing the output of this code, we can see that videos with claims have more likes, comments and shares per view, across all the authors ban status.
* Furthermore, we can see that these numbers are all very similar, as videos with claims are always higher in terms of likes, comments and shares, in very similar proportions, across the 3 types of author ban status!
* Therefore, seen from this perspective, the user's status does not matter, as long as the video contains claims.

## **2.3. Checking Outliers**

I’m going to plot boxplots to check for outliers, to have visual representation of these values in every engagement variable.

In [None]:
plt.figure(figsize=(5,1))
sns.boxplot(x = data['video_duration_sec'])

In [None]:
plt.figure(figsize=(5,1))
sns.boxplot(x = data['video_view_count'])

In [None]:
plt.figure(figsize=(5,1))
sns.boxplot(x = data['video_like_count'])

In [None]:
plt.figure(figsize=(5,1))
sns.boxplot(x = data['video_comment_count'])

In [None]:
plt.figure(figsize=(5,1))
sns.boxplot(x = data['video_share_count'])

In [None]:
plt.figure(figsize=(5,1))
sns.boxplot(x = data['video_download_count'])

As you can see in these graphs, there are some variables with outliers, namely `video_download_count`, `video_share_count`, `video_comment_count` and `video_like_count`, which confirms my suspicions previously raised when I analyzed the mean and median in 2.2. Descriptive Statistics. I can also confirm that there are videos with engagement levels much higher than the average.
Now that I’ve confirmed that we have outliers in several variables, I will later decide what to do with them when we choose the machine learning model I’ll use, as there are models that are sensitive to outliers and others that are not.

When building predictive models, the presence of outliers can be problematic. For example, if we were trying to predict the view count of a particular video, videos with extremely high view counts might introduce bias to a model. Also, some outliers might indicate problems with how data was captured or recorded.

The ultimate objective of the TikTok project is to build a model that predicts whether a video is a claim or opinion. The analysis we've performed so far indicates that a video's engagement level is strongly correlated with its claim status. There's no reason to believe that any of the values in the TikTok data are erroneously captured, and they align with expectation of how social media works: a very small proportion of videos get super high engagement levels. That's the nature of viral content.

Nonetheless, it's good practice to get a sense of just how many of our data points could be considered outliers. The definition of an outlier can change based on the details of any project, and it helps to have domain expertise to decide a threshold. I've learned that a common way to determine outliers in a normal distribution is to calculate the interquartile range (IQR) and set a threshold that is 1.5 * IQR above the 3rd quartile.

In this TikTok dataset, the values for the count variables are not normally distributed, as you will see later. They are heavily skewed to the right. One way of modifying the outlier threshold is by calculating the median value for each variable and then adding 1.5 * IQR. This results in a threshold that is, in this case, much lower than it would be if we used the 3rd quartile.

Let's write a for loop that iterates over the column names of each count variable and then check the distribution of these same variables, creating some histograms.

In [None]:
count_cols = ['video_view_count',
              'video_like_count',
              'video_share_count',
              'video_download_count',
              'video_comment_count',
              ]

for column in count_cols:
    q1 = data[column].quantile(0.25)
    q3 = data[column].quantile(0.75)
    iqr = q3 - q1
    median = data[column].median()
    outlier_threshold = median + 1.5*iqr

    # Count the number of values that exceed the outlier threshold
    outlier_count = (data[column] > outlier_threshold).sum()
    print(f'Number of outliers, {column}:', outlier_count)

## **2.4. Visualizations - Outliers and Variable Distributions**

In [None]:
plt.figure(figsize=(5,2))
plt.hist(data['video_duration_sec'], bins=range(0,61,5))
plt.title('Video Duration Histogram')
plt.show()

In [None]:
plt.figure(figsize=(7,2))
plt.hist(data['video_view_count'], bins=range(0,(10**6+1),10**5))
labels = [0] + [str(i) + 'k' for i in range(100, 1001, 100)]
plt.xticks(range(0, 1*10**6+1, 10**5), labels=labels)
plt.title('Video View Count Histogram')
plt.show()

In [None]:
plt.figure(figsize=(5, 2))
n, bins, patches = plt.hist(data['video_like_count'], bins=range(0, (7*10**5+1), 10**5))
labels = [0] + [str(i) + 'k' for i in range(100, 701, 100)]
plt.xticks(range(0, 7*10**5+1, 10**5), labels=labels)
plt.title('Video Like Count Histogram')
plt.show()

In [None]:
plt.figure(figsize=(5,2))
plt.hist(data['video_comment_count'])
plt.title('Video Comment Count Histogram')
plt.show()

In [None]:
plt.figure(figsize=(5,2))
plt.hist(data['video_share_count'])
plt.title('Video Share Count Histogram')
plt.show()

In [None]:
plt.figure(figsize=(5,2))
plt.hist(data['video_download_count'])
plt.title('Video Download Count Histogram')
plt.show()

As we can see from the graphs, all variables have a distribution skewed to the right, except the `video_duration_sec` variable which has a uniform distribution. This way, we can understand which variables we can use in what type of models.

Now let's create a histogram to see how many videos we have of claims and opinions in terms of verified and unverified users.

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

plt.figure(figsize=(7,3))
sns.histplot(
    data = data,
    x = 'claim_status',
    hue = 'verified_status',
    multiple='dodge',              # This is so that four separate bars appear, two blues and two oranges.
    shrink=0.9)
plt.title('Claims by Verification Status Histogram');

As you can see, users that are not verified tend to post videos with claims instead of opinions.

Next, I'm going to create a histogram of the number of videos with claims and opinions by author ban status.

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

plt.figure(figsize=(7,3))
sns.histplot(
    data = data,
    x = 'claim_status',
    hue = 'author_ban_status',
    multiple='dodge',              
    shrink=0.9)
plt.title('Claims by Author Ban Status Histogram')
plt.show()

There is a higher number of users who post videos with claims who are either banned or under review. In other words, users who post videos with claims appear to have a greater likelihood of being banned.

Let's check how many views, on average, does the videos posted by the different author ban status receive.

In [None]:
ban_status_median = data.groupby(['author_ban_status']).median(
    numeric_only=True).reset_index()

fig = plt.figure(figsize=(5,3))
sns.barplot(data=ban_status_median,
            x='author_ban_status',
            y='video_view_count',
            order=['active', 'under review', 'banned'],
            palette={'active':'green', 'under review':'orange', 'banned':'red'},
            alpha=0.5)
plt.title('Mean Visualization Counts by Ban Status');

As you can clearly see, videos posted by users that are banned received far more views than those that are active.

**Therefore, the views variable could be a good indicator of videos with claims.**

Let's check this indicator further by seeing the median of views by claim status and plotting a graph with the total views by claim status.

In [None]:
data.groupby('claim_status')['video_view_count'].median()

In [None]:
fig = plt.figure(figsize=(3,3))
plt.pie(data.groupby('claim_status')['video_view_count'].sum(), labels=['Claims', 'Opinions'])
plt.title('Total Visualizations by Claim Status');

**We can then confirm that videos with claims clearly have many more views than videos with opinions.**

## **2.5. Checking Missing Values**

In [None]:
data.isna().sum()

## **2.6. Summary and Key Insights of EDA**

According to the findings from the exploratory data analysis, the future claim classification model will need to account for null values and imbalance in opinion video engagement counts by incorporating them into the model parameters.

A key component of this project’s exploratory data analysis involves visualizing the data. As illustrated before with the histograms, it is clear that the vast majority of videos are grouped at the bottom of the range of values, which means that there are several videos with extreme counts of views, likes and comments for example.

Over 200 null values were found in 7 different columns. As a result, future modeling should consider the null values to avoid making insights that would assume complete data. Further analysis is necessary to investigate the reason for these null values, and their impact on future statistical analysis or model building.

The `video_view_count` variable could be a good indicator of videos with claims.

# **3. Statistical Analysis and Hypothesis testing**

At this stage, the number 1 question we should ask is, what is the research question we want answered? Later on, I will need to formulate the null and alternative hypotheses as the first step of my hypothesis test. Let's consider our research question now, at the start of this task:

1) Do videos from verified accounts and videos from unverified accounts have different average view counts?

2) Is there a relationship between the account being verified and the associated videos' view counts?

## **3.1. Importing Packages for Statistical Analysis/Hypothesis Testing**

In [None]:
from scipy import stats

## **3.2. Data Cleaning**

We know that we have some null values so, in this part of the analysis, it's important to eliminate them in order to do the hypothesis test. This is because we have a big dataset, and a few null values won’t make that much of a difference.

After that, I'm going to see if there are duplicated rows of data.

In [None]:
# Dropping rows with missing values

data_clean = data.dropna(axis=0)

In [None]:
# Checking duplicated rows

data_clean.duplicated().sum()

In [None]:
# Computing the mean `video_view_count` for each group in `verified_status`

data_clean.groupby(['verified_status']).agg({'video_view_count':'mean'})

## **3.3. Hypothesis Testing**

*   **Null hypothesis $H_0$**: There is no difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (any observed difference in the sample data is due to chance or sampling variability).
*    **Alternative hypothesis $H_A$**: There is a difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (any observed difference in the sample data is due to an actual difference in the corresponding population means).

I'm going to choose 5% as the significance level and proceed with a two-sample t-test.

In [None]:
# Conducting a two-sample t-test to compare means

not_verified = data_clean[data_clean['verified_status'] == 'not verified']

verified = data_clean[data_clean['verified_status'] == 'verified']

stats.ttest_ind(a = not_verified['video_view_count'], b = verified['video_view_count'], equal_var=False)

**Based on the p-value we got above, do we reject or fail to reject the null hypothesis?**

* We concluded that our p-value is very small, therefore:

* **P-value < 5%, then we reject the Null Hypothesis and conclude that there is a statistically significant difference in the average view counts between videos from verified accounts and videos from unverified accounts.**

## **3.4. Summary and Key Insights of Hypothesis Testing**

I've considered the relationship between `verified_status` and `video_view_count`. 
One approach conducted was to examine the mean values of `video_view_count` for each group of `verified_status` in the sample data. The findings showed that unverified accounts have a mean of 265,663 views vs. 91,439 views for verified accounts.

The second approach was a two-sample hypothesis test. Aligned with preliminary findings from the mean values, this statistical analysis shows that any observed difference in the sample data is due to an actual difference in the corresponding population means.

The analysis shows that there is a statistically significant difference in the average view counts between videos from verified accounts and videos from unverified accounts. This suggests there might be fundamental behavioral differences between these two groups of accounts.

It would be interesting to investigate the root cause of this behavioral difference. For example, do unverified accounts tend to post more clickbait-y videos? Or are unverified accounts associated with spam bots that help inflate view counts?

The next step will be to build a regression model on `verified_status`. A regression model is the natural next step because the end goal is to make predictions on `claim_status`. A regression model for `verified_status` can help analyze user behavior in this group of verified users. Technical note to prepare regression model: because the data is skewed, and there is a significant difference in account types, it will be key to build a **logistic regression model**.

# **4. Regression Analysis**

Logistic regression helps me estimate the probability of an outcome. For data science professionals, this is a useful skill because it allows us to consider more than one variable against the variable we're measuring against. This opens the door for much more thorough and flexible analysis to be completed.

I'm interested in how different variables are associated with whether a user is verified. Earlier, I've observed that if a user is verified, they are much more likely to post opinions. Now, I've decided to explore how to predict verified status to help me understand how video characteristics relate to verified users. Therefore, I'm going to conduct a **logistic regression** using `verified_status` as the outcome variable. The results may be used to inform the final model related to predicting whether a video is a claim or an opinion.

## **4.1. Importing Packages for Data Preprocessing and Data Modeling**

In [None]:
# Importing packages for data preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.utils import resample

# Importing packages for data modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

## **4.2. EDA & Checking Model Assumptions**

For me to perform a logistic regression analysis, we must understand whether our data is in accordance with 4 assumptions: **linearity**, **independent observations**, **no outliers**, and **non-multicollinearity**.

### 4.2.1. Independent Observations
I know that each row represents a different video, so the **independent observations** assumption is already met.

### 4.2.2. Outliers
As you know, we have several outliers in various variables so, we have to deal we them for this part of the project.

In [None]:
# Handling outliers for 'video_like_count'

percentile25 = data_clean["video_like_count"].quantile(0.25)
percentile75 = data_clean["video_like_count"].quantile(0.75)

iqr = percentile75 - percentile25
upper_limit = percentile75 + 1.5 * iqr

data_clean.loc[data_clean["video_like_count"] > upper_limit, "video_like_count"] = upper_limit # replaces values that are greater than the upper limit with the upper limit itself

In [None]:
# Handling outliers for 'video_comment_count'

percentile25 = data_clean["video_comment_count"].quantile(0.25)
percentile75 = data_clean["video_comment_count"].quantile(0.75)

iqr = percentile75 - percentile25
upper_limit = percentile75 + 1.5 * iqr

data_clean.loc[data_clean["video_comment_count"] > upper_limit, "video_comment_count"] = upper_limit # replaces values that are greater than the upper limit with the upper limit itself

In [None]:
# Handle outliers for 'video_share_count'

percentile25 = data_clean["video_share_count"].quantile(0.25)
percentile75 = data_clean["video_share_count"].quantile(0.75)

iqr = percentile75 - percentile25
upper_limit = percentile75 + 1.5 * iqr

data_clean.loc[data_clean["video_share_count"] > upper_limit, "video_share_count"] = upper_limit # replaces values that are greater than the upper limit with the upper limit itself

In [None]:
# Handle outliers for 'video_download_count'

percentile25 = data_clean["video_download_count"].quantile(0.25)
percentile75 = data_clean["video_download_count"].quantile(0.75)

iqr = percentile75 - percentile25
upper_limit = percentile75 + 1.5 * iqr

data_clean.loc[data_clean["video_download_count"] > upper_limit, "video_download_count"] = upper_limit # replaces values that are greater than the upper limit with the upper limit itself

In [None]:
# Checking class balance for 'verified_status'

data_clean["verified_status"].value_counts(normalize=True)

Approximately 93.7% of the dataset represents videos posted by unverified accounts and 6.2% represents videos posted by verified accounts. So, the outcome variable is not very balanced.

I'm going to use resampling to create class balance in the outcome variable.

In [None]:
# Identifying data points from majority and minority classes
data_majority = data_clean[data_clean["verified_status"] == "not verified"]
data_minority = data_clean[data_clean["verified_status"] == "verified"]

# Upsampling the minority class (which is "verified")
data_minority_upsampled = resample(data_minority,
                                 replace=True,                 # to sample with replacement
                                 n_samples=len(data_majority), # to match majority class
                                 random_state=42)               # to create reproducible results

# Combining majority class with upsampled minority class
data_upsampled = pd.concat([data_majority, data_minority_upsampled]).reset_index(drop=True)

# Checking class balance
data_upsampled["verified_status"].value_counts()

Now I'm going to get the average `video_transcription_text` length for videos posted by verified accounts and the average `video_transcription_text` length for videos posted by unverified accounts.

Then I'm going to extract the length of each `video_transcription_text` and add this as a column to the data frame, so that it can be used as a potential feature in the model.

In [None]:
data_upsampled[["verified_status", "video_transcription_text"]].groupby(by="verified_status")[["video_transcription_text"]].agg(func=lambda array: np.mean([len(text) for text in array]))

In [None]:
# Extractin the length of each `video_transcription_text` and adding this as a column to the dataframe

data_upsampled['text_length'] = data_upsampled["video_transcription_text"].apply(func=lambda text: len(text))

In [None]:
data_upsampled.head()

**I want to visualize the distribution of `video_transcription_text` length for videos posted by verified accounts and videos posted by unverified accounts.**

In [None]:
verified = data_upsampled['text_length'][data_upsampled['verified_status']=='verified']
unverified = data_upsampled['text_length'][data_upsampled['verified_status']=='not verified']

fig, axes = plt.subplots(1, 2, figsize = (10,4))

sns.histplot(verified, ax = axes[0])
axes[0].set_xlabel('Text Length')
axes[0].set_title('Text lenght on Verified Users Histogram')

sns.histplot(unverified, ax = axes[1])
axes[1].set_xlabel('Text Length')
axes[1].set_title('Text lenght on Unverified Users Histogram')

In [None]:
# Same information, different plot

sns.histplot(data=data_upsampled, stat="count", multiple="dodge", x="text_length",
             kde=False, palette="pastel", hue="claim_status",
             element="bars", legend=True)
plt.xlabel("video_transcription_text length (number of characters)")
plt.ylabel("Count")
plt.title("Distribution of video_transcription_text length for claims and opinions")
plt.show()

### 4.2.3. Non-multicollinearity - Examining Correlations

Next, I'm going to code a correlation matrix to help determine most correlated variables and create a heatmap to visualize these correlations.

In [None]:
data_upsampled.corr(numeric_only=True)

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(
    data_upsampled[["video_duration_sec", "claim_status", "author_ban_status", "video_view_count", 
                    "video_like_count", "video_share_count", "video_download_count", "video_comment_count", "text_length"]]
    .corr(numeric_only=True), 
    annot=True, 
    cmap="crest")
plt.title("Heatmap of the dataset")
plt.show()

The above heatmap shows that the following pair of variables are strongly correlated: `video_view_count` and `video_like_count` (0.86 correlation coefficient).

One of the model assumptions for logistic regression is **no severe multicollinearity** among the features. To build a logistic regression model that meets this assumption, we could exclude `video_like_count`. And among the variables that quantify video metrics, we could keep `video_view_count`, `video_share_count`, `video_download_count`, and `video_comment_count` as features.

## **4.3. Constructing the Logistic Regression Model**

In [None]:
# Selecting the outcome variable

y = data_upsampled["verified_status"]

In [None]:
# Selecting features
X = data_upsampled[["video_duration_sec", "claim_status", "author_ban_status", "video_view_count", "video_share_count", "video_download_count", "video_comment_count"]]

# Display first few rows of features dataframe
X.head()

*Note: The `#` and `video_id` columns are not selected as features here, because they do not seem to be helpful for predicting whether a video presents a claim or an opinion. Also, `video_like_count` is not selected as a feature here, because it is strongly correlated with other features, as discussed earlier. And logistic regression has a no multicollinearity model assumption that needs to be met.*

In [None]:
# Splitting the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.25, random_state = 42)

In [None]:
# Getting shape of each training and testing set to confirm that the dimensions are in alignment.

X_train.shape, X_test.shape, y_train.shape, y_test.shape

* The number of features (7) aligns between the training and testing sets.
* The number of rows aligns between the features and the outcome variable for training (26826) and testing (8942).

In [None]:
# Checking data types

X_train.dtypes

As shown above, the `claim_status` and `author_ban_status` features are each of data type object currently. In order to work with the implementations of models through sklearn, these categorical features will need to be made numeric. One way to do this is through one-hot encoding.

In [None]:
# Selecting the training features that needs to be encoded
X_train_to_encode = X_train[["claim_status", "author_ban_status"]]

# Display first few rows
X_train_to_encode.head()

In [None]:
# Setting up an encoder for one-hot encoding the categorical features
X_encoder = OneHotEncoder(drop='first', sparse_output=False)

In [None]:
# Fitting and transforming the training features using the encoder
X_train_encoded = X_encoder.fit_transform(X_train_to_encode)

In [None]:
# Getting feature names from encoder
X_encoder.get_feature_names_out()

In [None]:
# Placing encoded training features (which are currently an array) into a dataframe
X_train_encoded_df = pd.DataFrame(data=X_train_encoded, columns=X_encoder.get_feature_names_out())

# Display first few rows
X_train_encoded_df.head()

In [None]:
# Display first few rows of `X_train` with `claim_status` and `author_ban_status` columns dropped (since these features are being transformed to numeric)
X_train.drop(columns=["claim_status", "author_ban_status"]).head()

In [None]:
# Concatenating `X_train` and `X_train_encoded_df` to form the final dataframe for training data (`X_train_final`)
# Note: Using `.reset_index(drop=True)` to reset the index in X_train after dropping `claim_status` and `author_ban_status`,
# so that the indices align with those in `X_train_encoded_df` and `count_df`
X_train_final = pd.concat([X_train.drop(columns=["claim_status", "author_ban_status"]).reset_index(drop=True), X_train_encoded_df], axis=1)

# Display first few rows
X_train_final.head()

**Now, on to checking the outcome variable**

In [None]:
# Checking data type of outcome variable
y_train.dtype

In [None]:
# Getting unique values of outcome variable
y_train.unique()

**A shown above, the outcome variable is of data type object currently. One-hot encoding can be used to make this variable numeric.**

In [None]:
# Setting up an encoder for one-hot encoding the categorical outcome variable
y_encoder = OneHotEncoder(drop='first', sparse_output=False)

In [None]:
# Encoding the training outcome variable
# Notes:
#   - Adjusting the shape of `y_train` before passing into `.fit_transform()`, since it takes in 2D array
#   - Using `.ravel()` to flatten the array returned by `.fit_transform()`, so that it can be used later to train the model
y_train_final = y_encoder.fit_transform(y_train.values.reshape(-1, 1)).ravel()

# Display the encoded training outcome variable
y_train_final

**Now, onto constructing the actual model.**

In [None]:
# Constructing a logistic regression model and fitting it to the training set
clf = LogisticRegression(random_state=42, max_iter=800).fit(X_train_final, y_train_final)

# max_iter = 800 is the maximum number of iterations that the optimization algorithm should perform.
# This is useful to ensure that the model has time to converge.

## **4.4. Results and Evaluation**

### 4.4.1. Encoding categorical features in the testing set

In [None]:
# Selecting the testing features that needs to be encoded
X_test_to_encode = X_test[['claim_status', 'author_ban_status']]

# Display first few rows
X_test_to_encode.head()

In [None]:
# Transforming the testing features using the encoder
X_test_encoded = X_encoder.fit_transform(X_test_to_encode)

# Display first few rows of encoded testing features
X_test_encoded

In [None]:
# Placing encoded testing features (which is currently an array) into a dataframe
X_test_encoded_df = pd.DataFrame(data=X_test_encoded, columns=X_encoder.get_feature_names_out())

# Display first few rows
X_test_encoded_df.head()

In [None]:
# Display first few rows of `X_test` with `claim_status` and `author_ban_status` columns dropped (since these features are being transformed to numeric)
X_test.drop(columns=["claim_status", "author_ban_status"]).head()

In [None]:
# Concatenating `X_test` and `X_test_encoded_df` to form the final dataframe for training data (`X_test_final`)
# Note: Using `.reset_index(drop=True)` to reset the index in X_test after dropping `claim_status`, and `author_ban_status`,
# so that the indices align with those in `X_test_encoded_df` and `test_count_df`
X_test_final = pd.concat([X_test.drop(columns=["claim_status", "author_ban_status"]).reset_index(drop=True), X_test_encoded_df], axis=1)

# Display first few rows
X_test_final.head()

### 4.4.2. Results and Evaluation

In [None]:
# Use the logistic regression model to get predictions on the encoded testing set
y_pred = clf.predict(X_test_final)

In [None]:
# Encode the testing outcome variable
# Notes:
#   - Adjusting the shape of `y_test` before passing into `.transform()`, since it takes in 2D array
#   - Using `.ravel()` to flatten the array returned by `.transform()`, so that it can be used later to compare with predictions
y_test_final = y_encoder.fit_transform(y_test.values.reshape(-1, 1)).ravel()

# Display the encoded testing outcome variable
y_test_final

In [None]:
# Get shape of each training and testing set
X_train_final.shape, X_test_final.shape, y_train_final.shape, y_test_final.shape

In [None]:
# Compute values for confusion matrix
cm = confusion_matrix(y_test_final, y_pred, labels = clf.classes_)

# Create display of confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = clf.classes_)

# Plot confusion matrix
disp.plot()

# Display plot
plt.show()

In [None]:
(3856+1991) / (3856 + 649 + 2446 + 1991)

* The upper-left quadrant displays the number of **true negatives**: the number of videos posted by unverified accounts that the model accurately classified as so.

* The upper-right quadrant displays the number of **false positives**: the number of videos posted by unverified accounts that the model misclassified as posted by verified accounts.

* The lower-left quadrant displays the number of **false negatives**: the number of videos posted by verified accounts that the model misclassified as posted by unverified accounts.

* The lower-right quadrant displays the number of **true positives**: the number of videos posted by verified accounts that the model accurately classified as so.

* A perfect model would yield all true negatives and true positives, and no false negatives or false positives.

In [None]:
# Creating a classification report for logistic regression model
target_labels = ["verified", "not verified"]
print(classification_report(y_test_final, y_pred, target_names=target_labels))

The classification report above shows that the logistic regression model achieved a precision of 62% and a recall of 8%, and it achieved an accuracy of 51%. Note that the precision and recall scores are taken from the "not verified" row of the output because that is the target class that we are most interested in predicting. The "verified" class has its own precision/recall metrics, and the weighted average represents the combined metrics for both classes of the target variable.

### 4.4.3. Interpreting model coefficients

In [None]:
# Get the feature names from the model and the model coefficients (which represent log-odds ratios)
# Place into a DataFrame for readability

pd.DataFrame(data={"Feature Name":clf.feature_names_in_, "Model Coefficient":clf.coef_[0]})

## **4.5. Summary and Key Insights**

The variable of `verified_status` was selected for this regression model because of the relationship seen between the verified account type and the video content. A logistic regression model was selected because of the data type and distribution.

The logistic regression model achieved a precision of 56% and a recall of 51% (weighted averages). This model achieved an f1 accuracy of 51%. These model results inform key insights on video features, discussed above:

Key takeaways:

- The dataset has a few strongly correlated variables, which might lead to multicollinearity issues when fitting a logistic regression model. I decided to drop `video_like_count` from the model building.
- Based on the logistic regression model, each additional second of the video is associated with 0.00004% increase in the log-odds of the user having a verified status.
- The logistic regression model had not great, but acceptable predictive power: a precision of 56% is less than ideal, but a recall of 51% is not good. Overall accuracy is towards the lower end of what would typically be considered acceptable.
- Based on the estimated model coefficients from the logistic regression, longer videos tend to be associated with higher odds of the user being verified.
- Other video features have small estimated coefficients in the model, so their association with verified status seems to be small. As a result, other video features besides video length do not seem to be associated with verified status.

I developed a logistic regression model for verified status based on video features. The model had decent predictive power. Based on the estimated model coefficients from the logistic regression, longer videos tend to be associated with higher odds of the user being verified. Other video features have small estimated coefficients in the model, so their association with verified status seems to be small.

The next step is to construct a classification model that will predict the status of claims made by users. That is the final project and original expectation from the TikTok team. Now, there is enough information to analyze the results of that model with helpful context around user behavior.

# **5. Building a Machine Learning Model**

Let's recall some things:

**Business need and modeling objective**

TikTok users can report videos that they believe violate the platform's terms of service. Because there are millions of TikTok videos created and viewed every day, this means that many videos get reported, too many to be individually reviewed by a human moderator.

Analysis indicates that when authors do violate the terms of service, they're much more likely to be presenting a claim than an opinion. Therefore, it is useful to be able to determine which videos make claims and which videos are opinions.

TikTok wants to build a machine learning model to help identify claims and opinions. Videos that are labeled opinions will be less likely to go on to be reviewed by a human moderator. Videos that are labeled as claims will be further sorted by a downstream process to determine whether they should get prioritized for review. For example, perhaps videos that are classified as claims would then be ranked by how many times they were reported, then the top x% would be reviewed by a human each day.

A machine learning model would greatly assist in the effort to present human moderators with videos that are most likely to be in violation of TikTok's terms of service.

**Modeling design and target variable**

The data dictionary shows that there is a column called `claim_status`. This is a binary value that indicates whether a video is a claim or an opinion. This will be the target variable. In other words, for each video, the model should predict whether the video is a claim or an opinion.

This is a classification task because the model is predicting a binary class.

**Selecting an evaluation metric**

To determine which evaluation metric might be best, we must consider how the model might be wrong. There are two possibilities for bad predictions:

  - **False positives:** When the model predicts a video is a claim when in fact it is an opinion
  - **False negatives:** When the model predicts a video is an opinion when in fact it is a claim

**2. What are the ethical implications of building the model?**

In the given scenario, it's better for the model to predict false positives when it makes a mistake, and worse for it to predict false negatives. It's very important to identify videos that break the terms of service, even if that means some opinion videos are misclassified as claims. The worst case for an opinion misclassified as a claim is that the video goes to human review. The worst case for a claim that's misclassified as an opinion is that the video does not get reviewed _and_ it violates the terms of service. A video that violates the terms of service would be considered posted from a "banned" author, as referenced in the data dictionary.

Because it's more important to minimize false negatives, the model evaluation metric will be **recall**.

**Modeling workflow and model selection process**

Previous work with this data has revealed that there are ~20,000 videos in the sample. This is sufficient to conduct a rigorous model validation workflow, broken into the following steps:

1. Split the data into train/validation/test sets (60/20/20)
2. Fit models and tune hyperparameters on the training set
3. Perform final model selection on the validation set
4. Assess the champion model's performance on the test set

![](https://raw.githubusercontent.com/adacert/tiktok/main/optimal_model_flow_numbered.svg)

## **5.1. Importing Packages**

In [None]:
# Import packages for data preprocessing
from sklearn.feature_extraction.text import CountVectorizer

# Import packages for data modeling
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, precision_score, \
recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from xgboost import plot_importance

I've already performed some EDA on this data and cleaned the null values. I've also checked for duplicates and there are no duplicates in this dataset.

As for the outliers, tree-based models are robust to outliers, so there is no need to impute or drop any values based on where they fall in their distribution. Plus, having these is very helpful in terms of knowing if a video is a claim or opinion.

I've already checked for class balance, and I know that the amount of videos with claims and opinions is very similar, there's no need to adjust anything.

## **5.2. Feature Engineering**

We've already made some feature engineering when we extracted the length of the text from the videos and created 3 new features: `likes_per_view`, `comments_per_view`, `shares_per_view`.

### 5.2.1. Feature Selection and Transformation

Encoding target and categorical variables.

In [None]:
X = data_clean.copy()

# Dropping unnecessary columns
X = X.drop(['#', 'video_id', 'video_transcription_text'], axis=1)

# Encoding target variable
X['claim_status'] = X['claim_status'].replace({'opinion': 0, 'claim': 1})

# Dummy encoding remaining categorical values
X = pd.get_dummies(X,
                   columns=['verified_status', 'author_ban_status'],
                   drop_first=True)
X.head()

## **5.3. Splitting the data**

In this case, the target variable is `claim_status`.

0 represents an opinion
1 represents a claim

In [None]:
# Isolating target variable
y = X['claim_status']

In [None]:
# Isolating features
X = X.drop(['claim_status'], axis=1)

# Display first few rows of features dataframe
X.head()

## **5.4. Creating train/validate/test sets**

I'm going to split the data into training and testing sets, 80/20.

In [None]:
# Splitting the data into training and testing sets
X_tr, X_test, y_tr, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Splitting the training set into training and validation sets, 75/25, to result in a final ratio of 60/20/20 for train/validate/test sets.

In [None]:
# Splitting the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_tr, y_tr, test_size=0.25, random_state=42)

In [None]:
# Getting shape of each training, validation, and testing set to confirm that they are in alignment.
X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape

## **5.5. Building the Model**

### **5.5.1. Random Forest Model**

In [None]:
# Instantiating the random forest classifier
rf = RandomForestClassifier(random_state=42)

# Creating a dictionary of hyperparameters to tune
cv_params = {'max_depth': [5, 7, None],
             'max_features': [0.3, 0.6],
            #  'max_features': 'auto'
             'max_samples': [0.7],
             'min_samples_leaf': [1,2],
             'min_samples_split': [2,3],
             'n_estimators': [75,100,200],
             }

# Defining a dictionary of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1'}

# Instantiating the GridSearchCV object
rf_cv = GridSearchCV(rf, cv_params, scoring=scoring, cv=5, refit='recall')

In [None]:
%%time

# This may take up to 10 min to run
rf_cv.fit(X_train, y_train)

In [None]:
# Examining best recall score
rf_cv.best_score_

In [None]:
# Examine best parameters
rf_cv.best_params_

In [None]:
# Creating a table of results

def make_results(model_name, model_object):
    
      # Get all the results from the CV and put them in a df
    cv_results = pd.DataFrame(model_object.cv_results_)
    
      # Isolate the row of the df with the max(mean precision score)
    best_estimator_results = cv_results.iloc[cv_results['mean_test_recall'].idxmax(), :]
    
      # Extract accuracy, precision, recall and f1 score from that row
    f1 = best_estimator_results.mean_test_f1
    recall = best_estimator_results.mean_test_recall
    precision = best_estimator_results.mean_test_precision
    accuracy = best_estimator_results.mean_test_accuracy
    
      # Create table of results
    table = pd.DataFrame({'Model': [model_name],
                          'Recall': [recall],
                          'F1': [f1],
                          'Precision': [precision],
                          'Accuracy': [accuracy]
                         }
                        )
    return table

In [None]:
rf_cv_results = make_results('Random Forest CV', rf_cv)
rf_cv_results

This model performs exceptionally well, with an average recall score of 0.993 across the five cross-validation folds. After checking the precision score to be sure the model is not classifying all samples as claims, it is clear that this model is making almost perfect classifications.

### **5.5.1. XGBoost Model**

In [None]:
# Instantiating the XGBoost classifier
xgb = XGBClassifier(objective='binary:logistic', random_state=0)

# Creating a dictionary of hyperparameters to tune
cv_params = {'max_depth': [4,8,12],
             'min_child_weight': [3, 5],
             'learning_rate': [0.01, 0.1],
             'n_estimators': [300, 500]
             }

# Defining a dictionary of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1'}

# Instantiating the GridSearchCV object
xgb_cv = GridSearchCV(xgb, cv_params, scoring=scoring, cv=5, refit='recall')

In [None]:
%%time

# This may take up to 10 min to run
xgb_cv.fit(X_train, y_train)

In [None]:
xgb_cv.best_score_

In [None]:
xgb_cv.best_params_

In [None]:
rf_cv_results = make_results('XGBoost CV', xgb_cv)
rf_cv_results

This model also performs exceptionally well. Although its recall score is very slightly lower than the random forest model's at 0.992, its precision score is perfect.

# **6. Evaluate Models**

## **6.1.Random forest**

In [None]:
# Using the random forest "best estimator" model to get predictions on the validation set
y_pred = rf_cv.best_estimator_.predict(X_val)

In [None]:
# Creating a confusion matrix to visualize the results of the classification model

# Compute values for confusion matrix
log_cm = confusion_matrix(y_val, y_pred)

# Create display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix=log_cm, display_labels=None)

# Plot confusion matrix
log_disp.plot()

# Display plot
plt.show()

* The upper-left quadrant displays the number of **true negatives**: the number of opinions that the model accurately classified as so.

* The upper-right quadrant displays the number of **false positives**: the number of opinions that the model misclassified as claims.

* The lower-left quadrant displays the number of **false negatives**: the number of claims that the model misclassified as opinions.

* The lower-right quadrant displays the number of **true positives**: the number of claims that the model accurately classified as so.

A perfect model would yield all true negatives and true positives, and no false negatives or false positives.

As the above confusion matrix shows, this model does not produce any false positives.

Now, I'm going to create a classification report that includes precision, recall, f1-score, and accuracy metrics to evaluate the performance of the model.

In [None]:
# Creating a classification report
# Creating classification report for random forest model
target_labels = ['opinion', 'claim']
print(classification_report(y_val, y_pred, target_names=target_labels))

## **6.2. XGBoost**

In [None]:
#Evaluate XGBoost model
y_pred = xgb_cv.best_estimator_.predict(X_val)

In [None]:
# Compute values for confusion matrix
log_cm = confusion_matrix(y_val, y_pred)

# Create display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix=log_cm, display_labels=None)

# Plot confusion matrix
log_disp.plot()

# Display plot
plt.title('XGBoost - validation set');
plt.show()

In [None]:
# Create a classification report
target_labels = ['opinion', 'claim']
print(classification_report(y_val, y_pred, target_names=target_labels))

**The results of the XGBoost model were also nearly perfect. However, its errors tended to be more false negatives. Identifying claims was the priority, so it's important for the model to be good at capturing all actual claim videos. The random forest model has a better recall score and is therefore the champion model.**

# **7. Using the Champion Model to Predict on Test Data**

In [None]:
# Use champion model to predict on test data
y_pred = rf_cv.best_estimator_.predict(X_test)

In [None]:
# Compute values for confusion matrix
log_cm = confusion_matrix(y_test, y_pred)

# Create display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix=log_cm, display_labels=None)

# Plot confusion matrix
log_disp.plot()

# Display plot
plt.title('Random forest - test set');
plt.show()

# **8. Feature Importance**

In [None]:
importances = rf_cv.best_estimator_.feature_importances_
rf_importances = pd.Series(importances, index=X_test.columns)

fig, ax = plt.subplots()
rf_importances.plot.bar(ax=ax)
ax.set_title('Feature importances')
ax.set_ylabel('Mean decrease in impurity')
fig.tight_layout()

**The most predictive features are all related to engagement levels generated by the video. This is not unexpected, as analysis from prior EDA pointed to this conclusion.**

# **9. Conclusion**

Both model architectures—random forest (RF) and XGBoost—performed exceptionally well. The RF model had a better recall score (0.993) and was selected as champion.

Performance on the test holdout data yielded near perfect scores, with only 13 misclassified samples out of 3,817.

Subsequent analysis indicated that, as expected, the primary predictors were all related to video engagement levels, with video view count, like count, share count, and download count accounting for nearly all predictive signal in the data. With these results, we can conclude that videos with higher user engagement levels were much more likely to be claims. In fact, no opinion video had more than 10,000 views.

As noted, the model performed exceptionally well on the test holdout data. Before deploying the model, I recommend further evaluation using additional subsets of user data. Furthermore, I recommend monitoring the distributions of video engagement levels to ensure that the model remains robust to fluctuations in its most predictive features.