<a href="https://www.kaggle.com/code/marekbajdk/classifying-claims-and-opinions-in-tiktok-content?scriptVersionId=205248385" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

![](https://media.licdn.com/dms/image/D4D12AQGU_0vXNy2KcA/article-cover_image-shrink_720_1280/0/1691093307457?e=1726099200&v=beta&t=b9S9B1pijG6vw6ULCrb7JSOH9eqLXsKmIDDWQMPgBvM)

# **TikTok Claim Classification Project**



## **PACE: Plan**

**Business Need and Modeling Objective**

TikTok users can report videos they believe violate the platform's terms of service. Given the sheer volume of videos created and viewed daily, it's impractical for human moderators to review every reported video.

Analysis shows that videos violating the terms of service are more likely to present claims rather than opinions. Thus, distinguishing between videos that make claims and those that express opinions is crucial.

TikTok aims to develop a machine learning model to identify claims and opinions in videos. Videos identified as opinions will be less likely to require human review. In contrast, videos identified as claims will undergo a further sorting process to prioritize those for human moderation. For instance, claim videos could be ranked by the number of reports they receive, with the top x% reviewed daily by human moderators.

This machine learning model will significantly enhance the efficiency of human moderators by presenting them with videos that are most likely to violate TikTok's terms of service.

**Modeling Design and Target Variable**

The dataset includes a column named claim_status, a binary indicator of whether a video is a claim or an opinion. This binary value will serve as the target variable for the model, which will predict whether each video is a claim or an opinion.

This task involves binary classification, as the model will classify videos into one of two categories.

**Selecting an Evaluation Metric**

To choose an appropriate evaluation metric, consider the types of prediction errors:

- **False positives:** The model predicts a video is a claim when it is actually an opinion.
- **False negatives:** The model predicts a video is an opinion when it is actually a claim.

The machine learning model will be instrumental in presenting human moderators with videos that are most likely to violate TikTok's terms of service.

**Modeling Workflow and Model Selection Process**

   1. Split the data into training, validation, and test sets (60/20/20).
   2. Fit models and tune hyperparameters using the training set.
   3. Perform final model selection using the validation set.
   4. Assess the performance of the chosen model on the test set.
    

![](https://raw.githubusercontent.com/adacert/tiktok/main/optimal_model_flow_numbered.svg)

### **Task 1. Imports and data loading**

Start by importing packages needed to build machine learning models to achieve the goal of this project.

In [None]:
import numpy as np
import pandas as pd

# Suppress the specific FutureWarning in Kaggle notebook
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, message="use_inf_as_na option is deprecated")
warnings.filterwarnings("ignore", category=FutureWarning, message="When grouping with a length-1 list-like")


# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Displaying all of the columns in dataframes
pd.set_option('display.max_columns', None)

# Data modeling
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from xgboost import plot_importance
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Data preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.utils import resample

# Statistical analysis/hypothesis testing
from scipy import stats

# Metrics and helpful functions
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, accuracy_score, precision_score, \
recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.tree import plot_tree

# For saving models
import pickle

In [None]:
# Load dataset into dataframe
data = pd.read_csv('/kaggle/input/tiktok-content/tiktok_video_metrics.csv') /kaggle/input/tiktok-content

## **PACE: Analyze**

### **Understanding and inspecting data**

In [None]:
# Display and examine the first ten rows of the dataframe
data.head(10)

Each row represents a record for a particular ID or user's claim, containing detailed information about the video, duration, views, likes, shares, downloads, and comments. The dataset includes a mix of categorical, text, and numerical data.

In [None]:
# Get summary info
data.info()

The variables are of different types. There are numerical 3x integer, 5x float, and  4x object. Notably, there are some null values present in several variables.

In [None]:
description = data.describe().T

description.style.format('{:,.2f}')

Comments and views almost reached a million, with some entries showing outliers. There are peaks in shares, likes, downloads, and comments in certain entries. These variables have very large standard deviations and extremely high maximum values compared to their quartile values.

### **Variables investigation**

A good initial step in understanding the data is to examine the 'claim_status' variable. Start by determining the number of videos corresponding to each claim status.

**Claim status**

In [None]:
# What are the different values for claim status and how many of each are in the data?
claim_status_counts = (data['claim_status'].value_counts())
print(claim_status_counts)
print()
claim_status_percentage = (claim_status_counts/claim_status_counts.sum())*100
print(claim_status_percentage)
print()

# Specify the two classes we want to compare
class_1 = 'claim'
class_2 = 'opinion'

# Calculate the difference between the counts of the two classes
difference = claim_status_counts[class_1] - claim_status_counts[class_2]

total_counts = claim_status_counts.sum()
    
# Calculate the percentage difference
percentage_difference = (difference / total_counts) * 100
    
print(f"The absolute difference between {class_1} and {class_2} is: {difference}")
print(f"The percentage difference between {class_1} and {class_2} is: {percentage_difference:.2f}%")


The counts for each claim status are well-balanced, that is reflected to a minimal percentage difference of 0.69%.

Next, examine the engagement trends associated with each different claim status.

We can start with Boolean masking to filter the data based on claim status, followed by calculating the mean and median view counts for each category.

**Claim**

In [None]:
# What is the average view count of videos with "claim" status?
data_claim = data['claim_status'] =='claim'

claims = data[data['claim_status'] == 'claim']
print('Mean view count claims:', round(claims['video_view_count'].mean(),2))
print('Median view count claims:', claims['video_view_count'].median())

**Opinion**

In [None]:
# What is the average view count of videos with "opinion" status?
data_opinion = data['claim_status'] =='opinion'
#mean and median were in both categories really close to each other, almost identical.
#there is great count differenece between claims and opinions

opinions = data[data['claim_status'] == 'opinion']
print('Mean view count opinions:', opinions['video_view_count'].mean())
print('Median view count opinions:', opinions['video_view_count'].median())

The mean and median values within each claim category are similar, but there is a significant difference in view counts between videos labeled as claims and those labeled as opinions.

Now, examine trends associated with the ban status of the author.

Calculate the number of videos for each combination of claim status and author ban status categories.

In [None]:
data.groupby(['claim_status', 'author_ban_status']).count()[['#']]

Claim videos are more strictly policed than opinion videos, requiring authors to adhere to a stricter set of rules when posting a claim compared to an opinion. It is important to note that we cannot determine whether claim videos are inherently more likely than opinion videos to result in author bans or if authors who post claim videos are simply more prone to violating terms of service.

Additionally, while this data allows us to draw conclusions about banned versus active authors, it does not provide insights into banned videos. We cannot ascertain if a specific video led to the ban, as banned authors may have posted other videos that complied with the terms of service.

In [None]:
# What's the median video share count of each author ban status?
data.groupby(['author_ban_status']).median(numeric_only=True)[['video_share_count']]

Although banned accounts are fewer in number compared to active ones, they have higher median and average views, likes, and shares. Banned authors, in particular, have a median share count that is 33 times greater than that of active authors.

In [None]:
data.groupby(['author_ban_status']).agg(
    {'video_view_count': ['count', 'mean', 'median'],
     'video_like_count': ['count', 'mean', 'median'],
     'video_share_count': ['count', 'mean', 'median']
     })

Banned authors and those under review receive significantly more views, likes, and shares compared to active authors. In most groups, the mean is substantially higher than the median, suggesting the presence of videos with exceptionally high engagement counts or large upper outliers.

Let's create engagement columns to gain a clearer understanding of their actual rates.

In [None]:
# Likes_per_view column
data['likes_per_view'] = data['video_like_count']/data['video_view_count']

# Comments_per_view column
data['comments_per_view'] = data['video_comment_count']/data['video_view_count']

# Shares_per_view column
data['shares_per_view'] = data['video_share_count']/ data['video_view_count']

We can analyze engagement metrics grouped by claim_status and author_ban_status

In [None]:
# Engagement metrics by claim_status & author_ban_status
data.groupby(['claim_status','author_ban_status']).agg(
    {'likes_per_view': ['count','mean', 'median'],
    'comments_per_view':['count','mean', 'median'],
    'shares_per_view':['count','mean', 'median']})

We observe that videos by banned authors and those under review generally attract significantly more views, likes, and shares compared to videos by non-banned authors. However, once a video is viewed, its engagement rate is more strongly influenced by its claim status than by the author's ban status.

Additionally, claim videos have a higher view rate compared to opinion videos, and they also receive a higher average rate of likes, indicating they are more favorably received. Moreover, claim videos garner more engagement through comments and shares than opinion videos.

For claim videos, banned authors achieve slightly higher likes-to-view and shares-to-view rates than active authors or those under review. In contrast, for opinion videos, active authors and those under review achieve higher engagement rates across all categories compared to banned authors.

###  Build visualizations

#### **Video duration**

Create a box plot to analyze the range of values in video duration and a histogram to further investigate the distribution of these values.

In [None]:
# Colors from the Color Universal Design (CUD) palette
cud_box_color = '#E69F00'  # Orange
cud_hist_color = '#56B4E9'  # Sky Blue

# Figure with two subplots side by side
fig, axes = plt.subplots(1, 2, figsize=(15, 4))

# First subplot: Boxplot
axes[0].set_title('video_duration_sec')
sns.boxplot(x=data['video_duration_sec'], ax=axes[0], color=cud_box_color)

# Second subplot: Histogram 
sns.histplot(data['video_duration_sec'], bins=range(0, 61, 5), ax=axes[1], color=cud_hist_color)
axes[1].set_title('Video duration histogram')

# Adjust layout to prevent overlap
plt.tight_layout()



All videos have durations ranging from 5 to 60 seconds, with a uniformly distributed length.

#### **Video views**

Create a box plot to examine the spread of values in video views and histogram of this variable to explore the distribution.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 4))

# Define labels for labels for view count categories from 0 to 1000k in increments of 100k
labels = [0] + [str(i) + 'k' for i in range(100, 1001, 100)]

axes[0].set_title('video_view_count')
sns.boxplot(x=data['video_view_count'], ax=axes[0], color=cud_box_color)
axes[0].set_xticks(range(0, 10*10**5 + 1, 10**5)) # Set x-axis ticks from 0 to 1,000,000 in increments of 100,000
axes[0].set_xticklabels(labels)

sns.histplot(data['video_view_count'], bins=range(0,(10**6+1),10**5), ax=axes[1], color=cud_hist_color)
axes[1].set_title('video_view_count histogram')
axes[1].set_xticks(range(0, 10*10**5 + 1, 10**5))
axes[1].set_xticklabels(labels)

plt.tight_layout()

This variable exhibits a highly uneven distribution, with more than half of the videos receiving fewer than 100,000 views. For view counts greater than 100,000, the distribution is uniform.

#### **Video likes**

Create a box plot to examine the spread of values in video likes and histogram to explore the distribution.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 4))

labels = [0] + [str(i) + 'k' for i in range(100, 701, 100)]

axes[0].set_title('video_like_count')
sns.boxplot(x=data['video_like_count'], ax=axes[0], color=cud_box_color)
axes[0].set_xticks(range(0, 7*10**5 + 1, 10**5))
axes[0].set_xticklabels(labels)

sns.histplot(data['video_like_count'], bins=range(0, 7*10**5 + 1, 10**5), ax=axes[1], color=cud_hist_color)
axes[1].set_title('Video like count histogram')
axes[1].set_xticks(range(0, 7*10**5 + 1, 10**5))
axes[1].set_xticklabels(labels)

plt.tight_layout();

Similar to view counts, there are significantly more videos with fewer than 100,000 likes compared to those with more. However, the distribution tapers off, with the data skewing to the right, resulting in many videos having extremely high like counts.

#### Video comments

Create a box plot to examine the spread of values in the video comments & histogram to further explore the distribution.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 4), constrained_layout=True)

hist_labels = [str(i) for i in range(0, 3001, 500)]

ax1.set_title('video_comment_count')
sns.boxplot(x=data['video_comment_count'], ax=ax1, color=cud_box_color)

sns.histplot(data['video_comment_count'], bins=range(0, 3001, 100), ax=ax2, color=cud_hist_color)
ax2.set_title('Video comment count histogram')
ax2.set_xticks(range(0, 3001, 500))
ax2.set_xticklabels(hist_labels);

The majority of videos have comment counts clustered at the lower end of the range. Most videos have fewer than 100 comments, resulting in a highly right-skewed distribution.

#### Video shares

Create a boxplot to examine the spread of values of video shares and histogram to further explore the distribution.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 4), constrained_layout=True)

# Labeling values in a more compact form, discards the remainder, giving the number of thousands.
hist_labels = [f'{i//1000}k' for i in range(0, 270001, 20000)]

axes[0].set_title('video_share_count')
sns.boxplot(x=data['video_share_count'], ax=axes[0], color=cud_box_color)
axes[0].set_xlabel('')  # Optionally remove x-axis label if not needed

sns.histplot(data['video_share_count'], bins=range(0, 270001, 20000), ax=axes[1], color=cud_hist_color)
axes[1].set_title('Video Share Count Histogram')

axes[1].set_xticks(range(0, 270001, 20000))
axes[1].set_xticklabels(hist_labels)

# Optional: Rotate x-axis labels for better readability
axes[1].tick_params(axis='x', rotation=45);

Most videos had fewer than 10,000 shares, resulting in a distribution that is highly right-skewed.

#### Video downloads

Create a boxplot to examine the spread of values in video downloads and histogram further explore the distribution.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 4), constrained_layout=True)

hist_labels =  [str(i) for i in range(0,(15001),1000)]

axes[0].set_title('video_download_count')
sns.boxplot(x=data['video_download_count'], ax=axes[0], color=cud_box_color)
axes[0].set_xlabel('')  # Optionally remove x-axis label if not needed

sns.histplot(data['video_download_count'], bins=range(0,(15001),1000), ax=axes[1], color=cud_hist_color)
axes[1].set_title('video_download_count Histogram')

axes[1].set_xticks(range(0,(15001),1000))
axes[1].set_xticklabels(hist_labels)

axes[1].tick_params(axis='x', rotation=45);

Most videos were downloaded fewer than 500 times, though some received over 12,000 downloads. This indicates a significant rightward skew in the data.

#### **Claim status by verification status**

Create a histogram featuring four bars, with each bar representing a unique combination of claim status and verification status.

In [None]:
# Color Universal Design (CUD) palette
cud_palette = ['#E69F00', '#56B4E9']

plt.figure(figsize=(7, 4))

sns.histplot(data=data,
             x='claim_status',
             hue='verified_status',
             multiple='dodge',
             shrink=0.9,
             palette=cud_palette)

plt.title('Claims by verification status histogram');

There are significantly fewer verified users compared to unverified ones; however, verified users are substantially more likely to post opinions.

#### Claim status by author ban status

Create a histogram to examine the count of each claim status for each author ban status.

In [None]:
cud_palette = {
    'active': '#56B4E9',       # Sky Blue
    'under review': '#E69F00', # Orange
    'banned': '#E94E77'        # Fiery Red
}

fig = plt.figure(figsize=(8, 4))

sns.histplot(data, x='claim_status', hue='author_ban_status',
             multiple='dodge',
             hue_order=['active', 'under review', 'banned'],
             shrink=0.9,
             palette=cud_palette,
             alpha=0.5)

plt.title('Claim status by author ban status');

For both claim and opinion videos, there are significantly more active authors than banned authors or those under review. However, the proportion of active authors is notably higher for opinion videos compared to claim videos. This suggests that authors who post claim videos are more likely to face review or banning.

#### **Median view counts by ban status**

Generate a bar plot with three bars, each representing a different author ban status. The height of each bar should reflect the median number of views for videos associated with that particular author ban status.

In [None]:
color_palette = {
    'active': '#009E73',      # Green
    'under review': '#F0E442', # Yellow
    'banned': '#D55E00'       # Red
}

# Group and calculate the median values
ban_status_counts = data.groupby(['author_ban_status']).median(numeric_only=True).reset_index()
fig = plt.figure(figsize=(6, 4))
sns.barplot(
    data=ban_status_counts,
    x='author_ban_status',
    y='video_view_count',
    order=['active', 'under review', 'banned'],
    palette=color_palette,
    alpha=0.7
)
plt.title('Median view count by ban status');

The median view counts for inactive authors are substantially higher than those for active authors. Considering that inactive authors are more susceptible to post claims and that their videos accumulate significantly more views overall than those of active authors,`video_view_count` may be an effective indicator of claim status.

In [None]:
data.groupby('claim_status')['video_view_count'].median()

In fact, a quick examination of the median view count by claim status supports this conclusion.

#### **Total views by claim status**

Create a pie graph that depicts the proportions of total views for claim videos and total views for opinion videos.

In [None]:
fig = plt.figure(figsize=(3, 3))
plt.pie(data.groupby('claim_status')['video_view_count'].sum(),labels=['claim', 'opinion'],
    autopct='%1.1f%%',  # Show percentages on the pie chart
    startangle=140  # Start angle to ensure consistent orientation
)
plt.title('Total Views by Video Claim Status');

The overall view count is dominated by claim videos even though there are roughly the same number of each video in the dataset.

**Importance of Handling Outliers in Predictive Models**

Outliers can pose significant challenges when constructing predictive models. For instance, if the goal is to forecast the view count of a video, those with exceptionally high view counts might skew the model's predictions. Additionally, outliers may signal issues in data collection or recording processes.

**Objective of the TikTok Project**

The primary goal of the TikTok project is to develop a model that can predict whether a video is classified as a claim or an opinion. Our analysis has shown a strong correlation between a video's engagement level and its claim status. There is no evidence suggesting that the TikTok data contains inaccurately recorded values, and the distribution aligns with expected social media patterns: a small fraction of videos achieve extremely high engagement, reflecting the nature of viral content.

**Importance of Identifying Outliers**

The approach to handling outliers should be tailored to the project's requirements, leveraging field knowledge to set appropriate thresholds. A common method for identifying outliers in a normally distributed dataset is to use the interquartile range (IQR), defining outliers as those beyond 1.5 times the IQR above the third quartile.

**Adapting Outlier Detection for the TikTok Dataset**

In the TikTok dataset, the count variables are heavily right-skewed rather than normally distributed. To adjust the outlier threshold accordingly, one can calculate the median value for each variable and add 1.5 times the IQR. This results in a lower threshold compared to using the third quartile, better reflecting the dataset's skewed nature.

**Steps:**

1. Calculate the IQR of the column
2. Calculate the median of the column
3. Determine the outlier threshold using the formula: median + 1.5 * IQR.
4. Count the number of videos with values in that column exceeding the outlier threshold.

In [None]:
# List of column names related to video metrics
count_cols = [
    'video_view_count',
    'video_like_count',
    'video_share_count',
    'video_download_count',
    'video_comment_count',
]

# Iterate through each column in the list 'count_cols'
for column in count_cols:
    # Calculate the first quartile (25th percentile) of the column data
    q1 = data[column].quantile(0.25)
    # Calculate the third quartile (75th percentile) of the column data
    q3 = data[column].quantile(0.75)
    # Calculate the interquartile range (IQR) as the difference between Q3 and Q1
    iqr = q3 - q1
    # Find the median (50th percentile) of the column data
    median = data[column].median()
    # Define an outlier threshold as 1.5 times the IQR above the median
    outlier_threshold = median + 1.5 * iqr

    # Count the number of values in the column that exceed the outlier threshold
    outlier_count = (data[column] > outlier_threshold).sum()
    # Print the count of outliers for the current column
    print(f'Outliers of {column}:', outlier_count)
 

#### **Scatterplot**

Create a scatterplot of `video_view_count` versus `video_like_count` according to `claim_status`.

In [None]:
custom_palette = {'claim':'#56B4E9', 'opinion':'#E69F00'}

palette = sns.color_palette("colorblind", n_colors=2)

sns.scatterplot(x=data["video_view_count"], y=data["video_like_count"],
                hue=data["claim_status"], s=10, alpha=.3, palette=palette)
plt.title('Video_view_count vs. video_like_count by claim status')
plt.show()

Create a scatterplot of `video_view_count` versus `video_like_count` for opinions only

In [None]:
opinion = data[data['claim_status']=='opinion']
sns.scatterplot(x=opinion["video_view_count"],y=opinion["video_like_count"],s=10,alpha=.3)
plt.show()

## Hypothesis Testing

Checking for and handling missing values.

In [None]:
data.isna().sum()

In [None]:
# Drop rows with missing values
data = data.dropna(axis=0)

In [None]:
# Rows after handling missing values
data.head()

We can explore the relationship between verified_status and video_view_count. One way to do this is by analyzing the average video_view_count for each verified_status group in the sample data.

In [None]:
data.groupby("verified_status")["video_view_count"].mean()

Now let's recall the distinction between the null hypothesis and the alternative hypothesis and what are ours hypotheses for this data project.

**Null Hypothesis (H0)**: There is no difference in the number of views between posts made by verified and unverified accounts on TikTok.

Any observed difference in the sample data is attributed to chance or sampling variability.

**Alternative Hypothesis (H1)**: There is a difference in the number of views between posts made by verified and unverified accounts on TikTok.

Any observed difference in the sample data reflects a real difference in the population means. 

Conduct a two-sample t-test:

- State the null hypothesis and the alternative hypothesis.
- Choose a significance level.
- Find the p-value.
- Reject or fail to reject the null hypothesis.

**H0:** There is no difference in the number of views between posts made by verified and unverified accounts on TikTok.

Any observed difference in the sample data is due to chance or sampling variability.

**H1:** There is a difference in the number of views between posts made by verified and unverified accounts on TikTok.

Any observed difference in the sample data is due to an actual difference in the corresponding population means.

We choose a 5% significance level.

In [None]:
# Save each sample in a variable
not_verified = data[data["verified_status"] == "not verified"]["video_view_count"]
verified = data[data["verified_status"] == "verified"]["video_view_count"]

# Apply a t-test using the two samples
stats.ttest_ind(a=not_verified, b=verified, equal_var=False)

Given that the p-value is significantly smaller than the 5% significance level, you reject the null hypothesis. This indicates that there is a statistically significant difference in the mean video view counts between verified and unverified TikTok accounts.

### **Regression Analysis**


Check for and handle duplicates

In [None]:
data.duplicated().sum()

Previously, we identified outliers in likes and comments using boxplots and histograms.


**Video like count outliers**

In [None]:
# Calculate the 25th and 75th percentiles
q25 = data['video_like_count'].quantile(0.25)
q75 = data['video_like_count'].quantile(0.75)

# Calculate the Interquartile Range (IQR)
iqr = q75 - q25

# Determine the lower and upper limits for outliers
lower_bound = q25 - 1.5 * iqr
upper_bound = q75 + 1.5 * iqr

# Cap the outliers at the upper limit
data.loc[data['video_like_count'] > upper_bound, 'video_like_count'] = upper_bound

**Video comment count outliers**

In [None]:
q25 = data['video_comment_count'].quantile(0.25)
q75 = data['video_comment_count'].quantile(0.75)

iqr = q75 - q25

lower_bound = q25 - 1.5 * iqr
upper_bound = q75 + 1.5 * iqr

data.loc[data['video_comment_count'] > upper_bound, 'video_comment_count'] = upper_bound

**Class balance check**

In [None]:
# normalize= True return proportion rather than frequencies
data['verified_status'].value_counts(normalize=True)

Approximately 94.2% of the dataset comprises videos from unverified accounts, whereas only 5.8% are from verified accounts. This indicates a significant imbalance in the outcome variable.

If necessary, use resampling to achieve class balance in the outcome variable.

In [None]:
# Identify data points from majority and minority classes
data_majority = data[data["verified_status"] == "not verified"]
data_minority = data[data["verified_status"] == "verified"]

# Upsample the minority class (which is "verified")
data_minority_upsampled = resample(
    data_minority,
    replace=True,                 # to sample with replacement
    n_samples=len(data_majority), # to match majority class
    random_state=0                # to create reproducible results
)

# Combine majority class with upsampled minority class
data_upsampled = pd.concat([data_majority, data_minority_upsampled]).reset_index(drop=True)

# Display new class counts
print(data_upsampled['verified_status'].value_counts())

Determine the average length of video transcription texts for videos posted by verified accounts, and compare it to the average length of video transcription texts for videos posted by unverified accounts.

In [None]:

data_upsampled.groupby('verified_status')['video_transcription_text'].apply(lambda texts: np.mean([len(text) for text in texts]))

Extract the length of each video_transcription_text and add it as a new column to the dataframe. This will allow the length to be used as a potential feature in the model.

In [None]:
# List comprehension iterates through each text in the column and calculate its length, assigning the results to the new text_length column
data_upsampled['text_length'] = [len(text) for text in data_upsampled['video_transcription_text']]

In [None]:
data_upsampled.head()

Let's visualize the distribution of `video_transcription_text` length for videos posted by verified accounts compared to those posted by unverified accounts.

In [None]:
# Plot histogram of text lengths, differentiated by verification status

c_palette = {'not verified':'#E69F00', 'verified':'#56B4E9'}
sns.histplot(
    data=data_upsampled, 
    x='text_length', 
    hue='verified_status', 
    multiple='stack', 
    stat='count', 
    kde=False, 
    element='bars', 
    palette=c_palette,
    legend=True
)

# Customize plot
plt.title('Transcription Text Length by account type')
plt.xlabel('Text Length (Characters)');

### **Correlations Analysis**

Now, let's create a correlation matrix to identify the variables with the highest correlations.

In [None]:
data_upsampled.corr(numeric_only=True)

**Correlation heatmap**

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(
    data_upsampled[[
        "video_duration_sec", "claim_status", "author_ban_status", "video_view_count", 
        "video_like_count", "video_share_count", "video_download_count", "video_comment_count", "text_length"
    ]].corr(numeric_only=True), 
    annot=True, 
    cmap="coolwarm"  # Using 'coolwarm' for colorblind-friendly colors
)
plt.title("Correlation heatmap")
plt.xticks(rotation=25)
plt.show()

One of the key assumptions for logistic regression is the absence of severe multicollinearity among the features. We need to keep this in mind as we analyze the heatmap and decide which features to include in our model.


The heatmap reveals a strong correlation between `video view count` and `video like count`, with a correlation coefficient of 0.86.

To meet this assumption, we could exclude the `video_like_count`.

We keep variables that measure video metrics: `video_view_count`, `video_share_count`, `video_download_count`, and `video_comment_count` as features.


**Variables selection**

Y and X variables

In [None]:
# target variable - Y
y = data_upsampled["verified_status"]

In [None]:
# Features - X
X = data_upsampled[["video_duration_sec", "claim_status", "author_ban_status", "video_view_count", "video_share_count", "video_download_count", "video_comment_count"]]

X.head()

The # and video_id columns are not chosen as features in this case, as they do not appear to aid in predicting whether a video presents a claim or an opinion.

Train-test split

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Encode variables

In [None]:
# Check data types
X_train.dtypes

In [None]:
# Get unique `claim_status` values
X_train["claim_status"].unique()

In [None]:
# Get unique `author_ban_status` values
X_train["author_ban_status"].unique()

As demonstrated earlier, the `claim_status` and `author_ban_status` features are currently of object data type. To utilize these categorical features with sklearn model implementations, they need to be converted to numeric values. One approach to achieve this is through one-hot encoding.

Encode categorical features in the training set

In [None]:
# Select training features to be encoded
X_train_to_encode = X_train[["claim_status", "author_ban_status"]]

# Display few encoded rows
X_train_to_encode.head()

In [None]:
# Initialize one-hot encoder for categorical features
X_encoder = OneHotEncoder(drop='first', sparse_output=False)

In [None]:
# Fit and transform the training features using the encoder
X_train_encoded = X_encoder.fit_transform(X_train_to_encode)

In [None]:
# Get feature names from encoder
X_encoder.get_feature_names_out()

In [None]:
# Display first few rows of encoded training features
X_train_encoded

In [None]:
# Place encoded training features (which is currently an array) into a dataframe
X_train_encoded_df = pd.DataFrame(data=X_train_encoded, columns=X_encoder.get_feature_names_out())

# Display first few rows
X_train_encoded_df.head()

In [None]:
# Display first few rows of `X_train` with `claim_status` and `author_ban_status` columns dropped (since these features are being transformed to numeric)
X_train.drop(columns=["claim_status", "author_ban_status"]).head()

In [None]:
# Concatenate `X_train` and `X_train_encoded_df` to form the final dataframe for training data (`X_train_final`)
# Remember to use `.reset_index(drop=True)` on `X_train` after removing `claim_status` and `author_ban_status`,
# to ensure that the indices match those in `X_train_encoded_df` and `count_df`.

X_train_final =pd.concat([X_train.drop(columns=['claim_status','author_ban_status']).reset_index(drop=True),X_train_encoded_df], axis=1)

X_train_final.head()

Verify the data type of the outcome variable.

In [None]:
y_train.dtypes

Get unique values of outcome variable

In [None]:
y_train.unique()

As demonstrated above, the outcome variable is currently of object data type. To convert it to a numeric format, one-hot encoding can be applied.

In [None]:
# Set up an encoder for one-hot encoding the categorical outcome variable
y_encoder = OneHotEncoder(drop='first', sparse_output=False)

In [None]:
# Encode the training outcome variable

#  Adjusting the shape of `y_train` before passing into `.fit_transform()`, since it takes in 2D array
#  `flatten()` - flattens the array returned by `.fit_transform()`, so that it can be used later to train the model
y_train_final = y_encoder.fit_transform(y_train.values.reshape(-1, 1)).flatten()

y_train_final

**Model building**

In [None]:
# Build a logistic regression model and fit it to the training dataset.
log_clf = LogisticRegression(random_state=0, max_iter=800).fit(X_train_final,y_train_final)

**Results and evaluation**

In [None]:
# Select the testing features that needs to be encoded
X_test_to_encode = X_test[["claim_status", "author_ban_status"]]

X_test_to_encode.head()

In [None]:
# Transform the testing features using the encoder
X_test_encoded = X_encoder.transform(X_test_to_encode)

X_test_encoded

In [None]:
# Place encoded testing features (which is currently an array) into a dataframe
X_test_encoded_df = pd.DataFrame(data=X_test_encoded, columns=X_encoder.get_feature_names_out())

X_test_encoded_df.head()

In [None]:
# Display first few rows of `X_test` with `claim_status` and `author_ban_status` columns dropped (since these features are being transformed to numeric)
X_test.drop(columns=["claim_status", "author_ban_status"]).head()

In [None]:
# Concatenate `X_test` and `X_test_encoded_df` to form the final dataframe for training data (`X_test_final`)
# `.reset_index(drop=True)` to reset the index in X_test after dropping `claim_status`, and `author_ban_status`,
# so that the indices align with those in `X_test_encoded_df` and `test_count_df`
X_test_final = pd.concat([X_test.drop(columns=["claim_status", "author_ban_status"]).reset_index(drop=True), X_test_encoded_df], axis=1)

X_test_final.head()

Evaluate the logistic regression model by using it to generate predictions for the encoded testing set.

In [None]:
# Get predictions on the encoded testing set
y_pred= log_clf.predict(X_test_final)

Show the predictions for the encoded testing set.

In [None]:
# Display predictions
y_pred

Show the actual labels for the testing set.

In [None]:
# Display the true labels of the testing set
y_test

Encode the true labels of the testing set into a format that allows for comparison with the predictions.

In [None]:
# Encode the testing outcome variable

#   Adjusting the shape of `y_test` before passing into `.transform()`, since it takes in 2D array
#   flatten()` flattens the array returned by `.transform()`, so that it can be used later to compare with predictions

y_test_final = y_encoder.transform(y_test.values.reshape(-1, 1)).flatten()

y_test_final

Let's verify once more that the dimensions of the training and testing sets match, considering that additional features have been included.

In [None]:
# Get shape of each training and testing set
X_train_final.shape, y_train_final.shape, X_test_final.shape, y_test_final.shape

**Visualize the outcomes of the model**

In [None]:
# Calculate values for confusion matrix
log_cm = confusion_matrix(y_test_final, y_pred, labels=log_clf.classes_)

# Create display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix=log_cm, display_labels=log_clf.classes_)

# Plot confusion matrix
log_disp.plot(cmap='cividis');

Generate a classification report that encompasses precision, recall, F1-score, and accuracy metrics to assess the performance of the logistic regression model.

**Classification report**

In [None]:
# Define your target labels
target_labels = ['verified', 'not verified']

# Generate the classification report as a dictionary
report_dict = classification_report(y_test_final, y_pred, target_names=target_labels, output_dict=True)

# Convert the dictionary into a pandas DataFrame
report_df = pd.DataFrame(report_dict).transpose()

# Print the DataFrame
print(report_df)

Display the feature names and corresponding model coefficients from a trained logistic regression model 

In [None]:
# Create a DataFrame with feature names and model coefficients
# log_clf is an instance of a logistic regression classifier that has been trained on a dataset.
pd.DataFrame(data={'Feature Name':log_clf.feature_names_in_,'Model Coefficients':log_clf.coef_[0]})

# **Machine Learning for Video Classification**

## **PACE: Construct**

### **Feature engineering**

Extract the length of each `video_transcription_text` and add this as a column to the dataframe, so that it can be used as a potential feature in the model.

In [None]:
# Extract the length of each `video_transcription_text` and add this as a column to the dataframe
data['text_length'] = data['video_transcription_text'].str.len()

# or use .apply() function with list comprehension
# data['video_transcription_text'].apply(func= lambda x: len(x))

Calculate the average text_length for claims and opinions.

In [None]:
# Calculate the average text_length for claims and opinions
data[['text_length','claim_status']].groupby('claim_status').mean()

Visualize the distribution of `text_length` for claims and opinions.

In [None]:
# Suppress the specific warning

# Visualize the distribution of `text_length` for claims and opinions
# Create two histograms in one plot
c_palette = {'opinion':'#E69F00', 'claim':'#56B4E9'}

sns.histplot(data=data, stat='count', hue='claim_status', multiple='dodge', x='text_length', 
              kde=False, legend=True, element='bars', palette=c_palette
             )
plt.title('Distribution of text_length for claims and opinions')
plt.xlabel('Number of characters in (text_length)')
plt.show()

The letter count distributions for both claims and opinions are approximately normal with a slight right skew. Claim videos generally have a higher character count, averaging about 13 more characters than opinion videos, as noted in a previous analysis.

### **Feature selection and transformation**

Encode target and catgorical variables.

In [None]:
# Create a copy of the X data
X=data.copy

# Drop unnecessary columns
X=data.drop(['#', 'video_id'],axis=1)

# Encode target variable
X['claim_status'] = X['claim_status'].map({'opinion':0, 'claim':1})

# Dummy encode remaining categorical values
X = pd.get_dummies(X, columns=['verified_status', 'author_ban_status'], 
                   drop_first=True)
X.head(10)

### **Split the data**

Assign 'claim_status' as a target variable.

In [None]:
# Isolate target variable
y=X['claim_status']

Isolate the features.

In [None]:
# Isolate features
X=X.drop(['claim_status'], axis=1)

X.head(10)

### **Create training, validation, and test sets**

Split data into training and testing sets, 80/20.

In [None]:
# Split the data into training and testing sets
X_tr, X_test, y_tr, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Split the training set into training and validation sets, 75/25, to result in a final ratio of 60/20/20 for train/validate/test sets.

In [None]:
# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_tr, y_tr, test_size=0.25, random_state=0)

Ensure that the dimensions of the training, validation, and testing sets are consistent.

In [None]:
# Get shape of each training, validation, and testing set
X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape

Confirm that the dimensions of the training, validation, and testing sets are in alignment.

### **Tokenize text column**

The feature video_transcription_text is a text-based variable and is not categorical because it does not have a predetermined set of possible values. To convert this text into numerical features, a common method is to apply a bag-of-words approach, such as using the CountVectorizer.

CountVectorizer operates by dividing text into n-grams, which are sequences of n consecutive words. Each n-gram represents a sequence of adjacent words from the original text, capturing relationships between consecutive words.

In [None]:
# Initialize  `CountVectorizer` object, which converts a collection of text to a matrix of token counts
count_vec = CountVectorizer(
                ngram_range=(2, 3),    # Generates 2-grams and 3-grams (i.e., sequences of 2 or 3 consecutive words)
                max_features=15,       # Limits the vocabulary to the top 15 most frequent n-grams.
                stop_words='english'   # Excludes common English stop words (e.g., 'the', 'and') from the vocabulary.
                )  
# Dsiplay CountVectorizer
count_vec

Fit the vectorizer to the training data to generate the n-grams and transform the data to count their occurrences. Ensure that the vectorizer is fitted only on the training data and not on the validation or test data.

In [None]:
# Fit the vectorizer to the training data to learn the vocabulary and generate n-grams,
# then transform the training text data into a numerical format with occurrence counts.
# Convert the resulting sparse matrix to a dense array.
count_data = count_vec.fit_transform(X_train['video_transcription_text']).toarray()

count_data

In [None]:
# Place the numerical representation of `video_transcription_text` from training set into a dataframe
count_df = pd.DataFrame(data=count_data, columns=count_vec.get_feature_names_out())

# Display first few rows
count_df.head()

In [None]:
# Combine `X_train` and `count_df` to form the final dataframe for training data (`X_train_final`)
# `.reset_index(drop=True)` to reset the index in X_train after dropping `video_transcription_text`,
# so that the indices align with those in `X_train` and `count_df`
X_train_final = pd.concat([X_train.drop(columns=['video_transcription_text']).reset_index(drop=True), count_df], axis=1)

# Display first few rows
X_train_final.head()

In [None]:
# Combine the original training features (excluding 'video_transcription_text') with the new numerical features 
# derived from the vectorizer into a single DataFrame. 
# Reset the index to ensure proper alignment with `count_df` and concatenate along the columns axis.
X_train_final = pd.concat([X_train.drop(columns=['video_transcription_text']).reset_index(drop=True), count_df], axis=1)

In [None]:
# Extract numerical features from `video_transcription_text` in the testing set
validation_count_data = count_vec.transform(X_val['video_transcription_text']).toarray()
validation_count_data

In [None]:
# Place the numerical representation of `video_transcription_text` from validation set into a dataframe
validation_count_df = pd.DataFrame(data=validation_count_data, columns=count_vec.get_feature_names_out())
validation_count_df.head()

In [None]:
# Combine `X_val` and `validation_count_df` to crate the final dataframe for validation data (`X_val_final`)
# `.reset_index(drop=True)`ensures that the index is reset in `X_val` after dropping the `video_transcription_text` column,
# aligning the indices with those in `validation_count_df`.
X_val_final = pd.concat([X_val.drop(columns=['video_transcription_text']).reset_index(drop=True), validation_count_df], axis=1)

# Display first few rows
X_val_final.head()

Apply the same vectorizer to the test data to obtain n-gram counts. Ensure that the vectorizer is only used for transformation and not refitted.

In [None]:
# Convert `video_transcription_text` from the test set into numerical features using the fitted vectorizer
test_count_data = count_vec.transform(X_test['video_transcription_text']).toarray()

# Create a DataFrame from the numerical features, using the feature names provided by the vectorizer
test_count_df = pd.DataFrame(data=test_count_data, columns=count_vec.get_feature_names_out())

# Combine the transformed numerical features with the remaining columns from `X_test` to create the final testing dataset
X_test_final = pd.concat([X_test.drop(columns=['video_transcription_text']
                                      ).reset_index(drop=True), test_count_df], axis=1)
X_test_final.head()

### **Random Forest model**

Fit a random forest model to the training set. Use cross-validation to tune the hyperparameters and select the model that performs best on recall.

In [None]:
# Instantiate the random forest classifier with a fixed random seed for reproducibility
rf = RandomForestClassifier(random_state=0)

# Define a dictionary with hyperparameters to tune for cross-validation
cv_params = {
    'max_depth': [5, 7, None],              # Options for the maximum depth of the tree
    'max_features': [0.3, 0.6],             # Proportion of features to consider for splits
    # 'max_features': 'auto'                # Alternative option for feature selection
    'max_samples': [0.7],                   # Proportion of samples to use for fitting each base estimator
    'min_samples_leaf': [1, 2],             # Minimum number of samples required to be at a leaf node
    'min_samples_split': [2, 3],            # Minimum number of samples required to split an internal node
    'n_estimators': [75, 100, 200],         # Number of base estimators in the ensemble
}

# Define a dictionary of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1'}

# Instantiate the GridSearchCV object
rf_cv = GridSearchCV(rf, cv_params, scoring=scoring, cv=5, refit='recall')

In [None]:
%%time
# Fit the model
rf_cv.fit(X_train_final, y_train)

In [None]:
# Output the best recall score achieved during cross-validation
rf_cv.best_score_

In [None]:
# Output the best hyperparameters found during cross-validation
rf_cv.best_params_

This model demonstrates exceptional performance, achieving an average recall score of 0.995 across five cross-validation folds. Upon verifying the precision score to ensure the model is not classifying all samples as claims, it is evident that the model is making nearly perfect classifications.

### XGBoost model

In [None]:
# Instantiate the XGBoost classifier
xgb = XGBClassifier(objective='binary:logistic', random_state=0)

# Create a dictionary of hyperparameters to tune
cv_params = {'max_depth': [4,8,12],
             'min_child_weight': [3, 5],
             'learning_rate': [0.01, 0.1],
             'n_estimators': [300, 500]
             }

# Define a dictionary of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1'}

# Instantiate the GridSearchCV object
xgb_cv = GridSearchCV(xgb, cv_params, scoring=scoring, cv=5, refit='recall')

In [None]:
%%time
xgb_cv.fit(X_train_final, y_train)

In [None]:
# Examine best recall score
xgb_cv.best_score_

In [None]:
xgb_cv.best_params_

This model delivers exceptional performance. Despite its recall score being slightly lower than the random forest model's, its precision score is flawless.

## **PACE: Execute**

### Model evaluation

Evaluate models against validation criteria.

#### **Random forest model**

In [None]:
# Use the random forest "best estimator" model to get predictions on the encoded testing set
y_pred = rf_cv.best_estimator_.predict(X_val_final)

Display the predictions on the encoded testing set.

In [None]:
# Display the predictions on the encoded testing set
y_pred

Display the true labels of the testing set.

In [None]:
# Display the true labels of the testing set
y_val

Create a confusion matrix to visualize the results of the classification model.

In [None]:
# Compute values for confusion matrix
log_cm = confusion_matrix(y_val, y_pred)

# Create display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix=log_cm, display_labels=None)

# Plot confusion matrix
log_disp.plot(cmap='cividis')

plt.title('Random Forest - validation set');

In the confusion matrix:

- The upper-left quadrant represents the count of true negatives, indicating the number of opinions correctly classified as opinions by the model.
- The upper-right quadrant shows the number of false positives, which are opinions incorrectly classified as claims.
- The lower-left quadrant reflects the number of false negatives, representing claims incorrectly classified as opinions.
- The lower-right quadrant displays the count of true positives, representing claims correctly classified as claims.
An ideal model would result in only true negatives and true positives, with no false negatives or false positives.

As illustrated by the confusion matrix, this model achieves perfect performance with no false negatives.

Create a classification report that includes precision, recall, f1-score, and accuracy metrics to evaluate the performance of the model.

In [None]:
# Create classification report for random forest model
target_labels =['opinion', 'claim']
print(classification_report(y_val, y_pred, target_names=target_labels))

The classification report above demonstrates that the random forest model achieved nearly perfect scores. According to the confusion matrix, there were 4 misclassifications in total, that come from false positives quadrant.

#### **XGBoost model**

In [None]:
#Evaluate XGBoost model
y_pred = xgb_cv.best_estimator_.predict(X_val_final)

In [None]:
y_pred

In [None]:
# Compute values for confusion matrix
log_cm = confusion_matrix(y_val, y_pred)

# Create display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix=log_cm, display_labels=None)

# Plot confusion matrix
log_disp.plot()

# Display plot
plt.title('XGBoost - validation set');
plt.show()

In [None]:
# Create classification report for XGBoost model
target_labels =['opinion', 'claim']
print(classification_report(y_val, y_pred, target_names=target_labels))

The results from the XGBoost model were almost flawless, but it frequently produced false negatives. Since identifying claims is crucial, it’s essential for the model to effectively capture all actual claim videos. The random forest model, with its superior recall score, is therefore considered the best-performing model.

### **Use champion model to predict on test data**

Both the Random Forest and XGBoost model architectures yielded near-perfect results. However, in this instance, the Random Forest model performed slightly better, making it the preferred model.

In [None]:
# Use champion model to predict on test data
y_pred = rf_cv.best_estimator_.predict(X_test_final)

In [None]:
# Compute values for confusion matrix
log_cm = confusion_matrix(y_test, y_pred)

# Create display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix=log_cm, display_labels=None)

# Plot confusion matrix
log_disp.plot(cmap='cividis')

# Display plot
plt.title('Random forest - test set');
plt.show()

#### **Feature importances of champion model**


In [None]:
# Extract the feature importances from the best estimator of the random forest cross-validation model
importances = rf_cv.best_estimator_.feature_importances_

# Create a Pandas Series to hold the feature importances, using the column names from the test dataset as the index
rf_importances = pd.Series(importances, index=X_test_final.columns)

# Sort the feature importances in descending order and select the top 10
top_10_rf_importances = rf_importances.sort_values(ascending=False).head(10).sort_values(ascending=True)

# Colorblind-friendly color
colorblind_color = "#E69F00"

# Create the horizontal bar plot
fig, ax = plt.subplots()
top_10_rf_importances.plot.barh(ax=ax, color=colorblind_color)  # Create a horizontal bar plot

# Set the plot title and labels
ax.set_title('Top 10 Feature Importances of champion model', fontsize=12)
ax.set_xlabel('Mean Decrease in Impurity')
ax.set_ylabel('Feature')

# Set y-ticks font size
ax.tick_params(axis='y', labelsize=9)
ax.tick_params(axis='x', labelsize=9)

# Adjust layout for better fit
fig.tight_layout()

# Display the plot
plt.show()

The most predictive features were all related to the engagement levels generated by the video. This aligns with prior exploratory data analysis, which indicated a strong correlation between engagement metrics and the model's predictions.

###  Final Project Conclusion

**Recommendation on Model Usage:**

We recommend using the current model due to its exceptional performance across both validation and test datasets. It consistently achieved high precision and F1 scores, reflecting its strong capability in accurately classifying both claims and opinions.

**Model Functionality and Prediction Mechanism:**

The model's effectiveness stems from its reliance on user engagement metrics. It primarily classifies videos based on features such as the number of views, likes, shares, and downloads. These engagement indicators were found to be the most predictive features for determining the video classifications.

**Future Feature Engineering:**

Given the model's near-perfect performance, there is no immediate need for additional feature engineering. However, for potential future improvements, incorporating features such as the number of times a video was reported and the total number of user reports across all videos by each author could be beneficial. These additional metrics might provide further insights and enhance the model’s predictive capabilities.

It is important to acknowledge that there are instances when the data we have may not effectively predict the target variable. This is a natural occurrence in data science. While machine learning is a robust and valuable tool, it is not infallible. If the data lacks a meaningful predictive signal, even the most advanced algorithms will struggle to produce reliable and accurate predictions. Embracing this reality and drawing appropriate conclusions is an essential aspect of our analytical process.