# Midterm Project for Machine Learning Zoom Camp 2023

# TikTok User Engagement Data

# Table of Contents
- [Problem descriptiont](#problem_descriptiont)
    - [About the dataset](#about_dataset)
- [Exploratory Data Analysis (EDA)](#eda)
    - [Balancing categorical features](#balancing_cat_features)
    - [Outliers numerical features](#outliers_num_features)
    - [Feature importance analysis](#fia)
        - [For numerical](#num)
        - [For categorical](#cat)
        - [For text](#text)
- [Model training](#model_training)
    - [One hot encoding](#one_hot_encoding)
    - [Logistic Regression](#log_regression)
    - [Decision Tree](#decision_tree)
    - [Random Forest Model](#random_forest)
- [Chosen model](#model)

# Problem descriptiont
<a id='problem_descriptiont'></a>

https://www.kaggle.com/datasets/yakhyojon/tiktok/data

## About the dataset
<a id='about_dataset'></a>

TikTok is the leading destination for short-form mobile video. The platform is built to help imaginations thrive. TikTok's mission is to create a place for inclusive, joyful, and authentic content–where people can safely discover, create, and connect.
		

|  Column name  |             Type             |  Description  |
|:--------:|:-----------------------------------:|:-----------------------------------:|
|    **#**   |  int  |   TikTok assigned number for video with claim/opinion.    |
|    **claim_status**   |  obj |    Whether the published video has been identified as an “opinion” or a “claim.” In this dataset, an “opinion”  to an individual’s or group’s personal belief or thought. A “claim” refers to information that is either unsourced or from an unverified source.  | 
|    **video_id**   |  int |    Random identifying number assigned to video upon publication on TikTok.  |
|   **video_duration_sec**   | int | How long the published video is measured in seconds. |
| **video_transcription_text** |    obj  | Transcribed text of the words spoken in the published video. |
| **verified_status** |  obj | Indicates the status of the TikTok user who published the video in terms of their verification, either “verified” or “not verified.”| 
|  **author_ban_status**  | obj  |Indicates the status of the TikTok user who published the video in terms of their permissions: “active,” “under scrutiny,” or “banned.” | 
| **video_view_count**  |  float | The total number of times the published video has been viewed. |
| **video_like_count**  |  float | The total number of times the published video has been liked by other users. |
| **video_share_count**  |  float | The total number of times the published video has been shared by other users. |
| **video_download_count***  |  float | The total number of times the published video has been downloaded by other users. |
| **video_comment_count***  |  float |  The total number of comments on the published video. |


# Exploratory Data Analysis (EDA)
<a id='eda'></a>

In [146]:
import pandas as pd
import numpy as np
import plotly.express as px

from sklearn.metrics import mutual_info_score

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /Users/olga/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/olga/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [156]:
df = pd.read_csv('tiktok_dataset.csv')

In [157]:
df.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [159]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


In [160]:
df.isnull().sum()

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

In [162]:
df = df.dropna(axis=0)

In [163]:
df=df.drop_duplicates()

In [164]:
df.nunique()

#                           19084
claim_status                    2
video_id                    19084
video_duration_sec             56
video_transcription_text    19012
verified_status                 2
author_ban_status               3
video_view_count            15632
video_like_count            12224
video_share_count            9231
video_download_count         4336
video_comment_count          2424
dtype: int64

Drop features 'video_id', '#'. They are useless.

## Balancing categorical features
<a id='balancing_cat_features'></a>

Make categorical 'claim_status', 'verified_status', 'author_ban_status' features numerical.

In [203]:
px.histogram(df, x='claim_status').show()

In [188]:
px.histogram(df, x='verified_status').show()

In [187]:
px.histogram(df, x='author_ban_status').show()

Feature 'claim_status' - balanced; features 'verified_status', 'author_ban_status' - unbalanced.

We chose 'claim_status' as target feature so we don't need to balance class features.

## Outliers numerical features
<a id='outliers_num_features'></a>

Ranges of values

In [191]:
data = pd.melt(df, id_vars='claim_status', value_vars=['video_duration_sec',
                                                       'video_view_count', 
                                                       'video_like_count', 
                                                       'video_share_count',
                                                       'video_download_count', 
                                                       'video_comment_count'])
px.box(data, x='claim_status', y='value', color='variable').show()

Features 'video_like_count', 'video_share_count', 'video_download_count', 'video_comment_count' contain outliers.

Outliers in quantitative variables can significantly affect the regression model. They can bias regression coefficients, which can lead to incorrect conclusions. In logistic regression, outliers can also bias the coefficients, which can lead to incorrect predictions. In trees, outliers can cause the tree to be too deep or too wide, which can cause the model to overfit.

## Feature importance analysis
<a id='fia'></a>

### For numerical
<a id='num'></a>

In [195]:
numerical_columns = ['video_duration_sec',
                     'video_view_count', 
                     'video_like_count', 
                     'video_share_count',
                     'video_download_count', 
                     'video_comment_count']

In [199]:
px.imshow(df[numerical_columns].corr(),text_auto=True, aspect="auto", color_continuous_scale='Blues')

The correlation between numeric and categorical variables is meaningless because the correlation is determined only for a pair of numeric variables.

For logistic regression, multicollinearity can be a problem because it can lead to unstable estimates of regression coefficients and reduced predictive accuracy.

Multicollinearity is not a problem with decision trees because they do not use linear relationships between variables.

Feature 'video_like_count' correlates with most variables, so we'll remove it.

### For categorical
<a id='cat'></a>

In [170]:
categorical_columns = ['verified_status', 'author_ban_status']

In [171]:
def mutual_info_success_score(series):
    return mutual_info_score(series, df.claim_status)

In [172]:
mutual_info = df[categorical_columns].apply(mutual_info_success_score)
mutual_info.sort_values(ascending=False)

author_ban_status    0.054046
verified_status      0.015708
dtype: float64

Feature 'author_ban_status' have highest mutual information

### For text
<a id='text'></a>

In [173]:
all_text = ' '.join(df['video_transcription_text'])
all_text = all_text.lower()

In [174]:
words = word_tokenize(all_text)

In [175]:
stop_words = set(stopwords.words('english'))

In [176]:
filtered_words = [word for word in words if word not in stop_words and word.isalpha()]

In [177]:
word_counts = Counter(filtered_words)
most_common_words = word_counts.most_common(20)

In [178]:
most_common_words

[('claim', 3501),
 ('read', 3302),
 ('learned', 2950),
 ('someone', 2866),
 ('media', 2492),
 ('friend', 2490),
 ('colleague', 2419),
 ('discovered', 2264),
 ('friends', 2207),
 ('world', 2190),
 ('colleagues', 1958),
 ('family', 1936),
 ('news', 1551),
 ('earth', 1237),
 ('willing', 1136),
 ('internet', 1071),
 ('view', 992),
 ('first', 920),
 ('around', 771),
 ('online', 743)]

In [179]:
words, frequencies = zip(*most_common_words)


In [180]:
px.bar(x=words, y=frequencies)

# Model training

# One hot encoding

# Logistic Regression

# Decision Tree

# Chosen model