# Sentiment Analysis on TikTok Video Comments

## Introduction
This notebook details the steps taken to perform sentiment analysis on comments from TikTok videos using Python. The goal is to identify trends and patterns in sentiment across different video characteristics.

## Setup Environment

Ensure Python and the necessary libraries are installed. We will be using the following libraries:
- `pandas` for data manipulation
- `matplotlib` and `seaborn` for visualization
- `VADER` from `nltk` for sentiment analysis
- `scikit-learn` for building the predictive model.

In [None]:
pip install pandas nltk matplotlib seaborn scikit-learn

## Load and Prepare Your Dataset

Import the necessary libraries and load the dataset using pandas. We focus on columns relevant to our sentiment analysis.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

# Load the dataset
df = pd.read_csv("tiktok_dataset.csv")

# Display the first few rows of the dataframe
print(df.head())

## Data Exploration


### Histograms for Numeric Features

Visualize the distribution of numeric variables such as video duration, view count, like count, etc. These insights can help identify any data cleaning needs.

In [None]:
# List of numeric columns
numeric_columns = ['video_duration_sec', 'video_view_count', 'video_like_count', 'video_share_count', 'video_download_count', 'video_comment_count']

# Plotting histograms
fig, ax = plt.subplots(len(numeric_columns), 1, figsize=(8, 20))
for i, col in enumerate(numeric_columns):
    ax[i].hist(df[col].dropna(), bins=30, color='skyblue', edgecolor='black')
    ax[i].set_title(f'Distribution of {col}', fontsize=12)
    ax[i].set_xlabel(col)
    ax[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

### Box Plots to Identify Outliers

Box plots for each numeric variable to visualize potential outliers, informing further data cleaning strategies.

In [None]:
# Plotting box plots
fig, ax = plt.subplots(len(numeric_columns), 1, figsize=(8, 20))
for i, col in enumerate(numeric_columns):
    sns.boxplot(x=df[col], ax=ax[i])
    ax[i].set_title(f'Box Plot of {col}', fontsize=12)
    ax[i].set_xlabel(col)

plt.tight_layout()
plt.show()

### Count Plots for Categorical Variables

Visualize the distribution of categorical variables such as `verified_status` and `author_ban_status`.

In [None]:
# Plotting count plots for categorical columns
categorical_columns = ['verified_status', 'author_ban_status']

fig, ax = plt.subplots(len(categorical_columns), 1, figsize=(8, 10))
for i, col in enumerate(categorical_columns):
    sns.countplot(x=df[col], ax=ax[i], palette='viridis', hue=df[col])
    ax[i].set_title(f'Count of {col}', fontsize=12)
    ax[i].set_xlabel(col)
    ax[i].set_ylabel('Count')

plt.tight_layout()
plt.show()


## Data Cleaning

Ensure data types are appropriate, handle missing values, and remove duplicates.

In [None]:
# Handle missing and incorrect values
df.update(df[numeric_columns].fillna(0))
df[categorical_columns] = df[categorical_columns].fillna('Unknown')

# Convert numerical fields to integers
for col in numeric_columns:
    df[col] = df[col].astype(int)

# Remove duplicate entries
df.drop_duplicates(inplace=True)

# Ensure all entries are strings. Convert NaNs to a default string and ensure all inputs are treated as strings.
df['video_transcription_text'] = df['video_transcription_text'].fillna('No transcription available').astype(str)

# Display cleaned data stats
print(df.describe())
print(df.isnull().sum())

## Perform Sentiment Analysis

Set up and apply the VADER sentiment analysis tool from nltk to the transcription text of the videos.

In [3]:
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

# Initialize VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Ensure all entries are strings and handle NaN before applying sentiment analysis
df['video_transcription_text'] = df['video_transcription_text'].fillna('No transcription available').astype(str)

# Applying sentiment analysis
df['sentiment_score'] = df['video_transcription_text'].apply(lambda x: sia.polarity_scores(x)['compound'])
df['sentiment_category'] = df['sentiment_score'].apply(lambda x: 'positive' if x > 0.05 else 'negative' if x < -0.05 else 'neutral')

# Now you can safely proceed without encountering the AttributeError
print(df[['video_transcription_text', 'sentiment_score', 'sentiment_category']].head())

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/estifanos/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


                            video_transcription_text  sentiment_score  \
0  someone shared with me that drone deliveries a...           0.3400   
1  someone shared with me that there are more mic...           0.3818   
2  someone shared with me that american industria...           0.6369   
3  someone shared with me that the metro of st. p...           0.3400   
4  someone shared with me that the number of busi...           0.4019   

  sentiment_category  
0           positive  
1           positive  
2           positive  
3           positive  
4           positive  


## Predictive Modeling: Sentiment Category Prediction

Here we implement a Random Forest Classifier to predict the sentiment category based on video features.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Selecting features and target variable
features = df[['video_view_count', 'video_like_count', 'video_share_count', 'video_download_count', 'video_comment_count']]
target = df['sentiment_category']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

# Initializing the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Training the model
rf_classifier.fit(X_train, y_train)

# Making predictions
y_pred = rf_classifier.predict(X_test)

# Evaluating the model
print("Accuracy of the model: ", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

## Visualization of Sentiment Distribution

### Sentiment Distribution per Verification Status

Visualize how sentiment distribution varies between verified and unverified videos.

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(x='sentiment_category', hue='verified_status', data=df)
plt.title('Sentiment Distribution by Verification Status')
plt.xlabel('Sentiment Category')
plt.ylabel('Count')
plt.legend(title='Verified Status')
plt.show()

### Average Video Metrics by Sentiment Category

Average Video Metrics by Sentiment Category

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Metrics plots
metrics = ['video_view_count', 'video_like_count', 'video_share_count', 'video_download_count']
titles = ['Average Views', 'Average Likes', 'Average Shares', 'Average Downloads']
for i, (ax, metric) in enumerate(zip(axes.flatten(), metrics)):
    sns.barplot(ax=ax, x='sentiment_category', y=metric, data=df)
    ax.set_title(titles[i] + ' by Sentiment')
    ax.set_xlabel('Sentiment Category')
    ax.set_ylabel('Average ' + metric.split('_')[1].capitalize())

plt.tight_layout()
plt.show()