# DATA UNDERSTANDING

## Data Source

Data was obtained from [X](https://x.com/). We used an tweet scraping tool called [TWsearchexporter](https://twsearchexporter.toolmagic.app/) and the tweepy library to obtain and save tweets as CSV. The data did not have the target which is the hatespeech categorization thus we had to label manually.

## Data overview

The data contains tweets about Kenyan political class which then classifies them to Hate, Offensive and Neutral.
The columns in the data are:

1. Tweet ID: A unique numerical identifier assigned to each tweet.

2. Tweet Text: The content/text of the tweet.

3. Type: Could indicate the nature of the tweet — e.g., "tweet", "retweet", "reply", or "quote tweet".

4. Author Name: The full display name of the person who posted the tweet.

5. Author Username: The Twitter handle (e.g., @someone) of the tweet's author.

6. Creation Time: The exact timestamp when the tweet was posted.

7. Reply Count: Number of replies the tweet received.

8. Retweet Count: Number of times the tweet was retweeted.

9. Quote Count: Number of quote tweets (tweets that quoted this tweet).

10. Like Count: Number of likes (formerly favorites) the tweet received.

11. View Count: Number of views the tweet has received (may not always be available via API).

12. Bookmark Count: Number of times users bookmarked this tweet (rarely exposed via API).

13. Language: The detected language of the tweet (e.g., en, sw, fr).

14. Possibly Sensitive: Boolean flag indicating if the tweet might contain sensitive content (e.g., violence, hate speech, nudity).

15. Source: The platform the tweet was posted from (e.g., “Twitter for iPhone”, “Web App”).

16. Hashtags: List of hashtags used in the tweet.

17. Tweet URL: The full URL link to the tweet (e.g., https://twitter.com/...).

18. Media Type: Type of media in the tweet — could be image, video, gif, or none.

19. Media URLs: URLs to any images, videos, or gifs attached to the tweet.

20. External URLs: URLs shared in the tweet that lead to external websites (e.g., news articles).

21. SourceFile: File or dataset source this tweet was extracted from — useful when combining multiple datasets.

## Data Exploration

Looking ino the dataset to understand its structure before cleaning.

### Loading the datasets and consolidating them to one Dataframe.

#### Import required libraries.

This step is important for the running of codes the notebook. Python libraries assist in mathematical operaions, plotting visualizations, modeling and linking with other languages for deployment.

In [7]:
import pandas as pd

import matplotlib.pyplot as plt
import datetime as dt
import numpy as np
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import wordcloud
from wordcloud import WordCloud
import os
import glob
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import LabelEncoder , FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score
from sklearn.utils.class_weight import compute_class_weight
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Bidirectional, Conv1D, GlobalMaxPooling1D, MaxPooling1D, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.regularizers import l2, l1, l1_l2
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from tensorflow.keras.callbacks import Callback
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


#### Load datasets

##### **Data from tweepy extraction**

Load the dataset extracted by tweepy.

In [5]:
df1 = pd.read_csv('Kenyan_Politics_Hatespeech.csv')

##### **Data from TWSearchExporter**

Load the datasets extracted by TWSearchExplorer.

In [8]:
# Path to the folder containing your files
folder_path = "data"

# Match all TwSearchExporter CSV files
csv_files = glob.glob(os.path.join(folder_path, "TwSearchExporter-*.csv"))
print(f"Found {len(csv_files)} files.")

# Read and combine
dfs = []
for file in csv_files:
    try:
        df = pd.read_csv(file)
        df['SourceFile'] = os.path.basename(file)
        dfs.append(df)
    except Exception as e:
        print(f"❌ Error reading {file}: {e}")

# Combine into a single DataFrame
if dfs:
    combined_df = pd.concat(dfs, ignore_index=True)
    print("✅ Combined shape:", combined_df.shape)
    print(combined_df[['SourceFile', 'Author Name', 'Tweet Text', 'Creation Time']].head())

    # Optional: Save to new CSV
    combined_df.to_csv("combined_tweets_all.csv", index=False)
else:
    print("⚠️ No files were loaded. Please double-check the folder and file names.")

combined_df

Found 51 files.
✅ Combined shape: (10421, 21)
                                          SourceFile           Author Name  \
0  TwSearchExporter-abdulswamad since2024-07-01 u...      Pepe Danson 🇺🇲🇰🇪   
1  TwSearchExporter-abdulswamad since2024-07-01 u...  Cyprian, Is Nyakundi   
2  TwSearchExporter-abdulswamad since2024-07-01 u...            Udaku girl   
3  TwSearchExporter-abdulswamad since2024-07-01 u...  Cyprian, Is Nyakundi   
4  TwSearchExporter-abdulswamad since2024-07-01 u...        Son Of Anarchy   

                                          Tweet Text           Creation Time  
0  A founder &amp; funder of gang-rape gangsters:...   10/3/2024, 3:12:09 PM  
1  Executive Director Peter of Haki Yetu, a human...  10/16/2024, 1:36:47 PM  
2  Amekutana Ruto akitoka kwa Rigathi Gachagua \n...  10/3/2024, 10:35:45 PM  
3  Shanzu Court grants bail to Omar Ali Mohamed a...  11/17/2024, 8:56:01 PM  
4  This swahili woman mp who is going hard at Rig...  10/8/2024, 11:50:51 AM  


Unnamed: 0,Tweet ID,Tweet Text,Type,Author Name,Author Username,Creation Time,Reply Count,Retweet Count,Quote Count,Like Count,...,Bookmark Count,Language,Possibly Sensitive,Source,Hashtags,Tweet URL,Media Type,Media URLs,External URLs,SourceFile
0,"=""1841813490077405206""",A founder &amp; funder of gang-rape gangsters:...,Tweet,Pepe Danson 🇺🇲🇰🇪,PepeDanson,"10/3/2024, 3:12:09 PM",2,10,0,22,...,1,en,No,Twitter for Android,,https://x.com/PepeDanson/status/18418134900774...,photo,https://pbs.twimg.com/media/GY9vmFKXQAA3wmU.jpg,,TwSearchExporter-abdulswamad since2024-07-01 u...
1,"=""1846500529322811562""","Executive Director Peter of Haki Yetu, a human...",Tweet,"Cyprian, Is Nyakundi",C_NyaKundiH,"10/16/2024, 1:36:47 PM",51,377,12,698,...,24,en,No,Twitter Web App,,https://x.com/C_NyaKundiH/status/1846500529322...,video,https://video.twimg.com/ext_tw_video/184650012...,,TwSearchExporter-abdulswamad since2024-07-01 u...
2,"=""1841925125756879322""",Amekutana Ruto akitoka kwa Rigathi Gachagua \n...,Tweet,Udaku girl,kanyaa_diana,"10/3/2024, 10:35:45 PM",101,342,11,1274,...,106,in,No,Twitter for Android,,https://x.com/kanyaa_diana/status/184192512575...,video,https://video.twimg.com/ext_tw_video/184192491...,,TwSearchExporter-abdulswamad since2024-07-01 u...
3,"=""1858207477302432056""",Shanzu Court grants bail to Omar Ali Mohamed a...,Tweet,"Cyprian, Is Nyakundi",C_NyaKundiH,"11/17/2024, 8:56:01 PM",62,117,6,272,...,9,en,No,Twitter Web App,,https://x.com/C_NyaKundiH/status/1858207477302...,photo,https://pbs.twimg.com/media/GcmtyMVWIAA98WL.png,,TwSearchExporter-abdulswamad since2024-07-01 u...
4,"=""1843574767141695787""",This swahili woman mp who is going hard at Rig...,Tweet,Son Of Anarchy,__QlintDwayne,"10/8/2024, 11:50:51 AM",6,320,3,538,...,3,en,No,Twitter for iPhone,,https://x.com/__QlintDwayne/status/18435747671...,,,,TwSearchExporter-abdulswamad since2024-07-01 u...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10416,"=""1910035670208250296""",Female MPs Fight:\n \nEALA MP Falhada Iman ban...,Tweet,Citizen TV Kenya,citizentvkenya,"4/9/2025, 9:22:45 PM",1,3,1,9,...,3,en,No,Twitter Media Studio - LiveCut,#JKLive,https://x.com/citizentvkenya/status/1910035670...,video,https://video.twimg.com/amplify_video/19100352...,,TwSearchExporter-wetangula-217-Top.csv
10417,"=""1729588348509909271""",Is this the most important and urgent matter f...,Tweet,Dr. Miguna Miguna,MigunaMiguna,"11/28/2023, 10:49:31 PM",17,13,3,190,...,0,en,No,Twitter for iPhone,,https://x.com/MigunaMiguna/status/172958834850...,,,https://nation.africa/kenya/news/speaker-wetan...,TwSearchExporter-wetangula-217-Top.csv
10418,"=""1912614159889428889""",Onto the next ⏭️⏭️⏭️ \n#ARSRMA \n#ChampionsLea...,Tweet,Rt.Hon.Dr.Moses Wetang'ula,HonWetangula,"4/17/2025, 12:08:45 AM",57,52,5,543,...,3,en,No,Twitter for iPhone,"#ARSRMA, #ChampionsLeague",https://x.com/HonWetangula/status/191261415988...,photo,https://pbs.twimg.com/media/Gor4b1HXwAASp3Q.jpg,,TwSearchExporter-wetangula-217-Top.csv
10419,"=""1910269597674066337""",Musalia Mudavadi &amp; Wetangula are busy drin...,Tweet,Alinur Mohamed,AlinurMohamed_,"4/10/2025, 12:52:17 PM",17,136,1,516,...,3,en,No,Twitter for Android,,https://x.com/AlinurMohamed_/status/1910269597...,,,,TwSearchExporter-wetangula-217-Top.csv


##### **Merge the above.**

In [10]:
combined_df = combined_df.rename(columns={
    'Tweet ID': 'Tweet ID',
    'Like Count': 'Likes',
    'Retweet Count': 'Retweets',
    'Reply Count': 'Total Replies',
    'Tweet Text': 'Texts',
    'Creation Time': 'Created At'
})

# Select only the columns we want
df2 = combined_df[['Tweet ID', 'Likes', 'Retweets', 'Total Replies', 'Texts', 'Created At']]

df2['Tweet ID'] = df2['Tweet ID'].astype(str).str.replace(r'^="|"$', '', regex=True)

# Convert to datetime object
df2['Created At'] = pd.to_datetime(df2['Created At'], format='%m/%d/%Y, %I:%M:%S %p')

# Localize to UTC (or your preferred timezone)
df2['Created At'] = df2['Created At'].dt.tz_localize('UTC')


final_df = pd.concat([df1, df2], ignore_index=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['Tweet ID'] = df2['Tweet ID'].astype(str).str.replace(r'^="|"$', '', regex=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['Created At'] = pd.to_datetime(df2['Created At'], format='%m/%d/%Y, %I:%M:%S %p')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['Created At'] = df2['Created A

##### Save the dataframe as a csv file for labelling

In [None]:
final_df.to_csv("Kenyan_politicians_hatespeech.csv", index=False)

##### Make a copy of the saved df

In [22]:
df_copy = final_df.copy(deep=True)

##### **Check the first 20 rows of the dataframe**

In [13]:
df_copy.head(20)

Unnamed: 0,Tweet ID,Likes,Retweets,Total Replies,Texts,Created At
0,1.91229e+18,2.0,0.0,1.0,Rigathi Gachagua ni snitches Ile ya ukweli. La...,2025-04-15 23:23:00+00:00
1,1.91229e+18,0.0,0.0,0.0,@NoCountryHere it’s different for Gachagua man...,2025-04-15 23:21:43+00:00
2,1.91228e+18,0.0,0.0,0.0,@gitaus794 @Mithika_Denno @bonifacemwangi Wher...,2025-04-15 22:52:56+00:00
3,1.91227e+18,0.0,0.0,0.0,@hermexinvesting @MwangiHub The common enemy o...,2025-04-15 22:40:45+00:00
4,1.91227e+18,0.0,0.0,0.0,SHOCK as Gachagua addresses MILLIONS of Mlolon...,2025-04-15 22:30:19+00:00
5,1.91227e+18,0.0,0.0,0.0,@Br1anKE @EtalePhilip Gachagua doesn't feed me...,2025-04-15 22:19:35+00:00
6,1.91227e+18,0.0,0.0,0.0,@RobertAlai @NdindiNyoro Let these assholes se...,2025-04-15 22:19:21+00:00
7,1.91227e+18,0.0,0.0,0.0,"@rigathi Gachagua tiga ubaby mani, kiria utahi...",2025-04-15 22:18:41+00:00
8,1.91227e+18,0.0,0.0,0.0,@NelsonHavi Imagine their are dimwits who even...,2025-04-15 22:04:29+00:00
9,1.91226e+18,0.0,0.0,0.0,@rigathi 😂😂Akamba nation thinking Gachagua wil...,2025-04-15 21:56:23+00:00


1. We have 6 columns in the dataframe. 
2. The data maintains consistency in the first 20 columns.

##### **Check the last 20 rows**

In [14]:
df_copy.tail(20)

Unnamed: 0,Tweet ID,Likes,Retweets,Total Replies,Texts,Created At
11695,1912132552120217758,33.0,10.0,13.0,"Hannah Cheptumo – If a woman is educated, chan...",2025-04-15 16:15:00+00:00
11696,1463829441805701120,0.0,0.0,0.0,@RobertAlai I recognized @wamunyinyi as the ch...,2021-11-25 14:18:30+00:00
11697,1912207006238200251,2.0,0.0,0.0,"""CDF is here to stay""\n\nSpeaker Wetangula says",2025-04-15 21:10:52+00:00
11698,1912392815960609003,5.0,1.0,3.0,"Spkr Wetangula, Wake Up.\n\nIt's Salasya's tim...",2025-04-16 09:29:12+00:00
11699,1912046061834666468,9.0,1.0,0.0,It is time for Western Kenya to ditch Mudavadi...,2025-04-15 10:31:19+00:00
11700,1909938459466223912,10.0,4.0,1.0,"This afternoon, I hosted the Kenya National Co...",2025-04-09 14:56:28+00:00
11701,1911411543872274913,2.0,0.0,0.0,@jumaf3 Wetangula and Ford K’s stranglehold sh...,2025-04-13 16:29:59+00:00
11702,1909925455551238475,45.0,7.0,4.0,Kenya continues to deepen and strengthen its l...,2025-04-09 14:04:47+00:00
11703,486127068930129920,0.0,0.0,0.0,RT @SokoAnalyst SONKO heads to the Court of Ap...,2014-07-07 15:38:19+00:00
11704,1806229642895872477,0.0,0.0,0.0,"@C_NyaKundiH @borise18 Sell out or not, point ...",2024-06-27 10:34:39+00:00


The data seems to be uniform and consistent to the end.

##### **Check the details of the data in each column**

In [15]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11715 entries, 0 to 11714
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Tweet ID       11715 non-null  object 
 1   Likes          11715 non-null  float64
 2   Retweets       11715 non-null  float64
 3   Total Replies  11715 non-null  float64
 4   Texts          11715 non-null  object 
 5   Created At     11715 non-null  object 
dtypes: float64(3), object(3)
memory usage: 549.3+ KB


The data has 2 data types ie:
1. object 
2. float

##### **Check statistical measures of the numerical columns**

In [17]:
df_copy.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Likes,11715.0,919.255484,14244.939449,0.0,1.0,14.0,309.0,1060202.0
Retweets,11715.0,197.000427,2248.038787,0.0,0.0,4.0,54.0,150446.0
Total Replies,11715.0,68.525907,1063.107464,0.0,0.0,1.0,20.0,87879.0


We can see the statistical measures of the numerical columns ie: Mean, standard deviation(std), 1st quartile(25%), median(50%), 3rd quartile (75%) and the maximum value.

##### **Check the statistical measures of the Categorical columns**

In [19]:
df_copy.describe(include=['object']).T


Unnamed: 0,count,unique,top,freq
Tweet ID,11715,10926,1.9124e+18,200
Texts,11715,11241,"We must protect our children from paedophiles,...",7
Created At,11715,11228,2025-04-15 18:34:50+00:00,7


We can see the number of rows in given column(count), number of unique values (unique), top ocurring category/text(top), number of occurences for the top occuring category/text(freq)

##### **Check the shape of the df**

In [20]:
df_copy.shape


(11715, 6)

The dataframe has 11715 rows and 6 columns.

# Feature engineering

### Check the manually added column (label)

In [21]:
# Load the dataset
df = pd.read_csv("final_kenyan_hatespeech.csv")
df.sample(50)

Unnamed: 0.1,Unnamed: 0,likes,retweets,total_replies,created_at,cleaned_text,label
174,176,893,297,15,2024-08-09 15:46:33+00:00,confirmed plans were finalized yesterday to el...,neutral
5729,5746,5,3,1,2025-03-20 07:05:40+00:00,president william ruto and first lady mama rac...,neutral
9431,10592,238,56,12,2021-02-14 22:25:51+00:00,why is david ndii trying so hard to prove to i...,Offensive
8023,8720,5,2,0,2024-07-15 01:28:09+00:00,jimmy wanjigi ladies and gentlemen believe the...,neutral
3306,3321,0,0,0,2025-04-16 20:31:20+00:00,njugush ameuliza kama kindiki anaoga nje,neutral
4127,4142,738,230,28,2023-03-19 07:36:33+00:00,i had soooo much respect for this man now im c...,neutral
2784,2798,0,0,0,2025-04-12 11:50:05+00:00,junet mohamed is reminding you that in as much...,neutral
1645,1656,396,87,619,2024-08-21 17:18:18+00:00,rate chief justice martha koome out of,offensive
9587,10799,0,0,0,2025-04-19 07:34:25+00:00,you are already a member stop with the theatri...,offensive
5794,5811,1,1,0,2024-04-06 20:29:59+00:00,while rachel ruto is busy contacting the best ...,hate


From the sample we can see uniform labeling ove the dataframe.

### Add a total engagement column

We do this by adding the likes, retweets and replies

In [23]:
# Calculate combined engagement
df['engagement_score'] = df['likes'] + df['retweets'] + df['total_replies']

### Add engagement bins for likes, retweets, replies and the total engagement                        

This will enable us to categorize the engagements into categories like 0-10, 11-100 etc.

In [None]:
# Binning individual engagement types
df['likes_bin'] = pd.cut(df['likes'], bins=[-1, 0, 10, 100, 1000, 10000,100000], labels=['0', '1-10', '11-100', '101-1k', '1001-10k','10k+'])
df['retweets_bin'] = pd.cut(df['retweets'], bins=[-1, 0, 10, 100, 1000, 10000, 100000], labels=['0', '1-10', '11-100', '101-1k', '1001-10k','10k+'])
df['replies_bin'] = pd.cut(df['total_replies'], bins=[-1, 0, 10, 100, 1000, 10000, 100000], labels=['0', '1-10', '11-100', '101-1k', '1001-10k','10k+'])
df['engagement_bin'] = pd.cut(df['engagement_score'], bins=[-1, 0, 10, 100, 1000, 10000, 100000], labels=['0', '1-10', '11-100', '101-1k', '1001-10k', '10k+'])

### Add a wordcount and text length column

In [None]:
# Add word count column
df['word_count'] = df['Cleaned_Text'].apply(lambda x: len(str(x).split()))

# Add text length (character count) column
df['text_length'] = df['Cleaned_Text'].apply(lambda x: len(str(x)))



# Exploratory Data Analysis

## Multivariate analysis


### Engagement by Label categories

In [None]:
# Melt individual engagement types
long_df = pd.melt(
    df,
    id_vars='label',
    value_vars=['likes_bin', 'retweets_bin', 'replies_bin'],
    var_name='engagement_type',
    value_name='bin_value'
)

# Clean up engagement type names
long_df['engagement_type'] = long_df['engagement_type'].str.replace('_bin', '').str.capitalize()

# Append the combined engagement data
comb_engagement_df = df[['label', 'engagement_bin']].rename(columns={'engagement_bin': 'bin_value'})
comb_engagement_df['engagement_type'] = 'Total'
long_df = pd.concat([long_df, comb_engagement_df], ignore_index=True)

# Plotly faceted bar chart with combined engagement
fig = px.histogram(
    long_df,
    x='bin_value',
    color='label',
    facet_col='engagement_type',
    category_orders={
        'bin_value': ['0', '1-10', '11-100', '101-1k', '1001-10k', '10k+'],
        'engagement_type': ['Likes', 'Retweets', 'Replies', 'Total']
    },
    color_discrete_sequence=px.colors.qualitative.Set2,
    barmode='group',
    title="Tweet Labels by Engagement Type and Bin (Including Total)",
    labels={'bin_value': 'Engagement Bin', 'count': 'Tweet Count'},
    height=1000,
    width=2400
)

fig.update_layout(
    showlegend=True,
    legend_title_text='Label'
)

fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))  # Clean up facet titles

fig.show()


#### General insights:

1. Neutral tweets dominate all engagement types and bins.

2. Offensive and hate tweets tend to cluster more in lower engagement bins.

3. The number of tweets sharply drops in higher engagement bins across all categories.

#### Insights from Likes
1. Most tweets with no likes (bin 0) are neutral, but a significant number are also offensive or hate.

2. As engagement increases, neutral tweets become even more dominant.

3. Very few hate or offensive tweets receive 10k+ likes.

#### Insights from Retweets
1. Similar pattern to likes: neutral tweets are most common, even more so in higher bins.

2. Very few offensive or hate tweets make it past the 101-1k retweet range.

3. No hate or offensive tweets in the 10k+ retweet bin.

#### insights from Replies
1. Replies show a higher proportion of offensive and hate tweets, especially at 0 and 1–10 reply bins.

2. Still, neutral tweets dominate in total count, especially in the 1–100 reply range.

3. A steep drop-off in all categories beyond the 101–1k bin.

#### Insights from Total Engagement
1. The total view mirrors the trends from individual categories:

2. Neutral tweets are consistently the most engaged.

3. Offensive and hate tweets are far less likely to receive high engagement.


### Word count vs text length per label category

In [53]:
# Calculate mean word count and mean text length per label
mean_stats = df.groupby('label')[['word_count', 'text_length']].mean().reset_index()

# Melt to long format for grouped bar plot
mean_stats_melted = mean_stats.melt(id_vars='label', var_name='Metric', value_name='Mean Value')

# Create grouped bar chart
fig = px.bar(
    mean_stats_melted,
    x='label',
    y='Mean Value',
    color='Metric',
    barmode='group',
    text='Mean Value',
    color_discrete_sequence=px.colors.qualitative.Set2,
    title='Mean Word Count and Text Length per Tweet Label',
    labels={'label': 'Tweet Label', 'Mean Value': 'Mean'}
)

# Update layout for aesthetics
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.update_layout(
    yaxis_title='Mean Value',
    xaxis_title='Label',
    uniformtext_minsize=8,
    uniformtext_mode='hide',
    height=500,
    width=800
)

fig.show()

1. Text Length (in characters):
Hate tweets - (137.87)

Neutral tweets - (135.69)

Offensive tweets - (133.42)

Suggests that hate tweets tend to be slightly longer, possibly due to more elaborate or emotionally charged language.

The average tweet length falls between 130-140 characters for all categories

2. Word Count:
Hate tweets - (23.82),

Offensive - (23.28)

neutral tweets - (23.10)

The variation is relatively small but may indicate that hate and offensive tweets use slightly more words, possibly to express more complex or heated messages.

The average tweeet word count fallls between 23-24 words for all categories.



### Distribution of Word count vs text length per label category

In [None]:
# Scatter plot with trendlines by label using Plotly
fig = px.scatter(
    df,
    x='word_count',
    y='text_length',
    color='label',
    trendline='ols',
    opacity=0.6,
    size_max=10,
    title="Word Count vs. Text Length with Trendlines by Label",
    width=1000,
    height=700
)

# Update layout for labels and better appearance
fig.update_layout(
    xaxis_title="Word Count",
    yaxis_title="Text Length (Characters)",
    legend_title="Label"
)

fig.show()


### Insights

1. Positive Correlation:

A positive trend line across most or all categories suggests that as tweet length increases, the number of words also increases.

This makes intuitive sense: longer tweets can hold more words.

2. Category Differences:

Some categories tend to use longer or more complex words (lower word count for same tweet length).

Others may use shorter, more concise language (higher word count for same tweet length).


3. Outliers:

Points far from the trend line could be outliers most likely spam content, unusually wordy or concise tweets.

4. Compactness of Categories:
Tightly clustered points imply similar writing style or length.

In [None]:
# Create the correlation matrix
corr_matrix = df[['likes', 'retweets', 'total_replies']].corr()

# Create the heatmap using Plotly
fig = go.Figure(data=go.Heatmap(
    z=corr_matrix.values,
    x=corr_matrix.columns,
    y=corr_matrix.index,
    colorscale='Viridis',  # You can choose other color scales like 'Cividis', 'Plasma', etc.
    colorbar=dict(title='Correlation'),
))

# Update layout to customize the dimensions
fig.update_layout(
    title="Correlation Heatmap of Likes, Retweets & Replies",
    xaxis_title="Features",
    yaxis_title="Features",
    xaxis=dict(tickmode='array'),
    yaxis=dict(tickmode='array'),
    width=800,  # Adjust the width of the heatmap
    height=600  # Adjust the height of the heatmap
)

fig.show()


### Insights
1. Strong Positive Correlation Between Likes and Retweets

Correlation ≈ 0.95+ (very close to 1)

Indicates that tweets getting more likes are also highly likely to be retweeted, and vice versa.

These two features are likely reflecting similar audience engagement behaviors.

2. Moderate Correlation Between Retweets and Replies
The value seems lower (around ~0.76).

Suggests that while retweeted tweets may get replies, it's not as tightly linked as likes/retweets.

Retweets may spread content but not always spark conversation.

3. Lowest Correlation Between Likes and Replies
The correlation here is the weakest among the three pairs.

Implies that just because a tweet is liked doesn’t mean it generates a reply or discussion.


