# Data Exploration and Cleaning of Kaggle Cyberbullying Dataset

**Dataset Columns:**  
- `tweet_text`: Text of the tweet  
- `cyberbullying_type`: Type of cyberbullying (Age, Ethnicity, Gender, Religion, Other, Not cyberbullying)

**Dataset Source / Credit:**  
This dataset was obtained from [Kaggle: Cyberbullying Classification](https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification/data)  


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from collections import Counter
%pip install emoji wordsegment
import emoji
from wordsegment import load, segment
df = pd.read_csv("cyberbullying_tweets.csv")

Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting wordsegment
  Downloading wordsegment-1.3.1-py2.py3-none-any.whl.metadata (7.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading wordsegment-1.3.1-py2.py3-none-any.whl (4.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m91.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: wordsegment, emoji
Successfully installed emoji-2.14.1 wordsegment-1.3.1


## Data Exploration
First, we will take a look at the dataset to understand its structure and its issues.


In [None]:
print("First rows:")
display(df.head())

First rows:


Unnamed: 0,tweet_text,cyberbullying_type
0,"In other words #katandandre, your food was cra...",not_cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying


In [None]:
print("\nDataset info:")
df.info()


Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47692 entries, 0 to 47691
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_text          47692 non-null  object
 1   cyberbullying_type  47692 non-null  object
dtypes: object(2)
memory usage: 745.3+ KB


In [None]:
# number of duplicates
print("\nNumber of duplicate rows:", df.duplicated().sum())


Number of duplicate rows: 36


In [None]:
# missing values
print("\nMissing values per column:")
print(df.isnull().sum())


Missing values per column:
tweet_text            0
cyberbullying_type    0
dtype: int64


In [None]:
# unique values
print("\nUnique values per column:")
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")


Unique values per column:
tweet_text: 46017 unique values
cyberbullying_type: 6 unique values


In [73]:
# remove empty rows
df = df.dropna(subset=['tweet_text'])
df = df[df['tweet_text'].str.strip() != '']

In [None]:
# label distribution
print("\nRows per cyberbullying type:")
print(df["cyberbullying_type"].value_counts())


Rows per cyberbullying type:
cyberbullying_type
religion               7998
age                    7992
gender                 7973
ethnicity              7961
not_cyberbullying      7945
other_cyberbullying    7823
Name: count, dtype: int64


In [50]:
# text length
df["text_length"] = df["tweet_text"].apply(lambda x: len(str(x).split()))
print("\nTweet length statistics (in words):")
print(df["text_length"].describe())


Tweet length statistics (in words):
count    47656.000000
mean        23.708347
std         15.438910
min          1.000000
25%         13.000000
50%         20.000000
75%         32.000000
max        790.000000
Name: text_length, dtype: float64


In [None]:
df[df["text_length"] > 100][["tweet_text"]].head(10)

Unnamed: 0,tweet_text
1317,@EurekAlertAAAS: Researchers push to import to...
3030,He embellished the afternoon with moustachioed...
4846,@andrea_gcav: @viviaanajim recuerdas como noso...
10922,don't make rape jokes!!! don't make gay jokes!...
14168,IT CALLS THE FUNCTION TO THE PROCESS OR IT GE…...
15621,"@ufcpride40: : Terry Bean, prominent gay activ..."
24516,@NICKIMINAJ: #WutKinda\r\nAt this rate the MKR...
25411,@Sweetie_Niesha: So Im getting bullied via twi...
25939,If cats looked like frogs we'd realize what na...
29205,is feminazi an actual word with a denot…\r\n@N...


## Data Cleaning

In the previous step, we noticed that the dataset has some duplicate rows and a few tweets that are unusually long compared to normal tweet length. So now, in the data cleaning step, we will remove duplicates and filter out these extreme outliers as well as apply other data cleaning techniques.

In [52]:
# remove very long tweets
df = df[df["text_length"] <= 100]

# drop duplicates
df = df.drop_duplicates()

In [54]:
print("\nNumber of duplicate rows:", df.duplicated().sum())


Number of duplicate rows: 0


In [55]:
df[df["text_length"] > 100][["tweet_text"]].head(10)

Unnamed: 0,tweet_text


In the next step we clean tweets by removing URLs, replacing mentions with @USER, and splitting hashtags into words so their meaning is preserved. Emojis are converted to text, and important punctuation is kept to maintain tone and sentiment. This ensures the model captures bullying cues without losing context.

In [45]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+", "", text)
    text = re.sub(r"@\w+", "@USER", text)
    hashtags = re.findall(r"#\w+", text)
    for tag in hashtags:
        try:
            words = " ".join(segment(tag[1:]))
            if words:
                text = text.replace(tag, words)
            else:
                text = text.replace(tag, tag[1:])
        except:
             text = text.replace(tag, tag[1:])

    text = emoji.demojize(text, delimiters=(" ", " "))
    text = re.sub(r"[^a-z\s!?.]", "", text)
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

# apply cleaning
df["clean_text"] = df["tweet_text"].apply(clean_text)

In [59]:
%pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m972.8/981.5 kB[0m [31m71.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993223 sha256=a051d5a97a47b6cae5ab4e1ca20ed51d79d6c24ee493c80235ce4792395ecb56
  Stored in directory: /root/.cache/pip/wheels/c1/67/88/e844b5b022812e15a52e4eaa38a1e709e99f06f6639d7e3ba7
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [64]:
# remove non english tweets
from langdetect import detect, LangDetectException

def is_english(text):
    try:
        if text.strip():
            return detect(text) == 'en'
        else:
            return False
    except LangDetectException:
        return False

df = df[df['tweet_text'].apply(is_english)]

In [65]:
df = df.reset_index(drop=True)

In [68]:
df = df.drop(columns=['tweet_text'])
df = df.drop(columns=['text_length'])
df = df.rename(columns={'clean_text': 'tweet_text'})

In [70]:
df.to_csv("cleaned_cyberbullying_tweets.csv", index=False)