# Data Exploration and Cleaning of Kaggle Cyberbullying Dataset

**Dataset Columns:**  
- `tweet_text`: Text of the tweet  
- `cyberbullying_type`: Type of cyberbullying (Age, Ethnicity, Gender, Religion, Other, Not cyberbullying)

**Dataset Source / Credit:**  
This dataset was obtained from [Kaggle: Cyberbullying Classification](https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification/data)  


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from collections import Counter
df = pd.read_csv("cyberbullying_tweets.csv")

## Data Exploration
First, I will take a look at the dataset to understand its structure.  
The dataset contains two columns: the tweet text itself and the label indicating the type of cyberbullying.  
I will check the unique labels, their counts, and some basic statistics about the text length.


In [6]:
print("First rows:")
display(df.head())

First rows:


Unnamed: 0,tweet_text,cyberbullying_type
0,"In other words #katandandre, your food was cra...",not_cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying


In [7]:
print("\nDataset info:")
df.info()


Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47692 entries, 0 to 47691
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_text          47692 non-null  object
 1   cyberbullying_type  47692 non-null  object
dtypes: object(2)
memory usage: 745.3+ KB


In [11]:
# number of duplicates
print("\nNumber of duplicate rows:", df.duplicated().sum())


Number of duplicate rows: 36


In [9]:
# missing values
print("\nMissing values per column:")
print(df.isnull().sum())


Missing values per column:
tweet_text            0
cyberbullying_type    0
dtype: int64


In [10]:
# unique values
print("\nUnique values per column:")
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")


Unique values per column:
tweet_text: 46017 unique values
cyberbullying_type: 6 unique values


In [16]:
# label distribution
print("\nRows per cyberbullying type:")
print(df["cyberbullying_type"].value_counts())


Rows per cyberbullying type:
cyberbullying_type
religion               7998
age                    7992
gender                 7973
ethnicity              7961
not_cyberbullying      7945
other_cyberbullying    7823
Name: count, dtype: int64


In [17]:
# text length
df["text_length"] = df["tweet_text"].apply(lambda x: len(str(x).split()))
print("\nTweet length statistics (in words):")
print(df["text_length"].describe())


Tweet length statistics (in words):
count    47692.000000
mean        23.704835
std         15.434881
min          1.000000
25%         13.000000
50%         20.000000
75%         32.000000
max        790.000000
Name: text_length, dtype: float64


In [19]:
df[df["text_length"] > 100][["tweet_text"]].head(10)

Unnamed: 0,tweet_text
1317,@EurekAlertAAAS: Researchers push to import to...
3030,He embellished the afternoon with moustachioed...
4846,@andrea_gcav: @viviaanajim recuerdas como noso...
10922,don't make rape jokes!!! don't make gay jokes!...
14168,IT CALLS THE FUNCTION TO THE PROCESS OR IT GE…...
15621,"@ufcpride40: : Terry Bean, prominent gay activ..."
24516,@NICKIMINAJ: #WutKinda\r\nAt this rate the MKR...
25411,@Sweetie_Niesha: So Im getting bullied via twi...
25939,If cats looked like frogs we'd realize what na...
29205,is feminazi an actual word with a denot…\r\n@N...


## Data Cleaning

In the previous step, I noticed that the dataset has some duplicate rows and a few tweets that are unusually long compared to normal tweet length. So now, in the data cleaning step, I will remove duplicates and filter out these extreme outliers as well as apply other data cleaning techniques.

In [27]:
# drop duplicates
df = df.drop_duplicates()

In [28]:
# remove very long tweets
df = df[df["text_length"] <= 100]

In [26]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+", "", text)
    text = re.sub(r"@\w+", "", text)
    text = re.sub(r"#\w+", "", text)
    text = re.sub(r"[^a-z\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

# apply cleaning
df["clean_text"] = df["tweet_text"].apply(clean_text)

In [25]:
df = df.reset_index(drop=True)

In [29]:
df.to_csv("cleaned_cyberbullying_tweets.csv", index=False)