<h2><center>NLP Text Classification</h2>

## I. Introduction

### 1.1 Domain-specific area
This project provides an analysis of textual data on Twitter to accurately detect and classify threatening or harmful content using sentiment analysis techniques. This would provide the cybersecurity industry a tool that takes in a corpus of text for training to develop a strong detection system.

### 1.2 Objectives
Due to popular algorithms being centered around the detection of cyberbullying on social media (Cynthia Van Hee et al., 2018), it is important for this project to widen the scope of detection. While the general detection algorithms focus mainly on terrorism and cyberbullying, it is a known fact that cybersecurity encompasses more than those 2 focuses. (Khairy, Mahmoud and Abd-El-Hafeez, 2021) While full security and safety of users cannot be ensured, making these adjustments would contribute valuable insights for future development.

### 1.3 Dataset
To begin this project, an extensive amount of textual data corpora is required. After researching large datasets of Tweets, Sentiment140 Kaggle was proven to be the best for this project. With 1.6 million tweets extracted using the Twitter API, the authors have categorised each tweet to have either a positive, neutral or negative sentiment, which is beneficial for the algorithm in categorising harmful texts.

The dataset consists of the target (defined as the sentiment of the text), the tweet IDs, date, flags (possible queries, which would be removed in the initialisation phase of extracting the data), the username, and the text of the tweet.

### 1.4 Evaluation methodology

## II. Implementation

### 2.1 Pre-processing
(writeup not needed)
<br>Convert/store the dataset locally and preprocess the data. Describe the text representation
(e.g., bag of words, word embedding, etc.) and any pre-processing steps you have applied
and why they were needed (e.g. tokenization, lemmatization). Describe the vocabulary and
file type/format, e.g. CSV file.

#### Acquiring dataset
The dataset on the collection of Tweets were acquired from Kaggle by downloading the CSV file. The author of this dataset is Μαριος Μιχαηλιδης KazAnova. The code for importing the dataset is shown below:

#### 2.1.1 Importing libraries
- <b>pandas library</b> was imported to process and handle datasets in Python. It is used to help write and read from CSV files while handling real-world messy data and processing them into a proper format

- <b>numpy library</b> was imported to handle calculations and use numpy arrays for statistical calculations

- <b>matplotlib library</b> was imported to plot the data and represent it graphically

- <b>os library</b> was imported to have a way of using the operating system dependent functionalities, more specifically to save the dataset as a CSV

- <b>word_tokenize library</b> divides strings into lists of substrings, this aids in regular expression

- <b>stopwords library</b> was imported to have a library of the most common words in data to aid in stopwords removal

- <b>string library</b> contains all ASCII characters considered as whitespace to aid in stopwords removal

- <b>re library</b> specifies a set of strings that matches it and checks if a string matches a given regular expression

In [1]:
# dataframes
import pandas as pd
import random
import numpy as np
import matplotlib.pyplot as plt
import os
from io import StringIO

# text processing and analysis
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
import re

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sbgka\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### 2.1.2 Creating helper functions
- <b>displaysetsH</b> takes in a list of datasets and an optional number of rows to display the head of each dataset

- <b>displaysetsT</b> takes in a list of datasets and an optional number of rows to display the tail of each dataset

- <b>resetidx</b> takes in a list of datasets to reset the indexes of each dataset

In [2]:
def displaysetsH(datasets, amt = 5):
    for dataset in datasets:
        display(dataset.head(amt))
        
def displaysetsT(datasets, amt = 5):
    for dataset in datasets:
        display(dataset.head(amt))
        
def resetidx(datasets):
    for dataset in datasets:
        dataset.reset_index(drop = True, inplace = True)

#### 2.1.3 Importing dataset
Due to the dataset being too large for analysis, it will analyse the first 2000 random textual data corpora, 1000 negative and positive sentiment to ensure valid analysis. We will look at the first entry to check if there are headers.

Since there are no headers, we will add the condition 'header = None' when creating the dataframe for observation. The headers will also be created to aid in future analysis. The columns 'tweet_id', 'date', 'flag' and 'user' will then be removed as the project focus is on the sentiment analysis. This modified dataset will then be stored as a new CSV file.

(<i>the headers are modified based on looking at the contents from Kaggle</i>)

In [7]:
df = pd.read_csv('datasets/sentiment140_dataset.csv', nrows = 1)
df

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...


In [9]:
filename = 'datasets/sentiment140_dataset.csv'
headers = ['sentiment', 'tweet_id', 'date', 'flag', 'user', 'tweet']

random_entries = []
positives = 1000  # positive sentiment counters
negatives = 1000  # negative sentiment counters

with open(filename, 'r', encoding='latin-1') as file:
    # Skip the first line if it contains headers
    headers = next(file, None)
    
    for line in file:
        sentiment = int(line.split(",")[0].strip("\""))  # Extract and strip the sentiment value
        
        if (sentiment == 0 and negatives > 0) or (sentiment == 2 and neutrals > 0) or (sentiment == 4 and positives > 0):
            random_entries.append(line)
            
            if sentiment == 0:
                negatives -= 1
            elif sentiment == 2:
                neutrals -= 1
            elif sentiment == 4:
                positives -= 1
                
            if negatives == 0 and neutrals == 0 and positives == 0:
                break

tweets_df = pd.read_csv(StringIO("".join(random_entries)), header = None)
headers = ['sentiment', 'tweet_id', 'date', 'flag', 'user', 'tweet']
tweets_df.columns = headers
columns_to_drop = ['tweet_id', 'date', 'flag', 'user']
tweets_df.drop(columns = columns_to_drop, inplace = True)
tweets_df

Unnamed: 0,sentiment,tweet
0,0,is upset that he can't update his Facebook by ...
1,0,@Kenichan I dived many times for the ball. Man...
2,0,my whole body feels itchy and like its on fire
3,0,"@nationwideclass no, it's not behaving at all...."
4,0,@Kwesidei not the whole crew
...,...,...
1995,4,I have this strange desire to go to confession...
1996,4,@i_reporter answer sent in dm. try it
1997,4,@brooklynunion cuz ur 3pm is my 9am and Id be ...
1998,4,@littrellfans Its all good. Just figured you w...


The tweets_df will now be saved into a new dataset for easier accessibility. To ensure that no duplicates are saved, a simple path checking will be used.

In [10]:
file_path = 'datasets/selected_sentiment140_dataset.csv'

if not os.path.exists(file_path):
    tweets_df.to_csv(file_path, index = False)
    print('File saved successfully.')
else:
    print('File already exists.')
   
tweets_df = pd.read_csv(file_path)

File saved successfully.


#### 2.1.5 Checking for duplicates and null entries
To ensure that the analysis is beneficial, all 2000 entries that were chosen should be unique. To do so, a 'duplicated' column will be added to the a temporary copy of the dataset which is the output of df.duplicated() and we will print only columns where the 'duplicated' column is True.

Based on the output, it is seen that there are no duplicated Tweets.

After ensuring that all entries are unique, a check that all entries do not have NULL entries will be done. This is because sentiment analysis cannot be done on empty texts.

Based on the output, it is seen that there are no NULL entries.

In [14]:
dupe_checker = tweets_df.copy()
duplicates = dupe_checker.duplicated()
dupe_checker['duplicated'] = duplicates
duplicated = dupe_checker[dupe_checker['duplicated'] == True]
duplicated

Unnamed: 0,sentiment,tweet,duplicated


In [13]:
null_checker = tweets_df.copy()
null_checker.isnull().sum()

sentiment    0
tweet        0
dtype: int64

#### 2.1.6 Basic statistics
To ensure that the data is good for analysis, the amount of positive, neutral and negative sentiments should be as balanced as possible (sentiment values: 0 = negative, 2 = neutral, 4 = positive)

Based on the output, it is seen that the dataset has been skewed. However, due to the focus on the project being the detection of threatful comments, the skewed focus on negative sentiments should be sufficient enough for detection.

** Second option: Based on the output, it is seen that the dataset has been skewed. Due to this, a new collation of the first and last 1000 entries will be used.

In [17]:
sentiment_checker = tweets_df.copy()
print('Number of positive sentiments:', sentiment_checker[sentiment_checker['sentiment'] == 4].count())
print('Number of negative sentiments:', sentiment_checker[sentiment_checker['sentiment'] == 0].count())

Number of positive sentiments: sentiment    1000
tweet        1000
dtype: int64
Number of negative sentiments: sentiment    1000
tweet        1000
dtype: int64


#### 2.1.7 Basic text processing
To begin the process of analysing the text, we would require conducting basic text processing methods.

TO DO LIST:
- Describe the text representation (e.g., bag of words, word embedding, etc.) **[not done]**
- Describe the vocabulary and file type/format, e.g. CSV file. [**not done**]
- any pre-processing steps you have applied and why they were needed (e.g. tokenization [**done**], lemmatization [**did regex**]).

- <b>Removing stop words</b>: In human language, it is very common for stop words to be present. These words, including **determiners** (eg: the, a, this, my), **conjunctions** (eg: and, or, nor, but, whereas) and **prepositions** (eg: against, along, at, before), are used to connect thoughts and speech to form grammatically accurate sentences or structural cohesion. While important during communication amongst one another, they do not carry importance or sentiments that would be valuable to this project, thus introducing noise. The removal would help to streamline the process to focus on words that would contribute more meaning to the sentiment of Tweets.

* Tokenisation will be done in lowercase as all stopwords are in lowercase.

In [None]:
# Downloading stopword corpus
nltk.download('stopwords')
# Get stopword list
stop_words = set(stopwords.words('english'))

In [None]:
# Checking removal works on a test text
test_tweet = ['This is a test that Stopword removal works.']

def filter_text(tweets):
    for tweet in tweets:
        tokens = tweet.lower().split()
        # Removing each token if part of stop_words
        filtered_tokens = [token for token in tokens if token not in stop_words]
        print('filtered_tokens', filtered_tokens)
        
filter_text(test_tweet)

In [None]:
# Before improvement of stopwords
tweets = tweets_df['tweet'].tolist()
filter_text(tweets)

The removal of stopwords has reduced the texts. However after analysing the filtered words, it is seen that symbols are still considered, tokens. As such, the **removal of symbols** would be added to stop_words. Due to the tokenization making ending words with symbols (eg: 'goodbye!') a single token, the removal will only be on tokens with no characters (eg: '!')

In [None]:
stop_words = set(stopwords.words('english'))
symbols = set(string.punctuation)

# Checking removal works on a test text
test_tweet2 = ['modified test that StopwOrd removal works. It will remove @testtry']

def filter_text2(tweets):
    for tweet in tweets:
        tokens = tweet.lower().split()
        
        filtered_tokens = [token for token in tokens if not (
            # Removing each token if part of symbols
            all(char in symbols for char in token)
            # Removing each token if part of stop_words
            or token in stop_words
        )]
        
        print('filtered_tokens', filtered_tokens)
        
filter_text2(test_tweet2)

In [None]:
# After improvement of stopwords
tweets = tweets_df['tweet'].tolist()
filter_text2(tweets)

- <b>Regular expressions (Regex)</b>: Regex are known to have highly optimised algorithms. By choosing the patterns that we want to recognise, it allows us to handle variations better. As such, we will use this to get rid of **links and tagging**.

In [None]:
stop_words = set(stopwords.words('english'))
symbols = set(string.punctuation)

# Checking removal works on a test text
test_tweet3 = ['modified test that stopword removal works. It will remove @testtry', 
               'http not removed but https://www.youtube.com/ and http://www.youtube.com/ removed']

def filter_text3(tweets):
    for tweet in tweets:
        # Tokenizing URLs using regular expression
        tweet = re.sub(r'@\S+|(?:http|https)://\S+', '', tweet)
        tokens = tweet.lower().split()
        
        filtered_tokens = [token for token in tokens if not (
            # Removing each token if part of symbols
            all(char in symbols for char in token)
            # Removing each token if part of stop_words
            or token in stop_words
        )]
        
        print('filtered_tokens', filtered_tokens)
        
filter_text3(test_tweet3)

In [None]:
# After improvement of stopwords
tweets = tweets_df['tweet'].tolist()
filter_text3(tweets)

Users are known to add **repeated trailing symbols** or alphabets as a way to express their emotions. We will make use of Regex to show it as a single occurance. When considering the reduction of alphabets, we have to take into consideration whether the word has been dragged, or if the word does have repeated characters. (eg: kangaroo)

As the common dictionary does not seem to have any words that have 3 consecutive repeated letters, we would tokenise the words to have repeated characters a maximum of 2 times. While this is not a full-proof way of resolving repeated trailing characters, this is the safest way to ensure better classification.

In [None]:
stop_words = set(stopwords.words('english'))
symbols = set(string.punctuation)

# Checking removal works on a test text
test_tweet4 = ['modified test that stopword removal works. It will remove @testtry', 
               'http not removed but https://www.youtube.com/ and http://www.youtube.com/ removed',
               'The cleaning will show this!!!!!!!!?!!!?!!!!!! with 1 exclaimation',
               'kaaaaaaangarooooooooooooooooooooooooooo will be cleaned up']

def filter_text4(tweets):
    for tweet in tweets:
        # Tokenizing URLs using Regex
        tweet = re.sub(r'@\S+|(?:http|https)://\S+', '', tweet)
        # Tokenizing repeated symbols using Regex
        tweet = re.sub(r'([!@#$%^&*.]){2,}', r'\1', tweet)
        # Tokenizing repeated characters
        tweet = re.sub(r'([A-Za-z])\1+', r'\1\1', tweet)
        tokens = tweet.lower().split()
        
        filtered_tokens = [token for token in tokens if not (
            # Removing each token if part of symbols
            all(char in symbols for char in token)
            # Removing each token if part of stop_words
            or token in stop_words
        )]
        
        print('filtered_tokens', filtered_tokens)
        
filter_text4(test_tweet4)

In [None]:
# After improvement of stopwords
tweets = tweets_df['tweet'].tolist()
filter_text4(tweets)

### 2.2 Baseline performance
(writeup not needed)

### 2.3 Classification approach
(writeup not needed)

### 2.4 Coding style
(writeup not needed)

## III. Conclusions

### 3.1 Evaluation

### 3.2 Summary and conclusions

## Temporary reference list
* to use citation generator

- Cynthia Van Hee, Jacobs, G., Emmery, C., Desmet, B., Lefever, E., Verhoeven, B., Guy De Pauw, Daelemans, W. and Hoste, V. (2018). Automatic detection of cyberbullying in social media text. PLOS ONE, [online] 13(10), p.e0203794. doi:https://doi.org/10.1371/journal.pone.0203794.
- Khairy, M., Mahmoud, T.M. and Abd-El-Hafeez, T. (2021). Automatic Detection of Cyberbullying and Abusive Language in Arabic Content on Social Networks: A Survey. Procedia Computer Science, [online] 189, pp.156–166. doi:https://doi.org/10.1016/j.procs.2021.05.080.