<h2><center>NLP Text Classification</h2>

## I. Introduction

### 1.1 Domain-specific area
This project provides an analysis of textual data on Twitter to accurately detect and classify threatening or harmful content using sentiment analysis techniques. This would provide the cybersecurity industry a tool that takes in a corpus of text for training to develop a strong detection system.

### 1.2 Objectives
Due to popular algorithms being centered around the detection of cyberbullying on social media (Cynthia Van Hee et al., 2018), it is important for this project to widen the scope of detection. While the general detection algorithms focus mainly on terrorism and cyberbullying, it is a known fact that cybersecurity encompasses more than those 2 focuses. (Khairy, Mahmoud and Abd-El-Hafeez, 2021) While full security and safety of users cannot be ensured, making these adjustments would contribute valuable insights for future development.

### 1.3 Dataframe
To begin this project, an extensive amount of textual data corpora is required. After researching large dataframes of Tweets, Sentiment140 Kaggle was proven to be the best for this project. With 1.6 million tweets extracted using the Twitter API, the authors have categorised each tweet to have either a positive, neutral or negative sentiment, which is beneficial for the algorithm in categorising harmful texts.

The dataframe consists of the target (defined as the sentiment of the text), the tweet IDs, date, flags (possible queries, which would be removed in the initialisation phase of extracting the data), the username, and the text of the tweet.

### 1.4 Evaluation methodology

## II. Implementation

### 2.1 Pre-processing
(writeup not needed)
<br>Convert/store the dataframe locally and preprocess the data. Describe the text representation
(e.g., bag of words, word embedding, etc.) and any pre-processing steps you have applied
and why they were needed (e.g. tokenization, lemmatization). Describe the vocabulary and
file type/format, e.g. CSV file.

#### Acquiring dataframe
The dataframe on the collection of Tweets were acquired from Kaggle by downloading the CSV file. The author of this dataframe is Μαριος Μιχαηλιδης KazAnova. The code for importing the dataframe is shown below:

#### 2.1.1 Importing libraries
- <b>pandas library</b> was imported to process and handle dataframes in Python. It is used to help write and read from CSV files while handling real-world messy data and processing them into a proper format

- <b>numpy library</b> was imported to handle calculations and use numpy arrays for statistical calculations

- <b>matplotlib library</b> was imported to plot the data and represent it graphically

- <b>os library</b> was imported to have a way of using the operating system dependent functionalities, more specifically to save the dataframe as a CSV

In [67]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

#### 2.1.2 Creating helper functions
- <b>displaysetsH</b> takes in a list of dataframes and an optional number of rows to display the head of each dataframe

- <b>displaysetsT</b> takes in a list of dataframes and an optional number of rows to display the tail of each dataframe

- <b>resetidx</b> takes in a list of dataframes to reset the indexes of each dataframe

In [2]:
def displaysetsH(dataframes, amt = 5):
    for dataframe in dataframes:
        display(dataframe.head(amt))
        
def displaysetsT(dataframes, amt = 5):
    for dataframe in dataframes:
        display(dataframe.head(amt))
        
def resetidx(dataframes):
    for dataframe in dataframes:
        dataframe.reset_index(drop = True, inplace = True)

#### 2.1.3 Importing dataframe
Due to the dataframe being too large for analysis, we will analyse the first 1000 textual data corpora.

In [72]:
df = pd.read_csv("datasets/sentiment140_dataset.csv", nrows = 1000)
df

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
...,...,...,...,...,...,...
995,0,1468055266,Mon Apr 06 23:28:41 PDT 2009,NO_QUERY,ohmigosh_dusti,@t_wolfe i miss u too. i'm totally comin back...
996,0,1468055472,Mon Apr 06 23:28:43 PDT 2009,NO_QUERY,tiphaniebrooke,@sniffinglue ohhh. I love it. ps I'm sad we di...
997,0,1468055604,Mon Apr 06 23:28:45 PDT 2009,NO_QUERY,rinahannah,And somehow I still end up in this place
998,0,1468055791,Mon Apr 06 23:28:49 PDT 2009,NO_QUERY,ecjc,"@kisluvkis oh that is very sad, poor boy."


#### 2.1.4 Adding headers
Based on analysing tweets_df.head(), it is seen that the dataframe does not have any headers. As such, the first process would be to add the headers to aid in future analysis. The columns 'tweet_id', 'date', 'flag' and 'user' will then be removed as the project focus is on the sentiment analysis. This modified dataframe will then be stored as a new CSV file.

(<i>the headers are modified based on looking at the contents from Kaggle</i>)

In [65]:
headers = ['sentiment', 'tweet_id', 'date', 'flag', 'user', 'tweet']
tweets_df = pd.read_csv("dataframes/sentiment140_dataframe.csv", nrows = 1000, names = headers)
columns_to_drop = ['tweet_id', 'date', 'flag', 'user']
tweets_df.drop(columns = columns_to_drop, inplace = True)
tweets_df

Unnamed: 0,sentiment,tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."
...,...,...
995,0,@dkoenigs thanks man. I'm so very grateful. ...
996,0,@t_wolfe i miss u too. i'm totally comin back...
997,0,@sniffinglue ohhh. I love it. ps I'm sad we di...
998,0,And somehow I still end up in this place


In [69]:
file_path = 'datasets/expanded_sentiment140_dataframe.csv'

if not os.path.exists(file_path):
    tweets_df.to_csv(file_path, index = False)
    print('File saved successfully.')
else:
    print('File already exists.')

File saved successfully.


### 2.2 Baseline performance
(writeup not needed)

### 2.3 Classification approach
(writeup not needed)

### 2.4 Coding style
(writeup not needed)

## III. Conclusions

### 3.1 Evaluation

### 3.2 Summary and conclusions

## Temporary reference list
* to use citation generator

- Cynthia Van Hee, Jacobs, G., Emmery, C., Desmet, B., Lefever, E., Verhoeven, B., Guy De Pauw, Daelemans, W. and Hoste, V. (2018). Automatic detection of cyberbullying in social media text. PLOS ONE, [online] 13(10), p.e0203794. doi:https://doi.org/10.1371/journal.pone.0203794.
- Khairy, M., Mahmoud, T.M. and Abd-El-Hafeez, T. (2021). Automatic Detection of Cyberbullying and Abusive Language in Arabic Content on Social Networks: A Survey. Procedia Computer Science, [online] 189, pp.156–166. doi:https://doi.org/10.1016/j.procs.2021.05.080.