<h2><center>NLP Text Classification</h2>

## I. Introduction

### 1.1 Domain-specific area

### 1.2 Objectives

### 1.3 Dataset

### 1.4 Evaluation methodology

## II. Implementation

### 2.1 Pre-processing
(writeup not needed)
<br>**to be removed**: Convert/store the dataset locally and preprocess the data. Describe the text representation
(e.g., bag of words, word embedding, etc.) and any pre-processing steps you have applied
and why they were needed (e.g. tokenization, lemmatization). Describe the vocabulary and
file type/format, e.g. CSV file.

#### Acquiring dataset
The dataset on the collection of Tweets were acquired from Kaggle by downloading the CSV file. The author of this dataset is Saurabh Shahane. The code for importing the dataset is shown below:

#### Importing libraries
- <b>pandas library</b> was imported to process and handle datasets in Python. It is used to help write and read from CSV files while handling real-world messy data and processing them into a proper format

- <b>numpy library</b> was imported to handle calculations and use numpy arrays for statistical calculations

- <b>matplotlib library</b> was imported to plot the data and represent it graphically [not used]

- <b>os library</b> was imported to have a way of using the operating system dependent functionalities, more specifically to save the dataset as a CSV

- <b>stopwords library</b> was imported to have a library of the most common words in data to aid in stopwords removal

In [46]:
# dataframes
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
import os

# text processing and analysis
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords

# analysing text corpora
from collections import Counter

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sbgka\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Importing dataset
To check that the dataset is ready for cleaning and analysis, we will look at the first entry to check if there are headers. Since there are headers, and the dataset contains only relevant information, the text, and sentiment, the headers will just be modified to "tweet" and "sentiment" for better comprehension.

In [47]:
# The following block of code was self-written
df = pd.read_csv('datasets/Twitter_Data_Sentiments.csv', nrows = 1)
df

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1


In [48]:
# The following block of code was self-written
tweets_df = pd.read_csv('datasets/Twitter_Data_Sentiments.csv')
tweets_df.columns = ['tweet', 'sentiment']
tweets_df.head()

Unnamed: 0,tweet,sentiment
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0


#### Removing NaN or infinite entries
To ensure that the dataset contains only required information, we will check and remove any entries that contain missing or infinite values.

#### Removing duplicated entries
To ensure that the analysis is beneficial, all entriesshould be unique. A 'duplicated' column will be added to the a temporary copy of the dataset which is the output of the duplicated() function and we will print only columns where the 'duplicated' column is True. Based on the output, it is seen that there are no duplicated Tweets.

#### Reducing sample size
To address computational constraints and the imbalanced nature of the dataset (positives to negatives having an approximate ratio of 2:1), the analysis will be limited to a subset of 1000 instances for each sentiment category.
Due to the limitations of computer capacity and the imbalanced nature of the dataset, we will limit the analysis to a subset of 1000 instances for each sentiment category.

<b>The following blocks of code was self-written with references</b>
<br>replace() function: https://sparkbyexamples.com/pandas/pandas-drop-infinite-values-from-dataframe/?expand_article=1
<br>checking for duplicates: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
<br>creation of sample from large dataframe: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html

In [49]:
sentiments = tweets_df['sentiment'].value_counts(dropna = False)
infinites = np.isinf(tweets_df['sentiment']).sum()
sentiments['Infinite'] = infinites
print("sentiments before cleaning:\n\n", sentiments)

# Removing NaN values
tweets_df.dropna(inplace = True)
# Replacing and removing infinite values
tweets_df.replace([np.inf, -np.inf], np.nan, inplace = True)
tweets_df.dropna(inplace = True, axis = 0)

sentiments before cleaning:

 sentiment
1.0         72250
0.0         55213
-1.0        35510
NaN             7
Infinite        0
Name: count, dtype: int64


In [50]:
dupe_checker = tweets_df.copy()
duplicates = dupe_checker.duplicated()
dupe_checker['duplicated'] = duplicates
duplicated = dupe_checker[dupe_checker['duplicated'] == True]
duplicated

Unnamed: 0,tweet,sentiment,duplicated


In [51]:
# Storing 1000 entries of each sentiment
positives = tweets_df[tweets_df['sentiment'] == 1].sample(n = 1000, random_state = 10)
neutrals = tweets_df[tweets_df['sentiment'] == 0].sample(n = 1000, random_state = 10)
negatives = tweets_df[tweets_df['sentiment'] == -1].sample(n = 1000, random_state = 10)

# Concatenating the 3 sentiments together
sampled_tweets_df = pd.concat([positives, neutrals, negatives])
# Resetting index
sampled_tweets_df.reset_index(drop = True, inplace = True)
print('positive sentiments:', sampled_tweets_df[sampled_tweets_df['sentiment'] == 1].shape[0])
print('neutral sentiments:', sampled_tweets_df[sampled_tweets_df['sentiment'] == 0].shape[0])
print('negative sentiments:', sampled_tweets_df[sampled_tweets_df['sentiment'] == -1].shape[0])

positive sentiments: 1000
neutral sentiments: 1000
negative sentiments: 1000


The tweets_df will now be saved into a new dataset for easier accessibility. To ensure that no duplicates are saved, a simple path checking will be used.

In [52]:

file_path = 'datasets/sample3000_Twitter_Data_Sentiments.csv'

if not os.path.exists(file_path):
    sampled_tweets_df.to_csv(file_path, index = False)
    print('File saved successfully.')
else:
    print('File already exists.')
   
tweets_df = pd.read_csv(file_path)

File already exists.


#### Basic text processing
To begin the process of analysing the text, we would require conducting basic text processing methods.

TO DO LIST:
- Describe the text representation (e.g., bag of words, word embedding, etc.) **[not done]**
- Describe the vocabulary and file type/format, e.g. CSV file. [**not done**]
- any pre-processing steps you have applied and why they were needed (e.g. tokenization [**done**], lemmatization [**did regex**]).

- <b>Removing stop words</b>: In human language, it is very common for stop words to be present. These words, including **determiners** (eg: the, a, this, my), **conjunctions** (eg: and, or, nor, but, whereas) and **prepositions** (eg: against, along, at, before), are used to connect thoughts and speech to form grammatically accurate sentences or structural cohesion. While important during communication amongst one another, they do not carry importance or sentiments that would be valuable to this project, thus introducing noise. The removal would help to streamline the process to focus on words that would contribute more meaning to the sentiment of Tweets.

* tokenization will be done in lowercase as all stopwords are in lowercase.

<b>The following blocks of code was self-written with reference</b>
<br>stopwords: https://www.geeksforgeeks.org/removing-stop-words-nltk-python/

In [53]:
# Downloading stopword corpus
nltk.download('stopwords')
# Get stopword list
stop_words = set(stopwords.words('english'))

# Checking removal works on a test text
test_tweet = 'This is a test that Stopword removal works.'

tokens = test_tweet.lower().split()
# Removing each token if part of stop_words
filtered_tokens = [token for token in tokens if token not in stop_words]
print("Filtered from", len(tokens), "to", len(filtered_tokens))

Filtered from 8 to 4


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sbgka\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [54]:
tweets = tweets_df['tweet'].tolist()

filtered_tweets = []
for tweet in tweets:
    tokens = tweet.lower().split()
    # Removing each token if part of stop_words
    filtered_tokens = [token for token in tokens if token not in stop_words]
    filtered_tweets.append(filtered_tokens)

tweets_df['filtered_tweet'] = filtered_tweets
tweets_df.head()

Unnamed: 0,tweet,sentiment,filtered_tweet
0,infinity cant chased two nations theory has be...,1.0,"[infinity, cant, chased, two, nations, theory,..."
1,years dynamic modi rule and fear losing electi...,1.0,"[years, dynamic, modi, rule, fear, losing, ele..."
2,terrorists pakistan want lose opposition win m...,1.0,"[terrorists, pakistan, want, lose, opposition,..."
3,the entire panel opposition panel what can exp...,1.0,"[entire, panel, opposition, panel, expect, deb..."
4,theres one disgustinggundaghaleezghatyascum yo...,1.0,"[theres, one, disgustinggundaghaleezghatyascum..."


- <b>Regular expressions (Regex)</b>: consider if i need this

The removal of stopwords has reduced the texts. Due to the dataset chosen having the column named as "clean_text", as well as the analysis from the above output having highly unique text, the text will now be examined.

#### Evaluation of words
1. By making use of the collections.Counter library, it would allow the most used words to be displayed. Due to the word "modi" being seen 2831/3000 times, this would be considered a stop word. The word "modi" will be removed from all entries.

<b>The following blocks of code was self-written with reference</b>
<br>collections library: https://www.digitalocean.com/community/tutorials/python-counter-python-collections-counter

In [55]:
# Concatenate all tokenized words together
tokenized_words = [word for tweet in tweets_df['filtered_tweet'] for word in tweet]
# Count word frequency
word_counts = Counter(tokenized_words)

# Displays top 10 used words and their frequencies
most_used_words = word_counts.most_common(5)
for word, count in most_used_words:
    print(f"{word}: {count}")

modi: 2831
india: 499
bjp: 253
congress: 231
people: 230


In [61]:
# Iterating through rows to check for target word
for index, row in tweets_df.iterrows():
    tweet_words = row['filtered_tweet']
    tweet_words = [word for word in tweet_words if word != 'modi']
    tweets_df.at[index, 'filtered_tweet'] = tweet_words
    
# Check that stopword "modi" is not part of frequently used words
tokenized_words = [word for tweet in tweets_df['filtered_tweet'] for word in tweet]
word_counts = Counter(tokenized_words)
most_used_words = word_counts.most_common(5)
for word, count in most_used_words:
    print(f"{word}: {count}")

india: 499
bjp: 253
congress: 231
people: 230
like: 222


### 2.2 Baseline performance
(writeup not needed)

### 2.3 Classification approach
(writeup not needed)

### 2.4 Coding style
(writeup not needed)

## III. Conclusions

### 3.1 Evaluation

### 3.2 Summary and conclusions

## Temporary reference list
* to use citation generator

- 