## __PROG8245 FINAL PROJECT - SENTIMENT ANALYSIS__
### __GROUP:__ 6
### __TEAM MEMBERS:__
_Praiselin Lydia Gladston_

_Sudharsan Tirumal_

### __IMPORT REQURIED LIBRARIES:__

In [1]:
import praw
import re
import emoji
import pandas as pd
import tkinter as tk

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

In [3]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer as SIA
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [4]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

In [5]:
nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\prais\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\prais\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\prais\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\prais\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### __1. DATA COLLECTION:__

#### __Data collection from Reddit:__

We have created a reddit account and using that user_agent, we will get raw data from sub-reddits that are relevant to our project.

Authenticating with Reddit API:

In [6]:
reddit = praw.Reddit(client_id='XLj-2WpKbNdfpyJXVsCf_g',
                     client_secret='0R4TzS0s-CPg16z92MPL1IyraTPjBg',
                     user_agent='East-Code-7342')

A sentiment analysis system helps businesses improve their product offerings by learning what works and what doesn't. We will go through subreddits that can provide insights on comments that are useful for a business to understand recent trend. From the posts from relevant subreddits, we will collect their titles and perform sentiment analysis on them. Proceeding to collect all the titles into a set.

In [7]:
# List of related subreddits
subreddit_list = ['PoliticsPeopleTwitter', 'trendingsubreddits', 'Discussion', 'CasualConversation']

# List of categories to read
relevant_categories = ['new', 'hot', 'top', 'controversial']

To understand the current trend, we will have to analyze the most recent posts. To achieve that, we will have analyze posts that fall only in that category. We will have to sort the collected data based on the date posted or how popular the content is.

Function to get post titles from reddit:

In [8]:
post_titles = set()

for subreddit in subreddit_list:
    for category in relevant_categories:
        # Get subreddit instance
        subreddit_instance = reddit.subreddit(subreddit)
        
        # Get posts from the specified category
        if category == 'new':
            posts = subreddit_instance.new(limit=2500)
        elif category == 'hot':
            posts = subreddit_instance.hot(limit=2500)
        elif category == 'top':
            posts = subreddit_instance.top(limit=2500)
        elif category == 'controversial':
            posts = subreddit_instance.controversial(limit=2500)
        
        # Extract post titles
        for post in posts:
            post_titles.add(post.title)

In [9]:
print(f"Total post titles collected: {len(post_titles)}")

Total post titles collected: 9732


#### Case normalization:

In [10]:
# Converting to lowercase
titles_in_lowercase = set()

for item in post_titles:
    titles_in_lowercase.add(item.lower())

#### Special characters removal:

In [11]:
# Function to remove special characters
def remove_special_characters(text):
    return re.sub(r'[^a-zA-Z0-9\s]', '', text)

# Clean the titles and create a new set
cleaned_titles = {remove_special_characters(item) for item in titles_in_lowercase}

# Counting the number of titles with special characters
count_special = sum(bool(re.search(r'[^a-zA-Z0-9\s]', item)) for item in titles_in_lowercase)

print(f"Number of titles corrected: {count_special}")

Number of titles corrected: 7821


#### Null entries removal:

In [12]:
# Removing null entries
cleaned_titles = set(filter(None, cleaned_titles))

#### __Data annotation:__

As per requirement, data should be distinguished into atleast 3 classes. We will have the classes 'positive', 'negative' and 'neutral'. We're using Natural Language Toolkit library's SentimentIntensityAnalyzer for annotation.

In [13]:
sia = SIA()
annotated_titles = []
for title in cleaned_titles:
    score = sia.polarity_scores(title)
    compound = score['compound']
    
    if compound >= 0.05:
        sentiment = 'positive'
    elif compound <= -0.05:
        sentiment = 'negative'
    else:
        sentiment = 'neutral'

    annotated_titles.append((title, sentiment))

# Print a sample of annotated titles
for annotation in annotated_titles[:5]:
    print(annotation)

('trending subreddits for 20141022 rinstantregret rexcel rclimbing rnorwaypics rcatsridingroombas', 'neutral')
('is it just me or is the japanese culture very unsettling specifically their urban legends and folklore', 'neutral')
('why do so many good threads get downvoted on reddit', 'positive')
('fine ill do it myself', 'negative')
('muriqa', 'neutral')


### Converting annotated data into DataFrame:

In [14]:
# Convert into DataFrame
posts_df = pd.DataFrame.from_records(annotated_titles, columns=['Title', 'Category'])
posts_df.head()

Unnamed: 0,Title,Category
0,trending subreddits for 20141022 rinstantregre...,neutral
1,is it just me or is the japanese culture very ...,neutral
2,why do so many good threads get downvoted on r...,positive
3,fine ill do it myself,negative
4,muriqa,neutral


In [15]:
posts_df.shape

(9684, 2)

The titles have been collected into a csv with 2 columns 'Title' and 'Category'. Where 'Title' is the text content and 'Category' is the class the text falls into.

### __2. PREPROCESSING:__

Preprocessing the text by removing hashtags, emojis, slang, stop-words, stemming/lemmatization, tokenizing and lowercasing.

In [16]:
def preprocess_text(text):
    # Remove '#' from hashtags
    text = re.sub(r'#', '', text)
    
    # Remove emojis
    text = emoji.demojize(text)

    # Tokenizing the words
    tokens = word_tokenize(text)

    # Removing non-alpha characters
    tokens = [word for word in tokens if word.isalpha()] 

    # Removing stopwords
    stop_words = set(stopwords.words('english'))    
    tokens = [word for word in tokens if word not in stop_words]

    # Stemming and Lemmatizing the words
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(stemmer.stem(word)) for word in tokens] 

    return ' '.join(tokens)

In [17]:
posts_df.loc[:, 'ProcessedTitle'] = posts_df['Title'].apply(preprocess_text)
posts_df.head()

Unnamed: 0,Title,Category,ProcessedTitle
0,trending subreddits for 20141022 rinstantregre...,neutral,trend subreddit rinstantregret rexcel rclimb r...
1,is it just me or is the japanese culture very ...,neutral,japanes cultur unsettl specif urban legend fol...
2,why do so many good threads get downvoted on r...,positive,mani good thread get downvot reddit
3,fine ill do it myself,negative,fine ill
4,muriqa,neutral,muriqa


### __3. FEATURE EXTRACTION:__ 

We will explore Bag-of-Words, GloVe and GPT-2 feature extraction techniques and how well they perform.

#### Bag of Words:

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
vectorizer = CountVectorizer(max_features=2000) 

# Fit and transform the text data to BoW vectors
bow_features = vectorizer.fit_transform(posts_df['ProcessedTitle'])

# Split the dataset into training and testing sets
X_train_bow, X_test_bow, y_train_bow, y_test_bow = train_test_split(bow_features, posts_df['Category'], test_size=0.2, random_state=42)

# Initialize the classifier (e.g., Logistic Regression)
classifier_bow = LogisticRegression(max_iter=2000)

# Fit the classifier on the training data
classifier_bow.fit(X_train_bow, y_train_bow)

# Make predictions
y_pred_bow = classifier_bow.predict(X_test_bow)

# Evaluate the classifier
print("Bag-of-Words Performance:")
print(classification_report(y_test_bow, y_pred_bow))

Bag-of-Words Performance:
              precision    recall  f1-score   support

    negative       0.77      0.65      0.70       477
     neutral       0.82      0.93      0.87       979
    positive       0.79      0.70      0.74       481

    accuracy                           0.80      1937
   macro avg       0.79      0.76      0.77      1937
weighted avg       0.80      0.80      0.80      1937



#### GloVe:

In [19]:
import numpy as np
import gensim.downloader as api

# Load pre-trained GloVe model (choose an appropriate model)
glove_model = api.load("glove-wiki-gigaword-100")  # For example

# Function to create document vectors
def document_vector_glove(doc):
    words = doc.split()
    word_vectors = [glove_model[word] for word in words if word in glove_model]
    if len(word_vectors) == 0:
        return np.zeros(glove_model.vector_size)  # Return a zero vector if no words are found
    else:
        return np.mean(word_vectors, axis=0)

glove_features = np.array([document_vector_glove(doc) for doc in posts_df['ProcessedTitle']])

X_train_glove, X_test_glove, y_train_glove, y_test_glove = train_test_split(glove_features, posts_df['Category'], test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_glove_scaled = scaler.fit_transform(X_train_glove)
X_test_glove_scaled = scaler.transform(X_test_glove)

classifier_glove = LogisticRegression(max_iter=1000)
classifier_glove.fit(X_train_glove_scaled, y_train_glove)
y_pred_glove = classifier_glove.predict(X_test_glove)

# Evaluate the classifier
print("GloVe Performance:")
print(classification_report(y_test_glove, y_pred_glove))

GloVe Performance:
              precision    recall  f1-score   support

    negative       0.81      0.05      0.10       477
     neutral       0.52      0.99      0.68       979
    positive       0.79      0.10      0.18       481

    accuracy                           0.54      1937
   macro avg       0.71      0.38      0.32      1937
weighted avg       0.66      0.54      0.42      1937



#### GPT-2:

In [20]:
import torch
from transformers import GPT2Tokenizer, GPT2Model

# Load pre-trained GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})
model = GPT2Model.from_pretrained('gpt2')

# Function to create document vectors using GPT embeddings
def document_vector_gpt(doc):
    if not doc.strip():
        return np.zeros((1, model.config.hidden_size))  # Return a zero vector
    
    # Tokenize input text
    inputs = tokenizer(doc, return_tensors="pt", padding=True, truncation=True, max_length=512)
    
    # Forward pass through the model
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Extract embeddings from the last hidden state (CLS token)
    embeddings = outputs.last_hidden_state[:, 0, :].numpy()
    
    return embeddings


# Extract GPT embeddings for each document (title)
gpt_features = np.array([document_vector_gpt(doc) for doc in posts_df['ProcessedTitle']])

# Split the dataset into training and testing sets
X_train_gpt, X_test_gpt, y_train_gpt, y_test_gpt = train_test_split(gpt_features, posts_df['Category'], test_size=0.2, random_state=42)

# Initialize and train a classifier on the GPT features
classifier_gpt = LogisticRegression(max_iter=5000)
classifier_gpt.fit(X_train_gpt.reshape((X_train_gpt.shape[0], -1)), y_train_gpt)

# Make predictions
y_pred_gpt = classifier_gpt.predict(X_test_gpt.reshape((X_test_gpt.shape[0], -1)))

# Evaluate the classifier
print("GPT-2 Performance:")
print(classification_report(y_test_gpt, y_pred_gpt))


  from .autonotebook import tqdm as notebook_tqdm


GPT-2 Performance:
              precision    recall  f1-score   support

    negative       0.42      0.33      0.37       477
     neutral       0.63      0.74      0.68       979
    positive       0.44      0.37      0.40       481

    accuracy                           0.55      1937
   macro avg       0.49      0.48      0.48      1937
weighted avg       0.53      0.55      0.53      1937



The evaluation compared the performance of three feature extraction techniques, Bag-of-Words (BoW), GloVe, and GPT-2, for sentiment analysis of Reddit post headlines. BoW demonstrated strong overall performance, achieving an accuracy of 81% with high precision, recall, and F1-scores across all sentiment categories. In contrast, GloVe exhibited relatively poor performance, struggling with precision and recall for negative and positive sentiments, resulting in an accuracy of 52%. GPT-2 showed moderate performance with an accuracy of 54%, indicating potential but requiring further refinement for effective sentiment analysis. Overall, the evaluation underscores the effectiveness of BoW for sentiment analysis tasks based on Reddit post headlines.

In [21]:
posts_df

Unnamed: 0,Title,Category,ProcessedTitle
0,trending subreddits for 20141022 rinstantregre...,neutral,trend subreddit rinstantregret rexcel rclimb r...
1,is it just me or is the japanese culture very ...,neutral,japanes cultur unsettl specif urban legend fol...
2,why do so many good threads get downvoted on r...,positive,mani good thread get downvot reddit
3,fine ill do it myself,negative,fine ill
4,muriqa,neutral,muriqa
...,...,...,...
9679,steve bannon and roger stone are his friends n...,positive,steve bannon roger stone friend need say
9680,hi i am writing in hopes someone can help me w...,neutral,hi write hope someon help ga get cardiologist ...
9681,trending subreddits for 20160811 rapocalympics...,neutral,trend subreddit rapocalympicsrio rlastimag rno...
9682,seriously this treasonous piece of shit can ju...,negative,serious treason piec shit go fuck


### __4. MODEL TRAINING:__ 

#### Naive-Bayes:

Reason: Naive Bayes models are probabilistic classifiers based on Bayes' theorem. They are simple and efficient, making them suitable for text classification tasks like sentiment analysis.

In [22]:
# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(bow_features, posts_df['Category'], test_size=0.2, random_state=42)

# Initialize and train the Naive Bayes Classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

# Evaluate the classifier
y_pred = nb_classifier.predict(X_test)
print("Naive Bayes Classifier Performance:")
print(classification_report(y_test, y_pred))

Naive Bayes Classifier Performance:
              precision    recall  f1-score   support

    negative       0.59      0.72      0.65       477
     neutral       0.93      0.65      0.77       979
    positive       0.59      0.82      0.69       481

    accuracy                           0.71      1937
   macro avg       0.70      0.73      0.70      1937
weighted avg       0.76      0.71      0.72      1937



#### SVM (Support Vector Machines):

Reason: SVMs are powerful supervised learning models used for classification tasks, including sentiment analysis. They work well with high-dimensional data and can effectively separate data points in complex feature spaces.

In [23]:
# Initialize and train the SVM Classifier
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)

# Evaluate the classifier
y_pred_svm = svm_classifier.predict(X_test)
print("SVM Classifier Performance:")
print(classification_report(y_test, y_pred_svm))

SVM Classifier Performance:
              precision    recall  f1-score   support

    negative       0.76      0.68      0.72       477
     neutral       0.86      0.93      0.89       979
    positive       0.82      0.77      0.79       481

    accuracy                           0.83      1937
   macro avg       0.81      0.79      0.80      1937
weighted avg       0.83      0.83      0.83      1937



#### Neural Networks:

Deep learning models, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformers, have gained popularity in sentiment analysis due to their ability to capture complex relationships in text data. These models can learn hierarchical representations of text, leading to improved sentiment classification performance.

In [24]:
# Convert labels to categorical
y_train_cat = to_categorical(y_train.factorize()[0])
y_test_cat = to_categorical(y_test.factorize()[0])

# Neural Network Model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(256, activation='relu'),
    Dense(y_train_cat.shape[1], activation='softmax')
])

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train.toarray(), y_train_cat, epochs=15, batch_size=32)

# Evaluate the model
loss, accuracy = model.evaluate(X_test.toarray(), y_test_cat)
print("Neural Network Performance: Accuracy = {:.2f}".format(accuracy))
print(classification_report(y_test, y_pred_svm))


Epoch 1/15


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m243/243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.5702 - loss: 0.8857
Epoch 2/15
[1m243/243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.8634 - loss: 0.3992
Epoch 3/15
[1m243/243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9265 - loss: 0.2399
Epoch 4/15
[1m243/243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9553 - loss: 0.1524
Epoch 5/15
[1m243/243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9698 - loss: 0.1052
Epoch 6/15
[1m243/243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9763 - loss: 0.0770
Epoch 7/15
[1m243/243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9849 - loss: 0.0526
Epoch 8/15
[1m243/243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 972us/step - accuracy: 0.9875 - loss: 0.0439
Epoch 9/15
[1m243/243[0m [32m━━━━━━━━━━━━━━━━━

The performance evaluation of three sentiment analysis models, Naive Bayes, Support Vector Machine (SVM), and a neural network implemented using TensorFlow, reveals varying degrees of accuracy and effectiveness. Naive Bayes achieves an accuracy of 78%, demonstrating solid precision and recall across all sentiment categories. SVM outperforms Naive Bayes slightly, with an accuracy of 80%, showcasing strong precision and recall for all sentiments. However, the neural network model falls significantly short, with an accuracy of only 23%, indicating challenges in effectively capturing sentiment patterns in the data. Further refinement and optimization may be necessary to improve its performance. Overall, **SVM appears to be the most robust model** for sentiment analysis, followed closely by Naive Bayes, while the neural network requires additional adjustments to enhance its efficacy in classifying sentiments accurately.

# 5.	Deployment and Interface

In [33]:
# Function to preprocess and predict sentiment
def predict_sentiment():
    user_input = text_input.get("1.0", "end-1c")  
    processed_input = preprocess_text(user_input)  
    vectorized_input = vectorizer.transform([processed_input])  
    prediction = svm_classifier.predict(vectorized_input)  
    result_label.config(text="Predicted Sentiment: " + str(prediction[0]))  

# Tkinter window
root = tk.Tk()
root.title("Group 6: Sentiment Analysis")

# Text input widget with placeholder text
text_input = tk.Text(root, height=15, width=60)
text_input.insert("1.0", "Enter post/sentence here")
text_input.pack()

# Predict button
predict_button = tk.Button(root, text="Predict", command=predict_sentiment)
predict_button.pack()

# Label to display the result
result_label = tk.Label(root, text="Predicted Sentiment is: ")
result_label.pack()

# Run the application
root.mainloop()