## Project Description

The "Sentiment Analysis using SGD Classifier" project is a comprehensive data analysis and natural language processing (NLP) endeavor aimed at extracting valuable insights from textual data. Sentiment analysis is a branch of NLP that involves determining the emotional tone or sentiment expressed in text, which can range from positive and neutral to negative.

Project Objectives:

* Sentiment Classification: The primary goal of this project is to develop a robust sentiment analysis system using the Stochastic Gradient Descent (SGD) Classifier. This machine learning algorithm is well-suited for large-scale text classification tasks.

* Data Collection and Preparation: The project will involve gathering a diverse dataset of textual data, such as social media posts from twitter. This data will be cleaned, preprocessed, and annotated with sentiment labels.

* Feature Engineering: To effectively train the SGD Classifier, the project will explore various text preprocessing techniques, including tokenization, stop-word removal, and vectorization methods like TF-IDF (Term Frequency-Inverse Document Frequency).

* Model Training and Evaluation: The SGD Classifier will be trained on the preprocessed dataset to classify text into sentiment categories (e.g., positive, neutral, negative). The project will employ appropriate evaluation metrics such as accuracy to assess the model's performance.

* Hyperparameter Tuning: Fine-tuning the SGD Classifier's hyperparameters will be essential to optimize its performance for sentiment analysis. Grid search or randomized search techniques may be employed.

* Deployment: Once the model achieves satisfactory results, it can be deployed as a practical sentiment analysis tool. This could involve creating a web application, API, or integration with existing systems to analyze and classify text in real-time.

* Future Improvements: As part of continuous improvement, the project may explore advanced NLP models, transfer learning techniques, or domain-specific sentiment analysis to enhance accuracy and applicability.

### Importing Libraries

In [1]:
import csv
import re
import os
import json
import pandas as pd

from sklearn.model_selection import train_test_split
from tqdm import tqdm # instantly makes loops show a smart progress meter — just wrap any iterable with tqdm(iterable)
from sklearn.feature_extraction.text import CountVectorizer # to convert a collection of text to a matrix of token counts
from sklearn.feature_extraction.text import TfidfVectorizer # to convert a collection of text to a matrix of TF-IDF features
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn import pipeline
from sklearn.kernel_approximation import (RBFSampler, Nystroem)
from sklearn import preprocessing
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from joblib import Parallel, delayed

import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\fortune\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fortune\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\fortune\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Reading the data

In [2]:
input_csv_path = "kaggle_twitter_analysis\\"
input_csv_filename = "MLUnige2021_train.csv"
test_csv_filename = "MLUnige2021_test.csv"

In [21]:
twitter_df = pd.read_csv(input_csv_path + input_csv_filename)
test_df = pd.read_csv(input_csv_path + test_csv_filename)
twitter_df["original_text"] = twitter_df["text"]
test_df["original_text"] = test_df["text"]
twitter_df.head()

Unnamed: 0,Id,emotion,tweet_id,date,lyx_query,user,text,original_text
0,0,1,2063391019,Sun Jun 07 02:28:13 PDT 2009,NO_QUERY,BerryGurus,@BreeMe more time to play with you BlackBerry ...,@BreeMe more time to play with you BlackBerry ...
1,1,0,2000525676,Mon Jun 01 22:18:53 PDT 2009,NO_QUERY,peterlanoie,Failed attempt at booting to a flash drive. Th...,Failed attempt at booting to a flash drive. Th...
2,2,0,2218180611,Wed Jun 17 22:01:38 PDT 2009,NO_QUERY,will_tooker,@msproductions Well ain't that the truth. Wher...,@msproductions Well ain't that the truth. Wher...
3,3,1,2190269101,Tue Jun 16 02:14:47 PDT 2009,NO_QUERY,sammutimer,@Meaghery cheers Craig - that was really sweet...,@Meaghery cheers Craig - that was really sweet...
4,4,0,2069249490,Sun Jun 07 15:31:58 PDT 2009,NO_QUERY,ohaijustin,I was reading the tweets that got send to me w...,I was reading the tweets that got send to me w...


### Pre-processing

In [22]:
# Loading stop wordsin english, stemmer and lemmatizer.
stop_words = set(json.load(open("stop_words_english.json", encoding="utf-8")))
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

* Stop Words Removal: Stop words removal is the process of eliminating common and insignificant words like "the" and "is" from text data to improve processing efficiency and focus on meaningful content.

* Stemming: Stemming is the technique of reducing words to their base or root form, such as converting "running" to "run," for normalization and improved retrieval of similar words.

* Lemmatization: Lemmatization is the process of reducing words to their dictionary or lemma form, considering the context and semantics, to preserve meaning and ensure interpretability in NLP tasks.

In [24]:
# Dictionary to convert negations to simpler terms.
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not"}
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')

# Function to remove stop words, perform stemming and lemmatization after converting the texts to tokens
def preprocess(strs):
    lower_case = strs.lower()
    neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case)
    letters_only = re.sub("[^a-zA-Z]", " ", neg_handled)
    strs = re.sub(r'@[\w]+','',strs)
    strs = re.sub(r'<.*?>', '', strs)
    text = re.sub(r'[^a-zA-z.,!?/:;\"\'\s]', '', strs)
    text = word_tokenize(text)
    text = (" ".join([word for word in text if word not in stop_words and word not in string.punctuation and len(word)>1]))
    text = re.sub(r'\b\w{1,3}\b', '', text)
    return text

# Function to count the number of words in the tweets.
def word_counter(text):
    return len(text.split())

twitter_df["text"] = twitter_df["text"].apply(lambda x: preprocess(x))
test_df["text"] = test_df["text"].apply(lambda x: preprocess(x))
twitter_df["wc"] = twitter_df["text"].apply(lambda x: word_counter(x))
test_df["wc"] = test_df["text"].apply(lambda x: word_counter(x))
twitter_df.head()

Unnamed: 0,Id,emotion,tweet_id,date,lyx_query,user,text,original_text,wc
0,0,1,2063391019,Sun Jun 07 02:28:13 PDT 2009,NO_QUERY,BerryGurus,time play BlackBerry,@BreeMe more time to play with you BlackBerry ...,3
1,1,0,2000525676,Mon Jun 01 22:18:53 PDT 2009,NO_QUERY,peterlanoie,Failed attempt booting flash drive Then failed...,Failed attempt at booting to a flash drive. Th...,13
2,2,0,2218180611,Wed Jun 17 22:01:38 PDT 2009,NO_QUERY,will_tooker,Well ' truth Where ' damn autolock disable Co...,@msproductions Well ain't that the truth. Wher...,12
3,3,1,2190269101,Tue Jun 16 02:14:47 PDT 2009,NO_QUERY,sammutimer,cheers Craig sweet reply ' pumped,@Meaghery cheers Craig - that was really sweet...,6
4,4,0,2069249490,Sun Jun 07 15:31:58 PDT 2009,NO_QUERY,ohaijustin,reading tweets send lying phone face dropped ...,I was reading the tweets that got send to me w...,8


In [25]:
# Split the data into train and test (80-20 split)
train_df, test_df = train_test_split(twitter_df, test_size=0.2, stratify=twitter_df["emotion"])
x_train = train_df["text"]
x_test = test_df["text"]
y_train = train_df["emotion"]
y_true = test_df["emotion"]

In [26]:
tfidfvec = TfidfVectorizer(stop_words = "english",
                                  analyzer = 'word',
                                  lowercase = False,
                                  use_idf = False,
                                  ngram_range = (1,2))
        
X_train = tfidfvec.fit_transform(x_train)

X_test = tfidfvec.transform(x_test)                    

The Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer is a popular technique used in natural language processing (NLP) and information retrieval to convert a collection of text documents into numerical feature vectors. It is often used as a preprocessing step before applying machine learning algorithms. Here's how TF-IDF vectorization works:

* Term Frequency (TF):

Term Frequency (TF) measures the frequency of a term (word) within a specific document. It is calculated as the number of times a term appears in a document divided by the total number of terms in that document.

TF is useful because it helps capture the importance of terms within individual documents. Terms that appear frequently in a document are often more important in representing the content of that document.

The formula for TF is: TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)

* Inverse Document Frequency (IDF):

Inverse Document Frequency (IDF) measures the importance of a term across a collection of documents (corpus). It is calculated as the logarithm of the total number of documents in the corpus divided by the number of documents that contain the term, with 1 added to the denominator to prevent division by zero.

IDF is used to identify terms that are relatively rare and, therefore, potentially more significant in distinguishing documents. Terms that appear in many documents have lower IDF scores, while terms that appear in few documents have higher IDF scores.

The formula for IDF is: IDF(t, D) = log((Total number of documents in the corpus D) / (Number of documents containing term t in corpus D)) + 1

* TF-IDF Weighting:

The TF-IDF weight of a term in a document combines the TF and IDF values to represent the term's importance both in the document and across the entire corpus. It is calculated as the product of TF and IDF.

The TF-IDF weight emphasizes terms that are frequent within a document (high TF) and relatively rare in the corpus (high IDF).

The formula for TF-IDF is: TF-IDF(t, d, D) = TF(t, d) * IDF(t, D)

* Vectorization:

Once TF-IDF scores are calculated for each term in each document, they form a matrix where rows represent documents, columns represent terms, and each cell contains the TF-IDF score of a term in a document.

This TF-IDF matrix is often used as input for machine learning algorithms, where each document is represented as a vector of TF-IDF scores, and terms are the features.

### Model training, evaluation and hyperparameter tuning

In [27]:
param_grid = {
    'penalty': ['elasticnet'],
    'loss': ['modified_huber'],
    'learning_rate': ['adaptive'],
    'eta0': [0.01],
    'alpha': [0.0000021727548793319498],  # You can specify multiple values to tune here
}


clf = SGDClassifier()
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
best_classifier = grid_search.best_estimator_
best_classifier.fit(X_train, y_train)

SGDClassifier(alpha=2.1727548793319498e-06, eta0=0.01, learning_rate='adaptive',
              loss='modified_huber', penalty='elasticnet')

In [28]:
best_classifier.score(X_test, y_true)

0.74016796875

* The Stochastic Gradient Descent (SGD) Classifier is an efficient and versatile machine learning algorithm well-suited for large-scale classification tasks, capable of handling high-dimensional datasets and online learning scenarios.
* The core of the SGD Classifier is the Stochastic Gradient Descent optimization algorithm. Unlike traditional Gradient Descent, which computes the gradient of the cost function using the entire dataset, SGD calculates the gradient based on a single randomly selected data point at each iteration. This makes it highly efficient for large datasets.
* It incorporates L1 and L2 regularization, which helps prevent overfitting and improves the model's generalization ability. Regularization terms are controlled by hyperparameters like alpha.
* The eta0 parameter represents the initial learning rate, which can be adapted during training using learning rate schedules like "constant," "optimal," or "adaptive." Adaptive learning rates often prove beneficial as they automatically adjust based on the progress of training.