

---


## Welcome to the Sentiment Analysis Course! In this exciting journey, we will explore the fascinating world of understanding emotions through text. Sentiment analysis, also known as opinion mining, empowers us to decipher the sentiments, attitudes, and opinions expressed in written communication. Whether you're interested in social media analysis, customer feedback evaluation, or gaining deeper insights into human behavior, this course will equip you with the essential tools and techniques to effectively analyze sentiment. Let's dive in and unlock the power of sentiment analysis together!



---



## Installing relevent packages

In [65]:
! pip install tensorflow



## Importing Relevent packages

In [1]:
import sklearn  # Import scikit-learn for machine learning and data analysis
import nltk  # Import NLTK (Natural Language Toolkit) for natural language processing tasks
import matplotlib.pyplot as plt  # Import Matplotlib for data visualization
import pandas  as pd# Import Pandas for data manipulation and analysis
import gensim  # Import Gensim for topic modeling and word embeddings
# from google.colab import drive # For importing files from drive
import os #For navagting with direcories
nltk.download('punkt')
nltk.download('wordnet')

c:\Users\kmedr\anaconda3\envs\pypi_llm\Lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll
c:\Users\kmedr\anaconda3\envs\pypi_llm\Lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-gcc_10_3_0.dll
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kmedr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kmedr\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Data loading and Preprocessing

### Data loading

In [3]:
# Mount Google Drive
# drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# Changing the directory to the Sentiment analysis folder that will be used for this work
# os.chdir('drive/My Drive/Sentiment_analysis')

In [2]:
# Load the preprocessed tweet dataset into a DataFrame
df = pd.read_csv('Tweets.csv')

In [4]:
df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


In [5]:
df.shape

(27481, 4)

### Data Preprocessing

#### Creating the train and test sets

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
df_train, df_test = train_test_split(df, test_size=0.3, stratify=df['sentiment'], shuffle=True, random_state=20)

In [8]:
df_train.shape

(19236, 4)

In [9]:
df_test.shape

(8245, 4)

#### Removing redundant columns and creating maps

In [10]:
df_train.head()

Unnamed: 0,textID,text,selected_text,sentiment
12928,0f8195c613,"But now talking about today, Oh my GODNESS! Pr...","problems,",negative
21259,1072debb10,Movie is pretty interesting actually. Gonna fi...,Movie is pretty interesting actually.,positive
23157,235da13ecf,@_lightmare There are like six that hang aroun...,@_lightmare There are like six that hang aroun...,neutral
3667,e0ca22d6d5,not really ;D nice pic . no but could u imag...,nice,positive
15464,4381295572,Off to dinner with & his fam.,is fam.,positive


In [11]:
df_train.drop(['textID', 'selected_text'], axis=1, inplace=True)
df_test.drop(['textID', 'selected_text'], axis=1, inplace=True)

In [12]:
df_train.head(2)

Unnamed: 0,text,sentiment
12928,"But now talking about today, Oh my GODNESS! Pr...",negative
21259,Movie is pretty interesting actually. Gonna fi...,positive


In [13]:
df_test.head(2)

Unnamed: 0,text,sentiment
20236,I love how simple my Safari toolbar is! http:...,positive
11644,thanks love ) btw happy mother`s day to your mom,positive


In [14]:
# Creating the maps of the labes
df_train['sentiment'] = df_train['sentiment'].map(
    {
        'positive': 1,
        'negative': 0,
        'neutral': 2
    }
)

In [15]:
# Creating the maps of the labes
df_test['sentiment'] = df_test['sentiment'].map(
    {
        'neutral':2,
        'negative':0,
        'positive':1
    }
)

In [16]:
df_test.head(3)

Unnamed: 0,text,sentiment
20236,I love how simple my Safari toolbar is! http:...,1
11644,thanks love ) btw happy mother`s day to your mom,1
7552,Can`t sleep rite now because of havin` so much...,2


In [17]:
df_train.head(3)

Unnamed: 0,text,sentiment
12928,"But now talking about today, Oh my GODNESS! Pr...",0
21259,Movie is pretty interesting actually. Gonna fi...,1
23157,@_lightmare There are like six that hang aroun...,2


In [18]:
df_train = df_train[df_train['sentiment'] != 2]
df_test = df_test[df_test['sentiment'] != 2]

In [19]:
df_train.head(5)

Unnamed: 0,text,sentiment
12928,"But now talking about today, Oh my GODNESS! Pr...",0
21259,Movie is pretty interesting actually. Gonna fi...,1
3667,not really ;D nice pic . no but could u imag...,1
15464,Off to dinner with & his fam.,1
21966,LOL we`re such twitter addicts,1


In [20]:
df_test.head(5)

Unnamed: 0,text,sentiment
20236,I love how simple my Safari toolbar is! http:...,1
11644,thanks love ) btw happy mother`s day to your mom,1
7996,Found out that a schoolmate died of an heart a...,0
18041,Somebody accidentely sleep for 3 hours instead...,0
7628,wow! I`ve joined the photography scene pretty...,1


In [21]:
df_test['sentiment'].unique()

array([1, 0], dtype=int64)

In [22]:
df_train['sentiment'].unique()

array([0, 1], dtype=int64)

#### StopWord Removal

#### Stopwords are commonly used words in a language that are often considered insignificant or lack meaningful contribution to the overall semantics of a text. These words are typically filtered out or removed during text analysis tasks, such as sentiment analysis, to focus on more important and meaningful words. Examples of stopwords in English include articles (e.g., "a", "an", "the"), pronouns (e.g., "I", "you", "he", "she"), prepositions (e.g., "in", "on", "at"), and conjunctions (e.g., "and", "but", "or"). By removing stopwords, text analysis algorithms can often improve efficiency and accuracy by eliminating noise and reducing the dimensionality of the data.

In [23]:
from gensim.parsing.preprocessing import remove_stopwords

#### Lowercasing

In [24]:
sample_text = df_train.text[1020]
sample_text = sample_text.lower()
print(sample_text)

 i think we have it pretty much figured out.  added a box in the helsinki group where you can see the tweets


In [25]:
## NB: Remember to convert the text into their lowercase form so that for example "I" will be exactly the same as "i"
new_words = remove_stopwords(sample_text)
print(new_words)

think pretty figured out. added box helsinki group tweets


#### Removing Special character, mentions, hashtags

In [26]:

text = "Great article! Check it out at https://example.com #technology @username"

In [27]:
splitted_text = text.split()

In [28]:
print(splitted_text)

['Great', 'article!', 'Check', 'it', 'out', 'at', 'https://example.com', '#technology', '@username']


In [29]:
new_words = [word for word in splitted_text if not word.startswith('http')]

In [30]:
print(new_words)

['Great', 'article!', 'Check', 'it', 'out', 'at', '#technology', '@username']


In [31]:
final_words = ' '.join(new_words)

In [32]:
print(final_words)

Great article! Check it out at #technology @username


In [33]:
cleaned_text = ' '.join(word for word in text.split() if not word.startswith('http'))

In [34]:
def preprocess_text(text):
    # Remove URLs
    text = ' '.join(word for word in text.split() if not word.startswith('http'))
    text = ' '.join(word for word in text.split() if not word.startswith('www'))
    text = ' '.join(word for word in text.split() if not '.ly' in word)
    text = ' '.join(word for word in text.split() if not '.co' in word)

    # Remove special characters and punctuation
    text = ''.join(char for char in text if char.isalnum() or char.isspace())

    # Remove mentions (@username)
    text = ' '.join(word for word in text.split() if not word.startswith('@')) #TAke a  look later.

    # Remove hashtags (#technology)
    text = ' '.join(word[1:] if word.startswith('#') else word for word in text.split())

    return text

# Example text
text = "Great article! Check it out at https://example.com #technology @username"

# Preprocess the text
preprocessed_text = preprocess_text(text)

# Print the preprocessed text
print(preprocessed_text)

Great article Check it out at technology username


In [35]:
print(text)

Great article! Check it out at https://example.com #technology @username


In [36]:
new_words = preprocess_text(text)
new_words

'Great article Check it out at technology username'

#### Tokenization(Tweets)

#### Tokenization is the process of breaking down a text or sentence into smaller units called tokens. These tokens can be individual words, phrases, or even characters, depending on the granularity of the tokenization technique used. Tokenization helps in preparing text data for analysis or processing by splitting it into meaningful and manageable components. It serves as a foundational step in various natural language processing (NLP) tasks, such as text classification, language modeling, and information retrieval.

In [37]:
from nltk import word_tokenize

In [38]:
new_word = "I will be going to morocco next week"

In [39]:
sample_text_tokens = word_tokenize(new_word)
print(sample_text_tokens)

['I', 'will', 'be', 'going', 'to', 'morocco', 'next', 'week']


#### Lemmatization / Stemming (Tweets)





---

#### Lemmatization and stemming are techniques used in natural language processing to reduce words to their base or canonical forms, but they have different approaches and outcomes.

Lemmatization:
- Lemmatization aims to obtain the lemma or base form of a word.
- It considers the word's morphological analysis and applies language rules to determine the base form.
- Lemmatization typically produces valid words that are present in the language's dictionary.
- For example, the lemmatization of "running" would be "run", and the lemmatization of "better" would be "good".

Stemming:
- Stemming is a simpler and more heuristic-based approach.
- It reduces words to their stem or root form by removing suffixes or prefixes.
- Stemming does not guarantee that the resulting stem is a valid word.
- For example, the stemming of "running" would be "run", but the stemming of "better" would be "bet".

In summary, lemmatization provides linguistically accurate base forms, while stemming focuses on heuristics to derive word stems. Lemmatization tends to yield better results in terms of semantic accuracy, but it can be computationally more expensive than stemming. The choice between lemmatization and stemming depends on the specific requirements and objectives of your application or analysis.

In [40]:
from nltk.stem import WordNetLemmatizer, LancasterStemmer
from nltk.stem.snowball import EnglishStemmer

In [41]:
lemma = EnglishStemmer()

In [42]:
word_lemma = [lemma.stem(word) for word in sample_text_tokens]
print("Original Text")
print(sample_text_tokens)
print("-"*60)
print("Preprocessed Texts")
print("-"*60)
print(word_lemma)

Original Text
['I', 'will', 'be', 'going', 'to', 'morocco', 'next', 'week']
------------------------------------------------------------
Preprocessed Texts
------------------------------------------------------------
['i', 'will', 'be', 'go', 'to', 'morocco', 'next', 'week']


#### Creating a single preprocessing function and applying it to the dataset


In [43]:
def preprocess_text(text):
    # Remove URLs
    text = ' '.join(word for word in text.split() if not word.startswith('http'))
    text = ' '.join(word for word in text.split() if not word.startswith('www'))

    # Remove special characters and punctuation
    text = ''.join(char for char in text if char.isalnum() or char.isspace())

    # Remove mentions (@username)
    text = ' '.join(word for word in text.split() if not word.startswith('@'))

    # Remove hashtags (#technology)
    text = ' '.join(word[1:] if word.startswith('#') else word for word in text.split())

    # Removing stopwords
    ## NB: Remember to convert the text into thier lowercase form so that for example "I" will be exactly the same as "i"
    text = remove_stopwords(text.lower())

    # Tokenization
    text = word_tokenize(text)

    #lemmatization
    text = ' '.join([lemma.stem(word) for word in text])
    return text

In [44]:
new = preprocess_text(sample_text)

In [45]:
new

'think pretti figur ad box helsinki group tweet'

In [46]:
# Apply the preprocess_text function to create a new column 'PreprocessedText'
df_train['new_text'] = df_train['text'].apply(preprocess_text)
df_train.head(5)

Unnamed: 0,text,sentiment,new_text
12928,"But now talking about today, Oh my GODNESS! Pr...",0,talk today oh god problem problem problem love...
21259,Movie is pretty interesting actually. Gonna fi...,1,movi pretti interest actual gon na finish watc...
3667,not really ;D nice pic . no but could u imag...,1,d nice pic u imagin 2 thought
15464,Off to dinner with & his fam.,1,dinner fam
21966,LOL we`re such twitter addicts,1,lol twitter addict


In [47]:
# Apply the preprocess_text function to create a new column 'PreprocessedText'
df_test['new_text'] = df_test['text'].apply(preprocess_text)
df_test.head(5)

Unnamed: 0,text,sentiment,new_text
20236,I love how simple my Safari toolbar is! http:...,1,love simpl safari toolbar
11644,thanks love ) btw happy mother`s day to your mom,1,thank love btw happi mother day mom
7996,Found out that a schoolmate died of an heart a...,0,schoolmat die heart attack morn bare 35 miss u...
18041,Somebody accidentely sleep for 3 hours instead...,0,somebodi accident sleep 3 hour instead 2 hang ...
7628,wow! I`ve joined the photography scene pretty...,1,wow ive join photographi scene pretti recent l...


In [48]:
## Dropping the previous text column because it has now bcome redundant.
df_test.drop('text', axis=1, inplace=True)
df_train.drop('text', axis=1, inplace=True)

## Feature Extraction

In [49]:
# Creating our dependent and independent variables
x_train = df_train['new_text']
y_train = df_train['sentiment']


x_test =  df_test['new_text']
y_test = df_test['sentiment']

#### Vectorization (Tweet)

In [50]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [51]:
vectorizer = CountVectorizer(max_features=3000, stop_words='english', lowercase=True)

x_train_vec = vectorizer.fit_transform(x_train)
x_test_vec = vectorizer.transform(x_test)

In [52]:
x_test_vec = x_test_vec.toarray()

In [53]:
x_train_vec = x_train_vec.toarray()

## Modelling

In [54]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score, classification_report

In [55]:
model = BernoulliNB()

In [56]:
# Train a Naive Bayes model
model.fit(x_train_vec, y_train)

In [57]:
model.score(x_train_vec, y_train)

0.8985507246376812

In [58]:
y_train_pred = model.predict(x_train_vec)
accuracy_train = accuracy_score(y_train, y_train_pred)
print("\nTraining Set Metrics:")
print("-" * 54)
print("Train Accuracy:", accuracy_train)
print("-" * 54)
print(classification_report(y_train, y_train_pred))
print("-" * 54)


Training Set Metrics:
------------------------------------------------------
Train Accuracy: 0.8985507246376812
------------------------------------------------------
              precision    recall  f1-score   support

           0       0.89      0.90      0.89      5447
           1       0.91      0.90      0.90      6007

    accuracy                           0.90     11454
   macro avg       0.90      0.90      0.90     11454
weighted avg       0.90      0.90      0.90     11454

------------------------------------------------------


In [59]:
y_test_pred = model.predict(x_test_vec)
accuracy_test = accuracy_score(y_test, y_test_pred)
print("\nTest Set Metrics:")
print("-" * 54)
print("Test Accuracy:", accuracy_test)
print("-" * 54)
print(classification_report(y_test, y_test_pred))
print("-" * 54)


Test Set Metrics:
------------------------------------------------------
Test Accuracy: 0.8506824200448156
------------------------------------------------------
              precision    recall  f1-score   support

           0       0.84      0.85      0.84      2334
           1       0.86      0.85      0.86      2575

    accuracy                           0.85      4909
   macro avg       0.85      0.85      0.85      4909
weighted avg       0.85      0.85      0.85      4909

------------------------------------------------------


### DEEP LEARNING MODEL

In [60]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout


# Create a sequential model
model2 = Sequential()
model2.add(Dense(200, activation='relu', input_dim=x_train_vec.shape[1]))
model2.add(Dropout(0.2))
model2.add(Dense(100, activation='relu'))
model2.add(Dropout(0.2))
model2.add(Dense(50, activation='relu'))
model2.add(Dropout(0.2))
model2.add(Dense(1, activation='sigmoid'))

# Compile the model
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model2.fit(x_train_vec, y_train, epochs=20, batch_size=100, validation_data=(x_test_vec, y_test))

# Evaluate the model
loss, accuracy = model2.evaluate(x_test_vec, y_test)
print('Test loss:', loss)
print('Test accuracy:', accuracy)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test loss: 1.2163729667663574
Test accuracy: 0.82644122838974


#### Saving our models to be used later

In [None]:
import joblib
# Save the model to a file
joblib.dump(model, './Models/naive_model.pkl')

['./Models/naive_model.pkl']

In [None]:
# Save the model to disk
model2.save('./Models/sentiment_CNN.h5')