<a href="https://colab.research.google.com/github/hastikacheddy/Base_ML_Notebooks/blob/main/Twitter_Sentiment_Analysis_using_neural_network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Twitter Sentiment Analysis

### gensim:

Purpose:

Gensim is a Python library designed for topic modeling, document indexing, and similarity retrieval with large corpora. It's widely used for text processing and natural language processing tasks.

Features:

*   Topic modeling algorithms like Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA).
*   Similarity queries, including cosine similarity.

*   Document indexing and retrieval.
*   Word vector models like Word2Vec, FastText, and more.


### keras:

Purpose: Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It simplifies the process of building and training deep learning models.

Features:

*   Provides a simple and intuitive interface for building neural networks.
*   Supports convolutional networks, recurrent networks, and combinations of both.

*   Enables fast experimentation through a user-friendly API.
*   Integrates seamlessly with TensorFlow, which allows for easy scaling to distributed systems.


### pandas:

Purpose: Pandas is a powerful data manipulation and analysis library for Python. It provides data structures like DataFrame and Series that are flexible and easy to work with, making it a popular choice for data cleaning, transformation, and analysis tasks.

Features:

*   DataFrame object for data manipulation with integrated indexing.
Tools for reading and writing data in various formats (CSV, Excel, SQL databases, etc.).
*   Time series functionality for working with date and time data.

*   Powerful data aggregation and transformation capabilities.
*   Integration with other libraries like NumPy and Matplotlib for scientific computing and data visualization.









In [None]:
!pip install gensim --upgrade
!pip install keras --upgrade
!pip install pandas --upgrade

In [None]:
# Importing pandas for data manipulation
import pandas as pd

# Importing matplotlib for plotting
import matplotlib.pyplot as plt
%matplotlib inline  # Magic command to display plots inline in Jupyter Notebook

# Importing scikit-learn modules for machine learning tasks
from sklearn.model_selection import train_test_split  # For splitting data into training and testing sets
from sklearn.preprocessing import LabelEncoder       # For label encoding categorical variables
from sklearn.metrics import (                         # For evaluating model performance
    confusion_matrix,
    classification_report,
    accuracy_score
)
from sklearn.manifold import TSNE                     # For data visualization and dimensionality reduction
from sklearn.feature_extraction.text import TfidfVectorizer  # For text vectorization using TF-IDF

# Importing Keras modules for deep learning
from keras.preprocessing.text import Tokenizer       # For text tokenization
from keras.preprocessing.sequence import pad_sequences  # For padding sequences to a fixed length
from keras.models import Sequential                  # For building sequential neural network models
from keras.layers import (                            # Different types of neural network layers
    Activation,
    Dense,
    Dropout,
    Embedding,
    Flatten,
    Conv1D,
    MaxPooling1D,
    LSTM
)
from keras import utils                              # Utilities for data manipulation in Keras
from keras.callbacks import (                        # Callbacks for model training
    ReduceLROnPlateau,
    EarlyStopping
)

# Importing nltk for natural language processing tasks
import nltk
from nltk.corpus import stopwords                    # Stopwords list for text preprocessing
from nltk.stem import SnowballStemmer                # Stemmer for text preprocessing

# Importing gensim for word embedding models
import gensim

# Importing utility modules
import re                                            # Regular expression operations
import numpy as np                                   # Numerical operations
import os                                            # Operating system dependent functionality
from collections import Counter                      # Counter for counting occurrences
import logging                                       # Logging for tracking progress
import time                                          # Time-related functions
import pickle                                        # Serialization and deserialization of Python objects
import itertools                                     # Functions for efficient looping

# Setting up logging format
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


Different types of layers that use to construct neural network architectures in Keras. Here's a brief explanation of each:

1. **Activation**:
   - **Purpose**: Specifies the activation function to be used in a neural network layer.
   - **Usage**: Typically used after a layer to introduce non-linearity into the model. Common activation functions include ReLU, sigmoid, and tanh.

2. **Dense**:
   - **Purpose**: A fully connected layer where each neuron is connected to every neuron in the previous layer.
   - **Usage**: Often used as the output layer or in hidden layers of a neural network.

3. **Dropout**:
   - **Purpose**: A regularization technique to prevent overfitting in neural networks.
   - **Usage**: Randomly sets a fraction of input units to 0 during training to reduce the dependency on certain neurons.

4. **Embedding**:
   - **Purpose**: Represents categorical data or discrete entities as continuous vectors.
   - **Usage**: Commonly used in natural language processing tasks to convert word indices into dense vectors.

5. **Flatten**:
   - **Purpose**: Reshapes the input tensor into a 1D tensor.
   - **Usage**: Typically used to flatten the output of convolutional layers before feeding it into fully connected layers.

6. **Conv1D**:
   - **Purpose**: 1D convolutional layer for processing sequential data.
   - **Usage**: Used in neural networks for tasks like text classification and time series forecasting.

7. **MaxPooling1D**:
   - **Purpose**: Downsamples the input along the temporal dimension.
   - **Usage**: Used to reduce the spatial dimensions of the input, making the network more computationally efficient and reducing overfitting.

8. **LSTM**:
   - **Purpose**: Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture.
   - **Usage**: Particularly effective for handling sequential data due to its ability to capture long-term dependencies.


In [None]:
# Downloading the stopwords corpus from NLTK for text preprocessing
nltk.download('stopwords')

### Settings

In [None]:
# DATASET
# Define the column names for the dataset
DATASET_COLUMNS = ["target", "ids", "date", "flag", "user", "text"]
# Define the encoding of the dataset file
DATASET_ENCODING = "ISO-8859-1"
# Define the size of the training set as a fraction of the total dataset
TRAIN_SIZE = 0.8

# TEXT CLEANING
# Define the regular expression pattern for text cleaning
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

# WORD2VEC
# Define the dimensionality of the word vectors
W2V_SIZE = 300
# Define the size of the context window for word2vec
W2V_WINDOW = 7
# Define the number of training epochs for word2vec
W2V_EPOCH = 32
# Define the minimum count threshold for words in word2vec
W2V_MIN_COUNT = 10

# KERAS
# Define the maximum length of input sequences for Keras
SEQUENCE_LENGTH = 300
# Define the number of training epochs for Keras
EPOCHS = 8
# Define the batch size for training in Keras
BATCH_SIZE = 1024

# SENTIMENT
# Define labels for sentiment analysis
POSITIVE = "POSITIVE"
NEGATIVE = "NEGATIVE"
NEUTRAL = "NEUTRAL"
# Define thresholds for classifying sentiments
SENTIMENT_THRESHOLDS = (0.4, 0.7)

# EXPORT
# Define filenames for saving trained models and tokenizer
KERAS_MODEL = "model.h5"
WORD2VEC_MODEL = "model.w2v"
TOKENIZER_MODEL = "tokenizer.pkl"
ENCODER_MODEL = "encoder.pkl"


KERAS_MODEL = "model.h5":

Purpose: This filename is used to save the trained Keras model.
File Extension: .h5 is a common extension used to save Keras models in the Hierarchical Data Format (HDF5), which is a data model, library, and file format for storing and managing large amounts of data.

WORD2VEC_MODEL = "model.w2v":

Purpose: This filename is used to save the trained Word2Vec model.
File Extension: .w2v is a custom extension you are using to denote that the file contains a Word2Vec model. However, Word2Vec models are typically saved using .bin or .txt extensions.

TOKENIZER_MODEL = "tokenizer.pkl":

Purpose: This filename is used to save the tokenizer object.
File Extension: .pkl stands for "pickle", which is a module in Python used for serializing and deserializing Python object structures. The tokenizer object contains the vocabulary and other configurations used to preprocess text data.

ENCODER_MODEL = "encoder.pkl":

Purpose: This filename is used to save the label encoder object.
File Extension: Like the tokenizer, the encoder is also saved with a .pkl extension using the pickle module. The encoder is used to convert categorical labels into numerical format, which is necessary for training machine learning models.

### Read Dataset

### Dataset details
* **target**: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
* **ids**: The id of the tweet ( 2087)
* **date**: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
* **flag**: The query (lyx). If there is no query, then this value is NO_QUERY.
* **user**: the user that tweeted (robotickilldozr)
* **text**: the text of the tweet (Lyx is cool)

In [None]:
import os  # Importing the os module for interacting with the operating system
import pandas as pd  # Importing pandas for data manipulation

# Get the list of filenames in the "../input" directory and select the first one
dataset_filename = os.listdir("../input")[0]

# Construct the full path to the selected dataset file
dataset_path = os.path.join("..", "input", dataset_filename)

# Print the path of the file that will be opened
print("Open file:", dataset_path)

# Read the dataset file into a DataFrame using pandas
# - `encoding=DATASET_ENCODING`: Specify the encoding of the file
# - `names=DATASET_COLUMNS`: Specify the column names for the DataFrame
df = pd.read_csv(dataset_path, encoding=DATASET_ENCODING, names=DATASET_COLUMNS)


In [None]:
print("Dataset size:", len(df))

In [None]:
df.head(5)

### Map target label to String
* **0** -> **NEGATIVE**
* **2** -> **NEUTRAL**
* **4** -> **POSITIVE**

In [None]:
# Define a mapping dictionary to map numerical labels to sentiment categories
decode_map = {0: "NEGATIVE", 2: "NEUTRAL", 4: "POSITIVE"}

# Define a function to decode numerical labels to sentiment categories
def decode_sentiment(label):
    # Convert the label to an integer and use it to look up the corresponding sentiment category in the decode_map
    return decode_map[int(label)]


In [None]:
%%time
# Measure the time taken to execute the following code block

# Use the decode_sentiment function to decode numerical sentiment labels in the 'target' column
df.target = df.target.apply(lambda x: decode_sentiment(x))


In [None]:
# Count the frequency of each unique target label in the DataFrame
target_cnt = Counter(df.target)

# Create a new figure for plotting
plt.figure(figsize=(16,8))

# Create a bar plot to visualize the distribution of target labels
plt.bar(target_cnt.keys(), target_cnt.values())

# Set the title of the plot
plt.title("Dataset labels distribution")


### Pre-Process dataset

In [None]:
# Get the list of stopwords for the English language from nltk
# Stopwords are common words (such as "and", "the", "is", etc.) that are often filtered out during text preprocessing in natural language processing tasks. The resulting list is stored in the stop_words variable.

stop_words = stopwords.words("english")

# Create a Snowball stemmer object for English
# Stemming is the process of reducing words to their root or base form. The resulting stemmer object is stored in the stemmer variable.
stemmer = SnowballStemmer("english")


In [None]:
def preprocess(text, stem=False):
    # Remove links, user mentions, and special characters from the text
    text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()

    # Initialize an empty list to store the tokens after preprocessing
    tokens = []

    # Iterate through each token in the text
    for token in text.split():
        # Check if the token is not in the list of stopwords
        if token not in stop_words:
            # Check if stemming is required
            if stem:
                # Stem the token using the Snowball stemmer
                tokens.append(stemmer.stem(token))
            else:
                # Otherwise, add the token to the list of tokens
                tokens.append(token)

    # Join the list of tokens into a single string and return it
    return " ".join(tokens)


In [None]:
%%time
# Measure the time taken to execute the following code block

# Apply the preprocess function to each text entry in the 'text' column of the DataFrame
df.text = df.text.apply(lambda x: preprocess(x))

### Split train and test

In [None]:
# Split the DataFrame into training and testing sets
df_train, df_test = train_test_split(df, test_size=1-TRAIN_SIZE, random_state=42)

# Print the sizes of the training and testing sets
print("TRAIN size:", len(df_train))
print("TEST size:", len(df_test))


### Word2Vec

In [None]:
%%time
# Measure the time taken to execute the following code block

# Split each text entry in the 'text' column of the training DataFrame into a list of tokens
documents = [_text.split() for _text in df_train.text]


In [None]:
# Initialize a Word2Vec model using the parameters specified
w2v_model = gensim.models.word2vec.Word2Vec(
    size=W2V_SIZE,          # Dimensionality of the word vectors
    window=W2V_WINDOW,      # Maximum distance between the current and predicted word within a sentence
    min_count=W2V_MIN_COUNT,  # Minimum frequency count of words to consider
    workers=8               # Number of CPU cores to use for training the model
)


The purpose of this code is to construct the vocabulary for the Word2Vec model using the tokenized text data. Building the vocabulary involves identifying unique words (tokens) in the documents and assigning an index to each word, which will be used during the training phase to learn word embeddings.

In [None]:
# Build the vocabulary of the Word2Vec model using the tokenized documents
w2v_model.build_vocab(documents)


In [None]:
# Retrieve the keys (words) from the Word2Vec model's vocabulary
words = w2v_model.wv.vocab.keys()

# Calculate the size of the vocabulary
vocab_size = len(words)

# Print the size of the vocabulary
print("Vocab size", vocab_size)


In [None]:
%%time
# Measure the time taken to execute the following code block

# Train the Word2Vec model on the tokenized documents
w2v_model.train(
    documents,                    # Tokenized documents
    total_examples=len(documents), # Total number of documents
    epochs=W2V_EPOCH               # Number of training epochs
)

In [None]:
w2v_model.most_similar("love")

### Tokenize Text

In [None]:
%%time
# Measure the time taken to execute the following code block

# Initialize a Tokenizer
tokenizer = Tokenizer()

# Fit the Tokenizer on the text data in the training set
tokenizer.fit_on_texts(df_train.text)

# Calculate the total number of unique words in the vocabulary
vocab_size = len(tokenizer.word_index) + 1

# Print the total number of unique words in the vocabulary
print("Total words", vocab_size)


Converting text data into integers is a necessary step in preparing the data for training machine learning models, particularly neural networks. Here are a few reasons why this conversion is important:

Numerical Representation: Machine learning models, including neural networks, require numerical input. Text data, being categorical in nature, needs to be converted into a numerical format that can be processed by these models.

Fixed Length Input: Neural networks require fixed-size input vectors. Tokenizing and converting text into integers allows you to represent variable-length text sequences as fixed-length integer sequences by padding or truncating as needed.

Efficient Storage and Computation: Integer representations are more memory-efficient and faster to process compared to raw text data. This is crucial when dealing with large datasets and complex models.

Embedding Layer in Neural Networks: In natural language processing tasks, such as text classification or sentiment analysis, the tokenized integer sequences are often passed through an embedding layer in a neural network. This layer converts the integer indices into dense vectors (embeddings) of fixed size, where semantically similar words are mapped to nearby points in the vector space.

Consistency: Converting text into a consistent numerical format ensures that the model receives standardized input, which is essential for achieving reliable and consistent performance.

In [None]:
%%time
# Measure the time taken to execute the following code block

# Convert the tokenized text sequences into padded sequences of integers for training and testing sets
x_train = pad_sequences(tokenizer.texts_to_sequences(df_train.text), maxlen=SEQUENCE_LENGTH)
x_test = pad_sequences(tokenizer.texts_to_sequences(df_test.text), maxlen=SEQUENCE_LENGTH)


The pad_sequences function is used to ensure that all sequences in a list have the same length by either padding or truncating them. This is particularly important when working with sequences of variable length, such as sentences or paragraphs of text, which need to be converted into fixed-length vectors for input to machine learning models like neural networks.

### Label Encoder

In [None]:
# Get the unique labels from the 'target' column of the training DataFrame
labels = df_train.target.unique().tolist()

# Append the 'NEUTRAL' label to the list of unique labels
labels.append(NEUTRAL)

# Print the updated list of labels
print(labels)


In [None]:
# Initialize a LabelEncoder
encoder = LabelEncoder()

# Fit the encoder on the 'target' column of the training DataFrame and transform the labels to integers
y_train = encoder.fit_transform(df_train.target.tolist())
y_test = encoder.transform(df_test.target.tolist())

# Reshape the encoded labels to be 2D arrays
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)

# Print the shapes of the encoded label arrays
print("y_train", y_train.shape)
print("y_test", y_test.shape)


In [None]:
print("x_train", x_train.shape)
print("y_train", y_train.shape)
print()
print("x_test", x_test.shape)
print("y_test", y_test.shape)

In [None]:
y_train[:10]

### Embedding layer

In [None]:
# Initialize an embedding matrix with zeros
embedding_matrix = np.zeros((vocab_size, W2V_SIZE))

# Populate the embedding matrix with word vectors from the trained Word2Vec model
for word, i in tokenizer.word_index.items():
    if word in w2v_model.wv:
        embedding_matrix[i] = w2v_model.wv[word]

# Print the shape of the embedding matrix
print(embedding_matrix.shape)



An embedding matrix is a 2D matrix where each row corresponds to a word in the vocabulary, and each column to a feature of the word vectors, typically from a pre-trained model like Word2Vec. It provides dense, fixed-size representations of words, aiding neural networks in processing text data.

In [None]:
# Define an Embedding layer for the neural network model
# This layer uses pre-trained Word2Vec embeddings as initial weights

embedding_layer = Embedding(
    vocab_size,                   # Size of the vocabulary
    W2V_SIZE,                     # Dimensionality of the Word2Vec word vectors
    weights=[embedding_matrix],   # Pre-trained embedding matrix as initial weights
    input_length=SEQUENCE_LENGTH, # Length of input sequences
    trainable=False               # Freeze the embedding layer during training
)


### Build Model

In [None]:
# Initialize a Sequential model
model = Sequential()

# Add the pre-trained embedding layer to the model
model.add(embedding_layer)

# Add a Dropout layer to prevent overfitting
model.add(Dropout(0.5))

# Add an LSTM layer with 100 units
# The dropout and recurrent dropout are used for regularization to prevent overfitting
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))

# Add a Dense output layer with a sigmoid activation function
# This layer outputs a single value between 0 and 1, representing the predicted sentiment score
model.add(Dense(1, activation='sigmoid'))

# Print a summary of the model architecture
model.summary()


### Compile model

In [None]:
# Compile the model with binary cross-entropy loss, Adam optimizer, and accuracy metric
model.compile(
    loss='binary_crossentropy', # Loss function for binary classification
    optimizer="adam",            # Optimizer algorithm for model training
    metrics=['accuracy']         # Evaluation metric to monitor during training
)


model.compile: This method configures the model for training.

loss='binary_crossentropy': Specifies the loss function to use for binary classification tasks. Binary cross-entropy is commonly used for binary classification problems, like sentiment analysis.

optimizer="adam": Specifies the optimizer algorithm to use during training. Adam is an adaptive learning rate optimization algorithm that is well-suited for training deep learning models.

metrics=['accuracy']: Specifies the evaluation metric to monitor during training. In this case, we are monitoring the classification accuracy of the model on the training and validation data.

### Callbacks

In [None]:
# Define callbacks to enhance training performance
callbacks = [
    # Reduce learning rate when a metric has stopped improving
    ReduceLROnPlateau(
        monitor='val_loss',    # Monitor validation loss
        patience=5,             # Number of epochs with no improvement after which learning rate will be reduced
        cooldown=0              # Number of epochs to wait before resuming normal operation after reducing the learning rate
    ),
    # Stop training when a monitored metric has stopped improving
    EarlyStopping(
        monitor='val_acc',      # Monitor validation accuracy
        min_delta=1e-4,         # Minimum change in the monitored quantity to qualify as an improvement
        patience=5              # Number of epochs with no improvement after which training will be stopped
    )
]


### Train

In [None]:
%%time
# Train the model and measure the training time
history = model.fit(
    x_train,                         # Input features (training data)
    y_train,                         # Target labels (training data)
    batch_size=BATCH_SIZE,           # Number of samples per gradient update
    epochs=EPOCHS,                   # Number of training epochs
    validation_split=0.1,            # Fraction of training data to be used as validation data
    verbose=1,                       # Verbosity mode (0 = silent, 1 = progress bar, 2 = one line per epoch)
    callbacks=callbacks              # Callbacks for enhancing training performance
)


### Evaluate

In [None]:
%%time
# Evaluate the model on the test data and measure the evaluation time
score = model.evaluate(
    x_test,                         # Input features (test data)
    y_test,                         # Target labels (test data)
    batch_size=BATCH_SIZE           # Number of samples per gradient update
)

# Print the evaluation results
print()
print("ACCURACY:", score[1])        # Print the accuracy score
print("LOSS:", score[0])             # Print the loss value


In [None]:
# Extract training history data
acc = history.history['acc']           # Training accuracy
val_acc = history.history['val_acc']   # Validation accuracy
loss = history.history['loss']         # Training loss
val_loss = history.history['val_loss'] # Validation loss

# Create range of epochs
epochs = range(len(acc))

# Plot Training and Validation Accuracy
plt.plot(epochs, acc, 'b', label='Training acc')      # Training accuracy curve
plt.plot(epochs, val_acc, 'r', label='Validation acc') # Validation accuracy curve
plt.title('Training and validation accuracy')         # Plot title
plt.legend()                                           # Add legend
plt.xlabel('Epochs')                                  # Label for x-axis
plt.ylabel('Accuracy')                                # Label for y-axis
plt.show()                                             # Show plot

# Create new figure for Loss curves
plt.figure()

# Plot Training and Validation Loss
plt.plot(epochs, loss, 'b', label='Training loss')      # Training loss curve
plt.plot(epochs, val_loss, 'r', label='Validation loss') # Validation loss curve
plt.title('Training and validation loss')               # Plot title
plt.legend()                                             # Add legend
plt.xlabel('Epochs')                                    # Label for x-axis
plt.ylabel('Loss')                                      # Label for y-axis
plt.show()                                               # Show plot


### Predict

In [None]:
def decode_sentiment(score, include_neutral=True):
    """
    Decode sentiment score to sentiment label
    :param score: float, sentiment score between 0 and 1
    :param include_neutral: bool, whether to include neutral sentiment or not
    :return: str, sentiment label (POSITIVE, NEGATIVE, NEUTRAL)
    """
    if include_neutral:
        # Set default label to NEUTRAL
        label = NEUTRAL
        # Check if score is below negative threshold
        if score <= SENTIMENT_THRESHOLDS[0]:
            label = NEGATIVE
        # Check if score is above positive threshold
        elif score >= SENTIMENT_THRESHOLDS[1]:
            label = POSITIVE

        return label
    else:
        # Return NEGATIVE for scores below 0.5, otherwise POSITIVE
        return NEGATIVE if score < 0.5 else POSITIVE


In [None]:
def predict(text, include_neutral=True):
    """
    Predict sentiment for a given text
    :param text: str, input text for sentiment prediction
    :param include_neutral: bool, whether to include neutral sentiment or not
    :return: dict, dictionary containing predicted label, score, and elapsed time
    """
    # Record the start time
    start_at = time.time()

    # Tokenize and pad the input text
    x_test = pad_sequences(tokenizer.texts_to_sequences([text]), maxlen=SEQUENCE_LENGTH)

    # Predict sentiment score
    score = model.predict([x_test])[0]

    # Decode the sentiment score to label
    label = decode_sentiment(score, include_neutral=include_neutral)

    # Calculate elapsed time for prediction
    elapsed_time = time.time() - start_at

    return {
        "label": label,            # Predicted sentiment label
        "score": float(score),     # Predicted sentiment score
        "elapsed_time": elapsed_time  # Elapsed time for prediction
    }


In [None]:
predict("I love the music")

In [None]:
predict("I hate the rain")

In [None]:
predict("i don't know what i'm doing")

### Confusion Matrix

In [None]:
%%time
# Initialize lists to store predicted and true labels
y_pred_1d = []
y_test_1d = list(df_test.target)

# Predict sentiment scores for the test data
scores = model.predict(x_test, verbose=1, batch_size=8000)

# Decode sentiment scores to labels
y_pred_1d = [decode_sentiment(score, include_neutral=False) for score in scores]


In [None]:
def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.

    :param cm: ndarray, confusion matrix
    :param classes: list, class labels
    :param title: str, plot title
    :param cmap: matplotlib colormap, color scheme for the plot
    """

    # Normalize the confusion matrix
    cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    # Plot the confusion matrix
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, fontsize=30)
    plt.colorbar()

    # Define the tick marks for the plot
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90, fontsize=22)
    plt.yticks(tick_marks, classes, fontsize=22)

    # Add text annotations for each cell in the matrix
    fmt = '.2f'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    # Add axis labels and set their font sizes
    plt.ylabel('True label', fontsize=25)
    plt.xlabel('Predicted label', fontsize=25)


In [None]:
%%time

# Calculate the confusion matrix
cnf_matrix = confusion_matrix(y_test_1d, y_pred_1d)

# Create a new figure with specified size
plt.figure(figsize=(12,12))

# Plot the confusion matrix using the defined function
plot_confusion_matrix(cnf_matrix, classes=df_train.target.unique(), title="Confusion matrix")

# Display the plot
plt.show()


### Classification Report

In [None]:
# Print the classification report
print(classification_report(y_test_1d, y_pred_1d))


### Accuracy Score

In [None]:
accuracy_score(y_test_1d, y_pred_1d)

### Save model

In [None]:
# Save the Keras model
model.save(KERAS_MODEL)

# Save the Word2Vec model
w2v_model.save(WORD2VEC_MODEL)

# Save the Tokenizer using pickle
pickle.dump(tokenizer, open(TOKENIZER_MODEL, "wb"), protocol=0)

# Save the LabelEncoder using pickle
pickle.dump(encoder, open(ENCODER_MODEL, "wb"), protocol=0)
