<a href="https://colab.research.google.com/github/koksal100/NLP/blob/main/Sentiment_Analysis_Usig_GloVeEmbedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import collections
import math
import itertools
from collections import Counter
from scipy.sparse import lil_matrix
import tqdm
import matplotlib.pyplot as plt


In [37]:
#corpus is just a text including a sherlock holmes story

def preprocess_corpus(corpus):
    # Küçük harfe çevirme ve boşlukları ayırma
    tokens = corpus.lower().split()
    return tokens

tokens = preprocess_corpus(corpus)[:1000]
print(len(tokens))

1000


In [38]:
def build_cooccurrence_matrix(tokens, window_size=5):
    vocab = Counter(tokens)
    vocab_size = len(vocab)
    index_dict = {word: i for i, word in enumerate(vocab)}
    cooccurrence_matrix = lil_matrix((vocab_size, vocab_size), dtype=np.float64)

    for idx, word in enumerate(tokens):
        word_idx = index_dict[word]
        start = max(0, idx - window_size)
        end = min(len(tokens), idx + window_size + 1)

        for i in range(start, end):
            if i != idx:
                context_word = tokens[i]
                context_word_idx = index_dict[context_word]
                cooccurrence_matrix[word_idx, context_word_idx] += 1 / abs(i - idx)

    return cooccurrence_matrix, index_dict

cooccurrence_matrix, index_dict = build_cooccurrence_matrix(tokens)

Step-by-Step Explanation


Define the Function:
Start by defining the build_cooccurrence_matrix function. This function takes two parameters:
tokens: A list of words to process.
window_size: An integer specifying the context window size around each word (default is 5).


Count Word Frequencies:
Use the Counter from Python's collections module to count how often each word appears in the tokens list. This helps in building a vocabulary of unique words and their frequencies.


Create Word Indices:
Create a dictionary (index_dict) that maps each unique word to a numerical index. This index will be used to place words in the co-occurrence matrix.


Initialize a Co-occurrence Matrix:
Create an empty co-occurrence matrix (cooccurrence_matrix) using lil_matrix from the scipy.sparse module. This matrix will store the co-occurrence counts between each pair of words.



Iterate Over Tokens:
Loop through the tokens list to process each word (word) and its position index (idx).



Determine Context Window:
For each word in tokens, determine the range of surrounding words that will form its context. This is defined by start and end indices:
start is the maximum of 0 or (idx - window_size), ensuring the window does not go out of bounds.
end is the minimum of the length of tokens or (idx + window_size + 1), ensuring the window does not exceed the list length.



Iterate Over Context Words:
Within the defined context window (start to end), loop through each word (context_word) except the current word (word).



Update Co-occurrence Matrix:
For each pair of word and context_word, update the corresponding entry in the cooccurrence_matrix. Increase the value by 1 / abs(i - idx), where i is the index of context_word and idx is the index of word. This step captures the strength of association between word and context_word based on their distance within the window.



Return Results:After iterating through all words and updating the matrix, return the populated cooccurrence_matrix and the index_dict.

In [39]:
vocab = Counter(tokens)

In [40]:
index_dict = {word: i for i, word in enumerate(vocab)}

In [41]:
class GloVe:
    def __init__(self, vocab_size, embedding_dim=50, x_max=100, alpha=0.75):
        self.embedding_dim = embedding_dim
        self.x_max = x_max
        self.alpha = alpha

        self.W = np.random.rand(vocab_size, embedding_dim)
        self.W_context = np.random.rand(vocab_size, embedding_dim)
        self.b = np.random.rand(vocab_size)
        self.b_context = np.random.rand(vocab_size)

        self.gradsq_W = np.ones((vocab_size, embedding_dim))
        self.gradsq_W_context = np.ones((vocab_size, embedding_dim))
        self.gradsq_b = np.ones(vocab_size)
        self.gradsq_b_context = np.ones(vocab_size)

    def weight_func(self, x):
        return (x / self.x_max) ** self.alpha if x < self.x_max else 1

    def fit(self, cooccurrence_matrix, index_dict, epochs=100, learning_rate=0.05):
        vocab_size = len(index_dict)
        for epoch in tqdm.tqdm(range(epochs)):
            for i in range(vocab_size):
                for j in range(vocab_size):
                    if cooccurrence_matrix[i, j] == 0:
                        continue
                    x_ij = cooccurrence_matrix[i, j]
                    weight = self.weight_func(x_ij)
                    inner_prod = np.dot(self.W[i], self.W_context[j]) + self.b[i] + self.b_context[j]
                    cost = inner_prod - np.log(x_ij)

                    self.W[i] -= learning_rate * weight * cost * self.W_context[j] / np.sqrt(self.gradsq_W[i])
                    self.W_context[j] -= learning_rate * weight * cost * self.W[i] / np.sqrt(self.gradsq_W_context[j])

                    self.b[i] -= learning_rate * weight * cost / np.sqrt(self.gradsq_b[i])
                    self.b_context[j] -= learning_rate * weight * cost / np.sqrt(self.gradsq_b_context[j])

                    self.gradsq_W[i] += (weight * cost * self.W_context[j]) ** 2
                    self.gradsq_W_context[j] += (weight * cost * self.W[i]) ** 2

                    self.gradsq_b[i] += (weight * cost) ** 2
                    self.gradsq_b_context[j] += (weight * cost) ** 2

    def get_embedding(self, word, index_dict):
        word_idx = index_dict[word]
        return self.W[word_idx] + self.W_context[word_idx]


In [42]:
# GloVe modelini başlatma
embedding_dim = 50
glove_model = GloVe(vocab_size=len(index_dict), embedding_dim=embedding_dim)

# Modeli eğitme
epochs = 100
learning_rate = 0.05
glove_model.fit(cooccurrence_matrix, index_dict, epochs=epochs, learning_rate=learning_rate)

100%|██████████| 100/100 [01:43<00:00,  1.04s/it]


In [43]:
# Örnek bir kelimenin embedding vektörünü almak
word = 'deduction'
embedding_vector = glove_model.get_embedding(word, index_dict)
print(f"Embedding vector for '{word}': {embedding_vector}")

Embedding vector for 'deduction': [ 0.58709191  0.23773688 -0.41814824  0.42605251  0.73630965  0.51938128
  1.02113996  0.83698164  0.61483124 -0.0607342   1.11388758  0.22387023
  1.05113765  0.82584514  0.17191324  0.47326527  0.74657886  0.96291849
  0.74271006  1.05681575  0.25061816  0.59595141  0.06347256  0.627788
  0.93272076  0.19034006  0.67797183  0.29230095  0.67717746 -0.11653874
  0.99198813  1.11758164 -0.29649179  0.30757772 -0.28760836  0.63913467
  0.118504    0.50221188  0.55269492  0.58940624  0.8036267  -0.11463196
  0.39476471  0.29422352 -0.22400735  0.59716211  0.46941938  0.06097593
  1.25990476  0.34266969]


In [46]:
index_dict.keys()

dict_keys(['adventure', 'i.', 'a', 'scandal', 'in', 'bohemia', 'to', 'sherlock', 'holmes', 'she', 'is', 'always', 'the', 'woman.', 'i', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name.', 'his', 'eyes', 'eclipses', 'and', 'predominates', 'whole', 'of', 'sex.', 'it', 'was', 'not', 'that', 'he', 'felt', 'emotion', 'akin', 'love', 'for', 'irene', 'adler.', 'all', 'emotions,', 'one', 'particularly,', 'were', 'abhorrent', 'cold,', 'precise', 'but', 'admirably', 'balanced', 'mind.', 'was,', 'take', 'it,', 'most', 'perfect', 'reasoning', 'observing', 'machine', 'world', 'has', 'seen,', 'as', 'lover', 'would', 'placed', 'himself', 'false', 'position.', 'never', 'spoke', 'softer', 'passions,', 'save', 'with', 'gibe', 'sneer.', 'they', 'admirable', 'things', 'observer--excellent', 'drawing', 'veil', 'from', "men's", 'motives', 'actions.', 'trained', 'reasoner', 'admit', 'such', 'intrusions', 'into', 'own', 'delicate', 'finely', 'adjusted', 'temperament', 'introd

In [51]:
from sklearn.linear_model import LogisticRegression
import numpy as np

# Sentiment Analysis Modeli
class SentimentAnalysisModel:
    def __init__(self, glove_model, index_dict):
        self.glove_model = glove_model
        self.index_dict = index_dict
        self.logreg = LogisticRegression()

    def get_features(self, sentence):
        tokens = sentence.lower().split()
        sentence_embedding = np.zeros(self.glove_model.embedding_dim)
        for token in tokens:
            if token in self.index_dict:
                word_idx = self.index_dict[token]
                sentence_embedding += self.glove_model.get_embedding(token, self.index_dict)
        return sentence_embedding.reshape(1, -1)

    def train(self, labeled_sentences):
        X_train = []
        y_train = []

        for sentence, label in labeled_sentences:
            tokens = sentence.lower().split()
            sentence_embedding = self.get_features(sentence)
            X_train.append(sentence_embedding)
            y_train.append(label)

        X_train = np.vstack(X_train)
        self.logreg.fit(X_train, y_train)

    def predict(self, sentences):
        X_pred = []
        for sentence in sentences:
            tokens = sentence.lower().split()
            sentence_embedding = self.get_features(sentence)
            X_pred.append(sentence_embedding)

        X_pred = np.vstack(X_pred)
        predictions = self.logreg.predict(X_pred)
        return predictions


In [52]:
labeled_sentences = [
    ("Holmes' methods draw a veil from men's motives.", 0),
    ("His finely adjusted temperament is disturbed by emotional intrusions.", 1),
    ("He loathed every form of society.", 1),
    ("Holmes' abhorrence of emotions was evident in his sneer.", 0),
    ("He saw the signs of activity in the reign of Holland.", 1),
    ("Holmes' observations were particularly acute.", 1),
    ("The mysterious case of Trepoff's murder baffled official police.", 1),
    ("Holmes' balanced mind admired Irene Adler's courage.", 1),
    ("His deep emotions were abhorrent to his cold, precise mind.", 0),
    ("Sherlock Holmes is deeply absorbed in the study of crime.", 1),
    ("He has a deep aversion to softer passions.", 0),
    ("His complete happiness was home-centered.", 1),
    ("The crack in the lens could throw doubt on his mental results.", 0),
    ("Holmes remained immersed in his work.", 1),
    ("The woman, Irene Adler, was a singular and tragic figure.", 0),
    ("He remained buried among old books in his lodgings.", 1),
    ("Holmes' loathing of society contrasted sharply with Watson's habits.", 1),
    ("His balanced mind admired Irene Adler's courage.", 1),
    ("Holmes' abhorrence of emotions was evident in his sneer.", 0),
    ("He has a deep aversion to softer passions.", 0),
    ("His complete happiness was home-centered.", 1),
    ("The crack in the lens could throw doubt on his mental results.", 0),
    ("Holmes remained immersed in his work.", 1),
    ("The woman, Irene Adler, was a singular and tragic figure.", 0),
    ("He remained buried among old books in his lodgings.", 1),
    ("Holmes' loathing of society contrasted sharply with Watson's habits.", 1),
    ("His balanced mind admired Irene Adler's courage.", 1),
    ("Holmes' abhorrence of emotions was evident in his sneer.", 0),
    ("He has a deep aversion to softer passions.", 0),
    ("His complete happiness was home-centered.", 1)
]

In [53]:
# Sentiment Analysis modelini oluşturma ve eğitim yapma
sentiment_model = SentimentAnalysisModel(glove_model, index_dict)
sentiment_model.train(labeled_sentences[:20])



In [58]:
y_test=[label for sentence,label in labeled_sentences[20:]]
y_test

[1, 0, 1, 0, 1, 1, 1, 0, 0, 1]

In [59]:
x_test =[sentence for sentence,label in labeled_sentences[20:]]
x_test

['His complete happiness was home-centered.',
 'The crack in the lens could throw doubt on his mental results.',
 'Holmes remained immersed in his work.',
 'The woman, Irene Adler, was a singular and tragic figure.',
 'He remained buried among old books in his lodgings.',
 "Holmes' loathing of society contrasted sharply with Watson's habits.",
 "His balanced mind admired Irene Adler's courage.",
 "Holmes' abhorrence of emotions was evident in his sneer.",
 'He has a deep aversion to softer passions.',
 'His complete happiness was home-centered.']

In [61]:
predictions=sentiment_model.predict(x_test)

In [62]:
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

Accuracy: 1.0
