This phase is about moving from frequency-based features (TF-IDF) to semantic word representations.

text ‚Üí TF-IDF ‚Üí model

text ‚Üí word embeddings ‚Üí sentence vector ‚Üí model

This helps the model understand meaning, not just word counts.

In [1]:
import pandas as pd
import nltk
import string

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

pip install gensim
from gensim.models import Word2Vec


SyntaxError: invalid syntax (848046566.py, line 9)

In [18]:
df = pd.read_csv("../data/sentimentdataset.csv")
df.head()


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Hashtags,Retweets,Likes,Country,Year,Month,Day,Hour
0,0,0,Enjoying a beautiful day at the park! ...,Positive,2023-01-15 12:30:00,User123,Twitter,#Nature #Park,15.0,30.0,USA,2023,1,15,12
1,1,1,Traffic was terrible this morning. ...,Negative,2023-01-15 08:45:00,CommuterX,Twitter,#Traffic #Morning,5.0,10.0,Canada,2023,1,15,8
2,2,2,Just finished an amazing workout! üí™ ...,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,#Fitness #Workout,20.0,40.0,USA,2023,1,15,15
3,3,3,Excited about the upcoming weekend getaway! ...,Positive,2023-01-15 18:20:00,AdventureX,Facebook,#Travel #Adventure,8.0,15.0,UK,2023,1,15,18
4,4,4,Trying out a new recipe for dinner tonight. ...,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,#Cooking #Food,12.0,25.0,Australia,2023,1,15,19


## Step 3: Recreate preprocessing function

In [19]:
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    
    # Tokenize
    tokens = word_tokenize(text)

    tokens = [word for word in tokens if word not in stop_words]

    # Remove stop words and lemmatize
    processed_tokens = [
        lemmatizer.lemmatize(token) for token in tokens if token not in stop_words
    ]
    
    return processed_tokens

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Nishi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Nishi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Nishi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Step 4: Create tokenized column

In [20]:
df["tokens"] = df["Text"].apply(preprocess_text)
df["tokens"].head()

0          [enjoying, beautiful, day, park]
1              [traffic, terrible, morning]
2           [finished, amazing, workout, üí™]
3     [excited, upcoming, weekend, getaway]
4    [trying, new, recipe, dinner, tonight]
Name: tokens, dtype: object

## Step 5: Train Word2Vec model

In [21]:
# Understand the Word2Vec model and its parameters

w2v_model = Word2Vec(
    sentences=df["tokens"],
    vector_size=100,
    window=5,
    min_count=1,
    workers=4
)


## Step 6: Test a word embedding

In [22]:
print(list(w2v_model.wv.index_to_key)[:20])


['new', 'life', 'day', 'dream', 'joy', 'like', 'moment', 'feeling', 'heart', 'friend', 'laughter', 'night', 'world', 'echo', 'challenge', 'art', 'emotion', 'journey', 'beauty', 'time']


In [23]:
w2v_model.wv['day']


array([-0.00066215,  0.00377116, -0.00677739, -0.00132649,  0.00779554,
        0.00594807, -0.0033047 ,  0.00465612, -0.00893954,  0.00582766,
       -0.00543612, -0.00440635,  0.00934196,  0.0009625 ,  0.00785572,
       -0.00669005,  0.00533652,  0.00871947, -0.00877117, -0.00664146,
       -0.00706335, -0.00432352, -0.00317984, -0.00915159,  0.00781268,
       -0.00464146,  0.00802077,  0.00448031, -0.00741729,  0.00432184,
        0.00648023, -0.00725832, -0.00761548, -0.00299009, -0.00860094,
       -0.00083822, -0.00053311,  0.0024854 ,  0.00057321, -0.00292337,
       -0.00560153,  0.00078216, -0.00115441,  0.00677333,  0.00472358,
        0.00429136,  0.00034917, -0.00311225, -0.00409285, -0.0001362 ,
        0.0019635 , -0.00353652, -0.00748153, -0.00798584, -0.01018488,
       -0.00505917, -0.00130586, -0.00432386, -0.00749807, -0.00319127,
        0.00457594, -0.00335303,  0.0079576 ,  0.00188658, -0.00878155,
        0.01076969,  0.00787549,  0.00611534, -0.00806541,  0.00

## ‚ÄúWhy does Word2Vec sometimes throw a KeyError?‚Äù

‚ÄúWord2Vec only learns embeddings for words present in the training data. If a word doesn‚Äôt appear in the corpus, the model won‚Äôt have a vector for it.‚Äù


## ‚ÄúHow is TF-IDF different from Word2Vec?‚Äù

‚ÄúTF-IDF represents words based on frequency, while Word2Vec represents words as dense vectors that capture semantic meaning, so similar words have similar representations.‚Äù

## Phase 2 Step: Sentence Embeddings

We will convert each sentence into a vector by:

Method:

Take the average of all word vectors in the sentence.

"beautiful sunny day"

beautiful ‚Üí [0.2, 0.1, ...]
sunny     ‚Üí [0.4, -0.3, ...]
day       ‚Üí [0.1, 0.2, ...]

## Step 1: Create sentence vector function

In [24]:
import numpy as np

def get_sentence_vector(tokens, model, vector_size):
    vectors = []
    for token in tokens:
        if token in model.wv:
            vectors.append(model.wv[token])
    
    if len(vectors) == 0:
        return np.zeros(vector_size)
    
    return np.mean(vectors, axis=0)

## Step 2: Convert all sentences into vectors

In [26]:
X = np.array([
    get_sentence_vector(tokens, w2v_model, 100) 
    for tokens in df["tokens"]
])

In [27]:
print(X.shape)

(732, 100)


## Step 3: Prepare labels

In [28]:
y = df["Sentiment"]

## Step 4: Train‚Äìtest split

In [29]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)   

## Step 5: Train classifier

In [30]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


## Step 6: Evaluate model

In [31]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification_report:\n")
print(classification_report(y_test, y_pred))


Accuracy: 0.061224489795918366

Classification_report:

                        precision    recall  f1-score   support

         Acceptance          0.00      0.00      0.00         2
           Admiration        0.00      0.00      0.00         1
        Admiration           0.00      0.00      0.00         1
         Affection           0.00      0.00      0.00         1
      Ambivalence            0.00      0.00      0.00         1
         Anger               0.00      0.00      0.00         1
        Anticipation         0.00      0.00      0.00         1
        Arousal              0.00      0.00      0.00         3
                  Awe        0.00      0.00      0.00         1
         Awe                 0.00      0.00      0.00         1
                  Bad        0.00      0.00      0.00         1
             Betrayal        0.00      0.00      0.00         2
        Betrayal             0.00      0.00      0.00         1
         Bitter              0.00      0.00    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## What this completes

Text
 ‚Üí preprocessing
 ‚Üí Word2Vec embeddings
 ‚Üí sentence vectors
 ‚Üí Logistic Regression
 ‚Üí sentiment prediction

## ‚ÄúHow did you use Word2Vec in your project?‚Äù

‚ÄúI trained a Word2Vec model on the corpus, converted each sentence into a vector by averaging its word embeddings, and then trained a classifier on those sentence-level vectors.‚Äù

## Why accuracy dropped

This is happening because of three main reasons:

1) Very small dataset

Only ~700 samples

Word2Vec needs large corpora to learn meaningful embeddings

Word2Vec was originally trained on:

Millions or billions of words

Your dataset:

Only a few thousand words total

So embeddings are weak.

2) Too many sentiment classes

From your report:

Dozens of emotions

Many classes have only 1‚Äì3 samples

The model cannot learn patterns.

3) Averaging embeddings loses context

Sentence:

"I am not happy"

Average vector:

(mean of: i, not, happy)


This loses the negation meaning.

That‚Äôs why deep models exist.

## ‚ÄúWhat differences did you observe between TF-IDF and Word2Vec?‚Äù

‚ÄúOn the small multi-class dataset, TF-IDF performed slightly better than Word2Vec because Word2Vec requires larger corpora to learn meaningful embeddings. This highlighted the limitations of shallow embeddings and motivated the use of deep learning models like LSTMs and transformers.‚Äù