Gensim Glove 300D - 201a_gg300_sent.ipynb

1) Download Gensim Model

Download a pre-trained word embedding model compatible with Gensim. For example, the Google News Word2Vec model:

In [None]:
# wget -c "http://nlp.stanford.edu/data/glove.6B.zip"
# unzip glove.6B.zip

2) Download Dataset

Download a sentiment analysis dataset. For example, the airline sentiment dataset:

In [None]:
#wget -c "'https://archive.org/download/misc-dataset/airline-tweets.csv'"
#tar -xvzf aclImdb_v1.tar.gz

3) Load Gensim Model
Load the pre-trained model using Gensim:

In [2]:
%pip install gensim

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
import numpy as np

def load_glove_embeddings(file_path):
    embeddings_index = {}
    with open(file_path, encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = vector
    return embeddings_index

# Load GloVe model
glove_path = r"C:/ZL/model/glove.6B.300d.txt"
glove_embeddings = load_glove_embeddings(glove_path)
print("GloVe model loaded successfully!")

GloVe model loaded successfully!


4) Load Airline Sentiment Dataset
Load and explore the dataset:

In [5]:
import pandas as pd

# Load dataset
dataset_path = r"C:\ZL\dataset\_airsent\airline-tweets.csv"
data = pd.read_csv(dataset_path)

# Explore data
print(data.head())
print(data['airline_sentiment'].value_counts())

             tweet_id airline_sentiment  airline_sentiment_confidence  \
0  570306133677760513           neutral                        1.0000   
1  570301130888122368          positive                        0.3486   
2  570301083672813571           neutral                        0.6837   
3  570301031407624196          negative                        1.0000   
4  570300817074462722          negative                        1.0000   

  negativereason  negativereason_confidence         airline  \
0            NaN                        NaN  Virgin America   
1            NaN                     0.0000  Virgin America   
2            NaN                        NaN  Virgin America   
3     Bad Flight                     0.7033  Virgin America   
4     Can't Tell                     1.0000  Virgin America   

  airline_sentiment_gold        name negativereason_gold  retweet_count  \
0                    NaN     cairdin                 NaN              0   
1                    NaN    jnar

5) Preprocess
Preprocess the text and convert sentiment labels:

In [6]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download NLTK data
import nltk
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    tokens = word_tokenize(text)  # Tokenize
    tokens = [word for word in tokens if word not in stopwords.words('english')]  # Remove stopwords
    return tokens

# Apply preprocessing
data['tokens'] = data['text'].apply(preprocess_text)

# Convert sentiment to numerical labels
sentiment_map = {'positive': 1, 'neutral': 0, 'negative': -1}
data['label'] = data['airline_sentiment'].map(sentiment_map)
print("Preprocessing completed!")
# 1m 9s

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Preprocessing completed!


6) Vectorize
Convert text into word vectors:

In [7]:
import numpy as np

def text_to_vector_glove(tokens, embeddings_index, vector_size=300):
    vectors = [embeddings_index[word] for word in tokens if word in embeddings_index]
    if vectors:
        return np.mean(vectors, axis=0)  # Average of word vectors
    else:
        return np.zeros(vector_size)

# Apply vectorization
data['vector'] = data['tokens'].apply(lambda x: text_to_vector_glove(x, glove_embeddings))
X = np.vstack(data['vector'].values)
y = data['label'].values
# 0.2s

7) Train
Split the dataset into training and test sets, and train a Logistic Regression model:

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)
print("Model training completed!")


Model training completed!


8) Predict
Predict sentiment on the test set:

In [9]:
# Predict
y_pred = clf.predict(X_test)
print("Prediction completed!")


Prediction completed!


9) Evaluate
Evaluate the model's performance using common metrics:

In [10]:
from sklearn.metrics import classification_report, accuracy_score

# Classification report
print(classification_report(y_test, y_pred))

# Accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")


              precision    recall  f1-score   support

          -1       0.82      0.92      0.87      1889
           0       0.59      0.44      0.50       580
           1       0.75      0.62      0.68       459

    accuracy                           0.78      2928
   macro avg       0.72      0.66      0.68      2928
weighted avg       0.76      0.78      0.76      2928

Accuracy: 0.78


The GloVe-based model achieved an **accuracy of 78%**, slightly lower than the Word2Vec model (79%). 

Breakdown of the results and observations:

---

### **Key Observations from GloVe Results**
1. **Class-wise Performance**:
   - **Negative Sentiment (-1)**: 
     - **Precision (82%)** and **Recall (92%)** remain strong, similar to Word2Vec.
     - The **F1-Score (87%)** reflects balanced performance for this dominant class.
   - **Neutral Sentiment (0)**: 
     - Performance for this class remains challenging:
       - **Precision (59%)** and **Recall (44%)** are both slightly lower than the Word2Vec results.
       - **F1-Score (50%)** indicates room for improvement in identifying neutral sentiments.
   - **Positive Sentiment (1)**:
       - Comparable performance to Word2Vec with a **Precision (75%)** and **Recall (62%)**.
       - **F1-Score (68%)** is slightly lower than Word2Vec (70%).

2. **Macro Average**:
   - **F1-Score (68%)** is lower than Word2Vec's 70%, showing a slight decrease in balanced performance across all classes.

3. **Weighted Average**:
   - **F1-Score (76%)** is close to Word2Vec, reflecting the model's strength in predicting the majority class (negative sentiment).

4. **Class Imbalance**:
   - GloVe struggles similarly with the underrepresented classes (neutral and positive), likely due to class imbalance in the dataset.

---

### **Comparison: GloVe vs. Word2Vec**
| Metric                | Word2Vec | GloVe |
|-----------------------|----------|-------|
| Accuracy              | 79%      | 78%   |
| Negative F1-Score (-1)| 87%      | 87%   |
| Neutral F1-Score (0)  | 53%      | 50%   |
| Positive F1-Score (1) | 70%      | 68%   |
| Macro F1-Score        | 70%      | 68%   |
| Weighted F1-Score     | 78%      | 76%   |

---

### **Suggestions for Further Improvement**
1. **Fine-tune GloVe Embeddings**:
   - Use domain-specific data (e.g., airline reviews) to fine-tune the GloVe embeddings.

2. **Hybrid Approach**:
   - Combine GloVe and Word2Vec embeddings for richer word representations.
   - Use embedding concatenation or weighted averaging.

3. **Advanced Models**:
   - Experiment with FastText, which may better handle out-of-vocabulary words and rare words.
   - Use contextual embeddings like **BERT** or **RoBERTa** for improved performance.

4. **Handling Class Imbalance**:
   - Apply **oversampling** for minority classes or **class-weighting** in the Logistic Regression loss function.
   - Use data augmentation techniques to create synthetic samples for neutral and positive sentiments.

5. **Explore Deep Learning Models**:
   - Train LSTMs, CNNs, or hybrid models with pre-trained GloVe embeddings for potentially better results.

While GloVe offers strong performance, Word2Vec slightly outperforms it in this case, especially for the neutral and positive classes. Adopting some of the suggestions could help further refine the results.