Gensim Glove 300D - 201a_gg300_sent.ipynb

1) Download Gensim Model

Download a pre-trained word embedding model compatible with Gensim. For example, the Google News Word2Vec model:

In [None]:
# wget -c https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz
# gunzip cc.en.300.vec.gz

2) Download Dataset

Download a sentiment analysis dataset. For example, the airline sentiment dataset:

In [None]:
#wget -c "'https://archive.org/download/misc-dataset/airline-tweets.csv'"
#tar -xvzf aclImdb_v1.tar.gz

3) Load Gensim Model
Load the pre-trained model using Gensim:

In [1]:
%pip install gensim

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [7]:
from gensim.models import FastText

# Load the FastText binary model
fasttext_path = r"C:/ZL/model/cc.en.300.bin"
fasttext_model = FastText.load_fasttext_format(fasttext_path)
print("FastText binary model loaded successfully!")
# 4m 28s

  fasttext_model = FastText.load_fasttext_format(fasttext_path)


FastText binary model loaded successfully!


4) Load Airline Sentiment Dataset
Load and explore the dataset:

In [8]:
import pandas as pd

# Load dataset
dataset_path = r"C:\ZL\dataset\_airsent\airline-tweets.csv"
data = pd.read_csv(dataset_path)

# Explore data
print(data.head())
print(data['airline_sentiment'].value_counts())

             tweet_id airline_sentiment  airline_sentiment_confidence  \
0  570306133677760513           neutral                        1.0000   
1  570301130888122368          positive                        0.3486   
2  570301083672813571           neutral                        0.6837   
3  570301031407624196          negative                        1.0000   
4  570300817074462722          negative                        1.0000   

  negativereason  negativereason_confidence         airline  \
0            NaN                        NaN  Virgin America   
1            NaN                     0.0000  Virgin America   
2            NaN                        NaN  Virgin America   
3     Bad Flight                     0.7033  Virgin America   
4     Can't Tell                     1.0000  Virgin America   

  airline_sentiment_gold        name negativereason_gold  retweet_count  \
0                    NaN     cairdin                 NaN              0   
1                    NaN    jnar

5) Preprocess
Preprocess the text and convert sentiment labels:

In [9]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download NLTK data
import nltk
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    tokens = word_tokenize(text)  # Tokenize
    tokens = [word for word in tokens if word not in stopwords.words('english')]  # Remove stopwords
    return tokens

# Apply preprocessing
data['tokens'] = data['text'].apply(preprocess_text)

# Convert sentiment to numerical labels
sentiment_map = {'positive': 1, 'neutral': 0, 'negative': -1}
data['label'] = data['airline_sentiment'].map(sentiment_map)
print("Preprocessing completed!")
# 1m 9s

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Preprocessing completed!


6) Vectorize
Convert text into word vectors:

In [12]:
import numpy as np

def get_fasttext_vector(word, model, vector_size=300):
    try:
        return model.wv[word]
    except KeyError:
        return np.zeros(vector_size)

def text_to_vector_fasttext(tokens, model, vector_size=300):
    vectors = [get_fasttext_vector(word, model, vector_size) for word in tokens]
    if vectors:
        return np.mean(vectors, axis=0)  # Average of word vectors
    else:
        return np.zeros(vector_size)

# Apply vectorization
data['vector'] = data['tokens'].apply(lambda x: text_to_vector_fasttext(x, fasttext_model))
X = np.vstack(data['vector'].values)
y = data['label'].values

7) Train
Split the dataset into training and test sets, and train a Logistic Regression model:

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)
print("Model training completed!")


Model training completed!


8) Predict
Predict sentiment on the test set:

In [14]:
# Predict
y_pred = clf.predict(X_test)
print("Prediction completed!")


Prediction completed!


9) Evaluate
Evaluate the model's performance using common metrics:

In [15]:
from sklearn.metrics import classification_report, accuracy_score

# Classification report
print(classification_report(y_test, y_pred))

# Accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")


              precision    recall  f1-score   support

          -1       0.81      0.94      0.87      1889
           0       0.65      0.44      0.52       580
           1       0.81      0.58      0.67       459

    accuracy                           0.79      2928
   macro avg       0.75      0.65      0.69      2928
weighted avg       0.78      0.79      0.77      2928

Accuracy: 0.79


The FastText-based model achieved an **accuracy of 79%**, which matches the performance of the Word2Vec model. 
Breakdown of the results and observations:

---

### **Key Observations from FastText Results**
1. **Class-wise Performance**:
   - **Negative Sentiment (-1)**: 
     - **Precision (81%)** and **Recall (94%)** remain strong, similar to Word2Vec and GloVe.
     - **F1-Score (87%)** indicates the model's strength in predicting the majority class.
   - **Neutral Sentiment (0)**: 
     - **Precision (65%)** is higher than both Word2Vec (61%) and GloVe (59%).
     - **Recall (44%)** remains comparable to the other models, reflecting challenges in capturing neutral sentiment.
     - **F1-Score (52%)** is slightly better than GloVe and matches Word2Vec.
   - **Positive Sentiment (1)**:
       - **Precision (81%)** is the highest among the models (Word2Vec: 76%, GloVe: 75%).
       - **Recall (58%)** lags slightly behind Word2Vec (65%) but is better than GloVe (62%).
       - **F1-Score (67%)** is comparable to GloVe and slightly lower than Word2Vec (70%).

2. **Macro Average**:
   - **F1-Score (69%)** is higher than GloVe (68%) but slightly lower than Word2Vec (70%).

3. **Weighted Average**:
   - **F1-Score (77%)** is comparable to Word2Vec (78%) and slightly better than GloVe (76%).

4. **Class Imbalance**:
   - FastText handles class imbalance better than GloVe, especially for the neutral and positive classes.

---

### **Comparison: FastText vs. Word2Vec vs. GloVe**
| Metric                | Word2Vec | GloVe | FastText |
|-----------------------|----------|-------|----------|
| Accuracy              | 79%      | 78%   | 79%      |
| Negative F1-Score (-1)| 87%      | 87%   | 87%      |
| Neutral F1-Score (0)  | 53%      | 50%   | 52%      |
| Positive F1-Score (1) | 70%      | 68%   | 67%      |
| Macro F1-Score        | 70%      | 68%   | 69%      |
| Weighted F1-Score     | 78%      | 76%   | 77%      |

---

### **Observations**
1. **Strengths of FastText**:
   - Improved precision for the neutral and positive classes.
   - Handles rare and OOV words better due to subword representations.

2. **Trade-Offs**:
   - Recall for the positive class is slightly lower than Word2Vec.
   - Longer loading times compared to GloVe and Word2Vec.

3. **Overall**:
   - FastText provides comparable accuracy and F1-scores to Word2Vec while outperforming GloVe in most metrics.

---

### **Suggestions for Further Improvement**
1. **Fine-Tune FastText**:
   - Fine-tune the embeddings on domain-specific data (e.g., airline reviews).

2. **Hybrid Approach**:
   - Combine embeddings (e.g., concatenate FastText and Word2Vec) for richer representations.

3. **Advanced Models**:
   - Use contextual embeddings (e.g., BERT, RoBERTa) for further improvement.

4. **Class Imbalance**:
   - Apply **oversampling**, **undersampling**, or **class-weighting** to address the imbalance in neutral and positive sentiments.

FastText has shown strong performance, particularly for precision in minority classes. With its ability to handle rare words effectively, it remains a competitive option for sentiment analysis.
