Gensim Word2vec 300D - 201a_gw300_sent.ipynb

1) Download Gensim Model

Download a pre-trained word embedding model compatible with Gensim. For example, the Google News Word2Vec model:

In [None]:
#wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
#gunzip GoogleNews-vectors-negative300.bin.gz

2) Download Dataset

Download a sentiment analysis dataset. For example, the airline sentiment dataset:

In [None]:
#wget -c "'https://archive.org/download/misc-dataset/airline-tweets.csv'"
#tar -xvzf aclImdb_v1.tar.gz

3) Load Gensim Model
Load the pre-trained Word2Vec model using Gensim:

In [1]:
%pip install gensim

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import gensim as gensim
from gensim.models import KeyedVectors

# Load Word2Vec model
model_path = r"C:/ZL/model/GoogleNews-vectors-negative300.bin"
word2vec = KeyedVectors.load_word2vec_format(model_path, binary=True)
print("Word2Vec model loaded successfully!")
# 16.6m s

Word2Vec model loaded successfully!


4) Load Airline Sentiment Dataset
Load and explore the dataset:

In [4]:
import pandas as pd

# Load dataset
dataset_path = r"C:\ZL\dataset\_airsent\airline-tweets.csv"
data = pd.read_csv(dataset_path)

# Explore data
print(data.head())
print(data['airline_sentiment'].value_counts())

             tweet_id airline_sentiment  airline_sentiment_confidence  \
0  570306133677760513           neutral                        1.0000   
1  570301130888122368          positive                        0.3486   
2  570301083672813571           neutral                        0.6837   
3  570301031407624196          negative                        1.0000   
4  570300817074462722          negative                        1.0000   

  negativereason  negativereason_confidence         airline  \
0            NaN                        NaN  Virgin America   
1            NaN                     0.0000  Virgin America   
2            NaN                        NaN  Virgin America   
3     Bad Flight                     0.7033  Virgin America   
4     Can't Tell                     1.0000  Virgin America   

  airline_sentiment_gold        name negativereason_gold  retweet_count  \
0                    NaN     cairdin                 NaN              0   
1                    NaN    jnar

5) Preprocess
Preprocess the text and convert sentiment labels:

In [5]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download NLTK data
import nltk
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    tokens = word_tokenize(text)  # Tokenize
    tokens = [word for word in tokens if word not in stopwords.words('english')]  # Remove stopwords
    return tokens

# Apply preprocessing
data['tokens'] = data['text'].apply(preprocess_text)

# Convert sentiment to numerical labels
sentiment_map = {'positive': 1, 'neutral': 0, 'negative': -1}
data['label'] = data['airline_sentiment'].map(sentiment_map)
print("Preprocessing completed!")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Preprocessing completed!


6) Vectorize
Convert text into word vectors:

In [6]:
import numpy as np

def text_to_vector(tokens, model):
    vectors = [model[word] for word in tokens if word in model]
    if vectors:
        return np.mean(vectors, axis=0)  # Average of word vectors
    else:
        return np.zeros(model.vector_size)

# Vectorize dataset
data['vector'] = data['tokens'].apply(lambda x: text_to_vector(x, word2vec))
X = np.vstack(data['vector'].values)
y = data['label'].values
print("Vectorization completed!")


Vectorization completed!


7) Train
Split the dataset into training and test sets, and train a Logistic Regression model:

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)
print("Model training completed!")


Model training completed!


8) Predict
Predict sentiment on the test set:

In [8]:
# Predict
y_pred = clf.predict(X_test)
print("Prediction completed!")


Prediction completed!


9) Evaluate
Evaluate the model's performance using common metrics:

In [9]:
from sklearn.metrics import classification_report, accuracy_score

# Classification report
print(classification_report(y_test, y_pred))

# Accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")


              precision    recall  f1-score   support

          -1       0.83      0.92      0.87      1889
           0       0.61      0.46      0.53       580
           1       0.76      0.65      0.70       459

    accuracy                           0.79      2928
   macro avg       0.73      0.68      0.70      2928
weighted avg       0.78      0.79      0.78      2928

Accuracy: 0.79


The model achieved an **accuracy of 79%** on the Airline Sentiment Dataset. 

Breakdown of the results and observations:

---

### **Key Observations from the Report**
1. **Class-wise Performance**:
   - **Negative Sentiment (-1)**: 
     - **Precision**: 83% (High confidence in predictions for this class).
     - **Recall**: 92% (Most negative sentiments were correctly identified).
     - **F1-Score**: 87% (Balanced precision and recall for this class).
   - **Neutral Sentiment (0)**:
     - **Precision**: 61% (Lower confidence in predictions for this class).
     - **Recall**: 46% (Many neutral sentiments were missed).
     - **F1-Score**: 53% (Moderate performance in this class).
   - **Positive Sentiment (1)**:
     - **Precision**: 76% (Good confidence in predictions for positive sentiment).
     - **Recall**: 65% (Moderate success in identifying positive sentiments).
     - **F1-Score**: 70% (Decent balance of precision and recall).

2. **Macro Average**:
   - Provides an unweighted mean of precision, recall, and F1-score across all classes.
   - **F1-Score (70%)**: Indicates that some classes are harder to predict than others (especially the neutral class).

3. **Weighted Average**:
   - Weights the scores by the number of samples in each class.
   - Reflects the overall performance considering the class imbalance.

4. **Class Imbalance**:
   - The dataset is imbalanced, with significantly more negative samples (1889) compared to neutral (580) and positive (459).
   - This likely explains the model's better performance on the negative class and struggles with the neutral class.

---

### **Suggestions for Improvement**
1. **Address Class Imbalance**:
   - Use **oversampling** (e.g., SMOTE) for minority classes.
   - Apply **class weights** in the Logistic Regression model to balance the loss function.

2. **Feature Enhancements**:
   - Use **FastText embeddings** to handle out-of-vocabulary (OOV) words better.
   - Experiment with contextual embeddings like **BERT** or **RoBERTa** for richer representations.

3. **Model Tuning**:
   - Experiment with more complex classifiers like Support Vector Machines (SVM) or Random Forests.
   - Fine-tune hyperparameters of Logistic Regression (e.g., regularization strength).

4. **Add Linguistic Features**:
   - Include features like sentiment lexicon scores (e.g., AFINN, SentiWordNet).
   - Use bigrams or trigrams in addition to word embeddings.

5. **Advanced Preprocessing**:
   - Consider stemming or lemmatization.
   - Include domain-specific stopwords to clean the data better.

By implementing these improvements, you can likely boost the model's performance, especially for the underperforming neutral and positive sentiment classes.
