## Motivation

Use word embeddings to train a classifier

## Goal
- Based on content of an article, predict its newstype using logistic regression

## Intuition
Doc2Vec is an extension of word2vec. It helps get context from overall document. 

## Notebook

In [1]:
## Import Libraries
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

from tqdm.notebook import tqdm
from gensim.models import Word2Vec

from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

import matplotlib.pyplot as plt

In [2]:
seed = 42

In [3]:
## Import Data

#Source: https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles?select=Articles.csv
df = pd.read_csv('./data/Articles.csv', encoding="ISO-8859-1") 


In [4]:
## Pre-process

# tokenize while removing stop words because they hold no value
def token_helper(doc):
    global stops
    payload = [word.lower() for word in word_tokenize(doc) if word not in stops and word.isalpha()]    
    return payload

nltk.download('stopwords')
stops = set(stopwords.words('english'))

# Drop empty rows
df.dropna(inplace = True)

# Encode Target
df['NewsType'] = df['NewsType'].replace({'business':0, 'sports':1})



[nltk_data] Downloading package stopwords to /home/studio-lab-
[nltk_data]     user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# nltk_tokensized_bios[:3]

In [6]:
df.head()

Unnamed: 0,Article,Date,Heading,NewsType
0,KARACHI: The Sindh government has decided to b...,1/1/2015,sindh govt decides to cut public transport far...,0
1,HONG KONG: Asian markets started 2015 on an up...,1/2/2015,asia stocks up in new year trad,0
2,HONG KONG: Hong Kong shares opened 0.66 perce...,1/5/2015,hong kong stocks open 0.66 percent lower,0
3,HONG KONG: Asian markets tumbled Tuesday follo...,1/6/2015,asian stocks sink euro near nine year,0
4,NEW YORK: US oil prices Monday slipped below $...,1/6/2015,us oil prices slip below 50 a barr,0


In [7]:
## Split
X_train, X_test, y_train, y_test = train_test_split(df['Article'], df['NewsType'] , test_size = .2, random_state = seed)

X_train_w2v, X_test_w2v, y_train_w2v, y_test_w2v = train_test_split(df['Article'], df['NewsType'] , test_size = .2, random_state = seed)

In [8]:
## Train Baseline Model
vectorizer =  TfidfVectorizer(min_df = 500, stop_words = 'english')
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [9]:
lr_clf = LogisticRegression(random_state = seed)
lr_clf.fit(X_train, y_train)

In [10]:
# Word2Vec Approach
nltk_tokensized_bios = [token_helper(doc) for doc in tqdm(X_train_w2v.values)]
w2v_model = Word2Vec(nltk_tokensized_bios, vector_size=100, window=5, min_count=10)

  0%|          | 0/2153 [00:00<?, ?it/s]

After training the Word2Vec model, you can represent each article as a vector by taking the average of the Word2Vec embeddings of the words in the article. This makes a vector representation of the article that shows how the words fit together and what they mean.

In [12]:
# Source: https://spotintelligence.com/2023/02/15/word2vec-for-text-classification/

def vectorize(sentence):
    words = sentence.split()
    words_vecs = [w2v_model.wv[word] for word in words if word in w2v_model.wv]
    if len(words_vecs) == 0:
        return np.ze ros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

In [13]:
X_train_w2v = np.array([vectorize(sentence) for sentence in X_train_w2v])
X_test_w2v = np.array([vectorize(sentence) for sentence in X_test_w2v])

w2v_lr_clf = LogisticRegression(random_state = seed)
w2v_lr_clf.fit(X_train_w2v, y_train_w2v)

In [14]:
## Make Predictions
lr_pred = lr_clf.predict(X_test)
w2v_lr_pred = w2v_lr_clf.predict(X_test_w2v)

In [15]:
## Evaluate Model

lr_f1 = f1_score(y_test, lr_pred, average='macro')
w2v_lr_f1 = f1_score(y_test_w2v, w2v_lr_pred , average='macro') 

In [16]:
print(lr_f1)
print(w2v_lr_f1)

0.8886731658955718
0.9721518987341772


## Conclusion

## Looking ahead:
- use a pre-trained embedding as opposed to making my own. 
- consider using neural network

## Research/References:
- https://towardsdatascience.com/text-classification-using-word-embeddings-and-deep-learning-in-python-classifying-tweets-from-6fe644fcfc81
- https://www.kaggle.com/code/kstathou/word-embeddings-logistic-regression
  - for baseline model
- https://radimrehurek.com/gensim/models/doc2vec.html
- https://stats.stackexchange.com/questions/299446/word-embeddings-with-logistic-regression
- https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles?select=Articles.csv
  - Data Source
- https://medium.com/@dilip.voleti/classification-using-word2vec-b1d79d375381
  - for word2vec classification
- https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794
  - deep learning reference
- https://thinkingneuron.com/how-to-classify-text-using-word2vec/
- https://spotintelligence.com/2023/02/15/word2vec-for-text-classification/
- early stopping:
    - https://pythonguides.com/pytorch-early-stopping/