*Prepared for the course "TSTS22: Natural Language Processing and Text Mining" at Jönköping University, Teacher: [Marcel Bollmann](marcel.bollmann@ju.se)*

# Assignment 1: News Topic Classification

In this assignment, you will train classifiers to predict the topic area of a news article based on its headline.

The dataset was prepared from a public domain (CC-0) dataset of news headlines on Kaggle, and contains 10,000 articles for training – consisting of a headline (_title_) and one of 8 different topic areas (_topic_) – and 2,000 articles for validation.  You can see some stats about the dataset in the code cells below.

### Instructions

This assignment consists of **three parts.**  In each part, you'll find a <span style="background-color:#008148; padding:4px 8px; border-radius:4px; color:#F8F0E3">green box</span> that indicates where your own solution should begin, and an <span style="background-color:#EF8A17; padding:4px 8px; border-radius:4px; color:#F8F0E3">orange box</span> that is followed by some evaluation code which you should **not** modify.

### Grading

- This assignment is graded Pass/Fail.

- To _pass_ this assignment, you must provide a working solution for _all parts_ of the assignment. This means that:
    - Your notebook should run from start to finish without errors.
    - Your solutions should fulfill the requirements described in the parts below.
    - The provided evaluation code must not be modified.

- - - 

In [1]:
#!python -m pip install rich

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# Connected to google drive
# I used the goolge colab for this assignment. It could be commented when running in jupyter
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Imports and loading the dataset
import os
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.preprocessing import LabelBinarizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import rich
import rich.progress
import re, string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

# These variables are used later -- DO NOT MODIFY THEM.
# Assign new variable names if you need to modify the data
# in some way.
df_train = pd.read_csv("newstopics_train.csv")
#df_train = pd.read_csv("drive/MyDrive/TSTS22_Assignment1/newstopics_train.csv")
df_val = pd.read_csv("newstopics_val.csv")
#df_val = pd.read_csv("drive/MyDrive/TSTS22_Assignment1/newstopics_val.csv")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [4]:
df_train.head()

Unnamed: 0,title,topic
0,Former AMP chairman says culture long an issue...,business
1,Nagelsmann defends Guardiola after Man City's ...,sports
2,"Tennessee bans on-campus tailgating, expecting...",sports
3,Currys PC World owner slashes 800 jobs as coro...,business
4,Royal National Park: Police release descriptio...,nation


In [5]:
df_train["topic"].value_counts()

business         1379
sports           1379
nation           1379
health           1379
entertainment    1379
technology       1379
world            1379
science           347
Name: topic, dtype: int64

- - -

## Part 1: Linear bag-of-words

In this part, your task is to build a simple **linear bag-of-words classifier.**

The classifier should take a text string as input and predict a genre label as output. All of the functionality you need can be found in Scikit-learn, though you may also use other libraries such as NLTK if you like.

<div style="background-color:#008148; padding:4px 8px; border-radius:4px; color:#F8F0E3">
    <strong>Modify the cell(s) below with your implementation.</strong>
</div>
   

In [6]:
# Convert to lowercase, strip and remove punctuations
def preprocess(text):
    text = text.lower() 
    text=text.strip()  
    text=re.compile('<.*?>').sub('', text) 
    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)  
    text = re.sub('\s+', ' ', text)  
    text = re.sub(r'\[[0-9]*\]',' ',text) 
    text=re.sub(r'[^\w\s]', '', str(text).lower().strip())
    text = re.sub(r'\d',' ',text) 
    text = re.sub(r'\s+',' ',text) 
    return text

 
# Romove stopword
def stopword(string):
    a= [i for i in string.split() if i not in stopwords.words('english')]
    return ' '.join(a)

#LEMMATIZATION
# Initialize the lemmatizer
wl = WordNetLemmatizer()
 
# This is a helper function to map NTLK position tags
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# Tokenize the sentence
def lemmatizer(string):
    word_pos_tags = nltk.pos_tag(word_tokenize(string)) # Get position tags
    a=[wl.lemmatize(tag[0], get_wordnet_pos(tag[1])) for idx, tag in enumerate(word_pos_tags)] # Map the position tag and lemmatize the word/token
    return " ".join(a)

def finalpreprocess(string):
    return lemmatizer(stopword(preprocess(string)))

In [7]:
from sklearn.linear_model import LogisticRegression

def make_linear_classifier():
    class MulticlassClassification:
        def __init__(self):
            self.models = []
            global tfidf_vectorizer
            count_vectorizer = CountVectorizer()
            
        def fit(self, X, y):
            """
            Fits each model
            """
            X = X.apply(lambda x: finalpreprocess(x))
            X = count_vectorizer.fit_transform(X).toarray()

            for y_i in np.unique(y):
                # y_i - positive class for now
                # All other classes except y_i are negative

                # Choose x where y is positive class
                x_true = X[y == y_i]
                # Choose x where y is negative class
                x_false = X[y != y_i]
                # Concatanate
                x_true_false = np.vstack((x_true, x_false))

                # Set y to 1 where it is positive class
                y_true = np.ones(x_true.shape[0])
                # Set y to 0 where it is negative class
                y_false = np.zeros(x_false.shape[0])
                # Concatanate
                y_true_false = np.hstack((y_true, y_false))

                # Fit model and append to models list
                model = LogisticRegression()
                model.fit(x_true_false, y_true_false)
                self.models.append([y_i, model])


        def predict(self, X):
            X = X.apply(lambda x: finalpreprocess(x))
            X = count_vectorizer.transform(X).toarray()
            y_pred = [[label, model.predict(X)] for label, model in self.models]

            output = []

            for i in range(X.shape[0]):
                max_label = None
                max_prob = -10**5
                for j in range(len(y_pred)):
                    prob = y_pred[j][1][i]
                    if prob > max_prob:
                        max_label = y_pred[j][0]
                        max_prob = prob
                output.append(max_label)

        return output

    return MulticlassClassification()

<div style="background-color:#EF8A17; padding:4px 8px; border-radius:4px; color:#F8F0E3; margin-bottom:1em;">
  <strong>Do NOT modify the code cell below.</strong>
</div>

Run the cell below to fit and evaluate your model. Your goal is to obtain **a weighted F1-score of 0.6 or more**.

In [8]:
def fit_and_evaluate(clf):
    clf.fit(df_train["title"], df_train["topic"])
    y_pred = clf.predict(df_val["title"])
    print(metrics.classification_report(df_val["topic"], y_pred))
    wf1 = metrics.f1_score(df_val["topic"], y_pred, average="weighted")
    color = "green" if wf1 >= .6 else "red"
    rich.print(f"Weighted F1-score: [bold {color}]{wf1:.3f}[/]")

fit_and_evaluate(make_linear_classifier())

               precision    recall  f1-score   support

     business       0.27      0.92      0.42       276
entertainment       0.92      0.54      0.68       276
       health       0.78      0.60      0.68       276
       nation       0.68      0.30      0.42       275
      science       0.94      0.42      0.58        69
       sports       0.96      0.70      0.81       276
   technology       0.95      0.67      0.79       276
        world       0.73      0.38      0.50       276

     accuracy                           0.58      2000
    macro avg       0.78      0.56      0.61      2000
 weighted avg       0.76      0.58      0.61      2000



- - - 

## Part 2: Neural text classification

In this part, your task is to build a simple **neural network** that implements a text classification model.  Concretely, your network needs to consist of:

1. An **embedding layer** that maps tokens to an embedding space;
2. Any number of **intermediate layers** that ultimately result in a 1-dimensional vector representation;
3. A **final linear layer** that outputs a softmax over the number of class labels.

You **may _not_ use any recurrent layers or transformer architectures (e.g. multi-head attention)**, but you are free to otherwise experiment with any combination of layers to improve your model, such as dropout, extra linear layers, convolutional layers, etc.

The model that you build should **not need a GPU** to train; make sure to keep it small enough so that it doesn't take longer than 10 minutes to train on your CPU.

<div style="background-color:#008148; padding:4px 8px; border-radius:4px; color:#F8F0E3; margin-bottom:1em;">
    <strong>Modify the cell(s) below with your implementation.</strong>
</div>

The following cell already imports some names from TensorFlow and Keras for convenience; you are free to change this to another library, such as PyTorch, if you prefer that.  Just make sure to _use the same function names_ as below, so that the evaluation code in the cell below still works.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import RandomizedSearchCV

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

In [10]:
rich.print(f"Using TensorFlow {tf.__version__}")
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # suppress low-level TF warnings

In [11]:
sentences = df_train['title'].apply(lambda x: finalpreprocess(x))
y = df_train['topic']
y_sentences = df_val['title'].apply(lambda x: finalpreprocess(x))
y_val = df_val['topic']

In [12]:
# Convert topic label into vector variables
encoder = LabelBinarizer()
transfomed_label = encoder.fit_transform(y)

In [13]:
# Split train and test set
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, transfomed_label, test_size=0.25, random_state=1000)

In [14]:
# Word embedding
# Represent words in dataset with vectors
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(sentences_train)
X_train = tokenizer.texts_to_sequences(sentences_train)
X_test = tokenizer.texts_to_sequences(sentences_test)
X_val = tokenizer.texts_to_sequences(y_sentences)
vocab_size = len(tokenizer.word_index) + 1

In [15]:
# Pad sequence with Keras
maxlen = 100
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)
X_val = pad_sequences(X_val, padding='post', maxlen=maxlen)

In [29]:
def make_neural_classifier():
    """This function should instantiate and return your
       neural text classification model."""
    #raise NotImplementedError()
    #model = tf.keras.Sequential([...])
    #model.compile(...)
    #return model
    embedding_dim = 50

    model = Sequential()
    model.add(layers.Embedding(input_dim=vocab_size, 
                              output_dim=embedding_dim, 
                              input_length=maxlen))
    model.add(layers.Conv1D(50, 5, activation="relu"))
    model.add(layers.GlobalMaxPool1D())
    model.add(layers.Dense(10, activation='relu'))
    model.add(layers.Dense(8, activation='softmax'))
    model.compile(optimizer='adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model


def fit_and_predict(model):
    """This function should take a model object, 
       fit the model on the training set,
       and return predictions on the validation set."""
    #raise NotImplementedError()

    model.fit(X_train, y_train,
                    epochs=10,
                    verbose=True,
                    validation_data=(X_test, y_test),
                    batch_size=10)
    
    y_prediction = (model.predict(X_val) > 0.5).astype("int32")
    y_pred = encoder.inverse_transform(y_prediction)

    return y_pred
    


# <div style="background-color:#EF8A17; padding:4px 8px; border-radius:4px; color:#F8F0E3; margin-bottom:1em;">
  <strong>Do NOT modify the code cell below.</strong>
</div>

Run the cell below to fit and evaluate your model. Your goal is to implement a working neural classifier that **trains in 10 minutes or less on CPU** and obtains **a weighted F1-score of 0.6 or more**.

In [30]:
def fit_and_evaluate(clf):
    y_pred = fit_and_predict(clf)
    wf1 = metrics.f1_score(df_val["topic"], y_pred, average="weighted")
    color = "green" if wf1 >= .6 else "red"
    rich.print(f"Weighted F1-score: [bold {color}]{wf1:.3f}[/]")

fit_and_evaluate(make_neural_classifier())

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


- - - 

## Part 3: Pre-trained word embeddings

In this part, your task is to **integrate pre-trained word embeddings** into your neural classification model.

In other words, use the exact same model that you made for Part 2, but change it so that it uses pre-trained word embeddings instead of randomly initialized embeddings. *(You might need to change the dimensionality of your embedding layer for this.)*

Download and extract any pre-trained vectors from [GloVe](https://nlp.stanford.edu/projects/glove/) or [fastText](https://fasttext.cc/docs/en/english-vectors.html), then use the following function to load them into a dictionary object:

In [20]:
def load_vectors(filename):
    """Loads word vectors in word2vec/Glove/fastText format."""
    vectors = {}
    with rich.progress.open(filename, "r", description="Loading vectors...") as f:
        for line in f:
            token, coefs = line.rstrip().split(maxsplit=1)
            coefs = np.fromstring(coefs, "f", sep=" ")
            if len(coefs) < 2:  # probably a header
                continue
            vectors[token] = coefs
    rich.print(f"Found {len(vectors)} vectors.")
    return vectors

#vectors = load_vectors("drive/MyDrive/glove/glove.6B.100d.txt")
vectors = load_vectors("glove/glove.6B.100d.txt")

Output()

<div style="background-color:#008148; padding:4px 8px; border-radius:4px; color:#F8F0E3; margin-bottom:1em">
    <strong>Modify the cell(s) below with your implementation.</strong>
</div>

Since you only need to redefine the model, the `fit_and_predict()` function from Part 2 should still work!

In [21]:
# The word_index is used to store the mapping from word to numbers in the dataset
word_index = tokenizer.word_index

In [22]:
# An embedding matrix for each word in dataset.
# If the word has embedding in the GloVe, get its vector from GloVe
# If not, take 0 instead
embedding_matrix = np.zeros((len(word_index) + 1, maxlen))
for word, i in word_index.items():
    embedding_vector = vectors.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [31]:
def make_neural_classifier_from_pretrained():
    """This function should instantiate and return your
       neural text classification model, just like in Part 2,
       except that this time you should use the `vectors`
       loaded above."""
    #raise NotImplementedError()
    model = Sequential()
    model.add(layers.Embedding(input_dim=len(word_index) + 1,
                                output_dim=maxlen,
                                weights=[embedding_matrix],
                                input_length=maxlen,
                                trainable=False))
    model.add(layers.Conv1D(50, 5, activation="relu"))
    model.add(layers.GlobalMaxPool1D())
    model.add(layers.Dense(10, activation='relu'))
    model.add(layers.Dense(8, activation='softmax'))
    model.compile(optimizer='adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    
    return model

<div style="background-color:#EF8A17; padding:4px 8px; border-radius:4px; color:#F8F0E3; margin-bottom:1em;">
  <strong>Do NOT modify the code cell below.</strong>
</div>

Run the cell below to fit and evaluate your model. Your goal is to implement a working neural classifier that **uses pre-trained embeddings**, still **trains in 10 minutes or less on CPU** and obtains **a weighted F1-score of 0.6 or more**.

In [32]:
fit_and_evaluate(make_neural_classifier_from_pretrained())

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
