# Text classification

This week we are moving from  classifiyng characteristics of single words to classifying whole texts. However, instead of trying to classify the sentiment of a text, we will be classifying whether texts are toxic or not. We are using the toxi-text dataset from huggingface. You can find more information about the dataset [here](https://huggingface.co/datasets/FredZhang7/toxi-text-3M). Try to get an overview of:
- what kind of data it contains
- where the data comes from
- what the labels mean

If you prefer not to read toxic text you can use [this](https://huggingface.co/datasets/stanfordnlp/imdb) dataset instead which contains imdb reviews and sentiment classification labels - or any other dataset you prefer :-)

## Install packages

In [None]:
#in terminal:
# pip install nltk pandas numpy gensim scikit-learn fsspec huggingface-hub

## Import packages

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn.linear_model import LogisticRegression
import gensim.downloader
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /home/ucloud/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Load data

The dataset is very large and multilingual, so for efficiency's sake we will only use a smaller, English subset of the data. We don't have to split the data into training and test sets because the dataset already has a test set which is saved in a separate file.

In [2]:
df = pd.read_csv("hf://datasets/FredZhang7/toxi-text-3M/train/multilingual-train-deduplicated.csv", nrows=100000)

In [3]:
df = df[df.lang == 'en']
df[:10]

Unnamed: 0,text,is_toxic,lang
0,"Saved lives, and spent for all of their childr...",0,en
1,"I agree with what you say, but for those worke...",0,en
2,My observation is there exists unequal share o...,0,en
3,Animal based fats are not what causes cardiova...,0,en
4,@GOPBlackChick @barrackobama just said u.s.was...,0,en
5,"I bet you supported the war on Iraq, or bombin...",0,en
6,"""You are seriously comparing pregnancy with th...",0,en
7,I like Rachel Notley but regardless of her in...,0,en
8,One's biological sex - male and female - is a ...,0,en
11,The irony is delicious. A party that is single...,0,en


## Preprocessing

The sklearn bag-of-words model expects the data to be a sequence of strings:

In [4]:
texts = df["text"].tolist()
texts[:10]

["Saved lives, and spent for all of their children's lives.  \nLIberal Madness, playing at a theatre near you.",
 'I agree with what you say, but for those workers it must also become expensive to live in Vancouver, so maybe even they would be happier moving slightly further from downtown.  Maybe not as extreme as Toronto...',
 'My observation is there exists unequal share of State monies with its residents, before all the Urban residents get defensive please hear me out. Presently no one except Corporations pay State income taxes. No individual pays state taxes. I noticed state funded bicycle paths, road maintenance, defunct Docks, powerful politicians pet projects such as office buildings, state troopers etc, etc. all these fundings and more are not necessary within City limits, I was amazed at how much our state provides city functions in the bigger cities thus growing the state budget, I saw on tv last night how adg&g was showing the little ones how to ice fish, couldn\'t the paren

## Bag-of-words 

One of the simplest way to represent a document is a bag-of-words model. This model represents a document as a set of words, ignoring the order of the words. The model is implemented in the `CountVectorizer` class in sklearn.

In [6]:
vectorizer = CountVectorizer(stop_words = 'english')
features = vectorizer.fit_transform(texts)

features.shape

(86996, 155423)

The shape of the matrix should correspond to the number of documents and the number of unique words in the dataset. The value of each cell should correspond to the number of times the word appears in the document.

In [11]:
#comment out to avoid impossibly long list of key-value pairs to be printed:)
#vectorizer.vocabulary_

In [7]:
print(len(vectorizer.vocabulary_))
print(len(texts))

155423
86996


Lastly, we need to create a list of the labels:

In [8]:
y = df.is_toxic.tolist()
y[0:10] #print first 10 labels

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

## Training a model

Now we can train a model to classify the toxicity of the texts. I will use a simple logistic regression model, but feel free to swap it out for any other model you prefer.

In [None]:
# Model 0 (base model)
clf = LogisticRegression(random_state=42)
clf.fit(features, y)

In [None]:
clf.score(features, y)

In [None]:
# Model 1 add iterations to the LR
clf1 = LogisticRegression(random_state=42, 
                         max_iter = 1000, 
                         verbose = True)
clf1.fit(features, y)

In [11]:
clf1.score(features, y) #clf1 score: 0.9741137523564302

0.9741137523564302


removing lowercasing doesnt affect the score 
stop words = english - improves the score : 0.9741137523564302


Now try to take a look at the documentation for the [Countvectorizer](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Try to change the parameters of the model and see how it affects the performance of the model:
- try to remove lowercasing and see how it affects performance
- try to add stopwords to the model
- try to see if you can find a parameter that can be used as an alternative to stopword removal
- try to change the ngram_range parameter
- try to change how the model tokenises the text by changing the token_pattern parameter (hint: use a regex generator)

## tf-idf

Another simple, yet slightly more advanced model is the tf-idf model. This model is also implemented in sklearn. The model is implemented in the `TfidfVectorizer` class in sklearn.

- try to create tfidf features from our texts and run the classifier again
- take a look at the [documentation](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and try to change the parameters of the model and see how it affects the performance of the model

In [15]:
tfidf_vectorizer = TfidfVectorizer(stop_words = "english") #create vectorizer + incl. stop words
features_tfidf = tfidf_vectorizer.fit_transform(texts) #use vectorizer to extract features

In [None]:
#fit LogReg model 1 )
clf1.fit(features_tfidf, y)

In [17]:
clf1.score(features_tfidf, y) 
# with stop_words = english 0.9331118672122857

0.9331118672122857

## Document embeddings

A much more nuanced way to represent text is through embeddings. However, most machine learning models require a fixed-size input, so we need to find a way to represent the whole document as a fixed-size vector. One way to do this is to use the average of the word embeddings of the words in the document. We will use the pre-trained word embeddings from the GloVe model. However, using word embeddings requires us to split the documents into individual words. We will use the nltk library to do this, but there are both simpler and more advanced ways to do this. The simplest method would be to split the documents by spaces, while a more advanced method would be to use a tokenizer that is aware of the structure of the language, like the one in the [spacy](https://spacy.io/api/tokenizer) library.

If we try to tokenise the first of the texts, we get:

In [19]:
word_tokenize(texts[0], 
              language='english', 
              preserve_line=True)

['Saved',
 'lives',
 ',',
 'and',
 'spent',
 'for',
 'all',
 'of',
 'their',
 'children',
 "'s",
 'lives.',
 'LIberal',
 'Madness',
 ',',
 'playing',
 'at',
 'a',
 'theatre',
 'near',
 'you',
 '.']

Now we can load the embeddings and match our tokenised words to the embeddings:

In [20]:
embeddings = gensim.downloader.load("glove-wiki-gigaword-300")



In [21]:
def get_embeddings(text):
    return [embeddings[word] for word in word_tokenize(text, language='english', preserve_line=True) if word in embeddings.key_to_index]

In [22]:
text_embeddings = [get_embeddings(text) for text in texts]

In [None]:
print(len(text_embeddings[0]))
print(len(text_embeddings[0][0]))

18
300


In [None]:
print(len(text_embeddings[1]))
print(len(text_embeddings[1][0]))

35
300


We see that though the individual word embeddings have to same number of dimensions, the document embeddings have different sizes. We can fix this by taking the average of the word embeddings:

In [85]:
mean_embeddings = [np.mean(embedding, axis=0) for embedding in text_embeddings]

In [None]:
mean_embeddings[0].shape

Now you have mean document embeddings that you can use to classify the texts!

- try to classify the texts using the average of the word embeddings of the words in the text
- try lowercasing the words before creating the embeddings
- try removing stopwords or punctuation beore creating the embeddings
- try using another classifier
- try to use all the languages in the dataset and see how it affects the performance of the model

In [None]:
clf = LogisticRegression(random_state=42, 
                         max_iter = 1000, 
                         verbose = True)
clf.fit(mean_embeddings, y)