# Text classification

This week we are moving from  classifiyng characteristics of single words to classifying whole texts. However, instead of trying to classify the sentiment of a text, we will be classifying whether texts are toxic or not. We are using the toxi-text dataset from huggingface. You can find more information about the dataset [here](https://huggingface.co/datasets/FredZhang7/toxi-text-3M). Try to get an overview of:
- what kind of data it contains
- where the data comes from
- what the labels mean

If you prefer not to read toxic text you can use [this](https://huggingface.co/datasets/stanfordnlp/imdb) dataset instead which contains imdb reviews and sentiment classification labels - or any other dataset you prefer :-)

## Install packages

In [1]:
!pip install nltk
!pip install pandas
!pip install numpy
!pip install sklearn
!pip install gensim
!pip install gensim
!pip install scikit-learn
!pip install fsspec
!pip install huggingface-hub

Defaulting to user installation because normal site-packages is not writeable
Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.9.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading regex-2024.9.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (797 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m797.0/797.0 kB[0m [31m575.8 kB/s[0m eta [36m0:00:00[0m0:01[0m
[?25hInstalling collected packages: regex, nltk
Successfully installed nltk-3.9.1 regex-2024.9.11
Defaulting to user installation because normal site-packages is not writeable
Collecting pandas
  Using cached pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metad

## Import packages

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn.linear_model import LogisticRegression
import gensim.downloader
import numpy as np

## Load data

The dataset is very large and multilingual, so for efficiency's sake we will only use a smaller, English subset of the data. We don't have to split the data into training and test sets because the dataset already has a test set which is saved in a separate file.

In [3]:
df = pd.read_csv("hf://datasets/FredZhang7/toxi-text-3M/train/multilingual-train-deduplicated.csv", nrows=100000)

In [38]:
df = df[df.lang == 'en']
df_tox = df[df.is_toxic == 1]
df_tox['text'].tolist()

["Reminds me of the old adage to not mud wrestle with a pig as you'll both get dirty and the pig will like it.",
 'Do you always act like a jackass when someone asks a simple question? Perhaps you should learn to "deal with" your attitude.',
 "Fall Guy RENG: sivir\n\nQursor: boosted piece of shit\n\nFall Guy RENG: what did i do\n\nFall Guy RENG: take me back\n\nJETAIMETAMMY69 left the room.\n\nFall Guy RENG: pls\n\nPelipper AnFam: why would i do that\n\nFall Guy RENG: the kitten\n\nPelipper AnFam: when you jumped on someone else\n\nFall Guy RENG: just wants to hug you\n\nQursor left the room.\n\nURonmylist left the room.\n\nFall Guy RENG: it wasnt like that\n\nPelipper AnFam: YOU CHEATED ON ME\n\nFall Guy RENG: i swear\n\nFall Guy RENG: veigars just a friend\n\nPelipper AnFam: YOU SAW SOMEONE ELSE\n\nIm Doraemon left the room.\n\nPelipper AnFam: AND WANTED TO GO\n\nfeeI like Pablo left the room.\n\nPelipper AnFam: I SHOULD HAVE BEEN YOUR ONLY ONE\n\nPelipper AnFam: YOUR ONE AND ONLY\n\

## Preprocessing

The sklearn bag-of-words model expects the data to be a sequence of strings:

In [5]:
texts = df["text"].tolist()
texts

["Saved lives, and spent for all of their children's lives.  \nLIberal Madness, playing at a theatre near you.",
 'I agree with what you say, but for those workers it must also become expensive to live in Vancouver, so maybe even they would be happier moving slightly further from downtown.  Maybe not as extreme as Toronto...',
 'My observation is there exists unequal share of State monies with its residents, before all the Urban residents get defensive please hear me out. Presently no one except Corporations pay State income taxes. No individual pays state taxes. I noticed state funded bicycle paths, road maintenance, defunct Docks, powerful politicians pet projects such as office buildings, state troopers etc, etc. all these fundings and more are not necessary within City limits, I was amazed at how much our state provides city functions in the bigger cities thus growing the state budget, I saw on tv last night how adg&g was showing the little ones how to ice fish, couldn\'t the paren

## Bag-of-words 

One of the simplest way to represent a document is a bag-of-words model. This model represents a document as a set of words, ignoring the order of the words. The model is implemented in the `CountVectorizer` class in sklearn.

In [6]:
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(texts)

In [39]:
features.shape

(86996, 155736)

The shape of the matrix should correspond to the number of documents and the number of unique words in the dataset. The value of each cell should correspond to the number of times the word appears in the document.

In [40]:
vectorizer.vocabulary_

{'saved': 121345,
 'lives': 83794,
 'and': 15708,
 'spent': 129436,
 'for': 56781,
 'all': 14497,
 'of': 99553,
 'their': 137101,
 'children': 32188,
 'liberal': 82843,
 'madness': 85981,
 'playing': 107103,
 'at': 18833,
 'theatre': 137005,
 'near': 95748,
 'you': 153340,
 'agree': 13530,
 'with': 150973,
 'what': 149711,
 'say': 121422,
 'but': 28209,
 'those': 137708,
 'workers': 151472,
 'it': 74595,
 'must': 94220,
 'also': 14848,
 'become': 22280,
 'expensive': 52745,
 'to': 138694,
 'live': 83763,
 'in': 71137,
 'vancouver': 145582,
 'so': 128074,
 'maybe': 88338,
 'even': 52016,
 'they': 137448,
 'would': 151625,
 'be': 22069,
 'happier': 64861,
 'moving': 93242,
 'slightly': 127228,
 'further': 58568,
 'from': 57963,
 'downtown': 46658,
 'not': 97999,
 'as': 18187,
 'extreme': 53074,
 'toronto': 139248,
 'my': 94389,
 'observation': 99220,
 'is': 74182,
 'there': 137292,
 'exists': 52619,
 'unequal': 143244,
 'share': 124672,
 'state': 130797,
 'monies': 92397,
 'its': 74716,


In [41]:
len(vectorizer.vocabulary_)

155736

In [42]:
len(texts)

86996

Lastly, we need to create a list of the labels:

In [43]:
y = df.is_toxic.tolist()

In [44]:
y

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,


## Training a model

Now we can train a model to classify the toxicity of the texts. I will use a simple logistic regression model, but feel free to swap it out for any other model you prefer.

In [48]:
clf = LogisticRegression(random_state=85)

In [49]:
clf.fit(features, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [51]:
clf.score(features, y)

0.9545611292473217

Now try to take a look at the documentation for the [Countvectorizer](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Try to change the parameters of the model and see how it affects the performance of the model:
- try to remove lowercasing and see how it affects performance
- try to add stopwords to the model
- try to see if you can find a parameter that can be used as an alternative to stopword removal
- try to change the ngram_range parameter
- try to change how the model tokenises the text by changing the token_pattern parameter (hint: use a regex generator)


In [55]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(2, 2))
features = vectorizer.fit_transform(texts)
clf = LogisticRegression(random_state=85)
clf.fit(features, y)
clf.score(features, y)


0.993195089429399

## tf-idf

Another simple, yet slightly more advanced model is the tf-idf model. This model is also implemented in sklearn. The model is implemented in the `TfidfVectorizer` class in sklearn.

- try to create tfidf features from our texts and run the classifier again
- take a look at the [documentation](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and try to change the parameters of the model and see how it affects the performance of the model

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
features = vectorizer.fit_transform(texts)
clf = LogisticRegression(random_state=85)
clf.fit(features, y)
clf.score(features, y)

0.9326980550829923

## Document embeddings

A much more nuanced way to represent text is through embeddings. However, most machine learning models require a fixed-size input, so we need to find a way to represent the whole document as a fixed-size vector. One way to do this is to use the average of the word embeddings of the words in the document. We will use the pre-trained word embeddings from the GloVe model. However, using word embeddings requires us to split the documents into individual words. We will use the nltk library to do this, but there are both simpler and more advanced ways to do this. The simplest method would be to split the documents by spaces, while a more advanced method would be to use a tokenizer that is aware of the structure of the language, like the one in the [spacy](https://spacy.io/api/tokenizer) library.

In [57]:
import nltk

nltk.download('punkt')

from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /home/ucloud/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


If we try to tokenise the first of the texts, we get:

In [58]:
word_tokenize(texts[0], language='english', preserve_line=True)

['Saved',
 'lives',
 ',',
 'and',
 'spent',
 'for',
 'all',
 'of',
 'their',
 'children',
 "'s",
 'lives.',
 'LIberal',
 'Madness',
 ',',
 'playing',
 'at',
 'a',
 'theatre',
 'near',
 'you',
 '.']

Now we can load the embeddings and match our tokenised words to the embeddings:

In [59]:
embeddings = gensim.downloader.load("glove-wiki-gigaword-300")



In [60]:
def get_embeddings(text):
    return [embeddings[word] for word in word_tokenize(text, language='english', preserve_line=True) if word in embeddings.key_to_index]

In [61]:
text_embeddings = [get_embeddings(text) for text in texts]

In [73]:
print(len(text_embeddings[0]))
print(len(text_embeddings[0][0]))

[array([-0.22043  ,  0.16274  ,  0.15177  ,  0.42928  ,  0.16024  ,
       -0.040054 , -0.63559  , -0.10002  ,  0.25705  , -1.1344   ,
        0.24988  , -0.17257  , -0.20881  , -0.11443  , -0.068622 ,
       -0.16539  ,  0.23555  , -0.016286 , -0.13415  ,  0.22829  ,
        0.12582  ,  0.77688  ,  0.21637  ,  0.032827 , -0.18908  ,
        0.33627  ,  0.13523  ,  0.31116  , -0.21042  ,  0.026727 ,
        0.39201  ,  0.78035  , -0.73581  , -0.32101  ,  0.17006  ,
       -0.18398  ,  0.14787  , -0.2192   ,  0.10622  , -0.48877  ,
        0.38319  , -0.47464  , -0.3945   , -0.01071  , -0.1239   ,
       -0.050564 ,  0.10848  ,  0.38878  ,  0.42203  ,  0.1123   ,
       -0.067602 , -0.0032424,  0.48752  , -0.36077  ,  0.29425  ,
        0.25991  ,  0.35477  ,  0.15294  ,  0.055423 , -0.6191   ,
       -0.31757  , -0.23056  ,  0.33172  ,  0.26369  ,  0.40347  ,
       -0.36125  , -0.17894  ,  0.4292   ,  0.084683 , -0.043601 ,
        0.13005  , -0.60037  ,  0.15368  ,  0.39019  , -0.174

In [63]:
print(len(text_embeddings[1]))
print(len(text_embeddings[1][0]))

35
300


We see that though the individual word embeddings have to same number of dimensions, the document embeddings have different sizes. We can fix this by taking the average of the word embeddings:

In [64]:
mean_embeddings = [np.mean(embedding, axis=0) for embedding in text_embeddings]

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


In [72]:
mean_embeddings[0].shape


array([-1.24004550e-01,  7.84505606e-02, -2.61933748e-02, -1.82684585e-01,
       -3.20278145e-02,  2.45756745e-01, -1.40170634e-01,  1.18276477e-01,
        7.65452459e-02, -1.56712043e+00,  2.68017769e-01, -6.35541677e-02,
       -8.40957090e-02,  9.35974196e-02,  4.68820296e-02,  5.97041585e-02,
       -4.73498479e-02, -5.41515499e-02, -4.16396372e-02, -9.53569710e-02,
        3.89119424e-02,  1.72128037e-01,  2.63564229e-01,  5.06921597e-02,
       -2.00384542e-01,  1.34841567e-02,  7.32441097e-02, -1.85213670e-01,
       -1.22078963e-01,  2.34292932e-02, -6.37208372e-02,  2.15985358e-01,
       -1.20407715e-01, -1.25277400e-01, -7.29041874e-01,  9.28728059e-02,
        3.88088822e-02, -5.82907756e-04, -1.11949779e-02, -2.94781588e-02,
        6.86390325e-02, -3.33170034e-02, -1.25275105e-01,  5.84676377e-02,
        1.87997110e-02,  3.08020832e-03,  1.14236988e-01,  2.20613852e-01,
       -1.61646411e-03,  8.43942910e-02,  9.07795578e-02, -1.92791268e-01,
       -4.34668101e-02, -

Now you have mean document embeddings that you can use to classify the texts!

- try to classify the texts using the average of the word embeddings of the words in the text
- try lowercasing the words before creating the embeddings
- try removing stopwords or punctuation beore creating the embeddings
- try using another classifier
- try to use all the languages in the dataset and see how it affects the performance of the model

In [92]:
print(len(texts))
print(len(y))
print(len(mean_embeddings))
mean_embeddings[2]
mean_embeddings[np.isnan(mean_embeddings)]

86996
86996
86996


TypeError: isnan() takes from 1 to 2 positional arguments but 0 were given

In [84]:
clf = LogisticRegression(random_state=85)
clf.fit(mean_embeddings, y)
clf.score(featumean_embeddingsres, y)

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (86996,) + inhomogeneous part.