### Loading our data

In [1]:
import csv
import pandas as pd
from typing import List, Set, Tuple

# english data
classes_en = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tech"}
train_en = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/AGNews/train.csv", 
                       names = ["Label", "Title", "Article"],
                       encoding = "utf-8")
test_en = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/AGNews/test.csv", 
                      names = ["Label", "Title", "Article"],
                      encoding = "utf-8")

# german data
train_de = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/10kGNAD/train.csv", 
                       sep = ";", names = ["Label", "Article"], 
                       quotechar = "\'", quoting = csv.QUOTE_MINIMAL, encoding = "utf-8")
test_de = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/10kGNAD/test.csv", 
                       sep = ";", names = ["Label", "Article"], 
                       quotechar = "\'", quoting = csv.QUOTE_MINIMAL, encoding = "utf-8")

By iterating over the dataframe columns we can construct a "vanilla" list of documents that we can work on:

In [2]:
labels_en = [classes_en[int(row["Label"])] for i, row in train_en.iterrows()]
articles_en = [row["Article"] for i, row in train_en.iterrows()]
labels_de = [row["Label"] for i, row in train_de.iterrows()]
articles_de = [row["Article"] for i, row in train_de.iterrows()]

In [3]:
articles_en[:5]

["Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.",
 'Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.',
 'Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.',
 'Reuters - Authorities have halted oil export\\flows from the main pipeline in southern Iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on Saturday.',
 'AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections.']

# **NLTK**

[https://www.nltk.org/](https://www.nltk.org/)

NLTK - short for Natural Language Toolkit - is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning and wrappers for industrial-strength NLP libraries.

I mostly use NLTK for preprocessing tasks because it is more light-weight and straight forward than spaCy in my opinion.

In [4]:
import nltk
from nltk.corpus import stopwords as nltkStopwords
from nltk.stem.snowball import SnowballStemmer

### NLTK tokenizes documents which are any string variables

In [5]:
nltk.download("punkt")

articles_en_tokenized = [nltk.word_tokenize(doc) for doc in articles_en]
articles_de_tokenized = [nltk.word_tokenize(doc) for doc in articles_de]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\P42587\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
articles_en_tokenized[0]

['Reuters',
 '-',
 'Short-sellers',
 ',',
 'Wall',
 'Street',
 "'s",
 'dwindling\\band',
 'of',
 'ultra-cynics',
 ',',
 'are',
 'seeing',
 'green',
 'again',
 '.']

### Stemming can be done with NLTK's Snowball Stemmer

[https://www.nltk.org/api/nltk.stem.snowball.html](https://www.nltk.org/api/nltk.stem.snowball.html)

In [7]:
def stem(tokenized_document: str, language: str | None = None) -> List[str]:
        stemmer = SnowballStemmer(language, ignore_stopwords = False)
        return [stemmer.stem(word) for word in tokenized_document]
    
articles_en_stemmed = [stem(doc, "english") for doc in articles_en_tokenized]
articles_de_stemmed = [stem(doc, "german") for doc in articles_de_tokenized]

In [8]:
articles_en_stemmed[0]

['reuter',
 '-',
 'short-sel',
 ',',
 'wall',
 'street',
 "'s",
 'dwindling\\band',
 'of',
 'ultra-cyn',
 ',',
 'are',
 'see',
 'green',
 'again',
 '.']

### NLTK also offers built-in stopword sets for different languages

In [9]:
nltk.download("stopwords")
stopwords_en = set(nltkStopwords.words("english"))
stopwords_de = set(nltkStopwords.words("german"))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\P42587\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### The english stopwords are:

In [10]:
",".join(stopwords_en)

"the,that,have,now,will,my,this,d,aren't,itself,t,yours,while,below,where,its,ain,which,o,haven't,wasn,each,before,her,himself,you'll,didn't,can,should,theirs,mightn't,don,mustn't,doesn't,she's,s,until,wouldn,is,out,does,because,had,too,shouldn't,being,at,isn't,against,no,ve,don't,did,haven,shan't,wouldn't,has,you're,do,of,yourself,whom,you've,some,isn,we,couldn't,won't,a,than,aren,it's,i,off,herself,m,why,weren't,needn,nor,to,hadn,an,with,how,should've,couldn,so,won,are,what,mustn,that'll,when,myself,over,mightn,our,she,as,been,not,between,it,just,above,ours,all,after,shouldn,these,here,very,ma,those,yourselves,own,any,doing,needn't,by,he,ourselves,such,re,you'd,hers,then,wasn't,for,through,from,into,in,am,under,ll,on,once,and,few,down,hadn't,your,who,hasn't,during,didn,up,same,or,more,them,other,shan,if,be,was,further,his,they,both,about,having,him,y,only,most,but,doesn,me,you,again,their,hasn,there,were,weren,themselves"

### And the german ones are:

In [11]:
",".join(stopwords_de)

'durch,anderr,deines,diesen,ihr,jede,keinem,haben,gewesen,manchem,würde,unserem,jeder,will,zwischen,dasselbe,viel,ihm,demselben,keinen,diese,oder,unser,zum,ihren,im,einigen,sein,derselbe,sondern,euren,über,jedes,nach,welchen,dieselbe,seines,selbst,das,ander,desselben,wirst,denselben,euch,seiner,ihrem,die,dies,etwas,allen,ihrer,soll,bin,einig,ihnen,dem,meiner,wenn,meinen,mir,kann,dann,eine,diesem,mein,anderem,derer,ihn,unseren,welchem,wo,er,machen,andere,können,auf,eurer,eurem,unsere,wollte,für,sollte,jeden,manches,seinem,dazu,eures,jener,also,nur,unter,damit,der,dessen,weiter,würden,dir,zwar,seine,bist,keine,an,noch,um,hab,uns,hinter,so,eines,ins,solche,solches,anders,auch,hin,wird,sie,von,eure,deinen,sonst,daß,unseres,wir,doch,während,welche,einiger,denn,anderen,war,nun,jenes,dich,musste,werden,nichts,als,meinem,welcher,ohne,dass,deine,sind,solchen,solcher,zur,waren,einer,ich,und,aber,deiner,hier,hatten,jenen,derselben,meines,ein,aller,jetzt,dein,jedem,jene,meine,einige,vom,warst,alle

### Removing stopwords from our stemmed documents

In [12]:
def remove_stopwords(stemmed_document: str, stopwords: Set) -> List[str]:
        def is_stopword(word):
            return not word in stopwords
        return list(filter(is_stopword, stemmed_document))
    
articles_en_final = [remove_stopwords(doc, stopwords_en) for doc in articles_en_stemmed]
articles_de_final = [remove_stopwords(doc, stopwords_de) for doc in articles_de_stemmed]

In [13]:
articles_en_final[0]

['reuter',
 '-',
 'short-sel',
 ',',
 'wall',
 'street',
 "'s",
 'dwindling\\band',
 'ultra-cyn',
 ',',
 'see',
 'green',
 '.']

# **Gensim**

[https://radimrehurek.com/gensim/](https://radimrehurek.com/gensim/)

Gensim titles itself as "Topic Modelling for Humans" and is the third and final NLP library that we will have a look at. I have mainly used Gensim to build TF-IDF models and run text queries on datasets. We are going to use our NLTK preprocessed documents as input to build a dictionary, corpus and index with Gensim and calculate the TF-IDF matrix to run text queries on our data.

In [14]:
from gensim import corpora
from gensim import models
from gensim import similarities

### Building the TF-IDF model

In [15]:
size = 500 # adjust if model too big
corpus_dictionary_en = corpora.Dictionary(articles_en_final[:size])
corpus_en = [corpus_dictionary_en.doc2bow(document) for document in articles_en_final[:size]]
model_en = models.TfidfModel(corpus_en)
index_en = similarities.MatrixSimilarity(model_en[corpus_en])

To calculate the similarity of our input the query has to be preprocessed the same way our data was:

In [16]:
def query_en(query_string: str) -> List[Tuple[int, float]]:
    q = corpus_dictionary_en.doc2bow(remove_stopwords(stem(nltk.word_tokenize(query_string), language = "english"), stopwords_en))
    q_model = model_en[q]
    result = index_en[q_model]
    result = sorted(enumerate(result), key = lambda item: -item[1])
    for i, j in enumerate(result):
        if i > 2:
            break
        print(j, articles_en[:size][j[0]])
    return result

### Gensim returns the resulting document and its similarity

In [17]:
query_en("Scientists United States");

(237, 0.3952839) Scientists in the United States find a way to turn lazy monkeys into workaholics using gene therapy.
(450, 0.21606433) AFP - National Basketball Association players trying to win a fourth consecutive Olympic gold medal for the United States have gotten the wake-up call that the "Dream Team" days are done even if supporters have not.
(462, 0.20889904)  ATHENS (Reuters) - The United States beat Canada in a world  best time to qualify for the final of the men's Olympic eights  race Sunday, as the two crews renewed their fierce rivalry in  front of a raucous crowd at Schinias.
