### Loading our data

In [1]:
import csv
import pandas as pd

# english data
classes_en = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tech"}
train_en = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/AGNews/train.csv", 
                       names = ["Label", "Title", "Article"],
                       encoding = "utf-8")
test_en = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/AGNews/test.csv", 
                      names = ["Label", "Title", "Article"],
                      encoding = "utf-8")

# german data
train_de = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/10kGNAD/train.csv", 
                       sep = ";", names = ["Label", "Article"], 
                       quotechar = "\'", quoting = csv.QUOTE_MINIMAL, encoding = "utf-8")
test_de = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/10kGNAD/test.csv", 
                       sep = ";", names = ["Label", "Article"], 
                       quotechar = "\'", quoting = csv.QUOTE_MINIMAL, encoding = "utf-8")

By iterating over the dataframe columns we can construct a "vanilla" list of documents that we can work on:

In [2]:
labels_en = [classes_en[int(row["Label"])] for i, row in train_en.iterrows()]
articles_en = [row["Article"] for i, row in train_en.iterrows()]
labels_de = [row["Label"] for i, row in train_de.iterrows()]
articles_de = [row["Article"] for i, row in train_de.iterrows()]

In [3]:
articles_en[:5]

["Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.",
 'Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.',
 'Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.',
 'Reuters - Authorities have halted oil export\\flows from the main pipeline in southern Iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on Saturday.',
 'AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections.']

# **NLTK**

[https://www.nltk.org/](https://www.nltk.org/)

NLTK - short for Natural Language Toolkit - is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning and wrappers for industrial-strength NLP libraries.

I mostly use NLTK for preprocessing tasks because it is more light-weight and straight forward than spaCy in my opinion.

In [4]:
import nltk
from nltk.corpus import stopwords as nltkStopwords
from nltk.stem.snowball import SnowballStemmer

### NLTK tokenizes documents which are any string variables

In [5]:
nltk.download("punkt")

def tokenize(doc):
    return nltk.word_tokenize(doc)

articles_en_tokenized = [tokenize(doc) for doc in articles_en]
articles_de_tokenized = [tokenize(doc) for doc in articles_de]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Micha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
articles_en_tokenized[0]

['Reuters',
 '-',
 'Short-sellers',
 ',',
 'Wall',
 'Street',
 "'s",
 'dwindling\\band',
 'of',
 'ultra-cynics',
 ',',
 'are',
 'seeing',
 'green',
 'again',
 '.']

### Stemming can be done with NLTK's Snowball Stemmer

[https://www.nltk.org/api/nltk.stem.snowball.html](https://www.nltk.org/api/nltk.stem.snowball.html)

In [7]:
def stem(tokenized_document, language = None):
        stemmer = SnowballStemmer(language, ignore_stopwords = False)
        return [stemmer.stem(word) for word in tokenized_document]
    
articles_en_stemmed = [stem(doc, "english") for doc in articles_en_tokenized]
articles_de_stemmed = [stem(doc, "german") for doc in articles_de_tokenized]

In [8]:
articles_en_stemmed[0]

['reuter',
 '-',
 'short-sel',
 ',',
 'wall',
 'street',
 "'s",
 'dwindling\\band',
 'of',
 'ultra-cyn',
 ',',
 'are',
 'see',
 'green',
 'again',
 '.']

### NLTK also offers built-in stopword sets for different languages

In [9]:
stopwords_en = set(nltkStopwords.words("english"))
stopwords_de = set(nltkStopwords.words("german"))

### The english stopwords are:

In [10]:
",".join(stopwords_en)

"me,that'll,myself,this,hasn,nor,below,what,during,to,in,again,s,mightn,is,haven't,any,didn't,other,about,mustn,her,being,couldn,won,hasn't,an,by,shan't,for,until,doesn,before,so,ain,hers,just,few,can,hadn't,while,who,your,yourself,it's,been,herself,hadn,them,both,you've,wasn't,i,my,you'd,his,ourselves,on,not,that,between,t,does,we,out,themselves,because,off,have,and,y,same,whom,haven,couldn't,above,wasn,was,up,having,the,at,itself,theirs,after,most,isn't,here,m,no,they,doesn't,needn't,she's,ve,which,am,you,you'll,their,ours,a,into,re,doing,how,wouldn,against,why,should,there,were,our,don,where,he,if,of,under,wouldn't,those,its,be,these,it,or,you're,ma,shan,more,ll,yourselves,o,once,aren't,himself,too,then,as,mustn't,had,now,won't,shouldn,some,needn,through,but,don't,didn,did,all,will,mightn't,each,weren,when,only,isn,very,d,down,weren't,with,should've,aren,such,shouldn't,yours,from,than,further,are,over,him,do,has,she,own"

### And the german ones are:

In [11]:
",".join(stopwords_de)

'das,einig,ich,solches,deinem,einen,dieses,manche,aber,unsere,in,dich,hatte,manchem,zum,ihm,vor,manches,seiner,oder,war,solche,seinen,wirst,haben,mancher,bei,doch,weiter,an,habe,damit,des,hatten,dies,welchen,andern,dieser,euch,einigen,so,unseren,er,kann,jeden,dann,anderes,dazu,hat,zur,sonst,wenn,dasselbe,jetzt,mein,um,solchen,zwischen,derselben,ihres,sondern,mit,allen,welche,demselben,mich,ihren,nichts,ihn,keiner,der,ander,wie,seines,keine,derselbe,ohne,meiner,nur,meinen,hin,deiner,manchen,ihnen,jenes,über,deinen,ihrer,eines,welchem,bis,diesen,durch,ein,anders,während,eure,deine,uns,unter,es,werde,eine,was,mir,wieder,dir,ins,muss,ihrem,sich,zwar,seine,wollen,bist,die,jenem,vom,weg,euer,jeder,ihr,anderm,also,zu,sind,am,alles,einigem,wir,denselben,seinem,einiger,anderer,meines,im,nun,diese,als,nach,unserem,dem,einmal,warst,aus,keinen,sehr,allem,hab,auch,solcher,hinter,soll,derer,bin,meinem,musste,anderen,einem,eurem,ist,wird,sollte,werden,gewesen,solchem,desselben,keinem,jede,alle,dein,k

### Removing stopwords from our stemmed documents

In [12]:
def remove_stopwords(stemmed_document, stopwords):
        def is_stopword(word):
            return not word in stopwords
        return list(filter(is_stopword, stemmed_document))
    
articles_en_final = [remove_stopwords(doc, stopwords_en) for doc in articles_en_stemmed]
articles_de_final = [remove_stopwords(doc, stopwords_de) for doc in articles_de_stemmed]

In [13]:
articles_en_final[0]

['reuter',
 '-',
 'short-sel',
 ',',
 'wall',
 'street',
 "'s",
 'dwindling\\band',
 'ultra-cyn',
 ',',
 'see',
 'green',
 '.']

# **Gensim**

[https://radimrehurek.com/gensim/](https://radimrehurek.com/gensim/)

Gensim titles itself as "Topic Modelling for Humans" and is the third and final NLP library that we will have a look at. I have mainly used Gensim to build TF-IDF models and run text queries on datasets. We are going to use our NLTK preprocessed documents as input to build a dictionary, corpus and index with Gensim and calculate the TF-IDF matrix to run text queries on our data.

In [14]:
from gensim import corpora
from gensim import models
from gensim import similarities



### Building the TF-IDF model

In [15]:
size = 500 # adjust if model too big
corpus_dictionary_en = corpora.Dictionary(articles_en_final[:size])
corpus_en = [corpus_dictionary_en.doc2bow(document) for document in articles_en_final[:size]]
model_en = models.TfidfModel(corpus_en)
index_en = similarities.MatrixSimilarity(model_en[corpus_en])

To calculate the similarity of our input the query has to be preprocessed the same way our data was:

In [16]:
def query_en(query_string):
    q = corpus_dictionary_en.doc2bow(remove_stopwords(stem(tokenize(query_string), language = "english"), stopwords_en))
    q_model = model_en[q]
    result = index_en[q_model]
    result = sorted(enumerate(result), key = lambda item: -item[1])
    for i, j in enumerate(result):
        if i > 2:
            break
        print(j, articles_en[:size][j[0]])
    return result

### Gensim returns the resulting document and its similarity

In [17]:
query_en("Scientists United States");

(237, 0.3952839) Scientists in the United States find a way to turn lazy monkeys into workaholics using gene therapy.
(450, 0.21606433) AFP - National Basketball Association players trying to win a fourth consecutive Olympic gold medal for the United States have gotten the wake-up call that the "Dream Team" days are done even if supporters have not.
(462, 0.20889904)  ATHENS (Reuters) - The United States beat Canada in a world  best time to qualify for the final of the men's Olympic eights  race Sunday, as the two crews renewed their fierce rivalry in  front of a raucous crowd at Schinias.
