#PART 1: Text preprocessing and classic TF-IDF based classification

We use following packages:
-NLTK
-SpaCy
-Voikko
-Polyglot
-TurkuNLP neural parser
-Scikit-learn

We use toy dataset "wikipedia_toydata_FIN" with 416 finnish wikipedia articles related to health (212 texts) and economy (204 texts).
There are two versions of the data:
 Standard version "wikipedia_toydata_FIN_simple.txt"
 Version suitable for neural parser only "wikipedia_toydata_FIN_commented.txt"

Material and tutorials:
https://github.com/TurkuNLP/Text_Mining_Course/blob/master/Elementary%20text%20processing.ipynb
  https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
  https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908
  https://data.solita.fi/finnish-stemming-and-lemmatization-in-python/
  https://polyglot.readthedocs.io/en/latest/  

In [1]:
# root folder of data
DATA_ROOT = r'D:\Downloads\NLP_introduction-20191023T064334Z-001\NLP_introduction' + r'\\'

Typical steps for processing text are:
1. Tokenizing (separate words)
2. Spellchecking/correcting
3. Part of speech (POS) tagging
4. Lemmatizing (getting basewords)
5. Named entity recognition (NER tagging)

The specific order of steps 2-5 is somewhat "gray zone" and depends on which tool/pipeline is used.

In [2]:
'''
Simple text preprocessing using NLTK, SpaCy, Polyglot and Voikko.

We create a list of text samples where each element is a dictionary with fields:
 label: either TERVEYS or TALOUS
 tokens_raw: list of raw tokens
 tokens_lemma: list of lemmatized tokens
 tokens_type: list of token types
 tokens_ner: list of token named entity (empty if none)

NOTE: NLTK and SpaCy do not directly support Finnish, hence they are typically only useful for tokenization.

'''
import pandas

# read labels and texts
A = pandas.read_csv(DATA_ROOT + r'wikipedia_toydata_FIN_simple.txt', delimiter="\t", encoding="utf-8", names=['label', 'text'])
labels = list(A.label) # labels
samples = list(A.text) # texts

# use NLTK
from nltk.tokenize import word_tokenize
print('processing with NLTK')
data_nltk = [{'label':labels[k],'tokens_raw':word_tokenize(x)} for k,x in enumerate(samples)] # only one model
# NOTE: No lemmatization or other features for Finnish

# use SpaCy
import spacy
# You need to download "xx_ent_wiki_sm" with "python -m spacy download xx_ent_wiki_sm"
nlp = spacy.load("xx_ent_wiki_sm") # load multilingual model, no model for Finnish
print('processing with SpaCy')
data_spacy= [{'label':labels[k],
            'tokens_raw':[x.text for x in nlp(sample)],
            'tokens_lemma':[x.lemma_ for x in nlp(sample)],
            'tokens_pos':[x.pos_ for x in nlp(sample)]} for k,sample in enumerate(samples)
            ]
# NOTE: Lemmas are just raw tokens and pos are empty, not useful for Finnish

# use Polyglot
from polyglot.text import Text
print('processing with Polyglot')
data_polyglot=[]
for k,sample in enumerate(samples):
    text = Text(sample,hint_language_code='fi')
    # populate tags and entities
    text.pos_tags
    text.entities
    # get tokens
    tokens_raw,tokens_pos = zip(*[[x[0],x[1]] for x in text.pos_tags])
    # initialize entities (empty if none)
    tokens_ner = ['' for x in tokens_raw]     
    # populate found entities
    for ent in text.entities:
        start=ent.start
        stop =ent.end
        for i in range(start,stop):
            tokens_ner[i] = ent.tag 
    # add sample
    data_polyglot.append({'label':labels[k],'tokens_raw':tokens_raw,'tokens_pos':tokens_pos,'tokens_ner':tokens_ner})
# NOTE: No lemmatization available

# use Voikko
from voikko import libvoikko
voikko_object = libvoikko.Voikko(u"fi") # Voikko object for Finnish
print('processing with Voikko')
data_voikko=[{'label':labels[k],'tokens_raw':[x.tokenText for x in voikko_object.tokens(sample) if x.tokenTypeName != 'WHITESPACE']} for k,sample in enumerate(samples)]

print('..proofreading and lemmatizing')
for sample in data_voikko:
    lemmas=[]
    poss = []
    for k,token_raw in enumerate(sample['tokens_raw']):
        suggested_words = voikko_object.suggest(token_raw) # get suggestions
        if len(suggested_words)>0:
            token_raw = suggested_words[0] # replace raw token with first suggestion
            sample['tokens_raw'][k] = token_raw # update with new word

        # Analyze the word with voikko
        voikko_dict = voikko_object.analyze(token_raw)
        # Extract the base form, if the word is recognized
        if voikko_dict:
            token_lemma = voikko_dict[0]['BASEFORM']
            pos = voikko_dict[0]['CLASS']
        # If word is not recognized, add the original word
        else:
            token_lemma = token_raw
            pos = ''
        lemmas.append(token_lemma)
        poss.append(pos)
    sample['tokens_lemma']=lemmas
    sample['tokens_pos']=poss

print('all done')

processing with NLTK
processing with SpaCy
processing with Polyglot
processing with Voikko
..proofreading and lemmatizing
all done


Next we'll use advanced all-in-one preprocessing pipeline "Turku NLP neural parser". For best results ideally all steps should be done simultaneously as there are dependencies, especially in finding lemmas. This is recommended method, especially for Finnish!

Running neural parser is bit tricky and is done outside Python. Software is distributed as a Docker image.

A. First install Docker, which is available for free for Win, Mac and Linux.

B. Start console/terminal with docker available and start processing using these commands

In Linux:  
 'cat wikipedia_toydata_FIN_commented.txt | docker run -i turkunlp/turku-neural-parser:latest-fi-en-sv-cpu stream fi_tdt parse_plaintext > wikipedia_toydata_FIN_commented_parsed.txt'  

In Windows PowerShell (2 commands):  
 '$OutputEncoding = [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8'  
 'Get-Content -Encoding UTF8 wikipedia_toydata_FIN_commented.txt | docker run -i turkunlp/turku-neural-parser:latest-fi-en-sv-cpu stream fi_tdt parse_plaintext > wikipedia_toydata_FIN_commented_parsed.txt'  

First run takes longer as the image is first downloaded and installed into system. After that, it's faster.

C. After finished parsing, result is in CONLL-U format which needs to be parsed.
Example output (first lines) looks like this:

\# newdoc  
\# newpar  
\# sent_id = 1  
\# text = ﻿Dummy  
1	﻿Dummy	﻿Dummy	VERB	Symb	_	0	root	_	SpacesAfter=\r\n

\# sample001  
\# newdoc  
\# newpar  
\# sent_id = 1  
\# text = TERVEYS Kuuleva Kuuleva on henkilö, jonka molemmissa korvissa on normaali kuulo tai jonka kuulonalenema on niin lievä, ettei se haittaa hänen jokapäiväistä elämäänsä.  
1	TERVEYS	terveys	NOUN	N	Case=Nom|Derivation=Vs|Number=Sing	2	obj	_	SpacesAfter=\t\s  
2	Kuuleva	kuulla	VERB	V	Case=Nom|Degree=Pos|Number=Sing|PartForm=Pres|VerbForm=Part|Voice=Act	3	acl	_	_  
3	Kuuleva	kuuleva	NOUN	A	Case=Nom|Degree=Pos|Number=Sing	5	nsubj:cop	_	_  
4	on	olla	AUX	V	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	5	cop	_	_  
5	henkilö	henkilö	NOUN	N	Case=Nom|Number=Sing	0	root	_	SpaceAfter=No  
6	,	,	PUNCT	Punct	_	9	punct	_	_  
7	jonka	joka	PRON	Pron	Case=Gen|Number=Sing|PronType=Rel	9	nmod:poss	_	_  
8	molemmissa	molemmat	PRON	Pron	Case=Ine|Number=Plur|PronType=Ind	9	det	_	_  
9	korvissa	korva	NOUN	N	Case=Ine|Number=Plur	5	acl:relcl	_	_  
10	on	olla	AUX	V	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	9	cop	_	_  
11	normaali	normaali	ADJ	A	Case=Nom|Degree=Pos|Number=Sing	12	amod	_	_  
12	kuulo	kuulo	NOUN	N	Case=Nom|Number=Sing	9	nsubj:cop	_	_  
13	tai	tai	CCONJ	C	_	18	cc	_	_  
14	jonka	joka	PRON	Pron	Case=Gen|Number=Sing|PronType=Rel	15	nmod:poss	_	_  
15	kuulonalenema	kuulon#alenema	NOUN	N	Case=Nom|Number=Sing	18	nsubj:cop	_	_  
16	on	olla	AUX	V	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	18	cop	_	_  
17	niin	niin	ADV	Adv	_	18	advmod	_	_  
18	lievä	lievä	ADJ	A	Case=Nom|Degree=Pos|Number=Sing	9	conj	_	SpaceAfter=No  
19	,	,	PUNCT	Punct	_	22	punct	_	_  
20	ettei	että#ei	VERB	V	Number=Sing|Person=3|Polarity=Neg|VerbForm=Fin|Voice=Act	22	mark	_	_  
21	se	se	PRON	Pron	Case=Nom|Number=Sing|PronType=Dem	22	nsubj	_	_  
22	haittaa	haitata	VERB	V	Connegative=Yes|Mood=Ind|Tense=Pres|VerbForm=Fin	18	ccomp	_	_  
23	hänen	hän	PRON	Pron	Case=Gen|Number=Sing|Person=3|PronType=Prs	25	nmod:poss	_	_  
24	jokapäiväistä	jokapäiväinen	ADJ	A	Case=Par|Degree=Pos|Derivation=Inen|Number=Sing	25	amod	_	_  
25	elämäänsä	elämä	NOUN	N	Case=Par|Number=Sing|Person[psor]=3	22	obj	_	SpaceAfter=No  
26	.	.	PUNCT	Punct	_	5	punct	_	_  

\# sent_id = 2

For illustrated example, see:  
'http://bionlp-www.utu.fi/parser_demo'

We'll use conllu package and custom code to parse the data. Here "# sample001" are comments that are needed to separate different samples. For the same reason, we needed to add dummy text at the beginning.

NOTE: Output file is in UTF-16 format (for some reason).

In [3]:
'''
Parsing CONLL-U format file coming from Turku NLP neural parser with assumed file name "wikipedia_toydata_FIN_commented_parsed.txt"
'''

from conllu import parse # parser
import re

sample_separator = re.compile('# sample...') # separator of samples
data_turkuNLP = [] # collect samples here

# parse one sample
def get_sample(content,data):
    if len(content)==0:
        return
    sentences = parse(content)
    tokens_raw = []
    tokens_lemma = []
    tokens_pos = []
    label = None
    for sentence in sentences:
        for token in sentence:
            if label == None:
                label = token['form']
            else:
                tokens_raw.append(token['form'])
                tokens_lemma.append(token['lemma'])
                tokens_pos.append(token['upostag'])
    data_turkuNLP.append({'label': label, 'tokens_raw': tokens_raw, 'tokens_lemma': tokens_lemma,'tokens_pos':tokens_pos})

# read parsed dataset and make samples
print('parsing textfile')
with open(DATA_ROOT + 'wikipedia_toydata_FIN_commented_parsed.txt','r',encoding="utf-16") as f:
    is_reading=False
    content = ''
    for line_num, line in enumerate(f):
        if len(sample_separator.findall(line))>0: # sample separator present
            get_sample(content,data_turkuNLP)
            content = ''
            is_reading = True
        elif is_reading:
            content += line
    get_sample(content,data_turkuNLP)

import numpy as np
print('Done. Total %i samples after parsing (labels %s)' % (len(data_turkuNLP),np.unique([x['label'] for x in data_turkuNLP])))

parsing textfile
Done. Total 416 samples after parsing (labels ['TALOUS' 'TERVEYS'])


Note on Named Entiry Regognition (NER). Apart from Polyglot, there are no Python tools to perform NER tagging in Finnish (as far as I know). Such tool is also not available from TurkuNLP group. 

There is one Linux binary:
https://korp.csc.fi/download/finnish-tagtools/v1.3/
Using this requires some effort on feeding text files and parsing the results as another text files. But in principle, it's doable and one can then add yet another field "tokens_ner" into above data. We'll skip this tool for now.

In [4]:
# save data_turkuNLP as pickle. We'll use it later.
import pickle
pickle.dump(data_turkuNLP,open(DATA_ROOT + 'turkuNLP_preprocessed_data.pickle','wb'))

In [5]:
# topic classification with sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold,cross_validate
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import FunctionTransformer
from sklearn.svm import NuSVC
from sklearn.pipeline import make_pipeline
import pandas

from nltk.corpus import stopwords # NLTK contains stopwords for Finnish
fin_stop = set(stopwords.words('finnish')) # stopwords for finnish

# define custom tokenizer when bypassing build-in preprocessing, we'll also skip stopwords
my_tokenizer = lambda x:[y.lower() for y in x if y.lower() not in fin_stop] 

# k-fold cross validation
cv = StratifiedKFold(n_splits=10) # 10 folds

# Note: FunctionTransformer "hack" is needed to convert sparse features to dense

# pipeline with own custom preprocessing
text_clf_custom = {'MultinomialNB':make_pipeline(
    TfidfVectorizer(analyzer='word',tokenizer=my_tokenizer,ngram_range=(1,2),lowercase=False,max_features=15000),FunctionTransformer(lambda x: x.todense(), accept_sparse=True,validate=True),GaussianNB()),
    'NuSVM':make_pipeline(
    TfidfVectorizer(analyzer='word',tokenizer=my_tokenizer,ngram_range=(1,2),lowercase=False,max_features=15000),FunctionTransformer(lambda x: x.todense(), accept_sparse=True,validate=True),NuSVC(nu=0.5,gamma='scale'))}

# pipeline with sklearn build-in preprocessing
text_clf_simple = {'MultinomialNB':make_pipeline(
    TfidfVectorizer(stop_words = fin_stop,ngram_range=(1,2),max_features=15000),FunctionTransformer(lambda x: x.todense(), accept_sparse=True,validate=True),GaussianNB()),
    'NuSVM':make_pipeline(
    TfidfVectorizer(stop_words = fin_stop,ngram_range=(1,2),max_features=15000),FunctionTransformer(lambda x: x.todense(), accept_sparse=True,validate=True),NuSVC(nu=0.5,gamma='scale'))}

print('Classifying')
for preprocessor in ['TurkuNLP','Sklearn']:
    for k,data_type in enumerate(['tokens_raw', 'tokens_lemma']):
        if preprocessor == 'TurkuNLP':
            text_clf = text_clf_custom # use custom pipeline            
            data=pickle.load(open(DATA_ROOT + 'turkuNLP_preprocessed_data.pickle','rb')) # our preprocessed data
            X = [x[data_type] for x in data]
            y = [x['label'] for x in data]
        else:
            if k==1:
                continue # don't have lemmas, skip this iteration
            text_clf = text_clf_simple # use standard pipeline
            A = pandas.read_csv(DATA_ROOT + 'wikipedia_toydata_FIN_simple.txt',delimiter="\t",encoding="utf-8",names = ['label','text'])
            y = list(A.label)
            X = list(A.text)
        for pipe in text_clf.keys():
            scores = cross_validate(text_clf[pipe],X=X,y=y,cv=cv,scoring=['f1_macro','accuracy'])
            print('..Preprocessor "%s", data type "%s", classifier "%s": Mean accuracy %f, mean F1 %f' % (preprocessor,data_type,pipe,scores['test_accuracy'].mean(),scores['test_f1_macro'].mean()))
print('all done')

Classifying
..Preprocessor "TurkuNLP", data type "tokens_raw", classifier "MultinomialNB": Mean accuracy 0.944763, mean F1 0.944643
..Preprocessor "TurkuNLP", data type "tokens_raw", classifier "NuSVM": Mean accuracy 0.907890, mean F1 0.907159
..Preprocessor "TurkuNLP", data type "tokens_lemma", classifier "MultinomialNB": Mean accuracy 0.959223, mean F1 0.959113
..Preprocessor "TurkuNLP", data type "tokens_lemma", classifier "NuSVM": Mean accuracy 0.953946, mean F1 0.953866
..Preprocessor "Sklearn", data type "tokens_raw", classifier "MultinomialNB": Mean accuracy 0.934949, mean F1 0.934758
..Preprocessor "Sklearn", data type "tokens_raw", classifier "NuSVM": Mean accuracy 0.929672, mean F1 0.929201
all done
