# Introduction to NLP: Assignment 2 

## Assignment on Text Classification and Sequence Labeling

### Description of Assignment 2

This assignment relates to the Text classification and Sequence labeling themes of the introduction to NLP (courses Deskriptiv analytik / Machine learning for descriptive problems), and will focus on gaining some practical, hands-on experience in building and training simple models for these tasks.

The assignment is handed in as a Jupyternotebook (or a PDF render thereof) containing the code used to solve the problem, output presenting the results, and, most importantly, notes that present the students' conclusions and answer questions posed in the assignment.

**Assignment steps/Questions:**

1. Test sklearn’s TfidfVectorizer in place of CountVectorizer on the IMDB data. Do you see any difference in the classification results or the optimal C value?

2. Test different lengths of n-grams in the CountVectorizer on the IMDB data. Do you see any difference in the classification results or the optimal C value ? Do these n-grams show up also in the list of most significant positive/negative features?

3. In the data package for the course [http://dl.turkunlp.org/intro-to-nlp.tar.gz](http://dl.turkunlp.org/intro-to-nlp.tar.gz), the directory language_identification contains data for 5 languages. Based on this data, train an SVM classifier for language recognition between these 5 languages.

4. If you completed (3), toy around with features, especially the ngram_range and analyzer parameters, which allow you to test classification based on character ngrams of various lengths (not only word n-grams). Gain some insight in to the accuracy of the classifier with different features, and try to identify misclassified documents -why do you think they were misclassified?

5. **BONUS** On the address universaldependencies.org, you will find datasets for a bunch of languages. These come in an easy-to-parse, well-documented format. Pick one language that interests you, and one treebank for that language, and try to builda POS tagger for this language. You can use the 4th column “UPOS” [https://universaldependencies.org/format.html](https://universaldependencies.org/format.html) Report on your findings. If you have extra time, try to experiment with various features and see if you can make your accuracy go up. You can check here [https://universaldependencies.org/conll18/results-upos.html](https://universaldependencies.org/conll18/results-upos.html) what the state of the art roughly is for your selected language and treebank. Did you come close?




### Import libraries

In [58]:
import json
import random
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
import sklearn.svm
import sklearn.metrics
import numpy

### Read in IMDB data

In [59]:
with open("imdb_train.json") as f:
    data = json.load(f)
    
random.seed(10) # Seed to replicate same scenario for development
random.shuffle(data) # Shuffle data 

# Preview of data
print("class label:", data[0]["class"])
print("text:", data[0]["text"])

class label: neg
text: the single worst film i've ever seen in a theater. i saw this film at the austin film festival in 2004, and it blew my mind that this film was accepted to a festival. it was an interesting premise, and seemed like it could go somewhere, but just fell apart every time it tried to do anything. first of all, if you're going to do a musical, find someone with musical talent. the music consisted of cheesy piano playing that sounded like they were playing it on a stereo in the room they were filming. the lyrics were terribly written, and when they weren't obvious rhymes, they were groan-inducing rhymes that showed how far they were stretching to try to make this movie work. and you'd think you'd find people who could sing when making a musical, right? not in this case. luckily they were half talking/half singing in rhyme most of the time, but when they did sing it made me cringe. especially when they attempted to sing in harmony. and that just addresses the music. some

### Separate texts and labels 

In [60]:
# We need to gather the texts and labels into separate lists
texts=[d["text"] for d in data]
labels=[d["class"] for d in data]
print("Amount of texts:", len(texts))
print("Amount of labels", len(labels))
print()
for label, text in list(zip(labels, texts))[:10]:
    print(label, text[:50] + "...")

Amount of texts: 25000
Amount of labels 25000

neg the single worst film i've ever seen in a theater....
pos I think the reason for all the opinionated diarrhe...
neg This movie is horrible! It rivals \Ishtar\" in the...
neg This may not be the worst comedy of all time, but ...
pos I found this film to funny from the start. John Wa...
pos The problem is that the movie rode in on the coatt...
neg I was so looking forward to seeing this when it wa...
neg I actually saw this movie in the theater back in i...
neg blows my mind how this movie got made. i watched i...
neg Amateurish in the extreme. Camera work especially ...


## 1. Test sklearn’s TfidfVectorizer

**Test sklearn’s TfidfVectorizer in place of CountVectorizer on the IMDB data. Do you see any difference in the classification results or the optimal C value?**

### Datasplit

In [61]:
train_texts, dev_texts, train_labels, dev_labels = train_test_split(texts, labels, test_size = 0.2)

### Create method for easier showing results in changes of n-grams and C-value

In [62]:
def createClassifier(ngramRange, maxFeatures, cValue, vectorizerType):
    # Change vectorizer type based on variable
    if(vectorizerType == "Count"):
        vectorizer = CountVectorizer(max_features = maxFeatures, binary=True, ngram_range = ngramRange)
    if(vectorizerType == "Idf"):
        vectorizer = TfidfVectorizer(max_features = maxFeatures, binary=True, ngram_range = ngramRange)
    
    feature_matrix_train = vectorizer.fit_transform(train_texts)
    feature_matrix_dev = vectorizer.transform(dev_texts)
    
    classifier = sklearn.svm.LinearSVC(C =  cValue, verbose = 0)
    classifier.fit(feature_matrix_train, train_labels)
    
    print("Vectorizer type={0}, C={1}, n-gram={2}".format(vectorizerType, cValue, ngramRange))
    print("DEV", classifier.score(feature_matrix_dev, dev_labels))
    print("TRAIN", classifier.score(feature_matrix_train, train_labels))
    print()

### CountVectorizer

In [63]:
createClassifier(
    ngramRange = (1,1), 
    maxFeatures = 100000, 
    cValue = 0.0005, 
    vectorizerType = "Count"
)
createClassifier(
    ngramRange = (1,1), 
    maxFeatures = 100000, 
    cValue = 0.005, 
    vectorizerType = "Count"
)
createClassifier(
    ngramRange = (1,1), 
    maxFeatures = 100000, 
    cValue = 0.05, 
    vectorizerType = "Count"
)
createClassifier(
    ngramRange = (1,1), 
    maxFeatures = 100000, 
    cValue = 0.5, 
    vectorizerType = "Count"
)

Vectorizer type=Count, C=0.0005, n-gram=(1, 1)
DEV 0.8688
TRAIN 0.89385

Vectorizer type=Count, C=0.005, n-gram=(1, 1)
DEV 0.8812
TRAIN 0.95645

Vectorizer type=Count, C=0.05, n-gram=(1, 1)
DEV 0.8728
TRAIN 0.9959

Vectorizer type=Count, C=0.5, n-gram=(1, 1)
DEV 0.8546
TRAIN 1.0



### TfidfVectorizer

In [64]:
createClassifier(
    ngramRange = (1,1), 
    maxFeatures = 100000, 
    cValue = 0.0005, 
    vectorizerType = "Idf"
)
createClassifier(
    ngramRange = (1,1), 
    maxFeatures = 100000, 
    cValue = 0.005, 
    vectorizerType = "Idf"
)
createClassifier(
    ngramRange = (1,1), 
    maxFeatures = 100000, 
    cValue = 0.05, 
    vectorizerType = "Idf"
)
createClassifier(
    ngramRange = (1,1), 
    maxFeatures = 100000, 
    cValue = 0.5, 
    vectorizerType = "Idf"
)

Vectorizer type=Idf, C=0.0005, n-gram=(1, 1)
DEV 0.84
TRAIN 0.84545

Vectorizer type=Idf, C=0.005, n-gram=(1, 1)
DEV 0.856
TRAIN 0.8677

Vectorizer type=Idf, C=0.05, n-gram=(1, 1)
DEV 0.884
TRAIN 0.9218

Vectorizer type=Idf, C=0.5, n-gram=(1, 1)
DEV 0.8936
TRAIN 0.98515



#### Comparing TfidfVectorizer and CountVectorizer when changing C value 

First thing I noticed was that the classification results where very similar. The Count Vectorizers dev and train results are higher on a lower C value like 0.0005. When the C value is increased the Count Vectorizers Train rises to 0.95 very fast and even to 1.0 meanwhile the dev results go up to 0.88 and down to 0.86 when C value is 0.5. The optimal C value for the count vectorizer is probably around 0.005 where the data has not been overfitted to the train data. 

As mentioned above the Count vectorizers results where whigher on a lower C value. The Tfidf Vectorizer with a C value of 0.0005 has results around 0.84 on dev and train. When the C Value is increased, both the dev and train results increase steadily. Only after a C value as high as 0.5 is where the dev and train start to separate eachother. I'd say that the optimal C value for the Tfidf Vectorizer is around 0.05 because of diminishing returns on the dev result. This method seems to do a better job with not overfitting with the data. 

The Tfidf Vectorizer when it is run with an optimal C value has better dev results than Count Vectorizer.

## 2. Test different lengths of n-grams in the CountVectorizer on the IMDB data

**Test different lengths of n-grams in the CountVectorizer on the IMDB data. Do you see any difference in the classification results or the optimal C value ? Do these n-grams show up also in the list of most significant positive/negative features?**


Lets start by looking att the results of different n-grams and try to find their optimal C values

### CountVectorizer n-gram (1-1)

In [65]:
createClassifier(
    ngramRange = (1,1), 
    maxFeatures = 100000, 
    cValue = 0.0005, 
    vectorizerType = "Count"
)
createClassifier(
    ngramRange = (1,1), 
    maxFeatures = 100000, 
    cValue = 0.005, 
    vectorizerType = "Count"
)
createClassifier(
    ngramRange = (1,1), 
    maxFeatures = 100000, 
    cValue = 0.05, 
    vectorizerType = "Count"
)
createClassifier(
    ngramRange = (1,1), 
    maxFeatures = 100000, 
    cValue = 0.5, 
    vectorizerType = "Count"
)

Vectorizer type=Count, C=0.0005, n-gram=(1, 1)
DEV 0.8688
TRAIN 0.89385

Vectorizer type=Count, C=0.005, n-gram=(1, 1)
DEV 0.8812
TRAIN 0.95645

Vectorizer type=Count, C=0.05, n-gram=(1, 1)
DEV 0.8728
TRAIN 0.9959

Vectorizer type=Count, C=0.5, n-gram=(1, 1)
DEV 0.8546
TRAIN 1.0



### CountVectorizer n-gram (1-2)

In [66]:
createClassifier(
    ngramRange = (1,2), 
    maxFeatures = 100000, 
    cValue = 0.0005, 
    vectorizerType = "Count"
)
createClassifier(
    ngramRange = (1,2), 
    maxFeatures = 100000, 
    cValue = 0.005, 
    vectorizerType = "Count"
)
createClassifier(
    ngramRange = (1,2), 
    maxFeatures = 100000, 
    cValue = 0.05, 
    vectorizerType = "Count"
)
createClassifier(
    ngramRange = (1,2), 
    maxFeatures = 100000, 
    cValue = 0.5, 
    vectorizerType = "Count"
)

Vectorizer type=Count, C=0.0005, n-gram=(1, 2)
DEV 0.8868
TRAIN 0.93355

Vectorizer type=Count, C=0.005, n-gram=(1, 2)
DEV 0.8994
TRAIN 0.99515

Vectorizer type=Count, C=0.05, n-gram=(1, 2)
DEV 0.8916
TRAIN 1.0

Vectorizer type=Count, C=0.5, n-gram=(1, 2)
DEV 0.8898
TRAIN 1.0



### CountVectorizer n-gram (1-3)

In [67]:
createClassifier(
    ngramRange = (1,3), 
    maxFeatures = 100000, 
    cValue = 0.0005, 
    vectorizerType = "Count"
)
createClassifier(
    ngramRange = (1,3), 
    maxFeatures = 100000, 
    cValue = 0.005, 
    vectorizerType = "Count"
)
createClassifier(
    ngramRange = (1,3), 
    maxFeatures = 100000, 
    cValue = 0.05, 
    vectorizerType = "Count"
)
createClassifier(
    ngramRange = (1,3), 
    maxFeatures = 100000, 
    cValue = 0.5, 
    vectorizerType = "Count"
)

Vectorizer type=Count, C=0.0005, n-gram=(1, 3)
DEV 0.888
TRAIN 0.9396

Vectorizer type=Count, C=0.005, n-gram=(1, 3)
DEV 0.8998
TRAIN 0.997

Vectorizer type=Count, C=0.05, n-gram=(1, 3)
DEV 0.8912
TRAIN 1.0

Vectorizer type=Count, C=0.5, n-gram=(1, 3)
DEV 0.8876
TRAIN 1.0



### CountVectorizer n-gram (2-3)

In [68]:
createClassifier(
    ngramRange = (2,3), 
    maxFeatures = 100000, 
    cValue = 0.0005, 
    vectorizerType = "Count"
)
createClassifier(
    ngramRange = (2,3), 
    maxFeatures = 100000, 
    cValue = 0.005, 
    vectorizerType = "Count"
)
createClassifier(
    ngramRange = (2,3), 
    maxFeatures = 100000, 
    cValue = 0.05, 
    vectorizerType = "Count"
)
createClassifier(
    ngramRange = (2,3), 
    maxFeatures = 100000, 
    cValue = 0.5, 
    vectorizerType = "Count"
)

Vectorizer type=Count, C=0.0005, n-gram=(2, 3)
DEV 0.8574
TRAIN 0.9192

Vectorizer type=Count, C=0.005, n-gram=(2, 3)
DEV 0.8752
TRAIN 0.9945

Vectorizer type=Count, C=0.05, n-gram=(2, 3)
DEV 0.8688
TRAIN 1.0

Vectorizer type=Count, C=0.5, n-gram=(2, 3)
DEV 0.8622
TRAIN 1.0



### Results of different n-grams

When the n-grams are increased the C Value does not need to be that big. The Classifiers results are almost optimal at a value of 0.0005. The n-grams become very fast overfitted to the  train data if the C value is too high. I'd say the most optimal was an n-gram range of (1,2) and the C value as 0.005. 

## Do these n-grams show up also in the list of most significant positive/negative features?

Lets create a mehtod for showing the most significant features. After that we can look at the n-gram ranges of (1,2) and (2,3) since (1,1) will only consist of 1 lenght features.

### Create method for showing significant Features based on the classifier

In [111]:
def showSignificantFeatures(maxFeatures, ngramRange, cValue):
    
    vectorizer = CountVectorizer(max_features = maxFeatures, binary=True, ngram_range = ngramRange)
    feature_matrix_train = vectorizer.fit_transform(train_texts)
    feature_matrix_dev = vectorizer.transform(dev_texts)
    
    classifier = sklearn.svm.LinearSVC(C =  cValue, verbose = 0)
    classifier.fit(feature_matrix_train, train_labels)
    
    index2feature={}
    for feature,idx in vectorizer.vocabulary_.items():
        assert idx not in index2feature #This really should hold
        index2feature[idx]=feature
    
    indices=numpy.argsort(classifier.coef_[0])
    print(indices)
    for idx in indices[:30]:
        print(index2feature[idx])
    print("-------------------------------")
    for idx in indices[::-1][:30]:
        print(index2feature[idx])


### N-gram range (1,2)

In [113]:
print("N-gram (1,2)")
showSignificantFeatures(
    ngramRange = (1,2), 
    maxFeatures = 100000,
    cValue = 0.005
)

N-gram (1,2)
[98580  9852 13424 ... 34446 63258 27178]
worst
awful
boring
waste
terrible
disappointing
dull
disappointment
bad
poorly
the worst
poor
unfortunately
lacks
horrible
fails
worse
mess
stupid
avoid
ridiculous
not good
lame
badly
oh
save
unfunny
waste of
than this
laughable
-------------------------------
excellent
perfect
great
amazing
enjoyable
superb
wonderful
loved
today
must see
rare
bit
fun
incredible
very good
refreshing
fantastic
gem
better than
wonderfully
the best
well worth
liked
highly
subtle
enjoyed
beautiful
is great
pretty good
fascinating


### N-gram range (2,3)

In [114]:
print("N-gram (2,3)")
showSignificantFeatures(
    ngramRange = (2,3), 
    maxFeatures = 100000,
    cValue = 0.005
)

N-gram (2,3)
[81564 93017 53601 ... 75151 39589 52018]
the worst
waste of
not even
than this
of the worst
at best
not good
bad acting
fails to
boring and
unless you
at all
not worth
bad movie
so bad
your time
none of
supposed to
is awful
is terrible
worst movie
might have
very bad
better to
bad and
how bad
avoid this
not funny
is bad
looks like
-------------------------------
must see
is great
the best
loved it
well worth
was great
highly recommended
my favorite
great movie
loved this
is amazing
is excellent
very good
highly recommend
enjoyed this
very well
10 10
great job
enjoyed it
love this
an excellent
on dvd
definitely worth
it great
of the best
was excellent
fun and
is perfect
is wonderful
love it


### The most significant n-grams

It seems to be that the longer the featue, the more unique and more rare it is. If we look at the n-gram range of (1,2), then the most significant features have some words that are 2 words long. The majority of the features are still of one length. 

If we look at the n-gram range (1,3), then this proves also our point. There are some features that consist of 3 words, but most of the features are 2 words. The most common 3 words consists of the words "of the worst/best". 

It is rare that longer sentences would occur several times in normal text or literature. 

## 3. Train an SVM classifier for language recognition

**In the data package for the course [http://dl.turkunlp.org/intro-to-nlp.tar.gz](http://dl.turkunlp.org/intro-to-nlp.tar.gz), the directory language_identification contains data for 5 languages. Based on this data, train an SVM classifier for language recognition between these 5 languages.**

### Create methods for reading in the data

In [129]:
def openTextFile(lang, section):
    path = "./language-identification/{0}_{1}.txt".format(lang, section)
    textList = []
    
    with open(path) as f:
        for line in f:
            obj = {'lang': lang,
               'text': line.strip()
            }
            textList.append(obj)
            
    return textList
            

In [158]:
def separateLabelText(data):
    texts=[d["text"] for d in data]
    labels=[d["lang"] for d in data]
    
    return texts, labels

In [159]:
def loadLangFiles(languages):
    trainData = []
    develData = []
    testData = []
    
    for lang in languages:
        trainData.extend(openTextFile(lang, 'train'))
        develData.extend(openTextFile(lang, 'devel'))
        testData.extend(openTextFile(lang, 'test'))

    # Shuffle Data
    random.seed(10) # Seed to replicate same scenario for development
    random.shuffle(trainData) # Shuffle data 
    random.shuffle(develData) # Shuffle data 
    random.shuffle(testData) # Shuffle data 
    
    # Separate labels from text
    trainTexts, trainLabels = separateLabelText(trainData)
    develTexts, develLabels = separateLabelText(develData)
    testTexts, testLabels = separateLabelText(testData)

        
    return trainTexts, trainLabels, develTexts, develLabels, testTexts, testLabels 

### Create feature matrix and classifier

In [164]:
languages = ['en', 'es', 'et', 'fi', 'pt']
# Read in the texts and labels
lang_text_train, lang_label_train, lang_text_dev, lang_label_dev, lang_text_test, lang_label_test = loadLangFiles(languages)

In [169]:
lang_vectorizer = TfidfVectorizer(max_features = 100000, binary=True, ngram_range = (1,1))
lang_feature_matrix_train = lang_vectorizer.fit_transform(lang_text_train)
lang_feature_matrix_dev = lang_vectorizer.transform(lang_text_dev)

lang_classifier = sklearn.svm.LinearSVC(C = 0.05, verbose = 0)
lang_classifier.fit(lang_feature_matrix_train, lang_label_train)

print("DEV", lang_classifier.score(lang_feature_matrix_dev, lang_label_dev))
print("TRAIN", lang_classifier.score(lang_feature_matrix_train, lang_label_train))

DEV 0.934
TRAIN 0.9878


### Try predicting languages

In [182]:
def predictLang(text, vectorizer, classifier):
    data = vectorizer.transform([text])
    
    prediction = classifier.predict(data)
    
    print(prediction)

In [183]:
predictLang("Today is a nice day.", lang_vectorizer, lang_classifier)
predictLang("Estas obras son realizadas con libros, álbumes de música o periódicos como soporte.", lang_vectorizer, lang_classifier)
predictLang("Publik teeb nii mitu korda ja on juba üles köetud.", lang_vectorizer, lang_classifier)
predictLang("Tänään on mahti päivä mennä kävelylle.", lang_vectorizer, lang_classifier)
predictLang("Conseguir um bom exclusivo pode significar a entrada de milhões de dólares em publicidade.", lang_vectorizer, lang_classifier)

['en']
['es']
['et']
['fi']
['pt']


## 4. Toy around with features, especially the ngram_range and analyzer parameters

**If you completed (3), toy around with features, especially the ngram_range and analyzer parameters, which allow you to test classification based on character ngrams of various lengths (not only word n-grams). Gain some insight in to the accuracy of the classifier with different features, and try to identify misclassified documents -why do you think they were misclassified?**