# Introduction to NLP: Assignment 2 

## Assignment on Text Classification and Sequence Labeling

### Description of Assignment 2

This assignment relates to the Text classification and Sequence labeling themes of the introduction to NLP (courses Deskriptiv analytik / Machine learning for descriptive problems), and will focus on gaining some practical, hands-on experience in building and training simple models for these tasks.

The assignment is handed in as a Jupyternotebook (or a PDF render thereof) containing the code used to solve the problem, output presenting the results, and, most importantly, notes that present the students' conclusions and answer questions posed in the assignment.

**Assignment steps/Questions:**

1. Test sklearn’s TfidfVectorizer in place of CountVectorizer on the IMDB data. Do you see any difference in the classification results or the optimal C value?

2. Test different lengths of n-grams in the CountVectorizer on the IMDB data. Do you see any difference in the classification results or the optimal C value ? Do these n-grams show up also in the list of most significant positive/negative features?

3. In the data package for the course [http://dl.turkunlp.org/intro-to-nlp.tar.gz](http://dl.turkunlp.org/intro-to-nlp.tar.gz), the directory language_identification contains data for 5 languages. Based on this data, train an SVM classifier for language recognition between these 5 languages.

4. If you completed (3), toy around with features, especially the ngram_range and analyzer parameters, which allow you to test classification based on character ngrams of various lengths (not only word n-grams). Gain some insight in to the accuracy of the classifier with different features, and try to identify misclassified documents -why do you think they were misclassified?

5. **BONUS** On the address universaldependencies.org, you will find datasets for a bunch of languages. These come in an easy-to-parse, well-documented format. Pick one language that interests you, and one treebank for that language, and try to builda POS tagger for this language. You can use the 4th column “UPOS” [https://universaldependencies.org/format.html](https://universaldependencies.org/format.html) Report on your findings. If you have extra time, try to experiment with various features and see if you can make your accuracy go up. You can check here [https://universaldependencies.org/conll18/results-upos.html](https://universaldependencies.org/conll18/results-upos.html) what the state of the art roughly is for your selected language and treebank. Did you come close?




### Import libraries

In [205]:
import json
import random
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
import sklearn.svm
import sklearn.metrics
import numpy

### Read in IMDB data

In [64]:
with open("imdb_train.json") as f:
    data = json.load(f)
random.seed(10) # Seed to replicate same scenario for development
random.shuffle(data) # Shuffle data 

# Preview of data
print("class label:", data[0]["class"])
print("text:", data[0]["text"])

class label: neg
text: the single worst film i've ever seen in a theater. i saw this film at the austin film festival in 2004, and it blew my mind that this film was accepted to a festival. it was an interesting premise, and seemed like it could go somewhere, but just fell apart every time it tried to do anything. first of all, if you're going to do a musical, find someone with musical talent. the music consisted of cheesy piano playing that sounded like they were playing it on a stereo in the room they were filming. the lyrics were terribly written, and when they weren't obvious rhymes, they were groan-inducing rhymes that showed how far they were stretching to try to make this movie work. and you'd think you'd find people who could sing when making a musical, right? not in this case. luckily they were half talking/half singing in rhyme most of the time, but when they did sing it made me cringe. especially when they attempted to sing in harmony. and that just addresses the music. some

### Separate texts and labels 

In [65]:
# We need to gather the texts and labels into separate lists
texts=[d["text"] for d in data]
labels=[d["class"] for d in data]
print("Amount of texts:", len(texts))
print("Amount of labels", len(labels))
print()
for label, text in list(zip(labels, texts))[:10]:
    print(label, text[:50] + "...")

Amount of texts: 25000
Amount of labels 25000

neg the single worst film i've ever seen in a theater....
pos I think the reason for all the opinionated diarrhe...
neg This movie is horrible! It rivals \Ishtar\" in the...
neg This may not be the worst comedy of all time, but ...
pos I found this film to funny from the start. John Wa...
pos The problem is that the movie rode in on the coatt...
neg I was so looking forward to seeing this when it wa...
neg I actually saw this movie in the theater back in i...
neg blows my mind how this movie got made. i watched i...
neg Amateurish in the extreme. Camera work especially ...


## 1. Test sklearn’s TfidfVectorizer

**Test sklearn’s TfidfVectorizer in place of CountVectorizer on the IMDB data. Do you see any difference in the classification results or the optimal C value?**

### Datasplit

In [130]:
train_texts, dev_texts, train_labels, dev_labels = train_test_split(texts, labels, test_size = 0.2)

### 1.1 CountVectorizer

In [131]:
vectorizer = CountVectorizer(max_features = 100000, binary=True, ngram_range = (1,1))
feature_matrix_train = vectorizer.fit_transform(train_texts)
feature_matrix_dev = vectorizer.transform(dev_texts)

In [132]:
print(feature_matrix_train.shape)
print(feature_matrix_dev.shape)

(20000, 68271)
(5000, 68271)


In [133]:
classifier = sklearn.svm.LinearSVC(C = 0.0005, verbose = 1)
classifier.fit(feature_matrix_train, train_labels)

[LibLinear]

LinearSVC(C=0.0005, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=1)

In [134]:
print("DEV", classifier.score(feature_matrix_dev, dev_labels))
print("TRAIN", classifier.score(feature_matrix_train, train_labels))

DEV 0.8682
TRAIN 0.8942


### 1.2 TfidfVectorizer

In [135]:
tfidf_vectorizer = TfidfVectorizer(max_features = 100000, binary=True, ngram_range = (1,1))
tfidf_feature_matrix_train = tfidf_vectorizer.fit_transform(train_texts)
tfidf_feature_matrix_dev = tfidf_vectorizer.transform(dev_texts)

In [136]:
print(tfidf_feature_matrix_train.shape)
print(tfidf_feature_matrix_dev.shape)

(20000, 68271)
(5000, 68271)


In [173]:
tfidf_classifier = sklearn.svm.LinearSVC(C = 0.0005, verbose = 1)
tfidf_classifier.fit(tfidf_feature_matrix_train, train_labels)

[LibLinear]

LinearSVC(C=0.0005, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=1)

In [174]:
print("DEV", tfidf_classifier.score(tfidf_feature_matrix_dev, dev_labels))
print("TRAIN", tfidf_classifier.score(tfidf_feature_matrix_train, train_labels))

DEV 0.8344
TRAIN 0.84555


**When comparing the CountVectorizer and TfidfVectorizer,** 

the first thing i noticed was that the classification results were pretty similar. The TfidfVectorizers results where lower from the start when using C value 0.0005. When  increasing the C value on both, the results of the TfidfVectorizer increased both on the dev and train. The dev value increased from 0.835 -> 0.89 and train increased from 0.846 -> 0.986. I'd say the optimal value would be around ??? where dev and train is close mebe?? 

## 2. Test different lengths of n-grams in the CountVectorizer on the IMDB data

**Test different lengths of n-grams in the CountVectorizer on the IMDB data. Do you see any difference in the classification results or the optimal C value ? Do these n-grams show up also in the list of most significant positive/negative features?**


???limit is in 2-3 any longer becomes too unique???


In [239]:
vectorizer = CountVectorizer(max_features = 10000000, binary=True, ngram_range = (1,2))
feature_matrix_train = vectorizer.fit_transform(train_texts)
feature_matrix_dev = vectorizer.transform(dev_texts)

In [240]:
classifier = sklearn.svm.LinearSVC(C = 0.5, verbose = 1)
classifier.fit(feature_matrix_train, train_labels)

[LibLinear]



LinearSVC(C=0.5, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=1)

In [241]:
print("DEV", classifier.score(feature_matrix_dev, dev_labels))
print("TRAIN", classifier.score(feature_matrix_train, train_labels))

DEV 0.893
TRAIN 1.0


Start values:

Ngram (1,1)

C = 0.0005
DEV 0.8682
TRAIN 0.8942

C = 0.005
DEV 0.8816
TRAIN 0.95685

C = 0.05
DEV 0.873
TRAIN 0.9962

C = 0.5
DEV 0.8582
TRAIN 1.0


Ngram (1,2)

C = 0.0005
DEV 0.8854
TRAIN 0.95205

C = 0.005
DEV 0.8978
TRAIN 0.9989

C = 0.05
DEV 0.8944
TRAIN 1.0

C = 0.5
DEV 0.893
TRAIN 1.0

In [242]:
print(list(vectorizer.vocabulary_.items())[:30],"...")

[('corridors', 260929), ('of', 779454), ('time', 1144551), ('the', 1098818), ('movie', 731701), ('you', 1287220), ('can', 195974), ('watch', 1228084), ('if', 548648), ('re', 904983), ('looking', 665997), ('for', 423303), ('sophisticated', 1026469), ('way', 1231077), ('suicide', 1064988), ('some', 1018977), ('use', 1198707), ('guns', 483478), ('ropes', 942565), ('or', 809802), ('gas', 450621), ('but', 182754), ('want', 1219764), ('to', 1148742), ('ruin', 945156), ('your', 1289974), ('brains', 169550), ('do', 318558), ('not', 767443), ('wait', 1217905)] ...


In [243]:
#Reverse the dictionary
index2feature={}
for feature,idx in vectorizer.vocabulary_.items():
    assert idx not in index2feature #This really should hold
    index2feature[idx]=feature
#Now we can query index2feature to get the feature names as we need

In [244]:
indices=numpy.argsort(classifier.coef_[0])
print(indices)
for idx in indices[:30]:
    print(index2feature[idx])
print("-------------------------------")
for idx in indices[::-1][:30]: #you can also do it the other way round, reverse, then pick
    print(index2feature[idx])

[1277371  113994  164164 ... 1066919  843665  375235]
worst
awful
boring
terrible
waste
poor
dull
poorly
the worst
bad
disappointment
horrible
disappointing
not worth
weak
mess
than this
oh
laughable
save
unfortunately
lacks
badly
worse
lame
avoid
not good
not even
nothing
ridiculous
-------------------------------
excellent
perfect
superb
great
enjoyable
amazing
wonderful
well worth
gem
better than
today
incredible
rare
definitely worth
brilliant
must see
fantastic
enjoyed
job
masterpiece
fun
the best
refreshing
10 10
fascinating
enjoyed it
so well
to all
tears
not bad


Some of the ngram features showed up for the ngram (1,2)   ( 10 of 30 of positive) (5 of 30 negatives)

range (1,3) no 3 amount showed up. for 3 words to appear consecutively is very low. 

range (2,3) had one "of the worst" in negative. No in positive


worst
waste
poorly
lousy
laughable
awful
disappointment
refer
etta
boring
unfunny
terrible
disappointing
lacks
avoid
dreadful
mess
wooden
stupidity
programming
save
miscast
dysfunctional
tunnel
guilty
outer
fails
skip
hearts
extremelly
-------------------------------
scariest
refreshing
unexpecting
unpretentious
hooked
carrey
waited
relax
superb
delightful
excellent
perfect
jolie
units
enjoyable
fez
steals
eliminate
tears
freedom
goof
definitive
sublime
slide
underrated
thingy
judged
shines
angelina
dirty

## 3. Train an SVM classifier for language recognition

In the data package for the course [http://dl.turkunlp.org/intro-to-nlp.tar.gz](http://dl.turkunlp.org/intro-to-nlp.tar.gz), the directory language_identification contains data for 5 languages. Based on this data, train an SVM classifier for language recognition between these 5 languages.

In [272]:
def openTextFile(path):
    with open(path) as f:
        languageList  = []
        for line in f:
            languageList.append(line)

    random.seed(10) # Seed to replicate same scenario for development
    random.shuffle(languageList) # Shuffle data 

    return languageList

In [295]:
lang_text_train =[]
lang_label_train =[]

languagePathsTrain = ["./language_identification/en_train.txt", 
                "./language_identification/es_train.txt",
                "./language_identification/et_train.txt",
                 "./language_identification/fi_train.txt",
                 "./language_identification/pt_train.txt"]

languageLabels = ['en', 'es', 'et', 'fi', 'pt']

iterator = 0
for path in languagePathsTrain:
    lang = openTextFile(path)
    lang_text_train.extend(lang)
    label = [languageLabels[iterator]] * len(lang)
    lang_label_train.extend(label)
    iterator += 1

# Read in dev txt
lang_text_dev =[]
lang_label_dev =[]
    
languagePathsDev = ["./language_identification/en_devel.txt", 
                "./language_identification/es_devel.txt",
                "./language_identification/et_devel.txt",
                 "./language_identification/fi_devel.txt",
                 "./language_identification/pt_devel.txt"]    

iterator = 0
for path in languagePathsDev:
    lang = openTextFile(path)
    lang_text_dev.extend(lang)
    label = [languageLabels[iterator]] * len(lang)
    lang_label_dev.extend(label)
    iterator += 1

# Read in test txt
lang_text_test =[]
lang_label_test =[]
 
languagePathsTest = ["./language_identification/en_test.txt", 
                "./language_identification/es_test.txt",
                "./language_identification/et_test.txt",
                 "./language_identification/fi_test.txt",
                 "./language_identification/pt_test.txt"]    
    
iterator = 0
for path in languagePathsTest:
    lang = openTextFile(path)
    lang_text_test.extend(lang)
    label = [languageLabels[iterator]] * len(lang)
    lang_label_test.extend(label)
    iterator += 1



#### Feature matirx and svm train

In [304]:
lang_vectorizer = TfidfVectorizer(max_features = 100000, binary=True, ngram_range = (1,1))
lang_feature_matrix_train = lang_vectorizer.fit_transform(lang_text_train)
lang_feature_matrix_dev = lang_vectorizer.transform(lang_text_dev)

In [305]:
print(lang_feature_matrix_train.shape)
print(lang_feature_matrix_dev.shape)

(5000, 28620)
(5000, 28620)


In [318]:
lang_classifier = sklearn.svm.LinearSVC(C = 0.5, verbose = 1)
lang_classifier.fit(lang_feature_matrix_train, lang_label_train)

[LibLinear]

LinearSVC(C=0.5, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=1)

In [319]:
print("DEV", lang_classifier.score(lang_feature_matrix_dev, lang_label_dev))
print("TRAIN", lang_classifier.score(lang_feature_matrix_train, lang_label_train))

DEV 0.9448
TRAIN 0.9996
