Nov 19 2019

# Classifiers with scikit

In this homework, the task is to build a classifier that learns the conjugation/declension types of verbs and nouns, given examples. The focus is on designing a good set of linguistically informed features to do this. What classifier you use is up to you. Any of the standard ones will work: Naive Bayes, perceptron, SVM, logistic regression (MaxEnt), or k-NN, or even decision lists. You are free to use the classifiers in scikit_learn or take advantage of the ones you've already built for other homeworks. You may use a kernelized classifier if you want to, or a linear classifier, the choice is up to you. A simple straightforward choice would be the linear SVM, available as LinearSVC in scikit.

Classification

The task consists of basically three parts:

(1) Split the data into at least train/test if not train/dev/test (should you need a dev set) in some reasonable way (90/10 or 80/10/10). You are not given a split and should design and implement this random split yourself. You really want to randomize this because the first classes 0,1,2 ... are very big, and the last ones very small.

In [1]:
# Data consists of:
# * German nouns (de_n.txt), German verbs (de_v.txt)
# * Finnish nouns/adjectives (fi_na.txt), Finnish verbs (fi_v.txt)
# * Spanish verbs (es_v.txt)

# I will divide each into a training set (80%) and a test set (20%)
# X = features, y = classes

from sklearn.feature_extraction import DictVectorizer
from sklearn import svm 
import random

# Read in data 'y\tword' and convert to [(y, word1),(y, word2),...]
def makeLists(fileName):
    lines = [line.strip() for line in open(fileName)]
    list1 = []
    for line in lines:
        values = line.split('\t')
        list1.append((values[0],values[1]))
    return list1

values_de_n = makeLists('de_n.txt') # German nouns (de_n.txt) 
values_de_v = makeLists('de_v.txt') # German verbs (de_v.txt) 
values_fi_na = makeLists('fi_na.txt') # Finnish nouns/adjectives (fi_na.txt) 
values_fi_v = makeLists('fi_v.txt') # Finnish verbs (fi_v.txt) 
values_es_v = makeLists('es_v.txt') # Spanish verbs (es_v.txt) 
print(values_es_v[0:3]) 

# The number of conjugation/declension classes are as follows:
#de_n: 70
#de_v: 140
#fi_na: 258
#fi_v: 282
#es_v: 97

# split data randomly 
random.shuffle(values_de_n)
random.shuffle(values_de_v)
random.shuffle(values_fi_na)
random.shuffle(values_fi_v)
random.shuffle(values_es_v)
print(values_es_v[0:3]) # check if shuffled

def splitData(inputList, percentTest):
    testNum = round(len(inputList) * percentTest)
    trainNum = len(inputList) - testNum
    trainingData = inputList[0:trainNum]
    testData = inputList[trainNum:]
    return trainingData, testData

de_n_train, de_n_test = splitData(values_de_n, 0.2)
de_v_train, de_v_test = splitData(values_de_v, 0.2)
fi_na_train, fi_na_test = splitData(values_fi_na, 0.2)
fi_v_train, fi_v_test = splitData(values_fi_v, 0.2)
es_v_train, es_v_test = splitData(values_es_v, 0.2)

print('Training set German nouns:', len(de_n_train))
print('Test set German nouns:',len(de_n_test))
print('Proportion test/training German nouns:', len(de_n_test) / len(de_n_train))

[('0', 'vomitar'), ('0', 'eructar'), ('0', 'tantear')]
[('0', 'empachar'), ('4', 'redimir'), ('12', 'acentuar')]
Training set German nouns: 2051
Test set German nouns: 513
Proportion test/training German nouns: 0.250121891760117


(2) Convert each word into a feature representation of your design.

In [12]:
# Features: word itself, prefixes(4), suffixes(4), vowels

def hasVowel(inputVowel, inputWord):
    if inputVowel in inputWord:
        return 'yes'
    else:
        return 'no'

def extractFeatures(inputWord):
    featuresVec = []
    bigrams = [b[0]+b[1] for b in zip(inputWord,inputWord[1:])]
    trigrams = [t[0]+t[1]+t[2] for t in zip(inputWord,inputWord[1:], inputWord[2:])]
    f1 = 'pfx1=' + inputWord[0] #prefix
    f2 = 'pfx2=' + inputWord[0:2]
    f3 = 'pfx3=' + inputWord[0:3]
    f4 = 'pfx4=' + inputWord[0:4]
    f5 = 'sfx1=' + inputWord[-1] #suffix
    f6 = 'sfx2=' + inputWord[-2:]
    f7 = 'sfx3=' + inputWord[-3:]
    f8 = 'sfx4=' + inputWord[-4:]
    f9 = 'hasa=' + hasVowel('a', inputWord) # check vowel inventory !!
    f10 = 'hase=' + hasVowel('e', inputWord)
    f11 = 'hasi=' + hasVowel('i', inputWord)
    f12 = 'haso=' + hasVowel('o', inputWord)
    f13 = 'hasu=' + hasVowel('u', inputWord)
    f14 = 'hasy=' + hasVowel('y', inputWord)
    f15 = 'hasä=' + hasVowel('ä', inputWord)
    f16 = 'hasö=' + hasVowel('ö', inputWord)
    f17 = 'hasü=' + hasVowel('ü', inputWord)
    featuresVec = [f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11, f12, f13, f14, f15, f16, f17] + bigrams + trigrams
    return featuresVec

#test1 = extractFeatures('oír')
#print(test1)

# I now have a training set (~80%) and a test set (~20%) for each of the 5 groups in format [(class, word)]
# AND a function for generating features: extractFeatures(inputWord)
# de_n_train, de_n_test 
# de_v_train, de_v_test
# fi_na_train, fi_na_test 
# fi_v_train, fi_v_test 
# es_v_train, es_v_test 

def expandFeaturesList(inputList):
    newList = []
    for item in inputList: # [(class, word)]
        dict1 = {}
        class1 = item[0]
        newFeatures = extractFeatures(item[1])
        for feature in newFeatures:
            dict1[feature] = 1
        newList.append((class1, dict1)) # [(class1, {feature: 1}), (class2, {feature: 1})]
    return newList # list of dictionaries {feature:class}

de_n_train_expanded = expandFeaturesList(de_n_train) # German nouns
print('Training set:', de_n_train_expanded[0:3])
print('Number tuples:', len(de_n_train_expanded))
print('Size each dict:', len(de_n_train_expanded[0][1]))
print('Total features:', len(de_n_train_expanded) * len(de_n_train_expanded[0][1]))

de_v_train_expanded = expandFeaturesList(de_v_train) # German verbs
fi_na_train_expanded = expandFeaturesList(fi_na_train) # Finnish nouns + adj
fi_v_train_expanded = expandFeaturesList(fi_v_train) # Finnish verbs
es_v_train_expanded = expandFeaturesList(es_v_train) # Spanish verbs

Training set: [('7', {'pfx1=K': 1, 'pfx2=Kl': 1, 'pfx3=Kle': 1, 'pfx4=Klem': 1, 'sfx1=n': 1, 'sfx2=in': 1, 'sfx3=rin': 1, 'sfx4=erin': 1, 'hasa=no': 1, 'hase=yes': 1, 'hasi=yes': 1, 'haso=no': 1, 'hasu=no': 1, 'hasy=no': 1, 'hasä=no': 1, 'hasö=no': 1, 'hasü=no': 1, 'Kl': 1, 'le': 1, 'em': 1, 'mp': 1, 'pn': 1, 'ne': 1, 'er': 1, 'ri': 1, 'in': 1, 'Kle': 1, 'lem': 1, 'emp': 1, 'mpn': 1, 'pne': 1, 'ner': 1, 'eri': 1, 'rin': 1}), ('1', {'pfx1=B': 1, 'pfx2=Bu': 1, 'pfx3=Buc': 1, 'pfx4=Buch': 1, 'sfx1=e': 1, 'sfx2=he': 1, 'sfx3=che': 1, 'sfx4=uche': 1, 'hasa=no': 1, 'hase=yes': 1, 'hasi=no': 1, 'haso=no': 1, 'hasu=yes': 1, 'hasy=no': 1, 'hasä=no': 1, 'hasö=no': 1, 'hasü=no': 1, 'Bu': 1, 'uc': 1, 'ch': 1, 'he': 1, 'Buc': 1, 'uch': 1, 'che': 1}), ('8', {'pfx1=S': 1, 'pfx2=So': 1, 'pfx3=Soz': 1, 'pfx4=Sozi': 1, 'sfx1=e': 1, 'sfx2=ge': 1, 'sfx3=oge': 1, 'sfx4=loge': 1, 'hasa=no': 1, 'hase=yes': 1, 'hasi=yes': 1, 'haso=yes': 1, 'hasu=no': 1, 'hasy=no': 1, 'hasä=no': 1, 'hasö=no': 1, 'hasü=no': 1, 

All the features should have a value of 1 in the dictionary.
The classes are kept in a separate vector

For evaluation, the argument should be a word (a string) and inside the function, that word can be expanded with the features I defined above. Then I can predict the class.

In [16]:
# Now to create X and y vectors for training for *German nouns*
de_n_y_train = [item[0] for item in de_n_train_expanded] #classes
de_n_X_train = [item[1] for item in de_n_train_expanded] #features present [{f1:1,f2:1},{f1:1,f2:1}]
print('Classes:', de_n_y_train[30:40])
print('Len classes vector:', len(de_n_y_train))
print('Len features vector:', len(de_n_X_train))
print('Features[0]:', de_n_X_train[0])
print('Features[1]:', de_n_X_train[1])

# Each feature gets an index --> I should fit() and transform() the X vector separately, 
# so I can use the vectorizer later for prediction (otherwise it'll say I don't have enough features for one instance)
de_n_vectorizer = DictVectorizer(sparse = True).fit(de_n_X_train)
de_n_X = de_n_vectorizer.transform(de_n_X_train)
#print('After vector transformation for de_n:\n', de_n_X[0])

# German verbs
de_v_y_train = [item[0] for item in de_v_train_expanded] #classes
de_v_X_train = [item[1] for item in de_v_train_expanded]
de_v_vectorizer = DictVectorizer(sparse = True).fit(de_v_X_train)
de_v_X = de_v_vectorizer.transform(de_v_X_train) # de_v_X and de_v_y_train

# Finnish nouns + adj
fi_na_y_train = [item[0] for item in fi_na_train_expanded] #classes
fi_na_X_train = [item[1] for item in fi_na_train_expanded]
fi_na_vectorizer = DictVectorizer(sparse = True).fit(fi_na_X_train)
fi_na_X = fi_na_vectorizer.transform(fi_na_X_train) # fi_na_X and fi_na_y_train

# Finnish verbs
fi_v_y_train = [item[0] for item in fi_v_train_expanded] #classes
fi_v_X_train = [item[1] for item in fi_v_train_expanded]
fi_v_vectorizer = DictVectorizer(sparse = True).fit(fi_v_X_train)
fi_v_X = fi_v_vectorizer.transform(fi_v_X_train) # fi_v_X and fi_v_y_train

# Spanish verbs
es_v_y_train = [item[0] for item in es_v_train_expanded] #classes
es_v_X_train = [item[1] for item in es_v_train_expanded]
es_v_vectorizer = DictVectorizer(sparse = True).fit(es_v_X_train)
es_v_X = es_v_vectorizer.transform(es_v_X_train) # es_v_X and es_v_y_train

Classes: ['1', '0', '37', '0', '7', '0', '4', '4', '0', '4']
Len classes vector: 2051
Len features vector: 2051
Features[0]: {'pfx1=K': 1, 'pfx2=Kl': 1, 'pfx3=Kle': 1, 'pfx4=Klem': 1, 'sfx1=n': 1, 'sfx2=in': 1, 'sfx3=rin': 1, 'sfx4=erin': 1, 'hasa=no': 1, 'hase=yes': 1, 'hasi=yes': 1, 'haso=no': 1, 'hasu=no': 1, 'hasy=no': 1, 'hasä=no': 1, 'hasö=no': 1, 'hasü=no': 1, 'Kl': 1, 'le': 1, 'em': 1, 'mp': 1, 'pn': 1, 'ne': 1, 'er': 1, 'ri': 1, 'in': 1, 'Kle': 1, 'lem': 1, 'emp': 1, 'mpn': 1, 'pne': 1, 'ner': 1, 'eri': 1, 'rin': 1}
Features[1]: {'pfx1=B': 1, 'pfx2=Bu': 1, 'pfx3=Buc': 1, 'pfx4=Buch': 1, 'sfx1=e': 1, 'sfx2=he': 1, 'sfx3=che': 1, 'sfx4=uche': 1, 'hasa=no': 1, 'hase=yes': 1, 'hasi=no': 1, 'haso=no': 1, 'hasu=yes': 1, 'hasy=no': 1, 'hasä=no': 1, 'hasö=no': 1, 'hasü=no': 1, 'Bu': 1, 'uc': 1, 'ch': 1, 'he': 1, 'Buc': 1, 'uch': 1, 'che': 1}


(3) Train a classifier and evaluate its performance (accuracy) using this feature representation. Report accuracies for each data set.

In [17]:
# train classifier --> clf.fit(X, y) *German nouns*
de_n_clf = svm.LinearSVC() # use Linear SVM
de_n_clf.fit(de_n_X, de_n_y_train)

# German verbs
de_v_clf = svm.LinearSVC() # use Linear SVM
de_v_clf.fit(de_v_X, de_v_y_train)

# Finnish nouns + adj
fi_na_clf = svm.LinearSVC() # use Linear SVM
fi_na_clf.fit(fi_na_X, fi_na_y_train)

# Finnish verbs
fi_v_clf = svm.LinearSVC() # use Linear SVM
fi_v_clf.fit(fi_v_X, fi_v_y_train)

# Spanish verbs
es_v_clf = svm.LinearSVC() # use Linear SVM
es_v_clf.fit(es_v_X, es_v_y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

# Now to evaluate:
- This function accepts a list of tuples (class, word) as an argument
1. Take the word and pass it to extractFeatures(word)
2. Create y and X vectors for the word
3. Pass the X vector through vectorizer
   - vectorizer1 = DictVectorizer(sparse = True)
   - X = vectorizer1.fit_transform(inputList)
4. Then pass the vectorized X to clf.predict(de_n_vectorizer2.transform(inputList))
5. Check if prediction matches gold

In [22]:
# test set looks like: [(class, word), (class, word),...]
# I now have a trained classifier for each category
# de_n_test --> de_n_clf
# de_v_test --> de_v_clf
# fi_na_test --> fi_na_clf
# fi_v_test --> fi_v_clf
# es_v_test --> es_v_clf

def evaluate(inputListTuples, inputTrainedClassifier, vectorizer1): # input [(class, word), (class, word),...]
    numIncorrect = 0
    numTotal = len(inputListTuples)
    
    for element in inputListTuples: #(class, word)
        y = element[0]
        featuresList = extractFeatures(element[1]) #returns ['feature1','feature2', ...]
        featuresDict = {string:1 for string in featuresList} #returns {feature1:1, feature2:1, ...}
        # print('Features Dict:', featuresDict) #{'word=Schmierfett': 1, 'pfx1=S': 1,...}
        
        # transform features vec with vectorizer that has been previously fitted
        X = vectorizer1.transform(featuresDict)
        #print(X) #This seems to work
        
        #prediction
        guess = inputTrainedClassifier.predict(X)
        
        if guess != y:
            numIncorrect += 1
    
    totalCorrect = numTotal - numIncorrect
    print("Number correct:", totalCorrect)
    print("Total guesses:", numTotal)
    accuracy = totalCorrect/numTotal
    return accuracy

#print('InputXVec:',de_n_test[0:3]) #[('0', 'Schmierfett'), ('1', 'Agoraphobie'), ('2', 'Mechaniker')]
#print('InputClassifier:', de_n_clf) #this works

de_n_accuracy = evaluate(de_n_test, de_n_clf, de_n_vectorizer)
print('German noun accuracy:', de_n_accuracy, '\n')

de_v_accuracy = evaluate(de_v_test, de_v_clf, de_v_vectorizer)
print('German verb accuracy:', de_v_accuracy, '\n')

es_v_accuracy = evaluate(es_v_test, es_v_clf, es_v_vectorizer)
print('Spanish verb accuracy:', es_v_accuracy, '\n')

fi_na_accuracy = evaluate(fi_na_test, fi_na_clf, fi_na_vectorizer)
print('Finnish noun and adjective accuracy:', fi_na_accuracy, '\n')

fi_v_accuracy = evaluate(fi_v_test, fi_v_clf, fi_v_vectorizer)
print('Finnish verb accuracy:', fi_v_accuracy, '\n')

Number correct: 375
Total guesses: 513
German noun accuracy: 0.7309941520467836 

Number correct: 294
Total guesses: 365
German verb accuracy: 0.8054794520547945 

Number correct: 723
Total guesses: 771
Spanish verb accuracy: 0.9377431906614786 

Number correct: 972
Total guesses: 1200
Finnish noun and adjective accuracy: 0.81 

Number correct: 1343
Total guesses: 1410
Finnish verb accuracy: 0.9524822695035461 



What to hand in

You should hand in a Python/Jupyter file that works directly on the files you are given and reports the accuracy on each data set. I will assume I can run your code if a folder hw4data/ is in the same location .Your code should automatically split the files into train/test or train/dev/test. You should also include clear comments or a separate file that explains what features you decided to use, and what accuracies you obtained for each of the five data sets.

(NOTE 1: state-of-the-art for this task ranges between 80% for German nouns and 99% for Spanish verbs)

(NOTE 2: if you decide to use a perceptron, be prepared for the possibility that the data set isn't linearly separable. This will depend somewhat on what features you decide to use. For this reason it's a good idea to set a maximum number of iterations for perceptron learning, or use an averaged perceptron with early stopping.)