# TueSNLP 2019 - Assignment 2

## Classifying languages

The assignment and data are available at https://snlp2019.github.io/a2/

The data is a subset of WALS (https://wals.info/) and it consists of a tab-separated file with a `family` column containing the family of a language and other columns containing features for the language (notice: one language per row, first row is header and first two columns not essential)

The goal of the assignment is to learn language classification based on typological features.

(as per assignment instructions, our code roughly follows the template(s) provided)


### Exercise 1. Encoding the data

We write a function `encode()` which reads the data file and returns two numpy arrays, `labels` and `feature`; the former is the list of language families, the latter is a 2d array with as many rows as there are labels, where each row is the concatenation of the one-hot encodings of the features corresponding to the label. `NA`s and unknown values will be mapped to vectors of `0`s.

We do this step-by-step, then we put everything together in a function definition.

In [140]:
import pandas as pd
import numpy as np

In [141]:
# read data and take a look
df = pd.read_csv("data/wals-train.tsv", sep = "\t").fillna("NA")

In [142]:
df.head()

Unnamed: 0,lcode,lname,family,143A Order of Negative Morpheme and Verb,143F Postverbal Negative Morphemes,143G Minor morphological means of signaling negation,82A Order of Subject and Verb,143E Preverbal Negative Morphemes,83A Order of Object and Verb,85A Order of Adposition and Noun Phrase,...,19A Presence of Uncommon Consonants,3A Consonant-Vowel Ratio,8A Lateral Consonants,6A Uvular Consonants,18A Absence of Common Consonants,69A Position of Tense-Aspect Affixes,112A Negative Morphemes,107A Passive Constructions,48A Person Marking on Adpositions,1A Consonant Inventories
0,amh,Amharic,Afro-Asiatic,14 ObligDoubleNeg,2 [V-Neg],4 None,1 SV,2 [Neg-V],1 OV,4 No dominant order,...,1 None,4 Moderately high,"2 /l/, no obstruent laterals",1 None,1 All present,4 Mixed type,6 Double negation,1 Present,3 Pronouns only,4 Moderately large
1,arz,Arabic (Egyptian),Afro-Asiatic,15 OptDoubleNeg,2 [V-Neg],4 None,1 SV,3 NegV&[Neg-V],2 VO,2 Prepositions,...,4 Pharyngeals,4 Moderately high,"2 /l/, no obstruent laterals",4 Uvular stops and continuants,1 All present,4 Mixed type,2 Negative particle,1 Present,3 Pronouns only,4 Moderately large
2,bej,Beja,Afro-Asiatic,3 [Neg-V],4 None,4 None,1 SV,2 [Neg-V],1 OV,1 Postpositions,...,1 None,3 Average,"2 /l/, no obstruent laterals",1 None,1 All present,2 Tense-aspect suffixes,1 Negative affix,1 Present,2 No person marking,3 Average
3,heb,Hebrew (Modern),Afro-Asiatic,1 NegV,4 None,4 None,1 SV,1 NegV,2 VO,2 Prepositions,...,,,,,,4 Mixed type,2 Negative particle,1 Present,3 Pronouns only,
4,irk,Iraqw,Afro-Asiatic,4 [V-Neg],2 [V-Neg],4 None,1 SV,4 None,1 OV,2 Prepositions,...,4 Pharyngeals,5 High,4 /l/ and lateral obstruent,2 Uvular stops only,1 All present,5 No tense-aspect inflection,1 Negative affix,1 Present,2 No person marking,4 Moderately large


The easiest part is to extract the labels:

In [143]:
labels = df["family"].to_numpy

In [144]:
print(labels)

&lt;bound method IndexOpsMixin.to_numpy of 0         Afro-Asiatic
1         Afro-Asiatic
2         Afro-Asiatic
3         Afro-Asiatic
4         Afro-Asiatic
            ...       
84    Trans-New Guinea
85         Uto-Aztecan
86         Uto-Aztecan
87         Uto-Aztecan
88         Uto-Aztecan
Name: family, Length: 89, dtype: object&gt;


There are 89 families in the labels

Next, we want to one-hot encode the 30 columns of features. We need a procedure to encode an array of values, which also saves the coding dictionary for that array, so, for example, we can encode the training data first, and then use the same coding scheme to encode the testing data.

In [145]:
# a one-hot encoding function, it also returns the used coding dictionary
def onehot_encoder(input_array):
	encoding_dict = {}
	unique_elements = list(set(input_array)) # the "vocabulary" of the array

	# first, put together the encoding dictionary:
	for i in range(0, len(unique_elements)): # for each element of the input array:
		encoded_element = [0]*len(unique_elements) # initialize an array of 0s
		current_element = unique_elements[i]
		if not current_element in ["NA"]: # NAs and similar should be vectors of 0s
			encoded_element[i] = 1 # the position corresponding to the current index should be set to 1
		encoding_dict[current_element] = encoded_element # associate each element with its coding, into the dictionary
			
	# next use the dictionary to encode the input array
	output_array = [encoding_dict[element] for element in input_array]

	return(encoding_dict, output_array)

For example:

In [146]:
input_a = ["a", "b", "NA", "c", "NA", "d", "a", "b", "f", "NA", "c", "f", "a", "z", "z", "z"]

In [147]:
enc_dict, out_arr = onehot_encoder(input_a)

In [148]:
enc_dict

{&#39;b&#39;: [1, 0, 0, 0, 0, 0, 0],
 &#39;c&#39;: [0, 1, 0, 0, 0, 0, 0],
 &#39;d&#39;: [0, 0, 1, 0, 0, 0, 0],
 &#39;z&#39;: [0, 0, 0, 1, 0, 0, 0],
 &#39;a&#39;: [0, 0, 0, 0, 1, 0, 0],
 &#39;f&#39;: [0, 0, 0, 0, 0, 1, 0],
 &#39;NA&#39;: [0, 0, 0, 0, 0, 0, 0]}

In [149]:
out_arr

[[0, 0, 0, 0, 1, 0, 0],
 [1, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 1, 0, 0],
 [1, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0],
 [0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0],
 [0, 0, 0, 0, 1, 0, 0],
 [0, 0, 0, 1, 0, 0, 0],
 [0, 0, 0, 1, 0, 0, 0],
 [0, 0, 0, 1, 0, 0, 0]]

Let's try it on a column of the df:

In [150]:
enc_dict, enc_col = onehot_encoder(df.iloc[:, 3])

In [151]:
print(df.iloc[0:10, 3])

0    14 ObligDoubleNeg
1      15 OptDoubleNeg
2            3 [Neg-V]
3               1 NegV
4            4 [V-Neg]
5               2 VNeg
6               1 NegV
7               1 NegV
8    14 ObligDoubleNeg
9            4 [V-Neg]
Name: 143A Order of Negative Morpheme and Verb, dtype: object


In [152]:
enc_dict

{&#39;6 Type 1 / Type 2&#39;: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 &#39;11 Type 3 / Type 4&#39;: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 &#39;1 NegV&#39;: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 &#39;2 VNeg&#39;: [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
 &#39;10 Type 2 / Type 4&#39;: [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
 &#39;4 [V-Neg]&#39;: [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 &#39;3 [Neg-V]&#39;: [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 &#39;14 ObligDoubleNeg&#39;: [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 &#39;15 OptDoubleNeg&#39;: [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
 &#39;NA&#39;: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 &#39;8 Type 1 / Type 4&#39;: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}

In [153]:
enc_col[0:10]

[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
 [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]]

It appears to be working; let's apply it to every column (except from first three):

In [154]:
encoded_features = df.iloc[:, 3:].apply(onehot_encoder)

In [155]:
encoded_features.head()

143A Order of Negative Morpheme and Verb                ({&#39;6 Type 1 / Type 2&#39;: [1, 0, 0, 0, 0, 0, 0, 0...
143F Postverbal Negative Morphemes                      ({&#39;2 [V-Neg]&#39;: [1, 0, 0, 0, 0], &#39;1 VNeg&#39;: [0, ...
143G Minor morphological means of signaling negation    ({&#39;1 NegTone&#39;: [1, 0, 0], &#39;NA&#39;: [0, 0, 0], &#39;4 ...
82A Order of Subject and Verb                           ({&#39;2 VS&#39;: [1, 0, 0], &#39;1 SV&#39;: [0, 1, 0], &#39;3 No ...
143E Preverbal Negative Morphemes                       ({&#39;2 [Neg-V]&#39;: [1, 0, 0, 0, 0], &#39;1 NegV&#39;: [0, ...
dtype: object

The object `encoded_features` is a list of pairs `(encoding_dictionary, encoded_feature)`, one for each feauture, i.e. the first element of each pair is the dictionary used to encode the corresponding feature, and the second element is the resulting, encoded, feature. So for example:

In [156]:
encoded_features[0]

({&#39;6 Type 1 / Type 2&#39;: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  &#39;11 Type 3 / Type 4&#39;: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  &#39;1 NegV&#39;: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
  &#39;2 VNeg&#39;: [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
  &#39;10 Type 2 / Type 4&#39;: [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
  &#39;4 [V-Neg]&#39;: [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
  &#39;3 [Neg-V]&#39;: [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
  &#39;14 ObligDoubleNeg&#39;: [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
  &#39;15 OptDoubleNeg&#39;: [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
  &#39;NA&#39;: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  &#39;8 Type 1 / Type 4&#39;: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]},
 [[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
  [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
  [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 1, 0, 

Now, for each row `i` of the original dataset, we take the `i`-th element of the second element of each pair in `encoded_features`, obtaining a vector of (encoded) features for the row, and then we flatten it into a vector of 0s and 1s. For example:

In [157]:
i = 0
ith_row = [feature[1][i] for feature in encoded_features]

In [158]:
ith_row

[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 [1, 0, 0, 0, 0],
 [0, 0, 1],
 [0, 1, 0],
 [1, 0, 0, 0, 0],
 [0, 0, 1, 0],
 [0, 1, 0, 0, 0],
 [0, 1, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [0, 0, 1, 0],
 [0, 0, 0, 0, 0, 1],
 [1, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 0, 0],
 [1, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0],
 [1, 0, 0, 0],
 [1, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1],
 [0, 0, 0, 1, 0, 0],
 [1, 0, 0, 0, 0],
 [0, 0, 1, 0],
 [0, 0, 1, 0, 0, 0],
 [0, 0, 1, 0, 0, 0],
 [0, 0, 1],
 [0, 0, 0, 0, 1],
 [0, 1, 0, 0, 0, 0]]

In [159]:
ith_row_flat = np.array([item for element in ith_row for item in element])

In [160]:
ith_row_flat

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0])

Finally we do this for every row:

In [161]:
features = np.empty([89, 168]) # initialize empty matrix with appropriate dimensions

In [162]:
for row in range(0, features.shape[0]):
    ith_row = [feature[1][row] for feature in encoded_features]
    features[row] = np.array([item for element in ith_row for item in element])

In [163]:
features

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 1.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 1., 0., 0.]])

We can put everything together in a function called `encode()` as follows:

In [165]:
def encode(path_to_file, return_dictionaries = False): # added an option to return the set of encoding dictionaries as well
    # read the data
    df = pd.read_csv(path_to_file, sep = "\t").fillna("NA")

    # labels
    labels = df["family"].to_numpy

    # features
    encoded_features = df.iloc[:, 3:].apply(onehot_encoder)
    features = np.empty([89, 168])
    for row in range(0, features.shape[0]):
        ith_row = [feature[1][row] for feature in encoded_features]
        features[row] = np.array([item for element in ith_row for item in element])

    if return_dictionaries:
        dictionaries = [encoded_features[i][0] for i in range(0, len(encoded_features))]    
        return(features, labels, dictionaries)    
    else:
        return(features, labels)

In [166]:
features, labels, dictionaries = encode("data/wals-train.tsv", True)

In [174]:
features[0:3]

array([[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
        0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0.,

In [176]:
labels

&lt;bound method IndexOpsMixin.to_numpy of 0         Afro-Asiatic
1         Afro-Asiatic
2         Afro-Asiatic
3         Afro-Asiatic
4         Afro-Asiatic
            ...       
84    Trans-New Guinea
85         Uto-Aztecan
86         Uto-Aztecan
87         Uto-Aztecan
88         Uto-Aztecan
Name: family, Length: 89, dtype: object&gt;

In [175]:
dictionaries[0:3]

[{&#39;6 Type 1 / Type 2&#39;: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  &#39;11 Type 3 / Type 4&#39;: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  &#39;1 NegV&#39;: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
  &#39;2 VNeg&#39;: [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
  &#39;10 Type 2 / Type 4&#39;: [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
  &#39;4 [V-Neg]&#39;: [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
  &#39;3 [Neg-V]&#39;: [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
  &#39;14 ObligDoubleNeg&#39;: [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
  &#39;15 OptDoubleNeg&#39;: [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
  &#39;NA&#39;: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  &#39;8 Type 1 / Type 4&#39;: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]},
 {&#39;2 [V-Neg]&#39;: [1, 0, 0, 0, 0],
  &#39;1 VNeg&#39;: [0, 1, 0, 0, 0],
  &#39;3 VNeg&amp;[V-Neg]&#39;: [0, 0, 1, 0, 0],
  &#39;NA&#39;: [0, 0, 0, 0, 0],
  &#39;4 None&#39;: [0, 0, 0, 0, 1]},
 {&#39;1 NegTone&#39;: [1, 0, 0], &#39;NA&#39;: [0, 0, 0], &#39;4 None&#39;: [0, 0, 1]}]

It works!

### Exercise 2. Training a simple classifier