# TueSNLP 2019 - Assignment 2

## Classifying languages

The assignment and data are available at https://snlp2019.github.io/a2/

The data is a subset of WALS (https://wals.info/) and it consists of a tab-separated file with a `family` column containing the family of a language and other columns containing features for the language (notice: one language per row, first row is header and first two columns not essential)

The goal of the assignment is to learn language classification based on typological features.

(as per assignment instructions, our code roughly follows the template(s) provided)


### Exercise 1. Encoding the data

We write a function `encode()` which reads the data file and returns two numpy arrays, `labels` and `feature`; the former is the list of language families, the latter is a 2d array with as many rows as there are labels, where each row is the concatenation of the one-hot encodings of the features corresponding to the label. `NA`s and unknown values will be mapped to vectors of `0`s.

We do this step-by-step, then we put everything together in a function definition.

In [8]:
import pandas as pd
import numpy as np

In [9]:
# read data and take a look
df = pd.read_csv("data/wals-train.tsv", sep = "\t")

In [10]:
df.head()

Unnamed: 0,lcode,lname,family,143A Order of Negative Morpheme and Verb,143F Postverbal Negative Morphemes,143G Minor morphological means of signaling negation,82A Order of Subject and Verb,143E Preverbal Negative Morphemes,83A Order of Object and Verb,85A Order of Adposition and Noun Phrase,...,19A Presence of Uncommon Consonants,3A Consonant-Vowel Ratio,8A Lateral Consonants,6A Uvular Consonants,18A Absence of Common Consonants,69A Position of Tense-Aspect Affixes,112A Negative Morphemes,107A Passive Constructions,48A Person Marking on Adpositions,1A Consonant Inventories
0,amh,Amharic,Afro-Asiatic,14 ObligDoubleNeg,2 [V-Neg],4 None,1 SV,2 [Neg-V],1 OV,4 No dominant order,...,1 None,4 Moderately high,"2 /l/, no obstruent laterals",1 None,1 All present,4 Mixed type,6 Double negation,1 Present,3 Pronouns only,4 Moderately large
1,arz,Arabic (Egyptian),Afro-Asiatic,15 OptDoubleNeg,2 [V-Neg],4 None,1 SV,3 NegV&[Neg-V],2 VO,2 Prepositions,...,4 Pharyngeals,4 Moderately high,"2 /l/, no obstruent laterals",4 Uvular stops and continuants,1 All present,4 Mixed type,2 Negative particle,1 Present,3 Pronouns only,4 Moderately large
2,bej,Beja,Afro-Asiatic,3 [Neg-V],4 None,4 None,1 SV,2 [Neg-V],1 OV,1 Postpositions,...,1 None,3 Average,"2 /l/, no obstruent laterals",1 None,1 All present,2 Tense-aspect suffixes,1 Negative affix,1 Present,2 No person marking,3 Average
3,heb,Hebrew (Modern),Afro-Asiatic,1 NegV,4 None,4 None,1 SV,1 NegV,2 VO,2 Prepositions,...,,,,,,4 Mixed type,2 Negative particle,1 Present,3 Pronouns only,
4,irk,Iraqw,Afro-Asiatic,4 [V-Neg],2 [V-Neg],4 None,1 SV,4 None,1 OV,2 Prepositions,...,4 Pharyngeals,5 High,4 /l/ and lateral obstruent,2 Uvular stops only,1 All present,5 No tense-aspect inflection,1 Negative affix,1 Present,2 No person marking,4 Moderately large


The easiest part is to extract the labels:

In [11]:
labels = df["family"].to_numpy

In [12]:
print(labels)

<bound method IndexOpsMixin.to_numpy of 0         Afro-Asiatic
1         Afro-Asiatic
2         Afro-Asiatic
3         Afro-Asiatic
4         Afro-Asiatic
            ...       
84    Trans-New Guinea
85         Uto-Aztecan
86         Uto-Aztecan
87         Uto-Aztecan
88         Uto-Aztecan
Name: family, Length: 89, dtype: object>


There are 89 families in the labels

Next, we want to one-hot encode the 30 columns of features:

In [66]:
# a one-hot encoding function
def onehot_encoder(input_array):
	unique_elements = list(set(input_array)) # the "vocabulary" of the array
	output_array = [None]*len(input_array) # output has the same shape of input
	for i in range(0, len(input_array)): # for each element of the input array:
		current_element = input_array[i]
		if current_element in ["NA", "NaN"]:
			output_array[i] = [0]*len(unique_elements) # create an array of 0s
		else:
			output_array[i] = [0]*len(unique_elements) # create an array of 0s, then fill out the appropriate 1s
			for j in range(0, len(unique_elements)):
				if current_element == unique_elements[j]:
					output_array[i][j] = 1 # put a 1 corresponding to the position of the element in the dict
	return(output_array)

For example:

In [67]:
input_a = ["a", "b", "NaN", "c", "NA", "d", "a", "b", "f", "NA", "c", "NaN"]

In [69]:
onehot_encoder(input_a)

[[0, 0, 0, 1, 0, 0, 0],
 [0, 0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0],
 [0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 0, 0, 0],
 [0, 0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 1, 0, 0],
 [0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0],
 [0, 0, 0, 0, 0, 0, 0]]

Let's try it on a column of the df:

In [70]:
test_encoded_column = onehot_encoder(df.iloc[:, 3])

In [71]:
print(df.iloc[0:10, 3])

0    14 ObligDoubleNeg
1      15 OptDoubleNeg
2            3 [Neg-V]
3               1 NegV
4            4 [V-Neg]
5               2 VNeg
6               1 NegV
7               1 NegV
8    14 ObligDoubleNeg
9            4 [V-Neg]
Name: 143A Order of Negative Morpheme and Verb, dtype: object


In [72]:
test_encoded_column[0:10]

[[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]

It appears to be working; let's apply it to every column (except from first three):

In [73]:
encoded_features_df = df.iloc[:, 3:].apply(onehot_encoder)

In [74]:
encoded_features_df.head()

Unnamed: 0,143A Order of Negative Morpheme and Verb,143F Postverbal Negative Morphemes,143G Minor morphological means of signaling negation,82A Order of Subject and Verb,143E Preverbal Negative Morphemes,83A Order of Object and Verb,85A Order of Adposition and Noun Phrase,86A Order of Genitive and Noun,88A Order of Demonstrative and Noun,87A Order of Adjective and Noun,...,19A Presence of Uncommon Consonants,3A Consonant-Vowel Ratio,8A Lateral Consonants,6A Uvular Consonants,18A Absence of Common Consonants,69A Position of Tense-Aspect Affixes,112A Negative Morphemes,107A Passive Constructions,48A Person Marking on Adpositions,1A Consonant Inventories
0,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0]","[0, 1, 0]","[0, 1, 0]","[0, 0, 0, 1, 0]","[0, 0, 1, 0]","[0, 0, 0, 1, 0]","[0, 0, 1, 0]","[0, 1, 0, 0, 0, 0]","[0, 0, 0, 1]",...,"[0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 1]","[0, 0, 0, 0, 1, 0]","[0, 0, 0, 1, 0]","[0, 0, 1, 0]","[0, 0, 0, 0, 0, 1]","[0, 1, 0, 0, 0, 0]","[0, 0, 1]","[0, 0, 0, 1, 0]","[0, 0, 0, 0, 1, 0]"
1,"[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0]","[0, 1, 0]","[0, 1, 0]","[0, 0, 1, 0, 0]","[0, 1, 0, 0]","[0, 0, 1, 0, 0]","[0, 1, 0, 0]","[0, 0, 1, 0, 0, 0]","[0, 1, 0, 0]",...,"[0, 0, 0, 0, 1, 0]","[0, 0, 0, 0, 0, 1]","[0, 0, 0, 0, 1, 0]","[0, 0, 1, 0, 0]","[0, 0, 1, 0]","[0, 0, 0, 0, 0, 1]","[0, 0, 1, 0, 0, 0]","[0, 0, 1]","[0, 0, 0, 1, 0]","[0, 0, 0, 0, 1, 0]"
2,"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 1, 0, 0, 0]","[0, 1, 0]","[0, 1, 0]","[0, 0, 0, 1, 0]","[0, 0, 1, 0]","[0, 1, 0, 0, 0]","[0, 0, 1, 0]","[0, 0, 0, 0, 1, 0]","[0, 0, 1, 0]",...,"[0, 1, 0, 0, 0, 0]","[0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 1, 0]","[0, 0, 0, 1, 0]","[0, 0, 1, 0]","[0, 0, 0, 0, 1, 0]","[0, 0, 0, 0, 1, 0]","[0, 0, 1]","[0, 0, 1, 0, 0]","[0, 0, 1, 0, 0, 0]"
3,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]","[0, 1, 0, 0, 0]","[0, 1, 0]","[0, 1, 0]","[0, 0, 0, 0, 1]","[0, 1, 0, 0]","[0, 0, 1, 0, 0]","[0, 1, 0, 0]","[0, 0, 1, 0, 0, 0]","[0, 1, 0, 0]",...,"[0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0]","[0, 0, 0, 0]","[0, 0, 0, 0, 0, 1]","[0, 0, 1, 0, 0, 0]","[0, 0, 1]","[0, 0, 0, 1, 0]","[0, 0, 0, 0, 0, 0]"
4,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]","[0, 0, 1, 0, 0]","[0, 1, 0]","[0, 1, 0]","[0, 1, 0, 0, 0]","[0, 0, 1, 0]","[0, 0, 1, 0, 0]","[0, 1, 0, 0]","[0, 0, 1, 0, 0, 0]","[0, 1, 0, 0]",...,"[0, 0, 0, 0, 1, 0]","[0, 0, 0, 0, 1, 0]","[0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 1]","[0, 0, 1, 0]","[0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 1, 0]","[0, 0, 1]","[0, 0, 1, 0, 0]","[0, 0, 0, 0, 1, 0]"


This is how we flatten each row into a numpy array which is the concatenation of all features of the row:

In [79]:
flat_features = np.array([element for feature in encoded_features_df.iloc[0, :] for element in feature])

In [80]:
print(flat_features)
print(len(flat_features))

[0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0
 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0
 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1
 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0]
168


Finally we do this for every row:

In [94]:
features = np.empty([89, 168]) # initialize empty matrix with appropriate dimensions

In [95]:
for row in range(0, features.shape[0]):
    features[row] = np.array([element for feature in encoded_features_df.iloc[row, :] for element in feature])

In [96]:
features

array([[0., 1., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

We can put everything together in a function called `encode()` as follows:

In [98]:
def encode(path_to_file):
    # read the data
    df = pd.read_csv(path_to_file, sep = "\t")

    # labels
    labels = df["family"].to_numpy

    # features
    encoded_features_df = df.iloc[:, 3:].apply(onehot_encoder)
    features = np.empty([89, 168])
    for row in range(0, features.shape[0]):
        features[row] = np.array([element for feature in encoded_features_df.iloc[row, :] for element in feature])

    return(features, labels)

In [100]:
features, labels = encode("data/wals-train.tsv")

In [101]:
features

array([[0., 1., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

In [102]:
labels

<bound method IndexOpsMixin.to_numpy of 0         Afro-Asiatic
1         Afro-Asiatic
2         Afro-Asiatic
3         Afro-Asiatic
4         Afro-Asiatic
            ...       
84    Trans-New Guinea
85         Uto-Aztecan
86         Uto-Aztecan
87         Uto-Aztecan
88         Uto-Aztecan
Name: family, Length: 89, dtype: object>

It works!