# Clustering Quality of  WALS Subareas

## phonology:
    - Consonants
    - Vowels
    - Prosody (including 1 tone feature)

The following tables show the best silhoutte score of all feature groups restricted to the these subfields with at least 30 languages. 

As you can see, there were only 4 groups in the vowels set, so it can't really be compared with the consonant groups, which are only slightly better than the 'prosody' groups (thre were 37 of those).

In [3]:
import pandas as pd
consonants = pd.read_csv('../miscsv/consonants-silhouettes.csv')
vowels = pd.read_csv('../miscsv/vowels-silhouettes.csv')
prosody = pd.read_csv('../miscsv/prosody-silhouettes.csv')

In [2]:
consonants.describe()

Unnamed: 0,silhouette score
count,883.0
mean,0.199526
std,0.066717
min,0.006535
25%,0.165422
50%,0.197799
75%,0.239928
max,0.375377


In [3]:
vowels.describe()

Unnamed: 0,silhouette score
count,4.0
mean,0.275609
std,0.108548
min,0.150418
25%,0.238057
50%,0.268266
75%,0.305817
max,0.415485


In [4]:
prosody.describe()

Unnamed: 0,silhouette score
count,37.0
mean,0.142081
std,0.052859
min,0.028417
25%,0.112963
50%,0.134855
75%,0.188024
max,0.230584


## In word order
 - negation 
 - other

In [7]:
negation = pd.read_csv('../miscsv/negation-silhouettes-500.csv')
other = pd.read_csv('../miscsv/notnegation-silhouettes-1000.csv')
negation.describe()

Unnamed: 0,silhouette_score
count,27.0
mean,0.15517
std,0.093598
min,0.012679
25%,0.06471
50%,0.215416
75%,0.233651
max,0.264603


In [8]:
other.describe()

Unnamed: 0,silhouette_score
count,132.0
mean,0.083749
std,0.053972
min,-0.008737
25%,0.05002
50%,0.075191
75%,0.120748
max,0.224549


## Python MICE

In [2]:
from fancyimpute import *
from locator import *
from extractors import *
import random
logging.basicConfig(level=logging.ERROR)

In [24]:
test = chunk_wals(['81A','90A','143A'],True,True)
for c in test.columns:
    test[c] = test[c].apply(lambda x: float(numerize(x)))
test[:5]
np.count_nonzero(test.isnull())

0

### Randomly Remove Some Values, say 5

In [57]:
removed = list()
while len(removed) < 5:
    r = test.sample(1).index[0]
    c = random.sample(list(test.columns),1)[0]
    v = test.loc[r][c]
    test.set_value(r,c,np.nan)
    removed.append((r,c,v))
removed

[(194, '143A', 4.0),
 (265, '81A', 7.0),
 (591, '81A', 2.0),
 (1449, '90A', 2.0),
 (1250, '90A', 2.0)]

In [58]:
np.count_nonzero(test.isnull())

5

In [63]:
# Use 3 nearest rows which have a feature to fill in each row's missing features
filled_knn = KNN(k=3).complete(test)

# matrix completion using convex optimization to find low-rank solution
# that still matches observed values. Slow!
#filled_nnm = NuclearNormMinimization().complete(test)

# Instead of solving the nuclear norm objective directly, instead
# induce sparsity using singular value thresholding
#filled_softimpute = SoftImpute().complete(test)

#filled_mice = MICE().complete(test)

Imputing row 1/703 with 0 missing, elapsed time: 0.076
Imputing row 101/703 with 0 missing, elapsed time: 0.077
Imputing row 201/703 with 0 missing, elapsed time: 0.077
Imputing row 301/703 with 0 missing, elapsed time: 0.077
Imputing row 401/703 with 0 missing, elapsed time: 0.078
Imputing row 501/703 with 0 missing, elapsed time: 0.078
Imputing row 601/703 with 0 missing, elapsed time: 0.078
Imputing row 701/703 with 0 missing, elapsed time: 0.078


In [67]:
for i,(row,column,value) in enumerate(removed):
    print(value,filled_knn[i][list(test.columns).index(column)])

4.0 1.0
7.0 1.0
2.0 2.0
2.0 1.0
2.0 1.0


Not Very Good, I don't think there is even a point to try anything larger.

In [68]:
test.to_csv('703-5-emptied.csv')

In [25]:
test2 = test
removed2 = list()
while len(removed2) < 50:
    r = test2.sample(1).index[0]
    c = random.sample(list(test.columns),1)[0]
    v = test2.loc[r][c]
    test2.set_value(r,c,np.nan)
    removed2.append({'language_index': r, 'feature' : c, 'original_value' : v})

r2df = pd.DataFrame(removed2)
r2df
np.count_nonzero(test2.isnull())

50

In [29]:
r2df.to_csv('removed50-original.csv')
test2.to_csv('703-removed-50.csv')

In [30]:
r2df[r2df['feature'] == '81A']

Unnamed: 0,feature,language_index,original_value
0,81A,425,1.0
5,81A,843,1.0
13,81A,1196,1.0
16,81A,2373,1.0
17,81A,645,1.0
18,81A,158,7.0
20,81A,2106,2.0
23,81A,283,2.0
27,81A,1666,3.0
34,81A,2641,7.0


In [31]:
r2df[r2df['feature'] == '143A']

Unnamed: 0,feature,language_index,original_value
1,143A,655,3.0
6,143A,2318,1.0
7,143A,1859,14.0
9,143A,1198,14.0
11,143A,1257,1.0
22,143A,2063,4.0
24,143A,866,1.0
29,143A,2561,3.0
32,143A,1524,4.0
33,143A,1295,6.0
