### WALS Project mini-Test ###
We set up the basic functionality and tested the two features:
 - number of cases
 - word order


In [1]:
import yaml,logging,numpy
from optparse import OptionParser
from feature import Feature
features = yaml.load(open('features.yml'))
nlangs = 2679
logging.basicConfig(level=logging.ERROR)

# do the main #
base_matrix = numpy.zeros((0,nlangs))
for name,data in features.items():
    feat =  Feature(name,data)
    base_matrix = numpy.vstack((base_matrix,feat.get_languages()))

print(numpy.cov(base_matrix))

[[ 0.9875766   0.2157909 ]
 [ 0.2157909   2.32882366]]


The resulting covariance matrix doesn't seem very promissing we expected a negative number outside the diagonal.
You may note tough that only 261 of the languages have a value for this feature.
Now if we take just those 261:

In [2]:
cases = Feature('number_of_cases', features['number_of_cases']).get_languages()
orders = Feature('word_order1', features['word_order1']).get_languages()
new_base = numpy.zeros((0,2))
for o,c in zip(orders,cases):
    if c != 0:
        new_base = numpy.vstack((new_base,numpy.array([o,c])))
print(new_base.shape, new_base[:5])

(261, 2) [[ 1.  1.]
 [ 1.  2.]
 [-1.  1.]
 [ 1.  1.]
 [ 1.  9.]]


In [3]:
print(numpy.cov(numpy.transpose(new_base)))

[[ 0.70250516 -0.47217801]
 [-0.47217801  9.04238137]]


Does this make more sense?
If it does, I guess we have to first see what kind of coverage WALS offers.

In [4]:
from feature import languages
coverage = languages.ix[:,10:]
coverage = coverage.replace(to_replace=".+",regex=True,value=1)
coverage = coverage.replace(to_replace='',value=0)
coverage.describe()

Unnamed: 0,1A Consonant Inventories,2A Vowel Quality Inventories,3A Consonant-Vowel Ratio,4A Voicing in Plosives and Fricatives,5A Voicing and Gaps in Plosive Systems,6A Uvular Consonants,7A Glottalized Consonants,8A Lateral Consonants,9A The Velar Nasal,10A Vowel Nasalization,...,137B M in Second Person Singular,136B M in First Person Singular,109B Other Roles of Applied Objects,10B Nasal Vowels in West Africa,25B Zero Marking of A and P Arguments,21B Exponence of Tense-Aspect-Mood Inflection,108B Productivity of the Antipassive Construction,130B Cultural Categories of Languages with Identity of 'Finger' and 'Hand',58B Number of Possessive Nouns,79B Suppletion in Imperatives and Hortatives
count,2679.0,2679.0,2679.0,2679.0,2679.0,2679.0,2679.0,2679.0,2679.0,2679.0,...,2679.0,2679.0,2679.0,2679.0,2679.0,2679.0,2679.0,2679.0,2679.0,2679.0
mean,0.210153,0.210526,0.210526,0.211646,0.211646,0.211646,0.211646,0.211646,0.175065,0.091079,...,0.085853,0.085853,0.068309,0.014931,0.087719,0.059724,0.069429,0.026876,0.090705,0.072042
std,0.407493,0.407759,0.407759,0.408552,0.408552,0.408552,0.408552,0.408552,0.380094,0.287775,...,0.280199,0.280199,0.252323,0.121299,0.282939,0.237019,0.25423,0.16175,0.287243,0.258605
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [5]:
coverage = coverage.apply(lambda f: sum(f))

and a quick sanity check:

In [6]:
coverage[['49A Number of Cases', '81A Order of Subject, Object and Verb']]

49A Number of Cases                       261
81A Order of Subject, Object and Verb    1377
dtype: int64

(correct)

In [7]:
coverage.describe()

count     192.000000
mean      398.255208
std       349.743066
min         5.000000
25%       171.500000
50%       257.000000
75%       508.250000
max      1519.000000
dtype: float64


### So 75% of the features cover less than 527 languages ###
how many features cover more than 1000 languages?

In [8]:
from scipy.stats import percentileofscore as pcor
round(193*(1 - pcor(coverage.values,1000)/100))

19.0

but are these good features?

In [12]:
for i in coverage.keys():
    if coverage[i] > 1000:
        print(i)

33A Coding of Nominal Plurality
51A Position of Case Affixes
69A Position of Tense-Aspect Affixes
81A Order of Subject, Object and Verb
82A Order of Subject and Verb
83A Order of Object and Verb
85A Order of Adposition and Noun Phrase
86A Order of Genitive and Noun
87A Order of Adjective and Noun
88A Order of Demonstrative and Noun
89A Order of Numeral and Noun
95A Relationship between the Order of Object and Verb and the Order of Adposition and Noun Phrase
97A Relationship between the Order of Object and Verb and the Order of Adjective and Noun
112A Negative Morphemes
143F Postverbal Negative Morphemes
144A Position of Negative Word With Respect to Subject, Object, and Verb
143E Preverbal Negative Morphemes
143A Order of Negative Morpheme and Verb
143G Minor morphological means of signaling negation
