# Data description practice

Today we'll compute some descriptions of subsets of the Peterson & Barney data.

In [30]:
# Data IO and basic statistis.
import pandas
# For some higher-level statistics (like skew) not present in Pandas.
import scipy.stats

In [31]:
# Available from:
# http://wellformedness.com/courses/LING83800/Data/pb52.tsv
# Believe it or not, we can just put in the URL (with `http://` at the begining) and
# it works.
url = "http://wellformedness.com/courses/LING83800/Data/pb52.tsv"
pb52 = pandas.read_csv(url, sep="\t")
pb52.head()

Unnamed: 0,Type,Sex,Speaker,Vowel,Repetition,F0,F1,F2,F3
0,m,m,1,i,1,160,240,2280,2850
1,m,m,1,i,2,186,280,2400,2790
2,m,m,1,I,1,203,390,2030,2640
3,m,m,1,I,2,192,310,1980,2550
4,m,m,1,E,1,161,490,1870,2420


## Subsetting

In Pandas, we subset the data by placing a true/false "predicate" in square brackets. For instance, 
if we want to subset so as to only have the /i/ vowel, we can do:

In [32]:
# The vowel codes here are in in X-SAMPA:
# https://en.wikipedia.org/wiki/SAMPA_chart_for_English

i = pb52[pb52.Vowel == "i"]
# This prints out a summary of the numerial values.
# Note that it doesn't really understand that "Speaker" and "Repetition"
# aren't *really* numerical; the former is nominal and the latter is,
# at best, ordinal.
i.describe()  

Unnamed: 0,Speaker,Repetition,F0,F1,F2,F3
count,152.0,152.0,152.0,152.0,152.0,152.0
mean,38.5,1.5,197.342105,301.263158,2648.355263,3238.289474
std,22.009932,0.501653,61.380056,59.823774,375.365507,379.263012
min,1.0,1.0,105.0,190.0,2000.0,2600.0
25%,19.75,1.0,135.0,260.0,2305.0,2917.5
50%,38.5,1.5,201.5,300.0,2700.0,3190.0
75%,57.25,2.0,245.0,337.25,2872.5,3485.0
max,76.0,2.0,344.0,590.0,3610.0,4320.0


**Exercise**: Build a subset of the P&B data that excludes the high vowels /i, ɪ, u, ʊ/. (You may have to look at the X-SAMPA chart.)

In [37]:
tense_vowel = frozenset(["i", "I", "u", "U"])
# Makes a list of True and False labels, then applies that "predicate" to the
# vector. Unfortunately `pb52[pb52.vowel in tense_vowel]` doesn't work
# the way we'd like it to.
gen = [vowel in tense_vowel for vowel in pb52.Vowel]
tense = pb52[gen]
# Note that the mean F1 is significantly higher, as high vowels have a low F1.
tense.describe()

Unnamed: 0,Speaker,Repetition,F0,F1,F2,F3
count,608.0,608.0,608.0,608.0,608.0,608.0
mean,38.5,1.5,199.074013,392.633224,1771.416118,2852.810855
std,21.955474,0.500412,62.375275,97.606337,781.325832,503.31734
min,1.0,1.0,100.0,190.0,570.0,1850.0
25%,19.75,1.0,140.0,320.0,1040.0,2450.0
50%,38.5,1.5,205.0,391.0,1705.0,2802.5
75%,57.25,2.0,250.0,460.0,2450.0,3200.0
max,76.0,2.0,350.0,730.0,3610.0,4380.0


## Description

**Exercise**: Compute the mean, median, and skew for each vowel.