# Week 3: Advanced Python Pandas

# Scales

How do we capture the scale of the data?

Consider grades, is the difference between A and A- the same as between A- and B+? 

In data science there are generally four types of scales:

* Ratio scale
    * Units equally spaced
    * Mathematical operations are valid
    * Example: height and weight
* Interval scale
    * Units equally spaced, but no true zero
    * Mathematical operations not valid
    * Example: Temperatures measured in C vs. F. 
* Ordinal scale
    * The order of units is important, but not evenly spaced
    * Uneven distribution of grades does not mean uneven distribution of scores
    * Example: Letter grades for a class
* Nominal scale
    * Categories of data, but categories have no order with respect to one another
    * Example: Teams of a sport
    * Called "categorical data" in `pandas`
    
`pandas` has a number of methods for dealing with scales. There is a built-in type for categorical method. You can change your data to categorical by using `astype`. 
    
Play around with a dataframe of grades.

In [1]:
import numpy as np
import pandas as pd

In [7]:
df = pd.DataFrame(['A+','A','A-','B+','B','B-','C+','C','C-','D+','D'],index=['excellent','excellent','excellent','good','good','good','ok','ok','ok','poor','poor'])
df.rename(columns={0:'Grades'}, inplace=True)

df

Unnamed: 0,Grades
excellent,A+
excellent,A
excellent,A-
good,B+
good,B
good,B-
ok,C+
ok,C
ok,C-
poor,D+


Convert this dataframe into a `category,` which is a different datatype that looks more like a series.

In [8]:
df['Grades'].astype('category')

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
ok            C
ok           C-
poor         D+
poor          D
Name: Grades, dtype: category
Categories (11, object): [A, A+, A-, B, ..., C+, C-, D, D+]

Notice that right now, there is no "ordering" of the grades. They are just presented in a list.

If we want to indicate that there is an order, then we pass the flag `ordered=True` when we convert it to a category, and this is reflected in the `dtype`. Provide the grades in order from small to large.

Now you will notice that the grades are ordered with `<` signs as indicators.

In [12]:
grades = df['Grades'].astype('category',categories = ['D','D+','C-','C','C+','B-','B','B+','A-','A','A+'],ordered=True)
grades.head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

Now we can take advantage of broadcasting and mathematical operations.

Ordinal data has ordering, so it can help you with a boolean masking. Instead of having to code the letter grades to numbers so that we can pick anything greater than a C, we can set their order. 

This is useful in feature extraction, in which we can quickly find which elements are in (or above or below) a certain category.

In [13]:
grades > 'C'

excellent     True
excellent     True
excellent     True
good          True
good          True
good          True
ok            True
ok           False
ok           False
poor         False
poor         False
Name: Grades, dtype: bool

Sometimes it is useful to represent categorical data as a column of either `True` or `False` as to whether that category applies. This is common in feature extraction. Variables with a Boolean variable are typically called dummy variables, and `pandas` has a built-in function called `getdummies` that converts them into 0s and 1s.

# Collapsing ratio data into categorical data

Reducing a value on a ratio scale into a categorical scale. You're losing information about the value, but it's useful in cases where you're visualizing values, such as in a histograms. 

## Cut based on size

`pandas` has a function called `cut`, which takes an array-like structure and a number of bins to be used. 

Go back to the census data for an example.

In [17]:
df = pd.read_csv(r'../data/census.csv')

In [19]:
df = df[df['SUMLEV'] == 50] # Look at counties
df = df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg({'avg': np.average}) # Group by state, then average the census 2010 population
df.head()

Unnamed: 0_level_0,avg
STNAME,Unnamed: 1_level_1
Alabama,71339.343284
Alaska,24490.724138
Arizona,426134.466667
Arkansas,38878.906667
California,642309.586207


Now cut the data into 10 evenly-spaced bins.

In [21]:
pd.cut(df['avg'],10).head()

STNAME
Alabama        (11706.0871, 75333.413]
Alaska         (11706.0871, 75333.413]
Arizona       (390320.176, 453317.529]
Arkansas       (11706.0871, 75333.413]
California    (579312.234, 642309.586]
Name: avg, dtype: category
Categories (10, object): [(11706.0871, 75333.413] < (75333.413, 138330.766] < (138330.766, 201328.118] < (201328.118, 264325.471] ... (390320.176, 453317.529] < (453317.529, 516314.881] < (516314.881, 579312.234] < (579312.234, 642309.586]]

Try cutting height data into three bins.

This will cut the data so that spacing between categories is equal sized.

In [9]:
heights = pd.Series([168, 180, 174, 190, 170, 185, 179, 181, 175, 169, 182, 177, 180, 171])

In [10]:
pd.cut(heights,3)

0     (167.978, 175.333]
1     (175.333, 182.667]
2     (167.978, 175.333]
3         (182.667, 190]
4     (167.978, 175.333]
5         (182.667, 190]
6     (175.333, 182.667]
7     (175.333, 182.667]
8     (167.978, 175.333]
9     (167.978, 175.333]
10    (175.333, 182.667]
11    (175.333, 182.667]
12    (175.333, 182.667]
13    (167.978, 175.333]
dtype: category
Categories (3, object): [(167.978, 175.333] < (175.333, 182.667] < (182.667, 190]]

You can also add labels for the three categories.

In [16]:
pd.cut(heights,3, labels=['Small','Medium','Large'])

0      Small
1     Medium
2      Small
3      Large
4      Small
5      Large
6     Medium
7     Medium
8      Small
9      Small
10    Medium
11    Medium
12    Medium
13     Small
dtype: category
Categories (3, object): [Small < Medium < Large]

Nice!

In [12]:
pd.cut(heights,3, labels=['Small','Medium','Large'], retbins=True)[1]

array([ 167.978     ,  175.33333333,  182.66666667,  190.        ])

## Cut based on frequency

What if you want to cut based on frequency? You want the number of items in each bin to be the same rather than the spacing in each bin. That is another challenge.

We will look at this in the graphing and charting course (later in the specialization). 