## Scales:
### 4 different types of scales of importance to data scientists
* Ratio:
    - values are equally spaced
    - mathematical manipulations are still valid
* Interval scale:
    - values are equally spaced but there is no 'absence of value'
    - example: directions on a compass: 0 is a meaningful value
    - mathmatecal manipulates on interval scales are not valid
* Ordinal scale:
    - order of the scale is significant
    - units are not (necessarily) evenly spaced
    - grades are a good example of this
* Nominal scale (categorical data):
    - categories have no order with respect to one another
    - applying mathematical functions to them is meaningless

Pandas has ways of dealing with each of these scale types. For example, Pandas has a built-in type for categorical data. YOu can set a column of your data as categorical by simply using the 'astype' method. Categorical data can be converted to ordinal data by passing in an 'ordered' flag set to true

In [4]:
import pandas as pd

df = pd.DataFrame(['5','4','3','2','1','0'],
                  index = ['excellent', 'good', 'good', 'fair', 'fair', 'poor'],
                  columns = ['ratings'])
df

Unnamed: 0,ratings
excellent,5
good,4
good,3
fair,2
fair,1
poor,0


In [11]:
# convert the data frame to categorical type
df.dtypes
df['ratings'].astype("category").head()
# note it is interesting that this drops the 0 category

excellent    5
good         4
good         3
fair         2
fair         1
Name: ratings, dtype: category
Categories (6, object): ['0', '1', '2', '3', '4', '5']

In [13]:
# these categories can be explicitely ordered
ratings_categories =  pd.CategoricalDtype(categories=['0', '1', '2', '3' ,'4' ,'5'],
                                          ordered = True)
ratings = df['ratings'].astype(ratings_categories)
ratings.head()


excellent    5
good         4
good         3
fair         2
fair         1
Name: ratings, dtype: category
Categories (6, object): ['0' < '1' < '2' < '3' < '4' < '5']

In [None]:
# on an ordered set, you can use comparison operators such as '>' and '<'
# also operators such as min and max are also valid

## converting interval or ratio scales to categorical data
a common practice is to assign 'cuts' or 'bins' to numerical data. While you are reducing the granualar information, this can help in visualizing the overal data shape. This type of reduction is often represented visually in a Histogram. Here's an example of that process using census data

In [16]:
import numpy as np

df= pd.read_csv('../resources/week-3/datasets/census.csv')
# reduce to county level data (SUMLEV = 50)

df = df[df['SUMLEV']==50]
# this notation is a little confusing without reading through it
# df is equal to df where the value in the df['SUMLEV'] column is 50

df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [17]:
df = df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg(np.average)
df.head()

STNAME
Alabama        71339.343284
Alaska         24490.724138
Arizona       426134.466667
Arkansas       38878.906667
California    642309.586207
Name: CENSUS2010POP, dtype: float64

In [19]:
# These can be assigned bins with the cut method (10 bins in this example)
pd.cut(df,10)

STNAME
Alabama                   (11706.087, 75333.413]
Alaska                    (11706.087, 75333.413]
Arizona                 (390320.176, 453317.529]
Arkansas                  (11706.087, 75333.413]
California              (579312.234, 642309.586]
Colorado                 (75333.413, 138330.766]
Connecticut             (390320.176, 453317.529]
Delaware                (264325.471, 327322.823]
District of Columbia    (579312.234, 642309.586]
Florida                 (264325.471, 327322.823]
Georgia                   (11706.087, 75333.413]
Hawaii                  (264325.471, 327322.823]
Idaho                     (11706.087, 75333.413]
Illinois                 (75333.413, 138330.766]
Indiana                   (11706.087, 75333.413]
Iowa                      (11706.087, 75333.413]
Kansas                    (11706.087, 75333.413]
Kentucky                  (11706.087, 75333.413]
Louisiana                 (11706.087, 75333.413]
Maine                    (75333.413, 138330.766]
Maryland     

In [None]:
# cut gives you categories based on size. sometimes you want to base categories on other things, such as frequency