# Measures of Spread in Pandas and Numpy
Measures of spread are typically defined as quantiles, standard deviation, and variance. The goal is to find what and how much variation there is in your data. We can use `.quantile()`, `.std()`, and `.var()` on our DataFrame, Series, or Group to calculate these measurements.

Explore this notebook using the `census_income_data.csv` dataset to answer questions from these methods. We'll utilize the groupby method again to facilitate our methodology.

In [8]:
import pandas as pd

In [29]:
# Load the dataset
df= pd.read_csv('census_income_data.csv')

In [31]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Using quantile on a DataFrame

#### What was the 25%, 50%, 75%, 90%, and 95% quantile of capital gained and lost in our dataset?
Let's use the `.quantile()` method on our DataFrame to aggregate these totals at a high level.

Wow! We had to get to the 95% quantile to show any `capital-gain`. This means at least 90% of our dataset never bought or sold any assets, 0 `capital-gain` and 0 `capital-loss`.

In [24]:
df[['capital-gain','capital-loss']].quantile([.25,.5,.75,.9,.95])

Unnamed: 0,capital-gain,capital-loss
0.25,0.0,0.0
0.5,0.0,0.0
0.75,0.0,0.0
0.9,0.0,0.0
0.95,5013.0,0.0


## Using Quantile, Standard Deviation, and Variance on a Group

####  are the different workclass types

In [36]:
df.workclass.value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64

#### If we group by 'workclass', what are some interesting questions and answers?
We'll use `.quantile()`, `.std()`, `.var()` to see how the each metric tells a different story for each 'workclass'. For quantile, since we already know that most of the data is 0, we will only look at the 50%, 90%, and 95% quantiles. Let's focus only on "age", "capital-gain", "capital-loss", and "hours-per-week".

We finally see some differences between different work classes for `capital-gain`. Notably `Self-emp-inc`, `Self-emp-not-inc`, and `Without-pay` have a larger portion of their group selling assets, `capital-gain` > 0 in the 90% quantile compared to the 95% quantile for others.

What can you say about the `age` and `hours-per-week` quantiles for the different groups?

In [44]:
#quantile with group by func
df.groupby(by='workclass')[["age", "capital-gain", "capital-loss", "hours-per-week"]].quantile([.5,.9,.95])

Unnamed: 0_level_0,Unnamed: 1_level_0,age,capital-gain,capital-loss,hours-per-week
workclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Federal-gov,0.5,43.0,0.0,0.0,40.0
Federal-gov,0.9,58.0,0.0,0.0,50.0
Federal-gov,0.95,61.0,7298.0,1579.55,60.0
Local-gov,0.5,41.0,0.0,0.0,40.0
Local-gov,0.9,58.0,0.0,0.0,52.0
Local-gov,0.95,63.0,5178.0,1583.4,60.0
Never-worked,0.5,18.0,0.0,0.0,35.0
Never-worked,0.9,25.8,0.0,0.0,40.0
Never-worked,0.95,27.9,0.0,0.0,40.0
Private,0.5,35.0,0.0,0.0,40.0


In [38]:
#Standard Deviation
df.groupby(by='workclass').std()

Unnamed: 0_level_0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
workclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Federal-gov,11.509171,117502.359524,2.11365,4101.966767,453.504623,8.838605
Local-gov,12.272856,100254.775314,2.552536,5775.043442,439.513203,10.771559
Never-worked,4.613644,108135.748347,2.299068,0.0,0.0,15.186147
Private,12.827721,105789.668252,2.4955,6424.267599,384.157003,11.256298
Self-emp-inc,12.553194,96436.282913,2.60321,17976.548086,549.488497,13.900417
Self-emp-not-inc,13.338162,100735.75773,2.768132,10986.233506,467.611687,16.674958
State-gov,12.431065,111512.980926,2.538604,3777.749185,394.469789,11.697014
Without-pay,21.07561,85536.385921,1.685426,1300.780467,0.0,17.3579


In [39]:
#variance
df.groupby(by='workclass').var()

Unnamed: 0_level_0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
workclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Federal-gov,132.461017,13806800000.0,4.467517,16826130.0,205666.442818,78.120942
Local-gov,150.622997,10051020000.0,6.51544,33351130.0,193171.855905,116.026473
Never-worked,21.285714,11693340000.0,5.285714,0.0,0.0,230.619048
Private,164.550434,11191450000.0,6.227522,41271210.0,147576.60314,126.704246
Self-emp-inc,157.58267,9299957000.0,6.776703,323156300.0,301937.608495,193.221591
Self-emp-not-inc,177.906562,10147690000.0,7.662553,120697300.0,218660.689455,278.05423
State-gov,154.531374,12435140000.0,6.44451,14271390.0,155606.414471,136.820127
Without-pay,444.181319,7316473000.0,2.840659,1692030.0,0.0,301.296703


Because of how variance is calculated (squaring of the standard deviation), it can be used to help spot distributions with outliers. Look again at the three government groups. It appears the state government group is more widly spread than the other two. Maybe the outliers in the group are more obvious?

Some additional notes about variance:
 - While using variance to spot outliers can be used, there are better methods available, such as interquartile range (IQR) or median absolute deviation (MAD).
 - Variance is important in other areas, like calculating confidence intervals, data normalization, testing hypotheses, and assessing the quality of statistical models.

In [47]:
df.groupby(by='workclass')[['capital-gain','capital-loss']].describe(percentiles=[.50,.90,.95])

Unnamed: 0_level_0,capital-gain,capital-gain,capital-gain,capital-gain,capital-gain,capital-gain,capital-gain,capital-gain,capital-loss,capital-loss,capital-loss,capital-loss,capital-loss,capital-loss,capital-loss,capital-loss
Unnamed: 0_level_1,count,mean,std,min,50%,90%,95%,max,count,mean,std,min,50%,90%,95%,max
workclass,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Federal-gov,960.0,833.232292,4101.966767,0.0,0.0,0.0,7298.0,99999.0,960.0,112.26875,453.504623,0.0,0.0,0.0,1579.55,3683.0
Local-gov,2093.0,880.20258,5775.043442,0.0,0.0,0.0,5178.0,99999.0,2093.0,109.854276,439.513203,0.0,0.0,0.0,1583.4,2444.0
Never-worked,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Private,22696.0,889.217792,6424.267599,0.0,0.0,0.0,4386.0,99999.0,22696.0,80.008724,384.157003,0.0,0.0,0.0,0.0,4356.0
Self-emp-inc,1116.0,4875.693548,17976.548086,0.0,0.0,14084.0,15024.0,99999.0,1116.0,155.138889,549.488497,0.0,0.0,0.0,1902.0,2559.0
Self-emp-not-inc,2541.0,1886.061787,10986.233506,0.0,0.0,1409.0,7430.0,99999.0,2541.0,116.631641,467.611687,0.0,0.0,0.0,1672.0,2824.0
State-gov,1298.0,701.699538,3777.749185,0.0,0.0,0.0,5178.0,99999.0,1298.0,83.256549,394.469789,0.0,0.0,0.0,0.0,3683.0
Without-pay,14.0,487.857143,1300.780467,0.0,0.0,1689.8,3114.7,4416.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Remember that we can use the `.describe()` method on groups to produce many of these measurements. We can also use the `percentiles` parameter to customize which quantiles we want to target.

In [48]:
df.describe(include='all')

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
count,32561.0,30725,32561.0,32561,32561.0,32561,30718,32561,32561,32561,32561.0,32561.0,32561.0,31978,32561
unique,,8,,16,,7,14,6,5,2,,,,41,2
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,22696,,10501,,14976,4140,13193,27816,21790,,,,29170,24720
mean,38.581647,,189778.4,,10.080679,,,,,,1077.648844,87.30383,40.437456,,
std,13.640433,,105550.0,,2.57272,,,,,,7385.292085,402.960219,12.347429,,
min,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,117827.0,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,178356.0,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,237051.0,,12.0,,,,,,0.0,0.0,45.0,,
