In [78]:
!python -V
import StringIO
import pandas as pd
print "pandas version:", pd.__version__

Python 2.7.10 :: Anaconda 2.2.0 (x86_64)
pandas version: 0.16.2


#1. Introduction

This notebook will correspond to the [Introduction to the features of SAS](http://www.ats.ucla.edu/stat/sas/modules/intsas.htm) page. That tutorial covers basic features of SAS and applies them to the cars dataset. The dataset contains variables on **make, price, miles per gallon, repair rating (in 1978), weight in pounds, length in inches,** and whether the car was **foreign** or **domestic**.

Note: If you haven't read the **Main Differences Between SAS and Python** file, please read at this point. It covers important topics like importing packages and declaring objects.

---
##Importing the data
A common way of importing small data within SAS is the `datalines` [statement](https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000188182.htm) which enters data directly into a program. In Python  equivalent functionality is possible by creating an object of the `StringIO` [class](https://docs.python.org/2/library/stringio.html) with the data. This object stores a string in memory for us to read in using `read_table` with `pandas`.

We will store this string in a variable named `datalines`.



<div class="pynote">
**Python Note**: Multiline strings start with triple quotes `'''` 
</div>

In [79]:
datalines = StringIO.StringIO('''
AMC     4099 22  3     2930   186    0
AMC     4749 17  3     3350   173    0
AMC     3799 22  3     2640   168    0
Audi    9690 17  5     2830   189    1
Audi    6295 23  3     2070   174    1
BMW     9735 25  4     2650   177    1
Buick   4816 20  3     3250   196    0
Buick   7827 15  4     4080   222    0
Buick   5788 18  3     3670   218    0
Buick   4453 26  3     2230   170    0
Buick   5189 20  3     3280   200    0
Buick  10372 16  3     3880   207    0
Buick   4082 19  3     3400   200    0
Cad.   11385 14  3     4330   221    0
Cad.   14500 14  2     3900   204    0
Cad.   15906 21  3     4290   204    0
Chev.   3299 29  3     2110   163    0
Chev.   5705 16  4     3690   212    0
Chev.   4504 22  3     3180   193    0
Chev.   5104 22  2     3220   200    0
Chev.   3667 24  2     2750   179    0
Chev.   3955 19  3     3430   197    0
Datsun  6229 23  4     2370   170    1
Datsun  4589 35  5     2020   165    1
Datsun  5079 24  4     2280   170    1
Datsun  8129 21  4     2750   184    1
''')

###SAS Code
    DATA auto ;
        INPUT make $ price mpg rep78 weight length foreign ;
        datalines;
        [data]
        run;


####Python Code


In [80]:
auto = pd.read_table(datalines,
                     delim_whitespace=True,
                     names=['make', 'price', 'mpg', 'rep78', 'weight', 'length', 'foreign'])

In [81]:
datalines.close()

<div class="pynote">

**Python Notes**
`pd.read_table` calls the `read_table()` function from the `pandas` package. The first argument is the data we want to read, in this instance our `datalines` variable. Often you'll replace this first argument with a file location pointing to the file you want to read into a <em>DataFrame</em>. The `delim_whitespace=` argument informs the function that the file should parse whitespace as a delimiter. The `names` argument is similar to SAS `INPUT` statement, which provides names for all the variables. One difference is that the variables' type is inferred rather than explicitly stated by using the `$` indicator in SAS.  
<br>
`datalines.close()` frees up the memory buffer we created to store our string object. We won't need it anymore since the data is now stored in `auto`.
</div>

## Viewing a sample of the data
`PROC PRINT` prints a sample of the data to ODS. In Python, .head() is a function that performs equivalently on a `DataFrame` object. Instead of the `obs=10` argument it's `n=10` or simply `10` inside the parenthesis.

###SAS Code
    PROC PRINT DATA=auto(obs=10);
    RUN;    
####Python Code

In [176]:
auto.head(10)

Unnamed: 0,make,price,mpg,rep78,weight,length,foreign
0,AMC,4099,22,3,2930,186,0
1,AMC,4749,17,3,3350,173,0
2,AMC,3799,22,3,2640,168,0
3,Audi,9690,17,5,2830,189,1
4,Audi,6295,23,3,2070,174,1
5,BMW,9735,25,4,2650,177,1
6,Buick,4816,20,3,3250,196,0
7,Buick,7827,15,4,4080,222,0
8,Buick,5788,18,3,3670,218,0
9,Buick,4453,26,3,2230,170,0


<div class="pynote">
**Python Note**: All objects are [zero indexed](https://en.wikipedia.org/wiki/Zero-based_numbering)
</div>

##2. Descriptive Statistics
Similar to SAS's `PROC MEANS` is the  `.describe()` function called on a `DataFrame`. This will print out basic descriptive statistics on all continuous variables by default, and also contains some overlap with the distributional details of `PROC UNIVARIATE`.

###SAS Code
    PROC MEANS DATA=auto;
    RUN; 
    
####Python Code

In [154]:
auto.describe()

Unnamed: 0,price,mpg,rep78,weight,length,foreign
count,26.0,26.0,26.0,26.0,26.0,26.0
mean,6651.730769,20.923077,3.269231,3099.230769,190.076923,0.269231
std,3371.119809,4.757504,0.77757,695.079409,18.170136,0.452344
min,3299.0,14.0,2.0,2020.0,163.0,0.0
25%,4465.75,17.25,3.0,2642.5,173.25,0.0
50%,5146.5,21.0,3.0,3200.0,191.0,0.0
75%,8053.5,23.0,4.0,3610.0,203.0,0.75
max,15906.0,35.0,5.0,4330.0,222.0,1.0


You can get descriptive statistics separately for foreign and domestic cars (i.e. broken down by *foreign*) as shown below. For additional reference, see `pandas` [groupby](http://pandas.pydata.org/pandas-docs/stable/groupby.html) documentation.

In [89]:
auto.groupby('foreign').describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,length,mpg,price,rep78,weight
foreign,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,count,19.0,19.0,19.0,19.0,19.0
0,mean,195.421053,19.789474,6484.157895,2.947368,3347.894737
0,std,17.963901,4.03566,3768.461479,0.524265,627.176911
0,min,163.0,14.0,3299.0,2.0,2110.0
0,25%,182.5,16.5,4090.5,3.0,3055.0
0,50%,200.0,20.0,4816.0,3.0,3350.0
0,75%,205.5,22.0,6807.5,3.0,3785.0
0,max,222.0,29.0,15906.0,4.0,4330.0
1,count,7.0,7.0,7.0,7.0,7.0
1,mean,175.571429,24.0,7106.571429,4.142857,2424.285714


###Detailed Univariate Statistics
The next point in the original tutorial covers `PROC UNIVARIATE` on the **price** variable. `PROC UNIVARIATE` has a significant amount of output. We'll try and replicate the main functions of that proc, but will leave other functions (like t-tests) out of it. We'll cover:
- Skew / Kurtosis
- Extreme observations (top / bottom 5)
- Additional quintiles beyond 0/25/50/75/100%

**Skew / Kurtosis**  
`pandas` has built in [Skew](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.skew.html) / [Kurtosis](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.kurtosis.html) functions that operate on a data frame. We will specify a column/variable to apply these functions to by using `pandas` convenient [dictionary notation](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#column-selection-addition-deletion).

In [94]:
auto['price'].skew()

1.4707269994573371

In [95]:
auto['price'].kurtosis()

1.5346717037845785

**Extreme Observations (Top & Bottom 5)**  
To do this, we'll apply the `.order()` function on the **price** series first and use the simple `.head()` and `.tail()` functions on that ordered series.

In [104]:
auto['price'].order().head() # Bottom 5 (Minimums) and observation numbers

16    3299
20    3667
2     3799
21    3955
12    4082
Name: price, dtype: int64

In [105]:
auto['price'].order().tail() # Top 5 (Maximums) and observation numbers

5      9735
11    10372
13    11385
14    14500
15    15906
Name: price, dtype: int64

**Bonus**: The .order() function has an optional argument that changes the default sort order. We can also use this to display, highest to lowest, the greatest 5 values.

In [107]:
auto['price'].order(ascending=False).head() # Top 5 (Maximums), greatest to least

15    15906
14    14500
13    11385
11    10372
5      9735
Name: price, dtype: int64

**Additional quintiles beyond 0/25/50/75/100%**  
`pandas` `.quantile()` function returns the quantile value given a list of percentages.

In [111]:
auto['price'].quantile([.99, .95, .90, .10, .05, .01])

0.99    15554.50
0.95    13721.25
0.90    10878.50
0.10     3877.00
0.05     3700.00
0.01     3391.00
dtype: float64

**Bonus**: You can also supply a `percentiles=` argument to `.describe()` to get similar output, however, it returns the values out of the given order.

In [114]:
auto['price'].describe(percentiles=[.99, .95, .90, .10, .05, .01])

count       26.000000
mean      6651.730769
std       3371.119809
min       3299.000000
10%       3877.000000
5%        3700.000000
1%        3391.000000
50%       5146.500000
99%      15554.500000
95%      13721.250000
90%      10878.500000
max      15906.000000
Name: price, dtype: float64

## Frequency Distributions
`PROC FREQ` is one of the most commonly used procedures in SAS. Similar, stripped down, functionality is avialble via the `.value_counts()` function applied on a series.

In [119]:
auto['rep78'].value_counts()

3    15
4     6
2     3
5     2
dtype: int64

`.value_counts()` will return values from greatest to least occuring element. To turn this off, supply a `sort=False` argument.

In [120]:
auto['rep78'].value_counts(sort=False)

2     3
3    15
4     6
5     2
dtype: int64

To view values as a percentage, supply the `normalize=True` argument. Multiple arguments can also be provided, for example if you don't want to sort the elements.

In [134]:
auto['rep78'].value_counts(normalize=True, sort=False)

2    0.115385
3    0.576923
4    0.230769
5    0.076923
dtype: float64

In [147]:
pd.crosstab(auto['rep78'], auto['foreign'], margins=True, aggfunc=min)

foreign,0,1,All
rep78,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,3,0,3
3,14,1,15
4,2,4,6
5,0,2,2
All,19,7,26


In [151]:
crosstab = pd.crosstab(auto['rep78'], auto['foreign'])

In [153]:
crosstab.div(crosstab.sum(axis=1), axis=0)

foreign,0,1
rep78,Unnamed: 1_level_1,Unnamed: 2_level_1
2,1.0,0.0
3,0.933333,0.066667
4,0.333333,0.666667
5,0.0,1.0


In [180]:
# This cell imports the styling for this notebook. You can safely ignore it.

from IPython.display import HTML

def css_styling():
    styles = open("../../_styles/custom.css", "r").read()
    return HTML(styles)
css_styling()