#DATA Step - Examples Collection
There aren't many quality, comprehensive resources on all of the features of the data step with good examples. This notebook will contain a collection of examples related to the data step and their translations in Python. The list is sorted roughly in order of complexity.

Examples:

- [Reading in a SAS Dataset (.sas7bdat)](#Reading-in-a-SAS-Dataset)
- **From [Reading Raw Data with the INPUT Statement](http://support.sas.com/documentation/cdl/en/lrcon/68089/HTML/default/viewer.htm#n1w749t788cgi2n1txpuccsuqtro.htm)**
    - [List Input](#List-Input)
    - [Column Input](#Column-Input-/-Fixed-Width-Format) (Fixed Width Format)
    - [Formatted Input](#Formatted-Input)

##Reading in a SAS Dataset

Reading in a SAS Dataset format requires the `sas7bdat` [package](https://pypi.python.org/pypi/sas7bdat). You can install the package from the terminal/command prompt by typing:

    pip install sas7bdat
    
Alternatively, you can do it straight from an IPython environment by prepending an `!`, so a cell containing `!pip install sas7bdat` will run that as a terminal command.

We'll use this package to import the [DIJA](http://support.sas.com/documentation/cdl/en/proc/67916/HTML/default/viewer.htm#p16xfzs6h77uu2n1idsb76t4wzap.htm) dataset.

In [20]:
from sas7bdat import SAS7BDAT
import pandas as pd

We'll open a stream to the file by pointing it to the dataset location, then convert it to a `pandas` *DataFrame*.

In [11]:
with SAS7BDAT('../data/djia.sas7bdat') as dija_sas:
    dija = dija_sas.to_data_frame()

The `with` statement can be confusing to beginners, but it's used for file streams in python. Essentially, `SAS7BDAT` makes a connection to the file, and `with` makes sure that the connection to the file is closed once we do something with it: in this case convert it to a `pandas` *DataFrame* and store it as `dija`.

We'll verify the contents of the *DataFrame* below. You'll notice that date formats from SAS are automatically converted to `datetime.date` objects in python.

In [12]:
dija.head()

Unnamed: 0,High,HighDate,Low,LowDate,Year
0,985.21,1968-12-03,825.13,1968-03-21,1968
1,968.85,1969-05-14,769.93,1969-12-17,1969
2,842.0,1970-12-29,631.16,1970-05-06,1970
3,950.82,1971-04-28,797.97,1971-11-23,1971
4,1036.27,1972-12-11,889.15,1972-01-26,1972


<div class="pynote">
In general, I would suggest exporting SAS datasets into an intermediate format like a `.csv` file to avoid any potential errors in the data import.
</div>

##List Input
List input is any data that's separated by a space delimiter. SAS uses the following [code example](http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a003209907.htm):


###SAS Code
    data scores;
        length name $ 12;
        input name $ score1 score2;
        datalines;
    Riley 1132 1187
    Henderson 1015 1102
    ;
    
####Python Code
A common way of importing small data within SAS is the `datalines` [statement](https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000188182.htm) which enters data directly into a program. In Python equivalent functionality is possible by creating an object of the `StringIO` [class](https://docs.python.org/2/library/stringio.html) with the data. This object stores a string in memory for us to read in using `read_table()` with  `pandas`.

We will store this string in a variable named `datalines`.

In [14]:
import StringIO

In [16]:
datalines = StringIO.StringIO('''
Riley 1132 1187
Henderson 1015 1102
''')

<div class="pynote">
<b>Python Note</b>: Multiline strings start with triple quotes <code>'''</code>.
</div>

In [22]:
scores = pd.read_table(datalines,   
                       delim_whitespace=True,
                       names=['name', 'score1', 'score2'])

In [118]:
print scores

        name  score1  score2
0      Riley    1132     987
1  Henderson    1015    1102


`pandas` will do it's best to infer object types. We can check basic *DataFrame* information and column types with the `.info()` method.

In [119]:
scores.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
name      2 non-null object
score1    2 non-null int64
score2    2 non-null int64
dtypes: int64(2), object(1)
memory usage: 64.0+ bytes


<div class="resources">
Read more about column `dtypes` <a href="http://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes">here</a>.
</div>

##Column Input / Fixed Width Format
Column input is the reading in of files that are whitespace delimited into columns that align to specific widths.

###SAS Code
    data scores;
        infile datalines truncover;
        input name $ 1-12 score2 17-20 score1 27-30;
    datalines;
    Riley           1132       987
    Henderson       1015      1102
    ;

####Python Code
We'll again use a `StringIO` object for similar functionality to `datalines`.


In [100]:
datalines = StringIO.StringIO('''
Riley           1132       987
Henderson       1015      1102
''')

`pandas` uses the `read_fwf()` [function](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html) to read fixed-width format (fwf) files. Instead of having both the variable names and column widths in an `INPUT` statement like SAS, this function splits them into the `colspecs=` and `names=` functions. This is consistent with other data IO functions in `pandas`.

A few things to note:

- The `colspecs=` argument is a list of tuples describing the column widths
- The intervals are half open, ex: (0, 12) is starting at 0 to, but not including, position 12
- Python is zero indexed, so the first column interval starts at 0 and we should shift each interval down one value.
- We need to explicitly pass `header=0` since we have an empty first row
    - This is not intuitive... I found out through trial and error.
- There's a cleaner solution presented below this one.

In [101]:
scores = pd.read_fwf(datalines,    
                     colspecs=[(0,12), (16,20), (27,30)],
                     names=['name', 'score1', 'score2'],
                    header=0)

In [102]:
print scores

        name  score1  score2
0      Riley    1132     987
1  Henderson    1015     102


One of the nice things about the `read_fwf()` function is that the default value for the `colspecs=` argument is `'infer'`, which will infer the column widths and positions based on whitespace. Knowing this, we can remove the our input statement to take only the `data=` and `names=` arguments.

In [106]:
datalines = StringIO.StringIO('''
Riley           1132       987
Henderson       1015      1102
''')

In [107]:
scores_infer = pd.read_fwf(datalines, 
                           names=['name', 'score1', 'score2'],
                           header=0)

In [108]:
print scores_infer

        name  score1  score2
0      Riley    1132     987
1  Henderson    1015    1102


##Formatted Input
This input refers to data that's specifically formatted in a way that makes direct read in challenging, like thousands separators for example.

###SAS Code
    data scores;
       input name $12. +4 score1 comma5. +6 score2 comma5.;
       datalines;
    Riley           1,132      1,187
    Henderson       1,015      1,102
    ;
    
###Python Code
Structurally, the data is similar to the fixed-width format data, with the addition of the thousands separator.

Thankfully someone had the idea just to check if strings containing numbers with commas should be parsed as numbers. This means no messing around with the pointer controls like you're playing an idiot data input game. Instead, we'll just add the `thousands=','` argument to the `read_fwf()` function.

In [113]:
datalines = StringIO.StringIO('''
Riley           1,132      1,187
Henderson       1,015      1,102
''')

In [114]:
scores_formatted = pd.read_fwf(datalines,    
                               names=['name', 'score1', 'score2'],
                               header=0,
                              thousands=',')

In [117]:
print scores_formatted

        name  score1  score2
0      Riley    1132    1187
1  Henderson    1015    1102


In [1]:
# This cell imports the styling for this notebook. You can safely ignore it.

from IPython.display import HTML

def css_styling():
    styles = open("../_styles/custom.css", "r").read()
    return HTML(styles)
css_styling()