# Supplemental Notes on Pandas
## Using Dataframes and indexing Part 1

Oct. 2, 2016

In [1]:
import pandas as pd

# Creating dataframes with row names and column indices

Say you have a pair of lists you want to make a Dataframe from ...

In [2]:
row_names = ['a','b','c','d','e']
col_names = ['a','b','c','d','e']

#### Reading the documentation about [DataFrames]() we can create an __empty__ `DataFrame` with row and column names from the list above like this ...

In [3]:
df = pd.DataFrame(columns=col_names, index=row_names)

In [4]:
df

Unnamed: 0,a,b,c,d,e
a,,,,,
b,,,,,
c,,,,,
d,,,,,
e,,,,,


Pandas gives you the ability to access the data in the dataframe by _column_ and _row_ name like this.  If you read the documentation you see the the first argument to `loc` is the _row indexer_ and the second argument the _column indexer_, and the data can be set directly for a given row and column:

In [5]:
df.loc['a','a'] = 1
df

Unnamed: 0,a,b,c,d,e
a,1.0,,,,
b,,,,,
c,,,,,
d,,,,,
e,,,,,


### Using iloc to reference data by index position ...

Now say you want to iterate over all the rows and columns by index -- that is you what to access things by numeric value $i$ and $j$.  You can do that using the `iloc` method like this:

In [6]:
df.iloc[0,0] # notice zero-based indexing

1

### Getting dataframe dimensions ...

Now let's say you want to iterate over all the rows and columns by index.  A really useful function to know about is `shape`.  This will give you the dimensions of the dataframe like this:

In [7]:
df.shape

(5, 5)

Notice since this is a [tuple](), we can assign it to $n$ and $m$ like this so we can separate the dimensions ...

In [8]:
n, m = df.shape

We can see we now have a $5 \times 5$ dataframe.  While not exactly the most Pythonic way to do things we can iterate using `xrange` over $n$ &mdash; that is we can do this numerically.  We'll do that first, since coming from other languages it might seem the most comfortable, but it is not the best way to do it in Python ...

In [9]:
for i in xrange(0, n): # the rows
    print "row [{}] ==>".format(i),
    
    for j in xrange(0, n): # the columns
         print df.iloc[i, j],
    print 

row [0] ==> 1 nan nan nan nan
row [1] ==> nan nan nan nan nan
row [2] ==> nan nan nan nan nan
row [3] ==> nan nan nan nan nan
row [4] ==> nan nan nan nan nan


Now let's say we want to just set the diagonals all to `0`s.  We can use the same loop ... except only setting values when $i=j$.

In [10]:
for i in xrange(0, n): # the rows
    for j in xrange(0, n): # the columns
        if i == j:
            print "setting row [{}] diagonal ==> 0".format(i),
            df.iloc[i, j] = 0
    print 

df

setting row [0] diagonal ==> 0
setting row [1] diagonal ==> 0
setting row [2] diagonal ==> 0
setting row [3] diagonal ==> 0
setting row [4] diagonal ==> 0


Unnamed: 0,a,b,c,d,e
a,0.0,,,,
b,,0.0,,,
c,,,0.0,,
d,,,,0.0,
e,,,,,0.0


Now the finale -- let's say we want to take the dataframe and fill in the data at and below the diagonal ... simple enough &mdash; we'll use the same tools above:

In [11]:
for i in xrange(0, n): # the rows
    for j in xrange(0, n): # the columns
        if i >= j:
            df.iloc[i, j] = 1 
            
df

Unnamed: 0,a,b,c,d,e
a,1,,,,
b,1,1.0,,,
c,1,1.0,1.0,,
d,1,1.0,1.0,1.0,
e,1,1.0,1.0,1.0,1.0


Great, and while this works and can be used, we could also get some good mileage out of the names of using the rows and columns. See part 2 for how to do that!