# Stronger data interaction

##Pandas
**Pan**el **Da**ta **S**ystem 


- Python data analysis library 
- Built on top of Numpy 
- Open Sourced by AQR Capital Management, LLC in late 2009
- 30.000 lines of tested Python/Cython code 
- Used in production by many companies

In [1]:
from IPython.core.display import HTML
HTML("<iframe src=http://pandas.pydata.org width=800 height=450></iframe>")

Level up

In [2]:
import pandas as pd

Now we can do [**#kungfupandas**]()

<img src='http://j.mp/1Ixu8eH' width=400>


## Kung fu get

<img src='http://www.scicbeijing.com/Upfile/20091130015525774.jpg' width='400'>

###Read a file

Define files and column to be read

In [3]:
datafile = 'data/num.csv.gz'
infofile = 'data/num.csv.info'
cols_num = 3

Read columns from information file

In [4]:
cols_name = []
with open(infofile) as f:
    for row in f.read().splitlines():
        cols_name.append(row.split()[0])


Read the file with pandas!

In [5]:
# if you need help 
?pd.read_csv

In [6]:
A = pd.read_csv(datafile, header=None, names=cols_name[0:cols_num], usecols=range(0, cols_num))

### Now you have a DataFrame

In [7]:
type(A)

pandas.core.frame.DataFrame

In [8]:
A.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71436 entries, 0 to 71435
Data columns (total 3 columns):
Elevation    71436 non-null int64
Aspect       71436 non-null int64
Slope        71436 non-null int64
dtypes: int64(3)
memory usage: 2.2 MB


<small>Note: we get memory usage!</small>

In [9]:
# First five lines
A.head()

Unnamed: 0,Elevation,Aspect,Slope
0,2596,51,3
1,2590,56,2
2,2804,139,9
3,2785,155,18
4,2595,45,2


In [10]:
#Last five lines
A.tail()

Unnamed: 0,Elevation,Aspect,Slope
71431,2919,78,8
71432,2912,97,6
71433,2911,207,1
71434,2912,74,3
71435,2910,72,5


# DATA FRAMES

* A table-like data structure
* Commands for user to interact and modify the structure

* They can be compared with *R language* and **data.table** class
    - Basically a pythonic *data frame*
    - but with automatic data alignment!
    - Arithmetic operations align on row and column labels

You may read more on the [pandas website](http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html)

## Kung fu columns
selecting and examining columns

###Axis Indexing 

- Every axis has an index
- Highly optimized data structure
- Hierarchical indexing
- group by and join-type operations

<small>You will understand this by examples</small>

### Summarize all columns

In [11]:
A.describe()
# Only for numeric columns

Unnamed: 0,Elevation,Aspect,Slope
count,71436.0,71436.0,71436.0
mean,2862.476678,138.01845,12.245129
std,256.034333,105.152815,6.925558
min,1863.0,0.0,0.0
25%,2717.0,53.0,7.0
50%,2903.0,108.0,11.0
75%,3014.0,206.0,16.0
max,3849.0,360.0,61.0


In [12]:
# Select one column using the DataFrame attribute
A.Aspect[:4]

0     51
1     56
2    139
3    155
Name: Aspect, dtype: int64

In [13]:
# Select one column in a better way
index = 'Aspect'
A[index].head(4)

0     51
1     56
2    139
3    155
Name: Aspect, dtype: int64

# Series

> One-dimensional ndarray with axis labels (including time series).

In [14]:
# Wait... What type is a column?
type(A[index])

pandas.core.series.Series

In [15]:
# describe the Series (non-numeric)
A[index].describe()

count    71436.000000
mean       138.018450
std        105.152815
min          0.000000
25%         53.000000
50%        108.000000
75%        206.000000
max        360.000000
Name: Aspect, dtype: float64

In [16]:
# For each value, count number of occurrences
occ = A[index].value_counts()

## GOOD EXERCISE

# Produce the top three most used values
occ.sort(ascending=False,kind='mergesort',inplace=False).head(3)

45     1366
90      931
135     917
dtype: int64

##EXERCISE?

The name of the column having the most high 75% value of the whole file

In [39]:
# A list of columns
my_cols = ['Elevation', 'Slope']
print A[my_cols].head(2)
print

# Types?
print "t1", type(A[index])
print "t2", type(A[my_cols])

   Elevation  Slope
0       2596      3
1       2590      2

t1 <class 'pandas.core.series.Series'>
t2 <class 'pandas.core.frame.DataFrame'>


## Kungfu approach 1
Focus on one opponent at the time

<img src='http://i.ytimg.com/vi/9DW-ZDT2CwY/maxresdefault.jpg' width=500>

For each unique value in col_1, calculate mean of ALL other numeric columns

In [18]:
A.groupby(index).mean().head(3)

Unnamed: 0_level_0,Elevation,Slope
Aspect,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2931.37643,8.051487
1,2764.614815,15.296296
2,2793.15,13.872222


In [19]:
# for each unique value in col_1, calculate mean of ALL numeric columns
A.groupby(index).mean().Elevation.min()

2460.8000000000002

<img src='http://themamareport.com/wp-content/uploads/2013/06/kung-fu-image-still.jpg' width='450'>
<small> You may have noticed: *chainability*</small>

In [20]:
A.groupby(index).Elevation.mean().head(3)

Aspect
0    2931.376430
1    2764.614815
2    2793.150000
Name: Elevation, dtype: float64

## Kungfu approach 2
Transform the strength of one opponent in your strength

<img src='http://www.shaolin.org/images/taijiquan/com_22c.jpg' width=400>

Add a new column as a function of existing columns

In [21]:
A['new_col1'] = A.Elevation * 10
A['new_col2'] = A['Slope'] + A['Aspect'] - 1
# Check it
A.head(3)

Unnamed: 0,Elevation,Aspect,Slope,new_col1,new_col2
0,2596,51,3,25960,53
1,2590,56,2,25900,57
2,2804,139,9,28040,147


<small>note: can't (usually) assign to an attribute (e.g., `df.new_col`)</small>

In [22]:
# rename a column
A.rename(columns={'new_col2':'a_sum'}, inplace=True)
A.head(3)

Unnamed: 0,Elevation,Aspect,Slope,new_col1,a_sum
0,2596,51,3,25960,53
1,2590,56,2,25900,57
2,2804,139,9,28040,147


## Kungfu approach 3
Make your opponent see what you want him to see

<img src='http://funnystatus.socialzoidllc.netdna-cdn.com/wp-content/uploads/2013/06/hiding-panda.png' width=400>

In [23]:
# Hide a column (temporarily)
A.drop(['new_col1'], axis=1).head(3)

Unnamed: 0,Elevation,Aspect,Slope,a_sum
0,2596,51,3,53
1,2590,56,2,57
2,2804,139,9,147


<small>hint: use 'axis=0' to drop rows instead</small>

Alternative way: **slice** 'columns' attribute like a <u>list</u>!

In [24]:
A[A.columns[:-2]].head(3)

Unnamed: 0,Elevation,Aspect,Slope
0,2596,51,3
1,2590,56,2
2,2804,139,9


A more violent approach: delete a column **permanently**

In [25]:
# E.g. if you need memory
del A['new_col1']

<hr>

## More informations

In [None]:
A.index                # "the index" (aka "the labels")
df.columns              # column names (which is "an index")
df.dtypes               # data types of each column
df.shape                # number of rows and columns
df.values               # underlying numpy array

<hr>

## Snapshot
Quicly write the current status of a DataFrame to a CSV

In [27]:
csvfile = 'my_file.csv'

In [28]:
A.to_csv(csvfile)
# Warning: index is used as first column
!head {csvfile}

,Elevation,Aspect,Slope,a_sum
0,2596,51,3,53
1,2590,56,2,57
2,2804,139,9,147
3,2785,155,18,172
4,2595,45,2,46
5,2579,132,6,137
6,2606,45,7,51
7,2605,49,4,52
8,2617,45,9,53


In [29]:
# Better to ignore the index column!
A.to_csv(csvfile, index=False)
!head {csvfile}

Elevation,Aspect,Slope,a_sum
2596,51,3,53
2590,56,2,57
2804,139,9,147
2785,155,18,172
2595,45,2,46
2579,132,6,137
2606,45,7,51
2605,49,4,52
2617,45,9,53


## Reading from a remote source

In [30]:
# read CSV file directly from a URL and save the results
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 1 to 200
Data columns (total 4 columns):
TV           200 non-null float64
Radio        200 non-null float64
Newspaper    200 non-null float64
Sales        200 non-null float64
dtypes: float64(4)
memory usage: 7.8 KB


## Reading from Excel

> Since so much financial and scientific data ends up in Excel spreadsheets (*regrettably*), Pandas' ability to directly import Excel spreadsheets is **valuable**. 

<small>Warning: This support is contingent on having dependencies installed: `xlrd` and `openpyxl`</small>

<small>Note: Don't forget to use standards and open formats</small>

Our file

In [31]:
xlsfile = 'data/output.xlsx'

Write from a Dataframe

In [32]:
writer = pd.ExcelWriter(xlsfile)
A[:2500].to_excel(writer,'Sheet1')
writer.save()

Read the xls file

In [33]:
B = pd.read_excel(xlsfile, sheetname='Sheet1')
B.head()

Unnamed: 0,Elevation,Aspect,Slope,a_sum
0,2596,51,3,53
1,2590,56,2,57
2,2804,139,9,147
3,2785,155,18,172
4,2595,45,2,46


## Kungfu requires continuos training
...and meditation

<img src='http://j.mp/1KWt1pb' width=600>

- Pandas is a very powerful and sometimes complex framework
    - [Cookbook](http://pandas.pydata.org/pandas-docs/stable/pandas.pdf) is > 1600 pages!
- Many operations
    - Some operations can be optained in different but equivalent ways

> Be still, like the water