# Data interaction

##Pandas
**Pan**el **Da**ta **S**ystem 


- Python data analysis library 
- Built on top of Numpy 
- Open Sourced by AQR Capital Management, LLC in late 2009
- 30.000 lines of tested Python/Cython code 
- Used in production by many companies

In [8]:
from IPython.core.display import HTML
HTML("<iframe src=http://pandas.pydata.org width=800 height=450></iframe>")

Level up

In [1]:
import pandas as pd

Now we can do [**#kungfupandas**]()

<img src='http://j.mp/1Ixu8eH' width=400>

## Kungfu get

###Read a file

Define files and column to be read

In [2]:
datafile = 'data/num.csv.gz'
infofile = 'data/num.csv.info'
cols_num = 3

Read columns from information file

In [3]:
cols_name = []
with open(infofile) as f:
    for row in f.read().splitlines():
        cols_name.append(row.split()[0])
#len(cols_name)

Read the file with pandas!

In [4]:
# if you need help 
?pd.read_csv

In [5]:
A = pd.read_csv(datafile, header=None, names=cols_name[0:cols_num], usecols=range(0, cols_num))

### Now you have a DataFrame

In [6]:
type(A)

pandas.core.frame.DataFrame

In [7]:
A.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71436 entries, 0 to 71435
Data columns (total 3 columns):
Elevation    71436 non-null int64
Aspect       71436 non-null int64
Slope        71436 non-null int64
dtypes: int64(3)
memory usage: 2.2 MB


<small>Note: we get memory usage!</small>

In [8]:
# First five lines
A.head()

Unnamed: 0,Elevation,Aspect,Slope
0,2596,51,3
1,2590,56,2
2,2804,139,9
3,2785,155,18
4,2595,45,2


In [11]:
#Last five lines
A.tail()

Unnamed: 0,Elevation,Aspect,Slope
71431,2919,78,8
71432,2912,97,6
71433,2911,207,1
71434,2912,74,3
71435,2910,72,5


# DATA FRAMES

* A table-like data structure
* Commands for user to interact and modify the structure

* They can be compared with *R language* and **data.table** class
    - Basically a pythonic data.frame
    - but with automatic data alignment!
    - Arithmetic operations align on row and column labels

You may read more on the [pandas website](http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html)

<hr>

## Kungfu columns

In [None]:
# selecting and examining columns
df['col_name1']                 # select one column
df.col_name1                    # select one column using the DataFrame attribute
type(df['col_name1'])           # Series
df.col_name1.describe()         # describe the Series (non-numeric)
df.col_name1.value_counts()     # for each value, count number of occurrences

# summarize all columns (new in pandas 0.15.0)
df.describe(include='all')       # describe all Series
df.describe(include=['object'])  # limit to one (or more) types

# select multiple columns
df[['col_name1', 'col_name2']]          # select two columns
my_cols = ['col_name1', 'col_name2']    # or, create a list.of column names
df[my_cols]                             # use a list to select columns
type(df[my_cols])                       # DataFrame

# for each unique value in col_1, calculate mean of col_2 values
df.groupby('col_1').col_2.mean()

# for each unique value in col_1, calculate mean of ALL numeric columns
df.groupby('col_1').mean()

# for each unique value in col_1, count number of occurrences
df.groupby('col_1').col_1.count()
df.col_1.value_counts()

# add a new column as a function of existing columns
# note: can't (usually) assign to an attribute (e.g., 'df.new_col')
df['new_col'] = df.col_name1 + df.col_name2
df['new_col'] = df.col_name1 * 10
df.head()

# alternative method: default is column sums, 'axis=1' does row sums instead
df['new_col'] = df.loc[:, 'col_name1':'col_name2'].sum(axis=1)

# rename a column
df.rename(columns={'col_name1':'col_1'}, inplace=True)

# hide a column (temporarily)
df.drop(['col_name1'], axis=1)     # use 'axis=0' to drop rows instead
df[df.columns[:-1]]                # slice 'columns' attribute like a list

# delete a column (permanently)
del df['col_name1']

## More informations

In [None]:
df.describe()           # summarize all numeric columns
df.index                # "the index" (aka "the labels")
df.columns              # column names (which is "an index")
df.dtypes               # data types of each column
df.shape                # number of rows and columns
df.values               # underlying numpy array

###Axis Indexing 

- Every axis has an index
- Highly optimized data structure
- Hierarchical indexing
- group by and join-type operations

<hr>

## Checkpoint?

In [None]:
# write a DataFrame to a CSV
df.to_csv('my_file.csv')                 # index is used as first column
df.to_csv('my_file.csv', index=False)    # ignore index

## Reading from a remote source

In [3]:
# read CSV file directly from a URL and save the results
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 1 to 200
Data columns (total 4 columns):
TV           200 non-null float64
Radio        200 non-null float64
Newspaper    200 non-null float64
Sales        200 non-null float64
dtypes: float64(4)
memory usage: 7.8 KB


## Reading from Excel

Since so much financial and scientific data ends up in Excel spreadsheets (regrettably), Pandas' ability to directly import Excel spreadsheets is valuable. 

<small>Warning: This support is contingent on having dependencies installed: `xlrd` and `openpyxl`</small>

<small>Note: Don't forget to use standards and open formats</small>

Our file

In [25]:
xlsfile = 'data/output.xlsx'

Write from a Dataframe

In [28]:
writer = pd.ExcelWriter(xlsfile)
A[:2500].to_excel(writer,'Sheet1')
writer.save()

Read the xls file

In [29]:
B = pd.read_excel(xlsfile, sheetname='Sheet1')
B.head()

Unnamed: 0,Elevation,Aspect,Slope
0,2596,51,3
1,2590,56,2
2,2804,139,9
3,2785,155,18
4,2595,45,2


## Kungfu requires continuos training
...and meditation

<img src='http://j.mp/1KWt1pb' width=600>

- Pandas is a very powerful and sometimes complex framework
    - [Cookbook](http://pandas.pydata.org/pandas-docs/stable/pandas.pdf) is > 1600 pages!
- Many operations
    - Some operations can be optained in different but equivalent ways