# Pandas

![](http://pandas.pydata.org/_static/pandas_logo.png)
[Pandas](http://pandas.pydata.org/) is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Pandas is free software released under the three-clause BSD license. The name is derived from the term _panel data_, an econometrics term for multidimensional structured data sets.

#### Contents
* [Importing Pandas](#Importing-Pandas) and other libraries.
* [Creating Data](#Creating-Data) using lists and tuples
* [Viewing Data](#Viewing-Data)
* [Saving Data](#Saving-Data) to_csv and to_excel
* [Loading Data](#Loading-Data) read_csv, read_table read_excel, read_html
    * [Unix and os](#Unix-and-os)
    * [csvs and Excel](#CSVs-and-Excel)
* [Selecting Data](#Selecting-Data) loc,iloc,isin
    * [Masks](#Masks) or boolean arrays


NB: This notebook misses some methods of joining and concatenating and merging data. The instances in which those are useful are quite specific, so we'll see some examples but won't have a section in this notebook for reference. 

#### Resources:  
* [Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/index.html), especially
[10 minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)  
* [The Data Incubator](https://www.thedataincubator.com/)  
* [Hernan Rojas' learn-pandas](https://bitbucket.org/hrojas/learn-pandas)  
* [Harvard CS109 lab1 content](https://github.com/cs109/2015lab1)

## Importing Pandas

The general way to import libraries is to write
```python
import library #import the library directly
import library as alias 

# This just aliases the package names.
# That way we can call methods like plt.plot() instead of matplotlib.pyplot.plot().

from library import function # import specific functions or types in a library
%jupyter magic # jupyter only functions
```

In [2]:
# Some imports - for style reasons, try and put in alphabetical order, unless there are subgroupings of imports
# that you want.
import matplotlib #we'll only use this to determine the matplotlib version number
import matplotlib.pyplot as plt  # the graphing library
import numpy as np # scientific computing library
import pandas as pd # the data structure and analysis library
from pandas import DataFrame, read_csv, Series # specific functions from pandas
import seaborn as sns # Makes graphs look pretty
import sys #we'll only use this to determine the python version number

# Enable inline plotting.  The % is an iPython thing, and is not part of the Python language.
# In this case we're just telling the plotting library to draw things on
# the notebook, instead of on a separate window.
%matplotlib inline


In [None]:
# All the imports are listed as modules, including pyplot.  But there are several other types
%whos

In [3]:
# How to check your version numbers
print(('Python version: ' + sys.version))
print() 
print(('Pandas version: ' + pd.__version__))
print(('Matplotlib version: ' + matplotlib.__version__))

Python version: 3.6.1 |Anaconda 4.4.0 (x86_64)| (default, May 11 2017, 13:04:09) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]

Pandas version: 0.20.1
Matplotlib version: 2.0.2


## Creating Data

There are many ways to input data into Pandas. The goal of this is to input data to DataFrames.  

In [4]:
# A data frame, using a dictionary with ordered 
# lists for columns

df1 = pd.DataFrame({
    'number': [1, 2, 3],
    'animal': ['cat', 'dog', 'mouse']
})

# The same data frame, using tuples for each row
# We need to give the column names separately!
df2 = pd.DataFrame([
    ('cat', 1),
    ('dog', 2),
    ('mouse', 3),
], columns=['animal', 'number'])

# Are they the same?
assert((df1 == df2).all().all)

df1

Unnamed: 0,animal,number
0,cat,1
1,dog,2
2,mouse,3


In [5]:
dates = pd.date_range('20161101',periods =6)
dates

DatetimeIndex(['2016-11-01', '2016-11-02', '2016-11-03', '2016-11-04',
               '2016-11-05', '2016-11-06'],
              dtype='datetime64[ns]', freq='D')

In [7]:
df3 = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df3

Unnamed: 0,A,B,C,D
2016-11-01,0.404045,0.980022,0.469176,-0.613923
2016-11-02,-0.417024,0.181067,0.72738,-0.226986
2016-11-03,-0.356929,-0.523339,0.423304,-1.040002
2016-11-04,-0.574108,-1.038673,1.047897,1.14801
2016-11-05,0.583823,-3.231146,0.318799,0.348182
2016-11-06,-0.373152,-1.787792,-0.248619,0.932639


Let's Create another DataFrame, with different data types.  
Side note that that you can copy examples from the internet 
like this into Jupyter Notebooks, it still works, but only if there isn't anything else in the cell!

From: http://pandas.pydata.org/pandas-docs/stable/10min.html

In [8]:
In [10]: df2 = pd.DataFrame({ 'A' : 1.,
   ....:                      'B' : pd.Timestamp('20130102'),
   ....:                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
   ....:                      'D' : np.array([3] * 4,dtype='int32'),
   ....:                      'E' : pd.Categorical(["test","train","test","train"]),
   ....:                      'F' : 'foo' })
   ....: 

df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [9]:
# Dataframe Columns have specific data types
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

## Viewing Data

Selecting a single column, which yields a Series.

A Series, like a numpy array, most be of homogenous type

In [10]:
# Like Dictionaries! DataFrame is a dictionary of Series!
df1['animal']

0      cat
1      dog
2    mouse
Name: animal, dtype: object

In [11]:
# Access a column as a property of df1
df1.animal

0      cat
1      dog
2    mouse
Name: animal, dtype: object

In [12]:
# Selecting rows
df3[3:5]

Unnamed: 0,A,B,C,D
2016-11-04,-0.574108,-1.038673,1.047897,1.14801
2016-11-05,0.583823,-3.231146,0.318799,0.348182


## Exercise

In [13]:
# Select all the data from columns of C and D of df3

## Saving Data

#### Exporting to CSVs and Excel files

Export the our df to a ***csv*** file. We can name the file ***malaysia_states.csv***, but we can also do a txt file! The function ***to_csv*** will be used to export the file. The file will be saved in the same location of the notebook unless specified otherwise.

In [18]:
states = ['Johor','Kedah','Kelantan','Melaka', 
          'Negeri Sembilan','Pahang','Perak','Perlis',
          'Penang','Sabah', 'Sarawak','Selangor','Terengganu']
area = [19210,9500,15099,1664,6686,36137,21035,
        821,1048,73631,124450,8104,13035]
state_area = list(zip(states, area))
state_area
df = pd.DataFrame(data = state_area, columns=['State', 'Area'])

In [19]:
df.to_csv?

The only parameters we will use is ***index*** and ***header***. Setting these parameters to True will prevent the index and header names from being exported. Change the values of these parameters to get a better understanding of their use.

In [20]:
df.to_csv('malaysia_states.csv',index=False,header=False)

In [21]:
# Let's also try a text file
# CSV actually stands for comma separated values.
df.to_csv('malaysia_states.txt',index=False,header=False)

In [54]:
df.to_excel?

In [53]:
# And to Excel files
df.to_excel('malaysia_states.xlsx',index=False)

In [24]:
# Reset our namespace; delete all variables
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? n
Nothing done.


In [25]:
%whos

Variable     Type             Data/Info
---------------------------------------
DataFrame    type             <class 'pandas.core.frame.DataFrame'>
Series       type             <class 'pandas.core.series.Series'>
area         list             n=13
dates        DatetimeIndex    DatetimeIndex(['2016-11-0<...>atetime64[ns]', freq='D')
df           DataFrame                      State    Ar<...>       Terengganu   13035
df1          DataFrame          animal  number\n0    ca<...>      2\n2  mouse       3
df2          DataFrame             A          B    C  D<...>01-02  1.0  3  train  foo
df3          DataFrame                           A     <...>87792 -0.248619  0.932639
matplotlib   module           <module 'matplotlib' from<...>/matplotlib/__init__.py'>
np           module           <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
pd           module           <module 'pandas' from '/U<...>ages/pandas/__init__.py'>
plt          module           <module 'matplotlib.pyplo<...>es/

## Loading Data

Let's now try accessing that csv that we just saved.  Let us take a look at this function and what inputs it takes.

In [26]:
pd.read_csv?

Even though this functions has many parameters, we will simply pass it the location of the text file. We know that we saved things into the same directory.  

### CSVs and Excel

In [27]:
path = 'malaysia_states.csv'
df = pd.read_csv(path)
df

Unnamed: 0,Johor,19210
0,Kedah,9500
1,Kelantan,15099
2,Melaka,1664
3,Negeri Sembilan,6686
4,Pahang,36137
5,Perak,21035
6,Perlis,821
7,Penang,1048
8,Sabah,73631
9,Sarawak,124450


The read_csv function treated the first record in the csv file as the header names. This is obviously not correct since the text file did not provide us with header names.
To correct this we will pass the header parameter to the read_csv function and set it to None (means null in python).

In [28]:
df = pd.read_csv(path, header=None)
df

Unnamed: 0,0,1
0,Johor,19210
1,Kedah,9500
2,Kelantan,15099
3,Melaka,1664
4,Negeri Sembilan,6686
5,Pahang,36137
6,Perak,21035
7,Perlis,821
8,Penang,1048
9,Sabah,73631


If we wanted to give the columns specific names, we would have to pass another paramter called names. We can also omit the header parameter.

In [29]:
area_df = pd.read_csv(path, names=['State','Area'])
area_df

Unnamed: 0,State,Area
0,Johor,19210
1,Kedah,9500
2,Kelantan,15099
3,Melaka,1664
4,Negeri Sembilan,6686
5,Pahang,36137
6,Perak,21035
7,Perlis,821
8,Penang,1048
9,Sabah,73631


You can think of the numbers [0,1,2,3,4] as the row numbers in an Excel file. In pandas these are part of the ***index*** of the dataframe. You can think of the index as the primary key of a sql table with the exception that an index is allowed to have duplicates.  

***[State, Area]*** can be thought of as column headers similar to the ones found in an Excel spreadsheet or sql database.

> Delete the csv file now that we are done using it.

In [30]:
# Using a Python Library - you can also use the unix command directly!
import os
os.remove(path)

In [31]:
# Note that we do the same with xls files, only use read_excel.
# Try it!
pd.read_excel?

## Selecting Data

We can select data both by their labels and by their position. 

In [32]:
dates = pd.date_range('20160101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print(dates)
df

DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', freq='D')


Unnamed: 0,A,B,C,D
2016-01-01,0.596835,0.981417,1.753297,1.380331
2016-01-02,-0.588705,-0.836795,-0.851339,1.643648
2016-01-03,0.072855,0.263879,1.28931,1.981564
2016-01-04,1.304253,-0.340662,1.024972,0.960653
2016-01-05,-1.632018,-2.651721,1.826417,-0.056957
2016-01-06,-0.472266,-1.8542,0.657522,0.736938


In [33]:
# Try head, tail, index, columns, values, describe, T, sort_index
# sort_values and see for yourself what they do!

In [34]:
# Getting a cross section on a label
print((dates[0]))
print((df.loc[dates[0]]))

2016-01-01 00:00:00
A    0.596835
B    0.981417
C    1.753297
D    1.380331
Name: 2016-01-01 00:00:00, dtype: float64


In [35]:
# Selecting on a multi-axis by label
df.loc[:,['A','B']]


Unnamed: 0,A,B
2016-01-01,0.596835,0.981417
2016-01-02,-0.588705,-0.836795
2016-01-03,0.072855,0.263879
2016-01-04,1.304253,-0.340662
2016-01-05,-1.632018,-2.651721
2016-01-06,-0.472266,-1.8542


In [36]:
# Showing label slicing, both endpoints are included 
# unlike normal slicing

df.loc['20160103':'20160105',['B','C']]

Unnamed: 0,B,C
2016-01-03,0.263879,1.28931
2016-01-04,-0.340662,1.024972
2016-01-05,-2.651721,1.826417


In [37]:
# To get a scalar value... both work!
df.loc[dates[2],'D']
# df.at[dates[2],'D']

1.9815642026074607

In [38]:
df.at[dates[2],'D']

1.9815642026074607

```iloc``` is the same as ```loc```, only it works by position, not by label.


In [39]:
# Select via the position of the passed integers
df.iloc[2,3]

1.9815642026074607

In [40]:
# By integer slices, acting similar to numpy/python
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2016-01-04,1.304253,-0.340662
2016-01-05,-1.632018,-2.651721


In [41]:
# By lists of integer position locations, 
# similar to the numpy/python style
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2016-01-02,-0.588705,-0.851339
2016-01-03,0.072855,1.28931
2016-01-05,-1.632018,1.826417


In [42]:
# iloc is used to slice rows and columns explicitly

df.iloc[1:3,:]

Unnamed: 0,A,B,C,D
2016-01-02,-0.588705,-0.836795,-0.851339,1.643648
2016-01-03,0.072855,0.263879,1.28931,1.981564


In [43]:
df.iloc[:,1:3]

Unnamed: 0,B,C
2016-01-01,0.981417,1.753297
2016-01-02,-0.836795,-0.851339
2016-01-03,0.263879,1.28931
2016-01-04,-0.340662,1.024972
2016-01-05,-2.651721,1.826417
2016-01-06,-1.8542,0.657522


In [44]:
# For getting a value explicitly
df.iloc[1,1]
# df.iat[1,1]

-0.83679484358672862

## Exercise

In [None]:
# retrieve the value of Column C on 2016-01-05

In [None]:
# sort the states by area in descending order

In [None]:
# which is the smallest state?

### Masks

We can use _boolean arrays_ to select data

In [45]:
df.A>0

2016-01-01     True
2016-01-02    False
2016-01-03     True
2016-01-04     True
2016-01-05    False
2016-01-06    False
Freq: D, Name: A, dtype: bool

In [46]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
2016-01-01,0.596835,0.981417,1.753297,1.380331
2016-01-03,0.072855,0.263879,1.28931,1.981564
2016-01-04,1.304253,-0.340662,1.024972,0.960653


In [47]:
df > 0

Unnamed: 0,A,B,C,D
2016-01-01,True,True,True,True
2016-01-02,False,False,False,True
2016-01-03,True,True,True,True
2016-01-04,True,False,True,True
2016-01-05,False,False,True,False
2016-01-06,False,False,True,True


In [48]:
df[df > 0]

Unnamed: 0,A,B,C,D
2016-01-01,0.596835,0.981417,1.753297,1.380331
2016-01-02,,,,1.643648
2016-01-03,0.072855,0.263879,1.28931,1.981564
2016-01-04,1.304253,,1.024972,0.960653
2016-01-05,,,1.826417,
2016-01-06,,,0.657522,0.736938


### Setting Data

In [49]:
df['E'] = ['one', 'one','two','three','four','three']
df

Unnamed: 0,A,B,C,D,E
2016-01-01,0.596835,0.981417,1.753297,1.380331,one
2016-01-02,-0.588705,-0.836795,-0.851339,1.643648,one
2016-01-03,0.072855,0.263879,1.28931,1.981564,two
2016-01-04,1.304253,-0.340662,1.024972,0.960653,three
2016-01-05,-1.632018,-2.651721,1.826417,-0.056957,four
2016-01-06,-0.472266,-1.8542,0.657522,0.736938,three


In [50]:
df['E'].isin(['two','four'])

2016-01-01    False
2016-01-02    False
2016-01-03     True
2016-01-04    False
2016-01-05     True
2016-01-06    False
Freq: D, Name: E, dtype: bool

In [51]:
df[df['E'].isin(['two','four'])]

Unnamed: 0,A,B,C,D,E
2016-01-03,0.072855,0.263879,1.28931,1.981564,two
2016-01-05,-1.632018,-2.651721,1.826417,-0.056957,four


In [52]:
# We can set data in a variety of ways
print(df)
df.at[dates[0],'A'] = 0
df.iat[0,1] = 0
df.loc[:,'D'] = np.array([5] * len(df))
df

                   A         B         C         D      E
2016-01-01  0.596835  0.981417  1.753297  1.380331    one
2016-01-02 -0.588705 -0.836795 -0.851339  1.643648    one
2016-01-03  0.072855  0.263879  1.289310  1.981564    two
2016-01-04  1.304253 -0.340662  1.024972  0.960653  three
2016-01-05 -1.632018 -2.651721  1.826417 -0.056957   four
2016-01-06 -0.472266 -1.854200  0.657522  0.736938  three


Unnamed: 0,A,B,C,D,E
2016-01-01,0.0,0.0,1.753297,5,one
2016-01-02,-0.588705,-0.836795,-0.851339,5,one
2016-01-03,0.072855,0.263879,1.28931,5,two
2016-01-04,1.304253,-0.340662,1.024972,5,three
2016-01-05,-1.632018,-2.651721,1.826417,5,four
2016-01-06,-0.472266,-1.8542,0.657522,5,three


## Exercise

In [None]:
# Rename the column E to 'values'

In [None]:
# Add another new column to df (name of the column is 'total')

In [None]:
# Add the values of A, B and C and state the total values in 'total' column

In [None]:
# Export the data in df to a csv file.