## Introduction to Pandas ##
We will learn:

a) What is Pandas?

b) What is a Series ?

c) What is a DataFrame?

d) How to read data from a file to DataFrame ?

e) Exploring the Dataframe.

In [1]:
import numpy as np

from pandas import Series,DataFrame
import pandas as pd

In [2]:
# Series
# A Series is a labelled collection of values similar to the NumPy vector. 

series = Series([3,6,9,12])


In [6]:
# Lets print values
print series.values

# Lets print index
print series.index.values



[ 3  6  9 12]
[0 1 2 3]


In [9]:
# The main advantage of Series objects is the ability to utilize non-integer labels/index.

w = Series([8700000,4300000,3000000,2100000,400000],index=['USSR','Germany','China','Japan','USA'])

print w

USSR       8700000
Germany    4300000
China      3000000
Japan      2100000
USA         400000
dtype: int64


In [11]:
# Now we can use index values to select values in Series
# Check the value of index 'USA'
w['USA']



400000

In [14]:
# Check if USSR is in Series index
('USSR' in w)

True

In [36]:
# Check who had casualties greater than 4 million (4000000
a = sorted(w.index)
print a
print w[a]
type(w>4000000)

['China', 'Germany', 'Japan', 'USA', 'USSR']
Countries
China      3000000
Germany    4300000
Japan      2100000
USA         400000
USSR       8700000
Name: Casualities in World War 2, dtype: int64


pandas.core.series.Series

In [37]:
# Naming the series and index
w.name = "Casualities in World War 2"
w.index.name = "Countries"
w

Countries
USSR       8700000
Germany    4300000
China      3000000
Japan      2100000
USA         400000
Name: Casualities in World War 2, dtype: int64

### Now we'll learn DataFrame ###

A dataframe is a highly efficient, 2-dimensional data structure that provides a suite of methods and attributes to quickly explore, analyze, and visualize data.

In [39]:
# Read data from csv file
df = pd.read_csv("ship_data.csv")

In [40]:
# Now that we've read the dataset into a dataframe, we can start using the dataframe methods to explore the data
# a) Print the first 10 values in the df

df.head(10)

Unnamed: 0,Passenger ID,Class,Name,Gender,Age,Siblings Count,Parents Count,Fare,Embarked,Survived
0,1,3,Alexander Harris,male,22.0,1,0,7250.0,New York,0
1,2,1,Frank Parsons,female,38.0,1,0,71283.3,Los Angeles,1
2,3,3,Anthony Churchill,female,26.0,0,0,7925.0,New York,1
3,4,1,Alexandra Hughes,female,35.0,1,0,53100.0,New York,1
4,5,3,Joan Fraser,male,35.0,0,0,8050.0,New York,0
5,6,3,Megan Clarkson,male,,0,0,8458.3,Chicago,0
6,7,1,Molly Bower,male,54.0,0,0,51862.5,New York,0
7,8,3,Steven Jones,male,2.0,3,1,21075.0,New York,0
8,9,3,Bernadette Vance,female,27.0,0,2,11133.3,New York,1
9,10,2,Irene Chapman,female,-20.0,1,0,30070.8,Los Angeles,1


In [41]:
# b) Try printing out the last 15 values in the df.
df.tail()

Unnamed: 0,Passenger ID,Class,Name,Gender,Age,Siblings Count,Parents Count,Fare,Embarked,Survived
886,887,2,Bella Tucker,male,27.0,0,0,13000.0,New York,0
887,888,1,Boris Howard,female,19.0,0,0,30000.0,New York,1
888,889,3,Cameron Lambert,female,,1,2,23450.0,New York,0
889,890,1,Theresa Hill,male,26.0,0,0,30000.0,Los Angeles,1
890,891,3,Caroline Fraser,male,32.0,0,0,7750.0,Chicago,0


In [48]:
# Lenght of dataframe
df.describe()

Unnamed: 0,Passenger ID,Class,Age,Siblings Count,Parents Count,Fare,Survived
count,891.0,891.0,714.0,891.0,891.0,891.0,891.0
mean,446.0,2.308642,30.014244,0.523008,0.381594,32204.207969,0.383838
std,257.353842,0.836071,16.633418,1.102743,0.806057,49693.428597,0.486592
min,1.0,1.0,-20.0,0.0,0.0,0.0,0.0
25%,223.5,2.0,20.0,0.0,0.0,7910.4,0.0
50%,446.0,3.0,28.0,0.0,0.0,14454.2,0.0
75%,668.5,3.0,38.75,1.0,0.0,31000.0,1.0
max,891.0,3.0,200.0,8.0,6.0,512329.2,1.0


In [52]:
# To access the full list of column names, use the columns attribute
#column_names = ??
df.columns
#print column_names
df.columns

Index([u'Passenger ID', u'Class', u'Name', u'Gender', u'Age',
       u'Siblings Count', u'Parents Count', u'Fare', u'Embarked', u'Survived'],
      dtype='object')

In [53]:
# Selecting Individual Columns
series = df["Passenger ID"] 
print series.head(10)

0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
Name: Passenger ID, dtype: int64


In [57]:
# Selecting Multiple Columns - say 'Passenger ID' and 'Name'
series = df.ilocation['Passenger ID','Class']
print series.head(10)


KeyError: ('Passenger ID', 'Class')

In [58]:
#Selecting Rows from 3 to 6
df.loc[3:6]

Unnamed: 0,Passenger ID,Class,Name,Gender,Age,Siblings Count,Parents Count,Fare,Embarked,Survived
3,4,1,Alexandra Hughes,female,35.0,1,0,53100.0,New York,1
4,5,3,Joan Fraser,male,35.0,0,0,8050.0,New York,0
5,6,3,Megan Clarkson,male,,0,0,8458.3,Chicago,0
6,7,1,Molly Bower,male,54.0,0,0,51862.5,New York,0


In [None]:
#Selecting Rows with index 1, 5, 10


In [59]:
# Try adding a column say Alive/Dead to the DataFrame with a default value 'Dead'
df["Alive/Dead"] = 'Dead'
df.head()

Unnamed: 0,Passenger ID,Class,Name,Gender,Age,Siblings Count,Parents Count,Fare,Embarked,Survived,Alive/Dead
0,1,3,Alexander Harris,male,22.0,1,0,7250.0,New York,0,Dead
1,2,1,Frank Parsons,female,38.0,1,0,71283.3,Los Angeles,1,Dead
2,3,3,Anthony Churchill,female,26.0,0,0,7925.0,New York,1,Dead
3,4,1,Alexandra Hughes,female,35.0,1,0,53100.0,New York,1,Dead
4,5,3,Joan Fraser,male,35.0,0,0,8050.0,New York,0,Dead


In [62]:
# assign value 'Alive' to rows with index 0 and 4
alive_or_dead = Series(["Alive","Alive"],index=[4,0])

print alive_or_dead
df["Alive/Dead"] = alive_or_dead
df.head()

4    Alive
0    Alive
dtype: object


Unnamed: 0,Passenger ID,Class,Name,Gender,Age,Siblings Count,Parents Count,Fare,Embarked,Survived,Alive/Dead
0,1,3,Alexander Harris,male,22.0,1,0,7250.0,New York,0,Alive
1,2,1,Frank Parsons,female,38.0,1,0,71283.3,Los Angeles,1,
2,3,3,Anthony Churchill,female,26.0,0,0,7925.0,New York,1,
3,4,1,Alexandra Hughes,female,35.0,1,0,53100.0,New York,1,
4,5,3,Joan Fraser,male,35.0,0,0,8050.0,New York,0,Alive


In [67]:
# Check what these return:
#     1)type(df) - Check the datatype
#     2)df.info  - Check the starting index
df["Alive/Dead"] = df["Alive/Dead"].fillna("Dead")


In [68]:

df.head()

Unnamed: 0,Passenger ID,Class,Name,Gender,Age,Siblings Count,Parents Count,Fare,Embarked,Survived,Alive/Dead
0,1,3,Alexander Harris,male,22.0,1,0,7250.0,New York,0,Alive
1,2,1,Frank Parsons,female,38.0,1,0,71283.3,Los Angeles,1,Dead
2,3,3,Anthony Churchill,female,26.0,0,0,7925.0,New York,1,Dead
3,4,1,Alexandra Hughes,female,35.0,1,0,53100.0,New York,1,Dead
4,5,3,Joan Fraser,male,35.0,0,0,8050.0,New York,0,Alive


### 

In [None]:
# Try out:
#         1) matrix = df.as_matrix()
#         2) Check the datatype of matrix
#         3) Print df[0,0] and df[0]
#         4) Check the datatype of df[0]
# Try out:
#         1) Print df.iloc[0] 
#         2) Print df.ix[0]
#         3) Check the datatype of df.ix[0] 

In [69]:
matrix = df.as_matrix()
type(matrix)

numpy.ndarray

In [70]:
matrix

array([[1, 3, 'Alexander Harris', ..., 'New York', 0, 'Alive'],
       [2, 1, 'Frank Parsons', ..., 'Los Angeles', 1, 'Dead'],
       [3, 3, 'Anthony Churchill', ..., 'New York', 1, 'Dead'],
       ..., 
       [889, 3, 'Cameron Lambert', ..., 'New York', 0, 'Dead'],
       [890, 1, 'Theresa Hill', ..., 'Los Angeles', 1, 'Dead'],
       [891, 3, 'Caroline Fraser', ..., 'Chicago', 0, 'Dead']], dtype=object)

In [71]:
df.ix(axis=1)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


<pandas.core.indexing._IXIndexer at 0x7f08e55f1610>

In [72]:
# Print the 'Passenger ID' column where the Passenger ID is less than 10
# Hint: DataFrame can accept conditions such as: Passenger ID < 10.
pessenger_id_less_than_10 = (df["Passenger ID"] < 10)

print pessenger_id_less_than_10.head()

df[pessenger_id_less_than_10]

0    True
1    True
2    True
3    True
4    True
Name: Passenger ID, dtype: bool


Unnamed: 0,Passenger ID,Class,Name,Gender,Age,Siblings Count,Parents Count,Fare,Embarked,Survived,Alive/Dead
0,1,3,Alexander Harris,male,22.0,1,0,7250.0,New York,0,Alive
1,2,1,Frank Parsons,female,38.0,1,0,71283.3,Los Angeles,1,Dead
2,3,3,Anthony Churchill,female,26.0,0,0,7925.0,New York,1,Dead
3,4,1,Alexandra Hughes,female,35.0,1,0,53100.0,New York,1,Dead
4,5,3,Joan Fraser,male,35.0,0,0,8050.0,New York,0,Alive
5,6,3,Megan Clarkson,male,,0,0,8458.3,Chicago,0,Dead
6,7,1,Molly Bower,male,54.0,0,0,51862.5,New York,0,Dead
7,8,3,Steven Jones,male,2.0,3,1,21075.0,New York,0,Dead
8,9,3,Bernadette Vance,female,27.0,0,2,11133.3,New York,1,Dead


In [76]:
# Print the passengers in first Class (Class=1)


In [77]:
# you can also add multiple conditions using logical operators (& and |)
pessengers_in_first_class_with_id_less_than_10 = (df["Class"] == 1) & (df["Passenger ID"] < 10)
df[pessengers_in_first_class_with_id_less_than_10]

Unnamed: 0,Passenger ID,Class,Name,Gender,Age,Siblings Count,Parents Count,Fare,Embarked,Survived,Alive/Dead
1,2,1,Frank Parsons,female,38.0,1,0,71283.3,Los Angeles,1,Dead
3,4,1,Alexandra Hughes,female,35.0,1,0,53100.0,New York,1,Dead
6,7,1,Molly Bower,male,54.0,0,0,51862.5,New York,0,Dead


In [78]:
# Print passengers with age > 50 and Class = 1


## Check point ##

Pandas is used to load the data from file.

In Pandas, Series can be thought of as column/row (1D), DataFrame as a table (2D).

Pandas series/dataframe can be easily converted to numpy arrays.

REMEMBER: DataFrame[0] - returns all values of  column '0'
NumpyArray[0] - returns all values in the row '0'