

Pandas Basics

Pandas is a powerful library that supports reading, handling of large data especially in tabular formats(columns). It is very useful given the fact that most of the data available is either in .csv format of imported from excel. Data types in pandas supports multiple operations such as searching, grouping, handling missing values, retrieving by specific measure etc. 

In this notebook we will try to explore some of these data types and operations.

In [2]:
import pandas as pd
import numpy as np

Data type Series: Series are used to represents array-like shaped data with indices. They can be created in many ways 
such as numpy array, lists etc.

In [3]:
#Simple, empty series
simpleSeries = pd.Series()
simpleSeries

Series([], dtype: float64)

In [4]:
#Create series from list. Note that the index is automatically created if not explicitly specified.
simpleList = ['a','b','c','d','e']
listSeries = pd.Series(simpleList)
listSeries

0    a
1    b
2    c
3    d
4    e
dtype: object

In [5]:
#Create a series from numpy array.
arraySeries = pd.Series(np.array(['a','b','c','d','e']))
arraySeries

0    a
1    b
2    c
3    d
4    e
dtype: object

In [6]:
#Create a series from dictionary.
dictSeries1 = pd.Series({0:'a', 
                        1:'b',
                        2:'c',
                        3:'d',
                        4:'e'})
dictSeries1

0    a
1    b
2    c
3    d
4    e
dtype: object

In [7]:
#Same series can be created by an independent dictionary as follows:
dict1 = {0:'a', 
         1:'b',
         2:'c',
         3:'d',
         4:'e'}
dictSeries2 = pd.Series(dict1)
dictSeries2

0    a
1    b
2    c
3    d
4    e
dtype: object

In [8]:
#Create series with numpy randomly with low=0, high=9, size=6
numpyRandSeries = pd.Series(np.random.randint(0,9,6))
numpyRandSeries

0    8
1    0
2    6
3    7
4    3
5    2
dtype: int32

In [9]:
#Modifying index in series data.
numpyRandSeries.index = ['a','b','c','d','e','f']
numpyRandSeries

a    8
b    0
c    6
d    7
e    3
f    2
dtype: int32

In [10]:
#Accessing series with index and modifying it.
numpyRandSeries['d'] = 10
numpyRandSeries

a     8
b     0
c     6
d    10
e     3
f     2
dtype: int32

In [11]:
#Retrieving series data with indices.
numpyRandSeries[:'d']

a     8
b     0
c     6
d    10
dtype: int32

In [12]:
#Modifying group series data
numpyRandSeries[:'c'] = 89
numpyRandSeries

a    89
b    89
c    89
d    10
e     3
f     2
dtype: int32

In [13]:
#Another way of modifying.
numpyRandSeries[:'c'] = [9,19,29]
numpyRandSeries

a     9
b    19
c    29
d    10
e     3
f     2
dtype: int32

Dataframe: By far the most useful datatypes in Pandas. It is used for handling data in tabular format. Imagine it as a horizontal stack of panda's series or a stack of dictonaries.

Dataframe's can be created independtly but most commonly it is used to read data from excel worksheets or something similar. The ability to perform operations on the read data makes it a powerful tool. We are going to explore this ability in this notebook.

Dataframes can be created in a number of ways:
    
1. List,
2. Dictionary,
3. Numpy Array,
4. Series,
5. Dataframe.

In [15]:
#Create a dataframe with list.
simpleList = [10,20,30,40,50]
listDataframe = pd.DataFrame(simpleList)
listDataframe

Unnamed: 0,0
0,10
1,20
2,30
3,40
4,50


In [21]:
#The above example is very simple. Dataframes are much more powerful. Let us create some more columns and name them.
threeColList = [[10,20,30], [11,21,31], [12,22,32], [13,23,33], [14,24,34], [15,25,35]]
listDataframe = pd.DataFrame(threeColList, columns=['tens', 'twenties', 'thirties'], index=['a','b','c','d','e','f'])
listDataframe

Unnamed: 0,tens,twenties,therties
a,10,20,30
b,11,21,31
c,12,22,32
d,13,23,33
e,14,24,34
f,15,25,35


In [25]:
#Create dataframe from a dictionary.
simpleDict = {'tens':     [10,11,12,13,14,15], 
              'twenties': [20,21,22,23,24,25],
              'thirties': [30,31,32,33,34,35],}
dictDataframe = pd.DataFrame(simpleDict, columns=['tens', 'twenties', 'thirties'])
dictDataframe

Unnamed: 0,tens,twenties,thirties
0,10,20,30
1,11,21,31
2,12,22,32
3,13,23,33
4,14,24,34
5,15,25,35


In [26]:
#Create dataframes with list of dictionaries.
simpleDict1 = [{'tens': 10, 'twenties': 20, 'thirties': 30}, 
               {'tens': 11, 'twenties': 21, 'thirties': 31},
               {'tens': 12, 'twenties': 22, 'thirties': 32},
               {'tens': 13, 'twenties': 23, 'thirties': 33},
               {'tens': 14, 'twenties': 24, 'thirties': 34},
               {'tens': 15, 'twenties': 25, 'thirties': 35}]
dictDataframe = pd.DataFrame(simpleDict1, columns=['tens', 'twenties', 'thirties'])
dictDataframe

Unnamed: 0,tens,twenties,thirties
0,10,20,30
1,11,21,31
2,12,22,32
3,13,23,33
4,14,24,34
5,15,25,35


In [30]:
#Create a dataframe with multiple numpy arrays arranged in a dictionary.
simpleArr1 = np.random.randint(11, 19, 6)
simpleArr2 = np.random.randint(21, 29, 6)
simpleArr3 = np.random.randint(31, 39, 6)

arrDataframe = pd.DataFrame({'tens': simpleArr1, 
                             'twenties': simpleArr2,
                             'thirties': simpleArr3}, columns=['tens', 'twenties', 'thirties'])
arrDataframe

Unnamed: 0,tens,twenties,thirties
0,13,28,32
1,16,21,37
2,16,27,34
3,14,24,35
4,18,28,32
5,18,22,33


In [39]:
#Create dataframe with pandas series as a dictionary.
simpleSeries = {'tens': pd.Series([11, 15, 16, 17], index=[0,1,2, 3]),
                'twenties': pd.Series([22, 21, 24], index=[0,1,2]),
                'thirties': pd.Series([33, 35, 37], index=[0,1,2])}
seriesDataframe = pd.DataFrame(simpleSeries, columns=['tens', 'twenties', 'thirties'])
seriesDataframe

Unnamed: 0,tens,twenties,thirties
0,11,22.0,33.0
1,15,21.0,35.0
2,16,24.0,37.0
3,17,,


One of the most powerful features of pandas is the ability to load an already existing data-set. Here we will try to load
"titanic" data set into pandas object-Dataframe.

In [42]:
#Let us try to read titanic dataset excel file and print firt five rows.
#Read the excel
titanicDataframe = pd.read_excel("titanic.xls")

#Display only first 5 rows.
titanicDataframe.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


Let us create a simple pandas dataframe and explore multiple operations possible on that dataframe.

In [43]:
#Lets print array dataframe.
arrDataframe

Unnamed: 0,tens,twenties,thirties
0,13,28,32
1,16,21,37
2,16,27,34
3,14,24,35
4,18,28,32
5,18,22,33


In [44]:
#Print any one column.
arrDataframe['tens']

0    13
1    16
2    16
3    14
4    18
5    18
Name: tens, dtype: int32

In [45]:
#Add a column called 'forties'
arrDataframe['forties'] = [41,41,43,48,47,45]
arrDataframe

Unnamed: 0,tens,twenties,thirties,forties
0,13,28,32,41
1,16,21,37,41
2,16,27,34,43
3,14,24,35,48
4,18,28,32,47
5,18,22,33,45


In [46]:
#Delete a column.
del(arrDataframe['tens'])
arrDataframe

Unnamed: 0,twenties,thirties,forties
0,28,32,41
1,21,37,41
2,27,34,43
3,24,35,48
4,28,32,47
5,22,33,45


In [51]:
arrDataframe['multiplied'] = arrDataframe['twenties'] * arrDataframe['thirties']
arrDataframe

Unnamed: 0,twenties,thirties,forties,multiplied
0,28,32,41,896
1,21,37,41,777
2,27,34,43,918
3,24,35,48,840
4,28,32,47,896
5,22,33,45,726


In [52]:
#Add a column with single value.
arrDataframe['string'] = 'bar'
arrDataframe

Unnamed: 0,twenties,thirties,forties,multiplied,string
0,28,32,41,896,bar
1,21,37,41,777,bar
2,27,34,43,918,bar
3,24,35,48,840,bar
4,28,32,47,896,bar
5,22,33,45,726,bar
