#### Getting started with pandas

###### Introduction to pandas Data Structures

To get started with pandas, we will need to get comfortable with its two workhorse
data structures: Series and DataFrame. 

###### Series

A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its index.

In [37]:
import pandas as pd
import numpy as np

In [5]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [6]:
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [7]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [25]:
obj[1]

7

In [10]:
obj2 = pd.Series([4,7,-5, 3], index=['d', 'b', 'a','c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [15]:
obj3 = pd.Series([4,5,6,7,8], index=['a','b','c','d','e'])
obj3

a    4
b    5
c    6
d    7
e    8
dtype: int64

In [23]:
obj2.values

array([ 4,  7, -5,  3], dtype=int64)

In [24]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [26]:
obj2['a']

-5

In [28]:
obj2['d']=6
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [30]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

Using NumPy functions or NumPy-like operations, such as filtering with a boolean
array, scalar multiplication, or applying math functions, will preserve the index-value
link

In [33]:
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [34]:
obj2[obj2>4]

d    6
b    7
dtype: int64

In [35]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [38]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [40]:
'b' in obj2

True

In [42]:
'e'in obj2

False

In [47]:
# We can convert dictionary into series
sdata = {'Ohio':35000, 'Texas': 71000, 'Oregon':16000, 'Utah': 5000}
obj4 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [48]:
# We can override this by passing the dict keys in the order we want them to appear in the resulting Series
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj5 = pd.Series(sdata, states)
obj5

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [50]:
# The isnull and notnull functions in pandas should be used to detect missing data
pd.isnull(obj5)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [51]:
pd.notnull(obj5)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [52]:
# Series also has these as instance methods
obj5.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [53]:
# A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations
obj4

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [54]:
obj5

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [55]:
obj4 + obj5

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [57]:
# Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality
obj5.name = "Population"
obj5.index.name = "State"
obj5

State
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: Population, dtype: float64

#### DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collec‐tion of columns, each of which can be a different value type (numeric, string,
boolean, etc.). The DataFrame has both a row and column index; it can be thought of
as a dict of Series all sharing the same index. Under the hood, the data is stored as one
or more two-dimensional blocks rather than a list, dict, or some other collection of
one-dimensional arrays.

There are many ways to construct a DataFrame, though one of the most common is
from a dict of equal-length lists or NumPy arrays:


In [60]:
data = {'state': ['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'],
       'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [61]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [62]:
# If we specify a sequence of columns, the DataFrame’s columns will be arranged in that order:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [63]:
# If we pass a column that isn’t contained in the dict, it will appear with missing values in the result:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [64]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [65]:
# A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [69]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [77]:
# Rows can also be retrieved by position or name with the special loc attribute
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [79]:
# Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar value or an array of values
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [86]:
frame2.debt = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


When we are assigning lists or arrays to a column, the value’s length must match the
length of the DataFrame.

In [88]:
# Assigning a column that doesn’t exist will create a new column
frame2['eastern'] = frame2['state'] == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,0.0,True
two,2001,Ohio,1.7,1.0,True
three,2002,Ohio,3.6,2.0,True
four,2001,Nevada,2.4,3.0,False
five,2002,Nevada,2.9,4.0,False
six,2003,Nevada,3.2,5.0,False


In [89]:
# The del keyword will delete columns as with a dict
del frame2['eastern']

In [90]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [97]:
# Another common form of data is a nested dict of dicts
pop = {
    'Neveda' : {2001 :2.4, 2002: 2.9},
    'Ohio' : {2000 : 1.5, 2001 : 1.7, 2002 : 3.6}
}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Neveda,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [101]:
# You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array
frame3.T

Unnamed: 0,2001,2002,2000
Neveda,2.4,2.9,
Ohio,1.7,3.6,1.5


In [104]:
# If a DataFrame’s index and columns have their name attributes set, these will also be displayed:
frame3.index.name = 'year'; frame3.columns.name='state'
frame3

state,Neveda,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


##### pandas’s index Objects


pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels you use when
constructing a Series or DataFrame is internally converted to an Index

In [105]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
obj

a    0
b    1
c    2
dtype: int64

In [107]:
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [108]:
index[1:]

Index(['b', 'c'], dtype='object')

In [111]:
# Index objects are immutable and thus can’t be modified by the user:
# index[1] = 'd'
     # It's create a type error

In [113]:
# Immutability makes it safer to share Index objects among data structures
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [115]:
obj2 = pd.DataFrame([1.5, -2.5, 0], index = labels)
obj2

Unnamed: 0,0
0,1.5
1,-2.5
2,0.0


In [119]:
obj2.index is labels

True

In [120]:
# In addition to being array-like, an Index also behaves like a fixed-size set
frame3

state,Neveda,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [121]:
frame3.columns

Index(['Neveda', 'Ohio'], dtype='object', name='state')

In [122]:
frame3.index

Int64Index([2001, 2002, 2000], dtype='int64', name='year')

In [123]:
'Ohio' in frame3.columns

True

In [124]:
2003 in frame3.index

False

In [126]:
# Unlike Python sets, a pandas Index can contain duplicate labels
dup_labels = pd.Index(['foo','foo', 'bar', 'bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

##### Reindexing

In [129]:
# reindex means to create a new object with the data conformed to a new index
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [131]:
# Calling reindex on this Series rearranges the data according to the new index, intro‐ducing missing values if any index values were not already present
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or fill‐
ing of values when reindexing. The method option allows us to do this, using a
method such as ffill, which forward-fills the values

In [133]:
obj3 = pd.Series(['blue', 'Purple', 'Yellow'], index=[0,2,4])
obj3

0      blue
2    Purple
4    Yellow
dtype: object

In [134]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    Purple
3    Purple
4    Yellow
5    Yellow
dtype: object

In [137]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)), index = ['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [139]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [169]:
# The columns can be reindexed with the columns keyword
states = ['Ohio', 'Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Ohio,Texas,Utah,California
a,0,1,,2
c,3,4,,5
d,6,7,,8


###### Dropping Entries from an Axis

In [166]:
obj =pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [152]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [155]:
new_obj = obj.drop(['c','d'])
new_obj

a    0.0
b    1.0
e    4.0
dtype: float64

In [157]:
# With DataFrame, index values can be deleted from one of two axis
data = pd.DataFrame(np.arange(16).reshape((4,4)), index= ['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [158]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [161]:
# We can drop values from the columns by passing axis=1 or axis='columns':
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [163]:
data.drop(['two', 'four'], axis='columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


Many functions, like drop, which modify the size or shape of a Series or DataFrame,
can manipulate an object in-place without returning a new object

In [167]:
obj.drop('c', inplace=True)

In [168]:
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

##### Indexing, Selection, and Filtering

Series indexing (obj[...]) works analogously to NumPy array indexing, except you
can use the Series’s index values instead of only integers

In [171]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [176]:
obj['b']

1.0

In [177]:
obj[1]

1.0

In [181]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [183]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [184]:
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [185]:
obj[obj <2]

a    0.0
b    1.0
dtype: float64

In [187]:
# Slicing with labels behaves differently than normal Python slicing in that the end‐ point is inclusive
obj['b':'c']

b    1.0
c    2.0
dtype: float64

In [191]:
# Setting using these methods modifies the corresponding section of the Series
obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence

In [193]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], 
                    columns=['one','two', 'three', 'four']) 
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [194]:
data.two

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [196]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [197]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [198]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [200]:
# Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [204]:
data[data <5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Selection with loc and iloc

For DataFrame label-indexing on the rows, with the special indexing operators
loc and iloc. They enable us to select a subset of the rows and columns from a
DataFrame with NumPy-like notation using either axis labels (loc) or integers
(iloc).

In [205]:
# select a single row and multiple columns by label
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int32

In [206]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [207]:
# perform some similar selections with integers using iloc
data.iloc[2, [3,0,1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

In [208]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

In [210]:
data.iloc[[1,2], [3,0,1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [211]:
# Both indexing functions work with slices in addition to single labels or lists of labels
data.loc[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

In [215]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


Integer Indexes

In [218]:
ser = pd.Series(np.arange(3.))
ser
#ser[-1]

0    0.0
1    1.0
2    2.0
dtype: float64

In [223]:
# With a non-integer index, there is no potential for ambiguity
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

In [224]:
ser2[-1]

2.0

To keep things consistent, if we have an axis index containing integers, data selection
will always be label-oriented. For more precise handling, use loc (for labels) or iloc
(for integers)

In [226]:
ser[:1]

0    0.0
dtype: float64

In [229]:
ser.loc[:1]

0    0.0
1    1.0
dtype: float64