#### Getting started with pandas

###### Introduction to pandas Data Structures

To get started with pandas, we will need to get comfortable with its two workhorse
data structures: Series and DataFrame. 

###### Series

A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its index.

In [8]:
import pandas as pd
import numpy as np

In [9]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [10]:
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [11]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [12]:
obj[1]

7

In [13]:
obj2 = pd.Series([4,7,-5, 3], index=['d', 'b', 'a','c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [14]:
obj3 = pd.Series([4,5,6,7,8], index=['a','b','c','d','e'])
obj3

a    4
b    5
c    6
d    7
e    8
dtype: int64

In [15]:
obj2.values

array([ 4,  7, -5,  3], dtype=int64)

In [16]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [17]:
obj2['a']

-5

In [18]:
obj2['d']=6
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [19]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

Using NumPy functions or NumPy-like operations, such as filtering with a boolean
array, scalar multiplication, or applying math functions, will preserve the index-value
link

In [20]:
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [21]:
obj2[obj2>4]

d    6
b    7
dtype: int64

In [22]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [23]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [24]:
'b' in obj2

True

In [25]:
'e'in obj2

False

In [26]:
# We can convert dictionary into series
sdata = {'Ohio':35000, 'Texas': 71000, 'Oregon':16000, 'Utah': 5000}
obj4 = pd.Series(sdata)
obj3

a    4
b    5
c    6
d    7
e    8
dtype: int64

In [27]:
# We can override this by passing the dict keys in the order we want them to appear in the resulting Series
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj5 = pd.Series(sdata, states)
obj5

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [28]:
# The isnull and notnull functions in pandas should be used to detect missing data
pd.isnull(obj5)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [29]:
pd.notnull(obj5)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [30]:
# Series also has these as instance methods
obj5.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [31]:
# A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations
obj4

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [32]:
obj5

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [33]:
obj4 + obj5

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [34]:
# Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality
obj5.name = "Population"
obj5.index.name = "State"
obj5

State
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: Population, dtype: float64

#### DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collec‐tion of columns, each of which can be a different value type (numeric, string,
boolean, etc.). The DataFrame has both a row and column index; it can be thought of
as a dict of Series all sharing the same index. Under the hood, the data is stored as one
or more two-dimensional blocks rather than a list, dict, or some other collection of
one-dimensional arrays.

There are many ways to construct a DataFrame, though one of the most common is
from a dict of equal-length lists or NumPy arrays:


In [35]:
data = {'state': ['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'],
       'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [36]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [37]:
# If we specify a sequence of columns, the DataFrame’s columns will be arranged in that order:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [38]:
# If we pass a column that isn’t contained in the dict, it will appear with missing values in the result:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [39]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [40]:
# A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [41]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [42]:
# Rows can also be retrieved by position or name with the special loc attribute
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [43]:
# Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar value or an array of values
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [44]:
frame2.debt = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


When we are assigning lists or arrays to a column, the value’s length must match the
length of the DataFrame.

In [45]:
# Assigning a column that doesn’t exist will create a new column
frame2['eastern'] = frame2['state'] == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,0.0,True
two,2001,Ohio,1.7,1.0,True
three,2002,Ohio,3.6,2.0,True
four,2001,Nevada,2.4,3.0,False
five,2002,Nevada,2.9,4.0,False
six,2003,Nevada,3.2,5.0,False


In [46]:
# The del keyword will delete columns as with a dict
del frame2['eastern']

In [47]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [48]:
# Another common form of data is a nested dict of dicts
pop = {
    'Neveda' : {2001 :2.4, 2002: 2.9},
    'Ohio' : {2000 : 1.5, 2001 : 1.7, 2002 : 3.6}
}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Neveda,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [49]:
# You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array
frame3.T

Unnamed: 0,2001,2002,2000
Neveda,2.4,2.9,
Ohio,1.7,3.6,1.5


In [50]:
# If a DataFrame’s index and columns have their name attributes set, these will also be displayed:
frame3.index.name = 'year'; frame3.columns.name='state'
frame3

state,Neveda,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


##### pandas’s index Objects


pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels you use when
constructing a Series or DataFrame is internally converted to an Index

In [51]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
obj

a    0
b    1
c    2
dtype: int64

In [52]:
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [53]:
index[1:]

Index(['b', 'c'], dtype='object')

In [54]:
# Index objects are immutable and thus can’t be modified by the user:
# index[1] = 'd'
     # It's create a type error

In [55]:
# Immutability makes it safer to share Index objects among data structures
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [56]:
obj2 = pd.DataFrame([1.5, -2.5, 0], index = labels)
obj2

Unnamed: 0,0
0,1.5
1,-2.5
2,0.0


In [57]:
obj2.index is labels

True

In [58]:
# In addition to being array-like, an Index also behaves like a fixed-size set
frame3

state,Neveda,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [59]:
frame3.columns

Index(['Neveda', 'Ohio'], dtype='object', name='state')

In [60]:
frame3.index

Int64Index([2001, 2002, 2000], dtype='int64', name='year')

In [61]:
'Ohio' in frame3.columns

True

In [62]:
2003 in frame3.index

False

In [63]:
# Unlike Python sets, a pandas Index can contain duplicate labels
dup_labels = pd.Index(['foo','foo', 'bar', 'bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

##### Reindexing

In [64]:
# reindex means to create a new object with the data conformed to a new index
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [65]:
# Calling reindex on this Series rearranges the data according to the new index, intro‐ducing missing values if any index values were not already present
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or fill‐
ing of values when reindexing. The method option allows us to do this, using a
method such as ffill, which forward-fills the values

In [66]:
obj3 = pd.Series(['blue', 'Purple', 'Yellow'], index=[0,2,4])
obj3

0      blue
2    Purple
4    Yellow
dtype: object

In [67]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    Purple
3    Purple
4    Yellow
5    Yellow
dtype: object

In [68]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)), index = ['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [69]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [70]:
# The columns can be reindexed with the columns keyword
states = ['Ohio', 'Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Ohio,Texas,Utah,California
a,0,1,,2
c,3,4,,5
d,6,7,,8


###### Dropping Entries from an Axis

In [71]:
obj =pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [72]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [73]:
new_obj = obj.drop(['c','d'])
new_obj

a    0.0
b    1.0
e    4.0
dtype: float64

In [74]:
# With DataFrame, index values can be deleted from one of two axis
data = pd.DataFrame(np.arange(16).reshape((4,4)), index= ['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [75]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [76]:
# We can drop values from the columns by passing axis=1 or axis='columns':
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [77]:
data.drop(['two', 'four'], axis='columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


Many functions, like drop, which modify the size or shape of a Series or DataFrame,
can manipulate an object in-place without returning a new object

In [78]:
obj.drop('c', inplace=True)

In [79]:
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

##### Indexing, Selection, and Filtering

Series indexing (obj[...]) works analogously to NumPy array indexing, except you
can use the Series’s index values instead of only integers

In [80]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [81]:
obj['b']

1.0

In [82]:
obj[1]

1.0

In [83]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [84]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [85]:
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [86]:
obj[obj <2]

a    0.0
b    1.0
dtype: float64

In [87]:
# Slicing with labels behaves differently than normal Python slicing in that the end‐ point is inclusive
obj['b':'c']

b    1.0
c    2.0
dtype: float64

In [88]:
# Setting using these methods modifies the corresponding section of the Series
obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence

In [89]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], 
                    columns=['one','two', 'three', 'four']) 
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [90]:
data.two

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [91]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [92]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [93]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [94]:
# Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [95]:
data[data <5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Selection with loc and iloc

For DataFrame label-indexing on the rows, with the special indexing operators
loc and iloc. They enable us to select a subset of the rows and columns from a
DataFrame with NumPy-like notation using either axis labels (loc) or integers
(iloc).

In [96]:
# select a single row and multiple columns by label
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int32

In [97]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [98]:
# perform some similar selections with integers using iloc
data.iloc[2, [3,0,1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

In [99]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

In [100]:
data.iloc[[1,2], [3,0,1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [101]:
# Both indexing functions work with slices in addition to single labels or lists of labels
data.loc[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

In [102]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


Integer Indexes

In [103]:
ser = pd.Series(np.arange(3.))
ser
#ser[-1]

0    0.0
1    1.0
2    2.0
dtype: float64

In [104]:
# With a non-integer index, there is no potential for ambiguity
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

In [105]:
ser2[-1]

2.0

To keep things consistent, if we have an axis index containing integers, data selection
will always be label-oriented. For more precise handling, use loc (for labels) or iloc
(for integers)

In [106]:
ser[:1]

0    0.0
dtype: float64

In [107]:
ser.loc[:1]

0    0.0
1    1.0
dtype: float64

In [108]:
ser.iloc[:1]

0    0.0
dtype: float64

Arithmetic and Data Alignment

In [109]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [110]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

An important pandas feature for some applications is the behavior of arithmetic
between objects with different indexes. When you are adding together objects, if any
index pairs are not the same, the respective index in the result will be the union of the
index pairs. For users with database experience, this is similar to an automatic outer
join on the index labels. Let’s look at an example

In [111]:
s1+s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In the case of DataFrame, alignment is performed on both the rows and the columns:

In [112]:
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [113]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [114]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [115]:
# Adding these together returns a DataFrame whose index and columns are the unions of the ones in each DataFrame:
df1+df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [116]:
# If you add DataFrame objects with no column or row labels in common, the result will contain all nulls
df1 = pd.DataFrame({'A' : [1, 2]})
df2 = pd.DataFrame({'B' : [3, 4]})

In [117]:
df1

Unnamed: 0,A
0,1
1,2


In [118]:
df2

Unnamed: 0,B
0,3
1,4


In [119]:
df1-df2

Unnamed: 0,A,B
0,,
1,,


Arithmetic methods with fill values

In arithmetic operations between differently indexed objects, you might want to fill
with a special value, like 0, when an axis label is found in one object but not the other

In [120]:
df1 = pd.DataFrame(np.arange(12.).reshape(3,4), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape(4,5), columns=list('abcde'))

df2.loc[1, 'b'] = np.nan

In [121]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [122]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [123]:
df1+df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [124]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [125]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [126]:
1/df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [127]:
1/df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [128]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


Operations between DataFrame and Series

In [129]:
arr = np.arange(12.).reshape(3,4)
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [130]:
arr[0]

array([0., 1., 2., 3.])

In [131]:
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

Operations between a DataFrame and a Series are similarm

In [132]:
frame = pd.DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'), index=['Otah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]

In [133]:
frame

Unnamed: 0,b,d,e
Otah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [134]:
series

b    0.0
d    1.0
e    2.0
Name: Otah, dtype: float64

In [135]:
frame-series

Unnamed: 0,b,d,e
Otah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


###### Function Application and Mapping

NumPy universal functions (ufuncs) (element-wise array methods) also work with pandas objects

In [136]:
frame = pd.DataFrame(np.random.randn(4,3), columns = list('bde'), index= ['Utah', 'Ohio', 'Texas', 'Oregon'])

In [137]:
frame

Unnamed: 0,b,d,e
Utah,1.337529,-1.100417,0.1748
Ohio,-0.823718,-0.317398,-0.293862
Texas,-1.121225,-0.463837,-0.534291
Oregon,-1.004012,0.375351,-0.272698


In [138]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.337529,1.100417,0.1748
Ohio,0.823718,0.317398,0.293862
Texas,1.121225,0.463837,0.534291
Oregon,1.004012,0.375351,0.272698


In [139]:
# Applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this
f = lambda x: x.max() - x.min()
frame.apply(f)

b    2.458754
d    1.475769
e    0.709090
dtype: float64

In [140]:
# If we pass axis='columns' to apply, the function will be invoked once per row instead
frame.apply(f, axis=1)

Utah      2.437946
Ohio      0.529857
Texas     0.657388
Oregon    1.379363
dtype: float64

In [141]:
frame

Unnamed: 0,b,d,e
Utah,1.337529,-1.100417,0.1748
Ohio,-0.823718,-0.317398,-0.293862
Texas,-1.121225,-0.463837,-0.534291
Oregon,-1.004012,0.375351,-0.272698


In [142]:
# The function passed to apply need not return a scalar value; it can also return a Series with multiple values
def f(x):
    return pd.Series([x.min(), x.max()], index = ['min', 'max'])

In [143]:
frame.apply(f)

Unnamed: 0,b,d,e
min,-1.121225,-1.100417,-0.534291
max,1.337529,0.375351,0.1748


Element-wise Python functions can be used, too. Suppose we wanted to compute a
formatted string from each floating-point value in frame. We can do this with apply
map:


In [144]:
format = lambda x: '%.2f' % x
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,1.34,-1.1,0.17
Ohio,-0.82,-0.32,-0.29
Texas,-1.12,-0.46,-0.53
Oregon,-1.0,0.38,-0.27


In [145]:
# The reason for the name applymap is that Series has a map method for applying an element-wise function
frame['e'].map(format)

Utah       0.17
Ohio      -0.29
Texas     -0.53
Oregon    -0.27
Name: e, dtype: object

##### Sorting and Ranking


Sorting a dataset by some criterion is another important built-in operation. To sort
lexicographically by row or column index, use the sort_index method, which returns
a new, sorted object

In [146]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj

d    0
a    1
b    2
c    3
dtype: int64

In [147]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [148]:
# With a DataFrame, we can sort by index on either axis
frame = pd.DataFrame(np.arange(8).reshape((2,4)), index = ['Three','one'], columns=['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
Three,0,1,2,3
one,4,5,6,7


In [149]:
frame.sort_index()

Unnamed: 0,d,a,b,c
Three,0,1,2,3
one,4,5,6,7


In [150]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
Three,1,2,3,0
one,5,6,7,4


In [151]:
# The data is sorted in ascending order by default, but can be sorted in descending order
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
Three,0,3,2,1
one,4,7,6,5


In [152]:
# To sort a Series by its values, use its sort_values method
obj = pd.Series([4, 7, -3, 2])
obj

0    4
1    7
2   -3
3    2
dtype: int64

In [153]:
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [154]:
# Any missing values are sorted to the end of the Series by default
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj

0    4.0
1    NaN
2    7.0
3    NaN
4   -3.0
5    2.0
dtype: float64

In [155]:
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

When sorting a DataFrame, we can use the data in one or more columns as the sort keys. To do so, pass one or more column names to the by option of sort_values

In [156]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a':[0, 1, 0, 1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [157]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [158]:
# To sort by multiple columns, pass a list of names
frame.sort_values(by = ['a','b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


In [159]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [160]:
obj.rank()


0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

Ranking

Ranking assigns ranks from one through the number of valid data points in an array.
The rank methods for Series and DataFrame are the place to look; by default rank
breaks ties by assigning each group the mean rank

In [161]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [162]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [163]:
# Ranks can also be assigned according to the order in which they’re observed in the data
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

Here, instead of using the average rank 6.5 for the entries 0 and 2, they instead have
been set to 6 and 7 because label 0 precedes label 2 in the data.

In [164]:
# We can rank in descending order, too
obj.rank(method='max', ascending=False)

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [165]:
# DataFrame can compute ranks over the rows or the columns
frame = pd.DataFrame({'b' : [4.3, 7, -3, 2], 'a' : [0, 1, 0, 1], 'c': [-2, 5, 8, -2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [166]:
frame.rank(axis='columns')

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


Axis Indexes with Duplicate Labels

Many pandas functions (like reindex) require that the labels be
unique, it’s not mandatory. Let’s consider a small Series with duplicate indices

In [167]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [168]:
# The index’s is_unique property can tell us whether its labels are unique or not
obj.index.is_unique

False

Data selection is one of the main things that behaves differently with duplicates.
Indexing a label with multiple entries returns a Series, while single entries return a
scalar value

In [169]:
obj['a']

a    0
a    1
dtype: int64

In [170]:
obj['c']

4

In [171]:
# The same logic extends to indexing rows in a DataFrame
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df

Unnamed: 0,0,1,2
a,-1.022159,-1.621619,-1.379246
a,-0.159992,-1.880508,1.415295
b,-0.995503,0.976217,0.301566
b,1.027449,-2.335668,1.484949


In [172]:
df.index.is_unique

False

In [173]:
df.loc['b']

Unnamed: 0,0,1,2
b,-0.995503,0.976217,0.301566
b,1.027449,-2.335668,1.484949


#####  Summarizing and Computing Descriptive Statistics

pandas objects are equipped with a set of common mathematical and statistical meth‐ods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the similar methods
found on NumPy arrays, they have built-in handling for missing data. Consider a small DataFrame

In [174]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index = list('abcd'), 
                 columns = ['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [175]:
# Calling DataFrame’s sum method returns a Series containing column sums
df.sum()

one    9.25
two   -5.80
dtype: float64

In [176]:
# Passing axis='columns' or axis=1 sums across the columns instead
df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [177]:
# NA values are excluded unless the entire slice (row or column in this case) is NA. This can be disabled with the skipna option
df.mean(axis='columns', skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

In [178]:
# idxmin and idxmax, return indirect statistics like the index value where the minimum or maximum values are attained
df.idxmax()

one    b
two    d
dtype: object

In [179]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [180]:
# describe is produce multiple summary statistics in one shot
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [181]:
# On non-numeric data, describe produces alternative summary statistics
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

In [182]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

###### Correlation and Covariance
###### this concept see from hate kolome data science

##### Unique Values, Value Counts, and Membership


In [195]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [204]:
# Unique function, which gives us an array of the unique values in a Series
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [213]:
# value_counts computes a Series containing value frequencies
obj.value_counts()

a    3
c    3
b    2
d    1
dtype: int64

The Series is sorted by value in descending order as a convenience. value_counts is
also available as a top-level pandas method that can be used with any array or
sequence

In [211]:
pd.value_counts(obj.values, sort=False)

d    1
b    2
a    3
c    3
dtype: int64

In [214]:
# isin performs a membership check
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [216]:
mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [217]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object