# CH.5 - Getting Started with pandas

* While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

In [114]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels as sm

## 5.1 Introduction to pandas Data Structures

* A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.

In [5]:
obj = pd.Series([45, 7, 5, 6])
obj

0    45
1     7
2     5
3     6
dtype: int64

* You can get the array representation and index object of the Series via its values and index attributes, respectively:

In [6]:
# values
print(obj.values, '\n')
print(obj.index)

[45  7  5  6] 

RangeIndex(start=0, stop=4, step=1)


In [7]:
obj2 = pd.Series([4, 5, 6, 8], index=['d', 'f', 'a', 'g'])
print(obj2, '\n')
print(obj2.index, '\n')

d    4
f    5
a    6
g    8
dtype: int64 

Index(['d', 'f', 'a', 'g'], dtype='object') 



In [8]:
# Selecting
print(obj2['a'], '\n')
print(obj2[['a','f','d']], '\n')

6 

a    6
f    5
d    4
dtype: int64 



* Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [10]:
obj2[obj2 > 4]

f    5
a    6
g    8
dtype: int64

In [11]:
obj2 * 2

d     8
f    10
a    12
g    16
dtype: int64

In [12]:
np.exp(obj2)

d      54.598150
f     148.413159
a     403.428793
g    2980.957987
dtype: float64

In [15]:
# Conditional expressions with pd Series
'b' in obj2, 'd' in obj2

(False, True)

In [17]:
# Creating a series from a dict
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

* When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series

In [18]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [19]:
# Checking for null data (NaN in pandas)
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [20]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

* A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations:

In [21]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

### DataFrame

In [24]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame2 = pd.DataFrame(data, 
                      columns=['year','state','pop','debt'],
                      index=['one','two','three','four','five','six'])

frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [25]:
# delete column using del:

del frame2['debt']

frame2

Unnamed: 0,year,state,pop
one,2000,Ohio,1.5
two,2001,Ohio,1.7
three,2002,Ohio,3.6
four,2001,Nevada,2.4
five,2002,Nevada,2.9
six,2003,Nevada,3.2


* The column returned from indexing a DataFrame is a view on the underlying data, not a copy. Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied with the Series’s `.copy()` method.

In [32]:
year_col = frame2['year'].copy()

year_col

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [33]:
year_col = np.arange(2010, 2016)
year_col

array([2010, 2011, 2012, 2013, 2014, 2015])

In [34]:
frame2

Unnamed: 0,year,state,pop
one,2000,Ohio,1.5
two,2001,Ohio,1.7
three,2002,Ohio,3.6
four,2001,Nevada,2.4
five,2002,Nevada,2.9
six,2003,Nevada,3.2


In [35]:
frame2['year'] = year_col

frame2

Unnamed: 0,year,state,pop
one,2010,Ohio,1.5
two,2011,Ohio,1.7
three,2012,Ohio,3.6
four,2013,Nevada,2.4
five,2014,Nevada,2.9
six,2015,Nevada,3.2


In [36]:
# Transpose
frame2.T

Unnamed: 0,one,two,three,four,five,six
year,2010,2011,2012,2013,2014,2015
state,Ohio,Ohio,Ohio,Nevada,Nevada,Nevada
pop,1.5,1.7,3.6,2.4,2.9,3.2


In [38]:
# Column and index name
frame2.columns.name, frame2.index.name

(None, None)

In [39]:
frame2.columns.name = 'column'
frame2.index.name = 'index'

frame2.columns.name, frame2.index.name

('column', 'index')

In [40]:
frame2

column,year,state,pop
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,2010,Ohio,1.5
two,2011,Ohio,1.7
three,2012,Ohio,3.6
four,2013,Nevada,2.4
five,2014,Nevada,2.9
six,2015,Nevada,3.2


In [41]:
# Values, as a two-dimensional ndarray.
frame2.values

array([[2010, 'Ohio', 1.5],
       [2011, 'Ohio', 1.7],
       [2012, 'Ohio', 3.6],
       [2013, 'Nevada', 2.4],
       [2014, 'Nevada', 2.9],
       [2015, 'Nevada', 3.2]], dtype=object)

In [42]:
# Index objects
labels = pd.Index(np.arange(3))

labels

Int64Index([0, 1, 2], dtype='int64')

In [43]:
obj2 = pd.Series([1.5, 2.5, 0], index=labels)
obj2

0    1.5
1    2.5
2    0.0
dtype: float64

In [44]:
labels[:2]

Int64Index([0, 1], dtype='int64')

In [46]:
# Labels are immutable
labels[1] = 4

TypeError: Index does not support mutable operations

<br><br>
## 5.2 Essential Functionality

In [54]:
# Reindexing - changes the order, not the index 'itself'

obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['b', 'c', 'a', 'd'])
print(obj, '\n')

obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
print(obj2, '\n')

b    4.5
c    7.2
a   -5.3
d    3.6
dtype: float64 

a   -5.3
b    4.5
c    7.2
d    3.6
e    NaN
dtype: float64 



In [55]:
# forward-fill reindex
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
print(obj3, '\n')

obj3.reindex(range(6), method='ffill')

0      blue
2    purple
4    yellow
dtype: object 



0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [56]:
# Column reindex

frame = pd.DataFrame(np.arange(9).reshape(3,3),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [59]:
states = ['Texas', 'Utah', 'California']
frame2 = frame.reindex(columns=states)
frame2

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


### Dropping Entries from an Axis

In [60]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [63]:
new_obj = obj.drop(['d','c'])
new_obj

a    0.0
b    1.0
e    4.0
dtype: float64

In [69]:
# Dropping from either axis
# You can drop values from the columns by passing axis=1 or axis='columns'
frame3 = frame.drop('Ohio', axis=1)
frame3

Unnamed: 0,Texas,California
a,1,2
c,4,5
d,7,8


In [70]:
frame3.drop('d', inplace=True)
frame3

Unnamed: 0,Texas,California
a,1,2
c,4,5


### Indexing, Selection, and Filtering

* Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive:

In [72]:
# Indexing, Selection
obj['a':'c']

a    0.0
b    1.0
c    2.0
dtype: float64

In [73]:
obj['a':'c'] = 100.0

obj

a    100.0
b    100.0
c    100.0
d      3.0
e      4.0
dtype: float64

In [75]:
# Filtering
frame < 5

Unnamed: 0,Ohio,Texas,California
a,True,True,True
c,True,True,False
d,False,False,False


### Selection with loc and iloc

* For DataFrame label-indexing on the rows, I introduce the special indexing operators loc and iloc . They enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notation using either axis labels (`loc`) or integers (`iloc`).

In [76]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [77]:
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int64

In [78]:
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

In [79]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [80]:
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,4,5
Utah,11,8,9


In [82]:
# loc with slicing
data.loc[:'Utah', 'two']

Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int64

### Arithmetic and Data Alignment

* When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs.

In [87]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

print('s1:\n', s1, '\n\ns2:\n', s2, '\n\ns1 + s2:\n', s1 + s2)

s1:
 a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64 

s2:
 a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64 

s1 + s2:
 a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64


In [88]:
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)), 
                   columns=list('bcd'), 
                   index=['Ohio', 'Texas', 'Colorado'])

df2 = pd.DataFrame(np.arange(12.).reshape((4,3)),
                   columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print('df1:\n', df1, '\n\ndf2:\n', df2, '\n\ndf1 + df2:\n', df1 + df2)

df1:
             b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0 

df2:
           b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0 

df1 + df2:
             b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN


* To fill `NaN` values, Use the add method on df1 and pass df2 and an argument to `fill_value`:

In [90]:
df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
Colorado,6.0,7.0,8.0,
Ohio,3.0,1.0,6.0,5.0
Oregon,9.0,,10.0,11.0
Texas,9.0,4.0,12.0,8.0
Utah,0.0,,1.0,2.0


In [91]:
# Division
1 / df1

Unnamed: 0,b,c,d
Ohio,inf,1.0,0.5
Texas,0.333333,0.25,0.2
Colorado,0.166667,0.142857,0.125


In [92]:
df1.rdiv(1)

Unnamed: 0,b,c,d
Ohio,inf,1.0,0.5
Texas,0.333333,0.25,0.2
Colorado,0.166667,0.142857,0.125


In [99]:
df1.rdiv(1) == 1 / df1

Unnamed: 0,b,c,d
Ohio,True,True,True
Texas,True,True,True
Colorado,True,True,True


In [105]:
# Subtraction
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[1]

print('frame:\n', frame, '\n\nseries:\n', series, '\n\nframe - series:\n', frame - series)

frame:
           b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0 

series:
 b    3.0
d    4.0
e    5.0
Name: Ohio, dtype: float64 

frame - series:
           b    d    e
Utah   -3.0 -3.0 -3.0
Ohio    0.0  0.0  0.0
Texas   3.0  3.0  3.0
Oregon  6.0  6.0  6.0


* If an index value is not found in either the DataFrame’s columns or the Series’s index, the objects will be reindexed to form the union:

In [102]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])

frame - series2

Unnamed: 0,b,d,e,f
Utah,0.0,,1.0,
Ohio,3.0,,4.0,
Texas,6.0,,7.0,
Oregon,9.0,,10.0,


### Function Application and Mapping

* NumPy ufuncs (element-wise array methods) also work with pandas objects:

In [104]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


* Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s `apply` method does exactly this:

In [106]:
f = lambda  x: x.max() - x.min()

frame.apply(f)

b    9.0
d    9.0
e    9.0
dtype: float64

In [107]:
frame.apply(f, axis=1) # or axis='columns'

Utah      2.0
Ohio      2.0
Texas     2.0
Oregon    2.0
dtype: float64

In [108]:
f = lambda x: pd.Series([x.min(), x.max()], index=['min', 'max'])

frame.apply(f)

Unnamed: 0,b,d,e
min,0.0,1.0,2.0
max,9.0,10.0,11.0


* **Element-wise Python functions** can be used, too. Suppose you wanted to compute a formatted string from each floating-point value in frame . You can do this with `applymap`:

In [110]:
# apply for formatting
format_numbers = lambda x: '{:.4f}'.format(x)

frame.applymap(format_numbers)

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


### Sorting and Ranking

* Sorting a dataset by some criterion is another important built-in operation. To sort
lexicographically by row or column index, use the `sort_index` method, which returns
a new, sorted object:

In [115]:
# Series
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

print(obj.sort_index())

# DataFrame
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])

print(frame.sort_index(), '\n\n', frame.sort_index(axis=1))

a    1
b    2
c    3
d    0
dtype: int64
       d  a  b  c
one    4  5  6  7
three  0  1  2  3 

        a  b  c  d
three  1  2  3  0
one    5  6  7  4


In [117]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [124]:
frame.sort_index(axis=1, inplace=True)
frame

Unnamed: 0,a,b
0,0,4
1,1,7
2,0,-3
3,1,2


In [122]:
frame.sort_values(by='b')

Unnamed: 0,a,b
2,0,-3
3,1,2
0,0,4
1,1,7


In [125]:
frame.sort_values(by=['a','b'])

Unnamed: 0,a,b
2,0,-3
0,0,4
3,1,2
1,1,7


#### Ranking

In [127]:
obj= pd.Series([7, -5, 7, 4, 2, 0, 4])

obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [129]:
# Assign tie according to the order in which they’re observed in the data
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [130]:
# Assign tie values the maximum rank in the group, descending
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

* DataFrame can compute ranks over the rows or the columns

In [131]:
frame = pd.DataFrame({'a': [0, 1, 0, 1], 
                      'b': [4.3, 7, -3, 2],
                      'c': [-2, 5, 8, -2.5]})
frame

Unnamed: 0,a,b,c
0,0,4.3,-2.0
1,1,7.0,5.0
2,0,-3.0,8.0
3,1,2.0,-2.5


In [132]:
frame.rank(axis='columns')

Unnamed: 0,a,b,c
0,2.0,3.0,1.0
1,1.0,3.0,2.0
2,2.0,1.0,3.0
3,2.0,3.0,1.0


<br><br>
## 5.3 Summarizing and Computing Descriptive Statistics

In [134]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [135]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [138]:
df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [139]:
df.sum(axis=1, skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

In [140]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [141]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


### Correlation and Covariance

* Using DataFrame’s corrwith method, you can compute pairwise correlations between a DataFrame’s columns or rows with another Series or DataFrame. Passing a Series returns a Series with the correlation value computed for each column:

```python
In [249]: returns.corrwith(returns.IBM)
Out[249]:
AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64
```

### Unique Values, Value Counts, and Membership

In [149]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

obj.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

In [150]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

* `isin` performs a vectorized set membership check and can be useful in filtering a dataset down to a subset of values in a Series or column in a DataFrame:

In [151]:
obj.isin(['b', 'c'])

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [152]:
obj[obj.isin(['b', 'c'])]

0    c
5    b
6    b
7    c
8    c
dtype: object

* In some cases, you may want to compute a histogram on multiple related columns in a DataFrame. Here’s an example:

In [153]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [155]:
# Histograms
data.apply(pd.value_counts).fillna(0)

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
