# CH.5 - Getting Started with pandas

* While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels as sm

## 5.1 Introduction to pandas Data Structures

* A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.

In [5]:
obj = pd.Series([45, 7, 5, 6])
obj

0    45
1     7
2     5
3     6
dtype: int64

* You can get the array representation and index object of the Series via its values and index attributes, respectively:

In [6]:
# values
print(obj.values, '\n')
print(obj.index)

[45  7  5  6] 

RangeIndex(start=0, stop=4, step=1)


In [7]:
obj2 = pd.Series([4, 5, 6, 8], index=['d', 'f', 'a', 'g'])
print(obj2, '\n')
print(obj2.index, '\n')

d    4
f    5
a    6
g    8
dtype: int64 

Index(['d', 'f', 'a', 'g'], dtype='object') 



In [8]:
# Selecting
print(obj2['a'], '\n')
print(obj2[['a','f','d']], '\n')

6 

a    6
f    5
d    4
dtype: int64 



* Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [10]:
obj2[obj2 > 4]

f    5
a    6
g    8
dtype: int64

In [11]:
obj2 * 2

d     8
f    10
a    12
g    16
dtype: int64

In [12]:
np.exp(obj2)

d      54.598150
f     148.413159
a     403.428793
g    2980.957987
dtype: float64

In [15]:
# Conditional expressions with pd Series
'b' in obj2, 'd' in obj2

(False, True)

In [17]:
# Creating a series from a dict
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

* When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series

In [18]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [19]:
# Checking for null data (NaN in pandas)
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [20]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

* A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations:

In [21]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

### DataFrame

In [24]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame2 = pd.DataFrame(data, 
                      columns=['year','state','pop','debt'],
                      index=['one','two','three','four','five','six'])

frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [25]:
# delete column using del:

del frame2['debt']

frame2

Unnamed: 0,year,state,pop
one,2000,Ohio,1.5
two,2001,Ohio,1.7
three,2002,Ohio,3.6
four,2001,Nevada,2.4
five,2002,Nevada,2.9
six,2003,Nevada,3.2


* The column returned from indexing a DataFrame is a view on the underlying data, not a copy. Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied with the Series’s `.copy()` method.

In [32]:
year_col = frame2['year'].copy()

year_col

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [33]:
year_col = np.arange(2010, 2016)
year_col

array([2010, 2011, 2012, 2013, 2014, 2015])

In [34]:
frame2

Unnamed: 0,year,state,pop
one,2000,Ohio,1.5
two,2001,Ohio,1.7
three,2002,Ohio,3.6
four,2001,Nevada,2.4
five,2002,Nevada,2.9
six,2003,Nevada,3.2


In [35]:
frame2['year'] = year_col

frame2

Unnamed: 0,year,state,pop
one,2010,Ohio,1.5
two,2011,Ohio,1.7
three,2012,Ohio,3.6
four,2013,Nevada,2.4
five,2014,Nevada,2.9
six,2015,Nevada,3.2


In [36]:
# Transpose
frame2.T

Unnamed: 0,one,two,three,four,five,six
year,2010,2011,2012,2013,2014,2015
state,Ohio,Ohio,Ohio,Nevada,Nevada,Nevada
pop,1.5,1.7,3.6,2.4,2.9,3.2


In [38]:
# Column and index name
frame2.columns.name, frame2.index.name

(None, None)

In [39]:
frame2.columns.name = 'column'
frame2.index.name = 'index'

frame2.columns.name, frame2.index.name

('column', 'index')

In [40]:
frame2

column,year,state,pop
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,2010,Ohio,1.5
two,2011,Ohio,1.7
three,2012,Ohio,3.6
four,2013,Nevada,2.4
five,2014,Nevada,2.9
six,2015,Nevada,3.2


In [41]:
# Values, as a two-dimensional ndarray.
frame2.values

array([[2010, 'Ohio', 1.5],
       [2011, 'Ohio', 1.7],
       [2012, 'Ohio', 3.6],
       [2013, 'Nevada', 2.4],
       [2014, 'Nevada', 2.9],
       [2015, 'Nevada', 3.2]], dtype=object)

In [42]:
# Index objects
labels = pd.Index(np.arange(3))

labels

Int64Index([0, 1, 2], dtype='int64')

In [43]:
obj2 = pd.Series([1.5, 2.5, 0], index=labels)
obj2

0    1.5
1    2.5
2    0.0
dtype: float64

In [44]:
labels[:2]

Int64Index([0, 1], dtype='int64')

In [46]:
# Labels are immutable
labels[1] = 4

TypeError: Index does not support mutable operations

<br><br>

## 5.2 Essential Functionality