# Data Manipulation with Pandas

While ``nparray`` serves this purpose very well, its limitations become clear 
- when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) 
- when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.)

so we introduce three fundamental Pandas data structures: the ``Series``, ``DataFrame``, and ``Index``.

At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.

summary:    
``Series`` generalized nparray + specialized dictionary
- Series.index
- Series.values

``DataFrame`` aligned Series
- DataFrame.index
- DataFrame.columns
- DataFrame.values

``Index`` **immutable** array + ordered set

## Installing and Using Pandas

Details on this installation can be found in the [Pandas documentation](http://pandas.pydata.org/).
we will import Pandas under the alias ``pd``

In [1]:
import numpy as np
import pandas as pd
pd.__version__

'0.23.4'

## The Pandas Series Object

A Pandas ``Series`` is a one-dimensional array of indexed data.
It can be created from a list or array as follows:

In [89]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data


0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

``Series`` = ``values`` + ``index`` attributes.

In [6]:
print(type(data.values)) 
data.values

<class 'numpy.ndarray'>


array([0.25, 0.5 , 0.75, 1.  ])

In [7]:
print(type(data.index))
data.index

<class 'pandas.core.indexes.range.RangeIndex'>


RangeIndex(start=0, stop=4, step=1)

## ``Series`` as generalized NumPy array

In [9]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [12]:
data['b'] # access data via index

0.5

In [93]:
data[1:3] # access data via implicit integer index

1    0.50
2    0.75
dtype: float64

We can even use non-contiguous or non-sequential indices:

In [106]:
data = pd.Series([0, 0.25, 0.5, 0.75, 1.0],
                 index=[5, 4, 1, 2, 3])
data

5    0.00
4    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [110]:
#data[0] # this doesn't work. will return keyerror
data[5]

0.0

In [115]:
# it usually interpret by implicit integer index, this practice is unstable and confusing
#you should avoid using this kind of index
data[1:3] 

4    0.25
1    0.50
dtype: float64

These slicing and indexing conventions can be a source of confusion.
For example, if your ``Series`` has an explicit integer index, an indexing operation such as ``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index.

recommend to use ``.loc`` and ``.iloc`` to clarify your purpose

## ``Series`` as specialized dictionary

In [18]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [25]:
population['California']

38332521

In [26]:
population['California':'Illinois']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [27]:
population[0:3]

California    38332521
Texas         26448193
New York      19651127
dtype: int64

### Constructing Series objects

constructing a Pandas ``Series`` are some version of the following:

```python
>>> pd.Series(data, index=index)
```

In [30]:
#``data`` can be a list or NumPy array, in which case ``index`` defaults to an integer sequence
pd.Series([2, 4, 6]) 

0    2
1    4
2    6
dtype: int64

In [31]:
# ``data`` can be a scalar, which is repeated to fill the specified index:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [33]:
# note pandas won't sort key automately
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In [35]:
#``data`` can be a list or NumPy array, in which case ``index`` defaults to an integer sequence
# Notice that in this case, the ``Series`` is populated only with the explicitly identified keys.
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

## The Pandas DataFrame Object

### ``DataFrame`` as a generalized NumPy array
you can think of a ``DataFrame`` as a sequence of aligned ``Series`` objects.
Here, by "aligned" we mean that they share the same index.

``DataFrame`` can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

In [37]:
# prepare another series
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)

In [39]:
# construct a dataframe
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [40]:
print(type(states.index))
states.index

<class 'pandas.core.indexes.base.Index'>


Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [41]:
# Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels
states.columns

Index(['population', 'area'], dtype='object')

In [116]:
# DataFrame also have values attribute
states.values

array([[38332521,   423967],
       [26448193,   695662],
       [19651127,   141297],
       [19552860,   170312],
       [12882135,   149995]], dtype=int64)

### ``DataFrame`` as specialized dictionary  - only works for columns

In [42]:
print(type(states['area']))
states['area']

<class 'pandas.core.series.Series'>


California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [None]:
# trying to get access to the row. the followings are the wrong syntax
states['California'] # this doesn't work, will throw KeyError
states[0]# this doesn't work, will throw KeyError
states['California', :] # wrong syntax

In [54]:
states.loc['California',:]

population    38332521
area            423967
Name: California, dtype: int64

In [55]:
states.iloc[0] # if there is only one parameter, it refers to row.

population    38332521
area            423967
Name: California, dtype: int64

In [56]:
states.iloc[0, :]

population    38332521
area            423967
Name: California, dtype: int64

### Constructing DataFrame objects

#### From a single Series object

In [57]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### From a list of dicts

In [58]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [59]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


#### From a dictionary of Series objects

In [60]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


#### From a two-dimensional NumPy array

In [61]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.185526,0.360083
b,0.836638,0.588168
c,0.232415,0.039308


#### From a NumPy structured array 

In [62]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [63]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


## The Pandas Index Object  - immutable
This immutability makes it safer to share indices between multiple ``DataFrame``s and arrays, without the potential for side effects from inadvertent index modification.

In [64]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

### Index as immutable array

In [67]:
ind[1] = 7  # this doesn't work, index is immutable

TypeError: Index does not support mutable operations

In [71]:
ind[::2] # we can access index by array operation.

Int64Index([2, 5, 11], dtype='int64')

In [72]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


### Index as ordered set - there is ordering in index

In [73]:
indA = pd.Index([5, 3, 1, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [84]:
print( indA & indB )  # intersection
print( indA.intersection(indB) )  # this also works

Int64Index([5, 3, 7], dtype='int64')
Int64Index([5, 3, 7], dtype='int64')


In [87]:
print( indA | indB )  # union
print( indA.union(indB)) # this also works

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')


In [86]:
print( indA ^ indB )  # symmetric difference
print( indA.symmetric_difference(indB) ) #this also works

Int64Index([1, 2, 9, 11], dtype='int64')
Int64Index([1, 2, 9, 11], dtype='int64')
