# Chapter 3 - Data Manipulation with Pandas

In the previous chapter, we dove into detail on NumPy and its ndarray object, which
provides efficient storage and manipulation of dense typed arrays in Python.

Pandas is a newer package built on top of NumPy, and provides an
efficient implementation of a DataFrame. DataFrames are essentially multidimensional
arrays with attached row and column labels, and often with heterogeneous
types and/or missing data. As well as offering a convenient storage interface for
labeled data, Pandas implements a number of powerful data operations familiar to
users of both database frameworks and spreadsheet programs

## Installing and Using Pandas

Pandas is built on C and Cython sources

In [2]:
# Importing pandas

import pandas

In [4]:
pandas.__version__

'0.24.2'

In [6]:
import numpy as np
import pandas as pd

## Introducing Pandas Objects

At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.

The three fundamental Pandas data structures: the:
- Series, 
- DataFrame, and 
- Index.

### The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created from a
list or array as follows

In [7]:
data = pd.Series([0.25, 0.5, 0.7, 1.0])

In [8]:
data

0    0.25
1    0.50
2    0.70
3    1.00
dtype: float64

The Series wraps both a **sequence of values and a
sequence of indices**, which we can access with the **values and index** attributes. The
values are simply a familiar NumPy array:

In [9]:
data.values

array([0.25, 0.5 , 0.7 , 1.  ])

In [10]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [11]:
data[1]

0.5

In [12]:
data[1:3]

1    0.5
2    0.7
dtype: float64

#### Series as generalized Numpy array

The essential difference is the presence
of the index: while **the NumPy array** has an implicitly defined integer index used
to access the values, **the Pandas Series** has an explicitly defined index associated with
the values.

In [13]:
# Using string as an index
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])

In [14]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [15]:
data['b']

0.5

#### Series as specialized dictionary

Pandas Series a bit like a specialization of a Python
dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary
values, and a Series is a structure that maps typed keys to a set of typed values

In [16]:
population_dict = {'California': 38332521,'Texas': 26448193,'New York': 19651127,
'Florida': 19552860,'Illinois': 12882135}

In [17]:
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [19]:
population['Texas': 'Illinois']

Texas       26448193
New York    19651127
Florida     19552860
Illinois    12882135
dtype: int64

#### Constructing Series Objects

**Note:** Data can be a list or Numpy array, also data can be a scalar, which is repeated to fill the specified index:

In [20]:
# a scalar data
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [21]:
# Data as a dictionary
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

### The Pandas DataFrame Object
Like the Series object
discussed in the previous section, the DataFrame can be thought of either as a generalization
of a NumPy array, or as a specialization of a Python dictionary

#### DataFrame as a generalized Numpy array

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame
is an analog of a two-dimensional array with both flexible row indices and flexible
column names

Just as you might think of a two-dimensional array as an ordered
sequence of aligned one-dimensional columns, you can think of a DataFrame as a
sequence of aligned Series objects. Here, by “aligned” we mean that they share the
same index.

In [22]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}

area = pd.Series(area_dict)

In [23]:
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [24]:
states = pd.DataFrame({'population': population, 'area': area})

In [25]:
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [26]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [27]:
states.columns

Index(['population', 'area'], dtype='object')

#### DataFrame as specialized dictionary

In [28]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [29]:
states['population']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

In [33]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [35]:
# Dataframe for list of dicts, using list comprehension
data = [{'a': i, 'b':2*i} for i in range(3)]

In [37]:
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [48]:
pd.DataFrame(np.random.rand(3,2), columns=['foo', 'bar'], index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.641689,0.018169
b,0.528723,0.897171
c,0.363985,0.192001


In [52]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])

In [53]:
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [54]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


### The Pandas Index Object

**Note** index is immutable i.e. it can not be change

This immutability makes it safer to share indices between multiple DataFrames and
arrays, without the potential for side effects from inadvertent index modification

In [55]:
ind = pd.Index([2,3,5,7,11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

In [56]:
ind[1]

3

In [62]:
ind[::3]

Int64Index([2, 7], dtype='int64')

In [63]:
ind.shape

(5,)

In [66]:
ind.ndim

1

In [67]:
indA = pd.Index([1,3,5,7,9])
indB = pd.Index([2,3,5,7,11])

In [69]:
# Intersection
indA & indB

Int64Index([3, 5, 7], dtype='int64')

In [70]:
# Union
indA | indB

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [71]:
# symmetric difference
indA ^ indB

Int64Index([1, 2, 9, 11], dtype='int64')

## Data Indexing and Selection

### Data Selection in Series

In [90]:
# Series as dictionary
data = pd.Series([0.25,0.5,0.75,1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [91]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [92]:
'a' in data

True

In [93]:
data['b']

0.5

In [94]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [95]:
data['e'] = 1.25

In [96]:
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

In [98]:
# masking
data[(data>0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [99]:
# fancy indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

**Note:**
Among these, slicing may be the source of the most confusion. Notice that when you
are slicing with an explicit index (i.e., data['a':'c']), the final index is included in
the slice, while when you’re slicing with an implicit index (i.e., data[0:2]), the final
index is excluded from the slice.

In [100]:
data['a': 'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [101]:
data[0:2]

a    0.25
b    0.50
dtype: float64

In [102]:
data = pd.Series(['a', 'b', 'c'], index=[1,3,5])
data

1    a
3    b
5    c
dtype: object

In [103]:
data[1]

'a'

In [104]:
data[1:3]

3    b
5    c
dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides
some special indexer attributes that explicitly expose certain indexing schemes

First, the
attribute allows indexing and slicing that always references the explicit
loc
index:

In [105]:
data.loc[1]

'a'

In [106]:
data.loc[1:3]

1    a
3    b
dtype: object

The iloc attribute allows indexing and slicing that always references the implicit
Python-style index:

In [109]:
data.iloc[1]

'b'

In [110]:
data.iloc[0]

'a'

In [111]:
data.iloc[1:3]

3    b
5    c
dtype: object

**“explicit is better than implicit.”** The
explicit nature of loc and iloc make them very useful in maintaining clean and readable
code; especially in the case of integer indexes, I recommend using these both to
make code easier to read and understand, and to prevent subtle bugs due to the
mixed indexing/slicing convention

### Data Selection in DataFrame

Recall that a DataFrame acts in many ways like a two-dimensional or structured array,
and in other ways like a dictionary of Series structures sharing the same index.

