## Introducing Pandas Objects
____________________________________

* Pandas provides a host of useful tools on top of the basic data types: 
     * ``Series``
     * ``DataFrame`` 
     * ``Index``

* At the **very basic** level, Pandas objects can be thought of as **enhanced versions of NumPy arrays** 

In [2]:
import numpy as np
import pandas as pd

##  1. ``Series`` 
_______________________________

* ``Series`` is a one-dimensional array of indexed data
* It can be created from a list or array 

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

In [None]:
type(data)

In [None]:
dir(data)

#### 1.1. ``Series`` as a wrapper
______________________________

* a sequence of values
* a sequence of indices

In [None]:
data.values

* ``index`` is an object of type ``pd.Index``

In [None]:
data.index

In [None]:
type(data.index)

* `data` can be indexed :

In [None]:
data[1]

In [None]:
data[1:3]

#### 1.2. ``Series`` as generalized NumPy array
_________________________

* ``Series`` object is basically interchangeable with a one-dimensional NumPy array
* The presence of ``index`` is the essential difference ``Series`` from  NumPy array 
     * *explicitly defined* index 
     * *implicitly defined* integer index may be used as with Numpy Array
* Additional ``Series`` capabilities with explicit index definition: 
    * index need not be an integer
    * index can consist of values of any desired type.

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

In [None]:
data.index

* Item access works as expected:

In [None]:
data['b']

* Indices can even be non-contiguous or non-sequential :

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

In [None]:
data[5]

#### 1.3. ``Series`` as specialized dictionary
_____________________

* ``Series`` is a structure which maps typed keys to a set of typed values
* ``Series`` can be constructed directly from a dictionary:

In [5]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

* By default, a ``Series`` will be created where the index is drawn from the **sorted** keys
* The certain operations of ``Series`` are much more efficient than of Python dictionaries 
* Typical dictionary-style item access can be performed:

In [None]:
population['California']

* Unlike a dictionary, ``Series`` also supports array-style operations such as slicing:

In [None]:
population['California':'Illinois'] # !!!  [a,b] 

#### 1.4. Constructing `Series` objects
____________________________

* A ways of constructing ``Series`` from scratch are some version of the following:
```python
pd.Series(data, index=index)
```
     * ``data`` can be one of many entities: a list or NumPy array
     * ``index`` is an optional argument, is an integer sequence by default

In [None]:
pd.Series([2, 4, 6])

* ``data`` can be a scalar, which is repeated to fill the specified index:

In [None]:
pd.Series(5, index=[100, 200, 300])

* ``data`` can be a dictionary, in which ``index`` defaults to the sorted dictionary keys:

In [None]:
pd.Series({2:'a', 1:'b', 3:'c'})

* index can be explicitly set if a different result is preferred:

In [None]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

##  2. ``DataFrame``
__________________________________

``DataFrame`` can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary

#### 2.1. ``DataFrame`` as a generalized NumPy array
______________________________

* an analog of a **two**-dimensional array with both flexible row indices and flexible column names
* a sequence of aligned ``Series`` objects (**aligned** --they share the same index)

In [3]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

* A dictionary can be used  to construct a single two-dimensional object 

In [6]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [None]:
dir(states)

* ``DataFrame`` has an ``index`` attribute that gives access to the index labels (like the ``Series`` )
* Additionally, ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels

In [None]:
states.columns

#### 2.2. DataFrame as specialized dictionary
________________________________

* ``DataFrame`` maps a column name to a ``Series`` of column data

In [None]:
states['area']

* Because of ``states['area']`` will return the first *column* (not the first *row*), it is better to think about ``DataFrame`` as generalized dictionary rather than generalized array .

#### 2.3. Constructing ``DataFrame`` objects 
____________________________________

#### 1) From a single Series object
* ``DataFrame`` is a collection of ``Series`` objects
* single-column ``DataFrame`` can be constructed from a single ``Series``

In [None]:
pd.DataFrame(population, columns=['population'])

#### 2) From a list of dicts

In [None]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)

* If some keys in the dictionary are missing, Pandas will fill them in with ``NaN`` ("not a number") values:

In [None]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

#### 3) From a dictionary of ``Series``

In [None]:
pd.DataFrame({'population': population,
              'area': area})

#### 4) From a 2D NumPy array

* Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names
* If omitted, an integer index will be used for each

In [None]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

In [None]:
pd.DataFrame(np.random.rand(3, 2))

#### 5) From a NumPy structured array

In [None]:
S = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
S

In [None]:
pd.DataFrame(S)

##  3. ``Index`` 
____________________________


* Both the ``Series`` and ``DataFrame`` objects contain an atribute *index*
* ``Index`` object  can be thought of either as an *immutable array* or as an *ordered set* (technically a multi-set, as ``Index`` objects may contain repeated values)

In [None]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

#### 3.1. ``Index`` as immutable array
______________________________

* operates like an array with standard indexing notation to retrieve values or slices :

In [None]:
ind[1]

In [None]:
ind[::2]

* ``Index`` objects also have many of the attributes familiar from NumPy arrays:

In [None]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

* One difference between ``Index`` objects and NumPy arrays is that indices are immutable

In [None]:
ind[1] = 0 #TypeError: Index does not support mutable operations

#### 3.2. ``Index`` as ordered set
____________________________

* ``Index`` follows many of the ``set`` conventions -- unions, intersections, differences, and other combinations  
* ``set`` operations may also be accessed via object methods

In [None]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
indA.intersection(indB) 

In [None]:
indA.union(indB)

In [None]:
indA.symmetric_difference(indB) 

* Pandas objects facilitate operations such as ``join`` across datasets, which depend on many aspects of set arithmetic