Let's introduce the three fundamental Pandas data structures: the `series`, `DataFrame` and `Index`

# Question

Why index is immutable

In [1]:
import numpy as np
import pandas as pd

# The Pandas Series Object

A Pandas Series is a 1d array of indexed data

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])

In [3]:
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

The `Series` wraps both a sequence of values and a sequence of indices, which we can access with `values` and `index` attributes

In [4]:
# guess what type is the values?
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [8]:
# this is of type pd.Index; partly set, partly array 
data.index

RangeIndex(start=0, stop=4, step=1)

Like NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation

In [9]:
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [10]:
data[1]

0.5

In [11]:
data[1:3]

1    0.50
2    0.75
dtype: float64

As we will see, the Pandas `Series` is much more general and flexible than the 1d NumPy array that it emulates

## `Series` as generalized NumPy array

Essential difference is the presence of the index: Numpy array has an implicitly defined integer index, Pandas `Series` has an explicitly defined index

This explicit index gives the `Series` object additional capabilities. For example, the index does not need to be an integer

In [14]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])

In [15]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [16]:
data['b']

0.5

In [17]:
# non contiguous or non-sequential indices
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2,5,3,7])

In [18]:
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

## Series as specialized dictionary

A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a `Series` is a structure which maps **typed** keys to a set of typed values

In [21]:
# constructing a Series object directly from a Python dictionary 
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [26]:
# QUESTION: How do you access population of california? 
population['California']

38332521

Unlike dictionary, though, the `Series` also supports array-style operations such as slicing:

In [27]:
population['California':'Illinois'] 

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

## Constructing Series Objects

In [None]:
pd.Series(data, index=index)

Where `index` is an optional argument, and `data` can be one of many entities:

In [10]:
pd.Series([2,4,6]) # data is a list
pd.Series(np.array([2,4,6])) # data is a np array

0    2
1    4
2    6
dtype: int32

In [13]:
pd.Series(5, index=[100,200,300]) # data is a scalar, which is repeated to fill the specified index

100    5
200    5
300    5
dtype: int64

In [16]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3,2]) 

3    c
2    a
dtype: object

Notice that in this case, the `Series` is populated only with the explicitly identified keys

# The Pandas DataFrame Object

Like the `Series`, the `DataFrame` can be through of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. 

## DataFrame as a generalized NumPy array

If a `Series` is an analog of a 1d array with flexible indices, a `DataFrame` is an analog of a 2d array with both flexible row indices and flexibel column names. Just as you might think of a 2d array as an ordered sequence of aligned 1d columns, you can think of a `DataFrame` as a sequence of aligned `Series` objects. Here, by "aligned", we mean that they share the same index.

In [17]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [22]:
states = pd.DataFrame({'population':population, 'area':area})

In [23]:
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


Like the `Series` object, the `DataFrame` has an `index` attribute that gives access to the index labels:

In [24]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

Additionally, the `DataFrame` has a `columns` attribute, which is an `Index` object holding the column labels:

In [25]:
states.columns

Index(['population', 'area'], dtype='object')

Thus the `DataFrame` can be thought of as a generalization of a 2d NumPy array, where both the row and columns have a generalized index for accessing the data.

## DataFrame as specialized dictionary

Similarly, we can also think of a `DataFrame` as a specialization of a dictionary. Where a dictionary maps a key to a value, a `DataFrame` maps a column name to a `Series` of column data. For example, asking for the `'area'` attribute returns the Series object containing the areas we was earlier:

In [26]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Notice the potential point of confusion here: in a 2d NumPy array, `data[0]` return the first row. For a `DataFrame`, `data['col0']` will return the first *column*. Because of this, it is probably better to thinkg about `DataFrame` as generalized dictionary rather than generalized arrays. 

## Constructing DataFrame objects

A `DataFrame` is a collection of `Series` objects, and a single-column `DataFrame` can be constructed from a single Series:

In [None]:
pd.DataFrame(population, columns=['population'])

### From a list of dicts

Any list of dictionaries can be made into a `DataFrame`. We will use a simple list comprehension to create some data

In [27]:
data = [{'a':i, 'b':2 * i} for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


Even if some keys in the dictionary are missing, Pandas will fill them in with `NaN` (i.e. "not a number") values:

In [31]:
x

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [30]:
x = pd.DataFrame([{'a':1, 'b':2}, {'b':3, "c":4}])
x['a']

0    1.0
1    NaN
Name: a, dtype: float64

### From a dictionary of Series Objects

As we saw before, a `DataFrame` can be constructed from a dictionary of `Series` objects as well:

In [32]:
pd.DataFrame({'population':population, 'area':area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


### From a two-dimensional NumPy array

In [33]:
pd.DataFrame(np.random.rand(3,2), 
            columns=['foo', 'bar'], 
            index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.488973,0.741736
b,0.335679,0.008142
c,0.266165,0.215895


# The Pandas Index Object

This `index` object is can be thought as an immutable array or as an ordered set

In [34]:
idx = pd.Index([2,3,5,7,11])
idx

Int64Index([2, 3, 5, 7, 11], dtype='int64')

## Index as immutable array

In [37]:
idx[1]

3

In [36]:
idx[::2]

Int64Index([2, 5, 11], dtype='int64')

In [38]:
# familiar attribute from NumPy
print(idx.size, idx.shape, idx.ndim, idx.dtype)

5 (5,) 1 int64


One difference between `Index` objects and NumPy arrays is that indices are **immutable**. They cannot be modified.

In [39]:
idx[1] = 0

TypeError: Index does not support mutable operations

## index as ordered set

Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The `index` object follows many of the conventions used by Python's built-in `set` data structure, so that unions, intersections, differences, and other combinatinos can be computed in a familiar way:

1. Check membership 
2. union, intersect, negation

In [40]:
idxA = pd.Index([1,3,5,7,9])
idxB = pd.Index([2,3,5,7,11])


In [None]:
idxA.intersection(idxB)

In [None]:
idxA.union(idxB)

In [None]:
idxA.symmetric_difference(idxB) # symmetric difference