# Data Manipulation with Pandas

We've been looking at NumPy and its `ndarray` object

- This provides efficient storage and manipulation of **dense typed arrays** (contains very few 0s)

Pandas is built on NumPy and provides an efficient implementation of a `DataFrame`

- Convenient storage interface for labelled data.

- Provides powerful data operations familiar to users of database and spreadsheet programs.

NumPy useful for providing essential features for data organization.

It is however limited where flexibility is required:

- attaching labels to data

- working with missing data

- attempting operations that do not map well to element-wise broadcasting e.g grouping

Pandas helps in these "data minging tasks" that occupy much of a data scientist's time.

In [27]:
# Importing pandas
import pandas as pd

In [28]:
pd.__version__

'1.5.3'

In [29]:
import numpy as np

In [30]:
np.__version__

'1.23.5'

## The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

In [31]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see in the output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the `values` and `index` attributes.

In [32]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [33]:
# gives an array like object of type pd.Index
data.index

RangeIndex(start=0, stop=4, step=1)

In [34]:
# Like a NumPy array, data can be accessed via associated index
data[1]

0.5

In [35]:
data[1:3]

1    0.50
2    0.75
dtype: float64

## `Series` as a generalised NumPy array

`Series` object basically a 1-D NumPy array

Difference in the index:

- for NumPy it is implicitly defined

- for the Pandas `Series` it is explicitly defined

This means the index doesn't have to be an integer...

In [36]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [37]:
data['b']

0.5

In [38]:
# implicit
data[1]

0.5

In [39]:
# a non-sequential indexing
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [40]:
data[5]

0.5

## `Series` as specialised dictionary

Pandas `Series` a bit like a Python dictionary

Dictionarys map arbitrary keys to a set of arbitrary values

`Series` maps *typed* keys to a set of *typed* values

Similar premise that yields the efficiency of NumPy arrays

In [41]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [42]:
# index drawn from *sorted* keys
population['California']

38332521

In [43]:
# slicing supported
# last one included
population['California':'Florida']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

## Constructing Series objects

To construct, it's generally a variation of

`>>> pd.Series(data, index=index)`

- index is an optional argument
- data can be one of many entities...

In [44]:
# data can be a list
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

In [45]:
# data can be a scalar, which gets repeated
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [46]:
# data can be a dictionary
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In [47]:
# can set different indexes if wanted!
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

In [48]:
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In [49]:
# shortcut for showing docstring
# cmd + shift + space
pd.Series()

  pd.Series()


Series([], dtype: float64)

n.b the last series is populated only with the the explicitly identified keys

# The Pandas `DataFrame` Object

Like the `Series` object discussed in the previous section, the `DataFrame` can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.

## DataFrame as a generalised NumPy array

Recall: a `Series` is an analog of a one-dimensional array with flexible indices

Given this, a `DataFrame` is an analog of a two-dimensional array with both flexible row indices and flexible column names

In [50]:
# first let's construct a new Series
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995, 'Washington': 666}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Washington       666
dtype: int64

Let's use that population data from before...

In [51]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521.0,423967
Florida,19552860.0,170312
Illinois,12882135.0,149995
New York,19651127.0,141297
Texas,26448193.0,695662
Washington,,666


In [52]:
states.Index[0]

AttributeError: ignored

In [53]:
states[0]

KeyError: ignored

In [54]:
states['population']

California    38332521.0
Florida       19552860.0
Illinois      12882135.0
New York      19651127.0
Texas         26448193.0
Washington           NaN
Name: population, dtype: float64

In [55]:
# let's check the index labels
states.index

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas', 'Washington'], dtype='object')

In [56]:
# data frames also have columns attribute
states.columns

Index(['population', 'area'], dtype='object')

In [57]:
states.values

array([[3.8332521e+07, 4.2396700e+05],
       [1.9552860e+07, 1.7031200e+05],
       [1.2882135e+07, 1.4999500e+05],
       [1.9651127e+07, 1.4129700e+05],
       [2.6448193e+07, 6.9566200e+05],
       [          nan, 6.6600000e+02]])

So we have a generalisation of a 2D NumPy array, with generalised row and column indices

## DataFrame as specialised dictionary

Recall: dictionarys map keys to values

A `DataFrame` maps a column name to a `Series` of column data

In [58]:
states['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Washington       666
Name: area, dtype: int64

In [59]:
states['population']

California    38332521.0
Florida       19552860.0
Illinois      12882135.0
New York      19651127.0
Texas         26448193.0
Washington           NaN
Name: population, dtype: float64

## Constructing a DataFrame object

There are a number of ways to construct a `DataFrame`.

1. From a collection of `Series` objects.

2. A single `DataFrame` can be constructed from a single `Series`:

In [60]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [61]:
pd.DataFrame({"population": population})

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


3. It can also be constructed from a __list of dictionaries__

In [62]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [63]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}]) # , index = np.array([5, 6])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [64]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}], index = np.array([5, 6]) )

Unnamed: 0,a,b,c
5,1.0,2,
6,,3,4.0


In [65]:
ind = pd.Index([2, 3])

pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}], index = ind )

Unnamed: 0,a,b,c
2,1.0,2,
3,,3,4.0


If some keys are missing, Pandas fills these with `NaN`

4. A `DataFrame` can be constructed from a __dictionary of Series objects__

In [66]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,population,area
California,38332521.0,423967
Florida,19552860.0,170312
Illinois,12882135.0,149995
New York,19651127.0,141297
Texas,26448193.0,695662
Washington,,666


5. You can create a `DataFrame`from a __two dimensional array__, with specfied column and index names. (If ommitted, an integer index will be used instead)

In [67]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.193617,0.19864
b,0.948232,0.416385
c,0.62124,0.941098


In [69]:
np.random.rand(3,2)

array([[0.00410002, 0.84390773],
       [0.30776911, 0.62900565],
       [0.36153045, 0.63887609]])

In [70]:
pd.DataFrame(np.random.rand(3, 2))

Unnamed: 0,0,1
0,0.415285,0.535594
1,0.298676,0.359202
2,0.888049,0.528361


6. From a __NumPy structred array__

In [71]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [72]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


In [73]:
B = np.ones((2, 4), dtype=np.int32)
C = pd.DataFrame(B)
C

Unnamed: 0,0,1,2,3
0,1,1,1,1
1,1,1,1,1


In [74]:
C.values

array([[1, 1, 1, 1],
       [1, 1, 1, 1]], dtype=int32)

# The Pandas Index Object

Both the `Series` and `DataFrame` objects contain an explicit *index* that lets you reference and modify data.

The `Index` object is an immutable array

In [75]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

In [76]:
ind_2d = pd.Index([[2, 3, 5, 7, 11],
                   ['nmb', 'a', 'b', 'c', 'd']
                   ])
print(ind_2d)
print(ind_2d[1][0])

Index([[2, 3, 5, 7, 11], ['nmb', 'a', 'b', 'c', 'd']], dtype='object')
nmb


In [77]:
row_ind = pd.DataFrame([2, 3, 5, 7, 11])

print(row_ind[0][1])

row_ind

3


Unnamed: 0,0
0,2
1,3
2,5
3,7
4,11


`Index` operates like an array: we can index an index, and also slice:

In [78]:
ind[1]

3

In [79]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

In [80]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


In [81]:
# what happens when attempting to change a value of an Index?
ind[1] = 0

TypeError: ignored

Why is this immutable?

## Index as an ordered set

Pandas designed to faciliatate operations such as joins across datasets using set arithmetic.

`Index` objects can be combined using the union, intersection and difference operations

In [82]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [83]:
indA.intersection(indB)  # intersection

Int64Index([3, 5, 7], dtype='int64')

In [84]:
indA.union(indB)

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [85]:
indA.symmetric_difference(indB)

Int64Index([1, 2, 9, 11], dtype='int64')

# Summary

We've looked at the following during this lecture:

- The `Series` object
- The `DataFrame`object
- The `Index` object