## Pandas 
Pandas is a major tool for us to use in Data Engineering. Pandas organises data with structures. It also provides tools for handling the structured data. For example, Pandas can clean the data and provide summary statistics for the data. In this section, we will learn how to use Pandas.

All the materials were taken from the book.

In [103]:
import numpy as np
import pandas as pd

`Series` and `DataFrame` are the two main data structures in `Pandas`.

In [104]:
from pandas import Series, DataFrame

In [105]:
import numpy as np
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc("figure", figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
pd.options.display.max_columns = 20
pd.options.display.max_colwidth = 80
np.set_printoptions(precision=4, suppress=True)

### Series
A `Series` is a one-dimensional array-like object containing a sequence of values with an associated array of data labels, which is called its `index`.

In [106]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

We can print out the data values and their index

In [107]:
print(obj.array)
print(obj.index)

<NumpyExtensionArray>
[np.int64(4), np.int64(7), np.int64(-5), np.int64(3)]
Length: 4, dtype: int64
RangeIndex(start=0, stop=4, step=1)


We can also specify the index of each data value

In [108]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

You can use index to set and select values.

In [109]:
obj2["a"]
obj2["d"] = 6
obj2[["c", "a", "d"]]

c    3
a   -5
d    6
dtype: int64

Filters and math functions can be applied to `Series`.

In [110]:
obj2[obj2 > 0]
obj2 * 2
import numpy as np
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Python dictionary objects can be passed to a Series. In the following example, dictionary keys are the index in the Series.

In [111]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

A Series can be converted back to a dictionary using its `to_dict` method:

In [112]:
obj3.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

When passing dictionary objects to Series, the index can be specified with its order. The following example shows that three values were passed, but no value was found for "California", and so it marked as missing value. 

In [113]:
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

`isna` and `notna` methods can be used to find missing values:

In [114]:
pd.isna(obj4)
pd.notna(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [115]:
obj4.isna()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

A nice feature of Series is that it automatically aligns by index label in math operations:

In [116]:
obj3
obj4
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both the Series object itself and its index have a `name` attribute:

In [117]:
obj4.name = "population"
obj4.index.name = "state"
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

The index of a Series can be altered in place by assignment:

In [118]:
obj
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

### DataFrame
A DataFFrame represents a rectangular table of data and contains an ordered, named collection of columns, each of which can be a different value type, e.g., numeric, string, Boolean, etc. The DataFrame has both a row and column index. It can be thought of as a dictionary of Series all sharing the same index.

The following example shows to create a DataFrame from a dictionary:

In [119]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

In [120]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


The `head()` method only display the first five rows, and similiarly the `tail()` method:

In [121]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [122]:
frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


The order of the columns can be made by providing the columns value:

In [123]:
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


If a column is not contained in the dictionary, it will appear with the missing values in the result:

In [124]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])
frame2
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [125]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


A column can be retrieved from the DataFrame as a Series:

In [126]:
frame2["state"]
frame2.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

Rows can be retrieved by position or name with the attributes `iloc` and `loc`:

In [127]:
frame2.loc[1]
frame2.iloc[2]

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: 2, dtype: object

Columns can be modified by assignment. For example, the values of column `debt` can be assigned.

In [128]:
frame2["debt"] = 16.5
frame2
frame2["debt"] = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0.0
1,2001,Ohio,1.7,1.0
2,2002,Ohio,3.6,2.0
3,2001,Nevada,2.4,3.0
4,2002,Nevada,2.9,4.0
5,2003,Nevada,3.2,5.0


When assigning lists or arrays to columns, the value's length must match the length of the DataFrame.

In [129]:
val = pd.Series([-1.2, -1.5, -1.7], index=["two", "four", "five"])
frame2["debt"] = val
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


The `del` keyword can delete columns. Firstly, a new column of Boolean vlaues were added:

In [130]:
frame2["eastern"] = frame2["state"] == "Ohio"
frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,,True
1,2001,Ohio,1.7,,True
2,2002,Ohio,3.6,,True
3,2001,Nevada,2.4,,False
4,2002,Nevada,2.9,,False
5,2003,Nevada,3.2,,False


Secondly, the column can be deleted from the DataFrame:

In [131]:
del frame2["eastern"]
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Another form of the data is a nested dictionary of dictionaries:

In [132]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Nevada": {2001: 2.4, 2002: 2.9}}

If the nested dictionary is passed to the DataFrame, the outer dict keys is interpretted as the columns and the inner keys is as the tow indices:

In [133]:
frame3 = pd.DataFrame(populations)
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


You can transpose the DataFrame:

In [134]:
frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


Be careful when specifying the index:

In [135]:
pd.DataFrame(populations, index=[2001, 2002, 2003])

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9
2003,,


Dictionaries of Series are treated in much the same way:

In [136]:
pdata = {"Ohio": frame3["Ohio"][:-1],
         "Nevada": frame3["Nevada"][:2]}
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


If a DataFrame's index and columns have their name attributes set, these will be displayed

In [137]:
frame3.index.name = "year"
frame3.columns.name = "state"
frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


DataFrame's `to_numpy` method returns the data contained in the DataFrame as a two-dimensional ndarray:

In [138]:
frame3.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

If the DataFrame's columns are different data types, the data typ of the returned array will be chosen to accommodate all of the columns:

In [139]:
frame2.to_numpy()

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, nan],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

pandas’s Index objects are responsible for holding the axis labels (including a Data‐Frame’s column names) and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index:

In [140]:
obj = pd.Series(np.arange(3), index=["a", "b", "c"])
index = obj.index
index
index[0:]

Index(['a', 'b', 'c'], dtype='object')

You can also define the index by the following way:

In [141]:
labels = pd.Index(np.arange(3))
labels
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2
obj2.index is labels

True

In [142]:
frame3
frame3.columns
"Ohio" in frame3.columns
2003 in frame3.index

False

In [143]:
frame3.index

Index([2000, 2001, 2002], dtype='int64', name='year')

Pandas Index can contain duplicate labels:

In [144]:
pd.Index(["foo", "foo", "bar", "bar"])

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

### Reindexing
An important method on pandas objects is reindex, which means to create a new
object with the values rearranged to align with the new index. Consider an example:

In [145]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Calling reindex on this Series rearranges the data according to the new index,
introducing missing values if any index values were not already present:

In [146]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, you may want to do some interpolation or filling of
values when reindexing. The method option allows us to do this, using a method such
as ffill, which forward-fills the values:

In [147]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
obj3
obj3.reindex(np.arange(6), method="ffill")

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

With DataFrame, reindex can alter the (row) index, columns, or both. When passed
only a sequence, it reindexes the rows in the result:

In [148]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"],
                     columns=["Ohio", "Texas", "California"])
frame
frame2 = frame.reindex(index=["a", "b", "c", "d"])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


The columns can be reindexed with the columns keyword:

In [149]:
states = ["Texas", "Utah", "California"]
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


Another way to reindex a particular axis is to pass the new axis labels as a positional
argument and then specify the axis to reindex with the axis keyword:

In [150]:
frame.reindex(states, axis="columns")

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


You
can also reindex by using the loc operator, and many users prefer to always do it this
way. This works only if all of the new index labels already exist in the DataFrame
(whereas reindex will insert missing data for new labels):

In [151]:
frame.loc[["a", "d", "c"], ["California", "Texas"]]

Unnamed: 0,California,Texas
a,2,1
d,8,7
c,5,4


### Dropping Entries from an Axis

Dropping one or more entries from an axis is simple if you already have an index
array or list without those entries, since you can use the reindex method or .locbased
indexing. As that can require a bit of munging and set logic, the drop method
will return a new object with the indicated value or values deleted from an axis:

In [152]:
obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])
obj
new_obj = obj.drop("c")
new_obj
obj.drop(["d", "c"])

a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis. To illustrate this, we
first create an example DataFrame:

In [153]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Calling drop with a sequence of labels will drop values from the row labels (axis 0):

In [154]:
data.drop(index=["Colorado", "Ohio"])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


To drop labels from the columns, instead use the columns keyword:

In [155]:
data.drop(columns=["two"])

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


You can also drop values from the columns by passing axis=1 (which is like NumPy)
or axis="columns":

In [156]:
data.drop("two", axis=1)
data.drop(["two", "four"], axis="columns")

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


### Indexing, Selection, and Filtering

Series indexing (obj[...]) works analogously to NumPy array indexing, except you
can use the Series’s index values instead of only integers. Here are some examples of
this:

In [157]:
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj
obj["b"]
obj[1]
obj[2:4]
obj[["b", "a", "d"]]
obj[[1, 3]]
obj[obj < 2]

  obj[1]
  obj[[1, 3]]


a    0.0
b    1.0
dtype: float64

While you can select data by label this way, the preferred way to select index values is
with the special loc operator:

In [158]:
obj.loc[["b", "a", "d"]]

b    1.0
a    0.0
d    3.0
dtype: float64

The reason to prefer loc is because of the different treatment of integers when
indexing with []. Regular []-based indexing will treat integers as labels if the index
contains integers, so the behavior differs depending on the data type of the index. For
example:

In [159]:
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])
obj1
obj2
obj1[[0, 1, 2]]
obj2[[0, 1, 2]]

  obj2[[0, 1, 2]]


a    1
b    2
c    3
dtype: int64

Since loc operator indexes exclusively with labels, there is also an iloc operator
that indexes exclusively with integers to work consistently whether or not the index
contains integers:

In [160]:
obj1.iloc[[0, 1, 2]]
obj2.iloc[[0, 1, 2]]

a    1
b    2
c    3
dtype: int64

You can also slice with labels, but it works differently from normal
Python slicing in that the endpoint is inclusive:

In [161]:
obj2.loc["b":"c"]

b    2
c    3
dtype: int64

Assigning values using these methods modifies the corresponding section of the
Series:

In [162]:
obj2.loc["b":"c"] = 5
obj2

a    1
b    5
c    5
dtype: int64

Indexing into a DataFrame retrieves one or more columns either with a single value
or sequence:

In [163]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data
data["two"]
data[["three", "one"]]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


Indexing like this has a few special cases. The first is slicing or selecting data with a
Boolean array:

In [164]:
data[:2]
data[data["three"] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


The row selection syntax data[:2] is provided as a convenience. Passing a single
element or a list to the [] operator selects columns.

Another use case is indexing with a Boolean DataFrame, such as one produced by
a scalar comparison. Consider a DataFrame with all Boolean values produced by
comparing with a scalar value:

In [165]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


We can use this DataFrame to assign the value 0 to each location with the value True,
like so:

In [166]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


#### Selction on DataFrame with loc and iloc

Like Series, DataFrame has special attributes loc and iloc for label-based and
integer-based indexing, respectively. Since DataFrame is two-dimensional, you can
select a subset of the rows and columns with NumPy-like notation using either axis
labels (loc) or integers (iloc).

As a first example, let’s select a single row by label:

In [167]:
data
data.loc["Colorado"]

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

The result of selecting a single row is a Series with an index that contains the
DataFrame’s column labels. To select multiple roles, creating a new DataFrame, pass a
sequence of labels:

In [168]:
data.loc[["Colorado", "New York"]]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
New York,12,13,14,15


You can combine both row and column selection in loc by separating the selections
with a comma:

In [169]:
data.loc["Colorado", ["two", "three"]]

two      5
three    6
Name: Colorado, dtype: int64

We’ll then perform some similar selections with integers using iloc:

In [170]:
data.iloc[2]
data.iloc[[2, 1]]
data.iloc[2, [3, 0, 1]]
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


Both indexing functions work with slices in addition to single labels or lists of labels:

In [171]:
data.loc[:"Utah", "two"]
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


Boolean arrays can be used with loc but not iloc:

In [172]:
data.loc[data.three >= 2]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


#### Integer indexing pitfalls

Working with pandas objects indexed by integers can be a stumbling block for new
users since they work differently from built-in Python data structures like lists and
tuples. For example, you might not expect the following code to generate an error:

In [173]:
ser = pd.Series(np.arange(3.))
ser
ser[-1]

KeyError: -1

In this case, pandas could “fall back” on integer indexing, but it is difficult to do
this in general without introducing subtle bugs into the user code. Here we have an
index containing 0, 1, and 2, but pandas does not want to guess what the user wants
(label-based indexing or position-based):

In [None]:
ser

On the other hand, with a noninteger index, there is no such ambiguity:

In [None]:
ser2 = pd.Series(np.arange(3.), index=["a", "b", "c"])
ser2[-1]

If you have an axis index containing integers, data selection will always be label
oriented. As I said above, if you use loc (for labels) or iloc (for integers) you will get
exactly what you want:

In [None]:
ser.iloc[-1]

On the other hand, slicing with integers is always integer oriented:

In [None]:
ser[:2]

As a result of these pitfalls, it is best to always prefer indexing with loc and iloc to
avoid ambiguity.

### Arithmetic and data alignment

pandas can make it much simpler to work with objects that have different indexes.
For example, when you add objects, if any index pairs are not the same, the respective
index in the result will be the union of the index pairs. Let’s look at an example:

In [None]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
               index=["a", "c", "e", "f", "g"])
s1
s2

Adding these yields:

In [None]:
s1 + s2

The internal data alignment introduces missing values in the label locations that don’t
overlap. Missing values will then propagate in further arithmetic computations.

In the case of DataFrame, alignment is performed on both rows and columns:

In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),
                   index=["Ohio", "Texas", "Colorado"])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),
                   index=["Utah", "Ohio", "Texas", "Oregon"])
df1
df2

Adding these returns a DataFrame with index and columns that are the unions of the
ones in each DataFrame:

In [None]:
df1 + df2

Since the "c" and "e" columns are not found in both DataFrame objects, they appear
as missing in the result. The same holds for the rows with labels that are not common
to both objects.


If you add DataFrame objects with no column or row labels in common, the result
will contain all nulls:

In [None]:
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [3, 4]})
df1
df2
df1 + df2

#### Arithmetic methods with fill values

In arithmetic operations between differently indexed objects, you might want to fill
with a special value, like 0, when an axis label is found in one object but not the other.
Here is an example where we set a particular value to NA (null) by assigning np.nan
to it:

In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list("abcde"))
df2.loc[1, "b"] = np.nan
df1
df2

Adding these results in missing values in the locations that don’t overlap

In [None]:
df1 + df2

Using the add method on df1, I pass df2 and an argument to fill_value, which
substitutes the passed value for any missing values in the operation:

In [None]:
df1.add(df2, fill_value=0)

The following is the methods for division

In [None]:
1 / df1
df1.rdiv(1)

Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill
value:

In [None]:
df1.reindex(columns=df2.columns, fill_value=0)

#### Operations between DataFrame and Series

As with NumPy arrays of different dimensions, arithmetic between DataFrame and
Series is also defined. First, as a motivating example, consider the difference between
a two-dimensional array and one of its rows:

In [None]:
arr = np.arange(12.).reshape((3, 4))
arr
arr[0]
arr - arr[0]

When we subtract arr[0] from arr, the subtraction is performed once for each row.
This is referred to as broadcasting and is explained in more detail as it relates to
general NumPy arrays in Appendix A. Operations between a DataFrame and a Series
are similar:

In [None]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])
series = frame.iloc[0]
frame
series

By default, arithmetic between DataFrame and Series matches the index of the Series
on the columns of the DataFrame, broadcasting down the rows:

In [None]:
frame - series

If an index value is not found in either the DataFrame’s columns or the Series’s index,
the objects will be reindexed to form the union:

In [None]:
series2 = pd.Series(np.arange(3), index=["b", "e", "f"])
series2
frame + series2

If you want to instead broadcast over the columns, matching on the rows, you have to
use one of the arithmetic methods and specify to match over the index. For example:

In [None]:
series3 = frame["d"]
frame
series3
frame.sub(series3, axis="index")

The axis that you pass is the axis to match on. In this case we mean to match on the
DataFrame’s row index (axis="index") and broadcast across the columns.

### Function Application and Mapping

NumPy ufuncs (element-wise array methods) also work with pandas objects:

In [None]:
frame = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])
frame
np.abs(frame)

Another frequent operation is applying a function on one-dimensional arrays to each
column or row. DataFrame’s apply method does exactly this:

In [None]:
def f1(x):
    return x.max() - x.min()

frame.apply(f1)

Here the function f, which computes the difference between the maximum and
minimum of a Series, is invoked once on each column in frame. The result is a Series
having the columns of frame as its index.

If you pass axis="columns" to apply, the function will be invoked once per row
instead. A helpful way to think about this is as “apply across the columns”:

In [None]:
frame.apply(f1, axis="columns")

Many of the most common array statistics (like sum and mean) are DataFrame methods,
so using apply is not necessary.


The function passed to apply need not return a scalar value; it can also return a Series
with multiple values:

In [None]:
def f2(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])
frame.apply(f2)

Element-wise Python functions can be used, too. Suppose you wanted to compute
a formatted string from each floating-point value in frame. You can do this with
applymap:

In [None]:
def my_format(x):
    return f"{x:.2f}"

frame.applymap(my_format)

The reason for the name applymap is that Series has a map method for applying an
element-wise function:

In [None]:
frame["e"].map(my_format)

### Sorting and Ranking
Sorting a dataset by some criterion is another important built-in operation. To sort
lexicographically by row or column label, use the sort_index method, which returns
a new, sorted object:

In [None]:
obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])
obj
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

With a DataFrame, you can sort by index on either axis:

In [None]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=["three", "one"],
                     columns=["d", "a", "b", "c"])
frame
frame.sort_index()
frame.sort_index(axis="columns")

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


The data is sorted in ascending order by default but can be sorted in descending
order, too:

In [None]:
frame.sort_index(axis="columns", ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


To sort a Series by its values, use its sort_values method:

In [None]:
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

Any missing values are sorted to the end of the Series by default:

In [None]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

Missing values can be sorted to the start instead by using the na_position option:

In [None]:
obj.sort_values(na_position="first")

1    NaN
3    NaN
4   -3.0
5    2.0
0    4.0
2    7.0
dtype: float64

When sorting a DataFrame, you can use the data in one or more columns as the sort
keys. To do so, pass one or more column names to sort_values:

In [None]:
frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})
frame
frame.sort_values("b")

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


To sort by multiple columns, pass a list of names:

In [None]:
frame.sort_values(["a", "b"])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


Ranking assigns ranks from one through the number of valid data points in an array,
starting from the lowest value. The rank methods for Series and DataFrame are the
place to look; by default, rank breaks ties by assigning each group the mean rank:

In [None]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

Ranks can also be assigned according to the order in which they’re observed in the
data:

In [None]:
obj.rank(method="first")

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

Here, instead of using the average rank 6.5 for the entries 0 and 2, they instead have
been set to 6 and 7 because label 0 precedes label 2 in the data.

You can rank in descending order, too:

In [None]:
obj.rank(ascending=False)

0    1.5
1    7.0
2    1.5
3    3.5
4    5.0
5    6.0
6    3.5
dtype: float64

DataFrame can compute ranks over the rows or the columns:

In [None]:
frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1],
                      "c": [-2, 5, 8, -2.5]})
frame
frame.rank(axis="columns")

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


### Axis Indexes with Duplicate Labels

Up until now almost all of the examples we have looked at have unique axis labels
(index values). While many pandas functions (like reindex) require that the labels be
unique, it’s not mandatory. Let’s consider a small Series with duplicate indices:

In [None]:
obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

The is_unique property of the index can tell you whether or not its labels are unique:

In [None]:
obj.index.is_unique

False

Data selection is one of the main things that behaves differently with duplicates.
Indexing a label with multiple entries returns a Series, while single entries return a
scalar value:

In [None]:
obj["a"]
obj["c"]

np.int64(4)

This can make your code more complicated, as the output type from indexing can
vary based on whether or not a label is repeated.


The same logic extends to indexing rows (or columns) in a DataFrame:

In [None]:
df = pd.DataFrame(np.random.standard_normal((5, 3)),
                  index=["a", "a", "b", "b", "c"])
df
df.loc["b"]
df.loc["c"]

0    0.274992
1    0.228913
2    1.352917
Name: c, dtype: float64

## 5.3 Summarizing and Computing Descriptive Statistics

pandas objects are equipped with a set of common mathematical and statistical methods.
Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean) from a Series, or a Series of values
from the rows or columns of a DataFrame. Compared with the similar methods
found on NumPy arrays, they have built-in handling for missing data. Consider a
small DataFrame:

In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=["a", "b", "c", "d"],
                  columns=["one", "two"])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


Calling DataFrame’s sum method returns a Series containing column sums:

In [None]:
df.sum()

one    9.25
two   -5.80
dtype: float64

Passing axis="columns" or axis=1 sums across the columns instead:

In [None]:
df.sum(axis="columns")

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

When an entire row or column contains all NA values, the sum is 0, whereas if any
value is not NA, then the result is NA. This can be disabled with the skipna option, in
which case any NA value in a row or column names the corresponding result NA:

In [None]:
df.sum(axis="index", skipna=False)
df.sum(axis="columns", skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

Some aggregations, like mean, require at least one non-NA value to yield a value
result, so here we have:

In [None]:
df.mean(axis="columns")

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

Some methods, like idxmin and idxmax, return indirect statistics, like the index value
where the minimum or maximum values are attained:

In [None]:
df.idxmax()

one    b
two    d
dtype: object

Other methods are accumulations:

In [None]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


Some methods are neither reductions nor accumulations. describe is one such
example, producing multiple summary statistics in one shot:

In [None]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


On nonnumeric data, describe produces alternative summary statistics:

In [None]:
obj = pd.Series(["a", "a", "b", "c"] * 4)
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

### Correlation and Covariance
Some summary statistics, like correlation and covariance, are computed from pairs
of arguments. Let’s consider some DataFrames of stock prices and volumes originally
obtained from Yahoo! Finance and available in binary Python pickle files you can
find in the accompanying datasets for the book:

In [None]:
price = pd.read_pickle("examples/yahoo_price.pkl")
volume = pd.read_pickle("examples/yahoo_volume.pkl")

FileNotFoundError: [Errno 2] No such file or directory: 'examples/yahoo_price.pkl'

I now compute percent changes of the prices,

In [None]:
returns = price.pct_change()
returns.tail()

The corr method of Series computes the correlation of the overlapping, non-NA,
aligned-by-index values in two Series. Relatedly, cov computes the covariance:

In [None]:
returns["MSFT"].corr(returns["IBM"])
returns["MSFT"].cov(returns["IBM"])

DataFrame’s corr and cov methods, on the other hand, return a full correlation or
covariance matrix as a DataFrame, respectively:

In [None]:
returns.corr()
returns.cov()

Using DataFrame’s corrwith method, you can compute pair-wise correlations
between a DataFrame’s columns or rows with another Series or DataFrame. Passing a
Series returns a Series with the correlation value computed for each column:

In [None]:
returns.corrwith(returns["IBM"])

Passing a DataFrame computes the correlations of matching column names. Here, I
compute correlations of percent changes with volume:

In [None]:
returns.corrwith(volume)

### Unique Values, Value Counts, and Membership

Another class of related methods extracts information about the values contained in a
one-dimensional Series. To illustrate these, consider this example:

In [None]:
obj = pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])

The first function is unique, which gives you an array of the unique values in a Series:

In [None]:
uniques = obj.unique()
uniques

The unique values are not necessarily returned in the order in which they first
appear, and not in sorted order, but they could be sorted after the fact if needed
(uniques.sort()). Relatedly, value_counts computes a Series containing value frequencies:

In [None]:
obj.value_counts()

The Series is sorted by value in descending order as a convenience. value_counts is
also available as a top-level pandas method that can be used with NumPy arrays or
other Python sequences:

In [None]:
pd.value_counts(obj.to_numpy(), sort=False)

isin performs a vectorized set membership check and can be useful in filtering a
dataset down to a subset of values in a Series or column in a DataFrame:

In [None]:
obj
mask = obj.isin(["b", "c"])
mask
obj[mask]

Related to isin is the Index.get_indexer method, which gives you an index array
from an array of possibly nondistinct values into another array of distinct values:

In [None]:
to_match = pd.Series(["c", "a", "b", "b", "c", "a"])
unique_vals = pd.Series(["c", "b", "a"])
indices = pd.Index(unique_vals).get_indexer(to_match)
indices

In some cases, you may want to compute a histogram on multiple related columns in
a DataFrame. Here’s an example:

In [None]:
data = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4],
                     "Qu2": [2, 3, 1, 2, 3],
                     "Qu3": [1, 5, 2, 4, 4]})
data

We can compute the value counts for a single column, like so:

In [None]:
data["Qu1"].value_counts().sort_index()

To compute this for all columns, pass pandas.value_counts to the DataFrame’s
apply method:

In [None]:
result = data.apply(pd.value_counts).fillna(0)
result

Here, the row labels in the result are the distinct values occurring in all of the
columns. The values are the respective counts of these values in each column.

There is also a DataFrame.value_counts method, but it computes counts considering
each row of the DataFrame as a tuple to determine the number of occurrences of each
distinct row:

In [None]:
data = pd.DataFrame({"a": [1, 1, 1, 2, 2], "b": [0, 0, 1, 0, 0]})
data
data.value_counts()