# CHAPTER 4
# Pandas Basics

## 4.2 Essential Functionality 

This topic will walk you through the fundamental mechanics of interacting with the data contained in a Series or DataFrame. In the chapters to come, we will delve more deeply into data analysis and manipulation topics using pandas.  

### 4.2.1 Reindexing 

An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index. Consider an example:

In [1]:
import pandas as pd

obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Calling **reindex** on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present:

In [2]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [6]:
obj.reindex?

For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The method option allows us to do this, using a method such as **ffill**, which forward-fills the values:

In [8]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [9]:
obj3.reindex(range(6), method='bfill') 

0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object

With DataFrame, **reindex** can alter either the (row) index, columns, or both. When passed only a sequence, it reindexes the rows in the result:

In [3]:
import numpy as np

frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [4]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


The columns can be reindexed with the **columns** keyword:

In [5]:
states = ['Texas', 'ohio', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texas,ohio,California
a,1,,2
c,4,,5
d,7,,8


Table 4-3 shows more about the arguments to **reindex**. 

<br>
<center>Table 4.3: reindex function arguments  </center>
<img src="Table4.3.jpg">

### 4.2.2 Dropping Entries from an Axis

Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. As that can require a bit of munging and set logic, the **drop** method will return a new object with the indicated value or values deleted from an axis:

In [None]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

In [None]:
new_obj = obj.drop('c')
new_obj

In [None]:
obj.drop(['d', 'c']) 

The **drop** method does not changing the original object.

In [None]:
obj

With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], 
                    columns=['one', 'two', 'three', 'four'])
data

Calling **drop** with a sequence of labels will drop values from the row labels (axis 0):

In [None]:
data.drop(['Colorado', 'Ohio']) 

You can drop values from the columns by passing *axis=1* or *axis='columns'*:

In [None]:
 data.drop('two', axis=1) 

In [None]:
data.drop(['two', 'four'], axis='columns') 

Many functions, like **drop**, which modify the size or shape of a Series or DataFrame, can manipulate an object *in-place* without returning a new object:

In [None]:
obj.drop('c',inplace=True)
obj

Be careful with the *inplace*, as it destroys any data that is dropped.

### 4.2.2 Indexing, Selection, and Filtering Series 

indexing **(obj[...])** works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples of this:

In [10]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [None]:
obj['b']

In [None]:
obj[1] 

In [None]:
obj[2:4] 

In [11]:
obj[['b', 'a', 'd']] 

b    1.0
a    0.0
d    3.0
dtype: float64

In [12]:
obj[[1, 3]] 

b    1.0
d    3.0
dtype: float64

In [None]:
obj[obj < 2] 

Slicing with labels behaves differently than normal Python slicing in that the endpoint is **inclusive**:

In [None]:
obj['b':'c'] 

*Setting* using these methods modifies the corresponding section of the Series:

In [None]:
obj['b':'c'] = 5
obj

Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence:

In [13]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], 
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [23]:
data[['one']] 

Unnamed: 0,one
Ohio,0
Colorado,4
Utah,8
New York,12


In [15]:
data[['three', 'one']] 

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


Indexing like this has a few special cases. First, slicing or selecting data with a boolean array:

In [16]:
data[:2] 

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [17]:
 data[data['three'] > 5] 

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


The row selection syntax *data[:2]* is provided as a convenience. Passing a single element or a list to the [] operator selects columns. 

Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison:

In [18]:
 data ==5

Unnamed: 0,one,two,three,four
Ohio,False,False,False,False
Colorado,False,True,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [None]:
data[data < 5] = 0
data

### 4.2.3 Selection with loc and iloc 

For DataFrame label-indexing on the rows, pandas object has special indexing operators loc and iloc. They enable you to select a subset of the *rows* and *columns* from a DataFrame with NumPy-like notation using either axis labels (**loc**) or integers (**iloc**). As a preliminary example, let’s select a single row and multiple columns by label:

In [19]:
data.loc['Colorado', ['two', 'three']] 

two      5
three    6
Name: Colorado, dtype: int32

We’ll then perform some similar selections with integers using **iloc**:

In [20]:
data.iloc[2, [0, 1,2]]  

one       8
two       9
three    10
Name: Utah, dtype: int32

In [21]:
data.iloc[2] 

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

In [22]:
data.iloc[[1, 2], [3, 0, 1]] 

Unnamed: 0,four,one,two
Colorado,7,4,5
Utah,11,8,9


Both indexing functions work with slices in addition to single labels or lists of labels:

In [24]:
data.loc[:'Utah', 'two'] 

Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int32

In [1]:
data.iloc[:, :3][data.three > 5]

NameError: name 'data' is not defined

So there are many ways to select and rearrange the data contained in a pandas object. For DataFrame, Table 4-4 provides a short summary of many of them. As you’ll see later, there are a number of additional options for working with hierarchical indexes.

<br>
<center>Table 4.4: Indexing options with DataFrame </center>
<img src="Table4.4.jpg">



### 4.2.4 Arithmetic and Data Alignment 

An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels. Let’s look at an example:

In [25]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1 

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [26]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [27]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment introduces missing values in the label locations that don’t overlap. Missing values will then propagate in further arithmetic computations.

In the case of DataFrame, alignment is performed on both the rows and the columns:

In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1 

In [None]:
df2

Adding these together returns a DataFrame whose index and columns are the unions of the ones in each DataFrame:

In [None]:
df1 + df2

Since the 'c' and 'e' columns are not found in both DataFrame objects, they appear as all missing in the result. The same holds for the rows whose labels are not common to both objects. 

If you add DataFrame objects with no column or row labels in common, the result will contain all nulls:

In [None]:
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})
df1 

In [None]:
df2

In [None]:
df1 + df2

#### 4.2.4.1 Arithmetic methods with fill values 

In arithmetic operations between differently indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other:

In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),columns=list('abcd'))
df1 

In [None]:
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df2

Adding these together results in NA values in the locations that don’t overlap:

In [None]:
df1 + df2

Using the add method on *df1*, we pass *df2* and an argument to fill_value:

In [None]:
df1.add(df2, fill_value=0) 

Table 4.5 shows a listing of Series and DataFrame methods for arithmetic.

<br>
<center>Table 4.5: Flexible arithmetic methods  </center>
<img src="Table4.5.jpg">

As shown in Table 5.5, each of them has a counterpart, starting with the letter *r*, that has arguments flipped. So these two statements are equivalent:

In [None]:
1/df1

In [None]:
df1.rdiv(1)

Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill value:

In [None]:
df2.columns

In [None]:
df1.reindex(columns=df2.columns, fill_value=-1) 

### 4.2.5 Sorting 

Sorting a dataset by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:

In [28]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj

d    0
a    1
b    2
c    3
dtype: int64

In [29]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

With a DataFrame, you can sort by index on either axis:

In [30]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),index=['three', 'one'],columns=['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [31]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [32]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


The data is sorted in ascending order by default, but can be sorted in descending order, too:

In [33]:
frame.sort_index(axis=1, ascending=False) 

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


To sort a Series by its values, use its **sort_values** method:

In [34]:
obj = pd.Series([4, 7, -3, 2])
obj

0    4
1    7
2   -3
3    2
dtype: int64

In [35]:
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

Any missing values are sorted to the end of the Series by default:

In [None]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj

In [36]:
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

When sorting a DataFrame, you can use the data in one or more columns as the sort keys. To do so, pass one or more column names to the by option of **sort_values**:

In [37]:
frame = pd.DataFrame({'b': [8, 7, -3, 2], 'a': [1, 1, 3, 4]})
frame

Unnamed: 0,b,a
0,8,1
1,7,1
2,-3,3
3,2,4


In [39]:
frame.sort_values(by='b') 

Unnamed: 0,b,a
2,-3,3
3,2,4
1,7,1
0,8,1


To sort by multiple columns, pass a list of names:

In [38]:
frame.sort_values(by=['a', 'b']) 

Unnamed: 0,b,a
1,7,1
0,8,1
2,-3,3
3,2,4


## 4.3 Summarizing and Computing Descriptive Statistics 

pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the similar methods found on NumPy arrays, they have built-in handling for missing data. Consider a small DataFrame:


In [40]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


Calling DataFrame’s sum method returns a Series containing column sums:

In [41]:
df.sum()

one    9.25
two   -5.80
dtype: float64

Passing *axis='columns'* or *axis=1* sums across the columns instead:

In [None]:
df.sum(axis='columns') 

NA values are excluded unless the entire slice (row or column in this case) is NA. This can be disabled with the **skipna** option:

In [None]:
 df.mean(axis='columns', skipna=False)

Some methods, like **idxmin** and **idxmax**, return indirect statistics like the index value where the minimum or maximum values are attained:

In [42]:
df.idxmax() 

one    b
two    d
dtype: object

Other methods are *accumulations*:

In [None]:
df.cumsum() 

Another type of method is neither a reduction nor an accumulation. **describe** is one such example, producing multiple summary statistics in one shot:

In [None]:
df.describe() 

Table 4-6 for a full list of summary statistics and related methods.

<br>
<center>Table 4.6: Descriptive and summary statistics   </center>
<img src="Table4.6.jpg">