# 5  Getting Started with pandas

Throughout the remaining chapters of the book, pandas will be a central focus. It offers data structures and tools for manipulating data in Python, streamlining the processes of data cleaning and analysis. pandas is commonly utilized alongside numerical computing tools like NumPy and SciPy, analytical libraries such as statsmodels and scikit-learn, and data visualization libraries like matplotlib. While pandas borrows coding styles from NumPy, its primary distinction lies in its specialization for working with tabular or diverse data, unlike NumPy, which excels with homogeneously typed numerical array data.

Since its transition to an open-source project in 2010, pandas has evolved into a substantial library applicable to a wide range of real-world scenarios. With a developer community exceeding 2,500 contributors, the project has grown significantly, benefitting from the collective expertise of individuals who have actively used it to address their day-to-day data challenges. The thriving pandas developer and user communities have played a pivotal role in the success of the library.

For the remainder of the book, We'll stick to the following import conventions for NumPy and pandas:

```python
In [1]: import numpy as np

In [2]: import pandas as pd
```

So, whenever you encounter `pd.` in the code, it's essentially shorthand for pandas. To simplify things further, you might find it convenient to bring Series and DataFrame directly into the local namespace, considering their frequent usage:

```python
In [3]: from pandas import Series, DataFrame
```

In [1]:
import numpy as np
import pandas as pd

In [2]:
from pandas import Series, DataFrame

## 5.1 Introduction to pandas Data Structures

Getting started with pandas involves becoming acquainted with its two primary data structures: Series and DataFrame. While they may not be a one-size-fits-all solution, they serve as a robust foundation for a diverse array of data tasks.

### Series
A Series is akin to a one-dimensional array, containing a sequence of values (similar to NumPy types) and an associated array of data labels called its index. The simplest Series is created with just an array of data:

```python
In [14]: obj = pd.Series([4, 7, -5, 3])

In [15]: obj
Out[15]: 
0    4
1    7
2   -5
3    3
dtype: int64
```

The index is automatically generated as integers when not specified. You can access the array representation and index using `obj.array` and `obj.index` respectively.

```python
In [16]: obj.array
Out[16]: 
<PandasArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [17]: obj.index
Out[17]: RangeIndex(start=0, stop=4, step=1)
```

You can create a Series with a labeled index:

```python
In [18]: obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
```

Accessing values using labels is possible, and operations maintain the link between index and values:

```python
In [21]: obj2["a"]
Out[21]: -5

In [24]: obj2[obj2 > 0]
Out[24]: 
d    6
b    7
c    3
dtype: int64
```

A Series can be viewed as a fixed-length, ordered dictionary. You can create one from a dictionary or convert it back using `to_dict()`.

```python
In [32]: obj3 = pd.Series({"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000})
```

Handling missing data, labeled arithmetic operations, and assigning names to the Series and its index are essential aspects explored in this introduction to Series. The alignment of data based on index labels during arithmetic operations is a feature reminiscent of a join operation in databases.

```python
In [42]: obj3 + obj4
Out[42]: 
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64
```

Lastly, both the Series object and its index can have a name attribute, integrating with other pandas functionalities. The index of a Series can be modified in place by assignment.

Both the Series object and its index come equipped with a name attribute, seamlessly integrating with various aspects of pandas functionality:

```python
In [43]: obj4.name = "population"

In [44]: obj4.index.name = "state"

In [45]: obj4
Out[45]: 
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64
```

Assigning a name to the Series object as a whole (`obj4.name`) and naming its index (`obj4.index.name`) enhances clarity, especially when dealing with multiple Series in a broader analysis or when combining data from different sources.

Additionally, you can modify a Series's index directly through assignment:

```python
In [46]: obj
Out[46]: 
0    4
1    7
2   -5
3    3
dtype: int64

In [47]: obj.index = ["Bob", "Steve", "Jeff", "Ryan"]

In [48]: obj
Out[48]: 
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64
```

This flexibility allows you to personalize the index to better reflect the context of your data, making it more meaningful and interpretable in your analysis.

In [3]:
import numpy as np
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc("figure", figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
pd.options.display.max_columns = 20
pd.options.display.max_colwidth = 80
np.set_printoptions(precision=4, suppress=True)

In [8]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [9]:
obj.array


<NumpyExtensionArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [10]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [14]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [15]:
obj2["a"]


-5

In [17]:
obj2["d"] = 6
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [18]:
obj2[["c", "a", "d"]]

c    3
a   -5
d    6
dtype: int64

In [19]:
obj2[obj2 > 0]


d    6
b    7
c    3
dtype: int64

In [20]:
obj2 * 2


d    12
b    14
a   -10
c     6
dtype: int64

In [21]:
import numpy as np
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [9]:
"b" in obj2
"e" in obj2

In [10]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
obj3

In [11]:
obj3.to_dict()

In [12]:
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)
obj4

In [13]:
pd.isna(obj4)
pd.notna(obj4)

In [14]:
obj4.isna()

In [15]:
obj3
obj4
obj3 + obj4

In [16]:
obj4.name = "population"
obj4.index.name = "state"
obj4

In [17]:
obj
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]
obj

### Exercise:

1. **Series with Labels:**
   - Create a new Series named `obj2` with the values [4, 7, -5, 3] with labels ["d", "b", "a", "c"]. Display the Series and its index.

2. **Accessing and Modifying Elements:**
   - Retrieve the value associated with the label "a" from `obj2`.
   - Change the value associated with the label "d" in `obj2` to 6.
   - Display the subset of `obj2` containing values with labels ["c", "a", "d"].

3. **NumPy-like Operations on Series:**
   - Display the elements of `obj2` that are greater than 0.
   - Multiply all elements of `obj2` by 2.
   - Calculate the exponential of each element in `obj2` using NumPy's `exp` function.

4. **Series as a Dictionary:**
   - Check if the label "b" is present in `obj2`.
   - Check if the label "e" is present in `obj2`.
   - Create a Series named `obj3` from the dictionary `sdata` provided in the material. Display the resulting Series.

5. **Handling Missing Data:**
   - Display a boolean Series indicating whether each element in `obj4` is missing (NaN).
   - Display a boolean Series indicating whether each element in `obj4` is not missing.

7. **Series Attributes:**
   - Give the Series `obj4` the name "population" and name its index "state". Display the updated Series.

7. **Changing Index Labels:**
   - Change the index of the `obj` Series to ["Bob", "Steve", "Jeff", "Ryan"]. Display the updated Series.

### DataFrame

A DataFrame is a structured representation of tabular data, resembling a two-dimensional table, containing columns with different data types such as numeric, string, or Boolean values. It consists of both row and column indices, akin to a dictionary where each column is a Series sharing the same index.

The DataFrame can also be utilized for organizing data beyond two dimensions by employing hierarchical indexing, which we'll delve into in Chapter 8: Data Wrangling: Join, Combine, and Reshape, and is foundational for more advanced data manipulation functionalities in pandas.

Constructing a DataFrame can be achieved through various methods, but one common approach is using a dictionary comprising lists or NumPy arrays of equal length:

```python
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
```

The resulting DataFrame is automatically assigned an index, akin to Series, and the columns are arranged based on the order of keys in the data dictionary, preserving their insertion order:

```
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2
```

In [3]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

The `head` method in pandas DataFrame retrieves the first five rows of the DataFrame. This is useful when dealing with large datasets as it provides a concise preview of the data. For example, given a DataFrame named `frame`, calling `frame.head()` will return the first five rows as shown below:

```
   state  year  pop
0   Ohio  2000  1.5
1   Ohio  2001  1.7
2   Ohio  2002  3.6
3 Nevada  2001  2.4
4 Nevada  2002  2.9
```

Conversely, the `tail` method retrieves the last five rows of the DataFrame. This can be helpful for checking the end of the dataset. Similarly, given the same DataFrame `frame`, calling `frame.tail()` will return the last five rows:

```
   state  year  pop
1   Ohio  2001  1.7
2   Ohio  2002  3.6
3 Nevada  2001  2.4
4 Nevada  2002  2.9
5 Nevada  2003  3.2
```

When creating a DataFrame, you can specify the order of columns using the `columns` parameter. This allows you to arrange the columns as desired. For example, if `data` is a dictionary containing the data, you can create a DataFrame with specific column order like this:

```
pd.DataFrame(data, columns=["year", "state", "pop"])
```

This will result in a DataFrame where the columns are arranged as specified.

If you provide a column name that is not present in the dictionary used to create the DataFrame, it will appear with missing values in the resulting DataFrame. For instance, if you create a DataFrame `frame2` with an additional column named "debt", but it's not present in the original data, it will show up with NaN values:

```
   year   state  pop debt
0  2000    Ohio  1.5  NaN
1  2001    Ohio  1.7  NaN
2  2002    Ohio  3.6  NaN
3  2001  Nevada  2.4  NaN
4  2002  Nevada  2.9  NaN
5  2003  Nevada  3.2  NaN
```

Columns in a DataFrame can be accessed as Series using either dictionary-like notation (`frame2["state"]`) or dot attribute notation (`frame2.state`). However, the latter method (`frame2.column`) only works when the column name is a valid Python variable name and does not conflict with any DataFrame method names. It's worth noting that both methods return Series objects with the same index as the DataFrame, and their name attributes are appropriately set.

In [20]:
frame.head()

In [21]:
frame.tail()

In [22]:
pd.DataFrame(data, columns=["year", "state", "pop"])

In [23]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])
frame2
frame2.columns

In [24]:
frame2["state"]
frame2.year

Rows can be retrieved either by their position or by their label using the special `iloc` and `loc` attributes, respectively. These methods allow for more flexible and precise row selection. 

For example, using the `loc` attribute on a DataFrame like `frame2`, you can retrieve a row by its label. In the provided code snippet, `frame2.loc[1]` returns the row labeled '1':

```
year     2001
state    Ohio
pop       1.7
debt      NaN
Name: 1, dtype: object
```

Similarly, the `iloc` attribute allows for row retrieval by position. In the given code, `frame2.iloc[2]` retrieves the row at position 2 (zero-indexed):

```
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: 2, dtype: object
```

Both `loc` and `iloc` provide powerful methods for accessing specific rows in a DataFrame, enabling tasks such as filtering and manipulation based on either labels or positions.

In [25]:
frame2.loc[1]
frame2.iloc[2]

Columns in a pandas DataFrame can be easily modified by assignment. For instance, you can assign a scalar value or an array of values to a column. Here are a few examples:

```python
# Assigning a scalar value to the 'debt' column
frame2["debt"] = 16.5

# Assigning an array of values to the 'debt' column
frame2["debt"] = np.arange(6.)

# Assigning a Series to the 'debt' column
val = pd.Series([-1.2, -1.5, -1.7], index=[2, 4, 5])
frame2["debt"] = val
```

In each case, the DataFrame `frame2` is updated accordingly. When assigning lists, arrays, or Series to a column, the length of the value must match the length of the DataFrame. If a Series is assigned, its labels will be aligned with the DataFrame's index, and missing values will be inserted for any index values not present in the DataFrame.

Additionally, you can create new columns by assigning values to a column that doesn't exist. For instance, in the code snippet below, a new column named 'eastern' is created based on a condition:

```python
frame2["eastern"] = frame2["state"] == "Ohio"
```

However, it's important to note that new columns cannot be created using the dot attribute notation (`frame2.eastern`). 

If you need to delete columns, you can use the `del` keyword, similar to deleting keys in a dictionary. For example, to remove the 'eastern' column:

```python
del frame2["eastern"]
```

After deletion, you can verify the columns of the DataFrame using the `columns` attribute (`frame2.columns`). 

It's crucial to be cautious when modifying columns, especially since modifications to a Series obtained from a DataFrame view will reflect in the DataFrame itself. If you need to modify a Series without affecting the original DataFrame, you can explicitly copy the Series using the `copy()` method.

In [26]:
frame2["debt"] = 16.5
frame2
frame2["debt"] = np.arange(6.)
frame2

In [27]:
val = pd.Series([-1.2, -1.5, -1.7], index=["two", "four", "five"])
frame2["debt"] = val
frame2

In [28]:
frame2["eastern"] = frame2["state"] == "Ohio"
frame2

In [29]:
del frame2["eastern"]
frame2.columns

Another common data structure is a nested dictionary of dictionaries. Here, the outer dictionary keys typically represent the column names, while the inner dictionary keys serve as row indices, with their corresponding values being the data points.

For example, consider the following nested dictionary named `populations`:

In [30]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Nevada": {2001: 2.4, 2002: 2.9}}

If you pass this nested dictionary to a pandas DataFrame, pandas will interpret the outer dictionary keys as the columns and the inner keys as the row indices. The resulting DataFrame, `frame3`, will look like this:


In [31]:
frame3 = pd.DataFrame(populations)
frame3

Here, the columns represent the states ("Ohio" and "Nevada"), and the rows correspond to the years (2000, 2001, and 2002). The values in the DataFrame are filled according to the data provided in the nested dictionary. For instance, in the year 2000, the population of Ohio is 1.5, while no population data is available for Nevada in that year (hence, it's represented as NaN, meaning "Not a Number").

You can transpose a DataFrame, swapping its rows and columns, using syntax similar to that of a NumPy array. For instance, calling `.T` on the DataFrame `frame3` will transpose it:

```python
frame3.T
```

This operation results in the following transposed DataFrame:

```
        2000  2001  2002
Ohio     1.5   1.7   3.6
Nevada   NaN   2.4   2.9
```

It's worth noting a caveat: transposing a DataFrame discards column data types if the columns do not all have the same data type. Therefore, transposing and then transposing back may result in the loss of the previous type information, making the columns become arrays of pure Python objects.

When creating a DataFrame from a nested dictionary with an explicit index, the keys in the inner dictionaries are combined to form the index in the result. For example:

```python
pd.DataFrame(populations, index=[2001, 2002, 2003])
```

This will produce a DataFrame with the specified index:

```
      Ohio  Nevada
2001   1.7     2.4
2002   3.6     2.9
2003   NaN     NaN
```

Dictionaries of Series are handled similarly. For instance, if you have a dictionary `pdata` where the values are Series, you can create a DataFrame from it:

```python
pdata = {"Ohio": frame3["Ohio"][:-1],
         "Nevada": frame3["Nevada"][:2]}
pd.DataFrame(pdata)
```

This results in the following DataFrame:

```
      Ohio  Nevada
2000   1.5     NaN
2001   1.7     2.4
```

For a comprehensive list of data inputs that you can pass to the DataFrame constructor, refer to Table 5.1 in the documentation. This table outlines various types of data inputs and how they are interpreted when creating a DataFrame.



| Type                                    | Notes                                                                                                           |
|-----------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| 2D ndarray                              | A matrix of data, passing optional row and column labels                                                       |
| Dictionary of arrays, lists, or tuples | Each sequence becomes a column in the DataFrame; all sequences must be the same length                         |
| NumPy structured/record array           | Treated as the “dictionary of arrays” case                                                                     |
| Dictionary of Series                   | Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed |
| Dictionary of dictionaries             | Each inner dictionary becomes a column; keys are unioned to form the row index as in the “dictionary of Series” case |
| List of dictionaries or Series         | Each item becomes a row in the DataFrame; unions of dictionary keys or Series indexes become the DataFrame’s column labels |
| List of lists or tuples                | Treated as the “2D ndarray” case                                                                               |
| Another DataFrame                      | The DataFrame’s indexes are used unless different ones are passed                                              |
| NumPy MaskedArray                      | Like the “2D ndarray” case except masked values are missing in the DataFrame result                            |


In [32]:
frame3.T

In [33]:
pd.DataFrame(populations, index=[2001, 2002, 2003])

In [34]:
pdata = {"Ohio": frame3["Ohio"][:-1],
         "Nevada": frame3["Nevada"][:2]}
pd.DataFrame(pdata)

In [35]:
frame3.index.name = "year"
frame3.columns.name = "state"
frame3

Unlike Series objects, DataFrames do not have a `name` attribute. However, DataFrames provide a method called `to_numpy()` that allows you to retrieve the data stored in the DataFrame as a two-dimensional NumPy ndarray.

For example, calling `frame3.to_numpy()` on a DataFrame named `frame3` will return a two-dimensional ndarray containing the DataFrame's data. Here's an illustration:

```python
frame3.to_numpy()
```

The output will resemble the following ndarray:

```
array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])
```

It's important to note that if the DataFrame's columns contain data of different types, the data type of the resulting ndarray will be chosen to accommodate all of the columns. This means that if the columns have mixed data types, the resulting ndarray will have the `dtype` set to `object`, which essentially treats all elements as Python objects. 

For instance, calling `frame2.to_numpy()` on a DataFrame named `frame2` with mixed data types will produce an ndarray with `dtype=object`:

```python
frame2.to_numpy()
```

The output will be similar to the following:

```
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, -1.2],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, -1.5],
       [2003, 'Nevada', 3.2, -1.7]], dtype=object)
```

In this example, the elements in the ndarray are treated as Python objects due to the presence of mixed data types in the DataFrame columns.

In [36]:
frame3.to_numpy()

In [37]:
frame2.to_numpy()

### Index Object

Pandas' Index objects serve as containers for holding axis labels, including column names in DataFrames, along with other metadata like the axis name or names. When you construct a Series or DataFrame, any array or sequence of labels you use is internally converted to an Index object.

For instance, consider creating a Series `obj` with labels "a", "b", and "c":

```python
obj = pd.Series(np.arange(3), index=["a", "b", "c"])
```

The `index` attribute of this Series holds an Index object:

```python
index = obj.index
```

The `index` object would look like this:

```
Index(['a', 'b', 'c'], dtype='object')
```

It's important to note that Index objects are immutable, meaning they cannot be modified by the user. This immutability ensures the safety of sharing Index objects among different data structures.

```python
index[1] = "d"  # This will raise a TypeError
```

Despite immutability, Index objects offer capabilities similar to a fixed-size set. For example, you can check for the presence of a label within an Index:

```python
frame3.columns  # This returns Index(['Ohio', 'Nevada'], dtype='object', name='state')
"Ohio" in frame3.columns  # This returns True
```

However, unlike Python sets, Index objects can contain duplicate labels. When selecting with duplicate labels, all occurrences of that label will be selected.

Index objects provide a variety of methods and properties for set logic operations. Here are some useful ones:

- `append()`: Concatenate with additional Index objects, producing a new Index
- `difference()`: Compute set difference as an Index
- `intersection()`: Compute set intersection
- `union()`: Compute set union
- `isin()`: Compute a Boolean array indicating whether each value is contained in the passed collection
- `delete()`: Compute a new Index with an element at Index i deleted
- `drop()`: Compute a new Index by deleting passed values
- `insert()`: Compute a new Index by inserting an element at Index i
- `is_monotonic()`: Returns True if each element is greater than or equal to the previous element
- `is_unique()`: Returns True if the Index has no duplicate values
- `unique()`: Compute the array of unique values in the Index

These methods and properties offer valuable tools for handling and analyzing data contained within Index objects.

In [38]:
obj = pd.Series(np.arange(3), index=["a", "b", "c"])
index = obj.index
index
index[1:]

In [39]:
labels = pd.Index(np.arange(3))
labels
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2
obj2.index is labels

In [40]:
frame3
frame3.columns
"Ohio" in frame3.columns
2003 in frame3.index

In [41]:
pd.Index(["foo", "foo", "bar", "bar"])

## 5.2 Essential Functionality

This section provides a comprehensive overview of reindexing, a fundamental operation for aligning data in pandas Series and DataFrames. While the subsequent chapters will delve into more advanced data analysis and manipulation techniques, understanding reindexing lays a crucial foundation.

**Reindexing:**

Reindexing, a vital method in pandas, involves creating a new object with values rearranged to align with a new index. Let's illustrate this with examples:

Consider a Series `obj`:

```python
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
```

Calling `reindex` on this Series rearranges the data based on the new index, introducing missing values if necessary:

```python
obj2 = obj.reindex(["a", "b", "c", "d", "e"])
```

For ordered data like time series, you might need to interpolate or fill values when reindexing. The `method` option allows this, with methods like `ffill` for forward-filling:

```python
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
obj3.reindex(np.arange(6), method="ffill")
```

In DataFrames, reindexing can alter row index, column names, or both. When given a sequence, it reindexes the rows:

```python
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"],
                     columns=["Ohio", "Texas", "California"])
frame2 = frame.reindex(index=["a", "b", "c", "d"])
```

Columns can be reindexed using the `columns` keyword:

```python
states = ["Texas", "Utah", "California"]
frame.reindex(columns=states)
```

To reindex a particular axis, you can pass the new labels as a positional argument and specify the axis using the `axis` keyword:

```python
frame.reindex(states, axis="columns")
```

**Table 5.3: reindex function arguments:**

This table summarizes the arguments to the `reindex` function, providing a detailed explanation for each:

- `labels`: New sequence to use as an index. Can be an Index instance or any other sequence-like Python data structure.
- `index`: Use the passed sequence as the new index labels.
- `columns`: Use the passed sequence as the new column labels.
- `axis`: Specifies the axis to reindex, whether "index" (rows) or "columns".
- `method`: Interpolation (fill) method; "ffill" fills forward, while "bfill" fills backward.
- `fill_value`: Substitute value to use when introducing missing data by reindexing.
- `limit`, `tolerance`, `level`, `copy`: Additional parameters for specifying behavior during reindexing.

Lastly, while reindexing can be done using the `reindex` method, some users prefer using the `loc` operator, especially when all new index labels already exist in the DataFrame:

```python
frame.loc[["a", "d", "c"], ["California", "Texas"]]
```

This snippet demonstrates reindexing using `loc`, which inserts missing data for new labels only if they already exist in the DataFrame.

In [42]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
obj

In [43]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"])
obj2

In [44]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
obj3
obj3.reindex(np.arange(6), method="ffill")

In [45]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"],
                     columns=["Ohio", "Texas", "California"])
frame
frame2 = frame.reindex(index=["a", "b", "c", "d"])
frame2

In [46]:
states = ["Texas", "Utah", "California"]
frame.reindex(columns=states)

In [47]:
frame.reindex(states, axis="columns")

In [48]:
frame.loc[["a", "d", "c"], ["California", "Texas"]]

#### Dropping Entries from an Axis

The process of dropping one or more entries from an axis in pandas is straightforward, whether you're dealing with a Series or a DataFrame. If you already possess an index array or list without the entries you want to drop, you can utilize either the `reindex` method or `.loc`-based indexing. However, to streamline this process and avoid the complexities of set logic and manipulation, pandas offers the `drop` method, which conveniently returns a new object with the specified value or values removed from an axis.

**Dropping Entries from a Series:**

Let's begin with dropping entries from a Series. Consider the following example:

```python
obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])
```

You can drop a single entry by providing its label to the `drop` method:

```python
new_obj = obj.drop("c")
```

Or drop multiple entries by passing a list of labels:

```python
obj.drop(["d", "c"])
```

**Dropping Entries from a DataFrame:**

In a DataFrame, you can drop index values from either axis. First, let's create an example DataFrame:

```python
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
```

To drop values from the row labels (axis 0), use the `drop` method with the `index` keyword:

```python
data.drop(index=["Colorado", "Ohio"])
```

To drop labels from the columns, utilize the `columns` keyword:

```python
data.drop(columns=["two"])
```

Alternatively, you can drop values from the columns by specifying `axis=1` or `axis="columns"`:

```python
data.drop("two", axis=1)
data.drop(["two", "four"], axis="columns")
```

In summary, the `drop` method provides a convenient way to remove specified values from either the row or column axis in pandas Series and DataFrames, making data manipulation tasks more efficient and intuitive.

In [49]:
obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])
obj
new_obj = obj.drop("c")
new_obj
obj.drop(["d", "c"])

In [50]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data

In [51]:
data.drop(index=["Colorado", "Ohio"])

In [52]:
data.drop(columns=["two"])

In [53]:
data.drop("two", axis=1)
data.drop(["two", "four"], axis="columns")

#### Indexing, Selection, and Filtering

In pandas, Series indexing (`obj[...]`) operates similarly to NumPy array indexing. However, it introduces the ability to utilize the Series's index values instead of just integers. Here are some examples illustrating this:

```python
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])

obj["b"]        # Selecting by label
obj[1]          # Selecting by integer position
obj[2:4]        # Slicing by integer positions
obj[["b", "a", "d"]]   # Selecting multiple labels
obj[[1, 3]]     # Selecting multiple integer positions
obj[obj < 2]    # Selecting by boolean indexing
```

While using label-based selection with `[]` is possible, the preferred method is using the `loc` operator:

```python
obj.loc[["b", "a", "d"]]
```

The distinction lies in the treatment of integers when indexing with `[]`. Regular `[]` indexing treats integers as labels if the index contains integers. To address this inconsistency, pandas offers the `iloc` operator for integer-based indexing:

```python
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
obj1.iloc[[0, 1, 2]]
```

When using `loc`, it indexes exclusively with labels, ensuring consistency irrespective of the index's data type.

**Caution:**
Using regular `[]`-based indexing with labels may result in unexpected behavior due to the treatment of integers as labels when the index contains integers.

For DataFrame indexing, you can retrieve one or more columns using a single value or sequence:

```python
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])

data["two"]    # Retrieving a single column
data[["three", "one"]]   # Retrieving multiple columns
```

Indexing with a Boolean array or DataFrame is another common use case. For instance, you can use a Boolean array to select rows or columns based on a condition:

```python
data[:2]    # Selecting rows with slicing
data[data["three"] > 5]    # Selecting rows based on a condition
```

Moreover, you can use a Boolean DataFrame to assign values to specific locations in the DataFrame:

```python
data[data < 5] = 0    # Assigning values based on a condition
```

In summary, pandas offers versatile methods for indexing, selection, and filtering, allowing for efficient and intuitive data manipulation in Series and DataFrames.

In [54]:
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj
obj["b"]
obj[1]
obj[2:4]
obj[["b", "a", "d"]]
obj[[1, 3]]
obj[obj < 2]

In [55]:
obj.loc[["b", "a", "d"]]

In [56]:
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])
obj1
obj2
obj1[[0, 1, 2]]
obj2[[0, 1, 2]]

In [57]:
obj1.iloc[[0, 1, 2]]
obj2.iloc[[0, 1, 2]]

In [58]:
obj2.loc["b":"c"]

In [59]:
obj2.loc["b":"c"] = 5
obj2

In [60]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data
data["two"]
data[["three", "one"]]

In [61]:
data[:2]
data[data["three"] > 5]

In [62]:
data < 5

In [63]:
data[data < 5] = 0
data

#### Selection on DataFrame with loc and iloc

In pandas, DataFrame objects have two key attributes for indexing: `loc` and `iloc`, which stand for label-based and integer-based indexing, respectively. Since a DataFrame is two-dimensional, you can use these attributes to select subsets of rows and columns using either axis labels (`loc`) or integers (`iloc`), similar to how you would with NumPy arrays.

**Using loc:**

To illustrate, let's start by selecting a single row by its label:

```python
data.loc["Colorado"]
```

This returns a Series object with the column labels of the DataFrame as its index. To select multiple rows and create a new DataFrame, you can pass a sequence of labels:

```python
data.loc[["Colorado", "New York"]]
```

You can combine row and column selection using `loc` by separating them with a comma:

```python
data.loc["Colorado", ["two", "three"]]
```

**Using iloc:**

For integer-based indexing, you can use `iloc`. For instance, to select the third row:

```python
data.iloc[2]
```

To select multiple rows or columns, provide a list of integers:

```python
data.iloc[[2, 1]]
data.iloc[2, [3, 0, 1]]
```

**Slicing:**

Both `loc` and `iloc` support slicing. For example, to select rows up to and including "Utah":

```python
data.loc[:"Utah", "two"]
```

You can also perform boolean indexing with `loc`, but not with `iloc`:

```python
data.loc[data.three >= 2]
```

**Summary of DataFrame Indexing Options:**

There are many ways to select and rearrange data in a pandas DataFrame. Table 5.4 provides a concise summary of these options:

- `df[column]`: Selects a single column or sequence of columns from the DataFrame. It also supports conveniences like boolean arrays, slices, or boolean DataFrames.
- `df.loc[rows]`: Selects a single row or subset of rows from the DataFrame by label.
- `df.loc[:, cols]`: Selects a single column or subset of columns by label.
- `df.loc[rows, cols]`: Selects both rows and columns by label.
- `df.iloc[rows]`: Selects a single row or subset of rows from the DataFrame by integer position.
- `df.iloc[:, cols]`: Selects a single column or subset of columns by integer position.
- `df.iloc[rows, cols]`: Selects both rows and columns by integer position.
- `df.at[row, col]`: Selects a single scalar value by row and column label.
- `df.iat[row, col]`: Selects a single scalar value by row and column position (integers).
- `reindex` method: Selects either rows or columns by labels.

Understanding these indexing options allows for versatile and efficient data manipulation in pandas DataFrames.

In [64]:
data
data.loc["Colorado"]

In [65]:
data.loc[["Colorado", "New York"]]

In [66]:
data.loc["Colorado", ["two", "three"]]

In [67]:
data.iloc[2]
data.iloc[[2, 1]]
data.iloc[2, [3, 0, 1]]
data.iloc[[1, 2], [3, 0, 1]]

In [68]:
data.loc[:"Utah", "two"]
data.iloc[:, :3][data.three > 5]

In [69]:
data.loc[data.three >= 2]

#### Potential Issues with Integer Indexing

When working with pandas objects that are indexed by integers, it's important to be aware of potential pitfalls, especially for new users. Unlike built-in Python data structures such as lists and tuples, integer indexing in pandas operates differently and can lead to unexpected errors.

Let's consider an example with a pandas Series:

```python
ser = pd.Series(np.arange(3.))
```

Printing `ser` shows:

```
0    0.0
1    1.0
2    2.0
dtype: float64
```

Now, if we try to access an element using a negative integer index, like `-1`, which would typically retrieve the last element in a list or tuple, pandas raises an error:

```python
ser[-1]
```

The error message indicates a `KeyError`:

```
KeyError: -1
```

Pandas doesn't "fall back" on integer indexing in this case because it's challenging to do so without introducing subtle bugs into the code. The index in this Series contains `0`, `1`, and `2`, but pandas doesn't want to make assumptions about whether the user intends label-based indexing or position-based indexing.

However, with a non-integer index, such as strings, there's no ambiguity:

```python
ser2 = pd.Series(np.arange(3.), index=["a", "b", "c"])
ser2[-1]  # This returns 2.0 without error
```

In situations where the index contains integers, it's best to use `loc` (for labels) or `iloc` (for integers) to ensure clear and unambiguous data selection:

```python
ser.iloc[-1]  # This retrieves the last element
```

Similarly, slicing with integers is always integer-oriented:

```python
ser[:2]  # This returns the first two elements
```

To avoid potential issues and ensure clarity in your code, it's advisable to consistently use `loc` and `iloc` for indexing pandas objects.

In [70]:
ser = pd.Series(np.arange(3.))
ser
ser[-1]

In [71]:
ser

In [72]:
ser2 = pd.Series(np.arange(3.), index=["a", "b", "c"])
ser2[-1]

In [73]:
ser.iloc[-1]

In [74]:
ser[:2]

#### Pitfalls of Chained Indexing

In the previous section, we explored how `loc` and `iloc` can be used for flexible selections on a DataFrame. While these indexing attributes are powerful, using them to modify DataFrame objects in place requires careful attention to avoid potential pitfalls.

For instance, let's consider the example DataFrame:

```python
data.loc[:, "one"] = 1
data.iloc[2] = 5
data.loc[data["four"] > 5] = 3
```

Here, we are assigning values to columns or rows by label or integer position. These operations modify the DataFrame as expected.

However, a common mistake among new pandas users is to chain selections when performing assignments:

```python
data.loc[data.three == 5]["three"] = 6
```

When executed, this might raise a `SettingWithCopyWarning`, which indicates that a value is trying to be set on a copy of a slice from the DataFrame, rather than the original DataFrame itself. This warning alerts you that you might unintentionally modify a temporary view of the data instead of the original DataFrame.

To resolve this issue and ensure that modifications are made to the original DataFrame, it's recommended to rewrite the assignment using a single `loc` operation:

```python
data.loc[data.three == 5, "three"] = 6
```

This approach ensures that the assignment is performed directly on the original DataFrame.

As a general rule, it's advisable to avoid chained indexing when performing assignments in pandas. Chained indexing can lead to unexpected behavior and potentially generate warnings like `SettingWithCopyWarning`. For more information and examples, you can refer to the relevant topic in the online pandas documentation.

In [75]:
data.loc[:, "one"] = 1
data
data.iloc[2] = 5
data
data.loc[data["four"] > 5] = 3
data

In [76]:
data.loc[data.three == 5]["three"] = 6

In [77]:
data

In [78]:
data.loc[data.three == 5, "three"] = 6
data

### Arithmetic and Data Alignment

In [5]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
               index=["a", "c", "e", "f", "g"])
s1
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [80]:
s1 + s2

In [81]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),
                   index=["Ohio", "Texas", "Colorado"])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),
                   index=["Utah", "Ohio", "Texas", "Oregon"])
df1
df2

In [82]:
df1 + df2

In [83]:
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [3, 4]})
df1
df2
df1 + df2

In [84]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list("abcde"))
df2.loc[1, "b"] = np.nan
df1
df2

In [85]:
df1 + df2

In [86]:
df1.add(df2, fill_value=0)

In [87]:
1 / df1
df1.rdiv(1)

In [88]:
df1.reindex(columns=df2.columns, fill_value=0)

In [89]:
arr = np.arange(12.).reshape((3, 4))
arr
arr[0]
arr - arr[0]

In [90]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])
series = frame.iloc[0]
frame
series

In [91]:
frame - series

In [92]:
series2 = pd.Series(np.arange(3), index=["b", "e", "f"])
series2
frame + series2

In [93]:
series3 = frame["d"]
frame
series3
frame.sub(series3, axis="index")

In [94]:
frame = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])
frame
np.abs(frame)

In [95]:
def f1(x):
    return x.max() - x.min()

frame.apply(f1)

In [96]:
frame.apply(f1, axis="columns")

In [97]:
def f2(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])
frame.apply(f2)

In [98]:
def my_format(x):
    return f"{x:.2f}"

frame.applymap(my_format)

In [99]:
frame["e"].map(my_format)

In [100]:
obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])
obj
obj.sort_index()

In [101]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=["three", "one"],
                     columns=["d", "a", "b", "c"])
frame
frame.sort_index()
frame.sort_index(axis="columns")

In [102]:
frame.sort_index(axis="columns", ascending=False)

In [103]:
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()

In [104]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

In [105]:
obj.sort_values(na_position="first")

In [106]:
frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})
frame
frame.sort_values("b")

In [107]:
frame.sort_values(["a", "b"])

In [108]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()

In [109]:
obj.rank(method="first")

In [110]:
obj.rank(ascending=False)

In [111]:
frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1],
                      "c": [-2, 5, 8, -2.5]})
frame
frame.rank(axis="columns")

In [112]:
obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])
obj

In [113]:
obj.index.is_unique

In [114]:
obj["a"]
obj["c"]

In [115]:
df = pd.DataFrame(np.random.standard_normal((5, 3)),
                  index=["a", "a", "b", "b", "c"])
df
df.loc["b"]
df.loc["c"]

In [116]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=["a", "b", "c", "d"],
                  columns=["one", "two"])
df

In [117]:
df.sum()

In [118]:
df.sum(axis="columns")

In [119]:
df.sum(axis="index", skipna=False)
df.sum(axis="columns", skipna=False)

In [120]:
df.mean(axis="columns")

In [121]:
df.idxmax()

In [122]:
df.cumsum()

In [123]:
df.describe()

In [124]:
obj = pd.Series(["a", "a", "b", "c"] * 4)
obj.describe()

In [125]:
price = pd.read_pickle("examples/yahoo_price.pkl")
volume = pd.read_pickle("examples/yahoo_volume.pkl")

In [126]:
returns = price.pct_change()
returns.tail()

In [127]:
returns["MSFT"].corr(returns["IBM"])
returns["MSFT"].cov(returns["IBM"])

In [128]:
returns.corr()
returns.cov()

In [129]:
returns.corrwith(returns["IBM"])

In [130]:
returns.corrwith(volume)

In [131]:
obj = pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])

In [132]:
uniques = obj.unique()
uniques

In [133]:
obj.value_counts()

In [134]:
pd.value_counts(obj.to_numpy(), sort=False)

In [135]:
obj
mask = obj.isin(["b", "c"])
mask
obj[mask]

In [136]:
to_match = pd.Series(["c", "a", "b", "b", "c", "a"])
unique_vals = pd.Series(["c", "b", "a"])
indices = pd.Index(unique_vals).get_indexer(to_match)
indices

In [137]:
data = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4],
                     "Qu2": [2, 3, 1, 2, 3],
                     "Qu3": [1, 5, 2, 4, 4]})
data

In [138]:
data["Qu1"].value_counts().sort_index()

In [139]:
result = data.apply(pd.value_counts).fillna(0)
result

In [140]:
data = pd.DataFrame({"a": [1, 1, 1, 2, 2], "b": [0, 0, 1, 0, 0]})
data
data.value_counts()

In [142]:
pd.options.display.max_rows = PREVIOUS_MAX_ROWS