# Pandas - Part 1
> "Python Data Science Handbook" - *Jake Vanderplas (2016)*

---

# Introducing Pandas Objects

At the very basic level, Pandas objects can be thought of as enhanced versions of
NumPy structured arrays in which the rows and columns are identified with labels
rather than simple integer indices. As we will see during the course of this chapter,
Pandas provides a host of useful tools, methods, and functionality on top of the basic
data structures, but nearly everything that follows will require an understanding of
what these structures are. Thus, before we go any further, let’s introduce these three
fundamental Pandas data structures: the Series, DataFrame, and Index.
We will start our code sessions with the standard NumPy and Pandas imports:

```python
In[1]: import numpy as np
 import pandas as pd
```

## The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created from a
list or array as follows:
```python
In[2]: data = pd.Series([0.25, 0.5, 0.75, 1.0])
 data
Out[2]: 0 0.25
 1 0.50
 2 0.75
 3 1.00
 dtype: float64
```
As we see in the preceding output, the Series wraps both a sequence of values and a
sequence of indices, which we can access with the values and index attributes. The
values are simply a familiar NumPy array:
```python
In[3]: data.values
Out[3]: array([ 0.25, 0.5 , 0.75, 1. ])
```
The index is an array-like object of type pd.Index, which we’ll discuss in more detail
momentarily:
```python
In[4]: data.index
Out[4]: RangeIndex(start=0, stop=4, step=1)
```
Like with a NumPy array, data can be accessed by the associated index via the familiar
Python square-bracket notation:
```python
In[5]: data[1]
Out[5]: 0.5
In[6]: data[1:3]
Out[6]: 1 0.50
 2 0.75
 dtype: float64
 ```
As we will see, though, the Pandas Series is much more general and flexible than the
one-dimensional NumPy array that it emulates.

### Series as specialized dictionary
In this way, you can think of a Pandas Series a bit like a specialization of a Python
dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary
values, and a Series is a structure that maps typed keys to a set of typed values. This
typing is important: just as the type-specific compiled code behind a NumPy array
makes it more efficient than a Python list for certain operations, the type information
of a Pandas Series makes it much more efficient than Python dictionaries for certain
operations.
We can make the Series-as-dictionary analogy even more clear by constructing a
Series object directly from a Python dictionary:
100 | Chapter 3: Data Manipulation with Pandas
```python
In[11]: population_dict = {'California': 38332521,
                            'Texas': 26448193,
                            'New York': 19651127,
                            'Florida': 19552860,
                            'Illinois': 12882135}
 population = pd.Series(population_dict)
 population
Out[11]:    California  38332521
            Florida     19552860
            Illinois    12882135
            New York    19651127
            Texas       26448193
            dtype: int64
 ```
By default, a Series will be created where the index is drawn from the sorted keys.
From here, typical dictionary-style item access can be performed:
```python
In[12]: population['California']
Out[12]: 38332521
```

Unlike a dictionary, though, the Series also supports array-style operations such as
slicing:

```python
In[13]:     population['California':'Illinois']
Out[13]:    California 38332521
            Florida 19552860
            Illinois 12882135
            dtype: int64
 ```



### Try it yourself: Create a Pandas Series Object of milk tea ratings (The rating is upto you)

In [None]:
import pandas as pd

#your code here


---

# The Pandas DataFrame Object
The next fundamental structure in Pandas is the DataFrame. Like the Series object
discussed in the previous section, the DataFrame can be thought of either as a gener‐
alization of a NumPy array, or as a specialization of a Python dictionary. We’ll now
take a look at each of these perspectives.

DataFrame as a generalized NumPy array
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame
is an analog of a two-dimensional array with both flexible row indices and flexible
column names. Just as you might think of a two-dimensional array as an ordered
sequence of aligned one-dimensional columns, you can think of a DataFrame as a
sequence of aligned Series objects. Here, by “aligned” we mean that they share the
same index.
To demonstrate this, let’s first construct a new Series listing the area of each of the
five states discussed in the previous section:
```python
In[18]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
 'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area
Out[18]: California 423967
 Florida 170312
 Illinois 149995
 New York 141297
 Texas 695662
 dtype: int64
 ```
 Now that we have this along with the population Series from before, we can use a
dictionary to construct a single two-dimensional object containing this information:
```python
In[19]: states = pd.DataFrame({'population': population,
 'area': area})
 states
Out[19]: area population
 California     423967 38332521
 Florida        170312 19552860
 Illinois       149995 12882135
 New York       141297 19651127
 Texas          695662 26448193
```
Like the Series object, the DataFrame has an index attribute that gives access to the
index labels:
```python
In[20]: states.index
Out[20]:
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
```

Additionally, the DataFrame has a columns attribute, which is an Index object holding
the column labels:

```python
In[21]: states.columns
Out[21]: Index(['area', 'population'], dtype='object')
```
Thus the DataFrame can be thought of as a generalization of a two-dimensional
NumPy array, where both the rows and columns have a generalized index for accessing the data.



### Try it yourself: Create another Pandas series with the same index from your milktea ratings and create a Pandas DataFrame from this.

In [1]:
### your code here

---

# Data Indexing and Selection
In Chapter 2, we looked in detail at methods and tools to access, set, and modify val‐
ues in NumPy arrays. These included indexing (e.g., `arr[2, 1]`), slicing (e.g., `arr[:,1:5]`), masking (e.g., `arr[arr > 0]`), fancy indexing (e.g., `arr[0, [1, 5]]`), and
combinations thereof (e.g., `arr[:, [1, 5]]`). Here we’ll look at similar means of
accessing and modifying values in Pandas Series and DataFrame objects. If you have
used the NumPy patterns, the corresponding patterns in Pandas will feel very famil‐
iar, though there are a few quirks to be aware of.
We’ll start with the simple case of the one-dimensional Series object, and then move
on to the more complicated two-dimensional DataFrame object.

## Data Selection in Series
As we saw in the previous section, a Series object acts in many ways like a one￾dimensional NumPy array, and in many ways like a standard Python dictionary. If we
keep these two overlapping analogies in mind, it will help us to understand the pat‐
terns of data indexing and selection in these arrays.
Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to a
collection of values:
```python
In[1]:  import pandas as pd
        data = pd.Series([0.25, 0.5, 0.75, 1.0],
        index=['a', 'b', 'c', 'd'])
        data
Out[1]: a 0.25
        b 0.50
        c 0.75
        d 1.00
        dtype: float64
In[2]:  data['b']
Out[2]: 0.5
```
We can also use dictionary-like Python expressions and methods to examine the
keys/indices and values:
```python
In[3]   : 'a' in data
Out[3]  : True

In[4]   : data.keys()
Out[4]  : Index(['a', 'b', 'c', 'd'], dtype='object')

In[5]   : list(data.items())
Out[5]  : [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
```
Series objects can even be modified with a dictionary-like syntax. Just as you can
extend a dictionary by assigning to a new key, you can extend a Series by assigning
to a new index value:
```python
In[6]:  data['e'] = 1.25
        data
Out[6]: a 0.25
        b 0.50
        c 0.75
        d 1.00
        e 1.25
        dtype: float64
 ```
This easy mutability of the objects is a convenient feature: under the hood, Pandas is
making decisions about memory layout and data copying that might need to take
place; the user generally does not need to worry about these issues.

### Indexers: loc, iloc, and ix
These slicing and indexing conventions can be a source of confusion. For example, if
your Series has an explicit integer index, an indexing operation such as `data[1]` will
use the explicit indices, while a slicing operation like `data[1:3]` will use the implicit
Python-style index.
```python
In[11]: data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
        data
Out[11]:        1 a
                3 b
                5 c
                dtype: object

In[12]: # explicit index when indexing
        data[1]
Out[12]: 'a'

In[13]: # implicit index when slicing
        data[1:3]
Out[13]:        3 b
                5 c
                dtype: object
 ```
Because of this potential confusion in the case of integer indexes, Pandas provides
some special indexer attributes that explicitly expose certain indexing schemes. These
are not functional methods, but attributes that expose a particular slicing interface to
the data in the Series.
First, the loc attribute allows indexing and slicing that always references the explicit
index:
```python
In[14]:         data.loc[1]
Out[14]:        'a'

In[15]:         data.loc[1:3]
Out[15]:        1 a
                3 b
                dtype: object
```
The iloc attribute allows indexing and slicing that always references the implicit
Python-style index:
```python
In[16]:         data.iloc[1]
Out[16]:        'b'

In[17]:         data.iloc[1:3]
Out[17]:        3 b
                5 c
                dtype: object
 ```
A third indexing attribute, `ix`, is a hybrid of the two, and for Series objects is equiva‐
lent to standard []-based indexing. The purpose of the ix indexer will become more
apparent in the context of DataFrame objects, which we will discuss in a moment.
One guiding principle of Python code is that “explicit is better than implicit.” The
explicit nature of loc and iloc make them very useful in maintaining clean and read‐
able code; especially in the case of integer indexes, I recommend using these both to
make code easier to read and understand, and to prevent subtle bugs due to the
mixed indexing/slicing convention.


### Try it yourself: Using the milktea rating series, select any data using `loc` and `iloc`

In [2]:
### your code here

---


## Data Selection in DataFrame
Recall that a DataFrame acts in many ways like a two-dimensional or structured array,
and in other ways like a dictionary of Series structures sharing the same index.
These analogies can be helpful to keep in mind as we explore data selection within
this structure.

### DataFrame as two-dimensional array
As mentioned previously, we can also view the DataFrame as an enhanced two￾dimensional array. We can examine the raw underlying data array using the values
attribute:
```python
In[24]: data.values
Out[24]: array( [[ 4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
                [ 1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
                [ 1.49995000e+05, 1.28821350e+07, 8.58837628e+01],
                [ 1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
                [ 6.95662000e+05, 2.64481930e+07, 3.80187404e+01]])
```
With this picture in mind, we can do many familiar array-like observations on the
DataFrame itself. For example, we can transpose the full DataFrame to swap rows and
columns:
```python
In[25]: data.T
Out[25]:
            California Florida Illinois New York Texas
    area    4.239670e+05 1.703120e+05 1.499950e+05 1.412970e+05 6.956620e+05
    pop     3.833252e+07 1.955286e+07 1.288214e+07 1.965113e+07 2.644819e+07
    density 9.041393e+01 1.148061e+02 8.588376e+01 1.390767e+02 3.801874e+01
```
When it comes to indexing of DataFrame objects, however, it is clear that the
dictionary-style indexing of columns precludes our ability to simply treat it as a
NumPy array. In particular, passing a single index to an array accesses a row:
```python
In[26]  : data.values[0]
Out[26] : array([ 4.23967000e+05, 3.83325210e+07, 9.04139261e+01])
```
and passing a single “index” to a DataFrame accesses a column:
```python
In[27]: data['area']
Out[27]:    California 423967
            Florida 170312
            Illinois 149995
            New York 141297
            Texas 695662
            Name: area, dtype: int64
```
Thus for array-style indexing, we need another convention. Here Pandas again uses
the loc, iloc, and ix indexers mentioned earlier. Using the iloc indexer, we can
index the underlying array as if it is a simple NumPy array (using the implicit
Python-style index), but the DataFrame index and column labels are maintained in
the result:
```python
In[28]:     data.iloc[:3, :2]
Out[28]:                area pop
            California  423967 38332521
            Florida     170312 19552860
            Illinois    149995 12882135


In[29]  : data.loc[:'Illinois', :'pop']
Out[29] :           area pop
        California  423967 38332521
        Florida     170312 19552860
        Illinois    149995 12882135
```
The ix indexer allows a hybrid of these two approaches:
```python
In[30]  : data.ix[:3, :'pop']
Out[30] :               area pop
        California      423967 38332521
        Florida         170312 19552860
        Illinois        149995 12882135
 ```
Keep in mind that for integer indices, the ix indexer is subject to the same potential
sources of confusion as discussed for integer-indexed Series objects.
Any of the familiar NumPy-style data access patterns can be used within these index‐
ers. For example, in the loc indexer we can combine masking and fancy indexing as
in the following:
```python
In[31]  : data.loc[data.density > 100, ['pop', 'density']]
Out[31] :               pop density
        Florida        19552860 114.806121
        New York       19651127 139.076746
```
Any of these indexing conventions may also be used to set or modify values; this is
done in the standard way that you might be accustomed to from working with
NumPy:
```python
In[32]: data.iloc[0, 2] = 90
        data
Out[32]:        area pop density
 California     423967 38332521 90.000000
 Florida        170312 19552860 114.806121
 Illinois       149995 12882135 85.883763
 New York       141297 19651127 139.076746
 Texas          695662 26448193 38.018740
 ```
To build up your fluency in Pandas data manipulation, I suggest spending some time
with a simple DataFrame and exploring the types of indexing, slicing, masking, and
fancy indexing that are allowed by these various indexing approaches.
Additional indexing conventions

There are a couple extra indexing conventions that might seem at odds with the pre‐
ceding discussion, but nevertheless can be very useful in practice. First, while index‐
ing refers to columns, slicing refers to rows:
```python
In[33]: data['Florida':'Illinois']
Out[33]: area pop density
 Florida 170312 19552860 114.806121
 Illinois 149995 12882135 85.883763
```
Such slices can also refer to rows by number rather than by index:
```python
In[34]: data[1:3]
Out[34]:        area pop density
 Florida        170312 19552860 114.806121
 Illinois       149995 12882135 85.883763
```
Similarly, direct masking operations are also interpreted row-wise rather than
column-wise:
```python
In[35]  : data[data.density > 100]
Out[35] : area pop density
 Florida        170312 19552860 114.806121
 New York       141297 19651127 139.076746
 ```
These two conventions are syntactically similar to those on a NumPy array, and while
these may not precisely fit the mold of the Pandas conventions, they are nevertheless
quite useful in practice.

### Try it yourself: Using your DataFrame, use loc to find the milkteas with ratings above 3

In [3]:
# your code here

---

# Handling Missing Data
The difference between data found in many tutorials and data in the real world is that
real-world data is rarely clean and homogeneous. In particular, many interesting
datasets will have some amount of data missing. To make matters even more compli‐
cated, different data sources may indicate missing data in different ways.
In this section, we will discuss some general considerations for missing data, discuss
how Pandas chooses to represent it, and demonstrate some built-in Pandas tools for
handling missing data in Python. Here and throughout the book, we’ll refer to miss‐
ing data in general as null, NaN, or NA values.

### NaN and None in Pandas
NaN and None both have their place, and Pandas is built to handle the two of them
nearly interchangeably, converting between them where appropriate:
```python
In[10]: pd.Series([1, np.nan, 2, None])
Out[10]: 0 1.0
 1 NaN
 2 2.0
 3 NaN
 dtype: float64
```
For types that don’t have an available sentinel value, Pandas automatically type-casts
when NA values are present. For example, if we set a value in an integer array to
np.nan, it will automatically be upcast to a floating-point type to accommodate the
NA:
```python
In[11]: x = pd.Series(range(2), dtype=int)
 x
Out[11]: 0 0
 1 1
 dtype: int64

In[12]: x[0] = None
 x
Out[12]: 0 NaN
 1 1.0
 dtype: float64
 ```
Notice that in addition to casting the integer array to floating point, Pandas automati‐
cally converts the None to a NaN value. (Be aware that there is a proposal to add a
native integer NA to Pandas in the future; as of this writing, it has not been included.)
While this type of magic may feel a bit hackish compared to the more unified
approach to NA values in domain-specific languages like R, the Pandas sentinel/cast‐
ing approach works quite well in practice and in my experience only rarely causes
issues.

## Operating on Null Values
As we have seen, Pandas treats None and NaN as essentially interchangeable for indi‐
cating missing or null values. To facilitate this convention, there are several useful
methods for detecting, removing, and replacing null values in Pandas data structures.
They are:
* `isnull()`
    Generate a Boolean mask indicating missing values
* `notnull()`
    Opposite of isnull()
* `dropna()`
    Return a filtered version of the data
* `fillna()`
    Return a copy of the data with missing values filled or imputed
We will conclude this section with a brief exploration and demonstration of these
routines.
Detecting null values
Pandas data structures have two useful methods for detecting null data: `isnull()` and
`notnull()`. Either one will return a Boolean mask over the data. For example:

```python
In[13]: data = pd.Series([1, np.nan, 'hello', None])
In[14]: data.isnull()
Out[14]:    0 False
            1 True
            2 False
            3 True
            dtype: bool
```
As mentioned in “Data Indexing and Selection”, Boolean masks can be
used directly as a Series or DataFrame index:

```python
In[15]: data[data.notnull()]
Out[15]:    0 1
            2 hello
            dtype: object
```
The `isnull()` and `notnull()` methods produce similar Boolean results for Data
Frames.
### Dropping null values
In addition to the masking used before, there are the convenience methods, `dropna()`
(which removes NA values) and `fillna()` (which fills in NA values). For a Series,
the result is straightforward:

```python
In[16]: data.dropna()
Out[16]:    0 1
            2 hello
            dtype: object
```
For a DataFrame, there are more options. Consider the following DataFrame:
```python
In[17]: df = pd.DataFrame([ [1, np.nan, 2],
                            [2, 3, 5],
                            [np.nan, 4, 6]])
        df
Out[17]:    0 1 2
            0 1.0 NaN 2
            1 2.0 3.0 5
            2 NaN 4.0 6
```
We cannot drop single values from a DataFrame; we can only drop full rows or full
columns. Depending on the application, you might want one or the other, so
`dropna()` gives a number of options for a DataFrame.
By default, `dropna()` will drop all rows in which any null value is present:

```python
In[18]  : df.dropna()
Out[18] :     0 1 2
            1 2.0 3.0 5
```
Alternatively, you can drop NA values along a different axis; `axis=1` drops all col‐
umns containing a null value:
```python
In[19]: df.dropna(axis='columns')
Out[19]:      2
            0 2
            1 5
            2 6
```
But this drops some good data as well; you might rather be interested in dropping
rows or columns with all NA values, or a majority of NA values. This can be specified
through the how or thresh parameters, which allow fine control of the number of
nulls to allow through.
The default is `how='any'`, such that any row or column (depending on the axis key‐
word) containing a null value will be dropped. You can also specify `how='all'`, which
will only drop rows/columns that are all null values:
```python
In[20]: df[3] = np.nan
        df
Out[20]:  0 1 2 3
        0 1.0 NaN 2 NaN
        1 2.0 3.0 5 NaN
        2 NaN 4.0 6 NaN


In[21]:     df.dropna(axis='columns', how='all')
Out[21]:      0 1 2
            0 1.0 NaN 2
            1 2.0 3.0 5
            2 NaN 4.0 6
 ```
For finer-grained control, the thresh parameter lets you specify a minimum number
of non-null values for the row/column to be kept:
```python
In[22]:     df.dropna(axis='rows', thresh=3)
Out[22]:      0 1 2 3
            1 2.0 3.0 5 NaN
 ```
Here the first and last row have been dropped, because they contain only two non-null values.

### Try it yourself: 
1. Copy the original dataframe and drop the `rows` with nan values
2. Copy the original dataframe and drop the `columns` with nan values 

In [7]:
import numpy as np
import pandas as pd
df = pd.DataFrame([ [1, np.nan, 2]  ,
                    [2, 3, 5]       ,
                    [np.nan, 4, 6]] )
df

#your code here

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


### Filling null values

Sometimes rather than dropping NA values, you’d rather replace them with a valid
value. This value might be a single number like zero, or it might be some sort of
imputation or interpolation from the good values. You could do this in-place using
the `isnull()` method as a mask, but because it is such a common operation Pandas
provides the `fillna()` method, which returns a copy of the array with the null values
replaced.
Consider the following Series:
```python
In[23]: data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
 data
Out[23]: a 1.0
 b NaN
 c 2.0
 d NaN
 e 3.0
 dtype: float64
```
We can fill NA entries with a single value, such as zero:
```python
In[24]:     data.fillna(0)
Out[24]:    a 1.0
            b 0.0
            c 2.0
            d 0.0
            e 3.0
            dtype: float64
```

We can specify a forward-fill to propagate the previous value forward:

```python
In[25]: # forward-fill
        data.fillna(method='ffill')
Out[25]:    a 1.0
            b 1.0
            c 2.0
            d 2.0
            e 3.0
            dtype: float64
```
Or we can specify a back-fill to propagate the next values backward:

```python
In[26]: # back-fill
 data.fillna(method='bfill')
Out[26]:    a 1.0
            b 2.0
            c 2.0
            d 3.0
            e 3.0
            dtype: float64
```
For DataFrames, the options are similar, but we can also specify an axis along which
the fills take place:
```python
In[27]: df
Out[27]:      0 1 2 3
            0 1.0 NaN 2 NaN
            1 2.0 3.0 5 NaN
            2 NaN 4.0 6 NaN

In[28]: df.fillna(method='ffill', axis=1)
Out[28]:      0 1 2 3
            0 1.0 1.0 2.0 2.0
            1 2.0 3.0 5.0 5.0
            2 NaN 4.0 6.0 6.0
 ```
Notice that if a previous value is not available during a forward fill, the NA value
remains.

### Try it yourself: Using `fillna()`
1. Copy the original dataframe and fill the nan values with 0.
2. Copy the original dataframe and use forward fill.
2. Copy the original dataframe and use backward fill.

In [8]:
import numpy as np
import pandas as pd
df = pd.DataFrame([ [1, np.nan, 2]  ,
                    [2, 3, 5]       ,
                    [np.nan, 4, 6]] )
df

#your code here


Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6
