# Introduction to Pandas

Having explored NumPy, it is time to get to know the other workhorse of data science in Python: pandas. The pandas library in Python really does a lot to make working with data--and importing, cleaning, and organizing it--so much easier that it is hard to imagine doing data science in Python without it.

But it was not always this way. Wes McKinney developed the library out of necessity in 2008 while at AQR Capital Management in order to have a better tool for dealing with data analysis. The library has since taken off as an open-source software project that has become a mature and integral part of the data science ecosystem. 

The name 'pandas' actually has nothing to do with Chinese bears but rather comes from the term *panel data*, a form of multi-dimensional data involving measurements over time that comes out the econometrics and statistics community. Ironically, while panel data is a usable data structure in pandas, it is not generally used today and we will not examine it in this course. Instead, we will focus on the two most widely used data structures in pandas: `Series` and `DataFrame`s.

In [None]:
import pandas as pd

In [None]:
import numpy as np

In [None]:
pd?

## Fundamental panda data structures

Both `Series` and `DataFrame` are a lot like the `ndarray` you encountered last section. They provide clean, efficent data storage and handling at the scale necessary for data science. 

What both of them provide that `ndarray`s lack, however, are essential data-science features like flexibility when dealing with missing data and the ability to label data. 

These capabilities (along with others) help make `Series` and `DataFrame` essential to the "data munging" that make up so much of data science.

### `Series` objects in pandas

A pandas `Series` is a lot like an `ndarray` in NumPy: a one-dimensional array of indexed data.

In [None]:
series_example = pd.Series([-0.5, 0.75, 1.0, -2])
series_example

Again, we see that `Series` has upcast the values to all be of the same type.

The difference, is that `Series` has also wrapped the values into a sequence with indices:

In [None]:
series_example.values

In [None]:
series_example.index

Again, it's easy to access the values via indexing:

In [None]:
series_example[1]

In [None]:
series_example[1:3]

### Explicit Indices

Despite a lot of similarities, pandas `Series` have an important distinction from NumPy `ndarrays`: whereas `ndarrays` have  *implicitly defined* integer indices (as do Python lists), pandas `Series` have *explicitly defined* indices. The best part is that you can set the index:

In [None]:
series_example2 = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
series_example2

**Share**

In [None]:
series_example2['b']

### Series vs Dictionary

Thus, a `Series` is basically a fixed-length, ordered dictionary in that it maps arbitrary typed index values to arbitrary typed data values. But like `ndarray`s these data are all of the same type, which is important. Just as the type-specific compiled code behind `ndarray` makes them more efficient than a Python lists for certain operations, the type information of pandas ``Series`` makes them much more efficient than Python dictionaries for certain operations.

But the connection between `Series` and dictionaries is nevertheless very real: you can construct a ``Series`` object directly from a Python dictionary:

**Think, Pair, Share** 

In [None]:
population_dict = {'France': 65429495,
                   'Germany': 82408706,
                   'Russia': 143910127,
                   'Japan': 126922333}
population_dict

In [None]:
population = pd.Series(population_dict)
population

Notice a subtle difference: The order of the key/value pairs changed when we created a `Series` out of them.

This is because `Series` objects are ordered while Python dictionaries have no order.

### Interacting with Series

In [None]:
population['Russia']

### Exercise

In [None]:
# Try slicing on the population Series on your own.
# Would slicing be possible if Series keys were not ordered?
population['Germany':'Russia']

In [None]:
# Try running population['Albania'] = 2937590 (or another country of your choice)
# What order do the keys appear in when you run population? Is it what you expected?

population['Albania'] = 2937590
population

Anoter useful `Series` feature (and definitely a difference from dictionaries) is that `Series` automatically aligns differently indexed data in arithmetic operations:

In [None]:
pop2 = pd.Series({'Spain': 46432074, 'France': 102321, 'Albania': 50532})
population + pop2

### `DataFrame` object in pandas

The other crucial data structure in pandas to get to know for data science is the `DataFrame`.
Like the ``Series`` object, ``DataFrame``s can be thought of either as generalizations of `ndarray`s (or as specializations of Python dictionaries).

Just as a ``Series`` is like a one-dimensional array with flexible indices, a ``DataFrame`` is like a two-dimensional array with both flexible row indices and flexible column names. Essentially, a `DataFrame` represents a rectangular table of data and contains an ordered collection of labeled columns, each of which can be a different value type (`string`, `int`, `float`, etc.).
The DataFrame has both a row and column index; in this way you can think of it as a dictionary of `Series`, all of which share the same index.

Let's take a look at how this works in practice. We will start by creating a `Series` called `area`:

In [None]:
area_dict = {'Albania': 28748,
             'France': 643801,
             'Germany': 357386,
             'Japan': 377972,
             'Russia': 17125200}
area = pd.Series(area_dict)
area

Now, let's combine that with the `population` `Series` we created earlier to create a `DataFrame` with two columns:

In [None]:
countries = pd.DataFrame({'Population': population, 'Area': area})
countries

Again, `DataFrame` automatically orders the indices - Area comes before Population.

We can also add columns by defining a list and adding it:

In [None]:
countries['Capital'] = ['Tirana', 'Paris', 'Berlin', 'Tokyo', 'Moscow']
countries

We can also change the order of indices:

In [None]:
countries = countries[['Capital', 'Area', 'Population']]
countries

And add new columns of data:

In [None]:
countries['Population Density'] = countries['Population'] / countries['Area']
countries

Note: don't worry if IPython gives you a warning over this. The warning is IPython trying to be a little too helpful. The new column you created is an actual part of the `DataFrame` and not a copy of a slice.

In [None]:
countries['Area']

All of these operations are very similar to what we can do to a `Series` because `DataFrame` columns behave like `Series`.

### Exercise

In [None]:
# Now try accessing row data with a command like countries['Japan']
countries['Japan']

What happens?

Remember that `DataFrame`s are dictionaries of `Series` (which are the columns). 

`DataFrame` rows often have heterogeneous data types, so different methods are necessary to access row data. 

For that, we use the `.loc` method:

**Think, Pair, Share**

In [None]:
countries.loc['Japan']

**Share**

In [None]:
countries.loc['Japan']['Area']

### DataSeries Operations

Sometimes we would like to add a column to a `DataFrame` without assigning values just yet.

In [None]:
countries['Debt-to-GDP Ratio'] = np.nan
countries

It's also possible to add columns to a `DataFrame` that have a different number of rows as the `DataFrame`:

In [None]:
debt = pd.Series([0.19, 2.36], index=['Russia', 'Japan'])
countries['Debt-to-GDP Ratio'] = debt
countries

You can use the `del` command to delete a column from a `DataFrame`:

In [None]:
del countries['Capital']
countries

In addition to their dictionary-like behavior, `DataFrames` also behave like two-dimensional arrays. For example, it can be useful at times when working with a `DataFrame` to transpose it:

In [None]:
countries.T

Notice we've updcast to floating-point numbers. 

**If there had been strings in this `DataFrame`, everything would have been upcast to strings.** 

Use caution when transposing `DataFrame`s!

We can also easily create a `DataFrame` from a 2D NumPy array with any specified column and index names.
If omitted, an integer index will be used for each:

In [None]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

## Manipulating data in pandas

A huge part of data science is manipulating data in order to analyze it. (One rule of thumb is that 80% of any data science project will be concerned with cleaning and organizing the data for the project.) So it makes sense to learn the tools that pandas provides for handling data in `Series` and especially `DataFrame`s. 

Because both of those data structures are ordered, let's first start by taking a closer look at what gives them their structure: the `Index`.

### Index objects in pandas

The indices of `DataFrame`s and `Series` are actually objects themselves. The ``Index`` object can be thought of as an immutable array. 

In [None]:
series_example = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
ind = series_example.index
ind

`Index` works a lot like an array:

**Share**

In [None]:
ind[1]

**Share**

In [None]:
ind[::2]

**Think, Pair, Share**

In [None]:
ind[1] = 0

Did you know what would happen?

`Index` immutability is a good thing: it makes it safer to share indices between multiple `Series` or `DataFrame`s without the potential for problems arising from inadvertent index modification.

### Set Properties

In addition to being array-like, an `Index` also behaves like a fixed-size set, including following many of the conventions used by Python's built-in ``set`` data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way. Let's play around with this to see it in action.

In [None]:
ind_odd = pd.Index([1, 3, 5, 7, 9])
ind_prime = pd.Index([2, 3, 5, 7, 11])

**Think, Pair, Share**  
In the code cell below, try out the intersection (`ind_odd & ind_prime`), union (`ind_odd | ind_prime`), and the symmetric difference (`ind_odd ^ ind_prime`) of `ind_odd` and `ind_prime`.

There are also object methods for these various set properties:

| **Method**     | **Description**                                                                           |
|:---------------|:------------------------------------------------------------------------------------------|
| [`append`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html)       | Concatenate with additional `Index` objects, producing a new `Index`                      |
| [`diff`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html)         | Compute set difference as an Index                                                        |
| [`drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)         | Compute new `Index` by deleting passed values                                             |
| [`insert`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html)       | Compute new `Index` by inserting element at index `i`                                     |
| [`is_monotonic`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.is_monotonic.html) | Returns `True` if each element is greater than or equal to the previous element           |
| [`is_unique`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.is_unique.html)    | Returns `True` if the Index has no duplicate values                                       |
| [`isin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html)         | Compute boolean array indicating whether each value is contained in the passed collection |
| [`unique`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html)       | Compute the array of unique values in order of appearance                                         |

### Data Selection in Series

As a refresher, a ``Series`` object acts in many ways like both a one-dimensional `ndarray` and a standard Python dictionary.

Like a dictionary, the ``Series`` object provides a mapping from a collection of arbitrary keys to a collection of arbitrary values.

In [None]:
series_example2 = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
series_example2

**Share**

In [None]:
series_example2['b']

**Share**

In [None]:
'a' in series_example2

**Share**

In [None]:
series_example2.keys()

**Share**

In [None]:
list(series_example2.items())

**Share**

In [None]:
series_example2['e'] = 1.25
series_example2

#### Series as one-dimensional array

Because ``Series`` also provide array-style functionality, you can use the NumPy techniques we looked at in Section 3 like slices and masking:

In [None]:
# Slicing using the explicit index
series_example2['a':'c']

In [None]:
# Slicing using the implicit integer index
series_example2[0:2]

In [None]:
# Masking
series_example2[(series_example2 > -1) & (series_example2 < 0.8)]

One confusing thing:

When slicing with an explicit index (i.e., ``series_example2['a':'c']``), the final index is **included** in the slice; when slicing with an implicit index (i.e., ``series_example2[0:2]``), the final index is **excluded** from the slice.

### Indexers: `loc` and `iloc`

**Think, Pair, Share**

There are many ways to refer to your indices in pandas!

If you've created integer *explicit* indices, it's really easy to get mixed up with the *implicit* integer indices that pandas automatically creates for you.

To help avoid this confusion, pandas provides special *indexer* attributes that explicitly expose certain indexing schemes:

The ``loc`` attribute allows indexing and slicing that *always* references the **explicit** index:

In [None]:
series_example2

In [None]:
series_example2.loc['a']

In [None]:
series_example2.loc['a':'c']

The ``iloc`` attribute enables indexing and slicing using the **implicit**, Python-style index:

In [None]:
series_example2

**Share**

In [None]:
series_example2.iloc[0]

In [None]:
series_example2.iloc[0:2]

A guiding principle of the Python language is the idea that "explicit is better than implicit." Professional code will generally use explicit indexing with ``loc`` and ``iloc`` and you should as well in order to make your code clean and readable.

### Data Selection in DataFrames

``DataFrame``s also exhibit dual behavior, acting both like a two-dimensional `ndarray` and like a dictionary of ``Series``  sharing the same index.

In [None]:
area = pd.Series({'Albania': 28748,
                  'France': 643801,
                  'Germany': 357386,
                  'Japan': 377972,
                  'Russia': 17125200})
population = pd.Series ({'Albania': 2937590,
                         'France': 65429495,
                         'Germany': 82408706,
                         'Russia': 143910127,
                         'Japan': 126922333})
countries = pd.DataFrame({'Area': area, 'Population': population})
countries

We already know that you can access the individual ``Series`` that make up the columns of a ``DataFrame`` via dictionary-style indexing of the column name

**Share**

In [None]:
countries['Area']

In [None]:
countries['Population Density'] = countries['Population'] / countries['Area']
countries

### DataFrame as two-dimensional array

But, you can also think of ``DataFrame``s as two-dimensional arrays. You can examine the raw data in the `DataFrame` data array using the ``values`` attribute: 

In [None]:
countries.values

Viewed this way it makes sense that we can transpose the rows and columns of a `DataFrame` the same way we would an array:

In [None]:
countries.T

`DataFrame` also uses the ``loc`` and ``iloc`` indexers. With ``iloc``, you can index the underlying array as if it were an `ndarray` but with the ``DataFrame`` index and column labels maintained in the result:

In [None]:
countries.iloc[:3, :2]

In [None]:
countries.loc[:'Germany', :'Population']

### Exercise

In [None]:
# Can you think of how to use masking for a DataFrame?
# Your masking could be somthing like countries['Population Density'] > 200




# Operating on Data in Pandas

pandas is widely used becaus it can perform efficient element-wise operations on data. pandas builds on ufuncs from NumPy to supply theses capabilities and then extends them to provide additional power for data manipulation.

All NumPy ufuncs that we learned last section will also work on pandas `Series` and `DataFrame` objects.

**Think, Pair, Share** For each of these Sections.

## Index alignment with Series

For our first example, suppose we are combining two different data sources and find only the top five countries by *area* and the top five countries by *population*:

In [None]:
area = pd.Series({'Russia': 17075400, 'Canada':  9984670,
                  'USA': 9826675, 'China': 9598094, 
                  'Brazil': 8514877}, name='area')
population = pd.Series({'China': 1409517397, 'India': 1339180127,
                        'USA': 324459463, 'Indonesia': 322179605, 
                        'Brazil': 207652865}, name='population')

In [None]:
# Now divide these to compute the population density
#pop_density = # fill me in
pop_density = population / area
pop_density

Your resulting array contains the **union** of indices of the two input arrays: seven countries in total. All of the countries in the array without an entry (because they lacked either area data or population data) are marked with the now familiar ``NaN``, or "Not a Number," designation.

Index matching works the same way built-in Python arithmetic expressions and missing values are filled in with `NaN`s. You can see this clearly by adding two `Series` that are slightly misaligned in their indices:

In [None]:
series1 = pd.Series([2, 4, 6], index=[0, 1, 2])
series2 = pd.Series([3, 5, 7], index=[1, 2, 3])
series1 + series2

`NaN` values are not always convenient to work with; `NaN` combined with any other values results in `NaN`, which can be a pain, particulalry if you are combining multiple data sources with missing values. To help with this, pandas allows you to specify a default value to use for missing values in the operation. For example, calling `series1.add(series2)` is equivalent to calling `series1 + series2`, but you can supply the fill value:

In [None]:
series1.add(series2, fill_value=0)

Much better!

## Index alignment with DataFrames

The same kind of alignment takes place in both dimension (columns and indices) when you perform operations on ``DataFrame``s.

In [None]:
rng = np.random.RandomState(42)
df1 = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                   columns=list('AB'))
df1

In [None]:
df2 = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                   columns=list('BAC'))
df2

**Think, Pair, Share**

In [None]:
# Add df1 and df2. Is the output what you expected?
df1 + df2

Even though we passed the columns in a different order in `df2` than in `df1`, the indices were aligned correctly sorted in the resulting union of columns.

This table lists Python operators and their equivalent pandas object methods:

| Python Operator | Pandas Method(s)                      |
|-----------------|---------------------------------------|
| ``+``           | ``add()``                             |
| ``-``           | ``sub()``, ``subtract()``             |
| ``*``           | ``mul()``, ``multiply()``             |
| ``/``           | ``truediv()``, ``div()``, ``divide()``|
| ``//``          | ``floordiv()``                        |
| ``%``           | ``mod()``                             |
| ``**``          | ``pow()``                             |


## Operations between DataFrames and Series

Index and column alignment gets maintained in operations between a `DataFrame` and a `Series` as well. To see this, consider a common operation in data science, wherein we find the difference of a `DataFrame` and one of its rows. Because pandas inherits ufuncs from NumPy, pandas will compute the difference row-wise by default:

In [None]:
df3 = pd.DataFrame(rng.randint(10, size=(3, 4)), columns=list('WXYZ'))
df3

In [None]:
df3 - df3.iloc[0]

But what if you need to operate column-wise? You can do this by using object methods and specifying the ``axis`` keyword.

In [None]:
df3.subtract(df3['X'], axis=0)

And when you do operations between `DataFrame`s and `Series` operations, you still get automatic index alignment:

In [None]:
halfrow = df3.iloc[0, ::2]
halfrow

Note that the output from that operation was transposed. That was so that we can subtract it from the `DataFrame`:

In [None]:
df3 - halfrow