## Pandas Review

### Introduction

Pandas is a Python library that plays a pivotal role in data science. It is used both for data wrangling and for calculations, and merges well with machine learning libraries, too. It has plenty of applications, covering a lot of the lost ground that Python had versus R in the past. You need to install it with the following command in the terminal.

```
python3 -m pip install pandas
```

Pandas uses _numpy_ under the hood. _numpy_ is a numerical library that enhances Python's computational capabilities. Pandas is well thought so that we do not need to explicitly invoke _numpy_ often, but it will nevertheless appear every now and then.

We will mentioned _arrays_ sometimes, and we will be referring to numpy's _ndarray_ object. You may think of it loosely as a homogeneus list of numbers.

In [4]:
from pprint import pprint
import pandas as pd

### Series

_Series_ and _DataFrame_ are the two workhorses of pandas. _Series_ is a one-dimensional object containing a sequence of values and an associated array of data labels called _index_.

Let's define our first _series_ (the _pprint_ is not necessary).

In [6]:
from pprint import pprint
import pandas as pd
obj = pd.Series([1, 10, 5, 2])
pprint(obj)

0     1
1    10
2     5
3     2
dtype: int64


We can access the _array_ and _index_ attributes easily.

In [7]:
obj.array
obj.index

RangeIndex(start=0, stop=4, step=1)

Sometimes we want the _index_ to consist of labels instead of integers. The labels can be used then to access single values or sets of values.

In [8]:
obj_ = pd.Series([1, 10, 5, 2], index=['a', 'b', 'c', 'd'])
obj_
obj_['a']
obj_[['a', 'c']]

a    1
c    5
dtype: int64

Note that we need to put the indexes in a list. We can filter a _Series_ and apply numerical operations to it.

In [9]:
obj_[obj_ > 3]
obj_ * 3
import numpy as np
np.log(obj_)

a    0.000000
b    2.302585
c    1.609438
d    0.693147
dtype: float64

You may think of a _Series_ as an ordered dictionary, meaning that we can apply a similar syntax to the one used in dictionaries.

In [10]:
'a' in obj_
'other_index' in obj_

False

Indeed, we can easily create a Series from a dictionary.

In [11]:
savings = {'Ann': 10, 'Bob': 20, 'Charlie':15, 'Diane': 5}
obj_s = pd.Series(savings)

We can enforce a particular order for the indexes, but if we push one that does not have a value, a `NA` will appear. `NA` is a missing value and can be detected with the Python functions `isna()` and `notna()`.

In [12]:
people = ('Ann', 'Bob', 'Charlie', 'Diane', 'Eddie')
obj_n = pd.Series(savings, index=people)
obj_n

Ann        10.0
Bob        20.0
Charlie    15.0
Diane       5.0
Eddie       NaN
dtype: float64

In [13]:
pd.isna(obj_n)
pd.notna(obj_n)
obj_n.isna()

Ann        False
Bob        False
Charlie    False
Diane      False
Eddie       True
dtype: bool

Indexes are useful because they align the Series automatically when operating with them.

In [14]:
lump = {'Ann':2, 'Bob': 3, 'Charlie':0, 'Eddie':2}
obj_l = pd.Series(lump)
obj_n + obj_l

Ann        12.0
Bob        23.0
Charlie    15.0
Diane       NaN
Eddie       NaN
dtype: float64

Note that the `NaN` is "contagious", affecting the operations where it is involved.

Both the Series object and its index have a name that can be modified. The index can also be altered by assignment.

In [15]:
obj_n.name = 'savings'
obj_n.index.name = 'people'
obj_n

people
Ann        10.0
Bob        20.0
Charlie    15.0
Diane       5.0
Eddie       NaN
Name: savings, dtype: float64

### Dataframes

A DataFrame is a rectangular table of data that contains a ordered, named collection of columns. One common way to build a dataframe is with a dictionary of lists or arrays.

In [22]:
companies = {'name': ['Tesla', 'Berkshire', 'Nvidia', 'Tencent'],
             'country': ['US', 'US', 'US', 'China'],
             'market_cap': [350, 400, 600, 550],
}
df = pd.DataFrame(companies)
print(df)

        name country  market_cap
0      Tesla      US        1000
1  Berkshire      US         400
2     Nvidia      US         600
3    Tencent   China         550


The order of the columns can be specified.

In [17]:
df = pd.DataFrame(companies, columns=['country', 'name', 'market_cap'])
print(df)

  country       name  market_cap
0      US      Tesla        1000
1      US  Berkshire         700
2      US     Nvidia         600
3   China    Tencent         550


Columns that are not contained in the constructor will be full of `NaN`s.

In [18]:
df = pd.DataFrame(companies, columns=['country', 'name', 'market_cap', 'revenues'])
print(df)

  country       name  market_cap revenues
0      US      Tesla        1000      NaN
1      US  Berkshire         700      NaN
2      US     Nvidia         600      NaN
3   China    Tencent         550      NaN


Columns can be accessed either as keys or as attributes.

In [19]:
df.name
df['name']

0        Tesla
1    Berkshire
2       Nvidia
3      Tencent
Name: name, dtype: object

Rows can be accessed with the `loc` attribute and their index.

In [25]:
df.loc[1]

name          Berkshire
country              US
market_cap          400
Name: 1, dtype: object