# Numpy and Pandas

- NumPy  documentation can be found [here](https://numpy.org/)
- Pandas documentation can be found  [here](https://pandas.pydata.org/pandas-docs/stable/index.html)

# Outline

- [NumPy](#NumPy)
    - [Basic Overview](#Basic-Overview)
    - [Create NumPy array from scratch](#Create-NumPy-array-from-scratch)
    - [Operations with NumPy](#Operations-with-NumPy)
- [Pandas](#Pandas)
    - [Introducing Series](#Introducing-Series)
    - [Introducing DataFrame](#Introducing-Data-Frame)
    
- [Reading Real Data](#Reading-Real-Data)

## NumPy

**What is NumPy?**

Data can come from a wide range of formats, but the first step to make them analyzable will be transform them into arrays of numbers. 

NumPy is the fundamental package to deal with numerical arrays. This package provides an efficient interface to store and operate on dense data buffer. In some way, numPy arrays are like a `list` type, but NumPy arrays are more efficient in terms of storage and operations.

As we saw previously, `lists` in Python can contain objects of different types (int, float, string, etc.). As you can imagine, that increases more storage usage, since each item is associated with its particular type. 


### Basic Overview

In [None]:
# Setup
import numpy as np
import pandas as pd

In [None]:
# Built-in documentation:
#np.<TAB>
np.__version__

Python has a buil-in `array` module. While Python’s array object provides efficient storage of array-based data, NumPy adds to this efficient operations on that data.

See `import array` for more about the build-in module.

In [None]:
# integer array:
# remember: NumPy is constrained to arrays that all contain the same type.

# np.array() creates a numpy array: 
a = np.array([1, 4, 2, 5, 3])
print(a)

In [None]:
type(a)

In [None]:
# dtype returns the type of array
a.dtype

In [None]:
b = np.array([1.3, 4.4, 2.1, 5.2, 3.7])
print(b)

In [None]:
b.dtype

In [None]:
# Note that arrays can be called by just a single argument (e.g. a list)
# For example, the following code will fail:

#np.array(1,2,4)

In [None]:
np.array([1,2,3])

In [None]:
# array transforms sequences of sequences into two-dimensional arrays, 
# sequences of sequences of sequences into three-dimensional arrays, and so on..

np.array([(1,2,3),(5,6,2)])

In [None]:
# we can spicify the dtype in advance: (float32 is just the numPy  way to define a float number in 32-bit)
np.array([1, 2, 3, 4], dtype='float32')


In [None]:
# Just an example about the differences between float16, float32 and float 64:
a = np.array([0.123456789121212,2,3], dtype=np.float16)
print("16bit: ", a[0])

a = np.array([0.123456789121212,2,3], dtype=np.float32)
print("32bit: ", a[0])

b = np.array([0.123456789121212121212,2,3], dtype=np.float64)
print("64bit: ", b[0])

###  Create NumPy array from scratch


- `np.zeros((rows,columns))`: creates array with all values set to 0
- `np.ones((rows, columns))`: creates array with all values set to 1
- `np.arange(start, stop, step)`: creates vector of values from `start` to `stop` with step wifth `step`.
- `np.empty((rows, columns))` : creates an array whose initial content is random and depends on the state of the memory. 

In [None]:
np.zeros((3, 4))  # 

In [None]:
np.ones((2, 3, 4), dtype=int)   # We can add one argument: two times, 3 rows x 4 columns, 

In [None]:
c = np.arange(1,12,2)
print(c)

In [None]:
# we can reshape it (2d array)
c = np.arange(12).reshape(4,3)
print(c)

In [None]:
# We can reshape it in a 3d array:
c = np.arange(12).reshape(2,3,2)
print(c)

In [None]:
d = np.empty((3,4))
print(d)

###  Operations with NumPy

#### Element-wise operations and functions
Arithmetic operators on arrays apply *element-wise*. A new array is created and filled with the result.

For example, we have the following two identicals arrays:

In [None]:
vec1= [1,2,3,4,5,6]
vec2 = np.array(vec1)

In [None]:
# we can sum element by element
vec1 + vec2 

These operations can save us time. We can do the same using a loop, for example:

In [None]:
for i in range(len(vec1)):
    vec1[i]+=vec1[i]
vec1

We can also have a *matrix* representation from an NumPy Array:

In [None]:
a = np.array([(1,2,3),(5,6,2)]) 
print(a)

`ndim` returns the dimensions of the array

In [None]:
a.ndim

`shape` returns shape of the array as a list

In [None]:
a.shape

`np.arange(start, stop, step)`: creates vector of values from `start` to `stop` with step wifth `step`.

In [None]:
np.arange(0,12,3)

In [None]:
a**2

In [None]:
# Note that these operations aren't changing the original object. 
a

In [None]:
b = np.array([(5,1,2),(10,1,3)]) 
print(b)

In [None]:
a-b

In [None]:
a*b  # element wise product

In [None]:
c = b.reshape(3,2)

In [None]:
a @ c  # matrix product

In [None]:
a.sum()

In [None]:
a.std()

In [None]:
a.mean()

In [None]:
a.argmin()

In [None]:
a.argmax()

In [None]:
# Transpose
a.T

In [None]:
np.sqrt(a)

In [None]:
np.exp(a)

### Indexing with an integer: 
`array[index]`: select the value at position `index` from the data.

In [None]:
a = np.arange(3,10)
a

In [None]:
a[0]

###  Slicing
`array[start:stop:step]`: selects a subset of the data.

In [None]:
ar = np.arange(10)
ar

Slicing with `[:]` means to take every element from the first to the last.

In [None]:
ar[:]

In [None]:
ar[3:7]

In [None]:
ar[1:]

In [None]:
ar[:5]

In [None]:
# Let's see how it looks when we have a matrix-form:
ar2 = np.arange(12).reshape(3,4)
ar2

In [None]:
# Get the third row
ar2[2]

In [None]:
# get from first to second row: (index 2 is not included)
ar2[0:2]

Slicing can be used separately for rows and columns:

In [None]:
ar2[:,2]  # get third column

In [None]:
ar2[-2:,-1]  # Last 2 rows AND last 1 column

Indexing or slicing belongs to basic indexing in NumPy. You get no copy of your data but a *view* on the existing data. For efficiency reasons, your data arrays in your memory do not have to be copied again and again. 
Let's see an example:

In [None]:
# we have:
ar

In [None]:
sub = ar[:3]
sub

In [None]:
sub.base

In [None]:
sub[1]=10

In [None]:
ar

### Boolean Arrays
We can create them by applying the comparison operator:

In [None]:
ar

In [None]:
boolar = (ar==5)

In [None]:
boolar

In [None]:
boolar = (ar<=5)

In [None]:
boolar

### Indexing with Boolean Arrays

In [None]:
ar[boolar]

### Adding and removing elements of Arrays:
- `np.append(array,value)`: appends value to the end of array
- `np.insert(array,index,value)`: inserts values before index
- `np.delete(array,index,axis)`: deletes row or column on index

In [None]:
a = np.arange(5)
a

In [None]:
np.append(a,5)

In [None]:
np.insert(a,3,10)

In [None]:
np.delete(a,2)

## Pandas

Package `pandas` provides high-performance data structures anda data analysis tools. It is build on top of NumPy, and provides and efficient implementations of a `DataFrame` (i.e., multidimensional arrays with column and row labels ).  
- Data manipulation and analysis
- DataFrame objects and Series
- Export and import data from files and web
- Handling of missing data.


Two primary data structures in Pandas are **Series** and **DataFrame.**

In [None]:
# pd.<TAB>
# pd?

### Introducing Series

A Series in Pandas is a one-dimensional labeled array capable of holding a wide variety of
data. This array also has an index, which is like a label for each of the values in the array.

The general expression to construct a Pandas `Series` is the following:

`pd.Series(data, index=index)`. 

Where `index` is optional. If it is not specified, `pd` will create a default index using integers.


In [None]:
# create a basic series
data = pd.Series([0,1,2], index=['a', 'b', 'c'])
data

In [None]:
# We can access values and index attributes:
data.values

In [None]:
data.index

In [None]:
# Data (i.e., the value) can be accessed by the associated index:

In [None]:
# We can use implicit index (position)
data[0]

In [None]:
# Array-like behavior
data[0:2]

In [None]:
# We can use the explicit index:
data['a']

In [None]:
# What if the explicit index is a number? 
data = pd.Series([0,1,2], index=[2,1,0])
data

In [None]:
data[0]

**What is the difference between a NumPy Array and a Panda Series?**
- Explicit  vs implicit index. 
- Index in NumPy arrays are always integers and implicitly defined (we don't need to define them manually)
- Index in Pandas is immutable, but values are not!

In [None]:
# We can also use dictonaries:
age_dict = {'Mark':15, 'Alice':20, 'Kevin':12, 'Ana':18, 'Rob':33 , 'Joe':43 , 'Tom':12}
ages = pd.Series(age_dict)

In [None]:
ages.values

In [None]:
ages.index

In [None]:
ages.keys

In [None]:
ages

In [None]:
ages['Alice']

In [None]:
ages[1]

In [None]:
ages[2:4]

In [None]:
# We can repeat a scalar using the length of the index
data = pd.Series(4, index = [1,2,3,4,5])
data

In [None]:
# what is we have more data than indexes? 


In [None]:
# There also some useful operations to use in a Pandas Series:
ages.median()

In [None]:
ages.max()

In [None]:
ages.min()

In [None]:
# Operations can be performed element-wise for the series
ages > ages.median()

In [None]:
# Use any or all to check if a comparison is true for the series as a whole.
any(ages > 40)

In [None]:
any(ages > 60)

In [None]:
all(ages >20)

In [None]:
# Note, all of the above results are Boolean expressions. So we can use them as index to select rows:
ages[ages > ages.median()]

In [None]:
ages

In [None]:
#Check index in series:
'Alex' in ages

In [None]:
# other element-wise operations:
ages + ages

In [None]:
ages**2

**Notes about indexes: loc, iloc**

For example, if your Series has an explicit integer index, an indexing operation such as `data[1]` will
use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.

- `loc` attribute allows indexing and slicing that always references the explicit index.
    - For example: `data.loc[1]` will return value associated with explicit index `1` , i.e., index `1` should be specified in advance, otherwise it will be implicit. 
- `iloc` attribute allows indexing and slicing that always references the implicit index.
    - For example: `data.loc[1]` will return the value associated with the implicit index `1` when `1` is associated with the position of the value, rather than the explicit index.

Let's see an example with our `Series`:

In [None]:
ages

In [None]:
ages.loc['Mark']

In [None]:
ages.iloc[0]

### Introducing Data Frame

What happens when a Series gains a dimension? It becomes a DataFrame !

We can think of a `DataFrame` as a sequence of aligned `Series` objects.  By **aligned** mean that all the `Series` are sharing the **same index**. 

When we are not reading a file directly, we can construct a `DataFrame` by different ways. Here we will see how to construct a `DataFrame` from a set of dictonaries.

In [None]:
# We have:
ages

In [None]:
height_dic = {'Mark':5.5, 'Alice':5.2, 'Kevin':6.1, 'Ana':5.0, 'Rob':5.2 , 'Joe':6.5 , 'Tom':4.9}
height = pd.Series(height_dic)
height

In [None]:
wage =  np.random.random(size=(7,))*100
names = ['Mark','Alice','Kevin','Ana','Rob','Joe','Tom']
wage_dic=dict(zip(names,wage))
wage = pd.Series(wage_dic)

In [None]:
wage

In [None]:
df = pd.DataFrame({'age':ages,'height':height,'wage':wage})

In [None]:
df

Same as `Series`, we can have access to the index label and values

In [None]:
df.values

In [None]:
df.index

But, we can also check the columns (since we have a multidimensional object)

In [None]:
df.columns

### Indexing in DataFrame:

In [None]:
df.loc['Kevin']

In [None]:
df.iloc[2]

In [None]:
df['wage']

In [None]:
df.wage

In [None]:
df.wage['Alice']

In [None]:
df[df['wage']>30]

In [None]:
df[(df['wage']>30) & (df['age']<30)]

In [None]:
df[(df['height']==5.2) & (df['wage']>30)]

In [None]:
df[~(df['height']==5.2) & (df['wage']>30)]

In [None]:
df[~((df['height']==5.2) & (df['wage']>30))]

### Some additional functions:

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.sort_values(by='wage')

In [None]:
df.sort_values(by='wage').head(3)

In [None]:
# Basic plot from pandas:
df.plot()

In [None]:
df.wage.sort_values(ascending=False).plot(kind='bar', figsize=(7,5))

### Create new columns: 

In [None]:
# Create new column based on columns already in the data
df['wage2'] = df['wage']*2

In [None]:
df

In [None]:
# From comparisons (creates boolean expression)
df['wagehigh'] = df['wage']>30
df

In [None]:
# delete column:
del df['wage2']

In [None]:
df

In [None]:
# delete rows:
df=df.drop(['Mark'])

In [None]:
df

In [None]:
# add scalar:
df['status']=3
df

### Aggregations 

In [None]:
df.groupby(by='wagehigh').mean()

In [None]:
df.groupby(by='wagehigh').sum()

In [None]:
df.groupby(by='wagehigh').mean()[['wage','age']]

In [None]:
#original df didn't change! We need to assign a new object with aggregation if we need a new object:
df

In [None]:
df_agg = df.groupby(by='wagehigh').mean()[['wage']]

In [None]:
df_agg

### Joins

Different type of joins:


- left: use only keys from left frame; preserve key order.
- right: use only keys from right frame; preserve key order.
- outer: use union of keys from both frames.
- inner: use intersection of keys from both frames; preserve the order of the left keys.
- cross: creates the cartesian product from both frames, preserves the order of the left keys.


Find more about `Pandas` merge [here](#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)

In [None]:
df_agg.index

In [None]:
df.merge(df_agg, how='left', on='wagehigh',suffixes=['_master','_using'])

In [None]:
df.merge(df_agg, how='right', on='wagehigh',suffixes=['_master','_using'])

## Reading Real Data

We will read some data from [Five Thirty Eight](#https://github.com/fivethirtyeight/data)


The folder of Airline Safety contains the data behind the story [Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?](http://fivethirtyeight.com/features/should-travelers-avoid-flying-airlines-that-have-had-crashes-in-the-past/)

Header | Definition
---|---------
`airline` | Airline (asterisk indicates that regional subsidiaries are included)
`avail_seat_km_per_week` | Available seat kilometers flown every week
`incidents_85_99` | Total number of incidents, 1985–1999
`fatal_accidents_85_99` | Total number of fatal accidents, 1985–1999
`fatalities_85_99` | Total number of fatalities, 1985–1999
`incidents_00_14` | Total number of incidents, 2000–2014
`fatal_accidents_00_14` | Total number of fatal accidents, 2000–2014
`fatalities_00_14` | Total number of fatalities, 2000–2014

Source: [Aviation Safety Network](http://aviation-safety.net)


In [None]:
data = pd.read_csv('pyintro_resources/fivethirtyeight/airline-safety/airline-safety.csv')

In [None]:
data.head()

In [None]:
data.tail(5)

In [None]:
data.columns

In [None]:
data.describe()

In [None]:
data.shape

In [None]:
data.index

In [None]:
# top 10 with highest fatalities_00_14
data.sort_values(by='fatalities_00_14',ascending=False).head(10)