# NumPy

*   Instructor: Victor Fuentes Campos
*   Curso: Fundamentos de Programación en Python para Macroeconomía y Finanzas
*   Adapatado de las clases de [Carla Solís](https://github.com/ccsuehara/python_para_las_ccss/blob/main/Clase%202/tipos_operadores.ipynb) y [Alexander Quispe](https://github.com/alexanderquispe/QLAB_Summer_Python/blob/main/Lecture_2/Lecture_2.ipynb)


Empecemos con un [video sencillo](https://www.youtube.com/watch?v=Tkv45wgxlEU)

> En adelante, empezamos a usar librarías de Python, que son como súperpoderes que catalizarán nuestras habilidades  de programación en Python. Estas funciona como plugins o extensinoes para hacer nuestro código más eficiente

[NumPy](https://numpy.org/doc/stable/user/numpy-for-matlab-users.html) (Numerical Python) es la biblioteca central para la computación científica en Python. Proporciona un objeto array multidimensional de alto rendimiento y herramientas para trabajar con estos. Si ya estás familiarizado con MATLAB, es posible que encuentres útil este tutorial para comenzar con NumPy.

<img src="https://s3.amazonaws.com/dq-content/289/1.2-m289.gif" width="1000">

### Arrays
Un array de NumPy es una cuadrícula de valores, todos del mismo tipo, y se indexa mediante tuples de enteros no negativos. El número de dimensiones es el rango del array; la forma (shape) de un array es un tuple de enteros que indica el tamaño del array a lo largo de cada dimensión.

No olviden instalar numpy: ```conda install numpy```

In [1]:
import numpy as np

In [2]:
a = np.array( [1, 2, 3, 4, 5] )

In [3]:
a   

array([1, 2, 3, 4, 5])

In [4]:
type(a)

numpy.ndarray

In [5]:
# 1D array
a = np.array( [1, 2, 3, 4, 5] )
print(a)

[1 2 3 4 5]


In [6]:
# 2D array
M = np.array( [ [1, 2, 3], [4, 5, 6] ] )

print(M)

[[1 2 3]
 [4 5 6]]


In [7]:
X = np.array( [ [1, 2, 3, 4], [4, 5, 6, 7] ] )
X

array([[1, 2, 3, 4],
       [4, 5, 6, 7]])

|Function|	Description|
| --- |--- |
|np.array(a) |	Create -dimensional np array from sequence a|
|np.linspace(a,b,N) |	Create 1D np array with N equally spaced values <br> from a to b (inclusively)|
|np.arange(a,b,step) |	Create 1D np array with values from a to b (exclusively) <br> incremented by step|
|np.zeros(N)	| Create 1D np array of zeros of length |
|np.zeros((n,m)) |	Create 2D np array of zeros with  rows and  columns|
|np.ones(N) |	Create 1D np array of ones of length |
|np.ones((n,m))|	Create 2D np array of ones with  rows and  columns|
|np.eye(N)	| Create 2D np array with  rows and  columns  <br> with ones on the diagonal  (ie. the identity matrix of size )|
|np.concatenate( )|Join a sequence of arrays along an existing axis|
|np.hstack( )|Stack arrays in sequence horizontally(column wise)|
|np.vstack( )|Stack arrays in sequence vertically(row wise)|
|np.column_stack( )|Stack 1-D arrays as columns into a 2-D array|
|np.random.normal() | Draw random samples from a normal (Gaussian) distribution. |
|np.linalg.inv() | Compute the (multiplicative) inverse of a matrix. |
|np.dot() / @  | Matrix Multiplication. |

In [8]:
# Create a 1D NumPy array with 11 equally spaced values from 0 to 1:
x = np.linspace( 0, 1, 11 )
print(x)

[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]


In [9]:
# Create a 1D NumPy array with values from 0 to 20 (exclusively) incremented by 5:
y = np.arange( 0, 20, 3 )
print(y)

[ 0  3  6  9 12 15 18]


In [None]:
# Create a 1D NumPy array of zeros of length 5:
z = np.zeros(10)
print(z)

In [None]:
# Create a 2D NumPy array of zeros of shape ( 5, 10 ) :
M = np.zeros( (10, 10) )
print(M)

In [None]:
# Create a 1D NumPy array of ones of length 7:
w = np.ones(7)
print(w)

In [None]:
# Create a 2D NumPy array of ones with 35ows and 25 columns:
N = np.ones( (5, 5) )
print(N)

In [None]:
np.eye(5)

In [None]:
# Create the identity matrix of size 10:
I = np.eye(10)
print(I)

In [None]:
# Shape
I.shape

In [None]:
# Size
I.size

In [None]:
# Concateante
g = np.array([[5,6],[7,8]])
g

In [None]:
h = np.array([[1,2]])
h

In [None]:
np.concatenate( (g, h) , axis = 0)

In [None]:
h_2 = h.reshape(2, 1)
h_2

In [None]:
g

In [None]:
g_h = np.concatenate((g, h_2), axis = 1)
g_h

In [None]:
sett = np.hstack((g,h_2))
sett

In [None]:
# vstack 
x = np.array([1,1,1])
y = np.array([2,2,2])
z = np.array([3,3,3])

In [None]:
z

In [None]:
vstacked = np.vstack( (x, y, z) )
vstacked

In [None]:
vstacked.shape

In [None]:
vstacked = np.vstack((x,y,z))
print(vstacked)

In [None]:
# hstack 
hstacked = np.hstack((x,y,z))
print(hstacked)

### OLS con NumPy

In [None]:
# X data generation
n_data = 200
x1 = np.linspace(200, 500, n_data)
x0 = np.ones(n_data)
X = np.hstack(( x0.reshape(-1, 1 ) , x1.reshape(-1, 1 ) ))
X.shape

In [None]:
# Para mayor explicación sobre el reshape(-1,1):
# https://stackoverflow.com/questions/18691084/what-does-1-mean-in-numpy-reshape

In [None]:
# select parameters
beta = np.array([5, -2]).reshape(-1, 1 )
beta.shape

In [None]:
# y ture
y_true = X @ beta
y_true.shape

In [None]:
y_true

In [None]:
y_true + (np.random.normal(0, 1, n_data) * 20).reshape(-1, 1)

In [None]:
#   add random normal noise
sigma = 20
y_actual = y_true + (np.random.normal(0, 1, n_data) * sigma).reshape(-1, 1)
print(y_actual[0:4, :])

The matrix equation for the estimated linear parameters is as below:
$${\hat {\beta }}=(X^{T}X)^{-1}X^{T}y.$$

In [None]:
# estimations
beta_estimated = np.linalg.inv(X.T @ X) @ X.T @ y_actual

In [None]:
import matplotlib.pyplot as plt

plt.plot(x1, y_actual, 'o')
plt.plot(x1, y_true, 'g-', c = 'black')

Calculate the sum of squared residual errors
$$
RSS=y^{T}y-y^{T}X(X^{T}X)^{{-1}}X^{T}y
$$

In [None]:
y_actual

In [None]:
RSS = ( y_actual.T @ y_actual - y_actual.T @ X @ np.linalg.inv(X.T @ X) @ X.T @ y_actual )

Calculated the Total Sum of Squares of the spread of the actual (noisy) values around their mean
$$
TSS=(y-{\bar  y})^{T}(y-{\bar  y})=y^{T}y-2y^{T}{\bar  y}+{\bar  y}^{T}{\bar  y}
$$

In [None]:
y_mean = ( np.ones(n_data) * np.mean(y_actual) ).reshape( -1 , 1 )
TSS = (y_actual - y_mean).T @ (y_actual - y_mean)
TSS

In [None]:
# get predictions
y_pred = X @ beta_estimated

Calculate the Sum of Squares of the spread of the predictions around their mean.
$$
ESS=({\hat  y}-{\bar  y})^{T}({\hat  y}-{\bar  y})={\hat  y}^{T}{\hat  y}-2{\hat  y}^{T}{\bar  y}+{\bar  y}^{T}{\bar  y}
$$

In [None]:
ESS = (y_pred - y_mean).T @ (y_pred - y_mean)

ESS

In [None]:
TSS, ESS + RSS

Get $R^2$
$$
1 - RSS / TSS
$$

In [None]:
1 - RSS / TSS

###  SE
Calculate the standard error of the regression. We divide by `(n-2)`, because the Expectation of the sum of squares is `(n-2)*sigma^2`.

In [None]:
sr2 = ( (1 / (n_data - 2)) * (y_pred - y_actual).T  @ (y_pred - y_actual))
sr = np.sqrt(sr2)
sr

### Var-Cov
In order to get the standard errors for our linear parameters, we use the matrix formula below:
$$
Var(β^)=σ^2(X′X)^{-1}
$$

In [None]:
var_beta = sr2 * np.linalg.inv(X.T @ X)
var_beta

In [None]:
print(
    f'Std Error for b0 {np.sqrt(var_beta[0, 0])}, \nStd Error for b1 {np.sqrt(var_beta[1, 1])}'
)

## Exercise

*Example extracted from Dataquest course "Data Analyst in Python"*

Let's work with a subset of the New York City taxi trip data released by the city. We'll focus on about 90,000 yellow taxi trips to and from various NYC airports between January and June 2016. Here are some selected columns from the dataset:

- `pickup_month`: the month of the trip (January is 1, December is 12)
- `pickup_day`: the day of the month of the trip
- `pickup_location_code`: the airport or borough where the trip started
- `dropoff_location_code`: the airport or borough where the trip ended
- `trip_distance`: the distance of the trip in miles
- `trip_length`: the length of the trip in seconds
- `fare_amount`: the base fare of the trip, in dollars
- `total_amount`: the total amount charged to the passenger, including all fees, tolls, and tips


Review the dictionary data [here](https://s3.amazonaws.com/dq-content/289/nyc_taxi_data_dictionary.md).

Our data is stored in a CSV file called `nyc_taxis.csv`


In [None]:
pwd

In [None]:
import csv
# import nyc_taxi.csv as a list of lists
f = open("data/nyc_taxis.csv", "r")
taxi_list = list(csv.reader(f))

# remove the header row
taxi_list = taxi_list[1:]

# convert all values to floats
converted_taxi_list = []
for row in taxi_list:
    converted_row = []
    for element in row:
        converted_row.append(float(element))
    converted_taxi_list.append(converted_row)

Add a single line of code using the `numpy.array()` constructor to convert the `converted_taxi_list` variable to a NumPy ndarray, and assign the result to the variable name `taxi`

In [None]:
# Solution

Let´s explore our dataset

In [None]:
taxi

How many columns and rows?

In [None]:
# Solution

Select the rows at indices 100 to 200 inclusive for the columns at indices 9 to 13 inclusive (fare amount, fees amount, tolls amount, tip amount, total amount). Assign the result to `rows_100_to_200_column_9_to_13`

In [None]:
# Solution

Considering the following variables, calculate a new variables `trip_mph`

In [None]:
trip_distance_miles = taxi[:, 7]
trip_length_seconds = taxi[:, 8]
trip_length_hours = trip_length_seconds / 3600 # there are 3600 seconds in one hour

In [None]:
# Solution

Calculate the average speed of `trip_mph`

In [None]:
# Solution
trip_mph.mean()

Calculate the sum of each row in `fare_components`

In [None]:
# extract the first 5 rows only
taxi_first_five = taxi[:5]
# select columns: fare_amount, fees_amount, tolls_amount, and tip_amount
fare_components = taxi_first_five[:, 9:14]
fare_components

In [None]:
# Solution
