# Data Science Numpy & CSV's

## Tasks Today:

1) <b>Numpy</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Python List Comparison <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) In-Class Exercise #1 <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) Creating an NDArray <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - np.array() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - np.zeros() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - np.ones() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - np.arange() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Making Lists into NDArrays <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Performing Calculations on NDArrays <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Summation <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Difference <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Multiplication <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Division <br>
 &nbsp;&nbsp;&nbsp;&nbsp; f) Numpy Subsetting <br>
 &nbsp;&nbsp;&nbsp;&nbsp; g) Multi-dimensional Arrays <br>
 &nbsp;&nbsp;&nbsp;&nbsp; h) Indexing NDArrays <br>
 &nbsp;&nbsp;&nbsp;&nbsp; i) Checking NDArray Type <br>
 &nbsp;&nbsp;&nbsp;&nbsp; j) Altering NDArray Type <br>
 &nbsp;&nbsp;&nbsp;&nbsp; k) Checking the Shape <br>
 &nbsp;&nbsp;&nbsp;&nbsp; l) Altering the Shape <br>
 &nbsp;&nbsp;&nbsp;&nbsp; m) In-Class Exercise #2 <br>
 &nbsp;&nbsp;&nbsp;&nbsp; n) Complex Indexing & Assigning <br>
 &nbsp;&nbsp;&nbsp;&nbsp; o) Elementwise Multplication <br>
 &nbsp;&nbsp;&nbsp;&nbsp; p) np.where() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; q) Random Sampling <br>

2) <b>Working With CSV's</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Imports <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Reading a CSV <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) Loading a CSV's Data <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) Checking Number of Records <br>
 
3) <b>Exercises</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) #1 - Calculate BMI with NDArrays <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) #2 - Find the Average Sum of Marathon Runners <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) #3 - Random Matrix Function <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) #4 - Comparing Boston Red Sox Hitting Numbers <br>

## Numpy <br>

<p>NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.</p>
<ul>
    <li>Shape = Rows & Columns</li>
    <li>Matrix = Entire Array</li>
    <li>Vector = Variables to be applied (same vector as the one used in physics)</li>
    <li>Array = Similar to lists</li>
</ul>

#### Python List Comparison

<p>Lists are flexible, dynamic python objects that do their job quite well. But they do not support some mathematical operations in an intuitive way. Consider the summation of two lists, $l_1$ and $l_2$</p>

In [1]:
# run to sum both lists (may not work the way you believe)
l1 = [1,2,3]
l2 = [3,4,5]
# run to find difference in list (wait.... this doesn't work with Python lists)
print(l1 + l2)

[1, 2, 3, 3, 4, 5]


<p>If we wanted to sum lists elementwise, we could write our own function that does the job entirely within the framework of python</p>

#### In-Class Exercise #1 - Write a function that sums the indexes of two lists <br>
<p>Ex: [2, 3, 4] + [1, 5, 2] = [3, 8, 6]</p>

In [3]:
def array_sum(l1, l2):
    n = len(l1)
    out = []
    for i in range(n):
        out.append(l1[i] + l2[i])
    return out
array_sum(l1,l2)

[4, 6, 8]

We would have to write a similar function for all the possible operands that we could consider for list arithmatic. This is time consuming and inefficient. Moreover, once the lists in question become nested, mimicing the behavior of true matrices, the problem gets worse. Complicated indexing is necessary, just to allow for the most basic matrix operations common throughout science and engineering. Imagine writing a matrix multiplication function using python syntax in a general way, such that it returns a matrix-matrix or matrix-vector product:

\begin{align}
(n \times x) \times (x \times m) \rightarrow (n \times m)
\end{align}

\begin{align}
\begin{bmatrix}
c_{0,0} & ... & c_{0,n} \\
\vdots & \ddots & \vdots \\
c_{m,0} & ... & c_{m,n}
\end{bmatrix}
=
\begin{bmatrix}
a_{0,0} & ... & a_{0,x} \\
\vdots & \ddots & \vdots \\
a_{n,0} & ... & a_{n,x}
\end{bmatrix}
\begin{bmatrix}
b_{0,0} & ... & b_{0,m} \\
\vdots & \ddots & \vdots \\
b_{x,0} & ... & b_{x,m}
\end{bmatrix}
\end{align}

Let us instantiate a matrix $\mathcal{M}$ and a vector $\vec{v}$ and write a function that does the multiplication ourselves.

In [4]:
# think of a vector as the variables to be applied in a mathematical process
# it is the same vector as we know in physics with x, y, z coordinates

def matrix_multiply(A, B):
    # creating a matrix of 0's
    ret = [ [0 for i in range(len(B[0]))] for i in range(len(A))] # number of rows in the result
    
    # setting lengths
    inner_dim = len(A[0])
    n_dim = len(ret)
    m_dim = len(ret[0])
    
    # looping to add matrices
    for i in range(n_dim):
        for j in range(m_dim):
            element = 0
            for x in range(inner_dim):
                    element += A[i][x] * B[x][j]
            ret[i][j] = element
    
    return ret

M = [[0,1,0],[0,2,0],[0,3,0]] # matrix
v = [[1],[2],[3]] # variables being applied to matrix

print(matrix_multiply(M, v))

[[2], [4], [6]]


#### Importing

In [5]:
import numpy as np

# always import as np, standard across all of data science

#### Creating an NDArray <br>
<p>NumPy is based around a class called the $\textit{NDArray}$, which is a flexible vector / matrix class that implements the intuitive matrix and vector arithmatic lacking in basic Python. Let's start by creating some NDArrays:</p>

###### - np.array()

In [7]:
arr1 = np.array([1,2,3,4])
arr1

array([1, 2, 3, 4])

In [8]:
type(arr1)

numpy.ndarray

###### - np.zeros()

In [13]:
# use shape variable
shape = (3,2)
arr1 = np.zeros([2,3])
arr1

array([[0., 0., 0.],
       [0., 0., 0.]])

###### - np.ones()

In [16]:
# notice I specified the datatype
arr1 = np.ones(shape, int)
arr1

array([[1, 1],
       [1, 1],
       [1, 1]])

###### - np.arange()

In [17]:
arr1 = np.arange(4)
arr1

array([0, 1, 2, 3])

###### - Making Lists into NDArrays

In [20]:
l1 + l2
arr1 = np.array(l1)
arr2 = np.array(l2)
print(arr1)

[1 2 3]


#### Performing Calculations on NDArrays

###### - Summation

In [21]:
# using '+'
arr1 + arr2

array([4, 6, 8])

###### - Difference

In [22]:
# using '-'
arr1 - arr2

array([-2, -2, -2])

###### - Multiplication

In [23]:
# using '*'
arr1 * arr2

array([ 3,  8, 15])

###### - Division

In [24]:
# using '/'
arr1 / arr2

array([0.33333333, 0.5       , 0.6       ])

#### Numpy Subsetting

In [28]:
# conditional check, returns true or false, uses '>', '<', '<=', '>=', '==', etc.
arr1 > 4

# subsetting - get values instead of booleans
arr1[arr1 > 2]
arr1

array([1, 2, 3])

#### Multi-dimensional Arrays <br>
<p>NumPy seamlessly supports multidimensional arrays and matrices of arbitrary dimension without nesting NDArrays. NDArrays themselves are flexible and extensible and may be defined with such dimensions, with a rich API of common functions to facilitate their use. Let's start by building a two dimensional 3x3 matrix by conversion from a nested group of core python lists $M = [l_0, l_1, l_2]$:</p>

In [31]:
l1 = [1,2,3]
l2 = [4,5,6]
l3 = [7,8,9]

m = [l1, l2 ,l3]
m
M = np.array(m)
M

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

#### Indexing NDArrays <br>
<p> Similar to lists within lists; however, the syntax looks more like C programming language.... It is [1, 2] to access the second row, third element.</p>

In [40]:
# uses [y, x] notation not [x][y]
M[1,0] = 9
M

array([[1, 2, 3],
       [9, 5, 6],
       [7, 8, 9]])

#### Assigning Values in NDArrays

In [41]:
# similar to how we assign with lists, access using the index
M_f = np.array(M, float)
M_f

# notice how types are conflicting


array([[1., 2., 3.],
       [9., 5., 6.],
       [7., 8., 9.]])

<p>Notice above how we ended up with a 1 in the target element's place. This is a data type issue. The .dtype() method is supported by all NDArrays, as well as the .astype() method for casting between data types:</p>

#### Checking NDArray Type

In [43]:
# .dtype
t = M.dtype
t

dtype('int32')

#### Altering NDArray Type

In [49]:
# different types are 'int32', 'float64', 'bool', etc.
# .astype(type)
M = M.astype('float64')
t = M.dtype
t

dtype('float64')

#### Checking the Shape <br>
<p>The behavior and properties of an NDArray are often sensitively dependent on the $\textit{shape}$ of the NDArray itself. The shape of an array can be found by calling the .shape method, which will return a tuple containing the array's dimensions:</p>

In [50]:
# .shape
M.shape

(3, 3)

#### Altering the Shape <br>
<p>As long as the number of elements remains fixed, we can reshape NDArrays at will:</p>

In [52]:
# .reshape(tuple)
M.reshape((9,1))

array([[1.],
       [2.],
       [3.],
       [9.],
       [5.],
       [6.],
       [7.],
       [8.],
       [9.]])

In [51]:
# Notice that (9,1) is not the same as (1,9) !!!


#### In-Class Exercise #2 - Create a matrix of range 0 through 16 and reshape it into a 4x4 matrice

In [53]:
arr1 = np.array(np.arange(16))
arr1.reshape((4,4))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

#### Complex Indexing & Assinging

In [61]:
# reset matrix with zeros
M = np.zeros((4,4))

# set every first element in each row to 1
M[:,0] = 1
M

# Set all elements in axis 3 to 5
M[2, :] = 5
M
# Reset the matrix 
M = np.zeros((4,4))

# We can even do more complicated stuff

# set the second and third columns to 2
M[:, 1:3] = 2
M
# create an NDArray of range 4


# set the first row to that range


array([[0., 2., 2., 0.],
       [0., 2., 2., 0.],
       [0., 2., 2., 0.],
       [0., 2., 2., 0.]])

#### Elementwise Multiplication

<p>As long as the shapes of NDArrays are 'compatible', they can be multiplied elementwise, broadcasted, used in inner products, and much much more. 'Compatible' in this context can mean compatible in the linear algebraic sense, i.e. for inner products and other matrix multiplication, or simply sharing a dimension in such a manner that broadcasting 'makes sense'. Here are some examples of this:</p>

In [63]:
# reset matrix back to zero
M = np.zeros((4,4))
V = np.arange(4).reshape((4,1))
# create vector of range 4 and reshape to (1, 4)
M = M + V
M

# print before


# print after summation


# notice that 1, 4 shape is different than 4, 1 shape


array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.]])

#### np.where() <br>
<p>If statement within NDArrays that allows you to run conditionals on the entire array</p>

In [66]:
# Find elements of a conditions
np.where(M == 0)

# convert type to float
M = M.astype('float64')

# change all elements within condition to value
M[np.where(M==1)] = 0
M

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.]])

#### Random Sampling <br>
<p>NumPy provides machinery to work with random numbers - something often needed in a broad spectrum of data science applications.</p>

In [71]:
# np.random.uniform()

# A single call generates a single random number


# You can also pass some bounds


# You can also generate a bunch of random numbers all at once


# Even matrices with weird shapes


## Working With CSV's

#### Imports

In [96]:
import csv
import numpy as np
from datetime import datetime
from datetime import timedelta

#### Reading a CSV

In [103]:
def open_csv(filename, d=','):
    data = []
    with open (filename, encoding='utf-8') as bm_data:
        info = csv.reader(bm_data, delimiter = d)
        # append each row
        for row in info:
            data.append(row)
    return data
csv_data = open_csv('boston_marathon2017.csv')
print(csv_data[2:4])

# capitalized variables mean constant in python
FIELDNAMES = ['Bib', 'First Name', 'Last Name', 'Age', 'M/F', 'City', 'State', 'Country', 'Citizen', '5K', '10K', '15K', '20K', 
               'Half', '25K', '30K', '35K', '40K', 'Pace', 'Official Time', 'Overall', 'Gender', 'Division']
# define with tuples
DATATYPES = [('bib', 'i'), ('first_name', '|S25'), ('last_name', '|S25'), ('age', 'i'), ('sex', '|S25'), ('city', '|S25'), 
             ('state', '|S25'), 
             ('country', '|S25'), ('citizen', '|S25'), ('5k', 'object'), ('10k', 'object'), ('15k', 'object'), 
             ('20k', 'object'), ('half', 'object'), ('25k', 'object'), ('30k', 'object'), ('35k', 'object'), 
             ('40k', 'object'), ('pace', 'object'), ('overall_time', 'object'), ('overall', 'i'), ('gender', 'i'),
             ('division', 'i')]

[['17', 'Rupp, Galen', '30', 'M', 'Portland', 'OR', 'USA', '', '0:15:24', '0:30:27', '0:45:44', '1:01:15', '1:04:35', '1:16:59', '1:33:01', '1:48:19', '2:03:14', '0:04:58', '2:09:58', '2', '2', '2'], ['23', 'Osako, Suguru', '25', 'M', 'Machida-City', '', 'JPN', '', '0:15:25', '0:30:29', '0:45:44', '1:01:16', '1:04:36', '1:17:00', '1:33:01', '1:48:31', '2:03:38', '0:04:59', '2:10:28', '3', '3', '3']]


#### Loading a CSV's Data 

In [72]:
def load_csv(filename, d=','):
    my_csv = np.genfromtxt(filename, delimiter=d, skip_header=1, usecols=np.arange(0, 23), invalid_raise=False,
    names = FIELDNAMES, dtype = DATATYPES)
    return my_csv

#### Checking Number of Records

In [74]:
my_csv = load_csv('boston_marathon2017.csv')
print(my_csv[1])

(17, b'"Rupp', b' Galen"', 30, b'M', b'Portland', b'OR', b'USA', b'', b'0:15:24', b'0:30:27', b'0:45:44', b'1:01:15', b'1:04:35', b'1:16:59', b'1:33:01', b'1:48:19', b'2:03:14', b'0:04:58', b'2:09:58', 2, 2, 2)


In [75]:
my_csv[1][2]

b' Galen"'

In [76]:
type(my_csv[1][2])

numpy.bytes_

In [80]:
csv_data[2]

['17',
 'Rupp, Galen',
 '30',
 'M',
 'Portland',
 'OR',
 'USA',
 '',
 '0:15:24',
 '0:30:27',
 '0:45:44',
 '1:01:15',
 '1:04:35',
 '1:16:59',
 '1:33:01',
 '1:48:19',
 '2:03:14',
 '0:04:58',
 '2:09:58',
 '2',
 '2',
 '2']

In [81]:
datetime_object = datetime.strptime('0:15:24', '%H:%M:%S')

In [88]:
datetime_object

datetime.datetime(1900, 1, 1, 0, 15, 24)

In [89]:
datetime_object.hour

0

In [91]:
datetime_object.minute

15

In [93]:
datetime_object.second

24

In [98]:
duration = timedelta(hours=datetime_object.hour, minutes=datetime_object.minute, seconds=datetime_object.second)
print(duration)

0:15:24


In [102]:
print(type(duration))

<class 'datetime.timedelta'>
