# Data Science Numpy & CSV's

## Tasks Today:

1) <b>Numpy</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Python List Comparison <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) In-Class Exercise #1 <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) Creating an NDArray <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - np.array() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - np.zeros() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - np.ones() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - np.arange() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Making Lists into NDArrays <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Performing Calculations on NDArrays <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Summation <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Difference <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Multiplication <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Division <br>
 &nbsp;&nbsp;&nbsp;&nbsp; f) Numpy Subsetting <br>
 &nbsp;&nbsp;&nbsp;&nbsp; g) Multi-dimensional Arrays <br>
 &nbsp;&nbsp;&nbsp;&nbsp; h) Indexing NDArrays <br>
 &nbsp;&nbsp;&nbsp;&nbsp; i) Checking NDArray Type <br>
 &nbsp;&nbsp;&nbsp;&nbsp; j) Altering NDArray Type <br>
 &nbsp;&nbsp;&nbsp;&nbsp; k) Checking the Shape <br>
 &nbsp;&nbsp;&nbsp;&nbsp; l) Altering the Shape <br>
 &nbsp;&nbsp;&nbsp;&nbsp; m) In-Class Exercise #2 <br>
 &nbsp;&nbsp;&nbsp;&nbsp; n) Complex Indexing & Assigning <br>
 &nbsp;&nbsp;&nbsp;&nbsp; o) Elementwise Multplication <br>
 &nbsp;&nbsp;&nbsp;&nbsp; p) np.where() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; q) Random Sampling <br>

2) <b>Working With CSV's</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Imports <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Reading a CSV <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) Loading a CSV's Data <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) Checking Number of Records <br>
 
3) <b>Exercises</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) #1 - Calculate BMI with NDArrays <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) #2 - Find the Average Sum of Marathon Runners <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) #3 - Random Matrix Function <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) #4 - Comparing Boston Red Sox Hitting Numbers <br>

## Numpy <br>

<p>NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.</p>
<ul>
    <li>Shape = Rows & Columns</li>
    <li>Matrix = Entire Array</li>
    <li>Vector = Variables to be applied (same vector as the one used in physics)</li>
    <li>Array = Similar to lists</li>
</ul>

#### Python List Comparison

<p>Lists are flexible, dynamic python objects that do their job quite well. But they do not support some mathematical operations in an intuitive way. Consider the summation of two lists, $l_1$ and $l_2$</p>

In [1]:
# create two lists and sum both of them together (results may not be what you expect)

aList = [2, 3, 4]
bList = [4, 4, 4]

result = aList + bList

print(result)

result = aList - bList

print(result)

[2, 3, 4, 4, 4, 4]


TypeError: unsupported operand type(s) for -: 'list' and 'list'

<p>If we wanted to sum lists elementwise, we could write our own function that does the job entirely within the framework of python</p>

#### In-Class Exercise #1 - Write a function that sums the indexes of two lists <br>
<p>Ex: [2, 3, 4] + [1, 5, 2] = [3, 8, 6]</p>

In [2]:
l1 = [2, 3, 4]
l2 = [5, 6, 7]

# assume lists are same length

def sumLists(aList, bList):
    if len(aList) != len(bList):
        return None
    
    results = []
    
    for i in range(len(aList)):
        results.append(aList[i] + bList[i])
        
    return results


sumLists(l1, l2)

[7, 9, 11]

We would have to write a similar function for all the possible operands that we could consider for list arithmatic. This is time consuming and inefficient. Moreover, once the lists in question become nested, mimicing the behavior of true matrices, the problem gets worse. Complicated indexing is necessary, just to allow for the most basic matrix operations common throughout science and engineering. Imagine writing a matrix multiplication function using python syntax in a general way, such that it returns a matrix-matrix or matrix-vector product:

\begin{align}
(n \times x) \times (x \times m) \rightarrow (n \times m)
\end{align}

\begin{align}
\begin{bmatrix}
c_{0,0} & ... & c_{0,n} \\
\vdots & \ddots & \vdots \\
c_{m,0} & ... & c_{m,n}
\end{bmatrix}
=
\begin{bmatrix}
a_{0,0} & ... & a_{0,x} \\
\vdots & \ddots & \vdots \\
a_{n,0} & ... & a_{n,x}
\end{bmatrix}
\begin{bmatrix}
b_{0,0} & ... & b_{0,m} \\
\vdots & \ddots & \vdots \\
b_{x,0} & ... & b_{x,m}
\end{bmatrix}
\end{align}

Let is instantiate a matrix $\mathcal{M}$ and a vector $\vec{v}$ and write a function that does the multiplication ourselves.

In [6]:
# think of a vector as the variables to be applied in a mathematical process

def matrix_multiply(A, B):
    ret = [ [0 for i in range(len(B[0]))] for i in range(len(A))] # number of rows in the result
    
    inner_dim = len(A[0])
    n_dim = len(ret)
    m_dim = len(ret[0])
    
    
    for i in range(n_dim):
        for j in range(m_dim):
            element = 0
            for x in range(inner_dim):
                    element += A[i][x] * B[x][j]
            ret[i][j] = element
    
    return ret
M = [[0,1,0],[0,2,0],[0,3,0]]
v = [[1],[2],[3]]

print(matrix_multiply(M, v))

[[2], [4], [6]]


#### Importing

In [3]:
# always import as np, standard across all of data science
import numpy as np

#### Creating an NDArray <br>
<p>NumPy is based around a class called the $\textit{NDArray}$, which is a flexible vector / matrix class that implements the intuitive matrix and vector arithmatic lacking in basic Python. Let's start by creating some NDArrays:</p>

###### - np.array()

In [4]:
# can be created from a variable or a list declared in place

arr1 = np.array([1, 2232039, 3])

print(arr1)
print(type(arr1))

[      1 2232039       3]
<class 'numpy.ndarray'>


###### - np.zeros()

In [5]:
# takes in a shape

arr1 = np.zeros((3, 3))

print(arr1)

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


###### - np.ones()

In [6]:
# can define the data type

ones = np.ones((3, 3), int)

print(ones)

[[1 1 1]
 [1 1 1]
 [1 1 1]]


###### - np.arange()

In [7]:
# creates an NDArray up to the number given, works the same as range()

arr1 = np.arange(0, 10, 1)

print(arr1)

[0 1 2 3 4 5 6 7 8 9]


###### - Making Lists into NDArrays

In [8]:
aList = [2, 4, 6]

arr1 = np.array(aList)

print(aList)
print(type(aList))
print(arr1)
print(type(arr1))

[2, 4, 6]
<class 'list'>
[2 4 6]
<class 'numpy.ndarray'>


#### Performing Calculations on NDArrays <br>
<p>Performs elementwise calculations</p>

###### - Summation

In [9]:
arr1 = np.array([2, 4, 6])
arr2 = np.array([1, 2, 3])

result = arr1 + arr2

print(result)

[3 6 9]


###### - Difference

In [10]:
result = arr1 - arr2

print(result)

[1 2 3]


###### - Multiplication

In [11]:
result = arr1 * arr2

print(result)

[ 2  8 18]


###### - Division

In [12]:
result = arr1 / arr2

print(result)

[2. 2. 2.]


#### Numpy Subsetting

In [17]:
# return true/false matrix, or only elements that meet condition

print(arr1)
print(arr1 >= 3)
print(arr1)

print('\n\n')

print(arr1)
print(arr1[arr1 >= 3])    # acts like the filter method
print(arr1)

[2 4 6]
[False  True  True]
[2 4 6]



[2 4 6]
[4 6]
[2 4 6]


#### Multi-dimensional Arrays <br>
<p>NumPy seamlessly supports multidimensional arrays and matrices of arbitrary dimension without nesting NDArrays. NDArrays themselves are flexible and extensible and may be defined with such dimensions, with a rich API of common functions to facilitate their use. Let's start by building a two dimensional 3x3 matrix by conversion from a nested group of core python lists $M = [l_0, l_1, l_2]$:</p>

In [20]:
aList = [0, 1, 2]
bList = [3, 4, 5]
cList = [6, 7, 8]

# first step: convert lists into Matrix to work with
M = [aList, bList, cList]

print(f'Nested List Structure: \n {M}')
print(f'Type: {type(M)}')

print('\n\n')

# second step: convert into Numpy array
M = np.array(M)    # dimension inferred

print(f'Nested Array Structure: \n {M}')
print(f'Type: {type(M)}')

Nested List Structure: 
 [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
Type: <class 'list'>



Nested Array Structure: 
 [[0 1 2]
 [3 4 5]
 [6 7 8]]
Type: <class 'numpy.ndarray'>


#### Indexing NDArrays <br>
<p> Similar to lists within lists; however, the syntax looks more like C programming language.... It is [1, 2] to access the second row, third element.</p>

In [24]:
print(f'Middle element of Matrix is: %i \n Type: %s' % (M[1, 1], type(M)))     # rows, column

Middle element of Matrix is: 4 
 Type: <class 'numpy.ndarray'>


#### Assigning Values in NDArrays

In [28]:
M[1, 1] = 10

print(M)

print('\n\n')

# change the value to 1.5
M[1, 1] = 1.5

print(f'Nested Array Structure After 1.5 Assignment: \n{M}')

# notice that the element that we changed did not change to 1.5, but rather 1.
# This is because all NDArray's are assigned a single dType and must be consistent throughout the entire Array

[[ 0  1  2]
 [ 3 10  5]
 [ 6  7  8]]



Nested Array Structure After 1.5 Assignment: 
[[0 1 2]
 [3 1 5]
 [6 7 8]]


<p>Notice above how we ended up with a 1 in the target element's place. This is a data type issue. The .dtype() method is supported by all NDArrays, as well as the .astype() method for casting between data types:</p>

#### Checking NDArray Type

In [29]:
# .dtype

print(f'Type of Matrix is: {M.dtype}')

Type of Matrix is: int32


#### Altering NDArray Type

In [31]:
# .astype() 

M = M.astype(np.float64)

t = M.dtype

print(f'Data type of Matrix is: {t}')

print(f'Matrix: \n{M}')

M[1, 1] = 1.5

print('\n\n')

print(f'Matrix: \n{M}')

Data type of Matrix is: float64
Matrix: 
[[0. 1. 2.]
 [3. 1. 5.]
 [6. 7. 8.]]



Matrix: 
[[0.  1.  2. ]
 [3.  1.5 5. ]
 [6.  7.  8. ]]


#### Checking the Shape <br>
<p>The behavior and properties of an NDArray are often sensitively dependent on the $\textit{shape}$ of the NDArray itself. The shape of an array can be found by calling the .shape method, which will return a tuple containing the array's dimensions:</p>

In [32]:
# .shape

print(f'The shape of our 2D Matrix is: {M.shape}')
print(f'The type of the shape returned is: {type(M.shape)}')

The shape of our 2D Matrix is: (3, 3)
The type of the shape returned is: <class 'tuple'>


#### Altering the Shape <br>
<p>As long as the number of elements remains fixed, we can reshape NDArrays at will:</p>

In [33]:
# .reshape()

M = M.reshape(9, 1)

print(M)

[[0. ]
 [1. ]
 [2. ]
 [3. ]
 [1.5]
 [5. ]
 [6. ]
 [7. ]
 [8. ]]


In [35]:
# keep in mind that the shape numbers do matter, (9, 1) is different than (1, 9)

M = M.reshape(1,9)

print(M)

M = M.reshape(3, 3)

print(M)

[[0.  1.  2.  3.  1.5 5.  6.  7.  8. ]]
[[0.  1.  2. ]
 [3.  1.5 5. ]
 [6.  7.  8. ]]


#### In-Class Exercise #2 - Create the following matrix<br>
<p>[[&nbsp;&nbsp;&nbsp;0&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;3]<br>
&nbsp;[&nbsp;&nbsp;&nbsp;4&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;6&nbsp;&nbsp;&nbsp;7]<br>
&nbsp;[&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;9&nbsp;10&nbsp;11]<br>
&nbsp;[&nbsp;12&nbsp;13&nbsp;14&nbsp;15]]</p>

In [37]:
arr1 = np.arange(16).reshape(4,4)

print(arr1)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]


#### Complex Indexing & Assinging

In [45]:
#####################
# M[rows, columns]
####################


# create an array of 4, 4 zeros
M = np.zeros((4, 4))
print(f'Instantiated Matrix: \n{M}')

# Assign the number 1 to every row's first column
M[:, 0] = 1
print(f'\nMatrix after assigning the value of 1 to first column: \n{M}')

# Assign the number 5 to every column in the third row
M[2, :] = 5
print(f'\nMatrix after assigning the value of 5 to third row: \n{M}')

# Reset the Matrix back to 0
M = M * 0
print(f'\nMatrix after setting back to 0: \n{M}')

# Set every row's second and third column to the number 2
M[:, 1:3] = 2
print(f'\nMatrix after assigning the value of 2 to second and third column: \n{M}')

# Initialize a vector of range 4, reset the matrix to zero and run the vector on the matrix using addition
v = np.arange(4)
M = M * 0

print(f'Vector: {v}')

M = M + v

print(f'\nMatrix after adding vector: \n{M}')

Instantiated Matrix: 
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

Matrix after assigning the value of 1 to first column: 
[[1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]]

Matrix after assigning the value of 5 to third row: 
[[1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [5. 5. 5. 5.]
 [1. 0. 0. 0.]]

Matrix after setting back to 0: 
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

Matrix after assigning the value of 2 to second and third column: 
[[0. 2. 2. 0.]
 [0. 2. 2. 0.]
 [0. 2. 2. 0.]
 [0. 2. 2. 0.]]
Vector: [0 1 2 3]

Matrix after adding vector: 
[[0. 1. 2. 3.]
 [0. 1. 2. 3.]
 [0. 1. 2. 3.]
 [0. 1. 2. 3.]]


#### Elementwise Multiplication

<p>As long as the shapes of NDArrays are 'compatible', they can be multiplied elementwise, broadcasted, used in inner products, and much much more. 'Compatible' in this context can mean compatible in the linear algebraic sense, i.e. for inner products and other matrix multiplication, or simply sharing a dimension in such a manner that broadcasting 'makes sense'. Here are some examples of this:</p>

In [51]:
M = np.ones((4, 4))

v = np.arange(4).reshape(4, 1)

print(f'Matrix: \n{M}')
print(f'\nVector: \n{v}')

M = M * v

print('\n\n')

print(f'Matrix after vector multiplicatoin: \n{M}')

Matrix: 
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]

Vector: 
[[0]
 [1]
 [2]
 [3]]



Matrix after vector multiplicatoin: 
[[0. 0. 0. 0.]
 [1. 1. 1. 1.]
 [2. 2. 2. 2.]
 [3. 3. 3. 3.]]


#### np.where() <br>
<p>If statement within NDArrays that allows you to run conditionals on the entire array</p>

In [56]:
print(M)

print('\n')

# the proper values where the condition is met
print(np.where(M == 2))

# store the coordinates into variables
y, x = np.where(M == 2)

print(f'\n X Coordinates: {x} \t Y Coordinates: {y}')


# use the where method to assign values at the proper location where the condition is met
M[np.where(M == 2)] = 100

print(f'\nMatrix: \n{M}')

[[  0.   0.   0.   0.]
 [  1.   1.   1.   1.]
 [100. 100. 100. 100.]
 [  3.   3.   3.   3.]]


(array([], dtype=int64), array([], dtype=int64))

 X Coordinates: [] 	 Y Coordinates: []

Matrix: 
[[  0.   0.   0.   0.]
 [  1.   1.   1.   1.]
 [100. 100. 100. 100.]
 [  3.   3.   3.   3.]]


#### Random Sampling <br>
<p>NumPy provides machinery to work with random numbers - something often needed in a broad spectrum of data science applications.</p>

In [70]:
# np.random.uniform()
# np.random.seed()

np.random.seed(321)      # sets all of us to the same random calculation

# a single call generates a single random number between 0 and 1
num = np.random.uniform()
print('Here is a random number between 0 and 1: %s' % num)

# generating a number between 0 and 1 million
num = np.random.uniform(0, 1e6)
print('Here is a random number between 0 and 1 million: %s' % num)

# generating a bunch of numbers at once
nums = np.random.uniform(0, 10, 3)
print('Here are 3 random numbers between 0 and 10: %s' % nums)

# generating a random matrix
M = np.random.uniform(0, 10, (3, 3))
print('3x3 Random Matrix: \n%s' % M)

Here is a random number between 0 and 1: 0.8859479412560747
Here is a random number between 0 and 1 million: 77912.35871612435
Here are 3 random numbers between 0 and 10: [9.79646157 2.47671458 7.52884718]
3x3 Random Matrix: 
[[5.26675636 9.07553753 8.84070297]
 [0.8926896  5.17344596 3.43621292]
 [2.12293694 3.60673442 2.70775173]]


## Working With CSV's

#### Imports

In [145]:
import csv
import numpy as np
from datetime import datetime

#### Reading a CSV

In [147]:
def open_csv(f, d=','):
    # define an empty list to store the data
    data = []
    
    # use the 'with' keyword to open and read the information
    with open(f, encoding='utf-8') as mData:
        info = csv.reader(mData, delimiter=d)
        
        # loop over info and append to data, necessary because info is a reader object
        for row in info:
            data.append(row)
            
    return data

csv_data = open_csv('files/redsox_2017_hitting.txt')

print(csv_data[0])

['Rk', 'Pos', 'Name', 'Age', 'G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS+', 'TB', 'GDP', 'HBP', 'SH', 'SF', 'IBB']


#### Loading a CSV's Data 

In [151]:
FIELDS = ['Rk', 'Pos', 'Name', 'Age', 'G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 
          'OBP', 'SLG', 'OPS', 'OPS+', 'TB', 'GDP', 'HBP', 'SH', 'SF', 'IBB']

DATATYPES = [('rk', 'i'), ('pos', '|S25'), ('name', '|S25'), ('age', 'i'), ('g', 'i'), ('pa', 'i'), ('ab', 'i'), ('r', 'i'),
                ('h', 'i'), ('2b', 'i'), ('3b', 'i'), ('hr', 'i'), ('rbi', 'i'), ('sb', 'i'), ('cs', 'i'), ('bb', 'i'),
                 ('so', 'i'), ('ba', 'f'), ('obp', 'f'), ('slg', 'f'), ('ops', 'f'), ('ops+', 'f'), ('tb', 'i'), ('gdp', 'i'),
                 ('hbp', 'i'), ('sh', 'i'), ('sf', 'i'), ('ibb', 'i')
            ]

def load_data(f, d=','):
    # instead of laoding csv normally, we'll load it into a numpy array to calculate result on the data
    data = np.genfromtxt(f,delimiter=d,skip_header=1, usecols=np.arange(0,28), invalid_raise=False, 
                         names=FIELDS, dtype=DATATYPES)
    
    return data

bos_17 = load_data('files/redsox_2017_hitting.txt')

print(len(bos_17))
print(bos_17['HR'])

23
[ 5 22  7 10 10 20 17 24 23  7  7  4  8  0  0  4  0  0  0  0  0  0  0]


#### Summing the top 5 hitters for HR's

In [160]:
# small in-class exercise

hrs = bos_17['HR']

print(hrs)

sorted_hrs = sorted(hrs)

fifth = sorted_hrs[-5]

top_hitters = bos_17[np.where(bos_17['HR'] >=fifth)]

print(top_hitters['Name'])
# print(sum(top_hitters))

[ 5 22  7 10 10 20 17 24 23  7  7  4  8  0  0  4  0  0  0  0  0  0  0]
[b'Mitch Moreland' b'Andrew Benintendi' b'Jackie Bradley Jr.'
 b'Mookie Betts' b'Hanley Ramirez']


# Exercises To Complete.... <br>
<p>Given in separate file after completion of this file</p>