# Numpy

In describing numpy arrays, let's start with Python lists and why a data scientist might need something better.  As a reminder, a list is a container object that can hold different data types or objects.

The numpy package should come with Anaconda distribution.  If not, install it using the **pip install** command.
```
$ pip install numpy
$ pip3 install numpy
```

In [1]:
h = [67, 70, 65, 63, 72]
w = [160, 150, 165, 120, 205]

Body Mass Index (BMI)
Formula: weight (lb) / [height (in)]<sup>2</sup> x 703

In [2]:
bmi = w / (h**2) * 703

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

Lists are not good for calculations.  To calculate the BMI for each person, we would have to write a loop.  This is terribly inefficient and tiresome to write.  A more elegant solution is to use Numpy arrays.

Numpy can perform calculations over entire arrays making it quick and easy.

In [3]:
import numpy as np

In [4]:
np_ht = np.array([67, 70, 65, 63, 72])

In [5]:
np_ht

array([67, 70, 65, 63, 72])

In [7]:
np_wt = np.array(w)

In [8]:
np_wt

array([160, 150, 165, 120, 205])

In [9]:
bmi = np_wt / (np_ht**2) * 703

In [10]:
bmi

array([25.05680552, 21.52040816, 27.45443787, 21.25472411, 27.79996142])

Numpy calculations are fast & efficient because it assumes all data types inside the array are the same.

In [11]:
same = np.array([1.2, "is", True])

In [12]:
same

array(['1.2', 'is', 'True'], dtype='<U32')

## List vs np.array

In [13]:
python_list = [1, 2, 3, 4]

In [14]:
np_array = np.array([1, 2, 3, 4])

In [15]:
python_list + python_list

[1, 2, 3, 4, 1, 2, 3, 4]

In [16]:
np_array + np_array

array([2, 4, 6, 8])

Different objects have different behaviors!

## Numpy subsettings

In [17]:
bmi

array([25.05680552, 21.52040816, 27.45443787, 21.25472411, 27.79996142])

In [18]:
bmi[1]

21.520408163265305

In [19]:
bmi > 25

array([ True, False,  True, False,  True])

In [20]:
bmi[bmi > 25]

array([25.05680552, 27.45443787, 27.79996142])

## Baseball example

Suppose you are researching the bmi of MLB players and obtain a list of players' height & weight (Source: [stat.ucla.edu](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights)).  For this exercise, you can find the stats in the baseball.csv file. 

In [21]:
import csv

In [22]:
height = []
weight = []
with open('baseball.csv') as file:
    player_reader = csv.reader(file)
    next(player_reader) # skip the first row
    for row in player_reader:
        height.append(int(row[3]))
        weight.append(int(row[4]))
#         print(row)


In [23]:
len(height)

1015

In [24]:
len(weight)

1015

In [25]:
np_ht = np.array(height)
np_wt = np.array(weight)

In [26]:
bmi = np_wt / (np_ht**2) * 703

In [27]:
bmi

array([23.10810811, 27.60135135, 28.47800926, ..., 25.62044444,
       23.74577778, 25.72433853])

### Light weight players
Let's find out something about the player with a smaller build.

In [28]:
light = bmi < 21

In [29]:
light

array([False, False, False, ..., False, False, False])

In [30]:
print(bmi[light])

[20.54054054 20.54054054 20.69078947 20.69078947 20.34143519 20.34143519
 20.69078947 20.15685613 19.49653333 20.69078947 20.91846856]


## Multi-dimensional Arrays

In [31]:
h = [67, 70, 65, 63, 72]
w = [160, 150, 165, 120, 205]

In [32]:
np_ht = np.array(h)
np_wt = np.array(w)

In [33]:
type(np_ht)

numpy.ndarray

In [34]:
type(np_wt)

numpy.ndarray

In [35]:
np_2d = np.array([ [67, 70, 65, 63, 72], [160, 150, 165, 120, 205] ])

In [36]:
np_2d

array([[ 67,  70,  65,  63,  72],
       [160, 150, 165, 120, 205]])

In [37]:
np_2d.shape

(2, 5)

The shape attribute tells us that the np_2d array has 2 rows and 5 columns.

In [38]:
np_2d = np.array([ [67, 70, 65, 63, 72], [160, 150, 165, 120, "205"] ])

In [39]:
np_2d

array([['67', '70', '65', '63', '72'],
       ['160', '150', '165', '120', '205']], dtype='<U11')

Changing one item to a string cause numpy to change all the items to a string.  

In [40]:
np_2d = np.array([ [67, 70, 65, 63, 72], [160, 150, 165, 120, 205] ])

In [41]:
np_2d

array([[ 67,  70,  65,  63,  72],
       [160, 150, 165, 120, 205]])

### Subsetting a 2-D array

In [42]:
np_2d[0]

array([67, 70, 65, 63, 72])

In [43]:
np_2d[0][2]

65

In [44]:
np_2d[0, 2]

65

### 2-D array using the baseball data

In [45]:
np_baseball = np.array([height, weight])

In [46]:
np_baseball.shape

(2, 1015)

In [47]:
bmi = np_baseball[1] / np_baseball[0]**2 * 703

In [48]:
bmi

array([23.10810811, 27.60135135, 28.47800926, ..., 25.62044444,
       23.74577778, 25.72433853])

## Using numpy for basic statistics

In [49]:
np.mean(np_baseball[0])

73.6896551724138

In [50]:
np.mean(np_baseball[1])

201.34876847290641

In [51]:
np.corrcoef(np_baseball[0], np_baseball[1])

array([[1.        , 0.53153932],
       [0.53153932, 1.        ]])

In [52]:
np.std(np_baseball[0])

2.312791881046546

In [53]:
np.std(np_baseball[1])

20.812854496114763