# Introduction to Scientific Computing Lecture 5.2

## Introduction to masked arrays

Masking arrays is very useful in analysing subsets of data. Numpy has a masked array subpackage and tons of associated functions

Examples from: http://scipy-lectures.org/intro/numpy/elaborate_arrays.html#maskedarray-dealing-with-propagation-of-missing-data
and
http://scipy-lectures.org/advanced/advanced_numpy/index.html#masked-array-missing-data

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# this changes the default plotting for matplotlib
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [6.0, 4.0]
mpl.rcParams['figure.dpi'] = 200
mpl.rcParams['savefig.dpi'] = 200

mpl.rcParams['font.size'] = 14
mpl.rcParams['legend.fontsize'] = 'large'
mpl.rcParams['figure.titlesize'] = 'medium'
mpl.rcParams['lines.linewidth']= 2.0

In [None]:
# suppose you have some data and one of the values is bad
x = np.array([1, 2, 3, -99, 5])

In [None]:
# if we take the average we get something very wrong
x.mean()

In [None]:
# you can mask the missing data
mx = np.ma.masked_array(x, mask=[0, 0, 0, 1, 0])
mx

In [None]:
# then the mean works
mx.mean()

There are a lot of useful functions to work with masks

In [None]:
# mask the data where a certain logical statement is true
mx2 = np.ma.masked_where(x < 0, x)
mx2

In [None]:
# note this syntax is the same result as above
np.mean(mx2)

## Example from tutorial: Masked statistics

This is using the same data that was introduced previously

### Getting and playing with the data

In [None]:
data = np.loadtxt('populations.txt')

In [None]:
# note the data is year, pop1, pop2, pop3
data

In [None]:
data.shape
# this is time x variables

In [None]:
# extract all times (all elements in first dim) and all years (first index in second dim)
data[:,0]

In [None]:
# time vs population plots
#plt.figure(figsize=(8, 6))
plt.plot(data[:,0], data[:,1])
plt.plot(data[:,0], data[:,2])
plt.plot(data[:,0], data[:,3])
plt.xlabel('Year')
plt.ylabel('Population')
plt.legend(['Hares', 'Lynxes', 'Carrots'], loc = 'upper right')

In [None]:
help(plt.legend)

We may want to rename the data to make it easier to deal with


In [None]:
data.T
# now we have years as a row

In [None]:
# make vectors of the variables
year, hares, lynxes, carrots = data.T  # trick: columns to variables
# the last .T transposes the data

In [None]:
year

In [None]:
hares

In [None]:
lynxes

In [None]:
# note this is the same as above
plt.plot(year, hares)
plt.plot(year, lynxes)
plt.plot(year, carrots)
plt.xlabel('Year')
plt.ylabel('Population')
plt.legend(['Hares', 'Lynxes', 'Carrots'], loc = 'upper right')

In [None]:
# how about a boxplot?
# see:
# https://matplotlib.org/gallery/pyplots/boxplot_demo_pyplot.html#sphx-glr-gallery-pyplots-boxplot-demo-pyplot-py

plt.boxplot([hares, lynxes, carrots])
plt.xlabel('hares, lynxes, carrots')
plt.ylabel('Populations (thousands)')
;
# orange line is median

### New problem with this data (1.3.5.3)

Canadian rangers were distracted when counting hares and lynxes in 1903-1910 and 1917-1918, and got the numbers are wrong. (Carrot farmers stayed alert, though.) Compute the mean populations over time, ignoring the invalid numbers.

So the problem is that the hare and lynx data are bad in these years. So we need to mask these dates before we take the averages. 

Let's take the averages of the data first, before we mask. We either need to use the separate data vectors, or we need to be clever about taking the average of the matrix along the correct axis

In [None]:
data.mean()
# this just gives us the mean of the whole array, which is not helpful
# remember that the data is time x variables, so we want to average over the time index

In [None]:
np.mean(data, axis = 0)

In [None]:
data.mean(axis = 0)

In [None]:
# similarly
print(hares.mean())
print(lynxes.mean())
print(carrots.mean())

Now let's mask out the bad years and recalculate the averages

We want to mask 1903-1910 and 1917-1918

In [None]:
year

In [None]:
(year > 1903)
# is this right?

In [None]:
(year >= 1903)
# is this right? Is 1903 true?

In [None]:
(year >= 1903) & (year <= 1910)

In [None]:
# note this doesn't work:
(year >= 1903) and (year <= 1910)

In [None]:
# we need to add on the other range as well: 1917-1918
# is this an and or an or question?
# note "or" is |
((year >= 1903) & (year <= 1910)) |((year >= 1917) & (year <= 1918)) 

In [None]:
mask = ((year >= 1903) & (year <= 1910)) |((year >= 1917) & (year <= 1918)) 

In [None]:
# now we are ready to mask
lynxes_masked = np.ma.masked_where(mask,lynxes)

In [None]:
lynxes_masked

In [None]:
hares_masked = np.ma.masked_where(mask,hares)
# note carrots are fine

In [None]:
# note this is the same as above
plt.plot(year, hares_masked)
plt.plot(year, lynxes_masked)
plt.plot(year, carrots)
plt.xlabel('Year')
plt.ylabel('Population')
plt.legend(['Hares', 'Lynxes', 'Carrots'], loc = 'upper right')

In [None]:
print(hares.mean())
print(hares_masked.mean())

In [None]:
print(lynxes.mean())
print(lynxes_masked.mean())

In [None]:
plt.boxplot([hares,hares_masked, lynxes, lynxes_masked, carrots])
plt.xlabel('hares, hares masked, lynxes, lynxes masked, carrots')
plt.ylabel('Populations (thousands)')
;
# didn't change the statistics signifigantly