<div id="container" style="position:relative;">
<div style="float:left"><h1>  Data in Python </h1></div>
<div style="float:right"><img style="height:65px" src ="https://drive.google.com/uc?export=view&id=1EnB0x-fdqMp6I5iMoEBBEuxB_s7AmE2k" />
</div>
</div>

This unit will focus on:

* Working with NumPy arrays
* Panda Series and Dataframes
* Reading and Writing Data


### NumPy

One of the best reasons for using Python is the plethora of packages that have been created and shared online. NumPy, whose name comes from "Numeric Python" is one of the most useful packages we have.

It carries out many array-based operations in Python by strongly typing the data, and using C routines to carry it out. This allows us to get the speed of C, but with the simplicity of Python.

To use NumPy, we have to import it:

In [1]:
import numpy

Because we are lazy, it is usual to import NumPy using the shortcut name `np`. This saves vital keystrokes:

In [2]:
import numpy as np

### Motivation

We have a batch of data files: 

`sales-00.csv` to `sales-09.csv`

You can download these as a zip file [here](https://api.brainstation.io/content/link/1d9kdc-X7azzhtCRnTylZR38YoZtnnuAJ).

Each of these files represents a different product we have just developed, each column is a day, and each row is a different store.

We want to be able to tell which product is selling the best and at which stores. Today, we will focus on basic summary stats and plotting of the data.

We have to be able to walk before we can run, so let's try out some basic Python.

NumPy is a module, or package. We access it's contents by calling the object `np`, and a full stop after it. NumPy contains a lot of stuff, for now let's load a single one of our data files. This is useful, as it is often easiest and fastest to get a program or script working on a single file or subsample of data and then looping, rather than working on everything at once.

In [7]:
np.loadtxt('dataUnit02/data(1)/sales-00.csv', delimiter = ',')

array([[ 23.,  20.,  21.,  27.,  39.,  30.,  40.,  35.,  67.,  57.,  74.,
         44.,  60.,  81.,  79.,  91.,  78.,  94.,  93.,  94.,  88.,  95.,
        111.,  98., 130., 109., 121., 116., 147., 135.],
       [  3.,  23.,  10.,  42.,   0.,  14.,  12.,  13.,   0.,   4.,  10.,
         29.,  21.,   2.,  37.,  26.,   0.,  19.,   8.,  25.,   0.,   0.,
         23.,  18.,  23.,  21.,  27.,  17.,  22.,   7.],
       [ 26.,  28.,  37.,  37.,  37.,  64.,  74.,  67.,  83.,  83.,  79.,
        102., 104.,  97., 115., 141., 137., 149., 136., 160., 172., 167.,
        168., 165., 179., 207., 204., 216., 226., 217.],
       [  9.,  16.,  12.,   6.,  14.,  11.,  10.,  19.,   3.,  23.,  25.,
          0.,  32.,  17.,  19.,  10.,   8.,  19.,  13.,   7.,  13.,  26.,
          6.,  21.,   0.,  15.,  13.,  19.,   7.,  15.],
       [ 23.,  22.,  22.,  21.,  44.,  29.,  47.,  37.,  45.,  54.,  43.,
         54.,  56.,  74.,  67.,  70.,  83.,  70., 106.,  93., 107., 106.,
        111., 117., 118., 138., 

We can see we ran our function, loaded our data, and printed it to the screen. 

So, we called the `loadtxt` function from the `np` module, giving it two arguments. The first argument was the filename:
`sales-00.csv`, the second was delimiter = ','.

Python has both positional and named arguments. We can use a named argument, or provide the arguments in the order that the function expects. In this case, we told NumPy that we had data with a comma as the delimiter (ie. a csv).

How do we access the data we just loaded? We need to assign it to a variable:

In [11]:
my_data = np.loadtxt('./dataUnit02/data(1)/sales-00.csv', delimiter = ',')

What exactly did we just load? Previously we only had a single value loaded into a variable:

In [12]:
my_val = 22.5
print(type(my_val))

<class 'float'>


In [7]:
print(type(my_data))

<class 'numpy.ndarray'>


So, we had a float previously, now we have a NumPy ndarray - we can see that the NumPy module contains data types as well as functions. For now we can think of an array as a way of holding a bunch of data.

How many values do we have?

In [13]:
print(my_data.shape)

(30, 30)


We have a 30*30 matrix of data. The way we accessed the shape of the data shows a consistent way of accessing things from Python objects - using `object.attribute`, similar to the way we got the `loadtxt` function.

Now we have the data, how do we investigate it?

We can take subsets and slices to get an idea of what we have. Looking at an entire 30*30 matrix at once is probably not useful, but we can look at a column, row, or cell individually to get an idea of what we have.

In [14]:
my_data[0,0]

23.0

Numpy enables us to do math on our data, all at once

In [15]:
2 * my_data[1:3,4:9]

array([[  0.,  28.,  24.,  26.,   0.],
       [ 74., 128., 148., 134., 166.]])

We can use just a `:` to denote the whole dimension: `my_data[:,3]` slices the whole column. Be careful of dimension dropping.

We have universal functions, or [ufuncs](https://numpy.org/doc/stable/reference/ufuncs.html#math-operations), which are functions which can operate over arrays:

In [16]:
mymean = np.mean(my_data)
mystd = np.std(my_data)
mymin = np.min(my_data)
mymax = np.max(my_data)

print(f'Mean: {mymean:.2f}, Std: {mystd:.3f}, Min: {mymin}, Max: {mymax}')
#fstrings are new in python 3.6

Mean: 85.23, Std: 81.590, Min: 0.0, Max: 442.0


To find if a function exists in NumPy, the easiest way is almost always to Google '[whatever you want to do] NumPy' and if you don't find the exact function, you will usually get an implementation of it.

This is one of the largest benefits of working in a common language like Python.


The structure of our data sets had each row representing a store, and each column indicating a day. Each new file is a new item.

So, if we find the sum across each row, this gives us the total sales across days of our item in each store. If we average down the columns, we can see the overall day-to-day trend in our sales for this item, across stores.

How can we get these values?

In [27]:
#help(np.sum)
#take off the #

We have an argument, `axis`, which allows the calling of a function across an axis.

Setting `axis=0` aggregates over all rows, setting `axis=1` aggregates over all columns.

In [12]:
print(my_data[:,0])
print(np.sum(my_data, axis = 0))
print(my_data[0,:])
print(np.sum(my_data, axis = 1))

[23.  3. 26.  9. 23. 13. 23.  0. 25. 21. 15. 18. 24. 21.  9. 33.  6.  3.
 18. 13. 20. 23. 13. 22. 24. 14. 26.  2. 13. 15.]
[ 498.  632.  713.  975. 1030. 1224. 1305. 1391. 1519. 1705. 1915. 2027.
 2191. 2330. 2530. 2598. 2693. 2952. 3074. 3191. 3415. 3465. 3673. 3779.
 3984. 4004. 4192. 4392. 4627. 4680.]
[ 23.  20.  21.  27.  39.  30.  40.  35.  67.  57.  74.  44.  60.  81.
  79.  91.  78.  94.  93.  94.  88.  95. 111.  98. 130. 109. 121. 116.
 147. 135.]
[2297.  456. 3677.  408. 2303. 2749. 6782. 3436. 5070.  480. 2832. 3219.
 1255. 2192. 4524. 4272. 1685.  488. 2865.  392. 3036.  625. 6214.  490.
 2156. 5298. 1201. 3715. 2231.  356.]


We could assign these outputs to variables and continue on using these arrays if we wanted them.

### Why Use NumPy?

Given that we already have lists, which look a lot like arrays, why do we need NumPy?

Python is duck typed, so we don't know in advance what to do with many operators - we have to test the type of each item, find out how the operator is implemented, and then carry it out.

In NumPy, we are strongly typed, so we only have to check once.

There are also benefits to the memory layout and some other more advanced reasons for why we use NumPy, but you can see some of the benefits here:

In [13]:
mylist = [1,2,3,4,5,6,7,8,9,10]

print(2 * mylist) # huh. Probably not what we wanted.

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


In [14]:
%%timeit

out = []
for i in mylist:
    out.append(i*2)

875 ns ± 55.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [15]:
myarray = np.array([1,2,3,4,5,6,7,8,9,10])

In [17]:
%%timeit

out = 2 * myarray

730 ns ± 20.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


### Exercises

Complete the following exercises with a partner:

1. Run the following code:
```
x = np.arange(10)
```

What is the type of x?
What does the `np.arange` function do? (Use `help(function)`, or Google.)
Type in `x.` then hit tab. Can you find the maximum value in this array, using a method?
Can you find the shape of the array?

2. Import the module `time`. Use `time.t` and tab completion to find a function that gives us the current time. What arguments does this function take?

3. We still have the `my_data` array in memory - can you take the standard deviation for each store? How about for each day? Can you find the maximum amount of product sold each day, regardless of store?

## Pandas:  Series and Dataframes

Pandas is a package for Python that come with its own unique set of functions, objects, etc. It is designed for working with tabular data. We're going import Pandas with this command:

In [18]:
import pandas as pd

### Pandas Series objects

Pandas Series are a built on 1-dimensional NumPy arrays (the basic object of NumPy), but with the addition of an index.  As a result, a Series is much like a list in that it is an ordered sequence of Python objects, whose elements can be called by their index. The difference here, though, is that we can specify an index of our choosing (much like picking keys for a dictionary). If we do not specify an index, it will revert to standard integer indexing.

In [19]:
s = pd.Series([2,3,5,7,11,13])
s

0     2
1     3
2     5
3     7
4    11
5    13
dtype: int64

Here, the left-hand column is the index, and the right-hand column is the set of values.  We can specify the index when we create the series:

In [20]:
s = pd.Series([2,3,5,7,11,13],index=['a', 'b', 'c', 'd', 'e','f'])
s

a     2
b     3
c     5
d     7
e    11
f    13
dtype: int64

Or, we can assign the index afterwards:

In [21]:
s = pd.Series([2,3,5,7,11,13])
s

0     2
1     3
2     5
3     7
4    11
5    13
dtype: int64

In [84]:
s.index

RangeIndex(start=0, stop=6, step=1)

In [24]:
# Setting a new index
s.index = ['a', 'b', 'c', 'd', 'e','f']
s

a     2
b     3
c     5
d     7
e    11
f    13
dtype: int64

We can take a dictionary, and create a Series out of it.  This will automatically use the keys as the index:

In [25]:
midterm_marks = {'Patrick':86,'Lindsay':95,'Ivan':92,'Emily':97,'Iva':89}

marks = pd.Series(midterm_marks)
marks

Patrick    86
Lindsay    95
Ivan       92
Emily      97
Iva        89
dtype: int64

### Slicing & Referencing

Both Series and Dataframe items can be called and sliced using syntax similar to lists and dictionaries:

In [26]:
marks['Emily']

97

In [27]:
s['b':'d']

b    3
c    5
d    7
dtype: int64

You will notice that, unlike in regular Python, slicing a Series will *include* the last item listed.<br>

One of the most useful aspects of Series (and Dataframes, as we'll see), is *vectorized* operations. If we apply operations to the Series, the computer will perform the instructions to every element effectively simultaneously. For example, if we wanted to add two series together, component-wise, we would use:

In [28]:
s1 = pd.Series([4,10,7], index=['a','b','c'])
s2 = pd.Series([1,1,1], index=['a','b','c'])

s3 = s1+s2
s3

a     5
b    11
c     8
dtype: int64

And if we wanted to multiply everything in ```s3``` by 2 and then add 5:

In [29]:
s3*2 + 5

a    15
b    27
c    21
dtype: int64

One of the great abilities of vectorized operations is handling mismatched indices. If we have two Series that overlap in some indices, but also contain differences, it will fill in the gaps for us:

In [30]:
s4 = pd.Series([2,-1,6], index=['b','c','d'])
s5 = pd.Series([3,1,0], index=['a','b','c'])

s4+s5

a    NaN
b    3.0
c   -1.0
d    NaN
dtype: float64

The index of the resulting Series is the union of the two original indices.  Pandas will fill in any missing data with `np.NaN`. <br> 

**Note**: `NaN` ("not a number") is the marker used to denote missing data in Pandas and NumPy.

### Vectorization: Why we care

Why is vectorization important?  Why are we learning about Series (and soon Dataframes) that look kind of like other objects we've already seen, instead of just trying to use those objects?  There are two main reasons:

1. **Speed**. Performing vectorized operations will go much faster than iterative ones. 
2. **Code length**. The amount of code your write will be dramatically reduced by not having to write complicated for-loops.

Instead of having to iterate through all of the indices, we can give Python instructions for the entire Series. 
Vectorized operations allow us to very quickly perform operations on entire groups of data.  As we'll see later on, we can even create vectorized versions of custom functions.  Pandas and NumPy come packaged with a lot of very handy and powerful tools for dealing with very large datasets that only require a line or two of code, which reduces the amount of work we have to do.



### Pandas Dataframes

Dataframes are tables of indexed columns, containing potentially different types of data.  Each column is a `pd.Series` object.

We can create a dataframe from scratch using a dictionary of Series:




In [31]:
d = {'one' : pd.Series([2, 4, 6], index=['a', 'b', 'c']),
     'two' : pd.Series(['alpha','beta','gamma','delta'], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

Let's create a larger dataframe so we can get a better view of accessing data.  The actual code to produce this dataframe is not overly important for right now.

In [32]:
df = pd.DataFrame(np.random.randn(8,6), columns=['A','B','C','D','E','F'])
df

Unnamed: 0,A,B,C,D,E,F
0,-0.00612,0.995303,1.29559,-1.315822,0.680437,0.336321
1,0.617469,-0.092009,0.65754,0.769186,-0.413372,-1.038546
2,0.91102,0.236724,0.646495,0.794445,0.699801,-0.655905
3,-1.63444,-0.457585,0.41522,1.930824,1.41339,-1.406453
4,0.884871,-0.565394,-0.033965,0.329885,0.849439,-0.064877
5,0.452536,-1.149487,-0.527991,-0.481837,0.646394,0.233137
6,-0.212473,1.171074,1.709416,-0.118912,-0.433531,1.279646
7,-0.857506,1.412557,0.400593,0.652157,0.406363,-0.414815


We can access the top n rows by calling `df.head(n)`, and the bottom `m` rows by calling `df.tail(m)`.  If we leave `n` or `m` blank, it will default to 5.

In [33]:
df.head()

Unnamed: 0,A,B,C,D,E,F
0,-0.00612,0.995303,1.29559,-1.315822,0.680437,0.336321
1,0.617469,-0.092009,0.65754,0.769186,-0.413372,-1.038546
2,0.91102,0.236724,0.646495,0.794445,0.699801,-0.655905
3,-1.63444,-0.457585,0.41522,1.930824,1.41339,-1.406453
4,0.884871,-0.565394,-0.033965,0.329885,0.849439,-0.064877


In [34]:
df.tail(3)

Unnamed: 0,A,B,C,D,E,F
5,0.452536,-1.149487,-0.527991,-0.481837,0.646394,0.233137
6,-0.212473,1.171074,1.709416,-0.118912,-0.433531,1.279646
7,-0.857506,1.412557,0.400593,0.652157,0.406363,-0.414815


We can slice rows with the same syntax as lists and Series:

In [35]:
df[2:4]

Unnamed: 0,A,B,C,D,E,F
2,0.91102,0.236724,0.646495,0.794445,0.699801,-0.655905
3,-1.63444,-0.457585,0.41522,1.930824,1.41339,-1.406453


We can call an individual column with the syntax we use for list/dictionary items, and we can call a set of columns by passing a list of column names:

In [36]:
df['A']

0   -0.006120
1    0.617469
2    0.911020
3   -1.634440
4    0.884871
5    0.452536
6   -0.212473
7   -0.857506
Name: A, dtype: float64

(Note:  when we call only a single column, we will be given a Series object)

In [37]:
df[['A','D']]

Unnamed: 0,A,D
0,-0.00612,-1.315822
1,0.617469,0.769186
2,0.91102,0.794445
3,-1.63444,1.930824
4,0.884871,0.329885
5,0.452536,-0.481837
6,-0.212473,-0.118912
7,-0.857506,0.652157


We can define a new column as we would add a new key/value pair to a dictionary:

In [38]:
df['G'] = [1,1,3,7,5,5,3,9]

In [39]:
df

Unnamed: 0,A,B,C,D,E,F,G
0,-0.00612,0.995303,1.29559,-1.315822,0.680437,0.336321,1
1,0.617469,-0.092009,0.65754,0.769186,-0.413372,-1.038546,1
2,0.91102,0.236724,0.646495,0.794445,0.699801,-0.655905,3
3,-1.63444,-0.457585,0.41522,1.930824,1.41339,-1.406453,7
4,0.884871,-0.565394,-0.033965,0.329885,0.849439,-0.064877,5
5,0.452536,-1.149487,-0.527991,-0.481837,0.646394,0.233137,5
6,-0.212473,1.171074,1.709416,-0.118912,-0.433531,1.279646,3
7,-0.857506,1.412557,0.400593,0.652157,0.406363,-0.414815,9


One important thing to note is that if we slice out rows or columns, Python makes a *copy*, so any changes we make won't appear.  If we sliced out rows 2:4, and multiplied by 0, we see we get:

In [40]:
df[2:4]*0

Unnamed: 0,A,B,C,D,E,F,G
2,0.0,0.0,0.0,0.0,0.0,-0.0,0
3,-0.0,-0.0,0.0,0.0,0.0,-0.0,0


In [41]:
df

Unnamed: 0,A,B,C,D,E,F,G
0,-0.00612,0.995303,1.29559,-1.315822,0.680437,0.336321,1
1,0.617469,-0.092009,0.65754,0.769186,-0.413372,-1.038546,1
2,0.91102,0.236724,0.646495,0.794445,0.699801,-0.655905,3
3,-1.63444,-0.457585,0.41522,1.930824,1.41339,-1.406453,7
4,0.884871,-0.565394,-0.033965,0.329885,0.849439,-0.064877,5
5,0.452536,-1.149487,-0.527991,-0.481837,0.646394,0.233137,5
6,-0.212473,1.171074,1.709416,-0.118912,-0.433531,1.279646,3
7,-0.857506,1.412557,0.400593,0.652157,0.406363,-0.414815,9


The rows remain unchanged in df.  Similarly for columns:

In [42]:
df['A']*0

0   -0.0
1    0.0
2    0.0
3   -0.0
4    0.0
5    0.0
6   -0.0
7   -0.0
Name: A, dtype: float64

In [43]:
df

Unnamed: 0,A,B,C,D,E,F,G
0,-0.00612,0.995303,1.29559,-1.315822,0.680437,0.336321,1
1,0.617469,-0.092009,0.65754,0.769186,-0.413372,-1.038546,1
2,0.91102,0.236724,0.646495,0.794445,0.699801,-0.655905,3
3,-1.63444,-0.457585,0.41522,1.930824,1.41339,-1.406453,7
4,0.884871,-0.565394,-0.033965,0.329885,0.849439,-0.064877,5
5,0.452536,-1.149487,-0.527991,-0.481837,0.646394,0.233137,5
6,-0.212473,1.171074,1.709416,-0.118912,-0.433531,1.279646,3
7,-0.857506,1.412557,0.400593,0.652157,0.406363,-0.414815,9


To do the computation in-place, we have to redefine the rows/columns:

In [44]:
df['A'] = df['A']*0
df

Unnamed: 0,A,B,C,D,E,F,G
0,-0.0,0.995303,1.29559,-1.315822,0.680437,0.336321,1
1,0.0,-0.092009,0.65754,0.769186,-0.413372,-1.038546,1
2,0.0,0.236724,0.646495,0.794445,0.699801,-0.655905,3
3,-0.0,-0.457585,0.41522,1.930824,1.41339,-1.406453,7
4,0.0,-0.565394,-0.033965,0.329885,0.849439,-0.064877,5
5,0.0,-1.149487,-0.527991,-0.481837,0.646394,0.233137,5
6,-0.0,1.171074,1.709416,-0.118912,-0.433531,1.279646,3
7,-0.0,1.412557,0.400593,0.652157,0.406363,-0.414815,9


### Reading and Writing Data

Usually when we work with data, we will do so by loading a file into Pandas rather than creating the dataframe from scratch using Pandas as we did above. We will work with files in the `.csv` format as well as in the `.H5` (HDF5) format. In the class resources page there is a link to the file `canada_cpi.csv`. Make sure you have that file saved into your ```.../notebooks/data``` directory, and then run the code block below.

**Tip:** The notebook is pretty smart at autocompleting file paths. If you begin typing `dm = pd.read_csv('data/` and hit the `TAB` key, it should come up with a list of the contents of the directory and you can scroll to the file you want and hit `ENTER` to select it.

In [46]:
dm = pd.read_csv('data/canada_cpi.csv')

This file contains a time series of the Canadian Consumer Price Index (CPI). The CPI is a way of measuring the rate of inflation over time. The dataset was obtained from the [Bank of Canada Statistical Database at Quandl](https://www.quandl.com/data/BOC-Bank-of-Canada-Statistical-Database).

When we first load a new dataset into a notebook, it is useful to have a look at its head and tail to make sure it has loaded up as expected.

In [47]:
dm.head()

Unnamed: 0,Month,Total CPI,Total CPI S.A.,Core CPI,% Change 1 Yr: Total CPI,% Change 1 Yr: Core CPI,% Change 1 Yr: CPI-XFET,% Change 1 Yr: CPIW
0,2018-06-30,133.6,133.0,2.5,2.0,2.0,1.9,1.3
1,2018-05-31,133.4,132.9,2.2,1.9,2.0,1.9,1.3
2,2018-04-30,133.3,132.8,2.2,2.1,2.0,1.9,1.5
3,2018-03-31,132.9,132.7,2.3,2.0,2.0,1.9,1.4
4,2018-02-28,132.5,132.6,2.2,2.1,2.0,1.9,1.5


In [48]:
dm.tail()

Unnamed: 0,Month,Total CPI,Total CPI S.A.,Core CPI,% Change 1 Yr: Total CPI,% Change 1 Yr: Core CPI,% Change 1 Yr: CPI-XFET,% Change 1 Yr: CPIW
217,2000-05-31,94.9,94.7,2.4,1.6,1.4,1.6,1.1
218,2000-04-30,94.5,94.4,2.2,1.5,1.3,1.4,1.2
219,2000-03-31,94.8,94.8,3.0,1.6,1.4,1.5,1.3
220,2000-02-29,94.1,94.3,2.7,1.6,1.4,1.5,1.3
221,2000-01-31,93.5,94.0,2.2,1.5,1.3,1.5,1.2


It is also very useful to look at the data types of the columns. We want to make sure that the columns are the types that we expect -- sometimes the pretty-printing of DataFrames in the notebook hides issues with Pandas properly inferring data types.

In [49]:
dm.dtypes

Month                        object
Total CPI                   float64
Total CPI S.A.              float64
Core CPI                    float64
% Change 1 Yr: Total CPI    float64
% Change 1 Yr: Core CPI     float64
% Change 1 Yr: CPI-XFET     float64
% Change 1 Yr: CPIW         float64
dtype: object

Pandas is pretty good at inferring the format of .csv files, which is quite a trick since .csv files are actually quite ambiguous. A .csv file does not (usually) contain any _metadata_ to describe to Pandas what type of data is contained in each column. Calling `dm.dtypes` shows us what it guessed. Here it guessed that everything except for `Month` was of type `float64`, which is `numpy`'s representation of a `float`. 

That's mostly good and correct behavior, however the `Month` column has been read in as an `object` column. `object` is the label that Pandas (by way of NumPy under the hood) gives to data objects if it doesn't recognize what type they are. But Pandas also has representation for `datetime` objects, and we would get a lot more use out of this dataset if we used it! Thankfully, it's easy to fix this:

In [50]:
dm = pd.read_csv('data/canada_cpi.csv', parse_dates=[0])
dm.dtypes

Month                       datetime64[ns]
Total CPI                          float64
Total CPI S.A.                     float64
Core CPI                           float64
% Change 1 Yr: Total CPI           float64
% Change 1 Yr: Core CPI            float64
% Change 1 Yr: CPI-XFET            float64
% Change 1 Yr: CPIW                float64
dtype: object

The `parse_dates` argument accepts a list of column indices which we would like Pandas to try to understand as dates. As long as the date is represented consistently and in an unambiguous way, it's good at figuring it out. Here it's worked!

#### Reading in Time Series

A time series is a dataset where each row represents an observation in time. Not all datasets are time series, but Pandas does have special support for time series, so when we are working with a time series it is useful to let Pandas know. We do this by setting the data type of the index to datetime. In our CPI time series, we can do this on read by specifying that the first column is the index when we read it in.

In [51]:
dm = pd.read_csv('data/canada_cpi.csv', parse_dates=[0], index_col=0)

In [52]:
dm.head()

Unnamed: 0_level_0,Total CPI,Total CPI S.A.,Core CPI,% Change 1 Yr: Total CPI,% Change 1 Yr: Core CPI,% Change 1 Yr: CPI-XFET,% Change 1 Yr: CPIW
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-06-30,133.6,133.0,2.5,2.0,2.0,1.9,1.3
2018-05-31,133.4,132.9,2.2,1.9,2.0,1.9,1.3
2018-04-30,133.3,132.8,2.2,2.1,2.0,1.9,1.5
2018-03-31,132.9,132.7,2.3,2.0,2.0,1.9,1.4
2018-02-28,132.5,132.6,2.2,2.1,2.0,1.9,1.5


### Reading and Writing HDF5 and Pickle

We'll do a lot more of this kind of stuff in the section on data cleaning, but note that we've already done quite a bit of data cleaning: we read in this CSV file, told Pandas that the first column should be a date, as well as told it that the first column is an index.

But if we save this DataFrame as a CSV, we'll lose all of this work! CSV's don't have any way to express metadata about the columns. There are other so-called "self-describing" file formats that do keep metadata along with the columns. One is called `pickle`. It is Python's _object serialization_ tool, and you can [read about it in the help if you want](https://docs.python.org/3/library/pickle.html). The other is called HDF5, and it is a cross-language data format that is used in many environments. HDF5 files are more complicated than `pickle` files. You can store multiple data objects in an `HDF5` file, which are indexed by `key`s. Either one is generally fine to use, but `HDF5` does tend to be faster to read and write when your data is large.

In [53]:
dm.to_csv("out.csv")  # loses metadata
dm.to_hdf("out.h5", key='data') # keeps metadata
dm.to_pickle('out.pickle') # also keeps metadata

In [54]:
pd.read_csv("out.csv").head()  # see, we lost all our hard work!

Unnamed: 0,Month,Total CPI,Total CPI S.A.,Core CPI,% Change 1 Yr: Total CPI,% Change 1 Yr: Core CPI,% Change 1 Yr: CPI-XFET,% Change 1 Yr: CPIW
0,2018-06-30,133.6,133.0,2.5,2.0,2.0,1.9,1.3
1,2018-05-31,133.4,132.9,2.2,1.9,2.0,1.9,1.3
2,2018-04-30,133.3,132.8,2.2,2.1,2.0,1.9,1.5
3,2018-03-31,132.9,132.7,2.3,2.0,2.0,1.9,1.4
4,2018-02-28,132.5,132.6,2.2,2.1,2.0,1.9,1.5


In [55]:
pd.read_hdf("out.h5", key="data").head()  # hard work intact!

Unnamed: 0_level_0,Total CPI,Total CPI S.A.,Core CPI,% Change 1 Yr: Total CPI,% Change 1 Yr: Core CPI,% Change 1 Yr: CPI-XFET,% Change 1 Yr: CPIW
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-06-30,133.6,133.0,2.5,2.0,2.0,1.9,1.3
2018-05-31,133.4,132.9,2.2,1.9,2.0,1.9,1.3
2018-04-30,133.3,132.8,2.2,2.1,2.0,1.9,1.5
2018-03-31,132.9,132.7,2.3,2.0,2.0,1.9,1.4
2018-02-28,132.5,132.6,2.2,2.1,2.0,1.9,1.5


In [5]:
pd.read_pickle("out.pickle").head()

NameError: name 'pd' is not defined

# Assignment 2

1. Adapt your string-reversing function from the prep course assignment to take in an arbitrary list, and return a new list which is the original list in reverse order. (Again, ```return listname[::-1]``` is easy, but not the point of the exercise.)
2. Suppose we're out grocery shopping, and currently have the following produce in our basket:<br>
`shopping_basket = ['cherry','lemon','celery','grapefruit','apricot'] `.<br>
We have the following dictionary of some fruit by type: <br>
`f_dict = {'citrus':['lemon','lime','grapefruit','orange','pomelo'],'stone fruit':['cherry','apricot','peach'],'pome':['apple','pear','quince']}`. <br>
Write a script which will give a list of all of the items in your basket which are citrus.
3. Continuing from the prep course, write a function which takes in an arbitrary list of numbers, and returns the mean.  (Note:  while `return np.mean(listname)` is easy, it's not really the point of the exercise).  Use this list of numbers to test it: `[25, 54, 27, 54, 23, 47, 23,  4, 27, 36, 26, 12, 25, 29, 41]`. How would you adjust your code to deal with lists that include non-numeric entries?
4. Consider this dictionary of class marks:<br>
`ixlist = ['Patrick','Lindsay','Ivan','Emily','Iva']
class_marks = {'Assignment 1':pd.Series([72,85,87,94,77],index = ixlist),'Assignment 2':pd.Series([82,89,92,92,84],index = ixlist), 'Assignment 3':pd.Series([80,94,90,99,85],index = ixlist), 'Midterm':pd.Series([86,95,92,97,89],index = ixlist), 
'Final Exam':pd.Series([84,92,90,91,92],index = ixlist)}`<br>
Turn this into a Dataframe, and create a new column to compute the 'Final Grade' with the following weighting of marks:  30% for Assignments, 30% for the Midterm, and 40% for the Final Exam.
5.  1. Create a list using a comprehension which contains all of the odd numbers from $-10$ to $10$ (inclusive).
   2. Use a list comprehension to determine how many times letters from the first half of the alphabet (capital and lowercase!) appear in the following sentence:<br>
    'To construct the notion of a Lie group in Dirac geometry, extending the definition of Poisson Lie groups, the Courant algebroids A must themselves carry a multiplicative structure.'
    3. Write a list comprehension whose elements are lists `[x,y]`, where `x` can take on the values `[1,2,3,4]`, where `y` can take on values `[2,4,6]`, and where the `x` and `y` values are not equal.  (Hint: you can chain the `for` statements together in a comprehension).
6. (Bonus) In financial futures trading, each instrument (e.g.: wheat, oil, USD/CAD, S&P500) has its own symbol.  For example, the S&P500 e-mini is called ES.  Each instrument also has a specific number of months it trades, each of which has its own single-letter symbol; ES trades March (H), June (M), September (U), and December (Z).  To refer to a specific contract, you need the instrument name, month code, and year.  The December 2015 S&P500 contract would be ESZ15.  Consider the following list of contracts:<br>
`contracts = ['ESM15','ESZ14','ESU15','ESH15','ESZ15','ESU14']`<br>
If we want them in chronological order (e.g.: 'ESU14' is first, then 'ESZ14', etc.), we can't use the sorted() function, because this will sort them alphanumerically.  Write a function which sorts this list of contracts into chronological order.
7. (Bonus) As we saw, we can approximate a *matrix* by creating a list of lists, e.g.: `[[1,2],[0,1]]` can be seen as representing the matrix $\begin{bmatrix} 1 & 2 \\ 0 & 1 \end{bmatrix}$ by considering each of the internal lists as the rows.  We can *transpose* a matrix by switching the rows and the columns (i.e.: the first row becomes the first column, etc.).  For example, the tranpose of the matrix `[[1,2],[0,1]` is `[[1,0],[2,1]]`, i.e.: $\begin{bmatrix} 1 & 0 \\ 2 & 1 \end{bmatrix}$.<br>
Write a list comprehension that gives us the transpose of a matrix.

<div id="container" style="position:relative;">
<div style="position:relative; float:right"><img style="height:25px""width: 50px" src ="https://drive.google.com/uc?export=view&id=14VoXUJftgptWtdNhtNYVm6cjVmEWpki1" />
</div>