# Week 2: Dealing with data

TODO
- Finish linear algebra stock market example
- your turn: implement matrix multiplication
- Redo linear algebra stock market example in numpy
- Demonstrate speed-up with numpy matrix multiplication
- Homework solutions
- spell checking, proof reading

Welcome back everyone!

The agenda for today:

- Review homework assignments, answer questions
- How do computers store data?
- Common data storage abstractions you should know
- A lightning intro to 2 very important libraries: numpy and pandas

## Homework review



## How is data stored?

Actually, there are lots of different ways to answer this question...

Moving from abstract to concrete:

- A brief intro to computer architecture
- Data storage paradigms
- Different types of files, the advantages and disadvantages of each

The motivation:

- Important to be informed about the tradeoffs between storage paradigms
- Writing performant code requires familiarity with *why* some code is faster than other code

### Stepping wayyyy back - what is a computer?

During WWII, mathematical foundations of modern computing were invented out of necessity. In 1945, John von Neumann drafted a design for a *stored program computer*:

![](images/von_neumann.png)

- A processing device
    - Control unit for storing instruction sets
    - Logic unit for executing those instructions
- Memory for storing data
- External mass storage
- Input/output lines for communicating with the world

The advance over previous designs: instead of "hard-wiring" (literally) a program, design a device that takes in a program just like data. The logic unit is able to execute a limited number of operations. By composing those operations together, can prove *mathematically* that we can solve any problem that is solvable (ignoring resource usage...). *This is still fundamentally the way computers work today.*

Fast-forward 70 years, how do modern computers store and process data?

![](images/ComputerMemoryHierarchy.png)

What does that mean for us:

- For large organizations, real tape backup is a thing! https://aws.amazon.com/glacier/
- Most of us store data long-term on hard drives
- When actively working on a project, our data lives in RAM
- When in the middle of a computation, data is shifting from RAM to CPU cache
- CPU actually does work on bits that are in its registers

The typical computing workflow:

- Want to process some data stored on a hard drive (either physically connected to our local machine or accessible over a network connection)
- Provide the address of that data and some information for how to load it into memory
- Provide a set of instructions for what to do with the data that's in memory
- Write intermediate results to memory
- Store final results on a hard drive

Note - modern computers handle the RAM <-> Cache <-> CPU pipeline for us. But understanding how it works allows us to write faster code (will return to this later today).

### Storing data (on disk)

There are lots of ways that data can be stored on disk, and different formats can have drastic performance differences!

#### Text vs binary data

Fundamentally, the data contained in a file is represented in a computer as a sequence of bits (0's and 1's), which are grouped into chunks of 8 called bytes. Broadly speaking, there are two categories of files - binary and text.

Binary files:
- Can be any sequence of bytes (though they should adhere to some pattern)
- Designed to be machine readable only (i.e. don't make sense to a human eye)
- Examples: images (.png, .jpg), videos (.mp4, .wav), documents (.doc, .pdf), archive (.zip, .tar), executable (.exe, .dll)

Text files:
- Sequence of bytes correspond to an *encoding* that can be rendered into text
- Examining in a text editor, the files are human-readable
- Examples: documents (.txt, .md), web data (.html, .json), source code (.py, .java), tabular data (.csv)

In other words, text formatted files have particular structure to their bytes that can be rendered into characters that are displayed on a screen. Binary files don't adhere to the notion that sequences of bytes should correspond to characters, so are free to implement other protocols.

Let's use Python to open and read some files:

TODO - add section here opening hi.txt like in http://www.pgbovine.net/unicode-python.htm to show bytes in action (and maybe hi2.txt with a unicode character thrown in on the end?)

In [29]:
with open('data/hi.txt', 'r') as file:    # open the file data/hi.txt in read mode, refer to it as `file`
    text_data = file.read()               # read the contents of `file` into a variable called `text_data`

What's in `text_data`?

In [30]:
type(text_data)

str

In [31]:
len(text_data)

2

In [32]:
text_data

'hi'

So `text_data` is a string with two characters containing the text 'hi'.

**Question**: What is the physical size of `hi.txt` on disk?

In [47]:
import os
os.path.getsize('data/hi.txt')

2

In [33]:
# `bytes` converts a Python character to a representation of its bytes
# `ord()` converts a Python character into an integer representation
for char in text_data:
    print(bytes(char, 'utf-8'), ord(char))

b'h' 104
b'i' 105


Python 3 encodes strings with the unicode standard (UTF-8, specifically) by default. As a sanity check, we can look up the values of 104 and 105 in a unicode table to double check that they correspond to the characters 'h' and 'i': http://www.ssec.wisc.edu/~tomw/java/unicode.html

In the days before unicode became the *de facto* standard of internet communication, it was common to use [ASCII](http://www.asciitable.com/) to encode characters. In ASCII, each character corresponded to 1 byte in the computer, so there were only 2^8 = 256 characters. To the early computer pioneers in the 60s and 70s, the majority of whom lived in English-speaking countries, 256 characters was plenty - there were 26 upper case characters, 26 lower case characters, 10 digits, some special symbols like "(" and "&", and a few accented characters.

The rise of the internet, however, meant that many non-English speakers wanted to communicate digitally. But with ASCII, there was no way for people to write in Cyrillic or Mandarin characters. This led to a proliferation of character encodings that were eventually unified into UTF-8.

Let's read a different file with Python, this time with some characters outside of the standard 26-character English alphabet:

In [34]:
my_str = 'hi猫😺'
with open('data/hi2.txt', 'w') as f:
    f.write(my_str)

In [35]:
with open('data/hi2.txt', 'r') as file:
    text_data = file.read()

In [36]:
type(text_data)

str

In [37]:
len(text_data)

4

In [38]:
text_data

'hi猫😺'

We can see that `text_data` is a string with 4 characters this time - the same 2 English characters "h" and "i", as well as a Chinese character and a cat emoji.

**Question**: What is the size of `hi2.txt` on disk?

In [45]:
os.path.getsize('data/hi2.txt')

9

In [39]:
for char in text_data:
    print(bytes(char, 'utf-8'), ord(char))

b'h' 104
b'i' 105
b'\xe7\x8c\xab' 29483
b'\xf0\x9f\x98\xba' 128570


This gives us a better sense of where the file size comes from. The integer values of "h" and "i" are small enough that they can each be represented by a single byte, but several bytes are necessary to represent each of the other 2 characters. Printing the byte representation of the characters tells us that "猫" requires 3 bytes to store on disk, and "😺" requires 4 bytes, therefore there are a combined 9 bytes in `hi.txt`. In other words, Unicode characters correspond to a *variable* number of bytes, as opposed to ASCII, where characters *always* correspond to a single byte.

Now, let's read some bigger files into memory:

In [14]:
%%timeit         # an IPython "magic" function for profiling blocks of code
with open('data/sherlock_holmes.txt', 'r') as file:    # open the file in read mode
    file.read()

1000 loops, best of 3: 1.46 ms per loop


An aside: Python provides a way of *serializing* data into a binary format for storing on disk called *pickling*. Many different types of Python objects can be pickled, so it's a useful step for checkpointing your work on long-running calculations or freezing the state of your code for later use.

For example, we can dump the text data to a pickle...

In [9]:
import pickle
with open('data/sherlock_holmes.txt', 'r') as file:
    sherlock_text = file.read()
with open('data/sherlock_holmes.pickle', 'wb') as file:    # open a file in binary write mode
    pickle.dump(sherlock_text, file)

... and when we go to read the file, it loads a bit faster (even though it contains the same data).

In [11]:
%%timeit
with open('data/sherlock_holmes.pickle', 'rb') as file:  # note the 'rb' - for "read binary"
    pickle.load(file)

1000 loops, best of 3: 391 µs per loop


In [12]:
with open('data/sherlock_holmes.pickle', 'rb') as file:
    sherlock_pickle = pickle.load(file)

In [13]:
sherlock_text == sherlock_pickle

True

What happened here? Recall - there are no primitive data types in Python, everything is an object! So to read data from disk, Python must create an object to store the data in 

When reading a file from disk into memory, Python:

- Pulls raw bytes from disk into memory
- *Encodes* the raw bytes into their character representations
- Builds the Python objects that store those bytes

That second step, the *encoding*, can actually be fairly slow. If you're dealing with large text files, encoding them once and them pickling (or using another serialization method) can be a much more efficient way to read them in the future. We'll use pickling later in the course to serialize some intermediate results.

**IMPORTANT:** Pickling is NOT SAFE. Anyone can pickle arbitrary code objects. In other words, it is possible to use pickles to distribute malicious code. Never unpickle data from someone you don't trust. Pickling really should only be used as a convenience for yourself, not as a way of distributing code.

**Your turn**
- Read `data/alice_in_wonderland.txt` into memory. How many characters does it contain? How does this compare to its size on disk?
- Print out the unique non-ASCII characters in Alice in Wonderland (hint: non-ASCII means that the number of bytes used is greater than 1).
- Write the first 10,000 characters of Alice in Wonderland as text and as a pickle. What are the sizes of each file on disk?

### Some other common file types

We've already seen text and binary formats, but let's take a look at a couple others.

#### JSON

**J**ava**s**cript **O**bject **N**otation - a nested sequence of lists and dictionaries (or "arrays" and "hashes"). A very common way of transmitting data on the web because it's simple for both humans and computers to parse.

In [40]:
import json
with open('data/good_movies.json', 'r') as file:
    good_movies = json.loads(file.read())

In [41]:
from pprint import pprint    # pprint for pretty-printing nested objects
pprint(good_movies)

[{'oscar_nominations': 14,
  'short_summary': 'A jazz pianist falls for an apsiring actres in Los '
                   'Angeles.',
  'stars': ['Ryan Gosling', 'Emma Stone', 'Rosemarie DeWitt'],
  'title': 'La La Land',
  'year': 2016},
 {'oscar_nominations': 8,
  'short_summary': 'A timeless story of human self-discovery and connection, '
                   'Moonlight chronicles the life of a young black man from '
                   'childhood to adulthood as he struggles to find his place '
                   'in the world while growing up in a rough neighborhood of '
                   'Miami.',
  'stars': ['Mahershala Ali', 'Shariff Earp', 'Duan Sanderson'],
  'title': 'Moonlight',
  'year': 2016},
 {'oscar_nominations': 3,
  'short_summary': 'Acting under the cover of a Hollywood producer scouting a '
                   'location for a science fiction film, a CIA agent launches '
                   'a dangerous operation to rescue six Americans in Tehran '
                   'duri

**Your turn**

- Iterating over `good_movies`, print the name of the movies that Ben Affleck stars in.
- Find the total number of Oscar nominations for 2016 movies in the dataset.

#### CSV

**C**omma-**s**eparated **v**alue data is another very common, easy-to-use way of storing data. In particular, CSVs are used when you have *tabular* data - think, data that fits in a spreadsheet. The most common format is for *columns* to correspond to *categories* and *rows* to correspond to *examples*.

In [53]:
import csv
good_movies = []
with open('data/good_movies.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        good_movies.append(row)

In [54]:
pprint(good_movies)

[{'oscar_nominations': '14',
  'short_summary': 'A jazz pianist falls for an apsiring actres in Los '
                   'Angeles.',
  'star_1': 'Ryan Gosling',
  'star_2': 'Emma Stone',
  'star_3': 'Rosemarie DeWitt',
  'title': 'La La Land',
  'year': '2016'},
 {'oscar_nominations': '8',
  'short_summary': 'A timeless story of human self-discovery and connection, '
                   'Moonlight chronicles the life of a young black man from '
                   'childhood to adulthood as he struggles to find his place '
                   'in the world while growing up in a rough neighborhood of '
                   'Miami.',
  'star_1': 'Mahershala Ali',
  'star_2': 'Sheriff Earp',
  'star_3': 'Duan Sanderson',
  'title': 'Moonlight',
  'year': '2016'},
 {'oscar_nominations': '3',
  'short_summary': 'Acting under the cover of a Hollywood producer scouting a '
                   'location for a science fiction film, a CIA agent launches '
                   'a dangerous operation to r

Look familiar? `csv.DictReader` is actually parsing the CSV row-by-row into a JSON-like structure!

In [55]:
good_movies[0]['title']  # value of cell in first row, column called "title"

'La La Land'

For doing simple things like iterating over data structures, these built-in methods and objects are sufficient. But more complicated tasks will require better tooling.

## A bit of linear algebra

Linear algebra, particularly concepts liek scalars, vectors, and matrices, is a fundamental way of relationships between data. Next week, we'll be using vectors and matrices to represent and understand some real data. But before we get there, let's review the basics of linear algebra.

### Building blocks

At a basic level, the study of algebra is the study of relationships between numbers. Linear algebra is the study of numbers that are related to each other by lines. What exactly I mean by this will become more clear. In the mean time, let me introduce a few basic concepts in linear algebra.

#### Scalars

A scalar is a single quantity, which in Python we might represent with an integer or float. For example, the number of times Carolina has beaten Duke, the number of feet in a mile, and the time it takes you to drive from your house to downtown Durham are all scalars.

In [48]:
a_scalar = 10
another_scalar = 493.092

Relationships between scalars can be described using the sorts of functions you learned about in high-school algebra class:

In [49]:
def double_me(x):
    return 2*x

print(double_me(a_scalar))
print(double_me(another_scalar))

20
986.184


In [50]:
def multiply_them(x, y):
    return x * y

print(multiply_them(a_scalar, another_scalar))

4930.92


So far, so good.

#### Vectors

A vector is a just a way of grouping a collection of scalars together into a single, coherent unit. Geographical coordinates like (35.991708, -78.902830) are an example of a vector - the first number corresponds to how far north of the equator we are right now, and the second number corresponds to how far west of the prime meridian (running through Greenwich, UK) we are. Together, latitude and longitude uniquely identify every place on the Earth's surface.

In pure Python, one way to represent a vector is with lists:

In [51]:
our_location = [35.991708, -78.902830]

#### Matrices

Just like a vector is a way of grouping scalars together into a single entity, a matrix is a way of grouping vectors together into a single entity.

A common reason for using matrices is to describe how data stored in vectors changes with external conditions. For example, imagine a portfolio of stocks in 3 companies: Apple, Google, and Microsoft, with \$1,000, \$250, and \$3,200 worth of stock in each company, respectfully. You could represent your portfolio with a vector:

In [53]:
my_portfolio = [1000, 250, 3200]    # = [$ of Apple stock, $ of Google stock, $ of Microsoft stock]

Now, imagine the stock market has a perfectly flat day, meaning your all of your portfolio values stay constant:

In [56]:
for i in range(len(my_portfolio)):
    my_portfolio[i] = 1 * my_portfolio[i]

my_portfolio

[1000, 250, 3200]

This is equivalent to saying that for each stock `i`, the new value of `i` is:

$i = i + \sum_{j\neq i} 0 * j$

## NumPy: an intro to doing fast linear algebra

Now that we have an introduction to using data in Python, let's introduce some more ways of manipulating that data.

NumPy (**Num**eric **Py**thon) and Pandas are two fundamental libraries in the Python data science ecosystem. They provide extensive tooling for performing calculations with tabular/multi-dimensional data. But most importantly - all of the "heavy lifting" is done by algorithms written in "fast" languages like C and FORTRAN. By making appropriate use of these libraries, we can make our Python code run *much much* faster than it would if it was written in pure Python.

### NumPy

The fundamental object that NumPy provides is an array:

In [67]:
import numpy as np

list_of_numbers = [1, 2, 3, 4, 5]
array_1d = np.array(list_of_numbers)
array_1d.shape

(5,)

In [68]:
print(array_1d)

[1 2 3 4 5]


In [65]:
another_list_of_numbers = [6, 7, 8, 9, 10]
array_2d = np.array([list_of_numbers, another_list_of_numbers])
array_2d.shape

(2, 5)

In [66]:
print(array_2d)

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]


In addition to defining arrays by hand, we can produce them programmatically:

In [70]:
# create a 1D array with numbers 0-9
x = np.arange(10)
print(x)

[0 1 2 3 4 5 6 7 8 9]


In [75]:
# create an array with 4 evenly-spaced numbers starting at 1 and ending at 13
y = np.linspace(1, 13, 4)
print(y)

[  1.   5.   9.  13.]


In [76]:
# create some other common types of arrays
x = np.zeros((3, 5))
print(x)

[[ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]


In [78]:
x = np.ones((3, 5))
print(x)

[[ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]]


In [79]:
x = np.eye(5)    # why this name?
print(x)

[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]]


In [82]:
x = np.random.rand(4)
print(x)

[ 0.02081749  0.55420267  0.29518262  0.69192557]


To access the elements of NumPy arrays, we use the same notation as accessing elements of a Python list. Remember that in Python, indexing always starts at 0!

In [88]:
x[0]

0.02081749054190285

In [92]:
print(array_2d)

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]


In [94]:
array_2d[1][4]

10

In [96]:
array_2d[1, 4]

10

**Your turn**

Create a NumPy array with 100,000 random numbers. Then, write two functions (in pure Python, not using built-in NumPy functions):

- Compute the average
- Compute the standard deviation

In [87]:
%%timeit
np.std(np.random.rand(100000))

1000 loops, best of 3: 1.23 ms per loop


Now, talk about performance difference w/ NumPy built-ins

We'll be using more NumPy functionality for the rest of the course, which we'll be introduced to as we go along.

**Your turn**: some stuff with calculating averages, weighted averages as a lead-in to linear algebra