# Week 2: Dealing with data

Welcome back everyone!

The agenda for today:

- Review homework assignments, answer questions
- How do computers store data?
- Common data storage abstractions you should know
- A lightning intro to 2 very important libraries: numpy and pandas

## Homework solutions



## How is data stored?

Actually, there are lots of different ways to answer this question...

Moving from abstract to concrete:

- A brief intro to computer architecture
- Data storage paradigms
- Different types of files, the advantages and disadvantages of each

The motivation:

- Important to be informed about the tradeoffs between storage paradigms
- Writing performant code requires familiarity with *why* some code is faster than other code

### Stepping wayyyy back - what is a computer?

During WWII, mathematical foundations of modern computing were invented out of necessity. In 1945, John von Neumann drafted a design for a *stored program computer*:

![](images/von_Neumann.png)

- A processing device
    - Control unit for storing instruction sets
    - Logic unit for executing those instructions
- Memory for storing data
- External mass storage
- Input/output lines for communicating with the world

The advance over previous designs: instead of "hard-wiring" (literally) a program, design a device that takes in a program just like data. The logic unit is able to execute a limited number of operations. By composing those operations together, can prove *mathematically* that we can solve any problem that is solvable (ignoring resource usage...). *This is still fundamentally the way computers work today.*

Fast-forward 70 years, how do modern computers store and process data?

![](images/ComputerMemoryHierarchy.png)

What does that mean for us:

- For large organizations, real tape backup is a thing! https://aws.amazon.com/glacier/
- Most of us store data long-term on hard drives
- When actively working on a project, our data lives in RAM
- When in the middle of a computation, data is shifting from RAM to CPU cache
- CPU actually does work on bits that are in its registers

The typical computing workflow:

- Want to process some data stored on a hard drive (either physically connected to our local machine or accessible over a network connection)
- Provide the address of that data and some information for how to load it into memory
- Provide a set of instructions for what to do with the data that's in memory
- Write intermediate results to memory
- Store final results on a hard drive

Note - modern computers handle the RAM <-> Cache <-> CPU pipeline for us. But understanding how it works allows us to write faster code (will return to this later today).

### Storing data (on disk)

There are lots of ways that data can be stored on disk, and different formats can have drastic performance differences!

#### Text vs binary data

In [23]:
%%timeit         # an IPython "magic" function for profiling blocks of code
with open('data/sherlock_holmes.txt', 'r') as file:
    file.read()

1000 loops, best of 3: 1.4 ms per loop


In [18]:
import pickle         # library for loading and storing binary data

In [24]:
%%timeit
with open('data/sherlock_holmes.pickle', 'rb') as file:  # note the 'rb' - for "read binary"
    pickle.load(file)

1000 loops, best of 3: 394 µs per loop


In [27]:
with open('data/sherlock_holmes.txt', 'r') as file:
    sherlock_text = file.read()

with open('data/sherlock_holmes.pickle', 'rb') as file:
    sherlock_binary = pickle.load(file)

In [28]:
sherlock_text == sherlock_binary

True

What happened here? Recall - there are no primitive data types in Python, everything is an object! So to read data from disk, Python must convert that data into an object.

When reading a file from disk into memory, Python:

- Takes the raw bytes from disk
- *Encodes* the raw bytes into their character representations
- Builds the Python objects that store those bytes

TODO - fact checking here. 

Elaborate on why pickling is faster. Show other Python objects being pickled. Your turn example. Explain that a pickle is Python-only. Pickle is extensible - many libraries have built-in functionality for checkpointing work. Security warning.

### Some other common file types

We've already seen text and binary formats, but let's take a look at a couple others.

#### JSON

**J**ava**s**cript **O**bject **N**otation - a nested sequence of lists and dictionaries (or "arrays" and "hashes"). A very common way of transmitting data on the web because it's simple for both humans and computers to parse.

In [40]:
import json
with open('data/good_movies.json', 'r') as file:
    good_movies = json.loads(file.read())

In [41]:
from pprint import pprint    # pprint for pretty-printing nested objects
pprint(good_movies)

[{'oscar_nominations': 14,
  'short_summary': 'A jazz pianist falls for an apsiring actres in Los '
                   'Angeles.',
  'stars': ['Ryan Gosling', 'Emma Stone', 'Rosemarie DeWitt'],
  'title': 'La La Land',
  'year': 2016},
 {'oscar_nominations': 8,
  'short_summary': 'A timeless story of human self-discovery and connection, '
                   'Moonlight chronicles the life of a young black man from '
                   'childhood to adulthood as he struggles to find his place '
                   'in the world while growing up in a rough neighborhood of '
                   'Miami.',
  'stars': ['Mahershala Ali', 'Shariff Earp', 'Duan Sanderson'],
  'title': 'Moonlight',
  'year': 2016},
 {'oscar_nominations': 3,
  'short_summary': 'Acting under the cover of a Hollywood producer scouting a '
                   'location for a science fiction film, a CIA agent launches '
                   'a dangerous operation to rescue six Americans in Tehran '
                   'duri

**Your turn**

- Iterating over `good_movies`, print the name of the movies that Ben Affleck stars in.
- Find the total number of Oscar nominations for 2016 movies in the dataset.

#### CSV

**C**omma-**s**eparated **v**alue data is another very common, easy-to-use way of storing data. In particular, CSVs are used when you have *tabular* data - think, data that fits in a spreadsheet. The most common format is for *columns* to correspond to *categories* and *rows* to correspond to *examples*.

In [53]:
import csv
good_movies = []
with open('data/good_movies.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        good_movies.append(row)

In [54]:
pprint(good_movies)

[{'oscar_nominations': '14',
  'short_summary': 'A jazz pianist falls for an apsiring actres in Los '
                   'Angeles.',
  'star_1': 'Ryan Gosling',
  'star_2': 'Emma Stone',
  'star_3': 'Rosemarie DeWitt',
  'title': 'La La Land',
  'year': '2016'},
 {'oscar_nominations': '8',
  'short_summary': 'A timeless story of human self-discovery and connection, '
                   'Moonlight chronicles the life of a young black man from '
                   'childhood to adulthood as he struggles to find his place '
                   'in the world while growing up in a rough neighborhood of '
                   'Miami.',
  'star_1': 'Mahershala Ali',
  'star_2': 'Sheriff Earp',
  'star_3': 'Duan Sanderson',
  'title': 'Moonlight',
  'year': '2016'},
 {'oscar_nominations': '3',
  'short_summary': 'Acting under the cover of a Hollywood producer scouting a '
                   'location for a science fiction film, a CIA agent launches '
                   'a dangerous operation to r

Looks a bit familiar? `csv.DictReader` is actually parsing the CSV row-by-row into a JSON-like structure!

In [55]:
good_movies[0]['title']  # value of cell in first row, column called "title"

'La La Land'

For doing simple things like iterating over data structures, these built-in methods and objects are sufficient. But more complicated tasks will require better tooling.

## NumPy and Pandas: an intro to the data scientist's core toolbox