# First Python Notebook: Scripting your way to the story

By Ben Welsh

A step-by-step guide to analyzing data with Python and the Jupyter Notebook.

This tutorial will teach you how to use computer programming tools to analyze data by exploring contributors to campaigns for and again Proposition 64, a ballot measure asking California voters to decide if recreational marijuana should be legalized.

This guide was developed by Ben Welsh for a [Oct. 2, 2016, "watchdog workshop" organized by Investigative Reporters and Editors](http://ire.org/events-and-training/event/2819/2841/) at San Diego State University's school of journalism.



## Prelude: Prequisites

Before you can begin, your computer needs the following tools installed and working to participate.

1. A [command-line interface](https://en.wikipedia.org/wiki/Command-line_interface) to interact with your computer
2. Version 2.7 of the [Python](http://python.org/download/releases/2.7.6/) programming language
3. The [pip](https://pip.pypa.io/en/latest/installing.html) package manager and [virtualenv](http://www.virtualenv.org/en/latest/) environment manager for Python

### Command-line interface

Unless something is wrong with your computer, there should be a way to open a window that lets you type in commands. Different operating systems give this tool slightly different names, but they all have some form of it, and there are alternative programs you can install as well.

On Windows you can find the command-line interface by opening the "command prompt." Here are instructions for [Windows 8](http://windows.microsoft.com/en-us/windows/command-prompt-faq#1TC=windows-8) and [earlier versions](http://windows.microsoft.com/en-us/windows-vista/open-a-command-prompt-window). On Apple computers, you open the ["Terminal" application](http://blog.teamtreehouse.com/introduction-to-the-mac-os-x-command-line). Ubuntu Linux comes with a program of the [same name](http://askubuntu.com/questions/38162/what-is-a-terminal-and-how-do-i-open-and-use-it).

### Python

If you are using Mac OSX or a common flavor of Linux, Python is probably already installed and you can test to see what version, if any, is there waiting for you by typing the following into your terminal.

```bash
python -V
```

If you don't have Python installed (a more likely fate for Windows users) try downloading and installing it from
[here](https://www.python.org/downloads/release/python-2712/).

In Windows, it's also crucial to make sure that the Python program is available on your system's ``PATH`` so it can be called from anywhere on the command line. [This screencast](http://showmedo.com/videotutorials/video?name=960000&fromSeriesID=96) can guide you through that process.

Python 2.7 is preferred but you can probably find a way to make most of this tutorial work with other versions if you futz a little.

### pip and virtualenv

The [pip package manager](https://pip.pypa.io/en/latest/) makes it easy to install open-source libraries that expand what you're able to do with Python. Later, we will use it to install everything needed to create a working web application.

If you don't have it already, you can get pip by following [these instructions](https://phttps://pip.pypa.io/en/latest/ip.pypa.io/en/latest/installing.html). In Windows, it's necessary to make sure that the Python ``Scripts`` directory is available on your system's ``PATH`` so it can be called from anywhere on the command line. [This screencast](http://showmedo.com/videotutorials/video?name=960000&fromSeriesID=96) can help.

Verify pip is installed with the following.

```bash
pip -V
```

The [virtualenv environment manager](http://www.virtualenv.org/en/latest/) makes it possible to create an isolated corner of your computer where all the different tools you use to build an application are sealed off.

It might not be obvious why you need this, but it quickly becomes important when you need to juggle different tools
for different projects on one computer. By developing your applications inside separate virtualenv environments, you can use different versions of the same third-party Python libraries without a conflict. You can also more easily recreate your project on another machine, handy when you want to copy your code to a server that publishes pages on the Internet.

You can check if virtualenv is installed with the following.

```bash
virtualenv --version
```

If you don't have it, install it with pip.

```bash
pip install virtualenv
# If you're on a Mac or Linux and get an error saying you lack permissions, try again as a superuser.
sudo pip install virtualenv
```

If that doesn't work, [try following this advice](http://virtualenv.readthedocs.org/en/latest/installation.html).

## Act 1: Hello Jupyter Notebook

Start by creating a new development environment with virtualenv in your terminal. Name it after our application.

```bash
virtualenv first-python-notebook
```

Jump into the directory it created.

```bash
cd first-django-admin
```

Turn on the new virtualenv, which will instruct your terminal to only use those libraries installed
inside its sealed space. You only need to create the virtualenv once, but you'll need to repeat these
"activation" steps each time you return to working on this project.

```bash
# In Linux or Mac OSX try this...
. bin/activate
# In Windows it might take something more like...
cd Scripts
activate
cd ..
```

Use ``pip`` on the command line to install [Jupyter Notebook](http://jupyter.org/), an open-source tool for writing and sharing Python scripts.

```bash
pip install jupyter
```

Start up the notebook from your terminal.

```bash
jupyter notebook
```

That will open up a new tab in your default web browser that looks something like this:

![](http://jupyter.readthedocs.io/en/latest/_images/tryjupyter_file.png)

Click the "New" button in the upper right and create a new Python 2 notebook. Now you're all setup and ready to start writing code.

## Act 2: Hello Python

In [None]:
# 2+2
# basic variable assignment foo+bar
# download the CSV of Prop. 64 data
# Read it into with open()
# for loop to print it out line by line

You are now ready to roll within the Jupyter Notebook's framework for writing Python. Don't stress. There's nothing too fancy about it. You can start by just doing a little simple math. Type the following into the first box, then hit the play button in the toolbox (or hit SHIFT+ENTER on your keyboard).

In [9]:
2+2

4

There. You've just written your first Python code. You've entered two integers (the 2's) and added them together using the plus sign operator. 

Next, let's introduce one of the basics of computer programming, a variable.

Variables are like containers that hold different types of data so you can go back and refer to them later. They’re fundamental to programming in any language, and you’ll use them all the time.

Move down to the next box. Now let's put that number two into our first variable. 

In [5]:
san = 2

In this case, we’ve created a variable called greeting and assigned it the integer value 2.

In Python, variable assignment is done with the = sign. On the left is the name of the variable you want to create (it can be anything) and on the right is the value that you want to assign to that variable.

If we use the ``print`` command on the variable, Python will output its contents to the terminal because that value is stored in the variable. Let's try it.

In [6]:
print san

2


We can do the same thing again with a different variable name

In [7]:
diego = 2

Then add those two together the same way we added the numbers at the top.

In [8]:
san + diego

4

Variables can contain many different kinds of data types. There are integers, strings, floating point numbers (decimals), lists and dictionaries.

In [10]:
string = "Hello"

In [11]:
decimal = 1.2

In [12]:
list_of_strings = ["a", "b", "c", "d"]
list_of_integers = [1, 2, 3, 4]
list_of_whatever = ["a", 2, "c", 4]

In [13]:
my_phonebook = {'Mom': '713-555-5555', 'Chinese Takeout': '573-555-5555'}

Playing with data we invent can be fun, but it's a long way from investigative journalism. Now's the time for us to get our hands on some real data and get some work done.

Your assignment: Proposition 64. 

The use and sale of marijuana for recreational purposes is illegal in California. [Proposition 64](http://www.oag.ca.gov/system/files/initiatives/pdfs/15-0103%20%28Marijuana%29_1.pdf), scheduled to appear on the November 8 ballot, asks voters if it ought to be legalized. A "yes" vote would support legalization. A "no" vote would oppose it. A similar measure, [Proposition 19](http://articles.latimes.com/print/2010/nov/03/local/la-me-pot-20101103-1), was defeated in 2010.

[According to California's Secretary of State](http://www.sos.ca.gov/campaign-lobbying/cal-access-resources/measure-contributions/marijuana-legalization-initiative-statute/), more than 16 million dollars have been raised to campaign in support of Prop. 64 as of September 20. Just over 2 million has been raised to oppose it.

Your mission, should you choose to accept it, is to download a list of campaign contributors and figure out the biggest donors both for and against the measure.

[Click here](https://docs.google.com/spreadsheets/d/1Zsxlq01Wqu9D1qLLesjA7aGwclYTGeNsPY4Ax_jwtYI/pub?gid=0&single=true&output=csv) to download the file as a list of comma-separate values. This is known as a CSV file, it is the most common way you will find data published online. Save the file with the name ``first-python-notebok.csv`` in the same directory where you made this notebook. 

Python can read in files using the built-in ``open`` function. You feed two things into it: 1) The path to the file; 2) What type of operation you'd like it to execute on the file. "r" stands for read.

In [17]:
data_file = open("./first-python-notebook.csv", "r")

Print that variable and you see that ``open`` has created a file "object" that offers a number of different ways to interact with the contents of the file.

In [20]:
print data_file

<open file './first-python-notebook.csv', mode 'r' at 0x7f3f1c593390>


One thing a file object can do it read in all of the data from the file. Let's do that next and store the contents in a new variable.

In [21]:
data = data_file.read()

In [22]:
print data

FILING_ID,COMMITTEE_ID,COMMITTEE_NAME,COMMITTEE_POSITION,AMEND_ID,FIRST_NAME,LAST_NAME,CITY,STATE,ZIPCODE,EMPLOYER,OCCUPATION,DATE,AMOUNT
1680042,1343793,"Californians for Responsible Marijuana Reform, Sponsored by Drug Policy Action, Yes on Prop. 64",SUPPORT,0,,Drug Policy Action Committee to Tax and Regulate Marijuana,Sacramento,CA,95815,,,1/3/2012 12:00:00 AM,10975.5
1937009,1343793,"Californians for Responsible Marijuana Reform, Sponsored by Drug Policy Action, Yes on Prop. 64",SUPPORT,0,Luke,Sinquefield,Pacific Palisades,CA,90272,Luke Sinquefield,Realtor,7/25/2014 12:00:00 AM,15000
2038717,1343793,"Californians for Responsible Marijuana Reform, Sponsored by Drug Policy Action, Yes on Prop. 64",SUPPORT,1,Susan,Podolsky,Modesto,CA,95350,Ontario Base Hospital Group,Physician,3/21/2016 12:00:00 AM,100
2038717,1343793,"Californians for Responsible Marijuana Reform, Sponsored by Drug Policy Action, Yes on Prop. 64",SUPPORT,1,Philip,Davis,Salinas,CA,93901,Monterey County Office of Ed

That's all good, but the data is printing out as one big long string. If we're going to do some real analysis, we need Python to recognize and respect the structure of our data, in the way an Excel spreadsheet would.

To do that, we're going to need something smarter than ``open``. We're going to need something like agate.

## Act 3: Hello agate 

Lucky for us, Python already has tools filled with functions to do pretty much anything you’d ever want to do with a programming language: navigate the web, parse data, interact with a database, run fancy statistics, build a pretty website and so much more.

Some of those tools are included a toolbox that comes with the language, known as the standard library. Others have been built by members of Python's developer community and need to be downloaded and installed from the web.

For this exercise, we're going to install and use [agate](http://agate.readthedocs.io/en/latest/), a tool for accessing and analyzing data maintained by IRE/NICAR members Christopher Groskopf and Neil Bedi.

There are several others we could use instead ([pandas](http://pandas.pydata.org/) is the most popular alternative) but we're picking agate here because I think it's easier for beginners.

We'll install agate the same way we installed the Jupyter Notebook earlier: Our friend ``pip``. Save your notebook, open up your terminal and hit CTRL-C. That will kill your notebook and return you to the command line. There we'll install agate.

```bash
pip install agate
```

Now let's restart our notebook and get back to work.

```bash
jupyter notebook
```

Now let's use the next open box to import agate into our script, so we can use all its fancy methods here in our script.

In [25]:
import agate

Opening our CSV with agate isn't any hard than with ``open``, you just need to know the right trick to make it work.

In [27]:
agate.Table.from_csv("./first-python-notebook.csv")

<agate.table.Table at 0x7f3f0cfba610>

Great now let's do it again and assign it to a variable this time

In [28]:
table = agate.Table.from_csv("./first-python-notebook.csv")

In [33]:
print table

|---------------------+------------|
|  column             | data_type  |
|---------------------+------------|
|  FILING_ID          | Number     |
|  COMMITTEE_ID       | Number     |
|  COMMITTEE_NAME     | Text       |
|  COMMITTEE_POSITION | Text       |
|  AMEND_ID           | Boolean    |
|  FIRST_NAME         | Text       |
|  LAST_NAME          | Text       |
|  CITY               | Text       |
|  STATE              | Text       |
|  ZIPCODE            | Text       |
|  EMPLOYER           | Text       |
|  OCCUPATION         | Text       |
|  DATE               | DateTime   |
|  AMOUNT             | Number     |
|---------------------+------------|



In [47]:
table.print_table()

|------------+--------------+----------------------+--------------------+----------+------------+------|
|  FILING_ID | COMMITTEE_ID | COMMITTEE_NAME       | COMMITTEE_POSITION | AMEND_ID | FIRST_NAME | ...  |
|------------+--------------+----------------------+--------------------+----------+------------+------|
|  1,680,042 |    1,343,793 | Californians for ... | SUPPORT            |    False |            | ...  |
|  1,937,009 |    1,343,793 | Californians for ... | SUPPORT            |    False | Luke       | ...  |
|  2,038,717 |    1,343,793 | Californians for ... | SUPPORT            |     True | Susan      | ...  |
|  2,038,717 |    1,343,793 | Californians for ... | SUPPORT            |     True | Philip     | ...  |
|  2,038,717 |    1,343,793 | Californians for ... | SUPPORT            |     True | Kevin      | ...  |
|  2,063,970 |    1,343,793 | Californians for ... | SUPPORT            |    False | Jessica    | ...  |
|  2,063,970 |    1,343,793 | Californians for ... | SU

In [37]:
print len(table.rows)

318


## Act 4: Hello analysis

In [None]:
# Sort by amount descending and print the top 10
# Sum up the total contribution amount
# Filter to support/oppose
# Sort and print top 10 for each
# Sum up the total contribution amount for each
# Group and count/sum by committee name
# Group and count/sum by the last_name field
# SHOULD WE EXPAND THE SOURCE FILE WE START WITH TO BE THAN JUST PROP 64?

In [45]:
table.order_by("AMOUNT").print_table()

|------------+--------------+----------------------+--------------------+----------+------------+------|
|  FILING_ID | COMMITTEE_ID | COMMITTEE_NAME       | COMMITTEE_POSITION | AMEND_ID | FIRST_NAME | ...  |
|------------+--------------+----------------------+--------------------+----------+------------+------|
|  2,064,423 |    1,381,808 | Yes on 64, Califo... | SUPPORT            |     True | Larry      | ...  |
|  2,063,970 |    1,343,793 | Californians for ... | SUPPORT            |    False | Bruce      | ...  |
|  2,064,423 |    1,381,808 | Yes on 64, Califo... | SUPPORT            |     True | Rich       | ...  |
|  2,064,423 |    1,381,808 | Yes on 64, Califo... | SUPPORT            |     True | Marc       | ...  |
|  2,064,423 |    1,381,808 | Yes on 64, Califo... | SUPPORT            |     True | Dan        | ...  |
|  2,064,423 |    1,381,808 | Yes on 64, Califo... | SUPPORT            |     True | Dan        | ...  |
|  2,063,970 |    1,343,793 | Californians for ... | SU

## Act 5: Hello viz

In [None]:
# Install a charting library
# Bar chart of the top 10 contributors for each side
# Export the data to a CSV for your graphics department