# First Python Notebook data preparation

By Ben Welsh

Preparing a sample dataset to be used in a forthcoming class entitled "First Python Notebook: Scripting your way to the story." The class is scheduled to be taught as part of [an October 2016 "watchdog workshop" organized by Investigative Reporters and Editors](http://ire.org/events-and-training/event/2819/2841/) at San Diego State University's journalism school.

The class will focus on analyzing the contributors to one of this November's statewide ballot measures. Because the raw data published at [californiacivicdata.org](http://www.californiacivicdata.org) is still difficult for beginners to navigate, a simplified version for students will be prepared below.

In [1]:
import os
import requests
from datetime import datetime
from clint.textui import progress

In [2]:
import pandas
pandas.set_option('display.float_format', lambda x: '%.2f' % x)
pandas.set_option('display.max_columns', None)

In [3]:
def download_csv_to_dataframe(name):
    """
    Accepts the name of a calaccess.download CSV and returns it as a pandas dataframe.
    """
    path = os.path.join(os.getcwd(), '{}.csv'.format(name))
    if not os.path.exists(path):
        url = "http://calaccess.download/latest/{}.csv".format(name)
        r = requests.get(url, stream=True)
        with open(path, 'w') as f:
            total_length = int(r.headers.get('content-length'))
            for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length/1024) + 1): 
                if chunk:
                    f.write(chunk)
                    f.flush()
    return pandas.read_csv(path)

In [4]:
def remove_amended_filings(df):
    """
    Accepts a dataframe with FILING_ID and AMEND_ID files.
    
    Returns only the highest amendment for each unique filing id.
    """
    max_amendments = df.groupby('FILING_ID')['AMEND_ID'].agg("max").reset_index()
    merged_df = pandas.merge(df, max_amendments, how='inner', on=['FILING_ID', 'AMEND_ID'])
    print "Removed {} amendments".format(len(df)-len(merged_df))
    print "DataFrame now contains {} rows".format(len(merged_df))
    return merged_df

Download the raw CAL-ACCESS tables that contain itemized receipts and links to the filers who reported them.

In [None]:
itemized_receipts_df = download_csv_to_dataframe("rcpt_cd")

In [None]:
filer_filings_df = download_csv_to_dataframe("filer_filings_cd")

California's Proposition 64 asks voters if the growth and sale marijuana should be legalized in the state. As of September 20, [California's Secretary of State reports](http://www.sos.ca.gov/campaign-lobbying/cal-access-resources/measure-contributions/marijuana-legalization-initiative-statute/) that 16 million has been raised to campaign in support the measure, and 2 million to oppose it. 

Here are the committees the state lists as supporting the measure.

| Committee ID | Committee Name                                                                                                                                                                                    |
|----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1343793        | Californians for Responsible Marijuana Reform, Sponsored by Drug Policy Action, Yes on Prop. 64                                                                                                   |
| 1376077        | Californians for Sensible Reform, Sponsored by Ghost Management Group, LLC dba Weedmaps                                                                                                           |
| 1385506        | Drug Policy Action - Non Profit 501c4, Yes on Prop. 64                                                                                                                                            |
| 1385745        | Fund for Policy Reform (Nonprofit 501(C)(4))                                                                                                                                                      |
| 1371855        | Marijuana Policy Project of California                                                                                                                                                            |
| 1382525        | New Approach PAC (MPO)                                                                                                                                                                            |
| 1386560        | The Adult Use Campaign for Proposition 64                                                                                                                                                         |
| 1381808        | Yes on 64, Californians to Control, Regulate and Tax Adult Use of Marijuana While Protecting Children, Sponsored by Business, Physicians, Environmental and Social-Justice Advocate Organizations |


Here are the committees the state lists as opposing the measure. 

| Committee ID   | Committee Name                                                                                    |
|----------------|----------------------------------------------------------------------------------------------------|
| 1382568        | No on Prop. 64, Sponsored by California Public Safety Institute                                    |
| 1387789        | Sam Action, Inc., a Committee Against Proposition 64 with Help from Citizens (NonProfit 501(C)(4)) |

In [None]:
supporting_committees = pandas.DataFrame([
    {"Committee ID":1343793,"Committee Name":"Californians for Responsible Marijuana Reform, Sponsored by Drug Policy Action, Yes on Prop. 64"},
    {"Committee ID":1376077,"Committee Name":"Californians for Sensible Reform, Sponsored by Ghost Management Group, LLC dba Weedmaps"},
    {"Committee ID":1385506,"Committee Name":"Drug Policy Action - Non Profit 501c4, Yes on Prop. 64"},
    {"Committee ID":1385745,"Committee Name":"Fund for Policy Reform (Nonprofit 501(C)(4))"},
    {"Committee ID":1371855,"Committee Name":"Marijuana Policy Project of California"},
    {"Committee ID":1382525,"Committee Name":"New Approach PAC (MPO)"},
    {"Committee ID":1386560,"Committee Name":"The Adult Use Campaign for Proposition 64"},
    {"Committee ID":1381808,"Committee Name":"Yes on 64, Californians to Control, Regulate and Tax Adult Use of Marijuana While Protecting Children, Sponsored by Business, Physicians, Environmental and Social-Justice Advocate Organizations"}
])

In [None]:
opposing_committees = pandas.DataFrame([
    {"Committee ID":1382568,"Committee name":"No on Prop. 64, Sponsored by California Public Safety Institute"},
    {"Committee ID":1387789,"Committee name":"Sam Action, Inc., a Committee Against Proposition 64 with Help from Citizens (NonProfit 501(C)(4))"}
]