# Goal of this notebook:

1. High-Level Checks of the dataset
    - Check nulls, uniques, and take a look at first and last 5 rows
2. Look column by column:
    - Go through each column and do a quick check for integrity


In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
% matplotlib inline

data = pd.read_csv('../transactions_first_batch.csv', low_memory = False, dtype='O')

## 1. Data Overview
- Check number of records:

In [2]:
print("Rows: %i Columns: %i before NUll Drop" % data.shape)
print("Rows: %i Columns: %i After NUll Drop" % data.dropna().shape)

Rows: 88310 Columns: 8 before NUll Drop
Rows: 87530 Columns: 8 After NUll Drop


- Check the number of unique Items across all columns.
    - **Note:** Commitee_ID is 104, I expected 100 

In [3]:
data.apply(lambda x: len(set(x)))

Tran ID              88310
Tran Date             4101
Status                   2
Filer/Committee        142
Contributor/Payee    19501
Sub Type                29
Amount               14639
committee_id           104
dtype: int64

- Check the number of Nulls:
    - Note: The only column with null values is 780. Additional exploration later in this notebook.

In [4]:
len(data) - data.count()

Tran ID                0
Tran Date              0
Status                 0
Filer/Committee        0
Contributor/Payee    780
Sub Type               0
Amount                 0
committee_id           0
dtype: int64

- Look at first and last five

In [6]:
data = pd.read_csv('../transactions_first_batch.csv', low_memory = False, dtype='O')
data.head().append(data.tail())

Unnamed: 0,Tran ID,Tran Date,Status,Filer/Committee,Contributor/Payee,Sub Type,Amount,committee_id
0,1454151,01/24/2013,Original,Speech Hearing Action Committee,Sara Gelser for State Representative (4680),Lost or Returned Check,$200.00,255
1,968825,11/30/2010,Original,Speech Hearing Action Committee,Oregon Speech Language Hearing Association,Cash Contribution,$53.00,255
2,934239,10/18/2010,Original,Speech Hearing Action Committee,Committee to Elect Dr. Alan Bates (3604),Cash Expenditure,$250.00,255
3,934242,10/18/2010,Original,Speech Hearing Action Committee,Frank Morse for State Senate (4335),Cash Expenditure,$250.00,255
4,934247,10/18/2010,Original,Speech Hearing Action Committee,Friends of Suzanne Bonamici (5254),Cash Expenditure,$200.00,255
88305,26967,12/21/2006,Original,"Lee, Charles E., for State Representative",Miscellaneous Personal Expenditures $100 and u...,Personal Expenditure for Reimbursement,$31.48,5328
88306,29612,12/21/2006,Amended,"Lee, Charles E., for State Representative","Mary E Lee, Treasurer",Cash Expenditure,$31.48,5328
88307,16136,12/05/2006,Original,"Lee, Charles E., for State Representative",Miscellaneous Cash Expenditures $100 and under,Cash Expenditure,$82.82,5328
88308,16135,12/01/2006,Original,"Lee, Charles E., for State Representative",Miscellaneous Cash Expenditures $100 and under,Cash Expenditure,$80.34,5328
88309,15872,11/29/2006,Original,"Lee, Charles E., for State Representative","Mary E Lee, Treasurer",Cash Expenditure,$700.00,5328


- Clean data so that somes symbols are removed:
    - **Note**: this might be something we would want to do while pulling data or transforming data for storage

In [7]:
## Amount
for i in ['$',',',')']:
    data.Amount = data.Amount.str.replace(i,'')
    
data.Amount = data.Amount.str.replace('(','-').astype(float)


# 2. Columns In Detail
- **Status Column:** Check  of which there are two unique values. What are the proportions?

In [8]:
data.groupby('Status').size()

Status
Amended      1979
Original    86331
dtype: int64

- **Sub Type Column:** 
    - Question: What do each of these relate to? Are some of these categories more important than others?
    - [2018 Campaign Finance Manual](http://sos.oregon.gov/elections/Documents/campaign-finance.pdf) : Contains definitions for each of these codes:

In [9]:
data.groupby('Sub Type').size().sort_values()

Sub Type
Unexpended Agent Balance                       1
Pledge of In-Kind                              3
Nonpartisan Activity                           3
Loan Received (Exempt)                         3
Uncollectible Pledge of Cash                   4
Loan Payment (Exempt)                          5
In-Kind/Forgiven Account Payable               8
Miscellaneous Account Receivable              10
Account Payable Rescinded                     11
Personal Expenditure Balance Adjustment       11
Loan Forgiven (Non-Exempt)                    13
Expenditure Made by an Agent                  31
Loan Payment (Non-Exempt)                     63
Cash Balance Adjustment                       66
In-Kind/Forgiven Personal Expenditures        68
Miscellaneous Other Disbursement              74
Loan Received (Non-Exempt)                    79
Pledge of Cash                                96
Miscellaneous Other Receipt                  108
Return or Refund of Contribution             125
Lost or Ret

- **Contributor/Payee Column:**
    - Show Top 10
    - Show top 10, sort name
        - What does ** notation mean mean?
        - Question: Is there a way to easily standardize names? Or do we need to fuzzy match?
    - Look at 1&1 internet.

In [10]:
data.groupby('Contributor/Payee').size().sort_values().head(10)

Contributor/Payee
yvonne tamayo            1
Joseph Matarazzo         1
Sharon M. Ungerleider    1
Sharon Lenz              1
Joseph Safirstein        1
Sharon L Roy             1
Sharon Javna             1
Joseph Tennant           1
Joseph Weston            1
Joseph Young             1
dtype: int64

In [11]:
data.groupby('Contributor/Payee').size().sort_index().head(10)

Contributor/Payee
1 & 1 Interent                               3
1 and 1 Internet, Inc.                       1
1&1 Internet Inc                             1
1&1 Internet Inc.                           19
1-800 Contacts, Inc. **                      3
111 Investments                              2
111th Square, LLC - Sanchez Family Trust     1
1430 KYKN                                    1
1st Screen Mobile.com                        1
200 Market Building                          7
dtype: int64

In [12]:
# bring up 1&1 to see if anything interesting shows up
data[data['Contributor/Payee'].isin(['1 & 1 Interent','1 and 1 Internet, Inc.'])]

Unnamed: 0,Tran ID,Tran Date,Status,Filer/Committee,Contributor/Payee,Sub Type,Amount,committee_id
32374,2467683,12/01/2016,Original,People for Libraries,1 & 1 Interent,Cash Expenditure,52.38,6104
32479,2042190,07/06/2015,Amended,People for Libraries,1 & 1 Interent,Cash Expenditure,52.38,6104
32541,1765178,07/01/2014,Original,People for Libraries,1 & 1 Interent,Personal Expenditure for Reimbursement,52.38,6104
55683,665243,12/27/2009,Original,Concerned Oregonians PAC,"1 and 1 Internet, Inc.",Miscellaneous Other Receipt,29.97,12512


- **Contributor/Payee Column (cont):**
    - Look at Null values across records with and without nulls:
        - Note: There seems to be 4 unique subtypes in NULL records while 29 (which is the total number of uniques on an unfiltered dataset) 

In [13]:
null_mask = data['Contributor/Payee'].isnull()
data[null_mask].describe(include = 'all').head(2)

Unnamed: 0,Tran ID,Tran Date,Status,Filer/Committee,Contributor/Payee,Sub Type,Amount,committee_id
count,780,780,780,780,0.0,780,780.0,780
unique,780,534,2,51,0.0,4,,47


In [14]:
data[~null_mask].describe(include = 'all').head(2)

Unnamed: 0,Tran ID,Tran Date,Status,Filer/Committee,Contributor/Payee,Sub Type,Amount,committee_id
count,87530,87530,87530,87530,87530,87530,87530.0,87530
unique,87530,4100,2,140,19500,29,,103


- **Contributor/Payee Column (cont):**
    - Look at Sub Types of records with NULL (n = 780)  in Contributor / Payee and then Non-Null (n = 87530) ..
        - Note: The 4 that show up in Null records also show up in non-null records.

In [15]:
non_null_subs = set(data[~null_mask]['Sub Type'].unique())
print("Sub Types found in non-null rows:\n\n",non_null_subs)

Sub Types found in non-null rows:

 {'Miscellaneous Account Receivable', 'In-Kind/Forgiven Account Payable', 'In-Kind Contribution', 'Interest/Investment Income', 'Unexpended Agent Balance', 'Account Payable Rescinded', 'Loan Payment (Exempt)', 'Loan Received (Exempt)', 'Return or Refund of Contribution', 'Items Sold at Fair Market Value', 'Pledge of Cash', 'Nonpartisan Activity', 'Miscellaneous Other Receipt', 'Personal Expenditure for Reimbursement', 'Cash Expenditure', 'Expenditure Made by an Agent', 'Loan Received (Non-Exempt)', 'Refunds and Rebates', 'Loan Payment (Non-Exempt)', 'Account Payable', 'Miscellaneous Other Disbursement', 'Pledge of In-Kind', 'Loan Forgiven (Non-Exempt)', 'Uncollectible Pledge of Cash', 'Cash Contribution', 'Personal Expenditure Balance Adjustment', 'In-Kind/Forgiven Personal Expenditures', 'Cash Balance Adjustment', 'Lost or Returned Check'}


In [16]:
null_subs = set(data[null_mask]['Sub Type'].unique())
print("Sub Types found in records with Null: \n\n",null_subs)

Sub Types found in records with Null: 

 {'Interest/Investment Income', 'Items Sold at Fair Market Value', 'Personal Expenditure Balance Adjustment', 'Cash Balance Adjustment'}


In [17]:
both_subs = non_null_subs.intersection(null_subs)
print("The following are in both: \n\n",both_subs)

The following are in both: 

 {'Interest/Investment Income', 'Items Sold at Fair Market Value', 'Personal Expenditure Balance Adjustment', 'Cash Balance Adjustment'}


- **Amount Column:**
    - Look at descriptives of "Amount"

In [18]:
data['Amount'].describe()

count     88310.000000
mean       1116.298991
std        7272.454859
min      -29525.510000
25%          30.000000
50%         130.000000
75%         500.000000
max      592753.650000
Name: Amount, dtype: float64

- **Amount Column (cont):**
    - Check how many negative value records -- 37 show up.
    - Pull up descriptives on those 37 records with a negative amount value:

In [19]:
amt_negative = data[data['Amount'] < 0]
amt_positive = data[data['Amount'] > 0]


print("# of negative records:", len(amt_negative))
amt_negative.describe(include = 'all').head(2)

# of negative records: 37


Unnamed: 0,Tran ID,Tran Date,Status,Filer/Committee,Contributor/Payee,Sub Type,Amount,committee_id
count,37,37,37,37,5,37,37.0,37
unique,37,32,2,29,4,2,,28


- **Amount Column (cont):**
    - What Sub Type are there for messing values?
        - Note: There are two and they both show some sort of adjustment feature. Though I find it strange that Adjustments aren't all "amendments"

In [20]:
amt_negative['Sub Type'].unique()

array(['Personal Expenditure Balance Adjustment', 'Cash Balance Adjustment'], dtype=object)

In [21]:
amt_negative.head()

Unnamed: 0,Tran ID,Tran Date,Status,Filer/Committee,Contributor/Payee,Sub Type,Amount,committee_id
213,2551881,05/10/2017,Amended,Friends of Clackamas Community College,Jeanne Magmer,Personal Expenditure Balance Adjustment,-2500.0,11247
215,2551883,05/10/2017,Amended,Friends of Clackamas Community College,JE Dunn Construction,Cash Balance Adjustment,-6969.93,11247
216,2551884,05/10/2017,Amended,Friends of Clackamas Community College,JE Dunn Construction,Personal Expenditure Balance Adjustment,-713.34,11247
489,1440832,12/31/2012,Original,Friends of Clackamas Community College,Clackamas Federal Credit Union,Cash Balance Adjustment,-198.75,11247
1269,5469,01/25/2007,Original,Citizens for Schools,,Cash Balance Adjustment,-8.81,5592


In [22]:
amt_negative.pivot_table(index = 'Status', columns = 'Sub Type', values='Tran ID',aggfunc=[len]).T

Unnamed: 0_level_0,Status,Amended,Original
Unnamed: 0_level_1,Sub Type,Unnamed: 2_level_1,Unnamed: 3_level_1
len,Cash Balance Adjustment,3,30
len,Personal Expenditure Balance Adjustment,2,2


In [23]:
amt_positive.pivot_table(index = 'Status', columns = 'Sub Type', values='Tran ID',aggfunc=[len]).fillna(0)

Unnamed: 0_level_0,len,len,len,len,len,len,len,len,len,len,len,len,len,len,len,len,len,len,len,len,len
Sub Type,Account Payable,Account Payable Rescinded,Cash Balance Adjustment,Cash Contribution,Cash Expenditure,Expenditure Made by an Agent,In-Kind Contribution,In-Kind/Forgiven Account Payable,In-Kind/Forgiven Personal Expenditures,Interest/Investment Income,...,Miscellaneous Other Receipt,Nonpartisan Activity,Personal Expenditure Balance Adjustment,Personal Expenditure for Reimbursement,Pledge of Cash,Pledge of In-Kind,Refunds and Rebates,Return or Refund of Contribution,Uncollectible Pledge of Cash,Unexpended Agent Balance
Status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Amended,31.0,0.0,3.0,821.0,792.0,0.0,125.0,1.0,0.0,7.0,...,9.0,0.0,1.0,103.0,6.0,0.0,3.0,3.0,0.0,0.0
Original,628.0,11.0,29.0,49029.0,24864.0,31.0,2589.0,7.0,68.0,1032.0,...,99.0,3.0,6.0,6254.0,90.0,3.0,180.0,122.0,4.0,1.0


- **Date Column:**
    - I had issue converting "Tran Date" to datetime, check for integrity:
        - Write a test, take all as a series, run through and attempt convert via to_datetime
            - Note: **6 Dates are associated with errors:**

In [24]:
test = pd.Series(data['Tran Date'].unique())
errors = []
for number,value in test.iteritems():
    try: 
        pd.to_datetime(value)
    except:
        errors.append([number,value])
        
erroneous_date_strings = [i[1] for i in errors]
print(errors)

[[4057, '05/03/0007'], [4084, '11/03/0209'], [4085, '02/23/0009'], [4086, '03/14/0008'], [4087, '02/19/0007']]


- **Date Column (cont) :**
    - Below show erroneous records:

In [25]:
data[data['Tran Date'].isin(erroneous_date_strings)]

Unnamed: 0,Tran ID,Tran Date,Status,Filer/Committee,Contributor/Payee,Sub Type,Amount,committee_id
68927,57262,05/03/0007,Original,Committee To Elect DeShazer,Miscellaneous Cash Expenditures $100 and under,Cash Expenditure,20.0,8667
81289,640760,11/03/0209,Original,Josephine County Republican Central Committee,Miscellaneous In-Kind Contributions $100 and u...,In-Kind Contribution,71.2,319
81290,541960,02/23/0009,Original,Josephine County Republican Central Committee,Miscellaneous Cash Contributions $100 and under,Cash Contribution,75.0,319
81291,526699,03/14/0008,Original,Josephine County Republican Central Committee,Miscellaneous Cash Contributions $100 and under,Cash Contribution,80.0,319
81292,39429,02/19/0007,Original,Josephine County Republican Central Committee,Keith Heck,Items Sold at Fair Market Value,62.64,319
81293,39458,02/19/0007,Original,Josephine County Republican Central Committee,Doneta Thomason,Items Sold at Fair Market Value,62.64,319
