# **Project 1**

# Learning Objectives

- Explore and glean insights from a real dataset using pandas
- Practice using pandas for exploratory analysis, information gathering, and discovery
- Practice cleaning data and answering questions

# Dataset

You are to analyze campaign contributions to the 2016 U.S. presidential primary races made in California. Use the csv file located here: https://drive.google.com/file/d/1Lgg-PwXQ6TQLDowd6XyBxZw5g1NGWPjB/view?usp=sharing. You should download and save this file in the same folder as this notebook is stored.  This file originally came from the U.S. Federal Election Commission (https://www.fec.gov/).

Documentation for this data can be found here: https://drive.google.com/file/d/11o_SByceenv0NgNMstM-dxC1jL7I9fHL/view?usp=sharing

When you upload the dataset to Colab, it will take approximately 2-3 minutes to load. **Do not run the code cell under "Setup" once the data has fully uploaded.**

# Data Questions

You are working for a California state-wide election campaign. Your boss wants you to examine historic 2016 election contribution data to see what zipcodes are more supportive of fundraising for your candidate.

Your boss asks you to filter out some of the records:
- Only use primary 2016 contribution data.
- Concentrate on Bernie Sanders as a candidate.

The questions your boss wants answered is:
- Which zipcode (5-digit zipcode) had the highest count of contributions and the most dollar amount?
- What day(s) of the month do most people donate?

# Setup

Run the cell below as it will load the data into a pandas dataframe named `contrib`. Note that a custom date parser is defined to speed up loading. If Python were to guess the date format, it would take even longer to load.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

# These commands below set some options for pandas and to have matplotlib show the charts in the notebook
pd.set_option('display.max_rows', 1000)
pd.options.display.float_format = '{:,.2f}'.format

# Define a date parser to pass to read_csv
d = lambda x: datetime.strptime(x, '%d-%b-%y')

# Load the data in chunks
chunk_size = 100000
contrib = pd.read_csv('P00000001-CA.csv', index_col=False, parse_dates=['contb_receipt_dt'], date_parser=d)

# Note - for now, it is okay to ignore the warning about mixed types.

  contrib = pd.read_csv('P00000001-CA.csv', index_col=False, parse_dates=['contb_receipt_dt'], date_parser=d)
  contrib = pd.read_csv('P00000001-CA.csv', index_col=False, parse_dates=['contb_receipt_dt'], date_parser=d)


**For all questions, please do not alter any cells that already exist. However, you can add as many code or text cells as you need to answer the questions.**

***
# **Part 1: Initial Data Checks (50 points)**

First, we will take a preliminary look at the data to check that it was loaded correctly and contains the info we need.

In [2]:
# checking out the table 
contrib.head(5)

Unnamed: 0,cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
0,C00575795,P00003392,"Clinton, Hillary Rodham","AULL, ANNE",LARKSPUR,CA,949391913.0,,RETIRED,50.0,2016-04-26,,X,* HILLARY VICTORY FUND,SA18,1091718,C4768722,P2016
1,C00575795,P00003392,"Clinton, Hillary Rodham","CARROLL, MARYJEAN",CAMBRIA,CA,934284638.0,,RETIRED,200.0,2016-04-20,,X,* HILLARY VICTORY FUND,SA18,1091718,C4747242,P2016
2,C00575795,P00003392,"Clinton, Hillary Rodham","GANDARA, DESIREE",FONTANA,CA,923371507.0,,RETIRED,5.0,2016-04-02,,X,* HILLARY VICTORY FUND,SA18,1091718,C4666603,P2016
3,C00577130,P60007168,"Sanders, Bernard","LEE, ALAN",CAMARILLO,CA,930111214.0,AT&T GOVERNMENT SOLUTIONS,SOFTWARE ENGINEER,40.0,2016-03-04,,,* EARMARKED CONTRIBUTION: SEE BELOW,SA17A,1077404,VPF7BKWA097,P2016
4,C00577130,P60007168,"Sanders, Bernard","LEONELLI, ODETTE",REDONDO BEACH,CA,902784310.0,VERICOR ENTERPRISES INC.,PHARMACIST,35.0,2016-03-05,,,* EARMARKED CONTRIBUTION: SEE BELOW,SA17A,1077404,VPF7BKX3MB3,P2016


**Question 1:**

Print the *shape* of the data. Does this match the expectation? (2 points)

**Answer 1**

yes it does match the expectation

In [5]:
# Answerfor the first question  
contrib.shape

(1125659, 18)

**Question 2:**

Print a list of column names. Are all the columns included that are in the documentation? (2 points)

**Answer 2**

Yes everything is included you can see it below

In [50]:
# all of the following are the columns in the dataset
contrib.columns.to_list()

['cmte_id',
 'cand_id',
 'cand_nm',
 'contbr_nm',
 'contbr_city',
 'contbr_st',
 'contbr_zip',
 'contbr_employer',
 'contbr_occupation',
 'contb_receipt_amt',
 'contb_receipt_dt',
 'receipt_desc',
 'memo_cd',
 'memo_text',
 'form_tp',
 'file_num',
 'tran_id',
 'election_tp']

**Question 3:**

Print out the first five rows of the dataset. How do the columns `cand_id`, `cand_nm` and `contbr_st` look? (3 points)

**answer 3**

cand_id: This is an unique id for each person like Hilary, Sanders

cand_nm: This holds the First, Middle, and last name of the person

contbr_st: I belive this is the state name abbrevation like CA is Califronia 

In [51]:
# Exploring these 3 colunms 
contrib.head(5)

Unnamed: 0,cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
0,C00575795,P00003392,"Clinton, Hillary Rodham","AULL, ANNE",LARKSPUR,CA,949391913.0,,RETIRED,50.0,2016-04-26,,X,* HILLARY VICTORY FUND,SA18,1091718,C4768722,P2016
1,C00575795,P00003392,"Clinton, Hillary Rodham","CARROLL, MARYJEAN",CAMBRIA,CA,934284638.0,,RETIRED,200.0,2016-04-20,,X,* HILLARY VICTORY FUND,SA18,1091718,C4747242,P2016
2,C00575795,P00003392,"Clinton, Hillary Rodham","GANDARA, DESIREE",FONTANA,CA,923371507.0,,RETIRED,5.0,2016-04-02,,X,* HILLARY VICTORY FUND,SA18,1091718,C4666603,P2016
3,C00577130,P60007168,"Sanders, Bernard","LEE, ALAN",CAMARILLO,CA,930111214.0,AT&T GOVERNMENT SOLUTIONS,SOFTWARE ENGINEER,40.0,2016-03-04,,,* EARMARKED CONTRIBUTION: SEE BELOW,SA17A,1077404,VPF7BKWA097,P2016
4,C00577130,P60007168,"Sanders, Bernard","LEONELLI, ODETTE",REDONDO BEACH,CA,902784310.0,VERICOR ENTERPRISES INC.,PHARMACIST,35.0,2016-03-05,,,* EARMARKED CONTRIBUTION: SEE BELOW,SA17A,1077404,VPF7BKX3MB3,P2016


**Question 4:**

Print out the values for the column `election_tp`. In your own words, based on the documentation, what information does the `election_tp` variable contain? Do the values in the column match the documentation? (3 points)

**Answer 4**

i belive this stands for election 2016 like p2016 = presidential election 2016
and yes they do match the columns in the dataset

In [52]:
# Demonstration of the my answer from the dataset 
contrib['election_tp'].head(5)

0    P2016
1    P2016
2    P2016
3    P2016
4    P2016
Name: election_tp, dtype: object

**Question 5:**

Print out the datatypes for all of the columns. What are the datatypes for the `contbr_zip`, `contb_receipt_amt`, `contb_receipt_dt`? (5 points)

**Answer 5**

`contbr_zip` : object

`contb_receipt_amt`: Float64

`contb_receipt_dt`: datetime64[ns]


In [53]:
#showing the data types
contrib.dtypes

cmte_id                      object
cand_id                      object
cand_nm                      object
contbr_nm                    object
contbr_city                  object
contbr_st                    object
contbr_zip                   object
contbr_employer              object
contbr_occupation            object
contb_receipt_amt           float64
contb_receipt_dt     datetime64[ns]
receipt_desc                 object
memo_cd                      object
memo_text                    object
form_tp                      object
file_num                      int64
tran_id                      object
election_tp                  object
dtype: object

**Question 6:**

What columns have the most nulls?  Would you recommend to drop any columns based on the number of nulls? (5 points)

*Hint:* Use the .isna() and .sum() functions together

**Answer 6**

contbr_employer       157902,
contbr_occupation      10399,
receipt_desc         1110614,
memo_cd               981391,
memo_text             624511,
election_tp             1425,

`these has the most null `

contbr_city               26,
contbr_zip                95,

`these are the lest none `

i can only suggest to drop one of the least one because if you drop the one with the most it could effect your dataset plus it depends on what you need it for and what are you looking for so the circumstance matters 



In [54]:
contrib.isnull().sum()

cmte_id                    0
cand_id                    0
cand_nm                    0
contbr_nm                  0
contbr_city               26
contbr_st                  0
contbr_zip                95
contbr_employer       157902
contbr_occupation      10399
contb_receipt_amt          0
contb_receipt_dt           0
receipt_desc         1110614
memo_cd               981391
memo_text             624511
form_tp                    0
file_num                   0
tran_id                    0
election_tp             1425
dtype: int64

**Question 7:**

A column we know that we want to use is the cand_nm column.  From the documentation each candidate is a unique candidate id also. Check data quality of `cand_id` column to see if it matches `cand_nm` column. Specifically check to ensure our targetted candidate 'Bernard Sanders' always has the same cand_id throughout. Any issues with `cand_nm` matching `cand_id`? (5 points)

*Hint:* Look at the value counts for candidate ID and name pairs.

**Answer 7**

i showed that there is equal nunique ids and names, and sanders is always with the same id number 

In [55]:
candd_id = contrib['cand_id'].nunique()
cond_nm = contrib["cand_nm"].nunique()

print(candd_id, cond_nm)

25 25


In [56]:
contrib[contrib['cand_nm'].str.contains('Sanders')].head(5)

Unnamed: 0,cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
3,C00577130,P60007168,"Sanders, Bernard","LEE, ALAN",CAMARILLO,CA,930111214.0,AT&T GOVERNMENT SOLUTIONS,SOFTWARE ENGINEER,40.0,2016-03-04,,,* EARMARKED CONTRIBUTION: SEE BELOW,SA17A,1077404,VPF7BKWA097,P2016
4,C00577130,P60007168,"Sanders, Bernard","LEONELLI, ODETTE",REDONDO BEACH,CA,902784310.0,VERICOR ENTERPRISES INC.,PHARMACIST,35.0,2016-03-05,,,* EARMARKED CONTRIBUTION: SEE BELOW,SA17A,1077404,VPF7BKX3MB3,P2016
5,C00577130,P60007168,"Sanders, Bernard","LEONELLI, ODETTE",REDONDO BEACH,CA,902784310.0,VERICOR ENTERPRISES INC.,PHARMACIST,100.0,2016-03-06,,,* EARMARKED CONTRIBUTION: SEE BELOW,SA17A,1077404,VPF7BKYBXV4,P2016
6,C00577130,P60007168,"Sanders, Bernard","LEOPARD, PATTI",VISTA,CA,920842849.0,ONSITE ENERGY CORPORATION,PROJECT MANAGER,25.0,2016-03-04,,,* EARMARKED CONTRIBUTION: SEE BELOW,SA17A,1077404,VPF7BKW04C1,P2016
8,C00577130,P60007168,"Sanders, Bernard","LEPKE, KELLY",WESTMINSTER,CA,926833846.0,NONE,NOT EMPLOYED,10.0,2016-03-05,,,* EARMARKED CONTRIBUTION: SEE BELOW,SA17A,1077404,VPF7BKX3H59,P2016


**Question 8:**

Another area to check is to make sure all of the records are from California. Check the `contbr_st` column - are there any records outside of California based on `contbr_st`? (5 points)

**answer8**
the unique values is 1 which shows all are from CA

In [58]:
contrib['contbr_st'].nunique()

1

**Question 9:**

The next column to check for the analysis is the `tran_id` column. This column could be the primary key so look for duplicates. How many duplicate entries are there? (5 points)

*Hint:* Look at the where the tran id value counts are greater than 1

**Answer 9**

i found the dupicates by finding the total tran_id and unique numbers of tran_id

the total difference is = 3454

which concludes there is dupicates 

In [11]:
# check a few duplicated tran_id examples to see if there is a pattern for why there are duplicae entries
examples = ['ADB49CB248C174E298F0', 'A5602AD777C8C4632B5A', 'SA17.1131188', 'SA17.959311', 'C10357933', 'C9499151']
contrib[contrib['tran_id'].isin(examples)]

Unnamed: 0,cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
40217,C00575449,P40003576,"Paul, Rand","BREED, PAUL MR.",SOLANA BEACH,CA,92075.0,NET BURNER,ENGINEER,500.0,2015-09-30,,X,,SA17A,1057795,ADB49CB248C174E298F0,P2016
40218,C00575449,P40003576,"Paul, Rand","PARKS, ROBERT MR.",HUGHSON,CA,953260412.0,SELF-EMPLOYED,FARMER,200.0,2015-09-30,,X,,SA17A,1057795,A5602AD777C8C4632B5A,G2016
41393,C00575449,P40003576,"Paul, Rand","PARKS, ROBERT MR.",HUGHSON,CA,953260412.0,SELF-EMPLOYED,FARMER,200.0,2015-09-30,,X,,SA17A,1057796,A5602AD777C8C4632B5A,G2016
42430,C00575449,P40003576,"Paul, Rand","BREED, PAUL MR.",SOLANA BEACH,CA,92075.0,NET BURNER,ENGINEER,500.0,2015-09-30,,X,,SA17A,1057796,ADB49CB248C174E298F0,P2016
42822,C00575449,P40003576,"Paul, Rand","BREED, PAUL MR.",SOLANA BEACH,CA,92075.0,NET BURNER,ENGINEER,500.0,2015-09-30,,X,,SA17A,1057799,ADB49CB248C174E298F0,P2016
42836,C00575449,P40003576,"Paul, Rand","PARKS, ROBERT MR.",HUGHSON,CA,953260412.0,SELF-EMPLOYED,FARMER,200.0,2015-09-30,,X,,SA17A,1057799,A5602AD777C8C4632B5A,G2016
44681,C00575449,P40003576,"Paul, Rand","BREED, PAUL MR.",SOLANA BEACH,CA,92075.0,NET BURNER,ENGINEER,500.0,2015-09-30,,X,,SA17A,1057798,ADB49CB248C174E298F0,P2016
44685,C00575449,P40003576,"Paul, Rand","PARKS, ROBERT MR.",HUGHSON,CA,953260412.0,SELF-EMPLOYED,FARMER,200.0,2015-09-30,,X,,SA17A,1057798,A5602AD777C8C4632B5A,G2016
141644,C00458844,P60006723,"Rubio, Marco","GUERRA, MARIO A. MR.",DOWNEY,CA,902422240.0,GALLAGHER INSURANCE BROKERS,INSURANCE BROKER,2700.0,2016-01-29,REATTRIBUTION/REDESIGNATION REQUESTED,,REATTRIBUTION/REDESIGNATION REQUESTED,SA17A,1051592,SA17.959311,P2016
252440,C00574624,P60006111,"Cruz, Rafael Edward 'Ted'","TRULIO, JOHN GEORGE",LOS ANGELES,CA,900491226.0,RETIRED,RETIRED,25.0,2016-02-14,,,,SA17A,1071498,SA17.1131188,P2016


After examining a few of these duplicated tran_id's, there do not appear to be any patterns for why there are duplicate entries.

In [62]:
unique_id = contrib['tran_id'].nunique()
total_id = contrib['tran_id'].value_counts().sum()

total_id - unique_id

np.int64(3454)

**Question 10:**

Another column to check is the `contb_receipt_amt` that shows the donation amounts. How many negative donations are included? What do negative donations mean? Please show at least pull a few rows to look at the records with negative donations. Do these records match with the expectation of why a negative donation would happen? (5 points)

To assist you, please refer to the following:

https://www.fec.gov/help-candidates-and-committees/filing-reports/redesignating-and-reattributing-contributions/

Also, it may be useful to examine the 'receipt_desc' and 'memo_text' fields.

**answer 10**

Negative donations can occur due to:

Refunds or Returned Contributions: If a candidate's campaign realizes they received more money than allowed or from an ineligible donor, they might refund the excess amount or the entire contribution. This refund is recorded as a negative amount in the campaign's financial records, essentially "taking back" a portion or all of the original donation.

Adjustment Errors: Occasionally, errors in the initial reporting of contributions may require adjustments. For example, if a contribution was initially overstated or reported twice, a negative entry would correct the campaign finance records.

the thing i saw here was that the donation was made for the same election which is the general election


and there are 11896 negative donation

In [77]:
contrib.loc[contrib['contb_receipt_amt'] < 0]["contb_receipt_amt"].count().sum()

np.int64(11896)

In [79]:
contrib.loc[contrib['contb_receipt_amt'] < 0]

Unnamed: 0,cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
19,C00574624,P60006111,"Cruz, Rafael Edward 'Ted'","JOLLIFF, RICHARD",CHICO,CA,959289507.00,SELF EMPLOYED,RANCHER,-25.00,2016-04-29,REDESIGNATION TO GENERAL,X,REDESIGNATION TO GENERAL,SA17A,1077664,SA17A.1826482B,P2016
23,C00574624,P60006111,"Cruz, Rafael Edward 'Ted'","JOLLIFF, RICHARD",CHICO,CA,959289507.00,SELF EMPLOYED,RANCHER,-150.00,2016-04-29,REDESIGNATION TO GENERAL,X,REDESIGNATION TO GENERAL,SA17A,1077664,SA17A.1826483B,P2016
81,C00574624,P60006111,"Cruz, Rafael Edward 'Ted'","JOLLIFF, RICHARD",CHICO,CA,959289507.00,SELF EMPLOYED,RANCHER,-60.00,2016-04-14,REDESIGNATION TO GENERAL,X,REDESIGNATION TO GENERAL,SA17A,1077664,SA17A.1827494,P2016
190,C00574624,P60006111,"Cruz, Rafael Edward 'Ted'","NOWELL, DIANA L.",RANCHO SANTA MARGARITA,CA,926884928.00,CAPISTRAND UNIFIED SCHOOL DISTRICT,LIBRARIAN TECHNICIAN,-100.00,2016-04-11,REDESIGNATION TO GENERAL,X,REDESIGNATION TO GENERAL,SA17A,1077664,SA17A.1639830B,P2016
213,C00574624,P60006111,"Cruz, Rafael Edward 'Ted'","LICHTY, ANDREW MR.",SAN DIEGO,CA,921096720.00,SELF EMPLOYED,REAL ESTATE,-25.00,2016-04-30,REDESIGNATION TO GENERAL,X,REDESIGNATION TO GENERAL,SA17A,1077664,SA17A.1826888B,P2016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1125008,C00574624,P60006111,"Cruz, Rafael Edward 'Ted'","NELSON, PETER C. MR.",SAN LUIS OBISPO,CA,934018000,SELF EMPLOYED,DENTIST,-2700.00,2015-07-31,REDESIGNATION TO GENERAL,X,REDESIGNATION TO GENERAL,SA17A,1053893,SA17.466666B,P2016
1125317,C00580399,P60008521,"Christie, Christopher J.","ASCHER, STEPHEN",PASADENA,CA,911013113,MIRAMAR,EXECUTIVE,-2700.00,2016-01-30,REATTRIBUTION TO SPOUSE,X,REATTRIBUTION TO SPOUSE,SA17A,1051204,SA17.A40065,P2016
1125427,C00574624,P60006111,"Cruz, Rafael Edward 'Ted'","HANSEN, WILLIAM",GLENDALE,CA,912081507,RETIRED,RETIRED,-5400.00,2015-07-01,,,CHARGED BACK,SA17A,1053893,SA17.440961,P2016
1125446,C00573519,P60005915,"Carson, Benjamin S.","PECK, JOHN JR.",RANCHO SANTA FE,CA,920670829,PECK ENTERPRISES,PRESIDENT,-2700.00,2016-01-01,REDESIGNATION TO GENERAL,X,REDESIGNATION TO GENERAL,SA17A,1073637,SA17.817713B,P2016


**Question 11:**

Another column to look at is the date of donation column. Are there any dates outside of the primary period (defined as '2014-01-01' to '2016-06-07')? Are the dates well-formatted for our analysis? (5 points)

**answer 11**

yes there are dates outside of this and it is well formed for analysis

In [88]:
#contrib.loc[(contrib["contb_receipt_dt"] < '2016-06-07') & (contrib["contb_receipt_dt"] > '2014-01-01')]

In [86]:
contrib.loc[contrib["contb_receipt_dt"] < '2014-01-01']

Unnamed: 0,cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
9932,C00458844,P60006723,"Rubio, Marco","WHEELER, MARY MS.",ATHERTON,CA,940273415.0,SELF-EMPLOYED,INTERIOR DESIGNER,-20.0,2013-11-05,,X,TRANSFER FROM RUBIO VICTORY,SA18,1029436,SA18.631526.2.0615,P2016
9994,C00458844,P60006723,"Rubio, Marco","WHEELER, MARY MS.",ATHERTON,CA,940273415.0,SELF-EMPLOYED,INTERIOR DESIGNER,20.0,2013-11-05,,X,TRANSFER FROM RUBIO VICTORY,SA18,1029436,SA18.631526.3.0615,G2016


In [87]:
contrib.loc[contrib["contb_receipt_dt"] > '2016-06-07']

Unnamed: 0,cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
14673,C00574624,P60006111,"Cruz, Rafael Edward 'Ted'","RANDALL, DICK J",CUPERTINO,CA,95014.00,,,-5400.00,2016-06-30,Refund,,,SB28A,1096256,SB28A.90932,
14682,C00605568,P20002671,"Johnson, Gary","BOAL, ROB",OAK PARK,CA,91377.00,VENDAVO,ENGINEER,30.00,2016-07-20,,,,SA17A,1096305,SA17A.50793,G2016
14697,C00605568,P20002671,"Johnson, Gary","LEE, JASCHA",SANTA CRUZ,CA,95065.00,YAHOO,S/W ENG,88.45,2016-07-14,,,,SA17A,1096305,SA17A.51414,G2016
14698,C00605568,P20002671,"Johnson, Gary","LEE, JASCHA",SANTA CRUZ,CA,95065.00,YAHOO,S/W ENG,100.00,2016-07-24,,,,SA17A,1096305,SA17A.51415,G2016
14700,C00575795,P00003392,"Clinton, Hillary Rodham","BAKER, JUDY",VACAVILLE,CA,956873433.00,,RETIRED,5.00,2016-07-25,,X,* HILLARY VICTORY FUND,SA18,1109498,C8807328,P2016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1123992,C00575795,P00003392,"Clinton, Hillary Rodham","RANDALL, NANCY ENA",BAKERSFIELD,CA,933124485,RETIRED,REGISTERED NURSE,5.00,2016-08-02,,,,SA17A,1100718,C8235857,G2016
1123993,C00575795,P00003392,"Clinton, Hillary Rodham","WRIGHT, LORNA",BENICIA,CA,945101743,,RETIRED,50.00,2016-08-09,,,,SA17A,1100718,C8997947,G2016
1123994,C00575795,P00003392,"Clinton, Hillary Rodham","BALES, REBECCA",PASADENA,CA,911051700,PASADENA HEALTHCARE CONSULTING,HOSPITAL CFO,100.00,2016-08-31,,,,SA17A,1100718,C9829807,G2016
1123995,C00575795,P00003392,"Clinton, Hillary Rodham","BELLINGER, MARTHA",CLAREMONT,CA,917112219,,RETIRED,75.00,2016-08-31,,,,SA17A,1100718,C9862797,G2016


**Question 12:**

Let's examine the ```contbr_zip``` field. The zip codes should be 5 digits and not contain any decimals.

Look at the distribution of zip codes. Do they appear to be properly formatted? If not, give two examples of incorrectly formatted zip codes.

(5 points)

**answer 12** 

no the zip codes are not well formatted they have decimals and more than five numbers in some instances as we can see below

examples  
0	949,391,913.00

1	934,284,638.00

2	923,371,507.00

In [93]:
contrib[['contbr_zip']]

Unnamed: 0,contbr_zip
0,949391913.00
1,934284638.00
2,923371507.00
3,930111214.00
4,902784310.00
...,...
1125654,902781750
1125655,960270030
1125656,960270030
1125657,960270030


**At this stage, here is a list of columns that appear important to answer the boss's questions:**

"tran_id", "cand_id", "cand_nm", "election_tp", "contbr_zip", "contbr_nm", "contb_receipt_amt", and "contb_receipt_date"

**Here is a list of columns that do not appear important to answer the boss's questions:**

"cmte_id", "contbr_city", "contbr_st", "contbr_employer", "contbr_occupation", "receipt_desc", "memo_cd", "memo_text", "form_tp", "file_num"

***
# **Part 2: Data Filtering and Data Quality Fixes (25 points)**

Now that we have a basic understanding of the data, let's filter out the records we don't need and fix the data.

**Question 13:**

From the dataset filter out (remove) any election_tp not in the primary election. Print/show the shape of the dataframe after the filtering is complete. (5 points)

In [114]:
tp = contrib[~contrib['election_tp'].isna()]
tp[tp["election_tp"].str.contains('P')]
tp.shape

(1124234, 18)

**Question 14:**

From the dataset filter out (remove) any candidate that is not Bernie Sanders. Print/show the shape of the dataframe after the filtering is complete. (5 points)

**answer 14**
the shape is 407171, 18

In [124]:
tp = tp.loc[tp["cand_nm"] == "Sanders, Bernard"]
tp.shape

(407171, 18)

**Question 15:**

The `contbr_zip` column is not formatted well for our analysis.

Make a new zipcode column that is the five-digit zipcodes. Filter out any records outside of California based on the zipcode. Print/show the shape of the dataframe after the filtering is complete. (10 points).

**Note:**

If you were conducting this analysis in the real world, you would have to research what the valid 5-digit zipcodes for California are!

For ease of the assignment, I have done this research for you.

**Valid CA zip codes ranges from 90001 to 96162.**

I used the following source for this information:

https://www.structnet.com/instructions/zip_min_max_by_state.html

**answer 15**

the shape is 

(21995, 18)

In [128]:
contrib['contbr_zip'] = contrib['contbr_zip'].astype("string")
ca_zip_code_list = contrib['contbr_zip'].tolist()

In [129]:
tp = tp[tp["contbr_zip"].isin(ca_zip_code_list)]
tp.shape

(21995, 18)

**Question 16:**

The receipt amount column has negative donations. After talking with your team, a decision was made that the best course of action is to remove these negative values so that the donation count and amount is more accurate. Print/show the shape of the dataframe after the filtering is complete. (5 points)

**answer 16**

the shape is 21975, 8

Next, we will drop columns that were not needed for the filtering and will not be needed to answer the boss's questions. Please take a moment to see which columns can be dropped in the code cell below.

In [130]:
# drop columns that won't be used in the analysis

drop_cols = ["cmte_id", "contbr_city", "contbr_st", "contbr_employer", "contbr_occupation", "receipt_desc",
             "memo_cd", "memo_text", "form_tp", "file_num"]


#pre-drop dataframe shape
print("Pre-drop row count: ", contrib.shape[0])
print("Pre-drop column count: ", contrib.shape[1])

contrib = contrib.drop(drop_cols, axis=1)

#post-drop dataframe shape
print("Post-drop row count: ", contrib.shape[0])
print("Post-drop column count: ", contrib.shape[1])

Pre-drop row count:  1125659
Pre-drop column count:  18
Post-drop row count:  1125659
Post-drop column count:  8


In [138]:
tp = tp.drop(drop_cols, axis=1)

In [139]:
tp = tp[tp["contb_receipt_amt"] >= 0]
tp.shape

(21975, 8)

***
# **3. Answering the questions (20 points)**

Now that the data is cleaned and filterd - let's answer the two questions from your boss!

**Question 17:**

Which zipcode had the highest count of contributions and the most dollar amount? (10 points)

**answer 17**

the highest zipcode count is 44 for 921034727

the  most dollar  is zip code 941151105  of total   10,000.00

In [143]:
tp.groupby("contbr_zip")["contb_receipt_amt"].count().sort_values(ascending=False)

contbr_zip
921034727    44
956286916    40
949202136    23
945561545    21
910241229    19
             ..
961611625     1
961611178     1
961611168     1
961603052     1
900043704     1
Name: contb_receipt_amt, Length: 12737, dtype: int64

In [144]:
tp.groupby("contbr_zip")["contb_receipt_amt"].sum().sort_values(ascending=False)

contbr_zip
941151105   10,000.00
900642333    5,500.00
904056216    5,000.00
941222751    4,000.00
941171713    3,050.00
               ...   
917703108        1.00
940051622        1.00
934361484        1.00
916062828        1.00
926833846        1.00
Name: contb_receipt_amt, Length: 12737, dtype: float64

**Question 18:**

What day(s) of the month do most people donate? (10 points)

the days with the most dollars

2016-03-09 :  155,276.00

2016-03-31  : 109,452.04

2016-03-14   : 81,398.03

2016-03-27   : 60,130.39

2015-09-30  :  53,730.37

In [146]:
tp.groupby("contb_receipt_dt")["contb_receipt_amt"].sum().sort_values(ascending = False).head(5)

contb_receipt_dt
2016-03-09   155,276.00
2016-03-31   109,452.04
2016-03-14    81,398.03
2016-03-27    60,130.39
2015-09-30    53,730.37
Name: contb_receipt_amt, dtype: float64