### Data Wrangling Steps
#### By: Irene Yao
#### Date: 8/20/2018
The data is downloaded into csv files from DonorsChoose.org: https://research.donorschoose.org/t/download-opendata/33     
For this project, I will use the two datasets: Project Data and Donation Data.     

#### Step 1: Import relevant packages

In [34]:
import pandas as pd
import numpy as np

#### Step 2: Inspect columns from both data tables

In [3]:
donation_file = 'donations.csv'
with open(donation_file) as dfile:
    columns = dfile.readline() ## read the first line, which is the columns
    print(columns)

,_donationid,_projectid,_donor_acctid,_cartid,donor_city,donor_state,donor_zip,is_teacher_acct,donation_timestamp,donation_to_project,donation_optional_support,donation_total,donation_included_optional_support,payment_method,payment_included_acct_credit,payment_included_campaign_gift_card,payment_included_web_purchased_gift_card,payment_was_promo_matched,is_teacher_referred,giving_page_id,giving_page_type,for_honoree,thank_you_packet_mailed



In [4]:
projects_file = 'projects.csv'
with open(projects_file) as pfile:
    columns = pfile.readline() ## read the first line, which is the columns
    print(columns)

,_projectid,_teacher_acctid,_schoolid,school_ncesid,school_latitude,school_longitude,school_city,school_state,school_zip,school_metro,school_district,school_county,school_charter,school_magnet,school_year_round,school_nlns,school_kipp,school_charter_ready_promise,teacher_prefix,teacher_teach_for_america,teacher_ny_teaching_fellow,primary_focus_subject,primary_focus_area,secondary_focus_subject,secondary_focus_area,resource_type,poverty_level,grade_level,vendor_shipping_charges,sales_tax,payment_processing_charges,fulfillment_labor_materials,total_price_excluding_optional_support,total_price_including_optional_support,students_reached,total_donations,num_donors,eligible_double_your_impact_match,eligible_almost_home_match,funding_status,date_posted,date_completed,date_thank_you_packet_mailed,date_expiration



#### Step 3: Merge donation and project tables and remove the irrelevant columns
Some of the columns are created to link to other tables which I'm not going to cover in this study. Those columns will be removed.      
After inspecting the two data tables, the following columns are determined neccessary.    
* From donation table: '_donationid','_projectid','_donor_acctid','donor_city','donor_state','is_teacher_acct','donation_timestamp','donation_to_project',
'donation_optional_support','donation_total','payment_method','is_teacher_referred','thank_you_packet_mailed' 

* From projects table: '_projectid','_teacher_acctid','school_ncesid','school_latitude','school_longitude','school_city','school_state','school_zip','school_metro',
'school_district','school_county','teacher_prefix','primary_focus_subject','primary_focus_area','secondary_focus_subject','secondary_focus_area',
'resource_type','poverty_level','grade_level','total_price_excluding_optional_support','total_price_including_optional_support','students_reached',
'total_donations','num_donors','eligible_double_your_impact_match','eligible_almost_home_match','funding_status','date_posted','date_completed',
'date_thank_you_packet_mailed','date_expiration'

In [5]:
## import csv for donation
df_donation = pd.read_csv(donation_file)

In [7]:
## import csv for projects
df_projects = pd.read_csv(projects_file)

In [9]:
## select only the relevant columns
## zipcode column is removed because the last two digits are hidden; didn't offer useful info
new_cols_d = ['_donationid','_projectid','_donor_acctid','donor_city','donor_state','is_teacher_acct'
            ,'donation_timestamp','donation_to_project','donation_optional_support','donation_total','payment_method','is_teacher_referred'
            ,'thank_you_packet_mailed']
df_donation_short = df_donation[new_cols_d]
df_donation_short.head(2)

Unnamed: 0,_donationid,_projectid,_donor_acctid,donor_city,donor_state,is_teacher_acct,donation_timestamp,donation_to_project,donation_optional_support,donation_total,payment_method,is_teacher_referred,thank_you_packet_mailed
0,0000023f507999464aa2b78875b7e5d6,69bf3a609bb4673818e0eebd004ea504,22c50856b0824db76daf527da6af9abf,,,f,2011-02-13 11:07:19.349,8.5,1.5,10.0,creditcard,f,f
1,000009891526c0ade7180f8423792063,26f02742185eb1f73f3bc5be4655fae2,c91489d7b6b89943a28555e6add72509,,NJ,t,2013-05-26 11:28:31.30,63.75,11.25,75.0,creditcard,f,f


In [11]:
## do the same for projects dataset
new_cols_p = ['_projectid','_teacher_acctid','school_ncesid','school_latitude','school_longitude'
              ,'school_city','school_state','school_zip','school_metro','school_district','school_county'
              ,'teacher_prefix','primary_focus_subject','primary_focus_area','secondary_focus_subject','secondary_focus_area'
              ,'resource_type','poverty_level','grade_level','total_price_excluding_optional_support'
              ,'total_price_including_optional_support','students_reached','total_donations','num_donors'
              ,'eligible_double_your_impact_match','eligible_almost_home_match','funding_status','date_posted'
              ,'date_completed','date_thank_you_packet_mailed','date_expiration']
df_projects_short = df_projects[new_cols_p]
df_projects_short.head(2)

Unnamed: 0,_projectid,_teacher_acctid,school_ncesid,school_latitude,school_longitude,school_city,school_state,school_zip,school_metro,school_district,...,students_reached,total_donations,num_donors,eligible_double_your_impact_match,eligible_almost_home_match,funding_status,date_posted,date_completed,date_thank_you_packet_mailed,date_expiration
0,7342bd01a2a7725ce033a179d22e382d,5c43ef5eac0f5857c266baa1ccfa3d3f,360009700000.0,40.688454,-73.910432,New York City,NY,11207.0,urban,New York City Dept Of Ed,...,0.0,251.9,1,f,f,completed,2002-09-13 00:00:00,2002-09-23 00:00:00,2003-01-27 00:00:00,2003-12-31 00:00:00
1,ed87d61cef7fda668ae70be7e0c6cebf,1f4493b3d3fe4a611f3f4d21a249376a,360007700000.0,40.765517,-73.96009,New York City,NY,10065.0,,New York City Dept Of Ed,...,0.0,137.0,1,f,f,completed,2002-09-13 00:00:00,2002-09-23 00:00:00,2003-01-03 00:00:00,2003-12-31 00:00:00


In [12]:
## merge projects table together with donation table using _projectid
df = df_donation_short.merge(df_projects_short, how='inner', on='_projectid')
df.head()

Unnamed: 0,_donationid,_projectid,_donor_acctid,donor_city,donor_state,is_teacher_acct,donation_timestamp,donation_to_project,donation_optional_support,donation_total,...,students_reached,total_donations,num_donors,eligible_double_your_impact_match,eligible_almost_home_match,funding_status,date_posted,date_completed,date_thank_you_packet_mailed,date_expiration
0,0000023f507999464aa2b78875b7e5d6,69bf3a609bb4673818e0eebd004ea504,22c50856b0824db76daf527da6af9abf,,,f,2011-02-13 11:07:19.349,8.5,1.5,10.0,...,23.0,510.9,4,t,f,completed,2011-01-23 00:00:00,2011-03-18 00:00:00,2011-05-10 00:00:00,2011-06-21 00:00:00
1,53ec9a692cd770d6e4f0c6673451ff60,69bf3a609bb4673818e0eebd004ea504,ba7d4afdfc182c4c5fde1d57980697bc,,CA,f,2011-03-18 01:54:04.96,217.13,38.32,255.45,...,23.0,510.9,4,t,f,completed,2011-01-23 00:00:00,2011-03-18 00:00:00,2011-05-10 00:00:00,2011-06-21 00:00:00
2,798dad82b651ff0371e4a655e56bbca5,69bf3a609bb4673818e0eebd004ea504,9b29654e7ea1241e6fa1ec4805b7429e,Wilton,CA,f,2011-03-18 01:54:04.882,166.13,29.32,195.45,...,23.0,510.9,4,t,f,completed,2011-01-23 00:00:00,2011-03-18 00:00:00,2011-05-10 00:00:00,2011-06-21 00:00:00
3,87c43e67b49398b4e0d54d31e2ae95ca,69bf3a609bb4673818e0eebd004ea504,b8f54362e335b81171ebbe36c657ea4b,Orangevale,CA,f,2011-01-31 00:41:24.833,42.5,7.5,50.0,...,23.0,510.9,4,t,f,completed,2011-01-23 00:00:00,2011-03-18 00:00:00,2011-05-10 00:00:00,2011-06-21 00:00:00
4,000009891526c0ade7180f8423792063,26f02742185eb1f73f3bc5be4655fae2,c91489d7b6b89943a28555e6add72509,,NJ,t,2013-05-26 11:28:31.30,63.75,11.25,75.0,...,78.0,335.96,4,f,f,completed,2013-02-11 00:00:00,2013-05-26 00:00:00,2013-09-09 00:00:00,2013-06-11 00:00:00


Inner join donation table and project table should return the same number of rows as donation table itself, check this by viewing the shape of the 3 tables.

In [13]:
print(df_donation_short.shape)
print(df_projects_short.shape)
print(df.shape)

(6211956, 13)
(1203287, 31)
(6211956, 43)


## Data Wrangling
Now we can put the donations and projects tables aside for a while and work on the merged table for data wrangling. 

#### Step 1: Inspect the dataframe

In [15]:
## inspect the dataframe
df.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6211956 entries, 0 to 6211955
Data columns (total 43 columns):
_donationid                               6211956 non-null object
_projectid                                6211956 non-null object
_donor_acctid                             6211956 non-null object
donor_city                                1970233 non-null object
donor_state                               5207104 non-null object
is_teacher_acct                           6211956 non-null object
donation_timestamp                        6211956 non-null object
donation_to_project                       6211956 non-null float64
donation_optional_support                 6211956 non-null float64
donation_total                            6211956 non-null float64
payment_method                            6211956 non-null object
is_teacher_referred                       6211956 non-null object
thank_you_packet_mailed                   6211956 non-null object
_teacher_acctid           

The dataset has 6211956 rows and some of the columns have null values. We also notice that the data columns have the type object and some strings have the type float. 

In [18]:
df.head()

Unnamed: 0,_donationid,_projectid,_donor_acctid,donor_city,donor_state,is_teacher_acct,donation_timestamp,donation_to_project,donation_optional_support,donation_total,...,students_reached,total_donations,num_donors,eligible_double_your_impact_match,eligible_almost_home_match,funding_status,date_posted,date_completed,date_thank_you_packet_mailed,date_expiration
0,0000023f507999464aa2b78875b7e5d6,69bf3a609bb4673818e0eebd004ea504,22c50856b0824db76daf527da6af9abf,,,f,2011-02-13 11:07:19.349,8.5,1.5,10.0,...,23.0,510.9,4,t,f,completed,2011-01-23,2011-03-18,2011-05-10,2011-06-21
1,53ec9a692cd770d6e4f0c6673451ff60,69bf3a609bb4673818e0eebd004ea504,ba7d4afdfc182c4c5fde1d57980697bc,,CA,f,2011-03-18 01:54:04.960,217.13,38.32,255.45,...,23.0,510.9,4,t,f,completed,2011-01-23,2011-03-18,2011-05-10,2011-06-21
2,798dad82b651ff0371e4a655e56bbca5,69bf3a609bb4673818e0eebd004ea504,9b29654e7ea1241e6fa1ec4805b7429e,Wilton,CA,f,2011-03-18 01:54:04.882,166.13,29.32,195.45,...,23.0,510.9,4,t,f,completed,2011-01-23,2011-03-18,2011-05-10,2011-06-21
3,87c43e67b49398b4e0d54d31e2ae95ca,69bf3a609bb4673818e0eebd004ea504,b8f54362e335b81171ebbe36c657ea4b,Orangevale,CA,f,2011-01-31 00:41:24.833,42.5,7.5,50.0,...,23.0,510.9,4,t,f,completed,2011-01-23,2011-03-18,2011-05-10,2011-06-21
4,000009891526c0ade7180f8423792063,26f02742185eb1f73f3bc5be4655fae2,c91489d7b6b89943a28555e6add72509,,NJ,t,2013-05-26 11:28:31.300,63.75,11.25,75.0,...,78.0,335.96,4,f,f,completed,2013-02-11,2013-05-26,2013-09-09,2013-06-11


#### Step 2: Change the columns to the correct data type

In [16]:
df['school_ncesid'] = df['school_ncesid'].astype(str)
df['donation_timestamp'] = pd.to_datetime(df['donation_timestamp'])
df['school_zip'] = df['school_zip'].astype(str)
df['date_posted'] = pd.to_datetime(df['date_posted'])
df['date_completed'] = pd.to_datetime(df['date_completed'])
df['date_thank_you_packet_mailed'] = pd.to_datetime(df['date_thank_you_packet_mailed'])
df['date_expiration'] = pd.to_datetime(df['date_expiration'])

#### Step 3: Deal with null values (pending)

In [17]:
## print out the columns that have null value
nan_cols = []
for col in df.columns:
    if df[col].count()<6211956:
        nan_cols.append(col)
print(nan_cols)

['donor_city', 'donor_state', 'school_city', 'school_metro', 'school_district', 'school_county', 'teacher_prefix', 'primary_focus_subject', 'primary_focus_area', 'secondary_focus_subject', 'secondary_focus_area', 'resource_type', 'grade_level', 'students_reached', 'date_completed', 'date_thank_you_packet_mailed', 'date_expiration']


#### Step 4: Create additional columns to help with analysis (pending)