# Data Collection and Cleaning

According to Kickstarter's website, they are a service that "helps artists, musicians, filmmakers, designers, and other creators find the resources and support they need to make their ideas a reality". In other words, they're a crowdfunding platform. Today we examine how to predict whether a Kickstarter will meet its fundraising goals successfully or fail.

In [1]:
import pandas as pd
import numpy as np
import os
import ast
from datetime import datetime
from random import randint
from pandas.io.json import json_normalize

pd.options.display.max_rows = 10

dir_main = "/Users/Kellie/Documents/data301/proj/"
data_dir = dir_main + "datasets/"

cols = ["backers_count", "blurb", "category", "country", "created_at", "currency",
        "deadline", "goal", "id", "launched_at", "name", "pledged", "slug", "spotlight",
        "staff_pick", "state", "state_changed_at", "urls", "usd_pledged"]

Kickstarter has privatized its API, so we downloaded our data from [Web Robots](https://webrobots.io/kickstarter-datasets/), a web-scraping service. There are numerous files per month (with the same headers), so we need to begin by collecting those into a single `DataFrame`.

In [2]:
date = "2019-01-17"
folder_name = "Kickstarter_" + date + "/"

files_for_month = os.listdir(data_dir + folder_name)
files_for_month[:3]

['Kickstarter040.csv', 'Kickstarter054.csv', 'Kickstarter041.csv']

In [3]:
month_df = pd.DataFrame()

for ks in files_for_month:
    path = data_dir + folder_name + ks
    baby_df = (pd.read_csv(path)[cols])
    month_df = pd.concat([month_df, baby_df], ignore_index=True)

month_df.head()

Unnamed: 0,backers_count,blurb,category,country,created_at,currency,deadline,goal,id,launched_at,name,pledged,slug,spotlight,staff_pick,state,state_changed_at,urls,usd_pledged
0,1,An eco-friendly coffee table that is both func...,"{""id"":356,""name"":""Woodworking"",""slug"":""crafts/...",US,1473631551,USD,1480966943,5000.0,1504859185,1478371343,Industrial Bamboo Table,240.0,industrial-bamboo-table,False,False,failed,1480966943,"{""web"":{""project"":""https://www.kickstarter.com...",240.0
1,3,We take digital uploads and make them handpain...,"{""id"":23,""name"":""Painting"",""slug"":""art/paintin...",CA,1436540372,CAD,1440417634,1000.0,49266114,1437480034,"Custom Pet Portraits on Canvas- ""From Pixels t...",322.0,custom-pet-portraits-on-canvas-from-pixels-to-...,False,False,failed,1440417634,"{""web"":{""project"":""https://www.kickstarter.com...",247.950175
2,243,We are a team of restaurant pros looking to fu...,"{""id"":311,""name"":""Food Trucks"",""slug"":""food/fo...",US,1427218874,USD,1431706954,35000.0,1228074690,1429114954,The Barmobile: Boston's Mobile Cocktail Cateri...,41738.0,the-barmobile-bostons-mobile-cocktail-catering...,True,True,successful,1431706954,"{""web"":{""project"":""https://www.kickstarter.com...",41738.0
3,27,"Loosely-based on a Lakota legend, Grandfather ...","{""id"":46,""name"":""Children's Books"",""slug"":""pub...",US,1495110632,USD,1500217383,3000.0,330962986,1496329383,Grandfather Thunder & The Night Horses,3115.0,grandfather-thunder-and-the-night-horses,True,False,successful,1500217384,"{""web"":{""project"":""https://www.kickstarter.com...",3115.0
4,3,Save me is a feature film about a depression s...,"{""id"":298,""name"":""Movie Theaters"",""slug"":""film...",IE,1450121921,EUR,1455328590,15000.0,1657821447,1450144590,Save Me-A film to hightlight depression (Cance...,601.0,save-me-1,False,False,canceled,1455065666,"{""web"":{""project"":""https://www.kickstarter.com...",660.680598


According to the Web Robots website, their scraper crawls through the Kickstarter website by exploring categories and subcategories. As a result, some projects are scraped twice. We need to find and remove the duplicates.

In [4]:
print("original shape:", month_df.shape)
month_df.drop_duplicates(subset=["id"], inplace=True)
month_df.reset_index(inplace=True)
month_df = month_df.drop(["index"], axis=1)
print("shape after dropping duplicates:", month_df.shape)
month_df.head()

original shape: (207848, 19)
shape after dropping duplicates: (180934, 19)


Unnamed: 0,backers_count,blurb,category,country,created_at,currency,deadline,goal,id,launched_at,name,pledged,slug,spotlight,staff_pick,state,state_changed_at,urls,usd_pledged
0,1,An eco-friendly coffee table that is both func...,"{""id"":356,""name"":""Woodworking"",""slug"":""crafts/...",US,1473631551,USD,1480966943,5000.0,1504859185,1478371343,Industrial Bamboo Table,240.0,industrial-bamboo-table,False,False,failed,1480966943,"{""web"":{""project"":""https://www.kickstarter.com...",240.0
1,3,We take digital uploads and make them handpain...,"{""id"":23,""name"":""Painting"",""slug"":""art/paintin...",CA,1436540372,CAD,1440417634,1000.0,49266114,1437480034,"Custom Pet Portraits on Canvas- ""From Pixels t...",322.0,custom-pet-portraits-on-canvas-from-pixels-to-...,False,False,failed,1440417634,"{""web"":{""project"":""https://www.kickstarter.com...",247.950175
2,243,We are a team of restaurant pros looking to fu...,"{""id"":311,""name"":""Food Trucks"",""slug"":""food/fo...",US,1427218874,USD,1431706954,35000.0,1228074690,1429114954,The Barmobile: Boston's Mobile Cocktail Cateri...,41738.0,the-barmobile-bostons-mobile-cocktail-catering...,True,True,successful,1431706954,"{""web"":{""project"":""https://www.kickstarter.com...",41738.0
3,27,"Loosely-based on a Lakota legend, Grandfather ...","{""id"":46,""name"":""Children's Books"",""slug"":""pub...",US,1495110632,USD,1500217383,3000.0,330962986,1496329383,Grandfather Thunder & The Night Horses,3115.0,grandfather-thunder-and-the-night-horses,True,False,successful,1500217384,"{""web"":{""project"":""https://www.kickstarter.com...",3115.0
4,3,Save me is a feature film about a depression s...,"{""id"":298,""name"":""Movie Theaters"",""slug"":""film...",IE,1450121921,EUR,1455328590,15000.0,1657821447,1450144590,Save Me-A film to hightlight depression (Cance...,601.0,save-me-1,False,False,canceled,1455065666,"{""web"":{""project"":""https://www.kickstarter.com...",660.680598


## Clean JSON
Next, we need to clean up the JSON so we can expand it with `json_normalize`.

_Note:_ We tried to clean numerous JSON fields, but encountered many problems with how to extract the dictionary from the string. For some fields `ast.literal_eval` worked, other fields threw errors under technique but worked with `json.loads`. Because we could not get reliable output from many variables, we removed them from our focus. `urls` was included earlier this week, but somewhere along the way a cosmic flare struck Jupyter and our code to parse the urls stopped working. We have kept that variable in our `DataFrame`, though, as a technique to view the Kickstarter webpages of the project under study.

In [5]:
json_cols = ["category"]

for col in json_cols:
    month_df[col] = month_df[col].fillna('')
    month_df[col] = month_df[col].apply(ast.literal_eval)
    print(col)
    
month_df.head()

category


Unnamed: 0,backers_count,blurb,category,country,created_at,currency,deadline,goal,id,launched_at,name,pledged,slug,spotlight,staff_pick,state,state_changed_at,urls,usd_pledged
0,1,An eco-friendly coffee table that is both func...,"{'id': 356, 'name': 'Woodworking', 'slug': 'cr...",US,1473631551,USD,1480966943,5000.0,1504859185,1478371343,Industrial Bamboo Table,240.0,industrial-bamboo-table,False,False,failed,1480966943,"{""web"":{""project"":""https://www.kickstarter.com...",240.0
1,3,We take digital uploads and make them handpain...,"{'id': 23, 'name': 'Painting', 'slug': 'art/pa...",CA,1436540372,CAD,1440417634,1000.0,49266114,1437480034,"Custom Pet Portraits on Canvas- ""From Pixels t...",322.0,custom-pet-portraits-on-canvas-from-pixels-to-...,False,False,failed,1440417634,"{""web"":{""project"":""https://www.kickstarter.com...",247.950175
2,243,We are a team of restaurant pros looking to fu...,"{'id': 311, 'name': 'Food Trucks', 'slug': 'fo...",US,1427218874,USD,1431706954,35000.0,1228074690,1429114954,The Barmobile: Boston's Mobile Cocktail Cateri...,41738.0,the-barmobile-bostons-mobile-cocktail-catering...,True,True,successful,1431706954,"{""web"":{""project"":""https://www.kickstarter.com...",41738.0
3,27,"Loosely-based on a Lakota legend, Grandfather ...","{'id': 46, 'name': 'Children's Books', 'slug':...",US,1495110632,USD,1500217383,3000.0,330962986,1496329383,Grandfather Thunder & The Night Horses,3115.0,grandfather-thunder-and-the-night-horses,True,False,successful,1500217384,"{""web"":{""project"":""https://www.kickstarter.com...",3115.0
4,3,Save me is a feature film about a depression s...,"{'id': 298, 'name': 'Movie Theaters', 'slug': ...",IE,1450121921,EUR,1455328590,15000.0,1657821447,1450144590,Save Me-A film to hightlight depression (Cance...,601.0,save-me-1,False,False,canceled,1455065666,"{""web"":{""project"":""https://www.kickstarter.com...",660.680598


In [6]:
category_exp = json_normalize(month_df["category"])
month_df = month_df.drop("category", axis=1)
category_tups = category_exp.slug.str.split("/").apply(pd.Series)
category_tups.columns = ["parent_category", "category"]
month_df = pd.concat([month_df, category_tups], axis=1)
month_df.head()

Unnamed: 0,backers_count,blurb,country,created_at,currency,deadline,goal,id,launched_at,name,pledged,slug,spotlight,staff_pick,state,state_changed_at,urls,usd_pledged,parent_category,category
0,1,An eco-friendly coffee table that is both func...,US,1473631551,USD,1480966943,5000.0,1504859185,1478371343,Industrial Bamboo Table,240.0,industrial-bamboo-table,False,False,failed,1480966943,"{""web"":{""project"":""https://www.kickstarter.com...",240.0,crafts,woodworking
1,3,We take digital uploads and make them handpain...,CA,1436540372,CAD,1440417634,1000.0,49266114,1437480034,"Custom Pet Portraits on Canvas- ""From Pixels t...",322.0,custom-pet-portraits-on-canvas-from-pixels-to-...,False,False,failed,1440417634,"{""web"":{""project"":""https://www.kickstarter.com...",247.950175,art,painting
2,243,We are a team of restaurant pros looking to fu...,US,1427218874,USD,1431706954,35000.0,1228074690,1429114954,The Barmobile: Boston's Mobile Cocktail Cateri...,41738.0,the-barmobile-bostons-mobile-cocktail-catering...,True,True,successful,1431706954,"{""web"":{""project"":""https://www.kickstarter.com...",41738.0,food,food trucks
3,27,"Loosely-based on a Lakota legend, Grandfather ...",US,1495110632,USD,1500217383,3000.0,330962986,1496329383,Grandfather Thunder & The Night Horses,3115.0,grandfather-thunder-and-the-night-horses,True,False,successful,1500217384,"{""web"":{""project"":""https://www.kickstarter.com...",3115.0,publishing,children's books
4,3,Save me is a feature film about a depression s...,IE,1450121921,EUR,1455328590,15000.0,1657821447,1450144590,Save Me-A film to hightlight depression (Cance...,601.0,save-me-1,False,False,canceled,1455065666,"{""web"":{""project"":""https://www.kickstarter.com...",660.680598,film & video,movie theaters


## Clean Timestamps

Convert Unix time into a human-readable format.

In [7]:
timestamps = ["created_at", "state_changed_at", "launched_at", "deadline"]
for ts in timestamps:
    month_df[ts] = month_df[ts].fillna("")
    month_df[ts + "_clean"] = month_df[ts].apply(datetime.utcfromtimestamp)
month_df.head()

Unnamed: 0,backers_count,blurb,country,created_at,currency,deadline,goal,id,launched_at,name,...,state,state_changed_at,urls,usd_pledged,parent_category,category,created_at_clean,state_changed_at_clean,launched_at_clean,deadline_clean
0,1,An eco-friendly coffee table that is both func...,US,1473631551,USD,1480966943,5000.0,1504859185,1478371343,Industrial Bamboo Table,...,failed,1480966943,"{""web"":{""project"":""https://www.kickstarter.com...",240.0,crafts,woodworking,2016-09-11 22:05:51,2016-12-05 19:42:23,2016-11-05 18:42:23,2016-12-05 19:42:23
1,3,We take digital uploads and make them handpain...,CA,1436540372,CAD,1440417634,1000.0,49266114,1437480034,"Custom Pet Portraits on Canvas- ""From Pixels t...",...,failed,1440417634,"{""web"":{""project"":""https://www.kickstarter.com...",247.950175,art,painting,2015-07-10 14:59:32,2015-08-24 12:00:34,2015-07-21 12:00:34,2015-08-24 12:00:34
2,243,We are a team of restaurant pros looking to fu...,US,1427218874,USD,1431706954,35000.0,1228074690,1429114954,The Barmobile: Boston's Mobile Cocktail Cateri...,...,successful,1431706954,"{""web"":{""project"":""https://www.kickstarter.com...",41738.0,food,food trucks,2015-03-24 17:41:14,2015-05-15 16:22:34,2015-04-15 16:22:34,2015-05-15 16:22:34
3,27,"Loosely-based on a Lakota legend, Grandfather ...",US,1495110632,USD,1500217383,3000.0,330962986,1496329383,Grandfather Thunder & The Night Horses,...,successful,1500217384,"{""web"":{""project"":""https://www.kickstarter.com...",3115.0,publishing,children's books,2017-05-18 12:30:32,2017-07-16 15:03:04,2017-06-01 15:03:03,2017-07-16 15:03:03
4,3,Save me is a feature film about a depression s...,IE,1450121921,EUR,1455328590,15000.0,1657821447,1450144590,Save Me-A film to hightlight depression (Cance...,...,canceled,1455065666,"{""web"":{""project"":""https://www.kickstarter.com...",660.680598,film & video,movie theaters,2015-12-14 19:38:41,2016-02-10 00:54:26,2015-12-15 01:56:30,2016-02-13 01:56:30


In [8]:
month_df["project_duration"] = (month_df["deadline_clean"] -
                                month_df["created_at_clean"])
month_df["time_til_state_changed"] = (month_df["state_changed_at_clean"] -
                                      month_df["created_at_clean"])
month_df.head()

Unnamed: 0,backers_count,blurb,country,created_at,currency,deadline,goal,id,launched_at,name,...,urls,usd_pledged,parent_category,category,created_at_clean,state_changed_at_clean,launched_at_clean,deadline_clean,project_duration,time_til_state_changed
0,1,An eco-friendly coffee table that is both func...,US,1473631551,USD,1480966943,5000.0,1504859185,1478371343,Industrial Bamboo Table,...,"{""web"":{""project"":""https://www.kickstarter.com...",240.0,crafts,woodworking,2016-09-11 22:05:51,2016-12-05 19:42:23,2016-11-05 18:42:23,2016-12-05 19:42:23,84 days 21:36:32,84 days 21:36:32
1,3,We take digital uploads and make them handpain...,CA,1436540372,CAD,1440417634,1000.0,49266114,1437480034,"Custom Pet Portraits on Canvas- ""From Pixels t...",...,"{""web"":{""project"":""https://www.kickstarter.com...",247.950175,art,painting,2015-07-10 14:59:32,2015-08-24 12:00:34,2015-07-21 12:00:34,2015-08-24 12:00:34,44 days 21:01:02,44 days 21:01:02
2,243,We are a team of restaurant pros looking to fu...,US,1427218874,USD,1431706954,35000.0,1228074690,1429114954,The Barmobile: Boston's Mobile Cocktail Cateri...,...,"{""web"":{""project"":""https://www.kickstarter.com...",41738.0,food,food trucks,2015-03-24 17:41:14,2015-05-15 16:22:34,2015-04-15 16:22:34,2015-05-15 16:22:34,51 days 22:41:20,51 days 22:41:20
3,27,"Loosely-based on a Lakota legend, Grandfather ...",US,1495110632,USD,1500217383,3000.0,330962986,1496329383,Grandfather Thunder & The Night Horses,...,"{""web"":{""project"":""https://www.kickstarter.com...",3115.0,publishing,children's books,2017-05-18 12:30:32,2017-07-16 15:03:04,2017-06-01 15:03:03,2017-07-16 15:03:03,59 days 02:32:31,59 days 02:32:32
4,3,Save me is a feature film about a depression s...,IE,1450121921,EUR,1455328590,15000.0,1657821447,1450144590,Save Me-A film to hightlight depression (Cance...,...,"{""web"":{""project"":""https://www.kickstarter.com...",660.680598,film & video,movie theaters,2015-12-14 19:38:41,2016-02-10 00:54:26,2015-12-15 01:56:30,2016-02-13 01:56:30,60 days 06:17:49,57 days 05:15:45


## Export
This `DataFrame` is the final clean form.

In [9]:
month_df = month_df.set_index("id")
month_df.head()

Unnamed: 0_level_0,backers_count,blurb,country,created_at,currency,deadline,goal,launched_at,name,pledged,...,urls,usd_pledged,parent_category,category,created_at_clean,state_changed_at_clean,launched_at_clean,deadline_clean,project_duration,time_til_state_changed
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1504859185,1,An eco-friendly coffee table that is both func...,US,1473631551,USD,1480966943,5000.0,1478371343,Industrial Bamboo Table,240.0,...,"{""web"":{""project"":""https://www.kickstarter.com...",240.0,crafts,woodworking,2016-09-11 22:05:51,2016-12-05 19:42:23,2016-11-05 18:42:23,2016-12-05 19:42:23,84 days 21:36:32,84 days 21:36:32
49266114,3,We take digital uploads and make them handpain...,CA,1436540372,CAD,1440417634,1000.0,1437480034,"Custom Pet Portraits on Canvas- ""From Pixels t...",322.0,...,"{""web"":{""project"":""https://www.kickstarter.com...",247.950175,art,painting,2015-07-10 14:59:32,2015-08-24 12:00:34,2015-07-21 12:00:34,2015-08-24 12:00:34,44 days 21:01:02,44 days 21:01:02
1228074690,243,We are a team of restaurant pros looking to fu...,US,1427218874,USD,1431706954,35000.0,1429114954,The Barmobile: Boston's Mobile Cocktail Cateri...,41738.0,...,"{""web"":{""project"":""https://www.kickstarter.com...",41738.0,food,food trucks,2015-03-24 17:41:14,2015-05-15 16:22:34,2015-04-15 16:22:34,2015-05-15 16:22:34,51 days 22:41:20,51 days 22:41:20
330962986,27,"Loosely-based on a Lakota legend, Grandfather ...",US,1495110632,USD,1500217383,3000.0,1496329383,Grandfather Thunder & The Night Horses,3115.0,...,"{""web"":{""project"":""https://www.kickstarter.com...",3115.0,publishing,children's books,2017-05-18 12:30:32,2017-07-16 15:03:04,2017-06-01 15:03:03,2017-07-16 15:03:03,59 days 02:32:31,59 days 02:32:32
1657821447,3,Save me is a feature film about a depression s...,IE,1450121921,EUR,1455328590,15000.0,1450144590,Save Me-A film to hightlight depression (Cance...,601.0,...,"{""web"":{""project"":""https://www.kickstarter.com...",660.680598,film & video,movie theaters,2015-12-14 19:38:41,2016-02-10 00:54:26,2015-12-15 01:56:30,2016-02-13 01:56:30,60 days 06:17:49,57 days 05:15:45


In [10]:
date_short = date.replace("-", "")[:6]
month_df.to_csv(data_dir + "final/ks" + date_short + ".csv")