# Overview

This project is looking at analyzing Kickstarter projects to find which projects tend to succeed. The purpose is understanding trends, then moving towards a classification model that guesses at which projects will succeed.

source: https://webrobots.io/kickstarter-datasets/

## Development Roadmap

<blockquote>Phase 1: Combine all Kickstarter datasets and clean data <br>
Phase 2: Create visualizations and understand the current state <br>
Phase 3: Create a classification model based on available factors <br>
Phase 4: Create a model that predicts the amount of funds received past the goal funds <br>
Phase 5: Include more data from NLP and Understanding descriptions and textual data such as location</blockquote>

## Questions that can be answered

What campaigns are most successful by category? Location?
What factors contribute to succesful campaigns? Spotlights? Number of contributors?

## Other Potential Factors Affecting Success
Shared on other outlets? Social Media?

In [15]:
import pandas as pd

import glob
import json

import datetime
from datetime import datetime
from datetime import timedelta

In [16]:
data_files = sorted(glob.glob('data/Kickstarter*.csv'))
df = pd.concat((pd.read_csv(file) for file in data_files))

In [17]:
pd.set_option('display.max_columns', None)

In [18]:
def clean_date(column):
    date_list = []
    start_date = datetime(year=1970, month=1, day=1)
    for r in column:
        new_date = start_date + timedelta(seconds = int(r))
        date_list.append(new_date)
    return date_list

In [19]:
# Validation function - check input if valid JSON before reading it in
# In try except, be specific about the error

def parse_json(column, keyword):
    new_column = []
    for r in column:
        try:
            new_column.append(json.loads(r)[keyword])
        except:
            new_column.append(None)
    return new_column

In [20]:
df.created_at = clean_date(df.created_at)
df.state_changed_at = clean_date(df.state_changed_at)
df.deadline = clean_date(df.deadline)
df.launched_at = clean_date(df.launched_at)

In [21]:
df.category = parse_json(df.category, 'name')
df.creator = parse_json(df.creator, 'name')
df['location_city'] = parse_json(df.location, 'localized_name')
df['location_state'] = parse_json(df.location, 'state')

In [22]:
#drops duplicate rows that match on id, deadline, and creator
df = df.drop_duplicates(subset=['id', 'deadline', 'creator'])

In [23]:
# drop json Array columns
df = df.drop(columns=['location','photo','profile','urls'])

In [24]:
df = df.reset_index().drop(columns='index')

In [25]:
# save to pickle
df.to_pickle('Tables/loaded_data.pkl')

In [26]:
# save off sepearate file
df.to_csv('Tables/loaded_data.csv')