# Kickstarter: Exploratory Data Analysis

## Project Overview

#### The notebook outlines an Exploratory Data Analysis __(EDA)__ of collected Kickstarter project data from the years 2009 to 2012. The project aims to identify the trends and correlations associated with a successful Kickstarter project, and derive from the analysis recommendations for how a project hosted on Kickstarter can succeed.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
kickstarter_df = pd.read_csv('kickstarter_raw.csv', encoding='latin1')

## 1. Introduction

In [3]:
kickstarter_df.shape

(45957, 17)

In [4]:
kickstarter_df.columns

Index(['project id', 'name', 'url', 'category', 'subcategory', 'location',
       'status', 'goal', 'pledged', 'funded percentage', 'backers',
       'funded date', 'levels', 'reward levels', 'updates', 'comments',
       'duration'],
      dtype='object')

In [5]:
kickstarter_df.describe(datetime_is_numeric = True, include = 'all')

Unnamed: 0,project id,name,url,category,subcategory,location,status,goal,pledged,funded percentage,backers,funded date,levels,reward levels,updates,comments,duration
count,45957.0,45957,45957,45957,45957,44635,45957,45957.0,45945.0,45957.0,45957.0,45957,45957.0,45898,45957.0,45957.0,45957.0
unique,,45754,45814,14,51,4849,5,,,,,41068,,28378,,,
top,,Black Storm,http://www.kickstarter.com/projects/34840787/t...,Film &amp; Video,Documentary,"Los Angeles, CA",successful,,,,,"Sun, 01 Jan 2012 04:59:00 -0000",,"$10,$25,$50,$100,$250,$500,$1,000",,,
freq,,3,2,13053,4012,3927,22969,,,,,44,,369,,,
mean,1080800000.0,,,,,,,11942.71,4980.75,1.850129,69.973192,,8.004939,,4.08508,8.379529,39.995547
std,621805700.0,,,,,,,188758.3,56741.62,88.492706,688.628479,,4.233907,,6.43922,174.015737,17.414458
min,39409.0,,,,,,,0.01,0.0,0.0,0.0,,0.0,,0.0,0.0,1.0
25%,543896200.0,,,,,,,1800.0,196.0,0.044,5.0,,5.0,,0.0,0.0,30.0
50%,1078345000.0,,,,,,,4000.0,1310.0,1.0,23.0,,7.0,,2.0,0.0,32.0
75%,1621596000.0,,,,,,,9862.0,4165.0,1.11564,59.0,,10.0,,6.0,3.0,48.39


## 2. Data Cleaning

### a. Renaming Columns

In [6]:
kickstarter_df = kickstarter_df.rename(columns={'project id': 'project_id', 'funded percentage': 'funded_percentage', 'funded date': 'funded_date', 'reward levels': 'reward_levels'})
kickstarter_df.columns

Index(['project_id', 'name', 'url', 'category', 'subcategory', 'location',
       'status', 'goal', 'pledged', 'funded_percentage', 'backers',
       'funded_date', 'levels', 'reward_levels', 'updates', 'comments',
       'duration'],
      dtype='object')

### b. Validating Data Types

In [7]:
kickstarter_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45957 entries, 0 to 45956
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   project_id         45957 non-null  int64  
 1   name               45957 non-null  object 
 2   url                45957 non-null  object 
 3   category           45957 non-null  object 
 4   subcategory        45957 non-null  object 
 5   location           44635 non-null  object 
 6   status             45957 non-null  object 
 7   goal               45957 non-null  float64
 8   pledged            45945 non-null  float64
 9   funded_percentage  45957 non-null  float64
 10  backers            45957 non-null  int64  
 11  funded_date        45957 non-null  object 
 12  levels             45957 non-null  int64  
 13  reward_levels      45898 non-null  object 
 14  updates            45957 non-null  int64  
 15  comments           45957 non-null  int64  
 16  duration           459

In [8]:
kickstarter_df.funded_date = pd.to_datetime(kickstarter_df.funded_date, infer_datetime_format = 'True')

In [9]:
kickstarter_df.project_id = kickstarter_df.project_id.astype(object)

In [10]:
kickstarter_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45957 entries, 0 to 45956
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   project_id         45957 non-null  object             
 1   name               45957 non-null  object             
 2   url                45957 non-null  object             
 3   category           45957 non-null  object             
 4   subcategory        45957 non-null  object             
 5   location           44635 non-null  object             
 6   status             45957 non-null  object             
 7   goal               45957 non-null  float64            
 8   pledged            45945 non-null  float64            
 9   funded_percentage  45957 non-null  float64            
 10  backers            45957 non-null  int64              
 11  funded_date        45957 non-null  datetime64[ns, UTC]
 12  levels             45957 non-null  int64      

### c. Determining and Handling Null Values

In [11]:
kickstarter_df.isna().sum()

project_id              0
name                    0
url                     0
category                0
subcategory             0
location             1322
status                  0
goal                    0
pledged                12
funded_percentage       0
backers                 0
funded_date             0
levels                  0
reward_levels          59
updates                 0
comments                0
duration                0
dtype: int64

In [12]:
kickstarter_df.location = kickstarter_df.location.fillna('Not Available')

In [13]:
kickstarter_df.isna().sum()

project_id            0
name                  0
url                   0
category              0
subcategory           0
location              0
status                0
goal                  0
pledged              12
funded_percentage     0
backers               0
funded_date           0
levels                0
reward_levels        59
updates               0
comments              0
duration              0
dtype: int64

In [14]:
kickstarter_df.reward_levels = kickstarter_df.reward_levels.fillna(0)

In [15]:
kickstarter_df.isna().sum()

project_id            0
name                  0
url                   0
category              0
subcategory           0
location              0
status                0
goal                  0
pledged              12
funded_percentage     0
backers               0
funded_date           0
levels                0
reward_levels         0
updates               0
comments              0
duration              0
dtype: int64

In [16]:
null_pledges = kickstarter_df.pledged.isna()
kickstarter_df[null_pledges].url

1187     http://www.kickstarter.com/projects/69341191/x...
4502     http://www.kickstarter.com/projects/twokinds/t...
13381    http://www.kickstarter.com/projects/hickies/hi...
13802    http://www.kickstarter.com/projects/syrp/genie...
25239    http://www.kickstarter.com/projects/b9creation...
29412    http://www.kickstarter.com/projects/madgod/phi...
31164    http://www.kickstarter.com/projects/incident/g...
34274    http://www.kickstarter.com/projects/58936338/s...
35032    http://www.kickstarter.com/projects/257527888/...
40759    http://www.kickstarter.com/projects/spaceventu...
40872    http://www.kickstarter.com/projects/382469225/...
44132    http://www.kickstarter.com/projects/waysidecre...
Name: url, dtype: object

In [17]:
url_list = kickstarter_df[null_pledges].url.tolist()

In [18]:
pledges_list = [154715, 197512, 159167, 636766, 513422, 124156, 353392, 221267, 322022, 539767, 212265, 130746]

In [19]:
pledges_dict = dict(zip(url_list, pledges_list))
kickstarter_df.loc[kickstarter_df['url'].isin(url_list) & kickstarter_df['pledged'].isna(), 'pledged'] = kickstarter_df['url'].map(pledges_dict)

In [20]:
kickstarter_df.isna().sum()

project_id           0
name                 0
url                  0
category             0
subcategory          0
location             0
status               0
goal                 0
pledged              0
funded_percentage    0
backers              0
funded_date          0
levels               0
reward_levels        0
updates              0
comments             0
duration             0
dtype: int64

### d. Fixing Column Values

In [21]:
kickstarter_df.category.unique()

array(['Film & Video', 'Games', 'Fashion', 'Music', 'Art', 'Technology',
       'Dance', 'Publishing', 'Theater', 'Comics', 'Design',
       'Photography', 'Food', 'Film &amp; Video'], dtype=object)

In [22]:
kickstarter_df.category = kickstarter_df.category.replace('Film &amp; Video','Film & Video')

In [23]:
kickstarter_df.category.unique()

array(['Film & Video', 'Games', 'Fashion', 'Music', 'Art', 'Technology',
       'Dance', 'Publishing', 'Theater', 'Comics', 'Design',
       'Photography', 'Food'], dtype=object)

In [24]:
kickstarter_df.subcategory.unique()

array(['Short Film', 'Board & Card Games', 'Animation', 'Documentary',
       'Fashion', 'Music', 'Illustration', 'Film &amp; Video',
       'Open Software', 'Indie Rock', 'Dance', 'Fiction', 'Nonfiction',
       'Theater', 'Games', 'Art Book', 'Country & Folk', 'Comics',
       'Webseries', 'Technology', 'Performance Art', 'Narrative Film',
       'Video Games', 'Product Design', 'Rock', 'Painting', 'Photography',
       'Conceptual Art', 'Jazz', 'Open Hardware', 'Classical Music',
       'Food', 'Art', 'Pop', 'Journalism', 'Poetry', 'Electronic Music',
       'World Music', 'Sculpture', 'Publishing', "Children's Book",
       'Public Art', 'Mixed Media', 'Graphic Design', 'Hip-Hop',
       'Periodical', 'Crafts', 'Design', 'Digital Art',
       'Board &amp; Card Games', 'Country &amp; Folk'], dtype=object)

In [25]:
kickstarter_df.subcategory = kickstarter_df.subcategory.replace('Film &amp; Video','Film & Video')
kickstarter_df.subcategory = kickstarter_df.subcategory.replace('Board &amp; Card Games','Board & Card Games')
kickstarter_df.subcategory = kickstarter_df.subcategory.replace('Country &amp; Folk','Country & Folk')

In [26]:
kickstarter_df.subcategory.unique()

array(['Short Film', 'Board & Card Games', 'Animation', 'Documentary',
       'Fashion', 'Music', 'Illustration', 'Film & Video',
       'Open Software', 'Indie Rock', 'Dance', 'Fiction', 'Nonfiction',
       'Theater', 'Games', 'Art Book', 'Country & Folk', 'Comics',
       'Webseries', 'Technology', 'Performance Art', 'Narrative Film',
       'Video Games', 'Product Design', 'Rock', 'Painting', 'Photography',
       'Conceptual Art', 'Jazz', 'Open Hardware', 'Classical Music',
       'Food', 'Art', 'Pop', 'Journalism', 'Poetry', 'Electronic Music',
       'World Music', 'Sculpture', 'Publishing', "Children's Book",
       'Public Art', 'Mixed Media', 'Graphic Design', 'Hip-Hop',
       'Periodical', 'Crafts', 'Design', 'Digital Art'], dtype=object)

### e. Saving Clean Kickstarter Data

In [28]:
kickstarter_df.to_csv('kickstarter_clean.csv', index=False)

### f. Creating Boardgames Dataset

In [29]:
boardgames_df = kickstarter_df

In [30]:
boardgames_filter = boardgames_df.subcategory == 'Board & Card Games'

In [31]:
boardgames_df = boardgames_df[boardgames_filter]

In [32]:
boardgames_df.shape

(553, 17)

In [33]:
boardgames_df.to_csv('boardgames_clean.csv', index=False)

In [34]:
boardgames_df = pd.read_csv('boardgames_clean.csv', encoding='latin1')