# Exploration of Kickstarter Data (2010-2017) (Work in progress)

[Mickaël Mouillé](https://www.kaggle.com/kemical) posted this dataset online. 

You can check out the dataset [here](https://www.kaggle.com/kemical/kickstarter-projects).

![kickstarter](https://webby-gallery-production.s3.amazonaws.com/uploads/asset/image/15962/3018000000130981_large.jpg)

## Features

We have 15 initial features:
* ID:  internal kickstarter id
* name:  name of project - A project is a finite work with a clear goal that you’d like to bring to life. Think albums, books, or films.
* category:  category
* main_category:  category of campaign
* currency:  currency used to support
* deadline:  deadline for crowdfunding
* goal:  fundraising goal - The funding goal is the amount of money that a creator needs to complete their project.
* launched:  date launched
* pledged:  amount pledged by "crowd"
* state:  Current condition the project is in
* backers:  number of backers
* country:  country pledged from
* usd pledged: Pledged amount in USD (conversion made by KS)
* usd_pledged_real: Pledged amount in USD (conversion made by fixer.io api)
* usd_goal_real: Goal amount in USD (conversion made by fixer.io api)

## Initial Exploration

In [None]:
# Import dependencies
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load in dataset
KAGGLE_DIR = '../input/'
# There were some issues with the encoding so we manually set it to 'latin1'
df = pd.read_csv(KAGGLE_DIR + 'ks-projects-201801.csv', encoding='latin1', low_memory=False)

In [None]:
display(df.shape)

In [None]:
print('First 5 rows:')
display(df.head())

print('Last 5 rows:')
display(df.tail())

In [None]:
df.info()

## Missing Values

Let's see how many missing values we have in our DataFrame

In [None]:
percent_missing = (df.isnull().sum() * 100 / len(df)).round(2)
missing_value_df = pd.DataFrame({'column_name': df.columns,
                                 'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', ascending=False, inplace=True)
missing_value_df

Fortunately, there are not many missing values. Let's see what kind of project contain missing values.

In [None]:
display(df[df['usd pledged'].isnull()].shape)
# First 20 columns
df[df['usd pledged'].isnull()].head(20)

We can fill the missing values with the values from  'usd_plegded_real'. This should still give us reliable results.

In [None]:
df['usd pledged'].fillna(df['usd_pledged_real'], inplace=True)

It looks like 'usd pledged' and 'usd_pledged_real' are the same. Let's check if this is so.

In [None]:
duplication = df.duplicated(['usd pledged', 'usd_pledged_real'])
dup_count = 0
for row in duplication:
    if row == True:
        dup_count += 1
        
# Duplications percentage
print('Duplicates between USD Pledged and USD Pledged Real: {} %'.format(round(dup_count / len(df) * 100, 2)))

So the two features have a lot of overlap, but are not exactly the same. Let's keep them in our dataset for now.

## EDA

### Most popular categories

The 'category' and 'main_category' features are distinct but have a lot of overlap. However, the 'category' feature is much more diverse. Both can be useful for our analysis.

In [None]:
print('Categories in category: ', df['category'].nunique())
df['category'].value_counts()[:20].plot(kind='barh')

In [None]:
print('Categories in main_category: ', df['main_category'].nunique())
df['main_category'].value_counts().plot(kind='barh')

### Countries

![](https://www.nationsonline.org/gallery/Flags/Flags-of-the-World.jpg)

It is clear to see that most Kickstarter campaigns in our dataset are from the United States or Great Britain.

In [None]:
print('Number of unique countries: ', df['country'].nunique())
df['country'].value_counts()[:10].plot(kind='barh')

### Money!

In [None]:
print('The average Kickstarter campaign has {} USD pledged, {} backers and a goal of {} USD.'.format(round(df['usd_pledged_real'].mean(), 2),
                                                                                               int(df['backers'].mean()),
                                                                                               round(df['goal'].mean(), 2)))

In [None]:
df['discrepancy'] = df['goal'] - df['usd_pledged_real']
df['target_reached'] = df['discrepancy'] <= 0

target_reached = df.loc[lambda df: df['target_reached'] == True]
target_not_reached = df.loc[lambda df: df['target_reached'] == False]
target_reached_perc = round(len(target_reached) / len(df) * 100, 2)
target_not_reached_perc = round(len(target_not_reached) / len(df) * 100, 2)

print('Out of {} Kickstarter campaigns:\n\n{} % reached their target.'.format(len(df), 
                                                                            target_reached_perc))
print('For the {} campaigns that reached their target,\n\
there was on average {} USD pledged more than the target.\n'.format(len(target_reached), 
                                                                  round(target_reached['usd_pledged_real'].mean(), 2)))

print('{} % of the campaigns did not reach their target.\n\
For the {} campaigns that did not reach their target,\n\
there was on average {} USD pledged less than the target.'.format(target_not_reached_perc, 
                                                                  len(target_not_reached), 
                                                                  round(target_not_reached['usd_pledged_real'].mean(), 2)))

In [None]:
df['currency'].value_counts().plot(kind='barh')

It should be obvious that there is a correlation between the countries and currencies used. 

### States

In [None]:
df['state'].value_counts().plot(kind='barh')

perc_successful = len(df[df['state'] == 'successful']) / len(df) * 100
perc_failed = len(df[df['state'] == 'failed']) / len(df) * 100
perc_canceled = len(df[df['state'] == 'canceled']) / len(df) * 100
perc_other = 100 - (perc_successful + perc_failed + perc_canceled)

print('{} % of campaigns were successful\n\
{} % of campaigns failed\n\
{} % of campaigns were canceled\n\
{} % of campaigns belong to other categories'.format(round(perc_successful, 2), 
                                                     round(perc_failed, 2), 
                                                     round(perc_canceled, 2), 
                                                     round(perc_other, 2)))

Note: There is a slight difference between campaigns that reached their target and what the data calls 'successful' campaigns. A campaigns can for example have reached its target but still be live. A campaign can also not reach its target but belong to a 'canceled' state.

# Work in progress

In [None]:
##### Ideas for this Kernel ######
# Correlations
# Top Kickstarter campaigns (What do they have in common)
# Categories that are cashing the most
# Outlier analysis
# Preparation for machine learning (one-hot encoding)
# Predicting usd_pledged
# Feature importance
# Tree interpreter
# Partial Dependence
# Extrapolation
# Confidence based on tree variance
###################################
