# Kickstarter Deduplicate Data

* **Data Source**: https://webrobots.io/kickstarter-datasets/

**NOTE 1**: Need to ensure that the variables that we incorporate into the model are not giving data leakage. For example, we would need to leave out the staff pick variable (staff are potentially picking things that they believe are going to succeed). 

**NOTE 2**: There is a data dictionary for the kickstarter dataset in the references folder. 

In [2]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
import glob
import functools

src_dir = os.path.join(os.getcwd(), '..', '..', 'src')
sys.path.append(src_dir)

In [3]:
df_csv = pd.read_csv('../../data/02_intermediate/kickstarter_concat.csv')

  interactivity=interactivity, compiler=compiler, result=result)


There are over 7million rows in the original dataset. According to the website that the data was pulled from, many of the campaigns are duplicated across the dataset. We need to find a way to dedupe this. 

In [29]:
print('Number of Rows in the Original Dataset',len(df_csv))

7769058

In [5]:
print('Number of Unique Blurbs: ',df_csv.blurb.nunique())

331113

In [9]:
print('Number of Unique IDs',df_csv.id.nunique())

328099

In [13]:
print('Number of unique Names: ',df_csv.name.nunique())

332898

**How strange, there are more unique names in the dataset than unique IDs. Do some unique campaign IDs have multiple names? Let's find out!**

In [None]:
df_csv.groupby(['id', 'name']).size()

**Yes, as we can see, there are a number of campaign (each campaign has a unique ID) that have several names attached to it). Let's deduplicate the dataset by the unique names instead.**

In [4]:
df_csv.columns

Index(['backers_count', 'blurb', 'category', 'converted_pledged_amount',
       'country', 'created_at', 'creator', 'currency', 'currency_symbol',
       'currency_trailing_code', 'current_currency', 'deadline',
       'disable_communication', 'friends', 'fx_rate', 'goal', 'id',
       'is_backing', 'is_starrable', 'is_starred', 'last_update_published_at',
       'launched_at', 'location', 'name', 'permissions', 'pledged', 'slug',
       'source_url', 'spotlight', 'staff_pick', 'state', 'state_changed_at',
       'static_usd_rate', 'unread_messages_count', 'unseen_activity_count',
       'urls', 'usd_pledged', 'usd_type'],
      dtype='object')

If we are going to deduplicate this dataset then we need to sort it so that we can capture the last time the campaign appears in the dataset. we accomplish this by sorting by the "state changed at" column. The state changes when a campaign either fails, succeeds, is cancelled or is suspended. Since we are looking at if a campaign succeeds or fails then we need as many campaings as possible that have gotten to that stage. 

In [30]:
df_csv.sort_values(by='state_changed_at', ascending=False, inplace=True)

In [31]:
df_csv.reset_index(inplace=True)

In [32]:
df_csv.drop_duplicates(subset='name', keep='first', inplace=True)

In [33]:
print('Length of Dedulicated Dataset: ',len(df_csv))

332899

In [35]:
print('Number of rows in the deduped dataset',df_csv.id.nunique())

326266

Our updated dataset contains 276,734 campaigns that have either suceeded or failed. This is the number of datapoints that we will have in our final model. 

In [36]:
df_csv.state.value_counts()

successful    147268
failed        129466
live           38382
canceled       16728
suspended       1055
Name: state, dtype: int64

In [34]:
df_csv.to_csv('../../data/02_intermediate/kick_deduped.csv', index=False)