# Walkthrough

First step: Run the combin_csv.py script on a data folder containing ONLY the kickstarter000 to kickstarter055 csv-files! This gives you the dataset to import for this notebook.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.metrics import classification_report, confusion_matrix

In [3]:
# load the combined dataset
df = pd.read_csv("data/combined_csv.csv")

# check the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209222 entries, 0 to 209221
Data columns (total 37 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   backers_count             209222 non-null  int64  
 1   blurb                     209214 non-null  object 
 2   category                  209222 non-null  object 
 3   converted_pledged_amount  209222 non-null  int64  
 4   country                   209222 non-null  object 
 5   created_at                209222 non-null  int64  
 6   creator                   209222 non-null  object 
 7   currency                  209222 non-null  object 
 8   currency_symbol           209222 non-null  object 
 9   currency_trailing_code    209222 non-null  bool   
 10  current_currency          209222 non-null  object 
 11  deadline                  209222 non-null  int64  
 12  disable_communication     209222 non-null  bool   
 13  friends                   300 non-null     o

## list of features we keep

 -  blurb                     short description 
 -  category                  Kickstarter categories  
 -  country                   country 
 -  deadline                  deadline date/time?
 -  fx_rate                   currency conversion rate 
 -  goal                      fixed amount required for funding (convert with fx_rate)
 -  launched_at               launch date/time?  
 -  location                  location
 -  name                      project name 
 -  state                     !!!!!target!!!!! 

 <br>

features to create:
- goal_usd                    goal*fx_rate
- name_len                    number of characters in name
- blurb_len                   number of characters in blurb
- time_online                 deadline - launched at
- launch_weekday              day of the week of the launch
- launch_time                 time of day of the launch

## stakeholder

* who? - people/creators who are considering launching a project on Kickstarter
* why? - to find out if it's worth investing the time/money in creating materials/launching a project and which criteria to consider in order to make it successful
* metric? - f_beta (probably imbalanced data)
* model: classifier (binary)

In [4]:
# keep only columns we will be using

df = df[['blurb', 'category', 'country', 'deadline',
        'fx_rate', 'goal', 'launched_at', 'location',
       'name', 'state']]

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209222 entries, 0 to 209221
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   blurb        209214 non-null  object 
 1   category     209222 non-null  object 
 2   country      209222 non-null  object 
 3   deadline     209222 non-null  int64  
 4   fx_rate      209222 non-null  float64
 5   goal         209222 non-null  float64
 6   launched_at  209222 non-null  int64  
 7   location     208996 non-null  object 
 8   name         209222 non-null  object 
 9   state        209222 non-null  object 
dtypes: float64(2), int64(2), object(6)
memory usage: 16.0+ MB


## Create the new features

### Convert timestamps & add time delta features

In [8]:
launched = []
deadline = []

In [9]:
# launched_at
for label, content in df['launched_at'].iteritems():
    launched.append(datetime.datetime.fromtimestamp(content))

In [10]:
# deadline
for label, content in df['deadline'].iteritems():
    deadline.append(datetime.datetime.fromtimestamp(content))

In [11]:
for i in range(len(df)):
    df['launched_at'].iloc[[i]] = launched[i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [12]:
for i in range(len(df)):
    df['deadline'].iloc[[i]] = deadline[i]

In [13]:
df['delta_dead_laun'] = (df['deadline'] - df['launched_at']).astype('timedelta64[h]')

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209222 entries, 0 to 209221
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   blurb            209214 non-null  object 
 1   country          209222 non-null  object 
 2   deadline         209222 non-null  object 
 3   fx_rate          209222 non-null  float64
 4   goal             209222 non-null  float64
 5   launched_at      209222 non-null  object 
 6   location         208996 non-null  object 
 7   name             209222 non-null  object 
 8   state            209222 non-null  object 
 9   category         209222 non-null  object 
 10  delta_dead_laun  209222 non-null  float64
dtypes: float64(3), object(8)
memory usage: 17.6+ MB


### change categories to main categories

In [15]:
cats = df['category']

In [16]:
cats = cats.str.strip('"slug')

In [17]:
cats = cats.str.strip(':"')

In [18]:
cats

0               fashion/footwear
1             games/playing card
2                     music/rock
3             games/playing card
4          publishing/nonfiction
                   ...          
209217       games/tabletop game
209218    music/electronic music
209219       technology/hardware
209220      film & video/festiva
209221                journalism
Name: category, Length: 209222, dtype: object

In [19]:
cats = cats.str.split("/").str[0]

In [20]:
cats.unique()

array(['fashion', 'games', 'music', 'publishing', 'theater', 'food',
       'art', 'photography', 'technology', 'dance', 'design',
       'film & video', 'crafts', 'comics', 'comic', 'craft', 'journalism',
       'publishin', 'game'], dtype=object)

In [22]:
# add new categories to dataframe

df['category'] = cats
df['category'].unique()

array(['fashion', 'games', 'music', 'publishing', 'theater', 'food',
       'art', 'photography', 'technology', 'dance', 'design',
       'film & video', 'crafts', 'comics', 'comic', 'craft', 'journalism',
       'publishin', 'game'], dtype=object)

### Include weekday and hour of launch

In [23]:
launched_day = pd.to_datetime(df['launched_at']).dt.day_of_week

In [24]:
launched_hour = pd.to_datetime(df['launched_at']).dt.hour

In [26]:
df['launch_day'] = launched_day
df['launch_hour'] = launched_hour
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209222 entries, 0 to 209221
Data columns (total 13 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   blurb            209214 non-null  object 
 1   country          209222 non-null  object 
 2   deadline         209222 non-null  object 
 3   fx_rate          209222 non-null  float64
 4   goal             209222 non-null  float64
 5   launched_at      209222 non-null  object 
 6   location         208996 non-null  object 
 7   name             209222 non-null  object 
 8   state            209222 non-null  object 
 9   category         209222 non-null  object 
 10  delta_dead_laun  209222 non-null  float64
 11  launch_day       209222 non-null  int64  
 12  launch_hour      209222 non-null  int64  
dtypes: float64(3), int64(2), object(8)
memory usage: 20.8+ MB


Unnamed: 0,blurb,country,deadline,fx_rate,goal,launched_at,location,name,state,category,delta_dead_laun,launch_day,launch_hour
0,Babalus Shoes,US,2019-03-14 06:02:55,1.0,28000.0,2019-01-23 07:02:55,"{""id"":2462429,""name"":""Novato"",""slug"":""novato-c...",Babalus Children's Shoes,live,fashion,1199.0,2,7
1,A colorful Dia de los Muertos themed oracle de...,US,2017-09-09 19:00:59,1.0,1000.0,2017-08-10 19:00:59,"{""id"":2400549,""name"":""Euless"",""slug"":""euless-t...",The Ofrenda Oracle Deck,successful,games,720.0,3,19
2,"Electra's long awaited, eclectic Debut Pop/Roc...",US,2013-06-12 07:03:15,1.0,15000.0,2013-05-13 07:03:15,"{""id"":2423474,""name"":""Hollywood"",""slug"":""holly...","Record Electra's Debut Album (Pop, Rock, Class...",successful,music,720.0,0,7
3,The Mist of Tribunal is a turn-based card game...,GB,2017-03-13 18:22:56,1.308394,10000.0,2017-01-12 19:22:56,"{""id"":475457,""name"":""Kaunas"",""slug"":""kaunas-ka...",The Mist of Tribunal - A Card Game,failed,games,1439.0,3,19
4,"Livng with a brain impairment, what its like t...",US,2013-01-09 21:32:07,1.0,2800.0,2012-12-10 21:32:07,"{""id"":2507703,""name"":""Traverse City"",""slug"":""t...",Help change the face of Brain Impairment,successful,publishing,720.0,0,21


### blurb length and name length

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209222 entries, 0 to 209221
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   blurb            209214 non-null  object 
 1   country          209222 non-null  object 
 2   deadline         209222 non-null  object 
 3   fx_rate          209222 non-null  float64
 4   goal             209222 non-null  float64
 5   launched_at      209222 non-null  object 
 6   location         208996 non-null  object 
 7   name             209222 non-null  object 
 8   state            209222 non-null  object 
 9   category         209222 non-null  object 
 10  delta_dead_laun  209222 non-null  float64
 11  launch_day       209222 non-null  int64  
 12  launch_hour      209222 non-null  int64  
 13  name_len         209222 non-null  int64  
dtypes: float64(3), int64(3), object(8)
memory usage: 22.3+ MB


In [None]:
# drop missing values for blurb


In [27]:
# calculate length of name
name_len = []
for label, content in df['name'].iteritems():
    name_len.append(len(content))

df['name_len'] = name_len


# calculate length of blurb
blurb_len = []
for label, content in df['blurb'].iteritems():
    blurb_len.append(len(content))

df['blurb_len'] = blurb_len

TypeError: object of type 'float' has no len()

### Include only projects that were successful or failed

In [None]:
# convert 'state' to numerical
# successful: 1
# failed: 0
# drop: live, suspended, canceled

df = df.query('state != "live"')
df = df.query('state != "suspended"')
df = df.query('state != "canceled"')
print(df['state'].unique()) # check that 'state' only contains failed and successful

df['state'].replace({'failed':0, 'successful':1}, inplace=True)
print(df['state'].unique()) # check that 'state' only contains 1 and 0

In [None]:
# plot frequency of success and failure

sns.countplot(x='state', data=df)