# Data cleaning

In this notebook we will clean our Dataframe 'Kickstarter_merged.csv'. That means we will create new columns which we think are important as well as we will drop columns with unimportance.

#### Overview new columns
    * column 'blurbs' will be replaced with -> 'blurb_len_w'
    * column 'slug' will be trplaced with -> 'slug_len_w', 
    * column 'category' will be replaced with -> 'parent_name'
    * column 'launched_at' will be replaced with -> 'launched_month'
    * new column duration of the crowdfunding
    * new column preparation time (from created_at until launched_at
    * column 'state_changed_at' will be replaced with -> 'state_changed_year', 'state_changed_month, 
    * new column pledged/backer
    * column 'goal' will be converted in USD

#### Overview dropped columns
    * converted_pledged_amount
    * creator
    * currency 
    * currency_symbol
    * currency_trailing_code
    * current_curency
    * disable_communication
    * friends
    * fx_rate
    * id after using it for other transformations
    * is_backing
    * is_starrable
    * is_starred
    * location
    * name 
    * permissions
    * photo
    * pledged
    * profile 
    * slug
    * source_url
    * spotlight
    * state_changed_at 
    * static_usd_rate 
    * urls
    * usd_type
   
    
    
#### Overview dropped rows
    * 8 rows with missing values in column 'blurbs'
    * drop the duplicates
    * drop rows with values 'suspended' and 'live' in column 'state'


In [1]:
# import packages
import pandas as pd
import numpy as np
import seaborn as sns
import time
import datetime as dt
import json

Read in file.

In [2]:
df = pd.read_csv('data/Kickstarter_merged.csv', index_col=0)

## Create new columns
    
LetÂ´s start with creating new columns: We will replace the columns 'blurbs', 'slug', category', 'created_at', 'deadline', 'launched_at', 'state_changed_at'.

### Lenghth of Blurb in words:

In [3]:
def string_len_w(string):
    '''Return length of string (number of word, seperated by Space).'''
    string_str = str(string)
    string_list = string_str.split()
    string_len = len(string_list)
    return string_len

In [4]:
def add_blurb_len_w (df):
    '''Adding column that contains the length of the Blurb (words) and returns the updated Dataframe'''
    df['blurb_len_w'] = df.apply(lambda x: string_len_w(x['blurb']), axis=1)
    return df

In [7]:
df = add_blurb_len_w(df)

### Lenghth of Slug in words:

In [10]:
def string_len_slug_w(string):
    '''Returns length of string (number of words, seperated by "-").'''
    string_str = str(string)
    string_list = string_str.split("-")
    string_len = len(string_list)
    return string_len

In [11]:
def add_slug_len_w (df):
    '''Adding column that contains the length of the Slug (words) and returns the updated Dataframe'''
    df['slug_len_w'] = df.apply(lambda x: string_len_slug_w(x['slug']), axis=1)
    return df

In [13]:
df = add_slug_len_w(df)

0         3
1         4
2         7
3         7
4         7
         ..
209217    7
209218    4
209219    9
209220    5
209221    5
Name: slug_len_w, Length: 209222, dtype: int64

 ### column category aka parent_name

In [24]:
def add_parent_id(df):
    '''Extracts Parent ID out of the Category json and adds the Column to Dataframe. Returns updated Dataframe'''
    df['category_parent_id'] = pd.DataFrame([json.loads(df["category"][i]).get("parent_id") for i in range(df.shape[0])])
    return df

In [15]:
def add_category_id(df):
    '''Extracts category ID out of the Category json and adds the Column to Dataframe. Returns updated Dataframe'''
    df['category_id'] = pd.DataFrame([json.loads(df["category"][i]).get("id") for i in range(df.shape[0])])
    return df

In [16]:
def add_category_name(df):
    '''Extracts category name out of the Category json and adds the Column to Dataframe. Returns updated Dataframe'''
    df['category_name'] = pd.DataFrame([json.loads(df["category"][i]).get("name") for i in range(df.shape[0])])
    return df

In [18]:
def fill_na(df, column_name):
    '''Fill Missings with 0 as type integer. Returns updated dataframe. eg, for parent ID and pledged per backer'''
    df[column_name] = df[column_name].fillna(0).astype("int")
    return df

In [20]:
# Making a list based on entry in one category and if missing adds entry of another Column
def helper_list():
    '''Making a list based on entry in one category and if missing adds entry of another Column'''
    empty = []
    for i in range(df.shape[0]):
        if df["category_parent_id"][i] != 0:
            empty.append(df["category_parent_id"][i])
        else:
            empty.append(df["category_id"][i])
    return empty

In [21]:
# adds helper list as column to dataframe 
def add_list_as_column(df, column_name, list_name):
    '''Adds helper list as column to dataframe and retruns updated dataframe'''
    df[column_name] = pd.DataFrame(list_name)
    return df

In [34]:
def add_parent_name(df, column_name1, column_name2, dictionary):
    '''based on key value in a column, column with value is added as a column and updated dataframe is returned. 
    Example:
        parents_dict = {1: "Art", 3: "Comics", 6: "Dance", 7: "Design", 9: "Fashion", 10: "Food",
                11: "Film & Video", 12: "Games", 13: "Journalism", 14: "Music", 15: "Photography", 16: "Technology",
               17: "Theater", 18: "Publishing", 26: "Crafts"}
            df["parent_name"] = df["filled_parent"].apply(lambda x: parents_dict.get(x))'''
    df[column_name1] = df[column_name2].apply(lambda x: dictionary.get(x))
    return df

In [28]:
df = add_parent_id(df)
df = add_category_id(df)
df = add_category_name(df)
df = fill_na(df, 'category_parent_id')

In [29]:
empty = []
for i in range(df.shape[0]):
    if df["category_parent_id"][i] != 0:
        empty.append(df["category_parent_id"][i])
    else:
        empty.append(df["category_id"][i])

In [32]:
df = add_list_as_column(df, "filled_parent", empty)

In [35]:
df = add_parent_name(df, "parent_name", "filled_parent", {1: "Art", 3: "Comics", 6: "Dance", 7: "Design", 9: "Fashion", 10: "Food",
                11: "Film & Video", 12: "Games", 13: "Journalism", 14: "Music", 15: "Photography", 16: "Technology",
               17: "Theater", 18: "Publishing", 26: "Crafts"})

In [36]:
df["parent_name"]

0              Fashion
1                Games
2                Music
3                Games
4           Publishing
              ...     
209217           Games
209218           Music
209219      Technology
209220    Film & Video
209221      Journalism
Name: parent_name, Length: 209222, dtype: object

### Month launched

In [37]:
#funtion to extract the month out of the number
def extract_month(number):
    '''Extracts the month out of the number and returns the month'''
    gmtime = time.gmtime(number)
    return gmtime[1]

In [38]:
# Adding column with month the project was launched
def adding_month_launched(df):  
    '''Adding column with month the project was launched and returns the updated dataframe'''
    df["launched_month"] = df.apply(lambda x: extract_month(x["launched_at"]), axis=1)
    return df

In [40]:
df = adding_month_launched(df)

### Duration

In [42]:
def duration(deadline, launched_at):
    '''Calculating difference between two timepoints and returns it in days'''
    duration = deadline - launched_at
    duration_complete = dt.timedelta(seconds=duration)
    return duration_complete.days

In [43]:
# Adding column with duration in days
def adding_duration(df):
    '''Adding column with duration in days and returns updated dataframe'''
    df["duration_days"] = df.apply(lambda x: duration(x["deadline"], x["launched_at"]), axis=1)
    return df

In [44]:
df = adding_duration(df)

### preparation 

In [50]:
def adding_preparation(df):
    '''Adding column with preparation in days and returns updated dataframe'''
    df["preparation"] = df.apply(lambda x: duration(x["launched_at"], x["created_at"]), axis=1)
    return df

In [51]:
df = adding_preparation(df)

### pledged/backer as "Reward Amount"

In [54]:
def adding_pledged_per_backer(df):
    '''Adding column that is the averaged amount pledged per backer, returns updated dataframe'''
    df['pledged_per_backer'] = (df['usd_pledged'] / df['backers_count']).round(2)
    return df

In [55]:
df = adding_pledged_per_backer(df)

### Coverting Goal to USD

In [57]:
def usd_convert_goal(df, column_name, exchange_rate): 
    '''Converts a Column based on given exchange rate, rounds it to two decimal spaces  
    and returns updated dataframe, e.g. 
    df['goal'] = (df['goal'] * df['static_usd_rate']).round(2)'''
    df[column_name] = (df[column_name] * df[exchange_rate]).round(2)
    return df

In [58]:
df = usd_convert_goal(df, 'goal', 'static_usd_rate')

In [59]:
df['goal']

0         28000.00
1          1000.00
2         15000.00
3         12160.66
4          2800.00
            ...   
209217     1500.00
209218     5466.50
209219     2500.00
209220     5500.00
209221     1000.00
Name: goal, Length: 209222, dtype: float64

## Drop rows

In [60]:
def drop_rows_missings(df, column_name):
    '''Drop rows with missing values in column, eg. Blurb. Retruns dataframe.'''
    df.dropna(subset = [column_name], inplace=True)
    return df

In [63]:
def drop_duplicates(df, column_name):
    '''Creating dataframe and dropping all duplicates, based on a column_name (eg, ID) 
    and keep the last ("newest") duplicate'''
    df = df.drop_duplicates(subset=['id'], keep='last')
    return df

In [68]:
# drop rows with values certain values in a dataframe and returns updated dataframe, eg 'suspended' and 'live' in column 'state'
def drop_rows_value (df, column_name, value):
    '''drop rows with values certain values in a dataframe and returns updated dataframe'''
    df = df.drop(df[df[column_name] == value ].index)
    return df

In [62]:
# drop 8 rows with missing values in column 'blurbs'
df = drop_rows_missings(df, 'blurb')

In [69]:
# creating dataframe and dropping all duplicates and keep the last ("newest") duplicate 
df = drop_duplicates(df, 'id')

In [70]:
df = drop_rows_value(df, 'state', 'suspended')

In [71]:
df = drop_rows_value(df, 'state', 'live')

In [72]:
df = drop_rows_value(df, 'state', 'canceled')

In [73]:
df = drop_rows_value(df, 'goal', 0)

## Drop Columns

In [76]:
def drop_columns(df, list_columns):
    '''Drops columns in the list and returns updated datadrame'''
    df.drop(list_columns, axis=1, inplace=True)
    return df

In [74]:
df.columns

Index(['backers_count', 'blurb', 'category', 'converted_pledged_amount',
       'country', 'created_at', 'creator', 'currency', 'currency_symbol',
       'currency_trailing_code', 'current_currency', 'deadline',
       'disable_communication', 'friends', 'fx_rate', 'goal', 'id',
       'is_backing', 'is_starrable', 'is_starred', 'launched_at', 'location',
       'name', 'permissions', 'photo', 'pledged', 'profile', 'slug',
       'source_url', 'spotlight', 'staff_pick', 'state', 'state_changed_at',
       'static_usd_rate', 'urls', 'usd_pledged', 'usd_type', 'blurb_len_w',
       'slug_len_w', 'category_parent_id', 'category_id', 'category_name',
       'filled_parent', 'parent_name', 'launched_month', 'duration_days',
       'preparation', 'pledged_per_backer'],
      dtype='object')

In [85]:
df = drop_columns(df, ['backers_count', 'blurb', 'category', 'converted_pledged_amount',
       'country', 'created_at', 'creator', 'currency', 'currency_symbol',
       'currency_trailing_code', 'current_currency', 'deadline',
       'disable_communication', 'friends', 'fx_rate', 'id',
       'is_backing', 'is_starrable', 'is_starred', 'launched_at', 'location',
       'name', 'permissions', 'photo', 'pledged', 'profile', 'slug',
       'source_url', 'spotlight', 'state_changed_at',
       'static_usd_rate', 'urls', 'usd_type', 'category_parent_id', 'category_id', 'category_name',
       'filled_parent', 'staff_pick'])

KeyError: "['backers_count' 'blurb' 'category' 'converted_pledged_amount' 'country'\n 'created_at' 'creator' 'currency' 'currency_symbol'\n 'currency_trailing_code' 'current_currency' 'deadline'\n 'disable_communication' 'friends' 'fx_rate' 'id' 'is_backing'\n 'is_starrable' 'is_starred' 'launched_at' 'location' 'name' 'permissions'\n 'photo' 'pledged' 'profile' 'slug' 'source_url' 'spotlight'\n 'state_changed_at' 'static_usd_rate' 'urls' 'usd_type'\n 'category_parent_id' 'category_id' 'category_name' 'filled_parent'] not found in axis"

In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 168976 entries, 1 to 209221
Data columns (total 10 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   goal                168976 non-null  float64
 1   staff_pick          168976 non-null  bool   
 2   usd_pledged         168976 non-null  float64
 3   blurb_len_w         168976 non-null  int64  
 4   slug_len_w          168976 non-null  int64  
 5   parent_name         168976 non-null  object 
 6   launched_month      168976 non-null  int64  
 7   duration_days       168976 non-null  int64  
 8   preparation         168976 non-null  int64  
 9   pledged_per_backer  154156 non-null  float64
dtypes: bool(1), float64(3), int64(5), object(1)
memory usage: 18.1+ MB


## Concersion Data Type

In [80]:
def convert_to_int(df, column_name):
    '''Converting Column type to Integer and returns updated df'''
    df[column_name] = df[column_name].astype("int")
    return df

In [83]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 168976 entries, 1 to 209221
Data columns (total 10 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   goal                168976 non-null  float64
 1   staff_pick          168976 non-null  int64  
 2   usd_pledged         168976 non-null  float64
 3   blurb_len_w         168976 non-null  int64  
 4   slug_len_w          168976 non-null  int64  
 5   parent_name         168976 non-null  object 
 6   launched_month      168976 non-null  int64  
 7   duration_days       168976 non-null  int64  
 8   preparation         168976 non-null  int64  
 9   pledged_per_backer  154156 non-null  float64
dtypes: float64(3), int64(6), object(1)
memory usage: 19.2+ MB


In [84]:
df.head()

Unnamed: 0,goal,staff_pick,usd_pledged,blurb_len_w,slug_len_w,parent_name,launched_month,duration_days,preparation,pledged_per_backer
1,1000.0,0,1950.0,22,4,Games,8,30,8,41.49
2,15000.0,0,22404.0,15,7,Music,5,30,224,82.67
3,12160.66,0,165.384934,23,7,Games,1,59,5,55.13
4,2800.0,0,2820.0,24,7,Publishing,12,30,4,940.0
5,3500.0,0,3725.0,18,4,Music,4,30,159,106.43


## Drop Rows and only keep relevant categories

In [87]:
categories = ["Games", "Art", "Photography", "Film & Video", "Design", "Technology"]
df = df[df.parent_name.isin(categories)]

## get Dummies

In [88]:
# convert the categorical variable parent_name into dummy/indicator variables
df_dum2 = pd.get_dummies(df.parent_name, prefix='parent_name')
df = df.drop(['parent_name'], axis=1)
df = pd.concat([df, df_dum2], axis=1)

In [89]:
# making a categorical variable for launched_month q1, q2, q3, q4 
df.loc[df['launched_month'] <  4, 'time_yr'] = 'q1'
df.loc[(df['launched_month'] >=  4) & (df['launched_month'] <  7), 'time_yr'] = 'q2'
df.loc[(df['launched_month'] >=  7) & (df['launched_month'] <  10), 'time_yr'] = 'q3'
df.loc[df['launched_month'] >  9, 'time_yr'] = 'q4'

In [90]:
df_dum3 = pd.get_dummies(df.time_yr, prefix='time_yr')
df = df.drop(['time_yr'], axis=1)
df = df.drop(['launched_month'], axis=1)
df = pd.concat([df, df_dum3], axis=1)

In [91]:
df.head()

Unnamed: 0,goal,staff_pick,usd_pledged,blurb_len_w,slug_len_w,duration_days,preparation,pledged_per_backer,parent_name_Art,parent_name_Design,parent_name_Film & Video,parent_name_Games,parent_name_Photography,parent_name_Technology,time_yr_q1,time_yr_q2,time_yr_q3,time_yr_q4
1,1000.0,0,1950.0,22,4,30,8,41.49,0,0,0,1,0,0,0,0,1,0
3,12160.66,0,165.384934,23,7,59,5,55.13,0,0,0,1,0,0,1,0,0,0
24,54737.83,0,5.473783,3,2,20,3,5.47,1,0,0,0,0,0,0,0,1,0
25,2602.33,0,2861.258251,19,3,21,2,38.67,0,0,0,0,1,0,0,0,0,1
30,5000.0,0,5466.0,25,7,29,0,84.09,1,0,0,0,0,0,1,0,0,0


In [92]:
df = drop_columns(df, ['usd_pledged'])

In [93]:
df.head()

Unnamed: 0,goal,staff_pick,blurb_len_w,slug_len_w,duration_days,preparation,pledged_per_backer,parent_name_Art,parent_name_Design,parent_name_Film & Video,parent_name_Games,parent_name_Photography,parent_name_Technology,time_yr_q1,time_yr_q2,time_yr_q3,time_yr_q4
1,1000.0,0,22,4,30,8,41.49,0,0,0,1,0,0,0,0,1,0
3,12160.66,0,23,7,59,5,55.13,0,0,0,1,0,0,1,0,0,0
24,54737.83,0,3,2,20,3,5.47,1,0,0,0,0,0,0,0,1,0
25,2602.33,0,19,3,21,2,38.67,0,0,0,0,1,0,0,0,0,1
30,5000.0,0,25,7,29,0,84.09,1,0,0,0,0,0,1,0,0,0
