# Intro

**Objective**: Create an organized features dataset merging elements from the web-scrapped web_robots_data dataset and my engineered strucral image features 

**Data**

Web_robots_data metadata for kickstarter.com campaigns - source: 

https://webrobots.io/kickstarter-datasets/

Low-level strucural image features:

Two files of structural image features to be combined. These were genereated using functions from folder in this repository called image-feature-engineering and then saved in a pandas dataframe structures in:

    features_full_w_dom_color_df1p1.pkl
    features_full_w_dom_color_df1p2.pkl


Author: Nicholas Mostovych

# Imports

In [2]:
# Load required libraries
import numpy as np
import pandas as pd
import joblib
import time

## Interactive env
import warnings
warnings.filterwarnings('ignore')
from IPython.core.display import clear_output

In [3]:
# Webscrapped meta-data
datafile = '/home/mosto/Documents/insight/kickstarter-project/web_robots_data_to_08-2020_processed.pkl'

In [9]:
# My engineered image features
features_table_df1p1 = '/home/mosto/Documents/insight/kickstarter-project/features_full_w_dom_color_df1p1.pkl'
features_table_df1p2 = '/home/mosto/Documents/insight/kickstarter-project/features_full_w_dom_color_df1p2.pkl'

In [10]:
# Load table containing processed image features data
features_df1p1 = joblib.load(features_table_df1p1)
features_df1p2 = joblib.load(features_table_df1p2)

In [6]:
# Load table containing Web Robots data
df = joblib.load(datafile)

# Data Review

In [8]:
# Look at Metadata
df.head()

Unnamed: 0,name,id,blurb,category,url,currency,pledged,goal,state,location,creator_id,launched_at,deadline,backers,staff_pick,creator_name,created_at
0,Print 1000 Copies of The DENA Magazine Issue #...,551250,Be a part of history as we become the living d...,Periodicals,https://www.kickstarter.com/projects/105349506...,USD,0,2900,failed,"Pasadena, CA",1053495067,1368125818,1372640400,0,False,Jason Hardin,1367192000.0
1,Hopskeller Brewing Company -- Community Beer G...,1284652,Be a part of the Hopskeller family and let's b...,Small Batch,https://www.kickstarter.com/projects/hopskelle...,USD,21832,20000,successful,"Waterloo, IL",1799737223,1452897383,1455489383,117,False,Matthew Schweizer,1407877000.0
2,Southern Fusion Bbq Food Truck,3003776,I'm a looking to start my own food truck. I ha...,Food Trucks,https://www.kickstarter.com/projects/622491845...,USD,1,10000,failed,"Tampa, FL",622491845,1495109582,1498881540,1,False,Russel Short,1495024000.0
3,Help the Green Boys release their second album...,532135,The Green Boys are finished recording their se...,Country & Folk,https://www.kickstarter.com/projects/112702461...,USD,4213,3000,successful,"Richmond, VA",112702461,1366404388,1368244740,105,False,The Green Boys,1365637000.0
4,Creative Lighting with Yongnuo SpeedLights,1062669,I push the limits into what can be achieved wi...,Photobooks,https://www.kickstarter.com/projects/152187118...,USD,1,2000,failed,"Minneapolis, MN",1521871187,1402605560,1405197560,1,False,Peter Chang,1402425000.0


In [11]:
# look at structural features data
features_df1p1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4929 entries, 0 to 4956
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   image                4929 non-null   object 
 1   dullness             4917 non-null   float64
 2   brightness           4917 non-null   float64
 3   average_pixel_width  4929 non-null   float64
 4   average_color        4929 non-null   object 
 5   average_red          4929 non-null   float64
 6   average_green        4929 non-null   float64
 7   average_blue         4929 non-null   float64
 8   image_size           4929 non-null   int64  
 9   temp_size            4929 non-null   object 
 10  width                4929 non-null   int64  
 11  height               4929 non-null   int64  
 12  blurrness            4929 non-null   float64
 13  dominant_color       4929 non-null   object 
 14  dominant_red         4929 non-null   float64
 15  dominant_green       4929 non-null   f

In [12]:
# look at structural features data
features_df1p1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4929 entries, 0 to 4956
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   image                4929 non-null   object 
 1   dullness             4917 non-null   float64
 2   brightness           4917 non-null   float64
 3   average_pixel_width  4929 non-null   float64
 4   average_color        4929 non-null   object 
 5   average_red          4929 non-null   float64
 6   average_green        4929 non-null   float64
 7   average_blue         4929 non-null   float64
 8   image_size           4929 non-null   int64  
 9   temp_size            4929 non-null   object 
 10  width                4929 non-null   int64  
 11  height               4929 non-null   int64  
 12  blurrness            4929 non-null   float64
 13  dominant_color       4929 non-null   object 
 14  dominant_red         4929 non-null   float64
 15  dominant_green       4929 non-null   f

# Data Organization

In [13]:
# Merge stuctural features data
features = features_df1p1.copy()
features = features.append(features_df1p2, ignore_index=True)

In [15]:
# Obtain subset of images that were scrapped from the large web robots dataset
'''
A subset of images was randomly sampled from the large web robots dataset and then
the images were downloaded from Kickstarter.com. These randomly selected images are
specified using the random seed number of 74 and then this dataset looks at the first
10000 images of this dataset
'''

# Select projects in USD
df_USD = df[df['currency'] == 'USD']
# Take a random sample of the Web Robots data using a seed value to ensure repeatability
seed = np.random.seed(74)
df_sample = df_USD.sample(50000)
df1_sample=df_sample.iloc[:10000]
df1_sample.shape

(10000, 17)

In [16]:
# Label encode the sucess and faile states
cleanup_nums = {'state':{'successful':1, 'failed':0, 'live':2, 'canceled':0}, 'staff_pick':{False:0, True:1}}
df1_sample.replace(cleanup_nums, inplace=True)
df1_sample.head()

Unnamed: 0,name,id,blurb,category,url,currency,pledged,goal,state,location,creator_id,launched_at,deadline,backers,staff_pick,creator_name,created_at
3990,Glow Girls start up fund,2896853,We are raising funds for Marketing purposes su...,DIY,https://www.kickstarter.com/projects/165542940...,USD,541,500,1,"Seattle, WA",1655429407,1488242943,1490831343,12,0.0,Claudia,1488210000.0
124723,Studio for Artists,2738647,Unexpected health issues has taken it's toll o...,Painting,https://www.kickstarter.com/projects/117655044...,USD,50,5000,0,"Naples, FL",1176550444,1477960965,1480528140,1,0.0,Arthur Morehead,1477769000.0
52182,Quickstarter: Mechanized Fabrications-What is ...,3736756,"A look into a world lost to progress, and the ...",Comic Books,https://www.kickstarter.com/projects/mechanize...,USD,317,300,1,"Alexandria, VA",385588855,1559061039,1560789039,11,0.0,Sebastian J,1558452000.0
43468,Photography Business launch for Joseph in Nair...,3778504,Joseph Koya is launching his photography busin...,Places,https://www.kickstarter.com/projects/kingkoya/...,USD,111,1000,0,"Nairobi, Kenya",1760691886,1564473990,1569657990,3,0.0,Jospeph (posted by Doug),1563865000.0
41219,Bokeh Fire: Lenses for Everyone,1465799,A simple monthly lens rental service for photo...,Camera Equipment,https://www.kickstarter.com/projects/panok/bok...,USD,21219,20000,1,"Philadelphia, PA",131591379,1415854680,1418508000,124,0.0,Pano K,1414168000.0


In [17]:
# Change objects to numbers in dataframes for merging
df1_sample['state'] = pd.to_numeric(df1_sample['state'])
df1_sample['pledged'] = pd.to_numeric(df1_sample['pledged'])
df1_sample['goal'] = pd.to_numeric(df1_sample['goal'])
df1_sample['backers'] = pd.to_numeric(df1_sample['goal'])
df1_sample['staff_pick'] = pd.to_numeric(df1_sample['staff_pick'])
df1_sample['launched_at'] = pd.to_numeric(df1_sample['launched_at'])
df1_sample['deadline'] = pd.to_numeric(df1_sample['deadline'])


features['image'] = pd.to_numeric(features['image'])

In [18]:
# Make a new dataframe with selected metadata and do a left merge with the features dataframe
# Create dictionary
keys=df1_sample.index
state=df1_sample.state
pledged=df1_sample.pledged
goal=df1_sample.goal
backers=df1_sample.backers
staff_pick=df1_sample.staff_pick
launched_at=df1_sample.launched_at
deadline=df1_sample.deadline
category=df1_sample.category
dictionary = dict(zip(keys,state))

d = {'image':keys, 'state':state, 'pledged':pledged, 'goal':goal, 'backers':backers, 'staff_pick':staff_pick,
        'launched_at':launched_at, 'deadline':deadline, 'category':category}

# Create dataframe
df2 = pd.DataFrame(data=d)
df2.head()

# Left merge to only have rows with complete data from features
left_merged = pd.merge(features, df2,
                        how="left", on=["image"])
# Name new dataframe with metadata, state and features
features_with_state=left_merged
features_with_state.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10893 entries, 0 to 10892
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   image                10893 non-null  int64  
 1   dullness             10865 non-null  float64
 2   brightness           10865 non-null  float64
 3   average_pixel_width  10893 non-null  float64
 4   average_color        10893 non-null  object 
 5   average_red          10893 non-null  float64
 6   average_green        10893 non-null  float64
 7   average_blue         10893 non-null  float64
 8   image_size           10893 non-null  int64  
 9   temp_size            10893 non-null  object 
 10  width                10893 non-null  int64  
 11  height               10893 non-null  int64  
 12  blurrness            10893 non-null  float64
 13  dominant_color       10893 non-null  object 
 14  dominant_red         10893 non-null  float64
 15  dominant_green       10893 non-null 

In [19]:
# Are there dubplicates?
boolean = not features_with_state['image'].is_unique
boolean

True

In [20]:
# We have some doubles where images were repeated by accident, lets remove them
df1_unique = features_with_state
df1_unique.drop_duplicates(subset=['image'], inplace=True)

In [22]:
df1_unique.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9941 entries, 0 to 10892
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   image                9941 non-null   int64  
 1   dullness             9914 non-null   float64
 2   brightness           9914 non-null   float64
 3   average_pixel_width  9941 non-null   float64
 4   average_color        9941 non-null   object 
 5   average_red          9941 non-null   float64
 6   average_green        9941 non-null   float64
 7   average_blue         9941 non-null   float64
 8   image_size           9941 non-null   int64  
 9   temp_size            9941 non-null   object 
 10  width                9941 non-null   int64  
 11  height               9941 non-null   int64  
 12  blurrness            9941 non-null   float64
 13  dominant_color       9941 non-null   object 
 14  dominant_red         9941 non-null   float64
 15  dominant_green       9941 non-null   

In [23]:
# Rename some columns to better reflect features
df1_unique.rename(columns={'average_pixel_width': 'uniformity', 'blurrness': 'blurriness',
                           'image_size': 'compression_size'}, inplace=True)

In [24]:
df1_unique.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9941 entries, 0 to 10892
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   image             9941 non-null   int64  
 1   dullness          9914 non-null   float64
 2   brightness        9914 non-null   float64
 3   uniformity        9941 non-null   float64
 4   average_color     9941 non-null   object 
 5   average_red       9941 non-null   float64
 6   average_green     9941 non-null   float64
 7   average_blue      9941 non-null   float64
 8   compression_size  9941 non-null   int64  
 9   temp_size         9941 non-null   object 
 10  width             9941 non-null   int64  
 11  height            9941 non-null   int64  
 12  blurriness        9941 non-null   float64
 13  dominant_color    9941 non-null   object 
 14  dominant_red      9941 non-null   float64
 15  dominant_green    9941 non-null   float64
 16  dominant_blue     9941 non-null   float64

# Save Data

In [25]:
# Serialize the data table containing the scraped HTML for each project
joblib.dump(df1_unique, '/home/mosto/Documents/insight/kickstarter-project/final_features_df1.pkl')

['/home/mosto/Documents/insight/kickstarter-project/final_features_df1.pkl']