## Crowdfunding Analysis - Final Project

#### Dataset: https://www.kaggle.com/yashkantharia/kickstarter-campaigns-dataset-20


### Importing dependecies and data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # plotting

df = pd.read_csv('data.csv')

In [3]:
# Describing the Data
# df.describe()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217245 entries, 0 to 217244
Data columns (total 19 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   index          217245 non-null  int64  
 1   id             217245 non-null  int64  
 2   name           217245 non-null  object 
 3   currency       217245 non-null  object 
 4   launched_at    217245 non-null  object 
 5   backers_count  217245 non-null  int64  
 6   blurb          217245 non-null  object 
 7   country        217245 non-null  object 
 8   deadline       217245 non-null  object 
 9   slug           217245 non-null  object 
 10  status         217245 non-null  object 
 11  usd_pledged    217245 non-null  float64
 12  sub_category   217245 non-null  object 
 13  main_category  217245 non-null  object 
 14  creator_id     217245 non-null  int64  
 15  blurb_length   217245 non-null  int64  
 16  goal_usd       217245 non-null  float64
 17  city           217245 non-nul

### Data Cleaning

* Duplicates
* Removing useless columns
* Converting variables
* Normalizing values

In [4]:
# creator_id, index, creator_id are unique / serializing values
ul_cols = ['id', 'creator_id', 'index']
df = df.drop(columns = ul_cols, axis=1)
df.info()

In [5]:
# Removing Duplicates
print("# of Duplicates: ", df.duplicated().sum())
df.drop_duplicates(inplace=True, ignore_index=True)

# Length after duplicate removal
print("Length of dataset after removal of duplicates is ", len(df))

# of Duplicates:  19527
Length of dataset after removal of duplicates is  197718


In [6]:
# Converting fields to binary
df['status'].describe()

count         197718
unique             4
top       successful
freq          109205
Name: status, dtype: object

In [9]:
# Status to binary 0-failed 1-successful
status = df['status']
new_status = []
for s in status:
    if s == "successful":
        new_status.append(1)
    else:
        new_status.append(0)
df['status'] = new_status
df['status'].describe()

count    197718.000000
mean          0.552327
std           0.497256
min           0.000000
25%           0.000000
50%           1.000000
75%           1.000000
max           1.000000
Name: status, dtype: float64

### Importing models for the first iteration
* Logistic Regrssion
* Random Forest


In [15]:
!pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.0.1-cp39-cp39-win_amd64.whl (7.2 MB)
Collecting joblib>=0.11




  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.0.0-py3-none-any.whl (14 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.1.0 scikit-learn-1.0.1 threadpoolctl-3.0.0



You should consider upgrading via the 'c:\users\oriel\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.


In [16]:
# General dependencies + Scikit learn RF classifier
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier

In [18]:
# Splitting data to train an test datasets
part = np.random.rand(len(df)) < 0.8
train = df[part]
test = df[~part]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39357 entries, 2 to 197696
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           39357 non-null  object 
 1   currency       39357 non-null  object 
 2   launched_at    39357 non-null  object 
 3   backers_count  39357 non-null  int64  
 4   blurb          39357 non-null  object 
 5   country        39357 non-null  object 
 6   deadline       39357 non-null  object 
 7   slug           39357 non-null  object 
 8   status         39357 non-null  int64  
 9   usd_pledged    39357 non-null  float64
 10  sub_category   39357 non-null  object 
 11  main_category  39357 non-null  object 
 12  blurb_length   39357 non-null  int64  
 13  goal_usd       39357 non-null  float64
 14  city           39357 non-null  object 
 15  duration       39357 non-null  int64  
dtypes: float64(2), int64(4), object(10)
memory usage: 5.1+ MB
