<a href="https://colab.research.google.com/github/monicasoria/finance-k8s/blob/master/KICKSTARTER_ML_(RANDOM_FOREST).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About this notebook

+ First read how to mount the google drive to google colab here: 

https://medium.com/@master_yi/importing-datasets-in-google-colab-c816fc654f97

https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166


+ These pages will help you to understand better the Random Forest : 

https://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics


+ These pages contain how to implement Random Forest with Python: 

https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn/


In [0]:
# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
# Look for the file inside your google drive folder & see what is inside in order to build the route: 
! ls "/content/drive/My Drive/Colab Notebooks/KICKSTARTER"

'KICKSTARTER -ML (RANDOM FOREST).ipynb'   ks-projects-201801-clean2.csv


In [0]:
# Import libraries 
import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib 
import numpy as np 
from sklearn.model_selection import train_test_split 
from sklearn import preprocessing 
from sklearn.preprocessing import StandardScaler

%matplotlib inline 


In [0]:
# Read the file with pandas 
kickstarter = pd.read_csv("/content/drive/My Drive/Colab Notebooks/KICKSTARTER/ks-projects-201801-clean2.csv")

# Data base cleaning & transformation

In [0]:
kickstarter.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real,launched_year,period,funded_ratio
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09 00:00:00,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95,2015,58 days 11:47:32.000000000,0.0
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01 00:00:00,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0,2017,59 days 19:16:03.000000000,0.0807
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26 00:00:00,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0,2013,44 days 23:39:10.000000000,0.004889
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16 00:00:00,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0,2012,29 days 20:35:49.000000000,0.0002
4,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01 00:00:00,50000.0,2016-02-26 13:38:27,52375.0,successful,224,US,52375.0,52375.0,50000.0,2016,34 days 10:21:33.000000000,1.0475


**Data selection **

Columns we are not using and why :

ID : we have already an index in python 

name : we already have name lenght 

category : there are many field so we instead use main_category 

deadline : still have period 

launched : we have period 

backers : it is the same as state 

goal : it is not in USD 

usd_pledged : we are using 

usd_pledged_real

pledged : usd_pledged : we are using

usd_pledged_real

funded_ratio : was useful for EDA but not anymore 


In [0]:
# New variables 
kickstarter['name_lenght'] = kickstarter['name'].apply(len)


#Data transformation into numeric or integer
kickstarter['launched'] = pd.to_datetime(kickstarter.launched)
kickstarter['launched_month'] = kickstarter.launched.dt.month

kickstarter['deadline'] = pd.to_datetime(kickstarter.deadline)

kickstarter['period'] = kickstarter.deadline - kickstarter.launched
kickstarter['period'] = kickstarter.period.dt.days
kickstarter['period'] = kickstarter.period.astype(int)


# Convert categorical values to dummies 
kickstarter['main_category'] = pd.get_dummies(kickstarter.main_category)
kickstarter['currency'] = pd.get_dummies(kickstarter.currency)
kickstarter['country'] = pd.get_dummies(kickstarter.country)    
kickstarter['state'] = pd.get_dummies(kickstarter.state)   


In [0]:
kickstarter = kickstarter.drop(['ID', 'name', 'category','deadline','pledged','launched','launched_year','backers','goal','usd pledged','funded_ratio'], axis=1)

## Random Forest Implementation

In [0]:
# Prepare data for training 
X_unscaled = pd.DataFrame(kickstarter)
X_unscaled.drop ('state', axis = 1 , inplace = True)
        
y = pd.Series(kickstarter.state, name = "STATE")

# 2.- Split dataset : 80% training set & 20% test set 
X_train, X_test, y_train, y_test = train_test_split(X_unscaled,y, test_size=0.2, random_state=0)


In [0]:
# Feature scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [0]:
# Train the algorithm  / Fitting Random Forest Regression to the dataset 

# 1) Import Regressor 
from sklearn.ensemble import RandomForestRegressor

# 2) Create regressor object ( n_estimators is number of trees)
regressor = RandomForestRegressor(n_estimators=10, random_state=0)

# 3) Fit regressor with x and y data 
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

In [0]:
# Evaluate the algorithm
from sklearn import metrics

# MAE
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))

MAPE = round(metrics.mean_absolute_error(y_test, y_pred) *100,2)
print('Mean Absolute Error Percentage:',MAPE,"%")

# Rule of thumb for RMSE could be between 0.2 and 0.5
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

ACCURACY = 100- np.mean(MAPE)
print('Accuracy:', round(ACCURACY,2),'%')


Mean Absolute Error: 0.0007408425747716372
Mean Squared Error: 0.000315125408609049
Mean Absolute Error Percentage: 0.07 %
Root Mean Squared Error: 0.01775177198504558
Accuracy: 99.93 %


In [0]:
Evaluate the algorithm vs David's Neuronal network 

1) Mean high CTR 
2)  Mean precision of clasifier 
3) Mean improvement 

SyntaxError: ignored