# Datathon 2: Beginners: Cancer Death Rate
### Predict the cancer death rate for the given year

#### Content  
  As per WHO,

 - Cancer is the second leading cause of death globally, and is responsible for an estimated 9.6 million deaths in 2018. Globally, about 1 in 6 deaths is due to cancer.
 - Approximately 70% of deaths from cancer occur in low- and middle-income countries.
 - Around one third of deaths from cancer are due to the 5 leading behavioral and dietary risks: high body mass index, low fruit and vegetable intake, lack of physical activity, tobacco use, and alcohol use.

#### Problem Statement

Many aspects of the behaviour of cancer disease are highly unpredictable. Even with the huge number of studies that have been done on the DNA mutation responsible for the disease, we are still unable to use these information at clinical level. However, it is important that we understand the effects and impacts of this disease from the past information as much as we possibly can.

#### Objective of Datathon 2

It is required to build a machine learning  model that would predict the cancer death rate for the given year.



### Import Data

In [1]:
import numpy as np
import pandas as pd
import math
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from boruta import BorutaPy
%matplotlib inline

### Loading data and displaying first rows

In [2]:
df  = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/cancer_death_rate/Training_set_label.csv" )

In [None]:
df.head()

### Basic Exploratory Analysis 

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [5]:
df.shape

(3051, 34)

### Cleaning Data

In [None]:
df['TARGET_deathRate'].value_counts().sum()

In [None]:
df = df.drop(columns=['binnedInc','Geography','PctSomeCol18_24'])

In [None]:
mean_16 = df['PctEmployed16_Over'].mean()
df['PctEmployed16_Over'] = df['PctEmployed16_Over'].fillna(mean_16)

In [None]:
mean_alone = df['PctPrivateCoverageAlone'].mean()
df['PctPrivateCoverageAlone'] = df['PctPrivateCoverageAlone'].fillna(mean_alone)

### Separate the Input and Target Features of the data

In [None]:
X = df.drop(columns=['TARGET_deathRate'])
y = df['TARGET_deathRate']

### Split the data into Train and Test Sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Scaling data

I have tried working with normalized and standarized data but the results weren't good, probably because there are many outliers in some features, so i have decided not to scale data.

In [None]:
'''from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = (X_train.columns))
X_test = pd.DataFrame(scaler.fit_transform(X_test), columns = (X_test.columns))'''

In [None]:
'''from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = (X_train.columns))
X_test = pd.DataFrame(scaler.fit_transform(X_test), columns = (X_test.columns))'''

# Machine Learning

### Building Linear Regression Model

In [None]:
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred_reg = reg.predict(X_test)

### Evaluate the Linear Regression model

In [None]:
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, y_pred_reg))

### Building Random Forest Model

In [None]:
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

### Evaluate the Random Forest model

In [None]:
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, y_pred_rf))

# Hyperparameter Tunning

In [None]:
parameters = {
    'n_estimators': [200, 500, 1000],
    'max_depth': [2,3,4,5,8,16,None],
    'min_samples_leaf': [2,3,4,5,8,16]
}
cv = GridSearchCV(rf, parameters, cv = 3, n_jobs = -1, verbose = 2)
cv.fit(X_train, y_train)

print(cv.best_estimator_)

In [None]:
rf_rs = RandomForestRegressor(min_samples_leaf=2, n_estimators=500)
rf_rs.fit(X_train, y_train)
y_pred_rf_rs = rf_rs.predict(X_test)

### Evaluate the Random Forest model with hyperparameters tunned

In [None]:
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, y_pred_rf_rs))

# Feature Selection

### Using Boruta Feature Selection Algorithm

In [None]:
boruta_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)   
boruta_selector.fit(np.array(X_train), np.array(y_train)) 

In [None]:
print("No. of significant features: ", boruta_selector.n_features_)

In [None]:
X_train_boruta = boruta_selector.transform(np.array(X_train))
X_test_boruta = boruta_selector.transform(np.array(X_test))

# Machine learning with Boruta selected features

### Linear Regression

In [None]:
reg = LinearRegression()
reg.fit(X_train_boruta, y_train)
y_pred_reg_boruta = reg.predict(X_test_boruta)

### Evaluating the model

In [None]:
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, y_pred_reg_boruta))

### Random Forest

In [None]:
rf = RandomForestRegressor()
rf.fit(X_train_boruta, y_train)
y_pred_rf_boruta = rf.predict(X_test_boruta)

### Evaluating the model

In [None]:
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, y_pred_rf_boruta))

# Hyperparameter Tunning with Boruta selected features

In [None]:
parameters = {
    'n_estimators': [200, 500, 1000],
    'max_depth': [2,3,4,5,8,16,None],
    'min_samples_leaf': [2,3,4,5,8,16]
}
cv = RandomizedSearchCV(rf, parameters, cv = 3, n_jobs = -1, verbose = 2)
cv.fit(X_train_boruta, y_train)

print(cv.best_estimator_)

In [None]:
rf_gb_boruta = RandomForestRegressor(min_samples_leaf=3, n_estimators=200)
rf_gb_boruta.fit(X_train_boruta, y_train)
y_pred_rf_gb_boruta = rf_gb_boruta.predict(X_test_boruta)

In [None]:
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, y_pred_rf_gb_boruta))

# Machine Learning with other Regressor models

### Extra Trees Regressor

In [None]:
ext_boruta = ExtraTreesRegressor()
ext_boruta.fit(X_train_boruta, y_train)
y_pred_ext_boruta = ext_boruta.predict(X_test_boruta)

In [None]:
parameters = {
    'n_estimators': [200, 500, 1000],
    'max_depth': [2,3,4,5,8,16,None],
    'min_samples_leaf': [2,3,4,5,8,16]
}
cv = RandomizedSearchCV(ext_boruta, parameters, cv = 3, n_jobs = -1, verbose = 2)
cv.fit(X_train_boruta, y_train)

print(cv.best_estimator_)

In [None]:
ext_boruta = ExtraTreesRegressor(min_samples_leaf=2, n_estimators=500)
ext_boruta.fit(X_train_boruta, y_train)
y_pred_ext_boruta = ext_boruta.predict(X_test_boruta)

In [None]:
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, y_pred_ext_boruta))

### Gradient Boost Regressor

In [None]:
gbr_boruta = GradientBoostingRegressor()
gbr_boruta.fit(X_train_boruta, y_train)
y_pred_gbr_boruta = gbr_boruta.predict(X_test_boruta)

In [None]:
parameters = {
    'learning_rate': [0.001, 0.1, 0.3, 1],
    'n_estimators': [200, 500, 1000],
    'min_samples_leaf': [2,3,4,5,8,16]
}
cv = RandomizedSearchCV(gbr_boruta, parameters, cv = 3, n_jobs = -1, verbose = 2)
cv.fit(X_train_boruta, y_train)

print(cv.best_estimator_)

In [None]:
gbr_boruta = GradientBoostingRegressor(min_samples_leaf=2, n_estimators=1000, learning_rate=0.3)
gbr_boruta.fit(X_train_boruta, y_train)
y_pred_gbr_boruta = gbr_boruta.predict(X_test_boruta)

In [None]:
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, y_pred_gbr_boruta))

# Cleaning Test data and Output Predictions

In [None]:
test_df = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/cancer_death_rate/Testing_set_label.csv')

In [None]:
test_df = test_df.drop(columns=['binnedInc','Geography','PctSomeCol18_24'])
mean_16 = test_df['PctEmployed16_Over'].mean()
test_df['PctEmployed16_Over'] = test_df['PctEmployed16_Over'].fillna(mean_16)
mean_alone = test_df['PctPrivateCoverageAlone'].mean()
test_df['PctPrivateCoverageAlone'] = test_df['PctPrivateCoverageAlone'].fillna(mean_alone)
X_test_df_boruta = boruta_selector.transform(np.array(test_df))

# Missing Values visualization with Missingno

With this package it is possible to visualize missing data in a more intuitive way

In [None]:
import missingno as msno

df_check  = pd.read_csv(
    "https://raw.githubusercontent.com/dphi-official/Datasets/master/cancer_death_rate/Training_set_label.csv" )

msno.matrix(df_check, figsize = (30,5))
plt.show()

### Output for Random Forest with Boruta selected features

In [None]:
y_pred_final_rf = rf_gb_boruta.predict(X_test_df_boruta)

In [None]:
output = pd.DataFrame({'prediction': y_pred_final_rf})
output.to_csv('RF_Boruta_solution.csv', index=False)

### Output for Gradient Boost with Boruta selected features

In [None]:
y_pred_final_gbr = gbr_boruta.predict(X_test_df_boruta)

In [None]:
output2 = pd.DataFrame({'prediction': y_pred_final_gbr})
output2.to_csv('GBR_Boruta_solution.csv', index=False)