<a href="https://colab.research.google.com/github/hsaripalli/Pump-It-Up/blob/main/model/Pump_it_up_Optuna_Tuned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Notes**

**Hyperparamter tuned using optuna library. Best accuracy of 82.4% on submission on submission.** 

Load Data:

- Loaded and combined train and test csv files
- Dropped columns that obviously did not have any significance (mostly 0s or same value all across)
- Parsed date and created two new columns- month and year

Numeric Columns:

- List all numerical columns
- Impute gps height, lat/long using grouped means
- Impute construction year and population using grouped mean
- Created new column, age, using year recorded - construction year. Imputed negative vlaues of age
- Created new column, 'season' using the month column. 
- Using DBScan to create clusters for lat/long. Didn't do anything for accuracy

Categorical Columns

- Converted all strings to lower case 
- Split into columns that have too many unique values vs not too many unique values
- Replaced 0s and 'none's with most frequent values
- Cleaned up some values that are mostly similar but have typos or entered as different versions. For example: community vs commu. 
- Dropped some columns that are mostly similar to others

Split Train and Test

- Seperated train and test csv files after cleaning
- Did not do a train-test split to maximize the training data. Used 3 fold cross validation instead. 
- Label encoded

Pipeline:

- MyCategoryCoalescer- Customer transformer (Uncle Steve's) to retain top 25 per column and replace the rest as "Other'
- Ordinal Encoder for all category columns
- Scaler for numeric columns. Scaler didnt really boost accuracy, IMO. 

Models: 

- Trained random forest, xgboost, adaboost, bagging (with base as decision trees), extra trees, LIghGBM, CatBoost
- All models have mostly similar accuracies except adaboost. adaboost lower by a few points
- Stacking all five models gave the best accuracy

# **Load Data**

In [None]:
# Merged train and test for preprocessing

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
values = pd.read_csv("https://raw.githubusercontent.com/hsaripalli/Pump-It-Up/main/Training_set_values.csv")
labels = pd.read_csv("https://raw.githubusercontent.com/hsaripalli/Pump-It-Up/main/Training_set_labels.csv")
test = pd.read_csv("https://raw.githubusercontent.com/hsaripalli/Pump-It-Up/main/Test_set_values.csv")

In [None]:
# Merge train and test

values['train'] = True
test['test'] = True
data = pd.concat([values, test], ignore_index = True)

In [None]:
#Drop columns

columns_to_drop = ['num_private', 'recorded_by']
data = data.drop(columns_to_drop, axis = 1)

In [None]:
#Parse dates

data['date_recorded' + '_year'] = pd.to_datetime(data['date_recorded']).dt.year 
data['date_recorded' + '_month'] = pd.to_datetime(data['date_recorded']).dt.month
data = data.drop('date_recorded', axis = 1)

# **Data Cleaning - Numerical Features**

In [None]:
numeric_columns = data.select_dtypes(exclude = 'object').columns.tolist()

In [None]:
# Impute small latitude values with 0
data.loc[data['latitude'] > -0.5, 'latitude'] = 0

In [None]:
# gps height and longitude: impute 0 and nan with grouped mean
col1 = ['gps_height', 'longitude', 'latitude']
data[col1] = data[col1].replace(0, np.nan)
for i in col1:
    data[i] = data[i].fillna(data.groupby('subvillage')[i].transform('mean'))
    data[i] = data[i].fillna(data.groupby('ward')[i].transform('mean'))
    data[i] = data[i].fillna(data.groupby('lga')[i].transform('mean'))
    data[i] = data[i].fillna(data.groupby('region')[i].transform('mean'))
    data[i] = data[i].fillna(data.groupby('basin')[i].transform('mean'))

In [None]:
# construction year and population: impute 0 and nan with most frequent
col2 = ['construction_year', 'population']
data[col2] = data[col2].replace(0, np.nan)
for i in col2:
    data[i] = round(data[i].fillna(data.groupby('subvillage')[i].transform('mean')))
    data[i] = round(data[i].fillna(data.groupby('ward')[i].transform('mean')))
    data[i] = round(data[i].fillna(data.groupby('lga')[i].transform('mean')))
    data[i] = round(data[i].fillna(data.groupby('region')[i].transform('mean')))
    data[i] = round(data[i].fillna(data.groupby('basin')[i].transform('mean')))

In [None]:
# Add age = date recordced - construction year
# Impute negative age with 1
data['age'] = data['date_recorded_year'] - data['construction_year']
data.loc[data['age'] < 0, 'age'] = 1

In [None]:
# Jan and Feb short dry season
# long rains lasts during about March, April and May 
# long dry season lasts throughout June, July, August, September and October 
# During November and December there's another rainy season: the 'short rains'

data.loc[(data['date_recorded_month'] >= 1) & (data['date_recorded_month'] <= 2), 'season'] = 1
data.loc[(data['date_recorded_month'] >= 3) & (data['date_recorded_month'] <= 5), 'season'] = 2
data.loc[(data['date_recorded_month'] >= 6) & (data['date_recorded_month'] <= 10), 'season'] = 3
data.loc[(data['date_recorded_month'] >= 11) & (data['date_recorded_month'] <= 12), 'season'] = 4

data['season']

0        2.0
1        2.0
2        1.0
3        1.0
4        3.0
        ... 
74245    1.0
74246    2.0
74247    2.0
74248    1.0
74249    1.0
Name: season, Length: 74250, dtype: float64

In [None]:
from sklearn.cluster import KMeans

clusters = 15
kmeans = KMeans(n_clusters=clusters, random_state=0).fit(data[['latitude', 'longitude']].values)
kmean_feats = pd.DataFrame(kmeans.fit_transform(data[['latitude', 'longitude']].values), columns=['gspatial_' + str(i) for i in range(clusters)])


In [None]:
data = pd.concat([data, kmean_feats], axis = 1)

# **Data Cleaning - Categorical Features**

In [None]:
categorical_columns = data.select_dtypes(include = 'object').columns.tolist()

*Dealing with columns that contain too many unique values*




In [None]:
# TOO MANY UNIQUE VALUES
#funder                    2140
#installer                 2410
#wpt_name                 45684
#subvillage               21425
#lga                        125
#ward                      2098
#scheme name

In [None]:
# convert to lowercase
col3 = ['funder', 'installer','wpt_name', 'basin', 'subvillage', 'region',
                 'lga', 'ward','scheme_management', 'extraction_type','extraction_type_group',
                 'extraction_type_class','management','management_group','payment','payment_type',
                 'water_quality', 'quality_group','quantity','quantity_group','source','source_type', 
                 'source_class','waterpoint_type','waterpoint_type_group', 'scheme_name']
for i in col3:
  data[i] = data[i].str.lower()

In [None]:
# fill na with most frequest
col4 = ['funder', 'installer', 'wpt_name', 'subvillage', 'lga', 'ward', 'scheme_name']
data[col4] = data[col4].replace(to_replace = ('0', 'none'), value = np.nan)

In [None]:
data['installer'] = data['installer'].replace(to_replace = ('gover'), value = 'government')
data['installer'] = data['installer'].replace(to_replace = ('commu'), value = 'community')

In [None]:
for i in col4:
    data[i] = data[i].fillna(data[i].mode()[0])

Dealing with columns containing **not** too many unique values




In [None]:
# Not too many unique values
#basin                        9
#region                      21
#region_code                 27
#district_code               20
#public_meeting               2
#scheme_management           12
#permit                       2
#construction_year           55
#extraction_type             18
#extraction_type_group       13
#extraction_type_class        7
#management                  12
#management_group             5
#payment                      7
#payment_type                 7
#water_quality                8
#quality_group                6
#quantity                     5
#quantity_group               5
#source                      10
#source_type                  7
#source_class                 3
#waterpoint_type              7
#waterpoint_type_group        6
#train                        1
#test                         1
#date_recorded_year           6
#date_recorded_month         12

In [None]:
#public_meeting               2
#scheme_management           12
#permit                       2

In [None]:
col5 = ['public_meeting', 'permit']
for i in col5:
    data[i] = data[i].fillna(data[i].mode()[0])
    data[i] = data[i].astype(str)

In [None]:
# public meeting and scheme management: fill na with most frequest

data['scheme_management'] = data['scheme_management'].replace(to_replace = (np.nan, 'none'), value = 'other')

In [None]:
#extraction_type             18
#extraction_type_group       13
#extraction_type_class        7

In [None]:
# clean/ replace some values in extraction_type column

data = data.replace({'extraction_type': 
                     {'cemo': 'other motorpump',
                      'climax': 'other motorpump',
                      'india mark ii': 'india mark',
                      'india mark iii': 'india mark',
                      'other - mkulima/shinyanga': 'other handpump',
                      'other - play pump': 'other handpump',
                      'other - rope pump': 'rope pump',
                      'other - swn 81': 'swn',
                      'swn 80': 'swn'
                      }})


In [None]:
# describe columns (run one at a time)

#data[['extraction_type', 'extraction_type_group', 'extraction_type_class']].groupby('extraction_type_group').describe()
#data[['payment', 'payment_type']].groupby('payment').describe()
#data[['water_quality', 'quality_group']].groupby('water_quality').describe()
#data[['quantity', 'quantity_group']].groupby('quantity').describe()
#data[['source', 'source_type', 'source_class']].groupby('source').describe()
#data[['waterpoint_type', 'waterpoint_type_group']].groupby('waterpoint_type').describe()

In [None]:
col6 = ['extraction_type_group', 'payment_type', 'quality_group', 'quantity_group', 'source_type', 'waterpoint_type_group']
data = data.drop(col6, axis = 1)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74250 entries, 0 to 74249
Data columns (total 52 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     74250 non-null  int64  
 1   amount_tsh             74250 non-null  float64
 2   funder                 74250 non-null  object 
 3   gps_height             74250 non-null  float64
 4   installer              74250 non-null  object 
 5   longitude              74250 non-null  float64
 6   latitude               74250 non-null  float64
 7   wpt_name               74250 non-null  object 
 8   basin                  74250 non-null  object 
 9   subvillage             74250 non-null  object 
 10  region                 74250 non-null  object 
 11  region_code            74250 non-null  int64  
 12  district_code          74250 non-null  int64  
 13  lga                    74250 non-null  object 
 14  ward                   74250 non-null  object 
 15  po

# **Split Train and Test**

In [None]:
# Reverse split merged and clean data into train and test
train_values = data[data['train'] == True]
test = data[data['test'] == True]
train_values = train_values.drop(['train', 'test'], axis = 1)
test = test.drop(['train', 'test'], axis = 1)

In [None]:
test_set = test.drop('id', axis = 1)
x = train_values.drop('id', axis = 1)

In [None]:
X = x.copy()
y = pd.DataFrame(labels['status_group'])

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y.values.ravel())

In [None]:
le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print(le_name_mapping)

{'functional': 0, 'functional needs repair': 1, 'non functional': 2}


In [None]:
# Split train and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.01, random_state = 123)

In [None]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14850 entries, 36801 to 56271
Data columns (total 49 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   amount_tsh             14850 non-null  float64
 1   funder                 14850 non-null  object 
 2   gps_height             14850 non-null  float64
 3   installer              14850 non-null  object 
 4   longitude              14850 non-null  float64
 5   latitude               14850 non-null  float64
 6   wpt_name               14850 non-null  object 
 7   basin                  14850 non-null  object 
 8   subvillage             14850 non-null  object 
 9   region                 14850 non-null  object 
 10  region_code            14850 non-null  int64  
 11  district_code          14850 non-null  int64  
 12  lga                    14850 non-null  object 
 13  ward                   14850 non-null  object 
 14  population             14850 non-null  float64
 15

# **Pipeline**

In [None]:
#Uncle Steve's Custom Transformer for Category Coalescing

from sklearn.base import BaseEstimator, TransformerMixin

class MyCategoryCoalescer(BaseEstimator, TransformerMixin):
    # Coalesces (smushes/condenses) rare levels of a categorical 
    # feature into "__OTHER__".
    #
    # Will leave the `keep_top` most frequent levels unchanged; the rest
    # will be changed to `"__OTHER__"`.
    #
    # Note that there was a design choice: either have the user
    # pass in the names of the columns to operate one (which I've done here), 
    # or just operate on all the columns (and have the user be responsible for
    # passing in a subset of the dataframe). Pros and cons to each and there's
    # note a singe best answer.
    
    def __init__(self, cat_cols=[], keep_top=25):
        self.cat_cols = cat_cols
        self.keep_top = keep_top
        
        # For each cat_col, this dict will hold an list of the most-frequent 
        # levels
        self.top_n_values = {}
            
    def get_top_n_values(self, X, col, n=25):
        # A helper function to do the actual work.

        # Get the sorted value counts for the column
        vc = X[col].value_counts(sort=True, ascending=False)

        # Get the actual values
        vals = list(vc.index)
        if len(vals) > n:
            top_values = vals[0:n]
        else:
            top_values =  vals

        # Debug printing.
        #print("Top n={} values for column={}:".format(n, col))
        #print(top_values)
        return top_values
    
    def fit(self, X, y=None):

        # Find the top n values for each cateogircal column
        for col in self.cat_cols:
            self.top_n_values[col] = self.get_top_n_values(X, col, n=self.keep_top)
        return self
    
    def transform(self, X, y=None):
        _X = X.copy()
        _X[self.cat_cols] = _X[self.cat_cols].astype('category')
        for c in self.cat_cols:
            _X[c] = _X[c].cat.add_categories('__OTHER__')
            _X.loc[~_X[c].isin(self.top_n_values[c]), c] = "__OTHER__"
        return _X

In [None]:
# Model fit

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder, OneHotEncoder
from sklearn.compose import make_column_selector
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import VotingClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, f1_score

categorical_cols = ['funder', 'installer', 'wpt_name', 'basin', 'subvillage', 'region', 'lga', 'ward', 'scheme_management',
                    'extraction_type','extraction_type_class', 'management', 'management_group', 'payment', 'water_quality',
                    'quantity','source','source_class','waterpoint_type', 'permit', 'public_meeting', 'scheme_name']

columns_to_coal = ['funder','installer', 'subvillage', 'lga', 'ward', 'wpt_name', 'scheme_name']

columns_to_scale = ['population', 'gps_height', 'latitude', 'longitude']

coalescer = MyCategoryCoalescer(cat_cols=columns_to_coal, keep_top=25)
encoder = OrdinalEncoder()
scaler = StandardScaler()


cat_transformer = Pipeline([
                            ('coalescer', coalescer),                      
                            ('encoder', encoder)
                           ])

preprocessor = Pipeline(steps = [
                                 ('ct', ColumnTransformer(
                                     transformers=[
                                                   ('categorical', cat_transformer, categorical_cols),
                                                   ('scale', scaler, columns_to_scale)
                                                   ], 
                                                   remainder = 'passthrough', 
                                                   sparse_threshold =0)),
                                 ])

# *Random Forest (optuna tuned)*

In [None]:
# Random Forest Tuned
# Random Forest Tuned Hyper Parameters
# {'rf__max_depth': 20, 'rf__min_samples_split': 5, 'rf__n_estimators': 1000}

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(criterion = 'gini',
                            n_estimators = 536,
                            min_samples_split = 8,
                            max_depth = 20,
                            random_state = 42)

rf_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('rf', rf)])
rf_pipeline.fit(X_train,y_train)
y_pred_rf_pipeline = rf_pipeline.predict(X_test)

In [None]:
print("Accuracy of RF = {:.4f}".format(accuracy_score(y_test, y_pred_rf_pipeline)))

Accuracy of RF = 0.8418


In [None]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, y_pred_rf_pipeline))
pd.DataFrame(confusion_matrix(y_test, y_pred_rf_pipeline))

In [None]:
import optuna
def objective(trial):
    

    param = {
        "criterion": trial.suggest_categorical("criterion", ['gini', 'entropy']),
        "min_samples_split": trial.suggest_int("min_samples_split", 2,10),
        "n_estimators": trial.suggest_int("n_estimators", 200,1500),
        "max_depth": trial.suggest_int("max_depth", 5, 50)
    }

    rf = RandomForestClassifier(**param)
    rf_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('rf', rf)])
    
    return cross_val_score(rf_pipeline, X, y, cv = 3).mean()
    
if __name__ == "__main__":
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=10)

    print("Number of finished trials: {}".format(len(study.trials)))

    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))

    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

[32m[I 2021-11-16 10:40:02,033][0m A new study created in memory with name: no-name-4e962812-b08c-43ac-916e-97406778cbe3[0m
[32m[I 2021-11-16 10:42:50,490][0m Trial 0 finished with value: 0.7937205387205388 and parameters: {'criterion': 'gini', 'min_samples_split': 10, 'n_estimators': 470, 'max_depth': 14}. Best is trial 0 with value: 0.7937205387205388.[0m
[32m[I 2021-11-16 11:01:18,854][0m Trial 1 finished with value: 0.806969696969697 and parameters: {'criterion': 'entropy', 'min_samples_split': 4, 'n_estimators': 1430, 'max_depth': 49}. Best is trial 1 with value: 0.806969696969697.[0m
[32m[I 2021-11-16 11:07:59,682][0m Trial 2 finished with value: 0.8093771043771044 and parameters: {'criterion': 'entropy', 'min_samples_split': 7, 'n_estimators': 516, 'max_depth': 26}. Best is trial 2 with value: 0.8093771043771044.[0m
[32m[I 2021-11-16 11:13:46,526][0m Trial 3 finished with value: 0.8093602693602694 and parameters: {'criterion': 'gini', 'min_samples_split': 9, 'n_est

Number of finished trials: 10
Best trial:
  Value: 0.8093771043771044
  Params: 
    criterion: entropy
    min_samples_split: 7
    n_estimators: 516
    max_depth: 26


In [None]:
#Number of finished trials: 10
#Best trial:
#  Value: 0.8093771043771044
#  Params: 
#    criterion: entropy
#    min_samples_split: 7
#    n_estimators: 516
#    max_depth: 26
        
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(criterion = 'gini',
                            n_estimators = 516,
                            min_samples_split = 7,
                            max_depth = 26,
                            random_state = 42)

rf_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('rf', rf)])
rf_pipeline.fit(X_train,y_train)
y_pred_rf_pipeline = rf_pipeline.predict(X_test)

In [None]:
print("Accuracy of RF = {:.4f}".format(accuracy_score(y_test, y_pred_rf_pipeline)))

Accuracy of RF = 0.8468


# *LGBM (optuna tuned)*

In [None]:
from lightgbm import LGBMClassifier

lgbm = LGBMClassifier(boosting_type = 'gbdt',
                      objective = 'multiclass',
                      num_class = 3,
                      metric = 'multi_error',
                      num_iterations = 200,
                      lambda_l1 =  2.2899315163770417e-06,
                      lambda_l2 =  2.6273452242794607e-06,
                      num_leaves = 239,
                      feature_fraction = 0.5633644014015632,
                      learning_rate = 0.06012805964180289,
                      bagging_fraction = 0.6953776886469089,
                      bagging_freq = 6,
                      min_child_samples = 47,
                      min_data_in_leaf = 17,
                      max_depth = 46
                      )

lgbm_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('lgbm', lgbm)])
lgbm_pipeline.fit(X_train,y_train)
y_pred_lgbm_pipeline = lgbm_pipeline.predict(X_test)





In [None]:
print("Accuracy of LGBM   = {:.4f}".format(accuracy_score(y_test, y_pred_lgbm_pipeline)))

Accuracy of LGBM   = 0.8333


In [None]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, y_pred_lgbm_pipeline))
pd.DataFrame(confusion_matrix(y_test, y_pred_lgbm_pipeline))

# *Catboost (hyperparameter tuned)*

In [None]:
pip install catboost

In [None]:
from catboost import CatBoostClassifier

cat = CatBoostClassifier(depth = 10,
                        iterations = 500,
                         learning_rate = 0.05,
                        random_state = 42)

cat_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('catboost', cat)])
cat_pipeline.fit(X_train,y_train)
y_pred_cat_pipeline = cat_pipeline.predict(X_test)

In [None]:
print("Accuracy of Catboost   = {:.4f}".format(accuracy_score(y_test, y_pred_cat_pipeline)))

In [None]:
import optuna
from catboost import CatBoostClassifier

def objective(trial):
    

    param = {
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.01, 0.1),
        "depth": trial.suggest_int("depth", 1, 12),
        "boosting_type": trial.suggest_categorical("boosting_type", ["Ordered", "Plain"]),
        "bootstrap_type": trial.suggest_categorical(
            "bootstrap_type", ["Bayesian", "Bernoulli", "MVS"]
        ),
        
        "used_ram_limit": "2gb",
    }

    if param["bootstrap_type"] == "Bayesian":
        param["bagging_temperature"] = trial.suggest_float("bagging_temperature", 0, 10)
    elif param["bootstrap_type"] == "Bernoulli":
        param["subsample"] = trial.suggest_float("subsample", 0.1, 1)

    cat = CatBoostClassifier(**param)
    cat_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('catboost', cat)])
    
    return cross_val_score(cat_pipeline, X, y, cv = 3).mean()
    
if __name__ == "__main__":
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=10)

    print("Number of finished trials: {}".format(len(study.trials)))

    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))

    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

In [None]:
#Number of finished trials: 10
#Best trial:
#  Value: 0.8049831649831649
#  Params: 
#    colsample_bylevel: 0.07369920952387737
#    depth: 12
#    boosting_type: Plain
#   bootstrap_type: MVS

from catboost import CatBoostClassifier

cat = CatBoostClassifier(colsample_bylevel = 0.073699209523,
                         depth = 12,
                         boosting_type = 'Plain',
                         bootstrap_type = 'MVS',
                        random_state = 42)

cat_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('catboost', cat)])
cat_pipeline.fit(X_train,y_train)
y_pred_cat_pipeline = cat_pipeline.predict(X_test)

In [None]:
print("Accuracy of Catboost   = {:.4f}".format(accuracy_score(y_test, y_pred_cat_pipeline)))

Accuracy of Catboost   = 0.8316


In [None]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, y_pred_cat_pipeline))
pd.DataFrame(confusion_matrix(y_test, y_pred_cat_pipeline))

# *XG Boost*

In [None]:
from xgboost import XGBClassifier

#Number of finished trials: 30
#Best trial:
#  Value: 0.8013131313131314
#  Params: 
#    booster: dart
#    lambda: 4.572637572518502e-07
#    alpha: 6.037662427475617e-05
#    subsample: 0.7162353406216146
#    colsample_bytree: 0.8486248682584188
#    max_depth: 7
#    min_child_weight: 9
#    eta: 0.3563123559925298
#    gamma: 5.017895421049517e-05
#    grow_policy: depthwise
#    sample_type: uniform
#    normalize_type: forest
#    rate_drop: 0.012104590680294654
#    skip_drop: 0.00036189755567904127

xg = XGBClassifier(booster = 'dart',
                   alpha =  6.037662427475617e-05,
                   subsample = 0.7162353406216146,
                   colsample_bytree = 0.8486248682584188,
                   max_depth = 7,
                   min_child_weight = 9,
                   eta = 0.3563123559925298,
                   gamma = 5.017895421049517e-05,
                   grow_policy = 'depthwise',
                   sample_type = 'uniform',
                   normalize_type = 'forest',
                   rate_drop = 0.012104590680294654,
                   skip_drop = 0.00036189755567904127,
                   objective='multi:softmax',
                   use_label_encoder = False)

xg_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('xgboost', xg)])
xg_pipeline.fit(X_train,y_train)
y_pred_xg_pipeline = xg_pipeline.predict(X_test)

In [None]:
print("Accuracy of XGB   = {:.4f}".format(accuracy_score(y_test, y_pred_xg_pipeline)))

In [None]:
import optuna

def objective(trial):
    
  
    param = {
        "verbosity": 0,
        "objective": "binary:logistic",
        # use exact for small dataset.
        "tree_method": "exact",
        # defines booster, gblinear for linear functions.
        "booster": trial.suggest_categorical("booster", ["gbtree", "gblinear", "dart"]),
        # L2 regularization weight.
        "lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
        # L1 regularization weight.
        "alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        # sampling ratio for training data.
        "subsample": trial.suggest_float("subsample", 0.2, 1.0),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
    }

    if param["booster"] in ["gbtree", "dart"]:
        # maximum depth of the tree, signifies complexity of the tree.
        param["max_depth"] = trial.suggest_int("max_depth", 3, 9, step=2)
        # minimum child weight, larger the term more conservative the tree.
        param["min_child_weight"] = trial.suggest_int("min_child_weight", 2, 10)
        param["eta"] = trial.suggest_float("eta", 1e-8, 1.0, log=True)
        # defines how selective algorithm is.
        param["gamma"] = trial.suggest_float("gamma", 1e-8, 1.0, log=True)
        param["grow_policy"] = trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"])

    if param["booster"] == "dart":
        param["sample_type"] = trial.suggest_categorical("sample_type", ["uniform", "weighted"])
        param["normalize_type"] = trial.suggest_categorical("normalize_type", ["tree", "forest"])
        param["rate_drop"] = trial.suggest_float("rate_drop", 1e-8, 1.0, log=True)
        param["skip_drop"] = trial.suggest_float("skip_drop", 1e-8, 1.0, log=True)

    xg = XGBClassifier(**param)
    xg_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('xgboost', xg)])
    
    return cross_val_score(xg_pipeline, X, y, cv = 3).mean()


if __name__ == "__main__":
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=30)

    print("Number of finished trials: {}".format(len(study.trials)))

    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))

    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, y_pred_xg_pipeline))
pd.DataFrame(confusion_matrix(y_test, y_pred_xg_pipeline))

# *Extra Trees*

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

xt = ExtraTreesClassifier(n_estimators=200,
                          random_state=42)

xt_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('extra trees', xt)])
xt_pipeline.fit(X_train,y_train)
y_pred_xt_pipeline = xt_pipeline.predict(X_test)

In [None]:
print("Accuracy of EXTRA TREES   = {:.4f}".format(accuracy_score(y_test, y_pred_xt_pipeline)))

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, y_pred_xt_pipeline))
pd.DataFrame(confusion_matrix(y_test, y_pred_xt_pipeline))

# *Bagging*

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

bag =  BaggingClassifier(n_estimators=100,
                         max_features = 0.5,
                         random_state=42)

bag_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('bagging', bag)])
bag_pipeline.fit(X_train,y_train)
y_pred_bag_pipeline = bag_pipeline.predict(X_test)

In [None]:
print("Accuracy of BAGGING   = {:.4f}".format(accuracy_score(y_test, y_pred_bag_pipeline)))

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, y_pred_bag_pipeline))
pd.DataFrame(confusion_matrix(y_test, y_pred_bag_pipeline))

# *Voting Classifier*

In [None]:
from sklearn.ensemble import VotingClassifier

est_list = [('rf', rf), ('xgboost', xg), ('lgbm', lgbm)]


vclf = VotingClassifier(estimators = est_list, voting='soft')


vote_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('voting', vclf)])

vote_pipeline.fit(X_train,y_train)
y_pred_vote_pipeline = vote_pipeline.predict(X_test)

In [None]:
print("Accuracy of VOTING = {:.4f}".format(accuracy_score(y_test, y_pred_vote_pipeline)))

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier


rf = RandomForestClassifier(criterion = 'gini',
                            n_estimators = 536,
                            min_samples_split = 8,
                            max_depth = 20,
                            random_state = 42)

xg = XGBClassifier(booster = 'dart',
                   alpha =  6.037662427475617e-05,
                   subsample = 0.7162353406216146,
                   colsample_bytree = 0.8486248682584188,
                   max_depth = 7,
                   min_child_weight = 9,
                   eta = 0.3563123559925298,
                   gamma = 5.017895421049517e-05,
                   grow_policy = 'depthwise',
                   sample_type = 'uniform',
                   normalize_type = 'forest',
                   rate_drop = 0.012104590680294654,
                   skip_drop = 0.00036189755567904127,
                   objective='multi:softmax',
                   use_label_encoder = False)

xt = ExtraTreesClassifier(n_estimators=200,
                          random_state=42)


bag =  BaggingClassifier(n_estimators=100,
                         max_features = 0.5,
                         random_state=42)

cat = CatBoostClassifier(colsample_bylevel = 0.073699209523,
                         depth = 12,
                         boosting_type = 'Plain',
                         bootstrap_type = 'MVS',
                        random_state = 42)

lgbm = LGBMClassifier(boosting_type = 'gbdt',
                      objective = 'multiclass',
                      num_class = 3,
                      metric = 'multi_error',
                      num_iterations = 200,
                      lambda_l1 =  2.2899315163770417e-06,
                      lambda_l2 =  2.6273452242794607e-06,
                      num_leaves = 239,
                      feature_fraction = 0.5633644014015632,
                      learning_rate = 0.06012805964180289,
                      bagging_fraction = 0.6953776886469089,
                      bagging_freq = 6,
                      min_child_samples = 47,
                      min_data_in_leaf = 17,
                      max_depth = 46
                      )

rf_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('rf', rf)])
xg_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('xgboost', xg)])
xt_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('extra trees', xt)])
bag_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('bagging', bag)])
cat_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('catboost', cat)])
lgbm_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('lgbm', lgbm)])


est_list = [('rf', rf), ('xgboost', xg), ('extra trees', xt), ('bagging', bag), ('catboost', cat), ('lgbm', lgbm)]
vclf = VotingClassifier(estimators = est_list, voting='soft')


vote_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('voting', vclf)])

vote_pipeline.fit(X,y)

In [None]:
accuracy = cross_val_score(vote_pipeline, X, y, cv = 5).mean()
accuracy

In [None]:
accuracy

0.8176430976430977

In [None]:
print("Accuracy of VOTING = {:.4f}".format(accuracy_score(y_test, y_pred_voting_pipeline)))

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, y_pred_vote_pipeline))
pd.DataFrame(confusion_matrix(y_test, y_pred_vote_pipeline))

# *Stacking*

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

#est_list = [('rf', rf), ('xgboost', xg), ('extra trees', xt), ('bagging', bag), ('catboost', cat), ('lgbm', lgbm)]
est_list = [('rf', rf), ('xgboost', xg), ('lgbm', lgbm)]

sclf = StackingClassifier(estimators = est_list,
                          final_estimator = LogisticRegression())

stacking_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('stacking', sclf)])

stacking_pipeline.fit(X_train,y_train)
y_pred_stacking_pipeline = stacking_pipeline.predict(X_test)

In [None]:
print("Accuracy of STACKING = {:.4f}".format(accuracy_score(y_test, y_pred_stacking_pipeline)))

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

est_list = [('rf', rf), ('xgboost', xg), ('extra trees', xt), ('bagging', bag), ('catboost', cat), ('lgbm', lgbm)]
#est_list = [('rf', rf), ('xgboost', xg), ('lgbm', lgbm)]

sclf = StackingClassifier(estimators = est_list,
                          final_estimator = LogisticRegression())

stacking_pipeline = Pipeline(steps = [('preprocess', preprocessor), ('stacking', sclf)])

stacking_pipeline.fit(X_train,y_train)
y_pred_stacking_pipeline = stacking_pipeline.predict(X_test)

In [None]:
print("Accuracy of STACKING = {:.4f}".format(accuracy_score(y_test, y_pred_stacking_pipeline)))

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, y_pred_stacking_pipeline))
pd.DataFrame(confusion_matrix(y_test, y_pred_stacking_pipeline))

# **Predictions to CSV**

In [None]:
# Predictions
# Uncomment whichever model's prediction is desired

#RF
#y_pred_test = rf_pipeline.predict(test_set)

#XGBoost
#y_pred_test = xg_pipeline.predict(test_set)

#Extra Trees
#y_pred_test = xt_pipeline.predict(test_set)

#Stacking
#y_pred_test = stacking_pipeline.predict(test_set)

#Voting
y_pred_test = vote_pipeline.predict(test_set)

#{'functional': 0, 'functional needs repair': 1, 'non functional': 2}

In [None]:
predictions = pd.DataFrame(
                            {'id': test.id,
                           'status_group': y_pred_test}
                         )
predictions

Unnamed: 0,id,status_group
59400,50785,2
59401,51630,0
59402,17168,0
59403,45559,2
59404,49871,0
...,...,...
74245,39307,2
74246,18990,0
74247,28749,0
74248,33492,0


In [None]:
predictions.loc[predictions['status_group'] == 0, 'status_group'] = 'functional'
predictions.loc[predictions['status_group'] == 1, 'status_group'] = 'functional needs repair'
predictions.loc[predictions['status_group'] == 2, 'status_group'] = 'non functional'

In [None]:
predictions

Unnamed: 0,id,status_group
59400,50785,non functional
59401,51630,functional
59402,17168,functional
59403,45559,non functional
59404,49871,functional
...,...,...
74245,39307,non functional
74246,18990,functional
74247,28749,functional
74248,33492,functional


In [None]:
# Saving file
predictions.to_csv('my_submission.csv', header=True, index=False)

#from google.colab import files
#files.download('my_submission.csv')