# Learn With Other Kaggle Users

*Author: Christian Camilo Urcuqui López*

*Date: 27 August 2019*

*GitHub: https://github.com/urcuqui/ *

In this notebook I'm going to use a data science approach in order to evaluate machine learning classifiers for a friendly competition (*Classify forest types based on information about the area*).

This work is divided in the next sections:

+ [Data Description](#Data-Description)
+ [Packages](#Packages)
+ [Explore](#Explore)
    + [Dimensions](#Dimensions)
    + [Treatments](#Treatments)
        + [NaN](#NaN)
        + [Duplicated Data](#Duplicated-Data)    
    + [Exploration Quantitative Features](#Exploration-Quantitative-Features) 
    + [Exploration Qualitative Features](#Exploration-Qualitative-Features)
+ [Modeling](#Modeling)

In [1]:
from datetime import datetime

print("last update: {}".format(datetime.now())) 

last update: 2019-10-25 23:31:03.333121


# Data Description

<div class="DataExplorerDescription_Container-sc-rtvgew bNXSrs"><div class="DataExplorerDescription_DatasourceHeader-sc-1jjl64y hYvtkD"><img src="https://storage.googleapis.com/kaggle-competitions/kaggle/15767/logos/thumb76_76.png?t=2019-08-21-16-24-53" alt="Learn With Other Kaggle Users source image" class="DataExplorerDescription_DatasourceImage-sc-124hjw6 faQqwa"><div class="DataExplorerDescription_DatasourceDetails-sc-6z9az5 cgraVO"><div class="DataExplorerDescription_DatasourceOverview-sc-64u3xx dthRbg">Classify forest types based on information about the area</div><div class="DataExplorerDescription_DatasourceLastUpdated-sc-1bnzx7s gFoPLV">Last Updated: <span title="Tue Aug 27 2019 15:56:49 GMT-0500 (hora estándar de Colombia)">a day ago</span></div></div></div><div class="DataExplorerDescription_Header-sc-9udzgu kagSZQ"><div class="DataExplorerDescription_HeaderTitle-sc-8yzcy8 kIiVNS">About this Competition</div><div class="DataExplorerDescription_HeaderRight-sc-m2iwyg fyjBEU"></div></div><div class="DataExplorerDescription_Content-sc-yp9anb eysdMp"><div class="markdown-converter__text--rendered data-explorer-overview-description"><p>The study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. Each observation is a 30m x 30m patch. You are asked to predict an integer classification for the forest cover type. The seven types are:</p>

<p>1 - Spruce/Fir<br> 2 - Lodgepole Pine<br> 3 - Ponderosa Pine<br> 4 - Cottonwood/Willow<br> 5 - Aspen<br> 6 - Douglas-fir<br> 7 - Krummholz</p>

<p>The training set (15120 observations) contains both features and the&nbsp;Cover_Type. The test set contains only the features. You must predict the Cover_Type&nbsp;for every row&nbsp;in the test set (565892 observations).</p>

<h2>Data Fields</h2>

<p><strong>Elevation</strong> - Elevation in meters<br><strong>Aspect</strong> - Aspect in degrees azimuth<br><strong>Slope</strong> - Slope in degrees<br><strong>Horizontal_Distance_To_Hydrology</strong> - Horz Dist to nearest surface water features<br><strong>Vertical_Distance_To_Hydrology</strong> - Vert Dist to nearest surface water features<br><strong>Horizontal_Distance_To_Roadways</strong> - Horz Dist to nearest roadway<br><strong>Hillshade_9am</strong> (0 to 255 index) - Hillshade index at 9am, summer solstice<br><strong>Hillshade_Noon</strong> (0 to 255 index) - Hillshade index at noon, summer solstice<br><strong>Hillshade_3pm</strong> (0 to 255 index) - Hillshade index at 3pm, summer solstice<br><strong>Horizontal_Distance_To_Fire_Points</strong> - Horz Dist to nearest wildfire ignition points<br><strong>Wilderness_Area</strong> (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation<br><strong>Soil_Type</strong> (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation<br><strong>Cover_Type</strong> (7 types, integers 1 to 7) - Forest Cover Type designation</p>

<p>The wilderness areas are:</p>

<p>1 - Rawah Wilderness Area<br> 2 - Neota Wilderness Area<br> 3 - Comanche Peak Wilderness Area<br> 4 - Cache la Poudre Wilderness Area</p>

<p>The soil types are:</p>

<p>1 Cathedral family - Rock outcrop complex, extremely stony.<br> 2 Vanet - Ratake families complex, very stony.<br> 3 Haploborolis - Rock outcrop complex, rubbly.<br> 4 Ratake family - Rock outcrop complex, rubbly.<br> 5 Vanet family - Rock outcrop complex complex, rubbly.<br> 6 Vanet - Wetmore families - Rock outcrop complex, stony.<br> 7 Gothic family.<br> 8 Supervisor - Limber families complex.<br> 9 Troutville family, very stony.<br> 10 Bullwark - Catamount families - Rock outcrop complex, rubbly.<br> 11 Bullwark - Catamount families - Rock land complex, rubbly.<br> 12 Legault family - Rock land complex, stony.<br> 13 Catamount family - Rock land - Bullwark family complex, rubbly.<br> 14 Pachic Argiborolis - Aquolis complex.<br> 15 unspecified in the USFS Soil and ELU Survey.<br> 16 Cryaquolis - Cryoborolis complex.<br> 17 Gateview family - Cryaquolis complex.<br> 18 Rogert family, very stony.<br> 19 Typic Cryaquolis - Borohemists complex.<br> 20 Typic Cryaquepts - Typic Cryaquolls complex.<br> 21 Typic Cryaquolls - Leighcan family, till substratum complex.<br> 22 Leighcan family, till substratum, extremely bouldery.<br> 23 Leighcan family, till substratum - Typic Cryaquolls complex.<br> 24 Leighcan family, extremely stony.<br> 25 Leighcan family, warm, extremely stony.<br> 26 Granile - Catamount families complex, very stony.<br> 27 Leighcan family, warm - Rock outcrop complex, extremely stony.<br> 28 Leighcan family - Rock outcrop complex, extremely stony.<br> 29 Como - Legault families complex, extremely stony.<br> 30 Como family - Rock land - Legault family complex, extremely stony.<br> 31 Leighcan - Catamount families complex, extremely stony.<br> 32 Catamount family - Rock outcrop - Leighcan family complex, extremely stony.<br> 33 Leighcan - Catamount families - Rock outcrop complex, extremely stony.<br> 34 Cryorthents - Rock land complex, extremely stony.<br> 35 Cryumbrepts - Rock outcrop - Cryaquepts complex.<br> 36 Bross family - Rock land - Cryumbrepts complex, extremely stony.<br> 37 Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony.<br> 38 Leighcan - Moran families - Cryaquolls complex, extremely stony.<br> 39 Moran family - Cryorthents - Leighcan family complex, extremely stony.<br> 40 Moran family - Cryorthents - Rock land complex, extremely stony.</p></div></div></div>

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# Packages

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import FeatureHasher
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
from sklearn.preprocessing import PowerTransformer

In [4]:
# seed
np.random.seed(1231)

## Load Datasets

In [5]:
df_train = pd.read_csv("/kaggle/input/learn-together/train.csv" , index_col=['Id'])
df_test = pd.read_csv("/kaggle/input/learn-together/test.csv" , index_col=['Id'])

FileNotFoundError: [Errno 2] File b'/kaggle/input/learn-together/train.csv' does not exist: b'/kaggle/input/learn-together/train.csv'

# Explore

In this phase I'm going to see how is the data

# Dimensions

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
print("shape training csv: %s" % str(df_train.shape)) 
print("shape test csv: %s" % str(df_test.shape)) 

Well... we have more data for the testing phase 

In the next step let's see what are the feature types

In [None]:
df_train.dtypes.value_counts()

In [None]:
df_test.dtypes.value_counts()

the last two chunks show that we need to transform the variables to their correct representation

In [None]:
df_train.iloc[:,10:].columns

In [None]:
df_test.iloc[:,10:].columns

## Treatments

In [None]:
df_train.iloc[:,10:] = df_train.iloc[:,10:].astype("category")
df_test.iloc[:,10:] = df_test.iloc[:,10:].astype("category")

### NaN

Let's see how many NaN we have in our datasets

In [None]:
df_train.isna().sum().sum()

In [None]:
df_test.isna().sum().sum()

Through the last chunks we can assume that we don't have NaN in both datasets

### Duplicated Data

Let's see if we have duplicated data in our datasets

In [None]:
df_train[df_train.duplicated()].shape

# Exploration Quantitative Features 

In [None]:
df_train.describe()

Is it possible to have a negative value in *vertical distance to hydrology* ?

In [None]:
df_test.describe()

In [None]:
#Is it possible to have a negative value in *vertical distance to hydrology* ? how many are?

print("percent of negative values (training): " + '%.3f' % ((df_train.loc[df_train.Vertical_Distance_To_Hydrology < 0].shape[0] / df_train.shape[0])*100))
print("percent of negative values (testing): " + '%.3f' % ((df_test.loc[df_test.Vertical_Distance_To_Hydrology < 0].shape[0]/ df_test.shape[0])*100))

In [None]:
sns.boxplot(df_train.Vertical_Distance_To_Hydrology)

In [None]:
sns.boxplot(df_test.Vertical_Distance_To_Hydrology)

**Wow definetily we have something here, a lot of outliers for this feature**

what is going to happen with the other features? what are their distributions?

In [None]:
columns_t_analyze = df_train.select_dtypes(["float64", "int64"]).columns.tolist()
columns_t_analyze.append("Cover_Type")
plot = sns.pairplot(df_train.loc[:,columns_t_analyze], hue="Cover_Type")
plot.savefig("pairplot.png")

Among *The last plost* we can see that some features can be used to segment the types of our forests, some of them might be:
+ elevation
+ horizontal_distance_to_the_hidrology

As we saw the scales of each quantitave variable is significative and the boxplots and distplots allowed us to see that we have outliers (they are a lot of we can't erase them). We will need to standardize them. 

Due to we have outliers in our quantitative features the idea is to use an scaler method, I'm going to use the application of a *RobustScaler* due to it is robuts to outliers (same as Quantile transformer)

In [None]:
columns_t_analyze = df_train.select_dtypes(["float64", "int64"])
transformer =  PowerTransformer(method='yeo-johnson').fit(columns_t_analyze)

In [None]:
columns_t_analyze = df_train.select_dtypes(["float64", "int64"])
#columns_transformed =  RobustScaler(quantile_range=(25, 75)).fit_transform(columns_t_analyze)
columns_transformed =  PowerTransformer(method='yeo-johnson').fit_transform(columns_t_analyze)

In [None]:
columns_transformed = pd.DataFrame(columns_transformed)
columns_transformed.columns = columns_t_analyze.columns
columns_transformed = pd.concat([columns_transformed, df_train.loc[:,"Cover_Type"]], axis=1, join='inner')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(columns_transformed.loc[:,columns_transformed.columns].drop("Cover_Type", axis=1), columns_transformed.loc[:,'Cover_Type'], test_size=0.33, random_state=42)

In [None]:
X_train.shape

In [None]:
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False)
lsvc.fit(X_train, y_train)
pred = lsvc.predict(X_test)
print("LinearSVC")
print(classification_report(y_test,pred, labels=None))

In [None]:
from sklearn.linear_model import SGDClassifier

sgdc= SGDClassifier()
sgdc.fit(X_train, y_train)
pred = sgdc.predict(X_test)
print("SGDC")
print(classification_report(y_test,pred, labels=None))

In [None]:
from sklearn.ensemble import RandomForestClassifier
randomfr= RandomForestClassifier()
randomfr.fit(X_train, y_train)
pred = randomfr.predict(X_test)
print("randomfr")
print(classification_report(y_test,pred, labels=None))

In [None]:
model = SelectFromModel(randomfr, prefit=True)
X_new = model.transform(X_train)
X_new.shape

From the last RF we can filter some of the quantitive features. 

In [None]:
# elevation, horizontal_distance_to_roadways, horizontal_distance_to_fire_points

pd.DataFrame(X_new).describe()

In [None]:
sns.pairplot(pd.concat([pd.DataFrame(X_new), df_train.loc[:,'Cover_Type']], axis=1, join='inner'), hue="Cover_Type")

Nice!!! we are looking the Homoscedasticity 🤖

In [None]:
sns.pairplot(columns_transformed.drop(columns=["Elevation", 'Horizontal_Distance_To_Roadways', 'Horizontal_Distance_To_Fire_Points'], axis=1), hue="Cover_Type")

Through the last plot I could think that some quantitave have a treatment, let's work on that. 

***
How are the distributions of our quantitatve features at test set? 

In [None]:
df_train.columns

In [None]:
fig, axs = plt.subplots(nrows=2)
sns.boxplot(df_train.Hillshade_3pm, ax=axs[0])
sns.boxplot(df_test.Hillshade_3pm, ax=axs[1], color="green")

In [None]:
#training 
quan = df_train.select_dtypes(["int", "float64"])
Q1 = quan.quantile(0.25)
Q3 = quan.quantile(0.75)
IQR =  Q3 - Q1

(((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).sum() / quan.shape[0]) * 100

In [None]:
#testing 
quan = df_test.select_dtypes(["int", "float"])
Q1 = quan.quantile(0.25)
Q3 = quan.quantile(0.75)
IQR =  Q3 - Q1

(((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).sum() / quan.shape[0]) * 100

The last two tables allow us to see that the outlier percent of each quantitative feature is not high individually, so the next idea is to see what are the most important  and their impact collectively in the number of registers. 

Let's see the distribution of some training features without outliers
.

In [None]:
quan = df_train.select_dtypes(["int", "float"]).copy()
Q1 = quan.quantile(0.25)
Q3 = quan.quantile(0.75)
IQR =  Q3 - Q1
sns.boxplot(quan.loc[~((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).Hillshade_3pm].Hillshade_3pm)

In [None]:
for i in list(range(1,8)):
    sns.distplot(df_train.loc[(~((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).Hillshade_3pm) & (df_train.Cover_Type == i), 'Hillshade_3pm'])


Nice, it looks pretty 

In [None]:
print("Normal shape {}".format(quan.shape[0]))
print("Without outliers {}".format(quan.loc[~((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).Hillshade_3pm].shape[0]))

In [None]:
sns.boxplot(quan.loc[~((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).Slope].Slope)

In [None]:
for i in list(range(1,8)):
    sns.distplot(df_train.loc[(~((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).Slope) & (df_train.Cover_Type == i), 'Slope'])


In [None]:
print("Normal shape {}".format(quan.shape[0]))
print("Without outliers {}".format(quan.loc[~((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).Slope].shape[0]))

In [None]:
sns.boxplot(quan.loc[~((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).Hillshade_Noon].Hillshade_Noon)

In [None]:
for i in list(range(1,8)):
    sns.distplot(df_train.loc[(~((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).Hillshade_Noon) & (df_train.Cover_Type == i), 'Hillshade_Noon'])


In [None]:
print("Normal shape {}".format(quan.shape[0]))
print("Without outliers {}".format(quan.loc[~((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).Hillshade_Noon].shape[0]))

How many data will we loose for all of them?

In [None]:
quan.shape [0] - quan.loc[(~((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).Hillshade_Noon) & (~((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).Slope)
        & (~((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).Hillshade_3pm)].shape[0]

So... 453, these data will be important to analyze, I will keep them in another set

In [None]:
df_train_copy = df_train[(~((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).Hillshade_Noon) & (~((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).Slope)
        & (~((quan < (Q1 - 1.5 * IQR)) | (quan > (Q3 + 1.5 * IQR))).Hillshade_3pm)].copy()

In [None]:
df_train_copy.shape

In [None]:
columns_t_analyze = df_train_copy.select_dtypes(["float64", "int64"])
#transformer =  PowerTransformer(method='yeo-johnson').fit(columns_t_analyze)
#columns_transformed = transformer.transform(columns_t_analyze)
#columns_transformed = pd.DataFrame(columns_transformed)
#columns_transformed.columns = columns_t_analyze.columns
#columns_transformed = pd.concat([columns_transformed, df_train_copy.loc[:,"Cover_Type"]], axis=1, join='inner')
X_train, X_test, y_train, y_test = train_test_split(df_train_copy.loc[:,columns_t_analyze.columns], df_train_copy.loc[:,'Cover_Type'], test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier
randomfr= RandomForestClassifier()
randomfr.fit(X_train, y_train)
pred = randomfr.predict(X_test)
print("randomfr")
print(classification_report(y_test,pred, labels=None))

In [None]:
model = SelectFromModel(randomfr, prefit=True)
X_new = model.transform(X_train)
X_new.shape

In [None]:
pd.DataFrame(X_new).head()

In [None]:
X_train.head()

Horizontal_Distance_To_Roadways	and Elevation

# Exploration Qualitative Features

Our objetive variable is cover_type, a feature that is categorical that has 7 values... how is it'[](http://)s distribution?

In [None]:
df_train.Cover_Type.value_counts()

The last printing allows us to understand that we will not have problems with an objective feature that is imbalanced. 

We have a lot of categorical features, but, how many of them are really representative for our objective variable?

We can use statistic functions, change them to dummy variables or use an approach of fueature hashing in oder to filter and select the categorical features,in this case I'm going to change use a feature hashing approach and analyze the feature importances using a decision tree. 

In [None]:
X = df_train.select_dtypes("category").drop(columns=["Cover_Type"])

In [None]:
from collections import defaultdict
from sklearn.preprocessing import LabelEncoder
d = defaultdict(LabelEncoder)
fit = X.apply(lambda x: d[x.name].fit_transform(x))

In [None]:
fit.columns

In [None]:
Y_train = df_train.loc[:,'Cover_Type']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(fit, Y_train, test_size=0.33, random_state=42)

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy = accuracy_score(pred, y_test)
print(clf)
print(classification_report(pred, y_test, labels=None))

In [None]:
feature_importances = pd.DataFrame(clf.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance',ascending=False)
feature_importances

By te last process we have a number of features that are more important than the others, we can right 
now do experiments

# Modeling

In [None]:
qualitative = df_train.select_dtypes("category").drop(columns=["Cover_Type"])
columns_t_analyze = df_train.select_dtypes(["float64", "int64"])
#columns_transformed =  RobustScaler(quantile_range=(25, 75)).fit_transform(columns_t_analyze)

columns_transformed =  PowerTransformer(method='yeo-johnson').fit_transform(columns_t_analyze)
columns_transformed = pd.DataFrame(columns_transformed)
columns_transformed.columns = columns_t_analyze.columns
columns_transformed = pd.concat([columns_transformed, df_train.loc[:,"Cover_Type"]], axis=1, join='inner')
d = defaultdict(LabelEncoder)
fit = X.apply(lambda x: d[x.name].fit_transform(x))
fit.reset_index(drop=True, inplace=True)
columns_transformed.reset_index(drop=True, inplace=True)
features_preprocessing = pd.concat([fit, columns_transformed], axis=1, join='inner')

In [None]:
features_preprocessing.columns

In [None]:
selected_columns=["Elevation", 'Horizontal_Distance_To_Roadways', 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area4', 'Soil_Type10', 
                  'Soil_Type38', 'Soil_Type39', 'Soil_Type40', 'Soil_Type4', 'Soil_Type3', 'Soil_Type17', 'Soil_Type2']



***
*Holdout approach*

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features_preprocessing.loc[:,selected_columns], features_preprocessing.loc[:,'Cover_Type'], test_size=0.33, random_state=42)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

for i in range(3, 21, 3):
    neigh = KNeighborsClassifier(n_neighbors=i)
    neigh.fit(X_train, y_train)
    pred = neigh.predict(X_test)
    print("KNeighborsClassifier {}".format(i))
    print(classification_report(pred, y_test, labels=None))

In [None]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
pred = gnb.predict(X_test)
## accuracy
accuracy = accuracy_score(y_test,pred)
print("naive_bayes")
print(classification_report(y_test,pred, labels=None))

In [None]:
from sklearn import svm
Sv=svm.SVC(gamma='scale',kernel='rbf')
Sv.fit(X_train, y_train)

pred = Sv.predict(X_test)
# accuracy
accuracy = accuracy_score(y_test,pred)
print(classification_report(y_test,pred, labels=None))

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy = accuracy_score(pred, y_test)
print(clf)
print(classification_report(pred, y_test, labels=None))

In [None]:
from xgboost import XGBClassifier

xgb = XGBClassifier(max_depth=10, subsample=0.8, colsample_bytree=0.7,missing=-999)

xgb.fit(X_train, y_train)
pred = xgb.predict(X_test)
accuracy = accuracy_score(pred, y_test)
print(xgb)
print(classification_report(pred, y_test, labels=None))

***
*K fold approach*

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score
params = {
        'min_child_weight': [1, 5, 10, 13, 15],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.2, 0.4, 0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5, 10, 20]
        }

xgb = XGBClassifier(silent=True, nthread=1)
folds = 3
param_comb = 5

skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)

random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring='accuracy', n_jobs=4, cv=skf.split(X_train, y_train), verbose=3, random_state=1001 )

random_search.fit(X_train, y_train)

In [None]:
print('\n All results:')
print(random_search.cv_results_)
print('\n Best estimator:')
print(random_search.best_estimator_)
print('\n Best hyperparameters:')
print(random_search.best_params_)
results = pd.DataFrame(random_search.cv_results_)
results.to_csv('xgb-random-grid-search-results-01.csv', index=False)

In [None]:
xgb = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=5,
              learning_rate=0.1, max_delta_step=0, max_depth=10,
              min_child_weight=10, missing=None, n_estimators=100, n_jobs=1,
              nthread=1, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=True, subsample=0.8, verbosity=1)

xgb.fit(X_train, y_train)
pred = xgb.predict(X_test)
print(classification_report(pred, y_test, labels=None))

In [None]:
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC(gamma="scale")
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(X_train, y_train)
print('\n All results:')
print(clf.cv_results_)
print('\n Best estimator:')
print(clf.best_estimator_)
print('\n Best hyperparameters:')
print(clf.best_params_)

In [None]:
clf = svm.SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(classification_report(pred, y_test, labels=None))

***
*Another approach*

In [None]:
# selected_columns=["Elevation", 'Horizontal_Distance_To_Roadways', 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area4', 'Soil_Type10', 
#                   'Soil_Type38', 'Soil_Type39', 'Soil_Type40', 'Soil_Type4', 'Soil_Type3', 'Soil_Type17', 'Soil_Type2', 'Soil_Type30', 'Soil_Type13',
#                  'Soil_Type22', 'Soil_Type12', 'Soil_Type35', 'Soil_Type11', 'Wilderness_Area1', 'Soil_Type14']



selected_columns=["Elevation", 'Horizontal_Distance_To_Roadways', 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area4', 'Soil_Type10', 
                  'Soil_Type38', 'Soil_Type39', 'Soil_Type40', 'Soil_Type4', 'Soil_Type3', 'Soil_Type17', 'Soil_Type2', 'Soil_Type30', 'Soil_Type13',
                 'Soil_Type22', 'Soil_Type12', 'Soil_Type35', 'Soil_Type11', 'Wilderness_Area1', 'Soil_Type14',
                 'Wilderness_Area3', 'Soil_Type37', 'Soil_Type23', 'Soil_Type16', 'Soil_Type20', 'Soil_Type24', 'Soil_Type18', 'Wilderness_Area2']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_train_copy.loc[:,selected_columns], df_train_copy.loc[:,'Cover_Type'], test_size=0.33, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestClassifier
randomfr= RandomForestClassifier()
randomfr.fit(X_train, y_train)
pred = randomfr.predict(X_test)
print("randomfr")
print(classification_report(y_test,pred, labels=None))

In [None]:
from sklearn.neighbors import KNeighborsClassifier

for i in range(3, 21, 3):
    neigh = KNeighborsClassifier(n_neighbors=i)
    neigh.fit(X_train, y_train)
    pred = neigh.predict(X_test)
    print("KNeighborsClassifier {}".format(i))
    print(classification_report(pred, y_test, labels=None))

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
# from sklearn.metrics import roc_auc_score
random_grid = {'bootstrap': [True, False],
               'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
               'max_features': ['auto', 'sqrt'],
               'min_samples_leaf': [1, 2, 4],
               'min_samples_split': [2, 5, 10],
               'n_estimators': [130, 180, 230]}

# skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)
rf = RandomForestClassifier()
random_search = RandomizedSearchCV(rf, param_distributions=random_grid, n_iter=param_comb, scoring='accuracy', n_jobs=4, cv=5, verbose=3, random_state=1001 )

random_search.fit(X_train, y_train)

In [None]:
random_search.best_estimator_.fit(X_train, y_train)
pred = random_search.best_estimator_.predict(X_test)
print("randomfr")
print(classification_report(y_test,pred, labels=None))

In [None]:
# qualitative = df_train_copy.select_dtypes("category").drop(columns=["Cover_Type"])
# columns_t_analyze = df_train.select_dtypes(["float64", "int64"])
# #columns_transformed =  RobustScaler(quantile_range=(25, 75)).fit_transform(columns_t_analyze)

# columns_transformed = pd.concat([columns_t_analyze, df_train_copy.loc[:,"Cover_Type"]], axis=1, join='inner')
# d = defaultdict(LabelEncoder)
# fit = X.apply(lambda x: d[x.name].fit_transform(x))
# fit.reset_index(drop=True, inplace=True)
# columns_transformed.reset_index(drop=True, inplace=True)
# features_preprocessing = pd.concat([fit, columns_transformed], axis=1, join='inner')

In [None]:
# from sklearn.model_selection import StratifiedKFold
# from sklearn.model_selection import RandomizedSearchCV
# from sklearn.metrics import roc_auc_score
# params = {
#         'min_child_weight': [1, 5, 10, 13, 15],
#         'gamma': [0.5, 1, 1.5, 2, 5],
#         'subsample': [0.2, 0.4, 0.6, 0.8, 1.0],
#         'colsample_bytree': [0.6, 0.8, 1.0],
#         'max_depth': [3, 4, 5, 10, 20]
#         }

# xgb = XGBClassifier(silent=True, nthread=1)
# folds = 3
# param_comb = 5

# #skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)

# random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring='accuracy', n_jobs=4, cv=4, verbose=3, random_state=1001 )

# random_search.fit(X_train, y_train)

In [None]:
# xgb = random_search.best_estimator_

# xgb.fit(X_train, y_train)
# pred = xgb.predict(X_test)
# print(classification_report(pred, y_test, labels=None))

**Tranining with all data.**

In [None]:
# 1 experiment
#neigh = KNeighborsClassifier(n_neighbors=3)
#neigh.fit(features_preprocessing.loc[:,selected_columns], features_preprocessing.loc[:,'Cover_Type'])
# 2 experiment
# xgb = XGBClassifier(max_depth=10, subsample=0.8, colsample_bytree=0.7,missing=-999)
# xgb.fit(features_preprocessing.loc[:,selected_columns], features_preprocessing.loc[:,'Cover_Type'])
# 3 experiment 

selected_columns=["Elevation", 'Horizontal_Distance_To_Roadways', 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area4', 'Soil_Type10', 
                  'Soil_Type38', 'Soil_Type39', 'Soil_Type40', 'Soil_Type4', 'Soil_Type3', 'Soil_Type17', 'Soil_Type2', 'Soil_Type30', 'Soil_Type13',
                 'Soil_Type22', 'Soil_Type12', 'Soil_Type35', 'Soil_Type11', 'Wilderness_Area1', 'Soil_Type14',
                 'Wilderness_Area3', 'Soil_Type37', 'Soil_Type23', 'Soil_Type16', 'Soil_Type20', 'Soil_Type24', 'Soil_Type18', 'Wilderness_Area2']


# 4 experiment

random_search.best_estimator_.fit(df_train_copy.loc[:,selected_columns], df_train_copy.loc[:,'Cover_Type'])

# from sklearn.ensemble import RandomForestClassifier
# randomfr= RandomForestClassifier()
# randomfr.fit(df_train_copy.loc[:,selected_columns], df_train_copy.loc[:,'Cover_Type'])


# Make Predictions

In [None]:
columns_t_analyze = df_test.select_dtypes(["float64", "int64"])
columns_transformed =  transformer.transform(columns_t_analyze)
columns_transformed = pd.DataFrame(columns_transformed)
columns_transformed.columns = columns_t_analyze.columns

In [None]:
columns_transformed.shape

In [None]:
columns_transformed.head()

In [None]:
columns_transformed.columns

In [None]:
from collections import defaultdict
from sklearn.preprocessing import LabelEncoder
d = defaultdict(LabelEncoder)
X = df_test.select_dtypes("category")
fit = X.apply(lambda x: d[x.name].fit_transform(x))

In [None]:
fit.columns

In [None]:
fit.reset_index(drop=True, inplace=True)
columns_transformed.reset_index(drop=True, inplace=True)
features_test_preprocessing = pd.concat([columns_transformed, fit], axis=1, join='inner')
features_test_preprocessing.shape

In [None]:
features_test_preprocessing.isna().sum().sum()

Experiment 1

In [None]:
#selected_columns=["Elevation", 'Horizontal_Distance_To_Roadways', 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area4', 'Soil_Type10', 
#                  'Soil_Type38', 'Soil_Type39', 'Soil_Type40', 'Soil_Type4']

results = xgb.predict(features_test_preprocessing.loc[:,selected_columns])

Experiment 2

In [None]:
selected_columns=["Elevation", 'Horizontal_Distance_To_Roadways', 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area4', 'Soil_Type10', 
                  'Soil_Type38', 'Soil_Type39', 'Soil_Type40', 'Soil_Type4', 'Soil_Type3', 'Soil_Type17', 'Soil_Type2']

results = randomfr.predict(df_test.loc[:,selected_columns])

Experiment 3

In [None]:
selected_columns=["Elevation", 'Horizontal_Distance_To_Roadways', 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area4', 'Soil_Type10', 
                  'Soil_Type38', 'Soil_Type39', 'Soil_Type40', 'Soil_Type4', 'Soil_Type3', 'Soil_Type17', 'Soil_Type2', 'Soil_Type30', 'Soil_Type13',
                 'Soil_Type22', 'Soil_Type12', 'Soil_Type35', 'Soil_Type11', 'Wilderness_Area1', 'Soil_Type14',
                 'Wilderness_Area3', 'Soil_Type37', 'Soil_Type23', 'Soil_Type16', 'Soil_Type20', 'Soil_Type24', 'Soil_Type18', 'Wilderness_Area2']

results = random_search.best_estimator_.predict(df_test.loc[:,selected_columns])


In [None]:
output = pd.DataFrame({'Id': df_test.index,
                       'Cover_Type': results})
output.to_csv('submission_all.csv', index=False)

# References

+ https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63
+ https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/
+ https://pbpython.com/categorical-encoding.html