<a href="https://colab.research.google.com/github/hughjafro/DS1-Unit-2-Sprint-5---Project-Week/blob/master/Sub_5_CTolbert_DS1_Unit_2_Project_Week.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Sprint Challenge using data from Taarifa and the Tanzanian Ministry of Water, can you predict which pumps are functional, which need some repairs, and which don't work at all? Predict one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

This predictive modeling challenge comes from DrivenData, an organization who helps non-profits by hosting data science competitions for social impact. The competition has open licensing: "The data is available for use outside of DrivenData." We are reusing the data on Kaggle's InClass platform so we can run a weeklong challenge just for our Lambda School DS1 cohort.

The data comes from the Taarifa waterpoints dashboard, which aggregates data from the Tanzania Ministry of Water. In their own words:

Taarifa is an open source platform for the crowd sourced reporting and triaging of infrastructure related issues. Think of it as a bug tracker for the real world which helps to engage citizens with their local government. We are currently working on an Innovation Project in Tanzania, with various partners.


** * Review fast.ai/2017/11/13/validation-sets**

In [0]:
# Import all packages

import numpy as np
np.set_printoptions(threshold=np.inf)
np.set_printoptions(suppress=True)

import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
import matplotlib.pyplot as plt
import seaborn as sns

# Encoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn import preprocessing

# For scaling
from sklearn.preprocessing import scale, StandardScaler, Imputer

# Regression models
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_validate, cross_val_predict, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier


# For ridge regression
from sklearn.linear_model import Ridge

In [0]:
url_testf = 'https://raw.githubusercontent.com/hughjafro/DS1-Unit-2-Sprint-5---Project-Week/master/test_features.csv'
url_trainf = 'https://raw.githubusercontent.com/hughjafro/DS1-Unit-2-Sprint-5---Project-Week/master/train_features.csv'
url_trainl = 'https://raw.githubusercontent.com/hughjafro/DS1-Unit-2-Sprint-5---Project-Week/master/train_labels.csv'
df_trainf = pd.read_csv(url_trainf)
df_labels = pd.read_csv(url_trainl)
df_test = pd.read_csv(url_testf)

####Check the initial datasets

In [4]:
print(df_trainf.shape)
df_trainf.head()

(59400, 40)


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [5]:
# This will be our target - status_group
print(df_labels.shape)
df_labels.head()

(59400, 2)


Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional


In [6]:
print(df_test.shape)
df_test.head()

(14358, 40)


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,50785,0.0,2013-02-04,Dmdd,1996,DMDD,35.290799,-4.059696,Dinamu Secondary School,0,Internal,Magoma,Manyara,21,3,Mbulu,Bashay,321,True,GeoData Consultants Ltd,Parastatal,,True,2012,other,other,other,parastatal,parastatal,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,other,other
1,51630,0.0,2013-02-04,Government Of Tanzania,1569,DWE,36.656709,-3.309214,Kimnyak,0,Pangani,Kimnyak,Arusha,2,2,Arusha Rural,Kimnyaki,300,True,GeoData Consultants Ltd,VWC,TPRI pipe line,True,2000,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe
2,17168,0.0,2013-02-01,,1567,,34.767863,-5.004344,Puma Secondary,0,Internal,Msatu,Singida,13,2,Singida Rural,Puma,500,True,GeoData Consultants Ltd,VWC,P,,2010,other,other,other,vwc,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,other,other
3,45559,0.0,2013-01-22,Finn Water,267,FINN WATER,38.058046,-9.418672,Kwa Mzee Pange,0,Ruvuma / Southern Coast,Kipindimbi,Lindi,80,43,Liwale,Mkutano,250,,GeoData Consultants Ltd,VWC,,True,1987,other,other,other,vwc,user-group,unknown,unknown,soft,good,dry,dry,shallow well,shallow well,groundwater,other,other
4,49871,500.0,2013-03-27,Bruder,1260,BRUDER,35.006123,-10.950412,Kwa Mzee Turuka,0,Ruvuma / Southern Coast,Losonga,Ruvuma,10,3,Mbinga,Mbinga Urban,60,,GeoData Consultants Ltd,Water Board,BRUDER,True,2000,gravity,gravity,gravity,water board,user-group,pay monthly,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe


####ID the target

In [0]:
#Define target feature
target1 = df_labels.status_group.replace({'functional':0,
                                          'functional needs repair':1,
                                          'non functional':2})

In [8]:
# Check target values
target1.value_counts()

0    32259
2    22824
1     4317
Name: status_group, dtype: int64

In [9]:
# Percent funtional
print('Percent funtional (baseline accuracy): ', 32259 / (32259+22824+4317))

# Percent non-functional
print('\nPercent non-functional: ', 22824 / (32259+22824+4317))

# Percent need repair
print('\nPercent need repair: ', 4317 / (32259+22824+4317))

Percent funtional (baseline accuracy):  0.543080808080808

Percent non-functional:  0.3842424242424242

Percent need repair:  0.07267676767676767


In [10]:
# Check to make sure this is the right method
target1.value_counts(normalize=True)

0    0.543081
2    0.384242
1    0.072677
Name: status_group, dtype: float64

####Majority Class

In [0]:
majority_class = df_labels['status_group'].mode()[0]

y_pred = np.full(shape=df_labels['status_group'].shape, fill_value=majority_class)

In [12]:
accuracy_score(df_labels['status_group'], y_pred)

0.543080808080808

###Initial Classification Report

In [13]:
print(classification_report(df_labels['status_group'], y_pred) )

  'precision', 'predicted', average, warn_for)


                         precision    recall  f1-score   support

             functional       0.54      1.00      0.70     32259
functional needs repair       0.00      0.00      0.00      4317
         non functional       0.00      0.00      0.00     22824

              micro avg       0.54      0.54      0.54     59400
              macro avg       0.18      0.33      0.23     59400
           weighted avg       0.29      0.54      0.38     59400



####Combine test/train dataframes into one dataframe so they can be cleaned at one time

In [0]:
# # Convert target to numerical values
# df_labels = pd.get_dummies(df_labels)

In [15]:
# Remember to use concat to add to the bottome
df_all = pd.concat([df_trainf, df_test])
print(df_all.shape)
df_all.head()

(73758, 40)


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [16]:
# Check NaN's
df_all.isna().sum()

id                           0
amount_tsh                   0
date_recorded                0
funder                    4418
gps_height                   0
installer                 4443
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 465
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            4119
recorded_by                  0
scheme_management         4816
scheme_name              35005
permit                    3719
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_

In [0]:
# Drop feature with very large missing values  - Scheme name
df_all=df_all.drop(columns='scheme_name')

In [18]:
df_all.columns

Index(['id', 'amount_tsh', 'date_recorded', 'funder', 'gps_height',
       'installer', 'longitude', 'latitude', 'wpt_name', 'num_private',
       'basin', 'subvillage', 'region', 'region_code', 'district_code', 'lga',
       'ward', 'population', 'public_meeting', 'recorded_by',
       'scheme_management', 'permit', 'construction_year', 'extraction_type',
       'extraction_type_group', 'extraction_type_class', 'management',
       'management_group', 'payment', 'payment_type', 'water_quality',
       'quality_group', 'quantity', 'quantity_group', 'source', 'source_type',
       'source_class', 'waterpoint_type', 'waterpoint_type_group'],
      dtype='object')

###Fill missing values with the mode of that column

This is easier for categorical columns

In [0]:
df_all['funder'] = df_all['funder'].fillna(df_all['funder'].mode()[0])

In [0]:


cols = ['installer', 'subvillage', 'public_meeting',
        'scheme_management', 'permit']

df_all[cols] = df_all[cols].fillna(df_all.mode().iloc[0])

In [21]:
# Check for nans
df_all.isna().sum()

id                       0
amount_tsh               0
date_recorded            0
funder                   0
gps_height               0
installer                0
longitude                0
latitude                 0
wpt_name                 0
num_private              0
basin                    0
subvillage               0
region                   0
region_code              0
district_code            0
lga                      0
ward                     0
population               0
public_meeting           0
recorded_by              0
scheme_management        0
permit                   0
construction_year        0
extraction_type          0
extraction_type_group    0
extraction_type_class    0
management               0
management_group         0
payment                  0
payment_type             0
water_quality            0
quality_group            0
quantity                 0
quantity_group           0
source                   0
source_type              0
source_class             0
w

In [59]:
target1.shape

(59400,)

###Dummy encode data

In [0]:
# Dummy encoder function
def dummyEncode(df):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                df[feature] = le.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df

#Separate into train and test again

In [0]:
# Target is still target1
X = df_all[:-14358] # drop last 14358 rows
test = df_all[-14358:] # keep the last 14358 rows
y = target1

In [68]:
test.shape

(14358, 39)

In [70]:
print(X_test.shape)
X_test.head()

(14358, 39)


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,50785,0.0,2013-02-04,Dmdd,1996,DMDD,35.290799,-4.059696,Dinamu Secondary School,0,Internal,Magoma,Manyara,21,3,Mbulu,Bashay,321,True,GeoData Consultants Ltd,Parastatal,True,2012,other,other,other,parastatal,parastatal,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,other,other
1,51630,0.0,2013-02-04,Government Of Tanzania,1569,DWE,36.656709,-3.309214,Kimnyak,0,Pangani,Kimnyak,Arusha,2,2,Arusha Rural,Kimnyaki,300,True,GeoData Consultants Ltd,VWC,True,2000,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe
2,17168,0.0,2013-02-01,Government Of Tanzania,1567,DWE,34.767863,-5.004344,Puma Secondary,0,Internal,Msatu,Singida,13,2,Singida Rural,Puma,500,True,GeoData Consultants Ltd,VWC,True,2010,other,other,other,vwc,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,other,other
3,45559,0.0,2013-01-22,Finn Water,267,FINN WATER,38.058046,-9.418672,Kwa Mzee Pange,0,Ruvuma / Southern Coast,Kipindimbi,Lindi,80,43,Liwale,Mkutano,250,True,GeoData Consultants Ltd,VWC,True,1987,other,other,other,vwc,user-group,unknown,unknown,soft,good,dry,dry,shallow well,shallow well,groundwater,other,other
4,49871,500.0,2013-03-27,Bruder,1260,BRUDER,35.006123,-10.950412,Kwa Mzee Turuka,0,Ruvuma / Southern Coast,Losonga,Ruvuma,10,3,Mbinga,Mbinga Urban,60,True,GeoData Consultants Ltd,Water Board,True,2000,gravity,gravity,gravity,water board,user-group,pay monthly,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe


In [71]:
df_ce = dummyEncode(X_test)
df_ce.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,50785,0.0,255,174,1996,214,35.290799,-4.059696,633,0,0,3807,8,21,3,62,16,321,True,0,2,True,2012,9,6,3,3,2,0,2,6,2,3,3,5,3,1,6,5
1,51630,0.0,255,247,1569,219,36.656709,-3.309214,1727,0,5,2634,0,2,2,0,642,300,True,0,6,True,2000,3,1,0,7,4,0,2,6,2,2,2,8,6,0,1,1
2,17168,0.0,252,247,1567,219,34.767863,-5.004344,9483,0,0,5271,18,13,2,108,1659,500,True,0,6,True,2010,9,6,3,7,4,0,2,6,2,2,2,5,3,1,6,5
3,45559,0.0,242,220,267,259,38.058046,-9.418672,5467,0,7,2710,7,80,43,48,1178,250,True,0,6,True,1987,9,6,3,7,4,6,6,6,2,0,0,7,5,0,6,5
4,49871,500.0,306,72,1260,75,35.006123,-10.950412,5573,0,7,3442,16,10,3,60,1061,60,True,0,9,True,2000,3,1,0,9,4,3,1,6,2,1,1,8,6,0,1,1


In [72]:
df_ce.shape

(14358, 39)

In [0]:
test_ce = dummyEncode(test)

In [74]:
test_ce.shape

(14358, 39)

In [75]:
X_train, X_test, y_train, y_test = train_test_split(df_ce, y, test_size=0.25, random_state=42, shuffle=True)

ValueError: ignored

In [76]:
# Logistic Regression again
log_reg = LogisticRegression(solver='newton-cg', multi_class='multinomial')
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)



In [31]:
accuracy_score(y_test, y_pred)

0.6411447811447811

In [32]:
y_test.shape, y_pred.shape, X_test.shape

((14850,), (14850,), (14850, 39))

###Make Pipeline

In [33]:
pipeline = make_pipeline(StandardScaler(),
                        LogisticRegression(solver='newton-cg',
                                           multi_class='multinomial'))

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)


In [34]:
accuracy_score(y_test, y_pred)

0.6405387205387205

In [35]:
X_test.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
2980,37098,0.0,194,1407,0,390,31.985658,-3.59636,5344,0,3,521,17,17,5,9,453,0,True,0,9,True,0,9,6,3,11,4,6,6,6,2,0,0,7,5,0,6,5
5246,14530,0.0,219,485,0,622,32.832815,-4.944937,28341,0,3,8791,19,14,6,115,2052,0,True,0,7,True,0,4,2,1,7,4,0,2,3,3,2,2,7,5,0,4,3
22659,62607,10.0,300,1529,1675,390,35.488289,-4.242048,13832,0,0,16810,8,21,1,2,213,148,True,0,10,True,2008,3,1,0,9,4,4,5,6,2,2,2,8,6,0,1,1
39888,46053,0.0,135,727,0,807,33.140828,-9.059386,4273,0,2,1502,10,12,6,62,455,0,False,0,7,False,0,8,5,1,7,4,0,2,6,2,3,3,7,5,0,4,3
13361,47083,50.0,283,1842,1109,1567,34.217077,-4.430529,29036,0,0,11362,18,13,1,23,1403,235,True,0,8,True,2011,7,4,2,10,4,4,5,6,2,1,1,3,0,0,2,1


In [36]:
test_ce.shape

(14358, 39)

In [0]:
test_pred = log_reg.predict(test_ce)
print(test_pred)

In [0]:
accuracy_score(y_test, test_pred)

In [0]:
sub4 = pd.DataFrame({'id':test_ce.id,'status_group':test_pred})

In [0]:
sub4.head()

#Submit file

In [0]:
# from google.colab import files

# sub4.to_csv('submission4_CT.csv', index=False)
# files.download('submission4_CT.csv')

In [0]:
X_train.head()

###XGBoost

In [0]:
# !pip install xgboost
import xgboost as xgb

In [0]:
# load file from already run tests
label_train = y_train
label_test = y_test

dtrain = xgb.DMatrix(X_train, label=label_train)
dtest = xgb.DMatrix(X_test, label=label_test)

# save dmatrix into binary buffer
dtrain.save_binary('dtrain.buffer')
dtest.save_binary('dtrain.buffer')

# specify parameters via map
# param = {
#     'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'
# } - Doesn't work because my target is not binary - 0,1,2

param = {
    'num_class':3, 'eta':1, 'silent':1, 'objective':'multi:softmax'
}

In [46]:
# specify validations set to watch performance
watchlist = [(dtest, 'test'), (dtrain, 'train')]
num_round = 10

print('merror shows wrong cases / all cases')
bst = xgb.train(param, dtrain, num_round, watchlist)


merror shows wrong cases / all cases
[0]	test-merror:0.276835	train-merror:0.273692
[1]	test-merror:0.267205	train-merror:0.26018
[2]	test-merror:0.25266	train-merror:0.24092
[3]	test-merror:0.244108	train-merror:0.232121
[4]	test-merror:0.239259	train-merror:0.224714
[5]	test-merror:0.233333	train-merror:0.214658
[6]	test-merror:0.232458	train-merror:0.210258
[7]	test-merror:0.227811	train-merror:0.202222
[8]	test-merror:0.224377	train-merror:0.198025
[9]	test-merror:0.222896	train-merror:0.194456


In [51]:
# this is prediction
xg_ypred1 = bst.predict(dtrain)
xg_ypred2 = bst.predict(dtest)

print('Train accuracy score:', accuracy_score(y_train, xg_ypred1))
print('Test accuracy score:', accuracy_score(y_test, xg_ypred2))


# print('error=%f' % (sum(1 for i in range(len(xg_ypred1)) if int(xg_ypred1[i] > 0.5) != labels[i]) / float(len(preds))))
# bst.save_model('0001.model')

Train accuracy score: 0.8055443322109989
Test accuracy score: 0.7771043771043771


In [57]:
xg_ypred2.shape, xg_ypred1.shape, y_train.shape, y_test.shape

((14850,), (44550,), (44550,), (14850,))

In [0]:
from google.colab import files

sub4.to_csv('submission4_CT.csv', index=False)
files.download('submission4_CT.csv')

##OLD CODE SNIPPETS

###Grid Search

In [0]:
# # Polynomial Regression
# def PolynomialRegression(degree=2, **kwargs):
#     return make_pipeline(PolynomialFeatures(degree),
#                          LinearRegression(**kwargs))

In [0]:
# y = target
# X = df_num

# X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

In [0]:
# y_train.head()

In [0]:
# param_grid = {
#     'polynomialfeatures__degree': [0,1,2,3]
# }

# gridsearch = GridSearchCV(PolynomialRegression(), param_grid=param_grid,
#                           scoring='neg_mean_absolute_error', cv=3,
#                           return_train_score=True, verbose=10)
# gridsearch.fit(X_train, y_train)

In [0]:
# pd.DataFrame(gridsearch.cv_results_).sort_values(by='rank_test_score')

###Random Forest

In [0]:
# model = RandomForestRegressor(n_estimators=100, max_depth=20)

# scores = cross_validate(model, X_train, y_train,
#                         scoring='neg_mean_absolute_error',
#                         cv=3, return_train_score=True,
#                        return_estimator=True)

# pd.DataFrame(scores)