# Tanzanian Water Wells

Tanzania, as a developing country, struggles with providing clean water to its population of over 57,000,000. There are many water points already established in the country, but some are in need of repair while others have failed altogether.

Build a classifier to predict the condition of a water well, using information about the sort of pump, when it was installed, etc. Your audience could be an NGO focused on locating wells needing repair, or the Government of Tanzania looking to find patterns in non-functional wells to influence how new wells are built. Note that this is a ternary classification problem by default, but can be engineered to be binary.

Will  use:
   - logreg
   - knn
   - decision trees
   - svm
   - random forests
   - adaboost
   - xgboost
   - ensemble methods
    

Preliminary Business problem (havent looked at data yet): Identifying wells in need of repair to reduce resource expenditure for updates.

In [116]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer, normalize, PolynomialFeatures, LabelEncoder

from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet, LassoCV, RidgeCV, ElasticNetCV

from sklearn.model_selection import train_test_split, cross_validate, KFold, cross_val_score, ShuffleSplit, RandomizedSearchCV, GridSearchCV

from sklearn.metrics import mean_squared_error, make_scorer, log_loss, confusion_matrix, plot_confusion_matrix, precision_score, recall_score, accuracy_score, f1_score, roc_curve, roc_auc_score, classification_report, auc, plot_roc_curve

from sklearn.dummy import DummyRegressor

from sklearn.utils import resample

from sklearn.impute import MissingIndicator, SimpleImputer

from sklearn.feature_selection import SelectFromModel

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree

from sklearn import tree

from sklearn.neighbors import KNeighborsClassifier, NearestNeighbors

from scipy import stats
from sklearn.naive_bayes import MultinomialNB, GaussianNB

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, \
ExtraTreesClassifier, VotingClassifier, StackingRegressor, StackingClassifier

from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

import xgboost

%matplotlib inline


Your goal is to predict the operating condition of a waterpoint for each record in the dataset. You are provided the following set of information about the waterpoints:

- `amount_tsh` - Total static head (amount water available to waterpoint)
- `date_recorded` - The date the row was entered
- `funder` - Who funded the well
- `gps_height` - Altitude of the well
- `installer` - Organization that installed the well
- `longitude` - GPS coordinate
- `latitude` - GPS coordinate
- `wpt_name` - Name of the waterpoint if there is one
- `num_private` -
- `basin` - Geographic water basin
- `subvillage` - Geographic location
- `region` - Geographic location
- `region_code` - Geographic location (coded)
- `district_code` - Geographic location (coded)
- `lga` - Geographic location
- `ward` - Geographic location
- `population` - Population around the well
- `public_meeting` - True/False
- `recorded_by` - Group entering this row of data
- `scheme_management` - Who operates the waterpoint
- `scheme_name` - Who operates the waterpoint
- `permit` - If the waterpoint is permitted
- `construction_year` - Year the waterpoint was constructed
- `extraction_type` - The kind of extraction the waterpoint uses
- `extraction_type_group` - The kind of extraction the waterpoint uses
- `extraction_type_class` - The kind of extraction the waterpoint uses
- `management` - How the waterpoint is managed
- `management_group` - How the waterpoint is managed
- `payment` - What the water costs
- `payment_type` - What the water costs
- `water_quality` - The quality of the water
- `quality_group` - The quality of the water
- `quantity` - The quantity of water
- `quantity_group` - The quantity of water
- `source` - The source of the water
- `source_type` - The source of the water
- `source_class` - The source of the water
- `waterpoint_type` - The kind of waterpoint
- `waterpoint_type_group` - The kind of waterpoint

## Import Data and Baseline model

In [14]:
X_data_df = pd.read_csv('data/x_data.csv')
y_data_df = pd.read_csv('data/target_data.csv')

In [15]:
y_data_df['status_group'].value_counts(normalize=True)

functional                 0.543081
non functional             0.384242
functional needs repair    0.072677
Name: status_group, dtype: float64

### Need to set the target data as nums using labelEncoder

In [19]:
y_encoded = pd.DataFrame(LabelEncoder().fit_transform(y_data_df['status_group']))

In [21]:
y_encoded.value_counts(normalize=True)

0    0.543081
2    0.384242
1    0.072677
dtype: float64

- 0 = functional
- 1 = functional needs repair
- 2 = non functional

### Lets explore the predictor data

In [22]:
X_data_df.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [23]:
X_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 40 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55765 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59400 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

In [24]:
X_data_df.isna().sum()

id                           0
amount_tsh                   0
date_recorded                0
funder                    3635
gps_height                   0
installer                 3655
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 371
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            3334
recorded_by                  0
scheme_management         3877
scheme_name              28166
permit                    3056
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_

So many object dtypes. Some ideas: use imputer for most frequent and set an indicator for it.

Potentially drop column for scheme name?

In [25]:
X_data_df.describe()

Unnamed: 0,id,amount_tsh,gps_height,longitude,latitude,num_private,region_code,district_code,population,construction_year
count,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0
mean,37115.131768,317.650385,668.297239,34.077427,-5.706033,0.474141,15.297003,5.629747,179.909983,1300.652475
std,21453.128371,2997.574558,693.11635,6.567432,2.946019,12.23623,17.587406,9.633649,471.482176,951.620547
min,0.0,0.0,-90.0,0.0,-11.64944,0.0,1.0,0.0,0.0,0.0
25%,18519.75,0.0,0.0,33.090347,-8.540621,0.0,5.0,2.0,0.0,0.0
50%,37061.5,0.0,369.0,34.908743,-5.021597,0.0,12.0,3.0,25.0,1986.0
75%,55656.5,20.0,1319.25,37.178387,-3.326156,0.0,17.0,5.0,215.0,2004.0
max,74247.0,350000.0,2770.0,40.345193,-2e-08,1776.0,99.0,80.0,30500.0,2013.0


Potentially missing values in amount_tsh, gps_height, long/lat, num_private, district code, pop and def construction year

## Start with lazy approach:
    - merge X and y
        - drop nulls
        - run ohe
        - run model and see how it goes

In [29]:
lazy_df = pd.concat([X_data_df, y_encoded], axis=1)

In [33]:
lazy_df.rename(mapper={0:'target'}, axis=1, inplace=True)

I now have a good starting dataframe for any further analysis. No date has been changed or removed. No tweeking has occured.
There is no data leakage and labelencoder has been applied to the target var

### Now to get real lazy

In [37]:
lazy_df.head(-3)

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,target
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,0
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,0
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,0
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,2
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59392,40607,0.0,2011-04-15,Government Of Tanzania,0,Government,33.009440,-8.520888,Benard Charles,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,2
59393,48348,0.0,2012-10-27,Private,0,Private,33.866852,-4.287410,Kwa Peter,0,...,soft,good,insufficient,insufficient,dam,dam,surface,other,other,0
59394,11164,500.0,2011-03-09,World Bank,351,ML appro,37.634053,-6.124830,Chimeredya,0,...,soft,good,enough,enough,machine dbh,borehole,groundwater,communal standpipe,communal standpipe,2
59395,60739,10.0,2013-05-03,Germany Republi,1210,CES,37.169807,-3.253847,Area Three Namba 27,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,0


In [38]:
y_data_df.head(-3)

Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional
...,...,...
59392,40607,non functional
59393,48348,functional
59394,11164,non functional
59395,60739,functional


In [39]:
lazy_df.isna().sum()

id                           0
amount_tsh                   0
date_recorded                0
funder                    3635
gps_height                   0
installer                 3655
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 371
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            3334
recorded_by                  0
scheme_management         3877
scheme_name              28166
permit                    3056
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_

In [40]:
# dropping scheme_name because theres like no data for it
lazy_df_clean = lazy_df.drop(['scheme_name'], axis=1)

In [41]:
lazy_df_clean.dropna(inplace=True)

In [42]:
lazy_df_clean.isna().sum()

id                       0
amount_tsh               0
date_recorded            0
funder                   0
gps_height               0
installer                0
longitude                0
latitude                 0
wpt_name                 0
num_private              0
basin                    0
subvillage               0
region                   0
region_code              0
district_code            0
lga                      0
ward                     0
population               0
public_meeting           0
recorded_by              0
scheme_management        0
permit                   0
construction_year        0
extraction_type          0
extraction_type_group    0
extraction_type_class    0
management               0
management_group         0
payment                  0
payment_type             0
water_quality            0
quality_group            0
quantity                 0
quantity_group           0
source                   0
source_type              0
source_class             0
w

In [43]:
lazy_df_clean.shape

(48288, 40)

In [44]:
len(lazy_df_clean)/len(lazy_df)

0.812929292929293

After lazy cleaning, we lost about 19% of data

Now to drop id columns and sep into num and cat cols

In [45]:
lazy_df_clean = lazy_df_clean.drop(['id'], axis=1)

In [48]:
y_lazy = lazy_df_clean.pop('target')

In [49]:
lazy_nums = lazy_df_clean.select_dtypes('number')

In [50]:
lazy_nums

Unnamed: 0,amount_tsh,gps_height,longitude,latitude,num_private,region_code,district_code,population,construction_year
0,6000.0,1390,34.938093,-9.856322,0,11,5,109,1999
2,25.0,686,37.460664,-3.821329,0,21,4,250,2009
3,0.0,263,38.486161,-11.155298,0,90,63,58,1986
5,20.0,0,39.172796,-4.765587,0,4,8,1,2009
6,0.0,0,33.362410,-3.766365,0,17,3,0,0
...,...,...,...,...,...,...,...,...,...
59394,500.0,351,37.634053,-6.124830,0,5,6,89,2007
59395,10.0,1210,37.169807,-3.253847,0,3,5,125,1999
59396,4700.0,1212,35.249991,-9.070629,0,11,4,56,1996
59398,0.0,0,35.861315,-6.378573,0,1,4,0,0


In [51]:
lazy_cats = lazy_df_clean.select_dtypes('object')

In [52]:
lazy_cats

Unnamed: 0,date_recorded,funder,installer,wpt_name,basin,subvillage,region,lga,ward,public_meeting,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,2011-03-14,Roman,Roman,none,Lake Nyasa,Mnyusi B,Iringa,Ludewa,Mundindi,True,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
2,2013-02-25,Lottery Club,World vision,Kwa Mahundi,Pangani,Majengo,Manyara,Simanjiro,Ngorika,True,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,2013-01-28,Unicef,UNICEF,Zahanati Ya Nanyumbu,Ruvuma / Southern Coast,Mahakamani,Mtwara,Nanyumbu,Nanyumbu,True,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
5,2011-03-13,Mkinga Distric Coun,DWE,Tajiri,Pangani,Moa/Mwereme,Tanga,Mkinga,Moa,True,...,per bucket,salty,salty,enough,enough,other,other,unknown,communal standpipe multiple,communal standpipe
6,2012-10-01,Dwsp,DWSP,Kwa Ngomho,Internal,Ishinabulandi,Shinyanga,Shinyanga Rural,Samuye,True,...,never pay,soft,good,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59394,2011-03-09,World Bank,ML appro,Chimeredya,Wami / Ruvu,Komstari,Morogoro,Mvomero,Diongoya,True,...,monthly,soft,good,enough,enough,machine dbh,borehole,groundwater,communal standpipe,communal standpipe
59395,2013-05-03,Germany Republi,CES,Area Three Namba 27,Pangani,Kiduruni,Kilimanjaro,Hai,Masama Magharibi,True,...,per bucket,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
59396,2011-05-07,Cefa-njombe,Cefa,Kwa Yahona Kuvala,Rufiji,Igumbilo,Iringa,Njombe,Ikondo,True,...,annually,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe
59398,2011-03-08,Malec,Musa,Mshoro,Rufiji,Mwinyi,Dodoma,Chamwino,Mvumi Makulu,True,...,never pay,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump


Ok, I've got lazy nums and cats.
lets train/test nums then:

Lets scale it, run a model and see how it goes. When I come back to this. I gotta try and map by coords and see if that gives me any insight.I think there was a lecture in phase 1 that showed how

NOTE: for cats, lets try and convert date to a date/time type later on but not now

In [54]:
X_train, X_test, y_train, y_test = train_test_split(lazy_nums, y_lazy, random_state=42)

In [57]:
ss = StandardScaler().fit(X_train)

In [59]:
X_train_scaled = ss.transform(X_train)

In [60]:
X_test_scaled = ss.transform(X_test)

First test = logreg

In [77]:
lr = LogisticRegression(random_state=42)

In [78]:
lr.fit(X_train_scaled, y_train)

LogisticRegression(random_state=42)

In [79]:
lr.score(X_train_scaled, y_train)

0.5614093218466976

In [80]:
lr.score(X_test_scaled, y_test)

0.5672630881378397

For ease, lets set up a KNN just to see how things look

In [69]:
knn = KNeighborsClassifier()

In [70]:
knn.fit(X_train_scaled, y_train)

KNeighborsClassifier()

In [71]:
knn.score(X_train_scaled, y_train)

0.7642478462557986

In [72]:
knn.score(X_test_scaled, y_test)

0.6751988071570576

Not horrible, definitely overfitted but its a start and still performs better than the baseline

Lets try a very basic decision tree (Definitely expecting overfitting)

In [73]:
dtc = DecisionTreeClassifier(random_state=42)

In [74]:
dtc.fit(X_train_scaled, y_train)

DecisionTreeClassifier(random_state=42)

In [75]:
dtc.score(X_train_scaled, y_train)

0.986939474265518

In [76]:
dtc.score(X_test_scaled, y_test)

0.6670808482438702

This one is super duper overfit but thats to be expected.

What if we try a random forest now?

In [81]:
rfc = RandomForestClassifier(random_state=42)

In [82]:
rfc.fit(X_train_scaled, y_train)

RandomForestClassifier(random_state=42)

In [83]:
rfc.score(X_train_scaled, y_train)

0.986939474265518

In [84]:
rfc.score(X_test_scaled, y_test)

0.7224983432736912

Still very overfit but I'm getting much better predictions with Random forest than the others. Lets try ensemble?

In [86]:
ada = AdaBoostClassifier(random_state=42)

In [87]:
ada.fit(X_train_scaled, y_train)

AdaBoostClassifier(random_state=42)

In [88]:
ada.score(X_train_scaled, y_train)

0.6293351005080627

In [89]:
ada.score(X_test_scaled, y_test)

0.6209410205434063

maybe overfit but not really. Looks to perform pretty well, Could do better with some tweeks

#### lets try xgboost classifier

In [91]:
xgb = xgboost.XGBClassifier(random_state=42)

In [92]:
xgb.fit(X_train_scaled, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=42, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [93]:
xgb.score(X_train_scaled, y_train)

0.7780262867240998

In [94]:
xgb.score(X_test_scaled, y_test)

0.7040258449304175

Not bad. Lets try xgb random forest

In [95]:
xgbrf = xgboost.XGBRFClassifier(random_state=42)

In [96]:
xgbrf.fit(X_train_scaled, y_train)

XGBRFClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
                colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain',
                interaction_constraints='', max_delta_step=0, max_depth=6,
                min_child_weight=1, missing=nan, monotone_constraints='()',
                n_estimators=100, n_jobs=0, num_parallel_tree=100,
                objective='multi:softprob', random_state=42, reg_alpha=0,
                scale_pos_weight=None, tree_method='exact',
                validate_parameters=1, verbosity=None)

In [97]:
xgbrf.score(X_train_scaled,y_train)

0.6578584051248068

In [98]:
xgbrf.score(X_test_scaled, y_test)

0.6435553346587144

Stacking regressor?

In [117]:
estimators = [
    ('lr', LogisticRegression(random_state=42)), 
    ('knn', KNeighborsClassifier()),
    ('dtc', DecisionTreeClassifier(random_state=42))
]

In [119]:
stack = StackingClassifier(estimators)

In [120]:
stack.fit(X_train_scaled,y_train)

StackingClassifier(estimators=[('lr', LogisticRegression(random_state=42)),
                               ('knn', KNeighborsClassifier()),
                               ('dtc',
                                DecisionTreeClassifier(random_state=42))])

In [121]:
stack.score(X_train_scaled, y_train)

0.8850784183786172

In [122]:
stack.score(X_test_scaled,y_test)

0.6998011928429424

Its overfit but gives me hope. I think since dt and knn are overfit that kinda gives it away as to why.

**Another idea to try is reducing max depth in decision tree and adjusting knn also**

Gonna try one last stack with models that are less overfit.

Setting obective for xgb based on https://stackoverflow.com/questions/57986259/multiclass-classification-with-xgboost-classifier

Looks like default is for binary?

Can remove to get back to standard. Current scores without adj = 
- train:0.7814777998674619
- test: 0.7038601722995361

In [158]:
estimators2 = [
    ('lr', LogisticRegression(random_state=42, max_iter=1000)), 
    ('knn', KNeighborsClassifier()),
    ('dtc', DecisionTreeClassifier(random_state=42)),
    ('rfc', RandomForestClassifier(random_state=42)),
    ('ada', AdaBoostClassifier(random_state=42)),
    ('xgb', xgboost.XGBClassifier(random_state=42, objective='multi:softmax')),
    ('xgbrf', xgboost.XGBRFClassifier(random_state=42))
]

In [159]:
stack2 = StackingClassifier(estimators2)

In [160]:
stack2.fit(X_train_scaled,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


StackingClassifier(estimators=[('lr',
                                LogisticRegression(max_iter=1000,
                                                   random_state=42)),
                               ('knn', KNeighborsClassifier()),
                               ('dtc', DecisionTreeClassifier(random_state=42)),
                               ('rfc', RandomForestClassifier(random_state=42)),
                               ('ada', AdaBoostClassifier(random_state=42)),
                               ('xgb',
                                XGBClassifier(base_score=None, booster=None,
                                              colsample_bylevel=None,
                                              colsample_bynode=N...
                                                gamma=None, gpu_id=None,
                                                importance_type='gain',
                                                interaction_constraints=None,
                                              

In [161]:
stack2.score(X_train_scaled, y_train)

0.9211950519107577

In [162]:
stack2.score(X_test_scaled,y_test)

0.7301192842942346

NOW I NEED TO LOOK AT ALL SCORES DATA AND SEE HOW IT ACTUALLY COMPARES WHEN LOOKING AT MORE THAN JUST ACCURACY

Also need to add in cat data and ohe it, then concat and run again

# Note to self: dont delete stuff until the very end. Keep an itterative approach regardless of things getting a little wild so I can keep tabs on performance for it all. THIS IS V IMPORTANT