# Overview
This notebook is a tutorial on the basic functionality of the automated feature engineering library 'Feature Tools'. 

### Imports

In [211]:
#The basics
import pandas as pd
import numpy as np 

### Utility Functions and Lists (skip to "Feature Tools Time)
I'm defining a few utility functions to make cleaning our data faster. The purprose of this tutorial is understanding featuretools, so we won't spend too much time worrying about picture perfect cleaning.

In [164]:
#Fix dates
def fix_dates(x):

    #Import
    import datetime
    now = datetime.datetime.today()

    #Turn string to datetime
    x.date_recorded = pd.to_datetime(x['date_recorded'],format = '%Y-%m-%d')

    #Turn date into how long ago it happened
    x['age'] = x['date_recorded'] - now

    #sklearn doesn't like time. Turn it into an int
    x['age'] = x['age'].dt.days
    
    return x

#remove the columns that we don't want
def drop_stuff(x):
    x = x.drop(to_drop, axis=1)

    return x

#label NaNs
def label_nans(x):
    x.funder.fillna('unknown', inplace=True)
    x.permit.fillna('unknown', inplace=True)
    x.installer.fillna('unknown', inplace=True)
    x.subvillage.fillna('unknown', inplace=True)
    x.scheme_name.fillna('unknown', inplace=True)
    x.public_meeting.fillna('unknown', inplace=True)
    x.scheme_management.fillna('unknown', inplace=True)

    return x

#Clean Data
def clean_data(x):
    x = label_nans(x)
    x = fix_dates(x)
    x = drop_stuff(x)
    
    return x

In [166]:
#drop categories that are excessive, or drop redundant
to_drop = ['funder', 'installer', 'wpt_name', 'subvillage','region_code',
          'ward', 'scheme_name','payment', 'quantity_group', 'recorded_by',]

#reserved in case I need to drop gps coords
angry_model = ['gps_height','longitude','latitude']

# Understanding feature tools
---
---

## Organizational Structure 
There are 3 key pieces of terminology to understand the structure that feature tools uses.

### Entities
An entity is just a table. Each pandas dataframe is an entity.

### Entity Sets
This is just a group of entities. If you have three pandas dataframes that you want to use with feature tools then they will all be contained in the same Entity Set.

### Relationships
The most import thing to understand for you to get started with feature tools is how relationships function in feature tools. 

Relationships are organized as "parent" and "child" relationships. The "parent" and "child" are both dataframes. The relationship between parent and child is a shared feature (column). The parent can only have unique values in the shared feature, while the child can have repeats of values in the shared feature. 

This was hard for me to wrap my head around, so I drew a picture.

'Dog Breeds Dataframe' (Entity 1) and 'Dogs at Park Today Dataframe' (Entity 2) are both entities that are contained in the entity set called 'Entity Set'. 

Entity 1. has a list of dog breeds and their (fake) attributes. Entity 2. is a list of observations about dogs at the dog park today. 

Each breed is only listed once under the column "Breed" in Entity 1, but breeds may appear multiple times in the "breed" column of Entity 2. That makes Entity 1 the parent and Entity 2 the child for this particular relationship.

<img src="relationship_diagram_updated.png" />

# Prepare our data
---
---
The data that we're using is from a previous Kaggle competition. The goal of the competition was to take the data provided and predict which wells would be in need of repair. The data is fairly dirty, has lots of different data types, and is great to learn on.

Our first step is to take our dataframe and cut it into a more digestible chunk so that it's easier for us to understand exactly what featuretools is doing

In [212]:
# Allow us to view up to 500 columns so that we don't deal with the '...'
pd.set_option('display.max_columns', 500)

# Load in our full set of train data
X = pd.read_csv('train_features.csv')

# Use cleaning function
X = clean_data(X)

# features that we want included for instructional purposes
subset = ['id', 'date_recorded', 'num_private', 'basin', 'population', 'public_meeting']

# a dataframe with a random sample of 1000 rows, including only our selected features
practice = X[subset].sample(1000)

In [213]:
practice.head()

Unnamed: 0,id,date_recorded,num_private,basin,population,public_meeting
25129,22688,2011-03-16,0,Lake Nyasa,148,True
33559,60443,2011-07-13,0,Lake Victoria,0,True
7316,62013,2013-01-29,0,Ruvuma / Southern Coast,400,True
32428,32228,2013-03-04,0,Pangani,120,True
57685,51201,2011-02-26,102,Wami / Ruvu,20,True


In [184]:
# What does the data in our sub-sample look like?
for i in range(len(practice.columns)):
    print(practice.columns[i], '\n',practice.dtypes[i],'\nUnique Values: ', practice.nunique()[i], '\n')

id 
 int64 
Unique Values:  1000 

date_recorded 
 datetime64[ns] 
Unique Values:  241 

num_private 
 int64 
Unique Values:  10 

basin 
 object 
Unique Values:  9 

population 
 int64 
Unique Values:  148 

public_meeting 
 object 
Unique Values:  3 



# Automated feature engineering time!
---
---

In [None]:
# Step 1, import featuretools
import featuretools as ft


In [182]:
# Step 2, create a new entity set
es = ft.EntitySet('Entity Set')

In [185]:
# Step 3, add our entities
es.entity_from_dataframe(dataframe=practice, # the dataframe that you want to use to construct the entity
                        entity_id='entity_1', # the reference name for this entity when using featuretools
                        index='id' # the feature with unique values to identify each row by
                        )


Entityset: Entity Set
  Entities:
    entity_1 [Rows: 1000, Columns: 6]
  Relationships:
    No relationships

# Houston, we have a problem.
---
---
We have one 1 entity under Entities, and no relationships. In our example with the dog breeds, we had two dataframes so that we could create a relationship for featuretools to use. We only have a single dataframe to work with, which means only 1 entity. 

Luckily, featuretools allows us to create new entities from existing entites using "normalize_entity". 

If you look back at our "basin" column, you'll see that it had 9 values. Let's see what happens if we turn "basin" into its own untity using the featuretools "normalize_entity" function.

In [186]:
# Create a new entity from an existing entity's features

es.normalize_entity(base_entity_id='entity_1', # the entity that has the feature/column that you want to turn into an entity
                   new_entity_id='basins', # the reference name for this new entity when using featuretools
                    index='basin' # the name of the feature that you want to use to construct your new entity
                   )

Entityset: Entity Set
  Entities:
    entity_1 [Rows: 1000, Columns: 6]
    basins [Rows: 9, Columns: 1]
  Relationships:
    entity_1.basin -> basins.basin

# Actually time to do some automated feature engineering!
---
---
It will be easier to run the code and look at what featuretools generated before I explain exactly what featuretools is doing. After that, we'll discuss the featuretools approach to feature generation. The only thing that you need to know for this next step is that the target_entity is the entity that we are adding the new features to. It will likely be the dataframe that you will be using to generate predictions with for your model.

In [188]:
# this is standard syntax for featuretools.
# fm is your dataframe with both the old and newly engineered features
# features is your list of features in fm 
fm, features = ft.dfs(entityset=es,
                     target_entity='entity_1') 

In [194]:
fm.head()

Unnamed: 0_level_0,num_private,basin,population,public_meeting,DAY(date_recorded),YEAR(date_recorded),MONTH(date_recorded),WEEKDAY(date_recorded),basins.SUM(entity_1.num_private),basins.SUM(entity_1.population),basins.STD(entity_1.num_private),basins.STD(entity_1.population),basins.MAX(entity_1.num_private),basins.MAX(entity_1.population),basins.SKEW(entity_1.num_private),basins.SKEW(entity_1.population),basins.MIN(entity_1.num_private),basins.MIN(entity_1.population),basins.MEAN(entity_1.num_private),basins.MEAN(entity_1.population),basins.COUNT(entity_1),basins.NUM_UNIQUE(entity_1.public_meeting),basins.MODE(entity_1.public_meeting)
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
5,0,Wami / Ruvu,6922,True,26,2011,2,5,0,29007,0.0,1017.679601,0,6922,0.0,6.147406,0,0,0.0,315.293478,92,3,True
102,0,Lake Tanganyika,145,True,20,2013,1,6,0,30669,0.0,451.701544,0,3200,0.0,4.115083,0,0,0.0,262.128205,117,3,True
119,0,Rufiji,0,True,15,2011,4,4,2,23220,0.120818,341.077796,1,2800,8.153387,4.927333,0,0,0.014706,170.735294,136,3,True
214,0,Ruvuma / Southern Coast,320,True,18,2013,1,4,30,17369,3.692745,484.819922,30,3000,8.124038,4.444413,0,0,0.454545,263.166667,66,3,True
337,0,Lake Tanganyika,450,True,26,2013,1,5,0,30669,0.0,451.701544,0,3200,0.0,4.115083,0,0,0.0,262.128205,117,3,True


In [190]:
features

[<Feature: num_private>,
 <Feature: basin>,
 <Feature: population>,
 <Feature: public_meeting>,
 <Feature: DAY(date_recorded)>,
 <Feature: YEAR(date_recorded)>,
 <Feature: MONTH(date_recorded)>,
 <Feature: WEEKDAY(date_recorded)>,
 <Feature: basins.SUM(entity_1.num_private)>,
 <Feature: basins.SUM(entity_1.population)>,
 <Feature: basins.STD(entity_1.num_private)>,
 <Feature: basins.STD(entity_1.population)>,
 <Feature: basins.MAX(entity_1.num_private)>,
 <Feature: basins.MAX(entity_1.population)>,
 <Feature: basins.SKEW(entity_1.num_private)>,
 <Feature: basins.SKEW(entity_1.population)>,
 <Feature: basins.MIN(entity_1.num_private)>,
 <Feature: basins.MIN(entity_1.population)>,
 <Feature: basins.MEAN(entity_1.num_private)>,
 <Feature: basins.MEAN(entity_1.population)>,
 <Feature: basins.COUNT(entity_1)>,
 <Feature: basins.NUM_UNIQUE(entity_1.public_meeting)>,
 <Feature: basins.MODE(entity_1.public_meeting)>]

# So what just happened? 
---
---
The first thing that we can see is that ft separated our date_recorded feature into DAY, YEAR, MONTH, and WEEKDAY, and then dropped date_recorded. We can see where the new feature was derived from based on what is inside of the parenthesis of the new feature name. For example, DAY(date_recorded). 

### Transformations
This act of taking one feature and transforming it into a new feature is called, simply enough, "transformation". It is one of the two categories of operations that featuretools refers to as "primatives". 

### Aggregations
The other primative operation is called an aggregation. An example of an aggregation is basins.SUM(entity_1.num_private). We'll explore this in more depth in the next cells. 

If you look below, you'll see 2 pandas groupby objects - a numeric groupby and a categorical groupby. These are aggregations.  

In [208]:
# create groupby object, grouped by basin, for our numeric columns, and show the sum, stddev, max, skew, min, and mean values 
numeric_aggregation = practice.groupby('basin')['num_private','population'].agg(['sum', 'std', 'max', 'skew', 'min', 'mean'])

# create groupby object,grouped by basin, for our categorical columns, and show the number of unique values
# mode for categorical data takes some extra work in pandas, so we're skipping it here 
categorical_aggregation = practice.groupby('basin')['public_meeting'].agg(['nunique']) 

In [209]:
numeric_aggregation

Unnamed: 0_level_0,num_private,num_private,num_private,num_private,num_private,num_private,population,population,population,population,population,population
Unnamed: 0_level_1,sum,std,max,skew,min,mean,sum,std,max,skew,min,mean
basin,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Internal,0,0.0,0,0.0,0,0.0,14067,196.647206,1200,2.794165,0,109.898438
Lake Nyasa,141,15.666667,141,9.0,0,1.740741,3434,100.849849,500,3.125453,0,42.395062
Lake Rukwa,0,0.0,0,0.0,0,0.0,7033,326.131709,1500,3.329905,0,156.288889
Lake Tanganyika,0,0.0,0,0.0,0,0.0,30669,451.701544,3200,4.115083,0,262.128205
Lake Victoria,14,0.654158,6,8.714305,0,0.079545,42382,862.092403,7500,6.009413,0,240.806818
Pangani,153,5.658965,45,6.59666,0,0.962264,26694,331.313018,3500,6.959303,1,167.886792
Rufiji,2,0.120818,1,8.153387,0,0.014706,23220,341.077796,2800,4.927333,0,170.735294
Ruvuma / Southern Coast,30,3.692745,30,8.124038,0,0.454545,17369,484.819922,3000,4.444413,0,263.166667
Wami / Ruvu,0,0.0,0,0.0,0,0.0,29007,1017.679601,6922,6.147406,0,315.293478


In [210]:
categorical_aggregation

Unnamed: 0_level_0,nunique
basin,Unnamed: 1_level_1
Internal,3
Lake Nyasa,3
Lake Rukwa,2
Lake Tanganyika,3
Lake Victoria,3
Pangani,2
Rufiji,3
Ruvuma / Southern Coast,3
Wami / Ruvu,3


---
---
Aggregations are bread-and-butter feature engineering operations. Groupby, aggregation, join to the original dataframe. It can get tedious. Featuretools was able to handle it for us.

However, the functions that we've demonstrated so far are only touched the surface of what featuretools can do. 

In [178]:


# Primary Entity

# Additional Entites Made From Primary
es.normalize_entity(base_entity_id='test',
                   new_entity_id='basins',
                    index='basin'
                   )

es.normalize_entity(base_entity_id='test',
                   new_entity_id='lgas',
                    index='lga'
                   )

es.normalize_entity(base_entity_id='test',
                   new_entity_id='extraction_types',
                    index='extraction_type'
                   )

Entityset: Learn Stuff
  Entities:
    test [Rows: 59400, Columns: 31]
    basins [Rows: 9, Columns: 1]
    lgas [Rows: 125, Columns: 1]
    extraction_types [Rows: 18, Columns: 1]
  Relationships:
    test.basin -> basins.basin
    test.lga -> lgas.lga
    test.extraction_type -> extraction_types.extraction_type

### fm is our new dataframe

In [107]:
fm, features = ft.dfs(entityset=es,
                     target_entity='test')

# Modeling Prep

### Prep y

In [85]:
y = pd.read_csv('train_labels.csv')

#make y values match x values, which are sorted by id
y = y.sort_values('id')
y = y.reset_index()
y = y.drop('index', axis=1)

#status_group is target
y = y.status_group

#encode y to numbers for model
le = LabelEncoder()
le.fit(y)
y = le.transform(y)


### Prep X

In [129]:
X = fm

### Split data

In [120]:
X_train, X_valid, y_train, y_valid = train_test_split(X,y)

### Separate our data into categorical and numeric
This allows us to to perform PCA on our numeric data and MCA on our categorical data. This is an important distinction, because PCA calculates distance between points, which can't be effectively calculated for categorical data.

In [137]:
#separate into numeric and categorical
nums = X_train.select_dtypes(include=np.number)
cats = X_train.select_dtypes(include='object')
from sklearn.decomposition import PCA

In [139]:
#numeric data has a few nans. Add arbitrary value because I'm tired
nums = nums.fillna(1)

In [140]:
pca = PCA(n_components=50)
pca.fit(nums)

PCA(copy=True, iterated_power='auto', n_components=50, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [141]:
num_vals = pca.transform(nums)

In [142]:
cats.isna().sum().sum()

0

In [144]:
cats = cats.astype('category')

In [161]:
#cats = cats.drop('lgas.MODE(test.permit)', axis=1)
#cats = cats.drop('permit',axis=1)
#cats = cats.drop('lgas.MODE(test.public_meeting)', axis=1)
cats = cats.drop('public_meeting', axis=1)
cats.head()

Unnamed: 0_level_0,basin,region,lga,scheme_management,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment_type,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,basins.MODE(test.region),basins.MODE(test.lga),basins.MODE(test.scheme_management),basins.MODE(test.extraction_type),basins.MODE(test.extraction_type_group),basins.MODE(test.extraction_type_class),basins.MODE(test.management),basins.MODE(test.management_group),basins.MODE(test.payment_type),basins.MODE(test.water_quality),basins.MODE(test.quality_group),basins.MODE(test.quantity),basins.MODE(test.source),basins.MODE(test.source_type),basins.MODE(test.source_class),basins.MODE(test.waterpoint_type),basins.MODE(test.waterpoint_type_group),lgas.MODE(test.basin),lgas.MODE(test.region),lgas.MODE(test.scheme_management),lgas.MODE(test.extraction_type),lgas.MODE(test.extraction_type_group),lgas.MODE(test.extraction_type_class),lgas.MODE(test.management),lgas.MODE(test.management_group),lgas.MODE(test.payment_type),lgas.MODE(test.water_quality),lgas.MODE(test.quality_group),lgas.MODE(test.quantity),lgas.MODE(test.source),lgas.MODE(test.source_type),lgas.MODE(test.source_class),lgas.MODE(test.waterpoint_type),lgas.MODE(test.waterpoint_type_group),extraction_types.MODE(test.basin),extraction_types.MODE(test.region),extraction_types.MODE(test.lga),extraction_types.MODE(test.scheme_management),extraction_types.MODE(test.extraction_type_group),extraction_types.MODE(test.extraction_type_class),extraction_types.MODE(test.management),extraction_types.MODE(test.management_group),extraction_types.MODE(test.payment_type),extraction_types.MODE(test.water_quality),extraction_types.MODE(test.quality_group),extraction_types.MODE(test.quantity),extraction_types.MODE(test.source),extraction_types.MODE(test.source_type),extraction_types.MODE(test.source_class),extraction_types.MODE(test.waterpoint_type),extraction_types.MODE(test.waterpoint_type_group)
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1
73006,Lake Victoria,Shinyanga,Kahama,unknown,nira/tanira,nira/tanira,handpump,unknown,unknown,unknown,unknown,unknown,unknown,shallow well,shallow well,groundwater,hand pump,hand pump,Mwanza,Bariadi,VWC,nira/tanira,nira/tanira,handpump,vwc,user-group,never pay,soft,good,enough,shallow well,shallow well,groundwater,hand pump,hand pump,Lake Tanganyika,Shinyanga,unknown,other,other,handpump,wug,user-group,unknown,milky,milky,enough,shallow well,shallow well,groundwater,other,other,Lake Victoria,Shinyanga,Bariadi,VWC,nira/tanira,handpump,vwc,user-group,never pay,soft,good,enough,shallow well,shallow well,groundwater,hand pump,hand pump
28772,Rufiji,Singida,Manyoni,unknown,afridev,afridev,handpump,vwc,user-group,never pay,soft,good,enough,shallow well,shallow well,groundwater,hand pump,hand pump,Iringa,Njombe,VWC,gravity,gravity,gravity,vwc,user-group,never pay,soft,good,enough,river,river/lake,groundwater,communal standpipe,communal standpipe,Internal,Singida,VWC,mono,mono,motorpump,vwc,user-group,never pay,soft,good,enough,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,Lake Victoria,Tabora,Iramba,VWC,afridev,handpump,vwc,user-group,never pay,soft,good,enough,machine dbh,borehole,groundwater,hand pump,hand pump
11898,Lake Victoria,Shinyanga,Kahama,unknown,nira/tanira,nira/tanira,handpump,unknown,unknown,unknown,unknown,unknown,unknown,shallow well,shallow well,groundwater,hand pump,hand pump,Mwanza,Bariadi,VWC,nira/tanira,nira/tanira,handpump,vwc,user-group,never pay,soft,good,enough,shallow well,shallow well,groundwater,hand pump,hand pump,Lake Tanganyika,Shinyanga,unknown,other,other,handpump,wug,user-group,unknown,milky,milky,enough,shallow well,shallow well,groundwater,other,other,Lake Victoria,Shinyanga,Bariadi,VWC,nira/tanira,handpump,vwc,user-group,never pay,soft,good,enough,shallow well,shallow well,groundwater,hand pump,hand pump
65177,Lake Victoria,Kagera,Karagwe,VWC,gravity,gravity,gravity,vwc,user-group,never pay,soft,good,dry,river,river/lake,surface,communal standpipe,communal standpipe,Mwanza,Bariadi,VWC,nira/tanira,nira/tanira,handpump,vwc,user-group,never pay,soft,good,enough,shallow well,shallow well,groundwater,hand pump,hand pump,Lake Victoria,Kagera,unknown,gravity,gravity,gravity,vwc,user-group,never pay,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe,Pangani,Iringa,Njombe,VWC,gravity,gravity,vwc,user-group,never pay,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe
54808,Lake Tanganyika,Tabora,Urambo,VWC,afridev,afridev,handpump,vwc,user-group,never pay,soft,good,enough,machine dbh,borehole,groundwater,hand pump,hand pump,Kigoma,Kasulu,VWC,gravity,gravity,gravity,vwc,user-group,unknown,soft,good,enough,shallow well,shallow well,groundwater,hand pump,communal standpipe,Lake Tanganyika,Tabora,VWC,afridev,afridev,handpump,vwc,user-group,never pay,soft,good,enough,shallow well,shallow well,groundwater,hand pump,hand pump,Lake Victoria,Tabora,Iramba,VWC,afridev,handpump,vwc,user-group,never pay,soft,good,enough,machine dbh,borehole,groundwater,hand pump,hand pump


In [153]:
import prince
mca = prince.MCA(
    n_components=20,
    n_iter=3,
    check_input=True,
    engine='auto',
    random_state=42
)

In [162]:
mca.fit(cats)

AttributeError: 'SparseDataFrame' object has no attribute 'to_numpy'

# Model

In [87]:
#identify categories
categorical_features_indices = np.where(X.dtypes != np.float)[0]

#set model
model = CatBoostClassifier(
    random_seed=42
)

In [None]:
#fit model
model.fit(
    X_train, y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_valid, y_valid),
    logging_level='Verbose'
)

In [59]:
X.head()

Unnamed: 0,id,amount_tsh,num_private,basin,region,district_code,lga,population,public_meeting,scheme_management,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment_type,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,age,age_since_built
0,69572,6000.0,0,Lake Nyasa,Iringa,5,Ludewa,109,True,VWC,False,1970-01-01 00:00:00.000001999,gravity,gravity,gravity,vwc,user-group,annually,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe,-2887,15046 days 23:59:59.999998
1,8776,0.0,0,Lake Victoria,Mara,2,Serengeti,280,unknown,Other,True,1970-01-01 00:00:00.000002010,gravity,gravity,gravity,wug,user-group,never pay,soft,good,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,-2164,15769 days 23:59:59.999997
2,34310,25.0,0,Pangani,Manyara,4,Simanjiro,250,True,VWC,True,1970-01-01 00:00:00.000002009,gravity,gravity,gravity,vwc,user-group,per bucket,soft,good,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,-2173,15760 days 23:59:59.999997
3,67743,0.0,0,Ruvuma / Southern Coast,Mtwara,63,Nanyumbu,58,True,VWC,True,1970-01-01 00:00:00.000001986,submersible,submersible,submersible,vwc,user-group,never pay,soft,good,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,-2201,15732 days 23:59:59.999998
4,19728,0.0,0,Lake Victoria,Kagera,1,Karagwe,0,True,unknown,True,1970-01-01 00:00:00.000000000,gravity,gravity,gravity,other,other,never pay,soft,good,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,-2766,15168 days 00:00:00


# Scratch pad

In [None]:
#construction dictionary of data types
f = ft.variable_types

var_types = {'amount_tsh':ft.variable_types.Numeric, 'num_private':f.Numeric, 'basin':f.Categorical, 'region':f.Categorical,
'district_code':f.Categorical, 'lga':f.Categorical, 'population':f.Numeric, 'public_meeting':f.Categorical, 
'scheme_management':f.Categorical, 'permit':f.Categorical, 'construction_year':f.Datetime, 'extraction_type':f.Categorical,
'extraction_type_group':f.Categorical, 'extraction_type_class':f.Categorical,'management':f.Categorical, 
'management_group':f.Categorical, 'payment_type':f.Categorical, 'water_quality':f.Categorical, 'quality_group':f.Categorical,
'quantity':f.Categorical, 'source':f.Categorical, 'source_type': f.Categorical, 'source_class':f.Categorical, 
'waterpoint_type':f.Categorical, 'waterpoint_type_group':f.Categorical, 'age':f.Numeric
}