# Identify DataTypes

### Introduction

We now have the task of working with the rest of our training data.  Now, when working with so many columns it's often difficult to know where to begin.  Our general technique will be to capture as much of our data as possible, and then feed it into our model to get some feedback on the model performance and feature importances.  

To do that, to capture as much data as possible, we effectively have to coerce as many columns as possible into a number.  We can accomplish this in stages, working with the data that is easiest to coerce first. Let's get started.

### Sizing it Up

In [None]:
import pandas as pd 

url = "https://raw.githubusercontent.com/jigsawlabs-student/engineering-large-datasets/master/x_train.csv"
X_train_df = pd.read_csv(url)

In [18]:
X_train_df.shape

(18035, 96)

Let's start by reducing our problem.  We have 96 columns to work through but some of them are unnecessary.  The only columns that we can be confident truly add no benefit are those where each value in the column is identical.  

### Coercing Numbers

In [193]:
X_train_object_df = X_train_df_var.select_dtypes(include = 'object')

Let's see which of these 51 we can make usable.

In [194]:
def contains_numbers(column):
    # matches price or percentage     
    regex_string = (r'^(?!.*www|.*-|.*\/|.*[A-Za-z]|.* ).*\d.*')
#     regex_string = (r'\$\d+.*|\d+.*\%$|^\d+.*$')
    return column.str.contains(regex_string).all()

In [195]:
potential_num_cols = [col for col in X_train_object_df.columns if contains_numbers(X_train_coerced_dates[col])]

In [196]:
potential_num_cols

['host_response_rate',
 'weekly_price',
 'monthly_price',
 'security_deposit',
 'cleaning_fee',
 'extra_people']

In [197]:
almost_num_df = X_train_object_df[potential_num_cols]
almost_num_df[:6]

Unnamed: 0,host_response_rate,weekly_price,monthly_price,security_deposit,cleaning_fee,extra_people
0,,,,$216.00,$43.00,$8.00
1,,,,"$1,000.00",$160.00,$0.00
2,,$300.00,"$1,000.00",,,$30.00
3,,,,,,$0.00
4,,,,,$20.00,$25.00
5,80%,$350.00,$990.00,$400.00,$100.00,$0.00


Ok now let's work on coercing these numeric columns.  We already have `remove_price` method, which seems like it will work on our `security_deposit`, `cleaning_fee` and `extra_people` columns.

> One thing to consider is if these columns really belong in our features.  These all are related to how much our host is charging for the apartment, which is what we are trying to predict.  Let's include it for now, but this may be something to remove later on.

In [30]:
from sklearn_pandas import FunctionTransformer, DataFrameMapper

In [31]:
col = almost_num_df['host_response_rate']

In [90]:
import pandas as pd
def convert_percent(val):
    if pd.isnull(val):
        return pd.to_numeric(val) 
    else:
        without_dollar = val[:-1]
        return pd.to_numeric(without_dollar)

In [85]:
from sklearn.impute import SimpleImputer
mapper = DataFrameMapper([
    (['host_response_rate'], [FunctionTransformer(convert_percent)])
], df_out = True)

In [101]:
transformed_host_rate = mapper.fit_transform(almost_num_df)

In [87]:
transformed_host_rate.dtypes

host_response_rate    float64
dtype: object

Now let's move onto the rest of the columns.

In [89]:
almost_num_df[:6]

Unnamed: 0,host_response_rate,weekly_price,monthly_price,security_deposit,cleaning_fee,extra_people
0,,,,$216.00,$43.00,$8.00
1,,,,"$1,000.00",$160.00,$0.00
2,,$300.00,"$1,000.00",,,$30.00
3,,,,,,$0.00
4,,,,,$20.00,$25.00
5,80%,$350.00,$990.00,$400.00,$100.00,$0.00


Note that we will convert the price columns in the same way. 

In [124]:
def convert_price(val):
    if pd.isnull(val):
        return pd.to_numeric(val) 
    else:
        without_commas = val.replace(',','')
        without_dollar = without_commas[1:]
        return pd.to_numeric(without_dollar)

In [125]:
(['security_deposit'], [FunctionTransformer(convert_price)])

(['security_deposit'], [FunctionTransformer(func=None)])

Remember that we have a list of the columns we wish to convert.

In [126]:
potential_num_cols[3:]

['security_deposit', 'cleaning_fee', 'extra_people']

So we can loop through and create a list of the steps with code.

In [127]:
convert_dolls_to_nums = [([col], [FunctionTransformer(convert_price)]) 
                         for col in potential_num_cols[3:]]

In [128]:
convert_dolls_to_nums

[(['security_deposit'], [FunctionTransformer(func=None)]),
 (['cleaning_fee'], [FunctionTransformer(func=None)]),
 (['extra_people'], [FunctionTransformer(func=None)])]

In [246]:
from sklearn.impute import SimpleImputer
convert_percent_to_nums = [ (['host_response_rate'], [FunctionTransformer(convert_percent)]) ]
convert_dolls_to_nums
convert_to_nums_steps = convert_percent_to_nums + convert_dolls_to_nums
to_number_mapper = DataFrameMapper(convert_to_nums_steps, df_out = True, default = None)

In [247]:
converted_nums_df = to_number_mapper.fit_transform(X_train_df_var)

In [230]:
dtypes = dict(X_train_df_var.dtypes)

In [244]:
dtypes = {k:v for k, v in dtypes.items() if v.kind is not 'O'}

In [248]:
updated_df = converted_nums_df.astype(dtypes)

In [249]:
updated_df.select_dtypes(include = 'object').shape

(18035, 47)

In [250]:
updated_df.select_dtypes(exclude = 'object').shape

(18035, 38)