<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/28Apr20_6_coerce_numbers_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Numerical Data Lab

## Introduction
In this lab we will use the feature lib to transform the numerical data in the `car_data.csv` dataset. Time to start.

In [0]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/engineering-large-datasets/master/car_data.csv"
df = pd.read_csv(url)

In [0]:
df[:2]

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,6 years ago,$3.35,$5.59,27000,Petrol,Dealer,Manual,0
1,sx4,7 years ago,$4.75,$9.54,43000,Diesel,Dealer,Manual,0


We can see that two of the columns have numeric data in them, `Selling_Price` and `Present_Price`.  Let's just select them, as we have a small enough dataset to spot them.

In [0]:
almost_nums_df = df[['Selling_Price', 'Present_Price']]

Ok, now let's use a list comprehension to create our steps, and then coerce the data with a DataFrameMapper. 

> Try not to reference the previous reading at first.  Only look to it if you get stuck.

In [0]:
import pandas as pd
import numpy as np

def price_to_num(val):
    price = val[1:]
    return float(price)

In [0]:
from sklearn_pandas import DataFrameMapper, FunctionTransformer

steps = [(column_name, FunctionTransformer(price_to_num)) for column_name in almost_nums_df.columns]
steps



[('Selling_Price', FunctionTransformer(func=None)),
 ('Present_Price', FunctionTransformer(func=None))]

In [0]:
mapper = DataFrameMapper(steps, df_out=True)

In [0]:
transformed_cols = mapper.fit_transform(df)
transformed_cols[:5]

# 	Selling_Price	Present_Price
# 0	3.35	5.59
# 1	4.75	9.54
# 2	7.25	9.85
# 3	2.85	4.15
# 4	4.60	6.87

Unnamed: 0,Selling_Price,Present_Price
0,3.35,5.59
1,4.75,9.54
2,7.25,9.85
3,2.85,4.15
4,4.6,6.87


### Working with Year

Ok, now let's add year in there.  We can write a method that coerces the year, and use it with a transformer to add to our mapper.  Let's get going.

In [0]:
df['Year']

0       6 years ago
1       7 years ago
2       3 years ago
3       9 years ago
4       6 years ago
           ...     
296     4 years ago
297     5 years ago
298    11 years ago
299     3 years ago
300     4 years ago
Name: Year, Length: 301, dtype: object

In [0]:
def coerce_to_year(val):
    years_ago = val.split(" ")[0]
    return int(years_ago)

Store the step in `coerce_step`.

In [0]:
coerce_step = (['Year'], FunctionTransformer(coerce_to_year))
coerce_step



(['Year'], FunctionTransformer(func=None))

Then create a list of steps that convert both the prices and the year.

In [0]:
comb_steps = steps + [coerce_step]
comb_steps



[('Selling_Price', FunctionTransformer(func=None)),
 ('Present_Price', FunctionTransformer(func=None)),
 (['Year'], FunctionTransformer(func=None))]

And add the list of steps to the mapper.

In [0]:
mapper_with_num_converter = DataFrameMapper(comb_steps, df_out = True)

In [0]:
price_year_df = mapper_with_num_converter.fit_transform(df)

price_year_df[:3]

# 	Selling_Price	Present_Price	Year
# 0	3.35	5.59	6
# 1	4.75	9.54	7
# 2	7.25	9.85	3

Unnamed: 0,Selling_Price,Present_Price,Year
0,3.35,5.59,6
1,4.75,9.54,7
2,7.25,9.85,3


In [0]:
df[:2]

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,6 years ago,$3.35,$5.59,27000,Petrol,Dealer,Manual,0
1,sx4,7 years ago,$4.75,$9.54,43000,Diesel,Dealer,Manual,0


In [0]:
coerce_df = df.copy()
coerce_df['Selling_Price'] = price_year_df['Selling_Price']
coerce_df['Present_Price'] = price_year_df['Present_Price']
coerce_df['Year'] = price_year_df['Year']
coerce_df

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,6,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,7,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,3,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,9,2.85,4.15,5200,Petrol,,Manual,0
4,swift,6,4.60,6.87,42450,Diesel,,Manual,0
...,...,...,...,...,...,...,...,...,...
296,city,4,9.50,11.60,33988,Diesel,Dealer,Manual,0
297,brio,5,4.00,5.90,60000,Petrol,Dealer,Manual,0
298,city,11,3.35,11.00,87934,Petrol,Dealer,Manual,0
299,city,3,11.50,12.50,9000,Diesel,Dealer,Manual,0


In [0]:
test_mapper_with_num_converter = DataFrameMapper(comb_steps, df_out = True, default = None)
test_df = test_mapper_with_num_converter.fit_transform(df)
test_df

Unnamed: 0,Selling_Price,Present_Price,Year,Car_Name,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,3.35,5.59,6,ritz,27000,Petrol,Dealer,Manual,0
1,4.75,9.54,7,sx4,43000,Diesel,Dealer,Manual,0
2,7.25,9.85,3,ciaz,6900,Petrol,Dealer,Manual,0
3,2.85,4.15,9,wagon r,5200,Petrol,,Manual,0
4,4.60,6.87,6,swift,42450,Diesel,,Manual,0
...,...,...,...,...,...,...,...,...,...
296,9.50,11.60,4,city,33988,Diesel,Dealer,Manual,0
297,4.00,5.90,5,brio,60000,Petrol,Dealer,Manual,0
298,3.35,11.00,11,city,87934,Petrol,Dealer,Manual,0
299,11.50,12.50,3,city,9000,Diesel,Dealer,Manual,0


In [0]:
df[:2]

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,6 years ago,$3.35,$5.59,27000,Petrol,Dealer,Manual,0
1,sx4,7 years ago,$4.75,$9.54,43000,Diesel,Dealer,Manual,0


In [0]:
comb_steps



[('Selling_Price', FunctionTransformer(func=None)),
 ('Present_Price', FunctionTransformer(func=None)),
 (['Year'], FunctionTransformer(func=None))]

### Keeping the rest

In [0]:
test_df.dtypes

Selling_Price    float64
Present_Price    float64
Year               int64
Car_Name          object
Kms_Driven        object
Fuel_Type         object
Seller_Type       object
Transmission      object
Owner             object
dtype: object

We can see that we lost our original int datatypes from our starting dataframe. 

In [0]:
df.dtypes

Car_Name         object
Year             object
Selling_Price    object
Present_Price    object
Kms_Driven        int64
Fuel_Type        object
Seller_Type      object
Transmission     object
Owner             int64
dtype: object

So below, we'll select the datatypes from `df`.  

In [0]:
df_types = df.dtypes.to_dict()
df_types
# df_dtypes = {'Car_Name': dtype('O'),
#  'Year': dtype('O'),
#  'Selling_Price': dtype('O'),
#  'Present_Price': dtype('O'),
#  'Kms_Driven': dtype('int64'),
#  'Fuel_Type': dtype('O'),
#  'Seller_Type': dtype('O'),
#  'Transmission': dtype('O'),
#  'Owner': dtype('int64')}

{'Car_Name': dtype('O'),
 'Fuel_Type': dtype('O'),
 'Kms_Driven': dtype('int64'),
 'Owner': dtype('int64'),
 'Present_Price': dtype('O'),
 'Seller_Type': dtype('O'),
 'Selling_Price': dtype('O'),
 'Transmission': dtype('O'),
 'Year': dtype('O')}

*Then* use dictionary comprehension to select those that are not of type object.

In [0]:
import numpy as np
non_obj_dtypes = { column_name : column_type for column_name, column_type in df_types.items() if np.dtype('object') != column_type }

non_obj_dtypes
# {'Kms_Driven': dtype('int64'), 'Owner': dtype('int64')}

{'Kms_Driven': dtype('int64'), 'Owner': dtype('int64')}

Then use this dictionary to update our datatypes of the `coerced_df`.

In [0]:
updated_df = test_df.astype(non_obj_dtypes)

In [0]:
updated_df.dtypes

# Selling_Price    float64
# Present_Price    float64
# Car_Name          object
# Year              object
# Kms_Driven         int64
# Fuel_Type         object
# Seller_Type       object
# Transmission      object
# Owner              int64
# dtype: object

Selling_Price    float64
Present_Price    float64
Year               int64
Car_Name          object
Kms_Driven         int64
Fuel_Type         object
Seller_Type       object
Transmission      object
Owner              int64
dtype: object

### Summary

In this lesson, we worked with coercing our numeric data.  We practiced using list iteration to create multiple steps simultaneously, and also worked with coercing our datatypes using a dtypes dictionary.