# Lawnmover Data Preparation

In this notebook we focus on the data loading, basic exploring, and prepatation.

This notebook follows closely the previous data cleaning toturial from last week. We will be using the same input dataset and producing the same output with one small change - instead of the target being the continuous variable price, we will select the target price_category - and we will be using a logistic regression model instead of a linear regression model to identify where owner is true.


## 1.0 Setup


In [173]:
# import numpy and pandas libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
# set random seed to ensure that results are repeatable
np.random.seed(1)

## 2.0 Load data 

In [174]:
# load data
lawn = pd.read_csv("./RidingMowers.csv")

lawn.head(3)

Unnamed: 0,Income,Lot_Size,Ownership
0,60.0,18.4,Owner
1,85.5,16.8,Owner
2,64.8,21.6,Owner


## 3.0 Conduct initial exploration of the data

We have a number of input variables and one target variable. For this analysis, the target variable is price.

First, our initial exploration of the data should answer the following questions:
1. How many rows and columns
2. How much of a problem do we have with na's?
3. What types of data are there?
4. What types of data are stored in columns
    1. identify which variables are numeric and may need to be standardized later
    2. identify which variables are categorical and may need to be transformed using and encoders such as one-hot-encoder.
5. Identify errors in the data - this is a common problem with categorical vars where the category is mispelled or spelled differently in some instances.
 

In [175]:
# look at the data
lawn.head(3) # note that we don't want to dump all the data to the screen

Unnamed: 0,Income,Lot_Size,Ownership
0,60.0,18.4,Owner
1,85.5,16.8,Owner
2,64.8,21.6,Owner


In [176]:
# generate a basic summary of the data
lawn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Income     24 non-null     float64
 1   Lot_Size   24 non-null     float64
 2   Ownership  24 non-null     object 
dtypes: float64(2), object(1)
memory usage: 704.0+ bytes


In [177]:
# generate a statistical summary of the numeric value in the data
lawn.describe()

Unnamed: 0,Income,Lot_Size
count,24.0,24.0
mean,68.4375,18.95
std,19.793144,2.428275
min,33.0,14.0
25%,52.35,17.5
50%,64.8,19.0
75%,83.1,20.8
max,110.1,23.6


In [178]:
# there are many ways we could explore our data. A rather new library available is called 
# jupyter-summarytools this library provides functions that provide very thorough summaries 
# of your data. Though such detail is not always required, there are times when you want a 
# thorough summary

# jupyter-summary tools is not part of the standard anaconda distribution of python, nor 
# is it in any conda channels. To install this library, you need to install it from the 
# terminal/command line using pip pip install jupyter-summarytools

# once installed, you can import this library and use dfSummary to provide a more thorough 
# summary of your data
import summarytools
from summarytools import dfSummary
dfSummary(lawn)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,Income [float64],Mean (sd) : 68.4 (19.8) min < med < max: 33.0 < 64.8 < 110.1 IQR (CV) : 30.8 (3.5),22 distinct values,,0 (0.0%)
2,Lot_Size [float64],Mean (sd) : 19.0 (2.4) min < med < max: 14.0 < 19.0 < 23.6 IQR (CV) : 3.3 (7.8),18 distinct values,,0 (0.0%)
3,Ownership [object],1. Owner 2. Nonowner,12 (50.0%) 12 (50.0%),,0 (0.0%)


In [179]:
# Check the missing values by summing the total na's for each variable
lawn.isna().sum()

Income       0
Lot_Size     0
Ownership    0
dtype: int64

In [180]:
# create a list of these catagorical variables
category_var_list = list(lawn.select_dtypes(include='object').columns)
category_var_list

['Ownership']

In [181]:
# explore the categorical variable values - often there are typos here that need to be fixed.
for cat in category_var_list: # generally, we want to avoid for loops and use a functional style (i.e. list comprehension)
    print(f"Category: {cat} Values: {lawn[cat].unique()}")

Category: Ownership Values: ['Owner' 'Nonowner']


### Summary the findings from our initial evaluation of the data

* We have 1 categorical variable
* We have 0 variables that have missing values
* There doesn't seem to be a problem with the catogorical class names.

## 4.0 Process the data

* Conduct any data prepartion that should be done *BEFORE* the data split.
* Split the data.
* Conduct any data preparation that should be done *AFTER* the data split.

### 4.1  Conduct any data prepartion that should be done *BEFORE* the data split

Tasks at this stage include:
1. Drop any columns/features - no need of dropping
2. Decide if you with to exclude any observations (rows) due to missing na's.
2. Conduct proper encoding of categorical variables
    1. You can transform them using dummy variable encoding, one-hot-encoding, or label encoding. 

#### Encode our categorical variables

Categorical variables usually have strings for their values. Many machine learning algorithms do not support string values for the input variables. Therefore, we need to replace these string values with numbers. This process is called categorical variable encoding.

In a previous step we identified 5 catagorical variables and found no indication of typos in the class names. Our focus is now on encoding the variables. 

We have three main approaches to encoding variables (these will be discussed in greater detail in class)
* One-Hot-Encoding
* Dummy Encoding
* Label Encoding

In this exercise; we will dummy encode neighbourhood_cleansed, property_type using dummy encoding, and room_type, bed_type and cancelation policy using label encoding. (we will have more discussion on these choices in class).

Before we do our encoding, we must identify if any of our categorical variables have a missing value. We will replace any missing values with the term 'unkown'.

In [182]:
lawn['Ownership'].isna().sum() # check for missing values in this variable/column - we can see there are three for this variable

0

Now, let's encode neighborhood_cleansed and property_type as dummy variables and room_type, bed_type and cancelation_policy labeled (numeric)

In [183]:
dummies_df = pd.get_dummies(lawn['Ownership'], prefix='Ownership', drop_first=True)

In [184]:
lawn = lawn.join(dummies_df)
lawn.drop('Ownership', axis=1, inplace = True)

In [185]:
# explore the dataframe columns to verify encoding and dropped columns
lawn.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Income           24 non-null     float64
 1   Lot_Size         24 non-null     float64
 2   Ownership_Owner  24 non-null     uint8  
dtypes: float64(2), uint8(1)
memory usage: 536.0 bytes


### 4.2 Split data (train/test)

In [186]:
# split the data into validation and training set
train_df, test_df = train_test_split(lawn, test_size=0.35,random_state=105,shuffle=True)

# to reduce repetition in later code, create variables to represent the columns
# that are our predictors and target
target = 'Ownership_Owner'
predictors = list(lawn.columns)
predictors.remove(target)

Now, let's create a common scale between the numberic columns by standardizing each numeric column

## 5.0 Save the data

In [187]:
train_X = train_df[predictors]
train_y = train_df[target] # train_target is now a series objecttrain_df.to_csv('lawn_train_df.csv', index=False)
test_X = test_df[predictors]
test_y = test_df[target] # validation_target is now a series object

train_df.to_csv('./data/lawn_train_df_ownership.csv', index=False)
train_X.to_csv('./data/lawn_train_X_ownership.csv', index=False)
train_y.to_csv('./data/lawn_train_y_ownership.csv', index=False)
test_df.to_csv('./data/lawn_test_df_ownership.csv', index=False)
test_X.to_csv('./data/lawn_test_X_ownership.csv', index=False)
test_y.to_csv('./data/lawn_test_y_ownership.csv', index=False)