## Pre-processing Data

In this notebook we will be looking at creating features for our machine learning model. 

We will be doing the following:
1. Creating dummy variables for our categorical variables
2. Making sure we have standard scale for our numerical data
3. Creating our training and  test data.

On top of that we may be tweaking some of the data that we have in order to make sure that the data is not getting in the way.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [2]:
# Loading our data!

laptop_data = pd.read_csv('../dataset/tim_laptop_cleaned.csv', index_col = 0)

In [3]:
# Checking our data tyes and the size of our dataframe
laptop_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 896 entries, 0 to 895
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   brand            896 non-null    object 
 1   model            896 non-null    object 
 2   processor_brand  896 non-null    object 
 3   processor_name   896 non-null    object 
 4   processor_gnrtn  896 non-null    object 
 5   ram_gb           896 non-null    int64  
 6   ram_type         896 non-null    object 
 7   ssd_gb           896 non-null    int64  
 8   hdd_gb           896 non-null    int64  
 9   os               896 non-null    object 
 10  os_bit           896 non-null    int64  
 11  graphic_card_gb  896 non-null    int64  
 12  weight           896 non-null    object 
 13  display_size     896 non-null    object 
 14  warranty         896 non-null    int64  
 15  msoffice         896 non-null    object 
 16  latest_price     896 non-null    float64
dtypes: float64(1), i

In [4]:
# Calculating the missing values because it is a string 'missing' and count it up

laptop_columns = list(laptop_data.columns)
missing = [sum(laptop_data[column] == 'Missing') for column in laptop_data]

laptop_missing = pd.DataFrame(data = missing, index = laptop_columns)
laptop_missing

Unnamed: 0,0
brand,0
model,95
processor_brand,0
processor_name,0
processor_gnrtn,239
ram_gb,0
ram_type,0
ssd_gb,0
hdd_gb,0
os,0


Due to the amount of "missing" rows we have I am not comfortable with completely deleting the row. We may instead have to remove these features from our final data set. 

I will be dropping the columns `model` from my current data set as it ultimately is not a very useful measure with its wide categorical spread.

I will impute the missing `display_size` first by rounding the display size to its nearest integer and then taking the mode. I feel safe taking the most common display size and imputing it across our data as we have such a narrow range (13-17) for our display sizes.

Additionally, I will be doing the same for our processor generation as we have such few processor generations and it is quite an important data column for us. Remember that most of our numerical data is ordinal and discrete therefore doing things like the mean would not make sense.

In [5]:
# Dropping the model column

laptop_df = laptop_data.drop(['model'], axis = 1)
laptop_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 896 entries, 0 to 895
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   brand            896 non-null    object 
 1   processor_brand  896 non-null    object 
 2   processor_name   896 non-null    object 
 3   processor_gnrtn  896 non-null    object 
 4   ram_gb           896 non-null    int64  
 5   ram_type         896 non-null    object 
 6   ssd_gb           896 non-null    int64  
 7   hdd_gb           896 non-null    int64  
 8   os               896 non-null    object 
 9   os_bit           896 non-null    int64  
 10  graphic_card_gb  896 non-null    int64  
 11  weight           896 non-null    object 
 12  display_size     896 non-null    object 
 13  warranty         896 non-null    int64  
 14  msoffice         896 non-null    object 
 15  latest_price     896 non-null    float64
dtypes: float64(1), int64(6), object(9)
memory usage: 119.0+ KB


In [6]:
# Replace the string 'Missing' with type NaN so we can properly replace the data

laptop_df = laptop_df.replace('Missing', np.nan)
laptop_df.isna().sum()

brand                0
processor_brand      0
processor_name       0
processor_gnrtn    239
ram_gb               0
ram_type             0
ssd_gb               0
hdd_gb               0
os                   0
os_bit               0
graphic_card_gb      0
weight               0
display_size       332
warranty             0
msoffice             0
latest_price         0
dtype: int64

In [7]:
# Imputing `process_gnrtn` with the mode of the column
laptop_df['processor_gnrtn'] = laptop_df['processor_gnrtn'].fillna(laptop_df['processor_gnrtn'].mode()[0])

In [8]:
# Convert the `display_size` to numeric quantity then impute with the mode
laptop_df['display_size'] = pd.to_numeric(laptop_df['display_size'])
laptop_df['display_size'] = laptop_df['display_size'].fillna(laptop_df['display_size'].mode()[0])

In [9]:
# Check if there are any NaN values
# if there are none we are good to go!
laptop_df.isna().sum()

brand              0
processor_brand    0
processor_name     0
processor_gnrtn    0
ram_gb             0
ram_type           0
ssd_gb             0
hdd_gb             0
os                 0
os_bit             0
graphic_card_gb    0
weight             0
display_size       0
warranty           0
msoffice           0
latest_price       0
dtype: int64

## Creating dummy variables

First we will create dummy variables for our categorical data. We will do so using pandas.get_dummies()
We will first select the objects that are strings and include our `os_bit` data because although it is an integer, it is more of a category that a numerical value. 

We will also binarize our `warranty` data because we only care if it exists or not. 

In [10]:
laptop_cat = laptop_df.select_dtypes(include = 'object').columns
laptop_cat = laptop_cat.insert(0, 'os_bit')
laptop_cat

Index(['os_bit', 'brand', 'processor_brand', 'processor_name',
       'processor_gnrtn', 'ram_type', 'os', 'weight', 'msoffice'],
      dtype='object')

In [11]:
laptop_dummies = pd.get_dummies(laptop_df, columns = laptop_cat, drop_first = True)
laptop_dummies.head()

Unnamed: 0,ram_gb,ssd_gb,hdd_gb,graphic_card_gb,display_size,warranty,latest_price,os_bit_64,brand_APPLE,brand_ASUS,...,ram_type_DDR4,ram_type_DDR5,ram_type_LPDDR3,ram_type_LPDDR4,ram_type_LPDDR4X,os_Mac,os_Windows,weight_Gaming,weight_ThinNlight,msoffice_Yes
0,4,0,1024,0,15.6,0,324.87,1,0,0,...,1,0,0,0,0,0,1,0,1,0
1,4,0,512,0,15.6,0,254.67,1,0,0,...,1,0,0,0,0,0,1,0,0,0
2,4,128,0,0,15.6,0,259.87,1,0,0,...,1,0,0,0,0,0,1,0,1,0
3,4,128,0,0,15.6,0,279.37,1,0,0,...,1,0,0,0,0,0,1,0,1,0
4,4,256,0,0,15.6,0,324.87,1,0,0,...,1,0,0,0,0,0,1,0,1,0


## Binarize warranty

Since we are simply interested in whether or not warranty exists, we will binarize this data. 

Then we will get into the discussion of how to treat our numerical data that is discrete but still somewhat categorical.

In [12]:
df = laptop_dummies
df['Binarize_Warranty'] = 0
df.loc[df['warranty'] > 0, 'Binarize_Warranty'] = 1

In [13]:
df = df.drop('warranty', axis = 1)

## Scaling Data

Since we are dealing with very sparse but discrete ratio data we will scale our data down using the MinMaxScaler from scikitlearn. This will place our data in a range of 0-1 and make sure everything is in the same range.

We will first split the data into training and test splits before doing the fit and will transform both based on the training data with a 75/25 split of training/test data. It is better to do the fitting on only the training data so we don't get data leak and get over fit on the test set.

In [14]:
#X is all of the data that is not our target
#Y is our target latest price

X = df.drop('latest_price', axis = 1)
y = df[['latest_price']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1234)

In [15]:
X_to_scale = ['ram_gb', 'ssd_gb', 'hdd_gb', 'graphic_card_gb']

In [16]:
#Initialize the MinMax Scaler

X_scaled = MinMaxScaler()

#Fit the X data for our numerical columns
X_scaled.fit(X_train[X_to_scale])

#Transform the X data for both the X_train and X_test
X_train_scaled = X_scaled.transform(X_train[X_to_scale])
X_test_scaled = X_scaled.transform(X_test[X_to_scale])

## Final Summary

In these steps, we have done a bit more to prepare our data for analysis.

We removed the data from `model` because it was missing and there was no good way to impute the data. We also imputed the mode of `processor_gnrtn` and `display_size` after rounding `display_size` to the nearest integer. We want to stay true to the nature of the data as much as possible. Utilizing mode for our data makes more sense as the variable are not continuous.


After imputing the data, we create dummy variables for our categorical data utilizing the pd.get_dummies and making sure to drop first to make the data less redundant. We also binarized `warranty` because we are more interested in whether the warranty exists or not as opposed to the amount of the warranty.

Finally, we split our data into train/test splits and then used MinMax Scaler in order to make sure all our numerical variables are within the same scale. This way we can be more confident in our analysis without large data having an over-sized impact on the data. 

We are pretty much ready to begin modeling and preparing testing different models on the data.

In [17]:
# This is to allow us to use the variables in our next notebook for modeling
from sklearn import datasets

%store df

Stored 'df' (DataFrame)
