# Lesson One

This is the notebook I would have wanted access to when I started my Data Science journey. It is a concoction of my own learnings and those I picked up when I completed the Fast.ai [Introduction to Machine Learning for Coders](https://course18.fast.ai/ml.html) course back in 2018.

This notebook has been setup for both VS Code users and Google Colab users. If you're a beginner I would recommend going down the Colab route.

## 0 - Imports

In [None]:
import pandas as pd
import numpy as np
import os
from pandas.api.types import is_string_dtype, is_object_dtype, is_numeric_dtype, is_datetime64_any_dtype
import yaml

In [None]:
# determines whether to import from colab or vscode
editor = 'colab'

# repository location on Google Drive
drive_path = '/content/gdrive/MyDrive/learning'

## 1 - VSCode Import

The code below does the following:
1. Installs Kaggle
2. Creates a Kaggle folder in our home directory (it'll be hidden)
3. Gets our API credentials from the Kaggle 'Settings' page
4. Places the credentials (.json file) in the Kaggle folder from step 2
5. Downloads the Kaggle dataset

In [None]:
if editor == 'vscode':
  # change current working directory
  os.chdir('..')
  print(f'cwd: {os.getcwd()}')

  # install Kaggle
  !pip install -q kaggle

  # create a kaggle directory
  dir = os.path.expanduser('~/.kaggle')
  os.makedirs(dir, exist_ok=True)

  # copy credentials to kaggle folder
  creds = '/Users/chelseatucker/credentials/kaggle.json'
  !cp $creds ~/.kaggle

  # change permissions so only I have read & write access to the credentials file
  !chmod 600 ~/.kaggle/kaggle.json

  # create a bulldozers directory
  os.makedirs('data/bbfb', exist_ok=True)

  # downloading the bulldozers dataset to the 'data' folder
  !kaggle competitions download -c bluebook-for-bulldozers -p 'data/bbfb'

  # unzip the data
  !unzip -q data/bbfb/bluebook-for-bulldozers.zip -d 'data/bbfb'

  # unzip train data
  !unzip -q data/bbfb/Train.zip -d 'data/bbfb'

## 2 - Colab Import

The first section of code mounts Google Drive and navigates to where the cloned repository sits. The second section downloads the data from Kaggle in a simialr way to that in the '1 - VSCode Import' section.

In [None]:
if editor == 'colab':
  from google.colab import drive
  drive.mount('/content/gdrive')

  # navigate to the repository
  %cd $drive_path

In [None]:
if editor == 'colab':
  if os.path.exists('data/bbfb'):
    print('Bulldozers data already present on Google Drive')
  else:
    # install kaggle
    !pip install -q kaggle

    # upload the 'Kaggle.json' file
    from google.colab import files
    files.upload()

    # make a kaggle directory and move the json file there
    !mkdir ~/.kaggle
    !mv kaggle.json ~/.kaggle

    # change permissions so only I have read & write access to the credentials file
    !chmod 600 ~/.kaggle/kaggle.json

    # download dataset from Kaggle
    !kaggle competitions download -c 'bluebook-for-bulldozers'

    # move dataset
    !mkdir data
    !mv bluebook-for-bulldozers.zip data

    # unzip bulldozers data
    !unzip data/bluebook-for-bulldozers.zip -d data/bbfb

    # unzip train data
    !unzip data/bbfb/Train.zip -d data/bbfb

## 2 - Exploring the Data

In [None]:
df_raw = pd.read_csv('data/bbfb/Train.csv',
                     low_memory=False,
                     parse_dates=['saledate'])

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# quick look to see if the data has imported correctly
df_raw.head(3)

In [None]:
from utils.eda import df_look

# high level overview of the data
df_look(df_raw)

In [None]:
# checking individual columns
df_raw.Tire_Size.value_counts(dropna=False).sort_index()

## 3 - Data Cleansing

## 3.1 Corrections

In [None]:
# replacing certain values
df0 = df_raw.copy()

# different naming conventions for the same level
# e.g. '10 inch' and '10'
# editing so the levels has consistent naming conventions
# also converting to float
df0.Tire_Size = df0.Tire_Size.str.replace('"','')
df0.Tire_Size = df0.Tire_Size.str.replace(' inch','')

# tidying up '31.5 inch' level
# also converting to int
df0.Undercarriage_Pad_Width = df0.Undercarriage_Pad_Width.str.replace('.5','')
df0.Undercarriage_Pad_Width = df0.Undercarriage_Pad_Width.str.replace(' inch','')

# different naming conventions for the same level
# editing so the levels has consistent naming conventions
df0.Transmission = df0.Transmission.str.replace('AutoShift', 'Autoshift')

## 3.2 Converting Data Types

In [None]:
# manually changing these columns to numerical
df0.Tire_Size.fillna('0', inplace=True)
df0.Tire_Size = df0.Tire_Size.str.replace('None or Unspecified', '-1')
df0.Tire_Size = df0.Tire_Size.astype(float)

df0.Undercarriage_Pad_Width.fillna('0', inplace=True)
df0.Undercarriage_Pad_Width = df0.Undercarriage_Pad_Width.str.replace('None or Unspecified', '-1')
df0.Undercarriage_Pad_Width = df0.Undercarriage_Pad_Width.astype(int)

df0.Blade_Width = df0.Blade_Width.str.replace("'","")
df0.Blade_Width = df0.Blade_Width.str.replace("<12","11")
df0.Blade_Width.fillna('0', inplace=True)
df0.Blade_Width = df0.Blade_Width.str.replace('None or Unspecified', '-1')
df0.Blade_Width = df0.Blade_Width.astype(int)

## 3.3 Missing Values

In [None]:
# check percentage missing for each column
df0.isnull().sum().sort_values(ascending=False)/len(df0)

In [None]:
# categorical missings
for c in df0.columns:
    if is_string_dtype(df0[c]) or is_object_dtype(df0[c]):
        df0[c].fillna('Missing', inplace=True)

In [None]:
# numerical missings
mean = df0.MachineHoursCurrentMeter.mean()
df0.MachineHoursCurrentMeter.fillna(mean, inplace=True)

# taking the most common level
common = df0.auctioneerID.value_counts().sort_values(ascending=False)
df0.auctioneerID.fillna(common.index[0], inplace=True)

## 3.4 Categorical Variables

In [None]:
# banding levels with less data together
threshold = 50

for c in ['fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor']:
  counts = df0[c].value_counts(dropna=False)
  flag = df0[c].isin(counts.index[counts < threshold])
  df0.loc[flag, c] = 'Grouped'

In [None]:
# list object/string columns
cats = []
for c in df0.columns:
    if is_string_dtype(df0[c]) or is_object_dtype(df0[c]):
        cats.append(c)

cats

In [None]:
from utils.preprocessing import conv_to_cat

# converting all string/object columns to categories
conv_to_cat(df0)

# checking category orders
for c in cats:
    print(c,':',df0[c].cat.categories)
    print()

In [None]:
# reordering categories
df0.UsageBand = df0.UsageBand.cat.reorder_categories(
    ['Missing',
     'Low',
     'Medium',
     'High'], ordered=True)

df0.ProductSize = df0.ProductSize.cat.reorder_categories(
    ['Missing',
     'Mini',
     'Small',
     'Compact',
     'Medium',
     'Large / Medium',
     'Large'], ordered=True)

df0.Drive_System = df0.Drive_System.cat.reorder_categories(
    ['Missing',
     'No',
     'Two Wheel Drive',
     'Four Wheel Drive',
     'All Wheel Drive'], ordered=True)

df0.Grouser_Type = df0.Grouser_Type.cat.reorder_categories(
    ['Missing',
     'Single',
     'Double',
     'Triple'], ordered=True)

In [None]:
# checking after reordering
df0.UsageBand.cat.categories

## 3.5 Defining Feature Spec

A 'feature spec' can be thought of as a blue print of your cleansed data. I create these so that I can easily apply the same set of edits and rules to training, validation and test datasets.

In [None]:
features = [c for c in df0.columns if c != 'SalePrice']
feat_dict = {}
max_cat = 100

# defining a default feature spec
for col in features:
  number = is_numeric_dtype(df0[col])
  string = is_string_dtype(df0[col]) or is_object_dtype(df0[col])
  date   = is_datetime64_any_dtype(df0[col])
  cat_no = df0[col].nunique(dropna=False)

  # default imputation
  if number:
    imputation = 0
  elif string:
    imputation = 'Missing'
  else:
    imputation = pd.to_datetime(
        '2024-06-10 00:00:00.00',
        format='%Y-%m-%d %H:%M:%S.%f')

  # default categories
  if string and cat_no < max_cat:
    categories = list(df0[col].cat.categories)
  else:
    categories = []

  feat_dict.update({col :
   {'drop'      : 'N',
    'imputation' : imputation,
    'categories': categories,
    'monotonic' : 'N'}})

In [None]:
# over-riding default values

# drop
for c in ['SalesID', 'MachineID', 'ModelID', 'saledate', 'fiModelDesc', 'fiBaseModel']:
  feat_dict[c]['drop'] = 'Y'

# imputation
feat_dict['auctioneerID']['imputation'] = df0.auctioneerID.value_counts().sort_values(ascending=False).index[0]
feat_dict['YearMade']['imputation'] = 1000
feat_dict['MachineHoursCurrentMeter']['imputation'] = df0.MachineHoursCurrentMeter.mean()
feat_dict['saledate']['imputation'] = df0.saledate.value_counts().sort_values(ascending=False).index[0]

# monotonic
for c in ['YearMade', 'MachineHoursCurrentMeter']:
  feat_dict[c]['monotonic'] = 'Y'

In [None]:
# exporting the feature spec
with open('feature_spec.yaml', 'w') as outfile:
    yaml.dump(feat_dict, outfile, default_flow_style=False)

# 4 - Feature Engineering

## 4.1 Numericalise

In [None]:
from utils.preprocessing import numericalise

# converting all categorical columns to their code equivalents
df1 = df0.copy()

for c in cats:
    numericalise(df1, df1[c], c, max_n_cat=max_cat)

## 4.2 Date Attributes

In [None]:
from utils.preprocessing import add_dateattr

# extracting more information from the date field
add_dateattr(df1, 'saledate', drop=False)

## 4.3 Dependent Variable

In [None]:
# The competiton wants us to use RMSLE as the measure between actuals and
# predictions so we'll take the log of the dependent variable.
df1.SalePrice = np.log(df1.SalePrice)

## 4.4 Final Edits

In [None]:
for c in features:
  if feat_dict[c]['drop'] == 'Y':
    df1.drop(c, axis=1, inplace=True)