# Getting Started with Data Science

If you want to get started on or reignite your Data Science journey then look no further than this notebook. In here you will do the following:

1. Use a free, cloud based platform to run Python code.
2. Download a Data Science dataset.
3. Train your first machine learning model.

This can all be done in less than 10 minutes. Let's get cracking!

## 1 - Google Colab

The free, cloud based platform that I speak about above is Google Colaboratory - Colab for short.

☁️ Colab provides a convenient and free platform for running Python code, particularly suited for those not interested in local installation or powerful hardware. It's a hosted Jupyter Notebook service, inheriting all the pros of Jupyter, such as interactivity, visualizations, and documentation capabilities.

You can open this notebook in Colab by clicking the button below.

<a target="_blank" href="https://colab.research.google.com/github/mathschelsea/learning/blob/main/notebooks/getting_started.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This then becomes your notebook to edit as you wish and save to your Google Drive. If you don't have a Google account you may need to open one (don't worry, that is also free and quick to set up)

## 2 - Kaggle Dataset

The Data Science dataset is one provided by Kaggle.

🔭 Kaggle is a Data Science competition platform and online community. It provides a space to delve into diverse datasets, work on modelling projects, collaborate with others, and learn from shared insights.

The below code downloads the Kaggle dataset [Blue Book for Bulldozers](https://www.kaggle.com/c/bluebook-for-bulldozers/overview). If you'd prefer a different dataset then feel free to edit the code below as required. If you don't have a [Kaggle](https://www.kaggle.com/) account then you will need to open one (again, don't worry, it's free and quick to open).

### 2.1 - Kaggle Credentials

Go the the settings page on Kaggle and on the account tab scroll down to API. Select the 'Create New Token' button (see image below). This will download a 'kaggle.json' file to your PC. Remember where this file is downloaded to as we'll need it later.

![Kaggle Crednetials](https://raw.githubusercontent.com/mathschelsea/learning/main/media/kaggle_api.png)

### 2.2 - Download Data

Run the code below to download the Blue Book for Bulldozers data into your Colab runtime session. When you run the code you will be prompted to uploaded your Kaggle credentials to the .kaggle directory on your local machine. This is the 'kaggle.json' file that you downloaded in the previous section.

In [1]:
# install kaggle
!pip install -q kaggle

# upload the 'Kaggle.json' file
from google.colab import files
files.upload()

# make a kaggle directory and move the json file there
!mkdir ~/.kaggle
!mv kaggle.json ~/.kaggle

# change permissions so only you have read & write access to the credentials
!chmod 600 ~/.kaggle/kaggle.json

# download dataset from Kaggle
!kaggle competitions download -c 'bluebook-for-bulldozers'

# make a 'data' directory and move the dataset there
!mkdir data
!mv bluebook-for-bulldozers.zip data

# unzip bulldozers data
!unzip data/bluebook-for-bulldozers.zip -d data/bbfb

# unzip train data
!unzip data/bbfb/Train.zip -d data/bbfb

Saving kaggle.json to kaggle.json
Downloading bluebook-for-bulldozers.zip to /content
 99% 48.0M/48.4M [00:02<00:00, 32.9MB/s]
100% 48.4M/48.4M [00:02<00:00, 21.9MB/s]
Archive:  data/bluebook-for-bulldozers.zip
  inflating: data/bbfb/Data Dictionary.xlsx  
  inflating: data/bbfb/Machine_Appendix.csv  
  inflating: data/bbfb/Test.csv      
  inflating: data/bbfb/Train.7z      
  inflating: data/bbfb/Train.zip     
  inflating: data/bbfb/TrainAndValid.7z  
  inflating: data/bbfb/TrainAndValid.csv  
  inflating: data/bbfb/TrainAndValid.zip  
  inflating: data/bbfb/Valid.7z      
  inflating: data/bbfb/Valid.csv     
  inflating: data/bbfb/Valid.zip     
  inflating: data/bbfb/ValidSolution.csv  
  inflating: data/bbfb/median_benchmark.csv  
  inflating: data/bbfb/random_forest_benchmark_test.csv  
Archive:  data/bbfb/Train.zip
  inflating: data/bbfb/Train.csv     


If you open the left-hand-side panel on Colab, you should be able to see the new 'data' folder and the Blue Book for Bulldozers zipped dataset within it (see screenshot below).

<img src='https://raw.githubusercontent.com/mathschelsea/learning/main/media/colab_files.png' width='300'>

## 3 - Machine Learning

### 3.1 - Looking at the data

First up, let's have a look at the data. We'll open the 'Train' dataset as it's good practice to never use or look at that the validation or test datasets.

In [2]:
# importing the data
import pandas as pd

df = pd.read_csv('data/bbfb/Train.csv', low_memory=False, parse_dates=['saledate'])

In [3]:
# resetting the display options to max rows and columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# looking at the first 3 rows of data
df.head(3)

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,ProductGroupDesc,Drive_System,Enclosure,Forks,Pad_Type,Ride_Control,Stick,Transmission,Turbocharged,Blade_Extension,Blade_Width,Enclosure_Type,Engine_Horsepower,Hydraulics,Pushblock,Ripper,Scarifier,Tip_Control,Tire_Size,Coupler,Coupler_System,Grouser_Tracks,Hydraulics_Flow,Track_Type,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000,999089,3157,121,3.0,2004,68.0,Low,2006-11-16,521D,521,D,,,,Wheel Loader - 110.0 to 120.0 Horsepower,Alabama,WL,Wheel Loader,,EROPS w AC,None or Unspecified,,None or Unspecified,,,,,,,,2 Valve,,,,,None or Unspecified,None or Unspecified,,,,,,,,,,,,,Standard,Conventional
1,1139248,57000,117657,77,121,3.0,1996,4640.0,Low,2004-03-26,950FII,950,F,II,,Medium,Wheel Loader - 150.0 to 175.0 Horsepower,North Carolina,WL,Wheel Loader,,EROPS w AC,None or Unspecified,,None or Unspecified,,,,,,,,2 Valve,,,,,23.5,None or Unspecified,,,,,,,,,,,,,Standard,Conventional
2,1139249,10000,434808,7009,121,3.0,2001,2838.0,High,2004-02-26,226,226,,,,,Skid Steer Loader - 1351.0 to 1601.0 Lb Operat...,New York,SSL,Skid Steer Loaders,,OROPS,None or Unspecified,,,,,,,,,,Auxiliary,,,,,,None or Unspecified,None or Unspecified,None or Unspecified,Standard,,,,,,,,,,,


In [4]:
# looking at some key data characteristics
print(f'No. of rows in dataset: {df.shape[0]}')
print(f'No. of cols in dataset: {df.shape[1]}')
print('')
df.info(verbose=True)

No. of rows in dataset: 401125
No. of cols in dataset: 53

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 401125 entries, 0 to 401124
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   SalesID                   401125 non-null  int64         
 1   SalePrice                 401125 non-null  int64         
 2   MachineID                 401125 non-null  int64         
 3   ModelID                   401125 non-null  int64         
 4   datasource                401125 non-null  int64         
 5   auctioneerID              380989 non-null  float64       
 6   YearMade                  401125 non-null  int64         
 7   MachineHoursCurrentMeter  142765 non-null  float64       
 8   UsageBand                 69639 non-null   object        
 9   saledate                  401125 non-null  datetime64[ns]
 10  fiModelDesc               401125 non-null  object        
 11  fiBase

### 3.2 - Editing the data

Our data contains numerical, object and datatime datatypes. Before training a random forest (a type of machine learning model) we need to make sure all the features (independent variables) are numerical. We then need to address any columns that have missing values.

#### 3.2.1 - Missing Values

In [5]:
# check percentage missing for each column
df.isnull().sum().sort_values(ascending=False)/len(df)

Engine_Horsepower           0.937129
Pushblock                   0.937129
Enclosure_Type              0.937129
Blade_Width                 0.937129
Blade_Extension             0.937129
Tip_Control                 0.937129
Scarifier                   0.937102
Grouser_Tracks              0.891899
Hydraulics_Flow             0.891899
Coupler_System              0.891660
fiModelSeries               0.858129
Steering_Controls           0.827064
Differential_Type           0.826959
UsageBand                   0.826391
fiModelDescriptor           0.820707
Backhoe_Mounting            0.803872
Stick                       0.802720
Turbocharged                0.802720
Pad_Type                    0.802720
Blade_Type                  0.800977
Travel_Controls             0.800975
Tire_Size                   0.763869
Track_Type                  0.752813
Grouser_Type                0.752813
Pattern_Changer             0.752651
Stick_Length                0.752651
Thumb                       0.752476
U

In [6]:
# for object features, impute missing values with 'missing'
from pandas.api.types import is_string_dtype, is_object_dtype

for c in df.columns:
    if is_string_dtype(df[c]) or is_object_dtype(df[c]):
        df[c].fillna('missing', inplace=True)

In [7]:
# re-run the cell above to check the missing list again
# only two more columns with missing data
# both are numerical

# machinehourscurrentmeter - impute with the mean
mean = df.MachineHoursCurrentMeter.mean()
df.MachineHoursCurrentMeter.fillna(mean, inplace=True)

# auctioneerid - impute with most common level values
common = df.auctioneerID.value_counts().sort_values(ascending=False)
df.auctioneerID.fillna(common.index[0], inplace=True)

#### 3.2.2 - Datetime features

In [8]:
# extracting numerical information from the datetime features
attr = ['Year', 'Month', 'Day', 'Dayofweek', 'Dayofyear', 'Quarter']

for n in attr:
  df['saledate_' + n.lower()] = getattr(df['saledate'].dt, n.lower())

In [9]:
# quick check
df[['saledate_year', 'saledate_month', 'saledate_day',
    'saledate_dayofweek', 'saledate_dayofyear', 'saledate_quarter']].head(5)

Unnamed: 0,saledate_year,saledate_month,saledate_day,saledate_dayofweek,saledate_dayofyear,saledate_quarter
0,2006,11,16,3,320,4
1,2004,3,26,4,86,1
2,2004,2,26,3,57,1
3,2011,5,19,3,139,2
4,2009,7,23,3,204,3


#### 3.2.2 - Object Features

In [10]:
# converting all object features to categorical features

for n,c in df.items():
  if is_string_dtype(c): df[n] = c.astype('category').cat.as_ordered()
  if is_object_dtype(c): df[n] = c.astype('category').cat.as_ordered()

In [11]:
# quick spot check
print('Categories:')
print(df.Coupler_System.cat.categories)
print()
print('Value Counts:')
print(df.Coupler_System.value_counts(dropna=False))

Categories:
Index(['None or Unspecified', 'Yes', 'missing'], dtype='object')

Value Counts:
Coupler_System
missing                357667
None or Unspecified     40430
Yes                      3028
Name: count, dtype: int64


In [12]:
# now converting these categories to their equivalent code values
from pandas.api.types import is_numeric_dtype

for c in df.columns:
  if not is_numeric_dtype(c):
      df[c] = pd.Categorical(df[c]).codes

In [13]:
# quick spot check
print('Value Counts:')
df.Coupler_System.value_counts(dropna=False).sort_index()

Value Counts:


Coupler_System
0     40430
1      3028
2    357667
Name: count, dtype: int64

#### 3.2.3 - Drop Columns

Time to drop any columns that we don't want or need.

In [21]:
# looking for columns with nearly all unique levels (will drop these)
print('Percentage Unique')
for c in df.columns:
  print(f'{c}:', '{:.1%}'.format(df[c].nunique()/len(df)))

Percentage Unique
SalesID: 100.0%
SalePrice: 0.2%
MachineID: 85.0%
ModelID: 1.3%
datasource: 0.0%
auctioneerID: 0.0%
YearMade: 0.0%
MachineHoursCurrentMeter: 3.8%
UsageBand: 0.0%
saledate: 1.0%
fiModelDesc: 1.2%
fiBaseModel: 0.5%
fiSecondaryDesc: 0.0%
fiModelSeries: 0.0%
fiModelDescriptor: 0.0%
ProductSize: 0.0%
fiProductClassDesc: 0.0%
state: 0.0%
ProductGroup: 0.0%
ProductGroupDesc: 0.0%
Drive_System: 0.0%
Enclosure: 0.0%
Forks: 0.0%
Pad_Type: 0.0%
Ride_Control: 0.0%
Stick: 0.0%
Transmission: 0.0%
Turbocharged: 0.0%
Blade_Extension: 0.0%
Blade_Width: 0.0%
Enclosure_Type: 0.0%
Engine_Horsepower: 0.0%
Hydraulics: 0.0%
Pushblock: 0.0%
Ripper: 0.0%
Scarifier: 0.0%
Tip_Control: 0.0%
Tire_Size: 0.0%
Coupler: 0.0%
Coupler_System: 0.0%
Grouser_Tracks: 0.0%
Hydraulics_Flow: 0.0%
Track_Type: 0.0%
Undercarriage_Pad_Width: 0.0%
Stick_Length: 0.0%
Thumb: 0.0%
Pattern_Changer: 0.0%
Grouser_Type: 0.0%
Backhoe_Mounting: 0.0%
Blade_Type: 0.0%
Travel_Controls: 0.0%
Differential_Type: 0.0%
Steering_Con

In [22]:
df.drop(['SalesID', 'saledate', 'MachineID'], axis=1, inplace=True)

### 3.3 - Model Training

#### 3.3.1 - Data Split

We need to split the data into a train and test split so we can train the model on the train split and then measure it's performance on the test split.

In [None]:
# defining sizes
test_size = 12000
train_size = len(df) - test_size

# splitting data
def split_vals(a,n):
  return a[:n].copy(), a[n:].copy()

train_df, test_df = split_vals(df, train_size)

# checking sizes
train_df.shape, test_df.shape

#### 3.3.2 - Training

In [23]:
# train a random forest
from sklearn.ensemble import RandomForestRegressor

m = RandomForestRegressor(n_jobs=-1)
m.fit(df_train.drop('SalePrice', axis=1), df_train.SalePrice)

### 3.4 - Evaluating

In [24]:
m.score(df_test.drop('SalePrice', axis=1), df_test.SalePrice)

0.9875599980704429

## Next Steps

The code above showcased a quick way to train a machine learning model. The data exploration was very minimal and the engineering of the data was quite basic. A better model could be trained if more time and thought was put into these steps as well as others. Why don't you have a go at trying to improe the model performance value above by re-working this code. Here are some hints of things to do:

1. Are there any errors in the data that could be corrected or filtered? Take a look a bit more at the features and their unique levels.
2. Are any of the features highly correlated? Could some of these be dropped from the training data?
3. Are there other ways we can convert the object features to numerical features? How do these affect the model's performance?
4. Can we handle missing data better? Can more information be pulled out from the missing data?
5. Can more information be pulled out for the datetime fields? How do these extra features fair in a feature importance plot?
6. The model chosen is a random forest but others will do the job. Try out another model and see which is the champion.
7. Are there other performance metrics that you can assess the model's performance on?
8. Look into the model's hyperparameters. Learning what the main ones are. Can these be improved in any way?
9. What model explainability metrics and charts can you create that help tell the model's performance story? e.g. lift plots.
10. Can you create your own notebook on this work and share it with others? Where would you host it? GitHub? Kaggle? Medium?