# Getting Started with Data Science

If you want to get started on or reignite your Data Science journey then look no further than this notebook. In here you will do the following:

1. Use a free, cloud based platform to run Python code.
2. Download a Data Science dataset.
3. Train your first machine learning model.

This can all be done in less than 10 minutes. Let's get cracking!

## 1 - Google Colab

The free, cloud based platform that I speak about above is Google Colaboratory - Colab for short.

‚òÅÔ∏è Colab provides a convenient and free platform for running Python code, particularly suited for those not interested in local installation or powerful hardware. It's a hosted Jupyter Notebook service, inheriting all the pros of Jupyter, such as interactivity, visualizations, and documentation capabilities.

You can open this notebook in Colab by clicking the button below.

<a target="_blank" href="https://colab.research.google.com/github/mathschelsea/learning/blob/main/notebooks/getting_started.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This then becomes your notebook to edit as you wish and save to your Google Drive. If you don't have a Google account you may need to open one (don't worry, that is also free and quick to set up)

## 2 - Kaggle Dataset

The Data Science dataset is one provided by Kaggle.

üî≠ Kaggle is a Data Science competition platform and online community. It provides a space to delve into diverse datasets, work on modelling projects, collaborate with others, and learn from shared insights.

The below code downloads the Kaggle dataset [Blue Book for Bulldozers](https://www.kaggle.com/c/bluebook-for-bulldozers/overview). If you'd prefer a different dataset then feel free to edit the code below as required. If you don't have a [Kaggle](https://www.kaggle.com/) account then you will need to open one (again, don't worry, it's free and quick to open).

### 2.1 - Kaggle Credentials

Go the 'Settings' page on Kaggle and on the 'Account' tab scroll down to 'API'. Select the 'Create New Token' button (see image below). This will download a 'kaggle.json' file to your PC. Remember where this file is downloaded to as we'll need it later.

![Kaggle Crednetials](https://raw.githubusercontent.com/mathschelsea/learning/main/media/kaggle_api.png)

### 2.2 - Download Data

Run the code below to download the Blue Book for Bulldozers data into your Colab runtime session. When you run the code you will be prompted to uploaded your Kaggle credentials into your Colab runtime. This is the 'kaggle.json' file that you downloaded in the previous section. You'll upload this so that it can be placed in the '.kaggle' directory that we create on your local machine. So effectively we're just moving your kaggle credentials to where they need to be for us to use the Kaggle dataset API.

In [None]:
# install kaggle
!pip install -q kaggle

# upload the 'Kaggle.json' file
from google.colab import files
files.upload()

# make a kaggle directory and move the json file there
!mkdir ~/.kaggle
!mv kaggle.json ~/.kaggle

# change permissions so only you have read & write access to the credentials
!chmod 600 ~/.kaggle/kaggle.json

# download dataset from Kaggle
!kaggle competitions download -c 'bluebook-for-bulldozers'

# make a 'data' directory and move the dataset there
!mkdir data
!mv bluebook-for-bulldozers.zip data

# unzip bulldozers data
!unzip data/bluebook-for-bulldozers.zip -d data/bbfb

# unzip train data
!unzip data/bbfb/Train.zip -d data/bbfb

If you open the left-hand-side panel on Colab, you should be able to see the new 'data' folder and the Blue Book for Bulldozers zipped dataset within it (see screenshot below).

<img src='https://raw.githubusercontent.com/mathschelsea/learning/main/media/colab_files.png' width='300'>

## 3 - Machine Learning

### 3.1 - Looking at the data

First up, let's have a look at the data. We'll open the 'Train' dataset as it's good practice to never use or look at the validation or test datasets.

In [None]:
# importing the data
import pandas as pd

df = pd.read_csv('data/bbfb/Train.csv', low_memory=False, parse_dates=['saledate'])

In [None]:
# resetting the display options to max rows and columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# looking at the first 3 rows of data
df.head(3)

In [None]:
# looking at some key data characteristics
print(f'No. of rows in dataset: {df.shape[0]}')
print(f'No. of cols in dataset: {df.shape[1]}')
print('')
df.info(verbose=True)

### 3.2 - Editing the data

At this stage we would cleanse, edit, and engineer the data.

#### 3.2.1 - Missing Values

In [None]:
# check percentage missing for each column
df.isnull().sum().sort_values(ascending=False)/len(df)

In [None]:
# for object features, impute missing values with 'missing'
from pandas.api.types import is_string_dtype, is_object_dtype

for c in df.columns:
    if is_string_dtype(df[c]) or is_object_dtype(df[c]):
        df[c].fillna('missing', inplace=True)

In [None]:
# re-run the cell above to check the missing list again
# only two more columns with missing data
# both are numerical

# machinehourscurrentmeter - impute with the mean
mean = df.MachineHoursCurrentMeter.mean()
df.MachineHoursCurrentMeter.fillna(mean, inplace=True)

# auctioneerid - impute with most common level values
common = df.auctioneerID.value_counts().sort_values(ascending=False)
df.auctioneerID.fillna(common.index[0], inplace=True)

#### 3.2.2 - Datetime features

In [None]:
# extracting numerical information from the datetime features
attr = ['Year', 'Month', 'Day', 'Dayofweek', 'Dayofyear', 'Quarter']

for n in attr:
  df['saledate_' + n.lower()] = getattr(df['saledate'].dt, n.lower())

In [None]:
# quick check
df[['saledate_year', 'saledate_month', 'saledate_day',
    'saledate_dayofweek', 'saledate_dayofyear', 'saledate_quarter']].head(5)

#### 3.2.2 - Object Features

In [None]:
# converting all object features to categorical features

for n,c in df.items():
  if is_string_dtype(c): df[n] = c.astype('category').cat.as_ordered()
  if is_object_dtype(c): df[n] = c.astype('category').cat.as_ordered()

In [None]:
# quick spot check
print('Categories:')
print(df.Coupler_System.cat.categories)
print()
print('Value Counts:')
print(df.Coupler_System.value_counts(dropna=False))

In [None]:
# now converting these categories to their equivalent code values
from pandas.api.types import is_numeric_dtype

for c in df.columns:
  if not is_numeric_dtype(c):
      df[c] = pd.Categorical(df[c]).codes

In [None]:
# quick spot check
print('Value Counts:')
df.Coupler_System.value_counts(dropna=False).sort_index()

#### 3.2.3 - Drop Columns

Time to drop any columns that we don't want or need.

In [None]:
# looking for columns with nearly all unique levels (will drop these)
print('Percentage Unique')
for c in df.columns:
  print(f'{c}:', '{:.1%}'.format(df[c].nunique()/len(df)))

In [None]:
df.drop(['SalesID', 'saledate', 'MachineID'], axis=1, inplace=True)

### 3.3 - Model Training

#### 3.3.1 - Data Split

We need to split the data into a train and test split so we can train the model on the train split and then measure it's performance on the test split.

In [None]:
# defining sizes
test_size = 12000
train_size = len(df) - test_size

# splitting data
def split_vals(a,n):
  return a[:n].copy(), a[n:].copy()

train_df, test_df = split_vals(df, train_size)

# checking sizes
train_df.shape, test_df.shape

#### 3.3.2 - Training

In [None]:
# train a random forest
from sklearn.ensemble import RandomForestRegressor

m = RandomForestRegressor(n_jobs=-1)
m.fit(train_df.drop('SalePrice', axis=1), train_df.SalePrice)

### 3.4 - Evaluating

In [None]:
m.score(test_df.drop('SalePrice', axis=1), test_df.SalePrice)

## 4 - Next Steps

The code above showcased a quick way to train a machine learning model. The data exploration was very minimal and the engineering of the data was quite basic. A better model could be trained if more time and thought was put into these steps as well as others. Why don't you have a go at trying to improve the model performance value above by re-working this code. Here are some questions that you might want to answer and suggestions of things to do next.

* Are there any errors in the data that could be corrected or filtered? Take a look a bit more at the features and their unique levels.
* Are any of the features highly correlated? Is it reasonable to drop one of the correlated features from the training data?
* Can we handle missing data better? Can more information be pulled out from the missing data?
* Can more information be pulled out for the datetime fields? How do these extra features fair in a feature importance plot?
* What is the 'score' metric in the section '3.4 - Evaluating'?
* What other metrics could we use to evaluate a model's performance?
* What is a feature importance plot? What feature has the highest importance in your model?
* Why did we split the data in section '3.3.1 - Data Split'?.
* Can you find out what your model's hyperparameters are? Do you know what each hyperparameter does?
* Can you produce a lift plot? What does it show? What are other model explainability visuals you can use?
* The model chosen is a random forest but others will do the job. Try out another model and see which performs better.


