* Link to the course module: https://course18.fast.ai/lessonsml1/lesson1.html
* Link to forum discussing location of ml1 (fastai-v1) course: https://forums.fast.ai/t/where-is-machine-learning-repo/83869
* Command to install `fastai-v1` library : conda install -c pytorch -c fastai fastai=1.0.61
* To solve problem with installing `bcolz`, install new python 3.8 environment but then use `conda install bcolz`

# 01 | Basics - Introduction to Random Forests

## 1.1. About autoreload

In [None]:
!pip uninstall opencv-contrib-python
!pip uninstall seaborn
!pip uninstall bcolz
!pip uninstall graphviz
!pip uninstall sklearn_pandas
!pip uninstall isoweek
!pip uninstall ipywidgets
!pip uninstall tqdm
!pip uninstall -U autopep8   
!pip uninstall pyarrow

In [2]:
# !pip install opencv-contrib-python
# !pip install seaborn
# !pip install bcolz
# !pip install graphviz
# !pip install sklearn_pandas
# !pip install isoweek
# !pip install ipywidgets
# !pip install tqdm
# !pip install -U autopep8   
# !pip install pyarrow

* These two lines help modify modules and update on demand
* Mainly to show tables within the notebook itself
* If you want to use any Python variable inside a Jupyter console, use CURLIES

In [3]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [4]:
import os

currDir = os.getcwd()
os.chdir("../fastai/")
from structured import *
from imports import *

os.chdir(currDir)
# ____________________________________________________________ #
from pandas_summary import DataFrameSummary
from IPython.display import display

from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
# ____________________________________________________________ #

ModuleNotFoundError: No module named 'bcolz'

## 1.2. About Software development practices in Data Science

* `Pep 8` is the Python standard for developing source code in Python
* Most of DS/ML is `prototyping model`
* The component of software development is taken away (although very loosely)
* Best practices for prototyping model is barely even standardised, so you are on your own to setup your own style and standards

## 1.3. About Blue Book for Bulldozers

* It is about predicting `sales price` for heavy equipments
* here you better keep a seperate `PATH` variable to store the location of the dataset
* Also the *structured* module from fastai has been removed to *tabular*
* This is an example of a `structured` data --> this is where pandas come in
* 

In [None]:
PATH2DATA = '../datasets/kaggle/bluebook_bulldozers/'


## 1.4. Basics on Visualizing the Data

* You can use the `head` command to make

In [None]:
!head '../datasets/kaggle/bluebook_bulldozers/Train.csv'

* Or use the pandas library
* the `f{PATH}Train.csv` is sth known as the Python 3.7 string
* e.g. if you have variable NAME = 'deedo' , and then you type in command `f My name is {NAME}`, then this will print `My name is deedo`
* But keep in mind if you remove `f` it WONT WORK, `f` allows `interpolation` of string variables
* You can also make it work for `integers` use code inside curlies for example `{NAME}.upper()`

In [None]:
# setting low_memory to true just saves ram. setting to false lets it read more of the file
df_raw = pd.read_csv(f'{PATH2DATA}Train.csv',
                     low_memory=False, parse_dates=["saledate"])


* You will find that pandas `DataFrame` are exactly like R dataframes

In [None]:
df_raw

## 1.5. Visualizing your Data to Inspect Types

* Important to look at your data and understand `the format`, `how stored`, `what type of values` it holds
* Even if you read the description of the data, it will never be enough
* `.tail()` will show us the last few rows
* Keep in mind dont look at the data for too long, it risks `overfitting` 
* You should generally keep the `EDA` right towards the end (EDA means analysing the data)

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000):
        with pd.option_context("display.max_columns", 1000):
            display(df)

In [None]:
display_all(df_raw.tail().T)

## 1.6. Inspecting the Loss-Function

* Quite oftenly it is important to check what loss functions the data uses
* It may just fall into the pre-processing stage
* In Kaggle, you can check this by going on the `evaluation` section
* For e.g. in Bluebook for Bulldozers we evaluate using `Root Means Square Log Error (RMLSE)`
* Hence we need to log out our `dependant variables` which is the sale price

In [None]:
df_raw.SalePrice = np.log(df_raw.SalePrice)

* You can inspect each column of the DataFrame, as a seperate pandas `series`
* These panda series can easily be transferred to `numpy arrays`
* Hence numpy and pandas have a harmonious dev environment

## 1.7. Curse of Dimenionality and No Free Lunch Theorem

* Now, there is this essentially the belief in 90s theoreticians in ML that ...
* ... data points gets `stretched` to the edges of each dimenion you add to the data
* Hence the distance between them, which is essen. a `feature` becomes redundant
* Emperical modern ML agrees otherwise, where Jeremy says that adding more columns improves the model's performance
* Just because they are on edge, they still varies
* Even kNN works better on high-dimensional data, and it relies on distance between points

---

* Now the No Free Lunch --> "there is NO TYPE of model that WORKS WELL for any TYPE of dataset"
* Makes sense in Math, any random dataset will not be able to be approachable from point of view of methods that can make sense of random dataset
* Why ? because models we use rely on `features` to make sense of data
* A random dataset has no `features`, unless you know the type of process that generated that data well in hand but then it makes no sense to use ML if you already know the answer
* In the real applications, data is `NOT RANDOM`
* Mathematically, we believe that the real data sits on a `lower dimensional manifold`
* But it was created by some `Causal Structure` that is of stochastic nature

## 1.8. Initialise the Random Forest Algorithm

* Regressor is used for fitting data for `continuous variables` that is dependant
* Classifier for `discrete variables` or `categorical variables` that is dependant
* RFAlgorithm is best for beginings, as if your data doesnt work with RF to begin with, there is possibly sth already wrong with it
* In the `.fit` method you pass in the `independant variable`, that is the things you are going to be using to predict, in our case we set 'SalePrice' meaning we want to use EVERYTHING EXCEPT SALEPRICE
* The `pd.drop` function brings a new DataFrame with the `SalePrice` column REMOVED (it can work with rows as well FYI)
* And then the `dependant variable` that is the things you want to predict, for us this `df_raw.SalePrice`

In [None]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice)

## 1.9. How to Pre-Process (Convert all Colms to Numbers) Before Using RF Algorithm as Part of Feature Engineering

* You can tell from previous cell call that our datas contains `NON-NUMBERS`
* Dataset bluebook contains both `continuous` (numeric) and `categorical` (non-numeric are partially numeric like ZipCode)
* One example of non-numreic data in bluebook is the `Sale Date`

In [None]:
df_raw.saledate

* As you can the **dtype** of this entry is **datetime64bits**
* Not usefull when you have to do RF algorithm
* Now we are going to do our very first `feature engineering`
* You need to **transform** the data information that you have into something that is **usefull** to you
* For example, was there christmas on the day, if you are analyse sport, was it a football match that date
* No ML algorithm can tell you that, you have to use **feature engineering** to make use of the useless data presented to you
* Now using `add_datepart` will allow you to **split** the date into Year, Month, Day

In [None]:
add_datepart(df_raw, 'saledate')

* Now if you inspect all the fields of your data you see that the saledate is gone
* and now addittional fields have appeared known as **saleYear**, **saleMonth**, **saleWeek**, and **saleDay** and many more ...

In [None]:
df_raw.columns

* This is the most basic type of `feature engineering`
* `[BONUS]` In your spare time, try to do this step manually so you can understand the pre-processing properly

## 1.10. Feature Engineering :2 -- Take care of the Strings

* As you can see below, you still have certain columns that contain **Strings**
* How do you get rid of those ?

In [None]:
df_raw.head()

* You can use `train_cats` method in fastai library
* It basically converts strings in defined order to integers
* But `WARNING` make sure to use the same defined order for both **TRAIN** and **TEST** and **VAL** data
* Otherwise your model will be **non-predictive**
* For that issue you can instead use the `apply_cats`

In [None]:
train_cats(df_raw)

* After running `train_cats` you see nothing changed, because it does it behind the scenes
* But now you can see addittional modules appear for UsageBand
* In particularly u can see UsageBand.`cat.categories` appear
* This method will list all the categories to *numerify*

In [None]:
df_raw.UsageBand.cat.categories

* Also you notice from above that the list is in a wierd order, `High, Low, Medium`
* Truth is it doesnt matter **TOO MUCH** when you make decision trees
* Because decision trees can split things at certain points ...
* ... for example you'd have trees that might say well lets compare points *high* vs *low* and *medium*
* ... or *medium* vs *high* and *low*
* That is weird for orderly fashion sense
* So instead we predefine manually this order in the following way, it wont improve the performance too much but just a tad boost helps

In [None]:
df_raw.UsageBand.cat.categories(
    ['High', 'Medium', 'Low'], ordered=True, inplace=True)

* Also you can access the train_cats codes for the UsageBand by using the `codes` method
* You can see it has assigned `Low`-> 1,  `High` -> 0,  `Medium` -> 2, and `NaN` -> -1
* Keep in mind that Random Forest algorithm consits of trees that make a ...
* ... a `single split` based on the `threshold`, e.g. you could potentially split with a condition that if it is less than or greater than 1 to compare `HIGH VS MEDIUM n LOW`
* And if it is less than or greater than 2 we compare `LOW VS HIGH n MEDIUM`


In [None]:
df_raw.UsageBand

In [None]:
df_raw.UsageBand.cat.codes

* The UsageBand column data is what is known as `ordinal` type of `categorical data`
* THat it follows an order for example, *high*, *medium*, or *low*

## 1.11. FeatureEng. :3 -- Take Care of Missing Data

Continue from `1:03:52`