<h1>Machine Learning Models 101</h1>

Machine learning models are algorithms that can learn from data and make predictions or decisions on previously unseen data (stock prices, housing prices, etc.).

In this tutorial, we are going to cover how machine learning models work. The lesson is broken into the following sections.

<b>1. Data Preparation</b>: The data is cleaned, transformed, and divided into training and test sets.

<b>2. Model Type Selection</b>: In Python we can create a variety of models (Decision Tree, Random Forest, etc.) using the scikit-learn module.

<b>3. Model Training</b>: After we choose and create a model, we should fit/train the model, so it can capture and learn patterns from the training data.

<b>4. Model Evaluation</b>: Once model training is complete, we can evaluate the model on the test set to measure its performance.

Once a model is trained and evaluated, it can be used to make predictions on new, unseen data.

<h3>1. Data Preparation</h3>

Data preparation for machine learning is the process of cleaning, transforming, and organizing the data so that it can be used to train a model. It is an important step as the quality of the data can greatly impact the performance of the model.

In [1]:
# modules
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer



In [2]:
# Useful Funcitons & Tools

# pd.DataFrame.head()
# pd.DataFrame.describe()
# pd.DataFrame.columns


# Creating a new DataFrame containing only feature columns
# features = ['col1', 'col3', 'col5', 'col7']
# X = dataset[features]

<h2>Intro to Machine Learning At a Glance</h2>

1. Read the training dataset into pandas DataFrame.
2. Select target and features (X, y).
3. Split the dataset into train and validation sets (X_train, X_val, y_train, y_val).
4. Handle missing data (drop column, imputate both X_train and X_valid).
5. Define, fit, predict, and evaluate models (we can use Mean Absolute Value to evaluate; we want lowest MAE). Choose the best model (create a function for this part).
7. Re-fit the best model using the entire training dataset (X & y, instead of just X_train & y_train).
8. Use the newly fitted model to predict the target with the test data set.

- fitting == training
- evaluate == score
- "in-sample" scores are a bad indication of model accuracy (we want to evaluate on previously unseen data to obtain accurate model accruacy). therefore, we must split the dataset into training vs. validation set.
- Underfitting vs. Overfitting; underfitting refers to failure to capture relevant patterns, leading to less accurate predictions. overfitting refers to capturing spurious patterns that won't recur in the future, leading to less accruate predictions.
- just like decision tree, we can adjust the parameters of a random forest model, but they generally work reasonably well even without this tuning