<a href="https://colab.research.google.com/github/jeremychege/MachineLearning/blob/main/KaggleIntroToML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Pandas to get Familiar with Data
- abbv. as pd.
- DataFrame -most important part of th pandas library.
- count: how many rows have non-missing values
- mean: avg
- std-(standard deviation): how numerically spread out the values are.
- min, max, 25%, 50%, 75% -> percentiles



In [None]:
import pandas as pd
import kagglehub

melbourne_path = kagglehub.dataset_download("dansbecker/melbourne-housing-snapshot")
print("Path to dataset files:", melbourne_path)

melbourne_data = pd.read_csv('/root/.cache/kagglehub/datasets/dansbecker/melbourne-housing-snapshot/versions/5/melb_data.csv')
#melbourne_data.describe()

melbourne_data.columns

#melbourne_data = melbourne_data.dropna(axis=0)


Path to dataset files: /root/.cache/kagglehub/datasets/dansbecker/melbourne-housing-snapshot/versions/5


# Selecting a subset of data:

- dot notation -> select "prediction target". Single column is stored in a series-> like a DF with a single column of data. Select column you want to predict. called "y" by convention.

- column list -> select "features" -> selected models inputted into our model to make predictions. called "X" by convenction.



In [None]:
y = melbourne_data.Price

melbourne_features = [ 'Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

X = melbourne_data[melbourne_features]

#X.describe()

X.head()


Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


# **Building The Model**

- **scikit-learn** -> library to create the model. Written as **sklearn**. Most popular library for modelling the types of data stored in DataFrames.


*Steps to building the model:*
 1. Define the type of model. e.g Decision Tree?
 2. Fit: capture patterns from provided data. This is the heart of modelling.
 3. Predict: Predict.. duh
 4. Evaluate: accuracy of the model's predictions.


---


 Many ML models allow some randomness in model training. Specifying a number for ***random_state*** ensures you get the same results in each run. Independent of model quality









In [None]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

#Fit model
melbourne_model.fit(X, y)

print("Making prediction for the following 5 houses:")
print(X.head())
print("The predictions are:")
print(melbourne_model.predict(X.head()))


Making prediction for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are:
[1035000. 1465000. 1600000. 1876000. 1636000.]


# Model Validation

- Measure the quality of your model. This is the key to iteratively improve models. The relevant measure of model quality is predictive accuracy.

- First, we need to summarize the model quality in an understandable way. Looking through a list of predicted and actual values would be pointless. We need to summarize this into a single metric.

- Lets start with ***Mean Absolute Error(MAE)***.  `error=actual-predicted`.

- With MAE, we take the absolute value of each error. Then we take the average of those absolute errors. It can be said simply, "On average, our predictions are off by about X"






In [None]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

1115.7467183128902

# The Problem with "In-Sample" Scores
- Measure computed above can be called an "In-Sample" score. We used a single "sample" of houses for both building and interpreting the model.

- The model might recognize patterns in our training data that don't hold in the real-world, leaving the model inaccurate when tested with new data.

- Since a models practical value comes from making predictions on new data, we measure performancce on data that wasn't used to build the model.

- The most straightfoward way to dothis is by excluding some data from thr model building process, and then using those to test the model's accuracy on data it hasn't seen before. This data is called **validation data**.

- scikit-learn library has a function `train_test_split` to break data into two pieces.

In [None]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

melbourne_model = DecisionTreeRegressor()

melbourne_model.fit(train_X, train_y)

val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))




272765.02991177107


# Underfitting and Overfitting

- **Overfitting** -> when a model matches training data almost perfectly but does poorly in validation and new data. e.g a tree with 10 levels/ splits will have 1024 leaf nodes. On new data, it can make unreliable predictions because each prediction is based on a few houses.


- **Underfitting** -> When a model fails to capture important distinctions and patterns in the data, it performs poorly even in training data. e.g a tree with only 4 splits may not divide data into very distinct groups.

- since we care about accuracy on new data, estimated from validation data, we have to find a sweet spot between underfitting and overfitting.

EXAMPLE
- there are alternatives for controlling tree depth, many allowing for some routes through the tree to have greater depth than other routes.

- `max_leaf_nodes` provides a very sensible way to control overfitting and underfitting. The more leaves we allow the model to make, the more we move from underfitting towards overfitting.

- We can use a utility function to help compare MAE scores from different values for `max_leaf_nodes`:

- We can use a for-loop to compare the accuracy of models built with different values for max_leaf_nodes.

- Of the options listed, 500 is the optimal number of leaves.





In [None]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
  model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 0)
  model.fit(train_X, train_y)
  preds_val = model.predict(val_X)
  mae = mean_absolute_error(val_y, preds_val)
  return(mae)

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
  my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
  print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))


Max leaf nodes: 5 		 Mean Absolute Error: 385696
Max leaf nodes: 50 		 Mean Absolute Error: 279794
Max leaf nodes: 500 		 Mean Absolute Error: 261718
Max leaf nodes: 5000 		 Mean Absolute Error: 271320


#Random Forests

- Even sophisticated modelling techniques face tensions between overfitting and underfitting. Many models have clever ideas that can lead to better performance. We look at **random forest** as an example:

- The random forest uses many trees and makes a prediction by averaging the predictions of each component tree. It has much better predictive accuracy than a single decision tree and works well with default parameters.

- If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.




In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

207190.6873773146
