# Machine learning

Python is the _de facto_ standard for machine learning, and many common machine learning algorithms are available through the `scikit-learn` package. While a full introduction to machine learning is beyond the scope of this course, in this notebook we will estimate a boosting model using the same data we used in the parametric statistics notebook.

## Import libraries

As always, we need to import `pandas`. Scikit-learn is a huge package, so it is typical to only import the parts you need. Here, we will import `GradientBoostingRegressor` for doing regression, and the `cross_val_score` function for performing cross-validation.

In [None]:
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score

### Read data 

In [None]:
data = pd.read_excel("../data/recent_sfh_sales.xlsx")

### Exercise

Create the water, gas, and sewer dummies we used before, fill in the city field for properties in unincorporated areas, and create a reg_data data frame with all properties that sold for more than $15,000, including the following columns:

- Total_sale_Price
- Year_Built
- Deeded_Acreage
- HEATED_AREA
- CITY
- water
- gas
- sewer

Create dummy variables for cities. There is no need to drop a base category.

## Building the boosting model

Building the model in `scikit-learn` is pretty similar to statsmodels. We create a model object, and then "fit" it. The main difference is that the data used in the model is passed in in the "fit" rather than when the model is created.

Machine learning requires extra care to make sure that the model doesn't just learn your specific dataset, without finding any generalizable patterns. One common approach is "cross-validation", where the data is split into multiple parts, and the model is repeatedly fit on all except one part and predictions are evaluated on the left out part. This provides an estimate of how well the model will perform on new data. `sklearn` provides built in cross validation functionality.

When constructing the model, there are many options you can use to control how the algorithm functions; these will vary depending on which machine learning algorithm you're using.

In [None]:
mod = GradientBoostingRegressor(n_estimators=50)

### Cross validating the model

The `cross_val_score` method will return an array of "scores" (measures of model performance) from fitting the model multiple times on different subsets of the data.

Note that unlike `statsmodels`, `sklearn` convention is for the independent variables to come first in the function call, before the dependent variable.

`cross_val_score` uses whatever the `.score` function of the `sklearn` model you're using returns by default. For a gradient-boosted regressor, this is the R-squared.

In [None]:
scores = cross_val_score(mod, reg_data.drop(columns="Total_sale_Price"), reg_data.Total_sale_Price, cv=5)
scores

It is common to take the mean of cross validation scores as a measure of predictive power.

In [None]:
scores.mean()

The R-squared is 0.57, marginally better than the linear regression estimated previously. Once a model is cross-validated, it is common to move on to fitting on the entire dataset.

Here we fit the model and run its `score` function with the original data, to see how well it performs on the original dataset. This isn't a metric you should use for evaluating the quality of the model, but if it's much higher than the cross-validated scores it probably indicates overfitting.

In [None]:
mod = GradientBoostingRegressor(n_estimators=50)
mod.fit(reg_data.drop(columns="Total_sale_Price"), reg_data.Total_sale_Price)
mod.score(reg_data.drop(columns="Total_sale_Price"), reg_data.Total_sale_Price)

### Prediction

Machine learning is generally used for prediction; there aren't coefficients to easily interpret like there are in regression. For instance, a real estate developer might use a model like this one to to forecast the sale price of homes in two potential developments. Here, we create a new data frame that has the same columns as `reg_data`, and two rows - one for a 2000 square foot urban home in Raleigh, with all utilities, and one for a large 2700 square foot home with only electric service in Zebulon. Here we are creating a data frame manually, but for a larger data set you might want to create the data in Excel and read it with `pd.read_excel`.

In [None]:
pred_data = pd.DataFrame([
    {"Year_Built": 2022, "Deeded_Acreage": 0.2, "HEATED_AREA": 2000, "CITY": "RAL", "water": True,
         "sewer": True, "gas": True},
    {"Year_Built": 2022, "Deeded_Acreage": 3.0, "HEATED_AREA": 2700, "CITY": "ZEB", "water": False,
         "sewer": False, "gas": False}
])

# because we don't have every city in our prediction dataset, we convert CITY to a categorical variable and
# set the categories to be all the cities, so that we get dummies for every city.
pred_data["CITY"] = pred_data.CITY.astype("category").cat.set_categories(data.CITY.unique())

# scikit-learn expects columns to be in the same order when predicting as when fitting.
# selecting using the original column names gets them in this order
pred_data = pd.get_dummies(pred_data, columns=["CITY"])[reg_data.drop(columns="Total_sale_Price").columns]

In [None]:
mod.predict(pred_data)