# Learning Curve

Every classifer applied to a dataset makes a fundamental trade-off between `bias`, the systematic error of the model (`underfit`), and `variance`, the amount of error from fitting overly well to the sample (`overfit`). Per the bias-variance tradeoff, these two sources of error are related; together they make up all of the error in the model. Hence, on a basal level, the task of a machine learner is to pick a model that minimizes these values.

The best tool for finding what the bias-variance tradeoff of a model is is a **learning curve**. The x-axis on a learning curve is the number of observations provided to a model (e.g. the size of the training set). The y-axis on a learning curve is the amount of error in the model, according to some metric of your choosing. You then plot two curves on this graph: one for training scores, and one for cross-validation scores.

Learning curves are great because the amount of progress the model makes as it gains more and more samples of data is a visual marker for how much bias and/or variance is inherent in the model:

<img src="images/learningcurve1.png" alt="Confirmation bias" style="width: 800px;"/>

From: [Learning curves with Zillow Economics Data](https://www.kaggle.com/residentmario/learning-curves-with-zillow-economics-data/)

In the first case, the model is systematically bad: it performs poorly on the metric no matter which split it is running on. This is an indication of an underfitted model, e.g. one that is not capturing the underlying pattern in the data. Note that it is up to you to determine what a "bad" metric score is!

The second case is the best-case scenario. The model performs adequately well according to the metric, and adding more samples pushes the validation error towards the training error asymptotically (for at least part of the curve).

The third case is one of high variance: the model is fitting the training set really well, but is fitting the validation set(s) poorly. This means that the model is overfitted, and needs regularization or tweaking in its hyperparameters to find a better fit.

In [2]:
import pandas as pd
import numpy as np

ts = pd.read_csv("data/Metro_time_series.csv", parse_dates=['Date'])
ts = ts[ts.ZriPerSqft_AllHomes.notnull()]

In [3]:
ts.head()

Unnamed: 0,Date,RegionName,AgeOfInventory,DaysOnZillow_AllHomes,HomesSoldAsForeclosuresRatio_AllHomes,InventorySeasonallyAdjusted_AllHomes,InventoryRaw_AllHomes,InventorySeasonallyAdjusted_BottomTier,InventorySeasonallyAdjusted_MiddleTier,InventorySeasonallyAdjusted_TopTier,...,ZHVI_BottomTier,ZHVI_CondoCoop,ZHVI_MiddleTier,ZHVI_SingleFamilyResidence,ZHVI_TopTier,ZRI_AllHomes,ZRI_AllHomesPlusMultifamily,ZriPerSqft_AllHomes,Zri_MultiFamilyResidenceRental,Zri_SingleFamilyResidenceRental
114747,2010-11-30,10140,,,,735.0,732.0,,,,...,,,128300.0,132400.0,201900.0,1015.0,1013.0,0.736,896.0,1045.0
114748,2010-11-30,10180,,103.875,,1060.0,1062.0,,,,...,44400.0,,,,,989.0,989.0,0.632,,992.0
114749,2010-11-30,10220,,,,305.0,304.0,,,,...,,,76000.0,76000.0,147600.0,716.0,727.0,0.558,649.0,729.0
114750,2010-11-30,10300,,,,958.0,943.0,260.0,267.0,434.0,...,55200.0,88200.0,92600.0,92600.0,152500.0,1106.0,1101.0,0.782,1109.0,1101.0
114751,2010-11-30,10420,,138.25,10.7539,4930.0,4940.0,1516.0,1523.0,1886.0,...,64700.0,118400.0,119500.0,119600.0,206600.0,1174.0,1170.0,0.724,1099.0,1173.0


## Resources

- [Learning curves with Zillow Economics Data](https://www.kaggle.com/residentmario/learning-curves-with-zillow-economics-data/)