# Classification Models

OnlineStats can fit a variety of models for classification, including but not limited to

- Naive Bayes Classifiers (`NBClassifier`)
- Logistic Regression and SVMs (`StatLearn`)
- Decision Trees (`FastTree` and `NBTree`)
- Random Forests (`FastForest`)

In [1]:
using OnlineStats, JuliaDB

[1m[36mINFO: [39m[22m[36mRecompiling stale cache file /Users/joshday/.julia/lib/v0.6/OnlineStats.ji for module OnlineStats.
[39m[1m[36mINFO: [39m[22m[36mRecompiling stale cache file /Users/joshday/.julia/lib/v0.6/IndexedTables.ji for module IndexedTables.
[39m[1m[36mINFO: [39m[22m[36mRecompiling stale cache file /Users/joshday/.julia/lib/v0.6/JuliaDB.ji for module JuliaDB.
[39m

In [2]:
t = loadtable("diamonds.csv"; indexcols = [:carat])

Table with 53940 rows, 10 columns:
[1mcarat  [22mcut          color  clarity  depth  table  price  x      y      z
───────────────────────────────────────────────────────────────────────────
0.2    "Premium"    "E"    "SI2"    60.2   62.0   345    3.79   3.75   2.27
0.2    "Premium"    "E"    "VS2"    59.8   62.0   367    3.79   3.77   2.26
0.2    "Premium"    "E"    "VS2"    59.0   60.0   367    3.81   3.78   2.24
0.2    "Premium"    "E"    "VS2"    61.1   59.0   367    3.81   3.78   2.32
0.2    "Premium"    "E"    "VS2"    59.7   62.0   367    3.84   3.8    2.28
0.2    "Ideal"      "E"    "VS2"    59.7   55.0   367    3.86   3.84   2.3
0.2    "Premium"    "F"    "VS2"    62.6   59.0   367    3.73   3.71   2.33
0.2    "Ideal"      "D"    "VS2"    61.5   57.0   367    3.81   3.77   2.33
0.2    "Very Good"  "E"    "VS2"    63.4   59.0   367    3.74   3.71   2.36
0.2    "Ideal"      "E"    "VS2"    62.2   57.0   367    3.76   3.73   2.33
0.2    "Premium"    "D"    "VS2"    62.3   60.0 

# `NBClassifier`

A naive Bayes classifier uses conditional distributions to estimate the probability of each class, given the values of the predictor variables.  The "naive" part is the assumption that predictor variables are independent.  `NBClassifier` is cheap to run and serves as a good baseline model.

Since **OnlineStats** updates the model one observation at a time, it must keep summaries of the data in order to estimate the PDF (probability density function) and CDF (cumulative density function) of each predictor.

- For continuous predictors, summaries can be either
  1. `Hist(nbins)`: Discretize the distribution into `nbins` locations
  2. `FitNormal()`: assume the data is normally distributed
- For categorical predictors, the summary should be `CountMap(data_type)`

For the Diamonds data, we'll use all three of the possible summarizers, applying `CountMap(String)` to the `String` columns and `Hist(20)` or `FitNormal()` to the `Float64` columns.

In [11]:
g = Group(
    Hist(20),          # :carat
    CountMap(String),  # :color
    CountMap(String),  # :clarity
    Hist(20),          # :depth
    Hist(20),          # :table
    Hist(20),          # :price
    FitNormal(),       # :x
    FitNormal(),       # :y
    FitNormal()        # :z
)

@time s = reduce(NBClassifier(String, g), t, select = (Not(:cut), :cut))

  0.062384 seconds (1.58 M allocations: 28.346 MiB, 6.60% gc time)


[32m▦ Series{(1, 0)}[39m
│[32m EqualWeight | nobs=53940[39m
└── NBClassifier{String,Group{Tuple{Hist{0,AdaptiveBins{Float64}},CountMap{String},CountMap{String},Hist{0,AdaptiveBins{Float64}},Hist{0,AdaptiveBins{Float64}},Hist{0,AdaptiveBins{Float64}},FitNormal,FitNormal,FitNormal}}}
    > Fair (0.0298)
    > Good (0.091)
    > Ideal (0.3995)
    > Premium (0.2557)
    > Very Good (0.224)

In [7]:
yhat = map(r -> classify(s.stats[1], r), t; select = Not(:cut))

mean(yhat .== select(t, :cut))

0.5826103077493512