### Introduction

* Bespoke reconstruction algorithms & simulations

* Branching out to ML.

* Classification as versatile problem.

* Dataset of [Fix My Street](https://www.fixmystreet.com/) reports.

* Algorithm: **Naive Bayes**

### Naive Bayes
Why Naive Bayes?

* It works well without a large amount of training data

* Computationally efficient & scaleable

* Can be easily updated with new data

* Tends to work well despite the independence assumption

* Model is interpretable

### Bayes Theorem
Naive Bayes relies on Bayes' Theorem:

$$p(y|x) = \frac{p(y)p(x|y)}{p(x)}.$$

### Bayes Theorem - Applied to Classification

In classification problems we are given data $x$ and we want to assign it to a class $c_{k} \in C$. To do this we can calculate and find the class that maximises

$$p(c_{k}|x) = \frac{p(c_{k})p(x|c_{k})}{p(x)},$$

where $x = (x_{1},\dots,x_{n})$ are our features - in this case words in a Fix My Street (FMS) report, 

Reference: _Speech and Language Processing.  Daniel Jurafsky & James H. Martin, Pearson Prentice Hall, 2009._

### Naive Bayes for Classifiying FMS Reports

**<span style="color:black">1. Obtain a set of labelled documents.</span>**

<span style="color:gray">2. Preprocess the data.</span>

<span style="color:gray">3. Generate the feature matrix $X$, either binary term occurence, document term matrix, or tf-idf.</span>

<span style="color:gray">4.  Train model: calculate the values of the conditional probability $p(x_{i}|c_{k})$ & the prior probability $p(c_{k})$.</span>

<span style="color:gray">5. Classify new documents using $\hat{y} = \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} \log(p(x_{i}|c_{k})):$</span>
<span style="color:gray">
\begin{align}
\hat{y} &= \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} x_{i}\log(\theta_{ik})+ \sum_{i} (1-x_{i})\log(1-\theta_{ik}).\quad \text{(Bernouilli)} \\
\hat{y} &= \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} x_{i}\log(\theta_{ik}).\quad \text{(Multinomial)}
\end{align}</span>
<span style="color:gray">6. Evaluate model performance.</span>

In [1]:
using CSV, DataFrames, Queryverse,TextAnalysis,Printf,SparseArrays
csv_name = "FMS.csv";
# Load in the data
df = CSV.read(csv_name);
# Print a frequency table of the classes
println(df |>
    @groupby(_.category_coded) |>
    @map({Key=key(_), Count=length(_)}) |> @orderby_descending(_.Count)|>
    DataFrame)
# Get the unique labels
Classes = unique(df[!,:category_coded]);
# Convert the report description to a String Document to build the Document Term Matrices.
df = df |> @mutate(description = StringDocument(_.description)) |> DataFrame;


5×2 DataFrame
│ Row │ Key                  │ Count │
│     │ [90mString[39m               │ [90mInt64[39m │
├─────┼──────────────────────┼───────┤
│ 1   │ Car Parking          │ 679   │
│ 2   │ Potholes             │ 606   │
│ 3   │ Pavements/footpaths  │ 500   │
│ 4   │ Flytipping           │ 425   │
│ 5   │ Parks & Green Spaces │ 236   │


### Naive Bayes for Classifiying FMS Reports

<span style="color:black">1. Obtain a set of labelled documents.</span>

**<span style="color:black">2. Preprocess the data.</span>**

<span style="color:gray">3. Generate the feature matrix $X$, either binary term occurence, document term matrix, or tf-idf.</span>

<span style="color:gray">4.  Train model: calculate the values of the conditional probability $p(x_{i}|c_{k})$ & the prior probability $p(c_{k})$.</span>

<span style="color:gray">5. Classify new documents using $\hat{y} = \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} \log(p(x_{i}|c_{k})):$</span>
<span style="color:gray">
\begin{align}
\hat{y} &= \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} x_{i}\log(\theta_{ik})+ \sum_{i} (1-x_{i})\log(1-\theta_{ik}).\quad \text{(Bernouilli)} \\
\hat{y} &= \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} x_{i}\log(\theta_{ik}).\quad \text{(Multinomial)}
\end{align}</span>
<span style="color:gray">6. Evaluate model performance.</span>

In [2]:
# Grab the report descriptions and generate a corpus
desc = deepcopy(df[!,:description]);
crps = Corpus(desc);
# Remove all of the words that we're not interested in.
remove_corrupt_utf8!(crps);
remove_case!(crps);
remove_words!(crps,["amp","quot"]);
prepare!(crps,strip_articles | strip_numbers | strip_non_letters | strip_stopwords | strip_pronouns | strip_frequent_terms | strip_definite_articles);
# Generate the lexicon
update_lexicon!(crps);

### Naive Bayes for Classifiying FMS Reports

<span style="color:black">1. Obtain a set of labelled documents.</span>

<span style="color:black">2. Preprocess the data.</span>

**<span style="color:black">3. Generate the feature matrix $X$, either binary term occurence, document term matrix.</span>**

<span style="color:gray">4. Train model: calculate the values of the conditional probability $p(x_{i}|c_{k})$ & the prior probability $p(c_{k})$.</span>

<span style="color:gray">5. Classify new documents using $\hat{y} = \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} \log(p(x_{i}|c_{k})):$</span>
<span style="color:gray">
\begin{align}
\hat{y} &= \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} x_{i}\log(\theta_{ik})+ \sum_{i} (1-x_{i})\log(1-\theta_{ik}).\quad \text{(Bernouilli)} \\
\hat{y} &= \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} x_{i}\log(\theta_{ik}).\quad \text{(Multinomial)}
\end{align}</span>
<span style="color:gray">6. Evaluate model performance.</span>

In [3]:
# Now we can grab the Document Term Matrix
X = DocumentTermMatrix(crps);
# Get the matrix only - this will be used for Multinomial Naive Bayes
X_mat = dtm(X);
# Get the binary occurence matrix - this will be used for Bernouilli Naive Bayes
X_bin = spzeros(size(X_mat)[1],size(X_mat)[2]);
X_bin[X_mat .>0] .= 1;
# Generate the labels
y = zeros(Int64,length(df[!,:category_coded]));
[y[df[:,:category_coded] .== Classes[i]] .= (i-1) for i in range(2,stop=length(Classes))];
using MLJ
# Split into training & test sets
train, test = partition(eachindex(y), 0.7, shuffle=true, rng=1234);
y_train = y[train];
y_test = y[test];
# Sets for Bernouilli NB
X_train_bnb = X_bin[train,:];
X_test_bnb = X_bin[test,:];
# Sets for Multinomial NB
X_train_mnb = X_mat[train,:];
X_test_mnb = X_mat[test,:];

### Naive Bayes for Classifiying FMS Reports

<span style="color:black">1. Obtain a set of labelled documents.</span>

<span style="color:black">2. Preprocess the data.</span>

<span style="color:black">3. Generate the feature matrix $X$, either binary term occurence, document term matrix.</span>

**<span style="color:black">4.  Train model: calculate the values of the conditional probability $p(x_{i}|c_{k})$ & the prior probability $p(c_{k})$.</span>**

<span style="color:gray">5. Classify new documents using $\hat{y} = \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} \log(p(x_{i}|c_{k})):$</span>
<span style="color:gray">
\begin{align}
\hat{y} &= \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} x_{i}\log(\theta_{ik})+ \sum_{i} (1-x_{i})\log(1-\theta_{ik}).\quad \text{(Bernouilli)} \\
\hat{y} &= \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} x_{i}\log(\theta_{ik}).\quad \text{(Multinomial)}
\end{align}</span>
<span style="color:gray">6. Evaluate model performance.</span>

### Classification Method

The numerator is equivalent to the joint probability distribution, written $p(x_{1},\dots,x_{n},c_{k})$ which can be expanded using the chain rule of probabilities
\begin{align}
p(x_{1},\dots,x_{n},c_{k}) &= p(x_{1},\dots,x_{n}|c_{k})p(x_{2},\dots,x_{n},c_{k}), \\
 &= p(x_{1},\dots,x_{n}|c_{k})p(x_{2},\dots,x_{n}|c_{k})\dots p(x_{n}|c_{k})p(c_{k}),
\end{align}
If we assume **all of the features are independent**, then we have
$$
p(x_{1},\dots,x_{n},c_{k}) = p(c_{k})\prod_{i} p(x_{i}|c_{k}),
$$
$$p(c_{k}|x) = \frac{p(c_{k})\prod_{i} p(x_{i}|c_{k})}{p(x)},$$
and our classification rule is
$$
\hat{y} = \text{argmax}_{k} p(c_{k})\prod_{i} p(x_{i}|c_{k}).
$$
To avoid numerical problems when $p(x_{i}|c_{k})$ is very small, so take logs
$$
\hat{y} = \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} \log(p(x_{i}|c_{k})).
$$

### Event Models - The Distribution of $p(x_{i}|c_{k})$

* The prior $p(c_{k})$ can be calculated assuming either equiprobable classes or estimated using the training set. 
* We have to assume a distribution for the features $p(x_{i}|c_{k})$ :
    * Continous features - Gaussian Naive Bayes $p(x_{i}|c_{k}) = \frac{1}{\sqrt{2\pi \sigma_{k}^{2}}}\exp{\left(-\frac{\left(x_{i} - \mu_{k}\right)^{2}}{2 \sigma_{k}^{2}}\right)}$. $\sigma_{k},\mu_{k}$ are the mean & sd of $x_{i}$ in class $c_{k}$.
    * Discrete features - Bernouilli Naive Bayes $p(x|c_{k}) = \prod_{i}\theta_{ik}^{x_{i}}\left(1 - \theta_{ik}\right)^{\left(1 - x_{i}\right)}$. $\theta_{ik}$ is the probability of class $c_{k}$ generating $x_{i}$. This assumes a binary model for the features.
    * Discrete features - Multinomial Naive Bayes $p(x|c_{k}) \propto \prod_{i} \theta_{ik}^{x_{i}}$. The distribution is parameterised by multinomials $\theta_{k} = (\theta_{1k},\dots,\theta_{nk})$, where $\theta_{ik}$ is the probability of feature $i$ occuring in a sample belonging to class $c_{k}$.
* Bernouilli & Multinomial Naive Bayes are popular for document classification. The feature matrix consists of word occurence (Bernouilli), word frequencies (Multinomial), or term frequency-inverse term document frequency (tf-idf).

* Use smoothing to ensure there are no probability values $\theta_{ik} = 1$ or $0$.


In [4]:
function train_MultinomialNB(X,y,alpha=1)
    """
        train_MultinomialNB(X=Array[n_features,n_samples],y=Array[n_samples,1],alpha=1);
    A basic function to train a Multinomial Naive Bayes classifier. 
    
    It takes as input:
    X - a training set with feature matrix of size [n_feautres,n_samples],
    y - labels of length n_samples,
    alpha - smoothing factor, default = 1.
    
    It outputs:
    log_prior - the log prior probability of each class, array size [n_classes,]
    log_prob_cond - the conditional probability of each word given each class, array size [n_classes,n_features].
    
    The distribution is parameterised by vectors (theta_k1,..., theta_kn) for each class y_k, where n is the number of features.
    theta_ki is the probability p(x_i|y_k) of feature i appearing in a sample belonging to class y_k.
    
    theta_ki is estimated by a smoothed version of the maximum likelihood estimator:
    theta_ki = (N_ki + alpha)/(N_k + alpha*n)
    
    where N_ki is the number of times feature x_i appears in a sample of class y_k and N_k is the total count of all features in class y_k.
    n is the number of terms in the vocabulary.
    alpha is the smoothing parameter & is set to alpha = 1 as default.
    This means that Laplace smoothing is applied as default.
    
    Classification is performed by maximising the log likelihood, so we'll apply the log transform to the parameter matrix in this function.
    """
    
    # Calculate the number of features and classes
    n_class = Int64(maximum(y)+1);
    n_words = size(X)[2];
    # Calculate n_ik = the number of occurences of word x_i in class y_k
    n_i = zeros(n_class,n_words);
    for k in 0:(n_class-1)
        n_i[k+1,:] = sum(X[y.==k,:],dims = 1);
    end
    # Calculate the number of occurences in each class, used for the prior p(y_k)
    n_k = sum(n_i,dims = 2);
    # Calculate the total number of samples
    n = size(X)[1];
    # Now we can calculate the prior p(y_k) & the log prior 
    prior = (n_k.+alpha)./(alpha.*n_class.+n);
    log_prior = log.(prior);
    # Finally calculate the conditional probability p(i|y_k)
    cond_prob = (n_i.+alpha)./(n_k.+alpha*n_words);
    log_cond_prob = log.(cond_prob);
    return log_prior,log_cond_prob
end

train_MultinomialNB (generic function with 2 methods)

In [5]:
function train_BernouilliNB(X,y,alpha=1,beta=2)
    """
        train_BernouilliNB(X=Array[n_features,n_samples],y=Array[n_samples,1],alpha=1,beta=2);
    A basic function to train a Bernouilli Naive Bayes classifier. 
    
    It takes as input:
    X - a training set with feature matrix of size [n_feautres,n_samples],
    y - labels of length n_samples,
    alpha, beta - Laplace smoothing factors, default = 1 & 2, respectively.
    
    It outputs:
    log_prior - the log prior probability of each class, array size [n_classes,]
    prob_cond - the conditional probability of each word given each class, array size [n_classes,n_features].
    
    The distribution p(x_i|y_k) is calculated as p(i|y_k)^x_i - (1 - p(i|y_k))^{1-x_i} = theta_ik^x_i - (1-theta_ik)^(1-x_i),
    where:
        theta_ik = (n_ik+alpha)/(n_k + beta),
    n_ik is the number of occurences of x_i in class y_k, and n_k is the total number of occurences of x_i in the training set.
    alpha & beta are Laplace smoothing parameters & are set as alpha = 1, beta = 2 as default.
    This means that Laplace smoothing is applied as default.
    
    Classification is performed by maximising the log likelihood, so we'll apply the log transform to the parameter matrix in this function.
    """
    
    max_X = maximum(X);
    if (max_X > 1)
        ArgumentError("The feature matrix, X, must be binary")
    end
    # Calculate the number of features and classes
    n_class = Int64(maximum(y)+1);
    n_words = size(X)[2];
    # Calculate n_ik = the number of occurences of word x_i in class y_k
    n_i = zeros(n_class,n_words);
    for k in 0:(n_class-1)
        n_i[k+1,:] = sum(X[y.==k,:],dims = 1);
    end
    # Calculate the number of occurences in each class, used for the prior p(y_k)
    n_k = [sum(y.==k) for k in 0:(n_class-1)];
    # Calculate the total number of samples
    n = size(X)[1];
    # Now we can calculate the prior p(y_k) & the log prior 
    prior = (alpha.+n_k)./(beta.+n);
    log_prior = log.(prior);
    # Finally calculate the conditional probability p(i|y_k)
    cond_prob = (n_i.+alpha)./(n_k.+beta);
    log_cond_prob = log.(cond_prob);
    return log_prior,cond_prob
end

train_BernouilliNB (generic function with 3 methods)

In [6]:
log_prior_mnb,log_cond_prob_mnb = train_MultinomialNB(X_train_mnb,y_train);
log_prior_bnb,cond_prob_bnb = train_BernouilliNB(X_train_bnb,y_train);

### Naive Bayes for Classifiying FMS Reports

<span style="color:black">1. Obtain a set of labelled documents.</span>

<span style="color:black">2. Preprocess the data.</span>

<span style="color:black">3. Generate the feature matrix $X$, either binary term occurence, document term matrix.</span>

<span style="color:black">4.  Train model: calculate the values of the conditional probability $p(x_{i}|c_{k})$ & the prior probability $p(c_{k})$.</span>

**<span style="color:black">5. Classify new documents using $\hat{y} = \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} \log(p(x_{i}|c_{k})):$</span>**

\begin{align}
\hat{y} &= \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} x_{i}\log(\theta_{ik})+ \sum_{i} (1-x_{i})\log(1-\theta_{ik}).\quad \text{(Bernouilli)} \\
\hat{y} &= \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} x_{i}\log(\theta_{ik}).\quad \text{(Multinomial)}
\end{align}

<span style="color:gray">6. Evaluate model performance.</span>

In [7]:
function predict_MultinomialNB(X_test,log_prior,log_cond_prob)
    """
    predict_MultinomialNB(X_test=Array[n_features,n_samples],log_prior = Array[n_classes,],log_cond_prob=Array[n_classes,n_features]);
    A basic function to predict a set of reports using a pre-trained Multinomial Naive Bayes Classifier.
    It takes as input:
    X_test - a test set with feature matrix of size [n_feautres,n_samples],
    log_prior - the log prior probability of each class, array size [n_classes,]
    log_prob_cond - the log conditional probability of each word given each class, array size [n_classes,n_features].
    
    The output is the predicted class, an integer. It assumes classes are 0:(n_classes-1)
    
    The output is calculated using argmax_k [log(p(y_k)) + sum(x_i*log(p(i|y_k)))]
    """
    pred = zeros(size(X_test)[1]);
    for i in 1:nrows(X_test)
        score = log_prior[:] .+ sum(X_test[i,:].*transpose(log_cond_prob),dims=1)[:];
        pred[i] = argmax(score[:])-1;
    end
    return pred
end

predict_MultinomialNB (generic function with 1 method)

In [8]:
function predict_BernouilliNB(X_test,log_prior,cond_prob)
    """
    predict_BernouilliNB(X_test=Array[n_features,n_samples],log_prior = Array[n_classes,],cond_prob=Array[n_classes,n_features]);
    A basic function to predict a set of reports using a pre-trained Bernouilli Naive Bayes Classifier.
    It takes as input:
    X_test - a test set with feature matrix of size [n_feautres,n_samples],
    log_prior - the log prior probability of each class, array size [n_classes,]
    prob_cond - the conditional probability of each word given each class, array size [n_classes,n_features].
    
    The output is the predicted class, an integer. It assumes classes are 0:(n_classes-1)
    
    The output is calculated using argmax_k [log(p(y_k)) + sum(x_i*log(p(i|y_k)) + (1-x_i)*log(1-p(i|y-k)))]
    """
    max_X = maximum(X_test);
    if (max_X > 1)
        ArgumentError("The feature matrix, X_test, must be binary")
    end
    pred = zeros(size(X_test)[1]);
    for i in 1:nrows(X_test)
        score = log_prior[:] .+ sum(X_test[i,:].*transpose(log.(cond_prob)),dims=1)[:] .+ sum((1 .- X_test[i,:]).*transpose(log.(1 .- cond_prob)),dims=1)[:];
        pred[i] = argmax(score[:])-1;
    end
    return pred
end

predict_BernouilliNB (generic function with 1 method)

In [9]:
pred_mnb = predict_MultinomialNB(X_test_mnb,log_prior_mnb,log_cond_prob_mnb);
pred_bnb = predict_BernouilliNB(X_test_bnb,log_prior_bnb,cond_prob_bnb);

### Naive Bayes for Classifiying FMS Reports

<span style="color:black">1. Obtain a set of labelled documents.</span>

<span style="color:black">2. Preprocess the data.</span>

<span style="color:black">3. Generate the feature matrix $X$, either binary term occurence, document term matrix.</span>

<span style="color:black">4.  Train model: calculate the values of the conditional probability $p(x_{i}|c_{k})$ & the prior probability $p(c_{k})$.</span>

<span style="color:black">5. Classify new documents using $\hat{y} = \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} \log(p(x_{i}|c_{k})):$</span>

\begin{align}
\hat{y} &= \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} x_{i}\log(\theta_{ik})+ \sum_{i} (1-x_{i})\log(1-\theta_{ik}).\quad \text{(Bernouilli)} \\
\hat{y} &= \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} x_{i}\log(\theta_{ik}).\quad \text{(Multinomial)}
\end{align}

**<span style="color:black">6. Evaluate model performance.</span>**

In [10]:
# Calculate Accuracy & Weighted Accuracy
n_class = length(Classes);
accuracy_mnb = mean(y_test .== pred_mnb);
w_accuracy_mnb = sum([(1 ./n_class).*mean(y_test[y_test.== i] .== pred_mnb[y_test.== i]) for i in 0:(n_class-1)]);
accuracy_bnb = mean(y_test .== pred_bnb);
w_accuracy_bnb = sum([(1 ./n_class).*mean(y_test[y_test.== i] .== pred_bnb[y_test.== i]) for i in 0:(n_class-1)]);
println(string("Bernouilli NB Accuracy = ",@sprintf("%.1f",accuracy_bnb*100), "%"));
println(string("Bernouilli NB Weighted Accuracy = ",@sprintf("%.1f",w_accuracy_bnb*100), "%"));
println(string("Multinomial NB Accuracy = ",@sprintf("%.1f",accuracy_mnb*100), "%"))
println(string("Multinomial NB Weighted Accuracy = ",@sprintf("%.1f",w_accuracy_mnb*100), "%"));

Bernouilli NB Accuracy = 78.9%
Bernouilli NB Weighted Accuracy = 68.6%
Multinomial NB Accuracy = 86.9%
Multinomial NB Weighted Accuracy = 81.9%


In [125]:
cf_mat_mnb_norm = [sum((y_test .==(i-1)) .& (pred_mnb .==(j-1)))./sum(y_test .==(i-1)) for i in 1:n_class,  j in 1:n_class];
cf_mat_mnb = [sum((y_test .==(i-1)) .& (pred_mnb .==(j-1))) for i in 1:n_class,  j in 1:n_class];
cf_mat_bnb_norm = [sum((y_test .==(i-1)) .& (pred_bnb .==(j-1)))./sum(y_test .==(i-1)) for i in 1:n_class,  j in 1:n_class];
cf_mat_bnb = [sum((y_test .==(i-1)) .& (pred_bnb .==(j-1))) for i in 1:n_class,  j in 1:n_class];


In [68]:
TP_mnb = [cf_mat_mnb[i,i] for i in 1:n_class];
FP_mnb = sum(cf_mat_mnb,dims=1)[:] - TP_mnb[:];
FN_mnb = sum(cf_mat_mnb,dims=2)[:] - TP_mnb[:];
precis_mnb = TP_mnb./ (TP_mnb .+ FP_mnb);
recall_mnb = TP_mnb./ (TP_mnb .+ FN_mnb);
precis_micro_mnb  =sum(TP_mnb)./(sum(TP_mnb .+ FP_mnb));
recall_micro_mnb  =sum(TP_mnb)./(sum(TP_mnb .+ FN_mnb));
precis_macro_mnb = mean(precis_mnb);
recall_macro_mnb = mean(recall_mnb);
f1_micro_mnb = 2*precis_micro_mnb.*recall_micro_mnb./(precis_micro_mnb+recall_micro_mnb);
f1_macro_mnb = 2*precis_macro_mnb.*recall_macro_mnb./(precis_macro_mnb+recall_macro_mnb);
TP_bnb = [cf_mat_bnb[i,i] for i in 1:n_class];
FP_bnb = sum(cf_mat_bnb,dims=1)[:] - TP_bnb[:];
FN_bnb = sum(cf_mat_bnb,dims=2)[:] - TP_bnb[:];
precis_bnb = TP_bnb./ (TP_bnb .+ FP_bnb);
recall_bnb = TP_bnb./ (TP_bnb .+ FN_bnb);
precis_micro_bnb  =sum(TP_bnb)./(sum(TP_bnb .+ FP_bnb));
recall_micro_bnb  =sum(TP_bnb)./(sum(TP_bnb .+ FN_bnb));
precis_macro_bnb = mean(precis_bnb);
recall_macro_bnb = mean(recall_bnb);
f1_micro_bnb = 2*precis_micro_bnb.*recall_micro_bnb./(precis_micro_bnb+recall_micro_bnb);
f1_macro_bnb = 2*precis_macro_bnb.*recall_macro_bnb./(precis_macro_bnb+recall_macro_bnb);
println("Bernouilli NB:")

for i = 1:n_class
println(string(Classes[i],", Precision: ",@sprintf("%.1f",100 .*precis_bnb[i]),"%, Recall: ",@sprintf("%.1f",100 .*recall_bnb[i]),"%"));
end
println()
println("Micro Average F1 Score: ", @sprintf("%.2f",f1_micro_bnb))
println("Macro Average F1 Score: ", @sprintf("%.2f",f1_macro_bnb))
println()
println("---------------")
println("Multinomial NB:")

for i = 1:n_class
println(string(Classes[i],", Precision: ",@sprintf("%.1f",100 .*precis_mnb[i]),"%, Recall: ",@sprintf("%.1f",100 .*recall_mnb[i]),"%"));
end
println()
println("Micro Average F1 Score: ", @sprintf("%.2f",f1_micro_mnb))
println("Macro Average F1 Score: ", @sprintf("%.2f",f1_macro_mnb))
println()

Bernouilli NB:
Potholes, Precision: 83.7%, Recall: 98.3%
Flytipping, Precision: 89.0%, Recall: 88.3%
Parks & Green Spaces, Precision: NaN%, Recall: 0.0%
Car Parking, Precision: 75.9%, Recall: 97.1%
Pavements/footpaths, Precision: 68.1%, Recall: 59.5%

Micro Average F1 Score: 0.79
Macro Average F1 Score: NaN

---------------
Multinomial NB:
Potholes, Precision: 90.2%, Recall: 98.9%
Flytipping, Precision: 88.7%, Recall: 97.7%
Parks & Green Spaces, Precision: 88.9%, Recall: 47.8%
Car Parking, Precision: 86.5%, Recall: 97.1%
Pavements/footpaths, Precision: 80.6%, Recall: 68.4%

Micro Average F1 Score: 0.87
Macro Average F1 Score: 0.84



In [12]:
using Plots
plotly()
Plots.PlotlyBackend()
p1 = Plots.plot( Classes, Classes, cf_mat_mnb_norm,seriestype = :heatmap, xrotation = 45,yrotation = 45,aspect_ratio = 1,size =[400,400],yflip=true,xaxis = "Predicted", yaxis = "True", c = cgrad.(:blues),colorbar =:none,title = " MNB Confusion Matrix")
[annotate!((j-0.5,i-0.5,Plots.text(string(cf_mat_mnb[i,j]),8,:white))) for i in 1:n_class, j in 1:n_class];
p2 = Plots.plot( Classes, Classes, cf_mat_bnb_norm,seriestype = :heatmap, xrotation = 45,yrotation = 45,aspect_ratio = 1,size =[400,400],yflip=true,xaxis = "Predicted", yaxis = "True", c = cgrad.(:blues),colorbar =:none,title = " BNB Confusion Matrix");
[annotate!((j-0.5,i-0.5,Plots.text(string(cf_mat_bnb[i,j]),8,:white))) for i in 1:n_class, j in 1:n_class];
display(p1)
display(p2)

In [13]:
df_h_w = DataFrame(Dict(c=>Array{String}(undef,10) for c in Classes));
for i in 1:n_class
    inds = sortperm(log_cond_prob_mnb[i,:],rev=true);
    df_h_w[:,Symbol(Classes[i])] = X.terms[inds[1:10]];
end
df_h_w

Unnamed: 0_level_0,Car Parking,Flytipping,Parks & Green Spaces,Pavements/footpaths,Potholes
Unnamed: 0_level_1,String,String,String,String,String
1,road,rubbish,road,road,road
2,parking,road,park,path,potholes
3,cars,council,cut,pavement,pothole
4,park,fly,council,footpath,car
5,parked,waste,grass,walk,damage
6,car,dumped,children,council,hole
7,people,tipping,left,people,lane
8,vehicles,bags,please,pedestrians,deep
9,street,street,trees,children,surface
10,residents,bins,path,dangerous,holes


### Conclusions

* Naive Bayes is an intepretable classification model that can perform reasonably well with a small amount of training data and scales efficiently.

* Feature importance can be obtained directly from the model.

* Performance can be improved by for example weighting features or introducing a correlation factor.

* There are a number of alternatives, some lose the direct interpretability that Naive Bayes has.

In [76]:
# The 12th value of y_test is Parks & Green Spaces, but is misclassified by both algorithms
ind = 12;

In [77]:
df[test[ind],:]

Unnamed: 0_level_0,category_coded,description
Unnamed: 0_level_1,String,StringDo…
1913,Parks & Green Spaces,"StringDocument{String}(""The pathways between adderbury village, from twyford to bodicote. over grown and too difficult to walk my dog to keep fitter. my head is in perfect target for a vans wing mirror. I have reported this before, with the support of the village of adderbury, many if not all are single file only due to over growth, mud up the sides of the road, may flood due to this. You did very small bit, via oxford road. Weeds growing in the sides of roads. Many sign posts in banbury you cant see because of over growth, or so dirty you just cant see what they are supposed to say"", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"


In [78]:
pred_mnb[ind]

4.0

In [79]:
pred_bnb[ind]

4.0

In [83]:
score_bnb = log_prior_bnb[:] .+ sum(X_t_bnb.*transpose(log.(cond_prob_bnb)),dims=1)[:] .+ sum((1 .- X_t_bnb).*transpose(log.(1 .- cond_prob_bnb)),dims=1)[:]

5-element Array{Float64,1}:
 -238.20231626160592
 -248.66155241370905
 -256.58152389512406
 -236.54328571032437
 -224.30227861948606

In [86]:
score_mnb = log_prior_mnb[:] .+ sum(X_t_mnb.*transpose(log_cond_prob_mnb),dims=1)[:]

5-element Array{Float64,1}:
 -369.0424682201781
 -374.1888869965132
 -359.38465903445706
 -369.5458353610259
 -350.3363315229725

In [95]:
X.terms[findall(x-> x>0,X_t_bnb)]

37-element Array{String,1}:
 "adderbury"
 "banbury"
 "bit"
 "bodicote"
 "cant"
 "difficult"
 "dirty"
 "dog"
 "due"
 "file"
 "fitter"
 "flood"
 "growing"
 ⋮
 "sign"
 "single"
 "support"
 "supposed"
 "target"
 "twyford"
 "vans"
 "via"
 "village"
 "walk"
 "weeds"
 "wing"

In [101]:
Plots.plot(transpose(cond_prob_bnb[:,findall(x-> x>0,X_t_bnb)]))

In [104]:
X.terms[findall(x-> x>0,X_t_bnb)][13:15]

3-element Array{String,1}:
 "growing"
 "grown"
 "growth"

In [108]:
X.terms[findall(x-> x>0,X_t_bnb)][[6,9,18,24,32,35]]

6-element Array{String,1}:
 "difficult"
 "due"
 "mud"
 "road"
 "vans"
 "walk"

In [115]:
Plots.plot(transpose(transpose(X_t_mnb[findall(x-> x>0,X_t_mnb)]).*log_cond_prob_mnb[:,findall(x-> x>0,X_t_mnb)]))

In [120]:
l_c_p_mnb = transpose(X_t_mnb[findall(x-> x>0,X_t_mnb)]).*log_cond_prob_mnb[:,findall(x-> x>0,X_t_mnb)];
l_c_p_mnb_1 = l_c_p_mnb[1,:];
l_c_p_mnb_2 = l_c_p_mnb[2,:];
l_c_p_mnb_3 = l_c_p_mnb[3,:];
l_c_p_mnb_4 = l_c_p_mnb[4,:];
l_c_p_mnb_5 = l_c_p_mnb[5,:];

p_and_g_max = (l_c_p_mnb_3.>l_c_p_mnb_1).&(l_c_p_mnb_3.>l_c_p_mnb_2).&(l_c_p_mnb_3.>l_c_p_mnb_4).&(l_c_p_mnb_3.>l_c_p_mnb_5);
p_and_f_max = (l_c_p_mnb_5.>l_c_p_mnb_1).&(l_c_p_mnb_5.>l_c_p_mnb_2).&(l_c_p_mnb_5.>l_c_p_mnb_4).&(l_c_p_mnb_5.>l_c_p_mnb_3);
X.terms[findall(x-> x>0,X_t_bnb)][p_and_g_max]

10-element Array{String,1}:
 "adderbury"
 "dog"
 "fitter"
 "growing"
 "grown"
 "growth"
 "mirror"
 "pathways"
 "perfect"
 "weeds"

In [121]:
X.terms[findall(x-> x>0,X_t_bnb)][p_and_f_max]

11-element Array{String,1}:
 "banbury"
 "file"
 "flood"
 "mud"
 "posts"
 "sign"
 "supposed"
 "twyford"
 "via"
 "village"
 "walk"

In [122]:
pred_train_bnb = predict_BernouilliNB(X_train_bnb,log_prior_bnb,cond_prob_bnb);

In [124]:
findall(x-> x.==2,pred_train_bnb)

25-element Array{Int64,1}:
   81
  168
  296
  349
  368
  427
  490
  500
  516
  554
  593
  632
  690
  813
  923
 1015
 1043
 1053
 1057
 1136
 1272
 1278
 1284
 1553
 1588