### Introduction

* My experience: mixture of implementing bespoke reconstruction algorithms for specific purposes & experimental physics.

* More recently, I have begun working with more traditional ML algorithms.

* One area I have experience in is classification (classifying objects in a metal detector as threat or benign).

* I have a dataset of [Fix My Street](https://www.fixmystreet.com/) reports - an online portal to report local problems.

* My presentation will be on document classification of this dataset using **Naive Bayes**

### Naive Bayes
Why Naive Bayes?

* It works well without a large amount of training data

* Computationally efficient & scaleable

* Can be easily updated with new data

* Tends to work well despite the independence assumption

* Model is interpretable

### Bayes Theorem
Naive Bayes relies on Bayes' Theorem:

$$p(y|x) = \frac{p(y)p(x|y)}{p(x)},$$

assuming $p(x) \neq 0$. Or, in other words

$$p(\text{event}|\text{evidence}) = \frac{\text{prior} \times \text{likelihood}}{p(\text{evidence})}.$$

In classification problems we are given data $x$ and we want to assign it to a class $c_{k} \in C$. To do this we can calculate and find the class that maximises

$$p(c_{k}|x) = \frac{p(c_{k})p(x|c_{k})}{p(x)},$$

where $x = (x_{1},\dots,x_{n})$ are our features - in this case words in a report.

### Classification Method

The numberator is equivalent to the joint probability distribution, written $p(x_{1},\dots,x_{n},c_{k})$ which can be expanded using the chain rule of probabilities
\begin{align}
p(x_{1},\dots,x_{n},c_{k}) &= p(x_{1},\dots,x_{n}|c_{k})(x_{2},\dots,x_{n},c_{k}), \\
 &= p(x_{1},\dots,x_{n}|c_{k})(x_{2},\dots,x_{n}|c_{k})\dots p(x_{n}|c_{k})p(c_{k}),
\end{align}
If we assume **all of the features are independent**, then we have
$$
p(x_{1},\dots,x_{n},c_{k}) = p(c_{k})\prod_{i} p(x_{i}|c_{k}),
$$
$$p(c_{k}|x) = \frac{p(c_{k})\prod_{i} p(x_{i}|c_{k})}{p(x)},$$
and our classification rule is
$$
\hat{y} = \text{argmax}_{k} p(c_{k})\prod_{i} p(x_{i}|c_{k}).
$$
To avoid numerical problems when $p(x_{i}|c_{k})$ is very small, so take logs
$$
\hat{y} = \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} \log(p(x_{i}|c_{k})).
$$

### Event Models - The Distribution of $p(x_{i}|c_{k})$

* The prior $p(c_{k})$ can be calculated assuming either equiprobable classes or estimated using the training set. 
* We have to assume a distribution for the features $p(x_{i}|c_{k})$ :
    * Continous features - Gaussian Naive Bayes $p(x_{i}|c_{k}) = \frac{1}{\sqrt{2\pi \sigma_{k}^{2}}}\exp{\left(-\frac{\left(x_{i} - \mu_{k}\right)^{2}}{2 \sigma_{k}^{2}}\right)}$. $\sigma_{k},\mu_{k}$ are the mean & sd of $x_{i}$ in class $c_{k}$.
    * Discrete features - Bernouilli Naive Bayes $p(x|c_{k}) = \prod_{i}\theta_{ik}^{x_{i}}\left(1 - \theta_{ik}\right)^{\left(1 - x_{i}\right)}$. $\theta_{ik}$ is the probability of class $c_{k}$ generating $x_{i}$. This assumes a binary model for the features.
    * Discrete features - Multinomial Naive Bayes $p(x|c_{k}) \propto \prod_{i} \theta_{ik}^{x_{i}}$. The distribution is parameterised by multinomials $\theta_{k} = (\theta_{1k},\dots,\theta_{nk})$, where $\theta_{ik}$ is the probability of feature $i$ occuring in a sample belonging to class $c_{k}$.
* Bernouilli & Multinomial Naive Bayes are popular for document classification. The feature matrix consists of word occurence (Bernouilli), word frequencies (Multinomial), or term frequency-inverse term document frequency (tf-idf).

* Use smoothing to ensure there are no probability values $\theta_{ik} = 1$ or $0$.


### Naive Bayes for Document Classification - Outline
* Obtain a set of labelled documents.

* Generate the feature matrix $X$, either binary term occurence, document term matrix, or tf-idf.

* Calculate the values of $\theta_{ik}$ and the conditional probability $p(x_{i}|c_{k})$.

* Calculate the prior probability $p(c_{k}$.

* Classify new documents using $\hat{y} = \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} \log(p(x_{i}|c_{k})):$

\begin{align}
\hat{y} &= \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} x_{i}\log(\theta_{ik})+ \sum_{i} (1-x_{i})\log(1-\theta_{ik}).\quad \text{(Bernouilli)} \\
\hat{y} &= \text{argmax}_{k} \log(p(c_{k})) + \sum_{i} x_{i}\log(\theta_{ik}).\quad \text{(Multinomial)}
\end{align}

### Naive Bayes Example in Julia - Fix My Street Data
* Fix my street data contains reports of problems submitted by members of the public.
* Reports have been pre-classified as one of: "Car Parking","Potholes","Pavements/footpaths","Flytipping","Parks & Green Spaces".
* An example is: 
>Every weekday, especially between the hours of 7.30-8.30am a succession of vans completely obstruct the pavement while loading or unloading. Despite myself and other members of the public registering our concerns and pointing out the obvious danger to the public, the drivers continue this practice regardless of the very real threat to the public. We have witnessed several occasions where there have been very close misses and we feel it is only a matter of time before somebody is killed or seriously injured here. We also feel it is the councils responsibility to act now before a serious accident occurs, we have seen some (limited) parking regulation in the area but it is ineffective and does not appear to operate at all during the early morning periods when this problem is most dangerous

>Label: Car Parking

* I have recently started to learn Julia, which I will use for this example.

In [1]:
using CSV, DataFrames, Queryverse,TextAnalysis
csv_name = "FMS.csv";
# Load in the data
df = CSV.read(csv_name);
# Print a frequency table of the classes
println(df |>
    @groupby(_.category_coded) |>
    @map({Key=key(_), Count=length(_)}) |> @orderby_descending(_.Count)|>
    DataFrame)
# Get the unique labels
Classes = unique(df[!,:category_coded]);
# Convert the report description to a String Document to build the Document Term Matrices.
df = df |> @mutate(description = StringDocument(_.description)) |> DataFrame;
# Grab the report descriptions and generate a corpus
desc = deepcopy(df[!,:description]);
crps = Corpus(desc);
# Remve all of the words that we're not interested in.
remove_corrupt_utf8!(crps);
remove_case!(crps);
remove_words!(crps,["amp","quot"]);
prepare!(crps,strip_articles | strip_numbers | strip_non_letters | strip_stopwords | strip_pronouns | strip_frequent_terms | strip_definite_articles);
# Generate the lexicon
update_lexicon!(crps);

5×2 DataFrame
│ Row │ Key                  │ Count │
│     │ [90mString[39m               │ [90mInt64[39m │
├─────┼──────────────────────┼───────┤
│ 1   │ Car Parking          │ 679   │
│ 2   │ Potholes             │ 606   │
│ 3   │ Pavements/footpaths  │ 500   │
│ 4   │ Flytipping           │ 425   │
│ 5   │ Parks & Green Spaces │ 236   │
