### Introduction
The Naive Bayes exercise in the Doing Data Science book asks you to test out the Naive Bayes algorithm for classifying New York Times articles based on the text in the body of the article. Unfortunately, the [NYT Article Search API](https://developer.nytimes.com/apis) does not return the article body anymore. So I've decided to look at [Fix My Street](https://www.fixmystreet.com/) data, which was provided to me by [Reka Solymosi](https://www.rekadata.net).

We took the data the reports that had at least 500 characters in the description and we'll take a subset of these to consider only the 5 most common report categories.

First let's load in the data.

In [1]:
using CSV, DataFrames, Queryverse

csv_name = "longtext_fms.csv";
df = CSV.read(csv_name);

Let's take a look at the data. We have 7,287 reports and each report has 4 features: ```ID``` (which is superfluous in this case), ```category```, ```category_coded```, and ```description```. We are interested in the ```category_coded``` as our label and the ```description```, which is the report text, will be our input. Some of the ```category``` values have been combined to single ```category_coded``` values. For example reports that have a ```category``` of ```Carriageway Defect``` or ```General Highways Enquiry``` have been given a ```category_coded``` value of ```Roads/Highways```. 

After taking a look at the data, let's make frequency tables of category and category_coded values.

In [2]:
df

Unnamed: 0_level_0,Column1,category,category_coded,description
Unnamed: 0_level_1,Int64,String,String,String
1,1,Pothole,Potholes,There is a stretch of public highway that has been severely damaged by constant contractors heavy traffic to and from a compound adjacent to Fulscot railway bridge. The most severe damage is situated southbound roughly halfway between the Didcot road junction and the Fulscot railway bridge. Damage consists of a stretch of approximately 3 meters long by 0.5 meter wide and more importantly up to 0.3 meters deep. Because this damage has been primarily caused by heavy vehicles a heavy camber exists pushing cyclists and other traffic off the roads and into the verge. PLEASE NOTE THIS IS HIGHLY DANGEROUS AND SHOULD BE CLASSED AS A PRIORITY JOB
2,2,Pothole,Potholes,The enclosed potholes have damaged my alloy wheels due to recent poor works occurred on this stretch of route. My low profile alloys have been damaged hard. I had my daughter on board to witness the incident . I have informed Thames Valley police because of safety concerns especially for motor cyclists if they hits these potholes. They have authorised for the highway agencies to be informed immediately to avoid any accidents or potential death especially to motor cyclists and have advised to alert yourselves. I have obtained a reference number from the police and I will be contacting my insurance company regards to the damage caused to my alloy wheels which I intend to have repaired. My main concern is safety. I would like someone in your department to contact me please.
3,3,Street cleaning,Street Cleaning,"Over the past week, I have noticed that a deposit of domestic waste had formed in front of Blue Building, Denford Street. This deposit seems to have grown bigger over time, without any apparent attempt by the Borough’s waste collection services or the building’s managing agent to address the problem. I and all residents of that building would be grateful if the Borough’s waste management service could therefore take all necessary action as soon as possible, in order to prevent health hazards that could affect us and the local community."
4,4,Road traffic signs,Road Sign,Great to see this new crossing to help crossing very busy road - however finding it dangerous due to distance (I believe 60-80 feet) between controlling lights and pedestrian crossing. As traffic is nose to tail at peak times and fair distance between traffic lights and crossing the traffic has passed through green light with no other traffic light to stop just before the pedestrian crossing. Pedestrians get the signal to cross but cars thinking that they have right of way. Accident waiting to happen in my opinion and have already seen several near misses.
5,5,,,"There are two sets of no-parking lines each side of Briscoe road adjacent to the junction with Ware road. These strips are too short. When turning out of the high speed tail-gating traffic in Ware road and into Briscoe road I am very often faced with a 4x4 in the middle of the road -due to the parked cars - and a very short space to avoid a collision. If a second car is also turning into Briscoe behind me, the situation is critical, especially on a dark wet night. There is no room for two/three cars ! Request: that the no-parking lines are extended to allow two cars to seek refuge as described, in safety."
6,6,,,"The &#39;town centre bypass&#39; was put in place to direct &#39;through&#39; traffic from the town centre primarily for pedestrian/shoppers convenience and safety. The &#39;1170 bypass&#39; had a speed limit of 40mph to encourage motorists to use the bypass. Inexplicably the speed limit has now been reduced to 30mph. As a result, motorists now find it quicker to go past Barclays - past the Library, and down Burford street ! Shoppers/pedestrians are now faced with a continuous stream of impatient fast traffic all day long ! It must be now quicker this way or they wouldn&#39;t do it. Cause and effect. Please put the bypass back to its designed speed of 40mph, and &#39;traffic calming&#39; ridges outside the library, so us pedestrians can shop with safety...please."
7,7,Road traffic signs,Road Sign,"Unlit roundabout warning sign in Heath Rd. Whilst on the subject a few day ago i notified an unlit on the same roundabout area. I sent a picture of its location at that time. I have added it again. Why.? Your Street llighting tean visited Fiday 9th and replaced a Belisha Beacon bulb in the very same location. They fitted new bulbs at the Ambrose/Straight Rd roundabout as well. BUT, the Straight rd/Heath didn&#39;t get a bulb. Whilst the public can be helpful in reporting these, there is a duty to check by paid operatives when they are at locations. It would be prudent to drive Straight Rd and do a visual check. It gets dark before 5pm ( knocking off time). This would improve safety, prevent duplicate visits and reduce expenditure on the public purse. Just a thought. Austerity."
8,8,Rubbish (refuse and recycling),Litter,Centred more around opposite where Taylor Wimpy are building new houses on Naishes lane but also generally along the length of Naishes lane lots of litter.Where the builders cars and vans park on the verge opposite the new site lots of plastic bags full of food wrappers and cans of soft drink etc are just chucked on the verge or into the ditches. Please can this be picked up and also please can the council contact Taylor Wimpy re the upper part of the road they are responsible for as it needs a litter pick along the whole road as this is looking a real mess.
9,9,,,Above &quot;One Stop Interiors&quot; (ng7 7eu) there is a group of people that come every Sunday between 8am-2pm. They have large speakers that they use for the duration and it echo&#39;s around the streets. They also do this on friday/Saturday through the night !!! Thus goes on from 9pm all the way through to the early hours. I have been over a number of times asking for them to turn the volume down but i am just sniggered at and intimidated. I was told it was within there rights and to go away. This has been going on for a very long time now and i am finding this very draining now and need something done. PLEASE! Thanks Sarah Wealthall
10,10,Carriageway Defect,Roads/Highways,"Further to previous report, this area of sunken road fills up with water after rainfall, as shown in the uploaded photos. Contrary to the surveyors view in dry weather, this patch of road is significantly sunken and the water persists for many days after rainfall. Debris is shot onto adjacent parked cars on the property drives by passing traffic. The photos were taken after moderate rainfall overnight. Surface water seems to drain to this low area and goes nowhere, remaining for up to a week or more even in the absence of further rainfall."


In [3]:
println(df |>
    @groupby(_.category) |>
    @map({Key=key(_), Count=length(_)}) |>
    DataFrame)

163×2 DataFrame
│ Row │ Key                                                           │ Count │
│     │ [90mString[39m                                                        │ [90mInt64[39m │
├─────┼───────────────────────────────────────────────────────────────┼───────┤
│ 1   │ Pothole                                                       │ 217   │
│ 2   │ Street cleaning                                               │ 257   │
│ 3   │ Road traffic signs                                            │ 158   │
│ 4   │ NA                                                            │ 693   │
│ 5   │ Rubbish (refuse and recycling)                                │ 233   │
│ 6   │ Carriageway Defect                                            │ 71    │
│ 7   │ Overflowing litter bin                                        │ 22    │
│ 8   │ Drainage                                                      │ 67    │
│ 9   │ Flytipping                                                    │ 281   │
│ 10

In [4]:
println(df |>
    @groupby(_.category_coded) |>
    @map({Key=key(_), Count=length(_)}) |> @orderby_descending(_.Count)|>
    DataFrame)

43×2 DataFrame
│ Row │ Key                            │ Count │
│     │ [90mString[39m                         │ [90mInt64[39m │
├─────┼────────────────────────────────┼───────┤
│ 1   │ Roads/Highways                 │ 1334  │
│ 2   │ NA                             │ 693   │
│ 3   │ Car Parking                    │ 679   │
│ 4   │ Potholes                       │ 606   │
│ 5   │ Pavements/footpaths            │ 500   │
│ 6   │ Flytipping                     │ 425   │
│ 7   │ Road Sign                      │ 321   │
│ 8   │ Street Cleaning                │ 273   │
│ 9   │ Trees                          │ 267   │
│ 10  │ Litter                         │ 252   │
│ 11  │ Parks & Green Spaces           │ 236   │
│ 12  │ Faulty street light            │ 229   │
│ 13  │ Obstructions (skips, A boards) │ 229   │
│ 14  │ Abandoned vehicles             │ 218   │
│ 15  │ Pavement Damaged/Cracked       │ 217   │
│ 16  │ Traffic lights                 │ 169   │
│ 17  │ Dog Fouling               

Let's subset our data and drop the columns that we're not interested in. For now we're not going to consider ```Roads/Highways```, although it has the most cases it appears to be quite broad in meaning and when building a classifier this can lead to poor accuracy.

In [426]:
Classes = ["Car Parking","Potholes","Pavements/footpaths","Flytipping","Dog Fouling"]
df_sub = df |> @filter(_.category_coded == Classes[1] || _.category_coded == Classes[2]  || _.category_coded == Classes[3] || _.category_coded == Classes[4] || _.category_coded == Classes[5]) |> @map({category_coded = _.category_coded, description = _.description}) |> DataFrame;
first(df_sub,2)

Unnamed: 0_level_0,category_coded,description
Unnamed: 0_level_1,String,String
1,Potholes,There is a stretch of public highway that has been severely damaged by constant contractors heavy traffic to and from a compound adjacent to Fulscot railway bridge. The most severe damage is situated southbound roughly halfway between the Didcot road junction and the Fulscot railway bridge. Damage consists of a stretch of approximately 3 meters long by 0.5 meter wide and more importantly up to 0.3 meters deep. Because this damage has been primarily caused by heavy vehicles a heavy camber exists pushing cyclists and other traffic off the roads and into the verge. PLEASE NOTE THIS IS HIGHLY DANGEROUS AND SHOULD BE CLASSED AS A PRIORITY JOB
2,Potholes,The enclosed potholes have damaged my alloy wheels due to recent poor works occurred on this stretch of route. My low profile alloys have been damaged hard. I had my daughter on board to witness the incident . I have informed Thames Valley police because of safety concerns especially for motor cyclists if they hits these potholes. They have authorised for the highway agencies to be informed immediately to avoid any accidents or potential death especially to motor cyclists and have advised to alert yourselves. I have obtained a reference number from the police and I will be contacting my insurance company regards to the damage caused to my alloy wheels which I intend to have repaired. My main concern is safety. I would like someone in your department to contact me please.


We'll need to perform some preprocessing before we perform any analysis, in particular we want to remove punctuation, make everything lowercase, etc. To do this I'll use the [TextAnalysis.jl](https://github.com/JuliaText/TextAnalysis.jl) package. First we'll convert the description from a string to a StringDocument object.

In [427]:
using TextAnalysis
df_sub = df_sub |> @mutate(description = StringDocument(_.description)) |> DataFrame

Unnamed: 0_level_0,category_coded,description
Unnamed: 0_level_1,String,StringDo…
1,Potholes,"StringDocument{String}(""There is a stretch of public highway that has been severely damaged by constant contractors heavy traffic to and from a compound adjacent to Fulscot railway bridge. The most severe damage is situated southbound roughly halfway between the Didcot road junction and the Fulscot railway bridge. Damage consists of a stretch of approximately 3 meters long by 0.5 meter wide and more importantly up to 0.3 meters deep. Because this damage has been primarily caused by heavy vehicles a heavy camber exists pushing cyclists and other traffic off the roads and into the verge. PLEASE NOTE THIS IS HIGHLY DANGEROUS AND SHOULD BE CLASSED AS A PRIORITY JOB"", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
2,Potholes,"StringDocument{String}(""The enclosed potholes have damaged my alloy wheels due to recent poor works occurred on this stretch of route. My low profile alloys have been damaged hard. I had my daughter on board to witness the incident . I have informed Thames Valley police because of safety concerns especially for motor cyclists if they hits these potholes. They have authorised for the highway agencies to be informed immediately to avoid any accidents or potential death especially to motor cyclists and have advised to alert yourselves. I have obtained a reference number from the police and I will be contacting my insurance company regards to the damage caused to my alloy wheels which I intend to have repaired. My main concern is safety. I would like someone in your department to contact me please."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
3,Flytipping,"StringDocument{String}(""White vans with waste are being turned away from the council tip and refused to offload their rubbish. This is due to the council imposing a free to obtain license for vans. These vans then flytip alongside other irresponsible flytippers in this recycle area. This needs to be actioned and the names of those flytipping published weekly in the newspaper. This flytipping is unchallenged. More needs to be done as council taxes are being wasted on this problem created due to a license imposed on vans by the council. The council refuses vans to dump but has to then pay for removal of resulting flytipping. This is a crazy misuse of resources."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
4,Car Parking,"StringDocument{String}(""We have an ongoing issue with neigbours who have two cars parking infront of other neighbours drives. Does the double yellow lines not get inforced on weekends ? People parking on double yellow lines during weekends and evenings. We have neighbours from hell and they are continuely parking infront of our drive even when asked not to on several occasions. We face abuse if we say anything. There is also a double yellow line infront of the drive. I have a photo but dont want to be victimised. As i have been in the past. Seems like the council is doing less and less inforcement checks at weekends with neighbours even parking on pavements. There needs to be strict fines. Also no proper checks to see how many people are living in these houses. Alot of littering and dumping haringey council is doing nothing ? Weekend infircement checjs is required please to stop abusers and on the spot fine please is what i would like to see. Our area is becoming run down."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
5,Car Parking,"StringDocument{String}(""I live in a cul-de-sac with my house being towards the end. A few houses have drives and those with drives infront of the house park their car on them. The problem is others park their car or car(s) in such a way that it blocks my access into and out of my driveway and causes problems to get out of the street. There have been a number of occasions that I have had to go knocking on doors to find out whose car it is and politely ask them to move it. Aside from it being an inconvenience and also a problem for anyone turning the car around, I wouldn&#39;t want to be responsible for scratching anyone&#39;s car either. This has been going on for some time now (12 months or more)."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
6,Flytipping,"StringDocument{String}(""Photo a bit small, but shows a bed frame and a shopping trolley in undergrowth beside canal. At junction of Foxglove Path and Longmarsh Lane by small footbridge. Generally, litter picking along this stretch on both sides of canal is almost non existent (see pic) and a lot of plastic ends up actually in the canal....blown or thrown there. I pick it up myself at least twice a week. Again, as always in West Thamesmead, there are hardly any bins...just an old oil drum (see pic) which is rusting at the bottom so the foxes can drag out what rubbish people DO decide to dispose of. Perhaps also Tesco in Thamesmead (one of the dirtiest stores I&#39;ve ever been in) could be encouraged to get involved in the clean up as most of the rubbish along the canal is their packaging. Or the Thamesmead Fishing Club. But BINS BINS BINS in West Thamesmead please..lots of them...LOADS on streets leading from Woolwich Arsenal, but as soon as you&#39;re into West Thamesmead....nothing. And the ones that are there hardly ever get emptied."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
7,Potholes,"StringDocument{String}(""The footpath outside my house and several others in close proximity has lots of potholes that are dangerous to local users. the dropped curves are worse with large holes appearing. several elderly people live in this area and have difficulty getting past in their scooters. the majority of Camp Hill Road has had all its paths and access points tarmacked and repaired over the last year , but for some unknown reason this end of the road was omitted. Could this be given priority before the winter sets in and makes the matter worse."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
8,Dog Fouling,"StringDocument{String}(""There is a large amount of disgustin dog diarrhoea all over the pavement outside chantry school. It has clearly been gone through by a child&#39;s pushchair and school children were also treading in it this morning. An absolute disgrace and this is an ongoing problem around the school. Either CCTV needs putting up to catch the dog owner who keeps allowing this or the council needs to check and pick it up regularly. It is a hygiene risk for the school children who are then treading the faeces into the school."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
9,Flytipping,"StringDocument{String}(""Yet more fly-tipping on the walkway between the railway bridge and Brintons Road. The bags of rubbish by the railway bridge have not been cleared even after months of reporting them! Recently there were items of chipboard dumped at the end of Wolverton Road, a sofa on the pavement and other assorted items dumped behind the wall at the seating area. Litter everywhere you look in this area between the two roads. When is someone going to clean it up? Or, better still, will those responsible stop dumping the stuff? Photos are available but for some reason they were not upload-able on this site today."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
10,Car Parking,"StringDocument{String}(""Some residents have decided to park their cars on Cupar Road opposite the housing scheme rather than within designated off-street parking areas within Springbank. This is blocking one lane of the main road and causes dangerous queing of traffic as the cars are parked right next to a zebra crossing and waiting cars restrict other drivers vision of people waiting to cross the zebra crossing. This is a busy area for School Children crossing. Last night there was a serious collision where a car travelling on the main road collided with one of the parked cars causing serious damage. There need to be double yellow lines painted along this stretch of road as car parking here is dangerous and the parked cars are causing serious problems."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"


Next, we'll try to drop all of the things we don't want: articles, numbers, non letters, stop words, pronouns, case, corrupt characters, and the word "amp" which is used instead of the ampersand symbol. There may be other things we should remove, however these are just the things I found in my initial look at the data.

First, we'll create a corpus. Note that if we just create a Corpus from the DataFrame, then the operations we apply to the Corpus will also be applied to the DataFrame. I wasn't comfortable with this, although it would save memory usage, as I wasn't happy making chanages to the underlying DataFrame.

In [428]:
desc = deepcopy(df_sub[!,:description])

2345-element Array{StringDocument{String},1}:
 StringDocument{String}("There is a stretch of public highway that has been severely damaged by constant contractors heavy traffic to and from a compound adjacent to Fulscot railway bridge. The most severe damage is situated southbound roughly halfway between the Didcot road junction and the Fulscot railway bridge. Damage consists of a stretch of approximately 3 meters long by 0.5 meter wide and more importantly up to 0.3 meters deep. Because this damage has been primarily caused by heavy vehicles a heavy camber exists pushing cyclists and other traffic off the roads and into the verge. PLEASE NOTE THIS IS HIGHLY DANGEROUS AND SHOULD BE CLASSED AS A PRIORITY JOB", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))
 StringDocument{String}("The enclosed potholes have damaged my alloy wheels due to recent poor works occurred on this stretch of route. My low profile alloys have been damage

In [429]:
crps = Corpus(desc);
remove_corrupt_utf8!(crps);
remove_case!(crps);
remove_words!(crps,["amp"]);
prepare!(crps,strip_articles | strip_numbers | strip_non_letters | strip_stopwords | strip_pronouns | strip_frequent_terms | strip_definite_articles);

Let's compare a before and after for the first description text.

In [430]:
println(text(crps[1]))
println(text(df_sub[1,:description]))

     stretch   public highway       severely damaged   constant contractors heavy traffic         compound adjacent   fulscot railway bridge      severe damage   situated southbound roughly halfway     didcot road junction     fulscot railway bridge  damage consists     stretch   approximately   meters       meter wide     importantly       meters deep      damage     primarily caused   heavy vehicles   heavy camber exists pushing cyclists     traffic     roads       verge  please note     highly dangerous       classed     priority job
There is a stretch of public highway that has been severely damaged by constant contractors heavy traffic to and from a compound adjacent to Fulscot railway bridge. The most severe damage is situated southbound roughly halfway between the Didcot road junction and the Fulscot railway bridge. Damage consists of a stretch of approximately 3 meters long by 0.5 meter wide and more importantly up to 0.3 meters deep. Because this damage has been primarily caus

You can see how many of the words have been remove in the above example. Now we can build up a Lexicon from the Corpus, which can take a bit of time.

In [431]:
update_lexicon!(crps);
println(lexicon(crps))



Just for interest, let's see what word occurs the most times.

In [432]:
print("Key = ")
print(collect(keys(lexicon(crps)))[argmax(collect(values(lexicon(crps))))])
print(", frequency = ")
print(lexical_frequency(crps, collect(keys(lexicon(crps)))[argmax(collect(values(lexicon(crps))))]))

Key = road, frequency = 0.03053313686748602

Turns out the word road occurs the most times. Now we have our lexicon, we can create a Document Term Matrix. This contains the frequency of each word in each item. Note that this is slightly different to what they created in the book, where the matrix had values X<sub>i,j</sub> = 1 if the word occurs and 0 if not. 

In [433]:
X = DocumentTermMatrix(crps)

DocumentTermMatrix(
  [13  ,     1]  =  1
  [103 ,     1]  =  2
  [136 ,     1]  =  1
  [148 ,     1]  =  1
  [151 ,     1]  =  1
  [154 ,     1]  =  2
  [216 ,     1]  =  2
  [232 ,     1]  =  2
  [247 ,     1]  =  1
  [252 ,     1]  =  1
  [261 ,     1]  =  1
  [267 ,     1]  =  1
  ⋮
  [1196, 12004]  =  1
  [1374, 12004]  =  1
  [1543, 12004]  =  1
  [1654, 12004]  =  1
  [1826, 12004]  =  1
  [1955, 12004]  =  1
  [1960, 12004]  =  1
  [2080, 12004]  =  1
  [2199, 12004]  =  1
  [2331, 12004]  =  2
  [2345, 12004]  =  1
  [163 , 12005]  =  1
  [1432, 12006]  =  1, ["a", "aa", "aaxlcphbdqaffa", "ab", "abandon", "abandoned", "abandoning", "abandonned", "abandons", "abbey"  …  "zig", "zigzag", "zigzags", "zimmer", "zne", "zome", "zona", "zone", "zoned", "zoom"], Dict("piecemeal" => 7765,"chicanes" => 1817,"bidder" => 1032,"abrest" => 30,"rises" => 8890,"hampshire" => 4745,"lyminton" => 6321,"gathered" => 4438,"underground" => 11072,"canal" => 1569…))

We can extract the matrix and set the values to either 1 or 0.

In [434]:
X_mat = dtm(X);
X_mat[X_mat .>0] .= 1;

In [435]:
X_mat

2345×12006 SparseArrays.SparseMatrixCSC{Int64,Int64} with 97866 stored entries:
  [13  ,     1]  =  1
  [103 ,     1]  =  1
  [136 ,     1]  =  1
  [148 ,     1]  =  1
  [151 ,     1]  =  1
  [154 ,     1]  =  1
  [216 ,     1]  =  1
  [232 ,     1]  =  1
  [247 ,     1]  =  1
  [252 ,     1]  =  1
  [261 ,     1]  =  1
  [267 ,     1]  =  1
  ⋮
  [1196, 12004]  =  1
  [1374, 12004]  =  1
  [1543, 12004]  =  1
  [1654, 12004]  =  1
  [1826, 12004]  =  1
  [1955, 12004]  =  1
  [1960, 12004]  =  1
  [2080, 12004]  =  1
  [2199, 12004]  =  1
  [2331, 12004]  =  1
  [2345, 12004]  =  1
  [163 , 12005]  =  1
  [1432, 12006]  =  1

### Naive Bayes Classifier

Now that we have our matrix we can implement a Naive Bayes Classifier. Recall Bayes theorem

$p(c|x) = \frac{p(x|c)p(c)}{p(x)}$.

Let's say we have a report $x$ which has category $c$. We assume that each word, $x_{j}$ is independent from each other (i.e. the classifier is Naive, hence the name). To calculate $p(x|c)$ we have the product

$p(x|c) = \prod_{j}\theta_{jc}^{x_{j}}(1 - \theta_{jc})^{(1 - x_{j})}$,

where $\theta_{jc}$ is the probability that an individual word, $x_{j}$ is present in a report, $x$ with category $c$. We can then take the logs

$\log(p(x|c)) = \sum_{j}\log\left(x_{j}\frac{\theta_{j}}{1-\theta_{j}}\right) + \sum_{j}\log(1 - \theta_{j})$.

Note that $\left(\frac{\theta_{j}}{1-\theta_{j}}\right)$ does not depend on any email just the word, so we can rename this $w_{j}$ and let's say $\sum_{j}\log(1 - \theta_{j}) = w_{0}$, giving

$\log(p(x|c)) = \sum_{j} x_{j}w_{j} + w_{0}$.

Now that we have $p(x|c)$ we can compute $p(c|x)$ which is what we're interested in - the probability of a report being category $c$.

This algorithm is cheap to train, given a large number of reports all we do is count the words that are in each category. One downside of this is that we can have cases where weights are either $1$ or $0$, which may not be desirable for generalisability. Let's say that the word "Pavement" only occurs in the Pavements/footpaths reports in our training set. The weight for pavement will be $1$ for this category and $0$ for the rest. However if a new report comes in about potholes that mentions the pothole is right next to the pavement then this might become misclassified. We therefore introduce $2$ new smoothing variables like so

$\theta_{j} = \frac{n_{jc}+\alpha}{n_{j} + \beta}$,

where $n_{jc}$ is the number of times that $x_{j}$ occurs in a category $c$ report and $n_{j}$ is the of times it occurs in any email. Setting $\alpha$ and $\beta$ to $0$ is the equivalent of the ML estimator 

$\theta_{\text{ML}} = \text{argmax}_{\theta}\, p(D|\theta) = \frac{n_{jc}}{n_{j}}$.

The MAP estimator, assuming that the probability distribution of $\theta$ is of the form $\theta^{\alpha}(1 - \theta)^{\beta}$ then we have 

$\theta_{\text{MAP}} = \text{argmax}_{\theta}\, p(\theta|D) = \frac{n_{jc}+\alpha}{n_{j}+\beta}$.

## Implementing the classifier

Now we can define a few variables we will need to implement the classifier. First we will define our labels $y$, Roads/Highways = 0, Car Parking = 1, Potholes = 2, Pavements/footpaths = 3,Flytipping = 4.

We already have our matrix $X$, which we'll need to split into training and testing prior to training our model. We can define our variables

$\hat{\theta}_{jc} = \frac{n_{jc} + \alpha -1}{n_{c} + \alpha + \beta - 2}$

$\hat{\theta}_{c} = \frac{n_{c}}{n}$,

where $n_{jc}$ is the number of reports of class $c$ that contain the $j$'th word, $n_{c}$ is the number of reports of class $c$, $n$ is the total number of reports. Given these estimates and an unclassified report $x$ then we can calculate the log odds ratio for each class relative to a base class $c = 0$

$\log \left(\frac{p(y = c | x)}{p(y = 0 | x)}\right) = \sum_{j}\hat{w}_{jc}x_{j} + \hat{w}_{0c}$,

where

$\hat{w}_{jc} = \log\frac{\hat{\theta}_{jc}\left(1 - \hat{\theta}_{j0}\right)}{\hat{\theta}_{j0}\left(1- \hat{\theta}_{jc}\right)}$

and 

$\hat{w}_{0c} = \sum_{j} \log\frac{1 - \hat{\theta}_{jc}}{1 - \hat{\theta}_{0}} + \log \frac{\hat{\theta}_{c}}{\hat{\theta}_{0}}$

In [436]:
y = zeros(length(df_sub[!,:category_coded]));
y[df_sub[:,:category_coded] .== Classes[2]] .= 1;
y[df_sub[:,:category_coded] .== Classes[3]] .= 2;
y[df_sub[:,:category_coded] .== Classes[4]] .= 3;
y[df_sub[:,:category_coded] .== Classes[5]] .= 4;

In [437]:
using MLJ
train, test = partition(eachindex(y), 0.7, shuffle=true, rng=1234);

In [438]:
y_train = Int64.(y[train]);
y_test = Int64.(y[test]);
X_train = X[train,:];
X_test = X[test,:];
println(string("Size of X_train = ", size(X_train), ", size of y_train = ", size(y_train)));
println(string("Size of X_test = ", size(X_test), ", size of y_test = ", size(y_test)));

Size of X_train = (1642, 12006), size of y_train = (1642,)
Size of X_test = (703, 12006), size of y_test = (703,)


In [439]:
n_class = Int64(maximum(y)+1);
n_words = size(X_train)[2];
n_j = zeros(n_class,n_words);
for x in 0:(n_class-1)
    n_j[x+1,:] = sum(X_train[y_train.==x,:],dims = 1);
end
n_c = [sum(y_train.==x) for x in 0:(n_class-1)]
n = size(y_train)[1];
alpha = 1.5;
beta = 5;
theta_j = (n_j.+alpha .-1)./(n_c .+ alpha.+beta .- 2);
theta_c = n_c./n;
w_j = log.((theta_j[2:n_class,:].*(1 .- transpose(theta_j[1,:])))./(transpose(theta_j[1,:]).*(1 .- theta_j[2:n_class,:])))
w_0 = sum(log.((1. .-theta_j[2:n_class,:])./(1. .-transpose(theta_j[1,:]))),dims=2) .+ log.((theta_c[2:n_class])./(theta_c[1]));

In [440]:
using Plots
plotly()
Plots.PlotlyBackend()


Plots.PlotlyBackend()

We'll write a short prediction function.

In [441]:
function NB_pred(X_test,w_0,w_j)
  log_odds_ratio = sum(X_test.*transpose(w_j),dims=1) .+ transpose(w_0[:]);
  pred = argmax(log_odds_ratio[:]);
  if (maximum(log_odds_ratio) <0)
    pred = 0;
  end
  return pred;
end

NB_pred (generic function with 1 method)

Now we can predict all of the test data and generate a confusion matrix

In [460]:
pred = [NB_pred(X_test[i,:],w_0,w_j) for i in 1:nrows(X_test)];
confusion_matrix_norm = [sum((y_test .==(i-1)) .& (pred .==(j-1)))./sum(y_test .==(i-1)) for i in 1:n_class,  j in 1:n_class]

5×5 Array{Float64,2}:
 0.983051   0.0        0.0112994  0.00564972  0.0
 0.0110497  0.972376   0.0110497  0.00552486  0.0
 0.195402   0.091954   0.672414   0.0344828   0.00574713
 0.0225564  0.0075188  0.0526316  0.909774    0.0075188
 0.105263   0.0263158  0.0789474  0.657895    0.131579

In [525]:
p1 = Plots.plot( Classes, Classes, confusion_matrix,seriestype = :heatmap, xrotation = 45,yrotation = 45,aspect_ratio = 1,size =[400,400],yflip=true,xaxis = "Predicted", yaxis = "True", c = cgrad.(:blues),colorbar =:none,title = "Confusion Matrix")
[annotate!((j-0.5,i-0.5,Plots.text(round(confusion_matrix[i,j],digits=3),8,:white))) for i in 1:n_class, j in 1:n_class]
display(p1)

In [549]:
Plots.plot(theta_j[5,:],size =[800,300])
plot!(theta_j[4,:])

In [552]:
inds = sortperm(theta_j[5,:],rev=true)
X.terms[inds[1:10]]

10-element Array{String,1}:
 "dog"
 "road"
 "people"
 "fouling"
 "walk"
 "children"
 "mess"
 "please"
 "dogs"
 "owners"

In [567]:
Plots.plot(theta_j[:,inds[1:10]],lab = permutedims(X.terms[inds[1:10]]),size = [300,300])

In [572]:
argmax(theta_j[:,inds[1:10]],dims =1)[1,:]

10-element Array{CartesianIndex{2},1}:
 CartesianIndex(5, 1)
 CartesianIndex(2, 2)
 CartesianIndex(1, 3)
 CartesianIndex(5, 4)
 CartesianIndex(5, 5)
 CartesianIndex(5, 6)
 CartesianIndex(5, 7)
 CartesianIndex(5, 8)
 CartesianIndex(5, 9)
 CartesianIndex(5, 10)

MethodError: MethodError: no method matching transpose(::String)
Closest candidates are:
  transpose(!Matched::Missing) at missing.jl:100
  transpose(!Matched::Number) at number.jl:168
  transpose(!Matched::LinearAlgebra.Transpose{#s162,#s161} where #s161<:Union{StaticArrays.StaticArray{Tuple{N},T,1} where T where N, StaticArrays.StaticArray{Tuple{N,M},T,2} where T where M where N} where #s162) at /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/StaticArrays/mlIi1/src/linalg.jl:54
  ...