### Introduction
The Naive Bayes exercise in the Doing Data Science book asks you to test out the Naive Bayes algorithm for classifying New York Times articles based on the text in the body of the article. Unfortunately, the [NYT Article Search API](https://developer.nytimes.com/apis) does not return the article body anymore. So I've decided to look at [Fix My Street](https://www.fixmystreet.com/) data, which was provided to me by [Reka Solymosi](https://www.rekadata.net).

We took the data the reports that had at least 500 characters in the description and we'll take a subset of these to consider only the 5 most common report categories.

First let's load in the data.

In [2]:
using CSV, DataFrames, Queryverse

csv_name = "longtext_fms.csv";
df = CSV.read(csv_name);

Let's take a look at the data. We have 7,287 reports and each report has 4 features: ```ID``` (which is superfluous in this case), ```category```, ```category_coded```, and ```description```. We are interested in the ```category_coded``` as our label and the ```description```, which is the report text, will be our input. Some of the ```category``` values have been combined to single ```category_coded``` values. For example reports that have a ```category``` of ```Carriageway Defect``` or ```General Highways Enquiry``` have been given a ```category_coded``` value of ```Roads/Highways```. 

After taking a look at the data, let's make frequency tables of category and category_coded values.

In [7]:
first(df,3)

Unnamed: 0_level_0,Column1,category,category_coded,description
Unnamed: 0_level_1,Int64,String,String,String
1,1,Pothole,Potholes,There is a stretch of public highway that has been severely damaged by constant contractors heavy traffic to and from a compound adjacent to Fulscot railway bridge. The most severe damage is situated southbound roughly halfway between the Didcot road junction and the Fulscot railway bridge. Damage consists of a stretch of approximately 3 meters long by 0.5 meter wide and more importantly up to 0.3 meters deep. Because this damage has been primarily caused by heavy vehicles a heavy camber exists pushing cyclists and other traffic off the roads and into the verge. PLEASE NOTE THIS IS HIGHLY DANGEROUS AND SHOULD BE CLASSED AS A PRIORITY JOB
2,2,Pothole,Potholes,The enclosed potholes have damaged my alloy wheels due to recent poor works occurred on this stretch of route. My low profile alloys have been damaged hard. I had my daughter on board to witness the incident . I have informed Thames Valley police because of safety concerns especially for motor cyclists if they hits these potholes. They have authorised for the highway agencies to be informed immediately to avoid any accidents or potential death especially to motor cyclists and have advised to alert yourselves. I have obtained a reference number from the police and I will be contacting my insurance company regards to the damage caused to my alloy wheels which I intend to have repaired. My main concern is safety. I would like someone in your department to contact me please.
3,3,Street cleaning,Street Cleaning,"Over the past week, I have noticed that a deposit of domestic waste had formed in front of Blue Building, Denford Street. This deposit seems to have grown bigger over time, without any apparent attempt by the Borough’s waste collection services or the building’s managing agent to address the problem. I and all residents of that building would be grateful if the Borough’s waste management service could therefore take all necessary action as soon as possible, in order to prevent health hazards that could affect us and the local community."


In [4]:
println(df |>
    @groupby(_.category) |>
    @map({Key=key(_), Count=length(_)}) |>
    DataFrame)

163×2 DataFrame
│ Row │ Key                                                           │ Count │
│     │ [90mString[39m                                                        │ [90mInt64[39m │
├─────┼───────────────────────────────────────────────────────────────┼───────┤
│ 1   │ Pothole                                                       │ 217   │
│ 2   │ Street cleaning                                               │ 257   │
│ 3   │ Road traffic signs                                            │ 158   │
│ 4   │ NA                                                            │ 693   │
│ 5   │ Rubbish (refuse and recycling)                                │ 233   │
│ 6   │ Carriageway Defect                                            │ 71    │
│ 7   │ Overflowing litter bin                                        │ 22    │
│ 8   │ Drainage                                                      │ 67    │
│ 9   │ Flytipping                                                    │ 281   │
│ 10

│ 122 │ Japanese knotweed / ragwort                                   │ 1     │
│ 123 │ Trees &amp; hedges                                            │ 4     │
│ 124 │ Overgrown vegetation                                          │ 2     │
│ 125 │ Fly Posting                                                   │ 2     │
│ 126 │ Trunkroad or motorway defect                                  │ 1     │
│ 127 │ Trees and Woodland Maintenance                                │ 1     │
│ 128 │ Parks and playing fields                                      │ 1     │
│ 129 │ Car Parks                                                     │ 1     │
│ 130 │ Fences                                                        │ 6     │
│ 131 │ Overhanging vegetation                                        │ 1     │
│ 132 │ Tree                                                          │ 4     │
│ 133 │ Utilities                                                     │ 2     │
│ 134 │ Benches / bicycle racks         

In [8]:
println(df |>
    @groupby(_.category_coded) |>
    @map({Key=key(_), Count=length(_)}) |> @orderby_descending(_.Count)|>
    DataFrame)

43×2 DataFrame
│ Row │ Key                            │ Count │
│     │ [90mString[39m                         │ [90mInt64[39m │
├─────┼────────────────────────────────┼───────┤
│ 1   │ Roads/Highways                 │ 1334  │
│ 2   │ NA                             │ 693   │
│ 3   │ Car Parking                    │ 679   │
│ 4   │ Potholes                       │ 606   │
│ 5   │ Pavements/footpaths            │ 500   │
│ 6   │ Flytipping                     │ 425   │
│ 7   │ Road Sign                      │ 321   │
│ 8   │ Street Cleaning                │ 273   │
│ 9   │ Trees                          │ 267   │
│ 10  │ Litter                         │ 252   │
│ 11  │ Parks & Green Spaces           │ 236   │
│ 12  │ Faulty street light            │ 229   │
│ 13  │ Obstructions (skips, A boards) │ 229   │
│ 14  │ Abandoned vehicles             │ 218   │
│ 15  │ Pavement Damaged/Cracked       │ 217   │
│ 16  │ Traffic lights                 │ 169   │
│ 17  │ Dog Fouling               

Let's subset our data and drop the columns that we're not interested in. For now we're not going to consider ```Roads/Highways```, although it has the most cases it appears to be quite broad in meaning and when building a classifier this can lead to poor accuracy.

In [321]:
Classes = ["Car Parking","Potholes","Pavements/footpaths","Flytipping","Parks & Green Spaces"]
df_sub = df |> @filter(_.category_coded == Classes[1] || _.category_coded == Classes[2]  || _.category_coded == Classes[3] || _.category_coded == Classes[4] || _.category_coded == Classes[5]) |> @map({category_coded = _.category_coded, description = _.description}) |> DataFrame;
first(df_sub,2)
using CSV
CSV.write("FMS.csv",df_sub);

In [322]:
df_sub

Unnamed: 0_level_0,category_coded,description
Unnamed: 0_level_1,String,String
1,Potholes,There is a stretch of public highway that has been severely damaged by constant contractors heavy traffic to and from a compound adjacent to Fulscot railway bridge. The most severe damage is situated southbound roughly halfway between the Didcot road junction and the Fulscot railway bridge. Damage consists of a stretch of approximately 3 meters long by 0.5 meter wide and more importantly up to 0.3 meters deep. Because this damage has been primarily caused by heavy vehicles a heavy camber exists pushing cyclists and other traffic off the roads and into the verge. PLEASE NOTE THIS IS HIGHLY DANGEROUS AND SHOULD BE CLASSED AS A PRIORITY JOB
2,Potholes,The enclosed potholes have damaged my alloy wheels due to recent poor works occurred on this stretch of route. My low profile alloys have been damaged hard. I had my daughter on board to witness the incident . I have informed Thames Valley police because of safety concerns especially for motor cyclists if they hits these potholes. They have authorised for the highway agencies to be informed immediately to avoid any accidents or potential death especially to motor cyclists and have advised to alert yourselves. I have obtained a reference number from the police and I will be contacting my insurance company regards to the damage caused to my alloy wheels which I intend to have repaired. My main concern is safety. I would like someone in your department to contact me please.
3,Flytipping,White vans with waste are being turned away from the council tip and refused to offload their rubbish. This is due to the council imposing a free to obtain license for vans. These vans then flytip alongside other irresponsible flytippers in this recycle area. This needs to be actioned and the names of those flytipping published weekly in the newspaper. This flytipping is unchallenged. More needs to be done as council taxes are being wasted on this problem created due to a license imposed on vans by the council. The council refuses vans to dump but has to then pay for removal of resulting flytipping. This is a crazy misuse of resources.
4,Parks & Green Spaces,"Hi, I have just visited &quot;Penparcau Playing Field&quot; and it is in a VERY BAD state of repair. I thought that I would report it as it appears to be heavily used and is in a residential area. You might already know this but, - The wooden park gate does not shut, so children could run out, near a busy road - The dog bag dispenser is rusty, sharp and dangerous. - The tarmac path is not level near the entrance to the play area, pot hole. - There is graffiti on the play equipment. - Some of the play equipment is in poor condition and the plywood is splintering away. - There is a fallen tree in the area. - I noticed that the fence around the play area needs repairing. - The play area sign is damaged, bent and sharp. - Seats are rotten and need repairing and painting - There appears to be a pool of water in the children&#39;s play area!? Under a piece of rusting disused play equipment. As you walk in from the under pass it is all on the right hand side of the park. It is such a shame to see this in this condition in touch a prominent area. Hope it sorted soon :) Please pass on to the right people county council or a community council. Thanks"
5,Car Parking,We have an ongoing issue with neigbours who have two cars parking infront of other neighbours drives. Does the double yellow lines not get inforced on weekends ? People parking on double yellow lines during weekends and evenings. We have neighbours from hell and they are continuely parking infront of our drive even when asked not to on several occasions. We face abuse if we say anything. There is also a double yellow line infront of the drive. I have a photo but dont want to be victimised. As i have been in the past. Seems like the council is doing less and less inforcement checks at weekends with neighbours even parking on pavements. There needs to be strict fines. Also no proper checks to see how many people are living in these houses. Alot of littering and dumping haringey council is doing nothing ? Weekend infircement checjs is required please to stop abusers and on the spot fine please is what i would like to see. Our area is becoming run down.
6,Car Parking,"I live in a cul-de-sac with my house being towards the end. A few houses have drives and those with drives infront of the house park their car on them. The problem is others park their car or car(s) in such a way that it blocks my access into and out of my driveway and causes problems to get out of the street. There have been a number of occasions that I have had to go knocking on doors to find out whose car it is and politely ask them to move it. Aside from it being an inconvenience and also a problem for anyone turning the car around, I wouldn&#39;t want to be responsible for scratching anyone&#39;s car either. This has been going on for some time now (12 months or more)."
7,Flytipping,"Photo a bit small, but shows a bed frame and a shopping trolley in undergrowth beside canal. At junction of Foxglove Path and Longmarsh Lane by small footbridge. Generally, litter picking along this stretch on both sides of canal is almost non existent (see pic) and a lot of plastic ends up actually in the canal....blown or thrown there. I pick it up myself at least twice a week. Again, as always in West Thamesmead, there are hardly any bins...just an old oil drum (see pic) which is rusting at the bottom so the foxes can drag out what rubbish people DO decide to dispose of. Perhaps also Tesco in Thamesmead (one of the dirtiest stores I&#39;ve ever been in) could be encouraged to get involved in the clean up as most of the rubbish along the canal is their packaging. Or the Thamesmead Fishing Club. But BINS BINS BINS in West Thamesmead please..lots of them...LOADS on streets leading from Woolwich Arsenal, but as soon as you&#39;re into West Thamesmead....nothing. And the ones that are there hardly ever get emptied."
8,Potholes,"The footpath outside my house and several others in close proximity has lots of potholes that are dangerous to local users. the dropped curves are worse with large holes appearing. several elderly people live in this area and have difficulty getting past in their scooters. the majority of Camp Hill Road has had all its paths and access points tarmacked and repaired over the last year , but for some unknown reason this end of the road was omitted. Could this be given priority before the winter sets in and makes the matter worse."
9,Flytipping,"Yet more fly-tipping on the walkway between the railway bridge and Brintons Road. The bags of rubbish by the railway bridge have not been cleared even after months of reporting them! Recently there were items of chipboard dumped at the end of Wolverton Road, a sofa on the pavement and other assorted items dumped behind the wall at the seating area. Litter everywhere you look in this area between the two roads. When is someone going to clean it up? Or, better still, will those responsible stop dumping the stuff? Photos are available but for some reason they were not upload-able on this site today."
10,Parks & Green Spaces,"This bin doesn&#39;t have a metal liner and people put bags of waste including food scraps, vegetable peelings, food containers etc in it every week. The local dogs/ foxes/ gulls rip the side of the bin and pull the waste through onto the grass and path, and then it is distributed about the park. Not sanitary, safe for dogs to eat as they walk by or pleasant to look at. Please can the council put a metal liner in the bin and this will resolve the issue of the food waste being pulled out all over the park at least."


We'll need to perform some preprocessing before we perform any analysis, in particular we want to remove punctuation, make everything lowercase, etc. To do this I'll use the [TextAnalysis.jl](https://github.com/JuliaText/TextAnalysis.jl) package. First we'll convert the description from a string to a StringDocument object.

In [165]:
using TextAnalysis
df_sub = df_sub |> @mutate(description = StringDocument(_.description)) |> DataFrame

Unnamed: 0_level_0,category_coded,description
Unnamed: 0_level_1,String,StringDo…
1,Potholes,"StringDocument{String}(""There is a stretch of public highway that has been severely damaged by constant contractors heavy traffic to and from a compound adjacent to Fulscot railway bridge. The most severe damage is situated southbound roughly halfway between the Didcot road junction and the Fulscot railway bridge. Damage consists of a stretch of approximately 3 meters long by 0.5 meter wide and more importantly up to 0.3 meters deep. Because this damage has been primarily caused by heavy vehicles a heavy camber exists pushing cyclists and other traffic off the roads and into the verge. PLEASE NOTE THIS IS HIGHLY DANGEROUS AND SHOULD BE CLASSED AS A PRIORITY JOB"", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
2,Potholes,"StringDocument{String}(""The enclosed potholes have damaged my alloy wheels due to recent poor works occurred on this stretch of route. My low profile alloys have been damaged hard. I had my daughter on board to witness the incident . I have informed Thames Valley police because of safety concerns especially for motor cyclists if they hits these potholes. They have authorised for the highway agencies to be informed immediately to avoid any accidents or potential death especially to motor cyclists and have advised to alert yourselves. I have obtained a reference number from the police and I will be contacting my insurance company regards to the damage caused to my alloy wheels which I intend to have repaired. My main concern is safety. I would like someone in your department to contact me please."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
3,Flytipping,"StringDocument{String}(""White vans with waste are being turned away from the council tip and refused to offload their rubbish. This is due to the council imposing a free to obtain license for vans. These vans then flytip alongside other irresponsible flytippers in this recycle area. This needs to be actioned and the names of those flytipping published weekly in the newspaper. This flytipping is unchallenged. More needs to be done as council taxes are being wasted on this problem created due to a license imposed on vans by the council. The council refuses vans to dump but has to then pay for removal of resulting flytipping. This is a crazy misuse of resources."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
4,Parks & Green Spaces,"StringDocument{String}(""Hi, I have just visited &quot;Penparcau Playing Field&quot; and it is in a VERY BAD state of repair. I thought that I would report it as it appears to be heavily used and is in a residential area. You might already know this but, - The wooden park gate does not shut, so children could run out, near a busy road - The dog bag dispenser is rusty, sharp and dangerous. - The tarmac path is not level near the entrance to the play area, pot hole. - There is graffiti on the play equipment. - Some of the play equipment is in poor condition and the plywood is splintering away. - There is a fallen tree in the area. - I noticed that the fence around the play area needs repairing. - The play area sign is damaged, bent and sharp. - Seats are rotten and need repairing and painting - There appears to be a pool of water in the children&#39;s play area!? Under a piece of rusting disused play equipment. As you walk in from the under pass it is all on the right hand side of the park. It is such a shame to see this in this condition in touch a prominent area. Hope it sorted soon :) Please pass on to the right people county council or a community council. Thanks"", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
5,Car Parking,"StringDocument{String}(""We have an ongoing issue with neigbours who have two cars parking infront of other neighbours drives. Does the double yellow lines not get inforced on weekends ? People parking on double yellow lines during weekends and evenings. We have neighbours from hell and they are continuely parking infront of our drive even when asked not to on several occasions. We face abuse if we say anything. There is also a double yellow line infront of the drive. I have a photo but dont want to be victimised. As i have been in the past. Seems like the council is doing less and less inforcement checks at weekends with neighbours even parking on pavements. There needs to be strict fines. Also no proper checks to see how many people are living in these houses. Alot of littering and dumping haringey council is doing nothing ? Weekend infircement checjs is required please to stop abusers and on the spot fine please is what i would like to see. Our area is becoming run down."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
6,Car Parking,"StringDocument{String}(""I live in a cul-de-sac with my house being towards the end. A few houses have drives and those with drives infront of the house park their car on them. The problem is others park their car or car(s) in such a way that it blocks my access into and out of my driveway and causes problems to get out of the street. There have been a number of occasions that I have had to go knocking on doors to find out whose car it is and politely ask them to move it. Aside from it being an inconvenience and also a problem for anyone turning the car around, I wouldn&#39;t want to be responsible for scratching anyone&#39;s car either. This has been going on for some time now (12 months or more)."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
7,Flytipping,"StringDocument{String}(""Photo a bit small, but shows a bed frame and a shopping trolley in undergrowth beside canal. At junction of Foxglove Path and Longmarsh Lane by small footbridge. Generally, litter picking along this stretch on both sides of canal is almost non existent (see pic) and a lot of plastic ends up actually in the canal....blown or thrown there. I pick it up myself at least twice a week. Again, as always in West Thamesmead, there are hardly any bins...just an old oil drum (see pic) which is rusting at the bottom so the foxes can drag out what rubbish people DO decide to dispose of. Perhaps also Tesco in Thamesmead (one of the dirtiest stores I&#39;ve ever been in) could be encouraged to get involved in the clean up as most of the rubbish along the canal is their packaging. Or the Thamesmead Fishing Club. But BINS BINS BINS in West Thamesmead please..lots of them...LOADS on streets leading from Woolwich Arsenal, but as soon as you&#39;re into West Thamesmead....nothing. And the ones that are there hardly ever get emptied."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
8,Potholes,"StringDocument{String}(""The footpath outside my house and several others in close proximity has lots of potholes that are dangerous to local users. the dropped curves are worse with large holes appearing. several elderly people live in this area and have difficulty getting past in their scooters. the majority of Camp Hill Road has had all its paths and access points tarmacked and repaired over the last year , but for some unknown reason this end of the road was omitted. Could this be given priority before the winter sets in and makes the matter worse."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
9,Flytipping,"StringDocument{String}(""Yet more fly-tipping on the walkway between the railway bridge and Brintons Road. The bags of rubbish by the railway bridge have not been cleared even after months of reporting them! Recently there were items of chipboard dumped at the end of Wolverton Road, a sofa on the pavement and other assorted items dumped behind the wall at the seating area. Litter everywhere you look in this area between the two roads. When is someone going to clean it up? Or, better still, will those responsible stop dumping the stuff? Photos are available but for some reason they were not upload-able on this site today."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"
10,Parks & Green Spaces,"StringDocument{String}(""This bin doesn&#39;t have a metal liner and people put bags of waste including food scraps, vegetable peelings, food containers etc in it every week. The local dogs/ foxes/ gulls rip the side of the bin and pull the waste through onto the grass and path, and then it is distributed about the park. Not sanitary, safe for dogs to eat as they walk by or pleasant to look at. Please can the council put a metal liner in the bin and this will resolve the issue of the food waste being pulled out all over the park at least."", DocumentMetadata(English(), ""Untitled Document"", ""Unknown Author"", ""Unknown Time""))"


Next, we'll try to drop all of the things we don't want: articles, numbers, non letters, stop words, pronouns, case, corrupt characters, and the word "amp" which is used instead of the ampersand symbol. There may be other things we should remove, however these are just the things I found in my initial look at the data.

First, we'll create a corpus. Note that if we just create a Corpus from the DataFrame, then the operations we apply to the Corpus will also be applied to the DataFrame. I wasn't comfortable with this, although it would save memory usage, as I wasn't happy making chanages to the underlying DataFrame.

In [166]:
desc = deepcopy(df_sub[!,:description]);
crps = Corpus(desc);
remove_corrupt_utf8!(crps);
remove_case!(crps);
remove_words!(crps,["amp"]);
prepare!(crps,strip_articles | strip_numbers | strip_non_letters | strip_stopwords | strip_pronouns | strip_frequent_terms | strip_definite_articles);

Let's compare a before and after for the first description text.

In [167]:
println(text(crps[1]))
println(text(df_sub[1,:description]))

     stretch   public highway       severely damaged   constant contractors heavy traffic         compound adjacent   fulscot railway bridge      severe damage   situated southbound roughly halfway     didcot road junction     fulscot railway bridge  damage consists     stretch   approximately   meters       meter wide     importantly       meters deep      damage     primarily caused   heavy vehicles   heavy camber exists pushing cyclists     traffic     roads       verge  please note     highly dangerous       classed     priority job
There is a stretch of public highway that has been severely damaged by constant contractors heavy traffic to and from a compound adjacent to Fulscot railway bridge. The most severe damage is situated southbound roughly halfway between the Didcot road junction and the Fulscot railway bridge. Damage consists of a stretch of approximately 3 meters long by 0.5 meter wide and more importantly up to 0.3 meters deep. Because this damage has been primarily caus

You can see how many of the words have been removed in the above example. Now we can build up a Lexicon from the Corpus, which can take a bit of time.

In [168]:
update_lexicon!(crps);
println(lexicon(crps))

Dict("piecemeal" => 1,"deventary" => 1,"chicanes" => 4,"bidder" => 1,"abrest" => 1,"rises" => 3,"hampshire" => 2,"lyminton" => 1,"gathered" => 2,"underground" => 6,"canal" => 14,"november" => 11,"caught" => 38,"stress" => 12,"rectified" => 20,"chav" => 1,"methods" => 2,"buckinghamshire" => 1,"ferrymeads" => 2,"obsessed" => 1,"lowfield" => 1,"fountain" => 1,"crib" => 1,"premature" => 1,"infrequency" => 2,"eighteen" => 1,"morons" => 1,"recessed" => 2,"pot" => 146,"replacement" => 23,"a" => 177,"vibration" => 12,"advice" => 7,"sanity" => 1,"particular" => 58,"shouting" => 6,"selection" => 3,"tooth" => 1,"brighten" => 1,"insertion" => 1,"folded" => 1,"bogger" => 1,"hume" => 1,"fishing" => 3,"turnpike" => 1,"pickford" => 1,"domonic" => 1,"answers" => 1,"brookhouse" => 1,"suggestion" => 4,"rosemary" => 2,"hissing" => 2,"dustcarts" => 2,"canning" => 1,"royston" => 1,"uphold" => 2,"unload" => 3,"arguably" => 1,"developers" => 6,"consulted" => 2,"pestrian" => 1,"instinct" => 1,"suitability" => 

d" => 2,"aylesford" => 1,"greenflag" => 1,"emerald" => 4,"adjoins" => 2,"band" => 3,"downright" => 3,"ramblers" => 2,"partly" => 11,"scarce" => 2,"landowner" => 5,"safekeeping" => 1,"backside" => 1,"spey" => 1,"tilmore" => 1,"lichfield" => 2,"returned" => 14,"maida" => 1,"sports" => 5,"listen" => 2,"farley" => 4,"games" => 9,"various" => 41,"occasions" => 91,"contracted" => 4,"polluted" => 1,"rhiwbina" => 1,"scumbags" => 2,"egertons" => 3,"exisiting" => 1,"england" => 3,"lapping" => 1,"pleasure" => 1,"regardless" => 10,"evaluated" => 1,"heve" => 1,"suggests" => 5,"educated" => 2,"budweiser" => 1,"penalty" => 8,"golden" => 4,"thorpe" => 2,"resolved" => 28,"beddingham" => 1,"bike" => 61,"santapod" => 1,"marlborough" => 2,"campaign" => 1,"tippling" => 1,"meeting" => 11,"customers" => 26,"dangering" => 1,"tailbacks" => 1,"struck" => 7,"b" => 54,"chertsey" => 3,"reappear" => 5,"somercotes" => 2,"professional" => 3,"admits" => 1,"windrush" => 2,"chevet" => 3,"nativity" => 1,"buckingham" => 3

"textured" => 2,"freedom" => 2,"sump" => 4,"assortment" => 1,"glendennings" => 4,"sideways" => 3,"medstead" => 1,"hardens" => 1,"dislodged" => 1,"keynes" => 2,"senwick" => 4,"crested" => 1,"powder" => 4,"oval" => 3,"hits" => 8,"porch" => 2,"discipline" => 1,"opposite" => 273,"question" => 28,"orchard" => 14,"unwanted" => 10,"serviced" => 3,"belvedere" => 2,"lubbock" => 1,"visible" => 43,"le" => 1,"leaks" => 2,"anf" => 1,"summerheath" => 2,"preserve" => 1,"presume" => 7,"aldi" => 4,"trails" => 1,"ability" => 4,"amisss" => 1,"initially" => 5,"henville" => 1,"thursdays" => 1,"eighties" => 1,"blake" => 1,"leziate" => 5,"fumes" => 4,"rippolson" => 5,"markyate" => 1,"promised" => 6,"geographically" => 1,"emerging" => 6,"player" => 1,"leptospirosis" => 1,"planings" => 1,"foliage" => 11,"arises" => 1,"uncommon" => 1,"shakes" => 9,"contrary" => 2,"iam" => 3,"buy" => 14,"damaging" => 39,"cherry" => 7,"narrowness" => 5,"commons" => 2,"innacurate" => 1,"grit" => 5,"wages" => 1,"manouvering" => 2,"

an" => 1,"silver" => 16,"swooping" => 1,"pluckley" => 1,"riangular" => 1,"kellaway" => 1,"stressful" => 2,"amberton" => 9,"constanly" => 1,"foul" => 5,"childish" => 1,"recognise" => 3,"taxis" => 6,"transpires" => 1,"trial" => 2,"categorically" => 2,"collision" => 25,"biffa" => 2,"disturbed" => 5,"psco" => 1,"rope" => 2,"renewed" => 1,"badgemore" => 1,"tidying" => 1,"unroadworthy" => 2,"footsteps" => 2,"unfriendlyness" => 1,"explore" => 1,"peering" => 1,"nappy" => 2,"uplifted" => 1,"temporarily" => 7,"prohibitive" => 1,"pursell" => 1,"grinder" => 1,"mabe" => 1,"dumping" => 99,"tramps" => 4,"polish" => 1,"million" => 5,"juke" => 1,"furthest" => 2,"sector" => 2,"parkway" => 2,"inspections" => 3,"spotting" => 1,"wantage" => 1,"fortune" => 6,"morals" => 1,"frank" => 1,"closeness" => 1,"carterton" => 5,"ar" => 3,"circa" => 3,"jump" => 5,"stapnalls" => 2,"intervene" => 3,"underbanks" => 1,"cooker" => 2,"cabinet" => 3,"eldest" => 1,"liason" => 1,"undertaken" => 3,"yesterday" => 60,"pidio" => 1

"cab" => 2,"ignore" => 18,"successful" => 1,"unsettled" => 1,"precinct" => 1,"promenade" => 3,"failure" => 7,"uptight" => 1,"photographs" => 19,"traps" => 2,"ryde" => 1,"hesitate" => 3,"adrian" => 1,"stump" => 2,"disembark" => 1,"captain" => 1,"chandos" => 1,"chancellors" => 3,"disused" => 3,"asap" => 63,"tarmack" => 2,"passage" => 10,"dissuade" => 1,"hint" => 1,"bowl" => 1,"ashleigh" => 3,"inconvenient" => 4,"valley" => 14,"decipher" => 1,"loch" => 1,"alignment" => 4,"highways" => 65,"gaining" => 3,"key" => 4,"reaching" => 2,"dalry" => 3,"hospice" => 3,"substantially" => 3,"frequency" => 3,"induce" => 1,"stringent" => 1,"blackberry" => 2,"objects" => 4,"holds" => 2,"widening" => 5,"saq" => 9,"surveilance" => 1,"hiding" => 1,"blob" => 1,"session" => 1,"unwelcome" => 1,"southbank" => 1,"pipe" => 12,"permits" => 14,"afterwards" => 7,"village" => 65,"wonder" => 19,"insignia" => 1,"supplier" => 1,"moments" => 1,"identifies" => 2,"nefarious" => 1,"respond" => 8,"definition" => 1,"inforce" =

"alot" => 20,"practise" => 1,"tiles" => 5,"shouldn" => 22,"lanarkshire" => 1,"stepagates" => 1,"surveyed" => 1,"sleepless" => 1,"drawers" => 6,"rejoin" => 3,"clue" => 2,"bangs" => 1,"grassland" => 1,"benson" => 2,"revolting" => 1,"periodically" => 2,"fri" => 3,"cylists" => 4,"exasperated" => 1,"cracked" => 18,"advertised" => 2,"flooded" => 9,"badgers" => 1,"fertile" => 1,"checks" => 7,"shakespeare" => 1,"cottages" => 3,"ears" => 2,"quiet" => 11,"regarding" => 37,"desastre" => 1,"customer" => 13,"bins" => 208,"barry" => 2,"tirrington" => 1,"vehilces" => 2,"bodies" => 1,"colnbrook" => 1,"entirety" => 1,"regrettable" => 1,"precision" => 1,"markers" => 7,"faversham" => 1,"jelf" => 1,"fir" => 1,"ventured" => 1,"mot" => 14,"legitimate" => 4,"nut" => 1,"slammed" => 1,"kicked" => 2,"tues" => 1,"teams" => 4,"milton" => 10,"prevailing" => 1,"achieved" => 1,"noisily" => 1,"normally" => 16,"inhibiting" => 1,"sever" => 1,"percieved" => 1,"fails" => 3,"sep" => 9,"excessive" => 8,"yearsley" => 1,"fee

1,"moans" => 1,"worryingly" => 4,"edwardian" => 1,"received" => 23,"neighbourhoods" => 2,"seeking" => 8,"enourmess" => 1,"scheduled" => 5,"clockhouse" => 1,"frozen" => 3,"tin" => 1,"erma" => 1,"kencot" => 5,"toxic" => 4,"blood" => 2,"easterly" => 2,"g" => 2,"unchallenged" => 1,"ironworks" => 3,"slight" => 8,"deanston" => 2,"large" => 1,"andhit" => 1,"tomorrow" => 8,"speeding" => 26,"bartley" => 1,"fracturing" => 1,"conjunction" => 1,"semi" => 2,"treehouse" => 1,"stupidity" => 1,"trafficked" => 1,"questioned" => 1,"ploughed" => 6,"toll" => 3,"succumbed" => 1,"skin" => 2,"copious" => 3,"peoples" => 17,"straight" => 34,"reported" => 396,"items" => 111,"farmer" => 5,"megan" => 1,"logs" => 2,"abbotsford" => 3,"approaches" => 6,"barnstaples" => 1,"warmingham" => 2,"task" => 5,"lovely" => 16,"supposed" => 27,"wear" => 8,"scunthorpe" => 1,"betweeen" => 1,"ready" => 9,"eathline" => 1,"bankside" => 3,"immediately" => 41,"inland" => 1,"contacting" => 7,"thrapston" => 2,"riley" => 2,"cowboys" => 1

nerdale" => 1,"trunks" => 2,"heating" => 2,"beware" => 4,"narrower" => 9,"mayfair" => 1,"dentists" => 2,"experiencing" => 5,"tractor" => 7,"adhesion" => 1,"swanbridge" => 1,"selfish" => 24,"dave" => 2,"deanwater" => 1,"pruned" => 2,"snap" => 1,"shilcott" => 1,"readiness" => 1,"darkness" => 8,"prolongs" => 1,"teary" => 1,"jb" => 1,"greasy" => 1,"towpath" => 1,"seacourt" => 1,"smeared" => 1,"violated" => 1,"landmarks" => 1,"dip" => 19,"page" => 11,"swale" => 1,"staying" => 6,"inchinnan" => 1,"breakout" => 1,"hurting" => 1,"controls" => 2,"gutters" => 3,"ditch" => 18,"rotted" => 7,"attests" => 1,"size" => 38,"signage" => 35,"de" => 80,"presence" => 4,"firstaider" => 1,"reade" => 1,"hatch" => 10,"walkways" => 4,"load" => 11,"chappel" => 1,"nature" => 16,"objection" => 1,"saturday" => 41,"hopefully" => 22,"roaders" => 1,"standard" => 21,"wyke" => 1,"saturated" => 2,"tiny" => 4,"dg" => 1,"awoken" => 3,"flag" => 3,"irresponsible" => 6,"gyc" => 2,"distinctly" => 1,"renter" => 1,"offa" => 2,"de

d" => 22,"moxon" => 5,"bringing" => 11,"contains" => 2,"wheat" => 2,"flush" => 7,"malwood" => 1,"brz" => 1,"wharf" => 5,"gulls" => 4,"looking" => 58,"bont" => 2,"properties" => 73,"interrupted" => 1,"ing" => 2,"scrap" => 3,"bushey" => 2,"sliproad" => 1,"lechlade" => 1,"waist" => 1,"amnesty" => 1,"miserable" => 3,"facilities" => 14,"bod" => 1,"highfield" => 1,"indigenous" => 1,"fly" => 280,"unwilling" => 3,"workshop" => 1,"www" => 7,"dealers" => 3,"undeclared" => 1,"shooting" => 2,"garage" => 94,"southerly" => 1,"hood" => 1,"selling" => 3,"camping" => 2,"incoming" => 7,"phoenix" => 1,"responses" => 1,"grocer" => 1,"trashes" => 1,"determining" => 1,"librando" => 1,"bridgewater" => 8,"occurred" => 23,"meon" => 2,"fuse" => 1,"banbury" => 15,"lingerie" => 1,"cynics" => 1,"twerking" => 1,"carries" => 4,"cracking" => 2,"pear" => 1,"perennial" => 2,"parvis" => 2,"sound" => 10,"widnes" => 1,"charged" => 4,"representation" => 1,"wellingtons" => 2,"magor" => 2,"dome" => 2,"due" => 430,"fitted" =>



" => 2,"bagnall" => 1,"nips" => 1,"kiln" => 5,"taxpayers" => 5,"godstone" => 3,"papa" => 1,"musical" => 1,"wrecked" => 1,"ninety" => 2,"smoothed" => 1,"epilepsy" => 1,"planters" => 3,"alleway" => 2,"shook" => 2,"dropped" => 68,"shaw" => 2,"unenforced" => 1,"developments" => 4,"protrude" => 1,"sunken" => 15,"creek" => 1,"plowden" => 1,"coned" => 1,"distracting" => 2,"executive" => 2,"erddig" => 2,"prejudice" => 1,"background" => 3,"makes" => 145,"tarmaked" => 1,"hb" => 1,"authorised" => 24,"lleg" => 1,"confirming" => 1,"registry" => 1,"bitch" => 1,"zebra" => 9,"blasting" => 2,"unclear" => 4,"art" => 2,"virgin" => 8,"reg" => 11,"strewn" => 32,"refers" => 3,"wolverton" => 3,"dvsa" => 1,"requested" => 16,"alien" => 3,"roberts" => 4,"quarry" => 7,"passersby" => 1,"drunkards" => 1,"parade" => 14,"f" => 3,"ings" => 1,"utility" => 18,"employ" => 2,"checking" => 9,"payer" => 1,"ru" => 2,"yard" => 22,"defecate" => 1,"asda" => 4,"tolerated" => 2,"curbs" => 13,"raiding" => 1,"chapter" => 1,"bentle

ed" => 6,"trucks" => 32,"fingers" => 2,"delaminated" => 1,"adopted" => 8,"riddled" => 3,"grows" => 8,"fi" => 2,"portishead" => 2,"dutifully" => 1,"panicked" => 1,"strict" => 2,"wolverhampton" => 2,"squatting" => 1,"millionspf" => 1,"warmer" => 1,"riparian" => 2,"crevasse" => 1,"clainm" => 1,"disabilty" => 1,"horseboxes" => 1,"junctionof" => 1,"zcz" => 1,"hargood" => 1,"account" => 7,"cowley" => 4,"impunity" => 6,"stranded" => 1,"introduce" => 2,"enhanced" => 1,"almighty" => 1,"screws" => 3,"boundary" => 44,"unusable" => 7,"sands" => 2,"chances" => 2,"stung" => 10,"keen" => 5,"millcroft" => 3,"inspector" => 16,"disturbances" => 1,"realy" => 1,"quilt" => 1,"winchester" => 5,"afew" => 2,"drying" => 1,"earliest" => 6,"retired" => 1,"unloved" => 1,"jammed" => 2,"branch" => 7,"summit" => 1,"ferndale" => 2,"mtb" => 1,"halved" => 2,"extensions" => 3,"male" => 6,"kennington" => 8,"thousand" => 1,"zachery" => 1,"answered" => 1,"beginning" => 23,"pass" => 141,"erected" => 18,"cormac" => 1,"outten

ght" => 32,"trolly" => 1,"tipped" => 49,"allowed" => 65,"confidentiality" => 1,"night" => 177,"examples" => 6,"lot" => 112,"brompton" => 1,"pumping" => 1,"suspicions" => 1,"hindsight" => 1,"cavix" => 2,"urc" => 1,"comprises" => 1,"ub" => 1,"reddyshore" => 1,"opportunities" => 1,"scratching" => 8,"tired" => 9,"bracknell" => 1,"hertfordshire" => 1,"thirlmere" => 1,"protection" => 7,"channel" => 3,"forgotten" => 4,"unsustainable" => 1,"verbal" => 4,"alison" => 1,"cobble" => 1,"spare" => 14,"dew" => 1,"frustrated" => 7,"lister" => 1,"sate" => 1,"attractive" => 5,"unfortunately" => 32,"atrociously" => 1,"impairmenet" => 1,"diseases" => 1,"absolute" => 22,"lawn" => 11,"saints" => 2,"steady" => 1,"plastic" => 39,"monday" => 39,"apparatus" => 8,"ageing" => 1,"hid" => 1,"probem" => 1,"quedaba" => 1,"ironicly" => 1,"pancakes" => 1,"halfords" => 1,"loudly" => 1,"everyman" => 1,"poffley" => 1,"lea" => 5,"dowd" => 1,"displaced" => 3,"steps" => 51,"fullest" => 1,"copy" => 9,"please" => 593,"adhere" 



s" => 2,"dispersed" => 1,"deli" => 1,"roudn" => 1,"suffolk" => 1,"puddles" => 7,"teresa" => 1,"burford" => 4,"cultural" => 2,"summerhill" => 1,"gutter" => 14,"png" => 1,"kettering" => 2,"woman" => 19,"ages" => 6,"ingram" => 1,"ashcroft" => 1,"maxstoke" => 2,"facing" => 21,"affect" => 9,"mass" => 5,"fulsam" => 1,"bj" => 2,"lawyer" => 1,"larch" => 1,"party" => 3,"thorngrove" => 1,"poorly" => 24,"range" => 6,"barnwood" => 1,"nasty" => 19,"workmen" => 18,"sliding" => 1,"express" => 8,"iyou" => 1,"thm" => 1,"lkke" => 1,"stickers" => 6,"al" => 5,"stronger" => 1,"asb" => 4,"barnardos" => 1,"fake" => 1,"baynes" => 1,"crescent" => 46,"cleansed" => 1,"copse" => 5,"disgusting" => 41,"sheldons" => 1,"demarcate" => 1,"nat" => 2,"affected" => 21,"holland" => 3,"covering" => 22,"authorities" => 12,"neglegted" => 1,"visitor" => 6,"porthcawl" => 1,"intoxicated" => 1,"lbi" => 1,"blossom" => 1,"hauled" => 1,"omitted" => 1,"assuming" => 6,"idmiston" => 1,"wednesday" => 19,"texts" => 1,"cuttings" => 3,"lcc

"build" => 28,"anyhow" => 1,"ermont" => 1,"swallowfield" => 1,"penny" => 1,"brinnington" => 1,"foregoing" => 5,"organize" => 1,"splay" => 1,"latitude" => 1,"illegal" => 87,"hobart" => 1,"deddington" => 1,"abetted" => 1,"ascott" => 2,"smash" => 2,"container" => 5,"urban" => 2,"sidewalks" => 1,"upstairs" => 3,"terminal" => 1,"succession" => 2,"degradation" => 3,"automatically" => 1,"laminate" => 1,"chevolet" => 1,"wanna" => 2,"mmm" => 1,"wasn" => 30,"gg" => 1,"ultimately" => 5,"medway" => 2,"containing" => 4,"santa" => 1,"elms" => 5,"expense" => 12,"basildon" => 1,"umber" => 1,"uxmore" => 1,"suprized" => 1,"gather" => 4,"tubes" => 1,"inwas" => 1,"wi" => 2,"sensibly" => 2,"dealership" => 2,"appropriately" => 2,"wxr" => 1,"passable" => 5,"trades" => 2,"penarth" => 1,"gouged" => 1,"headington" => 2,"shambles" => 3,"reimbursement" => 5,"co" => 18,"winsters" => 1,"dp" => 1,"transient" => 1,"fenceing" => 1,"police" => 131,"contraption" => 1,"syndrome" => 1,"stinging" => 6,"epileptic" => 1,"ban

ific" => 8,"distracted" => 3,"eight" => 9,"materials" => 25,"hope" => 43,"establishment" => 2,"richardson" => 1,"sanitary" => 1,"motts" => 1,"naturally" => 1,"nav" => 1,"commandeering" => 1,"giant" => 3,"premises" => 14,"ornamental" => 1,"flux" => 1,"timely" => 1,"relentless" => 1,"unseat" => 2,"alledges" => 2,"bromwich" => 4,"richfield" => 1,"removal" => 14,"minsterlovell" => 1,"beneath" => 5,"pearce" => 1,"privet" => 1,"sloped" => 4,"reached" => 9,"resent" => 3,"reference" => 22,"pismere" => 1,"lbh" => 1,"occurence" => 1,"galsworthy" => 3,"marston" => 5,"insurers" => 1,"barnfield" => 3,"routes" => 8,"stops" => 12,"tiger" => 1,"rtm" => 1,"comprising" => 1,"towing" => 2,"allotment" => 6,"rabbish" => 1,"laziest" => 1,"frosty" => 1,"appropitr" => 1,"smelly" => 3,"frangible" => 2,"restocked" => 1,"tennants" => 1,"belonging" => 18,"maintainance" => 1,"unauthorised" => 3,"roof" => 6,"personally" => 9,"maintenance" => 43,"painter" => 1,"reporting" => 52,"ct" => 4,"culpa" => 1,"stone" => 15,"

ott" => 2,"winterbourne" => 1,"atal" => 1,"choose" => 12,"mattresses" => 14,"capture" => 1,"queing" => 2,"month" => 69,"saving" => 1,"compemsation" => 1,"barby" => 1,"brows" => 2,"gouges" => 1,"liek" => 1,"devoid" => 1,"paint" => 32,"mended" => 5,"progessively" => 1,"endure" => 3,"titchy" => 4,"fires" => 1,"witnessed" => 52,"continualled" => 1,"imbecile" => 1,"arthritis" => 2,"restricted" => 25,"glorious" => 2,"august" => 17,"seek" => 3,"neer" => 1,"dig" => 7,"inevitably" => 4,"fines" => 16,"peregrine" => 5,"examination" => 1,"grosvenor" => 1,"respectful" => 1,"rudeness" => 1,"h" => 1,"dope" => 1,"attractions" => 1,"waited" => 4,"reason" => 43,"tight" => 23,"hrcs" => 1,"carriage" => 5,"peter" => 4,"haphazardly" => 2,"vital" => 1,"copses" => 1,"harlow" => 1,"git" => 1,"flowers" => 6,"milford" => 3,"tadmarton" => 1,"helpful" => 6,"winter" => 47,"residue" => 2,"principle" => 2,"write" => 10,"proactive" => 3,"lay" => 19,"dislocating" => 1,"sell" => 9,"wildlife" => 13,"druid" => 1,"norman" 

1,"residences" => 2,"priory" => 3,"sacred" => 1,"sizeable" => 1,"frontages" => 1,"clio" => 1,"shorter" => 2,"ref" => 34,"sun" => 6,"unhealthy" => 4,"set" => 42,"albion" => 1,"davenport" => 1,"tampered" => 1,"selsdon" => 3,"ovoid" => 1,"slat" => 1,"northway" => 3,"containers" => 7,"zigzags" => 1,"turnaround" => 3,"intolerable" => 3,"internal" => 2,"caldermill" => 1,"tram" => 4,"snag" => 1,"shining" => 1,"lawrence" => 5,"model" => 1,"wates" => 1,"atracting" => 1,"deadly" => 2,"truing" => 1,"surprise" => 3,"screamed" => 2,"sutton" => 5,"rquest" => 1,"transact" => 1,"plantation" => 11,"challenge" => 1,"flow" => 40,"mobile" => 12,"ones" => 35,"access" => 386,"papers" => 3,"intimidation" => 1,"despite" => 70,"tandem" => 1,"realigned" => 2,"legality" => 1,"dartmouth" => 1,"polluting" => 2,"operation" => 4,"chase" => 8,"wisbech" => 1,"growth" => 16,"heald" => 1,"clearer" => 1,"bedstead" => 1,"kim" => 2,"submitting" => 3,"handler" => 2,"smoking" => 7,"spots" => 10,"pavers" => 2,"allocating" => 

buys" => 1,"portwood" => 2,"zero" => 7,"flytippers" => 4,"attarted" => 1,"operates" => 2,"downloaded" => 1,"heaved" => 1,"su" => 2,"hong" => 1,"true" => 11,"memory" => 1,"fernclough" => 1,"unlit" => 8,"considerable" => 33,"saftey" => 1,"promoting" => 2,"chemist" => 1,"pace" => 1,"attacked" => 2,"wintney" => 1,"ownership" => 5,"witch" => 1,"unstable" => 3,"transferred" => 1,"blue" => 37,"woking" => 4,"shrug" => 1,"water" => 227,"poholes" => 1,"burnthouse" => 1,"fro" => 1,"endless" => 4,"jason" => 3,"cats" => 8,"whisper" => 1,"mallard" => 1,"clise" => 1,"dragged" => 5,"stroller" => 1,"servant" => 1,"bentmead" => 1,"sedgemoor" => 1,"peraired" => 1,"banning" => 2,"abbot" => 1,"unsecured" => 3,"sedgley" => 2,"able" => 135,"rubber" => 2,"recognize" => 1,"giles" => 5,"asegurarme" => 1,"widen" => 1,"cap" => 1,"veared" => 1,"edgewood" => 1,"raging" => 1,"flowed" => 1,"hermit" => 1,"neighbouring" => 13,"fuller" => 2,"imagine" => 11,"leverstock" => 1,"sparrows" => 1,"officers" => 20,"stunted" => 

erfield" => 2,"told" => 104,"attempting" => 12,"pagham" => 1,"constriction" => 1,"dried" => 6,"footage" => 6,"belonged" => 4,"clip" => 3,"spells" => 3,"bloom" => 1,"enhance" => 3,"radiator" => 1,"accesss" => 1,"prebooked" => 1,"chelsea" => 1,"land" => 125,"twenty" => 1,"visit" => 36,"meetings" => 4,"scatter" => 2,"confrontations" => 1,"manchine" => 1,"twisting" => 1,"patrolled" => 2,"shaped" => 3,"wickham" => 4,"sclorosis" => 1,"whelechair" => 1,"calder" => 4,"anytime" => 4,"byngs" => 1,"norm" => 2,"lip" => 4,"referring" => 2,"warping" => 1,"happier" => 1,"vnf" => 1,"shrivenham" => 2,"vans" => 130,"hgv" => 13,"lidded" => 1,"councillor" => 11,"inception" => 1,"classification" => 1,"mcclaren" => 1,"grief" => 1,"chain" => 4,"seats" => 5,"esk" => 3,"switch" => 1,"deleted" => 1,"abandonned" => 1,"fascia" => 1,"ferrers" => 1,"entitled" => 2,"rugs" => 1,"shropshire" => 1,"streetscene" => 3,"parallel" => 6,"comining" => 1,"westwood" => 1,"mondeo" => 1,"shortcut" => 4,"thoight" => 1,"intentiona

Just for interest, let's see what word occurs the most times.

In [169]:
print("Key = ")
print(collect(keys(lexicon(crps)))[argmax(collect(values(lexicon(crps))))])
print(", frequency = ")
print(lexical_frequency(crps, collect(keys(lexicon(crps)))[argmax(collect(values(lexicon(crps))))]))

Key = road, frequency = 0.030107630754864313

Turns out the word road occurs the most times. Now we have our lexicon, we can create a Document Term Matrix. This contains the frequency of each word in each item. Note that this is slightly different to what they created in the book, where the matrix had values X<sub>i,j</sub> = 1 if the word occurs and 0 if not. 

In [170]:
X = DocumentTermMatrix(crps)

DocumentTermMatrix(
  [14  ,     1]  =  1
  [93  ,     1]  =  2
  [123 ,     1]  =  1
  [134 ,     1]  =  1
  [137 ,     1]  =  1
  [140 ,     1]  =  2
  [183 ,     1]  =  1
  [198 ,     1]  =  2
  [214 ,     1]  =  2
  [230 ,     1]  =  1
  [235 ,     1]  =  1
  [239 ,     1]  =  2
  ⋮
  [1220, 12358]  =  1
  [1414, 12358]  =  1
  [1597, 12358]  =  1
  [1725, 12358]  =  1
  [1795, 12358]  =  1
  [1916, 12358]  =  1
  [2050, 12358]  =  1
  [2055, 12358]  =  1
  [2183, 12358]  =  1
  [2298, 12358]  =  1
  [2430, 12358]  =  2
  [2446, 12358]  =  1
  [148 , 12359]  =  1, ["a", "aa", "aaxlcphbdqaffa", "ab", "abandon", "abandoned", "abandoning", "abandonned", "abandons", "abbey"  …  "zero", "zig", "zigzag", "zigzags", "zimmer", "zne", "zome", "zona", "zone", "zoned"], Dict("piecemeal" => 7992,"deventary" => 3066,"chicanes" => 1879,"bidder" => 1057,"abrest" => 29,"rises" => 9152,"hampshire" => 4875,"lyminton" => 6495,"gathered" => 4549,"underground" => 11404…))

We can extract the matrix and generate a new binary matrix by setting the values to either 1 or 0. This will be fed to our Bernouilli Naive Bayes Classifier. We'll also create our label vector, $y$.


In [171]:
X_mat = dtm(X);
X_bin = spzeros(size(X_mat)[1],size(X_mat)[2]);
X_bin[X_mat .>0] .= 1;
y = zeros(length(df_sub[!,:category_coded]));
y[df_sub[:,:category_coded] .== Classes[2]] .= 1;
y[df_sub[:,:category_coded] .== Classes[3]] .= 2;
y[df_sub[:,:category_coded] .== Classes[4]] .= 3;
y[df_sub[:,:category_coded] .== Classes[5]] .= 4;

### Bernouilli Naive Bayes Classifier

Now that we have our matrix we can implement a Naive Bayes Classifier. Recall Bayes theorem

$$p(c_{k}|x) = \frac{p(x|c_{k})p(c_{k})}{p(x)}.$$

Let's say we have a report $x$ which has category $c$. We assume that each word, $x_{j}$ is independent from each other (i.e. the classifier is Naive, hence the name) and that the data is distributed according to multivariate Bernouilli distributions (features are binary). To calculate $p(x|c_{k})$ we have the product

$$p(x|c_{k}) = \prod_{j}\theta_{jk}^{x_{j}}(1 - \theta_{jk})^{(1 - x_{j})},$$

where $\theta_{jk}$ is the probability that an individual word, $x_{j}$ is present in a report, $x$, with category $c_{k}$. We can then take the logs

$$\log(p(x|c_{k})) = \sum_{j}x_{j}\log\left(\theta_{jk}\right) + \left(1-x_{j}\right)\log(1 - \theta_{jk}).$$

Now that we have $p(x|c_{k})$ we can compute $p(c_{k}|x)$ which is what we're interested in - the probability of a report being category $c_{k}$. The prediction is computed using

$$\hat{y} = \text{argmax}_{k}\left[\log(p(x|c_{k})) + \log(p(c_{k}))\right].$$

This algorithm is cheap to train, given a large number of reports all we do is count the words that are in each category. One downside of this is that we can have cases where weights are either $1$ or $0$, which may not be desirable for generalisability. Let's say that the word "Pavement" only occurs in the Pavements/footpaths reports in our training set. The weight for pavement will be $1$ for this category and $0$ for the rest. However if a new report comes in about potholes that mentions the pothole is right next to the pavement then this might become misclassified. We therefore introduce $2$ new smoothing variables like so

$$\theta_{jk} = \frac{n_{jk}+\alpha}{n_{k} + \beta},$$

where $n_{jk}$ is the number of times that $x_{j}$ occurs in a category $c_{k}$ report and $n_{k}$ of documents in class $c_{k}$. Setting $\alpha$ and $\beta$ to $0$ is the equivalent of the ML estimator 

$$\theta_{\text{ML}} = \text{argmax}_{\theta}\, p(D|\theta) = \frac{n_{jk}}{n_{k}}.$$

In [172]:
function train_BernouilliNB(X,y,alpha=1,beta=2)
    """
        train_BernouilliNB(X=Array[n_features,n_samples],y=Array[n_samples,1],alpha=1,beta=2);
    A basic function to train a Bernouilli Naive Bayes classifier. 
    
    It takes as input:
    X - a training set with feature matrix of size [n_feautres,n_samples],
    y - labels of length n_samples,
    alpha, beta - Laplace smoothing factors, default = 1 & 2, respectively.
    
    It outputs:
    log_prior - the log prior probability of each class, array size [n_classes,]
    prob_cond - the conditional probability of each word given each class, array size [n_classes,n_features].
    
    The distribution p(x_i|y_k) is calculated as p(i|y_k)^x_i - (1 - p(i|y_k))^{1-x_i} = theta_ik^x_i - (1-theta_ik)^(1-x_i),
    where:
        theta_ik = (n_ik+alpha)/(n_k + beta),
    n_ik is the number of occurences of x_i in class y_k, and n_k is the total number of occurences of x_i in the training set.
    alpha & beta are Laplace smoothing parameters & are set as alpha = 1, beta = 2 as default.
    This means that Laplace smoothing is applied as default.
    
    Classification is performed by maximising the log likelihood, so we'll apply the log transform to the parameter matrix in this function.
    """
    
    max_X = maximum(X);
    if (max_X > 1)
        ArgumentError("The feature matrix, X, must be binary")
    end
    # Calculate the number of features and classes
    n_class = Int64(maximum(y)+1);
    n_words = size(X)[2];
    # Calculate n_ik = the number of occurences of word x_i in class y_k
    n_i = zeros(n_class,n_words);
    for k in 0:(n_class-1)
        n_i[k+1,:] = sum(X[y.==k,:],dims = 1);
    end
    # Calculate the number of occurences in each class, used for the prior p(y_k)
    n_k = [sum(y.==k) for k in 0:(n_class-1)];
    # Calculate the total number of samples
    n = size(X)[1];
    # Now we can calculate the prior p(y_k) & the log prior 
    prior = (alpha+n_k)./(beta+n);
    log_prior = log.(prior);
    # Finally calculate the conditional probability p(i|y_k)
    cond_prob = (n_i.+alpha)./(n_k.+beta);
    log_cond_prob = log.(cond_prob);
    return log_prior,cond_prob
end

train_BernouilliNB (generic function with 3 methods)

In [173]:
function predict_BernouilliNB(X_test,log_prior,cond_prob)
    """
    predict_BernouilliNB(X_test=Array[n_features,n_samples],log_prior = Array[n_classes,],cond_prob=Array[n_classes,n_features]);
    A basic function to predict a set of reports using a pre-trained Bernouilli Naive Bayes Classifier.
    It takes as input:
    X_test - a test set with feature matrix of size [n_feautres,n_samples],
    log_prior - the log prior probability of each class, array size [n_classes,]
    prob_cond - the conditional probability of each word given each class, array size [n_classes,n_features].
    
    The output is the predicted class, an integer. It assumes classes are 0:(n_classes-1)
    
    The output is calculated using argmax_k [log(p(y_k)) + sum(x_i*log(p(i|y_k)) + (1-x_i)*log(1-p(i|y-k)))]
    """
    max_X = maximum(X_test);
    if (max_X > 1)
        ArgumentError("The feature matrix, X_test, must be binary")
    end
    pred = zeros(size(X_test)[1]);
    for i in 1:nrows(X_test)
        score = log_prior[:] .+ sum(X_test[i,:].*transpose(log.(cond_prob)),dims=1)[:] .+ sum((1 .- X_test[i,:]).*transpose(log.(1 .- cond_prob)),dims=1)[:];
        pred[i] = argmax(score[:])-1;
    end
    return pred
end

predict_BernouilliNB (generic function with 1 method)

In [174]:
using MLJ
train, test = partition(eachindex(y), 0.7, shuffle=true, rng=1234);
y_train = Int64.(y[train]);
y_test = Int64.(y[test]);
X_train = X_bin[train,:];
X_test = X_bin[test,:];
println(string("Size of X_train = ", size(X_train), ", size of y_train = ", size(y_train)));
println(string("Size of X_test = ", size(X_test), ", size of y_test = ", size(y_test)));

Size of X_train = (1712, 12359), size of y_train = (1712,)
Size of X_test = (734, 12359), size of y_test = (734,)


In [175]:
log_prior,cond_prob = train_BernouilliNB(X_train,y_train);

In [176]:
pred = predict_BernouilliNB(X_test,log_prior,cond_prob);
accuracy = mean(y_test .== pred)

0.7888283378746594

In [177]:
confusion_matrix_norm = [sum((y_test .==(i-1)) .& (pred .==(j-1)))./sum(y_test .==(i-1)) for i in 1:n_class,  j in 1:n_class]

5×5 Array{Float64,2}:
 0.970588   0.0147059  0.00980392  0.00490196  0.0
 0.0112994  0.983051   0.00564972  0.0         0.0
 0.227848   0.14557    0.594937    0.0316456   0.0
 0.0859375  0.0        0.03125     0.882812    0.0
 0.208955   0.119403   0.552239    0.119403    0.0

In [215]:
function train_MultinomialNB(X,y,alpha=1)
    """
        train_MultinomialNB(X=Array[n_features,n_samples],y=Array[n_samples,1],alpha=1);
    A basic function to train a Multinomial Naive Bayes classifier. 
    
    It takes as input:
    X - a training set with feature matrix of size [n_feautres,n_samples],
    y - labels of length n_samples,
    alpha - smoothing factor, default = 1.
    
    It outputs:
    log_prior - the log prior probability of each class, array size [n_classes,]
    log_prob_cond - the conditional probability of each word given each class, array size [n_classes,n_features].
    
    The distribution is parameterised by vectors (theta_k1,..., theta_kn) for each class y_k, where n is the number of features.
    theta_ki is the probability p(x_i|y_k) of feature i appearing in a sample belonging to class y_k.
    
    theta_ki is estimated by a smoothed version of the maximum likelihood estimator:
    theta_ki = (N_ki + alpha)/(N_k + alpha*n)
    
    where N_ki is the number of times feature x_i appears in a sample of class y_k and N_k is the total count of all features in class y_k.
    n is the number of terms in the vocabulary.
    alpha is the smoothing parameter & is set to alpha = 1 as default.
    This means that Laplace smoothing is applied as default.
    
    Classification is performed by maximising the log likelihood, so we'll apply the log transform to the parameter matrix in this function.
    """
    
    # Calculate the number of features and classes
    n_class = Int64(maximum(y)+1);
    n_words = size(X)[2];
    # Calculate n_ik = the number of occurences of word x_i in class y_k
    n_i = zeros(n_class,n_words);
    for k in 0:(n_class-1)
        n_i[k+1,:] = sum(X[y.==k,:],dims = 1);
    end
    # Calculate the number of occurences in each class, used for the prior p(y_k)
    n_k = sum(n_i,dims = 2);
    # Calculate the total number of samples
    n = size(X)[1];
    # Now we can calculate the prior p(y_k) & the log prior 
    prior = (n_k+alpha)./(n_class+n);
    log_prior = log.(prior);
    # Finally calculate the conditional probability p(i|y_k)
    cond_prob = (n_i.+alpha)./(n_k.+alpha*n_words);
    log_cond_prob = log.(cond_prob);
    return log_prior,log_cond_prob
end

train_MultinomialNB (generic function with 2 methods)

In [216]:
function predict_MultinomialNB(X_test,log_prior,log_cond_prob)
    """
    predict_MultinomialNB(X_test=Array[n_features,n_samples],log_prior = Array[n_classes,],log_cond_prob=Array[n_classes,n_features]);
    A basic function to predict a set of reports using a pre-trained Multinomial Naive Bayes Classifier.
    It takes as input:
    X_test - a test set with feature matrix of size [n_feautres,n_samples],
    log_prior - the log prior probability of each class, array size [n_classes,]
    log_prob_cond - the log conditional probability of each word given each class, array size [n_classes,n_features].
    
    The output is the predicted class, an integer. It assumes classes are 0:(n_classes-1)
    
    The output is calculated using argmax_k [log(p(y_k)) + sum(x_i*log(p(i|y_k)) + (1-x_i)*log(1-p(i|y-k)))]
    """
    X_test_bin = zeros(size(X_test));
    X_test_bin[X_test .>0] .= 1;
    pred = zeros(size(X_test)[1]);
    for i in 1:nrows(X_test)
        score = log_prior[:] .+ sum(X_test_bin[i,:].*transpose(log_cond_prob),dims=1)[:];
        pred[i] = argmax(score[:])-1;
    end
    return pred
end

predict_MultinomialNB (generic function with 1 method)

In [218]:
X_train_mnb = X_mat[train,:];
X_test_mnb = X_mat[test,:];
log_prior_mnb,log_cond_prob_mnb = train_MultinomialNB(X_train_mnb,y_train);
pred_mnb = predict_MultinomialNB(X_test_mnb,log_prior_mnb,log_cond_prob_mnb);
accuracy = mean(y_test .== pred_mnb)

0.8569482288828338

In [219]:
confusion_matrix_norm = [sum((y_test .==(i-1)) .& (pred_mnb .==(j-1)))./sum(y_test .==(i-1)) for i in 1:n_class,  j in 1:n_class]

5×5 Array{Float64,2}:
 0.965686   0.0147059  0.0147059   0.00490196  0.0
 0.0        0.99435    0.00564972  0.0         0.0
 0.158228   0.101266   0.664557    0.0443038   0.0316456
 0.0078125  0.0        0.0234375   0.96875     0.0
 0.0447761  0.0597015  0.343284    0.149254    0.402985

In [314]:
function train_ComplementNB(X,y,alpha=ones(size(X)[2]))
    """
        train_ComplementNB(X=Array[n_features,n_samples],y=Array[n_samples,1],alpha=ones(size(X)[2]));
    A basic function to train a Complement Naive Bayes classifier. 
    
    It takes as input:
    X - a training set with feature matrix of size [n_feautres,n_samples],
    y - labels of length n_samples,
    alpha - smoothing factor for each word, default = ones(size(X)[2]).
    
    It outputs:
    log_prior - the log prior probability of each class, array size [n_classes,]
    weight_matrix - weight matrix calculated from the complement of each class.
    
    
    The weight matrix is calculated using the following:
    theta_ki = (alpha_i + sum_{j not in c_k}d_ij)/(alpha + sum_{j not in c_k}(sum_{l}(d_lj))
    
    w_ki = log(theta_ki)
    
    The summations are performed over all documents j not in class c_k, d_ij are the counts of term i in document j.
    alpha = sum_{i} alpha_i
    
    alpha is the vector of smoothing parameters & is set to alpha_{i} = 1 for all i as default.
    This means that Laplace smoothing is applied as default.
    """
    # Sum all of the alpha_i
    alpha_sum = sum(alpha);
    # Calculate the number of features and classes
    n_class = Int64(maximum(y)+1);
    n_words = size(X)[2];
    # Calculate n_ik = the number of occurences of word x_i in class y_k
    n_i = zeros(n_class,n_words);
    for k in 0:(n_class-1)
        n_i[k+1,:] = sum(X[y.==k,:],dims = 1);
    end
    # Now we can calculate the theta_ki values
    theta_ki = zeros(n_class,n_words);
    for k in 1:(n_class)
        # Consider the complement of class k
        rows = setdiff(1:n_class,k);
        # Calculate theta_ki
        theta_ki[k,:] = (alpha[:] .+ sum(n_i[rows,:],dims=1)[:])./(alpha_sum .+ sum(n_i[rows,:]));
        
    end
    # Take the log and normalise to get the weight values
    w_ki = log.(theta_ki);
    # Calculate the number of occurences in each class, used for the prior p(y_k)
    n_k = sum(n_i,dims = 2);
    # Calculate the total number of samples
    n = size(X)[1];
    # Now we can calculate the prior p(y_k) & the log prior 
    prior = n_k./n;
    log_prior = log.(prior);
    return log_prior,w_ki;
end

train_ComplementNB (generic function with 2 methods)

In [315]:
function predict_ComplementNB(X_test,log_prior,w_ki)
    """
    predict_ComplementNB(X_test=Array[n_features,n_samples],log_prior = Array[n_classes,],w_ki=Array[n_classes,n_features]);
    A basic function to predict a set of reports using a pre-trained Complement Naive Bayes Classifier.
    It takes as input:
    X_test - a test set with feature matrix of size [n_feautres,n_samples],
    log_prior - the log prior probabilities, array size [n_classes,]
    w_ki - the weight matrix, array size [n_classes,n_features].
    
    The output is the predicted class, an integer. It assumes classes are 0:(n_classes-1)
    
    The output is calculated using argmin_i [sum(x_i*w_ki)]
    """
    pred = zeros(size(X_test)[1]);
    for i in 1:nrows(X_test)
        score = log_prior[:] .- sum(X_test[i,:].*transpose(w_ki),dims=1)[:];
        pred[i] = argmax(score[:])-1;
    end
    return pred
end

predict_ComplementNB (generic function with 1 method)

In [316]:
log_prior_cnb,w_ki = train_ComplementNB(X_train_mnb,y_train);

In [317]:
pred_cnb = predict_ComplementNB(X_test_mnb,log_prior_cnb,w_ki);
accuracy = mean(y_test .== pred_cnb)

0.829700272479564

In [307]:
i = 2;
score =sum(X_test_mnb[i,:].*transpose(w_ki),dims=1)[:]

5-element Array{Float64,1}:
 -342.5339898079385
 -340.97053538026876
 -345.87434820299666
 -358.1644184737997
 -344.6486514265483

In [298]:
argmin(sum(X_test_mnb[i,:].*transpose(w_ki),dims=1)[:])

2

"FMS.csv"

In [None]:
df_sub