# ham_or_spam    

Julia example of Naive Bayes spam mail classification.   

## Bayes Theorem

Albeit being arguably the simplest algorithm in machine learning, [Bayes Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) is very powerful and practical in applications such as spam mail classification, recommendation system, and real-time prediction.


$$P(A|B) = \frac{P(B|A) · P(A)}{P(B)}$$


In the equation, $P$ stands for probability. $P(A)$ reads "the probability of Event A occurring" while $P(A|B)$ reads "the probability of Event A occurring *provided* Event B happened."  

Here is a practical Bayes Theorem application to solve any parent's problem:

> What is the chance that my kid will contract sickness if another kid she plays with has a sniffle?

Let `P(Sick)` stands for how often kids of the same age as that kid get sick in that given month, and `P(Sniffle)` for how often they have sniffles.

**Example**: 20% of kids age 3-5 in California Bay Area are sick and contagious during any given month of November (`P(Sick) = 0.2`), but even more kids with sniffles (30%) (`P(Sniffle) = 0.3`). It's been known that 80% of all sick kids have sniffles (`P(Sniff|Sick) = 0.8`).

```
P(Sick|Sniff) = (P(Sniff|Sick) * P(Sick)) / P(Sniff)
              = 0.8 * 0.2 / 0.3
              = 0.16 / 0.3
              = 0.53333...
```

Provided the kid has a sniffle, There is a **53%** chance that she will be sick.

So the formula kind of tells us “forwards” when we know “backwards” (or vice versa).

## Naive Bayes Classifier

Naive Bayes Classifier is basically a running application of Bayes Theorem.
It calculates the probability of every input...

In [29]:
# Activate current env
using Pkg
Pkg.activate(".")

"/Users/pieohpah/Code/julia/ham_spam/Project.toml"

In [None]:
using DataStructures
using Formatting

most_common(c::Accumulator) = most_common(c, length(c))
most_common(c::Accumulator, k) = sort(collect(c), by=kv->kv[2], rev=true)[1:k]

function isalpha(str)
    re = r"^[+-]?([0-9]+([.][0-9]*)?|[.][0-9]+)$"
    !occursin(re, str)
end

"Make a bag of words from email text."
function make_dictionary(root_dir)
    all_words = []
    emails = [joinpath(root_dir, f) for f in readdir(root_dir)]
    for (i, mail) in enumerate(emails)
        open(mail) do m
            for line in readlines(m)
                words = split(line)
                append!(all_words, words)
            end
        end
    end

    bag = counter(all_words)
    list_to_remove = [k for k in keys(bag)]

    for item in list_to_remove
        # remove if numerical
        if !isalpha(item)
            reset!(bag, item)
            # pop!(bag, item)
        elseif length(item) == 1
            reset!(bag, item)
            # pop!(bag, item)
        end
    end
    # Consider only most 3000 common words
    most_common(bag, 3000)
end

dict = make_dictionary("machine-learning-101/chapter1/train-mails")


In [165]:
# Playing with count
words = ["foo", "fum", "fast", "foo", "fum"]
for word in words
    c = count(x->x==word, words)
    println(c)
end

2
2
1
2
2


In [None]:
function A(mail_dir)
    files = [joinpath(mail_dir, f) for f in readdir(mail_dir)]
    features_matrix = zeros(length(files), 3000)
    train_labels = zeros(length(files))
    files
end


function B()
    files = A("machine-learning-101/chapter1/train-mails/")
    features_matrix = zeros(length(files), 3000)
    train_labels = zeros(length(files))
    count = 1
    doc_id = 1
    length(dict)
    for file in files
        open(file) do f
            for (i, line) in enumerate(readlines(f))
                if i == 3
                    words = split(line)
                    for word in words
                        word_id = 1

                        # Go through the bag of words
                        for (i, d) in enumerate(dict)
                            printfmt("d: {}", d[1])
                            if String(d[1]) == word
                                word_id = i
                                c = count(x -> x == word, words)
                                println(c)
                                features_matrix[doc_id, word_id] = c
                            end
                        end
                    end
                end
            end
        end
    end
    features_matrix
end

fm = B()
println(fm)


In [119]:
"Make features matrix and label vector"
function extract_features(mail_dir, dict)
    files = [joinpath(mail_dir, f) for f in readdir(mail_dir)]
    features_matrix = zeros(length(files), 3000)
    train_labels = zeros(length(files))
    # cnt = 1
    doc_id = 1
    for (n, file) in enumerate(files)
        open(file) do f
            for (i, line) in enumerate(readlines(f))
                # Skip the first subject line and second empty line
                if i == 3
                    words = split(line)
                    for word in words
                        word_id = 1
                        # Go through the bag of words
                        for (i, d) in enumerate(dict)
                            if d[1] == word
                                printfmt("Found {} in the bag\n", word)
                                word_id = i
                                c = count(x -> x == word, words)
                                features_matrix[doc_id, word_id] = c
                                printfmt("F[{}, {}] = {}\n", doc_id, word_id, c)
                            end
                        end
                    end
                end
            end

            train_labels[doc_id] = 0
            filepath_tokens = split(file, "/")
            last_token = filepath_tokens[length(filepath_tokens)]
            if startswith(last_token, "spmsg")
                train_labels[doc_id] = 1
                # TODO: ?
                # cnt += 1
            end
            doc_id += 1
        end
    end
    (features_matrix, train_labels)
end

extract_features("./machine-learning-101/chapter1/train-mails")


MethodError: MethodError: objects of type Int64 are not callable