<a href="https://colab.research.google.com/github/jerge/DAT405-DSC/blob/main/Lab4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#DAT405 Introduction to Data Science and AI 

Daniel Willim: 13h

Erik Jergéus: 13h

##2020-2021, Reading Period 2
## Assignment 4: Spam classification using Naïve Bayes 
There will be an overall grade for this assignment. To get a pass grade (grade 5), you need to pass items 1-3 below. To receive higher grades, finish items 4 and 5 as well. 

The exercise takes place in a notebook environment where you can chose to use Jupyter or Google Colabs. We recommend you use Google Colabs as it will facilitate remote group-work and makes the assignment less technical. 
Hints:
You can execute certain linux shell commands by prefixing the command with `!`. You can insert Markdown cells and code cells. The first you can use for documenting and explaining your results the second you can use writing code snippets that execute the tasks required.  

In this assignment you will implement a Naïve Bayes classifier in Python that will classify emails into spam and non-spam (“ham”) classes.  Your program should be able to train on a given set of spam and “ham” datasets. 
You will work with the datasets available at https://spamassassin.apache.org/old/publiccorpus/. There are three types of files in this location: 
-	easy-ham: non-spam messages typically quite easy to differentiate from spam messages. 
-	hard-ham: non-spam messages more difficult to differentiate 
-	spam: spam messages 

**Execute the cell below to download and extract the data into the environment of the notebook -- it will take a few seconds.** If you chose to use Jupyter notebooks you will have to run the commands in the cell below on your local computer, with Windows you can use 7zip (https://www.7-zip.org/download.html) to decompress the data.



In [None]:
#Download and extract data
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2
!tar -xjf 20021010_easy_ham.tar.bz2
!tar -xjf 20021010_hard_ham.tar.bz2
!tar -xjf 20021010_spam.tar.bz2

--2020-12-02 09:19:32--  https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 95.216.26.30, 95.216.24.32, 40.79.78.1, ...
Connecting to spamassassin.apache.org (spamassassin.apache.org)|95.216.26.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1677144 (1.6M) [application/x-bzip2]
Saving to: ‘20021010_easy_ham.tar.bz2’


2020-12-02 09:19:33 (1.84 MB/s) - ‘20021010_easy_ham.tar.bz2’ saved [1677144/1677144]

--2020-12-02 09:19:33--  https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 95.216.26.30, 95.216.24.32, 40.79.78.1, ...
Connecting to spamassassin.apache.org (spamassassin.apache.org)|95.216.26.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1021126 (997K) [application/x-bzip2]
Saving to: ‘20021010_hard_ham.tar.bz2’


2020-12-02 09:19:34 (1.35 MB/s) - ‘200210

*The* data is now in the three folders `easy_ham`, `hard_ham`, and `spam`.

In [None]:
!ls -lah

total 4.0M
drwxr-xr-x 1 root root 4.0K Dec  2 09:08 .
drwxr-xr-x 1 root root 4.0K Dec  2 09:05 ..
-rw-r--r-- 1 root root 1.6M Jun 29  2004 20021010_easy_ham.tar.bz2
-rw-r--r-- 1 root root 998K Dec 16  2004 20021010_hard_ham.tar.bz2
-rw-r--r-- 1 root root 1.2M Jun 29  2004 20021010_spam.tar.bz2
drwxr-xr-x 1 root root 4.0K Nov 20 17:15 .config
drwx--x--x 2  500  500 168K Oct 10  2002 easy_ham
drwx--x--x 2 1000 1000  20K Dec 16  2004 hard_ham
drwxr-xr-x 1 root root 4.0K Nov 13 17:33 sample_data
drwxr-xr-x 2  500  500  36K Oct 10  2002 spam


###1. Preprocessing: 
1.	Note that the email files contain a lot of extra information, besides the actual message. Ignore that for now and run on the entire text. Further down (in the higher-grade part), you will be asked to filter out the headers and footers. 
2.	We don’t want to train and test on the same data. Split the spam and the ham datasets in a training set and a test set. (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`)


In [None]:
#pre-processing code here
import os
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

def get_file_list_from_dir(datadir):
    all_files = os.listdir(os.path.abspath(datadir))
    all_content = []
    for file_name in all_files:
      all_content.append(open(datadir + file_name, "r", errors='ignore').read())
    return all_content

# List of emails (one email = one string)
easy_ham = get_file_list_from_dir("easy_ham/")
hard_ham = get_file_list_from_dir("hard_ham/")
spam = get_file_list_from_dir("spam/")

# Split data
hamtrain,hamtest = train_test_split(easy_ham + hard_ham, 
                                    test_size=0.3, random_state=17)
spamtrain,spamtest = train_test_split(spam, 
                                      test_size=0.3, random_state=17)

hardhamtrain,hardhamtest = train_test_split(hard_ham, 
                                            test_size=0.3, random_state=17)
easyhamtrain,easyhamtest = train_test_split(easy_ham, 
                                            test_size=0.3, random_state=17)


###2. Write a Python program that: 
1.	Uses four datasets (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`) 
2.	Trains a Naïve Bayes classifier (e.g. Sklearn) on `hamtrain` and `spamtrain`, that classifies the test sets and reports True Positive and False Negative rates on the `hamtest` and `spamtest` datasets. You can use `CountVectorizer` to transform the email texts into vectors. Please note that there are different types of Naïve Bayes Classifier in SKlearn ([Documentation here](https://scikit-learn.org/stable/modules/naive_bayes.html)). Test two of these classifiers that are well suited for this problem
- Multinomial Naive Bayes  
- Bernoulli Naive Bayes. 

Please inspect the documentation to ensure input to the classifiers is appropriate. Discuss the differences between these two classifiers. 





In [None]:
def run_predictions(hamtrain, spamtrain, hamtest, spamtest, 
                    multi=MultinomialNB(), 
                    berno=BernoulliNB(), 
                    vectorizer=CountVectorizer()):
  
  # Transform data into word feature vector
  train = vectorizer.fit_transform(hamtrain + spamtrain)

  labels = ['ham']*len(hamtrain) + ['spam']*len(spamtrain)

  test_ham = vectorizer.transform(hamtest)
  test_spam = vectorizer.transform(spamtest)
  test_set = vectorizer.transform(hamtest + spamtest)
  test_labels = ['ham']*len(hamtest) + ['spam']*len(spamtest)

  # Do math
  multi.fit(train, labels)
  berno.fit(train, labels)

  prediction_multi_ham = multi.predict(test_ham)
  prediction_berno_ham = berno.predict(test_ham)

  prediction_multi_spam = multi.predict(test_spam)
  prediction_berno_spam = berno.predict(test_spam)

  # Show
  unique, counts = np.unique(prediction_multi_ham, return_counts=True)
  print(f"Ham prediction for multinomial: {dict(zip(unique, counts))}")
  unique, counts = np.unique(prediction_berno_ham, return_counts=True)
  print(f"Ham prediction for bernoulli:   {dict(zip(unique, counts))}")

  print()

  unique, counts = np.unique(prediction_multi_spam, return_counts=True)
  print(f"Spam prediction for multinomial: {dict(zip(unique, counts))}")
  unique, counts = np.unique(prediction_berno_spam, return_counts=True)
  print(f"Spam prediction for bernoulli:   {dict(zip(unique, counts))}")
  
  print()

  print(f"Accuracy multinomial: {multi.score(test_set,test_labels):,.2f}")
  print(f"Accuracy bernoulli: {berno.score(test_set, test_labels):,.2f}")

  print()

In [None]:
run_predictions(hamtrain, spamtrain, hamtest, spamtest)

Ham prediction for multinomial: {'ham': 836, 'spam': 5}
Ham prediction for bernoulli:   {'ham': 839, 'spam': 2}

Spam prediction for multinomial: {'ham': 12, 'spam': 139}
Spam prediction for bernoulli:   {'ham': 110, 'spam': 41}

Accuracy multinomial: 0.98
Accuracy bernoulli: 0.89



#### Discuss the differences between these two classifiers.
The two classifiers that we use are Multinomial Naive Bayes and Bernoulli Naive Bayes. Both classifiers are commonly used for text classification but as we will see in this report, they will perform differently on the same input. 

This is because a Multinomial Naive Bayes classifier will consider all features (in our case, a feature is a single word) and how many times that feature appear. While a Bernoulli Naive Bayes classifier will only consider if a dataset has a feature or not and not how frequent that feature is. A Bernoulli classifier will also explicitly penalise a mail that does that have a feature otherwise common for a specific label.

These differences between the different classifiers lead to Bernoulli having worse accuracy than the multinomial classifier. As we see in question 4, removing uncommon words will increase Bernoulli’s accuracy significantly.


### 3.Run your program on 
-	Spam versus easy-ham 
-	Spam versus hard-ham.

In [None]:
print("Predictions on easy-ham vs spam")
run_predictions(easyhamtrain, spamtrain, easyhamtest, spamtest)

Predictions on easy-ham vs spam
Ham prediction for multinomial: {'ham': 766}
Ham prediction for bernoulli:   {'ham': 764, 'spam': 2}

Spam prediction for multinomial: {'ham': 15, 'spam': 136}
Spam prediction for bernoulli:   {'ham': 72, 'spam': 79}

Accuracy multinomial: 0.98
Accuracy bernoulli: 0.92



In [None]:
print("Predictions on hard-ham vs spam")
run_predictions(hardhamtrain, spamtrain, hardhamtest, spamtest)

Predictions on hard-ham vs spam
Ham prediction for multinomial: {'ham': 55, 'spam': 20}
Ham prediction for bernoulli:   {'ham': 48, 'spam': 27}

Spam prediction for multinomial: {'ham': 8, 'spam': 143}
Spam prediction for bernoulli:   {'ham': 4, 'spam': 147}

Accuracy multinomial: 0.88
Accuracy bernoulli: 0.86



Both get a better spam prediction on the hard data, even to the point that bernoulli performs equally well to the multinomial prediction. As expected however, it is a bit harder to classify the ham correctly and as such both gets worse results in those regards.

In the easy data the ham prediction is still almost flawless for both methods. However the spam prediction is worse for multinomial, but stronger for bernoulli (in comparison to when we used all ham data). It seems therefore that bernoulli performs a lot better when the data is split into the difficulty categories. Do note however that the spam prediction is still very bad and we have not yet seen any benefits to using bernoulli over multinomial for this task.

**Our guesses as to why**

Since the hard ham has more intricate and perhaps smaller differences from the spam, the algorithm is able to identify these small differences better. In the combined sample case the algorithm is flooded with easy ham that misconstrues the minute differences as something less useful to look at.

In regards to ham it seems reasonable that it's harder to identify the ham since there are fewer samples. Furthermore the prior for the spam is a lot higher now that we have fewer cases and as such it's more likely for it to be overrepresented, just as how ham is over represented in both the easy ham and combined samples.

Why the bernoulli distribution is better on easy ham rather than the combined samples, doesn't make sense at first. However since bernoulli distinguishes on features based on occurence instead of the amount of occurences, it likely performs better due to some features that are present in both the hard ham and spam, now being removed. Previously it would probably predict those samples to be ham, since they had a higher prior.

###4.	To avoid classification based on common and uninformative words it is common to filter these out. 

**a.** Argue why this may be useful. Try finding the words that are too common/uncommon in the dataset. 

In [None]:
def uncommunWords(freq, cutoff):
  i=0
  for elem in freq: 
    if elem[1] <= cutoff:
      i+=1

  return i

def count_uncommon_common(data, return_res=False):
  # Transform data into feature vector
  vectorizer = CountVectorizer()
  count = vectorizer.fit(data)
  clump = count.transform(data)

  # Count occurances
  sumw = clump.sum(axis=0)
  frequency = [(word, sumw[0,index]) for word, index in count.vocabulary_.items()]
  frequency = sorted(frequency, key=lambda x: x[1], reverse=True)
  
  if return_res:
    return frequency
  else: 
    print(f"The data had {len(data)} entries with {len(frequency)} different words")
    print(f"The 10 most common words are:")
    print(*frequency[0:10])

    print(f"There are {uncommunWords(frequency, 1)} words, that only occur once in the data")
    print(f"There are {uncommunWords(frequency, 5)} words that occur five times or less in the data")
    print()

count_uncommon_common(hamtrain + spamtrain)

The data had 2310 entries with 82402 different words
The 10 most common words are:
('com', 46578) ('the', 27745) ('to', 26485) ('http', 20742) ('from', 20092) ('2002', 19752) ('td', 17555) ('for', 16455) ('net', 15600) ('with', 15512)
There are 48189 words, that only occur once in the data
There are 69873 words that occur five times or less in the data



Removing common and uncommon words might lead to a better prediction since we, in loose terms, are trying to classify a mail depending on what words the mail contains. For example, a word like `Chalmers` or `Unambiguous` might only appear a few times in ham-mails and no times in spam mails through the whole training-set. The classifier would therefor "think" that a mail containing `Chalmers` cannot be a spam mail since no spam mail in the training-set contained `Chalmers`. 

This is a larger problem for the Bernoulli classifier since it only considers if a certain mail has a specific feature or not, this will be discussed in more detail in the latter part of the question.

For common words it is the opposite case, if a word appears in almost every mail it will be impossible to classify the mail based on that word and might mislead a calculation. For example, `com` appears on average 20 times per email.

Above we have printed the 10 most common words and the count of uncommon words.

**b.** Use the parameters in Sklearn’s `CountVectorizer` to filter out these words. Update the program from point 3 and run it on your data and report your results.

You have two options to do this in Sklearn: either using the words found in part (a) or letting Sklearn do it for you. Argue for your decision-making.

In [None]:
print("1. No removal of common or uncommon words")
run_predictions(hamtrain, spamtrain, hamtest, spamtest)

print("---------------------------")
print("2. Remove both common and uncommon words. Using sklearn, min_df = 5, max_df=0.8 ")
print("Removed words that appeared less than 5 times or word that appeared in more than 80% of all emails")
run_predictions(hamtrain, spamtrain, hamtest, spamtest, 
                vectorizer=CountVectorizer(min_df=5, max_df=0.8))

# Creat list of uncommon words
freq = count_uncommon_common(hamtrain + spamtrain, return_res=True)
stopwords = list(map(lambda x: x[0], filter(lambda x: x[1] < 5, freq)))

print("---------------------------")

print(f"3. Remove uncommonwords, using own stopwords list, {(len(stopwords)/len(freq)*100):,.2f} % of data was removed")
print("Removed words that appeared less than 5 times.")
run_predictions(hamtrain, spamtrain, hamtest, spamtest, 
                vectorizer=CountVectorizer(stop_words=stopwords))

print("---------------------------")
print('4. Remove common english words, using sklearns stopword set, "english"')
run_predictions(hamtrain, spamtrain, hamtest, spamtest, 
                vectorizer=CountVectorizer(stop_words="english"))

print("---------------------------")

1. No removal of common or uncommon words
Ham prediction for multinomial: {'ham': 836, 'spam': 5}
Ham prediction for bernoulli:   {'ham': 839, 'spam': 2}

Spam prediction for multinomial: {'ham': 12, 'spam': 139}
Spam prediction for bernoulli:   {'ham': 110, 'spam': 41}

Accuracy multinomial: 0.98
Accuracy bernoulli: 0.89

---------------------------
2. Remove both common and uncommon words. Using sklearn, min_df = 5, max_df=0.8 
Removed words that appeared less than 5 times or word that appeared in more than 80% of all emails
Ham prediction for multinomial: {'ham': 825, 'spam': 16}
Ham prediction for bernoulli:   {'ham': 814, 'spam': 27}

Spam prediction for multinomial: {'ham': 1, 'spam': 150}
Spam prediction for bernoulli:   {'spam': 151}

Accuracy multinomial: 0.98
Accuracy bernoulli: 0.97

---------------------------
3. Remove uncommonwords, using own stopwords list, 82.85 % of data was removed
Removed words that appeared less than 5 times.
Ham prediction for multinomial: {'ham': 

*Above we've tried different techniques and combinations of removing common/uncommon words*

From both test 2 and 4 we see that removing common words have no significant affect on the result, this might be because either word appears in all mails or in less than 80% of mail. Then filtering out these features will not influence the classification.

In test 2 and 3 when uncommon words are removed, we see that multinomial is not affected in a significant way. But the Bernoulli classifiers accuracy increases by almost 10 % units. This is because of the differences in classifiers that were discussed in question 2 since Bernoulli only "looks" if a certain mail contain a feature or not. Hence if an uncommon word from the training set appeared in a test-mail, that mail might be misclassified.

In [None]:
# Check for uncommon words in testdata
freq_test = dict(count_uncommon_common(hamtest + spamtest, return_res=True))

uncommon_in_test = len(list(filter(lambda x: x[0] in freq_test, stopwords)))

print(f"There are {uncommon_in_test} uncommon words from the traing set that appear in the test set")

There are 0 uncommon words from the traing set that appear in the test set


But from the code above we see that none of the uncommon words from the training reappear in the test set. Hence Bernoulli’s increase in accuracy might instead be because Bernoulli penalise a mail that does not have a feature common for a type of mail. For example, if `Chalmers` only appear once in all training mails and that mail is a ham mail, Bernoulli will think it less likely that a mail not containing `Chalmers` is a ham mail. 

This is probably the reason that most of the emails that Bernoulli miss-classifies are spam mails. Since these often contain random weird strings of characters like the examples below.

In [None]:
print(f"Example of words that appear once: {stopwords[-5:]}")
print(f"Example of words that appear five times: {stopwords[:5]}")

Example of words that appear once: ['s12pd7lmwm', 'muatiounbwshcouk8l2zvbnq', 'pc9wpg0kdqo8l2jvzhk', 'dqoncjwvahrtbd4', '200209140414']
Example of words that appear five times: ['01t00', 'footage', 'terrific', 'monkeys', 'blacklist']


Here we chose to define an uncommon word as when it appeared less than 5 times and a common word as if it appeared in more than 80 % of all documents. We chose these parameters since they gave a high accuracy on the test set while removing as little data as possible. But here the test set is used to set a parameter which can lead to an overfitting for the test set. A better way to choose parameters for a certain application would be to use a separate validation set, where the parameters are optimized on the validation set and the model is then evaluated on the test set.

*In some of our runs (before specifying random_state in or test train split) the accuracy of the multinomial classifier increased by a few % units, but this depends on the exact split of train and test data, which will be discussed in question 5*

###5. Eeking out further performance
Filter out the headers and footers of the emails before you run on them. The format may vary somewhat between emails, which can make this a bit tricky, so perfect filtering is not required. Run your program again and answer the following questions: 
-	Does the result improve from 3 and 4? 
- The split of the data set into a training set and a test set can lead to very skewed results. Why is this, and do you have suggestions on remedies? 
- What do you expect would happen if your training set were mostly spam messages while your test set were mostly ham messages? 

Re-estimate your classifier using `fit_prior` parameter set to `false`, and answer the following questions:
- What does this parameter mean?
- How does this alter the predictions? Discuss why or why not.

In [None]:
import email

# preprocess

def get_body_from_msg(msg):
  output = ""
  if msg.is_multipart():
      for payload in msg.get_payload():
          if payload.is_multipart():
            output + " " + get_body_from_msg(payload)
            continue 

          output + " " + payload.get_payload()
      return output
  else:
      return msg.get_payload()

# List of emails (one email = one string)
easy_ham = get_file_list_from_dir("easy_ham/")
hard_ham = get_file_list_from_dir("hard_ham/")
spam = get_file_list_from_dir("spam/")

no_head_easy_ham = list(map(get_body_from_msg, 
                            map(email.message_from_string, easy_ham)))
no_head_hard_ham = list(map(get_body_from_msg, 
                            map(email.message_from_string, hard_ham)))
no_head_spam = list(map(get_body_from_msg, 
                        map(email.message_from_string, spam)))

no_head_hamtrain, no_head_hamtest = train_test_split(no_head_easy_ham +
                                                     no_head_hard_ham, 
                                                     test_size=0.3)

no_head_spamtrain, no_head_spamtest = train_test_split(no_head_spam, 
                                                       test_size=0.3)

no_head_easyhamtrain, no_head_easyhamtest = train_test_split(no_head_easy_ham, 
                                                             test_size=0.3)

no_head_hardhamtrain, no_head_hardhamtest = train_test_split(no_head_hard_ham,
                                                             test_size=0.3)

count_uncommon_common(no_head_hamtrain + no_head_spamtrain)

run_predictions(no_head_hamtrain, no_head_spamtrain, no_head_hamtest, no_head_spamtest)

The data had 2310 entries with 48317 different words
The 10 most common words are:
('the', 25686) ('http', 17665) ('com', 16738) ('to', 16089) ('td', 12932) ('of', 12581) ('and', 12555) ('font', 10141) ('width', 9981) ('www', 9579)
There are 24961 words, that only occur once in the data
There are 38607 words that occur five times or less in the data

Ham prediction for multinomial: {'ham': 841}
Ham prediction for bernoulli:   {'ham': 837, 'spam': 4}

Spam prediction for multinomial: {'ham': 42, 'spam': 109}
Spam prediction for bernoulli:   {'ham': 107, 'spam': 44}

Accuracy multinomial: 0.96
Accuracy bernoulli: 0.89



- Does the result improve from 3 and 4?

The result is worse when we remove the boiler plate and header. This means that the information there actually has some value. The thing of value could for example be the subject of the mail, who it is sent from (some types of mailadresses could be more common in spam mails, for example the ".com" could be something else or it could be that they try to send from mails such as "xxx@whitehouse.xxx") or simply the amount of information that is presented.

- The split of the data set into a training set and a test set can lead to very skewed results. Why is this, and do you have suggestions on remedies?

In our cases the results have never been huge since we have used sklearns built in split method and because of us splitting the spam and ham separately, so that we get a proportional amount in both sets. The most notable difference is that in some splits the "easy ham vs spam" and "hard ham vs spam" gets different scores to the point that they alter which is better. It is never by a huge margin and in the other cases the difference is a 1-2% on the score at most. In order to remedy problems in the test case split you can try to run on different splits and get a more average result that way. We can also try to use a third validity set to optimize the hyper-parameters in the split. A more common way is probably to prune the input to remove some outliers which might be the cause of the skew. Lastly we can try to get more data points so that it is less likely to leave any key features out.

 - What do you expect would happen if your training set were mostly spam messages while your test set were mostly ham messages?

The prior would be skewed in the opposite direction, which would lead to bad predictions. Furthermore it would probably lead to the a similar affect a what we get in #3 for the hard ham bernoulli distribution. There it would become better at identifying spam, but the ham prediction would go way down.

In [None]:
print("Fit_prior = False")
run_predictions(hamtrain, spamtrain, hamtest, spamtest, 
                multi = MultinomialNB(fit_prior=False),
                berno = BernoulliNB(fit_prior=False))

print("Fit_prior = True")
run_predictions(hamtrain, spamtrain, hamtest, spamtest, 
                multi = MultinomialNB(fit_prior=True),
                berno = BernoulliNB(fit_prior=True))

Fit_prior = False
Ham prediction for multinomial: {'ham': 836, 'spam': 5}
Ham prediction for bernoulli:   {'ham': 839, 'spam': 2}

Spam prediction for multinomial: {'ham': 12, 'spam': 139}
Spam prediction for bernoulli:   {'ham': 108, 'spam': 43}

Accuracy multinomial: 0.98
Accuracy bernoulli: 0.89

Fit_prior = True
Ham prediction for multinomial: {'ham': 836, 'spam': 5}
Ham prediction for bernoulli:   {'ham': 839, 'spam': 2}

Spam prediction for multinomial: {'ham': 12, 'spam': 139}
Spam prediction for bernoulli:   {'ham': 110, 'spam': 41}

Accuracy multinomial: 0.98
Accuracy bernoulli: 0.89



#### What does this parameter mean?

From the documentation we learn that fit_prior indicates "*Whether to learn class prior probabilities or not. If false, a uniform prior will be used.*" where a prior for a given class is the proportion that the class occurs over the total number of samples. A uniform prior in this case should indicate that the prior would be 1/2 for both cases, which could be smart to do if we assume the samples are unnaturally skewed towards the wrong direction.

#### How does this alter the predictions? Discuss why or why not.

Our results indicates that fit_prior does not affect the results significantly in any direction. We also tried it on easy and hard ham, which gave the same result. This seems to indicate that a lot of the previous assumptions we have made regarding the prior might be incorrect, especially in regards to #3. It might still have an impact on for example the case where "*your training set were mostly spam messages while your test set were mostly ham messages*", but it also means that we could simply turn off fit_prior to test if the results from the prediction would be better.

### What to report and how to hand in.

- You will need to clearly report all results in the notebook in a clear and appropriate way, either using plots or code output (f.x. "print statements"). 
- The notebook must be reproducible, that means, we must be able to use the `Run all` function from the `Runtime` menu and reproduce all your results. **Please check this before handing in.** 
- Save the notebook and share a link to the notebook (Press share in upper left corner, and use `Get link` option. **Please make sure to allow all with the link to open and edit.**
- Edits made after submission deadline will be ignored, graders will recover the last saved version before deadline from the revisions history.
- **Please make sure all cells are executed and all the output is clearly readable/visible to anybody opening the notebook.**