# M550 Python Assessment

## Overview

This is the assessment for the M550 Python Module. It counts as 30% of your overall mark for M550.

The assessment is to be undertaken in small groups and your work should be submitted in the form of a single jupyter notebook.

The submission date for this assessment is 16/11/2021. Your jupyter notebook should be emailed directly  to [daniel grose](mailto:dan.grose@lancaster.ac.uk) by 17:00 on this date (one copy per group).

Before your results and feedback are returned you might be asked to have a short (approximately 5 minutes) individual online "interview" to discuss some aspects of your work. The outcome of this interview might impact on your overall individual score.

### <u>Task 1</u>

Confirm the names of all of the members of your group along with their e-mail addresses to [daniel grose](mailto:dan.grose@lancaster.ac.uk) no later than 17:00 on 04/11/2021. Only one mamber of the group has to send the e-mail. You need to complete this task to enable the other tasks to be assessed.

__No marks for this task__

The assessment asks you to study some sections of a short chapter from the book [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) by Daniel Jurafsky & James H. Martin. A pdf version of the chapter can be downloaded <a href="./3rd-party/speech-and-language-processing-c4.pdf" download>here</a>. The assessment will guide you through this reading.

The assessment is subdivided into a series of tasks most of which have a mark associated with them. You should look over all of the tasks before you start working through them so as to be able to decide how to share the work amongst your group. For many of the tasks you may want to attempt the task yourself (at least in part) and then compare your work with that of your other group members. You can then combine the best elements of your individual solutions/ideas into your group jupyter notebook. Make sure you use plenty of markdown to enhance you answers and provide code examples (that work) to support you findings.

Task 12 is slightly different from the other tasks in that it provides a list of suggestions as to possible ways you could extend and enhance your work. Bear in mind these are only suggestions, and you might have your own ideas as to how to improve the methods and tasks covered in the task. Note that you should not feel that you have to pursue all of the suggestions !!

To help with tasks, and in particular task 12, a number of code examples have been provided in the appendix. These should help you gain some insight to how to complete the tasks using python. 

## Naive Bayes and Sentiment Classification


### <u>Task 2</u>

Read chapter 4 of [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) (pdf version available <a href="./3rd-party/speech-and-language-processing-c4.pdf" download>here</a>) up to the end of section 4.1.

__No marks for this task__

### <u>Task 3</u>

A python dictionary can be used to organise the information summarised in figure 4.1 in [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/). For example :

In [1]:
document = {"it" : 6,"I" : 5, "the" : 4}
print(document)

{'it': 6, 'I': 5, 'the': 4}


However, a document is often provided as a list of words, not a dictionary. Write a function that takes a list of words and returns a dictionary with the words as keys and the word frequencies as values. Test your code using the following list of strings. 

In [2]:
content = ["fish","dog","cat","fish","cat","fish"]

Explain what each line of your function does.

Explain why a python dictionary a good choice of data structure to use for a document. If you are not sure, do some online research. Note that a python dictionary is an example of a more general data structure called a __hash map__.

You may find the examples and notes in the __dictionaries__ chapter helpful

__Marks__ [10]

## Training the Naive Bayes Classifier

### <u>Task 4</u>

Read section 4.1 of [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) up to the end of the paragraph after equation 4.14.

__No marks for this task__

### <u>Task 5</u>

Use your solution to __Task 3__ to create documents out of the following strings 

In [3]:
s1 = "just plain boring"
s2 = "entirely predictable and lacks energy"
s3 = "no surprises and very few laughs"
s4 = "very powerful"
s5 = "the most fun film of the summer"

You can use the __split__ function to create a list of strings from the strings given above. For example

In [4]:
s1.split(" ")

['just', 'plain', 'boring']

__Marks__ [5]

### <u>Task 6</u>

A category, as described in [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/), has the same structure as a document, and can be constructed by combining documents together. This is also true of a vocabulary.

Write a function to create a document (category,vocabulary) from a list of other documents.  Explain how your function works.

You might find the chapter on __lists__ useful.

__Marks__ [10]

### <u>Task 7</u>

Using the strings introduced in __Task 5__ and your solution to __Task 6__,  create a category __negative__ from strings s1, s2, and s3, and a category __positive__ from s4 and s5. Create a __vocabulary__ from the __negative__ and __positive__ categories. 

Display the categories and vocabulary you have created.

__Marks__ [5]


### <u>Task 8</u>

Verify that the following function implements equation 4.14 in [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/).

In [5]:
def prob_word_given_category(vocab,category,word) :
    denom = 0
    for entry in vocab:
        if category.get(entry) != None :
            denom +=  category[entry]
    denom += len(vocab)
    num = 1
    if category.get(word) != None :
        num += category[word]
    prob = num/denom
    return(prob)

Describe what each line of the function does with reference to equation 4.14 and add some suitable comments to the function.

__Marks__ [5]

### <u> Task 9</u>

Use the function in __Task 8__ and your solution to __Task 7__ to verify the worked example in section 4.3 of [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/).

__Marks__ [10]

### <u> Task 10</u>

Use your solutions to the proceeding tasks to solve exercise 4.2 in [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/).

__Marks__ [10]

## Document classification

The methods introduced in sections 4.1 and 4.2 of [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) can be used to classify documents according to their content. To do this, several documents of a known type (i.e. science fiction novel, medical journal) are used create categories and a vocabulary, and then 
individual documents are classified accordingly.

### <u>Task 11</u>

Download at least 20 examples of work (chapters, acts, sections etc) from at least 4 different authors (of your choosing) from [project gutenberg](https://www.gutenberg.org/). Use the examples to create documents, categories, and a vocabulary for the purposes of classification. 

You will probably want to modify the functions you used in the earlier tasks to process the text you obtain from project gutenberg more effectively. Have a look at the code examples in the appendix to help you decide how this might be done. 

__Marks__ [25]

### <u>Task 12</u>

Use your results from previous tasks and equation 4.10 in [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) to try and categorise excerpts of text from works by the authors you chose in __Task 11__. The example code in the appendix should help you achieve this. Note - what probability do you assign a word that is in a excerpt but not in the vocabulary ? What effect (if any) does the length of the excerpt have ?

Summarise your findings.

Try and improve your classification method by considering some of the following.

- Process your documents using a stemmer
- Remove words that are not in an english dictionary
- Find a way of ranking the words in your vocabulary and categories and use this to remove some of the most common and least common words.
- Explore the __multinomial naive Bayes__ method suggested in section 4.4 of [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/).
- Any good idea you have (or have found out about) that might improve the ability to classify your excerpts.

__Marks__ [20]

## Appendix

### <u>nltk package</u>

This is a useful package for helping with document classification. You can install it using 

In [6]:
!python -m pip install nltk

You should consider upgrading via the '/home/grosedj/python-envs/M550/env/bin/python -m pip install --upgrade pip' command.[0m


### using nltk to check for english words

In [7]:
import nltk
nltk.download('words')
english = set(nltk.corpus.words.words())

some_words = ["hello","fish","go","vis","bonjour","gedaan"]
english_words = [word for word in some_words if word in english]
english_words

[nltk_data] Downloading package words to /home/grosedj/nltk_data...
[nltk_data]   Package words is already up-to-date!


['hello', 'fish', 'go', 'vis']

### reading a file of text as a string

In [8]:
file_name = "./data/j-austen-persuasion-c1.txt"
file = open(file_name, "r")
content = file.read()

In [9]:
### using nltk to find the stems of words

In [10]:
import nltk

stemmer = nltk.SnowballStemmer(language='english')

words = ["love","lovely","loved","quick","quicker","quickly"]
stems = [stemmer.stem(word) for word in words]

print(stems)

['love', 'love', 'love', 'quick', 'quicker', 'quick']


### using re to split a string on more than one delimiter

In [11]:
import re

s = "a string to, test, some \n of re's : delimiters"
words = re.split(' |\n|,|\"',s)

print(words)

['a', 'string', 'to', '', 'test', '', 'some', '', '', 'of', "re's", ':', 'delimiters']


Note - how would you get rid of those '' entries in words ?