# Homework 1: Preprocessing and Text Classification

Student Name: Minh Trung Nguyen

Student ID: 1016151

# General Info

<b>Due date</b>: Sunday, 5 Apr 2020 5pm

<b>Submission method</b>: Canvas submission

<b>Submission materials</b>: completed copy of this iPython notebook

<b>Late submissions</b>: -20% per day (both week and weekend days counted)

<b>Marks</b>: 10% of mark for class (with 9% on correctness + 1% on quality and efficiency of your code)

<b>Materials</b>: See [Using Jupyter Notebook and Python page](https://canvas.lms.unimelb.edu.au/courses/17601/pages/using-jupyter-notebook-and-python?module_item_id=1678430) on Canvas (under Modules>Resources) for information on the basic setup required for this class, including an iPython notebook viewer and the python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. In particular, if you are not using a lab computer which already has it installed, we recommend installing all the data for NLTK, since you will need various parts of it to complete this assignment. You can also use any Python built-in packages, but do not use any other 3rd party packages (the packages listed above are all fine to use); if your iPython notebook doesn't run on the marker's machine, you will lose marks. <b> You should use Python 3</b>.  

To familiarize yourself with NLTK, here is a free online book:  Steven Bird, Ewan Klein, and Edward Loper (2009). <a href=http://nltk.org/book>Natural Language Processing with Python</a>. O'Reilly Media Inc. You may also consult the <a href=https://www.nltk.org/api/nltk.html>NLTK API</a>.

<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should edit the sections below where requested, but leave the rest of the code as is. You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions. 

You will be marked not only on the correctness of your methods, but also the quality and efficency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to <a href="https://www.python.org/dev/peps/pep-0008/">Python style requirements</a>. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it.

<b>Updates</b>: Any major changes to the assignment will be announced via Canvas. Minor changes and clarifications will be announced on the discussion board; we recommend you check it regularly.

<b>Academic misconduct</b>: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place.

# Overview

In this homework, you'll be working with a collection tweets. The task is to classify whether a tweet constitutes a rumour event. This homework involves writing code to preprocess data and perform text classification.

# 1. Preprocessing (5 marks)

**Instructions**: Run the code below to download the tweet corpus for the assignment. Note: the download may take some time. **No implementation is needed.**

In [4]:
import requests
import os
from pathlib import Path

fname = 'rumour-data.tgz'
data_dir = os.path.splitext(fname)[0] #'rumour-data'

my_file = Path(fname)
if not my_file.is_file():
    url = "https://github.com/jhlau/jhlau.github.io/blob/master/files/rumour-data.tgz?raw=true"
    r = requests.get(url)

    #Save to the current directory
    with open(fname, 'wb') as f:
        f.write(r.content)
        
print("Done. File downloaded:", my_file)


**Instructions**: Run the code to extract the zip file. Note: the extraction may take a minute or two. **No implementation is needed.**

In [5]:
import tarfile

#decompress rumour-data.tgz
tar = tarfile.open(fname, "r:gz")
tar.extractall()
tar.close()

#remove superfluous files (e.g. .DS_store)
extra_files = []
for r, d, f in os.walk(data_dir):
    for file in f:
        if (file.startswith(".")):
            extra_files.append(os.path.join(r, file))
for f in extra_files:
    os.remove(f)

print("Extraction done.")

Extraction done.


### Question 1 (1.0 mark)

**Instructions**: The corpus data is in the *rumour-data* folder. It contains 2 sub-folders: *non-rumours* and *rumours*. As the names suggest, *rumours* contains all rumour-propagating tweets, while *non-rumours* has normal tweets. Within  *rumours* and *non-rumours*, you'll find some sub-folders, each named with an ID. Each of these IDs constitutes an 'event', where an event is defined as consisting a **source tweet** and its **reactions**.

An illustration of the folder structure is given below:

    rumour-data
        - rumours
            - 498254340310966273
                - reactions
                    - 498254340310966273.json
                    - 498260814487642112.json
                - source-tweet
                    - 498254340310966273.json
        - non-rumours

Now we need to gather the tweet messages for rumours and non-rumour events. As the individual tweets are stored in json format, we need to use a json parser to parse and collect the actual tweet message. The function `get_tweet_text_from_json(file_path)` is provided to do that.

**Task**: Complete the `get_events(event_dir)` function. The function should return **a list of events** for a particular class of tweets (e.g. rumours), and each event should contain the source tweet message and all reaction tweet messages.

**Check**: Use the assertion statements in *"For your testing"* below for the expected output.

In [5]:
import json

def get_tweet_text_from_json(file_path):
    with open(file_path) as json_file:
        data = json.load(json_file)
        return data["text"]
    
def get_events(event_dir):
    event_list = []
    
    # each event is the ID e.g. '498254340310966273'
    for event in sorted(os.listdir(event_dir)):
        event_temp_list = []
        
        # get the source tweet
        source = get_tweet_text_from_json(os.path.join(event_dir, event, "source-tweet", event + ".json"))
        event_temp_list.append(source)
        
        # get the reaction tweets
        # each reaction is the ID.json e.g. '498254340310966273.json'
        for reaction in sorted(os.listdir(os.path.join(event_dir, event, "reactions"))):
            react = get_tweet_text_from_json(os.path.join(event_dir, event, "reactions", reaction))
            event_temp_list.append(react)
        
        event_list.append(event_temp_list)
        
    return event_list
    
#a list of events, and each event is a list of tweets (source tweet + reactions)    
rumour_events = get_events(os.path.join(data_dir, "rumours"))
nonrumour_events = get_events(os.path.join(data_dir, "non-rumours"))

print("Number of rumour events =", len(rumour_events))
print("Number of non-rumour events =", len(nonrumour_events))

KeyboardInterrupt: 

**For your testing:**

In [5]:
assert(len(rumour_events) == 500)
assert(len(nonrumour_events) == 1000)

### Question 2 (1.0 mark)

**Instructions**: Next we need to preprocess the collected tweets to create a bag-of-words representation. The preprocessing steps required here are: (1) tokenize each tweet into individual word tokens (using NLTK `TweetTokenizer`); and (2) remove stopwords (based on NLTK `stopwords`).

**Task**: Complete the `preprocess_events(event)` function. The function takes **a list of events** as input, and returns **a list of preprocessed events**. Each preprocessed event should have a dictionary of words and frequencies.

**Check**: Use the assertion statements in *"For your testing"* below for the expected output.

In [6]:
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
nltk.download('stopwords')
from collections import defaultdict, Counter

tt = TweetTokenizer()
stopwords = set(stopwords.words('english'))

def preprocess_events(events):
    preprocessed_events = []
    for event in events:
        bow = Counter()
        for tweet in event:
            tokens = tt.tokenize(tweet)
            for word in tokens:
                if word not in stopwords:
                    bow[word] = bow.get(word, 0) + 1
        preprocessed_events.append(bow)
                
    return preprocessed_events


preprocessed_rumour_events = preprocess_events(rumour_events)
preprocessed_nonrumour_events = preprocess_events(nonrumour_events)

print("Number of preprocessed rumour events =", len(preprocessed_rumour_events))
print("Number of preprocessed non-rumour events =", len(preprocessed_nonrumour_events))

[nltk_data] Downloading package stopwords to /Users/MAC/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Number of preprocessed rumour events = 500
Number of preprocessed non-rumour events = 1000


In [7]:
preprocessed_rumour_events

[Counter({'Michael': 2,
          'Brown': 2,
          '17': 4,
          'yr': 2,
          'old': 4,
          'boy': 5,
          'shot': 7,
          '10x': 2,
          '&': 4,
          'killed': 2,
          'police': 10,
          '#Ferguson': 3,
          'today': 2,
          '.': 47,
          'Media': 2,
          'reports': 2,
          '"': 22,
          'shoot': 2,
          'man': 3,
          '#blackboysonly': 2,
          '@AmeenaGK': 38,
          '@jaythenerdkid': 3,
          'And': 1,
          'long': 1,
          'conservative': 1,
          'pundit': 1,
          'finds': 1,
          'pic': 1,
          'messing': 1,
          'around': 2,
          'friends': 1,
          'ID': 1,
          '@d_m_elms': 2,
          "they'll": 1,
          'drag': 1,
          'entire': 1,
          'history': 1,
          'every': 2,
          'stolen': 1,
          'sip': 1,
          "dad's": 1,
          'beer': 1,
          ',': 17,
          'second-hand': 1,
         

**For your testing**:

In [8]:
assert(len(preprocessed_rumour_events) == 500)
assert(len(preprocessed_nonrumour_events) == 1000)

**Instructions**: Hashtags (i.e. topic tags which start with #) pose an interesting tokenisation problem because they often include multiple words written without spaces or capitalization. Run the code below to collect all unique hashtags in the preprocessed data. **No implementation is needed.**



In [9]:
def get_all_hashtags(events):
    hashtags = set([])
    for event in events:
        for word, frequency in event.items():
            if word.startswith("#"):
                hashtags.add(word)
    return hashtags

hashtags = get_all_hashtags(preprocessed_rumour_events + preprocessed_nonrumour_events)
print("Number of hashtags =", len(hashtags))

Number of hashtags = 1829


In [10]:
hashtags

{'#EatADick',
 '#Dudelove',
 '#furgeson',
 '#Fascism',
 '#катасрофа',
 '#energy',
 '#terrorizing',
 '#info4',
 '#NESARA',
 '#LiarsLie',
 '#WE',
 '#Islamic_State',
 '#NewsCorpse',
 '#pnpcbc',
 '#unitedbywings',
 '#frustrating',
 '#VictimHood',
 '#Facepalm',
 '#OneBastardBitesTheDust',
 '#punintended',
 '#thewholeworldiswatching',
 '#Yeah',
 '#Armour',
 "#Ferguson'da",
 '#FascistCops',
 '#dumbasses',
 '#AirbusCrash',
 '#BundyRanch',
 '#CNN',
 '#thinkforyourself',
 '#butthurt',
 '#givefacts',
 '#Flightradar24',
 '#Hindi',
 '#Democracy',
 '#COVERUP',
 '#PKdebate',
 '#Australians',
 '#commonsense',
 '#fuckinghilarious',
 '#KentState',
 '#MiraclesDoHappen',
 '#Harnof',
 '#uber',
 '#Prayers',
 '#GoodGuysWithGuns',
 '#MiliarizedPolice',
 '#PeaceHasToRule',
 '#US',
 '#PostMedia',
 '#Justic4all',
 '#Ukrainians',
 '#cancerofcorruption',
 '#USURY',
 '#OttowaShooting',
 '#sysdneysiege',
 '#kidnapper',
 '#Police',
 '#AvsFam',
 '#framing',
 '#Shot',
 '#Gulen',
 '#nojustice',
 '#Nazi',
 '#pegida',
 '#

### Question 3 (2.0 mark)

**Instructions**: Our task here to tokenize the hashtags, by implementing a reversed version of the MaxMatch algorithm discussed in class, where matching begins at the end of the hashtag and progresses backwards. NLTK has a list of words that you can use for matching, see starter code below. Be careful about efficiency with respect to doing word lookups. One extra challenge you have to deal with is that the provided list of words includes only lemmas: your MaxMatch algorithm should match inflected forms by converting them into lemmas using the NLTK lemmatizer before matching. When lemmatising a word, you also need to provide the part-of-speech tag of the word. You should use `nltk.tag.pos_tag` for doing part-of-speech tagging.

Note that the list of words is incomplete, and, if you are unable to make any longer match, your code should default to matching a single letter. Create a new list of tokenized hashtags (this should be a list of lists of strings) and use slicing to print out the last 20 hashtags in the list.

For example, given "#speakup", the algorithm should produce: \["#", "speak", "up"\]. And note that you do not need to delete the hashtag symbol ("#") from the tokenised outputs.

**Task**: Complete the `tokenize_hashtags(hashtags)` function by implementing a reversed MaxMatch algorithm. The function takes as input **a set of hashtags**, and returns **a dictionary** where key="hashtag" and value="a list of word tokens".

**Check**: Use the assertion statements in *"For your testing"* below for the expected output.

In [11]:
from nltk.tag import pos_tag

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [12]:
from nltk.corpus import wordnet
nltk.download('words')

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
words = set(nltk.corpus.words.words()) #a list of words provided by NLTK
words_lower = map(lambda x:x.lower(), words) # lowercase the NLTK word list
words_lower = set(words_lower)


def maxMatch(hashtag, tokens):
    """Match the words in hashtag to a list of words in backward direction (from right to left)"""
    if hashtag == '':
        return tokens
    for i in range(len(hashtag), 0, -1):
        firstword = hashtag[-i:]
        remainder = hashtag[:len(hashtag)-i]
        firstword_lower = firstword.lower() # lowercase prior to matching
        # Lemmatize the firstword before matching
        lemma = lemmatizer.lemmatize(firstword_lower, get_wordnet_pos(firstword_lower))
        
        if lemma in words_lower:
            tokens.insert(0, firstword) # append at the front of a list
            return maxMatch(remainder, tokens) # recursive call
        # keep the #
        elif len(lemma) == 1: 
            tokens.insert(0, firstword)
            return maxMatch(remainder, tokens)
        
        
def tokenize_hashtags(hashtags):
    tokenized_hashtags = {}
    for hashtag in hashtags:
        tokens = []
        tokenized_hashtags[hashtag] = maxMatch(hashtag, tokens)
    return tokenized_hashtags


tokenized_hashtags = tokenize_hashtags(hashtags)

print(list(tokenized_hashtags.items())[:20])

[nltk_data] Downloading package words to /Users/MAC/nltk_data...
[nltk_data]   Package words is already up-to-date!
[('#EatADick', ['#', 'E', 'atA', 'Dick']), ('#Dudelove', ['#', 'Dude', 'love']), ('#furgeson', ['#', 'f', 'urge', 'son']), ('#Fascism', ['#', 'Fascism']), ('#катасрофа', ['#', 'к', 'а', 'т', 'а', 'с', 'р', 'о', 'ф', 'а']), ('#energy', ['#', 'energy']), ('#terrorizing', ['#', 'terrorizing']), ('#info4', ['#', 'in', 'fo', '4']), ('#NESARA', ['#', 'NE', 'SARA']), ('#LiarsLie', ['#', 'Liars', 'Lie']), ('#WE', ['#', 'WE']), ('#Islamic_State', ['#', 'Islamic', '_', 'State']), ('#NewsCorpse', ['#', 'News', 'Corpse']), ('#pnpcbc', ['#', 'p', 'n', 'p', 'c', 'b', 'c']), ('#unitedbywings', ['#', 'united', 'by', 'wings']), ('#frustrating', ['#', 'frustrating']), ('#VictimHood', ['#', 'VictimHood']), ('#Facepalm', ['#', 'Face', 'palm']), ('#OneBastardBitesTheDust', ['#', 'One', 'Bastard', 'Bites', 'The', 'Dust']), ('#punintended', ['#', 'p', 'unintended'])]


In [19]:
tokenized_hashtags

{'#EatADick': ['#', 'E', 'atA', 'Dick'],
 '#Dudelove': ['#', 'Dude', 'love'],
 '#furgeson': ['#', 'f', 'urge', 'son'],
 '#Fascism': ['#', 'Fascism'],
 '#катасрофа': ['#', 'к', 'а', 'т', 'а', 'с', 'р', 'о', 'ф', 'а'],
 '#energy': ['#', 'energy'],
 '#terrorizing': ['#', 'terrorizing'],
 '#info4': ['#', 'in', 'fo', '4'],
 '#NESARA': ['#', 'NE', 'SARA'],
 '#LiarsLie': ['#', 'Liars', 'Lie'],
 '#WE': ['#', 'WE'],
 '#Islamic_State': ['#', 'Islamic', '_', 'State'],
 '#NewsCorpse': ['#', 'News', 'Corpse'],
 '#pnpcbc': ['#', 'p', 'n', 'p', 'c', 'b', 'c'],
 '#unitedbywings': ['#', 'united', 'by', 'wings'],
 '#frustrating': ['#', 'frustrating'],
 '#VictimHood': ['#', 'VictimHood'],
 '#Facepalm': ['#', 'Face', 'palm'],
 '#OneBastardBitesTheDust': ['#', 'One', 'Bastard', 'Bites', 'The', 'Dust'],
 '#punintended': ['#', 'p', 'unintended'],
 '#thewholeworldiswatching': ['#',
  'the',
  'who',
  'lew',
  'or',
  'l',
  'dis',
  'watching'],
 '#Yeah': ['#', 'Yeah'],
 '#Armour': ['#', 'Arm', 'our'],
 "#Fe

**For your testing:**

In [14]:
assert(len(tokenized_hashtags) == len(hashtags))

### Question 4 (1.0 mark)

**Instructions**: Now that we have the tokenized hashtags, we need to go back and update the bag-of-words representation for each event.

**Task**: Complete the ``update_event_bow(events)`` function. The function takes **a list of preprocessed events**, and for each event, it looks for every hashtag it has and updates the bag-of-words dictionary with the tokenized hashtag tokens. Note: you do not need to delete the counts of the original hashtags when updating the bag-of-words (e.g., if a document has "#speakup":2 in its bag-of-words representation, you do not need to delete this hashtag and its counts).

In [15]:
def update_event_bow(events):
    ###
    # Your answer BEGINS HERE
    ###
    for index, event in enumerate(events):
        for word in list(event):
            if word.startswith("#"):
                for token in tokenized_hashtags[word]:
                    event[token] += event[word] # update the list with new tokens
        events[index] = event
    ###
    # Your answer ENDS HERE
    ###
            
update_event_bow(preprocessed_rumour_events)
update_event_bow(preprocessed_nonrumour_events)

print("Number of preprocessed rumour events =", len(preprocessed_rumour_events))
print("Number of preprocessed non-rumour events =", len(preprocessed_nonrumour_events))

Number of preprocessed rumour events = 500
Number of preprocessed non-rumour events = 1000


In [16]:
preprocessed_rumour_events

[Counter({'Michael': 2,
          'Brown': 2,
          '17': 4,
          'yr': 2,
          'old': 4,
          'boy': 5,
          'shot': 7,
          '10x': 2,
          '&': 4,
          'killed': 2,
          'police': 10,
          '#Ferguson': 3,
          'today': 2,
          '.': 47,
          'Media': 2,
          'reports': 2,
          '"': 22,
          'shoot': 2,
          'man': 3,
          '#blackboysonly': 2,
          '@AmeenaGK': 38,
          '@jaythenerdkid': 3,
          'And': 1,
          'long': 1,
          'conservative': 1,
          'pundit': 1,
          'finds': 1,
          'pic': 1,
          'messing': 1,
          'around': 2,
          'friends': 1,
          'ID': 1,
          '@d_m_elms': 2,
          "they'll": 1,
          'drag': 1,
          'entire': 1,
          'history': 1,
          'every': 2,
          'stolen': 1,
          'sip': 1,
          "dad's": 1,
          'beer': 1,
          ',': 17,
          'second-hand': 1,
         

# Text Classification (4 marks)

### Question 5 (1.0 mark)

**Instructions**: Here we are interested to do text classification, to predict, given a tweet and its reactions, whether it is a rumour or not. The task here is to create training, development and test partitions from the preprocessed events and convert the bag-of-words representation into feature vectors.

**Task**: Using scikit-learn, create training, development and test partitions with a 60%/20%/20% ratio. Remember to preserve the ratio of rumour/non-rumour events for all your partitions. Next, turn the bag-of-words dictionary of each event into a feature vector, using scikit-learn `DictVectorizer`.

In [43]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split

vectorizer = DictVectorizer()

###
# Your answer BEGINS HERE
###
X = preprocessed_rumour_events + preprocessed_nonrumour_events
y = []

for i in range(len(preprocessed_rumour_events)):
    y.append('rumour')
for j in range(len(preprocessed_nonrumour_events)):
    y.append('not-rumour')

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=345)
X_train, X_dev, y_train, y_dev = train_test_split(X_train, y_train, stratify=y_train, test_size=0.2, random_state=345)

training_data = vectorizer.fit_transform(X_train)
dev_data = vectorizer.transform(X_dev)
test_data = vectorizer.transform(X_test)

###
# Your answer ENDS HERE
###

print("Vocabulary size =", len(vectorizer.vocabulary_))

Vocabulary size = 28724


In [35]:
dev_data

<240x30350 sparse matrix of type '<class 'numpy.float64'>'
	with 22534 stored elements in Compressed Sparse Row format>

### Question 6 (2.0 mark)

**Instructions**: Now, let's build some classifiers. Here, we'll be comparing Naive Bayes and Logistic Regression. For each, you need to first find a good value for their main regularisation (hyper)parameters, which you should identify using the scikit-learn docs or other resources. Use the development set you created for this tuning process; do **not** use cross-validation in the training set, or involve the test set in any way. You don't need to show all your work, but you do need to print out the accuracy with enough different settings to strongly suggest you have found an optimal or near-optimal choice. We should not need to look at your code to interpret the output.

**Task**: Implement two text classifiers: Naive Bayes and Logistic Regression. Tune the hyper-parameters of these classifiers and print the task performance for different hyper-parameter settings.

In [44]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

###
# Your answer BEGINS HERE
###

alpha_to_test = [0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
nbcs = [MultinomialNB(alpha=a) for a in alpha_to_test]

c_to_test = [0.001, 0.01, 0.05, 0.1, 0.3, 0.5, 0.7, 1, 10, 100, 1000, 5000]
lrcs = [LogisticRegression(C=c, random_state=345) for c in c_to_test]

def fit_multiple_classifiers(clfs, X_train, y_train, X_dev, y_dev, method):
    for clf in clfs:
        clf.fit(X_train, y_train)
        predictions = clf.predict(X_dev)
        if method == 'nb':
            print (f"accuracy for alpha = {clf.alpha}: {accuracy_score(y_dev, predictions)}")
        elif method == 'lr':
            print (f"accuracy for c = {clf.C}: {accuracy_score(y_dev, predictions)}")


###
# Your answer ENDS HERE
###

In [45]:
print("Tuning Naive Bayes:\n")
fit_multiple_classifiers(nbcs, training_data, y_train, dev_data, y_dev, 'nb')
print("\n")
print("Tuning Logistic Regression:\n")
fit_multiple_classifiers(lrcs, training_data, y_train, dev_data, y_dev, 'lr')

Tuning Naive Bayes:

accuracy for alpha = 0.001: 0.8041666666666667
accuracy for alpha = 0.01: 0.8333333333333334
accuracy for alpha = 0.05: 0.8583333333333333
accuracy for alpha = 0.1: 0.8458333333333333
accuracy for alpha = 0.2: 0.8416666666666667
accuracy for alpha = 0.3: 0.8416666666666667
accuracy for alpha = 0.5: 0.8416666666666667
accuracy for alpha = 0.6: 0.85
accuracy for alpha = 0.7: 0.8541666666666666
accuracy for alpha = 0.8: 0.8583333333333333
accuracy for alpha = 0.9: 0.8583333333333333
accuracy for alpha = 1: 0.8625


Tuning Logistic Regression:

accuracy for c = 0.001: 0.7666666666666667
accuracy for c = 0.01: 0.825
accuracy for c = 0.05: 0.8625
accuracy for c = 0.1: 0.8583333333333333
accuracy for c = 0.3: 0.8541666666666666
accuracy for c = 0.5: 0.8583333333333333
accuracy for c = 0.7: 0.8541666666666666
accuracy for c = 1: 0.8583333333333333
accuracy for c = 10: 0.8583333333333333
accuracy for c = 100: 0.8583333333333333
accuracy for c = 1000: 0.8625
accuracy for c =

### Question 7 (1.0 mark)

**Instructions**: Using the best settings you have found, compare the two classifiers based on performance in the test set. Print out both accuracy and macro-averaged F-score for each classifier. Be sure to label your output.

**Task**: Compute test performance in terms of accuracy and macro-averaged F-score for both Naive Bayes and Logistic Regression, using optimal hyper-parameter settings.

In [46]:
# We will use alpha = 1 and C = 1000
nb_best = MultinomialNB(alpha=1)
lr_best = LogisticRegression(C=1000)

nb_best.fit(training_data, y_train)
lr_best.fit(training_data, y_train)

nb_prediction = nb_best.predict(test_data)
lr_prediction = lr_best.predict(test_data)

print("Report for Naive Bayes:\n")
print(classification_report(y_test, nb_prediction))
print(f"Accurary: {accuracy_score(y_test, nb_prediction)}")
print("\n")

print("Report for Logistic Regression:\n")
print(classification_report(y_test, lr_prediction))
print(f"Accurary: {accuracy_score(y_test, lr_prediction)}")

Report for Naive Bayes:

             precision    recall  f1-score   support

 not-rumour       0.85      0.91      0.88       200
     rumour       0.79      0.67      0.72       100

avg / total       0.83      0.83      0.83       300

Accurary: 0.83


Report for Logistic Regression:

             precision    recall  f1-score   support

 not-rumour       0.82      0.91      0.86       200
     rumour       0.76      0.60      0.67       100

avg / total       0.80      0.80      0.80       300

Accurary: 0.8033333333333333
