# Assignment 1: Preprocessing and Text Classification

# Overview

In this homework, you'll be working with a collection of tweets. The task is to predict the geolocation (country) where the tweet comes from. This homework involves writing code to preprocess data and perform text classification.

# Preprocessing (4 marks)

**Instructions**: Download the data (as1-data.json) from Canvas and put it in the same directory as this iPython notebook. Run the code below to load the json data. This produces two objects, `x` and `y`, which contains a list of  tweets and corresponding country labels (it uses the standard [2 letter country code](https://www.iban.com/country-codes)) respectively. **No implementation is needed.**

In [1]:
import json

x = []
y = []
data = json.load(open("as1-data.json"))
for k, v in data.items():
    x.append(k)
    y.append(v)
    
print("Number of tweets =", len(x))
print("Number of labels =", len(y))
print("\nSamples of data:")
for i in range(10):
    print("Country =", y[i], "\tTweet =", x[i])
    
assert(len(x) == 943)
assert(len(y) == 943)

Number of tweets = 943
Number of labels = 943

Samples of data:
Country = us 	Tweet = @Addictd2Success thx u for following
Country = us 	Tweet = Let's just say, if I were to ever switch teams, Khalesi would be top of the list. #girlcrush
Country = ph 	Tweet = Taemin jonghyun!!! Your birits make me go~ http://t.co/le8z3dntlA
Country = id 	Tweet = depart.senior 👻 rapat perdana (with Nyayu, Anita, and 8 others at Ruang Aescullap FK Unsri Madang) — https://t.co/swRALlNkrQ
Country = ph 	Tweet = Done with internship with this pretty little lady!  (@ Metropolitan Medical Center w/ 3 others) [pic]: http://t.co/1qH61R1t5r
Country = gb 	Tweet = Wow just Boruc's clanger! Haha Sunday League stuff that, Giroud couldn't believe his luck! #clown
Country = my 	Tweet = I'm at Sushi Zanmai (Petaling Jaya, Selangor) w/ 5 others http://t.co/bcNobykZ
Country = us 	Tweet = Mega Fest!!!! Its going down🙏🙌  @BishopJakes
Country = gb 	Tweet = @EllexxxPharrell wow love the pic babe xx
Country = us 	Tweet = You 

### Question 1 (1.0 mark)

**Instructions**: Next we need to preprocess the collected tweets to create a bag-of-words representation (based on <font color=pink>frequency</font>). The preprocessing steps required here are: 

(1) tokenize each tweet into individual word tokens (using NLTK `TweetTokenizer`); 

(2) lowercase all words; 

(3) remove any word that does not contain any English letters in the alphabet (e.g. {_hello_, _#okay_, _abc123_} would be kept, but not {_123_, _!!_}) and 

(4) remove stopwords (based on NLTK `stopwords`). An empty tweet (after preprocessing) and its country label should be **excluded** from the output (`x_processed` and `y_processed`).

**Task**: Complete the `preprocess_data(data, labels)` function. The function takes **a list of tweets** and **a corresponding list of country labels** as input, and returns **two lists**. For the first list, each element is a <font color=pink>bag-of-words representation</font> of a tweet (represented using a python dictionary). For the second list, each element is a corresponding country label. Note that while we do not need to preprocess the country labels (`y`), we need to have a new output <font color=pink>list</font> (`y_processed`) because some tweets maybe removed after the preprocessing (due to having an empty set of bag-of-words).

**Check**: Use the assertion statements in <b>"For your testing"</b> below for the expected output.

In [2]:
import nltk
nltk.download('stopwords')
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords

tt = TweetTokenizer()
stopwords = set(stopwords.words('english')) 
#note: stopwords are all in lowercase

def preprocess_data(data, labels):
    
    ###
    # Your answer BEGINS HERE
    ###
    from collections import Counter
    import re
    
    x_processed = []
    y_processed = []
    for i in range(len(data)):
        # 1. tokenization
        words = tt.tokenize(data[i])

        # 2. lowercase the words
        words = [word.lower() for word in words]

        # 3. remove words without alphabet
        # words = [word for word in words if any(map(lambda x: x.isalpha(), list(word)))]
        words = [word for word in words if re.findall('[a-z]',word)]
        
        # 4. remove stop words and drop instances with empty tweet
        words = [word for word in words if word not in stopwords]
        if words: 
            count = Counter(words)
            x_processed.append(count)
            y_processed.append(labels[i])

    return x_processed, y_processed
    ###
    # Your answer ENDS HERE
    ###

x_processed, y_processed = preprocess_data(x, y)

print("Number of preprocessed tweets =", len(x_processed))
print("Number of preprocessed labels =", len(y_processed))
print("\nSamples of preprocessed data:")
x_processed_ = [dict(c) for c in x_processed]
for i in range(10):
    print("Country =", y_processed[i], "\tTweet =", x_processed_[i])

Number of preprocessed tweets = 943
Number of preprocessed labels = 943

Samples of preprocessed data:
Country = us 	Tweet = {'@addictd2success': 1, 'thx': 1, 'u': 1, 'following': 1}
Country = us 	Tweet = {"let's": 1, 'say': 1, 'ever': 1, 'switch': 1, 'teams': 1, 'khalesi': 1, 'would': 1, 'top': 1, 'list': 1, '#girlcrush': 1}
Country = ph 	Tweet = {'taemin': 1, 'jonghyun': 1, 'birits': 1, 'make': 1, 'go': 1, 'http://t.co/le8z3dntla': 1}
Country = id 	Tweet = {'depart.senior': 1, 'rapat': 1, 'perdana': 1, 'nyayu': 1, 'anita': 1, 'others': 1, 'ruang': 1, 'aescullap': 1, 'fk': 1, 'unsri': 1, 'madang': 1, 'https://t.co/swrallnkrq': 1}
Country = ph 	Tweet = {'done': 1, 'internship': 1, 'pretty': 1, 'little': 1, 'lady': 1, 'metropolitan': 1, 'medical': 1, 'center': 1, 'w': 1, 'others': 1, 'pic': 1, 'http://t.co/1qh61r1t5r': 1}
Country = gb 	Tweet = {'wow': 1, "boruc's": 1, 'clanger': 1, 'haha': 1, 'sunday': 1, 'league': 1, 'stuff': 1, 'giroud': 1, 'believe': 1, 'luck': 1, '#clown': 1}
Countr

[nltk_data] Downloading package stopwords to D:\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**For your testing**:

In [3]:
assert(len(x_processed) == len(y_processed))
assert(len(x_processed) > 800)

**Instructions**: Hashtags (i.e. topic tags which start with #) pose an interesting tokenisation problem because they often include multiple words written without spaces or capitalization. Run the code below to collect all unique hashtags in the preprocessed data. **No implementation is needed.**



In [4]:
def get_all_hashtags(data):
    hashtags = set([])
    for d in data:
        for word, frequency in d.items():
            if word.startswith("#") and len(word) > 1:
                hashtags.add(word)
    return hashtags

hashtags = get_all_hashtags(x_processed)
print("Number of hashtags =", len(hashtags))
print(sorted(hashtags))

Number of hashtags = 425
['#100percentpay', '#1stsundayofoctober', '#1yearofalmostisneverenough', '#2011prdctn', '#2015eebritishfilmacademyawards', '#2k16', '#2littlebirds', '#365picture', '#5sosacousticatlanta', '#5sosfam', '#8thannualpubcrawl', '#affsuzukicup', '#aflpowertigers', '#ahimacon14', '#aim20', '#airasia', '#allcity', '#alliswell', '#allwedoiscurls', '#amazing', '#anferneehardaway', '#ariona', '#art', '#arte', '#artwork', '#ashes', '#asian', '#asiangirl', '#askcrawford', '#askherforfback', '#askolly', '#asksteven', '#at', '#australia', '#awesome', '#awesomepict', '#barcelona', '#bart', '#bayofislands', '#beautiful', '#bedimages', '#bell', '#beringmy', '#bettybooppose', '#bff', '#big', '#bigbertha', '#bigbreakfast', '#blackhat', '#blessedmorethanicanimagine', '#blessedsunday', '#blogtourambiente', '#bluemountains', '#bonekachika', '#boomtaob', '#booyaa', '#bored', '#boredom', '#bradersisterhood', '#breaktime', '#breedingground', '#bringithomemy', '#brooksengland', '#burgers'

### Question 2 (1.0 mark)

**Instructions**: Our task here to tokenize the hashtags, by implementing the **MaxMatch algorithm** discussed in class.

NLTK has a list of words that you can use for matching, see starter code below (`words`). Be careful about <font color=pink>efficiency with respect to doing word lookups</font>. One extra challenge you have to deal with is that the provided list of words (`words`) includes only lemmas: your MaxMatch algorithm should match inflected forms by <font color=pink>converting them into lemmas using the NLTK lemmatizer</font> before matching (provided by the function `lemmatize(word)`). Note that the list of words (`words`) is the only source that you'll use for matching (i.e. you do not need to find  other external word lists). If you are unable to make any longer match, your code should default to <font color=pink>matching a single letter</font>.

For example, given "#newrecords", the algorithm should produce: \["#", "new", "records"\].

**Task**: Complete the `tokenize_hashtags(hashtags)` function by implementing the MaxMatch algorithm. The function takes as input **a set of hashtags**, and returns **a dictionary** where key="hashtag" and value="a list of tokenised words".

**Check**: Use the assertion statements in <b>"For your testing"</b> below for the expected output.

In [5]:
from nltk.corpus import wordnet
nltk.download('words')
nltk.download('wordnet')

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
words = set(nltk.corpus.words.words()) #a list of words provided by NLTK
words = set([ word.lower() for word in words ]) #lowercase all the words for better matching


def lemmatize(word):
    lemma = lemmatizer.lemmatize(word,'v')
    if lemma == word:
        lemma = lemmatizer.lemmatize(word,'n')
    return lemma

def tokenize_hashtags(hashtags):
    ###
    # Your answer BEGINS HERE
    ###
    import re

    tokenized_hash = {}
    for ahashtag in hashtags:
        tokenized_hash[ahashtag] = [ahashtag[0]]
        hashtag_words = re.findall(r'[a-z]+',ahashtag[1:])
        for hashtag in hashtag_words:

            temp_word = lemmatize(hashtag[0:])
            if temp_word in words:
                tokenized_hash[ahashtag].append(hashtag[0:])
                continue
            start_i, end_i = 0,-1
            while start_i <= 0:  
                temp_word = lemmatize(hashtag[start_i:end_i])
                while not temp_word in words:
                    end_i -= 1
                    temp_word = lemmatize(hashtag[start_i:end_i])
                tokenized_hash[ahashtag].append(hashtag[start_i:end_i])
                start_i = end_i
                temp_word = lemmatize(hashtag[start_i:])
                if temp_word in words: 
                    tokenized_hash[ahashtag].append(hashtag[start_i:])
                    start_i = 2

                end_i = -1

    return tokenized_hash
    ###
    # Your answer ENDS HERE
    ###

#tokenise hashtags with MaxMatch
tokenized_hashtags = tokenize_hashtags(hashtags)

#print results
for k, v in sorted(tokenized_hashtags.items())[-30:]:
    print(k, v)

[nltk_data] Downloading package words to D:\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to D:\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#vanilla ['#', 'vanilla']
#vca ['#', 'v', 'ca']
#vegan ['#', 'vega', 'n']
#veganfood ['#', 'vega', 'n', 'food']
#vegetables ['#', 'vegetables']
#vegetarian ['#', 'vegetarian']
#video ['#', 'video']
#vma ['#', 'v', 'ma']
#voteonedirection ['#', 'vote', 'one', 'direction']
#vsco ['#', 'vs', 'c', 'o']
#vscocam ['#', 'vs', 'coca', 'm']
#walking ['#', 'walking']
#watch ['#', 'watch']
#weare90s ['#', 'wear', 'e', 's']
#wearesocial ['#', 'weares', 'o', 'c', 'i', 'al']
#white ['#', 'white']
#wings ['#', 'wings']
#wok ['#', 'wo', 'k']
#wood ['#', 'wood']
#work ['#', 'work']
#workmates ['#', 'work', 'mates']
#world ['#', 'world']
#worldcup2014 ['#', 'world', 'cup']
#yellow ['#', 'yellow']
#yiamas ['#', 'y', 'i', 'ama', 's']
#ynwa ['#', 'yn', 'wa']
#youtube ['#', 'you', 'tube']
#yummy ['#', 'yummy']
#yws13 ['#', 'y', 'ws']
#zweihandvollfarm ['#', 'z', 'wei', 'hand', 'vol', 'l', 'farm']


**For your testing:**

In [6]:
assert(len(tokenized_hashtags) == len(hashtags))
assert(tokenized_hashtags["#newrecord"] == ["#", "new", "record"])

### Question 3 (1.0 mark)

**Instructions**: Our next task is to tokenize the hashtags again, but this time using a **reversed version of the MaxMatch algorithm**, where matching begins at the end of the hashtag and progresses backwards (e.g. for <i>#helloworld</i>, we would process it right to left, starting from the last character <i>d</i>). Just like before, you should use the provided word list (`words`) for word matching.

**Task**: Complete the `tokenize_hashtags_rev(hashtags)` function by the MaxMatch algorithm. The function takes as input **a set of hashtags**, and returns **a dictionary** where key="hashtag" and value="a list of tokenised words".

**Check**: Use the assertion statements in <b>"For your testing"</b> below for the expected output.

In [7]:
def tokenize_hashtags_rev(hashtags):
    ###
    # Your answer BEGINS HERE
    ###
    import re

    tokenized_hash = {}
    for ahashtag in hashtags:
        tokenized_hash[ahashtag] = []
        hashtag_words = re.findall(r'[a-z]+',ahashtag[1:])
        for hashtag in hashtag_words:

            temp_word = lemmatize(hashtag)
            if temp_word in words:
                tokenized_hash[ahashtag].append(hashtag)
                continue
            start_i, end_i = 1,len(hashtag)
            while end_i > 0:
                temp_word = lemmatize(hashtag[start_i:end_i])
                while not temp_word in words:
                    start_i += 1
                    temp_word = lemmatize(hashtag[start_i:end_i])
                tokenized_hash[ahashtag].append(hashtag[start_i:end_i])
                end_i = start_i
                temp_word = lemmatize(hashtag[:end_i])
                if temp_word in words: 
                    tokenized_hash[ahashtag].append(hashtag[:end_i])
                    end_i = 0

                start_i = 1
                
        tokenized_hash[ahashtag].append(ahashtag[0])
        tokenized_hash[ahashtag].reverse()

    return tokenized_hash
    ###
    # Your answer ENDS HERE
    ###

    
#tokenise hashtags with the reversed version of MaxMatch
tokenized_hashtags_rev = tokenize_hashtags_rev(hashtags)

#print results
for k, v in sorted(tokenized_hashtags_rev.items())[-30:]:
    print(k, v)

#vanilla ['#', 'vanilla']
#vca ['#', 'v', 'ca']
#vegan ['#', 'v', 'e', 'gan']
#veganfood ['#', 'v', 'e', 'gan', 'food']
#vegetables ['#', 'vegetables']
#vegetarian ['#', 'vegetarian']
#video ['#', 'video']
#vma ['#', 'v', 'ma']
#voteonedirection ['#', 'vote', 'one', 'direction']
#vsco ['#', 'vs', 'c', 'o']
#vscocam ['#', 'vs', 'c', 'o', 'cam']
#walking ['#', 'walking']
#watch ['#', 'watch']
#weare90s ['#', 's', 'we', 'are']
#wearesocial ['#', 'we', 'are', 'social']
#white ['#', 'white']
#wings ['#', 'wings']
#wok ['#', 'w', 'ok']
#wood ['#', 'wood']
#work ['#', 'work']
#workmates ['#', 'work', 'mates']
#world ['#', 'world']
#worldcup2014 ['#', 'world', 'cup']
#yellow ['#', 'yellow']
#yiamas ['#', 'y', 'i', 'a', 'mas']
#ynwa ['#', 'yn', 'wa']
#youtube ['#', 'you', 'tube']
#yummy ['#', 'yummy']
#yws13 ['#', 'y', 'ws']
#zweihandvollfarm ['#', 'z', 'wei', 'hand', 'vol', 'l', 'farm']


**For your testing:**

In [8]:
assert(len(tokenized_hashtags_rev) == len(hashtags))
assert(tokenized_hashtags_rev["#newrecord"] == ["#", "new", "record"])

### Question 4 (1.0 mark)

**Instructions**: The two versions of MaxMatch will produce different results for some of the hashtags. For a hastag that has different results, our task here is to use a **unigram language model** (lecture 3) to score them to see which is better. Recall that in a unigram language model we compute <font color=pink>P(<i>#</i>, <i>hello</i>, <i>world</i> = P(<i>#</i>)\*P(<i>hellow</i>)\*P(<i>world</i>)</font>.

You should: 

(1) use the NLTK's Brown corpus (`brown_words`) for collecting word frequencies (note: the words are already tokenised so no further tokenisation is needed); 

(2) lowercase all words in the corpus; 

(3) use add-one smoothing when computing the unigram probabilities; and 

(4) work in the log space to prevent numerical underflow.

**Task**: Build a unigram language model with add-one smoothing using the word counts from the Brown corpus. Iterate through the hashtags, and for each hashtag where MaxMatch and reversed MaxMatch produce different results, print the following: (1) the hashtag; (2) the results produced by MaxMatch and reversed MaxMatch; and (3) the log probability of each result as given by the unigram language model. Note: you **do not** need to print the hashtags where MaxMatch and reversed MaxMatch produce the same results.

An example output:
```
1. #abcd
MaxMatch = [#, a, bc, d]; LogProb = -2.3
Reversed MaxMatch = [#, a, b, cd]; LogProb = -3.5

2. #efgh
MaxMatch = [#, ef, g, h]; LogProb = -4.2
Reversed MaxMatch = [#, e, fgh]; LogProb = -3.1

```

Have a look at the output, and see if the sequences with better language model scores (i.e. less negative) are generally more coherent.

In [9]:
from nltk.corpus import brown

#words from brown corpus
brown_words = brown.words()

###
# Your answer BEGINS HERE
###
from collections import Counter
import numpy as np

hashtag_map = {}
freqs,V,M = Counter(brown_words),0,len(brown_words)   
V = len(freqs)  # V refers to #vocab, i.e. number of unique tokens
# M refers to #tokens, i.e. total num of tokens in the corpus

def get_prob(words):
    '''input: words = ['#', 'new', 'record]'''
    return np.sum([np.log((freqs[word] + 1) / (M + V)) for word in words])

count = 0
for hashtag in hashtags:
    if tokenized_hashtags_rev[hashtag] != tokenized_hashtags[hashtag]:
        count += 1
        p1 = get_prob(tokenized_hashtags[hashtag])
        pv = get_prob(tokenized_hashtags_rev[hashtag])
        hashtag_map[hashtag] = tokenized_hashtags[hashtag] if p1 > pv \
                                else tokenized_hashtags_rev[hashtag]
        print('{}.'.format(count),hashtag)
        print('MaxMatch =', tokenized_hashtags[hashtag],
              '; LogProb = {:.1f}'.format(p1))
        print('Reversed MaxMatch =', tokenized_hashtags_rev[hashtag],
              '; LogProb = {:.1f}\n'.format( pv ))
    else:
        hashtag_map[hashtag] = tokenized_hashtags[hashtag]
    
###
# Your answer ENDS HERE
###

1. #seniorbabysenior
MaxMatch = ['#', 'senior', 'babys', 'en', 'io', 'r'] ; LogProb = -78.1
Reversed MaxMatch = ['#', 'senior', 'baby', 'senior'] ; LogProb = -45.1

2. #potd
MaxMatch = ['#', 'pot', 'd'] ; LogProb = -38.7
Reversed MaxMatch = ['#', 'po', 'td'] ; LogProb = -42.0

3. #foodporn
MaxMatch = ['#', 'food', 'po', 'r', 'n'] ; LogProb = -62.4
Reversed MaxMatch = ['#', 'food', 'p', 'or', 'n'] ; LogProb = -51.4

4. #flambees
MaxMatch = ['#', 'flamb', 'e', 'es'] ; LogProb = -54.7
Reversed MaxMatch = ['#', 'flam', 'bees'] ; LogProb = -39.3

5. #singapore
MaxMatch = ['#', 'sing', 'a', 'pore'] ; LogProb = -41.6
Reversed MaxMatch = ['#', 's', 'inga', 'pore'] ; LogProb = -54.9

6. #instalook
MaxMatch = ['#', 'ins', 'tal', 'o', 'ok'] ; LogProb = -69.4
Reversed MaxMatch = ['#', 'ins', 'ta', 'look'] ; LogProb = -50.1

7. #thankyoulord
MaxMatch = ['#', 'thank', 'youl', 'or', 'd'] ; LogProb = -58.6
Reversed MaxMatch = ['#', 'thank', 'you', 'lord'] ; LogProb = -43.0

8. #anferneehardaway
MaxMat

# Text Classification (4 marks)

### Question 5 (1.0 mark)

**Instructions**: Here we are interested to do text classification, to predict the country of origin of a given tweet. The task here is to create training, development and test partitions from the preprocessed data (`x_processed`) and convert the bag-of-words representation into feature vectors.

**Task**: Create training, development and test partitions with a 70%/15%/15% ratio. <font color=red>Remember to preserve the ratio of the classes for all your partitions</font>. That is, say we have only 2 classes and 70% of instances are labelled class A and 30% of instances are labelled class B, then the instances in training, development and test partitions should also preserve this 7:3 ratio. You may use sklearn's builtin functions for doing data partitioning.

Next, turn the bag-of-words dictionary of each tweet into a feature vector. You may also use sklearn's builtin functions for doing this (but if you don't want to use sklearn that's fine).

You should produce 6 objects: `x_train`, `x_dev`, `x_test` which contain the input feature vectors, and `y_train`, `y_dev` and `y_test` which contain the labels.

In [17]:
from sklearn.feature_extraction import DictVectorizer

x_train, x_dev, x_test = None, None, None
y_train, y_dev, y_test = None, None, None

###
# Your answer BEGINS HERE
###

# continue preprocess the x
for counter in x_processed:
    for hashtag in hashtag_map:
        if hashtag in counter:
            counter.pop(hashtag)
            for word in hashtag_map[hashtag]:
                counter[word] += 1

# split it into 3 parts
from sklearn.model_selection import train_test_split

x_train, X_test, y_train, Y_test = train_test_split(x_processed, y_processed, 
                                                    test_size=0.3, random_state=42)
x_dev, x_test, y_dev, y_test = train_test_split(X_test,Y_test, test_size=0.5, 
                                                random_state=42)
# train:dev:test = 660, 141, 142


# get features
vec = DictVectorizer(sparse=False)
x_train = vec.fit_transform(x_train)
x_dev = vec.transform(x_dev)    # (943, 4338)
x_test = vec.transform(x_test)    # (943, 4338)

# # train:dev:test = 660, 141, 142

###
# Your answer ENDS HERE
###

### Question 6 (1.0 mark)

**Instructions**: Now, let's build some classifiers. Here, we'll be comparing Naive Bayes and Logistic Regression. For each, you need to first find a good value for their <font color=pink>main regularisation hyper-parameters</font>, which you should identify using the <font color=pink>scikit-learn docs or other resources</font>. Use the development set you created for this tuning process; do **not** use cross-validation in the training set, or involve the test set in any way. You don't need to show all your work, but you do need to print out the **accuracy** with enough different settings to strongly suggest you have found an optimal or near-optimal choice. We should not need to look at your code to interpret the output.

**Task**: Implement two text classifiers: Naive Bayes and Logistic Regression. Tune the hyper-parameters of these classifiers and print the task performance (accuracy) for different hyper-parameter settings.

In [15]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

###
# Your answer BEGINS HERE
###
import warnings
warnings.filterwarnings("ignore")

# # try to find best other settings
# solvers = ['liblinear','liblinear','lbfgs','lbfgs','newton-cg',
#             'newton-cg','sag', 'sag','saga','saga','saga']
# penalties = [  'l1',       'l2',     'l2',  'none',   'l2',     
#                'none',   'l2', 'none', 'l1',  'l2', 'none']
# multi_class = ['ovr','multinomial']
# C = 1
# random_state = 42
# LR_results = np.zeros((2,len(penalties)))

# for i in range(len(solvers)):
#     if solvers[i] != 'liblinear':
#         for j, cur_class in enumerate(multi_class):
#             logRegModel = LogisticRegression(
#                               solver=solvers[i],penalty=penalties[i],
#                               random_state=random_state, C=C,
#                               multi_class=cur_class)
#             logRegModel.fit(x_train, y_train)
#             score = logRegModel.score(x_dev, y_dev)
#             LR_results[j][i] = score
#             print("Logistic | Solver: {}, penalty: {}, class: {}, \
#                    Acc={:.3f} ".format(solvers[i],penalties[i],cur_class,score*100))
#     else:
#         logRegModel = LogisticRegression(
#                           solver=solvers[i],penalty=penalties[i],
#                           random_state=random_state, C=C,
#                           multi_class='ovr')
#         logRegModel.fit(x_train, y_train)
#         score = logRegModel.score(x_dev, y_dev)
#         LR_results[0][i] = score * 100
#         print("Logistic | Solver: {}, penalty: {}, class: {}, \
#                Acc={:.3f}% ".format(solvers[i],penalties[i],'ovr',score*100))
# print()

# to find best regularization parameter
param = np.array(range(1,11))*0.1
tuning_results_LR = [0] * len(param)
for j in range(len(param)):
    v =  LogisticRegression(solver='saga',penalty='l2',random_state=42, 
                            C=param[j],multi_class='auto')
    v.fit(x_train, y_train)
    score = v.score(x_dev, y_dev)
    tuning_results_LR[j] = score*100
    print("Logistic Regression | alpha={}, Acc={:.3f}% ".format(param[j],
          score*100))

print()
param = [0.01, 0.1, 0.5,0.6, 0.7,0.8, 1, 3, 5, 10]
tuning_results_NB = [0] * len(param)
for j in range(len(param)):
    v = MultinomialNB(alpha=param[j])
    v.fit(x_train, y_train)
    score = v.score(x_dev, y_dev)
    tuning_results_NB[j] = score*100
    print("Multinomial NB | alpha={}, Acc={:.3f}% ".format(param[j],score*100))

###
# Your answer ENDS HERE
###

Logistic Regression | alpha=0.1, Acc=29.078% 
Logistic Regression | alpha=0.2, Acc=30.496% 
Logistic Regression | alpha=0.30000000000000004, Acc=31.915% 
Logistic Regression | alpha=0.4, Acc=34.043% 
Logistic Regression | alpha=0.5, Acc=32.624% 
Logistic Regression | alpha=0.6000000000000001, Acc=31.915% 
Logistic Regression | alpha=0.7000000000000001, Acc=32.624% 
Logistic Regression | alpha=0.8, Acc=31.915% 
Logistic Regression | alpha=0.9, Acc=31.915% 
Logistic Regression | alpha=1.0, Acc=31.915% 

Multinomial NB | alpha=0.01, Acc=26.241% 
Multinomial NB | alpha=0.1, Acc=26.950% 
Multinomial NB | alpha=0.5, Acc=29.787% 
Multinomial NB | alpha=0.6, Acc=29.078% 
Multinomial NB | alpha=0.7, Acc=29.787% 
Multinomial NB | alpha=0.8, Acc=29.078% 
Multinomial NB | alpha=1, Acc=29.078% 
Multinomial NB | alpha=3, Acc=27.660% 
Multinomial NB | alpha=5, Acc=28.369% 
Multinomial NB | alpha=10, Acc=26.950% 


### Question 7 (1.0 mark)

**Instructions**: Using the best settings you have found, compare the two classifiers based on performance in the test set. Print out both **accuracy** and **macro-averaged F-score** for each classifier. Be sure to label your output. You may use sklearn's inbuilt functions.

**Task**: Compute test performance in terms of accuracy and macro-averaged F-score for both Naive Bayes and Logistic Regression, using their optimal hyper-parameter settings based on their development performance.

In [13]:
###
# Your answer BEGINS HERE
###
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

NBModel = MultinomialNB(alpha=0.7)
NBModel.fit(x_train, y_train)
NBAcc = NBModel.score(x_test, y_test)
y_pred = NBModel.predict(x_test)
NBF = f1_score(y_test, y_pred, average='macro')

logRegModel = LogisticRegression(solver='saga',penalty='l2',
                                 random_state=42, C=0.4,
                                 multi_class='auto')
logRegModel.fit(x_train, y_train)
LRAcc = logRegModel.score(x_test, y_test)
y_pred = logRegModel.predict(x_test)
LRF = f1_score(y_test, y_pred, average='macro')

print("LogisticRegression | Acc={:.3f}, Macro-F1 score={:.3f}".format(LRAcc,LRF))
print("MultinomialNB | Acc={:.3f}, Macro-F1 score={:.3f}".format(NBAcc,NBF))

###
# Your answer ENDS HERE
###

LogisticRegression | Acc=0.289, Macro-F1 score=0.274
MultinomialNB | Acc=0.232, Macro-F1 score=0.240


### Question 8 (1.0 mark)

**Instructions**: Print the most important features and their weights for each class for the two classifiers.


**Task**: For each of the classifiers (Logistic Regression and Naive Bayes) you've built in the previous question, print out the <font color=pink>top-20</font> features (words) with the highest weight for each class (countries).

An example output:
```
Classifier = Logistic Regression

Country = au
aaa (0.999) bbb (0.888) ccc (0.777) ...

Country = ca
aaa (0.999) bbb (0.888) ccc (0.777) ...

Classifier = Naive Bayes

Country = au
aaa (-1.0) bbb (-2.0) ccc (-3.0) ...

Country = ca
aaa (-1.0) bbb (-2.0) ccc (-3.0) ...
```

Have a look at the output, and see if you notice any trend/pattern in the words for each country.

In [18]:
###
# Your answer BEGINS HERE
###
ind2feature = dict(zip(vec.vocabulary_.values(), vec.vocabulary_.keys()))

print('Classifier = Logistic Regression\n')
for i, cur_class in enumerate(logRegModel.classes_):
    print("Country = {}".format(cur_class))

    ind2coef = dict(zip(range(len(logRegModel.coef_[i])), 
                        logRegModel.coef_[i]))
    top20_ind = sorted(ind2coef,key=lambda x:ind2coef[x], reverse=True)[:20]
    top20 = zip( [ind2feature[ind] for ind in top20_ind], 
                 ['({:.3f})'.format(ind2coef[ind]) for ind in top20_ind])

    print(' '.join([x[0]+' '+x[1] for x in top20]),'\n')


print('Classifier = Naive Bayes\n')
for i, cur_class in enumerate(NBModel.classes_):
    print("Country = {}".format(cur_class))

    ind2coef = dict(zip(range(len(NBModel.feature_log_prob_[i])), 
                        NBModel.feature_log_prob_[i]))
    top20_ind = sorted(ind2coef,key=lambda x:ind2coef[x], reverse=True)[:20]
    top20 = zip( [ind2feature[ind] for ind in top20_ind], 
                 ['({:.3f})'.format(ind2coef[ind]) for ind in top20_ind])

    print(' '.join([x[0]+' '+x[1] for x in top20]),'\n')
###
# Your answer ENDS HERE
###

Classifier = Logistic Regression

Country = au
melbourne (0.743) australia (0.690) v (0.626) little (0.597) amazing (0.513) @micksunnyg (0.507) australian (0.499) though (0.409) bourn (0.395) forward (0.392) cry (0.386) green (0.382) mel (0.371) r (0.369) song (0.361) lovely (0.350) sledging (0.346) e (0.344) something (0.329) enjoy (0.325) 

Country = ca
thing (0.721) new (0.664) great (0.607) bed (0.586) one (0.570) l (0.540) think (0.502) s (0.479) first (0.471) let's (0.452) finally (0.447) got (0.430) gonna (0.420) chill (0.412) follow (0.409) learning (0.406) actually (0.398) carlyle (0.385) manor (0.385) f (0.384) 

Country = de
germany (0.553) posted (0.498) done (0.469) # (0.440) painting (0.435) could (0.433) enough (0.424) never (0.393) @fabiomarabini (0.357) please (0.350) https://t.co/df7ficsci3 (0.349) roseninsel (0.349) almost (0.345) happened (0.343) @kelsxmclaughlin (0.337) cannot (0.337) https://t.co/i7j5pmd3mx (0.335) workout (0.335) anne (0.332) hahah (0.332) 

Coun

From the results above we can clearly see that the coefficients fitted by Logistic Regression model can somehow manifest the meaning w.r.t. country classification, whereas the feature log probabilities learned by Naive Bayes just "Naively" take words occurence into consideration, lack of any other information. For example, we can see "Melbourne" is the most weighted feature in Logistic Regression, while Naive Bayes just tells us the most frequent word, "#". This may somehow explain the relatively poor performance of Naive Bayes when compared to Logistic Regression in tasks like text classification.