# HW 3: Irish or Aussie?

## Date Out: Thursday, March 5
## Due Date: Tuesday, March 24

This programming assignment aims to help you prepare for the final project in this course. It tasks you to:

* download two large text datasets
* pre-process them into a common format
* divide them into appropriate training and testing sets
* learn a naive bayes classifier on the training set
* iterate and refine the model using a dev-test set
* give a final evaluation of the model using a test set

<u>You may work in teams of two or three (2-tuples or 3-tuples?) for this assignment.</u>

<hr>

This _never-seen-before_ assignment was motivated by our recent discussion of the NLTK Names Corpus and Jurafsky and Martin's example of a toy-classifier to distinguish between "Japan" and "Chinese".

It was a bit difficult finding datasets of the size and detail that I wanted to assign.

Before I settled on the classification task in this assignment __"Irish or Aussie? (You will train a model that distinguishes between newspaper headlines from an Austrailian newspaper compared to an Irish newspaper.)__, I attempted to find good datasets on:
* Jeopardy puzzle clues vs. Who Wants to Be a Millionaire clues
* Peanuts comic strip text vs. Garfield
* Seinfeld sitcom text vs. Friends
* Wine reviews vs. Beer reviews
* ...

If you want, feel free to use different, but comparable datasets for this assignment.

<hr>

In [1]:
import nltk

In [2]:
import csv

### Task #1

<u>Download the Aussie and Irish headline datasets.</u>

I found the datasets here:
https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research

Look at the "text data" section. There, you'll see citation hyperlinks 168 and 170, which will take you to a kaggle page where you can download the data:
1. A Million News Headlines (these are the headlines for Australia -- Aussie)
2. The Irish Times (Irish)

In [3]:
# Data sets downloaded and in same directory as notebook.  Renamed aussie.csv and irish.csv

### Task #2

<u>Process the downloaded <code>csv</code> files in python.</u>

Note that the format in the two datasets are different. Throw out any columns that you don't need. 

Perhaps normalize the data and lowercase everything. Observe that one dataset has capitalization differences. 

Transform the data into some python data structure that also captures whether the headline is Irish or Aussie. I recommend reviewing how we used the tuple `(instance, class)` representation when analyzing the Names Corpus.

In [4]:
# irish contains punctuation and capital letters
# aussie is all lower case with no punctutation
# aussie - date,headline
# irish - date,news label,headline
# Use the head.csv file in each folder for testing
# Both csv files are comma deliminated

#Field limit
import sys
import csv
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)
        
def readInCSV(thecsv, rawlist, colname):
    with thecsv as csv_file:
        csv_reader = csv.DictReader(csv_file, delimiter=",")
        for lines in csv_reader:
            rawlist.append(lines[colname])

    
irishcsv = open("irish.csv", "r")
aussiecsv = open("aussie.csv", "r")
irish = []
aussie = []

readInCSV(irishcsv, irish, "headline_text") 
readInCSV(aussiecsv, aussie, "headline_text") 


In [5]:
#normalize the text
badchars = ["'", "’", "”", "“", ";", ".", "!", "?", "-", ":", "%", ")", "(", "_"]
def normalizeText(lines):
    for i, line in enumerate(lines):
        lines[i] = (''.join(i for i in line if not i in badchars)).lower()

normalizeText(aussie)
normalizeText(irish)

In [6]:
# There are an unequal number of headlines.
# Irish - 1425460
# Aussie - 1186018
# Randomly select 1186018 headlines to keep so we can create equally wieghted training sets

import random

irish = random.sample(irish, 1186018)

# Shuffle both lists
random.shuffle(irish)
random.shuffle(aussie)

w = csv.writer(open("masterset.csv", "w"))
for i in range(len(irish)):
    w.writerow([irish[i], "irish"])
    w.writerow([aussie[i], "aussie"])

#create a master training csv that we can use to split into the training, dev testing and testing sets


### Task #3

<u>Divide your data into a training set and testing set.</u>

No peaking at the test set here on out!

Perhaps even divide your training set into a dev-test set.

In [7]:
# Split master.csv into 3 equal sets: training, dev and testing

In [8]:
import math

master = open("masterset.csv", "r").readlines()
splitInc = math.floor(len(master) / 3)
filename = ["train", "dev", "test"]
fc = 0
for i in range(len(master)):
    if i % splitInc == 0 and fc < 3:
        imax = i+splitInc
        open(filename[fc] + '.csv', 'w').writelines(master[i:imax])
        fc += 1


### Task #4

<u>Learn a naive bayes classifier on the training set.</u>

(For this step you may find it easier to deviate from the Names corpus classifier example, and instead manually compute probabilities following the intuition presented in the "China"/"Japan" J&M example.)

All we really need here are: (1) prior probabilities and (2) conditional probabilities.

In [3]:
import pandas as pd
import nltk
import csv

df = pd.read_csv('train.csv', delimiter=',')
#book says to use the entire set as the basis for all_words list
#sounds like cheating
#df = pd.read_csv('masterset.csv', delimiter=',')
list_of_tuples = [tuple(row) for row in df.values]

words = []

from nltk.corpus import stopwords 

stop_words = set(stopwords.words('english'))
for sHeadling in list_of_tuples :
    words += nltk.word_tokenize(sHeadling[0])
    
filtered_words = [w for w in words if not w in stop_words] 

all_words = nltk.FreqDist(filtered_words)



In [4]:
word_features = list(w for (w, x) in all_words.most_common(10000))

In [5]:
def document_features(document):
    document_words = set(nltk.word_tokenize(document))
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [6]:
#Testing has shown that the more words we invlude in the word_feature list, the more accurate the model is.
#Increasing the number of words in the word_features list will increase ram utilization

import pickle
featuresets = [(document_features(d), c) for (d,c) in list_of_tuples[:20000]]
classifier = nltk.NaiveBayesClassifier.train(featuresets)

save_classifier = open("naivebayes.pickle","wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()


In [12]:
import pickle

classifier_f = open("naivebayes.pickle", "rb")
classifier = pickle.load(classifier_f)
classifier_f.close()

In [16]:
import pandas as pd
df = pd.read_csv('dev.csv', delimiter=',')
list_of_tuples = [tuple(row) for row in df.values]
featuresets = [(document_features(d), c) for (d,c) in list_of_tuples[3000:3500]]
print(nltk.classify.accuracy(classifier, featuresets))

0.772


### Task #5

<u>Iterate and refine the model using a dev-test set.</u>
                        
Let's try to make our model better. What instances is your model misclassifying? Report lessons learned either here, or at the bottom of this notebook. 

In [None]:
# PYTHON CODE HERE

### Task #6

<u>Evaluate your model on the held out test set.</u>
                        
Which metric is most appropriate? Accuracy?

Is there anything else that could be useful to calculate? 

What is the classifier's baseline?

In [None]:
# PYTHON CODE HERE

### Report

Write a technical report (in this Jupyter Notebook, with good Markdown formatting) that documents your findings, "lessons learned", any areas of where you ran into difficult, and also any other interesting details. Include in your report the following details.

The format of the report is up to you, and you may include this information either above, or right here. Use any appropriate layout that best conveys your narrative.

Also submit this python notebook `.ipynb` to D2L.

In [None]:
# PYTHON CODE AND REPORT HERE