## Week 1 HW


1. This assignment is a group effort.
2. Submission to be uploaded into your group repositories in the folder week1
3. Deadline is 27th of April 5:00 PM.
4. Please follow google's [python styleguide](https://google.github.io/styleguide/pyguide.html) for your code. Pay attention to the guidelines for naming convention, comments and main.
5. Code will be checked for plagiarism. Compelling signs of a duplicated effort will lead to a rejection of submission and will attract a 100\% grade penalty.

Use the template provided as a starting point. Extend the classes as you see fit. Be careful to place new attributes and methods in the approriate class 

### 1

Extend the classes to include the following methods

1. document_term_matrix - which returns a D by V array of frequency counts.
2. tf_idf - returns a D by V array of tf-idf scores
3. dict_rank - returns the top `n` documents based on a given dictionary and represenation of tokens (eg. doc-term matrix or tf-idf matrix)  

Include subroutines as and when necessary


In [2]:
import numpy as np

# In my environment, a harmless exception is thrown from the following
# import.  Just ignore it.
try:
    from nltk.tokenize import wordpunct_tokenize
except Exception:
    pass

import codecs
import nltk
import re
from nltk.tokenize import wordpunct_tokenize
from nltk import PorterStemmer
from math import log
from collections import Counter

In [3]:
class Document():    
    """
    The Doc class represents a class of individual documents
    """    
    def __init__(self, speech_year, speech_pres, speech_text):
        self.year = speech_year
        self.pres = speech_pres
        self.text = speech_text.lower()
        self.tokens = np.array(wordpunct_tokenize(self.text))
        
    def token_clean(self,length):
        """ 
        Description: strip out non-alpha tokens and tokens of length > 'length'
        input: length: cut off length 
        """
        self.tokens = np.array([t for t in self.tokens if (t.isalpha() and len(t) > length)])

    def stopword_remove(self, stopwords):
        """
        Description: remove stopwords from tokens.
        input: stopwords: a suitable list of stopwords
        """
        self.tokens = np.array([t for t in self.tokens if t not in stopwords])

    def stem(self):
        """
        Description: stem tokens with Porter Stemmer.
        """
        self.tokens = np.array([PorterStemmer().stem(t) for t in self.tokens])

    def term_vector(self, doc_token_list):
        vector = [None] * len(doc_token_list)
        counter = Counter(self.tokens)
        for i in range(len(doc_token_list)):
            count = counter[doc_token_list[i]]
            vector[i] = count

        return vector

In [4]:
class Corpus():
    """
    The Corpus class represents a document collection.
    """
    def __init__(self, doc_data, stopword_file, clean_length):
        """
        Notice that the __init__ method is invoked everytime an object of the
        class is instantiated.
        """
        # Initialise documents by invoking the appropriate class
        self.docs = [Document(doc[0], doc[1], doc[2]) for doc in doc_data]         
        self.N = len(self.docs)
        self.clean_length = clean_length
        
        # Get a list of stopwords
        self.create_stopwords(stopword_file, clean_length)
        
        # Stopword removal, token cleaning and stemming to docs
        self.clean_docs(2)
        
        # Create vocabulary
        self.corpus_tokens()
        
    def clean_docs(self, length):
        """ 
        Applies stopword removal, token cleaning and stemming to docs.
        """
        for doc in self.docs:
            doc.token_clean(length)
            doc.stopword_remove(self.stopwords)
            doc.stem()        
    
    def create_stopwords(self, stopword_file, length):
        """
        Description: parses a file of stopwords, removes words of length
        'length' and  stems it.
        input: length: cutoff length for words
               stopword_file: stopwords file to parse
        """        
        with codecs.open(stopword_file, 'r', 'utf-8') as f: raw = f.read()        
        self.stopwords = (np.array([PorterStemmer().stem(word) 
                                    for word in list(raw.splitlines()) if len(word) > length]))
             
    def corpus_tokens(self):
        """
        Description: create a set of all all tokens or in other words a
        vocabulary
        """       
        # Initialise an empty set
        self.token_set = set()
        for doc in self.docs:
            self.token_set = self.token_set.union(doc.tokens) 
    
    def document_term_matrix(self):
        result = []
        for doc in self.docs:
            vector = doc.term_vector(list(self.token_set))
            result.append(vector)        
        
        return result

    def tf_idf(self):
        dt_matrix = self.document_term_matrix()
        tf_matrix = []
        idf_matrix = []
        tf_idf_matrix = []

        # Build a term frequency matrix from the document term matrix.
        # tf(d,v) = { 0 if x(d,v) = 0, 1 + log(x(d), v) otherwise }
        for dt_doc_vector in dt_matrix:
            tf_doc_vector = [(0 if x == 0 else 1 + log(x)) for x in dt_doc_vector]
            tf_matrix.append(tf_doc_vector)

        # Build a document frequency matrix for each term.
        # Initialize with zeros.
        df_vector = [0] * len(self.token_set)
        for dt_doc_vector in dt_matrix:
            # Increment the counters based on an indicator function which
            # is 1 if there is at least one instance of the term in the doc.
            df_vector = np.add(df_vector, [int(x > 0) for x in dt_doc_vector])

        # Build an inverse document frequency vector.
        idf_doc_vector = [log(len(self.docs) / x) for x in df_vector]

        # Build the TF-IDF weighting matrix.
        for tf_doc_vector in tf_matrix:
            tf_idf_vector = np.multiply(tf_doc_vector, idf_doc_vector)
            tf_idf_matrix.append(tf_idf_vector)

        return tf_idf_matrix

    def dict_rank(self, dictionary, n):        
        dtm = self.document_term_matrix()
        all_tokens = list(self.token_set)
        
        # Get rid of words in the document term matrix not in the dictionary
        vec_positions = [0] * len(dtm[0])        
        for i in range(len(all_tokens)):
            if all_tokens[i] in dictionary:
                vec_positions[i] = 1
            else:
                vec_positions[i] = 0
        sums = [0] * len(dtm)

        # Get the score of each document
        for j in range(len(dtm)):
            sums[j] = sum([a * b for a, b in zip(dtm[j], vec_positions)])

        # Order them and return the n top documents
        order = sorted(range(len(sums)), key = lambda k: sums[k])
        ordered_doc_data_n = [0] * len(dtm)
        counter = 0        
        for num in order:
            ordered_doc_data_n[counter] = doc_data[num]
            counter += 1
        n_top = ordered_doc_data_n[0:n]
       
        return n_top

In [5]:
def parse_text(textraw, regex):
    """
    Takes raw string and performs two operations:
      1. Breaks text into a list of speech, president and speech
      2. Breaks speech into paragraphs
    """
    prs_yr_spch_reg = re.compile(regex, re.MULTILINE|re.DOTALL)
    
    # Each tuple contains the year, name of the president and the speech text
    prs_yr_spch = prs_yr_spch_reg.findall(textraw)
    
    # Convert immutabe tuple to mutable list
    prs_yr_spch = [list(tup) for tup in prs_yr_spch]
    for i in range(len(prs_yr_spch)):
        prs_yr_spch[i][2] = prs_yr_spch[i][2].replace('\n', '')
    
    # Sort
    prs_yr_spch.sort()
    
    return prs_yr_spch

In [6]:
text = open('./../data/pres_speech/sou_all.txt', 'r').read()
regex = '_(\d{4}).*?_[a-zA-Z]+.*?_[a-zA-Z]+.*?_([a-zA-Z]+)_\*+(\\n{2}.*?)\\n{3}'
pres_speech_list = parse_text(text, regex)

# Instantiate the corpus class
corpus = Corpus(pres_speech_list, './../data/stopwords/stopwords.txt', 2)
tf_idf = corpus.tf_idf()

print corpus
print corpus.docs[0]
print(tf_idf)

<__main__.Corpus instance at 0x1077f95f0>
<__main__.Document instance at 0x1077f9878>
[array([ 1.09861229,  0.        ,  0.        , ...,  0.        ,
        0.        ,  0.        ]), array([ 0.        ,  0.        ,  1.09861229, ...,  1.09861229,
        1.09861229,  0.        ]), array([ 0.        ,  1.09861229,  0.        , ...,  0.        ,
        0.        ,  0.        ])]


### 2

Pick a dictionary (or dictionaries) of your choice from the Harvard IV set, the Loughran-McDonald set, or some other of your choosing that you think may be relevant for the data you collected. Then conduct the following exercise:
1. Use the two methods above to score each document in your data.
2. Explore whether the scores diﬀer according to the meta data ﬁelds you gathered: for example, do diﬀerent speakers/sources/etc tend to receive a higher score than others?
3. Do the answers to the previous question depend on whether tf-idf weighting is applied or not? Why do you think there is (or is not) a diﬀerence in your answers?


In [29]:
# Harvard IV set
file_handler = './../data/dictionary/inquirerbasic2.csv'
dictionary = np.loadtxt(open(file_handler, 'rb'), dtype = 'str',
                        delimiter = ';', skiprows = 1, comments = None)

type(dictionary)
#len(dictionary)

numpy.ndarray

### 3

We will now do a sentiment analysis using the AFINN list of words. AFINN is a list of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labeled by Finn Årup Nielsen in 2009-2011. A positive valence score can be interpreted as the word conveying a postive emotion and vice versa. 

Load _AFINN-111.txt_ from ./data/AFINN. Inspect the contents of the file and write a function that converts it into a dictionary where the keys are words and values are the valence scores attributed to them. You may use the readme file for hints. 

### 4
Now, use the presedential speeches from last week's HW to calculate its sentiment score. Match every word against the dictionary and come up with a metric that captures the sentiment value. If a word is not present mark its score as 0. Write a function that takes in a list of word and returns their sentiment score. What is the score of the speech you have been assigned? Which year, president gave the least and most positive speech?