#Practice Exercise 1.

**Time estimate:** 20 minutes

Study the `make_tf_vector()` function in Section 1 (reproduced below). Answer the three questions below. You do not need to write code... just explain in words.

* What does `doc.split()` precisely do? 
* What are its limitations introduced to `make_tf_vector` when `split()`  is used like this? 
* How might you improve it? 



In [1]:
docs = [
    'Julie loves me more than Linda loves me',
    'Jane likes me more than Julie loves me',
    'He likes basketball more than baseball'
]

from collections import defaultdict
from pprint import pprint

def make_tf_vector(doc):
    """
        Calculates the term frequency of a document.
        Given a string, splits it into words on whitespace.
        Returns a dictionary mapping words to their frequency.
    """
    v = defaultdict(int)
    for term in doc.split():
        v[term] += 1
    return v

tf_matrix = []
for doc in docs:    
    v = make_tf_vector(doc)    
    tf_matrix.append(v)

pprint(tf_matrix)

[defaultdict(<type 'int'>, {'me': 2, 'Julie': 1, 'loves': 2, 'Linda': 1, 'than': 1, 'more': 1}),
 defaultdict(<type 'int'>, {'me': 2, 'Julie': 1, 'likes': 1, 'loves': 1, 'Jane': 1, 'than': 1, 'more': 1}),
 defaultdict(<type 'int'>, {'basketball': 1, 'baseball': 1, 'likes': 1, 'He': 1, 'than': 1, 'more': 1})]


####Answer to practice exercise 1:

`doc.split()` breaks up the document (a String) based on whitespace (newlines, spaces, tabs, etc.). The function contains a list returning the individual tokens (i.e. words) in the document string. More details, including the complete list of whitespace characters, can be found in the online [string.split Python documentation](http://docs.python.org/2/library/stdtypes.html#str.split).

The `split()` method introduces several limitations. In particular, it will become confused by things like punctuation. For example:

In [2]:
print('Hello, world!'.split())

['Hello,', 'world!']


Notice that the first token ("`Hello,`") includes the trailing comma, and the last token ("`world!`") includes the exclamation point. We could work around this by writing a loop that more carefully analyzes the letters in a string against a predefined list of characters. Another option is to use the version of split that takes a regular expression:

In [3]:
import re

print(re.split('[^a-zA-Z0-9]+', 'Hello, world!'))

['Hello', 'world', '']


The regular expression above splits on non-alphanumeric characters. The '[^]' represents "characters that are not any of the following...". The '+' at the end means multiple consecutive non-alphanumeric characters should be collapsed together. However, **the best choice is to use nltk.** (This is internally used by scikit-learn).

In [4]:
from nltk.tokenize import wordpunct_tokenize

def has_letter(s):
    """Returns true iff the string s contains at least one letter."""
    return any(c.isalpha() for c in t)

tokens_and_punc = wordpunct_tokenize('Hello, world!')
tokens = [t for t in tokens_and_punc if has_letter(t)]   # filter out tokens that are just punctuation

print(tokens_and_punc)
print(tokens)

['Hello', ',', 'world', '!']
['Hello', 'world']


#Practice Exercise 2.

**Time estimate:** 20 minutes.

The Python code for the `l2norm()` function in Section 2 (shown below) uses a [list comprehension](http://carlgroner.me/Python/2011/11/09/An-Introduction-to-List-Comprehensions-in-Python.html). Rewrite the l2norm function using a standard "for-each" loop.

In [5]:
def l2norm(tf_vector):
    """
        Returns the l2-norm (i.e. Euclidean length) of a vector.
    """
    return sum([x*x for x in tf_vector.values()]) ** 0.5

for tf_vector in tf_matrix:
    norm = l2norm(tf_vector)
    print('l2norm of' , tf_vector, 'is', norm)

('l2norm of', defaultdict(<type 'int'>, {'me': 2, 'Julie': 1, 'loves': 2, 'Linda': 1, 'than': 1, 'more': 1}), 'is', 3.4641016151377544)
('l2norm of', defaultdict(<type 'int'>, {'me': 2, 'Julie': 1, 'likes': 1, 'loves': 1, 'Jane': 1, 'than': 1, 'more': 1}), 'is', 3.1622776601683795)
('l2norm of', defaultdict(<type 'int'>, {'basketball': 1, 'baseball': 1, 'likes': 1, 'He': 1, 'than': 1, 'more': 1}), 'is', 2.449489742783178)


#### Answer to Practice Exercise 2:

In [6]:
def l2norm(tf_vector):
    """
        Returns the l2-norm (i.e. Euclidean length) of a vector.
    """
    sum_sq = 0.0
    for x in tf_vector.values():
        sum_sq += x * x
    return sum_sq ** 0.5

for tf_vector in tf_matrix:
    norm = l2norm(tf_vector)
    print('l2norm of' , tf_vector, 'is', norm)

('l2norm of', defaultdict(<type 'int'>, {'me': 2, 'Julie': 1, 'loves': 2, 'Linda': 1, 'than': 1, 'more': 1}), 'is', 3.4641016151377544)
('l2norm of', defaultdict(<type 'int'>, {'me': 2, 'Julie': 1, 'likes': 1, 'loves': 1, 'Jane': 1, 'than': 1, 'more': 1}), 'is', 3.1622776601683795)
('l2norm of', defaultdict(<type 'int'>, {'basketball': 1, 'baseball': 1, 'likes': 1, 'He': 1, 'than': 1, 'more': 1}), 'is', 2.449489742783178)


#Assignment Question 1: 

**Time estimate:** 20 minutes

The Python code for the normalize function (below) uses a list comprehension. Rewrite the normalize function using a standard "for-each" loop. Check your answer against the results posted below.

In [7]:
def normalize(vector):
    """
        Given a dictionary representing a sparse vector, returns a new rescaled vector.
        The new vector contains values from the original vector rescaled by a constant.
        The new vector will have an l2-norm of 1.0.
    """
    norm = l2norm(vector)
    if norm == 0.0:
        return dict(vector)
    
    return dict([(term, tf/norm) for (term, tf) in vector.items()])

v = {'me': 2, 'Julie': 1, 'loves': 2, 'Linda': 1, 'than': 1, 'more': 1}
print(v)
print(l2norm(v))
print(normalize(v))
print(l2norm(normalize(v)))

{'me': 2, 'Julie': 1, 'loves': 2, 'Linda': 1, 'than': 1, 'more': 1}
3.46410161514
{'me': 0.5773502691896258, 'Julie': 0.2886751345948129, 'loves': 0.5773502691896258, 'Linda': 0.2886751345948129, 'than': 0.2886751345948129, 'more': 0.2886751345948129}
1.0


####Answer to Question 1:

In [8]:
def normalize(vector):
    """
        Given a dictionary representing a sparse vector, returns a new rescaled vector.
        The new vector contains values from the original vector rescaled by a constant.
        The new vector will have an l2-norm of 1.0.
    """
    l2 = l2norm(vector)
    if l2 == 0.0:
        return dict(vector)

    norm = {}
    for term in vector:
        norm[term] = vector[term] / l2
    return norm

v = {'me': 2, 'Julie': 1, 'loves': 2, 'Linda': 1, 'than': 1, 'more': 1}
print(v)
print(l2norm(v))
print(normalize(v))
print(l2norm(normalize(v)))

{'me': 2, 'Julie': 1, 'loves': 2, 'Linda': 1, 'than': 1, 'more': 1}
3.46410161514
{'me': 0.5773502691896258, 'Julie': 0.2886751345948129, 'loves': 0.5773502691896258, 'Linda': 0.2886751345948129, 'than': 0.2886751345948129, 'more': 0.2886751345948129}
1.0


#Assignment Question 2:

**Time estimate:** 30 minutes.

Answer two questions about the `make_doc_frequency` function (below):

1. Notice the use of `set()` in the inner-most for loop. Why is this necessary? How (specifically) would the result change if we didn't use it?
2. The performance of `make_doc_frequency` is linear, but linear *in what?*  Does the performance of this method scale with the number of documents, the number of unique terms, or total word count across all documents? Explain your answer. Hint: All set operations require constant time, as do dictionary sets and gets.

In [9]:
from math import log
from collections import defaultdict

def make_doc_frequency(docs):
    """
        Given a collection of documents (Strings), 
        returns a dictionary mapping each word to the number of times it appears.        
        Words are split on whitespace.
    """
    df = defaultdict(int)
    for d in docs:
        for term in set(d.split()):
            df[term] += 1
    return df

docs = [
    'Julie loves me more than Linda loves me',
    'Jane likes me more than Julie loves me',
    'He likes basketball more than baseball'
]
print(make_doc_frequency(docs))

defaultdict(<type 'int'>, {'me': 2, 'basketball': 1, 'Julie': 2, 'baseball': 1, 'likes': 2, 'loves': 2, 'Jane': 1, 'Linda': 1, 'He': 1, 'than': 3, 'more': 3})


####Answer for question 2:

**Part 1:** If the function did not include the call to `set()`, a token (i.e. word) that appears in a single document more than once would be overcounted. For example, the term `me` appears twice in both the first and second sentence. Thus, the document frequency dict would have a value of 4 for me (2 + 2 + 0). `Love` would change to have a value of 3 (2 + 1 + 0).

**Part 2:** The innermost for loop above runs once for every **distinct word** in every document. If "the" appears six times in a single document, the innermost "for loop" will only run once because a set keeps track of unique items. Since the amount of work in each loop is constant (a dictionary lookup and store, and an arithmetic addition), this suggests the function's complexity should scale with:

    sum across all documents d of (number of unique words in d)

However, this is not the whole story. Notice the `set()` function takes as a parameter all the words in the document, with words possibly appearing multiple times. Each of these words (including duplicates) must be added to the set. Therefore, the correct complexity is:

    sum across all documents d of (length in words in d)
    
Or "the total word count across all documents."

#Assignment Question 3:

**Time estimate:** 30 minutes.

Complete the `make_tf_idf_vector` function below. The steps you need to complete are outlined in the comments for the method. Once you correctly complete the function, and run the test that appears below it, you should see the following output:

        Julie loves me more than Linda loves me
                  Linda: tf=1 tf-idf=+0.706
                     me: tf=2 tf-idf=+0.000
                  Julie: tf=1 tf-idf=+0.000
                  loves: tf=2 tf-idf=+0.000
                   than: tf=1 tf-idf=-0.501
                   more: tf=1 tf-idf=-0.501
        
        Jane likes me more than Julie loves me
                   Jane: tf=1 tf-idf=+0.706
                     me: tf=2 tf-idf=+0.000
                  Julie: tf=1 tf-idf=+0.000
                  likes: tf=1 tf-idf=+0.000
                  loves: tf=1 tf-idf=+0.000
                   than: tf=1 tf-idf=-0.501
                   more: tf=1 tf-idf=-0.501
        
        He likes basketball more than baseball
             basketball: tf=1 tf-idf=+0.500
               baseball: tf=1 tf-idf=+0.500
                     He: tf=1 tf-idf=+0.500
                  likes: tf=1 tf-idf=+0.000
                   than: tf=1 tf-idf=-0.354
                   more: tf=1 tf-idf=-0.354

In [10]:
def make_tf_idf_vector(td, df, doc):
    """
        Given a total document count (td), document frequency dictionary (words -> # docs), and a document (a string)
        Returns a tf_idf vector.
        The returned vector is normalized so that it has an l2-norm of 1.0.
    """
    # step 1: Calculate the tf vector
    # step 2: Translate the tf vector into a tf-idf vector using the formula above
    # step 3: Normalize the tf-idf vector so it has an l2 norm of 1.0
    # step 4: return the normalized tf-idf vector.
    return {}

docs = [
    'Julie loves me more than Linda loves me',
    'Jane likes me more than Julie loves me',
    'He likes basketball more than baseball'
]
td = len(docs)
df = make_doc_frequency(docs)

for d in docs:
    tf = make_tf_vector(d)
    tf_idf = make_tf_idf_vector(td, df, d)
    print(d)
    for term in sorted(tf_idf, key=tf_idf.get, reverse=True):
        print('%15s: tf=%d tf-idf=%+.3f' % (term, tf[term], tf_idf[term]))
    print('')

Julie loves me more than Linda loves me

Jane likes me more than Julie loves me

He likes basketball more than baseball



####Answer for question 3:

In [11]:
def make_tf_idf_vector(td, df, doc):
    """
        Given a total document count (td), document frequency dictionary (words -> # docs), and a document (a string)
        Returns a tf_idf vector.
        The returned vector is normalized so that it has an l2-norm of 1.0.
    """
    # step 1: Calculate the tf vector
    v = make_tf_vector(doc)

    # step 2: Translate the tf vector into a tf-idf vector using the formula above
    for term in v:
        v[term] = v[term] * log(1.0 * td / (1 + df[term]))
        
    # step 3: Normalize the tf-idf vector so it has an l2 norm of 1.0
    # step 4: return the normalized tf-idf vector.    
    return normalize(v)

docs = [
    'Julie loves me more than Linda loves me',
    'Jane likes me more than Julie loves me',
    'He likes basketball more than baseball'
]
td = len(docs)
df = make_doc_frequency(docs)

for d in docs:
    tf = make_tf_vector(d)
    tf_idf = make_tf_idf_vector(td, df, d)
    print(d)
    for term in sorted(tf_idf, key=tf_idf.get, reverse=True):
        print('%15s: tf=%d tf-idf=%+.3f' % (term, tf[term], tf_idf[term]))
    print('')

Julie loves me more than Linda loves me
          Linda: tf=1 tf-idf=+0.706
             me: tf=2 tf-idf=+0.000
          Julie: tf=1 tf-idf=+0.000
          loves: tf=2 tf-idf=+0.000
           than: tf=1 tf-idf=-0.501
           more: tf=1 tf-idf=-0.501

Jane likes me more than Julie loves me
           Jane: tf=1 tf-idf=+0.706
             me: tf=2 tf-idf=+0.000
          Julie: tf=1 tf-idf=+0.000
          likes: tf=1 tf-idf=+0.000
          loves: tf=1 tf-idf=+0.000
           than: tf=1 tf-idf=-0.501
           more: tf=1 tf-idf=-0.501

He likes basketball more than baseball
     basketball: tf=1 tf-idf=+0.500
       baseball: tf=1 tf-idf=+0.500
             He: tf=1 tf-idf=+0.500
          likes: tf=1 tf-idf=+0.000
           than: tf=1 tf-idf=-0.354
           more: tf=1 tf-idf=-0.354



#Assignment Question 4:

**Time estimate:** 30 minutes.

Write a function called `sk_vector_to_simple_row` that returns a traditional (native) sparse vector representation for a row in a sci-kit learn matrix. Recall that the traditional python representation is a dict whose keys are terms and values are frequencies. Pattern your function after the example code in Section 4.2.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

def sk_vector_to_simple_row(lexicon, row):
    """
    Given a sci-kit learn row vector, return a native Python sparse vector.
    The result will be a dictionary whose keys are terms and values are term frequencies.
    Pattern this function after the code in Section 4.2.
    """

docs = [
    'Julie loves me more than Linda loves me',
    'Jane likes me more than Julie loves me',
    'He likes basketball more than baseball'
]
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(docs)
lexicon =  count_vectorizer.get_feature_names()
sk_tf_matrix = count_vectorizer.transform(docs)

for row in sk_tf_matrix:
    print(sk_vector_to_simple_row(lexicon, row))

None
None
None


####Answer for Question 4:

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(docs)
lexicon =  count_vectorizer.get_feature_names()
sk_tf_matrix = count_vectorizer.transform(docs)

def sk_vector_to_simple_row(lexicon, row):
    """
    Given a sci-kit learn row vector, return a native Python sparse vector.
    The result will be a dictionary whose keys are terms and values are term frequencies.
    Pattern this function after the code in Section 4.2.
    """
    simple = {}
    for i in range(len(row.data)):
        index = row.indices[i]    
        term = lexicon[index]
        val = row.data[i]
        simple[term] = val
    return simple

for row in sk_tf_matrix:
    print(sk_vector_to_simple_row(lexicon, row))

{u'me': 2, u'julie': 1, u'loves': 2, u'linda': 1, u'than': 1, u'more': 1}
{u'me': 2, u'julie': 1, u'likes': 1, u'loves': 1, u'jane': 1, u'than': 1, u'more': 1}
{u'basketball': 1, u'baseball': 1, u'likes': 1, u'he': 1, u'than': 1, u'more': 1}


#Assignment Question 5:

**Time estimate:** 30 minutes.

Study the code at the end of section 5 (measuring vector similarity). Notice that "by hand" we calculate the result of the `sk_dot` function between each pair of documents.

Write a function that calculates the similarity between each pair of documents using for loops. The output of this function should look approximately like:

    Similarity between document 0 and 1 is 0.753602532225
    Similarity between document 0 and 2 is 0.128408027002
    Similarity between document 1 and 2 is 0.128408027002
    
Use the following bit of python code to setup the sci-kit tf-idf matrix you need.

In [14]:
from sklearn.feature_extraction.text import TfidfTransformer

# Train a tf-idf transformer on the dataset
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(sk_tf_matrix)
sk_tf_idf_matrix = tfidf.transform(sk_tf_matrix)

***Hint:** The `shape` attribute of a matrix returns a two-tuple of the number of rows and columns:

In [15]:
print(sk_tf_idf_matrix.shape)

(3, 11)


####Answer for Question 5.

In [17]:
def sk_dot(v1, v2):
    return v1.dot(v2.transpose())[0,0]
def pairwise(matrix):
    nrows = matrix.shape[0]
    for i in range(nrows):
        for j in range(0, i):
            vi = matrix[i]
            vj = matrix[j]
            sim = sk_dot(vi, vj)
            print('Similarity between document %d and %d is %.5f' % (i, j, sim))

pairwise(sk_tf_idf_matrix)

Similarity between document 1 and 0 is 0.75360
Similarity between document 2 and 0 is 0.12841
Similarity between document 2 and 1 is 0.25742


#Assignment Question 6.

**Time estimate:** One hour.

For questions 6, 7, and 8 you will use a dataset I collected from Wikipedia that contains the article text for all the [Academy Award Winning Films on Wikipedia](http://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films). After next week, you'll know how to collect this data yourself!

The dataset is available on the course website called award_winners.zip. If you extract this file, you'll get a file called "award_winners.txt." The format of the dataset is tab delimited, where the "text" field captures the extracted text of the Wikipedia article associated with the movie.

    movie_title1       url1           text1
    movie_title2       url2           text2
    movie_title3       url3           text3    
    .......
    
First, complete the generator function called readMovies below (we learned about generators last week). Each record generated by the file should be a dictionary with three keys: 'title', 'url', 'id', and 'text'. The ids should be assigned consecutive, and start with 0.

You can use the `testReadMovies` function below to test your readMovies code. You should **not need to alter it**.

In [18]:
def readMovies(path):
    """
    Returns a generator that yields a record for each movie in the specified movie.txt.
    The format of a single record is {
        'id' : 0,
        'title' : '12 Years a Slave',
        'url' : 'http://en.wikipedia.org/wiki/12_Years_a_Slave_(film)',
        'text' : '12 Years a Slave is a 2013 British-Amer...'        
    }
    """

def testReadMovies(path):
    """Prints debugging information about the movies.txt"""
    for movie in readMovies(path):
        if movie['id'] <= 3 or movie['id'] >= 1193:                
            print("============================ Movie %d ============================" % movie['id'])
            print("'%s'" % movie['title'])
            print(movie['url'])
            print(movie['text'][:80] + '...')

#testReadMovies('movies.txt')

Once you successfully complete The output from running your should be **exactly**:

    ============================ Movie 0 ============================
    '12 Years a Slave'
    http://en.wikipedia.org/wiki/12_Years_a_Slave_(film)
    12 Years a Slave is a 2013 British-American historical drama film and an adaptat...
    ============================ Movie 1 ============================
    '20 Feet from Stardom'
    http://en.wikipedia.org/wiki/20_Feet_from_Stardom
    20 Feet from Stardom is an Oscar-winning 2013 American documentary film directed...
    ============================ Movie 2 ============================
    '20,000 Leagues Under the Sea'
    http://en.wikipedia.org/wiki/20,000_Leagues_Under_the_Sea_(1954_film)
    20,000 Leagues Under the Sea is a 1954 American adventure film starring Kirk Dou...
    ============================ Movie 3 ============================
    '2001: A Space Odyssey'
    http://en.wikipedia.org/wiki/2001:_A_Space_Odyssey_(film)
    2001: A Space Odyssey is a 1968 British-American science fiction film produced a...
    ============================ Movie 1193 ============================
    'Zorba the Greek (Alexis Zorbas)'
    http://en.wikipedia.org/wiki/Zorba_the_Greek_(film)
    Zorba the Greek (Greek title: Αλέξης Ζορμπάς, Alexis Zorba(s))is a ...
    ============================ Movie 1194 ============================
    'tom thumb'
    http://en.wikipedia.org/wiki/Tom_thumb_(film)
    Tom Thumb (stylised as tom thumb) is a 1958 fantasy-musical film directed by Geo...

####Solution for Question 6.

In [19]:
def readMovies(path):
    for i, line in enumerate(open(path)):
        tokens = line.split('\t')
        yield { 'title' : tokens[0], 'url' : tokens[1], 'text' : tokens[2].strip(), 'id' : i }

testReadMovies('movies.txt')

IOError: [Errno 2] No such file or directory: 'movies.txt'

#Assignment Question 7.

**Time estimate:** 30 minutes.

Next, complete the `readMovieDocs()` function below, which  returns a list of all the movie documents, in order. Also complete the `readMovieTitles()` that does the same thing for titles. You can use the `testReadDocsAndTitles` function below to make sure they are working properly.

***Hint:*** You should make good use of your `readMovies` function.

In [189]:
def readMovieDocs(path):
    """Returns a list of strings representing the movie articles text in the specified movies.txt"""

def readMovieTitles(path):
    """Returns a list of strings representing the movie titles in the specified movies.txt"""
    
def testReadDocsAndTitles(path):
    docs = readMovieDocs(path)
    for i, d in enumerate(docs):
        if i <= 3 or i >= 1193:          
            print('doc %d is: %s' % (i, d[:50]))
    print('\n')
            
    titles = readMovieTitles(path)
    for i, t in enumerate(titles):
        if i <= 3 or i >= 1193:          
            print('title %d is: %s' % (i, t))

#testReadDocsAndTitles(path)

The function should output exactly:

        doc 0 is: 12 Years a Slave is a 2013 British-American histor
        doc 1 is: 20 Feet from Stardom is an Oscar-winning 2013 Amer
        doc 2 is: 20,000 Leagues Under the Sea is a 1954 American ad
        doc 3 is: 2001: A Space Odyssey is a 1968 British-American s
        doc 1193 is: Zorba the Greek (Greek title: Αλέξης Ζορ�
        doc 1194 is: Tom Thumb (stylised as tom thumb) is a 1958 fantas
        
        
        title 0 is: 12 Years a Slave
        title 1 is: 20 Feet from Stardom
        title 2 is: 20,000 Leagues Under the Sea
        title 3 is: 2001: A Space Odyssey
        title 1193 is: Zorba the Greek (Alexis Zorbas)
        title 1194 is: tom thumb

####Solution to Question 7.

In [190]:
def readMovieDocs(path):
    return [m['text'] for m in readMovies(path)]

def readMovieTitles(path):
    return [m['title'] for m in readMovies(path)]

testReadDocsAndTitles(path)

doc 0 is: 12 Years a Slave is a 2013 British-American histor
doc 1 is: 20 Feet from Stardom is an Oscar-winning 2013 Amer
doc 2 is: 20,000 Leagues Under the Sea is a 1954 American ad
doc 3 is: 2001: A Space Odyssey is a 1968 British-American s
doc 1193 is: Zorba the Greek (Greek title: Αλέξης Ζορ�
doc 1194 is: Tom Thumb (stylised as tom thumb) is a 1958 fantas


title 0 is: 12 Years a Slave
title 1 is: 20 Feet from Stardom
title 2 is: 20,000 Leagues Under the Sea
title 3 is: 2001: A Space Odyssey
title 1193 is: Zorba the Greek (Alexis Zorbas)
title 1194 is: tom thumb


#Assignment Question 8

**Time estimate:** One hour

Next, we need to create the tf-idf matrix for the movie articles. Complete the `create_movie_matrix` function below. This code should closely match the code in Section 4. 

Based on my tests, I would like you to change one detail of the code in section 4. I found it better to reduce the magnitude of tf-idf scores for very popular words. You can do this by telling sci-kit learn to use $tf = log(tf)$ by specifying:

    TfidfTransformer(norm="l2", sublinear_tf=True)
    
You can use the `test_movie_matrix` function below to test your code:

In [199]:
from sklearn.feature_extraction.text import CountVectorizer

def test_movie_matrix(path):
    M = create_movie_matrix(path)
    print('Shape is' + str(M.shape))
    print('Selected entry is ' + str(M[0,60634]))
    
def create_movie_matrix(path):
    """Returns the tf-idf transformed movie feature matrix for the specified movies.txt."""
    
#test_movie_matrix(path)

It should output exactly:

        Shape is(1195, 60920)
        Selected entry is 0.0524889070596

####Solution to Question 8.

In [233]:
def create_movie_matrix(path):
    
    docs = readMovieDocs(path)
    count_vectorizer = CountVectorizer()
    count_vectorizer.fit_transform(docs)
    lexicon =  count_vectorizer.get_feature_names()
    sk_tf_matrix = count_vectorizer.transform(docs)
    
    tfidf = TfidfTransformer(norm="l2", sublinear_tf=True)
    tfidf.fit(sk_tf_matrix)        
    return tfidf.transform(sk_tf_matrix)

test_movie_matrix(path)

Shape is(1195, 60920)
Selected entry is 0.0524889070596


#Assignment Question 9.

**Time estimate:** Two hours.

Imagine you are designing a system that collects text documents from users and you would like to provide a visual "gist" for a document. One simple NLP technique for describing a text documcent is displaying the highest scoring terms in a document's tf-idf vector.

Complete the top_terms function below. It should return the top n terms that have highest values in your tf-idf vector. ***Hint:*** You may find it easier to convert the sci-kit vector to a native python vector using your `sk_vector_to_simple_row` method.

You can test your function using the `test_top_terms` method:

In [248]:
def top_terms(lexicon, vector, n):
    """
    Given a sci-kit sparse tf-idf vector, returns a list of the highest-scoring n terms.
    """    

def create_lexicon(path):
    """Utility function that returns a list that can be used to map from feature indexes to terms."""
    count_vectorizer = CountVectorizer()
    count_vectorizer.fit_transform(readMovieDocs(path))
    return count_vectorizer.get_feature_names()
    
def test_top_terms(path):
    lexicon = create_lexicon(path)
    titles = readMovieTitles(path)
    matrix = create_movie_matrix(path) 
    for (title, row) in zip(titles, matrix)[:20]:
        terms = top_terms(lexicon, row, 10)
        print('top terms for "%s" are: %s\n' % (title, terms))

#test_top_terms(path)

The first two lines of test_top_terms should display:

        top terms for "12 Years a Slave" are: [u'northup', u'ejiofor', u'epps', u'nyong', u'slave', u'chiwetel', u'mcqueen', u'slavery', u'patsey', u'fassbender']
        
        top terms for "20 Feet from Stardom" are: [u'stardom', u'lawry', u't\xe1ta', u'darlene', u'vega', u'fischer', u'86th', u'feet', u'clayton', u'merry']
        
You'll notice that these terms are extremely specific. Create a second version of the top_terms function that filters out any terms that appear in too few documents:

In [242]:

def top_terms2(lexicon, vector, n, df, min_docs):
    """
    Given a sci-kit sparse tf-idf vector, returns a list of the highest-scoring n terms.
    Any terms that appear in less than min_docs documents will be removed.
    df is a dictionary whose keys are terms and values are the number of documents it appears in.
    """
    
def sk_document_freq(lexicon, matrix):
    """Utility method that returns a dict whose keys are terms and values are document frequencies."""
    df = defaultdict(int)
    for row in matrix:
        for i in row.indices:
            term = lexicon[i]
            df[term] += 1
    return df
    
    
def test_top_terms2(path):
    lexicon = create_lexicon(path)
    titles = readMovieTitles(path)
    matrix = create_movie_matrix(path)    
    doc_freq = sk_document_freq(lexicon, matrix)
    for (title, row) in zip(titles, matrix)[:20]:
        terms = top_terms2(lexicon, row, 10, doc_freq, 20)
        print('top terms for "%s" are: %s\n' % (title, terms))

#test_top_terms2(path)

After completing this enhanced method, the terms for the first movie should be:

        top terms for "12 Years a Slave" are: [u'slave', u'mcqueen', u'bass', u'12', u'ford', u'sailor', u'cotton', u'christian', u'historic', u'twelve']

####Solution for Question 9.

In [236]:
def top_terms(lexicon, vector, n):
    """
    Given a sci-kit sparse tf-idf vector, returns a list of the highest-scoring n terms.
    """
    simple = sk_vector_to_simple_row(lexicon, vector)    
    terms = sorted(simple, key=simple.get, reverse=True)
    return terms[:n]

test_top_terms(path)

top terms for "12 Years a Slave" are: [u'northup', u'ejiofor', u'epps', u'nyong', u'slave', u'chiwetel', u'mcqueen', u'slavery', u'patsey', u'fassbender']

top terms for "20 Feet from Stardom" are: [u'stardom', u'lawry', u't\xe1ta', u'darlene', u'vega', u'fischer', u'86th', u'feet', u'clayton', u'merry']

top terms for "20,000 Leagues Under the Sea" are: [u'aronnax', u'nautilus', u'nemo', u'conseil', u'leagues', u'mcg', u'ned', u'squid', u'vulcania', u'submarine']

top terms for "2001: A Space Odyssey" are: [u'monolith', u'kubrick', u'bowman', u'poole', u'odyssey', u'jupiter', u'clarke', u'9000', u'millimetre', u'weidner']

top terms for "7 Faces of Dr. Lao" are: [u'lao', u'randall', u'stark', u'lindquist', u'cassan', u'woldercan', u'circus', u'cunningham', u'serpent', u'medusa']

top terms for "7th Heaven" are: [u'7th', u'borzage', u'heaven', u'gobin', u'gaynor', u'zhou', u'yuan', u'glazer', u'chico', u'1937']

top terms for "8 Mile" are: [u'eminem', u'mile', u'basinger', u'wink', u'r

In [241]:
def top_terms2(lexicon, vector, n, df, min_docs):
    """
    Given a sci-kit sparse tf-idf vector, returns a list of the highest-scoring n terms.
    Any terms that appear in less than min_docs documents will be removed.
    df is a dictionary whose keys are terms and values are the number of documents it appears in.
    """
    simple = sk_vector_to_simple_row(lexicon, vector) 
    terms = sorted(simple, key=simple.get, reverse=True)
    terms = [t for t in terms if df[t] >= min_docs]
    return terms[:n]

test_top_terms2(path)

top terms for "12 Years a Slave" are: [u'slave', u'mcqueen', u'bass', u'12', u'ford', u'sailor', u'cotton', u'christian', u'historic', u'twelve']

top terms for "20 Feet from Stardom" are: [u'feet', u'merry', u'judith', u'waters', u'sundance', u'singers', u'2013', u'jo', u'festival', u'20']

top terms for "20,000 Leagues Under the Sea" are: [u'ned', u'sea', u'mason', u'monster', u'disney', u'albums', u'disneyland', u'kirk', u'1954', u'captain']

top terms for "2001: A Space Odyssey" are: [u'kubrick', u'odyssey', u'clarke', u'hal', u'floyd', u'space', u'discovery', u'science', u'stanley', u'2001']

top terms for "7 Faces of Dr. Lao" are: [u'stark', u'circus', u'angela', u'faces', u'mike', u'dr', u'henchmen', u'tony', u'cowboy', u'monster']

top terms for "7th Heaven" are: [u'heaven', u'1937', u'diane', u'chinese', u'1927', u'angel', u'seventh', u'remake', u'china', u'janet']

top terms for "8 Mile" are: [u'jimmy', u'rabbit', u'doc', u'yourself', u'alex', u'gang', u'stephanie', u'leaders

#Assignment Question 10.

**Time estimate:** Two hours.

For your final task, you'll calculate the most similar neighbor articles for a particular movie article. You should use the `sk_dot` function from Section 5 as your measure of similarity.

Complete the neighbors function below. It should calculated the similarity to every other neighbor, and print out the top 10 similarity scores and movie titles in descending order.

In [247]:
def neighbors(titles, matrix, target):
    """
        Given a target movie's sk vector, finds the 10 most similar other rows (i.e. movies).
        Prints out the scores and titles for each of the most similar movies.
    """

def test_neighbors(path):
    titles = readMovieTitles(path)
    matrix = create_movie_matrix(path)
    target = matrix[9]   # the abyss
    neighbors(titles, matrix, target)

test_neighbors(path)

####Solution to Question 10.

In [246]:
def neighbors(titles, matrix, target):
    """
        Given a target movies, finds the 10 most similar other rows (i.e. movies).
        Prints out the scores and titles for each of the most similar movies.
    """
    neighbors = []
    for (i, candidate) in enumerate(matrix):
        neighbors.append((sk_dot(target, candidate), titles[i]))
    neighbors.sort()
    neighbors.reverse()
    for n in neighbors[:10]:
        print(n)

test_neighbors(path)

(0.99999999999999822, 'Abyss, TheThe Abyss')
(0.24877942071530276, 'Titanic')
(0.23415443073467382, 'Alien')
(0.23348650601736651, 'Avatar')
(0.23063808100837763, 'Jaws')
(0.22794608039107156, 'Terminator 2: Judgment Day')
(0.22553156006579073, '2001: A Space Odyssey')
(0.21666064780549232, 'Gravity')
(0.21602804890256305, 'Aliens')
(0.21470730303397395, 'Independence Day')
