# CPS600 - Python Programming for Finance 
###  
<img src="https://www.syracuse.edu/wp-content/themes/g6-carbon/img/syracuse-university-seal.svg?ver=6.3.9" style="width: 200px;"/>

# More Practice & Scripts

###  September 27, 2018

### Here is one quick way to download a text from inside our Python session.

In [1]:
from urllib import request
request.urlretrieve ("https://www.gutenberg.org/files/158/158-0.txt", "Emma.txt")

('Emma.txt', <http.client.HTTPMessage at 0x7f8b4bf7ccc0>)

### While we're at it, let's download the `words.txt` from *Think Python*.

In [12]:
fileURL = "https://raw.githubusercontent.com/AllenDowney/ThinkPython2/master/code/words.txt"
request.urlretrieve (fileURL, "words.txt")

('words.txt', <http.client.HTTPMessage at 0x7f8b4b723518>)

### Let's import `time` so that we can make a comparison.

In [2]:
import time

### Here are two different ways to write the desired function.

In [5]:
def make_word_list1():
    """Reads lines from a file and builds a list using append."""
    t = []
    fin = open('words.txt')
    for line in fin:
        word = line.strip()
        t.append(word)
    return t
# The next one is very slow; we don't need it.

def make_word_list2():
    """Reads lines from a file and builds a list using list +."""
    t = []
    fin = open('words.txt')
    for line in fin:
        word = line.strip()
        t = t + [word]
    return t


### Below, we compare these two functions. Which one is the faster of the two?

In [None]:
start_time = time.time()
t = make_word_list1()
elapsed_time = time.time() - start_time

print(len(t))
print(t[:10])
print(elapsed_time, 'seconds')

start_time = time.time()
t = make_word_list2()
elapsed_time = time.time() - start_time

print(len(t))
print(t[:10])
print(elapsed_time, 'seconds')

### Whoa!

### Next, we want to clean up the book and compute frequency statistics - what are the words in the book, and how many times is each one used?

In [6]:
import string # Used to get punctuation

In [8]:
def process_file(filename, skip_header):
    """Makes a histogram that contains the words from a file.

    filename: string
    skip_header: boolean, whether to skip the Gutenberg header
   
    returns: map from each word to the number of times it appears.
    """
    hist = {} # This is an empty dictionary
    fp = open(filename)

    if skip_header:
        skip_gutenberg_header(fp)

    for line in fp:
        process_line(line, hist)

    return hist


def skip_gutenberg_header(fp):
    """Reads from fp until it finds the line that ends the header.
    
    RMK: You just have to look at the Gutenberg format. That is
    how you would know how to write such a function. This had to
    be changed.

    fp: open file object
    """
    for line in fp:
        if line.startswith('*** START OF TH'):
            break


def process_line(line, hist):
    """Adds the words in the line to the histogram.

    Modifies hist.
    
    RMK: This is not *pure* function. It modifies
    one of its arguments. This is frowned upon
    in many circles, but it is one way to do things.

    line: string
    hist: histogram (map from word to frequency)
    """
    # replace hyphens with spaces before splitting
    line = line.replace('-', ' ')
    strippables = string.punctuation + string.whitespace

    for word in line.split():
        # remove punctuation and convert to lowercase
        word = word.strip(strippables)
        word = word.lower()

        # update the histogram
        hist[word] = hist.get(word, 0) + 1



### We want to compute word statistics for a document.

### The `hist` dictionary object contains all the information about our word stats. It is easy to write functions that compute word count and *unique* word count.

In [9]:
def total_words(hist):
    """Returns the total of the frequencies in a histogram."""
    return sum(hist.values())


def different_words(hist):
    """Returns the number of different words in a histogram."""
    return len(hist)

### Finally, our functions for the most commonly occurring words.

In [10]:
def most_common(hist):
    """Makes a list of word-freq pairs in descending order of frequency.

    hist: map from word to frequency

    returns: list of (frequency, word) pairs
    """
    t = []
    for key, value in hist.items():
        t.append((value, key))

    t.sort()
    t.reverse()
    return t


def print_most_common(hist, num=10):
    """Prints the most commons words in a histgram and their frequencies.
    
    hist: histogram (map from word to frequency)
    num: number of words to print
    """
    t = most_common(hist)
    print('The most common words are:')
    for freq, word in t[:num]:
        print(word, '\t', freq)


### OK, now it is time to use all of our functions. Let's try it on *Emma* first.

In [11]:
hist = process_file('Emma.txt', skip_header=True)
print('Total number of words:', total_words(hist))
print('Number of different words:', different_words(hist))

t = most_common(hist)
print('The most common words are:')
for freq, word in t[0:20]:
    print(word, '\t', freq)

Total number of words: 164029
Number of different words: 8893
The most common words are:
the 	 5353
to 	 5303
and 	 4889
of 	 4392
a 	 3170
i 	 2861
her 	 2445
it 	 2407
was 	 2394
she 	 2334
in 	 2233
not 	 2146
be 	 1984
you 	 1900
he 	 1770
that 	 1766
had 	 1622
as 	 1435
but 	 1364
for 	 1353
