<h1>Problem 1: Word counts</h1>
Write a function <span style="color:red">word_count</span> that takes a text string as an argument and returns a dictionary containing the count of words in that string

For example: 

For the  string "It was the best of times it was the worst of times", your function should return the following dictionary:

{'It': 1,
 'best': 1,
 'it': 1,
 'of': 2,
 'the': 2,
 'times': 2,
 'was': 2,
 'worst': 1}
 
 Notes:
 
 1. Assume that there is no punctuation, not even the end of sentence period, in the string, only words separated by spaces. 
 
 2. The function <span style="color:red">split</span> splits a string on spaces. An example call of the function is: <span style="color:red">"hello fellow".split()</span> which will return the list <span style="color:red">['hello', 'fellow']</span>
 
 3. Treat words with different cases as different words ("hello" and "Hello" are not the same word)
 
 4. You might find the <a href="http://book.pythontips.com/en/latest/for_-_else.html">for ... else ...</a> structure useful for this problem 
 
 5. If the string is empty, the function should return an empty dictionary
 
 6. Depending on your version of python, the ordering of words in your dictionary may be different from what is in my dictionary. The count matters, not the order!

In [1]:
def word_count(text):
    '''
    goal: return a dictionary with keys as distinct words in the text and values as the number of appearances
          of that word
    
    questions 
    - how do we isolate each word
        use the split function and then add NEW words to dictionary, increase count when old
    
    efficiency 
    time complexity: o(n^2) searching dictionary in for loop is o(n)
    space complexity: o(n) for dict and n for the words list; n is the number of words
    '''
    counts = dict()

    #Your code goes here
    
    # use split to get each individual word 
    words = text.split()
    
    # catch the null case 
    if len(words) == 0: return counts
    
    for word in words: 
        if word in counts:
            counts[word]+=1
        else:
            counts[word]=1
    
    return counts


In [2]:
#test your function with the following sample data

#Should return {'It': 1, 'best': 1, 'it': 1, 'of': 2, 'the': 2, 'times': 2, 'was': 2, 'worst': 1}
text1 = "It was the best of times it was the worst of times"
print(word_count(text1))

#Should return {}
text1 = ""
print(word_count(text1))

{'It': 1, 'was': 2, 'the': 2, 'best': 1, 'of': 2, 'times': 2, 'it': 1, 'worst': 1}
{}


<h1>Problem 2: word encodings and vocabulary</h1>
Many text mining problems use word encodings as an input to the analytic process. The idea behind word encodings is very simple: a corpus of documents (corpus = "many documents" in simple English!) contains a vocabulary (the set of words used across all documents). The vocabulary is textual ("green", "people", "carrots") but data analysis needs  numerical data. The solution is to replace each word with a numeric code. For example, if the corpus contains two documents:

doc1 = "it was the best of times it was the worst of times"<br>
doc2 = "The good times of today are the sad thoughts of tomorrow"

Then we can represent word encodings by the following dictionary (note that your numbering may be different):

{'are': 9,
 'best': 3,
 'good': 7,
 'it': 0,
 'of': 4,
 'sad': 10,
 'the': 2,
 'thoughts': 11,
 'times': 5,
 'today': 8,
 'tomorrow': 12,
 'was': 1,
 'worst': 6}
 
 If you look at the dictionary carefully, the encoding process should be very clear. "it" was the first word in the first document and it was encoded as a 0. "was" was the second word and it was encoded as a 1. And so on. Each distinct word is associated with an integer and the size of the vocabulary is the number of distinct words.
 
 Write a function <span style="color:blue">vocabulary</span> that takes a list of documents as an argument and returns a dictionary containing the encoded vocabulary
 
 Notes:
 
 1. Assume that each document is a single text string containing words separated by spaces and with absolutely no punctuation (not even periods at the end)

2. If the corpus is empty, the function should return an empty dictionary
 

In [3]:
def vocabulary(corpus):
    '''
    goal: make a dictionary where the keys are distinct words and the values are the IDs associated with each word;
    start your IDs such that your first word has ID 0 and your second has ID 1 (if it is a new word); return 
    the dictionary after probing both documents
    
    questions 
    - method add words to dictionary 
        + linearly probe the 1st document, check if the word is in the dictionary if not add it
        + need to keep track of a cntr that will allow us to make new IDs according to the number of new words
        
    efficiency: 
    time: o(m*n^2) where m is the number of docs and n is the avg number of words in docs
    space: o(n) where n scales with the number of distinct words in across documents
    '''
    vocab = dict()
    
    #YOUR CODE GOES HERE
    
    # check for the null case:
    if len(corpus) == 0: return vocab
    
    ID = 0
    # loop over our documents
    for document in corpus:
        
        words_in_doc = document.split()
        for word in words_in_doc:
            if word in vocab:
                pass
            else:
                vocab[word] = ID
                ID+=1
    
    return vocab

In [4]:
#Test your function with the following example. 
#Should return: 
{'it': 0,'was': 1,'the': 2,'best': 3,'of': 4,'times': 5,'worst': 6,'good': 7,'today': 8,'are': 9,'sad': 10,'thoughts': 11,'tomorrow': 12}#
doc1 = "it was the best of times it was the worst of times"
doc2 = "the good times of today are the sad thoughts of tomorrow"

print(vocabulary([doc1,doc2]))

doc1 = ""
doc2 = ""
print(vocabulary([doc1,doc2]))

{'it': 0, 'was': 1, 'the': 2, 'best': 3, 'of': 4, 'times': 5, 'worst': 6, 'good': 7, 'today': 8, 'are': 9, 'sad': 10, 'thoughts': 11, 'tomorrow': 12}
{}


<h1>Problem 3: word_vectors</h1>
The  <span style="color:red">vocabulary</span> function returns a dictionary containing the word encoded vocabulary associated with the corpus. Once the encoding is done, each document can be replaced by a <span style="color:blue">word vector</span> that indicates which words (from the vocabulary) are present in the document and with what frequency. For example, given the corpus:

doc1 = "it was the best of times it was the worst of times"

doc2 = "The good times of today are the sad thoughts of tomorrow"

the word vector corresponding to doc1 is:

[2, 2, 2, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0]

Note that the length of the vector is equal to the length of the entire vocabulary. Each location in the word vector corresponds to the code for the corresponding word in the vocabulary. The value at each location is the frequency of the eord in the document. Thus, location 0 corresponds to the word "it" which occurs twice in the doc1. Location 3 corresponds to "best" which occurs once in doc1. 

Write a function <span style="color:red">word_vectors</span> that takes a list of texts as an argument and returns a list of word vectors. 

Notes:

1. Use the word_count function to get word frequencies

2. Use the vocabulary function to get the encoded vocabulary for the corpus

3. You can construct a list of zeros of a given length using <span style="color:blue">[0]*n</span> where n is an integer. <span style="color:blue">[0] * len(vocabulary)</span> will return a list of zeros of the length of the vocabulary. Create this list for each document and update individual locations by their corresponding frequencies in the document

In [5]:
def word_vectors(corpus):
    '''
    goal: create a word vector for each document
    
    questions
    - what is a word vector:
        a word vector is a list where each element in the list maps a words location to its frequency;
        for example position one in the word vector maps to the second word in the document and is valued the number 
        of time the word appears in the document
        
    - how do we want to efficiently construct the word vector
        + loop over the documents 
        + make a freq dict 
        + make a vector of size n, where n is the total number of words across all documents
        + loop over the split text in the CURRENT document mapping the word to a index and a freq and update 
        the vector 
        '''
    vocab = vocabulary(corpus)
    word_vectors = list()

    #YOUR CODE GOES HERE
    
    # check the null case:
    if len(corpus) == 0: return word_vectors
    
    for document in corpus:
        # create the word vector
        curr_word_vector = [0]*len(vocab)
        # get the frequencies 
        freqs = word_count(document)
        # loop over all of the words in the document
        for word in freqs:
            # update the word vector
            idx_word = vocab[word]
            freq_word = freqs[word]
            curr_word_vector[idx_word] = freq_word
            
        # add the word vector to the result
        word_vectors.append(curr_word_vector)
        
    return word_vectors
            
           

In [6]:
"""
The function should return two lists:
[[2, 2, 2, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0],
 [0, 0, 2, 0, 2, 1, 0, 1, 1, 1, 1, 1, 1]]
"""
doc1 = "it was the best of times it was the worst of times"
doc2 = "the good times of today are the sad thoughts of tomorrow"
word_vectors([doc1,doc2])




[[2, 2, 2, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0],
 [0, 0, 2, 0, 2, 1, 0, 1, 1, 1, 1, 1, 1]]

<h1>Problem 4: Moving averages</h1>
Moving averages are often used in trend analysis in timeseries data. A simple way of figuring out whether a timeseries is trending (i.e., moving consistently in either the upward or downward direction) or mean reverting (i.e., fluctuating around a mean) is to see if a shorter term moving average is consistently below or above a longer term moving average. 

Write a function <span style="color:blue">getMovingAverage(series,duration)</span> that takes a list of numbers as an input (the series) and returns a n-period (the duration) moving average series of the same length as the original list. For each k-th element in the first n-1 elements, return the average of the k elemets.

For example, if:

x = [1,2,3,4,5,6,7,8,9,10]

and

duration = 4

then, getMovingAverage(x,duration) should return:

[1.0, 1.5, 2.0, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5]

The average at k=0 is 1/1<br>
The average at k=1 is (1 + 2)/2<br>
The average at k=2 is (1 + 2 + 3)/3<br>
The average at k=3 is (1 + 2 + 3 + 4)/4<br>
The average at k=4 is (2 + 3 + 4 + 5)/4<br>
etc.



In [7]:
def getMovingAverage(series,duration):
    '''
    goal: get the moving average over a maximum history of length duration
    
    questions
    how does the duration affect the average
    - we must take the average of no more than duration consecutive numbers 
    
    how do we want to efficiently do this problem 
    - we employ a sliding window approach 
    - keep taking the averages of numbers between left and right idxs so long as right - left <= duration 
    - once this inequality is no longer true we need to increase left until it holds true
    '''
    mvg_avg = list()
    running_sum = 0

    #YOUR CODE GOES HERE
    
    # null case 
    if len(series) == 0: return mvg_avg
    
    left, right = 0,0 
    while right < len(series):
        
        if right - left+1 <= duration:
            # take the average 
            avg = sum(series[left:right+1])/(right-left+1)
            mvg_avg.append(avg)
            
            # update ptr
            right+=1
        else:
            # if our window is too large then we need to shrink until we are within the range of duration 
            left+=1 
    return mvg_avg
    
x = [1,2,3,4,5,6,7,8,9,10]
getMovingAverage(x,4)

[1.0, 1.5, 2.0, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5]

In [8]:
series = [1,2,3,4,5,6,7,8,9,10]
print(getMovingAverage(series,2)) #[1.0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]
print(getMovingAverage(series,4)) #[1.0, 1.5, 2.0, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5]
print(getMovingAverage([],4)) #[] (empty list)
print(getMovingAverage([1],7)) #[1.0]

[1.0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]
[1.0, 1.5, 2.0, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5]
[]
[1.0]
