<h2>Problem 1: Simple string manipulation</h2>

Equity data providers often add exchange information to a ticker symbol when reporting pricing or other data (primarily to resolve cross-exchange ambiguities). Reuters uses a coding scheme known as Reuters Instrument Code (RIC) where exchange information is added to the ticker following a dot at the end of the ticker symbol.

For example, the RIC IBM.N indicates that a data item corresponds to IBM on the NYSE, IBM.L indicates that a data item corresponds to IBM on the London Stock Exchange, VOD.OQ for Vodafone on the NASDAQ stock exchange, etc.

Write a program that takes as input a Reuters RIC and separates out the ticker and the exchange. You may assume:

1. that the only "non-letter" character in the ticker is the dot and there will be only one dot

2. that every ticker has an exchange symbol (i.e., IBM.N and not IBM)

3. that the tickers and exchanges are valid (you don't need to check if the exchange identifer is valid)



<h3>Test your code with the following test examples</h3>

<h4>Example 1</h4>
Please enter a Reuters RIC symbol: VOD.OQ
<br>
The traded exchange for VOD is OQ
<p>
<h4>Example 2</h4>
Please enter a Reuters RIC symbol: IBM.L
<br>
The traded exchange for IBM is L
<h4>Example 3</h4>
Please enter a Reuters RIC symbol: GOOG.OQ
<br>
The traded exchange for GOOG is OQ

In [1]:
#PROBLEM 1 SOLUTION
ric = input('Please enter a Reuters RIC symbol: ')

# find function finds the location of the '.' in the input string
print('The traded exchange for ' + ric[:ric.find('.')] + ' is ' + ric[ric.find('.') + 1:])

Please enter a Reuters RIC symbol: GOOG.OQ
The traded exchange for GOOG is OQ


<h2>Problem 2: Simple Arithmetic and string formatting</h2>
Write a program that converts an amount given in a foreign currency into USD. Your program should input the foreign currency symbol, the amount in foreign currency being converted, and the exchange rate to USD. Your program should print out the equivalent amount in USD (including the $ symbol). 

<b>You will need to research string formatting on your own!</b>

Example:

Please enter the foreign currency symbol: EUR<br>
Please enter the amount in EUR: 1.11<br>
Please enter the exchange rate (1 EUR to USD): 100.00<br>
The equivalent USD amount is: $111.00 <br>

Notes:

1. Your output should be formatted correctly with a dollar sign and with cents rounded up to two decimal places 

2. Assume that the input is in the right format. I.e., when a number is requested, the user will enter a number (in other words, no error checking is necessary)

In [2]:
#PROBLEM 2 SOLUTION
currency = input('Please enter the foreign currency symbol: ')
amount = input('Please enter the amount in ' + currency + ': ')
exchange = input('Please enter the exchange rate (1 ' + currency +  ' to USD): ')

# format with '.2f' formats the output string so that it has 2 decimal places for the cents
print('The equivalent USD amount is: ' + '$' + str(format(float(amount) * float(exchange), '.2f')))

Please enter the foreign currency symbol: EUR
Please enter the amount in EUR: 1.11
Please enter the exchange rate (1 EUR to USD): 100.00
The equivalent USD amount is: $111.00


<h1>Problem 3:</h1>
Often, when dealing with data, continuous features are converted into categorical ones. Write a function <i><span style="color:red">encode_array</style></i> that converts continuous data into categorical data using a conversion scheme. 

For example:
if:<br>
input_array = (17,5,36,22,54,34,19,65,102)

and the categorical scheme is:<br>
category_limits = (10,20,30,40,50,60,70,80)<br>

* values less than 10 are in category 0, values between 10 (inclusive) and less than 20 are in category 1, etc. Values greater than or equal to 80 are in category 8.

and your function call is:<br>
encode_array(input_array,category_limits)

the output should be:<br>
[1, 0, 3, 2, 5, 3, 1, 6, 8]

Notes:

1. Assume that the category limits are in the form of a list as in the example above

2. There are many ways to write this function but you must encapsulate the encode function inside the encode_array function. Use the template below as a guideline

3. If category_limits is an empty list (or tuple), your function should return a list of all zeros 

4. If input_array is an empty list (or tuple), your function should return an empty list

5. You might find the <a href="http://book.pythontips.com/en/latest/for_-_else.html">for ... else ...</a> structure useful for this problem 

In [3]:
#PROBLEM 3 SOLUTION
def encode_array(input_array,category_limits):
    
    # if category_limits is empty, returns a list of length of the input_array with all 0's
    if not category_limits:
        return ([0] * len(input_array))
    
    # if input_array is empty, returns an empty list
    if not input_array:
        return []
    
    # encodes the input_array values based off of the inputs from category_limits
    else:
        def encode(input_value, category_limits):
            for i in range(len(category_limits)):
                if input_value < 10:
                    return 0
                elif input_value >= category_limits[i-1] and input_value < category_limits[i]:
                    return i
            else:
                return 8
     
    # instantiates empty output list
    encoding = []
    for value in input_array:
        encoding.append(encode(value, category_limits))
        
    return encoding

In [4]:
input_array = (17,5,36,22,54,34,19,65,102)
category_limits = (10,20,30,40,50,60,70,80)
encode_array(input_array,category_limits) #should return [1, 0, 3, 2, 5, 3, 1, 6, 8]

[1, 0, 3, 2, 5, 3, 1, 6, 8]

In [5]:
input_array = (17,5,36,22,54,34,19,65,102)
category_limits = ()
encode_array(input_array,category_limits) #should return [0, 0, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 0, 0, 0, 0, 0, 0]

In [7]:
input_array = ()
category_limits = (10,20,30,40,50,60,70,80)
encode_array(input_array,category_limits) #should return []

[]

<h1>Problem 4: Word counts</h1>
Write a function <span style="color:red">word_count</span> that takes a text string as an argument and returns a dictionary containing the count of words in that string

For example: 

For the  string "It was the best of times it was the worst of times", your function should return the following dictionary:

{'It': 1,
 'best': 1,
 'it': 1,
 'of': 2,
 'the': 2,
 'times': 2,
 'was': 2,
 'worst': 1}
 
 Notes:
 
 1. Assume that there is no punctuation, not even the end of sentence period, in the string, only words separated by spaces. 
 
 2. The function <span style="color:red">split</span> splits a string on spaces. An example call of the function is: <span style="color:red">"hello fellow".split()</span> which will return the list <span style="color:red">['hello', 'fellow']</span>
 
 3. Treat words with different cases as different words ("hello" and "Hello" are not the same word)
 
 4. You might find the <a href="http://book.pythontips.com/en/latest/for_-_else.html">for ... else ...</a> structure useful for this problem 
 
 5. If the string is empty, the function should return an empty dictionary

In [8]:
#PROBLEM 4 SOLUTION
def word_count(text):
    
    # if the string is empty, returns an empty dictionary
    if not text:
        return {}
    
    # splits the string on spaces
    else:
        # instantiates output dictionary
        word_counts = {}
        words = text.split()
        
        # counts the number of appearances of each word in the string
        for word in words:
            if word not in word_counts:
                word_counts[word] = 1
            else:
                word_counts[word] = word_counts[word] + 1
                
    return word_counts

In [9]:
#Should return {'It': 1, 'best': 1, 'it': 1, 'of': 2, 'the': 2, 'times': 2, 'was': 2, 'worst': 1}
text1 = "It was the best of times it was the worst of times"
word_count(text1)

{'It': 1,
 'was': 2,
 'the': 2,
 'best': 1,
 'of': 2,
 'times': 2,
 'it': 1,
 'worst': 1}

In [10]:
#Should return {}
text1 = ""
word_count(text1)

{}

<h1>Problem 5: word encodings and vocabulary</h1>
Many text mining problems use word encodings as an input to the analytic process. The idea behind word encodings is very simple: a corpus of documents (corpus = "many documents" in simple English!) contains a vocabulary (the set of words used across all documents). The vocabulary is textual ("green", "people", "carrots") but data analysis works better with numeric data. The solution is to replace each word with a numeric code. For example, if the corpus contains two documents:

doc1 = "it was the best of times it was the worst of times"<br>
doc2 = "The good times of today are the sad thoughts of tomorrow"

Then we can represent word encodings by the following dictionary:

{'are': 9,
 'best': 3,
 'good': 7,
 'it': 0,
 'of': 4,
 'sad': 10,
 'the': 2,
 'thoughts': 11,
 'times': 5,
 'today': 8,
 'tomorrow': 12,
 'was': 1,
 'worst': 6}
 
 If you look at the dictionary carefully, the encoding process should be very clear. "it" was the first word in the first document and it was encoded as a 0. "was" was the second word and it was encoded as a 1. And so on. 
 
 Write a function <span style="color:blue">vocabulary</span> that takes a list of documents as an argument and returns a dictionary containing the encoded vocabulary
 
 Notes:
 
 1. Assume that each document is a single text string containing words separated by spaces and with absolutely no punctuation

2. If the corpus is empty, the function should return an empty dictionary
 

In [11]:
#PROBLEM 5 SOLUTION
def vocabulary(corpus):
    
    # if the corpus is empty, returns an empty dictionary
    if not corpus:
        return {}
    
    else:
        
        # instantiates output dictionary
        encoding = {}
        
        # instantiates counter for word encoding
        counter = 0
        
        # individually splits each document in the corpus and encodes
        for doc in corpus:
            doc = doc.lower()
            words = doc.split()
            for word in words:
                if word not in encoding:
                    encoding[word] = counter
                    counter += 1

    return encoding

In [12]:
#Test your function with the following example. 
#Should return: 
#{'are': 9, 'best': 3, 'good': 7, 'it': 0, 'of': 4, 'sad': 10, 'the': 2, 'thoughts': 11, 'times': 5, 'today': 8, 'tomorrow': 12, 'was': 1, 'worst': 6}
doc1 = "it was the best of times it was the worst of times"
doc2 = "the good times of today are the sad thoughts of tomorrow"

vocabulary([doc1,doc2])


{'it': 0,
 'was': 1,
 'the': 2,
 'best': 3,
 'of': 4,
 'times': 5,
 'worst': 6,
 'good': 7,
 'today': 8,
 'are': 9,
 'sad': 10,
 'thoughts': 11,
 'tomorrow': 12}

<h1>Problem 6: word_vectors</h1>
The  <span style="color:red">vocabulary</span> function returns a dictionary containing the word encoded vocabulary associated with the corpus. Once the encoding is done, each document can be replaced by a <span style="color:blue">word vector</span> that indicates which words (from the vocabulary) are present in the document and with what frequency. For example, given the corpus:

doc1 = "it was the best of times it was the worst of times"
doc2 = "The good times of today are the sad thoughts of tomorrow"

the word vector corresponding to doc1 is:

[2, 2, 2, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0]

Note that the length of the vector is equal to the length of the entire vocabulary. Each location in the word vector corresponds to the code for the corresponding word in the vocabulary. The value at each location is the frequency of the word in the document. Thus, location 0 corresponds to the word "it" which occurs twice in the doc1. Location 3 corresponds to "best" which occurs once in doc1. 

Write a function <span style="color:red">word_vectors</span> that takes a list of texts as an argument and returns a list of word vectors. 

Notes:

1. Use the word_count function to get word frequencies

2. Use the vocabulary function to get the encoded vocabulary for the corpus

3. You can construct a list of zeros of a given length using <span style="color:blue">[0]*n</span> where n is an integer. <span style="color:blue">[0] * len(vocabulary)</span> will return a list of zeros of the length of the vocabulary. Create this list for each document and update individual locations by their corresponding frequencies in the document

In [13]:
[0]*6

[0, 0, 0, 0, 0, 0]

In [14]:
#PROBLEM 6 SOLUTION
def word_vectors(corpus):
    vocab = vocabulary(corpus)
    
    # instantiates output vector
    vector_list = []
    
    # iterates through each document in the corpus and creates each document's word frequency list
    for doc in corpus:
        vector = [0] * len(vocab)
        count = word_count(doc)
        
        # looks at each key in the vocabulary dictionary and sets its value in the document vector to equal the number of times
        # it appears in the corresponding document
        for key in vocab: 
            vector[vocab.get(key)] = count.get(key, 0)
        vector_list.append(vector)
    return(vector_list)       

In [15]:
#Test your code with the following example
"""
The function should return a list of two lists:
[[2, 2, 2, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0],
 [0, 0, 2, 0, 2, 1, 0, 1, 1, 1, 1, 1, 1]]
"""
doc1 = "it was the best of times it was the worst of times"
doc2 = "the good times of today are the sad thoughts of tomorrow"
word_vectors([doc1,doc2])

[[2, 2, 2, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0],
 [0, 0, 2, 0, 2, 1, 0, 1, 1, 1, 1, 1, 1]]