# Week 6 Problem 2

If you are not using the `Assignments` tab on the course JupyterHub server to read this notebook, read [Activating the assignments tab](https://github.com/UI-DataScience/info490-fa16/blob/master/Week2/assignments/README.md).

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

In [1]:
import re
from nose.tools import assert_equal

In this problem set, we will use regular expressions (regex) to process real Twitter data. Specifically, using a sample of real tweets that contain the hashtag #informatics, we will use regex to clean up the text data since many of them contain non-alphabetical characters as well as special characters, such as hashtags, @ signs, and HTTP links.

For simplicity, we will use only five tweets in this problem, but it's straightforward to scale to a data set with a large number of tweets after we write and test our functions.

In [2]:
tweets = '''
New #job opening at The Ottawa Hospital in #Ottawa - #Clinical #Informatics Specialist #jobs http://t.co/3SlUy11dro
Looking for a #Clinical #Informatics Pharmacist Park Plaza Hospital #jobs http://t.co/4Qw8i6YaJI
Info Session 10/7: MSc in Biomedical Informatics, University of Chicago https://t.co/65G8dJmhdR #HIT #UChicago #informatics #healthcare
Here's THE best #Books I've read on #EHR #HIE #HIPAA and #Health #Informatics http://t.co/meFE0dMSPe
@RMayNurseDir @FNightingaleF Scholars talking passionately about what they believe in. #informatics &amp; #skincare  https://t.co/m8qiUSxk0h
'''.strip().split('\n')

print(tweets)

['New #job opening at The Ottawa Hospital in #Ottawa - #Clinical #Informatics Specialist #jobs http://t.co/3SlUy11dro', 'Looking for a #Clinical #Informatics Pharmacist Park Plaza Hospital #jobs http://t.co/4Qw8i6YaJI', 'Info Session 10/7: MSc in Biomedical Informatics, University of Chicago https://t.co/65G8dJmhdR #HIT #UChicago #informatics #healthcare', "Here's THE best #Books I've read on #EHR #HIE #HIPAA and #Health #Informatics http://t.co/meFE0dMSPe", '@RMayNurseDir @FNightingaleF Scholars talking passionately about what they believe in. #informatics &amp; #skincare  https://t.co/m8qiUSxk0h']


Later in this course, we will learn how to use the [Twitter API](https://dev.twitter.com/overview/documentation) to monitor or process tweets in real-time. If you can't wait, see [Mining the Social Web 2nd Edition](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition) by Matthew A. Russell.

## 1. Split the text into words.

Words in each tweet are separated by one or more whitespaces. Use regex to create a list of all words in tweets. Note that tweets is a list of five strings.

In [3]:
def split_into_words(tweets):
    '''
    Take a list of tweets, and returns a list of words in all tweets.
    Since words are separated by one or more whitespaces,
    the return value is a list of strings with no whitespace.
    
    Parameters
    ----------
    tweets: a list of strings. Strings have whitespaces.
    
    Returns
    -------
    A list of strings. Strings have no whitespace.
    Results from splitting each tweet in tweets by whitespace.
    '''
    # YOUR CODE HERE
    result = []
    # For every string, use re.split to split every string into words
    for str in tweets:
        result = result + re.split('\s+', str)
    return result

Let's see if it passes the following test:

In [4]:
words = split_into_words(tweets)
words_answer = [
    'New', '#job', 'opening', 'at', 'The', 'Ottawa', 'Hospital', 'in', '#Ottawa', '-',
    '#Clinical', '#Informatics', 'Specialist', '#jobs', 'http://t.co/3SlUy11dro',
    
    'Looking', 'for', 'a', '#Clinical', '#Informatics', 'Pharmacist', 'Park', 'Plaza', 'Hospital',
    '#jobs', 'http://t.co/4Qw8i6YaJI',
    
    'Info', 'Session', '10/7:', 'MSc', 'in', 'Biomedical', 'Informatics,', 'University', 'of', 'Chicago',
    'https://t.co/65G8dJmhdR', '#HIT', '#UChicago', '#informatics', '#healthcare',
    
    "Here's", 'THE', 'best', '#Books', "I've", 'read', 'on', '#EHR', '#HIE', '#HIPAA',
    'and', '#Health', '#Informatics', 'http://t.co/meFE0dMSPe',
    
    '@RMayNurseDir', '@FNightingaleF', 'Scholars', 'talking', 'passionately', 'about', 'what', 'they', 'believe', 'in.',
    '#informatics', '&amp;', '#skincare', 'https://t.co/m8qiUSxk0h'
]

assert_equal(words, words_answer)

## 2. Remove all words that contain hashtags (#).
The easiest way to do this (that I can think of) is to use re.sub() to substitude any elements with a hash character with an empty string ''. And at the end, we can use a for loop or list comprehension to remove all empty strings from the list.
I'll even write the first part for you. You can replace every word that contains a # with an empty string with

`words = [re.sub('\#.*', '', word) for word in words]`

where I iterated through words using list comprehension. This is equivalent to

```a_list = []
for word in words:
    a_list += [re.sub('\#.*', '', word)]
words = a_list 
```

We have to include a \ before the # because # is a special character. The `.` matches any character (except newline), and `*` means zero or more repetitions. Thus, this line substitues every word in words that starts with a # with an empty string `''`.

Now, if we remove all empty strings from this list, we have removed all words that are hashtags.

In [5]:
def remove_hashtags(words):
    '''
    Take a list of strings.
    Returns a list of strings, where we discard all strings that are hashtags.
    
    Parameters
    ----------
    words: A list of strings.
    
    Returns
    -------
    A list of strings. None of the strings in the return list has a hashtag.
    '''
    # YOUR CODE HERE
    # Substitues every word in words that contains a # with an empty string '' using list comprehension
    words = [re.sub('\#.*', '', word) for word in words]
    # Use filter to remove all empty strings
    words = list(filter(None, words))
    return words

In [6]:
no_hashtags = remove_hashtags(words)
no_hashtags_answer = [
    'New', 'opening', 'at', 'The', 'Ottawa', 'Hospital', 'in', '-',
    'Specialist', 'http://t.co/3SlUy11dro',
    
    'Looking', 'for', 'a', 'Pharmacist', 'Park', 'Plaza', 'Hospital',
    'http://t.co/4Qw8i6YaJI',
    
    'Info', 'Session', '10/7:', 'MSc', 'in', 'Biomedical', 'Informatics,', 'University', 'of', 'Chicago',
    'https://t.co/65G8dJmhdR',
    
    "Here's", 'THE', 'best', "I've", 'read', 'on',
    'and', 'http://t.co/meFE0dMSPe',
    
    '@RMayNurseDir', '@FNightingaleF', 'Scholars', 'talking', 'passionately', 'about', 'what', 'they', 'believe', 'in.',
    '&amp;', 'https://t.co/m8qiUSxk0h'
]

assert_equal(no_hashtags, no_hashtags_answer)

## 3. Remove all words that contain users (@).
Similary, remove all words that indicate users (begins with the @ character).

In [7]:
def remove_users(words):
    '''
    Take a list of strings.
    Returns a list of strings, where we discard all strings that represent users.
    
    Parameters
    ----------
    words: A list of strings.
    
    Returns
    -------
    A list of strings. None of the strings in the return list has user tags.
    '''
    # YOUR CODE HERE
    # Substitues every word in words that contains a @ with an empty string ''
    words = [re.sub('@.*', '', word) for word in words]
    # Use filter to remove all empty strings
    words = list(filter(None, words))
    return words

In [8]:
no_users = remove_users(no_hashtags)
no_users_answer = [
    'New', 'opening', 'at', 'The', 'Ottawa', 'Hospital', 'in', '-',
    'Specialist', 'http://t.co/3SlUy11dro',
    
    'Looking', 'for', 'a', 'Pharmacist', 'Park', 'Plaza', 'Hospital',
    'http://t.co/4Qw8i6YaJI',
    
    'Info', 'Session', '10/7:', 'MSc', 'in', 'Biomedical', 'Informatics,', 'University', 'of', 'Chicago',
    'https://t.co/65G8dJmhdR',
    
    "Here's", 'THE', 'best', "I've", 'read', 'on',
    'and', 'http://t.co/meFE0dMSPe',
    
    'Scholars', 'talking', 'passionately', 'about', 'what', 'they', 'believe', 'in.',
    '&amp;', 'https://t.co/m8qiUSxk0h'
]

assert_equal(no_users, no_users_answer)

## 4. Remove all words that contain HTTP links.
We also want to remove all hyperlinks.

In [9]:
def remove_links(words):
    '''
    Take a list of strings.
    Returns a list of strings, where we discard all strings that are http links.
    
    Parameters
    ----------
    words: A list of strings.
    
    Returns
    -------
    A list of strings. None of the strings in the return list is an http link.
    '''
    # YOUR CODE HERE
    # Substitues every word in words that contains 'http:' or 'https:' with an empty string ''
    words = [re.sub('https?:.*', '', word) for word in words]
    # Use filter to remove all empty strings
    words = list(filter(None, words))
    return words

In [10]:
no_links = remove_links(no_users)
no_links_answer = [
    'New', 'opening', 'at', 'The', 'Ottawa', 'Hospital', 'in', '-',
    'Specialist',
    
    'Looking', 'for', 'a', 'Pharmacist', 'Park', 'Plaza', 'Hospital',
    
    'Info', 'Session', '10/7:', 'MSc', 'in', 'Biomedical', 'Informatics,', 'University', 'of', 'Chicago',
    
    "Here's", 'THE', 'best', "I've", 'read', 'on',
    'and',
    
    'Scholars', 'talking', 'passionately', 'about', 'what', 'they', 'believe', 'in.',
    '&amp;',
]

assert_equal(no_links, no_links_answer)

## 5. Remove all non-alphabetical characters.
A tweet may contain foreign characters, punctuation marks, or numbers. In this case, however, we don't want to remove a word just because it contains a punctuation mark. For example, we want to keep "Informatics" and "in" in "Informatics," (a comma at the end) and "in." (a period) while getting rid of the punctuation marks.

So, simply go through each word and remove every character that is not an alphabetical character.

In [11]:
def keep_letters(words):
    '''
    Take a list of strings.
    Returns a list of strings, where all strings have only alphabetical characters.
    
    Parameters
    ----------
    words: A list of strings.
    
    Returns
    -------
    A list of strings. None of the strings in the return list has any non-alphabetical characters.
    '''
    # YOUR CODE HERE
    # Substitues every non-alphabetical characters in words with an empty string ''
    words = [re.sub('[^A-Za-z]', '', word) for word in words]
    # Use filter to remove all empty strings
    words = list(filter(None, words))
    return words

In [12]:
only_letters = keep_letters(no_links)
only_letters_answer = [
    'New', 'opening', 'at', 'The', 'Ottawa', 'Hospital', 'in',
    'Specialist',
    
    'Looking', 'for', 'a', 'Pharmacist', 'Park', 'Plaza', 'Hospital',
    
    'Info', 'Session', 'MSc', 'in', 'Biomedical', 'Informatics', 'University', 'of', 'Chicago',
    
    "Heres", 'THE', 'best', "Ive", 'read', 'on',
    'and',
    
    'Scholars', 'talking', 'passionately', 'about', 'what', 'they', 'believe', 'in',
    'amp'
]

assert_equal(only_letters, only_letters_answer)

## 6. Convert everything to lower cases.
Convert all strings to lowercase.

In [13]:
def to_lower(words):
    '''
    Take a list of strings.
    Returns a list of strings, where all strings are lowercase.
    
    Parameters
    ----------
    words: A list of strings.
    
    Returns
    -------
    A list of strings. None of the strings in the return list has any capital letters.
    '''
    # YOUR CODE HERE
    # Convert all strings to lowercase with string method .lower()
    words = [word.lower() for word in words]
    return words

In [14]:
all_lower = to_lower(only_letters)
all_lower_answer = [
    'new', 'opening', 'at', 'the', 'ottawa', 'hospital', 'in',
    'specialist',
    
    'looking', 'for', 'a', 'pharmacist', 'park', 'plaza', 'hospital',
    
    'info', 'session', 'msc', 'in', 'biomedical', 'informatics', 'university', 'of', 'chicago',
    
    "heres", 'the', 'best', "ive", 'read', 'on',
    'and',
    
    'scholars', 'talking', 'passionately', 'about', 'what', 'they', 'believe', 'in',
    'amp'
]

assert_equal(all_lower, all_lower_answer)

Note that words that had #'s, @'s, numbers, links, etc. in them are all gone now, and we have a list of nicely looking words. If you are confused about how to do some of the operations, you can simply google, e.g., "python convert string to lowercase" or ask us questions.