# <span style="color:blue">Course Plan 12/02/2019</span>
## <span style="color:blue">(Last Updated 12/02/2019)</span>

## Updated schedule through the rest of the semester


|  Wk   |  M    |  W     | Topic   | Notebooks | Due |
| :---: | :---: | :----: | :------ | :----- | :---: |
|  8  |  10/21  | 23  | **Numpy:** Data Abstraction, **Numpy:** Multi-dimensional arrays,  | Midterm, 03-01, 03-02 | 10/30 |
|  9  |  28  | 30  | **Numpy:** Reading into multi-dimensional arrays, **Pandas:** Dataframes and reading into them;  Merging and matching Dataframes| 03-03, 03-04, 03-05 | 10/30 |
|  10  |  11/4  | 6  | **Pandas:** , Series and Views; Wrap Up Unit 3| 03-06, 03-07 | 11/10 |
|  11 |  &mdash; | 13   | Classification and Clustering, **Case Study:** Iris Data Set | 04-02, 04-03  | 11/17 |
|   |    |    | Notebooks under development&dagger;  | <del>04-04, 04-06, 04-07</del>  |
|  12 |  18  | 20  | _k_-means Clustering, **Case Study:** [World Happiness Report](https://worldhappiness.report/ed/2019/), Recommendations  | 04-04, 05-01, 04-05 | 11/24 | 
|  13 |  25   | &mdash;  | <span style="color:blue"> **Case Study:** Movie Recommendations</span> | 05-02 | 12/01 | 
|  14 |  12/2 | 4 |  **Case Studies:** World Happiness Map using [Geopandas](http://geopandas.org/) &Dagger;, Twitter Stream Analysis | 05-03, 05-04 | 12/08 |
|  16 |  | 12/13 | **(Take Home) Final Exam**  |

&dagger; We will not be covering these notebooks this semester. Feel free to peruse them if interested.

&Dagger; Installing Geopandas broke my Jupyter environment! We will not be covering Geopandas. Wednesday's class will be used for previewing the final exam. The final will pick from a dataset in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). If you wish to practice, pick one of those datasets. Many (most?) of them have a "Papers That Cite This Data Set" which could serve as example questions. However, some of those analyses can get complex &mdash; if it looks too complex for the final, it probably is!


# Twitter Stream Analysis

<img align="right" style="padding-left:10px; height: 50%; width: 50%" src="figures/twitterbirds-980x589.png" ></a>

This case study seeks to read incoming tweets and process them.

## Tweepy

Modern Data sources typically deliver their data over an API. In this exercise, we will first acquire some tweets and work with them in various ways.

We'll be working with [`tweepy`](https://github.com/tweepy/tweepy), an API for acquiring Data from Twitter.The best place to start is the [documentation](http://docs.tweepy.org/en/latest/). Take a look at the Getting Started page.

To begin, uncomment the next cell and run it to install `tweepy` using pip. You need only do it once (but doing it more than once won't hurt anything). Run the next cell.



In [None]:
# !pip install tweepy

The next step is to obtain Twitter credentials for being able to acccess the API. Visit the [Twitter Developer Site for Apps](https://developer.twitter.com/en/apps)</a> to create a twitter "application," give it a name. Once created, go to the keys and tokens tab and copy-paste the Consumer API keys as well as the Access token & access token secret into the cell below. 

**Careful!** `consumer_secret` and `access_token_secret` should be protected like passwords because they can be used by the API to _send out tweets or direct messages_ on your behalf. Immediately invalidate the token if you accidentally expose it &mdash; regenerating a new set of token values on the keys and tokens page will invalidate the old one. 

To protect Twitter credentials from being made public I suggest that you keep the lines of the next cell in a separate file on your laptop. Running the next cell will get the variable `auth` into the IPython environment and you may immediately delete the cell after that!

In [None]:
consumer_key = '▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣'
consumer_secret = '▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣'
access_token = '▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣'
access_token_secret = '▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣'

# Once auth has been initialized, it is best to delete this cell but keep a copy in a separate file, just in case!
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

### Errors and Exceptions

When dealing with live feeds, the program needs to be able to react to adverse conditions. A [user-defined exception](https://docs.python.org/3.8/tutorial/errors.html#user-defined-exceptions) should be defined for this purpose.

In [None]:
class StreamAnalysisError(Exception):
    """Base class for exceptions in this module."""
    pass

class TweepyError(StreamAnalysisError):
    """Exception raised in Tweepy.

    Attributes:
        expression -- input expression in which the error occurred
        message -- explanation of the error
    """

    def __init__(self, expression, message = ''):
        self.expression = expression
        self.message = message

In [None]:
import time, string, re
from datetime import datetime
import tweepy

def analyze_tweet(stripped_text):
    return []

def show_tweet(tweet):
    try:
        return ' '.join([tweet.created_at.strftime("%m/%d/%Y %H:%M:%S"), 
                         tweet.user.screen_name, 
                         tweet.text])
    except:
        return ''

    
class TwitterStream:
    def __init__(self, auth):
        self.api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
        # We will use this table to process the tweet text.
        # Ref: https://stackoverflow.com/a/34294398/653651 
        #      Remove all punctuation marks from a text file
        #      Also remove ellipses but don't remove '@'
        self.table = str.maketrans(dict.fromkeys((string.punctuation + "…").replace('@','')))
    
    def test(self):
        public_tweets = self.api.home_timeline()
        for tweet in public_tweets[:1]:
            x = show_tweet(tweet)
            assert (x) # Unable to get tweets, reason unknown
        now = int(datetime.timestamp(datetime.now()))
        # when the rate limit will reset
        self.tweet_rate_reset = int(self.api.last_response.headers['x-rate-limit-reset'])
        self.tweets_remaining = int(self.api.last_response.headers['x-rate-limit-remaining'])
        self.tweet_rate_limit = int(self.api.last_response.headers['x-rate-limit-limit'])
        
    def rate_limit_delay(self):
        '''
        Status of the API vis-a-vis rate limitation
        Based on https://developer.twitter.com/en/docs/basics/rate-limiting
        '''
        now = int(datetime.timestamp(datetime.now()))

        self.tweet_rate_reset = int(self.api.last_response.headers['x-rate-limit-reset'])
        self.tweets_remaining = int(self.api.last_response.headers['x-rate-limit-remaining'])
        self.tweet_rate_limit = int(self.api.last_response.headers['x-rate-limit-limit'])

        diff = now - self.tweet_rate_reset
        trigger_after = 1 + reset
        if (remaining <= 1):
            if now < trigger_after:
                print ('pause till', datetime.fromtimestamp(trigger_after).strftime("%H:%M:%S"))
                time.sleep (trigger_after - now)
        else:
            pass
            
        print ('ready')
        return
    
    def on_rate_limit(self, cursor):
        while True:
            try:
                yield cursor.next()
            except tweepy.RateLimitError:
                print ('tweepy.RateLimitError')
                self.rate_limit_delay()
            except StopIteration:
                raise TweepyError('Ending tweets')

    def get_more_tweets(self, user_id, count = 10, show = False):
        # Reference: http://docs.tweepy.org/en/v3.8.0/code_snippet.html#handling-the-rate-limit-using-cursors
        tweets_cursor = tweepy.Cursor(self.api.user_timeline, id=user_id) 
        try:
            for tweet in on_rate_limit(tweets_cursor.items(count)):
                tweet_text = tweet.text
                text_no_url = re.sub(r'\shttps?:\/\/.*[\r\n]*', '', tweet_text, flags=re.MULTILINE)
                stripped_text = text_no_url.translate(self.table)
                if show:
                    print (tweet_text)
                categories = analyze_tweet(stripped_text)
                if show and categories:
                    print ('-------------tweet-------------\n' + tweet_text)
                    print (categories)            

        except TweepyError as tweepy_error:
            now = datetime.now()
            print (tweepy_error.expression, 'at', now.strftime("%m/%d/%Y %H:%M:%S"))

In [None]:
# Uncomment as appropriate

# auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# auth.set_access_token(access_token, access_token_secret)
# auth = tweepy.OAuthHandler(_creds["consumer_key"], _creds["consumer_secret"])
# auth.set_access_token(_creds["access_token"], _creds["access_token_secret"])
twitter_stream = TwitterStream(auth)
twitter_stream.test()

### Obtaining Tweets

In the next cell, if everything is working, we should see some tweets!

In [204]:
twitter_stream.get_more_tweets('nytimes', count = 10, show = True)

New Yorkers preparing to lumber back to work after an extended Thanksgiving weekend are waking up to a city soaked… https://t.co/UvVjkVT080
Breaking News: Gov. Steve Bullock of Montana, a Democrat who won 2 terms as governor in a red state, dropped out of… https://t.co/ppQXt3EqPN
This is no ordinary airport. It’s Singapore’s Changi: part theme park, part futuristic pleasure dome. And while an… https://t.co/A3fYfeHH9V
Many undocumented immigrants in the United States did not cross the southern border. They arrived by plane with vis… https://t.co/cuE8vYcaol
The Islamic State's resilience in Afghanistan, even in light of its recent defeats, raises the grim prospect of an… https://t.co/7tUTmb0Gwe
In Opinion

Natasha Kassam writes, "It seems clear by now that even Beijing-friendly candidates cannot deliver Taiw… https://t.co/BcQzDEgVL7
The craft of building a story on publicly available data was part of journalism in the analog era, but it has come… https://t.co/eSN4mjiSUR
Typhoon Kammuri i

## Moral Foundations Theory

<img align="right" style="padding-left:10px; height: 55%; width: 55%" src="figures/political-camps-moral-foundations.png" ></a>

The second part of this case study is Moral Foundations Theory. Google the phrase or visit [MoralFoundations.org](https://moralfoundations.org/) to learn more about the theory (or watch [Jonathan Haidt's Ted Talk](https://www.ted.com/talks/jonathan_haidt_the_moral_roots_of_liberals_and_conservatives/)).

For the purpose of this exercise, you needn't learn much about the theory &mdash; we will be using the Moral Foundations team's list of words that connote different dimensions of morality. 

* **Care** connotes safety, peace, compassion, etc.
* **Harm**, Care's opposite, connotes war, fight, hurt, kill, suffer, etc.

In the coding, the above two moral opposites are called 'HarmVirtue' and 'HarmVice' respectively. Similarly for other moral dimensions. The categories and the words that belong to each are available on-line in a [Moral Foundations Dictionary](https://moralfoundations.org/wp-content/uploads/files/downloads/moral%20foundations%20dictionary.dic). The dictionary words often include &ast;s, for example `peace`&ast;, which could be peace, peaceful, peacefully, even peacenik!

To retrieve the Moral Foundations Dictionary we shall use the `requests` library, documentation available [here](https://requests.readthedocs.io/en/master/).

In [None]:
import requests
url = "https://moralfoundations.org/wp-content/uploads/files/downloads/moral%20foundations%20dictionary.dic"
r = requests.get(url)
content = r.text 
lines = content.split('\n')
groups = {}
codes = {}
reading_groups = False

for line in lines:
    line_ = ' '.join(line.split())
    if not line_: continue
    if line_.startswith('%'):
        reading_groups = not reading_groups
    else:
        line__ = line.split()
        if reading_groups:
            groups[line__[0]] = line__[1]
        else:
            codes[line__[0]] = ','.join(line__[1:])
# print (groups)

# Prints:
# {'01': 'HarmVirtue', '02': 'HarmVice', '03': 'FairnessVirtue', '04': 'FairnessVice', 
#  '05': 'IngroupVirtue', '06': 'IngroupVice', '07': 'AuthorityVirtue', '08': 'AuthorityVice', 
#  '09': 'PurityVirtue', '10': 'PurityVice', '11': 'MoralityGeneral'}

# print (codes)

# Partial print of codes:
# {'safe*': '01', 'peace*': '01', 'compassion*': '01', 'empath*': '01', 'sympath*': '01', 'care': '01', 
#  …
#  'deserter*': '06,08', 
#  …
#  'transgress*': '11'}

### Code Design

The obvious design is to create a dictionary that maps each word &rArr; category. Then we could take each word in an incoming tweet and categorize it. _But the &ast;s throw a wrench into this idea!_ How would it categorize "He came peacefully" when a dictionary whose key was peace&ast; wouldn't match the word "peacefully!"

And if a dictionary won't solve the problem, won't looking up each tweet word without the benefit of a dictionary be too slow?

Our design will be based on a combination of strategies: using a dictionary with 3-letter keys, and having each key map to a (relatively short) list that could be scanned quickly. For example, one of the longest lists will be "dis": [discriminat&ast;, disproportion&ast;, dishonest, dissociate, disloyal&ast;, dissent&ast;, disrespect&ast;, disobe&ast;, dissident, disgust&ast;, disease&ast;]. A particular Python class for this is a `defaultdict`, shown [here by example](https://docs.python.org/3.8/library/collections.html#defaultdict-examples).

In [None]:
from collections import defaultdict
groups = {}
codes = defaultdict(list)
reading_groups = False

for line in lines:
    line_ = ' '.join(line.split())
    if not line_: continue
    if line_.startswith('%'):
        # fill in
    else:
        line__ = line.split()
        # fill in
        else:
            '''
            We have to iterate through the words in a tweet 
            To allow for fast lookups, we make a dictionary with 3-letter keys
            Ref: https://docs.python.org/3.3/library/collections.html#defaultdict-examples
            '''
            codes[line__[0][0:3]].append({line__[0]:(','.join(line__[1:]))})
# print (groups)
# print (codes)

## Write a function for finding categories of a word

We want a function `find_word_categories(word)` such that the **assert** statements at the end of the next cell all pass.

In [None]:
def find_word_categories(word):
    # fill in
    return []

assert (find_word_categories('hurt') == ['02'])
assert (find_word_categories('tradition') == ['07'])
assert (find_word_categories('disease') == ['10'])
assert (find_word_categories('@nytnational') == [])
assert (find_word_categories('preserve') == ['01', '07', '09'])

### A new `analyze_tweet` function

Now the `analyze_tweet` function from before can be replaced by the real thing! Changing the definition of a function during the session is a powerful technique, but can result in confusion &mdash; which version of the program is being called at any moment? 

**The last function definition that was executed by IPython wins!**

In [None]:
def analyze_tweet(tweet_text):
    #print (tweet_text)
    text_no_url = re.sub(r'\shttps?:\/\/.*[\r\n]*', '', tweet_text, flags=re.MULTILINE)
    stripped_text = text_no_url.translate(twitter_stream.table)
    tw_words = stripped_text.lower().split(' ')
    categories = []
    for tw_word in tw_words:
        cats = find_word_categories(tw_word)
        for cat in cats:
            categories.append({groups[cat]: tw_word})
    return categories

cats = analyze_tweet("A tradition can be many things — for some, it's food. For others, it's faith. For many, family. ")
assert ({'AuthorityVirtue': 'tradition'} in cats)
assert ({'IngroupVirtue': 'family'} in cats)
assert (2 == len(cats))

## And finally, categorized tweets!

In the following, only the tweets that mentioned a moral foundation will be printed.

In [196]:
twitter_stream.get_more_tweets('nytimes', count = 500, show = False)

-------------tweet-------------
In Opinion

Emma Goldberg writes, "The injustice of somebody murdered while organizing for criminal justice feels i… https://t.co/6EmPcvsrXA
[{'FairnessVirtue': 'justice'}]
-------------tweet-------------
At least 14 people were shot dead in an attack on a church in eastern Burkina Faso on Sunday morning, the governmen… https://t.co/eKBoPEZZUP
[{'HarmVice': 'attack'}, {'PurityVirtue': 'church'}]
-------------tweet-------------
RT @farnazfassihi: Our exclusive story on the mass killing in Mahshar. Revolutionary Guards forces surrounded, shot and killed 40 to 100 de…
[{'HarmVice': 'killing'}, {'HarmVice': 'killed'}]
-------------tweet-------------
Prince Charles has long pushed for a more streamlined British royal family, with fewer members carrying out officia… https://t.co/5oH3SJ21Es
[{'IngroupVirtue': 'family'}]
-------------tweet-------------
More than 2 years after a car bomb killed Malta’s best-known investigative journalist, prosecutors on Saturday 

## Last Question:

In <em>your opinion</em>, how well does the categorization track your sense of the moral values expressed in the tweets? Do you see any blatant biases?

_There is no right or wrong answer to this question. Your response will be judged by how well-reasoned it is._

## Follow Ups?

It would be cool to plot a graph of the (Virtue - Vice) values of the five foundations expressed in these tweets. 

The New York Times is reputed to be a center-left news source. How do the foundations expressed by this source compared with the foundations expressed by other sources? 

In case you were wondering, there are [some serious objections](https://behavioralscientist.org/whats-wrong-with-moral-foundations-theory-and-how-to-get-moral-psychology-right/) to Moral Foundation Theory; which you may or may not find persuasive.