 ## Twitter Surveillance ##


In the past 10 to 15 years, research connecting disease surveillance with internet data has seen rapid growth.  In the early days, work focussed exclusively on communicable disease surveillance (e.g. influenza, dengue) but now has extended into chronic diseases and health behavior (e.g. alcohol use, mental health).  Twitter, in particular is widely used due to the relative ease with which the data can be accessed, and the volume of data available.  However, a key issue in utilizing Twitter for health surveillance work is that tweets are text based, and hence require natural language processing techniques to extract relevant content.

Examples of work that uses Twitter data for public health surveillance applications include Signorini et al. (2011) and Myslin et al. (2013). Signorini et al. used simple NLP methods to track rapidly evoloving public sentiment regarding H1NI (the swine flu epidemic of 2009), and also prevalence of flu in the community.  Myslin et al.  used Twitter data to investigate Twitter users' levels of informedness and attitudes towards both hookah and e-cigarettes, finding that Twitter users were much more well desposed towards hookah use compared to traditional combustible tobacco consumption.

In this lab session we will use various methods to investigate a corpus of 30,000 randomly selected tweets, starting with characterising the dataset, then exploring language distribution on Twitter, before homing in on specific health-related questions concerning tobacco use.  

Some of the the material covered will require you to consult the appropriate online documentation.



### References

- Myslin M, Zhu S, Chapman W, Conway M.  Using twitter to examine smoking behavior and perceptions of emerging tobacco products. 2013. 15(8):e174

- Signorini A, Seggra A, Polgreen P.  The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic.  *PLOS One*.  2011 6(5):e19467



### Loading Data ###

Our first task is to load the data from the file "00000001___Sat_May_18_11:48:09_PDT_2013.twitter".  Each line in the file consists of a tweet and its associated metadata in JSON format.  JSON (JavasScript Object Notation) is a widely  used,  human readable, open source file format commonly used to transfer and store data.  While the standard was developed for JavaScript, it is a language independent format, frequently used in place of XML.  Compared with CSV data, JSON has a number of interesting characteristics, including the ability to represent hierarchical data structures.   This is an example of a tweet JSON object:

```JSON
{"created_at":"Sat May 18 18:48:09 +0000 2013","id":335829176160489472,"id_str":"335829176160489472","text":"What's up with this whole coke thing?","source":"\u003ca href=\"http:\/\/twitter.com\/#!\/download\/ipad\" rel=\"nofollow\"\u003eTwitter for iPad\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":1256946714,"id_str":"1256946714","name":"maddie","screen_name":"maddiesnellxo","location":"brentwood yea","url":null,"description":"brooklyn beckham and justin bieber x","protected":false,"followers_count":139,"friends_count":153,"listed_count":0,"created_at":"Sun Mar 10 12:57:22 +0000 2013","favourites_count":59,"utc_offset":null,"time_zone":null,"geo_enabled":false,"verified":false,"statuses_count":743,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"010405","profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/866567612\/aa0dd7d8ecdab90c8c2199de3a8b71cf.jpeg","profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/866567612\/aa0dd7d8ecdab90c8c2199de3a8b71cf.jpeg","profile_background_tile":true,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3638837356\/55538b8753a2b6f4765315d56c723a88_normal.jpeg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3638837356\/55538b8753a2b6f4765315d56c723a88_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/1256946714\/1368559459","profile_link_color":"0084B4","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"en"}
```


As you can see, this is very hard to read.  JSON objects can be pretty printed using the command (Unix command line) python -m json.tool.  The pretty printed JSON object looks like this:

```JSON
{
    "contributors": null,
    "coordinates": null,
    "created_at": "Sat May 18 18:48:09 +0000 2013",
    "entities": {
        "hashtags": [],
        "symbols": [],
        "urls": [],
        "user_mentions": []
    },
    "favorite_count": 0,
    "favorited": false,
    "filter_level": "medium",
    "geo": null,
    "id": 335829176160489472,
    "id_str": "335829176160489472",
    "in_reply_to_screen_name": null,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "lang": "en",
    "place": null,
    "retweet_count": 0,
    "retweeted": false,
    "source": "<a href=\"http://twitter.com/#!/download/ipad\" rel=\"nofollow\">Twitter for iPad</a>",
    "text": "What's up with this whole coke thing?",
    "truncated": false,
    "user": {
        "contributors_enabled": false,
        "created_at": "Sun Mar 10 12:57:22 +0000 2013",
        "default_profile": false,
        "default_profile_image": false,
        "description": "brooklyn beckham and justin bieber x",
        "favourites_count": 59,
        "follow_request_sent": null,
        "followers_count": 139,
        "following": null,
        "friends_count": 153,
        "geo_enabled": false,
        "id": 1256946714,
        "id_str": "1256946714",
        "is_translator": false,
        "lang": "en",
        "listed_count": 0,
        "location": "brentwood yea",
        "name": "maddie",
        "notifications": null,
        "profile_background_color": "010405",
        "profile_background_image_url": "http://a0.twimg.com/profile_background_images/866567612/aa0dd7d8ecdab90c8c2199de3a8b71cf.jpeg",
        "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/866567612/aa0dd7d8ecdab90c8c2199de3a8b71cf.jpeg",
        "profile_background_tile": true,
        "profile_banner_url": "https://pbs.twimg.com/profile_banners/1256946714/1368559459",
        "profile_image_url": "http://a0.twimg.com/profile_images/3638837356/55538b8753a2b6f4765315d56c723a88_normal.jpeg",
        "profile_image_url_https": "https://si0.twimg.com/profile_images/3638837356/55538b8753a2b6f4765315d56c723a88_normal.jpeg",
        "profile_link_color": "0084B4",
        "profile_sidebar_border_color": "FFFFFF",
        "profile_sidebar_fill_color": "DDEEF6",
        "profile_text_color": "333333",
        "profile_use_background_image": true,
        "protected": false,
        "screen_name": "maddiesnellxo",
        "statuses_count": 743,
        "time_zone": null,
        "url": null,
        "utc_offset": null,
        "verified": false
    }
}
```
You can see that there is a bunch of information here regarding the tweet (tweet text/date/time/location/retweet/use of hashtags, etc.) and also the tweeter (i.e. the user).  You can see when the user joined, number of followers, number of friends, preferred language, and listed location.   Everytime a person tweets, a publicly accessible JSON object like this is created. 

The first thing you need to do is load the twitter data into memory.  Please execute the code below to load the data.

In [None]:
from pypop.utils import *
from pypop.view import *

In [None]:

data = get_twitter_data()

Let's create a Python dictionary from the first tweet in the file.  The Python JSON module ("json") essentially converts a string that is syntacticly correct JSON into a Python dictionary that can be easily manipulated.

**Q1**:  Extract the language of this first tweet (e.g. Spanish, Chinese, French, English).  You can do this using the "lang" field. Currently, the first line of text is a string.  We need to transform this into a JSON object for manipulation.  This process involves using the "loads" method from the python json module and assigning that object to a variable (language).

## What do the tweets look like?

In [None]:
ipw.interact(view_tweet_n, data=ipw.fixed(data), n=ipw.IntSlider(min=0, max=len(data)-1))

### There is a lot of different kinds of tweet data
#### We will split the data up by data that has a `user` field and data that does not

In [None]:
nouser = [t for t in data if "user" not in t]
data = [t for t in data if "user" in t]
print(len(nouser), len(data))

### What does the nouser data look like?



In [None]:
ipw.interact(view_tweet_n, data=ipw.fixed(nouser), n=ipw.IntSlider(min=0, max=len(nouser)-1))

## What are all the possible key values in our tweets?

In [None]:
set.union(*([set(t.keys()) for t in data]))

# What key values are in all of our tweets?

In [None]:
set.intersection(*([set(t.keys()) for t in data]))

## I'm going to consider the text of the tweet as the data and everything else as meta-data

### Q2 What are possible meta-data we would be interested in?

In [None]:
names = [t["user"]["name"] for t in data ]
languages = [t["user"]["lang"] for t in data ]
geo = [t["geo"] for t in data if t["geo"]]
uids = geo = [t["user"]["id"] for t in data]

### Others?

### How many unique values are there for these?

In [None]:
len(set(uids)), len(set(names))

### What are the languages of the tweets?

In [None]:
count_to_df(languages).plot.bar()

In [None]:
c = Counter(languages)
counter_to_df({get_lang(k):v for k,v in c.items()}).plot.bar()

## Analyzing the text

The next section consists of content focussed on the application of various methods for manipulating tweets.  Topics covered will be:

* Sentiment analysis
* Attitudes towards tobacco products  
* Some simple operations with Pandas


Now, we will iterate over the lines of the file.  For each tweet, we will convert the JSON string to a Python dictonary ["json.loads(tweet)"] and then append that dictionary to "list".  Note that the list variable has already been used for storing language categories, and therefore must be redefined ("list = []").

### Grab the text

In [None]:
text_list = [t["text"] for t in data if t["text"] and t["user"]["lang"]=="en"]
print(len(text_list))

Note that the number is different.  This is because a proportion of the tweets processed either contained no "text" field (KeyError), or weren't encoded in a standard way (ValueError).  

Now we we will focus on performing sentiment analysis for the Twitter data set as a whole.  We will use the TextBlob Python module for this task (https://textblob.readthedocs.io/). Please see the examples below.   The polarity (sentiment) score is a float within the range [-1.0, 1.0]. Sentiment analysis has been used extensively for, among other things, understanding public attitudes towards vaccination.

[Note that the Python language is named after Monty Python, and it has become a tradition to use Monty Python quotations as examples in Python tutorial material.  The quotations below are from "Monty Python and the Holy Grail"] 

In [None]:
### EXAMPLE TEXTBLOB SENTIMENT ANALYSIS CODE ###

### PLEASE RUN THIS CODE ###

from textblob import TextBlob

example1 = TextBlob(""" The Lady of the Lake, her arm clad in the purest shimmering,
                    samite held aloft Excalibur from the bosom of the water,
                    signifying by divine providence that I, Arthur,
                    was to carry Excalibur. THAT is why I am your king.""")

example2 = TextBlob("""Listen. Strange women lying in ponds distributing swords
                     is no basis for a system of government. Supreme executive power 
                     derives from a mandate from the masses, not from some farcical
                     aquatic ceremony.""")

print(example1.sentiment.polarity)
print(example2.sentiment.polarity)


**Q9:** Now that you've seen how the TextBlob sentiment analysis module works, I'd like you to perform sentiment analysis on all tweet texts from the variable "text_list" and identify (a) the number of positive tweets (i.e. polarity > 0.0), the number of negative tweets (i.e. polarity < 0.0), and the number of neutral tweets (i.e. polarity == 0.0)

# Q9 Answer

In [None]:
sentiment_values = []
for text in text_list:
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    sentiment_values.append(sentiment)
    
positive_sentiment = [i for i in sentiment_values if i > 0]
negative_sentiment = [i for i in sentiment_values if i < 0]
neutral_sentiment  = [i for i in sentiment_values if i == 0]

print("positive tweets:" + str(len(positive_sentiment)))
print("negative tweets:" + str(len(negative_sentiment)))
print("neutral tweets:"  + str(len(neutral_sentiment)))

**Q10**:  Write some code that prints out the first 10 tweets that have positive sentiment and the first 10 tweets that have negative sentiment.

# Q10 Answer

In [None]:
from pprint import pprint
positive_sentiment_only = []
negative_sentiment_only  = []
sentiment_values_list = []

for text in text_list:
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    if sentiment > 0: 
        positive_sentiment_only.append(text)
    if sentiment < 0:
        negative_sentiment_only.append(text)       
        
        
print(positive_sentiment_only[0:10])
print(negative_sentiment_only[0:10])

**Q11**:  Identify all those tweets that contain the tobacco-related keywords "cig", "tobacco", "nicotine", and then use TextBlob to calculate the sentiment for each of the tweets.  How many are positive, negative, or neutral? 

In [None]:
###########################
#### ADD Q11 CODE HERE ####
###########################

# Q11 Answer

In [None]:
# positive, negative, or neutral? 

tobacco_related_tweets_list = []

for text in text_list:
    if "cig"in text:
        tobacco_related_tweets_list.append(text)
    if "tobacco" in text:
        tobacco_related_tweets_list.append(text)
    if "nicotine" in text:
        tobacco_related_tweets_list.append(text)


sentiment_values = []
for text in tobacco_related_tweets_list:
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    sentiment_values.append(sentiment)
    
positive_sentiment = [i for i in sentiment_values if i > 0]
negative_sentiment = [i for i in sentiment_values if i < 0]
neutral_sentiment  = [i for i in sentiment_values if i == 0]

print("positive tweets:" + str(len(positive_sentiment)))
print("negative tweets:" + str(len(negative_sentiment)))
print("neutral tweets:"  + str(len(neutral_sentiment)))