#Using Social Media to Study the Link between Health and Happiness

###Based on Twitter Data


##A Data Science CS109 Class Project

##Team: Hackers for Human Health
Contributors
-----
In alphabetical order  

Alejandro Covarrubias | Jacob Lurye | Eliud Oloo | Qiu-Yue Zhong


#  
<div style="float: right; margin-left: 30px;"><img title="created by Stef Gibson at StefGibson.com"style="float: right;margin-left: 30px;" src="http://www.massage1.com/wp-content/uploads/healthhappiness.jpg" align=right height = 350 /><figcaption>Created by Stef Gibson</figcaption></div>


#  



#01_Motivation and Overview

##Motivation
------
We are a team with diverse backgrounds in statistics, computer science, public health and biomedical research. The motivation for this project is our common interest in applying social media and data science approaches to studying and promoting human health. The scientific goal of this work was to measure average happiness state-by state using sentiment analysis of Twitter data and then to determine how well the happiness of a state correlates with public health statistics such as morbidity, mortality, healthcare quality and other criteria.



##Overview
------

It is often stated that health and happiness are closely linked. Quantifiable evidence to that effect is however hard to come by.  One of many reasons for this is that happiness is an ambiguous concept -- easy to recognize in oneself but often harder to detect, much less measure, in others.  Yet, if there was ever an opportune time to make a reasonable attempt at quantifying happiness in the population, it is now.  The rapid worldwide adoption of social media platforms in recent years has tremendously increased the amount, spontaneity and frequency of human communication.   Twitter usage alone [grew six orders of magnitude](http://www.internetlivestats.com/twitter-statistics/) from 5,000 tweets per day in 2007 to 500,000,000 tweets per day in 2013. And because social media communication is recorded electronically, it has persistence - a very attractive edge over verbal communication for a data scientist.  It is remains accessible to be parsed and analyzed as new methods and expertise become available even long after the data was generated.  Studies have shown that self-disclosure in online communication is [more frequent and revealing](https://en.wikipedia.org/wiki/Self-disclosure) than face-to-face communication, presumably due to anonymity and physical distance.  Together, all these factors translate into the availability of huge volumes of data to work with in trying to gauge a phenomenon as nebulous as happiness.  A number of researchers have taken advantage of these characteristics of modern online communication and sought to measure happiness using twitter data. A notable example is the [hedonometer project by Dods *et al*](http://hedonometer.org/index.html), which attempts to measure the happiness of populations in real time. Recent hedonometer results reveal a clear spike in average happiness on United States Thanks Giving day and sharp dips that coincide with the recent terrorist shootings in Paris and San Bernardino.  Christmas Day consistently ranks as the happiest day of the year. Amusingly, the arrest of Justin Bieber in January 2014 was a pretty sad day; a testament to the the demographics of twitter users, or perhaps an indication of the the power of "beliebers" and the pandemic nature of "bieber fever".  A similar study by [Alex Davies](http://alex-davies-4lq6.squarespace.com/twitter-emoticon-meanings/gauged) sought to gauge the happiness of populations by performing sentiment analysis based on emoticons embedded in twitter messages. The study concluded that [Germans are the happiest](http://www.cam.ac.uk/research/news/germans-top-table-of-happiest-tweets) people on earth followed by Mexicans and residents of the USA.  We leveraged the experiences published by these and other researches to inform our approach in this project. Being able to reliably measure happiness opens the door to examining what the determinants of happiness in the human population are. For the current study, we chose to focus on the United States because of the ready availability of both tweet and health data for the US as well as the limitation of time and computational resources available for the project. In particular we were interested in investigating whether public health indicators released in various government and non-profit organization reports have any relationship with happiness as measured from tweets. The health data we relied on was obtained from the [American Health Rankings](http://www.americashealthrankings.org) annual report for 2014 produced by the [United Health Foundation](http://www.unitedhealthfoundation.org). The report presents a state-by-state health analysis and considers infant mortality, cancer incidences and cardiovascular deaths amongst many other factors.  

Visit our project's website [here](http://hackersforhumanhealth.me).



#02_Question and Data Acquisition

##The question
------
The main question we wanted to answer is: "Do do healthier states experience more happiness"? To do this, we assumed that happiness as expressed in twitter communication is a surrogate for genuine happiness in the human population.  That of course is a major assumption but we felt that it is a reasonable one and that it offers as good a starting point as one can get in an effort to investigate this challenging question.  Over the course of the project, we found that, in addition to disease incidence related health statistics, the American Health Rankings annual report also considered various interesting socioeconomic and environmental parameters for each state. Some of these factors include Median Household Income, Violent Crime rates Air pollution and High School Graduation rates. We decided to modify our question to include an analysis of these factors' correlation with happiness.  

##The data

We had two avenues for obtaining Twitter data. The first was to collect our own data from Twitter using an API, which we did.  As a backup, we also were very fortunate to have access to a dataset of 3.5 million tweets generously provided by Sébastien Gruhier of http://onemilliontweetmap.com/, to whom we are very grateful. Due to twitter usage policy, we are unable to provide a publicly acessible link to this data set.

For both data sources, we wrote separate scripts for  parsing and reformatting the data appropriately for analysis.  This involved selecting only US tweets, adding a US state of origin label to each tweet and cleaning out unwanted characters in the tweets.  

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import json
import csv
import re
from geopy.geocoders import Nominatim
import ast
from time import sleep
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

###Data collection from Twitter

Our first task was to obtain twitter data for analysis.  Since one of our intentions was to gain in harvesting tweets using the Twitter API, we wrote a python script to do just that. Please note that for security reasons, user authorization and authentication credentials have been deleted from the data collection code displayed in the cell below.

In [None]:
from twitter import Twitter, OAuth, TwitterHTTPError, TwitterStream
import sys

reload(sys)
sys.setdefaultencoding("utf-8")

# Authorization Tokens
ckey = ''
csecret = ''
atoken = ''
asecret = ''

oauth = OAuth(atoken, asecret, ckey, csecret)

# Initiate the connection to Twitter Streaming API
twitter_stream = TwitterStream(auth=oauth)

# Get a sample of the public data following through Twitter
iterator = twitter_stream.statuses.filter(locations='-126,-58,26,50', lang='en')

print "\nHow many tweets would you like to collect?"
tweet_count = input()
with open('ourdata.csv','w') as tweet_file:
	final_dict = {'uid':[], 'tid':[], 'text':[], 'timestamp':[], 'city':[], 'country':[], 'bounding_box':[]}
	for tweet in iterator:
		tweet_count -= 1
		# Twitter Python Tool wraps the data returned by Twitter 
		# as a TwitterDictResponse object.
		# We convert it back to the JSON format to print/score
		tweet.values()
		for k,v in tweet.iteritems():
			if k == 'text':
				final_dict['text'].append(v)
			elif k == 'user':
				final_dict['uid'].append(v['id'])
			elif k == 'id':
				final_dict['tid'].append(v)
			elif k == 'timestamp_ms':
				final_dict['timestamp'].append(long(v))
			elif k == 'place':
				try:
					final_dict['city'].append(v['full_name'].split(',')[0])
					final_dict['country'].append(v['country'])
					final_dict['bounding_box'].append(v['bounding_box'])
				except:
					final_dict['city'].append('')
					final_dict['country'].append('')
					final_dict['bounding_box'].append('')


		if tweet_count <= 0:
			break

	
	tweet_df = pd.DataFrame(final_dict)
	tweet_df.to_csv(tweet_file)

A sample file containing tweets harvested using the above script is available [here](test.ourdata.csv).

###Processing of collected tweets

With the tweets collected, we went about processing the data to transform it into a format which is easier to work with. Several functions were coded to handle individual steps of the data cleaning and processing task. 

Here, we write a function named find_tweet_address to derive location information based on a tweet's GPS coordinates.  For this project we were primarily interested in state information but the function can be used to get higher resolution information like county or zip codes. The function depends on the python [geopy](https://geopy.readthedocs.org/en/1.10.0/) library that offers a client for several popular geocoding web services.

In [2]:
geolocator = Nominatim()
def find_tweet_address(gps_polygon_text):
    """
    Get details about the location of origin of a tweet
    based on GPS coordinates
    """
    # initialize dict for storing location info
    location_dict = None
    gps_polygon_dict = ast.literal_eval(gps_polygon_text)
    # get latitude and longitude
    longitude =  gps_polygon_dict['coordinates'][0][0][0]
    latitude =  gps_polygon_dict['coordinates'][0][0][1]
    # grab location info from the Nominatim geocoding web service
    tweetlocation = geolocator.reverse((latitude, longitude))
    tweetaddress_fields = (tweetlocation.raw)
    try:
        county = tweetaddress_fields['address']['county']
        state = tweetaddress_fields['address']['state']
        zipcode = tweetaddress_fields['address']['postcode']
    except:
        county = ''
        state = ''
        zipcode = ''
    # load location info into dict
    location_dict = dict(county=county, state=state, zipcode=zipcode)
    return location_dict


Now let's use a borrowed function to clean up the tweets

In [3]:
def tweet_cleaner(tweet):
    """
    tweet cleaning function
    adopted from http://ravikiranj.net/posts/2012/code/how-build-twitter-sentiment-analyzer/
    """
    #Convert to lower case
    tweet = tweet.lower()
    #Convert www.* or https?://* to URL
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #trim
    cleanedtweet = tweet.strip('\'"')
    return cleanedtweet


The tweets we collected are in a CSV file.  It is the job of the parsecsv function below to read the data file line by line, unpackage the tweets and apply the tweet text cleaning and location finder functions referenced above.  

In [4]:
def parsecsv(tweet_data):
    """
    parse each tweet and extract values of interest
    """
    tweet_dict = None
    # select only US tweets
    if tweet_data[3] == "United States":
        # extract data from CSV file fields
        tweetid = tweet_data[-3]
        userid = tweet_data[-1]
        place = tweet_data[2]
        coords = tweet_data[1]
        country = tweet_data[3]
        lang = ''
        timestamp = tweet_data[-2]
        ttext = tweet_data[4]
        # clean the tweet text
        ttext_cleand = tweet_cleaner(ttext)
        # get location information
        location_data = find_tweet_address(coords)
        state = location_data['state']
        # load extracted information into a dictionary
        tweet_dict = dict(tweetid=tweetid, userid=userid, place=place, coords=coords, country=country, state=state, lang=lang,
                         timestamp=timestamp, ttext=ttext, ttext_cleand=ttext_cleand)
    else:
        pass
    return tweet_dict

With our workhorse functions ready, we now define a main function to execute the workflow and write out processed output files

In [5]:
def ourdata_main():
    """
    Load, reformat and clean
     
    """
    line_count = 0
    # read input csv file contaning tweets
    with open(incsvfile) as data_file:
        data = csv.reader(data_file)
        #wite output json file of processed tweet data
        with open(outjsonfile, 'w') as fp:
            fp.write('[' + '\n')
            for tweet_data in data:
                tweet_dict = parsecsv(tweet_data)
                out_put = json.dumps(tweet_dict)
                if out_put != 'null':
                    if line_count == 0:
                        fp.write(out_put + '\n')
                    else:
                        fp.write("," + out_put + '\n')
                    line_count = 1
            fp.write(']' + '\n')


In this notebook, we run our code on a small subset of the data for convenience. In practice, we ran the code using a command line version of the data cleaning and processing script shown [here](./processTwitterCsvfile.py).

In [6]:
%%time
incsvfile = ('./test.ourdata.csv')
outjsonfile = ('test.ourdata.usa.json')
ourdata_main()

CPU times: user 309 ms, sys: 21.2 ms, total: 330 ms
Wall time: 13.2 s


The resulting dataframe, with each row describing a single tweet and each column describing a property of that tweet, is displayed below.

In [7]:
usdf = pd.read_json('test.ourdata.usa.json')
usdf.head(3)

Unnamed: 0,coords,country,lang,place,state,timestamp,ttext,ttext_cleand,tweetid,userid
0,"{u'type': u'Polygon', u'coordinates': [[[-86.3...",United States,,North Muskegon,,1449028922213,@westbrook_chloe @cc6163 @anniiikkkkaaa @Jenna...,AT_USER AT_USER AT_USER AT_USER AT_USER he won...,671902115011383296,531579492
1,"{u'type': u'Polygon', u'coordinates': [[[-79.3...",United States,,Mebane,North Carolina,1449028922366,My timehop is the most embarrassing thing ever...,my timehop is the most embarrassing thing ever...,671902115653120000,517114449
2,"{u'type': u'Polygon', u'coordinates': [[[-93.2...",United States,,Prien,Louisiana,1449028922422,I swea 🌚 he was going so fast I thought it wa...,i swea 🌚 he was going so fast i thought it wa...,671902115887898624,2765379648


###Processing of tweets obtained from Sébastien Gruhier of http://onemilliontweetmap.com/

Our second data set of tweets was in a JSON formatted file, a sample of which is provided [here](./test.omtmdata.json)

The next piece of code reads in the the twitter data file and outputs another JSON file with United States tweets only, state labels added and tweet text cleaned. The online Json Viewer resource http://jsonviewer.stack.hu/ proved to be a very useful tool for easily identifying the fields to extract using our script. The viewer's rendering of a tweet in our data file looks like [this](omtm_screenshot.png)

In [8]:
def parsejson(tweet_data):
    '''
    function to parse data file and extract fields of interest
    '''
    # initialize dictionary for storing output
    tweet_dict = None
    # Filter out tweets from countries other than the US of A
    if tweet_data["_source"]["place"]["country_code"] == "US":
        # extract required fields
        tweetid = tweet_data["_id"]
        userid = tweet_data["_source"]["user"]["id"]
        place = tweet_data["_source"]["place"]["full_name"]
        coords = tweet_data["_source"]["coordinates"]
        country = tweet_data["_source"]["place"]["country"]
        lang = tweet_data["_source"]["lang"]
        timestamp = tweet_data["_source"]["timestamp_ms"]
        # clean the text message of tweets
        ttext = tweet_data["_source"]["text"]
        ttext_cleand = tweet_cleaner(ttext)
        state = place.strip()[-3:].strip()
        if state == 'USA':
            state = place.split(",")[0]
        else:
            state = state
            
        tweet_dict = dict(tweetid=tweetid, userid=userid, place=place, coords=coords, country=country, state=state, lang=lang,
                         timestamp=timestamp, ttext=ttext, ttext_cleand=ttext_cleand)
    else:
        pass
    return tweet_dict
 

def omtmdata_main():
    """
    read input tweet data, process and write output file
    """
    line_count = 0
    # load input json file for reading
    with open(injsonfile) as data_file:
        data = json.load(data_file)
    # open outpt json fiel for writing
    with open(outjsonfile, 'w') as fp:
        fp.write('[' + '\n')
        # loop over tweets one by one
        for tweet_data in data:
            tweet_dict = parsejson(tweet_data)
            out_put = json.dumps(tweet_dict)
            # write parsed output
            if out_put != 'null':
                if line_count == 0:
                    fp.write(out_put + '\n')
                else:
                    fp.write("," + out_put + '\n')
                line_count = 1
        fp.write(']' + '\n')


In [9]:
%%time
injsonfile = ('./test.omtmdata.json')
outjsonfile = ('test.omtmdata.usa.json')
omtmdata_main()

CPU times: user 3.88 ms, sys: 1.1 ms, total: 4.98 ms
Wall time: 45.9 ms


Again, we run our code in this notebook on a small subset of the data for convenience. In practice, did it using a command line version of the data cleaning and processing script linked to [here](processTwitterJsonfile.py).

In [10]:
usdf_omtm = pd.read_json('test.omtmdata.usa.json')
usdf_omtm.head(3)

Unnamed: 0,coords,country,lang,place,state,timestamp,ttext,ttext_cleand,tweetid,userid
0,"47.614937999999995,-122.3306025",United States,en,"Seattle, WA",WA,1447266708839,"After taking public transit in DC and Seattle,...","after taking public transit in dc and seattle,...",664510856407834624,537328079
1,"37.7050435,-122.162294",United States,en,"San Leandro, CA",CA,1447266710015,"Thankful for all the veterans out there, I lov...","thankful for all the veterans out there, i lov...",664510861340340224,3896359752
2,"37.7706565,-122.4359785",United States,en,"San Francisco, CA",CA,1447266716814,"@Priz I've been watching, but not really enjoy...","AT_USER i've been watching, but not really enj...",664510889857433600,15532647


##Health data

The American Health Rankings annual report data for 2014 was downloaded as a CSV formatted file from a dedicated website run by the United Health Foundation. A copy of the report is posted [here](AmericasHealthRankings-Annual-2014.csv). The downloaded dataset was quite clean to begin with. The data was transformed from a long format to a wide format and the value column deleted using tools available in Microsoft Excel. The resulting file, shown [here](health_cleaned.xlsx), was used for analysis.

##Experiences in data acquisition and processing

One of the experiences we faced in data acquisition and processing is that Twitter currently has a [policy](https://dev.twitter.com/rest/public/rate-limiting) in place that restricts access to historical data to paying customers via third-party vendors.  The public streaming API that we used is [rate-limited](https://dev.twitter.com/rest/public/rate-limiting) and caped to a small randomly sampled fraction of the total number of tweets available at any given moment in time.  Consequently, our data acquisition script needed to be run multiple times over several days to acquire sufficient data for analysis.  Because we did not have much time to collect data in this way, we had to resort to our backup data set from onemilliontweetmap.com. 
