# Measuring Opioid Stigma Using the Twitter API and NLP

## Introduction
More than 100 Americans each day die from [opioid overdoses](https://www.cdc.gov/drugoverdose/epidemic/index.html). Expanding access to [medication for addiction treatment (MAT)](http://adai.uw.edu/pubs/infobriefs/MAT.pdf) has the potential to help reverse the epidemic. However, [efforts to expand MAT have been blocked due to a stigmatized view of opioid addiction](https://www.vox.com/science-and-health/2017/7/20/15937896/medication-assisted-treatment-methadone-buprenorphine-naltrexone) as a moral failing rather than a medical condition.

Despite its importance, very little data is available on opioid-related stigma. This is in part because stigma is difficult to measure with traditional tools such as surveys, which may underestimate the pervasiveness of stigma due to [social desirability bias](https://en.wikipedia.org/wiki/Social_desirability_bias).

A potentially valuable source of data to fill this gap is unstructured text data from Twitter, which is [less likely to suffer from social desirability bias](http://journals.sagepub.com/doi/abs/10.1177/0049124115605339) than traditional surveys.

In order to tap into this data source, we can set up a listener that uses Twitter's [Streaming API](https://developer.twitter.com/en/docs/tweets/filter-realtime/overview) to track opioid-related tweets.

## Data Aquisition

### Scraping Twitter

```python
# Import libraries necessary for scraping
import logging
import tweepy
import dataset
from sqlalchemy.exc import ProgrammingError
from requests.packages.urllib3.exceptions import ReadTimeoutError
from textblob import TextBlob
import settings

# Set up logger for debugging
logging.basicConfig(
    filename=f"logs/{__name__}.log",
    level=logging.DEBUG,
    format="%(name)s - %(asctime)s - %(levelname)s - %(message)s",
    filemode='w')
logger = logging.getLogger()
logger.info('Starting log...')

# Define database connection
db = dataset.connect(settings.CONNECTION_STRING)

class StreamListener(tweepy.StreamListener):

    def on_status(self, status):
        # Exclude retweets
        if hasattr(status, 'retweeted_status'):
            return

        else:
            # Capture text of tweet
            try:
                text = status.extended_tweet['full_text']
            except AttributeError:
                text = status.text
            # Capture tweet metadata
            tweet_created_utc = status.created_at
            user_followers = status.user.followers_count
            user = status.user.screen_name
            user_location = status.user.location
            tweet_id_str = status.id_str
            user_id_str = status.user.id_str

            # Analyze polarity and subjectivity
            blob = TextBlob(text)
            tweet_polarity = blob.sentiment.polarity
            tweet_subjectivity = blob.sentiment.subjectivity

            # Write tweet and metadata to database
            table = db[settings.TABLE_NAME]
            try:
                table.insert(dict(
                    tweet_id=tweet_id_str,
                    user_id=user_id_str,
                    user=user,
                    user_location=user_location,
                    user_followers=user_followersfollowers,
                    tweet_text=text,
                    tweet_created_utc=tweet_created_utc,
                    tweet_polarity=tweet_polarity,
                    tweet_subjectivity=tweet_subjectivity,
                ))
            except ProgrammingError as err:
                logging.warning(err)

    def on_error(self, status_code):
        if status_code == 420:  # rate limiting
            return False


if __name__ == '__main__':
    # Authenticate using tokens defined in settings.py
    auth = tweepy.OAuthHandler(settings.TWITTER_APP_KEY,
                               settings.TWITTER_APP_SECRET)
    auth.set_access_token(settings.TWITTER_KEY, settings.TWITTER_SECRET)
    api = tweepy.API(auth)

    while True:
        try:
            stream_listener = StreamListener()
            stream = tweepy.Stream(auth=api.auth, listener=stream_listener,
                                   tweet_mode='extended')
            stream.filter(track=settings.TRACK_TERMS, languages=['en'],
                          stall_warnings=True)
        except ReadTimeoutError as err:
            logging.warning(err)
            continue
```

## Data Cleaning

### Load dataset

In [1]:
# Import libraries necessary for data cleaning
import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer
import settings

# Load raw data
DATA_PATH = settings.DATA_PATH
raw_data_file = '{}raw_data.csv'.format(DATA_PATH)
raw_data = pd.read_csv(raw_data_file, parse_dates=['tweet_created_utc'])

# Change maximum display width of columns to see full text of tweets
pd.options.display.max_colwidth = 300

# View first five rows of raw data
rows = raw_data.shape[0]
first_tweet = raw_data['tweet_created_utc'].dt.date.min()
last_tweet = raw_data['tweet_created_utc'].dt.date.max()
print("Raw data contains {:,} tweets scraped between {} and {}.".format(rows, first_tweet, last_tweet))
print("\nFirst five rows of raw data:")
display(raw_data.head())

Raw data contains 713,256 tweets scraped between 2018-03-25 and 2018-09-22.

First five rows of raw data:


Unnamed: 0,user,user_location,user_followers,tweet_text,tweet_created_utc,tweet_polarity,tweet_subjectivity
0,BrookeM_Feldman,"Philadelphia, PA",1803,"1)I heard a heartbreaking story from a person using drugs in Kensington on Friday. 20 years ago and long before today's ""opioid epidemic"" hysteria, his mother was abruptly denied prescription opioids she had been taking for a medical condition...",2018-03-25 15:21:00,-0.058333,0.466667
1,NewLeaf_Service,"Trenton, NJ",171,24-hour opioid hotline goes live | WBFO https://t.co/FY00sRgrSZ,2018-03-25 15:21:00,0.136364,0.5
2,RamonaEid,Denver,4724,"$1M in fentanyl seized from Texas trio plotting to mail drugs back from Ohio, authorities say https://t.co/dkhS5AcqjS",2018-03-25 15:21:00,0.0,0.0
3,jimj3125,,35,Trump don't care about anyone or anything except his ego and his money. https://t.co/8o6JvVfWjL,2018-03-25 15:21:00,0.0,0.0
4,theGrudgeRetort,United States,2095,Let’s ban theft while we’re at it. https://t.co/hXBENAnKrT,2018-03-25 15:21:00,0.0,0.0


### Check missings

In [2]:
# Confirm that only user_location has missing values
for col in list(raw_data):
    if col != 'user_location':
        assert (raw_data[col].isna().sum() == 0)

### Initial text cleaning
To make the tweet text easier to read for coding, this section will:
1. Remove URLs
2. Remove mentions of other Twitter users
3. Remove new line characters
4. Remove extra whitespace
5. Decode HTML

After tweets have been coding, a secondary text cleaning function will be applied to prepare the text for input into natural language processing algorithms.

Hat tips: [Ricky Kim](https://towardsdatascience.com/another-twitter-sentiment-analysis-with-python-part-2-333514854913) and [Dipanjan Sarkar](https://www.kdnuggets.com/2018/08/practitioners-guide-processing-understanding-text-2.html)

In [3]:
def initial_tweet_cleaner(tweet):
    """Remove URLS, mentions, new line characers, whitespace, and HTML"""
    # Remove URLs
    url_pattern = '(http|www\.)[^\s]+'
    # Remove mentions
    mention_pattern = '@[A-Za-z0-9_]+'
    # Remove new line characters
    new_line_pattern = '\\\\n'
    # Combine patterns
    combined_pattern = '|'.join([url_pattern, mention_pattern, new_line_pattern])
    tweet = re.sub(combined_pattern, '', tweet)
    # Remove extra whitespace
    tweet = re.sub(' +', ' ', tweet)
    # Strip HTML
    tweet = BeautifulSoup(tweet, 'html.parser').get_text()
        
    return tweet

In [4]:
# Remove URLs
url_example = raw_data['tweet_text'][14]
print("Tweet containing URL: {}".format(url_example))
url_clean = initial_tweet_cleaner(url_example)
print ("\nTweet cleaned of URL: {}".format(url_clean))

Tweet containing URL: The Big Pharma business plan was to get our people hooked on opioid prescription painkillers. As a reward for this, the CEO of McKesson Pharmaceutical Co. makes $265,000 per day. These Big Pharma execs are the real drug pushers and... https://t.co/a9CI1fJvdS

Tweet cleaned of URL: The Big Pharma business plan was to get our people hooked on opioid prescription painkillers. As a reward for this, the CEO of McKesson Pharmaceutical Co. makes $265,000 per day. These Big Pharma execs are the real drug pushers and... 


In [5]:
# Strip mentions of other Twitter users
mention_example = raw_data['tweet_text'][18]
print("Tweet containing @mention: {}".format(mention_example))
mention_clean = initial_tweet_cleaner(mention_example)
print ("\nTweet cleaned of @mention: {}".format(mention_clean))

Tweet containing @mention: @F00LofPotential I've noticed that states which have legalized marijuana have fewer opiate deaths.

Tweet cleaned of @mention:  I've noticed that states which have legalized marijuana have fewer opiate deaths.


In [6]:
# Remove new line characters
new_line_example = raw_data['tweet_text'][52]
print("Tweet containing URL: {}".format(new_line_example))
new_line_clean = initial_tweet_cleaner(new_line_example)
print ("\nTweet cleaned of URL: {}".format(new_line_clean))

Tweet containing URL: @KrazeeCatLaydee @Wiininiskwe @CandiceMalcolm Meth/Fentanyl drug houses not being ignored. Being replaced by Safe Injection Sites. \nFN individuals way over represented in criminal Justice System but closer to 30% (at least in Youth system).

Tweet cleaned of URL:  Meth/Fentanyl drug houses not being ignored. Being replaced by Safe Injection Sites. FN individuals way over represented in criminal Justice System but closer to 30% (at least in Youth system).


In [7]:
# Decode html
html_example = raw_data['tweet_text'][8]
print("Tweet containing HTML: {}".format(html_example))
html_clean = initial_tweet_cleaner(html_example)
print ("\nTweet cleaned of HTML: {}".format(html_clean))

Tweet containing HTML: Methadone &amp; Suboxone Clinic - Indianapolis, Indiana https://t.co/p5FdAFKFuH

Tweet cleaned of HTML: Methadone & Suboxone Clinic - Indianapolis, Indiana 


In [8]:
stripped_tweets = raw_data.copy()
stripped_tweets['tweet_text'] = stripped_tweets['tweet_text'].map(initial_tweet_cleaner)
stripped_tweets.head()

Unnamed: 0,user,user_location,user_followers,tweet_text,tweet_created_utc,tweet_polarity,tweet_subjectivity
0,BrookeM_Feldman,"Philadelphia, PA",1803,"1)I heard a heartbreaking story from a person using drugs in Kensington on Friday. 20 years ago and long before today's ""opioid epidemic"" hysteria, his mother was abruptly denied prescription opioids she had been taking for a medical condition...",2018-03-25 15:21:00,-0.058333,0.466667
1,NewLeaf_Service,"Trenton, NJ",171,24-hour opioid hotline goes live | WBFO,2018-03-25 15:21:00,0.136364,0.5
2,RamonaEid,Denver,4724,"$1M in fentanyl seized from Texas trio plotting to mail drugs back from Ohio, authorities say",2018-03-25 15:21:00,0.0,0.0
3,jimj3125,,35,Trump don't care about anyone or anything except his ego and his money.,2018-03-25 15:21:00,0.0,0.0
4,theGrudgeRetort,United States,2095,Let’s ban theft while we’re at it.,2018-03-25 15:21:00,0.0,0.0


### Drop quote tweets without keywords

On Twitter, it is possible to quote another user's tweet and add a new Tweet message. Some quote tweets were only captured by the stream listener because the _quoted tweet_ included an opioid-related keyword.  Since the stream listener only captured the URL of the quoted tweet but not its text, the `tweet_text` field for these quote tweets does not contain any opioid-related keywords.

For example, here is how the tweet in row 17 of our data originally appeared on Twitter:
![](../figures/quote_example.png)

But `stripped_tweets['tweet_text'][17]` only contains `'ThankYou!❤️ '`

Since this will not help with our analysis, we will drop these tweets.

In [10]:
# Count tweets that do not contain opioid keywords
# Note: Some tweets were included only because they quoted tweets that contained keywords
track_terms = '|'.join(settings.TRACK_TERMS)
contain_keywords = stripped_tweets['tweet_text'].str.contains(track_terms)
missing_keywords = np.size(contain_keywords) - np.count_nonzero(contain_keywords)

# Drop tweets that do not contain opioid keywords
print("Dropping {:,} tweets missing opioid keywords.".format(missing_keywords))
opioid_tweets = stripped_tweets[contain_keywords].copy()
print("\nData now contains {:,} tweets with opioid keywords".format(opioid_tweets.shape[0]))
opioid_tweets.head()

Dropping 324,776 tweets missing opioid keywords.

Data now contains 388,480 tweets with opioid keywords


Unnamed: 0,user,user_location,user_followers,tweet_text,tweet_created_utc,tweet_polarity,tweet_subjectivity
0,BrookeM_Feldman,"Philadelphia, PA",1803,"1)I heard a heartbreaking story from a person using drugs in Kensington on Friday. 20 years ago and long before today's ""opioid epidemic"" hysteria, his mother was abruptly denied prescription opioids she had been taking for a medical condition...",2018-03-25 15:21:00,-0.058333,0.466667
1,NewLeaf_Service,"Trenton, NJ",171,24-hour opioid hotline goes live | WBFO,2018-03-25 15:21:00,0.136364,0.5
2,RamonaEid,Denver,4724,"$1M in fentanyl seized from Texas trio plotting to mail drugs back from Ohio, authorities say",2018-03-25 15:21:00,0.0,0.0
5,ON_YourFeet,"Red Hill, SC",718,"3 Toms River residents allegedly possessed $6,700 in heroin",2018-03-25 15:21:00,-0.1,0.1
9,DrKhouryCDC,"Atlanta, GA",9557,Read a CDC paper that used genomics to help map an HIV outbreak associated with opioid use.,2018-03-25 15:21:00,0.0,0.0


### Tag users with duplicate tweets and keep first
Automated Twitter accounts, also known as 'bots', are more likely to tweet the same exact same text multiple times than accounts where human users send each message. This section tags accounts that have duplicate tweets and keeps only the first of each duplicate tweet _by user_.

Duplicate tweets _across users_ will be kept for two reasons:
1. The exact same tweet text can be sent by multiple users who share a news story via a share button built in to a website or mobile app without changing the default text.
2. If the same news story with a stigmatizing headline is shared via tweet by multiple users, we would want to count each tweet in our analysis.

In [11]:
# Count duplicates in terms of tweet text
duplicates = opioid_tweets.loc[
    opioid_tweets.duplicated(subset=['user','tweet_text'], keep='first')]

# Identify users with duplicate tweets
users_with_duplicates = set(duplicates['user'])
opioid_tweets['user_bot'] = opioid_tweets['user'].isin(users_with_duplicates)
print("{:,} users tagged as potential bots due to duplicate tweets.".format(len(users_with_duplicates)))

# Move 'user_bot' next to 'user_followers'
cols = opioid_tweets.columns.tolist()
cols = cols[0:3] + cols[-1:] + cols[3:-1]
opioid_tweets = opioid_tweets[cols]

# Drop duplicate tweets within user
print("\nDropping {:,} tweets shared multiple times by the same user.".format(duplicates.shape[0]))
opioid_tweets_unique = opioid_tweets.drop_duplicates(subset=['user','tweet_text'], keep='first').copy()
print("\nData now contains {:,} tweets with opioid keywords. "
      "Tweets are now unique within user.".format(opioid_tweets_unique.shape[0]))
opioid_tweets_unique.head()

3,830 users tagged as potential bots due to duplicate tweets.

Dropping 17,596 tweets shared multiple times by the same user.

Data now contains 370,884 tweets with opioid keywords. Tweets are now unique within user.


Unnamed: 0,user,user_location,user_followers,user_bot,tweet_text,tweet_created_utc,tweet_polarity,tweet_subjectivity
0,BrookeM_Feldman,"Philadelphia, PA",1803,False,"1)I heard a heartbreaking story from a person using drugs in Kensington on Friday. 20 years ago and long before today's ""opioid epidemic"" hysteria, his mother was abruptly denied prescription opioids she had been taking for a medical condition...",2018-03-25 15:21:00,-0.058333,0.466667
1,NewLeaf_Service,"Trenton, NJ",171,True,24-hour opioid hotline goes live | WBFO,2018-03-25 15:21:00,0.136364,0.5
2,RamonaEid,Denver,4724,False,"$1M in fentanyl seized from Texas trio plotting to mail drugs back from Ohio, authorities say",2018-03-25 15:21:00,0.0,0.0
5,ON_YourFeet,"Red Hill, SC",718,True,"3 Toms River residents allegedly possessed $6,700 in heroin",2018-03-25 15:21:00,-0.1,0.1
9,DrKhouryCDC,"Atlanta, GA",9557,False,Read a CDC paper that used genomics to help map an HIV outbreak associated with opioid use.,2018-03-25 15:21:00,0.0,0.0


### Restrict to tweets with identifiable U.S. location
Since analysis will focus on geographic variation in stigma, this section will restrict to tweets from users with an identifiable U.S. location. Because `user_location` is user-entered, this requires significant cleaning.

This section removes tweets with no `user_location`, converts all user locations to lowercase, strips non-letter characters except for commas, and then keeps tweets from users with a location that meets at least one of the following criteria:
1. Ends in a U.S. state postal abbreviation
2. Contains a U.S. state name
3. Contains the name of a large U.S. city

In [12]:
# Count tweets with no location
missing_location = opioid_tweets_unique['user_location'].isna().sum()

# Drop tweets missing user location
print("Dropping {:,} tweets missing user location.".format(missing_location))
opioid_tweets_location = opioid_tweets_unique.loc[
    opioid_tweets_unique['user_location'].notna()].copy()
print("\nData now contains {:,} unique tweets "
      "with opioid keywords and user location.".format(opioid_tweets_location.shape[0]))

Dropping 87,352 tweets missing user location.

Data now contains 283,532 unique tweets with opioid keywords and user location.


In [13]:
# Convert user_location to lower case
opioid_tweets_location['user_location'] = opioid_tweets_location['user_location'].str.lower()

# Remove non-letter characters except for commas
pattern = r'[^a-zA-z\s\,]'
opioid_tweets_location['user_location'] = opioid_tweets_location['user_location'].str.replace(pattern, '')

In [14]:
# Create dictionary to identify user_locations with U.S. state names or abbreviations
# Hat tip: https://gist.github.com/rogerallen/1583593

usa_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
}

usa_state_abbrev_lower = dict((k.lower(), v.lower()) for k,v in usa_state_abbrev.items())

In [15]:
# Load list U.S. cities with population greater than 100,000
# Note 1: Cities that share a name with populous foreign cities (e.g. Birmingham) are excluded
# Note 2: Duplicates (e.g. Kansas City, KS and Kansas City, MO) are removed
# Source: https://simple.wikipedia.org/wiki/List_of_United_States_cities_by_population
usa_cities_file = '{}usa_cities.csv'.format(DATA_PATH)
usa_cities = pd.read_csv(usa_cities_file)

In [16]:
# Filter 1: Ends with state abbreviation
# Create list of state abbreviations and convert to lowercase
state_abbrevs_list = list(usa_state_abbrev_lower.values())

# Create regex that captures separate state abbreviations at the end of strings or followed by ", usa"
state_abbrevs_regex = '\\b' + '(\,\susa)*$|\\b'.join(state_abbrevs_list) + '(\,\susa)*$'

# Apply filter
opioid_tweets_state_abbrevs = opioid_tweets_location.loc[
    opioid_tweets_location['user_location'].str.contains(state_abbrevs_regex)].copy()
print("\n{:,} unique tweets with opioid keywords have a user location "
      "that ends in a U.S. state abbreviation.".format(opioid_tweets_state_abbrevs.shape[0]))
opioid_tweets_state_abbrevs.head()

  # Remove the CWD from sys.path while we load stuff.



91,018 unique tweets with opioid keywords have a user location that ends in a U.S. state abbreviation.


Unnamed: 0,user,user_location,user_followers,user_bot,tweet_text,tweet_created_utc,tweet_polarity,tweet_subjectivity
0,BrookeM_Feldman,"philadelphia, pa",1803,False,"1)I heard a heartbreaking story from a person using drugs in Kensington on Friday. 20 years ago and long before today's ""opioid epidemic"" hysteria, his mother was abruptly denied prescription opioids she had been taking for a medical condition...",2018-03-25 15:21:00,-0.058333,0.466667
1,NewLeaf_Service,"trenton, nj",171,True,24-hour opioid hotline goes live | WBFO,2018-03-25 15:21:00,0.136364,0.5
5,ON_YourFeet,"red hill, sc",718,True,"3 Toms River residents allegedly possessed $6,700 in heroin",2018-03-25 15:21:00,-0.1,0.1
9,DrKhouryCDC,"atlanta, ga",9557,False,Read a CDC paper that used genomics to help map an HIV outbreak associated with opioid use.,2018-03-25 15:21:00,0.0,0.0
11,deepcow,"tampa, fl",10835,False,"$1M in fentanyl seized from Texas trio plotting to mail drugs back from Ohio, authorities say.",2018-03-25 15:22:00,0.0,0.0


In [18]:
# Filter 2: Contain state name
# Create list of state names and convert to lowercase
states_list = list(usa_state_abbrev_lower.keys())

# Convert list to regex that ensures that matches are not part of longer strings
states_regex = '\\b' + '\\b|\\b'.join(states_list) + '\\b'

# Only search tweets not captured by filter 1
no_abbrev = opioid_tweets_location.loc[
    opioid_tweets_location['user_location'].str.contains(state_abbrevs_regex)==False].copy()

# Apply filter
opioid_tweets_states = no_abbrev.loc[
    no_abbrev['user_location'].str.contains(states_regex)].copy()
print("\n{:,} addtional unique tweets with opioid keywords have a user location "
      "that contains a U.S. state.".format(opioid_tweets_states.shape[0]))
opioid_tweets_states.head()

  # Remove the CWD from sys.path while we load stuff.



47,493 addtional unique tweets with opioid keywords have a user location that contains a U.S. state.


Unnamed: 0,user,user_location,user_followers,user_bot,tweet_text,tweet_created_utc,tweet_polarity,tweet_subjectivity
18,kimlockhartga,"georgia, usa",12750,False,I've noticed that states which have legalized marijuana have fewer opiate deaths.,2018-03-25 15:22:00,0.0,0.0
20,mach229,"pennsylvania, usa",392,False,"Military was always rich. You're just continuing, not creating funding you hateful moron. Nice try linking drugs with Mexicans but the opioid crisis is homegrown so build a wall around pharmaceutical companies unless, of course, the st",2018-03-25 15:23:00,-0.145,0.77
22,BradmFROSTBITE,"ohio born, brooklyn raised",62,False,"So let me get this straight...... gun deaths outnumber opioid overdose deaths but only one is labeled an ""epidemic"" that GOP lawmakers care about. #MarchForOurLives #GunControlNow #DumpTrump",2018-03-25 15:23:00,0.1,0.7
41,Nevada,"nevada, usa",4082,False,This is an issue of rural prosperity': Nevada roundtable spotlights opioid epidemic in small communities.,2018-03-25 15:24:00,-0.125,0.2
58,rileecoyote,"pennsylvania, usa",15186,False,"This should be headlining the news in light of the opioid crisis that we have been inflicted with. Sadly it won’t be, and the process of straightening this mess out in the quickest manor is continuously slowed down by #sidelined science. #inflictedaddiction",2018-03-25 15:26:00,-0.107639,0.540972


In [19]:
# Filter 3: Contain city name
# Create list of city names and convert to lowercase
usa_cities_list = [city.lower() for city in list(usa_cities['City'])]

# Convert list to regex that ensures that matches are not part of longer strings
usa_cities_regex = '\\b' + '\\b|\\b'.join(usa_cities_list) + '\\b'

# Only search tweets not captured by filters 1 or 2
no_state = no_abbrev.loc[
    no_abbrev['user_location'].str.contains(states_regex)==False].copy()

# Apply filter
opioid_tweets_cities = no_state.loc[
    no_state['user_location'].str.contains(usa_cities_regex)].copy()
print("\n{:,} additional unique tweets with opioid keywords have a user location "
      "that contains a U.S. city.".format(opioid_tweets_cities.shape[0]))
opioid_tweets_cities.head()


10,545 additional unique tweets with opioid keywords have a user location that contains a U.S. city.


Unnamed: 0,user,user_location,user_followers,user_bot,tweet_text,tweet_created_utc,tweet_polarity,tweet_subjectivity
2,RamonaEid,denver,4724,False,"$1M in fentanyl seized from Texas trio plotting to mail drugs back from Ohio, authorities say",2018-03-25 15:21:00,0.0,0.0
136,abc7breaking,el paso,88269,False,"Three Texas residents have been arrested in a Toledo, Ohio, drug operation after they tried to mail a 2.2 pounds of heat-sealed fentanyl that authorities said was enough to kill the population of Toledo ""several times over.""",2018-03-25 15:34:00,0.0,0.25
202,alwilsonlsj,lansing,671,False,Prosecutors are increasingly charging people in connection with opioid overdose deaths. But is the system working? Who killed Jody? It depends who you ask via,2018-03-25 15:41:00,-0.2,0.0
217,jenkers_en,san francisco,1331,False,States: Federal money for opioid crisis a small step forward,2018-03-25 15:42:00,-0.25,0.4
289,ShelBelle4,phoenix,100,False,": We just need to tell them not to do drugs. : Eat the ice cream, not the fentanyl. This is a stellar approach to ending the opioid epidemic. Truly, groundbreaking.",2018-03-25 15:49:00,0.25,0.25


### Extract state

In [20]:
# Remove ', usa' ending so that last two characters are state abbreviation
opioid_tweets_state_abbrevs['user_location'] = opioid_tweets_state_abbrevs['user_location'].str.replace(', usa', '')

# Create columns for 'user_state' set equal to last two characters from user_location
opioid_tweets_state_abbrevs['user_state'] = opioid_tweets_state_abbrevs['user_location'].astype(str).str[-2:]

# Check
opioid_tweets_state_abbrevs[['user_location', 'user_state']].head(n=6)

Unnamed: 0,user_location,user_state
0,"philadelphia, pa",pa
1,"trenton, nj",nj
5,"red hill, sc",sc
9,"atlanta, ga",ga
11,"tampa, fl",fl
14,"tampa, fl",fl


In [21]:
# Create regex expression that contains each state name as match group
states_regex_match_groups = '\\b(' + ')\\b|\\b('.join(states_list) + ')\\b'

# Extract state names into new data frame with one column for each state
# Each row will have missing values for all columns except one
extracted_states = opioid_tweets_states['user_location'].str.extract(r'{}'.format(states_regex_match_groups))

# Replace missing values with empty strings
extracted_states.fillna('', inplace=True)

# Concatenate to get single column with state
opioid_tweets_states['user_state'] = extracted_states.values.sum(axis=1)

# Remap state names to state abbreviations
opioid_tweets_states['user_state'] = opioid_tweets_states['user_state'].replace(usa_state_abbrev_lower)

# Check
opioid_tweets_states[['user_location', 'user_state']].head()

Unnamed: 0,user_location,user_state
18,"georgia, usa",ga
20,"pennsylvania, usa",pa
22,"ohio born, brooklyn raised",oh
41,"nevada, usa",nv
58,"pennsylvania, usa",pa


In [22]:
# Create regex expression that contains each city name as match group
usa_cities_regex_match_groups = '\\b(' + ')\\b|\\b('.join(usa_cities_list) + ')\\b'
# Extract city names into new data frame with one column for each city
# Each row will have missing values for all columns except one
extracted_cities = opioid_tweets_cities['user_location'].str.extract(r'{}'.format(usa_cities_regex_match_groups))
# Replace missing values with empty strings
extracted_cities.fillna('', inplace=True)
# Concatenate to get single column with city
opioid_tweets_cities['user_city'] = extracted_cities.values.sum(axis=1)

# Create a dictionary mapping cities to states
usa_cities_states_list = [state.lower() for state in list(usa_cities['State'])]
usa_cities_dict = dict(zip(usa_cities_list, usa_cities_states_list))

# Map cities to state names
opioid_tweets_cities['user_state'] = opioid_tweets_cities['user_city'].replace(usa_cities_dict)

# Remap state names to state abbreviations
opioid_tweets_cities['user_state'] = opioid_tweets_cities['user_state'].replace(usa_state_abbrev_lower)

# Drop user_city in advance of append
opioid_tweets_cities.drop(columns='user_city', inplace=True)

# Check
opioid_tweets_cities[['user_location', 'user_state']].head()

Unnamed: 0,user_location,user_state
2,denver,co
136,el paso,tx
202,lansing,mi
217,san francisco,ca
289,phoenix,az


### Append

In [23]:
# Append tweets captured by all three filters
opioid_tweets_usa = opioid_tweets_state_abbrevs.append([opioid_tweets_states,
                                                        opioid_tweets_cities],
                                                       ignore_index=True)

# Move 'user_state' next to 'user_location'
cols = opioid_tweets_usa.columns.tolist()
cols = cols[0:2] + cols[-1:] + cols[2:-1]
opioid_tweets_usa = opioid_tweets_usa[cols]

print("\nData now contains {:,} unique tweets with opioid keywords "
      "and user location mapped to U.S. state.".format(opioid_tweets_usa.shape[0]))
opioid_tweets_usa.head()


Data now contains 149,056 unique tweets with opioid keywords and user location mapped to U.S. state.


Unnamed: 0,user,user_location,user_state,user_followers,user_bot,tweet_text,tweet_created_utc,tweet_polarity,tweet_subjectivity
0,BrookeM_Feldman,"philadelphia, pa",pa,1803,False,"1)I heard a heartbreaking story from a person using drugs in Kensington on Friday. 20 years ago and long before today's ""opioid epidemic"" hysteria, his mother was abruptly denied prescription opioids she had been taking for a medical condition...",2018-03-25 15:21:00,-0.058333,0.466667
1,NewLeaf_Service,"trenton, nj",nj,171,True,24-hour opioid hotline goes live | WBFO,2018-03-25 15:21:00,0.136364,0.5
2,ON_YourFeet,"red hill, sc",sc,718,True,"3 Toms River residents allegedly possessed $6,700 in heroin",2018-03-25 15:21:00,-0.1,0.1
3,DrKhouryCDC,"atlanta, ga",ga,9557,False,Read a CDC paper that used genomics to help map an HIV outbreak associated with opioid use.,2018-03-25 15:21:00,0.0,0.0
4,deepcow,"tampa, fl",fl,10835,False,"$1M in fentanyl seized from Texas trio plotting to mail drugs back from Ohio, authorities say.",2018-03-25 15:22:00,0.0,0.0


### View sample of tweets for keyword groups

In [24]:
# Keyword groups
generic_opioid_keywords = 'opiate|opioid|opium'
rx_keywords = 'codine|hydrocodone|morphine|opana|oxycodone|oxycontin|percocet|vicodin'
synthetic_keywords = 'carfentanil|fentanyl'
treatment_keywords = 'burprenorphine|methadone|naltrexone|suboxone|vivitrol'
harm_reduction_keywords = 'injection site|needle exchange|safe injection|supervised injection'
overdose_keywords = 'overdose|naloxone|narcan'

def keyword_sample(df, keywords, label, size=5):
    """Select sample of tweets containing specified keywords."""
    df_temp = df.loc[df['tweet_text'].str.contains(keywords)]
    print("Sample of tweets referencing {}:".format(label))
    return df_temp[['user','user_location','tweet_text']].sample(n=size, random_state=40422)

display(keyword_sample(opioid_tweets_usa, generic_opioid_keywords, 'generic opioids'))

display(keyword_sample(opioid_tweets_usa, rx_keywords, 'prescription opioids'))

display(keyword_sample(opioid_tweets_usa, 'heroin', 'heroin'))

display(keyword_sample(opioid_tweets_usa, synthetic_keywords, 'synthetic opioids'))

display(keyword_sample(opioid_tweets_usa, treatment_keywords, 'opioid addiction treatments'))

display(keyword_sample(opioid_tweets_usa, harm_reduction_keywords, 'harm reduction approaches'))

display(keyword_sample(opioid_tweets_usa, overdose_keywords, 'overdose and overdose reversal'))

Sample of tweets referencing generic opioids:


Unnamed: 0,user,user_location,tweet_text
93382,hipEchik,"san diego, california",DO IT! Scream at the top of your lungs. I think one big civil liberties suit would stop all the madness. #opioidcrisis #FentanylAndHeroinCrisis
138158,Chill_Pill,"gville, texas","It should be decriminalized on the federal level. They're restricting pain meds for the elderly and many states still don't have this option. My mom has degenerate disk disease, they have her on Tylenol since the opiate crack down. She's 71 but won"
110515,Rutgers_Camden,"camden, new jersey, usa","Every week, I've encountered someone who was affected by an opioid overdose some way or somehow. Serena Natal SNC'17 created a guide on #opioid overdose management that's being used in some area treatment facilities #RutgersDelivers"
26933,AirWharton,"washington, dc",60 Minutes (w/ assist from WaPo and DEA whistleblower) wins another Peabody w/ explosive probe into Big Pharma ties to opioid crisis
57992,FlexionInc,"burlington, ma","Does your #osteoarthritis knee pain limit your mobility? If so, ask your doctor if our extended-release, non-opioid treatment option is right for you: #TipTuesday"


Sample of tweets referencing prescription opioids:


Unnamed: 0,user,user_location,tweet_text
72863,tnicholsmd,"portland, or","I only use morphine, and if I prescribe narcotics, I only write for a few MSIR. When people ask me why or mention the side effects, I tell them that’s the point - patients shouldn’t be enjoying them, they should o"
79687,kevinjayheldman,"queens, ny","According to the affidavit, police seized 120,000 pills, including hydrocodone, hydromorphone and oxycodone, as well as fentanyl and morphine. They also recovered stolen guns, electronics, power tools, clothing, car parts and copper pipes."
109384,Brookslei,texas,I take lyrica for nerve pain helps where I could cut back on hydrocodone. What relief😜😉
27470,not_too_shABBY6,"salisbury, md",dude this song on the radio about never being the same after meeting someone is so annoying because she tries to sing “heroin” to sound like “morphine” to make it rhyme and it bothers me so much😂
80006,MaximizeQOL,"bryn mawr, pa","Methadone better for #NAS tx c/w morphine w both statistically & clinically meaningful dec:-length of stay-length of treatmentNot a surprise to many, but research confirmation is important. Long-term studies in progress#SUD #OpioidCrisis"


Sample of tweets referencing heroin:


Unnamed: 0,user,user_location,tweet_text
80957,DiFantastico,"cleveland, oh",I am trying to get together a chapbook for Atlas Review's open submissions that includes an essay about dating a heroin-addicted Elvis impersonator who's dead now. You know. My usual cheerful stuff.
9438,KatyAndNews,"charleston, wv","Raleigh Co. deputies arrested 3 people after finding 620 grams, value of $62,000, of heroin at a home in McArthur on Sunday. Deputies also found meth, prescription pills, $2,000 in cash. Deputies arrested Carrie Jewel, Javon Lampkin and Brandon Veal. #59News"
20247,thefontsavant,"danvers, ma","My hometown is known for its abundance of Dunkin Donuts locations, Chinese food restaurants, nail salons, and heroin. Is that anti-hipster?"
42013,devilradio,"milwaukee, wi","Fiebrink entered the jail Aug. 24 suffering from heroin and alcohol withdrawal, but never got a medical screening. On the night of August 27th and..."
452,TonyaRoot,"myrtle beach, sc","ICYMI: RT : #Georgetown man pleads guilty to #heroin distribution, habitual traffic offender & sentenced to prison"


Sample of tweets referencing synthetic opioids:


Unnamed: 0,user,user_location,tweet_text
104963,DebbyHouse5,"kentucky, usa","Yes, at this point, with all the regs for pain patients & with the illicit fentanyl problem not as prominent as it should be, real patients are being treated horribly & their pain"
110933,WDPAnews,"pittsburgh, pennsylvania",Judge sentences Pittsburgh fentanyl dealer Khalifa Cochran to 10 years in federal prison
62502,KyJohnCGay,"flemingsburg, ky","9 new AUSA’s in WV. Much of effort will likely focus on #opioidcrisis, #heroin, #fentanyl, & #meth."
4615,SafeRxMendocino,"ukiah, ca","What is #fentanyl, the drug that reportedly killed Prince? - Yahoo Lifestyle involved and make a difference in our County with"
98320,reversechapter,florida,"Florida began ""battling"" legal drugs in 2010. It's funny how we apparently still have an ""epidemic"" today. The problem was never caused by ""over prescribing,"" but by the govt's FAILURE to interdict ILLICITLY MANUFACTURED fentanyl analogues. It's easier to attack innocent people."


Sample of tweets referencing opioid addiction treatments:


Unnamed: 0,user,user_location,tweet_text
70790,NDukich,"elgin, il",I don't follow... What the fuck is this tweet about? Subliminal message or something? I tried to follow it but it made my brain hurt. Is this chick on methadone or something or does she have no life and tagged everyone she could think of?
140071,Gemini_1901,"resevoir hill, baltimore",I live for itRide 4 itCry 4 itGet me highRun ur handsUp my thighI'm gonna cumWith lyricsReal shitLifes knocksHardDrop to kneesAnd call GodLiving in MurderlandGotta b quickOn ur feetSpit DopeNo methadoneHopeLeaningOn barzFake niggasFly cars
140159,katcald,san francisco,"1.Can’t get them 2.Cost prohibitive 3.Bups and methadone are enslaving,have street value,diverted4.Vivitrol needs KOR"
118017,ShatterproofHQ,"new york, usa","After a 23-year battle with opioid addiction, Patience Roberts says suboxone gave her a new, sober life. via"
101730,PharmacyWatson,"louisiana, usa","Buy pills here with a overnight delivery and package will be delivered in discrete,order pain meds or anxiety pills Xanax bars,ritaline ,Vicodin,methadone ,Oxycodone,Oxycontin,valium,Percocet,Roxicodon,Tramadol,txt or whatsapp"


Sample of tweets referencing harm reduction approaches:


Unnamed: 0,user,user_location,tweet_text
75026,NeedleExchange_,"portland, me","Hamilton County launches state's 1st needle exchange program - | Chattanooga News, Weather & Sports"
146644,GLIDEsf,"san francisco, ca usa","MYTH: Neighborhoods surrounding a safe injection site (SIS) will have more drug use & crime.FACT: SIS reduce public drug use & discarded syringes in areas surrounding SISs & increase public safety.Join us to tour a realistic model of an SIS, 8/28 – 31."
127783,maiasz,new york,"same argument was used against needle exchange: just a bandaid, we need treatment, so let's not save lives while we're trying to get that..."
80597,NeedleExchange_,"portland, me",County to offer needle exchange | Local news |
95213,52Degrees,"california, usa",Heroin injection site?


Sample of tweets referencing overdose and overdose reversal:


Unnamed: 0,user,user_location,tweet_text
38313,AceDawg11_,"greater soufeast, dc",Cause you not suppose to eat one Tylenol..you can only overdose on things that aren’t good for you
7658,OfficialHemp,"denver, co","Opioid overdose has risen dramatically over the past 15 years and has been implicated in over 500,000 deaths since 2000 -- more than the number of Americans killed in World War II. #OpiodCrisis #OpioidEpidemic #opioid #Medicines"
149044,Editor_JMiller,cleveland,Combatting Texas' durg overdose epidemic [Opinion]
5865,tomdinki,"buffalo, ny",Two Allegany County teenagers have pleaded guilty while a third appears determined to fight the charges in a fatal heroin overdose case:
135009,Lloyd_TV17,"fort worth, texas",He overdose


## Create samples for analysis

In [25]:
# Create training sample
analysis_manual_code = opioid_tweets_usa.sample(n=1000, random_state=40422).copy()
analysis_auto_code = opioid_tweets_usa[~opioid_tweets_usa.index.isin(analysis_manual_code.index)].copy()

# Create sub-sample to be coded by second rater for inter-rater reliability
analysis_irr_code = analysis_manual_code.sample(n=100, random_state=40422)

# Export to csv
analysis_manual_code_file = '{}analysis_manual_code.csv'.format(DATA_PATH)
analysis_manual_code.to_csv(analysis_manual_code_file, index_label='id')

analysis_irr_code_file = '{}analysis_irr_code.csv'.format(DATA_PATH)
analysis_irr_code.to_csv(analysis_irr_code_file, index_label='id')

## Secondary text cleaning

### Prepare text for natural language processing

## Analysis

### Count Tweets By State

In [None]:
def count_by_keyword_state(keywords,group_name):
    """Count tweets containing keyword groups by state"""
    df_temp = df_usa.loc[df_usa['text'].str.contains(keywords, na=False)]
    return df_temp.groupby('state_abbrev').size().reset_index(name='{}_tweets'.format(group_name))

generic_opioid_by_state = count_by_keyword_state(generic_opioid_keywords,'generic_opioid')
heroin_by_state = count_by_keyword_state('heroin','heroin')
other_opioid_by_state = count_by_keyword_state(other_opioid_keywords,'other_opioid')
methadone_by_state = count_by_keyword_state('methadone','methadone')
other_synthetic_by_state = count_by_keyword_state(other_synthetic_keywords,'other_synthetic')

# Merge
tweets_by_state = generic_opioid_by_state.merge(heroin_by_state,how='left',on='state_abbrev',validate='1:1')
tweets_by_state = tweets_by_state.merge(other_opioid_by_state,how='left',on='state_abbrev',validate='1:1')
tweets_by_state = tweets_by_state.merge(methadone_by_state,how='left',on='state_abbrev',validate='1:1')
tweets_by_state = tweets_by_state.merge(other_synthetic_by_state,how='left',on='state_abbrev',validate='1:1')

# Calculate total tweets
tweets_by_state['total_tweets'] = tweets_by_state.sum(axis=1)

tweets_by_state.head()

### Load and merge overdose data from CDC

In [None]:
# Import state opioid overdose totals from CDC
# https://wonder.cdc.gov/mcd-icd10.html
od_by_state_ucd = pd.read_csv('../data/cdc_wonder/state_opioid_od_total_2016.txt',
                              sep='\t', usecols=[*range(1,5),*range(7,9)],
                              skipfooter=62, engine='python')

# Clean column headers
od_by_state_ucd.rename(columns={'Deaths': 'deaths_total', 
                             'Age Adjusted Rate': 'adjusted_rate_total',
                             'Age Adjusted Rate Standard Error': 'adjusted_rate_se_total'}, inplace=True)

od_by_state_ucd.head()

In [None]:
# Import state opioid overdose by drug type from CDC
# Note: Some deaths involve multiple drug types which is why these sum to more than the total
od_by_state_mcd = pd.read_csv('../data/cdc_wonder/state_opioid_od_substance_2016.txt',
                              sep='\t', usecols=[*range(1,4),*range(5,7),*range(9,11)],
                              skipfooter=74, engine='python')

# Keep only heroin
od_by_state_mcd = od_by_state_mcd.loc[od_by_state_mcd['Multiple Cause of death']=='Heroin']
od_by_state_mcd = od_by_state_mcd[od_by_state_mcd.columns.drop(od_by_state_mcd.filter(regex='Cause').columns)]

# Clean column headers
od_by_state_mcd.rename(columns={'Deaths': 'deaths_heroin', 
                             'Age Adjusted Rate': 'adjusted_rate_heroin',
                             'Age Adjusted Rate Standard Error': 'adjusted_rate_se_heroin'}, inplace=True)

od_by_state_mcd.head()

In [None]:
# Left merge since heroin data is missing for Montana, Nebraska, South Dakota, and Wyoming
od_by_state = od_by_state_ucd.merge(od_by_state_mcd,how='left',on=['State','State Code','Population'],validate='1:1')

# Clean column headers
od_by_state.columns = map(str.lower, od_by_state.columns)
od_by_state.rename(columns={'state': 'state_name', 'state code': 'fips'}, inplace=True)
od_by_state['state_abbrev'] = od_by_state['state_name'].map(us_state_abbrev)
cols = od_by_state.columns.tolist()
cols= cols[-1:] + cols[0:-1]
od_by_state = od_by_state[cols]

od_by_state.head()

In [None]:
# Merge with tweets_by_state
od_tweets_by_state = od_by_state.merge(tweets_by_state,on='state_abbrev',validate='1:1')

od_tweets_by_state['total_tweets_per_100k'] = ((od_tweets_by_state['total_tweets'] 
                                               / od_tweets_by_state['population']) * 10**5)

def calc_pct(df,numerator,denominator):
    """Calculate contribution of each keyword group to total"""
    return (df[numerator].divide(df[denominator], fill_value=0) * 100).round(1)

od_tweets_by_state['pct_generic_tweets'] = calc_pct(od_tweets_by_state,'generic_opioid_tweets','total_tweets')
od_tweets_by_state['pct_heroin_tweets'] = calc_pct(od_tweets_by_state,'heroin_tweets','total_tweets')
od_tweets_by_state['pct_rx_tweets'] = calc_pct(od_tweets_by_state,'other_opioid_tweets','total_tweets')
od_tweets_by_state['pct_methadone_tweets'] = calc_pct(od_tweets_by_state,'methadone_tweets','total_tweets')
od_tweets_by_state['pct_fentanyl_tweets'] = calc_pct(od_tweets_by_state,'other_synthetic_tweets','total_tweets')

od_tweets_by_state.head()

### Figure 1: Visualization of Tweets per State with Breakdown by Drug Type

In [None]:
import plotly
import plotly.plotly as py

plotly.tools.set_credentials_file(username='mefryar', api_key='uORkhBM3WRj3Su4flGQm')

scl = [[0.0, 'rgb(242,240,247)'], [0.2, 'rgb(218,218,235)'], [0.4, 'rgb(188,189,220)'],
       [0.6, 'rgb(158,154,200)'], [0.8, 'rgb(117,107,177)'], [1.0, 'rgb(84,39,143)']]

od_tweets_by_state['text'] = '<br>' + od_tweets_by_state['state_name'] + '<br>' + '<br>' +\
            'Opioid/Opiate/Opium: ' + od_tweets_by_state['pct_generic_tweets'].astype(str) + '%' + '<br>' +\
            'Heroin: ' + od_tweets_by_state['pct_heroin_tweets'].astype(str) + '%' + '<br>' +\
            'Prescription Opioids: ' + od_tweets_by_state['pct_rx_tweets'].astype(str) + '%' + '<br>' +\
            'Fentanyl/Carfentanil: ' + od_tweets_by_state['pct_fentanyl_tweets'].astype(str) + '%' + '<br>' +\
            'Methadone: ' + od_tweets_by_state['pct_methadone_tweets'].astype(str) + '%'
            
data = [dict(
            type='choropleth',
            colorscale = scl,
            autocolorscale = False,
            locations = od_tweets_by_state['state_abbrev'],
            z = od_tweets_by_state['total_tweets_per_100k'].round(2),
            locationmode = 'USA-states',
            text = od_tweets_by_state['text'],
            marker = dict(
                line = dict(
                    color = 'rgb(255,255,255)',
                    width = 2
                )
            ),
            colorbar = dict(
                title = "Tweets per 100,000")
            )
        ]

layout = dict(
            title = '<b>Opioid-Related Tweets Per 100k</b><br>' + \
                    '(Hover for Breakdown of Opioid-Related Tweets)',
            geo = dict(
                scope='usa',
                projection=dict(type='albers usa'),
                showlakes = True,
                lakecolor = 'rgb(255, 255, 255)'
            )
        )
    
fig = dict(data=data, layout=layout)
py.iplot(fig, filename='state-cloropleth-tweets')

### Figure 2: Relationship between Tweet Frequency and Overdose Deaths

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr

x = od_tweets_by_state.adjusted_rate_total.astype(float)
y = od_tweets_by_state.total_tweets_per_100k

plt.style.use('fivethirtyeight')

fig, ax = plt.subplots(figsize=(16, 8))
ax.scatter(x,y, s=20, alpha=0.75)
for i, txt in enumerate(od_tweets_by_state['state_abbrev']):
    ax.annotate(txt, ((x[i]-0.2),(y[i]-0.5)))
    
ax.text(37, 21.5, 'Correlation: {}'.format(format(pearsonr(x, y)[0], '.3f')),
        bbox={'pad':10, 'alpha':0.75}, fontsize=15)
    
ax.set_xlabel('Age-Adjusted Opioid Overdose Rate (Per 100,000) - 2016', fontsize=18)
ax.set_ylabel('Opioid Related Tweets (Per 100,000) - Spring 2018', fontsize=18)
fig.suptitle('Relationship between Opioid Overdoses (2016) and Opioid Related Tweets (Spring 2018)', fontsize=20)

fig.savefig('../figures/scatter_tweets_overdoses.png', bbox_inches='tight')

## Conclusion

This first-stage analysis demonstrates that conversations taking place about opioids on Twitter are correlated with real-world public health outcomes, such as overdose deaths. A quick visual comparison of the choropleth of tweet frequency (Figure 1 above) with [choropleths of overdose death rates](https://www.cdc.gov/drugoverdose/data/statedeaths.html) reveals remarkable similarities. Plotting the relationship between tweet frequency and opioid overdose deaths (Figure 2) confirms that a positive correlation exists between these two indicators.

This is promising for subsequent analyses in which I will look not only at the frequency of these conversations but also their sentiment. Specifically, I plan to next look at measures of polarity and subjectivity before eventually developing a custom sentiment analysis algorithm that codes for stigma.

Other next steps include:
 - Analyzing the relationship between conversations about opioids and access to treatment
 - Analyzing the relationship between conversations that specifically mention treatment and access to treatment
 - Further refining data cleaning procedures to remove suspected bots