# Analyzing Twitter Sentiment 

## Part 2: Streaming and storing tweets with tweepy and pymongo

In social media, trends move at incredible speed.  A hashtag can start trending, become popular, and then die in a matter days, or even hours.   At the forefront of social media trends is Twitter, an online social media site that allows people to write short 140 character comments on anything ranging from politics, to sports to video games.  

The sheer volume of Twitter data makes analysis challenging however.  There are ~6000 tweets sent out from twitter every second, which means that finding the latest trends is akin for looking for a needle in a haystack while getting sprayed by a firehose.   

Fortunately there are some good libraries for dealing with twitter data that can allow you to extract meaning from this information firehose.   In this blog post, I will show you how to set up a twitter sentiment analyzer which allows you to see the sentiment, and location of the latest trends in the US and around the world.   

## Table of Contents
  1. [Introduction](#intro)
    1. [Necessary Libraries](#nl)
    2. [Accessing Twitter](#at)
  2. [Using Tweepy and mongodb to get and store streaming data](#tm)
    1. getting trends
    2. streaming tweets
    3. storing tweets in mongodb
  3. Sentiment Analysis
    1. How to build a sentiment analyzer from scratch
  4.  Plotting location


## Necessary Libraries <a id="nl"></a>
For this example we will need scrape twitter data, connect to mongodb, store analyze the results in tables, train a sentiment classifier, and then plot the results.  To do this, we will need to use the following libraries
  - <a href="http://tweepy.readthedocs.io/en/v3.5.0/">tweepy</a>: A library for interacting with twitters api 
  - <a href="https://api.mongodb.com/python/current/">pymongo</a>: A library for interacing with Mongodb
  - <a href="http://pandas.pydata.org/">pandas</a>: One of the standard libraries for interacting with data in python
  - <a href="http://scikit-learn.org/stable/">sklearn</a>: Python library for machine learning
  - <a href="http://geopandas.org/install.html">geopandas</a>: library for dealing with geographic data
  - <a href="http://python-visualization.github.io/folium/">folium</a>: Library for creating leaflet.js maps


In [3]:
import tweepy
import pymongo
import sklearn
import geopandas
import folium

# a package for storing api keys
from src.apikeys import TWITTER

## Accessing Twitter <a id="at"></a>

Twitter allows anyone to access its api with certain rate restrictions for free.  To do this you will need to first create a twitter app.  Go to https://apps.twitter.com/ and click sign in on the top left of the page.   If you don't have an account you will need to create one.  Once you have created an account, you will need to create an app.  Click to create an app and fill in the information.  Although the website is a required field, you can fill it in with either a github profile, or a fake site (you should have a github profile, so if you don't have one I'd recommend creating one for free.)

<img width="600" src="images/creating_twitter_app.png">

Once you have done that, go the the "Keys and Access Settings tab".  This contains your api key and your api-secret key.   You will need to use both of these to access twitter.  You also will need to create an application access token.  On the bottom of the screen underneath "Your Access Token" click "Create my access token".   This will generate two more keys that you will need to access twitter 

<img width="600" src="images/create_access_tokens.png">


Be careful with your access tokens.  Do not share them freely and do not upload them to a public place (like github for example, the keys shown here are not my real access keys).   

Once we have created our app, we can begin to use it to get tweets via tweepy.  Substitute your consumer keys and access tokens for the TWITTER.CONS...  Once we have done this, we can make requests as follows

In [7]:
auth = tweepy.OAuthHandler(TWITTER.CONSUMER_KEY, TWITTER.CONSUMER_SECRET)
auth.set_access_token(TWITTER.ACCESS_TOKEN, TWITTER.ACCESS_SECRET)
api = tweepy.API(auth)

In [16]:
search = api.search(q='football',count=5)
print(type(search))
print(search)

<class 'tweepy.models.SearchResults'>
[Status(_api=<tweepy.api.API object at 0x7fb3bdde3c50>, _json={'created_at': 'Sat Oct 21 22:40:46 +0000 2017', 'id': 921868893639110657, 'id_str': '921868893639110657', 'text': "Kentucky Wildcats football is the fool's gold of athletics. #fb", 'truncated': False, 'entities': {'hashtags': [{'text': 'fb', 'indices': [60, 63]}], 'symbols': [], 'user_mentions': [], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 26724052, 'id_str': '26724052', 'name': 'BP ❌', 'screen_name': 'bpriggs', 'location': '', 'description': 'Life is good.', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 277, 'friends_count': 721, '

We can see that the search contains a large amount of fields such as hashtags, text, user information, the time it was created at, etc.   For this task you are only interested in three properties, the text, the time it was created, and the user reported location.  These fields are stored as attributes in the tweepy results

In [20]:
for search_result in search:
    print(search_result.text)
    print(search_result.created_at)
    print(search_result.user.location)

Kentucky Wildcats football is the fool's gold of athletics. #fb
2017-10-21 22:40:46

Ok a snack and college football....finally #collegefootball #fall #sayingyestothelandscapingplan… https://t.co/qlbhLnvn3y
2017-10-21 22:40:46
Las Vegas
Hazard showed mad football iq to gain the freekick for that goal
2017-10-21 22:40:46
London, England
RT @Get_Nooky: Ain't no way VSU not winning that title this year in football..Them boys balling
2017-10-21 22:40:45
Virginia, USA
RT @Timinole: Our football coach is Jeb Bush. “Please clap.” “Please keep cheering.”
2017-10-21 22:40:45
St Augustine, FL


## Accessing streaming data and storing it in mongodb <a id="tm"></a>

When we look at the twitter data, we can notice that it doesn't always have values for every field.  In addition, sometimes fields will just not exist.  If we wanted to store this data in a relational database we would be in trouble and would have to substantially process the tweets before storage.   Since we want to store the tweets in a database, we will need to use a NOSQL database.  One of the most popular NOSQL databases is mongodb, which is allows highly efficient storage of unstructured data.   

To install mongodb, follow the instructions at 
  - https://treehouse.github.io/installation-guides/mac/mongo-mac.html (mac)
  - https://www.howtoforge.com/tutorial/install-mongodb-on-ubuntu-16.04/#install-mongodb-on-ubuntu- (ubuntu)
  - https://docs.mongodb.com/manual/tutorial/install-mongodb-on-windows (windows)
  
Once you have installed mongodb, you can then use pymongo to access it.  In the default access settings, the user and password will be "admin" and "admin123", so substitute these or the appropriate credentials in for MONGO.USER and MONGO.PASSWORD

In [21]:
def generate_mongo_table_connection(table_name):
    """Returns a connection to the twitter
    mongodb table in the twitter_database
 
    Parameters:
    -----------
    table_name: str
        name of table to connect to

    Returns:
    --------
    pymongo connection to the mongodb table
    """
    client = MongoClient('mongodb://{}:{}@localhost:27017'.format(
        MONGO.USER, MONGO.PASSWORD))
    database = client['twitter_database']
    return database[table_name]

## Getting trends and accessing streaming data

To access trends, we can also use tweepy api.  The api.trends_place(woe_id) takes a "where on earth id", which can refer to any location.  To get a where on earth id for a specific location, you can use http://www.woeidlookup.com/.   For this example, we will use look at tweets for the united states, which has a where on earth id of 23424977.   

The following function allows us to access the streaming trends.  Again, put your own api tokens in for the TWITTER.CONSUMER_KEY, and related arguments.   The first trends are printed out below.  As we can see the names correspond to what the hashtag is.   We can see that the top five trends are 
  - \#TENNvsBAMA
  - \#NapsInFiveWords
  - \#FNCE
  - \#MyRelationshipWasOverWhen
  - Bill O'Reilly

In [25]:
US_WOEID = 23424977

def get_trends():
    """Returns a list of current trends in the US

    Parameters:
    -----------
    api: tweepy api

    Returns:
    --------
    list of trends(dictionarys), each trend has the keys:
        - name
        - promoted_content
        - query
        - tweet_volume
        - url
    """
    auth = tweepy.OAuthHandler(TWITTER.CONSUMER_KEY, TWITTER.CONSUMER_SECRET)
    auth.set_access_token(TWITTER.ACCESS_TOKEN, TWITTER.ACCESS_SECRET)
    api = tweepy.API(auth)
    response = api.trends_place(US_WOEID)
    return [trend for trend in response[0]['trends']]
print(get_trends()[:5])

[{'name': '#TENNvsBAMA', 'url': 'http://twitter.com/search?q=%23TENNvsBAMA', 'promoted_content': None, 'query': '%23TENNvsBAMA', 'tweet_volume': None}, {'name': '#NapsInFiveWords', 'url': 'http://twitter.com/search?q=%23NapsInFiveWords', 'promoted_content': None, 'query': '%23NapsInFiveWords', 'tweet_volume': None}, {'name': '#FNCE', 'url': 'http://twitter.com/search?q=%23FNCE', 'promoted_content': None, 'query': '%23FNCE', 'tweet_volume': None}, {'name': '#MyRelationshipWasOverWhen', 'url': 'http://twitter.com/search?q=%23MyRelationshipWasOverWhen', 'promoted_content': None, 'query': '%23MyRelationshipWasOverWhen', 'tweet_volume': None}, {'name': "Bill O'Reilly", 'url': 'http://twitter.com/search?q=%22Bill+O%27Reilly%22', 'promoted_content': None, 'query': '%22Bill+O%27Reilly%22', 'tweet_volume': 35056}]


## Accessing the twitter stream

The search we used before is rate limited, which means we can't access that many tweets.  For our analysis, we will need to get thousands of tweets, which means that we will need to stream them.  In tweepy, you access the streaming api by first creating a subclass of a tweepy StreamListener.   The stream listener has a three methods that will need to be overriden in this case, 
  1. \_\_init\_\_: the initialization method for creating an object
    - For this method we will need to add two arguments:
      - table, the table we want to store the tweets in 
      - n_hours.  The amount of hours we want the stream to run before disconnecting
  2. on\_status: the code that runs when the stream gets a new tweet.   
    - This method will need to return true or false.  
      - True will continue to stream topics
      - False will disconnect the stream.  
  3. on\_error: what to do when the stream encounters an error.   
    - For this we will just disconnect the stream and raise an exception

For the \_\_init\_\_ method, we will need to add two arguments which will be stored as instance variables, the mongodb table we want to connect to and the number of hours we want to stream. 



In [27]:
class CustomStreamListener(tweepy.StreamListener):
    def __init__(self, table, n_hours):
        super(CustomStreamListener, self).__init__()
        # store the time the stream starts
        self.begin_time = datetime.datetime.now()
        self.n_hours = n_hours
        self.table = table

    def on_status(self, status):
        self.table.insert_one(status._json)
        # disconnect if the stream has ran n_hours
        now = datetime.datetime.now()
        if (now - self.begin_time).seconds > 60*60*self.n_hours:
            return False

    def on_error(self, status_code):
        raise Exception(status_code)


We can then use this twitter stream to access the streaming data.  Once you have created a StreamListener object, you can pass this into a tweepy Stream, along with the OAuthHandler object we created before.   This stream then can be used to stream tweets.  

In [28]:
def create_twitter_stream(table, n_hours):
    """Creates a twitter stream object that will insert queries into
    object and will terminate in n_hours

    Parameters:
    -----------
    table: connection to mongodb table
    n_hours: number of hours to run before termination,

    Returns:
    --------
    tweepy Stream object
    """
    auth = tweepy.OAuthHandler(TWITTER.CONSUMER_KEY, TWITTER.CONSUMER_SECRET)
    auth.set_access_token(TWITTER.ACCESS_TOKEN, TWITTER.ACCESS_SECRET)

    stream_listener = tweepy.CustomStreamListener(table, n_hours=n_hours)
    twitter_stream = Stream(auth, stream_listener)
    return twitter_stream


We will now put the previous methods together to get the top five trends via a call to get_trends().  After that we will take the top five trends, create a twitter_stream, and use this stream to pick up topics which are related to the current stream.  To get a large enough amount of data for analysis, we will need to run this for several hours.

In [29]:
def stream_trends(n_hours=2):
    """querys the list of trends and inserts them into the twitter database
    in a table with the name 'trends_<year>_<month>_<day>_<day>'

    Parameters:
    -----------
    time: number of hours to scrape trends for before exiting

    Returns:
    --------
    table: connection to mongodb table
    table_name: name of table in mongodb database,
    trend_names: list of names of trends
    """
    trend_list = get_trends()
    trend_names = [trend['name'] for trend in trend_list]
    # only take the top five trends
    trend_names = trend_names[:5]
    table = generate_mongo_table_connection('streaming_topics')

    twitter_stream = create_twitter_stream(table, n_hours)
    twitter_stream.filter(track=trend_names)
    return table, table_name, trend_names
table, name, trends = stream_trends(n_hours=3)

### 

Now that we have created a twitter stream and found topics, it's time to do some cool