# Brand sentiment tracking with Tweepy and vaderSentiment

In [5]:
import time
from IPython.display import display, clear_output
import tweepy
import csv
from geopy.geocoders import Nominatim
import re
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import pandas as pd

pd.options.display.max_columns = None
pd.options.display.max_rows = 30
pd.options.display.max_colwidth = 150


### Authentication
First, make sure you setup a Twitter API, and input your credentials below.

In [7]:
ACCESS_TOKEN = 
ACCESS_SECRET = 
CONSUMER_KEY = 
CONSUMER_SECRET = 

### Brand name extraction
The streaming API lets you filter based on tracked terms (in this case, the Twitter handle(s) of the brands we're tracking). But sometimes Tweets pass through the filter that aren't actually related the brand(s) we've specified.
<br>

The function below will let us double-check that we're only storing Tweets that explicitly mention the brand name(s) of interest. And since we're tracking five brands at once, this also will let us filter the dataset later on.

In [8]:
def get_brand(tweet):
    tweet = tweet.lower()
    brands = ['input brand names here (ex. if you have @companyNorthAmerica and @companyEU, just put "company" to aggregate these.)']
    for brand in brands:
        if brand in tweet:
            return brand
    else:
        return 'NOBRAND' # These will be filtered out

### Extract coordinates
Tweepy API has a tweet.coordinates attribute; however, this null about 99.99% of the time.
<br>

Instead, we'll use the location entry from user bios, and try to extract the coordinates using the Geopy Nominatim tool. Nominatim searches text-based locations, then tries to return the coordinates (and the degree of certainty).
<br>

The location in user bios is an open text entry. So, it could be useful stuff like: (which turns into lat, long, confidence)
* Dallas, Texas, USA => (32.7762719, -96.7968559, 0.8841451879795001)
* Paris => (48.8566969, 2.3514616, 0.9417101715588673)
* 123 Main street, Anytown, USA => ( ###, ###, 0.9999999)
<br>

But this is Twitter. So there's also plenty of stuff like:
* 🌎 planet earth 🌎
* ~ Soundcloud link ~
* Coachella 2020 :(
<br>

Because we're trying to track sentiment by location, we will filter out records where no location could be determined. In addition, we'll filter out records where the location confidence is too low; in the function below, I've abitrarily chosen 70% confidence, but this could be adjusted as needed.
<br>

NOTE: If you need to extract other attributes besides just lat/long, you can do so in the function below. I ended up using 'Continent' and 'Country' as well.

In [11]:
def extract_coords(location):
    locator = Nominatim(user_agent = "myGeocoder")
    location = locator.geocode(location)
    location
    if location == None or location.raw['importance'] < 0.70:
        return 'NOTFOUND' # These will be filtered out.
    else:
        return location.latitude, location.longitude

### Sentiment extraction
Now for the main event: extracting sentiment.
<br>

The first thing we need to do is clean up the text a bit. We'll drop URL's from the text, as well as retweet prefixes (ex. "RT @Account: tweet text"). We'll still keep retweets, the rationale being that a person Canada retweeting a tweet originating in Europe is still reflective of the sentiment of that person in Canada.
<br>

Next, vaderSentiment. This is an open source sentiment analysis tool boasting a classification accuracy of 84%. It's specifically developed for social media use (which is why we won't drop things like emojis: vaderSentiment is designed to take those into account, too).
<br>

vaderSentiment returns postive, negative, neutral, and compound scores. We'll use the compound, which is described on GitHub as a "normalized, weighted composite score" of the other three. 

To learn more about vaderSentiment, check out: https://github.com/cjhutto/vaderSentiment#about-the-scoring

In [12]:
def clean_text(text):
    # remove URLS 
    url = re.search("(?P<url>https?://[^\s]+)", text)
    if url != None:
        text= text.replace(url.group('url'),'')     
    # Drop RT @XYZ
    if text[:2] == 'RT':
        drop = text.split(' ')[0] + ' ' + text.split(' ')[1]
        text = text.replace(drop,'')
    return text

analyser = SentimentIntensityAnalyzer()
def vader_senti(sentence):
    score = analyser.polarity_scores(sentence)
    return score['compound']

### Brand Sentiment Streaming
Now we can set up the streaming class. For this project, I followed about 20 Twitter handles related to 5 companies (which I've redacted from this notebook, because while I highly doubt there's any grounds for a cease and desist here, I don't really want to find out). 
<br>
Last
The Tweepy StreamListener will track content related to all of the Twitter handles mentioned. We'll filter to English (since that's the language our sentiment extracting tool uses). 
<br>

The on_status method is where we call the feature extractions using the functions above. This is also where we can establish a "filtration hiearchy"; in other words, where we decide what to write to our .txt file. As defined previously, we'll ignore (i.e. not write to the .txt file) Tweets that don't have an identifiable brand mentioned, or coordinates that can be identified.
<br>

Last, the on_error and on_timeout methods: These are just a couple of methods to keep the stream running in the event that an error occurs, rather than killing the process.

In [14]:
# Define brands here
brand1 = ['brand1handle1','brand1handle2']
brand2 = ['brand2handl1','brand2handle2']
all_brands = brand1 + brand2

class MyStreamListener(tweepy.StreamListener):
    def __init__(self, api):
        self.api = api
        self.me = api.me()
        
    def on_status(self, tweet):
        
        # Ignore tweets w/no location data 
        if tweet.user.location != None:
            
            # Ignore tweets w/no brand clearly mentioned
            brand = get_brand(tweet.text)
            if brand != 'NOBRAND':
                
                # Ignore tweets with no identifiable coordinates
                coord_extract = extract_coords(tweet.user.location)
                time.sleep(1) # to prevent Nominatim timeout
                if coord_extract != 'NOTFOUND':
                    
                    # Extract sentiment
                    text_clean = clean_text(tweet.text)
                    senti = vader_senti(text_clean)
                    
                    # Write output to .txt
                    rec = [brand, tweet.created_at, tweet.user.location, 
                           tweet.text, coord_extract, senti]     
                    
                    with open('data/stream_out.txt', 'a') as f:
                        writer = csv.writer(f)
                        writer.writerow(rec)
    
    def on_error(self, status_code):
        print('Status code error')
        return True # Don't kill the stream

    def on_timeout(self):
        print('Timeout')
        return True # Don't kill the stream. Seriously, don't kill it. I've worked so hard :(

# Authenticate to Twitter
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

# Create API object
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

# Setup listener
tweets_listener = MyStreamListener(api)
stream = tweepy.Stream(api.auth, tweets_listener)
stream.filter(track = all_brands , languages=["en"])

### Boom, you did it.

Have a drink. Make yourself a snack. Give your grandma a call.