### PART 1: Pulling tweet data from Titter API.
I separated this out so I could run it a few different ways. 

**Import libraries and connect to Twitter API:**

In [1]:
#Import libraries needed to pull tweets, and establish connection to twitter api
import json
import tweepy

%run ~/twitter_credentials.py

#Use tweepy.OAuthHandler to create an authentication using the given key and secret
auth = tweepy.OAuthHandler(consumer_key=con_key, consumer_secret=con_secret)
auth.set_access_token(acc_token, acc_secret)

#Connect to the Twitter API using the authentication. have it wait on rate limits and notify when it's waiting
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)

**Pull tweets using the Twitter REST API based on a specified hashtag**

The next set of code was used to collect tweets based on different hashtags.  I wanted the data separate by hashtags so that I could analyze it more and determine if I needed all of them, or if a subset would be enough.  

In [2]:
#Function to pull tweets based on a specific hastag
def get_tweets_by_hashtag(hashtag, num_tweets, filename):
    num_needed = num_tweets
    tweet_list = []
    last_id = -1 # id of last tweet seen
    
    while len(tweet_list) < num_needed:
        try:
            new_tweets = api.search(q = hashtag, count = 100, max_id = str(last_id - 1))
        except tweepy.TweepError as e:
            print("Error", e)
            break
        else:
            if not new_tweets:
                print("Could not find any more tweets!")
                break
            tweet_list.extend(new_tweets)
            last_id = new_tweets[-1].id
            print (len(tweet_list)) #to see that it's progressing and not failed
    
    #For this next part, I recognize I could have limited the data being pulled, 
    #but was concerned I would either miss a Sunday of data, which is likely a heavy tweet day for football teams
    #or that I wouldnt get the data I wanted while I was still learning the twitter api.  
    #So I opted to pull all data and filter it before saving.  
    limit_data = get_tweet_data(tweet_list)
    save_data = save_tweets(limit_data,filename)
    return(save_data)

In [3]:
#Function to pulling data I anticipate I'll need from the full set of data pulled from the api 
def get_tweet_data(tweepy_list):

    tweet_data=[]

    for tweet in tweepy_list:

        current_tweet=dict()
        current_tweet['text']=tweet.text
        current_tweet['created_at']=tweet.created_at.strftime("%Y-%m-%d %H:%M:%S")
        current_tweet['id_str']=tweet.id_str
        current_tweet['retweeted']=tweet.retweeted
        
        user_dict=tweet._json['user']
        current_tweet['user_id_str']=user_dict['id_str']
        current_tweet['user_location']=user_dict['location']
        
        entities_dict=tweet._json['entities']
        current_tweet['hashtags']=entities_dict['hashtags']
                
        if tweet.place: 
            place_dict = tweet._json['place']
            current_tweet['place_full_name']=place_dict['full_name']
            current_tweet['place_place_type']=place_dict['place_type']
            current_tweet['place_name']=place_dict['name']
            current_tweet['place_country_code']=place_dict['country_code']
            current_tweet['place_country']=place_dict['country']
        else:
            current_tweet['place_full_name'] = ''
            current_tweet['place_place_type']= ''
            current_tweet['place_name']= ''
            current_tweet['place_country_code']= ''
            current_tweet['place_country']= ''
        
        tweet_data.append(current_tweet)
    
    return(tweet_data)


In [4]:
#Function to save the tweets to a json file
def save_tweets(tweets,filename):
    try:
        file=open(filename,"w")
        json.dump(tweets,file)
        file.close() 
        return("Save Complete!")
    except:
        return("Something went wrong, file wasn't saved!")

**Running the functions to pull data**

The following were the runs that were completed using the get_tweets_by_hashtag function.  I wanted these separate in order to be able to choose later if I wanted to use them all or not. They are commented out so that they don't replace the files created for analysis while rerunning the entire notebook.  

In [5]:
#Sunday December 7 
# get_tweets_by_hashtag('#%23gopackgo', 100000,'tweets_gopackgo_all_12072019.json')
# get_tweets_by_hashtag('#%23packernation',100000,'tweets_packernation_all_12072019.json')
# get_tweets_by_hashtag('#%23greenandgold ', 100000,'tweets_greenandgold_all_12072019.json')
# get_tweets_by_hashtag('#%23greenbaypackers', 100000,'tweets_greenbaypackers_all_12072019.json')
# get_tweets_by_hashtag('#%23packers', 100000,'tweets_packers_all_12072019.json')


#Tuesday December 10
# get_tweets_by_hashtag('#%23gopackgo', 100000,'tweets_gopackgo_all_12102019.json')
# get_tweets_by_hashtag('#%23packernation',100000,'tweets_packernation_all_12102019.json')
# get_tweets_by_hashtag('#%23greenandgold ', 100000,'tweets_greenandgold_all_12102019.json')
# get_tweets_by_hashtag('#%23greenbaypackers', 100000,'tweets_greenbaypackers_all_12102019.json')
# get_tweets_by_hashtag('#%23packers', 100000,'tweets_packers_all_12102019.json')


### PART 2: Cleaning and modifying the data so it can be used in R
In this part, I work through merging data previously gathered, parsing for a state to use as a location, removing tweets from WI and duplicates, and performing sentiment analysis.  The last step saves a file to be used in R analysis.


In [6]:
import json
import pandas as pd
import numpy as np
import string
from textblob import TextBlob

In [7]:
#Function to merge json files created in part 1
def merge_json_files(file_name_list):

    tweets_list = []

    for ii in range(len(file_name_list)):
        with open(file_name_list[ii], 'r') as file:
            temp_list = json.load(file)
            tweets_list = tweets_list + temp_list
    return(tweets_list) 

In [8]:
#Funtion to pull only the fields I want to use from the Twitter data gathered
def get_specific_data (tweets_list):
    all_tweets_list = []
    id_list = []

    for ii in range(len(tweets_list)):
        text = tweets_list[ii]['text']
        id_str = int(tweets_list[ii]['id_str'])
        retweet = True if text[:4]=='RT @' else False #didn't pull retweet field, workaround for identifying them
        user_location  = tweets_list[ii]['user_location']
        place_full_name  = tweets_list[ii]['place_full_name']
        location = ''
        team_count = ''
        sentiment = ''
        
        #don't add duplicates or retweets to the list
        if id_str not in id_list and retweet == False:
            all_tweets_list.append([text,id_str,retweet,user_location,place_full_name,location,team_count,sentiment])
            id_list.append(id_str)

    return(all_tweets_list)

In [9]:
#Function to get location for each tweet based on data in either the user_place field or place_full_name field
def get_location_for_tweets (tweet_list):
  
    states = pd.read_csv("states.csv",na_values='*')
    statenames = states["State"].tolist()
    statenames = [x.upper() for x in statenames]
    stateabbrvs = states["Abbreviation"].tolist()
    
    #create these to use to do replacements/lookups based on 
    #state full name (to abbreviation) and abbreviation (for team count)
    state_dict = {}
    nflteams_dict = {}    
    for ii in range(len(states)):
        state_dict[states.iloc[ii]['State'].upper()] = states.iloc[ii]['Abbreviation']
        nflteams_dict[states.iloc[ii]['Abbreviation']] =states.iloc[ii]['Number of NFL Teams']

    #use either the user_location or place_full_name to find state info.  
    #Split the fields into word lists, and try to find the full state name or state abbreviation.
    for ii in range(len(tweet_list)):
        
        #create the "place" list depending on on which field can be used.
        if tweet_list[ii][3] != '':
            place = tweet_list[ii][3].split(",") 
        else:
            place = tweet_list[ii][4].split(", ")
            
        #strip spaces and make upper case for easier comparison   
        place = [word.strip() for word in place]
        place = [x.upper() for x in place]
        
        #if the place list contains a value in either the state or abbreviation list, set the location to that place
        for jj in range(len(place)):
            if place[jj] in statenames or place[jj] in stateabbrvs:
                tweet_list[ii][5] = place[jj]
                
    #make dataframe to simplify dropping data            
    tweetsdf = pd.DataFrame(tweet_list, columns = ['text', 'id_str','retweet','user_location',
                                                   'place_full_name','location','team_count','sentiment']) 

    #replace state with abbreviation
    tweetsdf ['location'] = tweetsdf['location'].map(state_dict).fillna(tweetsdf['location'])
    
    #drop tweets without location
    tweetsdf = tweetsdf.drop(tweetsdf[tweetsdf['location']==''].index)
    
    #drop tweets from WI
    tweetsdf = tweetsdf.drop(tweetsdf[tweetsdf['location']=='WI'].index)
       
    #add team count to each tweet
    tweetsdf ['team_count'] = tweetsdf['location'].map(nflteams_dict).fillna(tweetsdf['team_count'])
    
    #change back to list for additional processing
    tweets_list = tweetsdf.values.tolist()
    
    return(tweets_list)

In [10]:
#Function to remove special characters from tweet text
def clean_tweet(tweet_text):
    for p in string.punctuation:
        tweet_text=tweet_text.replace(p,"")
        return(tweet_text)

In [11]:
#Functions to get tweet sentiment using textblob
def get_tweet_sentiment(tweets): 
    for ii in range(len(tweets)):
        analysis = TextBlob(clean_tweet(tweets[ii][0])) 
        if analysis.sentiment.polarity > 0: 
            tweets[ii][7] = 'positive'
        elif analysis.sentiment.polarity == 0: 
            tweets[ii][7] = 'neutral'
        else: 
            tweets[ii][7] = 'negative'
    tweetsdf = pd.DataFrame(tweets, columns = ['text', 'id_str','retweet','user_location',
                                               'place_full_name','location','team_count','sentiment']) 
    
    #drop neutral sentiment tweets
    tweetsdf = tweetsdf.drop(tweetsdf[tweetsdf['sentiment']=='neutral'].index)
    return(tweetsdf)

In [12]:
#Function to run the various steps to clean and remove tweets
def get_unique_tweets(tweets_list):

    #1. Get fields needed for analysis and remove duplicates
    specific_tweets = get_specific_data (tweets_list)

    #2.Determine location based on other fields in each tweet, remove tweets without location or from WI
    location_tweets = get_location_for_tweets(specific_tweets)

    #3. Determine sentiment for each tweet and remove neutral tweets
    sentiment_tweets = get_tweet_sentiment(location_tweets)

    return(sentiment_tweets)    

**Running the functions to clean data**

This part runs the rest of the functions and save the results to a csv file.


In [13]:
#merge files together that have tweet data
file_list = ['tweets_gopackgo_all_12072019.json','tweets_greenbaypackers_all_12072019.json',
             'tweets_greenandgold_all_12072019.json','tweets_packernation_all_12072019.json',
             'tweets_packers_all_12072019.json','tweets_gopackgo_all_12102019.json',
             'tweets_greenbaypackers_all_12102019.json','tweets_greenandgold_all_12102019.json',
             'tweets_packernation_all_12102019.json','tweets_packers_all_12102019.json']

merged_list = merge_json_files(file_list)

#get list of unique tweets with desired information
unique_tweets = get_unique_tweets(merged_list)
print(len(unique_tweets)) #visual confirmation that process ran and returned expected number of tweets

#save lists to csv
unique_tweets.to_csv('unique_tweets.csv', index=False) 

2861
