# Introduction

Welcome! In this Jupyter notebook, you will find code to:

1. Scrape Twitter data and store it into an AWS DynamoDB
2. Clean Twitter data and calculate metrics from tweeted hashtags, like z scores
3. Train a SGD classifier to predict if a tweeted hashtag is trending or not
4. Analyze the output of SGD classifier's coefficients to perform feature engineering and find the most accurate model

To learn more about Twitter's trend algorithm, trend manipulation, and the methodology behind this work, check out the [GitHub repo](https://github.com/jesmith14/TwitterTrends)!

---

# 1. Scrape Twitter Data Into a Database

To collect all of the data you'll need for this workbook from Twitter, run the cell below as a separate Python script on your local machine. You will need to do three things:
1. Update the values in the TwitterAPI class to be your personal Twitter Developer API keys
2. Run the code in the same folder as your .aws folder with your personal AWS configuration details
    - you will need to create 2 AWS DynamoDBs (free tier eligible), name one 'TweetStream' with a numerical ID (tweet ID), name the other 'trendTable' with a string ID (time stamp)
3. To download all of the Twitter data into csvs, I used the DynamoDBtoCSV package. You can also skip this step and simply use the csvs I used in this workbook by downloading them from the GitHub repo

*For additional guidance, check out the [GitHub repo](https://github.com/jesmith14/TwitterTrends)*

In [None]:
#Treat this cell as its own python file, follow instructions to run on your own machine
#Or you can download the python file from the github repository
import datetime
import calendar
import tweepy
import json
import boto3
import sys
import threading

class TwitterAPI():
    def __init__(self):
        self.api_key = 'INSERT_TWITTER_API_KEY_HERE'
        self.api_secret = 'INSERT_TWITTER_API_SECRET_KEY_HERE'
        self.access_token = 'INSERT_TWITTER_API_ACCESS_TOKEN_HERE'
        self.access_secret = 'INSTER_TWITTER_SECRET_ACCESS_TOKEN_HERE'
        self.auth = tweepy.OAuthHandler(self.api_key, self.api_secret)
        self.auth.set_access_token(self.access_token, self.access_secret)
        self.api = tweepy.API(self.auth)
        self.trendTable = DataBase().trendTable

    def addTrendsToDB(self):
        api = self.api
        currentTrends = api.trends_place(1)[0]
        d1 = datetime.datetime.strptime(currentTrends['created_at'],"%Y-%m-%dT%H:%M:%SZ")
        calendar.month_name[d1.month][:3]
        new_format = calendar.month_name[d1.month][:3] + " %d %H:%M"
        time = d1.strftime(new_format)
        trends = set()
        for trend in currentTrends['trends']:
            trends.add(trend['name'])
        print('* time: ', time)
        self.trendTable.put_item(
            Item={
                'id':time,
                'trends':trends
            }
        )

class DataBase():
    def __init__(self):
        self.dynamodb = boto3.resource('dynamodb', region_name='us-east-2')
        self.client = boto3.client('dynamodb', region_name='us-east-2')
        self.tweetTable = self.dynamodb.Table('TweetStream')
        self.trendTable = self.dynamodb.Table('trendTable')

class MyStreamListener(tweepy.StreamListener):
    def __init__(self):
        self.tweet_count = 0
        self.word_counts = {}
        self.totals = {}

        self.tweetTable = DataBase().tweetTable

        
    def addNewTweetToDB(self, newData):
        #add the new tweet json to the DB (new tweet DB)
        print("# :" + newData['time'] + newData['text'][:5] + "...")
        self.tweetTable.put_item(
           Item={
                'id': newData['id'],
                'time': newData['time'],
                'text': newData['text']
            }
        )

    def on_error(self, status_code):
        if status_code == 420:
            #returning False in on_data disconnects the stream
            return False
    
    def on_data(self, data):
        self.tweet_count += 1
        if self.tweet_count % 10 == 0:
            data = json.loads(data)
            if'text' in data:
                newData = {}
                newData['text'] = data['text']
                newData['time'] = data['created_at'][4:16]
                newData['id'] = data['id']
                self.addNewTweetToDB(newData)

def gatherTweets():
    print('gathering tweets...')
    myListener = MyStreamListener()
    api = TwitterAPI()
    myStream = tweepy.Stream(auth=api.auth, listener=myListener)
    myStream.sample()

def gatherTrends():
    print('gathering trends...')
    threading.Timer(300.0, gatherTrends).start()
    api = TwitterAPI()
    api.addTrendsToDB()

def main():
    print('running')
    if(sys.argv[1] == 'Tweets'):
        gatherTweets()
    elif(sys.argv[1] == 'Trends'):
        gatherTrends()


main()

---

# 2. Clean Twitter Data

This step can be done in many different ways, here I chose to create different dataframes and dictionaries throughout the cleaning / data preprocessing so that it would be easier to see what I am calculating and changing at each step, but the final product could be obtained by taking many more shortcuts that I chose to not take.

**Be careful when running some of these cells, a few of them will take several minutes to run. Most cells have a time stamp that will print out to keep track of how long there is left to go**

In [1]:
import pandas as pd

#csvs created from my AWS DynamoDB using the DynamoDBtoCSV package
trends = pd.read_csv('trends.csv')
tweets = pd.read_csv('tweets.csv')

In [2]:
'''
Creating and formatting the trends dataframe:

|timeStamp [string] | trends [list of hashtags] |

'''
def getHashtags(trends):
    trendsWithoutQuotes = trends.replace('"', '')
    trendsStripped = trendsWithoutQuotes.strip('[]')
    trendList = trendsStripped.split()
    hashtags = []
    for item in trendList:
        if(item[0] == '#'):
            hashtags.append(item)


    return hashtags

import numpy as np
formattedTrends = trends[:2761].copy()
tester = []
for i in range(0, len(trends) - 2): 
    hashTagList = getHashtags(trends.iloc[i]['trends'])
    formattedTrends.iloc[i]['trends'] = hashTagList

In [3]:
formattedTrends.head()

Unnamed: 0,id,trends
0,Dec 06 11:53,"[#BuenViernes,#DiscoRajaTeaser,#Encounter,#Flo..."
1,Dec 06 15:30,"[#6DElParoSigue,#6Dic,#AdoreYou,#BroadwayinHaw..."
2,Dec 06 07:48,"[#AdoreYou,#AllAboutLuvForWonho,#AnxietyFeelsL..."
3,Dec 07 22:46,"[#370MilyonNerede,#7DEstoRecienEmpieza,#AJRuiz..."
4,Dec 08 03:24,"[#ACCFCG,#AChristmasLoveStory,#AltasHoras,#Ani..."


**Careful, the cell below can take around 5 minutes to run!**

It will stop after 800000 prints out below the cell

In [4]:
'''
Creating and formatting the tweets dataframe, only selecting hashtags for this dataset for simplicity:

| hashtags [string array] | tweetID [integer] | minute timeStamp [string] | RT (0 or 1) |
'''

#careful, this cell takes a few minutes to run - there are over 800000 tweets to go through!
newRows = []
#omitting last rows of tweets dataframe because they were query response codes
for i in range(0, len(tweets[:800158])):
    if((i % 100000) == 0):
        print(i)
    RT = 0
    hashtags = []
    currentTweet = tweets.iloc[i]
    if isinstance(currentTweet['text'], float) or isinstance(currentTweet['time'], float): continue
    for word in currentTweet['text'].split():
        if word == "RT":
            RT = 1
        if word[0] == '#':
            hashtags.append(word)
        else: continue
    if len(hashtags) > 0:
        thisTweet = np.array([hashtags, tweets.iloc[i]['id'], tweets.iloc[i]['time'], RT])
        newRows.append({'hashtags':hashtags, 'tweetID':thisTweet[1], 'time':thisTweet[2], 'RT':thisTweet[3]})

hashtags = pd.DataFrame(newRows)
hashtags.head()

0
100000
200000
300000
400000
500000
600000
700000
800000


Unnamed: 0,hashtags,tweetID,time,RT
0,"[#WangXian, #MoDaoZuShi, #魔道祖师]",1203236748953706500,Dec 07 08:56,1
1,"[#XiaoZhan, #เซียวจ้าน, #샤오잔, #シャオジャン]",1203667751438413800,Dec 08 13:28,1
2,"[#FNS歌謡祭, #ジェジュン, #チキンライス, #OH_MY_LITTLE_GIRL]",1202185203382247400,Dec 04 11:17,1
3,[#Live],1203929832494530600,Dec 09 06:50,1
4,"[#เป็นสเตจแรกของDoyouที่โคตรมัน, #BamBam]",1204299484915650600,Dec 10 07:19,1


### Turn all timestamps into Date Times, add time intervals for trends, sort dataframes by time

In [5]:
import datetime
def getDateTimeArray(df, columnName):
    datetimes = []
    for i in range(len(df)):
        date_time_str_test = df.iloc[i][columnName] + ' 2019'
        date_time_obj = datetime.datetime.strptime(date_time_str_test, '%b %d %H:%M %Y')
        datetimes.append(date_time_obj)

    #appending this to end to make it same length as trends df
    datetimes.append(None)
    return datetimes

In [6]:
#add dateTime column to both dataframes for sorting by time
formattedTrends['dateTime'] = getDateTimeArray(formattedTrends[:-1], 'id')
hashtags['dateTime'] = getDateTimeArray(hashtags[:-1], 'time')

In [7]:
#sort formattedTrends by time
formattedTrends.sort_values(by='dateTime', inplace=True)

In [8]:
#add endTime attribute column
formattedTrends['endTime'] = None

In [9]:
#fill in endTime attribute column
for index, row in formattedTrends.iterrows():
    startTime = row['dateTime']
    #end time is 10 minutes after this trend
    endTime = startTime + datetime.timedelta(minutes=10)
    #if there is a trend array for 10 minutes later, we add this end time to the dataframe
    if len(formattedTrends[formattedTrends['dateTime'] == endTime]) > 0:
        formattedTrends.at[index, 'endTime'] = endTime

In [10]:
#formated trends is now a dataframe that only contains trends in the specified 10 minute non-gap intervals
formattedTrends = formattedTrends[formattedTrends['endTime'].notna()]

In [11]:
#sort hashtags by time
hashtags.sort_values(by='dateTime', inplace=True)

In [12]:
#get dataframe of all the hashtags as their own row
totalHashtags = hashtags.explode('hashtags')

In [13]:
totalHashtags['trending'] = None
#creating unique index for each hashtag in the dataset
indexes = list(range(len(totalHashtags)))
totalHashtags['index'] = indexes
totalHashtags.set_index('index', inplace=True)

In [14]:
#here I am going through the hashtags dataframe and labeling each hashtag with a 0 if it was used 
#when it wasn't trending and with a 1 if it was used when it was trending
#this is if it was or wasn't trending in the specified 10 minute window of trends that the tweet occured during
#this cell will print a timestamp to check that it's running properly, and will stop after 500,000 prints out
timeTable = 0
for index1, row1 in formattedTrends.iterrows():
    startTime = row1['dateTime']
    trendList = list(formattedTrends[formattedTrends['dateTime'] == startTime]['trends'])[0][0]
    thisTime = startTime
    for i in range(1, 11):
        for index, row in (totalHashtags[totalHashtags['dateTime'] == thisTime]).iterrows():
            if row['hashtags'] in trendList:
                trending = 1
            else:
                trending = 0
            totalHashtags.at[index, 'trending'] = trending
            #here to keep track of running time
            timeTable += 1
            if(timeTable % 100000 == 0):
                print(timeTable)
        thisTime = startTime + datetime.timedelta(minutes=i)

100000
200000
300000
400000
500000


In [15]:
#getting rid of some formatting errors, omitting NaN values
totalHashtags = totalHashtags[totalHashtags['trending'].notnull()]

In [16]:
#reformatting dataframe to have unique index and sorted by time
totalHashtags.reset_index(inplace=True)
totalHashtags.sort_values(by='dateTime', inplace=True)
indexes = list(range(len(totalHashtags)))
totalHashtags['index'] = indexes
totalHashtags.set_index('index', inplace=True)

In [17]:
totalHashtags.head()

Unnamed: 0_level_0,hashtags,tweetID,time,RT,dateTime,trending
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,#笑う門には大澤駿弥,1200966846154858500,Dec 01 02:36,0,2019-12-01 02:36:00,0
1,#大澤駿弥,1200966846154858500,Dec 01 02:36,0,2019-12-01 02:36:00,0
2,#MamangamFestFromDec12,1200966930007412700,Dec 01 02:36,1,2019-12-01 02:36:00,0
3,#WorldAIDSDay,1200967081040081000,Dec 01 02:37,0,2019-12-01 02:37:00,0
4,#ただのいちごじゃない,1202058204026306600,Dec 04 02:53,1,2019-12-04 02:53:00,0


In [18]:
len(totalHashtags)

232134

In [19]:
totalHashtags[totalHashtags['dateTime'] == '2019-12-04 02:53']

Unnamed: 0_level_0,hashtags,tweetID,time,RT,dateTime,trending
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4,#ただのいちごじゃない,1202058204026306600,Dec 04 02:53,1,2019-12-04 02:53:00,0
5,#三角チョコパイあまおう,1202058204026306600,Dec 04 02:53,1,2019-12-04 02:53:00,0


In [20]:
#getting information to add to the trending dataframe so I know how many trending hashtags there were in each
#10 minute window and how many total hashtags there were
numTrendsInTweets = []
numTotalHashtags = []
for index,row in formattedTrends.iterrows():
    
    startTime = row['dateTime']
    thisTime = startTime
    numTrends = 0
    numTotal = 0
    
    for i in range(1, 11):
        hashtagsAtMinute = totalHashtags[totalHashtags['dateTime'] == thisTime]
        for index2, row2, in hashtagsAtMinute.iterrows():
            if row2['trending'] == 1:
                numTrends += 1
            numTotal += 1
        thisTime = startTime + datetime.timedelta(minutes=i)
    numTrendsInTweets.append(numTrends)
    numTotalHashtags.append(numTotal)

In [21]:
formattedTrends['NumTrendsTweeted'] = numTrendsInTweets
formattedTrends['TotalHashtags'] = numTotalHashtags

In [22]:
#average number of times a trending hashtag was used in each 10 minute bucket in this dataset
np.mean(formattedTrends['NumTrendsTweeted'])

10.770048737261853

In [23]:
#average number of hashtags that were tweeted in each 10 minute bucket in this dataset
np.mean(formattedTrends['TotalHashtags'])

243.38901196278246

### Resampling the dataset so that there will be an equal amount of trends and non-trends

In [24]:
#creating dictionaries that for every 10 minute window:
#there is a list of the trends from that window,
#a list of the hashtags that were tweeted that were trending,
#and a list of randomly selected hashtags that were tweeted that were not trending (same amount as trending hashtags)
timeStamp = 0
z_score_dict = {}
for index,row in formattedTrends.iterrows():
    currentTrends = row['trends']
    if row['TotalHashtags'] < 10: continue
    if row['TotalHashtags'] < (2*row['NumTrendsTweeted']): continue
    elif row['TotalHashtags'] >= (2*row['NumTrendsTweeted']):
        #randomly sample sampleSize of hashtags that weren't trending from this 10 minute window
        startTime = row['dateTime']
        thisTime = startTime
        if startTime not in z_score_dict:
            z_score_dict[startTime] = {'trends': currentTrends, 'tweetedTrends': [], 'tweetedNonTrends':[]}
        for i in range(1, 11):
            hashtagsAtMinute = totalHashtags[totalHashtags['dateTime'] == thisTime]
            for index2, row2, in hashtagsAtMinute.iterrows():
                if row2['trending'] == 1:
                    z_score_dict[startTime]['tweetedTrends'].append(row2['hashtags'])
                if row2['trending'] == 0 and len(z_score_dict[startTime]['tweetedNonTrends']) < row['NumTrendsTweeted']:
                    z_score_dict[startTime]['tweetedNonTrends'].append(row2['hashtags'])
            thisTime = startTime + datetime.timedelta(minutes=i)
    timeStamp += 1
    if timeStamp % 500 == 0:
        print(timeStamp)

500
1000
1500


### Calculation Time! The next few cells are calculating all the metrics needed to find the z_score from the last hour for every hashtag

In [25]:
#coming up with a count for how many times trending hashtags were tweeted in the last hour
#and how many times non trending hashtags were tweeted in the last hour
#also creating a count for how many hashtags total were tweeted in the last hour for proportions later
#using a time stamp, this cell will stop after 3000 prints
for key in z_score_dict:
    numTrendTags = len(z_score_dict[key]['tweetedTrends'])
    z_score_dict[key]['trendCounts'] = [0]*numTrendTags
    z_score_dict[key]['nonTrendCounts'] = [0]*numTrendTags
    z_score_dict[key]['totalCounts'] = 0
    for i in range(numTrendTags):
        itemtrend = z_score_dict[key]['tweetedTrends'][i]
        itemnontrend = z_score_dict[key]['tweetedNonTrends'][i]
        for k in range(0, 61):
            minute = key - datetime.timedelta(minutes=k)
            if minute in z_score_dict: 
                if itemtrend in z_score_dict[minute]['tweetedTrends']:
                    z_score_dict[key]['trendCounts'][i] += 1
                    z_score_dict[key]['totalCounts'] += 1
                if itemnontrend in z_score_dict[minute]['tweetedNonTrends']:
                    z_score_dict[key]['nonTrendCounts'][i] += 1
                    z_score_dict[key]['totalCounts'] += 1
    timeStamp += 1
    if timeStamp % 500 == 0:
        print(timeStamp)

2000
2500
3000


In [26]:
#calculate proportion count for each item
#cell will stop running when time stamp prints out 4500
timeStamp = 0
for item in z_score_dict:
    numTags = len(z_score_dict[item]['tweetedTrends'])
    z_score_dict[item]['trendCountProportion'] = [0] * numTags
    z_score_dict[item]['nonTrendCountProportion'] = [0] * numTags
    for i in range(numTags):
        if(z_score_dict[item]['totalCounts'] == 0):
            print(item)
            break
        #z_score = (count proportion - mean) / standard deviation
        trendCountProportion = z_score_dict[item]['trendCounts'][i] / z_score_dict[item]['totalCounts']
        nonTrendCountProportion = z_score_dict[item]['nonTrendCounts'][i] / z_score_dict[item]['totalCounts']
        z_score_dict[item]['trendCountProportion'][i] = trendCountProportion
        z_score_dict[item]['nonTrendCountProportion'][i] = nonTrendCountProportion
        
    timeStamp += 1
    if timeStamp % 500 == 0:
        print(timeStamp)

500
1000
1500


In [27]:
#get total count proportions from past hour to calculate mean proportion count and standard deviation for each item
timeStamp = 0
for item in z_score_dict:
    numTags = len(z_score_dict[item]['tweetedTrends'])
    z_score_dict[item]['hourTrendCounts'] = [[]] * numTags
    z_score_dict[item]['hourNonTrendCounts'] = [[]] * numTags
    for i in range(numTags):
        trendingTag = z_score_dict[item]['tweetedTrends'][i]
        nonTrendingTag = z_score_dict[item]['tweetedNonTrends'][i]
        for k in range(0, 61):
            minute = item - datetime.timedelta(minutes=k)
            if minute in z_score_dict: 
                if trendingTag in z_score_dict[minute]['tweetedTrends']:
                    indexOfTag = z_score_dict[minute]['tweetedTrends'].index(trendingTag)
                    newTrendCount = z_score_dict[minute]['trendCountProportion'][indexOfTag]
                    z_score_dict[item]['hourTrendCounts'][i].append(newTrendCount)
                if nonTrendingTag in z_score_dict[minute]['tweetedNonTrends']:
                    indexOfTag = z_score_dict[minute]['tweetedNonTrends'].index(nonTrendingTag)
                    newTrendCount = z_score_dict[minute]['nonTrendCountProportion'][indexOfTag]
                    z_score_dict[item]['hourNonTrendCounts'][i].append(newTrendCount)
    timeStamp += 1
    if timeStamp % 500 == 0:
        print(timeStamp)
                

500
1000
1500


In [28]:
#add mean, standard deviation, and z_scores for each trending and nontrending hashtag in the dictionary
timeStamp = 0
for item in z_score_dict:
    numTags = len(z_score_dict[item]['tweetedTrends'])
    z_score_dict[item]['trend_mean'] = [0] * numTags
    z_score_dict[item]['trend_stddev'] = [0] * numTags
    z_score_dict[item]['trend_z_score'] = [0] * numTags
    z_score_dict[item]['non_trend_mean'] = [0] * numTags
    z_score_dict[item]['non_trend_stddev'] = [0] * numTags
    z_score_dict[item]['non_trend_z_score'] = [0] * numTags
    for i in range(numTags):
        
        #trending
        trend_x = z_score_dict[item]['trendCountProportion'][i]
        trend_mean = np.mean(z_score_dict[item]['hourTrendCounts'][i])
        trend_stddev = np.std(z_score_dict[item]['hourTrendCounts'][i])
        if trend_stddev == 0:
            trend_z_score = 0
        else:
            trend_z_score = (trend_x - trend_mean) / trend_stddev
        z_score_dict[item]['trend_mean'][i] = trend_mean
        z_score_dict[item]['trend_stddev'][i] = trend_stddev
        z_score_dict[item]['trend_z_score'][i] = trend_z_score
        
        #nontrending
        nontrend_x = z_score_dict[item]['nonTrendCountProportion'][i]
        nontrend_mean = np.mean(z_score_dict[item]['hourNonTrendCounts'][i])
        nontrend_stddev = np.std(z_score_dict[item]['hourNonTrendCounts'][i])
        if nontrend_stddev == 0:
            nontrend_z_score = 0
        else:
            nontrend_z_score = (nontrend_x - nontrend_mean) / nontrend_stddev
        z_score_dict[item]['non_trend_mean'][i] = nontrend_mean
        z_score_dict[item]['non_trend_stddev'][i] = nontrend_stddev
        z_score_dict[item]['non_trend_z_score'][i] = nontrend_z_score
        
    timeStamp += 1
    if timeStamp % 500 == 0:
        print(timeStamp)

500
1000
1500


### For Reference:

#### Explanation of every attribute calculated and stored in the z_score_dictionary:
- *key*: date time for this 10 minute interval
- trends: list of all the trends from this timeStamp
- tweetedTrends: list of all hashtags that were tweeted in this 10 minute window that were in trends list
- tweetedNonTrends: randomly selected list of hashtags that were tweeted in this 10 minute window that were not in the trends list
    - this list is the same length as the tweetedTrends list
- trendCounts: list of all counts of a hashtag tweeted while trending at index i in the last hour
- nonTrendCounts: list of all counts of a hashtag tweeted while not trending at index i in the last hour
- totalCounts: number of all hashtags in this dataset that were tweeted in the last hour
- trendCountProportion: list of all proportion of counts of a hashtag tweeted while trending at index i in the last hour
- nonTrendCountProportion: list of all proportion of counts of a hashtag tweeted while not trending at index i in the last hour
- hourTrendCounts: list of lists, where each inner list at index i holds all the proportion of counts for this trending hashtag in the last hour
- hourNonTrendCounts: list of lists, where each inner list at index i holds all the proportion of counts for this nontrending hashtag in the last hour
- trend_mean: list that contains all the means of occurences in the last hour for every hashtag in tweetedTrends
- non_trend_mean: list that contains all the means of occurences in the last hour for every hashtag in tweetedNonTrends
- trend_stddev: list that contains all the standard deviations of occurences in the last hour for every hashtag in tweetedTrends
- non_trend_stddev: list that contains all the standard deviations of occurences in the last hour for every hashtag in tweetedNonTrends
- trend_z_score: list that contains the z_score of the last hour for every hashtag in tweetedTrends
- non_trend_z_score: list that contains the z_score of the last hour for every hashtag in tweetedNonTrends

### Creating the final training dataframe and testing array for the classifier

In [29]:
#gathering all necessary data to create the final training data
#so each hashtag in the dataset has a row in the training dataframe
#the y values will be the trending label 
#(0 if the hashtag was not a trend when it was tweeted, and 1 if the hashtag was a trend when it was tweeted)
indexes = []
hashtags = []
startTimes = []
endTimes = []
means = []
stddevs = []
z_scores = []
trendings = []
counts = []
i = 0
timeStamp = 0
for index,row in formattedTrends.iterrows():
    startTime = row['dateTime']
    endTime = row['endTime']
    minute = startTime
    for m in range(0, 10):
        if minute in z_score_dict:
            numTags = len(z_score_dict[minute]['trendCounts'])
            for j in range(numTags):
                #trend
                hashtag = z_score_dict[minute]['tweetedTrends'][j]
                hashtags.append(hashtag)
                startTime = int(round(minute.timestamp() * 1000))
                startTimes.append(startTime)
                endTime = int(round((minute + datetime.timedelta(minutes=10)).timestamp() * 1000))
                endTimes.append(endTime)
                mean = z_score_dict[minute]['trend_mean'][j]
                means.append(mean)
                stddev = z_score_dict[minute]['trend_stddev'][j]
                stddevs.append(stddev)
                z_score = z_score_dict[minute]['trend_z_score'][j]
                z_scores.append(z_score)
                trendings.append(1)
                count = z_score_dict[minute]['trendCountProportion'][j]
                counts.append(count)
                indexes.append(i)
                i+=1
                
                #nontrend
                hashtag = z_score_dict[minute]['tweetedNonTrends'][j]
                hashtags.append(hashtag)
                startTime = int(round(minute.timestamp() * 1000))
                startTimes.append(startTime)
                endTime = int(round((minute + datetime.timedelta(minutes=10)).timestamp() * 1000))
                endTimes.append(endTime)
                mean = z_score_dict[minute]['non_trend_mean'][j]
                means.append(mean)
                stddev = z_score_dict[minute]['non_trend_stddev'][j]
                stddevs.append(stddev)
                z_score = z_score_dict[minute]['non_trend_z_score'][j]
                z_scores.append(z_score)
                trendings.append(0)
                count = z_score_dict[minute]['nonTrendCountProportion'][j]
                counts.append(count)
                indexes.append(i)
                i+=1
            
        
        minute = minute + datetime.timedelta(minutes=10)
        
    timeStamp += 1
    if timeStamp % 500 == 0: print(timeStamp)

500
1000
1500
2000


In [30]:
#creating training data dataframe with all available information
data_train_total = pd.DataFrame(columns=['index', 'hashtag','startTime', 'endTime', 'mean', 'stddev', 'z_score', 'count', 'trending'])
data_train_total['index'] = indexes
data_train_total['hashtag'] = hashtags
data_train_total['startTime'] = startTimes
data_train_total['endTime'] = endTimes
data_train_total['mean'] = means
data_train_total['stddev'] = stddevs
data_train_total['z_score'] = z_scores
data_train_total['count'] = counts
data_train_total['trending'] = trendings
data_train_total.set_index('index', inplace=True) 

In [31]:
data_train_total.head()

Unnamed: 0_level_0,hashtag,startTime,endTime,mean,stddev,z_score,count,trending
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,#Medevac,1575427980000,1575428580000,0.5,0.0,0.0,0.5,1
1,#ただのいちごじゃない,1575427980000,1575428580000,0.5,0.0,0.0,0.5,0
2,#Medevac,1575428280000,1575428880000,0.211111,0.150718,0.073721,0.222222,1
3,#PasapalabraCHV,1575428280000,1575428880000,0.111111,0.0,0.0,0.111111,0
4,#4,1575428280000,1575428880000,0.211111,0.150718,-0.663489,0.111111,1


In [32]:
#labels for this classifier are trends, 0 if was not trending, 1 if was trending
y = trendings
y

[1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,


---

# 3. Train a classifier to predict Trending hashtags on Twitter

### Attempt #1: Using All Features:
- startTime
- endTime
- mean
- standard deviation
- z_score
- count

In [33]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score

In [34]:
#omitting labels and hashtag strings in classifier because too many to do word-to-vector conversion in this time frame
data_train = data_train_total[['startTime', 'endTime', 'mean', 'stddev', 'z_score', 'count']]

In [35]:
#split the dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(data_train, y, test_size=0.2, random_state=0)

classifier = SGDClassifier()
classifier.fit(X_train, y_train)

predictions_training = classifier.predict(X_train)
predictions = classifier.predict(X_test)

In [36]:
print("Training accuracy: %0.6f" % accuracy_score(y_train, predictions_training))
print("Testing accuracy: %0.6f" % accuracy_score(y_test, predictions))

Training accuracy: 0.500009
Testing accuracy: 0.499964


In [37]:
print(classifier.coef_)

[[ 4.12359483e+07  4.12360367e+07  1.08150596e+02  6.54579915e+01
  -1.20296373e+02  9.38675711e+01]]


### Attempt #2: Using Only Positive Features:
- mean
- standard deviation
- count

In [38]:
data_train = data_train_total[['mean', 'stddev', 'count']]
X_train, X_test, y_train, y_test = train_test_split(data_train, y, test_size=0.2, random_state=0)

classifier = SGDClassifier()
classifier.fit(X_train, y_train)

predictions_training = classifier.predict(X_train)
predictions = classifier.predict(X_test)

In [39]:
print("Training accuracy: %0.6f" % accuracy_score(y_train, predictions_training))
print("Testing accuracy: %0.6f" % accuracy_score(y_test, predictions))

Training accuracy: 0.914482
Testing accuracy: 0.914902


In [40]:
print(classifier.coef_)

[[29.66358858 24.7547925  15.81587896]]


### Attempt #3: Only Numbers Included in Z_Score Calculation :
- mean
- standard deviation
- count
- z_score

In [41]:
data_train = data_train_total[['mean', 'stddev', 'count', 'z_score']]
X_train, X_test, y_train, y_test = train_test_split(data_train, y, test_size=0.2, random_state=0)

classifier = SGDClassifier()
classifier.fit(X_train, y_train)

predictions_training = classifier.predict(X_train)
predictions = classifier.predict(X_test)

In [42]:
print("Training accuracy: %0.6f" % accuracy_score(y_train, predictions_training))
print("Testing accuracy: %0.6f" % accuracy_score(y_test, predictions))

Training accuracy: 0.932666
Testing accuracy: 0.932880


In [43]:
print(classifier.coef_)

[[28.38535161 23.66534825 19.42241186 -0.17151434]]


### Final Attempt #4: Only Z_Score :
- z_score

In [44]:
data_train = data_train_total[['z_score']]
X_train, X_test, y_train, y_test = train_test_split(data_train, y, test_size=0.2, random_state=0)

classifier = SGDClassifier()
classifier.fit(X_train, y_train)

predictions_training = classifier.predict(X_train)
predictions = classifier.predict(X_test)

In [45]:
print("Training accuracy: %0.6f" % accuracy_score(y_train, predictions_training))
print("Testing accuracy: %0.6f" % accuracy_score(y_test, predictions))

Training accuracy: 0.454239
Testing accuracy: 0.453876


In [46]:
print(classifier.coef_)

[[-0.37620597]]


---

# 4. Analysis - How To Predict Twitter Trends

In this study, the performance metric was sk-learn’s classification_accuracy score, which is the same as a jaccard index score for classification problems. 

In order to perform feature engineering, I needed to understand which features provided the most weight for the classifier. To do this, I printed the .coef_ metric of the classifier, which provided an array of the final weights of each feature in the model that led to its current accuracy. This helped me decide which features were likely the most important to include in an accurate classifier.

I first trained the model using all of the features available, which led to an accuracy of about 50%. Then, using only metrics that were involved in the z score calcuation, the accuracy jumped to about 91%. Using only the z score as a feature, the accuracy dropped to 42%. Finally, the model with the best performance was the one that included only the features that contributed to the z score of a hashtag (including the z score itself), which led to an accuracy of about 93%.

*Find more information about this study, as well as a report, on the [GitHub repo](https://github.com/jesmith14/TwitterTrends)*