# Twitter Mining and Trend Analysis

This marks the beginning of my learning how to mine social media platform data. I'm working with Matthew A. Russel's <u>Mining the Social Web</u>, 2e (2014). I enjoyed using an older version of the text because many of the links, libraries, and screenshots are now outdated. Because  the text was outdated, however, I had the opportunity of building a general idea of what the program should do and figuring out some specifics on my own.

For example, I replaced the __twitter__ library that was mentioned in the text with the __tweepy__ library and independently researched the latter's functionality to understand how my solution would differ from the text's. These types of explorations provided me some valuable experience in building a project "in the wild" without direct guidance.

With that context establish and without further ado - let's begin walking through this project!

## Building a Connection to Twitter

Before we can delve into Twitter's available data, we'll need to establish a connection to the Twitter API. I've already created API credentials under a Twitter developer account, so we are good to proceed!

For our API connection, we're going to use the __tweepy__ module.

In [1]:
from tweepy import OAuthHandler
from tweepy import API
import pandas as pd
import config

In [2]:
auth = OAuthHandler(config.OAuth1, config.OAuth2)
auth.set_access_token(config.access_token1, config.access_token2)

In [3]:
api = API(auth)

To check that our API is successfully hooked up, let's execute the __print__ function. We should get the address of the in-memory object as a result if our API connection was created successfully.

In [4]:
print (api)

<tweepy.api.API object at 0x000001DA2CC726D0>


Success! The above output shows us that our API connection is stored at the above address.

## Exploring Trending Topics (Comparing Milwaukee, WI, US vs US)

We are now connected to a vast world of Twitter data. To put the richness of Twitter's data in context, the platform last estimated that it provided services to over 300 million monthly active accounts in Q1 2019. Because of Twitter's asymmetric following model (i.e. users can follow one another without followbacks), we are able to see patterns of social activity that read more like a global interest graph rather than a social networking structure.

In this section, let's get a better feel of what Twitter has to offer by exploring currently trending topics by geolocation. Before delving into trends by geolocation, however, we'll need to know from which locations Twitter is able to report trend data. Each geolocation has a 32-bit identifier code assigned to it (called a Where on Earth ID) that will be helpful for us to retrieve.

We can use the __trends_available()__ function to call our Twitter API and ask for all the available locations. We'll set the return message to the variable __woe_avail__.

In [5]:
woe_avail = api.available_trends()

Let's take a sneek peak at the first five items we've received from our call...

In [6]:
print(woe_avail[:5])

[{'name': 'Worldwide', 'placeType': {'code': 19, 'name': 'Supername'}, 'url': 'http://where.yahooapis.com/v1/place/1', 'parentid': 0, 'country': '', 'woeid': 1, 'countryCode': None}, {'name': 'Winnipeg', 'placeType': {'code': 7, 'name': 'Town'}, 'url': 'http://where.yahooapis.com/v1/place/2972', 'parentid': 23424775, 'country': 'Canada', 'woeid': 2972, 'countryCode': 'CA'}, {'name': 'Ottawa', 'placeType': {'code': 7, 'name': 'Town'}, 'url': 'http://where.yahooapis.com/v1/place/3369', 'parentid': 23424775, 'country': 'Canada', 'woeid': 3369, 'countryCode': 'CA'}, {'name': 'Quebec', 'placeType': {'code': 7, 'name': 'Town'}, 'url': 'http://where.yahooapis.com/v1/place/3444', 'parentid': 23424775, 'country': 'Canada', 'woeid': 3444, 'countryCode': 'CA'}, {'name': 'Montreal', 'placeType': {'code': 7, 'name': 'Town'}, 'url': 'http://where.yahooapis.com/v1/place/3534', 'parentid': 23424775, 'country': 'Canada', 'woeid': 3534, 'countryCode': 'CA'}]


It looks like our call was successful, but the output is difficult to read. Let's try looking at only a single element for some clarity...

In [7]:
woe_avail[0]

{'name': 'Worldwide',
 'placeType': {'code': 19, 'name': 'Supername'},
 'url': 'http://where.yahooapis.com/v1/place/1',
 'parentid': 0,
 'country': '',
 'woeid': 1,
 'countryCode': None}

Great! We can see that our message returned a list with several sub-elements. For our purposes, we'll only need the 'name' and 'woeid' sub-elements from our response message.

Let's create a dictionary containing only the sub-elments we need; this will help us easily lookup geolocation identifier codes.

In [8]:
woeid_dict = dict()
for i in woe_avail:
    woeid_dict[i.get('name')] = i.get('woeid')

To test our result, we can try printing the first few key:value pairs from our __woeid_dict__ dictionary...

In [9]:
for x in list(woeid_dict)[0:3]:
    print ("key {}, value {} ".format(x,  woeid_dict[x]))

key Worldwide, value 1 
key Winnipeg, value 2972 
key Ottawa, value 3369 


Looking great! Now that we have a dictonary we can use to lookup goelocation identifiers, let's use it find out what is currently on people's minds as posted from the Milwaukee, WI, USA geolocation (my home city)!

First, we can pass the "Milwaukee" key to our newly formed WOEID dictionary to learn Milwaukee's specific ID...

In [10]:
woeid_dict['Milwaukee']

2451822

Next, let's use that identifier and the __trends_place__ function of our Twitter API to tune into Milwaukee-based posts...

In [11]:
mketrends = api.get_place_trends(id = '2451822' )

Great, we've captured Milwaukee trends data in our local project variable __mketrends__. Let's see what our variable now contains!

First let's confirm the type of object we've captured by using Python's __type()__ function...

In [12]:
type(mketrends)

list

Looks like we have a list! We can learn more about our list by using the __len()__ function to determine the size and the __type()__ function again to learn what type of objects the list contains...

In [13]:
len(mketrends)

1

So we have a list of one element. Let's check out what type of element the list contains using __type()__ on the single member (found at location index 0)....

In [14]:
type(mketrends[0])

dict

In [15]:
type(mketrends[0]["trends"])

list

In [16]:
len(mketrends[0]["trends"])

50

In [17]:
mketrends[0]["trends"][0]

{'name': 'Broncos',
 'url': 'http://twitter.com/search?q=Broncos',
 'promoted_content': None,
 'query': 'Broncos',
 'tweet_volume': 78989}

To make things a little prettier, let's send our __"trends"__ dictionary list to our __Pandas__ library to generate a tabular data frame. We'll sort our result so top value are listed first and, for brevity's sake, retrieve only our top 10 Milwaukee trends below.

In [37]:
trends_data_frame_mke = pd.DataFrame.from_dict(mketrends[0]["trends"])
trends_data_frame_mke.sort_values("tweet_volume", inplace = True, ascending = False)
trends_data_frame_mke.index = range(1,len(trends_data_frame_mke)+1)

print(mketrends[0]["as_of"])
trends_data_frame_mke.iloc[:10]

2022-10-07T02:24:07Z


Unnamed: 0,name,url,promoted_content,query,tweet_volume
1,Mario,http://twitter.com/search?q=Mario,,Mario,734303.0
2,Marijuana,http://twitter.com/search?q=Marijuana,,Marijuana,459478.0
3,Bowser,http://twitter.com/search?q=Bowser,,Bowser,170388.0
4,Hunter Biden,http://twitter.com/search?q=%22Hunter+Biden%22,,%22Hunter+Biden%22,144269.0
5,Chris Pratt,http://twitter.com/search?q=%22Chris+Pratt%22,,%22Chris+Pratt%22,131136.0
6,Jack Black,http://twitter.com/search?q=%22Jack+Black%22,,%22Jack+Black%22,89511.0
7,Broncos,http://twitter.com/search?q=Broncos,,Broncos,78989.0
8,Luigi,http://twitter.com/search?q=Luigi,,Luigi,51282.0
9,Primetime,http://twitter.com/search?q=Primetime,,Primetime,30060.0
10,Matt Ryan,http://twitter.com/search?q=%22Matt+Ryan%22,,%22Matt+Ryan%22,24912.0


**Table attribute descriptions...**  
name: trending term  
url: trend url  
promoted_content: has tweet content been purchased by an organization  
query: query string related to trend  
tweet_volume: number of tweets within a 24-hour period   

And now let's repeat these steps for worldwide trends...

In [38]:
worldtrends = api.get_place_trends(id = '1' )
trends_data_frame_world = pd.DataFrame.from_dict(worldtrends[0]["trends"])

trends_data_frame_world.sort_values("tweet_volume", inplace = True, ascending = False)
trends_data_frame_world.index = range(1,len(trends_data_frame_world)+1)

print(worldtrends[0]["as_of"])
trends_data_frame_world.iloc[:10]

2022-10-07T02:45:41Z


Unnamed: 0,name,url,promoted_content,query,tweet_volume
1,Mario,http://twitter.com/search?q=Mario,,Mario,755061.0
2,Bowser,http://twitter.com/search?q=Bowser,,Bowser,174058.0
3,Hunter Biden,http://twitter.com/search?q=%22Hunter+Biden%22,,%22Hunter+Biden%22,147895.0
4,Chris Pratt,http://twitter.com/search?q=%22Chris+Pratt%22,,%22Chris+Pratt%22,134123.0
5,Broncos,http://twitter.com/search?q=Broncos,,Broncos,93681.0
6,Jack Black,http://twitter.com/search?q=%22Jack+Black%22,,%22Jack+Black%22,91428.0
7,La Plata,http://twitter.com/search?q=%22La+Plata%22,,%22La+Plata%22,74891.0
8,Gimnasia,http://twitter.com/search?q=Gimnasia,,Gimnasia,63408.0
9,Luigi,http://twitter.com/search?q=Luigi,,Luigi,53060.0
10,Endrick,http://twitter.com/search?q=Endrick,,Endrick,33984.0


**Table attribute descriptions...**  
name: trending term  
url: trend url  
promoted_content: has tweet content been purchased by an organization  
query: query string related to trend  
tweet_volume: number of tweets within a 24-hour period  

# Finding Common Trends

Now that we've seperately extracted the top common trends in  Milwaukee and worldwide, let's see if there are are any overlapping topics. Below code converts our Milwaukee and worldwide trend series into lists, then compares Milwaukee trend topics to topics trending world-wide. 

The final result is a list of trending topics associated with both geolocations!

In [45]:
mke_world_common_trends = []
for name in trends_data_frame_mke.name.tolist():
    if name in  trends_data_frame_world.name.tolist():
        mke_world_common_trends.append(name)
        
print(mke_world_common_trends)

['Mario', 'Bowser', 'Hunter Biden', 'Chris Pratt', 'Jack Black', 'Broncos', 'Luigi', 'Matt Ryan', 'Hines', 'Saweetie', 'Armageddon', 'Quavo', 'Blake Masters', 'Mark Kelly', 'Commanders', '#ThursdayNightFootball', '#INDvsDEN', 'Al Michaels', 'Hackett', 'Judy Tenuta', 'Matty Ice', 'Sutton', 'Frank Reich', 'Graphic B', 'Melvin Gordon', 'Nick Foles', 'Drew Lock']


In [40]:
common_trend_dict ={}

for trend in mke_world_common_trends:
    mke_volume = trends_data_frame_mke.loc[trends_data_frame_mke.name == trend,"tweet_volume"].tolist()[0]
    world_volume = trends_data_frame_world.loc[trends_data_frame_world.name == trend, "tweet_volume"].tolist()[0]
    common_trend_dict[trend] = [mke_volume, world_volume]
    
common_trend_dict

{'Mario': [734303.0, 755061.0],
 'Bowser': [170388.0, 174058.0],
 'Hunter Biden': [144269.0, 147895.0],
 'Chris Pratt': [131136.0, 134123.0],
 'Jack Black': [89511.0, 91428.0],
 'Broncos': [78989.0, 93681.0],
 'Luigi': [51282.0, 53060.0],
 'Matt Ryan': [24912.0, 33029.0],
 'Hines': [22555.0, 23229.0],
 'Saweetie': [22018.0, 23966.0],
 'Armageddon': [21099.0, 25406.0],
 'Quavo': [18614.0, 19959.0],
 'Blake Masters': [15312.0, 18365.0],
 'Mark Kelly': [14256.0, 16855.0],
 'Commanders': [12279.0, 13324.0],
 '#ThursdayNightFootball': [nan, nan],
 '#INDvsDEN': [nan, nan],
 'Al Michaels': [nan, nan],
 'Hackett': [nan, nan],
 'Judy Tenuta': [nan, nan],
 'Matty Ice': [nan, nan],
 'Sutton': [nan, 10402.0],
 'Frank Reich': [nan, nan],
 'Graphic B': [nan, nan],
 'Melvin Gordon': [nan, nan],
 'Nick Foles': [nan, nan],
 'Drew Lock': [nan, nan]}

Very cool! We now have some neat insight into what individuals tweeting from Milwaukee specifically and the rest of the world appear to have in common. This pipeline can be run on-demand as desired. 

Some other features to impliment could include:  
* Manual estimations of tweet volumes where tweet_volume is not available
* Image scraping for specific trends