# Problem 1 - Twitter

This exercise introduces collection of Social Media data and working with semi-structured data in JSON format. 

Twitter is a large source of social media data including the Twitter feed and the social network structure. The data on each tweet is semi-structured (in JSON) and dominated by metadata. The total size of each tweet is around 5-6K, which is significantly more than 140 characters dedicated to the actual text.

<p>Twitter provides application developers with an API (Application Program Interface) that allows applications identify trending topics, access user info (for example, user timeline, lists of followers and friends), perform search queries, stream a subset of the entire Twitter feed in real time (which is sometimes referred to as “sipping from a fire hose”) and perform many other functions. Twitter API conforms to REST architecture. REST stands for Representational State Transfer and is typically associated with 4 operations of HTTP protocol: GET, PUT, POST, and DELETE. While it is possible to access each aspect of Twitter API in terms of these operations, it is usually more convenient to use libraries that implement “wrappers” around them. One such library is a Python package twitter. For more information on REST architecture, see</p>

<p><a href="http://www.infoq.com/articles/rest-introduction">A Brief Introduction to REST</a></p>
<p><a href="http://net.tutsplus.com/tutorials/other/a-beginners-introduction-to-http-and-rest/">A Beginner’s Guide to HTTP and REST</a></p>
<p>Twitter API documentation can be found at  <a href="https://dev.twitter.com/docs">https://dev.twitter.com/docs</a></p>


For accessing Twitter API, you need to

<b> Create Your Twitter App </b>
<p><font color='blue'><b>Step 1</b>&nbsp;</font> Go to https://dev.twitter.com</p>
<p><font color='blue'><b>Step 2</b>&nbsp;</font> Click <em>Sign In</em>. in the top-right corner, and log in.</p>

<p style="margin-left: 48px">You need to have a Twitter account.</p>
<p style="margin-left: 48px"><font color = 'red'>Please follow MMA865 on Twitter.</font></p>
<img style="position:absolute; top:500px; left:550px; width:120px; height:160px" src="mma865.png">
<p><font color='blue'><b>Step 3</b>&nbsp;</font> https://apps.twitter.com.</p>

<p><font color='blue'><b>Step 4</b>&nbsp;</font> Click Create New App:</p>

<ul>
<li> Name <font color='red'>\*</font>: juewangMMA865 (create your own name)  </li>
<li> Description <font color='red'>\*</font>: MMA865 Big Data </li>
<li> Website <font color='red'>\*</font>: https://mma865.ca </li>
<li> Callback URL <font color='red'>\*</font>: leave it blank </li>
</ul>
<img style="width:600px" align ="left" src="step4.png">

<p><font color='blue'><b>Step 5</b>&nbsp;</font> You can check your info in https://apps.twitter.com</p>
<p><font color='blue'><b>Step 6</b>&nbsp;</font> Click <mark>Keys and Access Tokens</mark></p>
<p style="margin-left: 48px">You will see your Consumer Key <span style="color: #ff0000">\*</span>, Consumer Secret <span style="color: #ff0000">\*</span></p>
<p><img style="width:600px" align ="left" src="step6.png"></p>

<p><font color='blue'><b>Step 7</b>&nbsp;</font> Click Create my access token near the bottom of the current page</p>
<p style="margin-left: 48px">You now have Access Token <span style="color: #ff0000">\*</span>, Access Token Secret <span style="color: #ff0000">\*</span></p>
<img style="width:600px" align ="left" src="step7.png">

Assuming that all these steps are successfully completed, we can write a Python program to extract information from Twitter. For this problem, we use twitter python package; see either of the following:
<p><a href= "http://mike.verdone.ca/twitter/">http://mike.verdone.ca/twitter/</a></p>
<a href= "https://github.com/sixohsix/twitter/tree/master">https://github.com/sixohsix/twitter/tree/master</a>

## Part I ##
Define a Twitter OAuth login function:

<blockquote>
OAuth stands for Open Authentication.
</blockquote>

Copy and paste into the fields of the code below your Consumer key, Consumer secret, Access token (into OAUTH_TOKEN), and Access token secret (into OAUTH_TOKEN_SECRET) that you get from Step 6 and 7 in **Create Your Twitter App**. 

In [1]:
### Code Block 1-1 ###

import twitter
import urllib2
import json

def oauth_login():
    CONSUMER_KEY = 'cHzkizO8oSQCpqzdrCg3cuSga'
    CONSUMER_SECRET = 'DMC0KQDUPEtO8QSUsdwWEgaL63V1YRUR3rNaoVWq9v0nnYaKw6'
    OAUTH_TOKEN = '4796563954-DKsPi0lSoR9Re1EUjsJ8VeJJSpJt5ajpJLNOYyA'
    OAUTH_TOKEN_SECRET = 'k9WDRzqTEtFqKGfhZHHALi10DqSIFn4fZtbdpgQ3eKNKh'
    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                               CONSUMER_KEY, CONSUMER_SECRET)
    twitter_api = twitter.Twitter(auth=auth)
    return twitter_api

<p>Now, let's define a function to check trends on Twitter:</p>

In [2]:
### Code Block 1-2 ###

def twitter_trends(twitter_api, woe_id):
    # Prefix ID with the underscore for query string parameterization.
    # Without the underscore, the twitter package appends the ID value
    # to the URL itself as a special-case keyword argument.
    return twitter_api.trends.place(_id=woe_id)

Now, we can log in the Twitter api by running Code Block 1-3.
<blockquote>
<div style="background-color: rgba(0,0,0,0.7); display:inline"><font color='white'>&nbsp;Tip 1&nbsp;<font></div>&nbsp;&nbsp;In order to run Code Block 1-3, you need to run Code 1-1 first, because Code Block 1-3 calls function ```oauth_login()``` that is defined in Code Block 1-1.
</blockquote>

In [3]:
### Code Block 1-3 ###

twitter_api = oauth_login()
print twitter_api

<twitter.api.Twitter object at 0x0000000003AE5630>


Let's examine the trends around the world.

<blockquote>
<div style="background-color: rgba(0,0,0,0.7); display:inline"><font color='white'>&nbsp;Tip 2&nbsp;<font></div>&nbsp;&nbsp;In order to run Code 1-4, you need to first run Code Blocks 1-1, 1-2, and 1-3.
</blockquote>

<font color='purple'><b>Question 1-1:</b></font> What's the reason for Tip 2? Hint: it should be similar to Tip 1

<font color='purple'><b>Write down your answer for Question 1-1 in this cell:</b></font>

Tip 1 indicates that Code 1-1 is required for Code Block 1-3, as you cannot call the function oauth_login() unless you define it, which is done in Code Block 1-1. Similar to Tip 1, Tip 2 indicates that all 3 prior Code Blocks (1-1, 1-2, and 1-3) are required for Code Block 1-4. This is because the world_trends is defined as the function twitter_trends, which uses twitter_api, which is defined as oauth_login(). Therefore Code Block 1-4 is dependent on Code Blocks 1-1, 1-2 and 1-3 in order for the code to run correctly.

In [4]:
### Code Block 1-4 ###

try:
    WORLD_WOE_ID = 1
    world_trends = twitter_trends(twitter_api, WORLD_WOE_ID)
    print json.dumps(world_trends, indent=1)
except urllib2.URLError, e:
    print "Error", e
    #URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
    #This error may happen if you do not have a consistent internet connection.

[
 {
  "created_at": "2016-01-16T18:21:23Z", 
  "trends": [
   {
    "url": "http://twitter.com/search?q=%23%D8%B2%D8%AF_%D8%B1%D8%B5%D9%8A%D8%AF%D9%8364", 
    "query": "%23%D8%B2%D8%AF_%D8%B1%D8%B5%D9%8A%D8%AF%D9%8364", 
    "tweet_volume": 131704, 
    "name": "#\u0632\u062f_\u0631\u0635\u064a\u062f\u064364", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=%23Pel%C3%ADculasQueHeVistoMilVeces", 
    "query": "%23Pel%C3%ADculasQueHeVistoMilVeces", 
    "tweet_volume": 23325, 
    "name": "#Pel\u00edculasQueHeVistoMilVeces", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=Chelsea", 
    "query": "Chelsea", 
    "tweet_volume": 295282, 
    "name": "Chelsea", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=%23%D8%A7%D9%84%D8%B3%D8%B9%D9%88%D8%AF%D9%8A%D9%87_%D9%83%D9%88%D8%B1%D9%8A%D8%A7_%D8%A7%D9%84%D8%B4%D9%85%D8%A7%D9%84%D9%8A%D9%87", 
    "query": "%23%D8%A7%D9%84%D8%B3%D8%B9%D9%88%

<font color='purple'><b>Question 1-2:</b></font> what are the names of the fields for each element (object) in the “trends” list and what do these fields represent?

<font color='purple'><b>Write down your answer for Question 1-2 in this cell:</b></font>

There are five fields for each element in the "trends" list:

1. url - the search url of what topic is specifically trending.
2. query - the individual trending item that is being queried.
3. tweet_volume - the number of tweets with that specific term.
4. name - the term that is trending.
5. promoted_content - whether the specified term is flagged as promoted content.

<p>As you can see <q>trends</q> field contains a list of trending terms for the world. You can also find the Twitter trends in US:
<blockquote><p>```json.dumps()``` converts a Python object ```world_trends``` into a string containing its JSON-formatted representation. The keyword ```indent = 1``` forces formatting into multiple lines with indentation of 1 for sub-object fields (without this keyword, the output will not be broken into multiple lines).</p>
<p>Try ```indent = 0```</p>
</blockquote>

In [67]:
### Code Block 1-4-a ###

try:
    WORLD_WOE_ID = 1
    world_trends = twitter_trends(twitter_api, WORLD_WOE_ID)
    print json.dumps(world_trends, indent=0)
except urllib2.URLError, e:
    print "Error", e
    #URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
    #This error may happen if you do not have a consistent internet connection.

[
{
"created_at": "2016-01-15T16:47:53Z", 
"trends": [
{
"url": "http://twitter.com/search?q=%E3%83%90%E3%83%AB%E3%82%B9", 
"query": "%E3%83%90%E3%83%AB%E3%82%B9", 
"tweet_volume": 1831831, 
"name": "\u30d0\u30eb\u30b9", 
"promoted_content": null
}, 
{
"url": "http://twitter.com/search?q=%239YearsOfKidrauhl", 
"query": "%239YearsOfKidrauhl", 
"tweet_volume": 270360, 
"name": "#9YearsOfKidrauhl", 
"promoted_content": null
}, 
{
"url": "http://twitter.com/search?q=%23HoyTengoGanasDe", 
"query": "%23HoyTengoGanasDe", 
"tweet_volume": null, 
"name": "#HoyTengoGanasDe", 
"promoted_content": null
}, 
{
"url": "http://twitter.com/search?q=%23%D8%A7%D9%84%D9%87%D9%84%D8%A7%D9%84_%D8%B3%D9%8A%D9%86%D8%A7%D9%8A%D9%88%D9%83%D9%8A", 
"query": "%23%D8%A7%D9%84%D9%87%D9%84%D8%A7%D9%84_%D8%B3%D9%8A%D9%86%D8%A7%D9%8A%D9%88%D9%83%D9%8A", 
"tweet_volume": 20179, 
"name": "#\u0627\u0644\u0647\u0644\u0627\u0644_\u0633\u064a\u0646\u0627\u064a\u0648\u0643\u064a", 
"promoted_content": null
}, 
{
"url": "http

In [69]:
### Code Block 1-5 ###

try:
    US_WOE_ID = 23424977
    us_trends = twitter_trends(twitter_api, US_WOE_ID)
    print json.dumps(us_trends, indent=1)
except urllib2.URLError, e:
    print "Error", e
    #URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
    #This error may happen if you do not have a consistent internet connection.

[
 {
  "created_at": "2016-01-15T16:47:53Z", 
  "trends": [
   {
    "url": "http://twitter.com/search?q=%23NewYorkValues", 
    "query": "%23NewYorkValues", 
    "tweet_volume": 31649, 
    "name": "#NewYorkValues", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=%22Grizzly+Adams%22", 
    "query": "%22Grizzly+Adams%22", 
    "tweet_volume": null, 
    "name": "Grizzly Adams", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=%239YearsOfKidrauhl", 
    "query": "%239YearsOfKidrauhl", 
    "tweet_volume": 270360, 
    "name": "#9YearsOfKidrauhl", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=%23NationalHatDay", 
    "query": "%23NationalHatDay", 
    "tweet_volume": null, 
    "name": "#NationalHatDay", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=%23fridayreads", 
    "query": "%23fridayreads", 
    "tweet_volume": null, 
    "name": "#fridayreads"

<font color='purple'><b>Question 1-3:</b></font> Find the Twitter trends in Canada and Toronto.
<p>Hint<font color='red'>\*</font>: ```US_WOE_ID``` and ```us_trends``` are used to find the Twitter trends in <font color='blue'>US</font>. Use ```CA_WOE_ID``` and ```TRT_WOE_ID``` in place of ```US_WOE_ID```. Use ```ca_trends``` and ```trt_trends``` in place of ```us_trends```. ```WOE_IDs``` for Canada and Toronto can be found at <a href = "http://woeid.rosselliot.co.nz">http://woeid.rosselliot.co.nz</a>.</p>

The modified code can be found below:

In [5]:
### Write down your answer for Question 1-3 in this cell ###
### Your answer is python code that is similar to Code 1-5 with appropriate changes ###
### Code Block 1-5-a ###

try:
    TOR_WOE_ID = 4118
    tr_trends = twitter_trends(twitter_api, TOR_WOE_ID)
    print json.dumps(tr_trends, indent=1)
except urllib2.URLError, e:
    print "Error", e
    
try:
    CN_WOE_ID = 23424775
    ca_trends = twitter_trends(twitter_api, CN_WOE_ID)
    print json.dumps(ca_trends, indent=1)
except urllib2.URLError, e:
    print "Error", e



[
 {
  "created_at": "2016-01-16T18:21:24Z", 
  "trends": [
   {
    "url": "http://twitter.com/search?q=%23FilmFreeway", 
    "query": "%23FilmFreeway", 
    "tweet_volume": null, 
    "name": "#FilmFreeway", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=%22Canada+Is+Suddenly%22", 
    "query": "%22Canada+Is+Suddenly%22", 
    "tweet_volume": null, 
    "name": "Canada Is Suddenly", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=%23ProjectWinterSurvival", 
    "query": "%23ProjectWinterSurvival", 
    "tweet_volume": null, 
    "name": "#ProjectWinterSurvival", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=%23PrideTO", 
    "query": "%23PrideTO", 
    "tweet_volume": null, 
    "name": "#PrideTO", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=%22David+MacNaughton%22", 
    "query": "%22David+MacNaughton%22", 
    "tweet_volume": null, 
    "nam

<p>Results can also be saved into text files as follows:</p>

In [8]:
### Code Block 1-6 ###

fw = open('wt.json','w')
json.dump(tr_trends,fw,indent=1)
fw.close()
fc = open('ct.json','w')
json.dump(ca_trends,fc,indent=1)
fc.close()



<blockquote><p>The above code opens the text files ```wt.json``` and ```ct.json``` for writing (using ```open()``` function) and calls ```json.dump()``` method to print trend results (with indentation). In the tutorial for the next lecture, we will examine how to save results in a document-based DB (MongoDB). Results can also be manipulated for analysis.</p>
<p><font color='blue'>Check</font> your current folder for the files ```wt.json``` and ```ct.json```.<p>
</blockquote>

<p>Consider the lists of trending topics in the whole world and in Canada: </p>

<p>We now convert these lists into sets:</p>

<blockquote><p> You have to finish <font color='purple'><b>Question 1-3</b></font> to run Code Block 1-7. Otherwise, change ```ca_trends``` to ```us_trends``` in Code Block 1-7.
</p></blockquote>

In [9]:
### Code Block 1-7 ###

wn = [ t['name'] for t in world_trends[0]['trends'] ]
cn = [ t['name'] for t in ca_trends[0]['trends'] ]

w_set = set(wn)
c_set = set(cn)
common_trends = w_set.intersection(c_set)

<font color='purple'><b>Question 1-4:</b></font>
<ul>
<li>What topics do these lists include? Use print command to display wn and cn.</li>
<li>Print to check whether there are any topics in common. What is the output of this print statement?</li>
</ul>

In [11]:
## Your Code for Question 1-4 ##

print "\n".join(wn) 
print "\n".join(cn) 
print "\n".join(common_trends)

#زد_رصيدك64
#PelículasQueHeVistoMilVeces
Chelsea
#السعوديه_كوريا_الشماليه
#SabadoDetremuraSdv
#TFCPSG
Milagro Sala
Mahrez
Sinan Gümüş
I LOVE ZAYN
Jason Rezaian
Aaliyah
Guidetti
HOY JUEGA BOCA
AVRIL FEAT HARRY
#SomosFasDeBandinhaCoreana
#Bang100Milhões
#FelizSabado
#شنو_راح_يقولون_عنك_بالمستقبل
#تصدق_بتغريده_لوالديك
#CBLoL
#EuPrecisoDe
#小泉花陽生誕祭2016
#ÇokHoşOlurdu
#ViveLaPatria
#MagconTourToMexico
#PeorQueUnPeloEnLaSopa
#LucasLuccoNoCaldeirao
#شو_مزعلك
#DoYourJob
#알티당_실제성격
#ДирекшионерыЗахватятВселенную
#EsteFinde
#10milyonliramolsa
#HappyKanginDay
#ConRobertEnBatalla
#AtalantaInter
#voleinaredetv
#TheySayItsImpossibleBut
#BerxwedanSerhildanJiyane
#Amici15
#JustinMeetDalelyla
#SAvENG
#التويتراويه_وفولورز_يناير
#TorinoFrosinone
#ESPGER
#AVLLEI
#BazenAğlamak
#あなたが作りそうなジブリ作品のタイトル
#RIPHarmonizerSophia
#TEDxLangleyED
John Terry
Chelsea
#ProjectWinterSurvival
#FilmFreeway
Everton
Canada Is Suddenly
#CFCvEFC
#AWNChat
Ian Kennedy
David MacNaughton
Aaliyah
Costa
Orioles
Washington Post
Hagelin
Jon

Output

#زد_رصيدك64
#PelículasQueHeVistoMilVeces
Chelsea
#السعوديه_كوريا_الشماليه
#SabadoDetremuraSdv
#TFCPSG
Milagro Sala
Mahrez
Sinan Gümüş
I LOVE ZAYN
Jason Rezaian
Aaliyah
Guidetti
HOY JUEGA BOCA
AVRIL FEAT HARRY
#SomosFasDeBandinhaCoreana
#Bang100Milhões
#FelizSabado
#شنو_راح_يقولون_عنك_بالمستقبل
#تصدق_بتغريده_لوالديك
#CBLoL
#EuPrecisoDe
#小泉花陽生誕祭2016
#ÇokHoşOlurdu
#ViveLaPatria
#MagconTourToMexico
#PeorQueUnPeloEnLaSopa
#LucasLuccoNoCaldeirao
#شو_مزعلك
#DoYourJob
#알티당_실제성격
#ДирекшионерыЗахватятВселенную
#EsteFinde
#10milyonliramolsa
#HappyKanginDay
#ConRobertEnBatalla
#AtalantaInter
#voleinaredetv
#TheySayItsImpossibleBut
#BerxwedanSerhildanJiyane
#Amici15
#JustinMeetDalelyla
#SAvENG
#التويتراويه_وفولورز_يناير
#TorinoFrosinone
#ESPGER
#AVLLEI
#BazenAğlamak
#あなたが作りそうなジブリ作品のタイトル
#RIPHarmonizerSophia
#TEDxLangleyED
John Terry
Chelsea
#ProjectWinterSurvival
#FilmFreeway
Everton
Canada Is Suddenly
#CFCvEFC
#AWNChat
Ian Kennedy
David MacNaughton
Aaliyah
Costa
Orioles
Washington Post
Hagelin
Jonjo Shelvey
U.S. and UN
John Scott
René Angélil
Tinordi
Alan Rickman
Patrick Kane
Grizzly Adams
Pacific Centre
Victor Bartley
Ouagadougou
#JasonRezaian
#IfIDidntKnowBetterIdThink
#OTTHAC16
#TheySayItsImpossibleBut
#PrideTO
#Brunch
#IranDeal
#ldnbudget
#screenshotsaturday
#PatsNation
#MCFC
#edcampbarrie
#CHAN2016
#esktransformationweekend
#Caturday
#MyJasper
#ImABadCanadianBecause
#MakeMeThinkIn5Words
#ALDUB6thMonthsary
#9YearsOfKidrauhl
#BurkinaFaso
#NewYorkValues
#BlueBloods
Aaliyah
#TheySayItsImpossibleBut
Chelsea

To conclude this part, we briefly examine Twitter’s Search API. This aspect gets complicated, because Twitter limits the rate at which data can be obtained (see <a href = "https://dev.twitter.com/docs/rate-limiting/1.1">https://dev.twitter.com/docs/rate-limiting/1.1</a>), only recent tweets are indexed (a typical tweet remains indexed for 6-9 days) and network failures can prevent the application from getting a response to its query. Thus, we will have to make search queries robust to failures. For now, we consider a simplified scenario in which we assume that a response can be reliably obtained from Twitter API. Run the following search that requests up to 100 tweets with a keyword ‘databricks’:

The keywords for the query and the use Twitter Search are described in detail at <a href = "https://dev.twitter.com/docs/api/1.1/get/search/tweets">https://dev.twitter.com/docs/api/1.1/get/search/tweets</a> and <a href = "https://dev.twitter.com/docs/using-search">https://dev.twitter.com/docs/using-search</a>. 


In [92]:
### Code Block 1-8 ###
### You have to run Code Blocks 1-1, 1-2, and 1-3 in order to run Code Block 1-8 

search_results = twitter_api.search.tweets(q='Databricks', count=100)
#Specific tweets matching our query can be extracted as
statuses = search_results['statuses']
#Examine the first tweet among these results:
print json.dumps(statuses[0],indent=1)

{
 "contributors": null, 
 "truncated": false, 
 "text": "#Data: Spark 2015 Year In Review - Apache Spark went through a lot in 2015. Get a solid review from Databricks,... https://t.co/dhdYBc7GPO", 
 "is_quote_status": false, 
 "in_reply_to_status_id": null, 
 "id": 688039693607383040, 
 "favorite_count": 0, 
 "source": "<a href=\"http://www.hootsuite.com\" rel=\"nofollow\">Hootsuite</a>", 
 "retweeted": false, 
 "coordinates": null, 
 "entities": {
  "symbols": [], 
  "user_mentions": [], 
  "hashtags": [
   {
    "indices": [
     0, 
     5
    ], 
    "text": "Data"
   }
  ], 
  "urls": [
   {
    "url": "https://t.co/dhdYBc7GPO", 
    "indices": [
     115, 
     138
    ], 
    "expanded_url": "http://ow.ly/3a8jau", 
    "display_url": "ow.ly/3a8jau"
   }
  ]
 }, 
 "in_reply_to_screen_name": null, 
 "in_reply_to_user_id": null, 
 "retweet_count": 0, 
 "id_str": "688039693607383040", 
 "favorited": false, 
 "user": {
  "follow_request_sent": false, 
  "has_extended_profile": fals

The keywords for the query and the use Twitter Search are described in detail at https://dev.twitter.com/docs/api/1.1/get/search/tweets and https://dev.twitter.com/docs/using-search

<font color='purple'><b>Question 1-5:</b></font> A complete description of the fields can be found at <a href = "https://dev.twitter.com/docs/platform-objects/tweets">https://dev.twitter.com/docs/platform-objects/tweets</a>, identify the id of the user who posted this tweet afer you run Code Block 1-8, his/her number of friends and followers and the posting date.

<font color='purple'><b>Write down your answer for Question 1-5 in this cell:</b></font>

<b>id:</b> 688039693607383040 <br>
<b>friends_count:</b> 777, <br>
<b>followers_count:</b> 970, <br>
<b>created_at:</b> Fri Jan 15 16:47:00 +0000 2016



<p>The results can be saved into a file for subsequent analysis</p>

In [93]:
### Code Block 1-9 ###

f = open('statuses.json','w') # Check you current folder for statuses.json
json.dump(statuses,f,indent=1)
f.close()

<p>Next, we look for the most popular tweets in our results. This can be approached through the number of retweets or the number of times the tweet was marked as favorite. Extract the respective counts from the results:</p>

In [94]:
### Code Block 1-10 ###

rc=[s['retweet_count'] for s in statuses]
fc=[s['favorite_count'] for s in statuses] 

<p>Find the indices of the most retweeted and the most favorite tweet in the results:</p>

In [95]:
### Code Block 1-11 ###

import numpy
rmax=numpy.argmax(rc)
fmax=numpy.argmax(fc)

### Print the corresponding max numbers and the text of the tweets:
print rc[rmax],statuses[rmax]['text']
print fc[fmax],statuses[fmax]['text']

63 RT @ApacheSpark: Two Spark MOOCs on edX: Introduction to Big Data with Apache Spark and Scalable Machine Learning https://t.co/OPR0eX8L4A
13 Databricks' leadership changes: path to acquisition/IPO? #data #in


<font color='purple'><b>Question 1-6:</b></font> Indicate these numbers and the texts of the tweets in your submission. 

<font color='purple'><b>Write down your answer for Question 1-6 in this cell:</b></font>

Index: 63 <br>
Text: RT @ApacheSpark: Two Spark MOOCs on edX: Introduction to Big Data with Apache Spark and Scalable Machine Learning https://t.co/OPR0eX8L4A <br>

Index: 13 <br>
Text: Databricks' leadership changes: path to acquisition/IPO? #data #in

<p>The tweets can be examined for the presence of links, user mentions and hashtags:<p>

In [139]:
### Code Block 1-12 ###

index63h = statuses[rmax]['entities']['hashtags']
index63l = statuses[rmax]['entities']['urls']
index63u = statuses[rmax]['entities']['user_mentions']

index13h = statuses[fmax]['entities']['hashtags']
index13l = statuses[fmax]['entities']['urls']
index13u = statuses[fmax]['entities']['user_mentions']

print "Indexed Tweet 63"
print index63h
print index63l
print index63u

print "Indexed Tweet 13"
print index13h
print index13l
print index13u

##print statuses[rmax]['entities']
##print statuses[fmax]['entities']



Indexed Tweet 63
[]
[{u'url': u'https://t.co/OPR0eX8L4A', u'indices': [114, 137], u'expanded_url': u'https://databricks.com/blog/2015/06/01/databricks-launches-mooc-data-science-on-spark.html', u'display_url': u'databricks.com/blog/2015/06/0\u2026'}]
[{u'id': 1551361069, u'indices': [3, 15], u'id_str': u'1551361069', u'screen_name': u'ApacheSpark', u'name': u'Apache Spark'}]
Indexed Tweet 13
[{u'indices': [57, 62], u'text': u'data'}, {u'indices': [63, 66], u'text': u'in'}]
[]
[]


<font color='purple'><b>Question 1-7:</font> In your submission, provide the resulting lists of hastags, links and user mentions with your response.

<font color='purple'><b>Write down your answer for Question 1-7 in this cell:</b></font>



<font color='red'><b>Indexed Tweet 63</b><br></font>
<b>Links</b><br><b>URL</b> https://t.co/OPR0eX8L4AM <b><br>Expanded_URL</b> https://databricks.com/blog/2015/06/01/databricks-launches-mooc-data-science-on-spark.html <b><br>Display URL</b> https://databricks.com/blog/2015/06/0\u2026<br><br><b>User Mentions</b><br>
<b>uid</b>: 1551361069 <br><b>indices</b>: [3, 15]<br><b>id_str</b>: 1551361069<br> <b>screen_name</b>: ApacheSpark<br> <b>name:</b> Apache Spark
<font color='red'><b>Indexed Tweet 13</b><br></font>
<b>Hashtag</b><br><b>Index: </b>[57, 62]<br> <b>text: </b>data<br><b>indices: </b>[63, 66]<br><b>text: </b> in

<font color='purple'><b>Question 1-8:</font> Using what you learned so far to check the popularity of Hadoop, Spark, Databricks, and Hortonworks On Twitter

In [190]:
## Provide your code for Question 1-8 if necessary ##

##Hadoop
search_results = twitter_api.search.tweets(q='Hadoop', count=100)
#Specific tweets matching our query can be extracted as
statusesh = search_results['statuses']
#Examine the first tweet among these results:
print json.dumps(statusesh[0],indent=1)

f = open('statusesh.json','w') # Check you current folder for statuses.json
json.dump(statusesh,f,indent=1)
f.close()

rch=[s['retweet_count'] for s in statusesh]
fch=[s['favorite_count'] for s in statusesh] 

import numpy
rhmax=numpy.argmax(rch)
fhmax=numpy.argmax(fch)

print rch[rhmax],statuses[rhmax]['text']
print fch[fhmax],statuses[fhmax]['text']

hindex1h = statuses[rhmax]['entities']['hashtags']
hindex1l = statuses[rhmax]['entities']['urls']
hindex1u = statuses[rhmax]['entities']['user_mentions']

hindex2h = statuses[fhmax]['entities']['hashtags']
hindex2l = statuses[fhmax]['entities']['urls']
hindex2u = statuses[fhmax]['entities']['user_mentions']

print "Indexed Tweet 1"
print hindex1h
print hindex1l
print hindex1u

print "Indexed Tweet 2"
print hindex2h
print hindex2l
print hindex2u

##Spark
search_results = twitter_api.search.tweets(q='Spark', count=100)
#Specific tweets matching our query can be extracted as
statusess = search_results['statuses']
#Examine the first tweet among these results:
print json.dumps(statusess[0],indent=1)

f = open('statusesh.json','w') # Check you current folder for statuses.json
json.dump(statusess,f,indent=1)
f.close()

rcs=[s['retweet_count'] for s in statusess]
fcs=[s['favorite_count'] for s in statusess] 

import numpy
rsmax=numpy.argmax(rcs)
fsmax=numpy.argmax(fcs)

print rcs[rsmax],statuses[rsmax]['text']
print fcs[fsmax],statuses[fsmax]['text']

sindex1h = statuses[rhmax]['entities']['hashtags']
sindex1l = statuses[rhmax]['entities']['urls']
sindex1u = statuses[rhmax]['entities']['user_mentions']

sindex2h = statuses[fhmax]['entities']['hashtags']
sindex2l = statuses[fhmax]['entities']['urls']
sindex2u = statuses[fhmax]['entities']['user_mentions']

print "Indexed Tweet 1"
print sindex1h
print sindex1l
print sindex1u

print "Indexed Tweet 2"
print sindex2h
print sindex2l
print sindex2u

## Hortonworks
search_results = twitter_api.search.tweets(q='Hortonworks', count=100)
#Specific tweets matching our query can be extracted as
statusesho = search_results['statuses']
#Examine the first tweet among these results:
print json.dumps(statusesho[0],indent=1)

f = open('statusesho.json','w') # Check you current folder for statuses.json
json.dump(statusesho,f,indent=1)
f.close()

rcho=[s['retweet_count'] for s in statusesho]
fcho=[s['favorite_count'] for s in statusesho] 

import numpy
rhomax=numpy.argmax(rcho)
fhomax=numpy.argmax(fcho)

print rcho[rhomax],statuses[rhomax]['text']
print fcho[fhomax],statuses[fhomax]['text']

hoindex1h = statuses[rhomax]['entities']['hashtags']
hoindex1l = statuses[rhomax]['entities']['urls']
hoindex1u = statuses[rhomax]['entities']['user_mentions']

hoindex2h = statuses[fhomax]['entities']['hashtags']
hoindex2l = statuses[fhomax]['entities']['urls']
hoindex2u = statuses[fhomax]['entities']['user_mentions']

print "Indexed Tweet 1"
print hoindex1h
print hoindex1l
print hoindex1u

print "Indexed Tweet 2"
print hoindex2h
print hoindex2l
print hoindex2u


{
 "contributors": null, 
 "truncated": false, 
 "text": "RT @MondeInformatiq: [PARTENAIRE] #Hadoop : pourquoi faut-il y croire ? #revuedeIT @DellFrance https://t.co/bB8jt8zRVA https://t.co/8VVDsbM\u2026", 
 "is_quote_status": false, 
 "in_reply_to_status_id": null, 
 "id": 688109342545973248, 
 "favorite_count": 0, 
 "source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>", 
 "retweeted": false, 
 "coordinates": null, 
 "entities": {
  "symbols": [], 
  "user_mentions": [
   {
    "id": 95868974, 
    "indices": [
     3, 
     19
    ], 
    "id_str": "95868974", 
    "screen_name": "MondeInformatiq", 
    "name": "LeMondeInformatique"
   }, 
   {
    "id": 150212556, 
    "indices": [
     83, 
     94
    ], 
    "id_str": "150212556", 
    "screen_name": "DELLFrance", 
    "name": "DELL France"
   }
  ], 
  "hashtags": [
   {
    "indices": [
     34, 
     41
    ], 
    "text": "Hadoop"
   }, 
   {
    "indices": [
     72, 
     82
  

<font color='purple'><b>Write down your analysis for Question 1-8 in this cell:</b></font>

Based on the output below, we can see that the terms "Hadoop", "Spark" and "Hortonworks" are fairly popular and frequently used terms. All 3 of the results were tweeted within the last 30 minutes, and those who tweeted them had a wide range of friends and followers from as low as 35 to as high as 6933. Usage of the terms ranged from tweeters posting for job opportunities, as well as well as just general news within the Big Data landscape. Unsurprisingly, these terms generally are found in the same kind of tweets, as one of the most popular tweets for Hadoop (the uber_recruiter tweet), was also found for Spark. Other big data terms are also used in conjunction with these terms. 

Hadoop Output

id: 2228328324
friends_count: 35
followers_count: 38
created_at: Fri Jan 15 21:23:46 +0000 201

Index: 40
Text: My team at Conductor is looking for a Software Engineer, ideally with some distributed computing experience (Hadoop…https://t.co/pnp8ItGkN0
Index: 2
Text: RT @uber_recruiter: Job Opportunity! Hadoop Developer in Minneapolis, MN https://t.co/NSzGUZK1Op #job #hadoop #bigdata

Links: [{u'url': u'https://t.co/pnp8ItGkN0', u'indices': [116, 139], u'expanded_url': u'https://lnkd.in/eFYQUfj', u'display_url': u'lnkd.in/eFYQUfj'}]
User Mentions: 
Hashtags: 

Links: [{u'url': u'https://t.co/NSzGUZK1Op', u'indices': [73, 96], u'expanded_url': u'http://bull.hn/l/2P6B9/15', u'display_url': u'bull.hn/l/2P6B9/15'}]
User Mentions: [{u'id': 22189427, u'indices': [3, 18], u'id_str': u'22189427', u'screen_name': u'uber_recruiter', u'name': u'Andrew Reams'}]
Hashtags: [{u'indices': [97, 101], u'text': u'job'}, {u'indices': [102, 109], u'text': u'hadoop'}, {u'indices': [110, 118], u'text': u'bigdata'}]

Spark Output

id: 688109705114161153
friends_count: 138
followers_count: 312 
created_at: Fri Jan 15 21:25:12 +0000 2016

Index: 546
Text: Sunrise Systems Inc: Hadoop Consultant (#BaskingRidge, NJ) https://t.co/0Ezk4hG5Y7 #BusinessMgmt #NettempsJobs #Job #Jobs #Hiring #CareerArc 
Index: 5 
Text: RT @uber_recruiter: Job Opportunity! Hadoop Developer in Minneapolis, MN https://t.co/NSzGUZK1Op #job #hadoop #bigdata

Links: [{u'url': u'https://t.co/pnp8ItGkN0', u'indices': [116, 139], u'expanded_url': u'https://lnkd.in/eFYQUfj', u'display_url': u'lnkd.in/eFYQUfj'}]
User Mentions:
Hashtags:

Links: [{u'url': u'https://t.co/NSzGUZK1Op', u'indices': [73, 96], u'expanded_url': u'http://bull.hn/l/2P6B9/15', u'display_url': u'bull.hn/l/2P6B9/15'}]
User Mentions: [{u'id': 22189427, u'indices': [3, 18], u'id_str': u'22189427', u'screen_name': u'uber_recruiter', u'name': u'Andrew Reams'}]
Hashtags: [{u'indices': [97, 101], u'text': u'job'}, {u'indices': [102, 109], u'text': u'hadoop'}, {u'indices': [110, 118], u'text': u'bigdata'}]

Hortonworks Output

id: 688108582433198081
friends_count: 243
followers_count: 6933 
created_at: Fri Jan 15 21:20:45 +0000 2016

Index: 23
Text: Forget the FUD! Get Data: Which do you choose for a #Data lake? #Hadoop or an in-memory database? https://t.co/hZ9fX0zBwU
Index: 5
Text: RT @ryanrod42: #bigdata Survey: Big Data Goes Mainstream: submitted by  piterpolk  [link] [comment] https://t.co/SAIf0fGTWu #hadoop

Links: [{u'url': u'https://t.co/pnp8ItGkN0', u'indices': [116, 139], u'expanded_url': u'https://lnkd.in/eFYQUfj', u'display_url': u'lnkd.in/eFYQUfj'}]
User Mentions:
Hashtags:

Links: [{u'url': u'https://t.co/NSzGUZK1Op', u'indices': [73, 96], u'expanded_url': u'http://bull.hn/l/2P6B9/15', u'display_url': u'bull.hn/l/2P6B9/15'}]
User Mentions:[{u'id': 22189427, u'indices': [3, 18], u'id_str': u'22189427', u'screen_name': u'uber_recruiter', u'name': u'Andrew Reams'}]
Hashtags:[{u'indices': [97, 101], u'text': u'job'}, {u'indices': [102, 109], u'text': u'hadoop'}, {u'indices': [110, 118], u'text': u'bigdata'}]
