# WEEK 6: Web Crawling & Twitter API

---

## Web Crawling
We will introduce two methods to collect data: web crawling (this week) and calling API (next week).<br>
Web crawling is to design an automatic bot to imitate human browsing behavior.

### Understanding HTML
- HTML stands for **Hyper Text Markup Language**, which is used to define a website.
- All HTML contents are hierarchical and structured.
    - Basic Element: `Tag` and `Text`
    - Text is the content shown on the screen. **Tag is not displayed but is used to render the text.**
    - Text is wrapped by start and end tags.
    - Tag: denoted by a pair of angle bracket <>
        - Start Tag
            - Tag Name
            - Attributes (optional): attributes provide additional information about the element
                - Attribute Name
                - Attribute Value
            - format: <...>
        - End Tag
            - format: </...>
        - All tags are used in pairs, <font style="color:red">except line break tag <b>&lt;br&gt;</b> and input box tag <b>&lt;input&gt;</b></font>.

---

### Input Types

```html
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <a href="https://juniorworld.github.io/python-workshop-2018/">Go to our Home Page</a>
    <p>Please input your user name:</p>
    <input type="text">
    <p>Please input your password:</p>
    <input type="password">
    <br>
    <input type='radio'> Do you like Python?
    <br>
    <input type='radio'> Do you like HTML?
    <br>
    <input type="submit">
    <input type="reset">
  </body>
</html>```

<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <a href="https://juniorworld.github.io/python-workshop-2018/">Go to our Home Page</a>
    <p>Please input your user name:</p>
    <input type="text">
    <p>Please input your password:</p>
    <input type="password">
    <br>
    <input type='radio'> Do you like Python?
    <br>
    <input type='radio'> Do you like HTML?
    <br>
    <input type="submit">
    <input type="reset">
  </body>
</html>

To assign default value, you can use `value` attribute.

```html
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <a href="https://juniorworld.github.io/python-workshop-2018/">Go to our Home Page</a>
    <p>Please input your user name:</p>
    <input type="text" value="junior">
    <p>Please input your password:</p>
    <input type="password" value="123">
    <br>
    <input type="submit">
    <input type="reset">
  </body>
</html>```

<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <a href="https://juniorworld.github.io/python-workshop-2018/">Go to our Home Page</a>
    <p>Please input your user name:</p>
    <input type="text" value="junior">
    <p>Please input your password:</p>
    <input type="password" value="123">
    <br>
    <input type="submit">
    <input type="reset">
  </body>
</html>

### Publish HTML page
Please save your HTML code as a file and rename it as "week5.html"
Double click to render the page at your local end.
If you have a server, then you can send this file to your server and publish it as a online web page.

#### <font style="color: blue">Practice:</font>
<font style="color: blue">Please create a page as the screen, save it as "week5_practice.html" and render it in your computer.</font>

<html>
    <head>
        <title>survey</title>
    </head>
    <body>
        <h1 style='text-align:center'>Online Survey by JMSC</h1>
        <p style='color:blue;text-align:center'>Applicable to HKU students only.</p>
        <h3>Q1: Is Common Core helpful for broadening your intellectual perspective?</h3>
        <input type='radio'>Strongly Agree
        <br>
        <input type='radio'>Agree
        <br>
        <input type='radio'>Neutral
        <br>
        <input type='radio'>Disagree
        <br>
        <input type='radio'>Strongly Disagree
        <br>
        <h3>Q2: Is Common Core helpful for broadening your intellectual perspective?</h3>
        <input type='radio'>Strongly Agree
        <br>
        <input type='radio'>Agree
        <br>
        <input type='radio'>Neutral
        <br>
        <input type='radio'>Disagree
        <br>
        <input type='radio'>Strongly Disagree
        <br>
        <input type="submit">
        <input type="reset">
    </body>
</html>

## Using Selenium

We will use `selenium` package to collect data, which is applicable to both static and dynamic websites.<br>
Please download Chrome driver from this link: https://chromedriver.storage.googleapis.com/index.html?path=73.0.3683.20/

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

In [2]:
driver=webdriver.Chrome(executable_path='C:\\Python27\\selenium\\webdriver\\chrome\\chromedriver.exe') #load the browser

In [3]:
driver.get('file:///C:/Users/yuner/Desktop/week5.html') #use absolute path to open local html file

In [4]:
driver.title #print the title

'This is a title'

In [5]:
driver.current_url #get the url of the page

'file:///C:/Users/yuner/Desktop/week5.html'

## Locate Element by Xpath

We can locate elements by their relative/absolute paths in the file with additional hints about their tag name, attribute name, and attribute value.<br>
- Xpath is an expression of HTML element path
    - `/` is the sign of **absolute path**:
        - if used at the begining: this is a xpath starting from the root node
        - if used in the middle: refer to the element **at the next level**
            - i.e. xpath of &lt;body&gt; can be written as "html/body" or "/html/body". 
            - If you write "/body", system will pop up error message.
    - `//` is the sign of **relative path**: refer to any element that matches to the pattern no matter where they are.
        - i.e. xpath of &lt;body&gt; can be written as "//body"
    - `[@attribute name=attribute value]` we can include attribute into the matching pattern
        - i.e. "//input[@type='reset']"
        - The most efficient attribute is `id`. `id` is the unique identification of element.

In [6]:
#you can use find_element_by_xpath function to find the element by relative xpath
body=driver.find_element_by_xpath('//body')

In [7]:
body.text #get the text of the matched element

'Go to our Home Page\nPlease input your user name:\nPlease input your password:'

In [8]:
#or by absolute xpath
body=driver.find_element_by_xpath('/html/body')
print(body.text)

Go to our Home Page
Please input your user name:
Please input your password:


In [9]:
#use find_elements_by_xpath function to find a list of elements with shared pattern
inputs=driver.find_elements_by_xpath('//input')

In [10]:
len(inputs)

4

In [11]:
#1 way
first_input=inputs[0]
print(first_input.get_attribute('value'))

junior


In [12]:
#2nd way
first_input=driver.find_element_by_xpath('//input[1]')

In [13]:
print(first_input.get_attribute('value'))

junior


In [14]:
#3rd way
first_input=driver.find_element_by_xpath('//input[@type="text"]')
print(first_input.get_attribute('value'))

junior


In [17]:
print(first_input.get_attribute('type'))

text


In [15]:
ps=driver.find_elements_by_xpath('//p')

In [16]:
print(len(ps)) #count how many <p> are in the html
print(ps[0].text) #first element's text
print(ps[1].text) #second element's text

2
Please input your user name:
Please input your password:


## Imitate Browsing Behavior

Some frequently used behaviors:
1. Click: `element.click()`
2. Type: `element.send_keys('something')`
3. Clear existing content: `element.clear()`
4. Scroll: 
    - Scroll to bottom: `driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")`
    - Scroll to specific location: i.e. scroll down by 400px, `driver.execute_script("window.scrollTo(0, 400);")`

In [18]:
#clean default name and fill in your name
name_box=inputs[0]
name_box.clear()
name_box.send_keys('your name')

In [19]:
#clean default password and fill in any random keys
password_box=inputs[1]
password_box.clear()
password_box.send_keys('abcd')


In [20]:
#click the link of "GO to our Home Page"
link=driver.find_element_by_xpath('//a')
link.click()

In [21]:
#navigate to another online page and inspect the page
driver.get('https://juniorworld.github.io/python-workshop-2018/week5/1.html')

In [22]:
#copy the xpath and fill it into the bracket
Q1=driver.find_element_by_xpath('//*[@id="1"]')
print(Q1.text)
Q2=driver.find_element_by_xpath('//*[@id="2"]')
print(Q2.text)

Q1: Is Common Core helpful for broadening your intellectual perspective?
Q2: Is Common Core helpful for building friendships across faculties for you?


In [23]:
#click the submit button
submit=driver.find_element_by_xpath('/html/body/input[11]') #copy the xpath from inspect window will not look into attributes other than id
submit=driver.find_element_by_xpath('//input[@type="submit"]') #or you can specify xpath by yourself
submit.click()

#### <font style="color: blue">Practice:</font>
<font style="color: blue">Open Google page (https://www.google.com/), search for "JMSC" and click the "Google Search" button.</font>

In [24]:
#write your code here
driver.get('https://www.google.com/')

In [25]:
search_box=driver.find_element_by_xpath('//*[@id="tsf"]/div[2]/div/div[1]/div/div[1]/input')
search_box.send_keys('JMSC')

In [26]:
search_button=driver.find_element_by_xpath('//*[@id="tsf"]/div[2]/div/div[3]/center/input[1]')
search_button.click()

In [27]:
#collect all results on the first page
results=driver.find_elements_by_xpath('//div[@class="rc"]')

In [28]:
#how many results are listed on the first page
len(results)

6

In [29]:
#print every result
for result in results:
    result_link=result.find_element_by_xpath('div[@class="r"]/a') #we can also find element under current note
    result_link_text=result_link.find_element_by_xpath('h3').text
    result_link_href=result_link.get_attribute('href')
    result_description=result.find_element_by_xpath('div[@class="s"]').text
    print(result_link_text,result_link_href,result_description)

Journalism and Media Studies Centre, The University of Hong Kong https://jmsc.hku.hk/ http://www.handbook.hku.hk/ug/full-time-2018-19/arrangements-during-bad-weather. Copyright © 2019 Journalism and Media Studies Centre, The University of ...
Journalism and Media Studies Centre 香港大學新聞及傳媒研究 ... - HKU https://www4.hku.hk/hkumcd/index.php/eng/unit/111_Journalism_and_Media_Studies_Centre 2018年1月10日 - The Journalism and Media Studies Centre has brought professional journalism education to Hong Kong's premier university, creating an ...
JMSC (@JMSCHKU) | Twitter https://twitter.com/jmschku The latest Tweets from JMSC (@JMSCHKU). Founded in 1999, the Journalism and Media Studies Centre of The University of Hong Kong offers professional ...
Manuscript Submission - Editorial Manager https://www.editorialmanager.com/jmsc/ 沒有這個頁面的資訊。
瞭解原因
JMSC - 7th ATC http://www.7atc.army.mil/JMSC/ The Joint Multinational Simulation Center, headquartered in Grafenwoehr, Germany, trains the art and science of co

In [30]:
#save results
output_file=open('week5_google.txt','w',encoding='utf-8')
for result in results:
    result_link=result.find_element_by_xpath('div[@class="r"]/a') #we can also find element under current note
    result_link_text=result_link.find_element_by_xpath('h3').text
    result_link_href=result_link.get_attribute('href')
    result_description=result.find_element_by_xpath('div[@class="s"]').text
    output_file.write(result_link_text+'\t'+result_link_href+'\t'+result_description+'\n')
output_file.close()

---
# Break
---

## Twitter API
API stands for Application Interface, which is provided and maintained by IT company as an official approach to automatically fetch data from their servers. Almost all IT giants like Twitter, Facebook and Google have their APIs. Therefore, knowing how to API is a very critical capacity for anyone who aims to do social media analytics.
Please follow this instruction to apply for a Twitter API: https://juniorworld.github.io/python-workshop-2018/doc/Instructions_on_Twitter_API.pdf

In [31]:
import requests
import time
import base64
import pandas as pd

In [32]:
#Authorize your App

api_key = 'KyQ9A6AkM9fkopbKHu2eRQGxM'
api_secret = 'M0mckxZVYIPXXsJSmXZWfWsnt0LJcesdKm1hn5UkQQW1lbGs0c'

key_secret = api_key+':'+api_secret
b64_encoded_key = base64.b64encode(key_secret.encode('ascii')).decode('ascii')

auth_url = 'https://api.twitter.com/oauth2/token'

auth_headers = {
    'Authorization': 'Basic '+b64_encoded_key,
    'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8'
}

auth_data = {
    'grant_type': 'client_credentials'
}

auth_resp = requests.post(auth_url, headers=auth_headers, data=auth_data)

In [33]:
auth_resp.status_code #status code "200" means authorization succeeds, "400" bad request, "401" unauthorized, "403" forbidden 

200

In [34]:
access_token=auth_resp.json()['access_token'] #get your bearer access token

In [35]:
headers = {'Authorization': 'Bearer '+access_token} #we will use this header throughout the course

In [36]:
access_token

'AAAAAAAAAAAAAAAAAAAAAM8M9gAAAAAA33GKu2zHP%2BCcelTcDGw%2FIK0KQGg%3DkLls0647xj9UYpmfgd0x8IduB3DdNurBTEYAYyFF43w84Ak8j9'

## Search API
We can use Search API to search for posts or users in Twitter platform.

### 1. Search for Posts

Since we are using free version API, we are only allowed to collect post in the past 7 days. But this limitation can be transcended if you schedule a routine program to collect data every 7 days.<br>
The Search API functions in a way similar to Twitter advanced search: https://twitter.com/search-advanced<br>
The key to search is creating a query url containing search parameters.

In [38]:
search_url = 'https://api.twitter.com/1.1/search/tweets.json'

In [39]:
params = {
    'q': '"#hongkong"', #search string
    'result_type': 'recent', #mixed,recent,popular
    'count': 100 #up to 100
}

search_resp = requests.get(search_url, headers=headers, params=params)

In [40]:
type(search_resp.json())

dict

In [41]:
search_resp.json().keys()

dict_keys(['statuses', 'search_metadata'])

In [42]:
type(search_resp.json()['statuses'])

list

In [43]:
print(len(search_resp.json()['statuses']))  #a list of tweet objects

100


In [44]:
search_resp.json()['statuses'][0].keys()

dict_keys(['created_at', 'id', 'id_str', 'text', 'truncated', 'entities', 'extended_entities', 'metadata', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive', 'lang'])

In [45]:
results=search_resp.json()['statuses'] #save first 100 results

For more information about tweet object, please refer to: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json

### 2. Navigate to next page of results (step-by-step breakdown)

In [46]:
#have a look at the metadata
search_resp.json()['search_metadata'] #the link of next_results is the one we need

{'completed_in': 0.07,
 'max_id': 1101769423252123649,
 'max_id_str': '1101769423252123649',
 'next_results': '?max_id=1101750455875584000&q=%22%23hongkong%22&count=100&include_entities=1&result_type=recent',
 'query': '%22%23hongkong%22',
 'refresh_url': '?since_id=1101769423252123649&q=%22%23hongkong%22&result_type=recent&include_entities=1',
 'count': 100,
 'since_id': 0,
 'since_id_str': '0'}

In [47]:
#let's do next run search
next_page=search_resp.json()['search_metadata']['next_results']     #please extract the link from the dictionary and save it as "next_page" variable
search_resp=requests.get(search_url+next_page,headers=headers)

In [48]:
len(search_resp.json()['statuses']) #another 100 posts are in place

100

In [49]:
#update the results
results.extend(search_resp.json()['statuses'])

In [50]:
len(results)

200

### 3. Navigate to next N page of results (integrated)

In [51]:
#you can use a for loop to collect specific pages of results
for page in range(5):
    next_page=search_resp.json()['search_metadata']['next_results']
    search_resp=requests.get(search_url+next_page,headers=headers)
    results.extend(search_resp.json()['statuses'])
    print(page+1,'pages have been collected')
    time.sleep(15)
print('DONE!')

1 pages have been collected
2 pages have been collected
3 pages have been collected
4 pages have been collected
5 pages have been collected
DONE!


In [52]:
#you can use a while loop to exhaust all posts
#Reminder: put some time delay so that you won't exceed the rate limit
page=0
while 'next_results' in search_resp.json()['search_metadata'].keys():
    page+=1
    next_page=search_resp.json()['search_metadata']['next_results']
    search_resp=requests.get(search_url+next_page,headers=headers)
    results.extend(search_resp.json()['statuses'])
    print(page,'pages have been collected')
    time.sleep(15)
print('DONE!')

1 pages have been collected
2 pages have been collected
3 pages have been collected
4 pages have been collected
5 pages have been collected
6 pages have been collected
7 pages have been collected
8 pages have been collected
9 pages have been collected
10 pages have been collected
11 pages have been collected
12 pages have been collected
13 pages have been collected
14 pages have been collected
15 pages have been collected


KeyboardInterrupt: 

In [53]:
len(results)

2127

### 4. Preliminary Analysis

In [54]:
#turn results into a dataframe
table=pd.DataFrame.from_records(results)

In [55]:
table.columns

Index(['contributors', 'coordinates', 'created_at', 'entities',
       'extended_entities', 'favorite_count', 'favorited', 'geo', 'id',
       'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'metadata',
       'place', 'possibly_sensitive', 'quoted_status', 'quoted_status_id',
       'quoted_status_id_str', 'retweet_count', 'retweeted',
       'retweeted_status', 'source', 'text', 'truncated', 'user'],
      dtype='object')

### 4(a) Co-hashtag Analysis

In [56]:
table['entities'][0]

{'hashtags': [{'text': '晴天', 'indices': [61, 64]},
  {'text': 'HongKong', 'indices': [65, 74]},
  {'text': 'アットジャム', 'indices': [75, 82]}],
 'symbols': [],
 'user_mentions': [{'screen_name': 'sakuraebi_staff',
   'name': '桜エビ～ず',
   'id': 3254175722,
   'id_str': '3254175722',
   'indices': [3, 19]}],
 'urls': [],
 'media': [{'id': 1101675110212132864,
   'id_str': '1101675110212132864',
   'indices': [83, 106],
   'media_url': 'http://pbs.twimg.com/media/D0nvrQIVYAAyLgs.jpg',
   'media_url_https': 'https://pbs.twimg.com/media/D0nvrQIVYAAyLgs.jpg',
   'url': 'https://t.co/sFDRhRnefM',
   'display_url': 'pic.twitter.com/sFDRhRnefM',
   'expanded_url': 'https://twitter.com/sakuraebi_staff/status/1101675118990782464/photo/1',
   'type': 'photo',
   'sizes': {'medium': {'w': 1024, 'h': 674, 'resize': 'fit'},
    'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
    'large': {'w': 1024, 'h': 674, 'resize': 'fit'},
    'small': {'w': 680, 'h': 448, 'resize': 'fit'}},
   'source_status_id': 11

In [57]:
table['entities'][0].keys() #entities is a dictionary about in-text connections

dict_keys(['hashtags', 'symbols', 'user_mentions', 'urls', 'media'])

In [64]:
table['entities'][1]['hashtags'][1]['text'] #first 1: user index; second 1 is the hashtag index

'OpenDataDay'

For more information about entities, please refer to official documentation: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/entities-object

In [65]:
hashtags=[]
for entity in table['entities']:
    for hashtag in entity['hashtags']:
        hashtags.append(hashtag['text'])

In [66]:
len(hashtags)

5645

In [67]:
hashtag_freq=pd.value_counts(hashtags) #frequency distribution of hashtags

In [68]:
hashtag_freq.head() #first 5 rows

HongKong       923
hongkong       332
香港             137
RaspberryPi     68
4IN1            68
dtype: int64

In [69]:
'HongKong'.lower()

'hongkong'

In [70]:
#convert uppercase to lowercase
hashtags=[i.lower() for i in hashtags] #Write your code here
hashtag_freq=pd.value_counts(hashtags) #data type: Series, index: hashtag
hashtag_freq.head()

hongkong       1315
香港              137
china            90
4in1             68
raspberrypi      68
dtype: int64

In [71]:
pd.DataFrame(hashtags).to_csv('hashtags.txt')

You can create a word cloud of co-hashtags of #hongkong in https://wordcloud.timdream.org/

#### <font style="color: blue">Practice:</font>
---
<font style="color: blue">Please collect most recent 500 tweets using hashtag #FinishTheWall and visualize its co-hashtags with word cloud.<br>
   Please use a variable name other than "table" to store your results, because we will use table later. 
</font>

In [72]:
#Write your code here
params = {
    'q': '"#FinishTheWall"', #search string
    'result_type': 'recent', #mixed,recent,popular
    'count': 100 #up to 100
}

search_resp = requests.get(search_url, headers=headers, params=params)
results=search_resp.json()['statuses']
for page in range(4):
    next_page=search_resp.json()['search_metadata']['next_results']
    search_resp=requests.get(search_url+next_page,headers=headers)
    results.extend(search_resp.json()['statuses'])
    print(page+1,'pages have been collected')
    time.sleep(15)
print('DONE!')

1 pages have been collected
2 pages have been collected
3 pages have been collected
4 pages have been collected
DONE!


In [73]:
table2=pd.DataFrame.from_records(results)

In [75]:
hashtags2=[]
for entity in table2['entities']:
    for hashtag in entity['hashtags']:
        hashtags2.append(hashtag['text'])

In [77]:
pd.DataFrame(hashtags2).to_csv('hashtags2.txt')