## Text Data Collection

To do analysis in text, you need textual data! The sources of these data are varied. Some of them are for academic purposes - well labelled etc. But the truth is often you need to collect from the 'real-world'. These include collection from the Internet - RSS sites, google pages, social media etc. 

These form an important avenue to collect data from the Internet to do sentiment analysis. For eg. almost all news media provide RSS. Note that RSS is not UGC, and thence differences can be expected from social media or blogs. The content and how it is written are substantially different from 'short messages'. Most of the news content are also summarised by the headlines. 

In this notebook, we illustrate some examples of text data collection:
- Rss feeds
- Yelp (popular website by web scrapping)
- Google search pages
- Twitter (as usual!)

## News from rss feeds
We first illustrate news feeds with RSS Feeds.

In [3]:
# Importing packges
# Run this first before all code
from __future__ import unicode_literals
import os
import time
fpath = os.getcwd()
print (fpath)

import json
from feedparser import parse
from requests import get
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

TIMEOUT = 30
jsonlist = []

C:\Users\isstyc\Documents\NUS\Teaching\Practical Language Processing\New Media and Sentiment Mining\Workshops\Notebooks


In [4]:
import re, sys

def removeIndent(phrase):
    phrase=re.sub("\n",' ',phrase)
    phrase=re.sub("\r",' ',phrase)
    phrase=re.sub("\t",' ',phrase)
    return phrase

def removeWS(phrase):
    phrase=re.sub(' ','',phrase)
    return phrase

def removePunc(phrase):
    phrase=re.sub('&',' and ',phrase)
    phrase=re.sub(u"\"","\'", phrase)
    phrase=re.sub("\%","percent",phrase)
  #  phrase=re.sub(',','\,',phrase)
    return phrase


### Examples of rss sites are listed below. 
- http://www.channelnewsasia.com/rss/latest_cna_biz_rss.xml # business
- http://www.channelnewsasia.com/rss/latest_cna_sgbiz_rss.xml # sg biz
- http://www.channelnewsasia.com/rss/latest_cna_world_rss.xml # world
- http://www.channelnewsasia.com/rss/latest_cna_asiapac_rss.xml # asia pac
<br>

scmp
- http://www.scmp.com/rss/2/feed  # HK
- http://www.scmp.com/rss/10/feed  # business
- http://www.scmp.com/rss/318421/feed # china feed


Code to retrieve RSS content is as below. 

In [5]:
if __name__ == "__main__":
    newsurl = "http://www.channelnewsasia.com/rss/latest_cna_world_rss.xml"
    newsurl = "http://www.channelnewsasia.com/rss/latest_cna_biz_rss.xml"

    if os.path.exists("data\\cna.json"): os.remove("data\\cna.json") 
    ffile = open("data\\cna.json","w")
    rss = parse(newsurl)
    i = 1
    # print (rss)
    for rss_entry in rss['entries']:  # note format can change time to time
        if i > 30 : break
        i += 1
            #     try:
        url_link = rss_entry['id']
        url_content = get(url_link, timeout=TIMEOUT)
        if url_content.ok == True:                
            page = url_content.content.decode('utf-8','ignore')
            soup = BeautifulSoup(page, 'html.parser')
            data = soup.find("div", {"class": "c-rte--article"}).find_all('p')
            content = ""
            for element in data:
                #print (element.text)
                content += element.text.lstrip().rstrip()          
            #print (content)
            url_label = removePunc(rss_entry['title'])
            url_id = rss_entry['id']
            url_summary = rss_entry['summary']                  

            jdata = {"url_id": url_id, "content": {"url_label": url_label,"text":content }}
            jsonlist.append(jdata)            
    #    except Exception as e:
      #      pass
        #    print (u"Error site for " + url_link)
    jdata = json.dump(jsonlist, ffile)
    ffile.close()
    
    

In [10]:
ffile.close()

## Data Collection from the Yelp sites

Another source of data is user-generated data, of which we look at Yelp - a popular website for restuarants and other services reviews. It is possible to obtain via their website through their API. However there are limitations if done in this manner. Here, we use web scraping.

In [6]:
yelp_url = "https://www.yelp.com/biz/the-sushi-bar-singapore?osq=Restaurants"
ffile = open("data\\yelp_1.json","w")

url_content = get(yelp_url)
page = url_content.content.decode('utf-8','ignore')

soup = BeautifulSoup(page, 'html.parser')
data = soup.find_all("script", type="application/ld+json")[2].text.lstrip().rstrip()  
# there is structure change and second element is the right one
data = removeIndent(data)
#data = soup.find_all("content").text.lstrip().rstrip()

jsondata = json.loads(data)
json.dump(jsondata, ffile)

ffile.close()
#  <meta name="description" content="41 reviews of The Sushi Bar &#34;The service at Sushi Bar was great with an extremely polite and conscientious staff. Decor of the location was very intimate but also great for larger groups.  Came here on my birthday and ordered tuna sashimi (comes…"/>
# <meta property="og:title" content="Super Fun Event 1" />

### Automation of web download
Automating download of information from websites using Selenium. This package has recently been heavily used for RPA - robotic process automation. It is useful to know it well. 


In [14]:
ffile = open("data\\yelp_2.json","w")
def getBS(data):
    soup = BeautifulSoup(data, "html.parser")
    data = soup.find_all("script", type="application/ld+json")[2].text.lstrip().rstrip()
    data = removeIndent(data)
    jsondata = json.loads(data)
    return jsondata

drive=webdriver.Chrome(fpath + "\\jar\\chromedriver.exe")
drive.set_page_load_timeout(10)
yelp_url = "https://www.yelp.com/biz/the-sushi-bar-singapore?start="
i=0

drive.get(yelp_url+str(i))
time.sleep(20)
data = drive.page_source
data0 = getBS(data)  # in dict format
print ("first clicked :" + str(i) + " downloaded")
reviews = {i : data0}

NbReviews = data0['aggregateRating']['reviewCount']
print ("Total no of reviews: " +str(NbReviews))

while i< NbReviews-20:  # code can be improved to look for next button in Selenium
    i=i+20
    drive.get(yelp_url+str(i))
    print ("no of reviews :" + str(i) + " downloaded")
    time.sleep(10)
    data = drive.page_source
    data = getBS(data) 
    #data = pd.DataFrame.to_json(getBS(data))  # in json format
    reviews[i]= data 

#jsondata = json.loads(data0)
json.dump(reviews, ffile)
ffile.close()

first clicked :0 downloaded
Total no of reviews: 41
no of reviews :20 downloaded
no of reviews :40 downloaded


## Data collection from Google Search
It is also possible to extract search snippets from google search. From then, it is a simple task to use Selenium above to extract the contents returned from the search. An example below is done for search term 'Coffee'. 

To do run the code below, you need to obtain an API key and also create a custom search ID from the site
https://developers.google.com/custom-search/v1/overview?csw=1

In [18]:
APIKEY = 'yourkey'

# https://developers.google.com/custom-search/v1/overview?csw=1

In [15]:
CSE_ID = 'Coffee'
CSE_ID = 'yourID'
# https://developers.google.com/custom-search/v1/overview?csw=1
# Also enable the "Search the entire web"

![image.png](images/image.png)

In [None]:
# It looks something like this.........

![image.png](images/image2.png)

In [None]:
#!pip install google-api-python-client

In [19]:
from googleapiclient.discovery import build
my_api_key = APIKEY
my_cse_id = CSE_ID

def google_search(search_term, api_key, cse_id, **kwargs):
    service = build("customsearch", "v1", developerKey=api_key)
    res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
    return res

In [24]:
result = google_search("covid", my_api_key, my_cse_id)
from pprint import pprint
pprint(result)

{'context': {'title': 'Coffee'},
 'items': [{'cacheId': 'B_FPQc5umtEJ',
            'displayLink': 'www.cdc.gov',
            'formattedUrl': 'https://www.cdc.gov/coronavirus/2019-ncov/index.html',
            'htmlFormattedUrl': 'https://www.cdc.gov/coronavirus/2019-ncov/index.html',
            'htmlSnippet': 'Translations. Español &middot; 简体中文 &middot; Tiếng '
                           'Việt &middot; 한국어 &middot; Other Languages. '
                           '<b>COVID</b>-<br>\n'
                           '19 UPDATES. Get email&nbsp;...',
            'htmlTitle': 'Coronavirus Disease 2019 (<b>COVID</b>-19) | CDC',
            'kind': 'customsearch#result',
            'link': 'https://www.cdc.gov/coronavirus/2019-ncov/index.html',
            'pagemap': {'contactpoint': [{'url': 'U.S. Department of Health & '
                                                 'Human Services'},
                                         {'url': 'USA.gov'}],
                        'cse_image': [{'src

In [25]:
import json
rt =  json.dumps(result)
json.dump(rt, open("data\\liverpool.json","w"))

### Twitter data download

For download of twitter feeds using Python, consider using the library tweepy. https://tweepy.readthedocs.io/en/latest/getting_started.html

First create an application on Twitter. Follow the steps in https://developer.twitter.com/en/apps/ to obtain the keys belowmentioned. 

In [28]:
consumer_key = "xxx" 
consumer_secret = "xx"
access_token = "xx-xxx"
access_token_secret = "xxx"


In [29]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

public_tweets = api.home_timeline()
for tweet in public_tweets:
    print(tweet.text)

Equity investors seem to be focusing on positive COVID-19 developments, but the fundamental backdrop remains weak.… https://t.co/oYQ1PCtJ3W
As momentum builds behind debt relief for Africa’s heavily indebted economies, a tussle is brewing between the West… https://t.co/owNBprDedv
The change will affect the movement of individuals and commercial vehicles such as cargo lorries.

https://t.co/NfWYB50ISM
Messages in response to the pandemic are getting repetitive, and risk appearing insincere, some in the industry say https://t.co/xHudiuchiD
In Dubai, citizens and expats can summon groceries and services – even a single chocolate bar – within minutes.

https://t.co/Pvkhqm9T7Z
BREAKING: Singapore preliminarily confirms 1,111 COVID-19 new cases, crosses 9,000 mark
https://t.co/gg5yWCURRo https://t.co/UhB4h6F7lq
Genting Bhd. and its units are planning the first group-wide salary cut since its founding in 1965.
#YahooFinance
https://t.co/PXmFdLjyvq
A series of purges has strengthened President

This obtains tweets by the hashtag, in this case 'man utd'.

In [30]:
manutd = tweepy.Cursor(api.search, q='man utd').items(10)
for tweet in manutd:
   print (tweet.created_at, tweet.text, tweet.lang)

2020-04-21 07:44:59 This list is really interesting. Wolves has the 5th richest owner in the league but doesn't have a single world cla… https://t.co/7kQslzDiRj en
2020-04-21 07:44:46 RT @TheSaItIsHere: This quarantine's made me realise why everyone hates Man Utd fans en
2020-04-21 07:44:26 Man Utd identify Sancho transfer alternative as Solskjaer eyes new forward https://t.co/XzGvCUuq7L https://t.co/BW9nuYw0fw en
2020-04-21 07:43:59 Man Utd identify Sancho transfer alternative as Solskjaer eyes new forward https://t.co/1hfTxVdRBv en
2020-04-21 07:43:57 FA Cup Final 25th May 1963 Man Utd 3 - Leicester City 1 Wonderful (albeit silent) footage of the dressing room afte… https://t.co/mg4jKo7ooi en
2020-04-21 07:43:36 How Man Utd could line up with Joao Felix and two other signings next season #MUFC 
https://t.co/vxgJ9QHsik https://t.co/WgyPdc58Ho en
2020-04-21 07:43:35 @ManUtdInPidgin @ManUtd @andrinhopereira Chop bench abi make them dash am out, bcus he's not even worth to be on Man utd 