## Text Data Collection

To do analysis in text, you need textual data! The sources of these data is varied. Some of them are for academic purposes - well labelled etc. But the truth is often you need to collect from the 'real-world'. These include collection from the Internet - RSS sites, google pages, social media etc. 

These form an important avenue to collect data from the Internet to do sentiment analysis. For eg. almost all news media provide RSS. Note that RSS is not UGC, and thence differences can be expected from social media or blogs. The content and how it is written are substantially different from 'short messages'. Most of the news content are also summarised by the headlines. 

In this notebook, we illustrate some examples of text data collection:
- Rss feeds
- Yelp (popular website by web scrapping)
- Google search pages
- Twitter (as usual!)

## Rss Feeds
We first illustrate with RSS Feeds.

In [23]:
# Importing packges
# Run this first before all code
from __future__ import unicode_literals
import os
import time
fpath = os.getcwd()
print (fpath)

import json
from feedparser import parse
from requests import get
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

TIMEOUT = 30
jsonlist = []

C:\Users\isstyc\Documents\NUS\Teaching\Sentiment Analytics\Workshops\Notebooks


In [3]:
import re, sys

def removeIndent(phrase):
    phrase=re.sub("\n",' ',phrase)
    phrase=re.sub("\r",' ',phrase)
    phrase=re.sub("\t",' ',phrase)
    return phrase

def removeWS(phrase):
    phrase=re.sub(' ','',phrase)
    return phrase

def removePunc(phrase):
    phrase=re.sub('&',' and ',phrase)
    phrase=re.sub(u"\"","\'", phrase)
    phrase=re.sub("\%","percent",phrase)
  #  phrase=re.sub(',','\,',phrase)
    return phrase


### Examples of rss sites are listed below. 
- http://www.channelnewsasia.com/rss/latest_cna_biz_rss.xml # business
- http://www.channelnewsasia.com/rss/latest_cna_sgbiz_rss.xml # sg biz
- http://www.channelnewsasia.com/rss/latest_cna_world_rss.xml # world
- http://www.channelnewsasia.com/rss/latest_cna_asiapac_rss.xml # asia pac
<br>

scmp
- http://www.scmp.com/rss/2/feed  # HK
- http://www.scmp.com/rss/10/feed  # business
- http://www.scmp.com/rss/318421/feed # china feed


Code to retrieve RSS content is as below. 

In [8]:
if __name__ == "__main__":
    newsurl = "http://www.channelnewsasia.com/rss/latest_cna_world_rss.xml"
    if os.path.exists("Data\\cna.json"): os.remove("Data\\cna.json") 
    ffile = open("Data\\cna.json","w")
    rss = parse(newsurl)
    i = 1
    # print (rss)
    for rss_entry in rss['entries']:  # note format can change time to time
        if i > 3 : break
        i += 1
            #     try:
        url_link = rss_entry['id']
        url_content = get(url_link, timeout=TIMEOUT)
        if url_content.ok == True:                
            page = url_content.content.decode('utf-8','ignore')
            soup = BeautifulSoup(page, 'html.parser')
            data = soup.find("div", {"class": "c-rte--article"}).find_all('p')
            content = ""
            for element in data:
                #print (element.text)
                content += element.text.lstrip().rstrip()          
            #print (content)
            url_label = removePunc(rss_entry['title'])
            url_id = rss_entry['id']
            url_summary = rss_entry['summary']                  

            jdata = {"url_id": url_id, "content": {"url_label": url_label,"text":content }}
            jsonlist.append(jdata)            
    #    except Exception as e:
      #      pass
        #    print (u"Error site for " + url_link)
    jdata = json.dump(jsonlist, ffile)
    ffile.close()

## Data Collection from the Yelp sites

Another source of data is user-generated data, of which we look at Yelp - a popular website for restuarants and other services reviews. It is possible to obtain via their website through their API. However there are limitations if done in this manner. Here, we use web scraping.

In [10]:
yelp_url = "https://www.yelp.com/biz/the-sushi-bar-singapore?osq=Restaurants"
ffile = open("Data\\yelp_1.json","w")

url_content = get(yelp_url)
page = url_content.content.decode('utf-8','ignore')

soup = BeautifulSoup(page, 'html.parser')
data = soup.find("script", type="application/ld+json").text.lstrip().rstrip()
data = removeIndent(data)

jsondata = json.loads(data)
json.dump(jsondata, ffile)

ffile.close()


### Automation of web download
Automating download of information from websites using Selenium.


In [13]:
ffile = open("Data\\yelp_2.json","w")
def getBS(data):
    soup = BeautifulSoup(data, "html.parser")
    data = soup.find("script", type="application/ld+json").text.lstrip().rstrip()
    data = removeIndent(data)
    jsondata = json.loads(data)
    return jsondata

drive=webdriver.Chrome(fpath + "\\jar\\chromedriver.exe")
drive.set_page_load_timeout(10)
yelp_url = "https://www.yelp.com/biz/the-sushi-bar-singapore?start="
i=0
drive.get(yelp_url+str(i))
time.sleep(10)
data = drive.page_source
data0 = getBS(data)  # in dict format
print ("first clicked :" + str(i) + " downloaded")
reviews = {i : data0}

NbReviews = data0['aggregateRating']['reviewCount']
print ("Total no of reviews: " +str(NbReviews))

while i< NbReviews-20:  # code can be improved to look for next button in Selenium
    i=i+20
    drive.get(yelp_url+str(i))
    print ("no of reviews :" + str(i) + " downloaded")
    time.sleep(10)
    data = drive.page_source
    data = getBS(data) 
    #data = pd.DataFrame.to_json(getBS(data))  # in json format
    reviews[i]= data 

#jsondata = json.loads(data0)
json.dump(reviews, ffile)
ffile.close()

first clicked :0 downloaded
Total no of reviews: 41
no of reviews :20 downloaded
no of reviews :40 downloaded


## Data collection from Google Search
It is also possible to extract search snippets from google search. From then, it is a simple task to use Selenium above to extract the contents returned from the search. An example below is done for search term 'Coffee'. 

To do run the code below, you need to obtain an API key and also create a custom search ID from the site
https://developers.google.com/custom-search/v1/overview?csw=1

In [1]:
APIKEY = 'place your API Key'

# https://developers.google.com/custom-search/v1/overview?csw=1

In [2]:
CSE_ID = 'place your CSE ID - custom search engine id'

# https://developers.google.com/custom-search/v1/overview?csw=1
# Also enable the "Search the entire web"

![image.png](images/image.png)

In [3]:
# It looks something like this.........

![image.png](images/image2.png)

In [None]:
#!pip install google-api-python-client

In [9]:
from googleapiclient.discovery import build
my_api_key = APIKEY
my_cse_id = CSE_ID

def google_search(search_term, api_key, cse_id, **kwargs):
    service = build("customsearch", "v1", developerKey=api_key)
    res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
    return res

In [10]:
result = google_search("Coffee", my_api_key, my_cse_id)
from pprint import pprint
pprint(result)

{'context': {'title': 'Coffee'},
 'items': [{'cacheId': 'U6oJMnF-eeUJ',
            'displayLink': 'en.wikipedia.org',
            'formattedUrl': 'https://en.wikipedia.org/wiki/Coffee',
            'htmlFormattedUrl': 'https://en.wikipedia.org/wiki/<b>Coffee</b>',
            'htmlSnippet': '<b>Coffee</b> is a brewed drink prepared from '
                           'roasted <b>coffee</b> beans, the seeds of <br>\n'
                           'berries from certain Coffea species. The genus '
                           'Coffea is native to tropical Africa <br>\n'
                           'and&nbsp;...',
            'htmlTitle': '<b>Coffee</b> - Wikipedia',
            'kind': 'customsearch#result',
            'link': 'https://en.wikipedia.org/wiki/Coffee',
            'pagemap': {'cse_image': [{'src': 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/45/A_small_cup_of_coffee.JPG/1200px-A_small_cup_of_coffee.JPG'}],
                        'cse_thumbnail': [{'height': '194',
   

### Twitter data download

For download of twitter feeds using Python, consider using the library tweepy. https://tweepy.readthedocs.io/en/latest/getting_started.html

First create an application on Twitter. Follow the steps in https://developer.twitter.com/en/apps/ to obtain the keys belowmentioned. 

In [1]:
consumer_key = "Use your own key etc" 
consumer_secret = "consumer secret"
access_token = "access token"
access_token_secret = "access token secret"


In [37]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

public_tweets = api.home_timeline()
for tweet in public_tweets:
    print(tweet.text)

Najib's lawyer said the court should not accept evidence given by prosecution witnesses of what Jho Low had alleged… https://t.co/gHC5Mn5cwi
Heard on the Street: The demise of 178-year old British tour operator Thomas Cook is a testament to the profound ch… https://t.co/MU1LXhfmWN
South Korea has culled around 15,000 pigs since the first case was reported on 17 Sept.

https://t.co/yLdSoL929g
“I’ve also got enough blood pressure medication to last me over two weeks.” A Scotsman trapped in Florida with his… https://t.co/SWIAwX6yWF
RT @PlattsOil: Refinery Margin Tracker: Asian refining margins for US crude higher on Saudi supply hitch | #crudeoil #OOTT #refiners | http…
RT @HumanProgress: On average commodities become 3.4% more affordable each year. That means that the time price of commodities halves every…
RT @V_of_Europe: Sweden: Racist migrant gang films ruthless beating of young Swedish schoolboy - Voice https://t.co/hsA7lRWpMd
Nuclear energy too slow, too expensive to save climate -

This obtains tweets by the hashtag, in this case 'man utd'.

In [38]:
manutd = tweepy.Cursor(api.search, q='man utd').items(10)
for tweet in manutd:
   print (tweet.created_at, tweet.text, tweet.lang)