## Text Data Collection

To do analysis in text, you need textual data! The sources of these data is varied. Some of them are for academic purposes - well labelled etc. But the truth is often you need to collect from the 'real-world'. These include collection from the Internet - RSS sites, google pages, social media etc. 

These form an important avenue to collect data from the Internet to do sentiment analysis. For eg. almost all news media provide RSS. Note that RSS is not UGC, and thence differences can be expected from social media or blogs. The content and how it is written are substantially different from 'short messages'. Most of the news content are also summarised by the headlines. 

In this notebook, we illustrate some examples of text data collection:
- Rss feeds
- Yelp (popular website by web scrapping)
- Google search pages
- Twitter (as usual!)

## Rss Feeds
We first illustrate with RSS Feeds.

In [1]:
# Importing packges
# Run this first before all code
from __future__ import unicode_literals
import os
import time
fpath = os.getcwd()
print (fpath)

import json
from feedparser import parse
from requests import get
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

TIMEOUT = 30
jsonlist = []

/Users/liming/projects/sentiment/Day1


In [2]:
import re, sys

def removeIndent(phrase):
    phrase=re.sub("\n",' ',phrase)
    phrase=re.sub("\r",' ',phrase)
    phrase=re.sub("\t",' ',phrase)
    return phrase

def removeWS(phrase):
    phrase=re.sub(' ','',phrase)
    return phrase

def removePunc(phrase):
    phrase=re.sub('&',' and ',phrase)
    phrase=re.sub(u"\"","\'", phrase)
    phrase=re.sub("\%","percent",phrase)
  #  phrase=re.sub(',','\,',phrase)
    return phrase


### Examples of rss sites are listed below. 
- http://www.channelnewsasia.com/rss/latest_cna_biz_rss.xml # business
- http://www.channelnewsasia.com/rss/latest_cna_sgbiz_rss.xml # sg biz
- http://www.channelnewsasia.com/rss/latest_cna_world_rss.xml # world
- http://www.channelnewsasia.com/rss/latest_cna_asiapac_rss.xml # asia pac
<br>

scmp
- http://www.scmp.com/rss/2/feed  # HK
- http://www.scmp.com/rss/10/feed  # business
- http://www.scmp.com/rss/318421/feed # china feed


Code to retrieve RSS content is as below. 

In [4]:
%run rss_channelnewsasia.py

n_rss_entries: 100
[0] - https://www.channelnewsasia.com/news/world/covid-19-lockdown-averted-3-million-deaths-europe-12816142


[1] - https://www.channelnewsasia.com/news/world/george-floyd-protests-france-adama-traore-12815990
[3] - https://www.channelnewsasia.com/news/world/overfishing-ocean-sea-fish-un-report-12815960
[4] - https://www.channelnewsasia.com/news/asia/hong-kong-protests-national-security-law-china-target-12815868
[5] - https://www.channelnewsasia.com/news/world/british-uk-slave-trader-statue-race-imperialism-debate-12815586
[6] - https://www.channelnewsasia.com/news/world/australia-china-unresponsive-pleas-ease-tension-12814558
[7] - https://www.channelnewsasia.com/news/asia/philippines-students-away-school-covid-19-vaccine-available-12815698
[8] - https://www.channelnewsasia.com/news/world/dutch-mh17-trial-resumes-after-delay-due-to-covid-19-lockdown-12815536
[9] - https://www.channelnewsasia.com/news/world/covid-19-china-demands-proof-us-senator-rick-scott-12815428
[10] - https://www.channelnewsasia.com/news/world/tropical-storm-cristobal-to-weaken-into--depression--in-coming-hours-12815486
[

[74] - https://www.channelnewsasia.com/news/world/george-floyd-protests-australia-defy-bans-at-black-lives-matters-12810722
[75] - https://www.channelnewsasia.com/news/business/vt-san-antonio-aerospace-st-engineering-aerospace-data-breach-12810492
[76] - https://www.channelnewsasia.com/news/world/george-floyd-trump-sparks-controversy-saying-great-day-for-floyd-12810388
[77] - https://www.channelnewsasia.com/news/world/gargling-bleach-misuse-disinfectant-coronavirus-prevention-trump-12810206
[78] - https://www.channelnewsasia.com/news/world/g20-pledges-more-than-us-21-billion-to-fight-coronavirus-12810378
[79] - https://www.channelnewsasia.com/news/world/covid-19-bolsonaro-threatens-who-exit-brazil-record-deaths-12810048
[80] - https://www.channelnewsasia.com/news/world/trump-orders-big-us-troop-cut-in-germany--official-says-12810352
[81] - https://www.channelnewsasia.com/news/world/canada-george-floyd-protests-rally-trudeau-kneels-12810322
[82] - https://www.channelnewsasia.com/news/sp

## Data Collection from the Yelp sites

Another source of data is user-generated data, of which we look at Yelp - a popular website for restuarants and other services reviews. It is possible to obtain via their website through their API. However there are limitations if done in this manner. Here, we use web scraping.

In [6]:
%run yelp_reviews.py

{'aggregateRating': {'reviewCount': 41, '@type': 'AggregateRating', 'ratingValue': 4.5}, 'review': [{'reviewRating': {'ratingValue': 5}, 'datePublished': '2017-05-19', 'description': 'The service at Sushi Bar was great with an extremely polite and conscientious staff. Decor of the location was very intimate but also great for larger groups.\n\nCame here on my birthday and ordered tuna sashimi (comes with three large pieces of raw fish), crab stick roll, and the salmon aburi. All of the sushi tasted fresh and was of high quality for a fair price. They even put a little candle on the sushi, so I kinda had to give it 5 stars.\n\nWould definitely recommend!', 'author': 'Tanmay A.'}, {'reviewRating': {'ratingValue': 4}, 'datePublished': '2019-07-05', 'description': "Oh! This was a great choice to come here. It's not easy to find as it is inside the mall and the numbering of stores is hard to figure out. But looking for it is worth it. \nI had the California Roll (very nice), the Salmon Abur

### Automation of web download
Automating download of information from websites using Selenium.


In [None]:
ffile = open("data/yelp_2.json","w")
def getBS(data):
    soup = BeautifulSoup(data, "html.parser")
    data = soup.find("script", type="application/ld+json").text.lstrip().rstrip()
    data = removeIndent(data)
    jsondata = json.loads(data)
    return jsondata

drive=webdriver.Chrome(fpath + "/jar/chromedriver")
#drive=webdriver.Chrome(fpath + "/jar/chromedriver.exe")
drive.set_page_load_timeout(10)
yelp_url = "https://www.yelp.com/biz/the-sushi-bar-singapore?start="
i=0
drive.get(yelp_url+str(i))
time.sleep(10)
data = drive.page_source
data0 = getBS(data)  # in dict format
print ("first clicked :" + str(i) + " downloaded")
reviews = {i : data0}

NbReviews = data0['aggregateRating']['reviewCount']
print ("Total no of reviews: " +str(NbReviews))

while i< NbReviews-20:  # code can be improved to look for next button in Selenium
    i=i+20
    drive.get(yelp_url+str(i))
    print ("no of reviews :" + str(i) + " downloaded")
    time.sleep(10)
    data = drive.page_source
    data = getBS(data) 
    #data = pd.DataFrame.to_json(getBS(data))  # in json format
    reviews[i]= data 

#jsondata = json.loads(data0)
json.dump(reviews, ffile)
ffile.close()

## Data collection from Google Search
It is also possible to extract search snippets from google search. From then, it is a simple task to use Selenium above to extract the contents returned from the search. An example below is done for search term 'Coffee'. 

To do run the code below, you need to obtain an API key and also create a custom search ID from the site
https://developers.google.com/custom-search/v1/overview?csw=1

In [9]:
APIKEY = 'AIzaSyChHbxfZWIGzhk3ogu78S0R900ZVxrNDr8'

# https://developers.google.com/custom-search/v1/overview?csw=1

In [10]:
CSE_ID = 'mingsqtt'

# https://developers.google.com/custom-search/v1/overview?csw=1
# Also enable the "Search the entire web"

![image.png](images/image.png)

In [11]:
# It looks something like this.........

![image.png](images/image2.png)

In [12]:
#!pip install google-api-python-client

In [13]:
from googleapiclient.discovery import build
my_api_key = APIKEY
my_cse_id = CSE_ID

def google_search(search_term, api_key, cse_id, **kwargs):
    service = build("customsearch", "v1", developerKey=api_key)
    res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
    return res

In [14]:
result = google_search("macbook", my_api_key, my_cse_id)
from pprint import pprint
pprint(result)

HttpError: <HttpError 400 when requesting https://customsearch.googleapis.com/customsearch/v1?q=macbook&cx=mingsqtt&key=AIzaSyChHbxfZWIGzhk3ogu78S0R900ZVxrNDr8&alt=json returned "Request contains an invalid argument.">

### Twitter data download

For download of twitter feeds using Python, consider using the library tweepy. https://tweepy.readthedocs.io/en/latest/getting_started.html

First create an application on Twitter. Follow the steps in https://developer.twitter.com/en/apps/ to obtain the keys belowmentioned. 

In [15]:
consumer_key = "Use your own key etc" 
consumer_secret = "consumer secret"
access_token = "access token"
access_token_secret = "access token secret"


In [17]:
# !pip install tweepy

Collecting tweepy
  Downloading tweepy-3.8.0-py2.py3-none-any.whl (28 kB)
Collecting requests-oauthlib>=0.7.0
  Downloading requests_oauthlib-1.3.0-py2.py3-none-any.whl (23 kB)
Collecting oauthlib>=3.0.0
  Downloading oauthlib-3.1.0-py2.py3-none-any.whl (147 kB)
[K     |████████████████████████████████| 147 kB 1.4 MB/s eta 0:00:01
[?25hInstalling collected packages: oauthlib, requests-oauthlib, tweepy
Successfully installed oauthlib-3.1.0 requests-oauthlib-1.3.0 tweepy-3.8.0


In [18]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

public_tweets = api.home_timeline()
for tweet in public_tweets:
    print(tweet.text)

TweepError: [{'code': 89, 'message': 'Invalid or expired token.'}]

This obtains tweets by the hashtag, in this case 'man utd'.

In [19]:
manutd = tweepy.Cursor(api.search, q='man utd').items(10)
for tweet in manutd:
    print (tweet.created_at, tweet.text, tweet.lang)

TweepError: Twitter error response: status code = 401