# Data Collection

This tutorials just give some examples how to fetch and handle different types of data

## Import required packages

In [1]:
import time
import datetime

## Files

### Simple text files

In [2]:
sample_file_name = 'data/sample-text-files/sample-text-file-1000kb.txt'

documents = []
with open(sample_file_name) as f:
    for line in f:
        line = line.strip()
        if line != '': # Ingnore empty lines
            documents.append(line)
            
print("The file {} contains {} documents.".format(sample_file_name, len(documents)))
print()
print("This is the last document:")
print(documents[-1])

The file data/sample-text-files/sample-text-file-1000kb.txt contains 1656 documents.

This is the last document:
Nunc eget elit elit. Nulla ornare, orci non maximus gravida, quam mi mollis nunc, eu gravida dui diam at nibh. Donec vitae nibh libero. Donec in eleifend neque. In tellus sapien, eleifend in augue et, pellentesque dictum lacus. Etiam congue porttitor sapien eget egestas. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Fusce feugiat, ipsum ut aliquet fringilla, nibh metus ornare dolor, ut semper urna tellus nec orci. Vivamus eget urna ut nisi dapibus sollicitudin. Aliquam ligula ex, placerat nec nisi convallis, mattis suscipit nulla. Donec cursus nec sem consequat tempus. Aenean et ornare dolor, vel bibendum magna. Sed tempor tincidunt lorem.


### CSV/TSV files

In principle, CSV/TSV (comma-separated/tab-separated values) files are also just text files. As such, one can sue the approache from above to read such files. The structured nature of CSV/TSV files quickly leads to annoying issues:

In [3]:
reviews_file_name = 'data/reviews/yelp-reviews-mon-ami-gabi.csv'

with open(reviews_file_name) as f:
    for idx, line in enumerate(f):
        if idx == 0: # We want to ignore the header
            continue
        line = line.strip()
        if line != '':
            review_nr, review_text = line.split(',') # Oh, oh...can you spot the problem?
            print(review_text) # This will most likely throw an error

ValueError: too many values to unpack (expected 2)

Since handling CSV/TSV files is a very common task, there is already powerful Python packages available that makes life some much easier. `pandas` is a very popular package for handling structured files like CSV/TSV files.

In [4]:
import pandas as pd

`pandas` uses the notion of data frames (df) to denote data objects

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

In [5]:
#df = pd.read_csv(reviews_file_name, sep=',', quotechar='"', encoding = "ISO-8859-1")
df = pd.read_csv(reviews_file_name, encoding = "ISO-8859-1")

df.head(n = 10)

Unnamed: 0,review_number,review
0,1,"Excellent food, great atmosphere, a bit noisy...."
1,2,If you enjoy a little people watching with you...
2,3,"affordable, fairly classic french foodsit outs..."
3,4,Though heartbroken and a bit aimless on my 22n...
4,5,"The food and wine was amazing, but the super h..."
5,6,Yippy! Make-your-own bloody mary bar! Chose t...
6,7,I went here for a team function on one of the ...
7,8,"An affordable dining experience in Paris, I me..."
8,9,Definitely one of my favorites on the Strip! I...
9,10,What great friends I have..... We ate at Mon A...


In [6]:
# Extract list of reviews from data frame
reviews = df['review'].tolist()

print("The file {} contains {} reviews.".format(reviews_file_name, len(reviews)))
print()
print("This is the last review:")
print(reviews[-1])

The file data/reviews/yelp-reviews-mon-ami-gabi.csv contains 1000 reviews.

This is the last review:
I had a $25 lettuce entertain you gift card so I brought it with me to Las Vegas and had a very pricey breakfast, but ... no complaints, it was good.We got seated right away outside, it was still reasonably comfortable enough to eat outside in July in Las Vegas.  We were told the specials, I opted for one of them - a fancy eggs benedict and apple juice, which was a lighter color as it was REAL apple juice, not the kind from a machine or bottle.  We also started our meal with a blackberry bran muffin which was super yummy.


## Online news article

This example addresses online content. Handling "raw" websites is usually a bit annoying since the text is not plain text but HTML. For simplicity, we use the package `newspaper` that helps to fetch the content of online news articles

In [11]:
from newspaper import Article

Feel free to copy&paste different news article URLs. Note the package does not work with all news websites; however, it works just fine with straitstimes.com.

In [12]:
url = 'http://www.straitstimes.com/asia/east-asia/now-its-japans-turn-to-brace-for-a-monster-storm-as-typhoon-lan-nears'
article = Article(url)

The methods `download()` and `parse()` do the actually fetching and processing of the news articles.

In [13]:
article.download()
article.parse()

In [14]:
print("Authors:", article.authors, "\n")
print("Publication data:", article.publish_date, "\n")
print("Title:", article.title, "\n")
print("Main text:", article.text, "\n")
print("Top image link:", article.top_image, "\n")
print("Video links:", article.movies, "\n")

Authors: [] 

Publication data: 2017-10-19 22:15:56+08:00 

Title: Now its Japan's turn to brace for a monster storm as Typhoon Lan nears 

Main text: NEW YORK (BLOOMBERG) - Seems like no one can escape nature's wrath these days.

Typhoon Lan is forecast to grow into a monster storm south of Japan before it weakens on its approach to the island nation next week. It come on the heels of Ophelia, which brought gale-force winds to southern Ireland Monday, Maria, which devastated Puerto Rico, and Irma, Harvey and Nate, which struck the US Gulf Coast or Florida.

That's not to mention two hurricanes that recently struck Mexico.


Lan isn't the first big storm for Japan this season. It was struck by typhoons Noru and Talim, in August in September. 

Top image link: https://static.straitstimes.com.sg/sites/default/files/styles/x_large/public/articles/2017/10/19/fa-lan-20171019.jpg?itok=tIVcW4az 

Video links: [] 



The `newspaper` packages comes with some additional functions to extract keywords and generate a summary for a news article

In [18]:
article.nlp()
print("Keywords:", article.keywords, "\n")
print("Summary:", article.summary, "\n")

Keywords: ['brace', 'japans', 'winds', 'monster', 'lan', 'york', 'storm', 'nears', 'week', 'weaken', 'japan', 'weakens', 'category', 'wrath', 'typhoon', 'turn', 'struck'] 

Summary: NEW YORK (BLOOMBERG) - Seems like no one can escape nature's wrath these days.
Typhoon Lan is forecast to grow into a monster storm south of Japan before it weakens on its approach to the island nation next week.
As it nears the Tokyo-Yokohama area, winds will probably weaken to about 109 mph, making it a Category 2 storm.
Lan isn't the first big storm for Japan this season.
It was struck by typhoons Noru and Talim, in August in September. 



In [16]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/mannu/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Tweets

Twitter provides an API (Application Programming Interface) that allows to fetch public tweets. The `twython` packages is a wrapper for ths API to simplify this task in Python.

In [17]:
from twython import Twython, TwythonError

Accessing the API requires credentials. This in turn requires a Twitter account and further configurations. If you don't have or want an Twitter account then no problem. This tutorial is only supposed to show how simply the task of fetching tweets is. It won't be required for the other tutorials.

In [19]:
APP_KEY = '74NW20mKELhpava8SndjOO0XW' 
APP_SECRET = '1OHuhp2GSiI3BcS4GkVqpM0bZEbOUDvdwckRsa49XLlSFUGxCt' 
OAUTH_TOKEN = '2694891187-oSTS9El6NUW8x8Enxt7jBpSZLnv4QtcQ3hNEYG8' 
OAUTH_TOKEN_SECRET = 'rdjnr3UqylJ4YhgKS51QYEPvyZDb4lqs6rcAP1hCTbZpQ'

In [13]:
twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)

Among other calls, the Twitter API allows to search for tweets using keywords. You can also specify that you're not interest in retweets.

In [14]:
try:
    search_results = twitter.search(q='"orchard road" -filter:retweets', count=20)
except TwythonError as e:
    print(e)

Each tweet comes with a plethora of information. In the following we are only interest in the user name, the date and time the tweet was posted and the tweet text itself.

In [18]:
for tweet in search_results['statuses']:
    # Ingnore non-English tweets
    language = tweet['lang']
    if language != 'en':
        continue
    # Extract the basic information about the tweet
    screen_name = tweet['user']['screen_name']
    created_at =  tweet['created_at']
    tweet_text = tweet['text']    
    # Simple way to remove line breaks and tabs: string to list and back to string again
    tweet_text = ' '.join(tweet_text.split())
    # Twitter returns the time as string of the form "Wed Jan 24 10:37:57 +0000 2018"; let's simplify this
    created_at = time.strftime('%Y-%m-%d %H:%M:%S', time.strptime(created_at,'%a %b %d %H:%M:%S +0000 %Y'))
    # Print each tweet with publication date, the screen name of the user, and the actual text of the tweet
    print('[{}] @{} wrote: {}'.format(created_at, screen_name, tweet_text))

[2018-02-25 04:24:14] @FareezShahNB wrote: Was walking along Orchard Road and I stumbled upon a super talented and down to earth musician… https://t.co/Q4cx05FrqM
[2018-02-25 04:19:40] @widiasmoro wrote: Having a session with @mindfico at Today at Apple Orchard Road https://t.co/EApVG1g2RY
[2018-02-25 00:17:03] @myfatpocket wrote: Teeth Whitening at La Vie Aesthetics - Just a few days before Chinese New Year, I went to make an appointment for t… https://t.co/1UdOyTx2d0
[2018-02-25 00:13:58] @staronline wrote: Making Orchard Road great again https://t.co/K7DriL6TBp https://t.co/y9EKgrYVdM
[2018-02-24 13:39:07] @Travelodium wrote: Orchard Road Singapore #travel https://t.co/rU8gyI4XG8 https://t.co/J5WkB2SrVo
[2018-02-24 12:46:32] @JenanJamal wrote: I'm at Orchard Road in Singapore https://t.co/aUrtzknqKn
[2018-02-24 11:31:50] @heytherew0man wrote: I'm at Apple Orchard Road in Singapore https://t.co/o1oNSfvBKw
[2018-02-24 09:19:45] @JenanJamal wrote: I'm at Orchard Road in Singapore https