# DSCI 511: Data acquistion and pre-processing<br>Chapter 9: Distribution, accessibility, and data sharing

## 9.0 Can we share?
What to do with an exciting, completed dataset? Well, if you're asking this question you've probably already released the data to a group or individual downstream who motivated the dataset's development. But datasets often have unknown value outside of their impetuses for development and moreover, if you're in the business of developing dataset then you probably would like to create some visibility for your work, much like a porfolio. However, before addressing to take these final steps let's approach a few questions:
+ What are the reasons for sharing data?
+ Is the data in the correct form to be shared?
+ Who has the rights to share?
+ What mechanism can we use to share?

Of course, each group or individual is going to have wildly different answers to these questions. There won't really be any hard and fast rules to stick by (apart from not doing anything illegal!). Probably the best way to proceed is by looking at some examples.

## 9.1 A BuzzFeed dataset
Let's look at the dataset discussed in the following BuzzFeed article: 
- https://www.buzzfeednews.com/article/craigsilverman/partisan-fb-pages-analysis.

For this, a team of journalists analyzed every post made by 9 separate Facebook pages for news outlets during a 7 business day period for veracity. In other words, each of these posts was read by the journalists and it was determined how factual they were. The journalists assigned the following four veracity categories:
+ mostly true
+ mostly false
+ mixture of true and false
+ no factual content

The time period chosen was during September 2016 at the height of the U.S. Presidential Election. While the BuzzFeed article reported some interesting analyses of the posts and the attention they received on Facebook, there's likely more interesting analyses that could be performed by others! However, what data from Facebook can be shared? Even if Facebook's data could not be passed forward, it's essential that BuzzFeed's analysis be reproducible! What options exist?

### 9.1.1 Leaving a paper trail
Let's check out how they actually provided the data that backs up the story. Some searching of the article yields a link to a GitHub repository: 
- https://github.com/BuzzFeedNews/2016-10-facebook-fact-check 

which is one of the best places to put data you want the public to see! In addition to being a public-facing venue in which (relatively small) data can be placed, github' focus on version control makes updates and tracking easy. Let's take a look at the first couple of rows of the data in the repository, appearing under the path: 
- `data/facebook-fact-check.csv`. 

Note: to access a single file from a github repo, raw content can be found under urls like
- https://raw.githubusercontent.com/USERNAME/REPOSITORY/BRANCH/PATH/TO/FILENAME

In [1]:
import requests, os
os.system("mkdir -p data")

response = requests.get("https://raw.githubusercontent.com/BuzzFeedNews/2016-10-facebook-fact-check/master/data/facebook-fact-check.csv")

with open("data/facebook-fact-check.csv", "w") as f:
    f.write(response.text)

Reviewing the header and first few lines, what content did BuzzFeed likely control the rights of? Where's the Facebook data? 

Well, there's actually a few pieces of data coming from Facebook that we should discuss, but most importantly (for tracking and rereproducability) are the ids: `'account_id'` and `'post_id'`. Together, these identify the account owners of the posts and the posts rated, themselves.

In [2]:
import csv
reader = csv.reader(open("data/facebook-fact-check.csv", "r"))
header = reader.__next__()
data = list(reader)
print(header)

['account_id', 'post_id', 'Category', 'Page', 'Post URL', 'Date Published', 'Post Type', 'Rating', 'Debate', 'share_count', 'reaction_count', 'comment_count']


In [3]:
data[0:2]

[['184096565021911',
  '1035057923259100',
  'mainstream',
  'ABC News Politics',
  'https://www.facebook.com/ABCNewsPolitics/posts/1035057923259100',
  '2016-09-19',
  'video',
  'no factual content',
  '',
  '',
  '146',
  '15'],
 ['184096565021911',
  '1035269309904628',
  'mainstream',
  'ABC News Politics',
  'https://www.facebook.com/ABCNewsPolitics/posts/1035269309904628',
  '2016-09-19',
  'link',
  'mostly true',
  '',
  '1',
  '33',
  '34']]

### 9.1.2 Why publish pointers to data with ids? 
As it turns out, BuzzFeed published little information from the Facebook platform or the content of the articles, themselves. By attaching their (fully owned) veracity ratings of the articles they made their analysis reproducible and essentially enriched a bunch of data from Facebook. As it turns out, sharing ids from content objects on online platforms has become something of a standard protocol for communication around data. If an individual who wishes to reproduce an experiment or otherwise use its data has access rights, shared ids along with a functional API make it possible to programmatically reconstitute a dataset. Of course, these protocols are different from platform to platform, but it's usually how things work. But what about the share, reaction, and comment counts?

### 9.1.3 Why not publish the whole dataset?
Well, there's a few reasons for this, first and formost many platforms are in the business of monetizing the data they generate. So if you are privilaged enough to build an exciting dataset using proprietary data this is usually the reason for not being allowed. However, there are good other reasons! 

As it turns out, BuzzFeed arguably may have done better to leave off some data, in particular the last three columns: share, reaction, and comment counts. If these were accessed directly from Facebook's API, then they almost certainly violate our previous reason for non-distribution, but there's an additional problem: when did BuzzFeed publish the data, and does its content currently accurately reflect the state of the posts? The answer is most certainly "no". Comments, likes, and reactions, etc., are generally continually modifiable on a public post and now, several years later, are almost certainly out of date. 

So, do the do comment, etc., counts have no value in BuzzFeed dataset? Well, no! As faithful representatives of a particular post's social attention these comments certainly are out of date, but their placement in the BuzzFeed dataset does indeed satisfy a crucial role. BuzzFeed might hope that their record will have value in reuse, but most importantly they are _documenting_ their work for reproducability. In other words, they have posted the minimal data necessary to be able to reproduce their analysis. Sure, the out of date values might be misleading if they are not represetnted appropriately, but that's a different matter!

### 9.1.4 Identifiable, publicly visable, and reproducible, but accessible?
Well, the posts and other data on Facebook used to be. Prior to 2018, Facebook's Graph API allowed un-reviewed apps to collect _public_ data from their Pages API (the posts were mader to Facebook Public Pages). Following the platform's media trouble early that year, Facebook clamped down on data for apps into a need-to-know protocol, i.e., apps now have to apply for access to specific streams with intended uses documented. So while it was possible in the past for independent researchers to [access this data](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM18/paper/view/17825), the only way API users with could access now is using a reviewed app authorized by each of the (9) page owners!

This example provides a non-trivial view into data sharing across tightly controlled data; let's look at how this varies and some of the other challenges out there.

## 9.2 Variations in accessibility
Ok, so we've gone over examples which concern data from fairly restricted sources. Of course, not all data is protected like this (not even all social media data is guarded). Some sources simply have no protections, while other go so far as to encourage anyone and everyone to download their data and play around with it. Let's first look at a key example on electronic texts. 

### 9.2.1 Project Gutenberg, Google Books, and N-grams
We've already looked at the Project Gutenberg data, but it's so useful that it's worth revisiting. Basically, it's a large project whose goal is to make books that are currently in the public domain easily accessible to the public at large. But what does it mean for a book to be in the public domain? It turns out that in many countries, copyrights on intellectual property only last for a certain number of years (around 70), after which the work in question loses any protections and becomes available for any conceivable use, without restriction. When it becomes 2019, a collection of books copyrighted in 1923 will become part of the public domain, and henceforth added to Project Gutenberg. The curators at Project Gutenberg take these public domain works and make them available in convenient filetypes (including the appropriate filetypes for e-readers). If you enjoy reading classic literature, this is the place to be online! But even further than that, the project provides a huge reserve of textual data for researchers to play with. 

While Project Gutenberg only uses books which are in the public domain, Google Books certainly does not. Perhaps at some point you've searched for a sample of a book online to examine before actually purchasing it, chances are you found an entry for the book on Google Books. Only a limited number of pages and even sections are available for viewing of copyrighted books on Google Books. This makes sense, but how can they share even this? Well, for just the reason described above. Showing bits and pieces of a book can often be enough to entince someone to purchase it. So how did Google compile such a huge collection of (partial) books? With close relationships with several of the nation's largest libraries, Google has transfered huge quantities of physical textbooks to a digital form in an effort to preserve records, and allow for analysis. Of course, Google can't be allowed to openly share these reserves of data, as that could collapse the book industry. So, as a kind of happy medium (between not breaking copyright laws and still providing data to potential researchers), Google has used the data obtained from digitizing the massive collections of books to create a database of n-grams. 

_N-grams_ are basically just snippets of text that contain N words. So, for example, a 2-gram from the previous sentence might be 'text that'. N-gram `features` have become somewhat important in Natural Language Processing and linguistics. Google analyzed its massive trove of text from the books it digitized and published all n-grams and their occurence rates in a huge dataset. This is interesting because the data technically comes from and really is copyrighted material, but presented in a way that the book industry is fine with. This is a good example where context is extremely important. You can check out a neat tool that allows you to play with the dataset here: 

- https://books.google.com/ngrams.

### 9.2.2 Wikipedia
Wikipedia is famous for its openness. It began with the novel idea of an encyclopedia that anyone could edit, and has now ballooned into the largest information source of its kind on the internet. The vast majority of content is totally open and free on Wikipedia, and they actively encourage users to obtain it. If you dig deeper into Wikipedia's policies, you might think this is incorrect, since they seem to be strictly anti-scraping, as they explicitly state "Please do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia". While this is the case, it's only so because Wikipedia provides links to download the entirety of the website all at once! Going to 

- https://dumps.wikimedia.org/enwiki/ 

will allow you to download all of the current English-language Wikipedia page, which amounts to about 58 GB of data. There are also ways to download only articles from a specific category, or even a list of specific articles: 

- https://en.wikipedia.org/wiki/Special:Export
- https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export

Wouldn't it be nice if all data sources had the same policies as Wikipedia?!

### 9.2.3 The Yelp Dataset(s)
One of the best single datasets for budding data scientists to both test their skills and possibly show off is the Yelp dataset. Every year, Yelp hosts challenges centered around its datasets geared towards students and beginning data scientists. It selects winners based on a whole assortment of criteria. There is much prestige to be had in winning a Yelp dataset challenge (and a potential $5000 prize)! The current challenge includes tasks for photo classification, natural language processing and sentiment analysis, and graph mining. Even if you don't win any of the contests, using this dataset for work is a great idea because it's massive and extremely clean (hardly any munging or preprocessing is required, Yelp does a fantastic job of doing that for you). This is a notorious dataset, and hundreds of academic papers have already been written using it as a basis. 

#### 9.2.3.1 Anonymous, but precise and complete
An interesting point regarding the dataset is that it's extremely complete for a dataset coming from a large social media platform. Really the only aspect of the data that has been altered to fit in with the challenge is the fact that user-identifying information has been scrubbed. Rather than provide the real user identities, Yelp has created randomized unique identities for each user and business&mdash;we'll refer to these as _anonymized_ ids. These allows for researchers to still perform user-level analyses, while avoiding some ethical and privacy concerns. Let's look at a sample of this data! Note: you can find the data at: 

- https://www.yelp.com/dataset/challenge. 

The dataset is enormous, so let's just look at a single review pasted into the notebook (tsk):

In [35]:
review_example = {"review_id": "x7mDIiDB3jEiPGPHOmDzyw",
                  "user_id": "msQe1u7Z_XuqjGoqhB0J5g",
                  "business_id": "iCQpiavjjPzJ5_3gPD5Ebg",
                  "stars": 2,
                  "date": "2011-02-25",        
                  "text": "The pizza was okay. Not the best I've had. I prefer Biaggio's on Flamingo \/ Fort Apache. The chef there can make a MUCH better NY            style pizza. The pizzeria @ Cosmo was over priced for the quality and lack of personality in the food. Biaggio's is a much better pick if             youre going for italian - family owned, home made recipes, people that actually CARE if you like their food. You dont get that at a pizzeria          in a casino. I dont care what you say...",
                  "useful": 0, "funny": 0, "cool": 0}

print(review_example["text"])

The pizza was okay. Not the best I've had. I prefer Biaggio's on Flamingo \/ Fort Apache. The chef there can make a MUCH better NY            style pizza. The pizzeria @ Cosmo was over priced for the quality and lack of personality in the food. Biaggio's is a much better pick if             youre going for italian - family owned, home made recipes, people that actually CARE if you like their food. You dont get that at a pizzeria          in a casino. I dont care what you say...


To fully integrate the reviews with the dataset Yelp has actually had to anonymize three identifiers: `review_id` (a unique identifier for each review), `user_id` (an id for the user who left the review), and `business_id` (an id for the business which is being reviewed). Aside from this, the actual content consists of the `text` (body text of the review),`stars` (the number of stars the user rated the business, from 1 to 5), and the `useful`/`funny`/`cool` are ratings (other users reactions to a given review).

This file naturally links to out to information about the users, contained in `user.json`:

In [36]:
user_sample = {"user_id":"lzlZwIpuSWXEnNS91wxjHw",
               "name":"Susan","review_count":1,
               "yelping_since":"2015-09-28",
               "friends":"None","useful":0,"funny":0,"cool":0,
               "fans":0,"elite":"None","average_stars":2.0,
               "compliment_hot":0,"compliment_more":0,"compliment_profile":0,"compliment_cute":0, 
               "compliment_list":0,"compliment_note":0,"compliment_plain":0,"compliment_cool":0,
               "compliment_funny":0,"compliment_writer":0,"compliment_photos":0}
print(user_sample["name"])

Susan


which shows us just how personal some of the information is for each user! Note that while this detail exists about the users and the anonymized ids given by the `user_id` key match to those in the `reviews.json` file, the user's Yelp identity is technically still unknown, as their `user_id` cannot be linked to any existing on their API! 

So, this gives us some metadata about the users themselves, with information such as how many reviews they've made in total and the average star rating they give businesses. The last piece of the anonymized puzzles is the businesses themselves:

In [37]:
business_example = {"business_id":"Apn5Q_b6Nz61Tq4XzPdf9A",
                   "name":"Minhas Micro Brewery","neighborhood":"",
                   "address":"1314 44 Avenue NE","city":"Calgary",
                   "state": "AB","postal_code":"T2E 6L6",
                   "latitude": 51.0918130155,"longitude": -114.031674872,
                   "stars": 4.0,"review_count": 24, "is_open": 1,
                   "attributes": {"BikeParking": False,"BusinessAcceptsCreditCards":True,
                                  "BusinessParking":{'garage': False, 'street': True, 
                                                     'validated': False, 'lot': False, 'valet': False},
                                  "GoodForKids": True, "HasTV": True, "NoiseLevel":"average",
                                  "OutdoorSeating": False,"RestaurantsAttire":"casual",
                                  "RestaurantsDelivery": False,"RestaurantsGoodForGroups": True,
                                  "RestaurantsPriceRange2": "2","RestaurantsReservations": True,
                                  "RestaurantsTakeOut": True}, 
                   "categories": "Tours, Breweries, Pizza, Restaurants, Food, Hotels & Travel",
                   "hours": {"Monday": "8:30-17:0", "Tuesday": "11:0-21:0", "Wednesday": "11:0-21:0",
                            "Thursday": "11:0-21:0", "Friday": "11:0-21:0", "Saturday": "11:0-21:0"}}
print(business_example["latitude"], business_example["longitude"])

51.0918130155 -114.031674872


Here, we get the information regarding the businesses that have been reviewed throughout the dataset by the anonymous reviewers. Even though the `"business_id"` field is probably anonymized, too, we get a plethora of identifying information about the business&mdash;what would you use to identify this business in the real world? Does obfuscating the true `"business_id"` help support user anonymity in any way?

In general, the relative non-anonymization of business features makes sense as that's the entire purpose of Yelp; potentially anonymous reviewers can leave honest and frank public reviews of known businesses. Plus, the businesses by nature are most likely interested in being identified (as long as their reviews are good)! It is interesting to note the large amount of information present in these JSONs. This is basically everything you can find out about a business by checking out its Yelp page! Considering these attempts at anonymization and the nature of data being challenging to really separate from identity segways us into some thoughts on how anonymization of personal data varies.

### 9.2.4 Levels of personal data
You may have noticed that the platform in which data is coming from makes a huge difference on the quantity of identifying information and features present. Some platforms are extremely easy to work with because there's a plethora of data provided, while others are difficult because of how cryptic they are. It's important when planning out a project to be acutely aware of just how much personalized information will be available in the data. Let's try and group this world into three general categories for personal data availability in harvested data:

+ Totally anonymized - Although completely anonymous data is becoming more and more rare with the exponential growth of social media platforms, it still exists. A good example would be the family of so-called _chan_ boards (i.e. 4chan, 8chan, etc.). 4chan in particular has a [public API](https://github.com/4chan/4chan-API) that makes it incredibly easy to download any kind of data you'd like from the page, but said data is completely lackling in identifying information. In case you're unfamiliar, 4chan is an imageboard where people post pictures, and then users can leave comments on them. There are no usernames on 4chan, everyone is simply known as _anonymous_. So, while 4chan data is exceedingly easy to acquire and plentiful, there's no way to identify who said what on the platform.
+ Semi-anonymized - This is where a lot of the data we've been talking about today would fall. Data in this category is voluntary and potentially non-reliable (from an identification point of view), as a given user can easily assume multiple platform identities. Examples here include the Yelp and Twitter where there may be personally-identifying information, but only if the user, e.g., chooses to use their real name in their username or on their profile. For Twitter, each Tweet obtained from the API comes with the username associated with the tweet. While this is often just a made up name, it's still enough to identify which Tweets were made by which users. However, since any user can easily make _many_ Twitter accounts it can be hard to know who's actually who, and helpful to think of tweeters as semi-anonymous.
+ De-anonymized - This kind of data can be rare, but most often is hard to get! This is data with names etc. and other information registered and controlled closely against account-official identifiers. It's hard to find many examples outside of in-house corporate datasets, but in a sense this is exactly what Facebook strives for. Have you ever had Facebook demand a picture of your photo id for account verification? It happens!

## 9.3 Shared Tasks
Why else share data? Well, sometimes it's important for a community to come together around solving a problem on common data. It allows folks to compare results exactly and identify best solutions. Such organized events are referred to as _shared tasks_. 

Shared tasks are somewhat competitive endeavors in which a group of participants are all given the same dataset with a specific task in mind, which should have a precise mechanism for evaluation. The task lasts for a specific amount of time, after which the participants are evaluated on how well they achieved the task, and the winners are selected. Prizes vary depending on the context of the shared task, but sometimes they can be lucrative. Apart from the prizes, winners of shared tasks can claim (rightfully so) mastery of that specific subfield of machine learning or data analysis. So, people take shared tasks quite seriously, and they are often very popular. 

Having a shared task centered around a dataset provides a lot of top-notch exposure! One avenue for shared tasks is the academic conference systems. This is great for academics, but what about others who are still interested in data science and showing off their skills? Probably the best place for that is Kaggle.

### 9.3.1 Kaggle
If you haven't yet, check out Kaggle: 

- https://www.kaggle.com/

It's a platform for practicing and learning about data science. Notably, Kaggle hosts a huge amount of public datasets and encourages anyone and everyone to download them and play around with them. Generally, the rights for these have been secured by the dataset posters, which range from businesses (generating data) to excited ML researchers interested in exposure and engagement. When companies post their datasets/shared tasks there are often excellent prizes and opportunities for recognition available to the winners of these contests! Also, many companies pay close attention the results. 

#### 9.3.1.1 Exercise: Kaggle 
Set up a Kaggle account (if you don't have one) and do a bit of browsing. Another feature that's handy on Kaggle is that it allows for users to upload their own datasets (both privately or publicly). Thus, even if you're more interested in the data acquisition and consolidation aspects of data science, Kaggle is still quite a useful potential tool. If you upload a public dataset and it proves popular, just as much recognition awaits.

### 9.3.2 Extended example: A Twitter geolocation shared task and dataset
Should we expect researchers focused on a ML shared task to write their own data access software for a shared task, even if we provide the identifiers necessary to build the requesite data? Well, even if the community of task takers includes some folks with data acquisition savvy you're probably finding by now that this work is very non-trivial. So, in addition to pointers to content an essential ingredient for data distribution is the provisioning of reusable data access software.

#### 9.3.2.1 High level considerations
When a script which downloads a specific set of Tweets is distributed, the Tweets themselves are not being sent out. All that is being sent is a program which will automatically download the Tweets using Twitter's REST API (provided that you have signed up for API access and are willing to submit your API keys). This may seem like a silly trick to get around copyright laws, but instead of sending out copyrighted data these scripts require users to go through the official APIs created for use with social media platforms. This is the only way that most social media platforms would like for users to obtain large quantities of data. 

#### 9.3.2.2 An exemplar shared task
Let's look at an example of exactly this distribution model. At the 2016 W-NUT (Workshop on Noisy User Generated Text) conference, there was a shared task where the competitors were given a large corpus of Tweets along with a list of potential urban centers, and were asked to determine which urban center each Tweet originated from. The task is detailed here: 

- https://noisy-text.github.io/2016/geo-shared-task.html. 

#### 9.3.2.3 Getting the data
What's important to us right now is how the training data was distributed. Upon inspection, you'll find the following link to a Google Drive which contains the training data downloader script: 

- https://drive.google.com/drive/folders/0B8bfAiuVjZ1Edk5hYkhnbEF3SDg. 

The files we'll focus on here are `README`,`validation.tweet.json.gz`, `cred.txt`, and `tweet_downloader.py`.

#### 9.3.2.4. README: understanding the shared task materials
Since the file format chose was a Google doc, i.e., Microsoft Word (tsk), its contents are converted to markdown (yay) in the below cell. Let's review some highlights:
1. From the first line: data files won't contain "raw", i.e., text content or metadata besides the tweet ids.
- Samples are provided for how the shared task data should look, once downloaded.
- Reviewing the `train.user.json` sample, the target data (once tweets are downloaded) joins the tweets to the known city and tweet latitude and longitude values. __Important__: over 12 million tweets are covered by this file; this should make us second guess passing this data around on github; checking on the Google Drive folder shows that this file is over 600 MB compressed! That's over 6 times the file size limit on github, and it'll be a much bigger decompressed!
- To handle API access control, users are required to place tokens (credentials) in the `cred.txt` file. As we'll see when reviewing tweet_downloader.py how this works exactly.
- The "test" data files contain "hashed" user and tweet ids. These allow individuals to work with the data according to the metadata available from Twitter without letting the user know how to get the original data from Twitter. __Extremely important__: This means the test data are distributed directly; the only reason this is allowed is because Twitter's policy allows sharing datasets informally (directly) so long as the contain fewer than 50k tweets. 

> - Due to Twitter term restrictions, we are not going to share the raw data but the tweet ids for this research purpose shared task.
> - The downloader script is to download tweet JSON data using the tweet ids in the data folder. 
> - You are also required to register a Twitter dev account to get your credentials and put them as specified in cred.txt. (https://apps.twitter.com/)
> - The folder also contains train/dev/test data for WNUT 2016 Geotagging Shared Task
>
> File: train.user.json <br>
> Description: Training data for user level tasks <br>
> Size: 1M users, 12,827,165 tweets <br>
> Format: Each line is a JSON dict which consists of tweet_id, user_id, city_name, city_latitude, city_longitude, tweet_latitude, tweet_longitude, tweet_text <br>
> Example:  <br>
> 
> ```
> {
>   "tweet_id": 12345678,
>   "user_id": 12345,
>   "city_name": "melbourne-07-au",
>   "city_latitude": -35.123,
>   "city_longitude": 108.123,
>   "tweet_latitude": -35.124,
>   "tweet_longitude": 108.124,
>   "tweet_text": "This is a demo text"
> }
> ```
> <br>
> File: validation.user.json <br>
> Description: validation data for user level tasks <br>
> Size: 10K users, 128,524 tweets <br>
> Format: same as train.user.json <br>
> Example: same as train.user.json <br>
>  <br>
> File: test.user.json <br>
> Description: Test data for user level tasks <br>
> Size: 10K users, 99,732 tweets <br>
> Format: Each line is a JSON dict which consists of hashed_tweet_id, hashed_user_id, tweet_text <br>
> Example: <br>
> 
> ```
> {
>   "hashed_tweet_id": 25D55AD283AA400AF464C76D713C07AD,
>   "hashed_user_id": 827CCB0EEA8A706C4C34A16891F84E7B,
>   "tweet_text": "This is a demo text"
> }
> ```
> <br>
> File: train.tweet.json <br>
> Description: Training data for tweet level tasks <br>
> Size: same as train.user.json <br>
> Format: same as train.user.json <br>
> Example: same as train.user.json <br>
> <br>
> File: validation.tweet.json <br>
> Description: validation data for tweet level tasks <br>
> Size: 128,435 tweets which are different from validation.user.json <br>
> Format: same as validation.user.json except the user related fields are omitted. <br>
> Example: same as validation.user.json except the user related fields are omitted. <br>
> <br>
> File: test.tweet.json <br>
> Description: Test data for tweet level tasks <br>
> Size: 10K tweets which are different from test.user.json <br>
> Format: Each line is a JSON dict which consists of hashed_tweet_id, hashed_user_id, tweet_text <br>
> Example: <br>
> 
> ```
> {
>  "hashed_tweet_id": 25D55AD283AA400AF464C76D713C07AD,
>  "hashed_user_id": 827CCB0EEA8A706C4C34A16891F84E7B,
>  "tweet_text": "This is a demo text"
> }
> ```
> 

#### 9.3.2.5 Understanding the data
While `README` provides some high level understanding, there's nothing like some exploration using Python to really dig in what we've got. Note: to actually use the data we first have to decompress (unzip). This can be done using `gunzip filename.gz` from the command line, like:
- `gunzip validation.tweet.json.gz`

Note: each of the supplied datasets are actually in line-by-line json format. This allow the data to be downloaded incrementally (or from separate Twitter apps/accounts), without having to load the entire dataset in memory. Since the validation file is small enough we'll just load it all into memory using a loop. 

Discussion: The task developers (in the `tweet_downloader.py` script) rely on the `tweet_id` field in their task data to access the referenced tweets. Once downloaded, we'll have to join the tweets with the task data.

In [4]:
import json

validation = []
with open("data/validation.tweet.json") as f:
    for line in f:
        validation.append(json.loads(line))
validation[:2]

[{'tweet_id': '360011809459146752',
  'tweet_city': 'mansfield-engj9-gb',
  'tweet_latitude': '53.130483',
  'tweet_longitude': '-1.141419'},
 {'tweet_id': '547240490022617088',
  'tweet_city': 'omaha-ne055-us',
  'tweet_latitude': '41.782947',
  'tweet_longitude': '-95.287009'}]

#### 9.3.2.6 Understanding the script
To review, the code is placed in a cell below under the `%%writefile tweet_downloader.py` IPython magic command. For the uninitiated, this magic command simply writes the contents of the cell to the specified file (assuming a relative path to the notebook).

Note: it may seem a bit confusing (because of the parsing they do), but it's really just taking a list of Tweets and downloading them using the REST API, much like in __Chapter 3__. We'll take some aspects from their approach below and simplify their code into a more Python user-freindly version written as a module.

In [5]:
%%writefile tweet_downloader.py
#!//usr/bin/env python
"""
Description: 
    Download tweets using tweet ID, downloaded from https://noisy-text.github.io/files/tweet_downloader.py

Usage example (in linux):
    clear;python tweet_downloader.py --credentials ../data/credentials.txt --inputfile ../data/input.tids --outputtype IdTweetTok

Inputfile contains training/validation data whose first column is tweetID

credentials.txt stores the Twitter API keys and secrects in the following order:
consumer_key
consumer_secret
access_token
access_token_secret

Required Python library: 
    ujson, twython and twokenize (https://github.com/myleott/ark-twokenize-py)

An example output with whitespace tokenised text and tweet id in JSON format
    {"text":"@SupGirl whoaaaaa .... childhood flashback .","id_str":"470363741880463362"}
"""

try:
    import ujson as json
except ImportError:
    import json
import sys
import time
import argparse
from twython import Twython, TwythonError
from collections import OrderedDict

MAX_LOOKUP_NUMBER = 100
SLEEP_TIME = 15 + 1
twitter = None
arguments = None
tid_list = None

def init():
    global twitter, arguments, tid_list

    parser = argparse.ArgumentParser(description = "A simple tweet downloader for WNUT-NORM shared task.")
    parser.add_argument('--credentials', type=str, required = True, help = '''\
        Credential file which consists of four lines in the following order:
        consumer_key
        consumer_secret
        access_token
        access_token_secret
        ''')
    parser.add_argument('--inputfile', type=str, required = True, help = 'Input file one tweet id per line')
    parser.add_argument('--outputtype', type=str, default='IdTweet', choices = ['json', 'IdTweet'], help = '''\
        Output data type:
        (1) json: raw JSON data from Twitter API;
        (2) IdTweet: tweet ID and raw tweet messages (default)
        ''')
    arguments = parser.parse_args()

    credentials = []
    with open(arguments.credentials) as fr:
        for l in fr:
            credentials.append(l.strip())
    twitter = Twython(credentials[0], credentials[1], credentials[2], credentials[3])

    tid_list = []
    with open(arguments.inputfile) as fr:
        for l in fr:
            jobj = json.loads(l.strip())
            tid = jobj['tweet_id']
            tid_list.append(tid)

def download():
    global twitter, arguments, tid_list
    with open(arguments.inputfile + "." + arguments.outputtype, "w") as fw:
        tid_number = len(tid_list)
        max_round = tid_number // MAX_LOOKUP_NUMBER + 1
        for i in range(max_round):
            tids = tid_list[i * MAX_LOOKUP_NUMBER : (i + 1) * MAX_LOOKUP_NUMBER]
            time.sleep(SLEEP_TIME)
            jobjs = []
            jobjs = twitter.lookup_status(id = tids)
            for jobj in jobjs:
                if arguments.outputtype == "json":
                    fw.write(json.dumps(jobj))
                else:
                    tweet = jobj["text"]
                    tid = jobj["id_str"]
                    dic_tweet = (('tweet_id', tid), ('text', tweet))
                    fw.write(json.dumps(OrderedDict(dic_tweet)))
                fw.write("\n")

def main():
    init()
    download()

if __name__ == "__main__":
    main()


Overwriting tweet_downloader.py


#### 9.3.2.7 Rewriting the script
So, it's finally come back to haunt us. The `tweet_downloader.py` script was built using Python 2. To explore, let's spruce up the code and convert it to a Python _class_, i.e., as a self contained object. With this we'll be able to interact with the downloader in our notebook!

Cutting away from the original tweet downloader's special modules (`ujson` for faster json handling, `OrderedDict` to order of input tweet numbers for matched serialization, and `argparse` utilities for command line argument parsing) we can see that there's only a few things we need for Pythonic interaction:

In [6]:
import json, os, re, time
from twython import Twython

You'll notice the scattering of `global` variables declarations above. Rather than use these, Python's classes have the `self` argument that provides the managemnt of class attributes, i.e., objects that the class should have access to across its methods.

Architecturally, the shared task's downloader aligns to two methods, which we'll call:
- `__init__`: The shared task's `init()` is ripe conversion into Python's special class initialization method. In brief, the class's defined `__init()__` method can be expected to run at the time the class is instantiated into an object. Importantly, defining this method provides the opportunity to specify how arguments are input to the class.
- `download`: Converting the main action of the shared task code, we can match a with a `.download()` method that 'activates' the downloader to access batches of tweets. Note that this method implements our dance with Twitter's [rate limit](https://developer.twitter.com/en/docs/basics/rate-limiting.html), which should push the code to target roughly 1 request per minute (no more than 15 per 15-minute window will be allowed).

We've also made a few improvements to the code's function:
- The original slept for fixed time, while the new offsets against the time time the downloading process takes between batches (API calls).
- The downloaded tweets for all specified in a target file. Since it's possible to run the download process in parallel if the user has multiple apps/credentials, we can modify as below to have the downloader only request a specified subset (arguments `first` and `last` indices) from Twitter, i.e., target the downloader to a subset of the dataset. Note that since the original stored the data using Python's wrte mode (`'w'`) that we've had to modify to append mode (`'a'`) to enable multiple downloaders.

Note: Unlike `Twython`'s `.show_status(id)` method the downloader has us using `.lookup_status(id_list)`, which has the benefit of accepting a list of tweet ids to download; the maximum value is 100 tweets as per Twitter's limits.

In [30]:
class tweet_downloader(object):

    ## initialize the tweet downloader
    def __init__(self, inputfile, credfile = "cred.txt", first = 0, last = float('Inf'),
                 MAX_LOOKUP_NUMBER = 100, SLEEP_TIME = 60):
        
        ## accept user input
        self.MAX_LOOKUP_NUMBER = MAX_LOOKUP_NUMBER
        self.SLEEP_TIME = SLEEP_TIME
        self.inputfile = inputfile
        self.credfile = credfile
        self.first = first
        self.last = last

        ## load the specified Twitter app credentials
        self.credentials = []
        with open("cred"+os.path.sep+self.credfile) as f:
            for line in f:
                self.credentials.append(line.strip())
        
        ## initialize the api client
        self.twitter = Twython(self.credentials[0], self.credentials[1], 
                               self.credentials[2], self.credentials[3])
        
        ## load a list of tweet IDs to download
        self.tid_list = []
        with open(self.inputfile) as f:
            for i, line in enumerate(f):
                ## make sure this tweet is in specified range on list
                if i < self.first or i > self.last: continue
                data = json.loads(line.strip())
                self.tid_list.append(data['tweet_id'])

    ## download the tweets on the list 
    def download(self):
        ## the total number of tweets to download
        tid_number = len(self.tid_list)
        ## compute the number of batches to download
        max_round = tid_number // self.MAX_LOOKUP_NUMBER + 1
        ## initialize time counters
        now, then = 0, 0
        ## loop over batches
        for i in range(max_round):
            ## slice out the tweet ids for this batch
            tids = self.tid_list[i * self.MAX_LOOKUP_NUMBER : 
                                 (i + 1) * self.MAX_LOOKUP_NUMBER]            
            ## advance time counters
            then = now; now = time.time()
            ## compute remaining time
            REMAINING_SLEEP = self.SLEEP_TIME - int(now - then) + 1
            ## only sleep if we have already made previous calls
            if then:
                time.sleep(REMAINING_SLEEP)
            ## download the tweets
            data = self.twitter.lookup_status(id = tids)
            then = time.time()
            ## open the output file in append mode (in case of multiple downloaders)
            with open(re.sub(".json",".downloaded.json",self.inputfile), "a") as f:
                ## store results
                for d in data:
                    f.write(json.dumps(d)+"\n")

Now it's straightforward for our code to access the first 10 tweets in the `validation.tweet.json` file: 

In [31]:
## initialize downloader
downloader = tweet_downloader(inputfile = "data/validation.tweet.json",  
                              last = 10)
## run downloader
downloader.download()

Now that we've downloaded (some of) the data, we may want to use it. Is it ready? Well since the geospatial data for the shared task is still in `validation.tweet.json`, and the newly downloaded full-content tweets are in a separate file, `validation.tweet.downloaded.json`, we'll have to join the two portions of data by tweet id. Building off of the loading code from __Section 9.3.2.5__, let's build a function that matches integerates the shared task's geospatial data into any of the tweets we've managed to of download using our class.

In [32]:
def integrate_tweets(inputfile):
    ## initialize the integrated data and open downloaded tweets
    tweets = {}
    with open(re.sub(".json", ".downloaded.json", inputfile)) as f:
        for line in f: ## each line is a tweet
            tweet = json.loads(line)
            ## if we store the tweet id as its key we can use infullness below
            tweets[tweet["id_str"]] = tweet
            ## set a default value for missing task data
            tweets[tweet["id_str"]]["task_data"] = None
    ## open the original (geospatial data) shared task file
    with open(inputfile) as f:
        for line in f: ## each line corresponds to a tweet
            task_data = json.loads(line)
            ## check if this task data's id matches the downloaded
            if task_data["tweet_id"] in tweets:
                ## store the task data over the default
                tweets[task_data["tweet_id"]]["task_data"] = task_data
    return(tweets)

In [33]:
tweets = integrate_tweets(inputfile = "data/validation.tweet.json")
list(tweets.values())[0]

{'created_at': 'Tue Dec 31 16:45:56 +0000 2013',
 'id': 418060465164406785,
 'id_str': '418060465164406785',
 'text': 'Happy new year everybody :)))) http://t.co/jN7eokdWiX',
 'truncated': False,
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 418060465055354880,
    'id_str': '418060465055354880',
    'indices': [31, 53],
    'media_url': 'http://pbs.twimg.com/media/Bc0_xypCUAAwpKC.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/Bc0_xypCUAAwpKC.jpg',
    'url': 'http://t.co/jN7eokdWiX',
    'display_url': 'pic.twitter.com/jN7eokdWiX',
    'expanded_url': 'https://twitter.com/intanpitha/status/418060465164406785/photo/1',
    'type': 'photo',
    'sizes': {'large': {'w': 333, 'h': 333, 'resize': 'fit'},
     'medium': {'w': 333, 'h': 333, 'resize': 'fit'},
     'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'small': {'w': 333, 'h': 333, 'resize': 'fit'}}}]},
 'extended_entities': {'media': [{'id': 41806046505535488