# Overview

Given a `TweetFavorited_SummaryWithDecodedUrls` dataset, get the raw tweet object/record for each tweet.

# Dependencies

## debian jessie

In [24]:
%%bash
cat /etc/*-release

PRETTY_NAME="Debian GNU/Linux 8 (jessie)"
NAME="Debian GNU/Linux"
VERSION_ID="8"
VERSION="8 (jessie)"
ID=debian
HOME_URL="http://www.debian.org/"
SUPPORT_URL="http://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"


## sudo

Verify:

In [1]:
%%bash
sudo date -u

Mon Jun 19 17:11:33 UTC 2017


## jq
Install via binary download

In [5]:
%%bash
wget --quiet -O jq https://github.com/stedolan/jq/releases/download/jq-1.5/jq-linux64
chmod +x ./jq
sudo mv jq /usr/bin

Verify:

In [6]:
%%bash
command -v jq || { echo >&2 "jq not found";} 
jq --version

/usr/bin/jq
jq-1.5


## csvjson
via csvkit

Install via pip:

In [3]:
%%bash
pip install --quiet csvkit

Verify:

In [4]:
%%bash
command -v csvjson || { echo >&2 "csvjson not found";} 
bash -c "csvjson --version 2>&1"

/opt/conda/envs/python2/bin/csvjson
csvjson 1.0.2


## twarc
Install via pip:

In [7]:
%%bash
pip install --quiet twarc

Verify:

In [8]:
%%bash
command -v twarc || { echo >&2 "twarc not found";} 
twarc version

/opt/conda/envs/python2/bin/twarc
twarc v1.1.3


## Twitter API

A Twitter API Application https://apps.twitter.com configured with:
* Access Level: Read-only

# Input

## Datasets

### TweetFavorited_SummaryWithDecodedUrls

In [1]:
%%bash
ls -alph output/TweetFavorited_SummaryWithDecodedUrls.csv

-rw-r--r-- 1 jovyan users 603K Jun 19 20:32 output/TweetFavorited_SummaryWithDecodedUrls.csv


## Configuration

### `twarc` configuration file
`twarc` needs a configuration file containing Twitter API v1.1 OAuth credentials:
```bash
config/.twarc
```

which can be generated externally via interactive prompt:
```bash
twarc configure
```

# Execute

## 1. Convert from CSV to JSON

Convert input from CSV to JSON to avoid handling multi-line strings in the CSV.

In [33]:
%%bash
csvjson --no-inference --stream output/TweetFavorited_SummaryWithDecodedUrls.csv > output/TweetFavorited_SummaryWithDecodedUrls.json

Verify:

In [35]:
%%bash
head output/TweetFavorited_SummaryWithDecodedUrls.json

{"ID": "876449181816238080", "Posted at": "2017-06-18 14:39:03 +0000", "Screen name": "InfoQ", "Text": "Management 3.0 is about understanding you need to change the environment. Manage the system, not the people.… https://twitter.com/i/web/status/876449181816238080"}
{"ID": "876446771525820416", "Posted at": "2017-06-18 14:29:28 +0000", "Screen name": "adymitruk", "Text": "Why I miss @hintjens so much. He was way ahead of most people around him. So many good things I learned about #0mq… https://twitter.com/i/web/status/876446771525820416"}
{"ID": "876439360001814528", "Posted at": "2017-06-18 14:00:01 +0000", "Screen name": "NinjaEconomics", "Text": "How Much Are People Making from the Sharing Economy? https://priceonomics.com/how-much-are-people-making-from-the-sharing/ https://t.co/te1VuUZveD"}
{"ID": "876376163257602048", "Posted at": "2017-06-18 09:48:54 +0000", "Screen name": "codepo8", "Text": "ID3 - a development environment in the browser for d3.js http://d3-id3.com/"}
{"ID": "

## 2. Get tweet IDs

In [39]:
%%bash
jq --raw-output '.ID' output/TweetFavorited_SummaryWithDecodedUrls.json > output/TweetFavorited_TweetIds.csv

Verify:

In [40]:
%%bash
wc -l output/TweetFavorited_TweetIds.csv

3200 output/TweetFavorited_TweetIds.csv


In [41]:
%%bash
head output/TweetFavorited_TweetIds.csv

876449181816238080
876446771525820416
876439360001814528
876376163257602048
876370053154959360
876094836683579393
875830301796188160
875827621971369985
875809693913948165
875805024311537664


## 3. Get raw tweet objects

In [46]:
%%bash
twarc hydrate output/TweetFavorited_TweetIds.csv --config config/.twarc --log output/twarc.log > output/TweetFavorited_RawTweetObject.json

Verify:

In [47]:
%%bash
wc -l output/TweetFavorited_RawTweetObject.json

3200 output/TweetFavorited_RawTweetObject.json


In [48]:
%%bash
head output/TweetFavorited_RawTweetObject.json

{"contributors": null, "truncated": false, "text": "Akka Cluster on K8 https://t.co/yHO6WBhoxW using Statefulset for seed nodes, and Deployment for worker nodes\nkubectl / akka-cluster overlap?", "is_quote_status": false, "in_reply_to_status_id": null, "id": 868131123972460544, "favorite_count": 1, "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>", "retweeted": false, "coordinates": null, "entities": {"symbols": [], "user_mentions": [], "hashtags": [], "urls": [{"url": "https://t.co/yHO6WBhoxW", "indices": [19, 42], "expanded_url": "https://github.com/saturnism/akka-kubernetes-example", "display_url": "github.com/saturnism/akka\u2026"}]}, "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "retweet_count": 1, "id_str": "868131123972460544", "favorited": true, "user": {"follow_request_sent": false, "has_extended_profile": false, "profile_use_background_image": true, "default_profile_image": false, "id": 18364654, "profile_background_image_url_h

# Output

`TweetFavorited_RawTweetObject.json` should exist:

In [49]:
%%bash
ls -alph output/TweetFavorited_RawTweetObject.json

-rw-r--r-- 1 jovyan users 9.9M Jun 21 12:36 output/TweetFavorited_RawTweetObject.json
