# Introduction

**JavaScript Object Notation** or **JSON** is an open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value). It is a very common data format used for asynchronous browser–server communication, including as a replacement for XML in some AJAX-style systems.

JSON is a language-independent data format. It was derived from JavaScript, but as of 2017 many programming languages include code to generate and parse JSON-format data. The official Internet media type for JSON is `application/json`. JSON filenames use the extension `.json`.

## Tools
We focus on two tools:
1. Python package `json` https://docs.python.org/2/library/json.html
2. UNIX command line tool `jq` https://stedolan.github.io/jq/

## References
- Tweet data dictionaries https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object
- https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.html
- Parsing JSON with jq http://www.compciv.org/recipes/cli/jq-for-parsing-json/

# Data Set
We use an small subset of a Twitter data-set, which contains over 170,000,000 tweets collected during 3 months leading up to the 2012 presidential elections. The original data files are posted at https://old.datahub.io/dataset/twitter-2012-presidential-election



In [3]:
%%sh
head -1 twitter200.json | jq '.'

{
  "in_reply_to_user_id_str": null,
  "id_str": "246058156919238656",
  "text": "Anderson Silva vs Stephen bonner is like sending a man to murders row after he saved the president 5 years earlier",
  "geo": null,
  "retweeted": false,
  "in_reply_to_status_id": null,
  "created_at": "Thu Sep 13 01:30:10 +0000 2012",
  "source": "web",
  "entities": {
    "urls": [],
    "user_mentions": [],
    "hashtags": []
  },
  "contributors": null,
  "place": null,
  "favorited": false,
  "coordinates": null,
  "retweet_count": 0,
  "truncated": false,
  "in_reply_to_status_id_str": null,
  "user": {
    "id_str": "29324689",
    "is_translator": false,
    "verified": false,
    "favourites_count": 9,
    "geo_enabled": false,
    "profile_use_background_image": false,
    "profile_image_url": "http://a0.twimg.com/profile_images/2581517833/1y7d6c4gxsublrfhmium_normal.jpeg",
    "profile_text_color": "666666",
    "profile_background_image_url": "http://a0.twimg.com/profile_background_images/595

# Python


In [5]:
import json
import os, sys

In [9]:
MAX_LINES = 10
with open('twitter200.json') as io:
    line = io.readline()
    while line and MAX_LINES>0:
        row = json.loads(line.strip())
        print row.keys()
        
        # end of while-body
        MAX_LINES -= 1
        line = io.readline()
        

[u'favorited', u'in_reply_to_user_id', u'contributors', u'entities', u'text', u'created_at', u'truncated', u'retweeted', u'in_reply_to_status_id_str', u'coordinates', u'id', u'source', u'in_reply_to_status_id', u'place', u'id_str', u'in_reply_to_screen_name', u'retweet_count', u'geo', u'in_reply_to_user_id_str', u'user']
[u'contributors', u'truncated', u'text', u'in_reply_to_status_id', u'id', u'source', u'retweeted', u'coordinates', u'entities', u'in_reply_to_screen_name', u'in_reply_to_user_id', u'retweet_count', u'id_str', u'favorited', u'user', u'geo', u'in_reply_to_user_id_str', u'possibly_sensitive', u'created_at', u'possibly_sensitive_editable', u'in_reply_to_status_id_str', u'place']
[u'contributors', u'truncated', u'text', u'in_reply_to_status_id', u'id', u'source', u'retweeted', u'coordinates', u'entities', u'in_reply_to_screen_name', u'in_reply_to_user_id', u'retweet_count', u'id_str', u'favorited', u'user', u'geo', u'in_reply_to_user_id_str', u'possibly_sensitive', u'create

In [10]:
row

{u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Thu Sep 13 01:30:11 +0000 2012',
 u'entities': {u'hashtags': [],
  u'urls': [],
  u'user_mentions': [{u'id': 34993057,
    u'id_str': u'34993057',
    u'indices': [3, 15],
    u'name': u'JustMePammy',
    u'screen_name': u'JustMePammy'},
   {u'id': 376514554,
    u'id_str': u'376514554',
    u'indices': [24, 36],
    u'name': u'@bornAmerica',
    u'screen_name': u'BornAmerica'}]},
 u'favorited': False,
 u'geo': None,
 u'id': 246058158190129153,
 u'id_str': u'246058158190129153',
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_status_id_str': None,
 u'in_reply_to_user_id': None,
 u'in_reply_to_user_id_str': None,
 u'place': None,
 u'retweet_count': 2,
 u'retweeted': False,
 u'retweeted_status': {u'contributors': None,
  u'coordinates': None,
  u'created_at': u'Thu Sep 13 01:25:48 +0000 2012',
  u'entities': {u'hashtags': [],
   u'urls': [],
   u'user_mentions': [{u'id': 376514554,
     u'

## Table of users and web-sites

Let's create a table of users and web-sites they are tweeted about. This may be useful to create a graph between web-sites determined by the number of shared uses. Eventually, we could also include a sentiment analysis of the tweet.

In [None]:
### Let's start with the core loop
MAX_LINES = 10
with open('twitter200.json') as io:
    line = io.readline()
    while line and MAX_LINES>0:
        row = json.loads(line.strip())
        print row.keys()
        
        # end of while-body
        MAX_LINES -= 1
        line = io.readline()

# JQ

- https://stedolan.github.io/jq/
- Parsing JSON with jq http://www.compciv.org/recipes/cli/jq-for-parsing-json/

jq is like sed for JSON data - you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.

jq is written in portable C, and it has zero runtime dependencies. You can download a single binary, scp it to a far away machine of the same type, and expect it to work.

jq can mangle the data format that you have into the one that you want with very little effort, and the program to do so is often shorter and simpler than you’d expect.