<a id='top'></a>

# Data Extraction

 
<br>

 <center> <img src=img/data_extraction.png  width=75%> </center>  
 
#### Goal: Learn how to convert raw JSON data from a platform API into the SocialSim package input format
 
### <a href='#intro'> 1) Intro To SocialSim data format </a>
 
### <a href='#data_conversion'> 2) Converting To SocialSim data format </a>
 
### <a href='#twitter_data'> 3) Example with Twitter data </a>


<a id='intro'></a>

### SocialSim Data Format
[Jump to top](#top)

The measurements package leverages a standardized data format across multiple social platforms. This format is designed to extract key properties of user actions that answer questions related to information spread.

<img src="img/data_format.png?1" width="800"/>

The information ID is a key quantity that allows us to distill the content of social media shares into a trackable identifier. Examples would include hashtags, keywords, URLs, and detected topics but the information ID could be any derived property of the content that could be tracked.

<img src="img/info_id.png?1" width="800"/>

The parent ID and root ID are what enable us to track sharing paths through the social network.

<img src="img/root_parent.png?1" width="800"/>

The actionType varies by the interaction mechanisms that are enabled on each platform, which may range from the very simple (only a single message interaction) to the highly complex and structure (e.g. GitHub).

<img src="img/actionTypes.png?1" width="700"/>

The input file structure contains one JSON per line containing these relevent fields:

```json
{"actionType": "post", "informationID": "CVE-2015-6620", "nodeID": "t3_c9uWs8FVfbko2qaPNgpUFA", "nodeTime": "2015-12-10T06:41:02Z", "nodeUserID": "vHwXTX4FohkDUqQMdjb3zg", "parentID": "t3_c9uWs8FVfbko2qaPNgpUFA", "platform": "reddit", "rootID": "t3_c9uWs8FVfbko2qaPNgpUFA"}
{"actionType": "post", "informationID": "CVE-2015-6620", "nodeID": "t3_UKRvYWT1sB1Gy98yZccoYQ", "nodeTime": "2015-12-10T07:40:38Z", "nodeUserID": "vHwXTX4FohkDUqQMdjb3zg", "parentID": "t3_UKRvYWT1sB1Gy98yZccoYQ", "platform": "reddit", "rootID": "t3_UKRvYWT1sB1Gy98yZccoYQ"}
{"actionType": "post", "informationID": "CVE-2015-6620", "nodeID": "t3_2dRXuDU88Z4K_c90e0hVBQ", "nodeTime": "2015-12-10T08:41:03Z", "nodeUserID": "vHwXTX4FohkDUqQMdjb3zg", "parentID": "t3_2dRXuDU88Z4K_c90e0hVBQ", "platform": "reddit", "rootID": "t3_2dRXuDU88Z4K_c90e0hVBQ"}
```

However, the raw JSON output for a social platform usually looks something like this...
```json
{"accessPlatform": "<a href=\"http://www.twitter.com\" rel=\"nofollow\">Twitter for Windows Phone</a>", "avatarURL_h": "http://pbs.twimg.com/7AMCeGmHQbasIrep4WfyNQ/seTQyFWFg8kpai_slU5Ijw/jOwULya3GuXhDu3Gv6_vrQ", "content_language": "English", "content_type": "Tweet", "contributors": null, "coordinates": null, "country": NaN, "country_code": NaN, "created_at": "Mon Jul 03 09:11:48 +0000 2017", "data_source": "twitter", "data_source_type": "socialmedia", "date_time": "Mon Jul 03 09:11:48 +0000 2017", "display_text_range": NaN, "entities": {"symbols": [], "urls": [{"indices": [71, 94], "display_url_h": "github.com/pYy_qhfUFxrEz2nkjQp3tg/CI_svJvKlObwqZZ1AHRHfQ", "expanded_url_h": "https://github.com/pYy_qhfUFxrEz2nkjQp3tg/WvbmxY-It8u75Uy0NglnVw/blob/Ov30rdpcpVPIyqgVK5Ppmg", "url_h": "https://t.co/uleSzVIa0mkn3t5ss4M6-A"}], "user_mentions": [], "hashtags": [{"text": "nmap", "indices": [6, 11]}, {"text": "EternalBlue", "indices": [55, 67]}, {"text": "Vapt", "indices": [95, 100]}, {"text": "InfoSec", "indices": [101, 109]}]}, "extended_entities": NaN, "extension": {"socialsim_search_source": "text", "socialsim_keywords": ["CVE-2017-0143"], "created_dow": "Monday", "socialsim_domain": "CVE", "created_hod": 9}, "favorite_count": 0, "favorited": false, "filter_level": "low", "full_url_h": ["https://github.com/pYy_qhfUFxrEz2nkjQp3tg/WvbmxY-It8u75Uy0NglnVw/blob/Ov30rdpcpVPIyqgVK5Ppmg"], "full_url_hash": NaN, "full_url_hash_h": ["HWejGDEWUPDRPHPg4RO5jQ"], "geo": null, "hashtags": ["nmap", "EternalBlue", "Vapt", "InfoSec"], "id_h": "7cDdUQ30lg-rwb_yOMsHDQ", "id_str_h": "7cDdUQ30lg-rwb_yOMsHDQ", "in_reply_to_screen_name_h": "", "in_reply_to_status_id_h": "", "in_reply_to_status_id_str_h": "", "in_reply_to_user_id_h": "", "in_reply_to_user_id_str_h": "", "is_quote_status": false, "lang": "en", "location": NaN, "place": null, "possibly_sensitive": true, "quoted_status": NaN, "quoted_status_id_h": NaN, "quoted_status_id_str_h": NaN, "retweet_count": 0, "retweeted": false, "retweeted_status": NaN, "source": "<a href=\"http://www.twitter.com\" rel=\"nofollow\">Twitter for Windows Phone</a>", "text_m": "Using #nmap script to scan for MS17-010 (CVE-2017-0143 #EternalBlue )   url: https://t.co/uleSzVIa0mkn3t5ss4M6-A \r#Vapt #InfoSec", "timestamp_ms": "1499073108296", "truncated": false, "uid_h": "7cDdUQ30lg-rwb_yOMsHDQ", "user": {"profile_link_color": "FF0000", "profile_background_image_url_h": "http://abs.twimg.com/WbUUF0v_5K5AKz1jqtef4A/_gIZQ9zahxUPWQs0da-t7Q/kkCnotdMJzBrHShFs8UdPg", "lang": "en", "profile_image_url_https_h": "https://pbs.twimg.com/7AMCeGmHQbasIrep4WfyNQ/seTQyFWFg8kpai_slU5Ijw/jOwULya3GuXhDu3Gv6_vrQ", "profile_background_tile": true, "profile_sidebar_fill_color": "7AC3EE", "verified": false, "listed_count": 223, "protected": false, "followers_count": 441, "screen_name_h": "cb9l7iFAbHzH29DJOulO7w", "profile_background_color": "642D8B", "profile_banner_url_h": "https://pbs.twimg.com/AQsv-nMYlZJ0nOByaUjNkw/gEziu13lmM_MMYCjWpQckQ/sGtVovoELzRSe_jYupt-wA", "location": "INDIA", "url_h": "", "time_zone": "New Delhi", "profile_sidebar_border_color": "65B0DA", "description_m": "PenTester . OSCP  #PenTest  #Exploit #CyberSecurity", "id_str_h": "gEziu13lmM_MMYCjWpQckQ", "id_h": "gEziu13lmM_MMYCjWpQckQ", "created_at": "Mon Jul 12 13:55:50 +0000 2010", "profile_text_color": "3D1957", "following": null, "profile_image_url_h": "http://pbs.twimg.com/7AMCeGmHQbasIrep4WfyNQ/seTQyFWFg8kpai_slU5Ijw/jOwULya3GuXhDu3Gv6_vrQ", "default_profile": false, "statuses_count": 16777, "name_h": "7881ThGM1pQGTcCzueAjug", "is_translator": false, "profile_background_image_url_https_h": "https://abs.twimg.com/WbUUF0v_5K5AKz1jqtef4A/_gIZQ9zahxUPWQs0da-t7Q/kkCnotdMJzBrHShFs8UdPg", "contributors_enabled": false, "follow_request_sent": null, "geo_enabled": false, "profile_use_background_image": true, "favourites_count": 1602, "default_profile_image": false, "notifications": null, "utc_offset": 19800, "friends_count": 264}, "user_id_h": "gEziu13lmM_MMYCjWpQckQ", "username_h": "cb9l7iFAbHzH29DJOulO7w"}
```

<a id='data_conversion'></a>

### Converting Data
[Jump to top](#top)

So, how do you get from one to the other?

For some fields is it relatively simple, just needing to identify the relavent field in the platform's JSON schema.

<img src="img/simple_fields.png?2" width="700"/>

However, other fields can be more difficult depending on the platform API and sometimes require looking at contextual information in addition to the individual events. 

**ParentID and RootID**

For example, many platforms will provide the direct parent of a particular reply-type post, but *not* the root post that started the thread. If I know each parentID, I can identify the rootID by iterating upwards through the parent tree.

<img src="img/root_finding.png?1" width="800"/>

However, we sometimes have the oppostie problem where we know the root but not the parent. For example, the Twitter API provides the root tweet for a retweet action, but *does not* provide the direct parent. If UserB retweets UserA and then userC retweets UserB, the API output will list the parent as UserA.  Because we are interested in tracking the pathways of information propagation, I would like to specify UserB as the parent (i.e. immediate source of the information).

We can leverage the Twitter follower network as an additional source of information to estimat the most likely parent for a given retweet.  As a criteria for the selecting the immediate parent, we:
1. Identify all possible parents by limiting to the tweets/retweets that occured *prior* to the retweet in question
2. Identify which of the authors of those tweets/retweets the user in question follows
3. Assign the parentID to be the most recent tweet/retweet by a user that the user in question follows

<img src="img/retweet_parents.png?1" width="700"/>

**InformationID**

Extracting the information ID has two steps:
1. **Text processing**: Identify whether the information of interest is mentioned in a specific post. There are several possible approaches for this:
      1. Use pre-extracted fields: some platforms have their own entity extraction
      2. Search the text for specific keywords
      3. Perform more advanced text processing such as topic modeling, named entity extraction, or classification models
2. **Propagating to child events**: Responses to a post that are about a specified information ID should be considered to also be related to that information ID
    

<img src="img/info_id_extraction.png?1" width="800"/>

### Imports

In [4]:
import warnings
warnings.filterwarnings("ignore")

In [5]:
import socialsim as ss
import json
import pprint
import pandas as pd

<a id='twitter_data'></a>

### Let's look at some raw Twitter JSON data
[Jump to top](#top)

In [6]:
tweets = []
with open('../data/example_raw_twitter_data.json','r') as f:
    for line in f:
        tweets.append(json.loads(line))
pprint.pprint(tweets[0])

{'accessPlatform': '<a href="http://twitter.com/download/iphone" '
                   'rel="nofollow">Twitter for iPhone</a>',
 'avatarURL_h': 'http://pbs.twimg.com/7AMCeGmHQbasIrep4WfyNQ/SfevCuwm1ZqBBw7DwzdBwg/SORkjxK5jt5JAXPWl8WHug',
 'content_language': 'English',
 'content_type': 'Tweet',
 'contributors': None,
 'coordinates': None,
 'country': nan,
 'country_code': nan,
 'created_at': 'Sat Jul 01 20:25:27 +0000 2017',
 'data_source': 'twitter',
 'data_source_type': 'socialmedia',
 'date_time': 'Sat Jul 01 20:25:27 +0000 2017',
 'display_text_range': nan,
 'entities': {'hashtags': [],
              'symbols': [],
              'urls': [{'display_url_h': 'twib.in/LblejhqSZ7ehGIVWsgE7Mw/99bBIaXHd3hhze_0h4yw4w',
                        'expanded_url_h': 'http://twib.in/LblejhqSZ7ehGIVWsgE7Mw/99bBIaXHd3hhze_0h4yw4w',
                        'indices': [96, 119],
                        'url_h': 'https://t.co/HsjpQiLDBrfTiwDoGIl7Cw'},
                       {'expanded_url_h': '',
      

We have built in functions to convert JSON data from Twitter, Reddit, GitHub, and Teelgram into the required input format. The arguments for these functions are:
1. **data** - file path to the raw JSON data
2. **info_id_fields** - A specific field path in the JSON file which contains pre-extracted information IDs. One of two options for information ID extraction.
3. **keywords** - A list of keywords to search for in the text to use as information IDs. The other option for information ID extraction.
4. **anonymized** - A boolean to indicate whether the data is in the processed and anonymized format used by the SocialSim program. Use anonymized = False when working with typical platform API output data. The anonymized = True option is specific for SocialSim data only.

For the SocialSim data we are using in this tutorial, several specific keywords of interest have been pre-extracted into the "extension.socialsim_keywords" field.

In [7]:
tweets[0]['extension']

{'socialsim_search_source': 'text',
 'socialsim_keywords': ['CVE-2017-0143'],
 'created_dow': 'Saturday',
 'socialsim_domain': 'CVE',
 'created_hod': 20}

So we can specify this field to extract the information IDs from.

In [8]:
data = ss.extract_twitter_data('../data/example_raw_twitter_data.json',
                               info_id_fields = ["extension.socialsim_keywords"],
                               anonymized=True)
data.head()

Extracting fields...
Sorting...
Reconstructing cascades...
not running parent reconstruction...
Adding information IDs to children...
0/1
Expanding events...
Done!


Unnamed: 0,actionType,nodeID,nodeTime,nodeUserID,parentID,platform,rootID,informationID
0,tweet,09lfHHshDvxRn0aGoG446w,2017-07-01T20:17:16Z,SO1ac7x6aKy5z-VZgL6l4w,09lfHHshDvxRn0aGoG446w,twitter,09lfHHshDvxRn0aGoG446w,CVE-2017-0143
1,retweet,56aepqqx2xLNmAxxvEeirw,2017-07-01T20:21:45Z,h1spW_KFiKNKzGiAlr3YsA,09lfHHshDvxRn0aGoG446w,twitter,09lfHHshDvxRn0aGoG446w,CVE-2017-0143
2,retweet,5MmFhzTsyeoKeHdgmOBjnw,2017-07-01T20:24:41Z,E6VQWiZUritPuU40-Jy0Kw,09lfHHshDvxRn0aGoG446w,twitter,09lfHHshDvxRn0aGoG446w,nmap
3,retweet,5MmFhzTsyeoKeHdgmOBjnw,2017-07-01T20:24:41Z,E6VQWiZUritPuU40-Jy0Kw,09lfHHshDvxRn0aGoG446w,twitter,09lfHHshDvxRn0aGoG446w,CVE-2017-0143
4,retweet,TnEUnc301-QS8r7TJOO8CQ,2017-07-01T20:25:27Z,l6f0yTrmfwWbkNcVyQteeQ,09lfHHshDvxRn0aGoG446w,twitter,09lfHHshDvxRn0aGoG446w,CVE-2017-0143


However, if you were working with more standard Twitter data, you could also specify the fields containing Twitter's entity extraction results. For example, if you want to track the spread of mentions of specific users, you could specify the "entities.user_mentions.screen_name" field.

In [9]:
data = ss.extract_twitter_data('../data/example_raw_twitter_data.json',
                               info_id_fields = ["entities.user_mentions.screen_name_h"],
                               anonymized=True)
data.head()

Extracting fields...
Sorting...
Reconstructing cascades...
not running parent reconstruction...
Adding information IDs to children...
0/1
Expanding events...
Done!


Unnamed: 0,actionType,nodeID,nodeTime,nodeUserID,parentID,platform,rootID,informationID
0,retweet,56aepqqx2xLNmAxxvEeirw,2017-07-01T20:21:45Z,h1spW_KFiKNKzGiAlr3YsA,09lfHHshDvxRn0aGoG446w,twitter,09lfHHshDvxRn0aGoG446w,tD7ZRX_Ce5viq7z2pkIwRg
1,retweet,5MmFhzTsyeoKeHdgmOBjnw,2017-07-01T20:24:41Z,E6VQWiZUritPuU40-Jy0Kw,09lfHHshDvxRn0aGoG446w,twitter,09lfHHshDvxRn0aGoG446w,tD7ZRX_Ce5viq7z2pkIwRg
2,retweet,TnEUnc301-QS8r7TJOO8CQ,2017-07-01T20:25:27Z,l6f0yTrmfwWbkNcVyQteeQ,09lfHHshDvxRn0aGoG446w,twitter,09lfHHshDvxRn0aGoG446w,tD7ZRX_Ce5viq7z2pkIwRg
3,retweet,2EwY04fGAIVtPVUvgBxWKQ,2017-07-01T20:32:31Z,IHGHZa0JgIGjSsxGwySN8w,09lfHHshDvxRn0aGoG446w,twitter,09lfHHshDvxRn0aGoG446w,tD7ZRX_Ce5viq7z2pkIwRg
4,retweet,F8Z-373W1rilMPilh3yHcw,2017-07-01T20:33:29Z,rwcnjlkIk8TVv64k0GnLTg,09lfHHshDvxRn0aGoG446w,twitter,09lfHHshDvxRn0aGoG446w,tD7ZRX_Ce5viq7z2pkIwRg


Additionally, you can search for specific keywords in the text of the tweets by using the keywords argument instead.

In [10]:
data = ss.extract_twitter_data('../data/example_raw_twitter_data.json',
                               keywords = ['EternalBlue','MS17-010'],
                               anonymized=True)
data.head()

Extracting fields...
Sorting...
Reconstructing cascades...
not running parent reconstruction...
Adding information IDs to children...
0/1
Expanding events...
Done!


Unnamed: 0,actionType,nodeID,nodeTime,nodeUserID,parentID,platform,rootID,informationID
0,tweet,09lfHHshDvxRn0aGoG446w,2017-07-01T20:17:16Z,SO1ac7x6aKy5z-VZgL6l4w,09lfHHshDvxRn0aGoG446w,twitter,09lfHHshDvxRn0aGoG446w,EternalBlue
1,tweet,09lfHHshDvxRn0aGoG446w,2017-07-01T20:17:16Z,SO1ac7x6aKy5z-VZgL6l4w,09lfHHshDvxRn0aGoG446w,twitter,09lfHHshDvxRn0aGoG446w,MS17-010
2,retweet,56aepqqx2xLNmAxxvEeirw,2017-07-01T20:21:45Z,h1spW_KFiKNKzGiAlr3YsA,09lfHHshDvxRn0aGoG446w,twitter,09lfHHshDvxRn0aGoG446w,EternalBlue
3,retweet,56aepqqx2xLNmAxxvEeirw,2017-07-01T20:21:45Z,h1spW_KFiKNKzGiAlr3YsA,09lfHHshDvxRn0aGoG446w,twitter,09lfHHshDvxRn0aGoG446w,MS17-010
4,retweet,5MmFhzTsyeoKeHdgmOBjnw,2017-07-01T20:24:41Z,E6VQWiZUritPuU40-Jy0Kw,09lfHHshDvxRn0aGoG446w,twitter,09lfHHshDvxRn0aGoG446w,EternalBlue
