<a href="https://colab.research.google.com/github/mevah/nailsalon/blob/master/Reuters_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Connecting to the Reuters Connect API

How to access the **Reuters Connect API Developer Guide**:  
* http://liaison.reuters.com/
* Choose 'Web Services - API' from dropdown menu
* Click on 'Related Content' tab
* Download 'API Developer Guide' (and 'XML Quick Reference')

Further documentation: https://docs.google.com/document/d/14Uiys8TKyoeYfSB8TBEzwlGwvUK-PjxnyuHN-uh2BQg/edit#heading=h.m8140bfkyycf

### Imports

In [0]:
import requests

# If you do not have any of the packages installed, e.g. requests - uncomment the line below to install.
# !pip3 install requests 

import json

from datetime import datetime

### 1. Obtain a token for web services with your username and password

**Username** alternatives: HackAPI1, HackAPI2

In [0]:
username = 'HackAPI2'

Please retrieve the required **passwords** from the mentors on the day.

In [0]:
password = 'xmwMsuGfdqhXRGV' # needs to be adjusted!

Constructing the url to obtain the authorisation token.

In [5]:
auth_url = "https://commerce.reuters.com/rmd/rest/xml/login?username=" + username + "&password=" + password + "&format=json"
auth_url

'https://commerce.reuters.com/rmd/rest/xml/login?username=HackAPI2&password=xmwMsuGfdqhXRGV&format=json'

Sending a GET request to obtain the authorisation token.

In [0]:
response = requests.get(auth_url)

Drilling down into the response to obtain the **authorisation token** and loading the response as json.

In [7]:
authToken = json.loads(response.text).get('authToken').get('authToken')
authToken

'8vsT3ETAUZaiVRklHIqY7typIwTzvjss81kIX5wuiTI='

**Note**: The authentication token is valid for 24 hours. Once obtained, it should be cached on your client-side and reused in subsequent calls made by other methods during the 24-hour long period.
When the token expires, you will get an authentication error.

### 2. Obtain a list of the channels to which you have access

In [0]:
channels_url = 'http://rmb.reuters.com/rmd/rest/json/channels?token=' + authToken + '&format=json'

Overriding the value of the previous response with the result for channels.

In [0]:
response = requests.get(channels_url)

In [10]:
channels = json.loads(response.text)
channels

{'channelIds': [158,
  675,
  24217,
  585,
  1,
  644,
  24242,
  718,
  27055,
  61670,
  61669,
  656,
  61672,
  27052,
  676,
  81607,
  176033,
  32220,
  174,
  61667,
  61671,
  674,
  106821,
  61668],
 'channelInformation': [{'alias': 'CLE548',
   'availableContentProfiles': ['NEP-External_ANP'],
   'category': {'description': 'Online Video', 'id': 'OLV'},
   'description': 'BVO',
   'lastArrivalInternal': '2019-11-01T22:37:26Z',
   'lastUpdate': '2019-11-01T22:35:05Z'},
  {'alias': 'Efm208',
   'availableContentProfiles': ['SNI-Graphic'],
   'category': {'description': 'Graphics', 'id': 'GRA'},
   'description': 'French Language News Graphics Service',
   'lastArrivalInternal': '2019-11-01T12:36:05Z',
   'lastUpdate': '2019-11-01T12:35:30Z'},
  {'alias': 'FES376',
   'availableContentProfiles': ['SNI-Text',
    'SNI-Picture',
    'NEP-External',
    'SNEP-External'],
   'category': {'category': {'description': 'USA', 'id': 'OLR:USA'},
    'description': 'Online Reports',
   

All descriptions of the available channels are listed below.

In [11]:
descriptions = [value.get('description') for value in channels.get('channelInformation')]
descriptions

['BVO',
 'French Language News Graphics Service',
 'US Online Report Top News',
 'German General News Service',
 'EVO',
 'Reuters World Service',
 'UK Online Report Top News',
 'France Online Report Top News',
 'Germany Online Report Top News',
 'DNP Basic Germany OLR Markets',
 'DNP Basic Germany OLR Economy',
 'Swiss Domestic News Service German',
 'DNP Basic Germany OLR World',
 'Germany Online Report World News',
 'Spanish Language News Graphics Service',
 'DNP Basic Germany OLR Politics',
 'Reuters News Picture Service - RNPS',
 'Reuters Interactive Graphics',
 'GNVO',
 'DNP Basic Germany OLR Company',
 'DNP Basic Germany OLR Top',
 'English Language News Graphics Service',
 'Captioned Online Video - German',
 'DNP Basic Germany OLR Domestic']

In [12]:
aliases = [value.get('alias') for value in channels.get('channelInformation')]
aliases  # will be used for the next step

['CLE548',
 'Efm208',
 'FES376',
 'HkV652',
 'Iwu647',
 'STK567',
 'TRn222',
 'UXR369',
 'afj497',
 'dja779',
 'hro568',
 'jyn629',
 'nch777',
 'nld052',
 'oYM964',
 'pbn620',
 'pwu404',
 'shl347',
 'sst663',
 'ucz335',
 'uzq030',
 'wbq437',
 'xbt154',
 'xnk712']

### 3. Retrieve content for a specific channel

All of the above are examples to get an intuition how to query the API.

Getting the *first* alias. 

In [13]:
alias = channels.get('channelInformation')[0].get('alias')
alias

'CLE548'

Get all items in a specific channel ('CLE548' in this case) for the last 24 hours.

In [0]:
url = 'http://rmb.reuters.com/rmd/rest/json/items?channel=' + alias + '&token=' + authToken + '&format=json'

In [0]:
response = requests.get(url)

Retrieve text-based content for a specific channel category (group of channels containing similar content) using the **channelCategory** argument.

In [16]:
channelCat = 'TXT'  # or: OLR, GRA, PIC, ...
url = 'http://rmb.reuters.com/rmd/rest/json/items?channelCategory=' + channelCat + '&token=' + authToken + '&format=json'
response = requests.get(url)
json.loads(response.text)

{'pending': False,
 'pollToken': 'ExwaY31kfnt3ZGNxcGd3YH1hYHhneg==',
 'results': [{'channelIds': [542,
    626,
    159003,
    176417,
    22321,
    63697,
    647,
    70666,
    29598,
    63633,
    49349,
    63667,
    599,
    63685,
    644,
    63665,
    158991,
    68936,
    635,
    22370,
    159015,
    118201,
    23278,
    27574,
    564],
   'channels': ['STK567'],
   'dateCreated': '2019-11-02T13:27:55Z',
   'geography': ['ZA', 'GB'],
   'guid': 'tag:reuters.com,2019:newsml_L8N27I0AL',
   'headline': "Rugby-Erasmus' South Africa game plan gets thumbs up from ex-Boks",
   'id': 'tag:reuters.com,2019:newsml_L8N27I0AL:76873634',
   'internalReceivedDate': '2019-11-02T13:27:58Z',
   'language': 'en',
   'mediaType': 'T',
   'priority': 3,
   'slug': 'RUGBY-UNION-WORLDCUP-ENG-ZAF/ (TV, PIX)',
   'source': 'Thomson Reuters',
   'version': 76873634},
  {'channelIds': [587,
    70659,
    70660,
    611,
    585,
    63412,
    586,
    159005,
    158993,
    158997,
    

Retrieve items for the last 4 hours using the **maxAge** argument.

In [19]:
max_age = '4h' 
url = 'http://rmb.reuters.com/rmd/rest/json/items?channel=CLE548&maxAge=' + max_age + '&token=' + authToken + '&format=json'
response = requests.get(url)
json.loads(response.text)

{'pending': False, 'results': [], 'status': {'code': 10}}

Retrieve items for a given date range using the **dateRange** argument.

In [29]:
date_range = '2019.10.22-2019.10.28'  # YYYY.MM.DD; If no date range filter is supplied, the news items returned will be limited to the past 24 hours.
url = 'http://rmb.reuters.com/rmd/rest/json/items?channel=CLE548&dateRange=' + date_range + '&token=' + authToken + '&format=json'
response = requests.get(url)
json.loads(response.text)

{'pending': False,
 'pollToken': 'ExwaY31kfnt3ZWF1cGF0ZHltYH9hdg==',
 'results': [{'author': 'Reuters, OCT 26',
   'channelIds': [157,
    172391,
    93257,
    156455,
    158,
    182551,
    194309,
    63328,
    79133,
    212951,
    166089,
    32100,
    122641],
   'channels': ['CLE548'],
   'dateCreated': '2019-10-26T15:13:08Z',
   'dimensions': '960x540',
   'duration': 108,
   'editNumber': '6434BO',
   'geography': ['US'],
   'guid': 'tag:reuters.com,2019:newsml_OVB2QOJ1N',
   'headline': "Microsoft beats Amazon for Pentagon's $10 bln cloud computing contract",
   'id': 'tag:reuters.com,2019:newsml_OVB2QOJ1N:2',
   'internalReceivedDate': '2019-10-26T15:13:14Z',
   'language': 'en',
   'mediaType': 'V',
   'previewUrl': 'http://content.reuters.com/auth-server/content/tag:reuters.com,2019:newsml_OVB2QOJ1N:2/tag:reuters.com,2019:binary_LOP000LBM8PBP-VIEWIMAGE:512X288',
   'priority': 4,
   'remoteContentComplete': True,
   'slug': 'PENTAGON-JEDI',
   'source': 'Thomson Reut

### 4. Retrieve the NewsML for an item

The item method is used to retrieve a particular news item as a NewsMLG2 document. To invoke this method, you must know the item ID of the news item you wish to retrieve.

In [0]:
ids = json.loads(response.text).get('results')

In [31]:
example_id = ids[0].get('id')
example_id

'tag:reuters.com,2019:newsml_OVB2QOJ1N:2'

The *item* method is used to obtain a list of news items.

In [32]:
url = 'http://rmb.reuters.com/rmd/rest/json/item?id=' + example_id + '&token=' + authToken + '&format=json'
response = requests.get(url)
json.loads(response.text)

{'associations': [{'body_xhtml': '              <p>Microsoft has won the Pentagon\'s $10 billion cloud computing contract.</p>\n              <p>And beaten out favorite Amazon.</p>\n              <p/>\n              <p>The U.S. Defense Department made the announcement on Friday (October 25).</p>\n              <p>The contracting process had long been mired in conflict of interest allegations.</p>\n              <p>And had even drawn the attention of President Donald Trump - who has publicly taken swipes at Amazon and its founder Jeff Bezos. </p>\n              <p>The Joint Enterprise Defense Infrastructure Cloud contract - or JEDI as it is known - is part of a broader digital modernization of the Pentagon.</p>\n              <p>The idea is to make it more technologically agile.</p>\n              <p>And specifically to give the military better access to data and the cloud from battlefields and other remote locations.</p>\n              <p>Although the Pentagon boasts the world\'s most 

Highlighting company entities in a NewsML2 text item using **entityMarkup** (needs to be applied to a text-based example_id).

In [24]:
url = 'http://rmb.reuters.com/rmd/rest/json/item?id=' + example_id + '&entityMarkup=newsml&token=' + authToken + '&format=json'
response = requests.get(url)
json.loads(response.text)

{'associations': [{'body_xhtml': '              <p>Microsoft has won the Pentagon\'s $10 billion cloud computing contract.</p>\n              <p>And beaten out favorite Amazon.</p>\n              <p/>\n              <p>The U.S. Defense Department made the announcement on Friday (October 25).</p>\n              <p>The contracting process had long been mired in conflict of interest allegations.</p>\n              <p>And had even drawn the attention of President Donald Trump - who has publicly taken swipes at Amazon and its founder Jeff Bezos. </p>\n              <p>The Joint Enterprise Defense Infrastructure Cloud contract - or JEDI as it is known - is part of a broader digital modernization of the Pentagon.</p>\n              <p>The idea is to make it more technologically agile.</p>\n              <p>And specifically to give the military better access to data and the cloud from battlefields and other remote locations.</p>\n              <p>Although the Pentagon boasts the world\'s most 

### 5. Search functionality

The search method is used to obtain a list of news items which match the search criteria you provide. This method provides more advanced filtering options than those offered with the items method as well as full text search capabilities. You can find more examples on p. 49 of the developer guide.

Example query based on the *headline* and *topic* of news articles.

In [25]:
# Can do search queries for following fields: headline, slug, body, caption, topic, entity, ... (fulltext, main)
search_query = 'headline:obama AND iraq||topic:POL||topic:WAR OR SPO'
url = 'http://rmb.reuters.com/rmd/rest/json/search?q=' + search_query + '&token=' + authToken + '&format=json'
response = requests.get(url)
json.loads(response.text)

{'results': {'mediaTypeBreakdownPercent': {'C': 2,
   'G': 0,
   'P': 80,
   'T': 17,
   'V': 1},
  'numFound': 2886,
  'result': [{'author': 'AKHTAR SOOMRO',
    'channelIds': [175935,
     86083,
     175155,
     175217,
     136867,
     216649,
     204337,
     823,
     198575,
     136717,
     125675,
     149225,
     175219,
     130019,
     134801,
     122999,
     176033,
     812,
     130523,
     125677,
     215187],
    'channels': ['pwu404'],
    'contributorId': 'RTRS',
    'contributorName': 'REUTERS',
    'dateCreated': 1572701625000,
    'destination': ['RPA'],
    'dimensions': '3352x4490',
    'geography': ['PAK'],
    'guid': 'tag:reuters.com,2019:newsml_RC14C207D760',
    'headline': 'Supporters of religious and political party during what participants call Freedom March to protest the government of Prime Minister Imran Khan in Islamabad',
    'id': 'tag:reuters.com,2019:newsml_RC14C207D760:746757794',
    'indexTimestamp': 1572701642141,
    'internalRecei

In the next query, we are looking for the keyword *trafficking* in the main text.

(*main*: used to specify the keywords that may be contained within one or more of the following fields of the news items that are retrieved: body, headline, caption, id, topic, signal and slug.)

**Note**: If a *daterange* is not specified, then the results are for the last 24h. 

In [26]:
search_query = 'main:trafficking'
url = 'http://rmb.reuters.com/rmd/rest/json/search?q=' + search_query + '&mediaType=T&token=' + authToken + '&format=json'
response = requests.get(url)
json.loads(response.text)

{'results': {'mediaTypeBreakdownPercent': {'C': 0,
   'G': 0,
   'P': 0,
   'T': 100,
   'V': 0},
  'numFound': 10,
  'result': [{'channelIds': [76637,
     63683,
     37917,
     26786,
     144993,
     626,
     68932,
     159003,
     662,
     176417,
     63692,
     22321,
     34325,
     63693,
     161681,
     57606,
     154815,
     180027,
     63675,
     63699,
     75332,
     46622,
     159011,
     583,
     63667,
     164607,
     150029,
     76633,
     63685,
     31396,
     644,
     76630,
     63646,
     27216,
     27199,
     44352,
     211865,
     180029,
     158991,
     20483,
     142441,
     63708,
     672,
     195519,
     610,
     648,
     635,
     62870,
     22283,
     159015,
     141803,
     23927,
     180025,
     180033,
     180023,
     561,
     41078,
     23926,
     661,
     565,
     601,
     118201,
     31397,
     63669,
     22307,
     2213,
     63643,
     127689,
     176419,
     634,
     31377,
     68945,
 

The query below is looking for the keyword *foundation* in the fulltext.

(*fulltext*: used to specify the keywords that may be contained within one or more of the following fields of the news items that are retrieved: body, headline, caption.)

In [27]:
search_query = 'fulltext:foundation'
url = 'http://rmb.reuters.com/rmd/rest/json/search?q=' + search_query + '&mediaType=T&token=' + authToken + '&format=json'
response = requests.get(url)
json.loads(response.text)

{'results': {'mediaTypeBreakdownPercent': {'C': 0,
   'G': 0,
   'P': 0,
   'T': 100,
   'V': 0},
  'numFound': 11,
  'result': [{'author': 'By Nick Mulvenney',
    'channelIds': [542,
     37917,
     626,
     159003,
     662,
     176417,
     63692,
     22321,
     63697,
     647,
     70666,
     29598,
     63633,
     159011,
     49349,
     583,
     63667,
     599,
     63685,
     644,
     182977,
     63665,
     158991,
     142441,
     63708,
     68936,
     635,
     22370,
     23742,
     159015,
     41078,
     118201,
     23278,
     2213,
     27574,
     564,
     63655],
    'channels': ['STK567'],
    'contributorId': 'RTRS',
    'contributorName': 'Reuters',
    'dateCreated': 1572695906000,
    'destination': ['PSC',
     'LBY',
     'G',
     'DNP',
     'AFN',
     'J',
     'PSP',
     'GNS',
     'RSP',
     'PGE',
     'RWS',
     'CSA',
     'UCDPTEST',
     'UKI',
     'SF',
     'RNP',
     'RWSA',
     'REULB',
     'AFA',
     'RBN'],
    'ge

We could also consider creating queries that find content created by the Thomson Reuters Foundation (e.g. 'contributorName': 'Thomson Reuters Foundation', 'contributorId': 'TRFN')

The below query returns all results within a given date range that include keywords: *slavery* or *trafficking*

**Note**: The keyword trafficking might be related to other topics as well (drugs, etc.). Therefore, think about the search terms.

In [28]:
search_query = 'fullText:(slavery OR trafficking)'
current_date = datetime.today().strftime('%Y.%m.%d')
past_date = '2019.10.02' # This date needs to be changed as the API can get only articles in the past 30 days.
url_parts = [
    'http://rmb.reuters.com/rmd/rest/json/search?q=',
    search_query,
    '&mediaType=T&token=',
    authToken,
    '&format=json',
    '&dateRange=',
    past_date,
    '-',
    current_date]

url = ''.join(url_parts)
response = requests.get(url)
json.loads(response.text)

{'results': {'mediaTypeBreakdownPercent': {'C': 0,
   'G': 0,
   'P': 0,
   'T': 100,
   'V': 0},
  'numFound': 426,
  'result': [{'channelIds': [37917,
     176417,
     75332,
     159011,
     63446,
     644,
     211865,
     142441,
     159015,
     41078,
     118201,
     2213],
    'channels': ['STK567'],
    'contributorId': 'RTRS',
    'contributorName': 'Reuters',
    'dateCreated': 1572652457000,
    'destination': ['CSA',
     'UCDPTEST',
     'LBY',
     'RWSA',
     'REULB',
     'AFA',
     'GNS',
     'RWS'],
    'guid': 'tag:reuters.com,2019:newsml_L2N27H278',
    'headline': 'Reuters World News Summary',
    'id': 'tag:reuters.com,2019:newsml_L2N27H278:286682283',
    'indexTimestamp': 1572652460109,
    'internalReceivedDate': 1572652460109,
    'language': 'en',
    'mediaType': 'T',
    'priority': 4,
    'signal': ['prodId:TXT', 'pmt:text', 'source:ids', 'edStat:N'],
    'slug': 'BC-WORLD',
    'source': 'Thomson Reuters',
    'version': 286682283},
   {'channe