## Microsoft Concept Graph

[Microsoft Concept Graph](https://concept.research.microsoft.com/) - это таксономия терминов, устанавливающая отношение `is-a` между концептами. 

Граф доступен в двух вариантах:
 * Текстовый файл для скачивания
 * REST API

Статистика:
 * 5401933 уникальных концептов, 
 * 12551613 уникальных сущностей
 * 87603947 отношений `is-a`

## Использование веб-сервиса

Веб-сервис предоставляет различные варианты вероятностной оценки принадлежности концепта к разным группам. Подробности - [тут](https://concept.research.microsoft.com/Home/Api).
Пример URL для запроса: `https://concept.research.microsoft.com/api/Concept/ScoreByProb?instance=microsoft&topK=10`

In [4]:
import urllib.request
import json

def http(x):
    response = urllib.request.urlopen(x)
    data = response.read()
    return data.decode('utf-8')

def query(x):
    return json.loads(http("https://concept.research.microsoft.com/api/Concept/ScoreByProb?instance={}&topK=10".format(x)))

query('microsoft')

{'company': 0.6105356614382954,
 'vendor': 0.08858636677518003,
 'client': 0.048239124001183784,
 'firm': 0.045476965571668145,
 'large company': 0.043109401203511886,
 'organization': 0.043010752688172046,
 'corporation': 0.035908059583703265,
 'brand': 0.03383644076156654,
 'software company': 0.027522935779816515,
 'technology company': 0.023774292196902438}

Попробуем категоризовать заголовки новостей на основе ключевых концептов. Для доступа к новостям используем сервис [NewsApi.org](http://newsapi.org). Для использования сервиса необходимо зарегистрироваться и получить свой ключ доступа:

In [5]:
newsapi_key = '7015bc6ae10841679b21676c05bdad97'
def get_news(country='us'):
    res = json.loads(http("https://newsapi.org/v2/top-headlines?country={0}&apiKey={1}".format(country,newsapi_key)))
    return res['articles']

all_titles = [x['title'] for x in get_news('us')+get_news('gb')]

Извлечём из заголовков существительные:

In [9]:
import sys
!{sys.executable} -m textblob.download_corpora
from textblob import TextBlob

[nltk_data] Downloading package brown to /home/nbuser/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /home/nbuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/nbuser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package conll2000 to /home/nbuser/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package maxent_treebank_pos_tagger to
[nltk_data]     /home/nbuser/nltk_data...
[nltk_data]   Package maxent_treebank_pos_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/nbuser/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.


In [10]:
w = {}
for x in all_titles:
    for n in TextBlob(x).noun_phrases:
        if n in w:
            w[n].append(x)
        else:
            w[n]=[x]
{ x:len(w[x]) for x in w.keys()}

{'chapo': 2,
 'anxious juror': 2,
 'britain': 1,
 'eu agree': 1,
 'brexit': 1,
 'federal judge orders review': 1,
 'provisional ballots': 1,
 'georgia': 1,
 'delays deadline': 1,
 'terrorist attacks': 1,
 'trump': 4,
 'twitter': 1,
 'gesture': 2,
 'nazi': 2,
 'wis.': 2,
 'amazon': 1,
 'billion': 1,
 'queens': 1,
 'begins': 1,
 'fight': 1,
 'worth': 1,
 'white house': 2,
 'matt whitaker': 1,
 'temporary replacement': 1,
 'sessions': 1,
 'stan lee': 2,
 'true legacy': 1,
 'complicated cosmic mystery': 1,
 'ocasio-cortez': 1,
 'joins climate': 1,
 'pelosi': 1,
 "'s office": 1,
 'ruth bader ginsburg': 1,
 'supreme court': 1,
 'session': 1,
 'breaking ribs': 1,
 'improve': 1,
 'midlothian': 1,
 'police officer': 1,
 'shot security guard': 1,
 'jemel roberson': 1,
 'hate': 1,
 'new fbi data': 1,
 'cnn': 2,
 'sues president': 1,
 'jim acosta': 2,
 "'s press pass": 1,
 'california wildfires updates': 2,
 'dead': 2,
 'camp fire': 2,
 'expected': 2,
 'rise': 2,
 'halted': 1,
 'vatican': 1,
 'us 

Мы видим, что существительные не дают нам разбиения на сколько-либо значимые смысловые группы. Попробуем заменить существительные на их более общие понятия, полученные через knowledge graph

In [16]:
w = {}
for x in all_titles:
    for noun in TextBlob(x).noun_phrases:
        terms = query(noun.replace(' ','%20'))
        for term in [u for u in terms.keys() if terms[u]>0.1]:
            if term in w:
                w[term].append(x)
            else:
                w[term]=[x]

In [19]:
{ x:len(w[x]) for x in w.keys() if len(w[x])>3}

{'topic': 6,
 'state': 4,
 'candidate': 4,
 'republican candidate': 4,
 'republican presidential candidate': 4,
 'city': 4,
 'place': 5,
 'concept': 4,
 'word': 9,
 'person': 5}

In [26]:
print('\nCITY:\n'+'\n'.join(w['city']))
print('\nSTATE:\n'+'\n'.join(w['state']))
print('\nPERSON:\n'+'\n'.join(w['person']))
print('\nCANDIDATE:\n'+'\n'.join(w['candidate']))


CITY:
Gesture intended as 'goodbye,' not Nazi salute, says parent who took controversial Wis. photo
Gesture intended as 'goodbye,' not Nazi salute, says parent who took controversial Wis. photo
Tommy Robinson: EDL founder not granted US visa in time to meet Congress members in Washington
Warm up through Friday in Denver before snow on Saturday

STATE:
Federal judge orders review of all provisional ballots in Georgia, delays deadline for certification
Halted from voting by the Vatican, US bishops begin heated debate about anti-sex abuse measures
NJ weather: Nor'easter to dump snow, wintry mix and rain on Thursday
Tommy Robinson: EDL founder not granted US visa in time to meet Congress members in Washington

PERSON:
Midlothian police officer who fatally shot security guard Jemel Roberson placed on administrative leave, officials say
Monica Lewinsky: Bill Clinton's Refusal To Apologize Is Proof Of 'What Power Looks Like'
I'm A Celebrity 2018 line-up: Who will be the mystery contestant?
M

## Использование "сырых" данных

Сырые данные (около 320 Мб) можно загрузить с сайта следующим образом:

In [1]:
!wget https://concept.research.microsoft.com/Home/StartDownload

--2018-11-13 16:45:09--  https://concept.research.microsoft.com/Home/StartDownload
Resolving webproxy (webproxy)... 10.36.64.1
Connecting to webproxy (webproxy)|10.36.64.1|:3128... connected.
Proxy request sent, awaiting response... 302 Found
Location: /Home/DownloadData?key=MV3tkEjOI0bN8OH8UDZ8sN3AxhLwZHvz&h=106218505 [following]
--2018-11-13 16:45:09--  https://concept.research.microsoft.com/Home/DownloadData?key=MV3tkEjOI0bN8OH8UDZ8sN3AxhLwZHvz&h=106218505
Reusing existing connection to concept.research.microsoft.com:443.
Proxy request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘StartDownload’

StartDownload           [      <=>           ] 320.76M  3.07MB/s    in 1m 49s  

2018-11-13 16:46:58 (2.95 MB/s) - ‘StartDownload’ saved [336346900]



In [7]:
!mv StartDownload concept.zip
!unzip -p concept.zip data-concept/data-concept-instance-relations.txt | head -10

mv: cannot stat 'StartDownload': No such file or directory
factor	age	35167
free rich company datum	size	33222
free rich company datum	revenue	33185
state	california	18062
supplement	msm glucosamine sulfate	15942
factor	gender	14230
factor	temperature	13660
metal	copper	11142
issue	stress pain depression sickness	11110
variable	age	9375


In [27]:
!rm concept.zip
!rm -rf data-concept