# Recognize KCs of interest in book text

In [3]:
f = open("../dat/parsed_books/mml-book.txt", "r")
raw_text = f.read()

We are going to use the Wikifier (https://wikifier.org) to recognize and link entities to their corresponding Wikipedia page. Then we might need to filter for KCs that are of relevance.

In [37]:
import requests
from dotenv import dotenv_values
config = dotenv_values("../.env")
userKey = config['WIKIFIER_USER_KEY']

response = requests.post("http://www.wikifier.org/annotate-article",
    data={"userKey": userKey,
          "lang": "en",
          "text": raw_text[:1000],
          "support": "true",
          "ranges": "true"},
)
response = response.json()

In [38]:
print(raw_text[0:1000])

MATHEMATICS  FOR 
MACHINE LEARNING
Marc Peter DeisenrothA. Aldo FaisalCheng Soon Ong
MATHEMATICS FOR MACHINE LEARNING DEISENROTH ET AL.
The fundamental mathematical tools needed to understand machine learning include linear algebra, analytic geometry, matrix decompositions, vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science or computer science students, or professionals, to efﬁ  ciently learn the mathematics. This self-contained textbook bridges the gap between mathematical and machine learning texts, introducing the mathematical concepts with a minimum of prerequisites. It uses these concepts to derive four central machine learning methods: linear regression, principal component analysis, Gaussian mixture models and support vector machines. For students and others with a mathematical background, these derivations provide a starting point to machine learning texts. For those learning the

In [70]:
supports = {
    a['title']: 
        {"url": a['url'],
         "occurences": [
            {"start": s['chFrom'], "end": s['chTo'], "pr": s['pageRank']}
        for s in a['support'] if s['pageRank'] > 0.001] 
    } 
    for a in response['annotations']
}

{k: v for k, v in supports.items() if len(v['occurences'])>0}

{'Mathematics': {'url': 'http://en.wikipedia.org/wiki/Mathematics',
  'occurences': [{'start': 500, 'end': 510, 'pr': 0.001829006013726964}]},
 'Machine learning': {'url': 'http://en.wikipedia.org/wiki/Machine_learning',
  'occurences': [{'start': 18, 'end': 33, 'pr': 0.003895690124344518},
   {'start': 18, 'end': 33, 'pr': 0.00790830941700353},
   {'start': 101, 'end': 116, 'pr': 0.003895690124344518},
   {'start': 101, 'end': 116, 'pr': 0.00790830941700353},
   {'start': 192, 'end': 207, 'pr': 0.003895690124344518},
   {'start': 583, 'end': 598, 'pr': 0.003895690124344518},
   {'start': 724, 'end': 739, 'pr': 0.003895690124344518},
   {'start': 954, 'end': 969, 'pr': 0.003895690124344518}]},
 'Linear algebra': {'url': 'http://en.wikipedia.org/wiki/Linear_algebra',
  'occurences': [{'start': 217, 'end': 230, 'pr': 0.003866890604590002}]},
 'Linear regression': {'url': 'http://en.wikipedia.org/wiki/Linear_regression',
  'occurences': [{'start': 750, 'end': 766, 'pr': 0.0034936306661461

Now we have to clean overlaps such as mixture model. 

In [59]:
response

{'annotations': [{'title': 'Mathematics',
   'url': 'http://en.wikipedia.org/wiki/Mathematics',
   'lang': 'en',
   'pageRank': 0.005841219537781039,
   'cosine': 0.2593255174846198,
   'secLang': 'en',
   'secTitle': 'Mathematics',
   'secUrl': 'http://en.wikipedia.org/wiki/Mathematics',
   'wikiDataItemId': 'Q395',
   'wikiDataClasses': [{'itemId': 'Q11862829',
     'enLabel': 'academic discipline'},
    {'itemId': 'Q4671286', 'enLabel': 'academic major'},
    {'itemId': 'Q41511', 'enLabel': 'universal language'},
    {'itemId': 'Q1047113', 'enLabel': 'specialty'},
    {'itemId': 'Q9081', 'enLabel': 'knowledge'},
    {'itemId': 'Q492', 'enLabel': 'memory'},
    {'itemId': 'Q1048607', 'enLabel': 'conviction'},
    {'itemId': 'Q14819853', 'enLabel': 'learning or memory'},
    {'itemId': 'Q34394', 'enLabel': 'belief'},
    {'itemId': 'Q2200417', 'enLabel': 'cognition'},
    {'itemId': 'Q2990593', 'enLabel': 'animal behavior'},
    {'itemId': 'Q54989186', 'enLabel': 'mental state'},
    