<a href="https://colab.research.google.com/github/leodenale/ColabExamples/blob/master/notebooks/MiningWebPages.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mining the Social Web

## Mining Web Pages

This Jupyter Notebook provides an interactive way to follow along with and explore the examples from the video series. The intent behind this notebook is to reinforce the concepts in a fun, convenient, and effective way.

[Original Repository (github.com/mikhailklassen/):](https://github.com/mikhailklassen/Mining-the-Social-Web-3rd-Edition)

### Book: Mining the Social Web, 3rd Edition
The official code repository for Mining the Social Web, 3rd Edition (O'Reilly, 2019). The book is available from Amazon and Safari Books Online.

### Mounting Google Drive in Google Colab

In [0]:
from google.colab import drive
drive.mount('/mydrive')

### Navigating to project directory

In [27]:
cd ..

/


In [28]:
ls

[0m[01;34mbin[0m/      [01;34mdev[0m/   [01;34mlib32[0m/  [01;34mmydrive[0m/  [01;34mrun[0m/    [01;34msys[0m/                 [01;34musr[0m/
[01;34mboot[0m/     [01;34metc[0m/   [01;34mlib64[0m/  [01;34mopt[0m/      [01;34msbin[0m/   [01;34mtensorflow-2.0.0b1[0m/  [01;34mvar[0m/
[01;34mcontent[0m/  [01;34mhome[0m/  [01;34mmedia[0m/  [01;34mproc[0m/     [01;34msrv[0m/    [30;42mtmp[0m/
[01;34mdatalab[0m/  [01;34mlib[0m/   [01;34mmnt[0m/    [01;34mroot[0m/     [01;34mswift[0m/  [01;34mtools[0m/


In [29]:
cd mydrive/My Drive/Colab Notebooks/data

/mydrive/My Drive/Colab Notebooks/data


In [30]:
ls

feed.json  speech1.wav  speech2.wav


## Using boilerpipe to extract the text from a web page

Example blog post:
http://radar.oreilly.com/2010/07/louvre-industrial-age-henry-ford.html

In [0]:
import matplotlib
matplotlib.use('Agg')

In [2]:
# May also require the installation of Java runtime libraries
!pip3 install boilerpipe3
from boilerpipe.extract import Extractor

# If you're interested, learn more about how Boilerpipe works by reading
# Christian Kohlschütter's paper: http://www.l3s.de/~kohlschuetter/boilerplate/

URL='https://www.oreilly.com/ideas/ethics-in-data-project-design-its-about-planning'

extractor = Extractor(extractor='ArticleExtractor', url=URL)

print(extractor.getText())

Collecting boilerpipe3
[?25l  Downloading https://files.pythonhosted.org/packages/6f/53/13795e30c8b6335b4c21ef408a18bc3217380ef24fef05a4e9dc8252ea63/boilerpipe3-1.1.tar.gz (1.3MB)
[K     |████████████████████████████████| 1.3MB 2.9MB/s 
[?25hCollecting JPype1-py3 (from boilerpipe3)
[?25l  Downloading https://files.pythonhosted.org/packages/9b/81/63f5e4202c598f362ee4684b41890f993d6e58309c5d90703f570ab85f62/JPype1-py3-0.5.5.4.tar.gz (88kB)
[K     |████████████████████████████████| 92kB 28.8MB/s 
[?25hCollecting charade (from boilerpipe3)
[?25l  Downloading https://files.pythonhosted.org/packages/74/26/565610c87e951b8a3182df890589c280a16c5897cfbca97eebd73705e0c6/charade-1.0.3.tar.gz (168kB)
[K     |████████████████████████████████| 174kB 44.5MB/s 
[?25hBuilding wheels for collected packages: boilerpipe3, JPype1-py3, charade
  Building wheel for boilerpipe3 (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/b9/fd/7e/8db2c536d1946876d32cc990bf9e9ea255c56

## Using feedparser to extract the text (and other fields) from an RSS or Atom feed

In [4]:
!pip3 install feedparser
import feedparser # pip install feedparser

FEED_URL='http://feeds.feedburner.com/oreilly/radar/atom'

fp = feedparser.parse(FEED_URL)

for e in fp.entries:
    print(e.title)
    print(e.links[0].href)
    print(e.content[0].value)

Collecting feedparser
[?25l  Downloading https://files.pythonhosted.org/packages/91/d8/7d37fec71ff7c9dbcdd80d2b48bcdd86d6af502156fc93846fb0102cb2c4/feedparser-5.2.1.tar.bz2 (192kB)
[K     |████████████████████████████████| 194kB 2.9MB/s 
[?25hBuilding wheels for collected packages: feedparser
  Building wheel for feedparser (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/8c/69/b7/f52763c41c5471df57703a0ef718a32a5e81ee35dcf6d4f97f
Successfully built feedparser
Installing collected packages: feedparser
Successfully installed feedparser-5.2.1
Four short links: 4 July 2019
http://feedproxy.google.com/~r/oreilly/radar/atom/~3/_waxd6utHmY/four-short-links-4-july-2019
<p><em>Debugging AI, Serverless Foundations, YouTube Bans, and Pathological UI</em></p><ol>
<li>
<a href="https://github.com/microsoft/tensorwatch">TensorWatch</a> -- open source Microsoft, <i>a debugging and visualization tool designed for data science, deep learning, and reinforcement learning

## Harvesting blog data by parsing feeds

In [5]:
import os
import sys
import json
import feedparser
from bs4 import BeautifulSoup
from nltk import clean_html

FEED_URL = 'http://feeds.feedburner.com/oreilly/radar/atom'

def cleanHtml(html):
    if html == "": return ""

    return BeautifulSoup(html, 'html5lib').get_text()

fp = feedparser.parse(FEED_URL)

print("Fetched {0} entries from '{1}'".format(len(fp.entries[0].title), fp.feed.title))

blog_posts = []
for e in fp.entries:
    blog_posts.append({'title': e.title, 'content'
                      : cleanHtml(e.content[0].value), 'link': e.links[0].href})

out_file = os.path.join('feed.json')
f = open(out_file, 'w+')
f.write(json.dumps(blog_posts, indent=1))
f.close()

print('Wrote output file to {0}'.format(f.name))

Fetched 29 entries from 'All - O'Reilly Media'
Wrote output file to feed.json


## Starting to write a web crawler

In [6]:
import httplib2
import re
from bs4 import BeautifulSoup

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

soup = BeautifulSoup(response, 'html5lib')

links = []
 
for link in soup.findAll('a', attrs={'href': re.compile("^http(s?)://")}):
    links.append(link.get('href'))

for link in links:
    print(link)

https://www.nytimes.com/es/
https://cn.nytimes.com
https://www.nytimes.com/subscription/multiproduct/lp8HYKU.html?campaignId=6W74R
https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi
https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi
https://www.nytimes.com/section/todayspaper
https://www.nytimes.com/section/world
https://www.nytimes.com/section/us
https://www.nytimes.com/section/politics
https://www.nytimes.com/section/nyregion
https://www.nytimes.com/section/business
https://www.nytimes.com/section/opinion
https://www.nytimes.com/section/technology
https://www.nytimes.com/section/science
https://www.nytimes.com/section/health
https://www.nytimes.com/section/sports
https://www.nytimes.com/section/arts
https://www.nytimes.com/section/books
https://www.nytimes.com/section/style
https://www.nytimes.com/section/food
https://www.nytimes.com/section/travel
https://www.nytimes.com/section/magazine
https://www.nytimes.com/section/t-magazine
https

## Using NLTK to parse web page data

**Naive sentence detection based on periods**

In [7]:
text = "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow."
print(text.split("."))

['Mr', ' Green killed Colonel Mustard in the study with the candlestick', ' Mr', ' Green is not a very nice fellow', '']


**More sophisticated sentence detection**

In [8]:
import nltk # Installation instructions: http://www.nltk.org/install.html

# Downloading nltk packages used in this example
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [9]:
sentences = nltk.tokenize.sent_tokenize(text)
print(sentences)

['Mr. Green killed Colonel Mustard in the study with the candlestick.', 'Mr. Green is not a very nice fellow.']


In [10]:
harder_example = """My name is John Smith and my email address is j.smith@company.com.
Mostly people call Mr. Smith. But I actually have a Ph.D.!
Can you believe it? Neither can most people..."""

sentences = nltk.tokenize.sent_tokenize(harder_example)
print(sentences)

['My name is John Smith and my email address is j.smith@company.com.', 'Mostly people call Mr. Smith.', 'But I actually have a Ph.D.!', 'Can you believe it?', 'Neither can most people...']


**Word tokenization**

In [11]:
text = "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow."
sentences = nltk.tokenize.sent_tokenize(text)

tokens = [nltk.word_tokenize(s) for s in sentences]
print(tokens)

[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.'], ['Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.']]


**Part of speech tagging for tokens**

In [13]:
# Downloading nltk packages used in this example
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_treebank_pos_tagger')

pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]
print(pos_tagged_tokens)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_treebank_pos_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_treebank_pos_tagger is already up-to-
[nltk_data]       date!
[[('Mr.', 'NNP'), ('Green', 'NNP'), ('killed', 'VBD'), ('Colonel', 'NNP'), ('Mustard', 'NNP'), ('in', 'IN'), ('the', 'DT'), ('study', 'NN'), ('with', 'IN'), ('the', 'DT'), ('candlestick', 'NN'), ('.', '.')], [('Mr.', 'NNP'), ('Green', 'NNP'), ('is', 'VBZ'), ('not', 'RB'), ('a', 'DT'), ('very', 'RB'), ('nice', 'JJ'), ('fellow', 'NN'), ('.', '.')]]


**Alphabetical list of part-of-speech tags used in the Penn Treebank Project**

See: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

| # | POS Tag | Meaning |
|:-:|:-------:|:--------|
| 1	| CC | Coordinating conjunction|
|2|	CD	|Cardinal number|
|3|	DT	|Determiner|
|4|	EX	|Existential there|
|5|	FW	|Foreign word|
|6|	IN	|Preposition or subordinating conjunction|
|7|	JJ	|Adjective|
|8|	JJR	|Adjective, comparative|
|9|	JJS	|Adjective, superlative|
|10|	LS	|List item marker|
|11|	MD	|Modal|
|12|	NN	|Noun, singular or mass|
|13|	NNS	|Noun, plural|
|14|	NNP	|Proper noun, singular|
|15|	NNPS	|Proper noun, plural|
|16|	PDT	|Predeterminer|
|17|	POS	|Possessive ending|
|18|	PRP	|Personal pronoun|
|19|	PRP\$	|Possessive pronoun|
|20|	RB	|Adverb|
|21|	RBR	|Adverb, comparative|
|22|	RBS	|Adverb, superlative|
|23|	RP	|Particle|
|24|	SYM	|Symbol|
|25|	TO	|to|
|26|	UH	|Interjection|
|27|	VB	|Verb, base form|
|28|	VBD	|Verb, past tense|
|29|	VBG	|Verb, gerund or present participle|
|30|	VBN	|Verb, past participle|
|31|	VBP	|Verb, non-3rd person singular present|
|32|	VBZ	|Verb, 3rd person singular present|
|33|	WDT	|Wh-determiner|
|34|	WP	|Wh-pronoun|
|35|	WP\$|Possessive wh-pronoun|
|36|	WRB	|Wh-adverb|

**Named entity extraction/chunking for tokens**

In [14]:
# Downloading nltk packages used in this example
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [0]:
jim = "Jim bought 300 shares of Acme Corp. in 2006."

tokens = nltk.word_tokenize(jim)
jim_tagged_tokens = nltk.pos_tag(tokens)

ne_chunks = nltk.chunk.ne_chunk(jim_tagged_tokens)

In [31]:
ne_chunks

[Tree('S', [Tree('PERSON', [('Mr.', 'NNP')]), Tree('PERSON', [('Green', 'NNP')]), ('killed', 'VBD'), Tree('ORGANIZATION', [('Colonel', 'NNP'), ('Mustard', 'NNP')]), ('in', 'IN'), ('the', 'DT'), ('study', 'NN'), ('with', 'IN'), ('the', 'DT'), ('candlestick', 'NN'), ('.', '.')]),
 Tree('S', [Tree('PERSON', [('Mr.', 'NNP')]), Tree('ORGANIZATION', [('Green', 'NNP')]), ('is', 'VBZ'), ('not', 'RB'), ('a', 'DT'), ('very', 'RB'), ('nice', 'JJ'), ('fellow', 'NN'), ('.', '.')])]

In [32]:
ne_chunks = [nltk.chunk.ne_chunk(ptt) for ptt in pos_tagged_tokens]

ne_chunks[0].pprint()
ne_chunks[1].pprint()

(S
  (PERSON Mr./NNP)
  (PERSON Green/NNP)
  killed/VBD
  (ORGANIZATION Colonel/NNP Mustard/NNP)
  in/IN
  the/DT
  study/NN
  with/IN
  the/DT
  candlestick/NN
  ./.)
(S
  (PERSON Mr./NNP)
  (ORGANIZATION Green/NNP)
  is/VBZ
  not/RB
  a/DT
  very/RB
  nice/JJ
  fellow/NN
  ./.)


In [33]:
ne_chunks[0]

TclError: ignored

Tree('S', [Tree('PERSON', [('Mr.', 'NNP')]), Tree('PERSON', [('Green', 'NNP')]), ('killed', 'VBD'), Tree('ORGANIZATION', [('Colonel', 'NNP'), ('Mustard', 'NNP')]), ('in', 'IN'), ('the', 'DT'), ('study', 'NN'), ('with', 'IN'), ('the', 'DT'), ('candlestick', 'NN'), ('.', '.')])

In [24]:
ne_chunks[1]

TclError: ignored

Tree('S', [Tree('PERSON', [('Mr.', 'NNP')]), Tree('ORGANIZATION', [('Green', 'NNP')]), ('is', 'VBZ'), ('not', 'RB'), ('a', 'DT'), ('very', 'RB'), ('nice', 'JJ'), ('fellow', 'NN'), ('.', '.')])

## Using NLTK’s NLP tools to process human language in blog data

In [34]:
import json
import nltk

BLOG_DATA = "feed.json"

blog_data = json.loads(open(BLOG_DATA).read())

# Download nltk packages used in this example
nltk.download('stopwords')

# Customize your list of stopwords as needed. Here, we add common
# punctuation and contraction artifacts.

stop_words = nltk.corpus.stopwords.words('english') + [
    '.',
    ',',
    '--',
    '\'s',
    '?',
    ')',
    '(',
    ':',
    '\'',
    '\'re',
    '"',
    '-',
    '}',
    '{',
    u'—',
    ']',
    '[',
    '...'
    ]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [35]:
for post in blog_data:
    sentences = nltk.tokenize.sent_tokenize(post['content'])

    words = [w.lower() for sentence in sentences for w in
             nltk.tokenize.word_tokenize(sentence)]

    fdist = nltk.FreqDist(words)

    # Remove stopwords from fdist
    for sw in stop_words:
        del fdist[sw]
   
    # Basic stats

    num_words = sum([i[1] for i in fdist.items()])
    num_unique_words = len(fdist.keys())

    # Hapaxes are words that appear only once
    num_hapaxes = len(fdist.hapaxes())

    top_10_words_sans_stop_words = fdist.most_common(10)

    print(post['title'])
    print('\tNum Sentences:'.ljust(25), len(sentences))
    print('\tNum Words:'.ljust(25), num_words)
    print('\tNum Unique Words:'.ljust(25), num_unique_words)
    print('\tNum Hapaxes:'.ljust(25), num_hapaxes)
    print('\tTop 10 Most Frequent Words (sans stop words):\n\t\t', \
          '\n\t\t'.join(['{0} ({1})'.format(w[0], w[1]) for w in top_10_words_sans_stop_words]))
    print()

Four short links: 21 August 2017
	Num Sentences:           10
	Num Words:               140
	Num Unique Words:        113
	Num Hapaxes:             93
	Top 10 Most Frequent Words (sans stop words):
		 signals (5)
		cloud (4)
		application (3)
		drone (3)
		operations (2)
		machine (2)
		learning (2)
		radio (2)
		flying (2)
		cameras (2)

6 practical guidelines for implementing conversational AI
	Num Sentences:           69
	Num Words:               908
	Num Unique Words:        528
	Num Hapaxes:             354
	Top 10 Most Frequent Words (sans stop words):
		 ’ (21)
		“ (21)
		” (21)
		conversational (15)
		bots (7)
		says (7)
		interaction (7)
		must (7)
		user (7)
		kai (7)

Four short links: 18 August 2017
	Num Sentences:           16
	Num Words:               263
	Num Unique Words:        204
	Num Hapaxes:             173
	Top 10 Most Frequent Words (sans stop words):
		 hype (9)
		jobs (5)
		technologies (5)
		cycle (5)
		’ (5)
		bayesian (4)
		years (4)
		style (3)
		cycles (3)

## A document summarization algorithm based principally upon sentence detection and frequency analysis within sentences

In [0]:
import json
import nltk
import numpy

BLOG_DATA = "feed.json"

blog_data = json.loads(open(BLOG_DATA).read())

N = 100  # Number of words to consider
CLUSTER_THRESHOLD = 5  # Distance between words to consider
TOP_SENTENCES = 5  # Number of sentences to return for a "top n" summary

In [0]:
stop_words = nltk.corpus.stopwords.words('english') + [
    '.',
    ',',
    '--',
    '\'s',
    '?',
    ')',
    '(',
    ':',
    '\'',
    '\'re',
    '"',
    '-',
    '}',
    '{',
    u'—',
    '>',
    '<',
    '...'
    ]

In [0]:
# Approach taken from "The Automatic Creation of Literature Abstracts" by H.P. Luhn
def _score_sentences(sentences, important_words):
    scores = []
    sentence_idx = 0

    for s in [nltk.tokenize.word_tokenize(s) for s in sentences]:

        word_idx = []

        # For each word in the word list...
        for w in important_words:
            try:
                # Compute an index for where any important words occur in the sentence.
                word_idx.append(s.index(w))
            except ValueError: # w not in this particular sentence
                pass

        word_idx.sort()

        # It is possible that some sentences may not contain any important words at all.
        if len(word_idx)== 0: continue

        # Using the word index, compute clusters by using a max distance threshold
        # for any two consecutive words.

        clusters = []
        cluster = [word_idx[0]]
        i = 1
        while i < len(word_idx):
            if word_idx[i] - word_idx[i - 1] < CLUSTER_THRESHOLD:
                cluster.append(word_idx[i])
            else:
                clusters.append(cluster[:])
                cluster = [word_idx[i]]
            i += 1
        clusters.append(cluster)

        # Score each cluster. The max score for any given cluster is the score 
        # for the sentence.

        max_cluster_score = 0
        
        for c in clusters:
            significant_words_in_cluster = len(c)
            # true clusters also contain insignificant words, so we get 
            # the total cluster length by checking the indices
            total_words_in_cluster = c[-1] - c[0] + 1
            score = 1.0 * significant_words_in_cluster**2 / total_words_in_cluster

            if score > max_cluster_score:
                max_cluster_score = score

        scores.append((sentence_idx, max_cluster_score))
        sentence_idx += 1

    return scores

In [0]:
def summarize(txt):
    sentences = [s for s in nltk.tokenize.sent_tokenize(txt)]
    normalized_sentences = [s.lower() for s in sentences]

    words = [w.lower() for sentence in normalized_sentences for w in
             nltk.tokenize.word_tokenize(sentence)]

    fdist = nltk.FreqDist(words)
    
    # Remove stopwords from fdist
    for sw in stop_words:
        del fdist[sw]

    top_n_words = [w[0] for w in fdist.most_common(N)]

    scored_sentences = _score_sentences(normalized_sentences, top_n_words)

    # Summarization Approach 1:
    # Filter out nonsignificant sentences by using the average score plus a
    # fraction of the std dev as a filter

    avg = numpy.mean([s[1] for s in scored_sentences])
    std = numpy.std([s[1] for s in scored_sentences])
    mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences
                   if score > avg + 0.5 * std]

    # Summarization Approach 2:
    # Another approach would be to return only the top N ranked sentences

    top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:]
    top_n_scored = sorted(top_n_scored, key=lambda s: s[0])

    # Decorate the post object with summaries

    return dict(top_n_summary=[sentences[idx] for (idx, score) in top_n_scored],
                mean_scored_summary=[sentences[idx] for (idx, score) in mean_scored])

In [40]:
for post in blog_data: 
    post.update(summarize(post['content']))

    print(post['title'])
    print('=' * len(post['title']))
    print()
    print('Top N Summary')
    print('-------------')
    print(' '.join(post['top_n_summary']))
    print()
    print('Mean Scored Summary')
    print('-------------------')
    print(' '.join(post['mean_scored_summary']))
    print()

Four short links: 21 August 2017

Top N Summary
-------------
Cloud Operations, Machine Learning Radio, Flying Cameras, and Text Organization

Paracloud: Bringing Application Insight into Cloud Operations -- In this work, we propose a uniform Paracloud interface (PaCI) to enable a bi-directional communication channel between application containers and the cloud management substrate. An application knows how it's doing, which it reports through this interface so the cloud management layer can figure how/when to migrate, scale, load balance. (via A Paper A Day)

DARPA Wants Machine Learning for Radio Signals -- An RFMLS would be able to discern subtle differences in the RF signals among identical, mass-manufactured IoT devices and identify signals intended to spoof or hack into these devices. “We want to ... stand up an RF forensics capability to identify unique and peculiar signals amongst the proverbial cocktail party of signals out there,” Tilghman said. XPose: Reinventing User Intera

## Visualizing document summarization results with HTML output

In [41]:
import os
from IPython.display import IFrame
from IPython.core.display import display

HTML_TEMPLATE = """<html>
    <head>
        <title>{0}</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    </head>
    <body>{1}</body>
</html>"""

for post in blog_data:
   
    # Uses previously defined summarize function.
    post.update(summarize(post['content']))

    # You could also store a version of the full post with key sentences marked up
    # for analysis with simple string replacement...

    for summary_type in ['top_n_summary', 'mean_scored_summary']:
        post[summary_type + '_marked_up'] = '<p>{0}</p>'.format(post['content'])
        
        for s in post[summary_type]:
            post[summary_type + '_marked_up'] = \
            post[summary_type + '_marked_up'].replace(s, '<strong>{0}</strong>'.format(s))

        filename = post['title'].replace("?", "") + '.summary.' + summary_type + '.html'
        
        f = open(os.path.join(filename), 'wb')
        html = HTML_TEMPLATE.format(post['title'] + ' Summary', post[summary_type + '_marked_up'])    
        f.write(html.encode('utf-8'))
        f.close()

        print("Data written to", f.name)

# Display any of these files with an inline frame. This displays the
# last file processed by using the last value of f.name...
print()
print("Displaying {0}:".format(f.name))
display(IFrame('files/{0}'.format(f.name), '100%', '600px'))

Data written to Four short links: 21 August 2017.summary.top_n_summary.html
Data written to Four short links: 21 August 2017.summary.mean_scored_summary.html
Data written to 6 practical guidelines for implementing conversational AI.summary.top_n_summary.html
Data written to 6 practical guidelines for implementing conversational AI.summary.mean_scored_summary.html
Data written to Four short links: 18 August 2017.summary.top_n_summary.html
Data written to Four short links: 18 August 2017.summary.mean_scored_summary.html
Data written to How Ray makes continuous learning accessible and easy to scale.summary.top_n_summary.html
Data written to How Ray makes continuous learning accessible and easy to scale.summary.mean_scored_summary.html
Data written to Julie Stanford on vetting designs through rapid experimentation.summary.top_n_summary.html
Data written to Julie Stanford on vetting designs through rapid experimentation.summary.mean_scored_summary.html
Data written to Jack Daniel on buildin

## Extracting entities from a text with NLTK

In [42]:
import nltk
import json

BLOG_DATA = "feed.json"

blog_data = json.loads(open(BLOG_DATA).read())

for post in blog_data:

    sentences = nltk.tokenize.sent_tokenize(post['content'])
    tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]
    pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]

    # Flatten the list since we're not using sentence structure
    # and sentences are guaranteed to be separated by a special
    # POS tuple such as ('.', '.')

    pos_tagged_tokens = [token for sent in pos_tagged_tokens for token in sent]

    all_entity_chunks = []
    previous_pos = None
    current_entity_chunk = []
    for (token, pos) in pos_tagged_tokens:

        if pos == previous_pos and pos.startswith('NN'):
            current_entity_chunk.append(token)
        elif pos.startswith('NN'):
            
            if current_entity_chunk != []:
                
                # Note that current_entity_chunk could be a duplicate when appended,
                # so frequency analysis again becomes a consideration

                all_entity_chunks.append((' '.join(current_entity_chunk), pos))
            current_entity_chunk = [token]

        previous_pos = pos

    # Store the chunks as an index for the document
    # and account for frequency while we're at it...

    post['entities'] = {}
    for c in all_entity_chunks:
        post['entities'][c] = post['entities'].get(c, 0) + 1

    # For example, we could display just the title-cased entities

    print(post['title'])
    print('-' * len(post['title']))
    proper_nouns = []
    for (entity, pos) in post['entities']:
        if entity.istitle():
            print('\t{0} ({1})'.format(entity, post['entities'][(entity, pos)]))
    print()

Four short links: 21 August 2017
--------------------------------
	Cloud Operations (1)
	Machine Learning Radio (1)
	Cameras (1)
	Text Organization Paracloud (1)
	Bringing Application Insight (1)
	Cloud Operations (1)
	Paracloud (1)
	A Paper A Day (1)
	Radio Signals (1)
	” Tilghman (1)
	Reinventing User Interaction (1)
	Flying Cameras (1)
	Drone (1)
	Tree Sheets (1)
	Nice (1)
	Continue (1)

6 practical guidelines for implementing conversational AI
---------------------------------------------------------
	Apple (1)
	Siri (4)
	Jeff Bezos (1)
	Star Trek (1)
	Alexa (1)
	Joseph Weizenbaum (1)
	Decades (1)
	Andrew Leonard (1)
	Bots (1)
	Mozambique. ” (1)
	Today (1)
	Slack (2)
	Starbucks (1)
	Mastercard (1)
	Macy ’ (1)
	Gartner (1)
	Alexa (3)
	Cortana (2)
	Google Home (1)
	Skipflag (1)
	Use (1)
	Taco Bell ’ (1)
	Google Home (1)
	Organizations (1)
	Start (1)
	Amir Shevat (1)
	Beyond (1)
	Shevat (1)
	Others (1)
	Figure (1)
	Figure (1)
	Screenshots (1)
	Susan Etlinger (1)
	Chris Mullins (1)
	Mi

## Discovering interactions between entities

In [43]:
import nltk
import json

BLOG_DATA = "feed.json"

def extract_interactions(txt):
    sentences = nltk.tokenize.sent_tokenize(txt)
    tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]
    pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]

    entity_interactions = []
    for sentence in pos_tagged_tokens:

        all_entity_chunks = []
        previous_pos = None
        current_entity_chunk = []

        for (token, pos) in sentence:

            if pos == previous_pos and pos.startswith('NN'):
                current_entity_chunk.append(token)
            elif pos.startswith('NN'):
                if current_entity_chunk != []:
                    all_entity_chunks.append((' '.join(current_entity_chunk),
                            pos))
                current_entity_chunk = [token]

            previous_pos = pos

        if len(all_entity_chunks) > 1:
            entity_interactions.append(all_entity_chunks)
        else:
            entity_interactions.append([])

    assert len(entity_interactions) == len(sentences)

    return dict(entity_interactions=entity_interactions,
                sentences=sentences)

blog_data = json.loads(open(BLOG_DATA).read())

# Display selected interactions on a per-sentence basis

for post in blog_data:

    post.update(extract_interactions(post['content']))

    print(post['title'])
    print('-' * len(post['title']))
    for interactions in post['entity_interactions']:
        print('; '.join([i[0] for i in interactions]))
    print()

Four short links: 21 August 2017
--------------------------------
Cloud Operations; Machine Learning Radio; Cameras; Text Organization Paracloud; Bringing Application Insight; Cloud Operations; work; Paracloud; interface; PaCI; communication channel; application; containers
application; interface; cloud management layer; how/when; scale
A Paper A Day; DARPA Wants Machine Learning; Radio Signals; RFMLS; differences; RF; signals; IoT; devices; signals
RF; forensics; capability; signals; cocktail party; signals
XPose; Reinventing User Interaction; Flying Cameras
Drone; bunch; shots; path; operators; drone; location; operator; framing; focus
drone; photo; thing


Continue; links

6 practical guidelines for implementing conversational AI
---------------------------------------------------------
organizations; interactions; humans; machines.It; years; Apple; Siri; Jeff Bezos; Star Trek
idea; interfaces; intelligence
MIT; professor; Joseph Weizenbaum; prototype; today; ’
Decades; WIRED; story

## Visualizing interactions between entities with HTML output

In [44]:
import os
import json
import nltk
from IPython.display import IFrame
from IPython.core.display import display

BLOG_DATA = "feed.json"

HTML_TEMPLATE = """<html>
    <head>
        <title>{0}</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    </head>
    <body>{1}</body>
</html>"""

blog_data = json.loads(open(BLOG_DATA).read())

for post in blog_data:

    post.update(extract_interactions(post['content']))

    # Display output as markup with entities presented in bold text

    post['markup'] = []

    for sentence_idx in range(len(post['sentences'])):

        s = post['sentences'][sentence_idx]
        for (term, _) in post['entity_interactions'][sentence_idx]:
            s = s.replace(term, '<strong>{0}</strong>'.format(term))

        post['markup'] += [s] 
            
    filename = post['title'].replace("?", "") + '.entity_interactions.html'
    f = open(os.path.join(filename), 'wb')
    html = HTML_TEMPLATE.format(post['title'] + ' Interactions', ' '.join(post['markup']))
    f.write(html.encode('utf-8'))
    f.close()

    print('Data written to', f.name)
    
    # Display any of these files with an inline frame. This displays the
    # last file processed by using the last value of f.name...
    
    print('Displaying {0}:'.format(f.name))
    display(IFrame('files/{0}'.format(f.name), '100%', '600px'))

Data written to Four short links: 21 August 2017.entity_interactions.html
Displaying Four short links: 21 August 2017.entity_interactions.html:


Data written to 6 practical guidelines for implementing conversational AI.entity_interactions.html
Displaying 6 practical guidelines for implementing conversational AI.entity_interactions.html:


Data written to Four short links: 18 August 2017.entity_interactions.html
Displaying Four short links: 18 August 2017.entity_interactions.html:


Data written to How Ray makes continuous learning accessible and easy to scale.entity_interactions.html
Displaying How Ray makes continuous learning accessible and easy to scale.entity_interactions.html:


Data written to Julie Stanford on vetting designs through rapid experimentation.entity_interactions.html
Displaying Julie Stanford on vetting designs through rapid experimentation.entity_interactions.html:


Data written to Jack Daniel on building community and historical context in InfoSec.entity_interactions.html
Displaying Jack Daniel on building community and historical context in InfoSec.entity_interactions.html:


Data written to Four short links: 17 August 2017.entity_interactions.html
Displaying Four short links: 17 August 2017.entity_interactions.html:


Data written to Contouring learning rate to optimize neural nets.entity_interactions.html
Displaying Contouring learning rate to optimize neural nets.entity_interactions.html:


Data written to Creating better disaster recovery plans.entity_interactions.html
Displaying Creating better disaster recovery plans.entity_interactions.html:


Data written to Announcing the Rebecca Bace Pioneer Award for Defensive Security.entity_interactions.html
Displaying Announcing the Rebecca Bace Pioneer Award for Defensive Security.entity_interactions.html:


Data written to How synthetic biology startups are building the future at RebelBio.entity_interactions.html
Displaying How synthetic biology startups are building the future at RebelBio.entity_interactions.html:


Data written to The impact of design at Shopify.entity_interactions.html
Displaying The impact of design at Shopify.entity_interactions.html:


Data written to Take the 2018 Data Science Salary Survey.entity_interactions.html
Displaying Take the 2018 Data Science Salary Survey.entity_interactions.html:


Data written to Four short links: 16 August 2017.entity_interactions.html
Displaying Four short links: 16 August 2017.entity_interactions.html:


Data written to Four short links: 15 August 2017.entity_interactions.html
Displaying Four short links: 15 August 2017.entity_interactions.html:


Data written to How to use Presto Sketching to clarify your team’s purpose.entity_interactions.html
Displaying How to use Presto Sketching to clarify your team’s purpose.entity_interactions.html:


Data written to Four short links: 14 August 2017.entity_interactions.html
Displaying Four short links: 14 August 2017.entity_interactions.html:


Data written to A multi-cloud strategy is the foundation for digital transformation.entity_interactions.html
Displaying A multi-cloud strategy is the foundation for digital transformation.entity_interactions.html:


Data written to How to choose a cloud provider.entity_interactions.html
Displaying How to choose a cloud provider.entity_interactions.html:


Data written to Four short links: 11 August 2017.entity_interactions.html
Displaying Four short links: 11 August 2017.entity_interactions.html:


Data written to Mike Roberts on serverless architectures.entity_interactions.html
Displaying Mike Roberts on serverless architectures.entity_interactions.html:


Data written to How to craft a voice user interface that won’t leave you frustrated.entity_interactions.html
Displaying How to craft a voice user interface that won’t leave you frustrated.entity_interactions.html:


Data written to Four short links: 10 August 2017.entity_interactions.html
Displaying Four short links: 10 August 2017.entity_interactions.html:


Data written to Four short links: 9 August 2017.entity_interactions.html
Displaying Four short links: 9 August 2017.entity_interactions.html:


Data written to Deep learning revolutionizes conversational AI.entity_interactions.html
Displaying Deep learning revolutionizes conversational AI.entity_interactions.html:


Data written to Cancer detection, one slice at a time.entity_interactions.html
Displaying Cancer detection, one slice at a time.entity_interactions.html:


Data written to Integrating data with AI.entity_interactions.html
Displaying Integrating data with AI.entity_interactions.html:


Data written to Jupyter Insights: Lorena Barba, an associate professor of mechanical and aerospace engineering.entity_interactions.html
Displaying Jupyter Insights: Lorena Barba, an associate professor of mechanical and aerospace engineering.entity_interactions.html:


Data written to Four short links: 8 August 2017.entity_interactions.html
Displaying Four short links: 8 August 2017.entity_interactions.html:


Data written to Why continuous learning is key to AI.entity_interactions.html
Displaying Why continuous learning is key to AI.entity_interactions.html:


Data written to Four short links: 7 August 2017.entity_interactions.html
Displaying Four short links: 7 August 2017.entity_interactions.html:


Data written to How to move your team closer to clarity.entity_interactions.html
Displaying How to move your team closer to clarity.entity_interactions.html:


Data written to Four short links: 4 August 2017.entity_interactions.html
Displaying Four short links: 4 August 2017.entity_interactions.html:


Data written to JupyterHub on Google Cloud.entity_interactions.html
Displaying JupyterHub on Google Cloud.entity_interactions.html:


Data written to Why AI and machine learning researchers are beginning to embrace PyTorch.entity_interactions.html
Displaying Why AI and machine learning researchers are beginning to embrace PyTorch.entity_interactions.html:


Data written to A DevOps approach to data management.entity_interactions.html
Displaying A DevOps approach to data management.entity_interactions.html:


Data written to Four short links: 3 August 2017.entity_interactions.html
Displaying Four short links: 3 August 2017.entity_interactions.html:


Data written to Declaring variables in Kotlin.entity_interactions.html
Displaying Declaring variables in Kotlin.entity_interactions.html:


Data written to Building—and scaling—a reliable distributed architecture.entity_interactions.html
Displaying Building—and scaling—a reliable distributed architecture.entity_interactions.html:


Data written to Operationalizing security risk.entity_interactions.html
Displaying Operationalizing security risk.entity_interactions.html:


Data written to Reinforcement learning for complex goals, using TensorFlow.entity_interactions.html
Displaying Reinforcement learning for complex goals, using TensorFlow.entity_interactions.html:


Data written to Jay Jacobs on data analytics and security.entity_interactions.html
Displaying Jay Jacobs on data analytics and security.entity_interactions.html:


Data written to Four short links: 2 August 2017.entity_interactions.html
Displaying Four short links: 2 August 2017.entity_interactions.html:


Data written to The wisdom hierarchy: From signals to artificial intelligence and beyond.entity_interactions.html
Displaying The wisdom hierarchy: From signals to artificial intelligence and beyond.entity_interactions.html:


Data written to Four short links: 1 August 2017.entity_interactions.html
Displaying Four short links: 1 August 2017.entity_interactions.html:


Data written to How can I add simple, automated data visualizations and dashboards to Jupyter Notebooks.entity_interactions.html
Displaying How can I add simple, automated data visualizations and dashboards to Jupyter Notebooks.entity_interactions.html:


Data written to From prototype to product with hybrid neural networks.entity_interactions.html
Displaying From prototype to product with hybrid neural networks.entity_interactions.html:


Data written to Four short links: 31 July 2017.entity_interactions.html
Displaying Four short links: 31 July 2017.entity_interactions.html:


Data written to Four short links: 28 July 2017.entity_interactions.html
Displaying Four short links: 28 July 2017.entity_interactions.html:


Data written to Eric Freeman and Elisabeth Robson on design patterns.entity_interactions.html
Displaying Eric Freeman and Elisabeth Robson on design patterns.entity_interactions.html:


Data written to John Whalen on using brain science in design.entity_interactions.html
Displaying John Whalen on using brain science in design.entity_interactions.html:


Data written to When you hear hooves, think horse, not zebra.entity_interactions.html
Displaying When you hear hooves, think horse, not zebra.entity_interactions.html:


Data written to Classifying traffic signs with Apache MXNet: An introduction to computer vision with neural networks.entity_interactions.html
Displaying Classifying traffic signs with Apache MXNet: An introduction to computer vision with neural networks.entity_interactions.html:


Data written to Four short links: 27 July 2017.entity_interactions.html
Displaying Four short links: 27 July 2017.entity_interactions.html:


Data written to R’s tidytext turns messy text into valuable insight.entity_interactions.html
Displaying R’s tidytext turns messy text into valuable insight.entity_interactions.html:


Data written to Four short links: 26 July 2017.entity_interactions.html
Displaying Four short links: 26 July 2017.entity_interactions.html:


Data written to Making great hires in your design organization.entity_interactions.html
Displaying Making great hires in your design organization.entity_interactions.html:


Data written to A lesson in prescriptive modeling.entity_interactions.html
Displaying A lesson in prescriptive modeling.entity_interactions.html:


Data written to Four short links: 25 July 2017.entity_interactions.html
Displaying Four short links: 25 July 2017.entity_interactions.html:


Data written to Data science startups focus on AI-enabled efficiency.entity_interactions.html
Displaying Data science startups focus on AI-enabled efficiency.entity_interactions.html:
