In [1]:
# ! pip install spacy

In [2]:
# ! python -m spacy download en

In [3]:
reset -fs

<img src='images/intro.png'>

<img src='images/slide2.png'>

<img src='images/slide3.png'>

<img src='images/slide4a.png'>

In [4]:
import requests
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup

In [5]:
page = "https://pydata.org/seattle2017/schedule/"
html = urlopen(page)
soup = BeautifulSoup(html.read(), "lxml")
links = soup.find_all("span", { "class" : "title" })
print("Number of talks: " + str(len(links)))

Number of talks: 65


In [6]:
links[0]

<span class="title">
<a href="/seattle2017/schedule/presentation/57/">Using CNTK's Python Interface for Deep Learning</a>
</span>

In [7]:
links[64]

<span class="title">
<a href="/seattle2017/schedule/presentation/113/">Robust Algorithms for Machine Learning</a>
</span>

In [10]:
# lets get the title name and the hyperlink of one link
test = links[0]
print(test.get_text())


Using CNTK's Python Interface for Deep Learning



In [11]:
test.get_text().strip(' \t\n\r') # trims spaces of title

"Using CNTK's Python Interface for Deep Learning"

In [12]:
parsed = test.find("a", href=True)
parsed.attrs['href']

'/seattle2017/schedule/presentation/57/'

In [13]:
# Let's get all titles and links to each of the talks
titles = []
page_link = []
for link in links:
    title = link.get_text().strip(' \t\n\r') # trims spaces of link description
    link = link.find("a", href=True).attrs['href']
    titles.append(title)
    page_link.append(link)

In [14]:
titles[:5]

["Using CNTK's Python Interface for Deep Learning",
 'D’oh! Unevenly spaced time series analysis of The Simpsons in Pandas',
 'From Novice to Data Ninja',
 'So you want to be a Python expert?',
 'Introduction to data analytics with pandas']

In [15]:
page_link[:5]

['/seattle2017/schedule/presentation/57/',
 '/seattle2017/schedule/presentation/104/',
 '/seattle2017/schedule/presentation/109/',
 '/seattle2017/schedule/presentation/125/',
 '/seattle2017/schedule/presentation/105/']

> Links and titles parsed 😜

### Let's scrape - Talk abstracts

In [16]:
base = 'https://pydata.org'
page2 = base + page_link[0]
html2 = urlopen(page2)
soup2 = BeautifulSoup(html2.read(), "lxml")
abstract_description = soup2.find_all("div", { "class" : "abstract" })

In [17]:
abs_texts = []
for link in abstract_description:
    abstract_text = link.get_text().strip(' \t\n\r') # trims spaces of link description
    abs_texts.append(abstract_text)

In [18]:
abs_texts[0]

'Topics to be covered include ...\n\nCognitive Toolkit (CNTK) installation\nWhat is "machine learning"? [gradient descent example]\nWhat is "learning representations"?\nWhy do Graphics Processing Units (GPUs) help?\nHow do we prevent overfitting?\nCNTK Packages and Modules\nDeep Learning Examples [including Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) examples]'

## Lets get all of our data 😄

In [19]:
abs_texts = []
for p in page_link:
    page2 = base + p
    html2 = urlopen(page2)
    soup2 = BeautifulSoup(html2.read(), "lxml")
    abstract_description = soup2.find_all("div", { "class" : "abstract" })
    
    for link in abstract_description:
        abstract_text = link.get_text().strip(' \t\n\r') # trims spaces of link description
        abstract_text = abstract_text.replace("\n"," ").replace("(","").replace(")","").replace("?","").replace("]","").replace("[","")
        abs_texts.append(abstract_text)

In [20]:
len(abs_texts)

65

In [21]:
abs_texts[20]

'PixieDust is a new Python open source library that helps data scientists and developers working in Jupyter Notebooks and Apache Spark to be more efficient. PixieDust speeds up data manipulation and display with features like:  Automated local install of Python and Scala kernels running with Spark  Realtime Spark Job progress monitoring directly from the Notebook  Use Scala directly in your Python notebook. Variables are automatically transferred from Python to Scala and vice-versa   Auto-visualisation of Spark DataFrames using popular chart engines like Matplotlib, Seaborn, Bokeh, or MapBox   Seamless integration to cloud services  Create embedded apps with your own visualisations or apps using the PixieDust extensibility APIs Come along and learn how you can use this tool in your own projects to visualise and explore data effortlessly with no coding. If you prefer working with a Scala Notebook, this session is also for you, as PixieDust can also run on a Scala Kernel. Imagine being a

## Sample data

In [22]:
# talk # 20
titles[20]

'PixieDust - make Jupyter Python Notebooks with Apache Spark Faster, Flexible, and Easier to use'

In [23]:
abs_texts[20]

'PixieDust is a new Python open source library that helps data scientists and developers working in Jupyter Notebooks and Apache Spark to be more efficient. PixieDust speeds up data manipulation and display with features like:  Automated local install of Python and Scala kernels running with Spark  Realtime Spark Job progress monitoring directly from the Notebook  Use Scala directly in your Python notebook. Variables are automatically transferred from Python to Scala and vice-versa   Auto-visualisation of Spark DataFrames using popular chart engines like Matplotlib, Seaborn, Bokeh, or MapBox   Seamless integration to cloud services  Create embedded apps with your own visualisations or apps using the PixieDust extensibility APIs Come along and learn how you can use this tool in your own projects to visualise and explore data effortlessly with no coding. If you prefer working with a Scala Notebook, this session is also for you, as PixieDust can also run on a Scala Kernel. Imagine being a

## Feature Engineering

### Tokenizing data

In [24]:
import spacy
# Spacy for tokenization and lemmatization

In [25]:
en_nlp = spacy.load('en')

In [26]:
doc_spacy = en_nlp(abs_texts[20])

In [27]:
# simple tokenizer
print([token.lower_ for token in doc_spacy])

['pixiedust', 'is', 'a', 'new', 'python', 'open', 'source', 'library', 'that', 'helps', 'data', 'scientists', 'and', 'developers', 'working', 'in', 'jupyter', 'notebooks', 'and', 'apache', 'spark', 'to', 'be', 'more', 'efficient', '.', 'pixiedust', 'speeds', 'up', 'data', 'manipulation', 'and', 'display', 'with', 'features', 'like', ':', ' ', 'automated', 'local', 'install', 'of', 'python', 'and', 'scala', 'kernels', 'running', 'with', 'spark', ' ', 'realtime', 'spark', 'job', 'progress', 'monitoring', 'directly', 'from', 'the', 'notebook', ' ', 'use', 'scala', 'directly', 'in', 'your', 'python', 'notebook', '.', 'variables', 'are', 'automatically', 'transferred', 'from', 'python', 'to', 'scala', 'and', 'vice', '-', 'versa', '  ', 'auto', '-', 'visualisation', 'of', 'spark', 'dataframes', 'using', 'popular', 'chart', 'engines', 'like', 'matplotlib', ',', 'seaborn', ',', 'bokeh', ',', 'or', 'mapbox', '  ', 'seamless', 'integration', 'to', 'cloud', 'services', ' ', 'create', 'embedded', 

In [28]:
# Lemmatization
print("Lemmatization:")
print([token.lemma_ for token in doc_spacy])

Lemmatization:
['pixiedust', 'be', 'a', 'new', 'python', 'open', 'source', 'library', 'that', 'help', 'data', 'scientist', 'and', 'developer', 'work', 'in', 'jupyter', 'notebooks', 'and', 'apache', 'spark', 'to', 'be', 'more', 'efficient', '.', 'pixiedust', 'speed', 'up', 'datum', 'manipulation', 'and', 'display', 'with', 'feature', 'like', ':', ' ', 'automate', 'local', 'install', 'of', 'python', 'and', 'scala', 'kernel', 'run', 'with', 'spark', ' ', 'realtime', 'spark', 'job', 'progress', 'monitor', 'directly', 'from', 'the', 'notebook', ' ', 'use', 'scala', 'directly', 'in', '-PRON-', 'python', 'notebook', '.', 'variable', 'be', 'automatically', 'transfer', 'from', 'python', 'to', 'scala', 'and', 'vice', '-', 'versa', '  ', 'auto', '-', 'visualisation', 'of', 'spark', 'dataframes', 'use', 'popular', 'chart', 'engine', 'like', 'matplotlib', ',', 'seaborn', ',', 'bokeh', ',', 'or', 'mapbox', '  ', 'seamless', 'integration', 'to', 'cloud', 'service', ' ', 'create', 'embed', 'apps', 'wi

## Clean up data

In [32]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from nltk.corpus import stopwords
import string
import re

In [33]:

STOPLIST = ["’s", "’re", "n't", "'s", "'m", "ca"] + list(ENGLISH_STOP_WORDS)
SYMBOLS = " ".join(string.punctuation).split(" ") + ["–", "-----", "---", "...", "“", "”", "'ve"]

In [34]:
# Convert to lemmas
def tokenizeText(sample):

    # get the tokens using spaCy
    tokens = en_nlp(sample)

    # lemmatize
    lemmas = []
    for tok in tokens:
        lemmas.append(tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_)
    tokens = lemmas

    # stoplist the tokens
    tokens = [tok for tok in tokens if tok not in STOPLIST]

    # stoplist symbols
    tokens = [tok for tok in tokens if tok not in SYMBOLS]

    # remove additional characters
    while "" in tokens:
        tokens.remove("")
    while "’" in tokens:
        tokens.remove("’")
    while " " in tokens:
        tokens.remove(" ")
    while "⁃" in tokens:
        tokens.remove("⁃")
    while "—" in tokens:
        tokens.remove("—")
    while "•" in tokens:
        tokens.remove("•")
    while "\n" in tokens:
        tokens.remove("\n")
    while "\n\n" in tokens:
        tokens.remove("\n\n")

    return tokens

In [35]:
sample1 = tokenizeText(abs_texts[20])
sample1

['pixiedust',
 'new',
 'python',
 'open',
 'source',
 'library',
 'help',
 'data',
 'scientist',
 'developer',
 'work',
 'jupyter',
 'notebooks',
 'apache',
 'spark',
 'efficient',
 'pixiedust',
 'speed',
 'datum',
 'manipulation',
 'display',
 'feature',
 'like',
 'automate',
 'local',
 'install',
 'python',
 'scala',
 'kernel',
 'run',
 'spark',
 'realtime',
 'spark',
 'job',
 'progress',
 'monitor',
 'directly',
 'notebook',
 'use',
 'scala',
 'directly',
 'python',
 'notebook',
 'variable',
 'automatically',
 'transfer',
 'python',
 'scala',
 'vice',
 'versa',
 'auto',
 'visualisation',
 'spark',
 'dataframes',
 'use',
 'popular',
 'chart',
 'engine',
 'like',
 'matplotlib',
 'seaborn',
 'bokeh',
 'mapbox',
 'seamless',
 'integration',
 'cloud',
 'service',
 'create',
 'embed',
 'apps',
 'visualisation',
 'app',
 'use',
 'pixiedust',
 'extensibility',
 'api',
 'come',
 'learn',
 'use',
 'tool',
 'project',
 'visualise',
 'explore',
 'datum',
 'effortlessly',
 'coding',
 'prefer',
 

## TF-IDF

In [414]:
tf_idf_all = TfidfVectorizer(tokenizer=tokenizeText, stop_words=STOPLIST, ngram_range=(1,1))

In [415]:
X = tf_idf_all.fit_transform(abs_texts)

In [417]:
X

<65x1773 sparse matrix of type '<class 'numpy.float64'>'
	with 4110 stored elements in Compressed Sparse Row format>

In [418]:
import pandas as pd

In [419]:
docs = pd.DataFrame(X.todense())

In [420]:
docs.shape

(65, 1773)

In [421]:
docs.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1763,1764,1765,1766,1767,1768,1769,1770,1771,1772
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<img src='images/slide5.png'>

To compensate for the effect of document length, the standard way of quantifying the similarity between two documents $d_1$ and $d_2$ is to compute the cosine similarity of their vector representations

In [422]:
from sklearn.preprocessing import normalize

In [423]:
# normalize each talk to simplifies 'cosine similarity' to dot-product of two vectors
docs_normed = normalize(docs,axis=1)

### most used words

In [424]:
summed = docs.sum(axis=0)

In [425]:
summed.shape

(1773,)

In [459]:
max_words = np.argsort(-summed)[:20]
max_words.values

array([ 386, 1685, 1005,  376, 1268,  886,  200, 1576, 1245, 1613, 1056,
       1041,  540,  617,   81,  932,  256,  940,  400,  895])

In [460]:
summed[max_words]

386     3.942727
1685    3.269268
1005    2.764599
376     2.210588
1268    2.202805
886     2.048558
200     2.037208
1576    1.655281
1245    1.641570
1613    1.631103
1056    1.549369
1041    1.474834
540     1.435485
617     1.368837
81      1.338632
932     1.337837
256     1.313613
940     1.312429
400     1.296345
895     1.295913
dtype: float64

In [461]:
for w in max_words:
    print(list(tf_idf_all.vocabulary_.keys())[list(tf_idf_all.vocabulary_.values()).index(w)]) # Prints george

datum
use
model
data
python
learn
build
talk
provide
time
notebook
network
example
follow
analysis
machine
code
make
deep
level


## Calculating similarity of talks

In [296]:
# docs_similarity is a [S,S] matrix, where 'S' is a talk
# the higher docs_similarity[i,j] indicates the more similar between talk[i] and talk[j]
docs_similarity = docs_normed.dot(docs_normed.T)
docs_similarity = pd.DataFrame(docs_similarity)

In [297]:
docs_similarity.shape

(65, 65)

In [298]:
### find top K most similar to each talk
def most_similar_talk(s,topk):
    # [0] must be itself
    similar_ones = s.sort_values(ascending=False)[1:topk+1].index.values
    return pd.Series(similar_ones,index = ["similar#{}".format(i) for i in range(1,topk+1)])

In [302]:
top10 = docs_similarity.apply(most_similar_talk,topk=10,axis=1)

In [303]:
top10.head(10)

Unnamed: 0,similar#1,similar#2,similar#3,similar#4,similar#5,similar#6,similar#7,similar#8,similar#9,similar#10
0,51,29,47,9,11,31,21,17,6,52
1,4,2,41,52,30,50,58,11,37,21
2,1,4,11,40,56,33,32,22,18,8
3,31,29,28,27,26,25,24,23,22,21
4,1,21,11,32,2,20,30,50,22,8
5,35,24,59,9,34,21,28,29,48,1
6,50,60,29,13,37,46,24,28,36,15
7,13,28,39,36,14,26,4,54,37,1
8,39,52,21,59,56,14,16,1,11,58
9,29,51,24,31,0,5,47,37,52,21


## Titles

In [225]:
for i,t in enumerate(titles):
    print("{}. {}".format(i, t))

0. Using CNTK's Python Interface for Deep Learning
1. D’oh! Unevenly spaced time series analysis of The Simpsons in Pandas
2. From Novice to Data Ninja
3. So you want to be a Python expert?
4. Introduction to data analytics with pandas
5. Parallelizing Scientific Python with Dask
6. pomegranate: fast and flexible probabilistic modeling in python
7. Effective Visual Studio
8. Data Visualization and Exploration with Python
9. A Quick Primer on TensorFrames: Apache Spark and TensorFlow Together
10. Python Web Sraping
11. Vocabulary Analysis of Job Descriptions
12. Keynote (McKinley & Livestream in Cascade)
13. Scalable Data Science in Python and R on Apache Spark
14. Using Scattertext and the Python NLP Ecosystem for Text Visualization
15. Provenance for Reproducible Data Science
16. Monitoring Displacement Crises with Python: A Humanitarian Project by Data for Democracy
17. Designing for Guidance in Machine Learning
18. Automatic Citation generation with Natural Language Processing
19. P

In [305]:
# get talk title given the index number in matrix
def get_sims(topk, titles, talk_no):
    row = topk.loc[talk_no, :].copy()
    talk_zero = titles[talk_no]
    talks = np.array(titles)[row]
    print("Talk:", talk_zero)
    print("-"*60)
    for i,t in enumerate(talks):
        print("{}. {}".format(row[i], t))

In [307]:
get_sims(top10, titles, 0)

Talk: Using CNTK's Python Interface for Deep Learning
------------------------------------------------------------
51. Learn to be a painter using Neural Style Painting
29. Code First, Math Later: Learning Neural Nets Through Implementation and Examples
47. Medical image processing using Microsoft Deep Learning framework (CNTK)
9. A Quick Primer on TensorFrames: Apache Spark and TensorFlow Together
11. Vocabulary Analysis of Job Descriptions
31. Applying the four-step "Embed, Encode, Attend, Predict" framework to predict document similarity
21. Pandas, Pipelines, and Custom Transformers
17. Designing for Guidance in Machine Learning
6. pomegranate: fast and flexible probabilistic modeling in python
52. bqplot - Interactive Data Visualization in Jupyter


In [466]:
get_sims(top10, titles, 26)

Talk: Robust Automated Forecasting in Python and R
------------------------------------------------------------
41. Forecasting Time Series Data at scale with the TICK stack
1. D’oh! Unevenly spaced time series analysis of The Simpsons in Pandas
13. Scalable Data Science in Python and R on Apache Spark
47. Medical image processing using Microsoft Deep Learning framework (CNTK)
7. Effective Visual Studio
28. How to be a 10x Data Scientist
35. Make it Work, Make it Right, Make it Fast - Debugging and Profiling in Dask
11. Vocabulary Analysis of Job Descriptions
36. Python for .NET or .NET for Python
50. Implementing and Training Predictive Customer Lifetime Value Models in Python


In [309]:
get_sims(top10, titles, 36)

Talk: Python for .NET or .NET for Python
------------------------------------------------------------
20. PixieDust - make Jupyter Python Notebooks with Apache Spark Faster, Flexible, and Easier to use
37. Moving notebooks into the cloud: challenges and lessons learned
52. bqplot - Interactive Data Visualization in Jupyter
53. Making packages and packaging "just work"
7. Effective Visual Studio
1. D’oh! Unevenly spaced time series analysis of The Simpsons in Pandas
6. pomegranate: fast and flexible probabilistic modeling in python
27. In-database Machine Learning with Python in SQL Server - Sponsor Talk
62. Writing a Book in Jupyter Notebooks
47. Medical image processing using Microsoft Deep Learning framework (CNTK)


## Next steps
+ Use __"day and time"__ as constraints to create schedule
+ Use additional information such as __'description', 'speaker bio'__
+ Create __semantic embeddings (word2vec, doc2vec, lda2vec)__ and cluster talks
+ 