<a href="https://colab.research.google.com/github/paruliansaragi/DL-Notebooks/blob/master/oxford_deep_nlp_P1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!pip install bokeh

Collecting bokeh
[?25l  Downloading https://files.pythonhosted.org/packages/54/4f/e6554176080d5cb809a20f36b8723ded05872c60f72e791efd6f2a9346bd/bokeh-1.0.2.tar.gz (16.2MB)
[K    100% |████████████████████████████████| 16.2MB 1.8MB/s 
Collecting packaging>=16.8 (from bokeh)
  Downloading https://files.pythonhosted.org/packages/89/d1/92e6df2e503a69df9faab187c684585f0136662c12bb1f36901d426f3fab/packaging-18.0-py2.py3-none-any.whl
Building wheels for collected packages: bokeh
  Running setup.py bdist_wheel for bokeh ... [?25l- \ | / - \ | / - \ | / - done
[?25h  Stored in directory: /root/.cache/pip/wheels/1a/a0/ec/d46994ac427b4879969dd780cf422bd3a0886fb85f481dd064
Successfully built bokeh
Installing collected packages: packaging, bokeh
Successfully installed bokeh-1.0.2 packaging-18.0


In [0]:
import numpy as np
import os
from random import shuffle
import re

In [0]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

In [0]:
#!pip install lxml

In [0]:
import urllib.request
import zipfile
import lxml.etree

In [0]:
# Download the dataset if it's not already there: this may take a minute as it is 75MB
if not os.path.isfile('ted_en-20160408.zip'):
    urllib.request.urlretrieve("https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip", filename="ted_en-20160408.zip")

In [0]:
# For now, we're only interested in the subtitle text, so let's extract that from the XML:
with zipfile.ZipFile('ted_en-20160408.zip', 'r') as z:
    doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
input_text = '\n'.join(doc.xpath('//content/text()'))
del doc

In [0]:
i = input_text.find("Hyowon Gweon: See this?")
input_text[i-20:i+150]

' baby does.\n(Video) Hyowon Gweon: See this? (Ball squeaks) Did you see that? (Ball squeaks) Cool. See this one? (Ball squeaks) Wow.\nLaura Schulz: Told you. (Laughs)\n(Vide'

In [0]:
input_text_noparens = re.sub(r'\([^)]*\)', '', input_text)

In [0]:
i = input_text_noparens.find("Hyowon Gweon: See this?")
input_text_noparens[i-20:i+150]

"hat the baby does.\n Hyowon Gweon: See this?  Did you see that?  Cool. See this one?  Wow.\nLaura Schulz: Told you. \n HG: See this one?  Hey Clara, this one's for you. You "

Now, let's attempt to remove speakers' names that occur at the beginning of a line, by deleting pieces of the form "<up to 20 characters>:", as shown in this example. Of course, this is an imperfect heuristic.

In [0]:
sentences_strings_ted = []
for line in input_text_noparens.split('\n'):
    m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
    sentences_strings_ted.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)

# Uncomment if you need to save some RAM: these strings are about 50MB.
# del input_text, input_text_noparens

# Let's view the first few:
sentences_strings_ted[:5]

["Here are two reasons companies fail: they only do more of the same, or they only do what's new",
 'To me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation',
 ' Both are necessary, but it can be too much of a good thing',
 'Consider Facit',
 " I'm actually old enough to remember them"]


Now that we have sentences, we're ready to tokenize each of them into words. This tokenization is imperfect, of course. For instance, how many tokens is "can't", and where/how do we split it? We'll take the simplest naive approach of splitting on spaces. Before splitting, we remove non-alphanumeric characters, such as punctuation. You may want to consider the following question: why do we replace these characters with spaces rather than deleting them? Think of a case where this yields a different answer.

In [0]:
sentences_ted = []
for sent_str in sentences_strings_ted:
    tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()
    sentences_ted.append(tokens)

In [0]:
len(sentences_ted), len(sentences_ted[5]), len(sentences_ted[1])

(266694, 5, 20)

If you store the counts of the top 1000 words in a list called counts_ted_top1000, the code below will plot the histogram requested in the writeup.

In [0]:
from collections import Counter

In [0]:
freq = Counter(p for o in sentences_ted for p in o)
itos = [o for o, c in freq.most_common()]

In [0]:
counts_ted_top1000 = freq.most_common(1000); counts_ted_top1000[:5]

[('the', 207748),
 ('and', 149305),
 ('to', 125169),
 ('of', 114818),
 ('a', 105399)]

In [0]:
import collections
count = collections.Counter()
for sentence in sentences_ted:
    for word in sentence:
        count[word] += 1
words_top_ted = [token_count_pair[0] for token_count_pair in count.most_common(1000)]
counts_ted_top1000 = [token_count_pair[1] for token_count_pair in count.most_common(1000)]

In [0]:
counts_ted_top1000[:5], words_top_ted[:5]

([207748, 149305, 125169, 114818, 105399], ['the', 'and', 'to', 'of', 'a'])

In [0]:
hist, edges = np.histogram(counts_ted_top1000, density=True, bins=100, normed=True)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="Top-1000 words distribution")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

In [0]:
!pip install -U gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/27/a4/d10c0acc8528d838cda5eede0ee9c784caa598dbf40bd0911ff8d067a7eb/gensim-3.6.0-cp36-cp36m-manylinux1_x86_64.whl (23.6MB)
[K    100% |████████████████████████████████| 23.6MB 1.7MB/s 
Collecting smart-open>=1.2.1 (from gensim)
  Downloading https://files.pythonhosted.org/packages/4b/1f/6f27e3682124de63ac97a0a5876da6186de6c19410feab66c1543afab055/smart_open-1.7.1.tar.gz
Collecting boto>=2.32 (from smart-open>=1.2.1->gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/23/10/c0b78c27298029e4454a472a1919bde20cb182dab1662cec7f2ca1dcc523/boto-2.49.0-py2.py3-none-any.whl (1.4MB)
[K    100% |████████████████████████████████| 1.4MB 14.6MB/s 
[?25hCollecting bz2file (from smart-open>=1.2.1->gensim)
  Downloading https://files.pythonhosted.org/packages/61/39/122222b5e85cd41c391b68a99ee296584b2a2d1d233e7ee32b4532384f2d/bz2file-0.98.tar.gz
Collecting boto3 (from smart-open>=1.2.1->gensim)
[?25l  Downlo

In [0]:
from gensim.models import Word2Vec

In [0]:
model_ted = Word2Vec(sentences_ted, min_count=1)

In [0]:
model_ted.most_similar("man")

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


[('woman', 0.857656717300415),
 ('guy', 0.8415890336036682),
 ('lady', 0.7838987112045288),
 ('girl', 0.7515023946762085),
 ('boy', 0.7491970062255859),
 ('kid', 0.7147458791732788),
 ('gentleman', 0.7110449075698853),
 ('soldier', 0.7045953273773193),
 ('surgeon', 0.6718157529830933),
 ('john', 0.6670570373535156)]

In [0]:
model_ted.most_similar("computer")

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


[('software', 0.7263487577438354),
 ('machine', 0.7113944888114929),
 ('interface', 0.6860880851745605),
 ('robot', 0.6775686144828796),
 ('3d', 0.6734641790390015),
 ('chip', 0.6694499254226685),
 ('printer', 0.6630687117576599),
 ('device', 0.6628063917160034),
 ('mechanical', 0.6486104130744934),
 ('program', 0.6475090384483337)]

In [0]:
model_ted.most_similar("machine")

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


[('device', 0.7576619386672974),
 ('robot', 0.7519630193710327),
 ('computer', 0.7113944888114929),
 ('software', 0.7027640342712402),
 ('printer', 0.6825121641159058),
 ('program', 0.6762884855270386),
 ('3d', 0.6712007522583008),
 ('laser', 0.6686410903930664),
 ('model', 0.6660893559455872),
 ('interface', 0.6593688726425171)]

In [0]:
model_ted.most_similar("learning")

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


[('designing', 0.6857173442840576),
 ('sharing', 0.6324872970581055),
 ('thinking', 0.5989416837692261),
 ('teaching', 0.5966650247573853),
 ('understanding', 0.5961527228355408),
 ('creativity', 0.5843908786773682),
 ('conscious', 0.573307991027832),
 ('knowledge', 0.5699175596237183),
 ('concerned', 0.5649411678314209),
 ('engaging', 0.5649142265319824)]

**t-SNE visualization**

To use the t-SNE code below, first put a list of the top 1000 words (as strings) into a variable words_top_ted. The following code gets the corresponding vectors from the model, assuming it's called model_ted:

In [0]:
# This assumes words_top_ted is a list of strings, the top 1000 words
words_top_vec_ted = model_ted[words_top_ted]

  """Entry point for launching an IPython kernel.


In [0]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
words_top_ted_tsne = tsne.fit_transform(words_top_vec_ted)

In [0]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],
                                    x2=words_top_ted_tsne[:,1],
                                    names=words_top_ted))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

# Wiki Text

In [0]:
if not os.path.isfile('wikitext-103-raw-v1.zip'):
    urllib.request.urlretrieve("https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip", filename="wikitext-103-raw-v1.zip")

In [0]:
with zipfile.ZipFile('wikitext-103-raw-v1.zip', 'r') as z:
    input_text = str(z.open('wikitext-103-raw/wiki.train.raw', 'r').read(), encoding='utf-8') # Thanks Robert Bastian

In [0]:
sentences_wiki = []
for line in input_text.split('\n'):
    s = [x for x in line.split('.') if x and len(x.split()) >= 5]
    sentences_wiki.extend(s)
    
for s_i in range(len(sentences_wiki)):
    sentences_wiki[s_i] = re.sub("[^a-z]", " ", sentences_wiki[s_i].lower())
    sentences_wiki[s_i] = re.sub(r'\([^)]*\)', '', sentences_wiki[s_i])
del input_text

In [0]:
# sample 1/5 of the data
shuffle(sentences_wiki)
print(len(sentences_wiki))
sentences_wiki = sentences_wiki[:int(len(sentences_wiki)/5)]
print(len(sentences_wiki))

4267112
853422


Now, repeat all the same steps that you performed above. You should be able to reuse essentially all the code.



In [0]:
sentences_wiki[1]

' the militants who were not killed or captured either managed to flee back to albania or were hiding along the border   according to a kvm monitor '

In [0]:
model_wiki = Word2Vec(sentences_wiki, min_count=1)

In [0]:
count_wiki = collections.Counter()
for sentence in sentences_wiki:
    for word in sentence:
        count_wiki[word] += 1
words_top_wiki = [token_count_pair[0] for token_count_pair in count.most_common(1000)]

In [0]:
# This assumes words_top_wiki is a list of strings, the top 1000 words
words_top_vec_wiki = model_wiki[words_top_wiki]

tsne = TSNE(n_components=2, random_state=0)
words_top_wiki_tsne = tsne.fit_transform(words_top_vec_wiki)

  """Entry point for launching an IPython kernel.


In [0]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_wiki_tsne[:,0],
                                    x2=words_top_wiki_tsne[:,1],
                                    names=words_top_wiki))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)