### Embeddings from Language Models (ELMo)

Elmo is a proper way to represent words in vectors or embeddings, ELMo tries to
model complicated characteristics of word such as syntax and semantics and changes
across linguistic contexts. For more information.

This representations can be easily added to
existing models and significantly improve the
state of the art  result in Semantic anlysis and named entity recognition (NER)
For more information  please see: 
<br />

https://arxiv.org/abs/1802.05365
 
 


## Imports

In [1]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf  # conda install -c conda-forge tensorflow 
import tensorflow_hub as hub  #-c conda-forge tensorflow-hub
from sklearn import preprocessing 

#!python -m spacy download en_core_web_md #you will need to install this on first load
import spacy
from spacy.lang.en import English
from spacy import displacy
nlp = spacy.load('en_core_web_md')
from IPython.display import HTML
import logging
#logging.getLogger('tensorflow').disabled = True #OPTIONAL - to disable outputs from Tensorflow
import plotly.plotly as py  #conda install -c plotly plotly
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

W0430 14:39:45.663673 12660 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14


## Data 



Please download the data '01_df_v013.pickle' in github on  your local machine from  this directory :<br />
 https://github.com/grasshoff/vorlesung2019/tree/master/notebooks/yeghaneh/data
 <br /><br />
The data 01_df_v013 is annotated (labeled) data of english version of well-known kepler's book: New Astronomy (Latin: Astronomia nova). You can find that in the repository.


In [2]:
importVersion = '013'  # The  desired version pickle-file of your data (now we are working on version 013 of our data)

In [3]:
path= r'C:\Users\moha\Documents\BZML\01_df_v{0}.pickle'.format(importVersion)  # Put the path of the data in your local machine here, consider the letter "r" before the path

### Pre-processing

In [4]:
dfAstroNova = pd.read_pickle(path)# The data  is saved as a pickle file. We read it as a pandas dataFrame
type(dfAstroNova)  

pandas.core.frame.DataFrame

In [5]:
 # Sort the data based on the chapters of the book 
dfAstroNova['chapter'] = dfAstroNova.chapter.replace("appendix b",np.nan).astype(float)  
dfAstroNova.sort_values(by='chapter' , inplace=True)
dfAstroNova.chapter.fillna('appendix b', inplace=True)

In [6]:
len(dfAstroNova)   # Number of rows.

1605

In [7]:
dfAstroNova.head(5)  

Unnamed: 0,html,text,links,italic,chapter,graphic,table,marginal,sentences,tagged
18,"<p><span class=""anchor"" id=""bookmark0""></span>...",Chapter 1,[],[],1,[],[],[],[Chapter 1],"[[(Chapter, None), (1, NUM)]]"
34,"<p>But before that, I shall prove in this firs...","But before that, I shall prove in this first p...",[],[],1,[],[],[],"[But before that, I shall prove in this first ...","[[(But, None), (before, None), (that, None), (..."
33,<p>But since the sun's mean and apparent motio...,But since the sun's mean and apparent motions*...,[],"[Mysterium cosmographicum,]",1,[],[],[ Terms: * The sun's apparent position is that...,[But since the sun's mean and apparent motions...,"[[(But, None), (since, None), (the, None), (su..."
32,<p>Now the causes and measures of these inequa...,Now the causes and measures of these inequalit...,[],[],1,[],[],[ 5],[Now the causes and measures of these inequali...,"[[(Now, None), (the, None), (causes, None), (a..."
30,"<p>Again, however, it was noticed that these l...","Again, however, it was noticed that these loop...",[],[],1,[],[],[],"[Again, however, it was noticed that these loo...","[[(Again, None), (however, None), (it, None), ..."


In [8]:
sentences_01=dfAstroNova['text']

In [9]:
sentences_01=list(sentences_01)

In [10]:
sentences_01=sentences_01[0:150] 

In [11]:
len(sentences_01)  #Because of heavy computation, only part of the data is embedded.

100

## Create sentence embeddings

In [12]:
url = "https://tfhub.dev/google/elmo/2"
embed = hub.Module(url)

Instructions for updating:
Colocations handled automatically by placer.


W0430 14:40:03.862963 12660 deprecation.py:323] From C:\Users\moha\Anaconda3\lib\site-packages\tensorflow\python\ops\control_flow_ops.py:3632: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.


In [13]:
embeddings = embed(
    sentences_01,
    signature="default",
    as_dict=True)["default"]

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0430 14:40:04.856192 12660 saver.py:1483] Saver not created because there are no variables in the graph to restore


In [14]:
embeddings

<tf.Tensor 'module_apply_default/truediv:0' shape=(100, 1024) dtype=float32>

In [15]:
%%time
with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  sess.run(tf.tables_initializer())
  x = sess.run(embeddings)

Wall time: 3min 27s


## Visualize the sentences using PCA and TSNE

An informative plot  by dimensionality reduction of our data to 2D. Colours are based on the sentence length. 

In [16]:
from sklearn.decomposition import PCA

pca = PCA(n_components=50)
y = pca.fit_transform(x)

from sklearn.manifold import TSNE

y = TSNE(n_components=2).fit_transform(y)

In [17]:


init_notebook_mode(connected=True)


data = [
    go.Scatter(
        x=[i[0] for i in y],
        y=[i[1] for i in y],
        mode='markers',
        text=[i for i in sentences_01],
    marker=dict(
        size=16,
        color = [len(i) for i in sentences_01], #set color equal to a variable
        opacity= 0.8,
        colorscale='Viridis',
        showscale=False
    )
    )
]
layout = go.Layout()
layout = dict(
              yaxis = dict(zeroline = False),
              xaxis = dict(zeroline = False)
             )
fig = go.Figure(data=data, layout=layout)
file = plot(fig, filename='Sentence encode.html')

#from google.colab import files   #conda install -c conda-forge pydrive 
#files.download('Sentence encode.html')   

## Create a semantic search engine:

In [29]:
#@title Sementic search
#@markdown Enter a set of words to find matching sentences. 'results_returned' can beused to modify the number of matching sentences retured. To view the code behind this cell, use the menu in the top right to unhide...
search_string = "equavalence" #@param {type:"string"}
results_returned = "10" #@param [1, 2, 3]

from sklearn.metrics.pairwise import cosine_similarity


embeddings2 = embed(

    [search_string],
    signature="default",
    as_dict=True)["default"]

with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  sess.run(tf.tables_initializer())
  search_vect = sess.run(embeddings2)
  

cosine_similarities = pd.Series(cosine_similarity(search_vect, x).flatten())
output =""
for i,j in cosine_similarities.nlargest(int(results_returned)).iteritems():
  output +='<p style="font-family:verdana; font-size:110%;"> '
  for i in sentences_01[i].split():
    if i.lower() in search_string:
      output += " <b>"+str(i)+"</b>"
    else:
      output += " "+str(i)
  output += "</p><hr>"
    
output = '<h3>Results:</h3>'+output
display(HTML(output))
#   print(sentences[i])
#   print('\n')


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0430 16:08:52.907270 12660 saver.py:1483] Saver not created because there are no variables in the graph to restore


It seems that the search engine clearly knows that moon and eclipse are closely related! 