**Copyright 2019 The TensorFlow Hub Authors.**

Licensed under the Apache License, Version 2.0 (the "License");

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3
# Copyright 2019 The TensorFlow Hub Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Cross-Lingual Similarity and Semantic Search Engine with Multilingual Universal Sentence Encoder


<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/hub/tutorials/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/hub/blob/master/examples/colab/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/hub/examples/colab/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
  <td>
    <a href="https://tfhub.dev/google/universal-sentence-encoder-multilingual/3"><img src="https://www.tensorflow.org/images/hub_logo_32px.png" />See TF Hub model</a>
  </td>
</table>

This notebook illustrates how to access the Multilingual Universal Sentence Encoder module and use it for sentence similarity across multiple languages. This module is an extension of the [original Universal Encoder module](https://tfhub.dev/google/universal-sentence-encoder/2).

The notebook is divided as follows:

*   The first section shows a visualization of sentences between pair of languages. This is a more academic exercise. 
*   In the second section, we show how to build a semantic search engine from a sample of a Wikipedia corpus in multiple languages.

## Citation

*Research papers that make use of the models explored in this colab should cite:*

### [Multilingual universal sentence encoder for semantic retrieval](https://arxiv.org/abs/1907.04307)
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019.
 arXiv preprint arXiv:1907.04307

## Setup

This section sets up the environment for access to the Multilingual Universal Sentence Encoder Module and also prepares a set of English sentences and their translations. In the following sections, the multilingual module will be used to compute similarity *across languages*.

In [None]:
%%capture
#@title Setup Environment

!pip install tensorflow_text


In [None]:
#@title Setup common imports and functions

import numpy as np
import os
import pandas as pd
import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer
import sklearn.metrics.pairwise





This is additional boilerplate code where we import the pre-trained ML model we will use to encode text throughout this notebook.

In [None]:
#The 16-language multilingual module is the default but feel free
# to pick others from the list and compare the results.
module_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual/3' #@param ['https://tfhub.dev/google/universal-sentence-encoder-multilingual/3', 'https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3']

model = hub.load(module_url)



In [None]:
import pandas as pd
import gc

In [None]:
data = pd.read_csv('/content/drive/MyDrive/InfoDesk/cleaned_titles_new.csv')
data=data[['Title','Story']]
len(data)

1189

In [None]:
embedded=model(data['Title'])

In [None]:
embedded.shape

TensorShape([1189, 512])

In [None]:
from scipy.cluster import  hierarchy

In [None]:
threshold = 0.779
Z = hierarchy.linkage(embedded,"average", metric="cosine")
C = hierarchy.fcluster(Z, threshold, criterion="distance")

In [None]:
max(C)

22

In [None]:
data['Clusters'] = pd.DataFrame(C, index=data.index)
data.to_csv('/content/drive/MyDrive/InfoDesk/USE_Hier_new.csv')