<a href="https://colab.research.google.com/github/oliverguhr/htw-nlp-lecture/blob/master/assignments/transformer/nlp_3_neural_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural search with Transformers

## What are we going to do?

Instead of searching text by compareing characters and words, 
we will use the power of transfomer models and compare texts in vector sprace.

![](https://media.giphy.com/media/xT0xeJpnrWC4XWblEk/giphy.gif)

## installing dependencies

In [65]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [66]:
from transformers import AutoModel, AutoTokenizer

## loading a model

In [68]:
model_name = "dbmdz/bert-base-german-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)


Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/234k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/422M [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-base-german-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## transforming a text to an vector

In [69]:
inputs = tokenizer("Hallo Welt!", return_tensors="pt")
inputs


{'input_ids': tensor([[  102, 10373, 30892, 24135,  3330,   103]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

In [74]:
inputs = tokenizer(["Hallo Welt!", "ein toller Test"], return_tensors="pt", padding=True,truncation=True)
inputs

{'input_ids': tensor([[  102,  4485,   866,  3330,   103,     0],
        [  102,   143, 13837, 30884,  4369,   103]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1]])}

In [75]:
outputs = model(**inputs)
outputs 

BaseModelOutputWithPoolingAndCrossAttentions([('last_hidden_state',
                                               tensor([[[ 0.2415,  0.6428, -0.5918,  ...,  0.7404,  0.1338, -0.2779],
                                                        [ 0.2890,  0.7354, -0.9872,  ...,  0.7438, -0.6845, -0.1892],
                                                        [ 0.2164,  0.8085, -0.6320,  ...,  0.6260, -0.6735,  0.2411],
                                                        [-0.2210,  0.8246, -0.6606,  ...,  0.8001, -0.1529,  0.0831],
                                                        [-0.0573,  1.1134, -0.8065,  ...,  0.5099, -0.2104, -0.0108],
                                                        [ 0.2753,  1.0079, -0.7109,  ...,  0.3392,  0.1056,  0.1083]],
                                               
                                                       [[ 0.1849,  0.7584, -0.2930,  ...,  0.7572,  0.9876, -0.3735],
                                                        [

**last_hidden_state**: Sequence of hidden-states at the output of the last layer of the model.

In [76]:
outputs["last_hidden_state"].shape


torch.Size([2, 6, 768])

**pooler_output**: Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

In [77]:
outputs["pooler_output"].shape

torch.Size([2, 768])

## loading data 

We load a data set of news headlines from german newspapers. This data set contains the headlines and the according article urls.
After we loaded the data, we need to convert all headlines into vectors.

In [78]:
!curl -O https://www2.htw-dresden.de/~guhr/dist/feeds.tsv 


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.0M  100 10.0M    0     0  4613k      0  0:00:02  0:00:02 --:--:-- 4613k


In [79]:
!head feeds.tsv

id	title	text	time	link		
https://www.spiegel.de/politik/deutschland/corona-krise-in-deutschland-wie-kommen-wir-wieder-raus-a-d8099433-e178-46be-957a-f6c779b3f2f5	'Corona-Krise in Deutschland: Wie kommen wir wieder raus?'	'Die Bundesregierung will in der kommenden Woche über mögliche Szenarien für den Exit aus dem Lockdown beraten. Schon jetzt warnen Politiker vor einem überhasteten Aussetzen der Maßnahmen. Der Überblick.'	'Mon, 13 Apr 2020 18:18:00 +0200'	'https://www.spiegel.de/politik/deutschland/corona-krise-in-deutschland-wie-kommen-wir-wieder-raus-a-d8099433-e178-46be-957a-f6c779b3f2f5#ref=rss		
https://www.spiegel.de/wissenschaft/leopoldina-forscher-legen-konkreten-fahrplan-fuer-ende-der-kontaktsperren-vor-a-0cfd0aed-cf48-4dd1-a219-241d818d60ae	'Leopoldina-Forscher legen konkreten Fahrplan für Ende der Kontaktsperren vor'	'Die Nationalakademie Leopoldina empfiehlt eine baldige Rückkehr zur Schule. Auch Geschäfte und Behörden sollen schrittweise eröffnen und Reisen erlaubt werden

In [80]:
import time
import pandas as pd
import numpy as np

feeds_df = pd.read_csv("feeds.tsv", sep='\t', header=0,encoding="utf-8")
feeds_df.drop(columns=['text'], inplace=True)
feeds_df.drop(columns=['time'], inplace=True)
feeds_df.drop(columns=['id'], inplace=True)
feeds_df.drop(columns=['Unnamed: 5'], inplace=True)
feeds_df.drop(columns=['Unnamed: 6'], inplace=True)

In [81]:
feeds_df.head(5)

Unnamed: 0,title,link
0,'Corona-Krise in Deutschland: Wie kommen wir w...,'https://www.spiegel.de/politik/deutschland/co...
1,'Leopoldina-Forscher legen konkreten Fahrplan ...,'https://www.spiegel.de/wissenschaft/leopoldin...
2,'Philosophie Coronavirus-Lockdown: Wir müssen ...,'https://www.spiegel.de/wissenschaft/philosoph...
3,'Coronavirus in Indonesien: Gefährliche Heimre...,'https://www.spiegel.de/politik/ausland/corona...
4,'Coronavirus News am Montag: Die wichtigsten E...,'https://www.spiegel.de/wissenschaft/medizin/c...


In [82]:
# We want to remove the qoutes here in order to get better results.

def remove_quotes(text):
    return text[1:-1]

feeds_df["title"]=feeds_df["title"].map(remove_quotes)
feeds_df["link"]=feeds_df["link"].map(remove_quotes)
feeds_df.head(10)

Unnamed: 0,title,link
0,Corona-Krise in Deutschland: Wie kommen wir wi...,https://www.spiegel.de/politik/deutschland/cor...
1,Leopoldina-Forscher legen konkreten Fahrplan f...,https://www.spiegel.de/wissenschaft/leopoldina...
2,Philosophie Coronavirus-Lockdown: Wir müssen ü...,https://www.spiegel.de/wissenschaft/philosophi...
3,Coronavirus in Indonesien: Gefährliche Heimreise,https://www.spiegel.de/politik/ausland/coronav...
4,Coronavirus News am Montag: Die wichtigsten En...,https://www.spiegel.de/wissenschaft/medizin/co...
5,Corona-Lockdown: Deutsche sind immer mehr unte...,https://www.spiegel.de/panorama/corona-lockdow...
6,Corona-Krise: Warum Vorhersagen zu Wirtschaft ...,https://www.spiegel.de/wirtschaft/corona-krise...
7,"Corona-Alltags-Heldin: Susanne Rudwill, 56, Ka...",https://www.spiegel.de/panorama/gesellschaft/c...
8,"Corona: Politik darf keine Erwartungen wecken,...",https://www.spiegel.de/politik/deutschland/cor...
9,Trigema-Chef Grupp kämpft gegen die Corona-Kri...,https://www.spiegel.de/wirtschaft/unternehmen/...


In [83]:
# Number of entries in our data set
len(feeds_df)

21257

## Processing the data

In [None]:
# since 21257 entries would take a lot of time to process, we just load
# the first 3000 articles here. But you are welcome to experiment with this 
# parameter. 

titles = list(feeds_df["title"][:3000])
links = list(feeds_df["link"][:3000])

In [84]:
import torch
model.to("cuda")
tokens = tokenizer(titles, return_tensors="pt",truncation=True,padding=True)
tokens.to("cuda")
with torch.no_grad():
    headline_vectors = model(**tokens)["pooler_output"]  

In [85]:
headline_vectors.shape

torch.Size([3000, 768])

In [98]:
tokens = tokenizer("Auswirkungen der Corona Pandmie", return_tensors="pt",truncation=True, padding=True)
tokens.to("cuda")
with torch.no_grad():
  query_vector = model(**tokens)["pooler_output"]
query_vector.shape

torch.Size([1, 768])

In [100]:
# calculate the dot product
result = torch.sum(query_vector * headline_vectors,axis=1) 
result.shape

torch.Size([3000])

In [101]:
result

tensor([199.0725, 189.0664, 184.6973,  ..., 191.6362, 163.9664, 201.1741],
       device='cuda:0')

## Ranking the results

In [106]:
topk = 20
values, indices = torch.topk(result, topk,largest=True)
print(values,indices)

tensor([210.6873, 209.8775, 209.4633, 209.4478, 209.4223, 209.3386, 209.1796,
        209.1045, 209.0961, 209.0785, 209.0172, 208.9439, 208.9334, 208.8723,
        208.8566, 208.8131, 208.5761, 208.5325, 208.4820, 208.4820],
       device='cuda:0') tensor([1043, 2657,  101, 1997, 1162, 1494, 1603, 2802, 2402, 1538, 2761, 1146,
        1971, 2386, 1502, 2905, 2046, 2514, 1416, 1409], device='cuda:0')


In [107]:
for i in range(0,topk):
  index = indices[i].item()
  value = int(values[i].item())
  print(value,titles[index],links[index])


210 Corona-Beschlüsse: Erste Schritte aus dem Lockdown https://www.spiegel.de/politik/deutschland/corona-beschluesse-erste-schritte-aus-dem-lockdown-a-affb6f1d-7f4a-452e-8087-ff6ea23b69a8#ref=rs
209 Corona-Sicherheitsmaßnahmen: Die große Stunde der Egoisten https://www.faz.net/aktuell/feuilleton/debatten/weshalb-so-wenige-menschen-im-alltag-masken-tragen-16733712.htm
209 Coronavirus: Mit Vorsicht zurück in den Alltag https://www.sueddeutsche.de/politik/leopoldina-coronavirus-walter-borjans-1.487550
209 CDU in der Corona-Krise: Plötzlich populär https://www.spiegel.de/politik/deutschland/cdu-in-der-corona-krise-ploetzlich-populaer-a-b48d2cc8-32db-44b2-b58e-e3ef8e7358b8#ref=rs
209 Corona-Schäden: Versicherer rufen nach staatlicher Hilfe https://www.handelsblatt.com/finanzen/immobilien/corona-schaeden-versicherer-rufen-nach-staatlicher-hilfe/25746022.htm
209 IT-Sicherheit: Der größte Risikofaktor sitzt vor dem Rechner https://www.sueddeutsche.de/digital/it-sicherheit-schulungen-corona-1.4

## Let's take a look at the vector space
Download the two files and upload them into [Tensorflow Projector](https://projector.tensorflow.org/).

In [108]:
# export to tf projector
x_np = headline_vectors.cpu().numpy()
x_df = pd.DataFrame(x_np)
x_df.to_csv('vectors.tsv',sep="\t",index=False, header=None,encoding="utf-8")

with open('titles.tsv', 'w') as writer:
  for title in titles:
    writer.write(title[:150]+"...\n")


# Your tasks

Try to improve the search results. Here are some ideas:

* try out sentence transformers like this one: [Language-agnostic BERT Sentence Embedding](https://arxiv.org/abs/2007.01852)
* try to adapt the sample code from [sentence transformers project.](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2)

Check your results with the embedding projector and compare them. What do you see?


Bonus:

* try a clustering like k-nearest neighbors to group news artikels
