<a href="https://colab.research.google.com/github/oliverguhr/htw-nlp-lecture/blob/master/assignments/transformer/NLP_3_Neural_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural search with Transformers

## What are we going to do?

Instead of searching text by compareing characters and words, 
we will use the power of transfomer models and compare texts in vector sprace.

![](https://media.giphy.com/media/xT0xeJpnrWC4XWblEk/giphy.gif)

## installing dependencies

In [195]:
!pip install transformers



In [196]:
from transformers import AutoModel, AutoTokenizer

## loading a model

In [197]:
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)


Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## transforming a text to an vector

In [198]:
inputs = tokenizer("Hello world!", return_tensors="pt")
inputs


{'input_ids': tensor([[ 101, 8667, 1362,  106,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

In [199]:
inputs = tokenizer(["Hello world!", "funny test"], return_tensors="pt", padding=True,truncation=True)
inputs

{'input_ids': tensor([[ 101, 8667, 1362,  106,  102],
        [ 101, 6276, 2774,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 0]])}

In [200]:
outputs = model(**inputs)
outputs 

BaseModelOutputWithPoolingAndCrossAttentions([('last_hidden_state',
                                               tensor([[[ 3.7628e-01,  3.6770e-01,  4.9426e-01,  ..., -1.6608e-01,
                                                          5.9425e-01, -1.5929e-01],
                                                        [ 7.1213e-01, -3.6788e-01,  8.5830e-01,  ..., -2.3904e-01,
                                                          5.6956e-01, -8.8592e-02],
                                                        [ 6.5599e-01,  4.7040e-01,  4.8593e-01,  ..., -5.6358e-01,
                                                         -2.6219e-01, -4.7541e-01],
                                                        [ 6.3835e-01,  8.1096e-02,  8.1820e-01,  ...,  2.6849e-01,
                                                          3.9201e-01,  1.5533e-01],
                                                        [ 7.4092e-01,  5.9099e-01,  3.2814e-01,  ..., -3.2534e-01,
                     

**last_hidden_state**: Sequence of hidden-states at the output of the last layer of the model.

In [201]:
outputs["last_hidden_state"].shape


torch.Size([2, 5, 768])

**pooler_output**: Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

In [202]:
outputs["pooler_output"].shape

torch.Size([2, 768])

## loading data 

We load a data set of news headlines from german newspapers. This data set contains the headlines and the according article urls.
After we loaded the data, we need to convert all headlines into vectors.

In [203]:
!curl -O https://www2.htw-dresden.de/~guhr/dist/feeds.tsv 


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.0M  100 10.0M    0     0  4163k      0  0:00:02  0:00:02 --:--:-- 4164k


In [204]:
!head feeds.tsv

id	title	text	time	link		
https://www.spiegel.de/politik/deutschland/corona-krise-in-deutschland-wie-kommen-wir-wieder-raus-a-d8099433-e178-46be-957a-f6c779b3f2f5	'Corona-Krise in Deutschland: Wie kommen wir wieder raus?'	'Die Bundesregierung will in der kommenden Woche über mögliche Szenarien für den Exit aus dem Lockdown beraten. Schon jetzt warnen Politiker vor einem überhasteten Aussetzen der Maßnahmen. Der Überblick.'	'Mon, 13 Apr 2020 18:18:00 +0200'	'https://www.spiegel.de/politik/deutschland/corona-krise-in-deutschland-wie-kommen-wir-wieder-raus-a-d8099433-e178-46be-957a-f6c779b3f2f5#ref=rss		
https://www.spiegel.de/wissenschaft/leopoldina-forscher-legen-konkreten-fahrplan-fuer-ende-der-kontaktsperren-vor-a-0cfd0aed-cf48-4dd1-a219-241d818d60ae	'Leopoldina-Forscher legen konkreten Fahrplan für Ende der Kontaktsperren vor'	'Die Nationalakademie Leopoldina empfiehlt eine baldige Rückkehr zur Schule. Auch Geschäfte und Behörden sollen schrittweise eröffnen und Reisen erlaubt werden

In [205]:
import time
import pandas as pd
import numpy as np

feeds_df = pd.read_csv("feeds.tsv", sep='\t', header=0,encoding="utf-8")
feeds_df.drop(columns=['text'], inplace=True)
feeds_df.drop(columns=['time'], inplace=True)
feeds_df.drop(columns=['id'], inplace=True)
feeds_df.drop(columns=['Unnamed: 5'], inplace=True)
feeds_df.drop(columns=['Unnamed: 6'], inplace=True)

In [206]:
feeds_df.head(5)

Unnamed: 0,title,link
0,'Corona-Krise in Deutschland: Wie kommen wir w...,'https://www.spiegel.de/politik/deutschland/co...
1,'Leopoldina-Forscher legen konkreten Fahrplan ...,'https://www.spiegel.de/wissenschaft/leopoldin...
2,'Philosophie Coronavirus-Lockdown: Wir müssen ...,'https://www.spiegel.de/wissenschaft/philosoph...
3,'Coronavirus in Indonesien: Gefährliche Heimre...,'https://www.spiegel.de/politik/ausland/corona...
4,'Coronavirus News am Montag: Die wichtigsten E...,'https://www.spiegel.de/wissenschaft/medizin/c...


In [207]:
# We want to remove the qoutes here in order to get better results.

def remove_quotes(text):
    return text[1:-1]

feeds_df["title"]=feeds_df["title"].map(remove_quotes)
feeds_df["link"]=feeds_df["link"].map(remove_quotes)
feeds_df.head(10)

Unnamed: 0,title,link
0,Corona-Krise in Deutschland: Wie kommen wir wi...,https://www.spiegel.de/politik/deutschland/cor...
1,Leopoldina-Forscher legen konkreten Fahrplan f...,https://www.spiegel.de/wissenschaft/leopoldina...
2,Philosophie Coronavirus-Lockdown: Wir müssen ü...,https://www.spiegel.de/wissenschaft/philosophi...
3,Coronavirus in Indonesien: Gefährliche Heimreise,https://www.spiegel.de/politik/ausland/coronav...
4,Coronavirus News am Montag: Die wichtigsten En...,https://www.spiegel.de/wissenschaft/medizin/co...
5,Corona-Lockdown: Deutsche sind immer mehr unte...,https://www.spiegel.de/panorama/corona-lockdow...
6,Corona-Krise: Warum Vorhersagen zu Wirtschaft ...,https://www.spiegel.de/wirtschaft/corona-krise...
7,"Corona-Alltags-Heldin: Susanne Rudwill, 56, Ka...",https://www.spiegel.de/panorama/gesellschaft/c...
8,"Corona: Politik darf keine Erwartungen wecken,...",https://www.spiegel.de/politik/deutschland/cor...
9,Trigema-Chef Grupp kämpft gegen die Corona-Kri...,https://www.spiegel.de/wirtschaft/unternehmen/...


In [208]:
# Number of entries in our data set
len(feeds_df)

21257

## Processing the data

In [209]:
# since 21257 entries would take a lot of time to process, we just load
# the first 500 articles here. But you are welcome to experiment with this 
# parameter. 

titles = list(feeds_df["title"][:500])
links = list(feeds_df["link"][:500])

In [210]:
tokens = tokenizer(titles, return_tensors="pt",truncation=True,padding=True)
with torch.no_grad():
    headline_vectors = model(**tokens)["pooler_output"]  

In [211]:
headline_vectors.shape

torch.Size([500, 768])

In [212]:
tokens = tokenizer("Auswirkungen der Corona Pandmie", return_tensors="pt",truncation=True, padding=True)
with torch.no_grad():
  query_vector = model(**tokens)["pooler_output"]
query_vector.shape

torch.Size([1, 768])

In [213]:
result = torch.sum(query_vector * headline_vectors,axis=1) 
result.shape

torch.Size([500])

In [214]:
result

tensor([434.5861, 427.5778, 417.6656, 425.2582, 443.6272, 444.1003, 429.4274,
        443.6042, 434.7585, 450.9636, 441.4206, 438.9923, 457.1965, 420.5591,
        434.4346, 440.5258, 439.7144, 447.6839, 438.9429, 443.9311, 451.3104,
        435.3569, 443.7168, 443.8405, 452.5818, 442.3846, 447.6673, 435.3538,
        411.6184, 433.8081, 441.5355, 438.1096, 437.5651, 450.1714, 439.7906,
        429.7813, 443.2680, 423.2501, 435.0064, 445.7311, 421.8165, 438.0741,
        416.5206, 425.5953, 444.5760, 431.6331, 439.0455, 454.2089, 386.2811,
        444.0154, 441.2814, 437.8051, 440.6770, 431.9468, 413.9568, 444.6879,
        445.4962, 434.6720, 421.6457, 424.0478, 439.4895, 419.4149, 429.1205,
        441.5368, 435.2282, 438.2585, 433.9427, 428.4285, 417.3004, 438.9912,
        422.2676, 422.6285, 437.2614, 453.0658, 446.9125, 430.0360, 435.3096,
        440.8641, 432.4124, 437.7288, 444.4147, 415.2664, 452.9628, 433.8447,
        416.7220, 450.0892, 445.6443, 440.6169, 426.9055, 444.79

## Ranking the results

In [215]:
topk = 10
values, indices = torch.topk(result, topk,largest=True)
print(values,indices)

tensor([458.3764, 457.1965, 454.6101, 454.5966, 454.4232, 454.3520, 454.2089,
        453.9844, 453.0658, 453.0188]) tensor([363,  12, 180, 289, 340, 143,  47, 144,  73, 187])


In [216]:
for i in range(0,topk):
  index = indices[i].item()
  value = int(values[i].item())
  print(value,titles[index],links[index])


458 Viele Corona-Tote: Durchsuchungen in Altenheimen in Mailand https://www.faz.net/aktuell/gesellschaft/gesundheit/coronavirus/viele-corona-tote-durchsuchungen-in-altenheimen-in-mailand-16724777.htm
457 Coronakrise: Condor-Rettung endgültig geplatzt https://www.spiegel.de/wirtschaft/unternehmen/coronakrise-condor-rettung-endgueltig-geplatzt-a-4ebf4558-adc1-48a1-ad52-0c4f5736144a#ref=rs
454 USA: Eine Pressekonferenz, wie sie wohl nur Trump geben kann https://www.sueddeutsche.de/politik/coronavirus-usa-trump-1.487579
454 Südkorea wählt trotz Corona-Krise https://www.wienerzeitung.at/nachrichten/politik/welt/2057223-Suedkorea-waehlt-trotz-Corona-Krise.htm
454 Coronakrise: Shinzo Abe allein zu Haus https://www.sueddeutsche.de/panorama/japan-coronavirus-shinzo-abe-1.487623
454 USA: Mindestens 20 Tote durch Stürme https://www.zeit.de/gesellschaft/zeitgeschehen/2020-04/usa-stuerme-tornados-mississippi-tot
454 Corona-Berichterstattung: Angesteckt https://www.zeit.de/2020/16/coronavirus-berich

## Let's take a look at the vector space
Download the two files and upload them into [Tensorflow Projector](https://projector.tensorflow.org/).

In [217]:
# export to tf projector
x_np = headline_vectors.numpy()
x_df = pd.DataFrame(x_np)
x_df.to_csv('vectors.tsv',sep="\t",index=False, header=None,encoding="utf-8")

with open('titles.tsv', 'w') as writer:
  for title in titles:
    writer.write(title[:150]+"...\n")


# Your tasks

Try to improve the search results. Here are some ideas:

* try out different language models
* try out different pooling modes 
  * e.g you can use the mean of the "last_hidden_state" tensor
* try out sentence transformers like this one: [Language-agnostic BERT Sentence Embedding](https://arxiv.org/abs/2007.01852)

Don't forget to check your results with the embedding projector!


Bonus:

* modify our code and use the gpu
* try a clustering like k-nearest neighbors to group news artikels
