<a href="https://colab.research.google.com/github/mpsdecamargo/ml-data-science-portfolio/blob/main/bert-deep-learning-project/Covid_related_Text_Similar_Text_Assessment_with_Sentence_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## INTRODUCTION

Application Feature related in this notebook: Provide 5 Covid-related fact-checked claims of the input data, with claim verified, similarity score, date published, link and assessment of claim.

Notebook content: Importing of the dataset, processing of data, embedding process and semantic similarity assessment.

Note: The notebook was developed in Google Colab. The datasets are not publicly available due to copyright restrictions. This notebook is a form of demonstration of problem solving, Data Science and Machine Learning skills, but as the dataset and the models are not publicly available, it cannot be reproduced. However, the code can be used for similar tasks.





# ABOUT THE DATASET

The dataset was called dataset_verifato_checagens (for this update, it's named dataset_verifato_sentence_similarity) and has 3894 samples.

| Source             | Number of Samples |
|---------------------|-------------------|
| Aos Fatos           | 1415              |
| Boatos.org          | 791               |
| Estadão Verifica    | 538               |
| G1 Fato ou Fake     | 499               |
| AFP Checamos        | 432               |
| Agência Lupa        | 219               |

Note: In the table below, Claim Assessment, means the verdict of the verified claims that will be provided in top 5 similar claims to input text.

| Claim Assessment    | Number of Samples |
|----------------------|-------------------|
| False                | 3440              |
| Misleading          | 148               |
| Predominantly False  | 142               |
| Distorted           | 119               |
| No Context          | 18                |
| Partly True          | 12                |
| No Record           | 6                 |
| Exaggerated         | 5                 |
| Missing Context     | 4                 |



In [None]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence_transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence_transformers
  Building wheel for sentence_transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence_transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=a8dec5aaf8f39742efbf3e4254443024215facef415d78d936caa6fec60bbb4b
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence_tra

In [None]:
from sentence_transformers import SentenceTransformer, util
import pandas as pd
import pickle

In [None]:
from google.colab import drive
drive.mount('gdrive')

Mounted at gdrive


In [None]:
# Loading the Sentence-BERT model

embedder = SentenceTransformer('distiluse-base-multilingual-cased-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.69k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/610 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/539M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/531 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

In [None]:
# Defining the input samples in a list

corpus = ['Católicos celebram neste 14 de maio o dia de Santa Corona'
          ]

In [None]:
df = pd.read_csv("/content/gdrive/MyDrive/Datasets/dataset_verifato_sentence_similarity.csv", sep=";")

In [None]:
df.head()

Unnamed: 0,ID,link,claimReviewed,title,text,datePublished,label
0,81,https://politica.estadao.com.br/blogs/estadao-...,Dez artistas morreram de covid mesmo após toma...,"Para atacar Doria, campanha antivacina falseia...",,19/ago/21,False
1,135,https://politica.estadao.com.br/blogs/estadao-...,Números comprovam que não faltam vacinas de co...,Vídeo engana ao sugerir que sobram vacinas con...,,14/abr/21,False
2,1938,https://www.boatos.org/saude/reportagem-globo-...,"Globo/DF afirma, em reportagem de 2022, que má...",Reportagem da Globo/DF de 2022 aponta que másc...,,21/fev/22,False
3,1939,https://www.boatos.org/saude/menina-gritou-for...,"Após se vacinar e gritar “fora, Bolsonaro”, me...","Menina que gritou ""fora Bolsonaro"" ao se vacin...",,09/fev/22,False
4,1940,https://www.boatos.org/saude/vacina-mata-crian...,Uma criança morreu ao ser vacinada contra a Co...,Vacina mata criança na Paraíba e pai fica dese...,,22/jan/22,False


In [None]:
# Creating a DataFrame with the reviewed claims to encode into numerical vectors

df_to = df['claimReviewed'].to_list()
type(df_to)

list

In [None]:
# Encoding the dataset

corpus_embeddings = embedder.encode(df_to)

In [None]:
# Saving the encoded dataset into pickle format

with open("/content/gdrive/MyDrive/Datasets/sts-embeddings.pkl", "wb") as fOut:
    pickle.dump(corpus_embeddings,fOut)

In [None]:
print(corpus_embeddings)

[[-0.02763003 -0.00216739  0.00612784 ...  0.05662372 -0.01689292
  -0.01049389]
 [ 0.07169698 -0.02848689 -0.06468349 ... -0.04537911  0.01191914
  -0.01098778]
 [-0.0120403   0.05308544 -0.02920126 ... -0.02822604  0.01654766
   0.0129501 ]
 ...
 [ 0.07244699 -0.0498138   0.00700273 ...  0.06558876  0.05121427
   0.00389253]
 [ 0.03662167  0.00838148  0.00049268 ...  0.06995937 -0.07485394
   0.05879562]
 [-0.00297564  0.01090902 -0.02740498 ...  0.00086305  0.00484336
  -0.00772337]]


In [None]:
obj = pd.read_pickle(r'/content/gdrive/MyDrive/Datasets/sts-embeddings.pkl')

In [None]:
# sample input text to use for similarity assessment

to_predict = "Católicos celebram neste 14 de maio o dia de Santa Corona, padroeira dos madeireiros e dos que buscam ajuda em tempos de dificuldades financeiras. Com a crise gerada pela Covid-19 — e com a semelhança com o nome do coronavírus —, a santa também se tornou padroeira da luta contra a pandemia. Em março, os responsáveis pela Catedral de Aachen, na Alemanha, recuperaram as relíquias da Santa Corona guardadas dentro do relicário e começaram a polir o santuário dedicado a ela. A ideia é que eles fiquem expostos quando a pandemia do novo coronavírus passar. Acredita-se que Santa Corona morreu aos 16 anos, provavelmente na Síria, por professar a fé cristã — o que desagradou os romanos do Século II. Ela foi cruelmente assassinada ao ser amarrada em duas palmeiras esticadas até o chão. Quando as plantas se soltaram rapidamente, o corpo da mártir foi esquartejado. Santa padroeira À agência Reuters, a forma com a qual a santa morreu a tornou, primeiro, padroeira dos madeireiros"

In [None]:
# define the queries as a list of the sample input (in this case, only 1 input text sample)
queries = [to_predict]

In [None]:
def generateTop5():
  for query in queries:
    query_embedding = embedder.encode(query)

    print("Query:", query)
    print("\nTop 5 of Similar Claims:")

    # Perform semantic search and retrieve the top 5 hits
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]

    # Iterate through the hits and print information about the similar claims
    for hit in hits:
      for i in range(len(df)):
        if df_to[hit['corpus_id']] == df['claimReviewed'][i]:
          print(f"Claim Verified: {df_to[hit['corpus_id']]}, \n(Score: {(hit['score']):.4f})\n Published Date: {df['datePublished'][i]}\n Link: {df['link'][i]}\n Assessment: {df['label'][i]}")


In [None]:
generateTop5()

Query: Católicos celebram neste 14 de maio o dia de Santa Corona, padroeira dos madeireiros e dos que buscam ajuda em tempos de dificuldades financeiras. Com a crise gerada pela Covid-19 — e com a semelhança com o nome do coronavírus —, a santa também se tornou padroeira da luta contra a pandemia. Em março, os responsáveis pela Catedral de Aachen, na Alemanha, recuperaram as relíquias da Santa Corona guardadas dentro do relicário e começaram a polir o santuário dedicado a ela. A ideia é que eles fiquem expostos quando a pandemia do novo coronavírus passar. Acredita-se que Santa Corona morreu aos 16 anos, provavelmente na Síria, por professar a fé cristã — o que desagradou os romanos do Século II. Ela foi cruelmente assassinada ao ser amarrada em duas palmeiras esticadas até o chão. Quando as plantas se soltaram rapidamente, o corpo da mártir foi esquartejado. Santa padroeira À agência Reuters, a forma com a qual a santa morreu a tornou, primeiro, padroeira dos madeireiros

Top 5 of Sim