<a href="https://colab.research.google.com/github/lymoelopez/automated-filipino-fake-news-detector/blob/main/evidenceCollector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evidence Collector

## I. Web Searcher

#### Web Search Filter

In [123]:
URL_banList = ["facebook", "twitter", "youtube", "pdf", "blog", "tiktok", "instagram", "youtu.be", "mp4", "mp3", "audiobook", "podcast", "spotify"]

def URLfilter(url):
  lowercasedURL = url.lower()
  if all(excludedURL not in lowercasedURL for excludedURL in URL_banList):
    return url

#### Google Search Api

In [124]:
!pip install beautifulsoup4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [125]:
!pip install google

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [126]:
from googlesearch import search

def webSearcher(inputClaim, numOfResults=20):

  filteredList = []
  #numOfResults = 20
  
  while len(filteredList) <= numOfResults:
    if len(filteredList) >= numOfResults:
      break
    for url in search(inputClaim, lang="tl", num=numOfResults, pause=3):
      filteredURL = URLfilter(url)
      if filteredURL and (filteredURL not in filteredList):
        filteredList.append(filteredURL)
      if len(filteredList) >= numOfResults:
        break

    return filteredList

## II. Evidence Selector

#### Article Extractor

In [127]:
!pip install newspaper3k

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [128]:
from newspaper import Article

def articleExtractor(url):

  try:
    article = Article(url)
    article.download()
    article.parse()
  except:
    return ""

  return article.text

In [129]:
def createExtractedArticlesList(urlList):

  extractedArticlesList = []

  for url in urlList:
    extractedArticlesList.append(articleExtractor(url))

  return extractedArticlesList

#### Cosine Similarity

In [130]:
!pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [131]:
from sentence_transformers import SentenceTransformer

cosineSimilarityModel = SentenceTransformer("danjohnvelasco/filipino-sentence-roberta-v1")

In [132]:
from sklearn.metrics.pairwise import cosine_similarity

def findSimilarityScores(inputClaim, extractedArticlesList):

  # append claim in front of extracted articles list
  sentences = list(extractedArticlesList)
  sentences.insert(0, inputClaim)
  sentenceEmbeddings = model.encode(sentences)

  similarityScores = cosine_similarity(
      [sentenceEmbeddings[0]],
      sentenceEmbeddings[1:]
  )

  return similarityScores[0]

In [133]:
import numpy as np

def findTopFiveEvidences(inputClaim, extractedArticlesList, model):

  similarityScores = findSimilarityScores(inputClaim, extractedArticlesList)
  topFiveSimilarityScoresIndex = np.argpartition(similarityScores,-5)[-5:]

  numpyExtractedArticlesList = np.array(extractedArticlesList)
  topFiveEvidences = numpyExtractedArticlesList[topFiveSimilarityScoresIndex]

  return topFiveEvidences

# Evidence Collector Function

In [134]:
def evidenceCollector(inputClaim, model):

  filteredUrlList = webSearcher(inputClaim, 20)
  extractedArticlesList = createExtractedArticlesList(filteredUrlList)
  topFiveEvidences = findTopFiveEvidences(inputClaim, extractedArticlesList, model)

  return filteredUrlList, extractedArticlesList, topFiveEvidences

### Sample Claims

#### Sample Claim 1

In [135]:
sampleClaim1 = "may banta sa buhay ni Teves"
filteredUrlList1, extractedArticlesList1, topFiveEvidences1 = evidenceCollector(sampleClaim1, cosineSimilarityModel)

In [136]:
len(filteredUrlList1)

20

In [140]:
len(extractedArticlesList1)

20

In [137]:
print(topFiveEvidences1)

['Nakiusap muli si Pangulong Ferdinand Marcos, Jr. kay Negros Oriental (3rd District) Rep. Arnolfo Teves, Jr. na umuwi na sa bansa upang harapin ang mga alegasyon laban sa kanya.\n\nPaniniyak ng Pangulo, ibibigay nila ang lahat ng klase ng seguridad sa pagbabalik ni Teves sa bansa.\n\n“But anyway, if to reassure him, we’ll provide all kinds of security kung anong gusto mo. Mayaman ka naman. May private jet ka naman eh. Mag-landing ka kung saan mo gusto, papaligiran – sa Air Force base. Mag-landing siya sa Basa, papaligiran natin ng sundalo, walang makalapit na isang kilometro. So that – that will guarantee his security,” reaksyon ni Marcos matapos dumalo sa ika-126 anibersaryo ng Philippine Army sa Taguig City nitong Miyerkules.\n\nItinanggi rin nito ang ulat sa sinasabing banta sa buhay ni Teves batay na rin nakalap na intelligence information ng Office of the President.\n\n“Well, I have no direct contact with him. Ang nakakausap niya is the Speaker (Martin Romualdez) dahil ‘yun ang S

#### Sample Claim 2

In [141]:
sampleClaim2 = "Wala nang cancer si Tenorio."
filteredUrlList2, extractedArticlesList2, topFiveEvidences2 = evidenceCollector(sampleClaim2, cosineSimilarityModel)



In [142]:
len(filteredUrlList2)

20

In [143]:
len(extractedArticlesList2)

20

In [144]:
print(topFiveEvidences2)

['NAKIKIPAGLABAN ngayon sa colon cancer ang tinaguriang “Iron Man” ng Philippine Basketball Association na si LA Tenorio.\n\nNa-diagnose ng Stage 3 colon cancer ang 38-anyos professional basketball player at Barangay Ginebra guard, base na rin sa official statement na ipinost sa PBA website kahapon, March 21.\n\nHumingi ng paumanhin si LA sa publiko lalo sa mga supporters ng Ginebra dahil inilihim muna niya ang kanyang health condition matapos malaman ang initial diagnosis sa kanya three weeks ago.\n\nMarami kasi ang nagtaka kung bakit hindi nakapaglaro ang veteran cager sa Gin Kings campaign ng PBA Governors’ Cup. Ang announcement ng kanyang team ay sumailalim siya sa surgery noong March 8 “for an aggravated sports hernia”.\n\nView this post on Instagram A post shared by LA Tenorio (@la_tenorio)\n\n\n\n“I was recently diagnosed with Stage 3 colon cancer. The initial testing three weeks ago led me to instantly miss practices and games. I have completed my surgery last week and will soo

# References

[1] https://www.geeksforgeeks.org/performing-google-search-using-python-code/

[2] https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1

[3] https://stackoverflow.com/questions/6910641/how-do-i-get-indices-of-n-maximum-values-in-a-numpy-array