# GOOGLE SEARCH VERIFIER

Google Search Verifier, given the sentence forms a query and scrapes the results from the front page of google. It then calculates Levenshtein score that measures the similarity between the initial query and the results. This is done to help with unreliable data.

### Necessery Imports

In [1]:
import pandas as pd
import numpy as np
import urllib.request
import ssl
import random
import time
import json
from leven import levenshtein
from bs4 import BeautifulSoup
ssl._create_default_https_context = ssl._create_unverified_context

### List of User Agents
- Google makes it hard to scrape the results from their search engine. In order to prevent google from banning a "bot that scrapes its results" user-agents have to be definined to trick google into thinking that human is behind the computer
- In each query a random user-agent is picked from this list

In [2]:
user_agents = [
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.56 (KHTML, like Gecko) Chrome/87.0.4283.88 Safari/535.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.61 (KHTML, like Gecko) Chrome/87.0.4284.88 Safari/532.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_17_5) AppleWebKit/532.60 (KHTML, like Gecko) Chrome/87.0.4288.88 Safari/537.32',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_17_3) AppleWebKit/532.90 (KHTML, like Gecko) Chrome/87.0.4281.88 Safari/537.31',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.74.9 (KHTML, like Gecko) Chrome/87.0.4281.88 Safari/537.74.9'
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.2526.106 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.2526.106 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3946.132 Safari/537.32'
]

### Google Scraper

The algorithm works as follows:
- Get random user agents lists and a dataset of claims
- For each row in dataset
    - create a google query
    - wait 5 seconds
    - Create a request (add random user agent)
    - Read the response
    - Parse the results
    - Initialise score and results to 0
    - For each div that appear on the main page:
        - Get the title
        - Calculate Levenstein score between title and the query
        - Save leven_score and increment results
    - Save overall leven_score divided by nr of the results scraped

In [6]:
# dataframe -> dataframe that needs an attribute named "claim" that will be queried in google
# waittime -> time that the algorithm waits before executing 2 queries in a row. Default = 5s
def google_scrape(dataframe, waittime = 5):
    for i, row in dataframe.iterrows():
        
        # Get the claim and form a google query
        claim = row.tweet__text
#         claim = row.claim
        query = claim.replace(" ", "+")
        query = 'https://google.com/search?q=' + query

        # wait "waittime" seconds in between each query
        time.sleep(waittime) 

        # Send the request and parse results
        request = urllib.request.Request(query)
        request.add_header('User-agent', random.choice(user_agents))
        raw_response = urllib.request.urlopen(request).read()
        html = raw_response.decode("utf-8")
        soup = BeautifulSoup(html, 'html.parser')
        divs = soup.select("#search div.g")
        
        results = 0
        leven_score = 0
        # Scrape the titles that appear on the main page
        for div in divs:
            result = div.select("h3")
            if (len(result) >= 1):
                title = result[0].get_text()
                if title[-3:] == '...':
                    title = title[:-4]
                    
                # Calculate levenshtein score between the initial claim and a google result (title)
                leven_score += levenshtein(claim, title)
                results += 1
                
            if results != 0:
                dataframe.at[i,'leven__score'] = leven_score/results
            else:
                dataframe.at[i,'leven__score'] = -1 # omething went wrong put 
                

In [5]:
def getResults(dataframe, data_name):
    google_scrape(dataframe, 4)
    print(dataframe.head(5))
    dataframe.to_csv(data_name, encoding='utf-8')

## Performing Google Scraping on particular datasets

### 1. Test
Running code below would take ages, hence a cell to test the script is provided below

In [8]:
data = [['Beyonce faces $20M copyright suit from Youtube stars estate', 1, 0], 
        ['Site is getting blown up now thanks to Real Clear Politics', 0, 0]]
df = pd.DataFrame(data, columns = ['tweet__text', 'claim_veracity', 'leven__score'])

getResults(df, 'results.csv')

                                         tweet__text  claim_veracity  \
0  Beyonce faces $20M copyright suit from Youtube...               1   
1  Site is getting blown up now thanks to Real Cl...               0   

   leven__score  
0           5.3  
1          46.6  


### 2. TwitterFakeNews

#### 2.1 Fake entries

In [5]:
Auto_df_fake = pd.read_csv('../Datasets_analysis/TwitterFakeNews/Auto_Format_Fake.csv', 
                           sep=';', usecols = ['tweet__fake', 'tweet__text'], low_memory=False)

# Add additional column leven_score
Auto_df_fake['leven__score'] = 0.0
getResults(Auto_df_fake, 'Auto_df_fake_score.csv')

#### 2.2 True Entries

In [7]:
Auto_df_true = pd.read_csv('../Datasets_analysis/TwitterFakeNews/Auto_Format_True.csv', sep=';', 
                           usecols = ['tweet__fake', 'tweet__text'], low_memory=False)

# Add additional column leven_score
Auto_df_true['leven__score'] = 0.0
getResults(Auto_df_true, 'Auto_df_true_score.csv')

### 3. LIAR Half-True entries

In [None]:
LIAR_half_true_df = pd.read_csv('../Datasets_analysis/LIAR/LIAR_half_true.csv', 
                                usecols = ['claim_veracity', 'claim'], low_memory=False)

# Add additional column leven_score
LIAR_half_true_df['leven__score'] = 0.0
getResults(LIAR_half_true_df, 'LIAR_half_true_df.csv')