# Wikipedia Alternate Titles - Evaluation

## Peter Bokor

In [123]:
import json
import requests
import random
import re 
import numpy as np
from difflib import SequenceMatcher
import pandas as pd
import unidecode

In [124]:
def parse_response(resp):
    """function to parse the web search response and extract found titles"""
    # parse response text as dict 
    res_dict = json.loads(resp.text)
    # extract titles
    titles = [r['title'] for r in res_dict['value']]
    return titles

In [125]:
def string_similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

In [126]:
with open('indices/enwiki_index.json', encoding='utf8') as json_file:
    index = json.load(json_file)

In [127]:
def choose_alt_and_title(index):
    try:
        title = random.choice(list(filter(lambda k: len(k.split()) == 1, index.keys())))
        alt = random.choice(list(filter(lambda v: len(v.split()) == 1, index[title])))
        return alt, title
    except:
        return choose_alt_and_title(index)

We load the index created into memory. English wiki will be used for the evaluation.

In [128]:
# prepare url and headers
url = "https://contextualwebsearch-websearch-v1.p.rapidapi.com/api/Search/WebSearchAPI"
headers = {
    'x-rapidapi-host': "contextualwebsearch-websearch-v1.p.rapidapi.com",
    'x-rapidapi-key': "9805ad717fmsh887356c31eef1c7p1d4dfejsn31827afa9324"
    }
main_titles, alt_titles, found, matches, similarities = [], [], [], [], []
for i in range(100):
    # select main and alt title at random from index
    title, alt = random.choice(list(index.items()))
    alt = random.choice(alt)
    # preprocess the titles (Disambiguations and simialr WIKI stuff may be present)
    title, alt = re.split('[#:(]', title)[0], re.split('[#:(]', alt)[0]
    # query the alt title in web search
    querystring = {"pageNumber":"1","q":alt,"autoCorrect":"false","pageSize":"50"}
    try:
        response = requests.request("GET", url, headers=headers, params=querystring)
        # parse found titles
        found_titles = parse_response(response)
        # Title found or not?
        title_found = True in [title in t for t in found_titles]
        #calculate titles similarity
        max_similarity = max(np.array([string_similar(t, title) for t in found_titles]))

        main_titles.append(title)
        alt_titles.append(alt)
        found.append(found_titles)
        matches.append(title_found)
        similarities.append(max_similarity)
    except:
        continue

Above we select 100 records from the index and search for them using Web Search API. We parse and save the results for later evaluation. 

In [129]:
matches = np.array(matches)
accuracy = len(matches[matches == True]) / len(matches)
accuracy

0.550561797752809

We see that the accuracy is somewhat poor.

In [130]:
sum(similarities) / len(similarities)

0.5382991291393447

We can see that the result seem underwhelming. Looking at a few examples (below) we can notice though, that the results produced by the search API are perhaps not what we'd expect. 

Another problem we have to deal with is 2 word phrases as a alt. title. Using string similarity the way we did above may not be the best solution as the search results may be longer than the phrase we search for and we immediately get biased results.

We can not solve the problem with API (we could find a better one, but let's presume we have the best one there is). We can deal with the other problem however. We could calculate similarity in all subsets of each search result and consider only the highest scoring. For example, for a 2 word alternate title, we could match the 2 words with every word subset of every result with the length of 2. This would take a lot of time. 

A better solution seems to be using just 1 word alternate titles and considering the word to be a match if the word is present in any of the results returned by our search. 

In [131]:
df = pd.DataFrame({'title': main_titles, 'alt_title': alt_titles, 'found': found, 'match': matches})
df

Unnamed: 0,title,alt_title,found,match
0,Kischinskinia,Kischinskinia scandens,"[Asarina scandens | A Growing Obsession, Hangi...",False
1,Phet,Phet,"[Phet - Wikipedia, Kamphaeng Phet - Wikipedia,...",True
2,Alinja,Əlincə,"[Lincolnshire Police, ABPI LINC | Resource Sea...",False
3,Conboy,Conboy,"[carter conboy Archives - Carter Conboy, Conbo...",True
4,Lat.,Lat.,[Mouths (Lat.) crossword clue - Crossword Quiz...,True
5,WLCT,WLCT-FM,"[WLCT - Wikipedia, WLCT - FM Station Profile -...",True
6,bismuthine,Bismuthane,[what is tris(fluoranyl)bismuthane - LookChemi...,False
7,JDeveloper,Jdeveloper,"[JDeveloper - Wikipedia, JDeveloper 11g Releas...",True
8,Pseudolarix,Golden larch,"[golden larch | Encyclopedia.com, Golden larch...",True
9,Tuulikki,Tuulikki,"[Tuulikki - Wikipedia, Hanna Tuulikki - Wikipe...",True


In [132]:
not_matched = df[df.match == False].reset_index(drop=True)
not_matched

Unnamed: 0,title,alt_title,found,match
0,Kischinskinia,Kischinskinia scandens,"[Asarina scandens | A Growing Obsession, Hangi...",False
1,Alinja,Əlincə,"[Lincolnshire Police, ABPI LINC | Resource Sea...",False
2,bismuthine,Bismuthane,[what is tris(fluoranyl)bismuthane - LookChemi...,False
3,Yalvaç,Yalvac,"[Yalvac Museum, Turkey (Antioch of Pisidia), S...",False
4,Semiperimeter,Semi-perimeter,[Find the value of c in a triangle if the semi...,False
5,Aglet,Agnet,"[Food and Fertilizer Technology Center, FFTC A...",False
6,Placentalia,Evolutionary history of placental mammals,[Skinny 'Shrew' Is Oldest True Mammal | Live S...,False
7,Corectopia,Corectopia in eye,[Delhi doctors flag rise in eye-related proble...,False
8,Anthroponymy,Anthroponym,[Anthroponym - OmegaWiki: Multilingual Diction...,False
9,Bulway,Buluwai,[Buluwai - Wikipedia],False


"Not found table" - terms that were not found by the web search used

In [133]:
# prepare url and headers
url = "https://contextualwebsearch-websearch-v1.p.rapidapi.com/api/Search/WebSearchAPI"
headers = {
    'x-rapidapi-host': "contextualwebsearch-websearch-v1.p.rapidapi.com",
    'x-rapidapi-key': "9805ad717fmsh887356c31eef1c7p1d4dfejsn31827afa9324"
    }
main_titles_single, alt_titles_single, found_single, matches_single, similarities_single = [], [], [], [], []
for i in range(100):
    # select main and alt title at random from index
    alt, title = choose_alt_and_title(index)
    # preprocess the titles (Disambiguations and simialr WIKI stuff may be present)
    title, alt = re.split('[#:(]', title)[0], re.split('[#:(]', alt)[0]
    # query the alt title in web search
    querystring = {"pageNumber":"1","q":alt,"autoCorrect":"false","pageSize":"50"}
    try:
        response = requests.request("GET", url, headers=headers, params=querystring)
        # parse found titles
        found_titles = parse_response(response)
        # Title found or not?
        title_found = True in [title in t for t in found_titles]
        #calculate titles similarity
        max_similarity = max(np.array([string_similar(t, title) for t in found_titles]))

        main_titles_single.append(title)
        alt_titles_single.append(alt)
        found_single.append(found_titles)
        matches_single.append(title_found)
        similarities_single.append(max_similarity)
    except:
        continue

In [134]:
matches_single = np.array(matches_single)
accuracy = len(matches_single[matches_single == True]) / len(matches_single)
accuracy

0.3424657534246575

In [135]:
sum(similarities_single) / len(similarities_single)

0.480825282924828

In [136]:
df_single = pd.DataFrame({'title': main_titles_single, 'alt_title': alt_titles_single, 'found': found_single, 'match': matches_single})
df_single

Unnamed: 0,title,alt_title,found,match
0,Luck,Luckily,"[luckily - Wiktionary, Luckily | Define Luckil...",True
1,COSBI,CoSBi,"[COSBI - Wikipedia, Cosbi | rabbisylviarothsch...",True
2,Maljević,Maljevic,[Maljevic Map | Serbia and Montenegro Google S...,False
3,Lightstreamer,Weswit,[Giant iGaming Firm Wirex Joins Forces with We...,True
4,Karabas,Carabas,[Carabas Blog Archive Events Profile: Pop...,False
5,Epicaridea,Bopyroidea,[ADW: Bopyroidea: CLASSIFICATION],False
6,Zuiō-ji,Zuio-ji,[JI served nation during Covid-19 peak - Pakis...,False
7,LQG,LQGs,"[LQG - Wikipedia, LQG Archives | PlanetSave, T...",True
8,OU,Ou.,"[university college of law ou., Author at Lawc...",True
9,Vrskmaň,VrskmaN,[Vrskma - Wikipedia],False


In the end it really seems that we have suboptimal web search tool, since if we search manualy using google we get much more accurate results. We have to accept this fact though, as there is not a better alternative and state that we achieved around 50% accuracy and around 50% similarity betweeen phrases we searched for and the ones we found.

**Below are provided couple of examples used as a proof of the web search suboptimality. These are all taken from the 'not found table' and searched for with google.**

- Simplified Memory-Bounded A* - https://www.google.com/search?q=Simplified+Memory-Bounded+A*&oq=Simplified+Memory-Bounded+A*&aqs=chrome..69i57&sourceid=chrome&ie=UTF-8
- Chyormozskoye Urban Settlement - https://www.google.com/search?q=Chyormozskoye+Urban+Settlement&oq=Chyormozskoye+Urban+Settlement&aqs=chrome..69i57&sourceid=chrome&ie=UTF-8
- Tectonostratigraphic - https://www.google.com/search?q=Tectonostratigraphic&oq=Tectonostratigraphic&aqs=chrome..69i57&sourceid=chrome&ie=UTF-8
- Mansôa, Guinea-Bissau - https://www.google.com/search?q=Mans%C3%B4a%2C+Guinea-Bissau&oq=Mans%C3%B4a%2C+Guinea-Bissau&aqs=chrome..69i57&sourceid=chrome&ie=UTF-8
- Dediyapada - https://www.google.com/search?q=Dediyapada&oq=Dediyapada&aqs=chrome..69i57&sourceid=chrome&ie=UTF-8