<a href="https://colab.research.google.com/github/haticenuralan/Search_Engine_Algorithm/blob/main/Search_Engine_Algorithm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Definitions of procedures

**1.get_page(url):** This function takes a URL as input and tries to fetch the
content of the webpage at that URL. It returns the page content as a string. If it encounters an error (like the page doesn't exist), it returns an empty string.

**2.get_next_target(page):** This function finds the next hyperlink in the given page content. It returns the URL of the hyperlink and the position where the hyperlink ends in the page content. If no hyperlink is found, it returns None and 0.

**3. get_all_links(page):** It extracts all hyperlinks from the given page content. This is done by repeatedly calling get_next_target and collecting the URLs until no more links are found. It returns a list of URLs.

**3.union(p,q):** This function takes two lists p and q and adds elements from q to p if they are not already in p. It's used to avoid duplicate URLs in the list of pages to crawl.

**4. add_to_index(index, keyword, url):** This function adds a keyword and its corresponding URL to an index, which is a dictionary. If the keyword already exists in the index, the URL is added to the list of URLs associated with that keyword. If not, a new entry is created.

**5.getClearPage(content):** This function extracts the title and body from a webpage's HTML content, removing all HTML tags from the body. It returns the concatenated title and body as a clean text.

**6.addPageToIndex(index, url, content):** This function adds all the words found in the content of a page to the index with their corresponding URL. It uses getClearPage to clean the HTML content and add_to_index to add words to the index.

**7.lookup(index, keyword):** This function searches for a keyword in the index and returns a list of URLs associated with that keyword. If the keyword is not found, it returns None.

**8. computeRanks(graph):** This function calculates a rank for each page in a web graph, based on a simplified version of Google's PageRank algorithm. Pages are ranked higher if they are linked to by many pages or by pages that are themselves highly ranked.

**9.crawlWeb(seed):** This is the main function of the web crawler. It starts with a seed page and crawls the web from there, using get_page to fetch content, addPageToIndex to index it, get_all_links to find new pages to crawl, and union to add them to the list of pages to crawl. It keeps track of pages it has already crawled to avoid revisiting them. The function returns the index (word to URL mapping) and a graph of outlinks (which page links to which).

Finally, the script starts by crawling the web starting from a seed page (seedpage) and then computes the ranks of the pages in the crawled web graph.

In [1]:
def get_page(url):
  try:
    import urllib.request
    page = urllib.request.urlopen(url).read()
    page = page.decode("utf-8")
    return page
  except:
    return ""


def get_next_target(page):
    start_link = page.find('<a href=')
    if start_link == -1:
        return None, 0
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote+1)
    url = page[start_quote + 1:end_quote]
    return url, end_quote


def get_all_links(page):
    links = []
    while True:
      url, endpos = get_next_target(page)
      if url:
        links.append(url)
        page = page[endpos:]
      else:
        break
    return links


def union(p,q):
    for e in q:
        if e not in p:
            p.append(e)


def add_to_index(index, keyword, url):
  if keyword in index:
    index[keyword].append(url)
  else:
    index[keyword] = [url]


def getClearPage(content):
  title = content[content.find("<title>")+7:content.find("</title>")]
  body = content[content.find("<body>")+6:content.find("</body>")]
  while body.find(">") != -1:
    start =  body.find("<")
    end =  body.find(">")
    body = body[:start] + body[end+1:]
  return title + body


def addPageToIndex(index, url, content):
  content = getClearPage(content)
  words = content.split()
  for word in words:
    add_to_index(index, word, url)


def lookup(index, keyword):
  if keyword in index:
    return index[keyword]
  else:
    return None

def computeRanks(graph):
  damping = 0.8
  numloops = 10
  ranks = {}
  npages = len(graph)
  for page in graph:
    ranks[page] = 1/npages
  for i in range(0, numloops):
    newranks = {}
    for page in graph:
      newrank = (1-damping)/npages
      for node in graph:
        if page in graph[node]:
          newrank = newrank + damping*(ranks[node]/ len(graph[node]) )
      newranks[page] = newrank
    ranks = newranks
  return ranks


def crawlWeb(seed):
    tocrawl = [seed]
    crawled = []
    graph = {}
    index = {}
    while tocrawl:
        page = tocrawl.pop()
        if page not in crawled:
            content = get_page(page)
            addPageToIndex(index, page, content)
            outlinks = get_all_links(content)
            graph[page] = outlinks
            union(tocrawl, outlinks)
            crawled.append(page)
    return index, graph  # returns index, graph of outlinks now


seedpage = "https://ayaktayolcukalmasin.com.tr/ana_sayfa.html"
index, graph = crawlWeb(seedpage)
ranks = computeRanks(graph)

In [3]:
print(f"The graph has {len(graph)} elements. These are:")
n = 1
for key,value in graph.items():
  print("\t"+str(n)+". ", key, ":", value)
  n = n+1

The graph has 6 elements. These are:
	1.  https://ayaktayolcukalmasin.com.tr/ana_sayfa.html : ['https://ayaktayolcukalmasin.com.tr/ankara.html', 'https://ayaktayolcukalmasin.com.tr/konya.html', 'https://ayaktayolcukalmasin.com.tr/istanbul.html', 'https://ayaktayolcukalmasin.com.tr/oktayrecommends.html', 'https://ayaktayolcukalmasin.com.tr/seymarecommends.html']
	2.  https://ayaktayolcukalmasin.com.tr/seymarecommends.html : ['https://ayaktayolcukalmasin.com.tr/oktayrecommends.html', 'https://ayaktayolcukalmasin.com.tr/konya.html']
	3.  https://ayaktayolcukalmasin.com.tr/oktayrecommends.html : ['https://ayaktayolcukalmasin.com.tr/istanbul.html']
	4.  https://ayaktayolcukalmasin.com.tr/istanbul.html : []
	5.  https://ayaktayolcukalmasin.com.tr/konya.html : ['https://ayaktayolcukalmasin.com.tr/seymarecommends.html']
	6.  https://ayaktayolcukalmasin.com.tr/ankara.html : []


In [None]:
for key, value in ranks.items():
  print("The rank of the page", key,":\t", value)

The rank of the page https://ayaktayolcukalmasin.com.tr/ana_sayfa.html :	 0.033333333333333326
The rank of the page https://ayaktayolcukalmasin.com.tr/seymarecommends.html :	 0.10274769919999999
The rank of the page https://ayaktayolcukalmasin.com.tr/oktayrecommends.html :	 0.07998944255999998
The rank of the page https://ayaktayolcukalmasin.com.tr/istanbul.html :	 0.10274769919999999
The rank of the page https://ayaktayolcukalmasin.com.tr/konya.html :	 0.07998944255999998
The rank of the page https://ayaktayolcukalmasin.com.tr/ankara.html :	 0.038666666666666655
