Web-Crawler-implementation-using-Python

Web Crawler implementation using Python

###WebCrawlerPython.py

There are three methods in this script:

findDomain <- finds root domain using tld library

def findDomain(url):
  root = get_tld(url, as_object=True)
  return root.fld[0:root.fld.find('.')]

valid_url <- validates url using urlparse library

def valid_url(url):
  validated = urlparse(url)
  return bool(validated.scheme) and bool(validated.netloc)

getLinks <- retrieves url content using BeautifulSoup

def getLinks(url):
  links = {}
  domain = findDomain(url)
  soup = BeautifulSoup(requests.get(url).content, "html.parser")

Initialize dictionary for links, find root domain using helper method, and parse html content using BeautifulSoup

    for link in  soup.findAll("a"):
        if(len(links) == 25):
            return links
        valid = link.attrs.get("href")
        if valid == "" or valid is None:
            continue
        if not valid_url(valid):
            continue

for every link we find all the a tags, if links dictionary size is 25, return links if not we get the href link, and validate that it is not empty or null we also check that it is a valid url

        if findDomain(valid) == domain and "#" not in valid:

check if root domain is in the link, and that we are not calling any internal page links.

            try:
                temp = urllib2.urlopen(valid).read()

Use urllib2 to retrieve html content to insert into dictionary

            except Exception:
                continue
            print("adding link to list: ", valid)
            links[valid] = temp

    return links

Once dictionary has been completed the script will write all items into Links.txt in json format:

with open("Links.txt", "w") as file:
        for links in urls:
            file.write("{ 'link' : '" + str(links) + "', '\n'")
            file.write(" 'url' : '" + str(urls[links]) + "' }")

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Links.txt		Links.txt
README.md		README.md
webCrawlerPython.py		webCrawlerPython.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web-Crawler-implementation-using-Python

About

Uh oh!

Releases

Packages

Languages

jpdynadev/Web-Crawler-implementation-using-Python

Folders and files

Latest commit

History

Repository files navigation

Web-Crawler-implementation-using-Python

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages