### Web Scraping Sample Work Portfolio
```
Here I'm scraping the SEO data from the SpyFu (https://www.spyfu.com/overview/domain?tour=true&query=www.semrush.com)
Where selected features are as follow:
    - Monthly Domain Overview Data,
    - Competition Data,
    - Top Keywords Data (Paid & Organic Keywords),
    - Google Ads Buy Recommendations Data,
    - Inbound Links (Internal & External Links), etc.
```
```
Here for getting the page source I'm using Selenium Chrome Web Browser for, It also can be possible with the request and scrapy module but for getting rid from the blocking issue we are using the below mentioned approach.
```

In [1]:
# ---------- Importing Dependencies ----------
from selenium import webdriver
from time import sleep
from selenium.webdriver import Chrome
from shutil import which
from selenium.webdriver.chrome.options import Options
from scrapy.selector import Selector

In [2]:
# ----- Using Selenium for Getting Complete Page Source -----
chrome_options = Options()
chrome_path = which("chromedriver.exe")
driver = webdriver.Chrome(executable_path = chrome_path, options = chrome_options)
driver.maximize_window()
url = 'https://www.spyfu.com/overview/domain?tour=true&query=www.semrush.com'
driver.get(url)
sleep(5)
sel = Selector(text=driver.page_source)

In [3]:
# ---------- Monthly Domain Overview ----------
# Organic Search (SEO) ------
total_organic_keywords = sel.xpath('//div[@class="sf-figure-trio"][contains(.,"Organic Search")]/div[@class="first-row"]/a[contains(@class,"left")]//span/span/text()').get()
est_organic_est_monthly_seo_clicks = "".join([''.join(i.xpath('.//span/text()').getall()) for i in sel.xpath('//div[@class="sf-figure-trio"][contains(.,"Organic Search")]/div[@class="first-row"]/a[contains(@class,"right")]')])
est_monthly_seo_click_value = "".join([''.join(i.xpath('./span/text()').getall()) for i in sel.xpath('//div[@class="sf-figure-trio"][contains(.,"Organic Search")]/a/span')])
# Paid Search (PPC) ------
total_paid_keywords = sel.xpath('//div[@class="sf-figure-trio"][contains(.,"Paid Search")]/div[@class="first-row"]/a[contains(@class,"left")]//span/span/text()').get()
est_paid_monthly_seo_clicks = "".join([''.join(i.xpath('.//span/text()').getall()) for i in sel.xpath('//div[@class="sf-figure-trio"][contains(.,"Paid Search")]/div[@class="first-row"]/a[contains(@class,"right")]')])
est_monthly_google_ads_budget = "".join([''.join(i.xpath('./span/text()').getall()) for i in sel.xpath('//div[@class="sf-figure-trio"][contains(.,"Paid Search")]/a/span')])
# ---------- Competition ----------
organic_competitors = sel.xpath('//div[@class="competitor-panel sf-panel sf-global-component tw-shadow"]/div[contains(.,"Organic Competitors")]/..//div[@class="competitors-chart"]//a/div/label/text()').getall()
paid_competitors = sel.xpath('//div[@class="competitor-panel sf-panel sf-global-component tw-shadow"]/div[contains(.,"Paid Competitors")]/..//div[@class="competitors-chart"]//a/div/label/text()').getall()
# ---------- Top Keywords ----------
# Top Organic Keywords ------
rank = [i.strip() for i in sel.xpath('//div[@class="sf-grid-inner"]//div[@class="sf-panel sf-global-component tw-shadow"]/div[@class="sf-table"][contains(.,"Organic Keyword")]//tbody/tr/td/div/text()').getall()]
organic_keyword = sel.xpath('//div[@class="sf-grid-inner"]//div[@class="sf-panel sf-global-component tw-shadow"]/div[@class="sf-table"][contains(.,"Organic Keyword")]//tbody/tr/td/div/a/span/text()').getall()
seo_click = [''.join(i.xpath('.//span/text()').getall()) for i in sel.xpath('//div[@class="sf-grid-inner"]//div[@class="sf-panel sf-global-component tw-shadow"]/div[@class="sf-table"][contains(.,"Organic Keyword")]//tbody/tr/td/div/span')]
# Top Paid Keywords ------
paid_keywords_ = sel.xpath('//div[@class="sf-grid-inner"]//div[@class="sf-panel sf-global-component tw-shadow"]/div[@class="sf-table"][contains(.,"Paid Keyword")]//tbody[@class="sf-global-component sf-table-body"]/tr//a//text()').getall()
cost_per_click_ = [''.join(i.xpath('.//span/text()').getall()) for i in sel.xpath('//div[@class="sf-grid-inner"]//div[contains(@class,"component")]/div[@class="sf-table"][contains(.,"Paid Keyword")]//tbody[contains(@class,"global-component")]/tr//td[contains(@class,"cpc")]/div/span')]
monthly_cost_ = [''.join(i.xpath('.//span/text()').getall()) for i in sel.xpath('//div[@class="sf-grid-inner"]//div[contains(@class,"component")]/div[@class="sf-table"][contains(.,"Paid Keyword")]//tbody[contains(@class,"global-component")]/tr//td[contains(@class,"monthly-cost-cell")]/div/span')]
# ---------- Top Pages -----------
top_page_name = [i.replace('.','') for i in sel.xpath('//table[@class="tw-border-none tw-border-collapse tw-w-full"]/tbody/tr//a[@class="tw-font-semibold"]/text()').getall()]
top_page_links = sel.xpath('//table[@class="tw-border-none tw-border-collapse tw-w-full"]/tbody/tr//a[@class="url"]/text()').getall()
est_monthly_seo_clicks = [''.join(i.xpath('.//span/text()').getall()) for i in sel.xpath('(//table[@class="tw-border-none tw-border-collapse tw-w-full"]//span[@class="sf-metricized-number"])[position() <= 5]')]
# ---------- Top Google Ads Buy Recommendations ----------
recommend_keywords = [i.strip().replace('.','_') for i in sel.xpath('//div[@data-test="recommendations"]/div[@class="recommendation"]//a[contains(@href,"/keyword/")]/text()').getall()]
# recommend_keywords = [i.strip() for i in sel.xpath('//div[@data-test="recommendations"]/div[@class="recommendation"]//a[contains(@href,"/keyword/")]/text()').getall()]
buying_recommendation = [i.strip() for i in sel.xpath('//div[@data-test="recommendations"]/div[@class="recommendation"]//div[@class="progress-meter-cell-inner"]/div[contains(@style,"color")]/text()').getall()]
impression_per_month = sel.xpath('//div[@data-test="recommendations"]/div[@class="recommendation"]//span[@class="sf-figure-value"]/text()').getall()
# ---------- Inbound Links (Backlinks) ----------
# web_name = [i.split('.')[1] for i in sel.xpath('//div[@class="overview-backlinks"]//table//a[contains(@class,"tw-block")]/@href').getall()]
backlink = [i.replace('.','_') for i in sel.xpath('//div[@class="overview-backlinks"]//table//a[contains(@class,"tw-block")]/@href').getall()]
domain_monthly_organic_clicks = [''.join(i.xpath('.//span/text()').getall()) for i in sel.xpath('(//div[@class="overview-backlinks"]//table/tbody/tr/td[3]//span[@class="sf-metricized-number"])[position()<=5]')]
page_monthly_organic_clicks = [''.join(i.xpath('.//span/text()').getall()) for i in sel.xpath('(//div[@class="overview-backlinks"]//table/tbody/tr/td[4]//span[@class="sf-metricized-number"])[position()<=5]')]
domain_strength = [''.join(i.xpath('.//span/text()').getall()) for i in sel.xpath('(//div[@class="overview-backlinks"]//table/tbody/tr/td[5]//span[@class="sf-metricized-number"])[position()<=5]')]
ranked_keywords = [''.join(i.xpath('.//span/text()').getall()) for i in sel.xpath('(//div[@class="overview-backlinks"]//table/tbody/tr/td[6]//span[@class="sf-metricized-number"])[position()<=5]')]
outbound_links = [''.join(i.xpath('.//span/text()').getall()) for i in sel.xpath('(//div[@class="overview-backlinks"]//table/tbody/tr/td[7]//span[@class="sf-metricized-number"])[position()<=5]')]

### Monthly Domain Overview

In [4]:
# Function for getting Monthly Domain Overview Data ----------
# Paid Domain Overview Data -----
def monthly_paid_domain_overview(total_paid_keywords, est_paid_monthly_seo_clicks, est_monthly_google_ads_budget):
    final_dict = {}
    final_dict['Total Paid Keywords'] = total_paid_keywords
    final_dict['Est Paid Monthly SEO Clicks'] = est_paid_monthly_seo_clicks
    final_dict['Est Monthly Google Ads Budget'] = est_monthly_google_ads_budget
    return final_dict
monthly_paid_domain_overviews = monthly_paid_domain_overview(total_paid_keywords, est_paid_monthly_seo_clicks, est_monthly_google_ads_budget)

In [5]:
# Function for getting Monthly Domain Overview Data ----------
# Organic Domain Overview Data -----
def monthly_organic_domain_overview(total_organic_keywords, est_organic_est_monthly_seo_clicks, est_monthly_seo_click_value):
    final_dict = {}
    final_dict['Total Organic Keywords'] = total_organic_keywords
    final_dict['Est Organic Est Monthly SEO Clicks'] = est_organic_est_monthly_seo_clicks
    final_dict['Est Monthly SEO Click Value'] = est_monthly_seo_click_value
    return final_dict
monthly_organic_domain_overviews = monthly_organic_domain_overview(total_organic_keywords, est_organic_est_monthly_seo_clicks, est_monthly_seo_click_value)

### Competition

In [6]:
# Function for getching Competition data between Organic and Paid Competitors ----------
def competition(organic_competitors, paid_competitors):
    final_dict = {}
    final_dict['Organic Competitors'] = organic_competitors
    final_dict['Paid Competitors'] = paid_competitors
    return final_dict
competitions = competition(organic_competitors, paid_competitors)

### Top Keywords

In [7]:
# Function for fetching data of Top Organic Keywords of that particular website ----------
def top_organic_keywords(organic_keyword,rank,seo_click):
    inputs = zip(organic_keyword,rank,seo_click)
    final_dict = {}
    for organic_keyword, rank, seo_click in inputs:
        final_dict.update({
            organic_keyword : {
                "Rank" : rank,
                "SEO Clicks" : seo_click,
                        },
                })

    return final_dict
top_organic_keyword = top_organic_keywords(organic_keyword,rank,seo_click)

In [8]:
# Function for fetching data of Top Paid Keywords of that particular website ----------
def top_paid_keywords(paid_keywords_, cost_per_click_, monthly_cost_):
    inputs = zip(paid_keywords_, cost_per_click_, monthly_cost_)
    final_dict = {}
    for paid_keywords_, cost_per_click_, monthly_cost_ in inputs:
        final_dict.update({
            paid_keywords_ : {
                "Rank" : cost_per_click_,
                "SEO Clicks" : monthly_cost_,
                        },
                })

    return final_dict
top_paid_keyword = top_paid_keywords(paid_keywords_, cost_per_click_, monthly_cost_)

### Top Pages

In [9]:
# Function for scraping the data of Top SEO pages of that website ----------
def top_page(top_page_name, top_page_links, est_monthly_seo_clicks):
    inputs = zip(top_page_name, top_page_links, est_monthly_seo_clicks)
    final_dict = {}
    for top_page_name, top_page_links, est_monthly_seo_clicks in inputs:
        final_dict.update({
            top_page_name : {
                "Page Link" : top_page_links,
                "Est Monthly SEO Clicks" : est_monthly_seo_clicks,
                        },
                })

    return final_dict
top_pages = top_page(top_page_name, top_page_links, est_monthly_seo_clicks)

### Top Google Ads Buy Recommendations

In [10]:
# Function for getting detailed data of Google Ads Buying Recommendations ----------
def top_google_ads_buy_recommendation(recommend_keywords, buying_recommendation, impression_per_month):
    inputs = zip(recommend_keywords, buying_recommendation, impression_per_month)
    final_dict = {}
    for recommend_keywords, buying_recommendation, impression_per_month in inputs:
        final_dict.update({
            recommend_keywords : {
                "Buying Recommendations" : buying_recommendation,
                "Impression/Month" : impression_per_month,       
                        },
                })

    return final_dict
top_google_ads_buy_recommendations = top_google_ads_buy_recommendation(recommend_keywords, buying_recommendation, impression_per_month)

### Inbound Links (Backlinks)

In [11]:
# Function for getting complete data of Inbound and Backlinks of website ----------
def inbound_link(backlink, domain_monthly_organic_clicks, page_monthly_organic_clicks, domain_strength, ranked_keywords, outbound_links):
    inputs = zip(backlink, domain_monthly_organic_clicks, page_monthly_organic_clicks, domain_strength, ranked_keywords, outbound_links)
    final_dict = {}
    for backlink, domain_monthly_organic_clicks, page_monthly_organic_clicks, domain_strength, ranked_keywords, outbound_links in inputs:
        final_dict.update({
            backlink : {
                "Domain Monthly Organic Clicks" : domain_monthly_organic_clicks,
                "Page Monthly Organic Clicks" : page_monthly_organic_clicks,
                "Domain Strength" : domain_strength,
                "Ranked Keywords" : ranked_keywords,
                "Outbound Links" : outbound_links,
                        },
                })

    return final_dict
inbound_links = inbound_link(backlink, domain_monthly_organic_clicks, page_monthly_organic_clicks, domain_strength, ranked_keywords, outbound_links)

### Final Results
```
Here I created a JSON format Schema for storing the data, that will be helpfull for storing and fetching the data from 
DataBase directly.
```

In [251]:
# All functions sum-up and Returning the Data into JSON format ----------
def finalDataInJsonFormat():
    final = {}
    final['Organic Search (SEO)'] = monthly_organic_domain_overviews
    final['Paid Search (PPC)'] = monthly_paid_domain_overviews
    final['Competition'] = competitions
    final['Top Organic Keywords'] = top_organic_keyword
    final['Top Paid Keywords'] = top_paid_keyword
    final['Top Pages'] = top_pages
    final['Top Google Ads Buy Recommendations'] = top_google_ads_buy_recommendations
    final['Inbound Links (Backlinks)'] = inbound_links
    final
finalDataInJsonFormat()

{'Organic Search (SEO)': {'Total Organic Keywords': '684',
  'Est Organic Est Monthly SEO Clicks': '405',
  'Est Monthly SEO Click Value': '$1.16k'},
 'Paid Search (PPC)': {'Total Paid Keywords': '0',
  'Est Paid Monthly SEO Clicks': '-',
  'Est Monthly Google Ads Budget': '-'},
 'Competition': {'Organic Competitors': ['agencyplatform.com',
   'rocketdriver.com',
   'seoresellerusa.com',
   'whitelabelseo.com',
   'conduitdigital.us'],
  'Paid Competitors': []},
 'Top Organic Keywords': {'white label facebook ads': {'Rank': '4',
   'SEO Clicks': '24'},
  'agency platform': {'Rank': '4', 'SEO Clicks': '24'},
  'digital agency white label': {'Rank': '13', 'SEO Clicks': '16'},
  'seo reseller': {'Rank': '19', 'SEO Clicks': '16'},
  'white label facebook advertising agency': {'Rank': '2',
   'SEO Clicks': '14'}},
 'Top Paid Keywords': {},
 'Top Pages': {'DashClicks: White-Label Solutions for Digital Marketing Agencies': {'Page Link': 'dashclicks.com',
   'Est Monthly SEO Clicks': '132'},
 

##### Thank You