<a href="https://colab.research.google.com/github/hamletbatista/inbound/blob/master/CheatSheet_Web_Scraping101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Web Scraping 101 with Python

Tool: https://scrapy.org/


Libraries

1. Scrapy

**The value of original research.**

We are going to get SEO and accessibility scores of themes in the Shopify Marketplace

I performed two similar studies for Practical Ecommerce but focused on speed.

https://www.practicalecommerce.com/assessing-googles-core-web-vitals-on-shopify-themes

https://www.practicalecommerce.com/page-speed-scores-of-every-shopify-theme


## Scraping themes in two phases

### First, let's define custom HTML DOM element extractors

![alt text](https://github.com/hamletbatista/inbound/raw/master/selectors.png)

1. Select the theme name, right-click and click on "Inspect Element"
2. Mouse over to the HTML tag and right-click again
3. Select "Copy > Copy selector"

Please paste the selectors in the form below

### Phase 1: Get the theme name and URL by following links

In [None]:
%%capture
!!pip install scrapy

In [None]:
%%writefile shopifyspider.py

import scrapy

class ShopifyThemeSpider(scrapy.Spider):
    name = 'shopifyspider'
    start_urls = [ "https://themes.shopify.com/themes?page=1"]

    def parse(self, response):
        for theme in response.css(".theme-info"): # Div
            yield {"link": theme.css("a::attr(href)").get(), # A href
                   'theme': theme.css("a span ::text").get()} #Span text

        # Scrape each page in the series
        for next_page in response.css("a.next_page"):
            yield response.follow(next_page, self.parse)

Writing shopifyspider.py


In [None]:
!scrapy runspider shopifyspider.py -o data.csv

2020-07-18 03:51:28 [scrapy.utils.log] INFO: Scrapy 2.2.1 started (bot: scrapybot)
2020-07-18 03:51:28 [scrapy.utils.log] INFO: Versions: lxml 4.2.6.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.9 (default, Apr 18 2020, 01:56:04) - [GCC 8.4.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
2020-07-18 03:51:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-07-18 03:51:28 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2020-07-18 03:51:28 [scrapy.extensions.telnet] INFO: Telnet Password: 73959149fc63ca6a
2020-07-18 03:51:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-07-18 03:51:

In [None]:
!head data.csv

link,theme
/themes/express/styles/bistro,Express
/themes/streamline/styles/core,Streamline
/themes/warehouse/styles/metal,Warehouse
/themes/context/styles/chic,Context
/themes/broadcast/styles/clean,Broadcast
/themes/avenue/styles/casual,Avenue
/themes/story/styles/chronicle,Story
/themes/boost/styles/flourish,Boost
/themes/cascade/styles/classic,Cascade


In [None]:
!wc -l data.csv

74 data.csv


## Phase 2: Get the theme demo URL by Reading a list from file



Libraries

1. urllib
2. csv
3. pickle

First, we need to make the URLs retrieved absolute to be able to scrape them

In [None]:
from urllib.parse import urljoin
import csv
from collections import OrderedDict

In [None]:
data_file = csv.DictReader(open("data.csv"))


In [None]:
themes = dict()

for row in data_file:
  # Example row
  #OrderedDict([('link', '/themes/express/styles/bistro'), ('theme', 'Express')])
  
  theme = row["theme"]

  link = row["link"]

  #'/themes/express/styles/bistro' -> 'https://themes.shopify.com/themes/express/styles/bistro'

  link = urljoin("https://themes.shopify.com", link)
  
  themes[link] = theme


In [None]:
len(themes.keys())

73

Next, we need to persist the dictionary to a file and read from it from another spider

In [None]:
import pickle

with open("theme_links.pkl", "wb") as f:
  pickle.dump(themes, f)

###Now, we create another spider to scrape the theme links we saved

![alt text](https://github.com/hamletbatista/inbound/raw/master/demo-selector.png)

Same steps as above

In [None]:
%%writefile themelink_spider.py

import scrapy
import pickle

class ShopifyThemeLinkSpider(scrapy.Spider):

    name = 'shopifyspider'

    with open('theme_links.pkl', 'rb') as f:

      theme_links = pickle.load(f)

    start_urls = theme_links.keys()

    def parse(self, response):
        
        for theme in response.xpath("//a[contains(@class, 'theme-preview-link')]"):
            demo_url = theme.css("::attr(data-demo-url)").get()

            yield {
                "demo-url": f"https://{demo_url}", 
                   "link" : response.url, #crawled page
                   "theme": self.theme_links[response.url] #theme from pickled file
            }


Writing themelink_spider.py


In [None]:
%%capture
!scrapy runspider themelink_spider.py -o data2.csv

In [None]:
!head data2.csv

demo-url,link,theme
https://express-theme-bistro.myshopify.com/,https://themes.shopify.com/themes/express/styles/bistro,Express
https://express-theme-bistro.myshopify.com/,https://themes.shopify.com/themes/express/styles/bistro,Express
https://context-theme-chic.myshopify.com,https://themes.shopify.com/themes/context/styles/chic,Context
https://context-theme-chic.myshopify.com,https://themes.shopify.com/themes/context/styles/chic,Context
https://flourish-theme.myshopify.com,https://themes.shopify.com/themes/boost/styles/flourish,Boost
https://flourish-theme.myshopify.com,https://themes.shopify.com/themes/boost/styles/flourish,Boost
https://warehouse-theme-metal.myshopify.com,https://themes.shopify.com/themes/warehouse/styles/metal,Warehouse
https://warehouse-theme-metal.myshopify.com,https://themes.shopify.com/themes/warehouse/styles/metal,Warehouse
https://streamline-theme-core.myshopify.com,https://themes.shopify.com/themes/streamline/styles/core,Streamline


In [None]:
!wc -l data2.csv

147 data2.csv
