Source: https://notebook.community/HrantDavtyan/Data_Scraping/2018/Scrapy

## Intro to Scrapy

Scrapy is a Python framework for data scraping, which, to say in short, is the combination of almost everything we learnt until now: requests, css selectors (BeautifulSoup), xpath (lxml), regex (re) and even checking robots.txt or putting hte scraper to sleep.

Generally, as Scrapy is a framework, one does not code inside Jupyter Notebook. To mimic Scrapy behavior inside the Notebook, we will have to make some additional imports which would not be required otherwise.

Key points:

response - the object that contains page source as a Scrapy element to be scraped,
response.css() - css approach to scraping (BeautifulSoup),
response.xpath() - xpath approach to scraping (Lxml),
extract() - extract all elements satisfying some condition (provides list),
extract_first() - extract first element satisfying some condition (provides element).
response.css("a::text").extract_first() - will provide the text of the first link matched (CSS),
response.xpath("//a/text()").extract_first() - will provide the text of the first link matched (Xpath),
response.css('a::attr(href)').extract_first() - will provide the href attribute (URL) of the first link matched (CSS),
response.xpath("//a/@href").extract_first() - will provide the href attribute (URL) of the first link matched (Xpath).

In [1]:
import requests
from scrapy.http import TextResponse

In [16]:
url = "https://meps.ahrq.gov/data_files/pufs/"
r = requests.get(url)
response = TextResponse(r.url,body=r.text,encoding="utf-8")

In [3]:
response

<200 https://meps.ahrq.gov/data_files/pufs/>

In [None]:
#get heading-css
response.css("a").extract_first()

In [15]:
#get heading-xpath
response.xpath("//a").extract_first()

'<a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr>\n   <tr><th colspan="5"><hr></th></tr>\n<tr><td valign="top"><img src="/icons/back.gif" alt="[PARENTDIR]"></td><td><a href="/data_files/">Parent Directory</a></td><td>\xa0</td><td align="right">  - </td><td>\xa0</td></tr>\n<tr><td valign="top"><img src="/icons/binary.gif" alt="[   ]"></td><td><a href="h01dat.exe">h01dat.exe</a></td><td align="right">2021-06-08 17:26  </td><td align="right">1.1M</td><td>\xa0</td></tr>\n<tr><td valign="top"><img src="/icons/compressed.gif" alt="[   ]"></td><td><a href="h01dat.zip">h01dat.zip</a></td><td align="right">2021-06-08 17:26  </td><td align="right">1.0M</td><td>\xa0</td></tr>\n<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="h036/">h036/</a></td><td align="right">2022-08-12 19:11  </td><td align="right">  - </td><td>\xa0</td></tr>\n<tr><td valign="top"><i

In [5]:
#get authors-css
response.css("small::text").extract()

[]

In [6]:
#authors-xpath
response.xpath("//small/text()").extract()

[]

In [7]:
#heading-css href only
response.css('a[style="text-decoration: none"]::attr(href)').extract()

[]

In [8]:
#tag text css
response.css("a[class='tag']::text").extract()

[]

In [9]:
#tag url css
response.css("a[class='tag']::attr(href)").extract()

[]

In [10]:
#tag text xpath
response.xpath("//a[@class='tag']/text()").extract()

[]

In [11]:
#tag url xpath
response.xpath("//a[@class='tag']/@href").extract()

[]

In [12]:
response.css("title").extract_first()

'<title>Index of /data_files/pufs</title>\n </head>\n <body>\n<h1>Index of /data_files/pufs</h1>\n  <table>\n   <tr><th valign="top"><img src="/icons/blank.gif" alt="[ICO]"></th><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr>\n   <tr><th colspan="5"><hr></th></tr>\n<tr><td valign="top"><img src="/icons/back.gif" alt="[PARENTDIR]"></td><td><a href="/data_files/">Parent Directory</a></td><td>\xa0</td><td align="right">  - </td><td>\xa0</td></tr>\n<tr><td valign="top"><img src="/icons/binary.gif" alt="[   ]"></td><td><a href="h01dat.exe">h01dat.exe</a></td><td align="right">2021-06-08 17:26  </td><td align="right">1.1M</td><td>\xa0</td></tr>\n<tr><td valign="top"><img src="/icons/compressed.gif" alt="[   ]"></td><td><a href="h01dat.zip">h01dat.zip</a></td><td align="right">2021-06-08 17:26  </td><td align="right">1.0M</td><td>\xa0</td></tr>\n<tr><td valign="top"><img src="

In [13]:
response.css("title").re("title")

['title', 'title']

In [14]:
#regex to get text between tags
response.css("title").re('.+>(.+)<.+')

['Index of /data_files/pufs',
 'Index of /data_files/pufs',
 '</th>',
 '</th>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td>',
 '\xa0</td