**Challenge**

Do a little scraping or API-calling of your own. Pick a new website and see what you can get out of it. Expect that you'll run into bugs and blind alleys, and rely on your mentor to help you get through.

Formally, your goal is to write a scraper that will:

1. Return specific pieces of information (rather than just downloading a whole page)
2. Iterate over multiple pages/queries
3. Save the data to your computer

Once you have your data, compute some statistical summaries and/or visualizations that give you some new insights into your scraping topic of interest. Write up a report from scraping code to summary and share it with your mentor.

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess


class CovidWikiSpider(scrapy.Spider):
    name = "CWS"
    
    start_urls = [
        'https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=linkshere&titles=Coronavirus_disease_2019&lhprop=title%7Credirect'
        ]

    def parse(self, response):
        for item in response.xpath('//lh'):
            if item.xpath('@ns').extract_first() == '0':
                yield {
                    'title': item.xpath('@title').extract_first()
                    
                    }
        next_page = response.xpath('continue/@lhcontinue').extract_first()
        
        if next_page is not None:
            next_page = '{}&lhcontinue={}'.format(self.start_urls[0],next_page)
            yield scrapy.Request(next_page, callback=self.parse)
            
    
process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': 'covid.json',
    'ROBOTSTXT_OBEY': False,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': False,
    'CLOSESPIDER_PAGECOUNT' : 10
})
                                         

process.crawl(CovidWikiSpider)
process.start()
print('First 100 links extracted!')

First 100 links extracted!


In [2]:
import pandas as pd


covid=pd.read_json('covid.json', orient='records')
print(covid.shape)
print(covid.head())

(95, 1)
                             title
0     Traditional Chinese medicine
1                           Corona
2      List of infectious diseases
3                       Jim Bakker
4  National Basketball Association
