# Challenge

Do a little scraping or API-calling of your own. Pick a new website and see what you can get out of it. Expect that you'll run into bugs and blind alleys, and rely on your mentor to help you get through.

Formally, your goal is to write a scraper that will:

1) Return specific pieces of information (rather than just downloading a whole page)
2) Iterate over multiple pages/queries
3) Save the data to your computer

Once you have your data, compute some statistical summaries and/or visualizations that give you some new insights into your scraping topic of interest. Write up a report from scraping code to summary and share it with your mentor.



https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=linkshere&titles=Qasem_Soleimani&lhprop=title%7Credirect

Let's break that down into it's components:

**w/api.php**
        Tells the server that we are using the API to pull info, rather than scraping the raw pages.

**action=query**
        We want information from the API (as opposed to changing information in the API)

**format=xml**
        Format the return in xml- then we will parse it with xpath

**prop=linkshere**
        We are interested in which pages link to our target page

**titles=Qasem_Soleimani**
        The target page is the Qasem_Soleimani page. Note that we used the exact name of the wikipedia page (Qasem_Soleimani).

**lhprop=title**
        From those links, we want the title of each page

**redirect**
        We also want to know if that link is a redirect



1) Return specific pieces of information (rather than just downloading a whole page) 2) Iterate over multiple pages/queries 3) Save the data to your computer

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess


class WikiSpider(scrapy.Spider):
    name = "WS"
    
    # Here is where we insert our API call.
    start_urls = [
        'https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=linkshere&titles=Qasem_Soleimani&lhprop=title%7Credirect'
        ]

    # Identifying the information we want from the query response and extracting it using xpath.
    def parse(self, response):
        for item in response.xpath('//lh'):
            # The ns code identifies the type of page the link comes from.  '0' means it is a Wikipedia entry.
            # Other codes indicate links from 'Talk' pages, etc.  Since we are only interested in entries, we filter:
            if item.xpath('@ns').extract_first() == '0':
                yield {
                    'title': item.xpath('@title').extract_first() 
                    }
        # Getting the information needed to continue to the next ten entries.
        next_page = response.xpath('continue/@lhcontinue').extract_first()
        
        # Recursively calling the spider to process the next ten entries, if they exist.
        if next_page is not None:
            next_page = '{}&lhcontinue={}'.format(self.start_urls[0],next_page)
            yield scrapy.Request(next_page, callback=self.parse)
            
    
process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': 'SoleimaniLinks.json',
    # Note that because we are doing API queries, the robots.txt file doesn't apply to us.
    'ROBOTSTXT_OBEY': False,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': False,
    # We use CLOSESPIDER_PAGECOUNT to limit our scraper to the first 100 links.    
    'CLOSESPIDER_PAGECOUNT' : 10
})
                                         

# Starting the crawler with our spider.
process.crawl(WikiSpider)
process.start()
print('First 100 links extracted!')

First 100 links extracted!


In [3]:
import pandas as pd

# Checking whether we got data 

Soleimani=pd.read_json('SoleimaniLinks.json', orient='records')
print(Soleimani.shape)
print(Soleimani.tail())

(32, 1)
                       title
27      Baghdad Airport Road
28             Mohsen Rezaee
29  National day of mourning
30                 Code Pink
31          Death to America


Once you have your data, compute some statistical summaries and/or visualizations that give you some new insights into your scraping topic of interest. 

In [4]:
Soleimani

Unnamed: 0,title
0,Assassination
1,January 3
2,1957
3,2020s
4,AGM-114 Hellfire
5,Martyr
6,2020
7,Love Story (1970 film)
8,Baghdad International Airport
9,Tikrit
