# Scraping Challenge

Do a little scraping or API-calling of your own. Pick a new website and see what you can get out of it. Expect that you'll run into bugs and blind alleys, and rely on your mentor to help you get through.

Formally, your goal is to write a scraper that will:

1. Return specific pieces of information (rather than just downloading a whole page)
2. Iterate over multiple pages/queries
3. Save the data to your computer

Once you have your data, compute some statistical summaries and/or visualizations that give you some new insights into your scraping topic of interest. Write up a report from scraping code to summary and share it with your mentor.

### The Website

What if I use glassdoor to get company information to prepare for an interview?
1. All interview questions
2. Links to the answers

In [4]:
################## Imports ##############################
# Importing in each cell because of the kernel restarts.
import scrapy
import re
from scrapy.crawler import CrawlerProcess
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize


In [None]:
################### Create Crawler ######################

class GlassdoorSpider(scrapy.Spider):
    # Naming the spider is important if you are running more than one spider of
    # this class simultaneously.
    name = "Glassdoor_Simple"
    
    # URL(s) to start with.
    start_urls = [
        'https://www.glassdoor.com/Interview/Facebook-Interview-Questions-E40772.htm?filter.jobTitleFTS=Data+Scientist',
    ]

    # Use XPath to parse the response we get.
    def parse(self, response):
       
        # Iterate over every review class element on the page.
        # Get all the reviews objects for the page
        
        #Extract the different infomration
        for review in response.xpath('//*[starts-with(@id, "InterviewReview_")]'):
            
            
            # Yield a dictionary with the values we want.
            yield {
                # This is the code to choose what we want to extract
                # You can modify this with other Xpath expressions to extract other information from the site
                'interview_questions': review.xpath('.//div[3]/div/div[2]/div[2]/div/div/ul/li/span/text()').extract(),
              'answers' : review.xpath('.//div[3]/div/div[2]/div[2]/div/div/ul/li/span/a/text()').extract(),
                'answer_links' : review.xpath('.//div[3]/div/div[2]/div[2]/div/div/ul/li/span/a/@href').extract(),
                'helpful': review.xpath('.//div[4]/div/div[3]/span/@data-count').extract_first(),
            }

        next_page = response.xpath('//*[@id="FooterPageNav"]/div[2]/ul/li[7]/a/@href').extract()
        if next_page:
            next_page = next_page[0]
            next_page_url = 'https://www.glassdoor.com' + next_page
            print(next_page_url)
            # Request the next page and recursively parse it the same way we did above
            yield scrapy.Request(next_page_url, callback=self.parse)
            
# Tell the script how to run the crawler by passing in settings.
process = CrawlerProcess({
    'FEED_FORMAT': 'json',         # Store data in JSON format.
    'FEED_URI': 'data.json',  # Name our storage file.
    'LOG_ENABLED': False ,          # Turn off logging for now.
    'ROBOTSTXT_OBEY': True,
    'USER_AGENT': 'BrandynAdderleyCrawler (madderle@gmail.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True
})

# Start the crawler with our spider.
process.crawl(GlassdoorSpider)
process.start()
print('Success!')

In [3]:
# Turn into DataFrame
glassdoor_df = pd.read_json('data.json', orient='records')
glassdoor_df.head()

Unnamed: 0,answer_links,answers,helpful,interview_questions
0,[/Interview/Data-challenge-was-very-similar-to...,[23 Answers],331,[ Data challenge was very similar to the ads a...
1,[/Interview/How-would-you-measure-the-health-o...,"[6 Answers, 9 Answers, 7 Answers, 3 Answers, 4...",196,[ How would you measure the health of Mentions...
2,[/Interview/Facebook-has-personal-information-...,"[1 Answer, Answer Question]",0,[ Facebook has personal information such as ge...
3,[/Interview/Expected-value-combined-with-proba...,[Answer Question],0,[ Expected value combined with probability ]
4,[/Interview/They-will-ask-you-to-design-a-very...,[Answer Question],0,[ They will ask you to design a very simple th...


In [20]:
individual_questions = []
for row in glassdoor_df['interview_questions']:
    #print(len(row))
    for question in row:
        individual_questions.append(question)

In [21]:
len(individual_questions)

313