## Web Scraping - Challenge

Formally, your goal is to write a scraper that will:

1) Return specific pieces of information (rather than just downloading a whole page)  
2) Iterate over multiple pages/queries  
3) Save the data to your computer  


### 1) Return specific pieces of information (rather than just downloading a whole page)

In [1]:
# Importing in each cell because of the kernel restarts.
import scrapy
from scrapy.crawler import CrawlerProcess


class ESSpider(scrapy.Spider):
    # Naming the spider is important if you are running more than one spider of
    # this class simultaneously.
    name = "ESS"
    
    # URL(s) to start with.
    start_urls = [
        'https://mujerespioneras.org/pioneeringwomen/',
    ]

    # Use XPath to parse the response we get.
    def parse(self, response):
        
        # Iterate over every <article> element on the page.
        for article in response.xpath('//main/article'):
            
            # Yield a dictionary with the values we want.
            yield {
                'title': article.xpath('header/h1[@class="entry-title"]/a/text()').extract_first(),
                'achieve': article.xpath('div[@class="entry-summary"]/p/text()').extract_first()
            }

# Tell the script how to run the crawler by passing in settings.
process = CrawlerProcess({
    'FEED_FORMAT': 'json',         # Store data in JSON format.
    'FEED_URI': 'pioneerwomen.json',  # Name our storage file.
    'LOG_ENABLED': False           # Turn off logging for now.
})

# Start the crawler with our spider.
process.crawl(ESSpider)
process.start()
print('Success!')




Success!


### 2) Iterate over multiple pages/queries

In [1]:
# Importing in each cell because of the kernel restarts.
import scrapy
import re
from scrapy.crawler import CrawlerProcess

class ESSpider(scrapy.Spider):
    name = "ESS"
    
    # URL(s) to start with.
    start_urls = [
        'https://mujerespioneras.org/pioneeringwomen/2018/08/29/first-woman-on-record-to-travel-to-every-country-in-the-world/',
    ]

    
    
    # Use XPath to parse the response we get.
    def parse(self, response):
        
        # Iterate over every <article> element on the page.
        for article in response.xpath('//main/article'):

            n = article.xpath('div[@class="entry-content"]/h3/text()').extract_first()            
            if n is None:
                n = article.xpath('div[@class="entry-content"]/h2/text()').extract_first()
            
            # Yield a dictionary with the values we want.
            yield {
                'name': n,
                'highlights': article.xpath('div[@class="entry-content"]/p/strong/text()').extract(),
                'posted_on': article.xpath('footer[@class="entry-footer"]/span[@class="posted-on"]/a/time/text()').extract_first(),
                'tags': article.xpath('footer[@class="entry-footer"]/span[@class="tags-links"]/a/text()').extract()
            }

        # Get the URL of the previous page.
        next_page = response.xpath('//div[@class="nav-next"]/a/@href').extract_first()
        print(next_page)
        
        pagenum = 0 
        
        
        # Recursively call the spider to run on the next page, if it exists.
        if next_page is not None and pagenum < 10:
        
            next_page = response.urljoin(next_page)
            pagenum += 1
            
            # Request the next page and recursively parse it the same way we did above
            yield scrapy.Request(next_page, callback=self.parse)

# Tell the script how to run the crawler by passing in settings.
# The new settings have to do with scraping etiquette.          
process = CrawlerProcess({
    'FEED_FORMAT': 'json',         # Store data in JSON format.
    'FEED_URI': 'manypioneeringwomen.json',       # Name our storage file.
    'LOG_ENABLED': False,          # Turn off logging for now.
    'ROBOTSTXT_OBEY': True,
    'USER_AGENT': 'Wendy Navarrete ',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True
})

# Start the crawler with our spider.
process.crawl(ESSpider)
process.start()
print('Success!')

https://mujerespioneras.org/pioneeringwomen/2018/09/10/first-canadian-woman-to-appear-alone-on-a-bank-note-10-bill/
https://mujerespioneras.org/pioneeringwomen/2019/02/18/evelyn-miot-the-first-black-woman-to-reach-the-top-fifteen-finalists-position-in-miss-universe/
https://mujerespioneras.org/pioneeringwomen/2019/05/13/zaha-hadid-first-woman-to-receive-the-pritzker-prize/
https://mujerespioneras.org/pioneeringwomen/2019/05/14/juana-ines-de-asbaje-y-ramirez-de-santillana-considered-as-the-first-female-feminist-in-america/
https://mujerespioneras.org/pioneeringwomen/2019/05/15/julieta-lanteri-the-first-woman-to-vote-in-latin-america/
https://mujerespioneras.org/pioneeringwomen/2019/05/16/martine-grael-the-first-brazilian-female-to-compete-on-the-volvo-ocean-race/
https://mujerespioneras.org/pioneeringwomen/2019/05/17/marie-curie-the-first-female-to-win-a-nobel-prize-and-the-first-person-to-win-that-award-twice/
https://mujerespioneras.org/pioneeringwomen/2019/05/17/351/
https://mujeresp

### 3) Save the data to your computer

#### From exercise 1, I saved data on pioneerwomen.json file. Details below.

In [2]:
import pandas as pd

# Checking json file
jsonFile=pd.read_json('pioneerwomen.json', orient='records')
print(jsonFile.shape)
jsonFile[['title','achieve']]


(10, 2)


Unnamed: 0,title,achieve
0,"Julieta Lanteri, the first woman to vote in La...","Julieta Lanteri Born in Italy on March 22, 187..."
1,"Juana Inés de Asbaje y Ramírez de Santillana, ...",Sor Juana Inés de la Cruz Juana Inés de Asbaje...
2,"Zaha Hadid, first woman to receive the Pritzke...",Zaha Hadid The first woman to receive the Prit...
3,"Evelyn Miot, the first black woman to reach th...",Evelyn Miot Born in Haiti in 1943. At age 18 s...
4,First Canadian Woman to Appear Alone on a Bank...,"Viola Desmond She was born in Nova Scotia, Can..."
5,First Woman on Record to Travel to Every Count...,"Cassie de Pecol Cassie was born on June 23, 19..."
6,"Register the birth of a child, is not yet a ri...",Having a birth certificate is foundational. Bi...
7,First Latin American Woman To Win The Nobel Prize,"Gabriel Mistral December 1945, Lucila Godoy Al..."
8,"Zahida Kazmi, First Pakistani Woman Taxi Driver",Zahida Kazmi Zahida is considered the first fe...
9,The First Latin American Female to Reach the S...,"Elsa Ávila She is, in her own words, a profe..."


#### From exercise 2, I saved data on manypioneeringwomen file. Details below.

In [3]:
# Checking json file

jsonFile=pd.read_json('manypioneeringwomen.json', orient='records')
print(jsonFile.shape)
jsonFile[['name','highlights','tags','posted_on']]



(11, 4)


Unnamed: 0,name,highlights,tags,posted_on
0,Cassie de Pecol,"[\nCassie was born on June 23, 1989 in United ...","[entrepreneurwoman, entrepreneurwomen, firstfe...","August 29, 2018"
1,Viola Desmond,[Viola grew up in a middle-class mixed-race fa...,"[blackrightsactivism, blackwomenrights, Canadi...","September 10, 2018"
2,Evelyn Miot,"[Born in Haiti in 1943, her legacy is having b...","[Haiti, MIssUniverse]","February 18, 2019"
3,,"[Important events in her childhood, Important ...","[Irak, pioneeringwomeninarchitecture, womenmak...","May 13, 2019"
4,Sor Juana Inés de la Cruz,"[was born on November 12, 1651 , one of the mo...","[Feminism, feminist, Mexico, poetry, poetWomen...","May 14, 2019"
5,Julieta Lanteri,[Julieta became the first woman to vote in Lat...,"[Argentina, pioneeringwomeninMedicine, votingR...","May 15, 2019"
6,Martine Grael,[She is a sailor and took gold in her home Ol...,"[Brazil, BrazilianPioneeringWomen, firstfemale...","May 16, 2019"
7,Marie Curie,"[this brilliant woman was the first to:, Marie...","[beingawoman, inspireothers, NobelPrize, Nobel...","May 17, 2019"
8,,[],[GoodNews],"May 17, 2019"
9,Amanda F. Kelley,[became the first enlisted woman to graduate f...,"[ArmyRangerSchool, beingawoman, firstfemale, i...","May 17, 2019"


_________________________________________

By: Wendy Navarrete

September 2019