# Web-Scraping

First install scrapy. Check installation a guid here https://docs.scrapy.org/en/latest/intro/install.html#intro-install

In this tutorial, we will be scraping https://fundrazr.com/find. Partticularly, the campaigns mentioned under Health & Illness (https://fundrazr.com/find?category=Health).

Scarpy usage with example can be found at https://github.com/mGalarnyk/Python_Tutorials/tree/master/Scrapy

Take a look at the webpage that lists all campaigns for Health & Illness.

In [1]:
%%html
<iframe src="https://fundrazr.com/find?category=Health" width="500" height="400"></iframe>

This is only the first page. There are several pages, each with 12 campaigns. 

For this exercise, we will scrap information only from the first page (i.e., 12 campaigns). 

In [2]:
# Import required libraries

# Import TextResponse module from scrapy
from scrapy.http import TextResponse

# Import the request library
import requests

TextResponse is a convenient module that allows us to use xpath feature of *scrapy* without having to run the scrapy shell.

In [3]:
# Send request to access the first page of Health & Illness campaign.

res = requests.get("https://fundrazr.com/find?category=Health")
response = TextResponse(res.url,body=res.text,encoding='utf-8')

hrefs = response.xpath("//h2[contains(@class, 'title headline-font')]/a[contains(@class, 'campaign-link')]//@href").extract()
print('Campaigns listed on the first page')
print('\n'.join(hrefs))

Campaigns listed on the first page
//fundrazr.com/americangut
//fundrazr.com/IndonesiaCOVID
//fundrazr.com/britishgut
//fundrazr.com/f13EF1
//fundrazr.com/ViewHealthcareHeroes
//fundrazr.com/helpleewana
//fundrazr.com/RHK8_Snacks_2019
//fundrazr.com/71YCU4
//fundrazr.com/AlinkerwalkingbikeforHopesmom
//fundrazr.com/9OsT5
//fundrazr.com/31XUGd
//fundrazr.com/31WUvb


The above list are partial urls for each of the 12 campaigns listed on the first page.

The code below creates complete url by prepending the partial urls with 'https.'

In [4]:
campaign_urls = []
for href in hrefs:
    campaign_urls.append("https:" + href)
    
print('\n'.join(campaign_urls))

https://fundrazr.com/americangut
https://fundrazr.com/IndonesiaCOVID
https://fundrazr.com/britishgut
https://fundrazr.com/f13EF1
https://fundrazr.com/ViewHealthcareHeroes
https://fundrazr.com/helpleewana
https://fundrazr.com/RHK8_Snacks_2019
https://fundrazr.com/71YCU4
https://fundrazr.com/AlinkerwalkingbikeforHopesmom
https://fundrazr.com/9OsT5
https://fundrazr.com/31XUGd
https://fundrazr.com/31WUvb


Now that we have urls for every campaign, we can launch scrapy shell (or use TextResponse) on every campaign to extract relevant information.

Let's extract the following information from each of the 12 campaigns:

1. Title
2. Amount of money raised
3. Number of contributors
4. Length of time the campaign is running for

Following is a dictionary containing strings that serve as inputs to xpath. Format of the strings define what information is to be scraped. This step is the heart of this exercise. Make sure you understand the string format.

In [5]:
xpaths = {
    'title':"//title/text()",
    'currency':"//span[contains(@class,'currency-symbol')]/text()",
    'moneyRaised':"//span[contains(@class,'amount-raised')]/descendant::text()",
    "contributors":"//span[contains(@class,'donation-count stat')]/descendant::text()",
    "duration":"//span[contains(@class,'stats-label lowercase')]//span[contains(@class,'stat')]/text()"
}

In [6]:
from collections import defaultdict
import re

data = defaultdict(list)

for url in campaign_urls:
    res = requests.get(url)
    response = TextResponse(res.url,body=res.text,encoding='utf-8')
    for xpath in xpaths:
        if xpath in ['title','moneyRaised','contributors','currency']:
            data[xpath].append(response.xpath(xpaths[xpath]).extract()[0])
        elif xpath == 'duration':
            d = response.xpath(xpaths[xpath]).extract()
            d = ' '.join([val for val in [re.sub('\n|\t','',s) for s in d] if val != ''])
            data[xpath].append(d)
        

All the scrapped data in now stored in the 'data' dictionary.

Convert it to a dataframe to make it see the data in a tabular form.

In [7]:
import pandas as pd
pd.DataFrame(data)

Unnamed: 0,title,currency,moneyRaised,contributors,duration
0,American Gut by American Gut Project (UC San D...,$,1943404,12799,7 Years running
1,TOPENG SEHAT: Providing Respirator Masks for I...,$,8219,58,
2,British Gut by American Gut Project (UC San Di...,£,741115,7128,5 Years running
3,Help Us Make Ireland's Dream a Reality! by Ire...,$,1952,28,4 Years running
4,Rescue Detroit Restaurants - Feed Healthcare W...,$,2625,31,25 Days running
5,Help Leewana heal -- get her back in action! b...,$,4801,104,0 days left
6,Healthy Snack Workshop by Raleigh Hills K8 6th...,$,1020,16,0 days left
7,Please help me #keepmoving #outdoors with an #...,$,2850,39,0 days left
8,Help me stay #unstoppable with an Alinker walk...,$,2800,21,0 days left
9,Operation Walk USA by Operation Walk USA,$,17275,30,7 Years running


If need to scrap data from multiple pages, repeat the above process on each page. You can start with the following code:

In [8]:
start_urls = ["https://fundrazr.com/find?category=Health"]

npages = 2

# This mimics getting the pages using the next button.
for i in range(2, npages+2):
    start_urls.append("https://fundrazr.com/find?category=Health&page="+str(i))

In [9]:
start_urls

['https://fundrazr.com/find?category=Health',
 'https://fundrazr.com/find?category=Health&page=2',
 'https://fundrazr.com/find?category=Health&page=3']