# Collecting URLs using Scrapy

You need to have:
* scrapy (python package) for web crawling
* PyCharm Educational Edition or Spyder (In Windows, Start/All Apps/Anaconda2/Spyder)    

You can download PyCharm Edu from https://www.jetbrains.com/pycharm-edu/download/#section=windows

## Scrapy

- Python package for web crawling https://scrapy.org/
- Suitable for big data collection

Installation

- conda install -c conda-forge scrapy

## Scenario: BBC Football

Let's suppose you're interested in collecting news articles about football from BBC. There are many different web pages in BBC. Some examples are:

    http://www.bbc.com/sport/football/35182965
    http://www.bbc.com/sport/football/35139938
    http://www.bbc.com/sport/football/35139941    
    http://www.bbc.com/news/technology-35591726
    http://www.bbc.com/news/technology-35614335
    http://www.bbc.com/news/science-environment-35612559
    http://www.bbc.com/sport/tennis/35603859
    http://www.bbc.com/sport/golf/35624058

We know that only the first three articles are related to football:

    http://www.bbc.com/sport/football/35182965
    http://www.bbc.com/sport/football/35139938
    http://www.bbc.com/sport/football/35139941   

Then, the question is how we can collect only those football-related urls, while ignoring other urls in BBC. The answer is you need to use **regular expression**.

**Regular expression** to crawl only those football webpages above:
    
    .+\/sport\/football\/\d{8}

We can test the above regular expression using Online regular expression tester **http://regexr.com/v1/**

To learn more about regular expression, go to **lecture 5_regularexpression**

## Game plan
- start url for football articles = http://www.bbc.com/sport/football/
- collect the **urls** for football-related articles (**using Scrapy**)
- Crawl data from the collected urls 
- Do content analytics (e.g., sentiment analysis, text classification) and descriptive analytics (e.g., business intelligence)

### 1. start a scrapy project (open a command prompt or a terminal). 
**scrapy startproject bbcfootball**

### 2. Generating a spider
**cd bbcfootball**

**scrapy genspider bbcfootballcrawler www.bbc.com**

This will create a folder and files looking like this:
<img src="images\scrapy.gif">

### 3. Coding

In [None]:
# Change bbcfootballcrawler.py
# Complete codes
from scrapy.item import Item, Field
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class TryItem(Item):
    url = Field()

class BbcfootballcrawlerSpider(CrawlSpider):
    name = "bbcfootballcrawler"
    allowed_domains = ["www.bbc.com"]
    start_urls = ['http://www.bbc.com/sport/football/']

    rules = (Rule(LinkExtractor(allow=['.+\/sport\/football\/\d{8}$']), callback='parse_item', follow=True),)

    def parse_item(self, response):
        Item = TryItem()
        Item['url'] = response.url
        yield Item

### 4. Run (in command prompt or terminal)
**scrapy crawl bbcfootballcrawler -o items.csv**

This will create a csv file (items.csv), which contains the urls for football-related articles
<img src="images\scrapy2.gif">

### 5. How to stop
**CTRL-C**

Take a look at items.csv. **Remove the first row (url). Make sure there are only urls in items.csv**. Then **move this file to the data folder**. We will import this file later.

<img src="images\scrapy3.gif">
<img src="images\scrapy4.gif">

### to restart
scrapy crawl bbcfootballcrawler -o items.csv 

# Now we crawl data using those URLs we just collected

In [1]:
# import python packages

import requests
from lxml import html
import csv
import pandas as pd

In [2]:
openfile = open("data/items.csv", "rb")
r = csv.reader(openfile)
for i in r:
    #the urls are in the first column ... 0 refers to the first column
    url = i[0]
    print url        
openfile.close()

http://www.bbc.com/sport/football/41269176
http://www.bbc.com/sport/football/41227374
http://www.bbc.com/sport/football/41241398
http://www.bbc.com/sport/football/41234874
http://www.bbc.com/sport/football/41261402
http://www.bbc.com/sport/football/41244754
http://www.bbc.com/sport/football/41248175
http://www.bbc.com/sport/football/41268931
http://www.bbc.com/sport/football/41233171
http://www.bbc.com/sport/football/41265278
http://www.bbc.com/sport/football/41245454
http://www.bbc.com/sport/football/40883140
http://www.bbc.com/sport/football/41253551
http://www.bbc.com/sport/football/41227367
http://www.bbc.com/sport/football/41218469
http://www.bbc.com/sport/football/41241396
http://www.bbc.com/sport/football/41248903
http://www.bbc.com/sport/football/41229478
http://www.bbc.com/sport/football/41214982
http://www.bbc.com/sport/football/41133569
http://www.bbc.com/sport/football/41234873
http://www.bbc.com/sport/football/41249300
http://www.bbc.com/sport/football/41214712
http://www.

In [3]:
# the entire crawling process

openfile = open("data/items.csv", "rb")
r = csv.reader(openfile)
bbc_football_data = []

for i in r:
    url = i[0]
    print url  # to know the status of web crawling
    r = requests.get(url)
    data = html.fromstring(r.text)
    
    texts = data.xpath("//div[@id='story-body']/p/text()") 
    raw = ''.join(str(i.encode("utf-8")) for i in texts)
    finaldata = raw.replace('\r','').replace('\n','').replace('\r','').replace('\t','')    
    bbc_football_data.append([finaldata])
                
openfile.close()

http://www.bbc.com/sport/football/41269176
http://www.bbc.com/sport/football/41227374
http://www.bbc.com/sport/football/41241398
http://www.bbc.com/sport/football/41234874
http://www.bbc.com/sport/football/41261402
http://www.bbc.com/sport/football/41244754
http://www.bbc.com/sport/football/41248175
http://www.bbc.com/sport/football/41268931
http://www.bbc.com/sport/football/41233171
http://www.bbc.com/sport/football/41265278
http://www.bbc.com/sport/football/41245454
http://www.bbc.com/sport/football/40883140
http://www.bbc.com/sport/football/41253551
http://www.bbc.com/sport/football/41227367
http://www.bbc.com/sport/football/41218469
http://www.bbc.com/sport/football/41241396
http://www.bbc.com/sport/football/41248903
http://www.bbc.com/sport/football/41229478
http://www.bbc.com/sport/football/41214982
http://www.bbc.com/sport/football/41133569
http://www.bbc.com/sport/football/41234873
http://www.bbc.com/sport/football/41249300
http://www.bbc.com/sport/football/41214712
http://www.

In [4]:
len(bbc_football_data)

264

we should expect that some articles are too short (and even empty) to be included in our football dataset. The Xpath data.xpath("//div[@role='main']/p/text()") will return nothing for some short articles. We purposely do this. These short articles would be useless for further content analytics and descritive analytics. We would remove them.

In [5]:
# we should expect that some articles are too short (and even empty) to be included in our football dataset
bbc_football_data[:2]

[['Real Madrid midfielder Casemiro has defended forward Gareth Bale after he was booed for the second game in a row.Wales international Bales, 28, was jeered by some of the Bernabeu crowd in Wednesday\'s  over Apoel Nicosia."He was a very strong player for us and I hope he continues to be so," Casemiro "We try to protect the players we have. We are always defending our own. It shows that we are a family."He added: "I defend my team-mates as if they were my family and I am with them until the end."With the quality he has, things will be going well for him one day."Bale has scored only one goal in six games for Madrid this season.Manager Zinedine Zidane said: "I did not listen to the whistles, only the applause."Swansea manager Paul Clement, the former assistant boss at Madrid, said Bale can cope with the boos and emerge a better player."That is a club where you need a lot of character," Clement said."Claude Makelele says it is the club you need the most mental toughness, because of how 

In [6]:
for i in bbc_football_data:
    for word in i:
        print len(word.split())

284
493
0
0
134
895
0
79
0
58
274
262
379
289
0
0
297
0
448
235
0
495
143
809
1333
1920
176
503
145
251
111
635
137
0
419
805
264
202
494
717
543
228
136
69
330
914
359
534
150
655
474
438
481
450
773
216
154
159
106
466
99
529
475
602
285
1327
1328
512
562
423
211
299
256
653
302
318
494
168
405
400
331
156
487
386
192
850
215
303
550
765
386
342
269
341
589
583
491
573
485
285
474
430
0
0
179
505
95
336
542
251
317
227
475
250
190
559
314
336
443
438
592
446
338
1014
545
511
135
137
319
248
705
557
581
135
212
359
456
300
606
1602
3369
801
503
361
304
82
523
526
129
521
445
497
574
617
118
753
284
241
327
456
818
302
783
108
471
181
180
424
615
884
121
68
239
225
286
286
166
406
257
337
530
355
151
237
321
327
360
230
631
296
373
219
75
417
185
198
290
494
283
0
200
236
413
282
793
242
291
198
412
282
571
375
808
1428
1258
1883
2182
1860
143
640
221
510
3307
90
1166
478
483
184
487
545
105
604
421
244
75
769
226
174
360
109
734
313
1088
478
209
369
251
388
792
90
485
201
512
257
450


In [12]:
df = pd.DataFrame(bbc_football_data)
df.head()

Unnamed: 0,0
0,Harry Kane scored twice as Tottenham beat Boru...
1,The first woman to referee a senior men's game...
2,Raith Rovers moved four points clear at the to...
3,Nigeria midfielder Ogenyi Onazi has put his fa...
4,Pedro Caixinha backed Alfredo Morelos to keep ...


In [13]:
df.to_csv("data/output_bbcfootball_crawledtexts.csv")

Open output_bbcfootball_crawledtexts.csv in Excel and remove empty rows. Then prepare the dataset as following:

<img src = "images\scrapy5.gif">

Done! Nice job. This tutorial has shown how to collect data through advanced web crawling (scrapy & more)

### Now, you should be able to repeat the entire process again to collect data about "politics" (or any other topic) and prepare the final dataset as following:

<img src = "images\scrapy6.gif">

### Then, you can perform analysis such as sentiment analysis, word cloud, text classification, word frequency, data visualization, business intelligence, etc ...

# Appendix: Can you collect politics articles?
- You need to develop **regular expression for the URLs containing politics articles**