# Example of a Web Scraper
We develop a web scraper step by step. The aim is to show the procedure of data collection with the help of a scraper.
Individual Wikipedia pages are queried - see also the blog post by Frank Andrade.
The text is extracted from the pages. From this, an XML file is generated that can be used as input for the search engine.
The pages about the Football World Cup in Wikipedia serve as a basis.

Please note
Many websites, including Wikipedia, have defense mechanisms against aggressive crawlers/scrapers. Keep this in mind during development and only access Wikipedia
if absolutely necessary.

Further resources:
Scraper vs Crawler: https://medium.com/oncrawl-seo-tips-tricks/an-introduction-to-web-crawler-119a2b492b63
Franks Post: https://medium.com/geekculture/yes-you-can-easily-scrape-websites-with-pandas-heres-how-f833157781d5
Beautiful Soup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#xml

## Getting to know the tools
We will spend a few sections getting to know the tools.
Then we will build the scraper step by step.
We use:

BeautifulSoup
requests

In [13]:
from bs4 import BeautifulSoup
import requests
path = '/home/bfh/irsed/daten/FIFA/'


### Cooking a soup
To generate data with Beautiful Soup, we need a 'soup'. This also requires the lxml parser (already imported with BeautifulSoup).
To get the HTML code of a web page, we send a request to this page and get the text as a response.
Note: With help(soup) you get a help for the object soup - the command no longer belongs in the finished code, but is practical during development.

In [14]:
web = 'https://en.wikipedia.org/wiki/2014_FIFA_World_Cup'
response = requests.get(web)
content = response.text
soup = BeautifulSoup(content, 'lxml')

## Getting to know the page structure
To make progress with Beautiful Soup, we examine the structure of the source page.
To do this, we call it up in the browser and right-click to open the browser's developer tools.
There we find regularities that we can use for scraping.

In [15]:
# Read the title-tag and show the content
print(soup.title.string)
titel = soup.title.string

2014 FIFA World Cup - Wikipedia


In [16]:
# find all p-tags and count them
len(soup.find_all('p'))

93

In [17]:
# find and count all h2-tags
len(soup.find_all('h2'))

# Fetch the tag with index 2 from the list and show it
test = soup.find_all('h2')[2]
print(test.getText())

Participating teams and officials


In [18]:
# We have found all the headlines

headlines = soup.find_all('span', class_='mw-heading')
headlines = soup.find_all('h2')
# print(headlines[1])
for h in headlines:
    print(h.getText())


Contents
Host selection
Participating teams and officials
Venues
Innovations
Format
Opening ceremony
Group stage
Knockout stage
Statistics
Final standings
Preparations and costs
Marketing
Symbols
Media
Controversies
See also
Notes
References
External links


In [19]:
# Now we want to find the text for the headline
# We inspect host selection in the browser and find that the text is in p tags that are on
# the same level as the h2 tag with the headline we are interested in
# We search for all 'sibling tags' of our example tag

host_selection = headlines[1]   # fixed for the time being to develop the code
print(host_selection)

ps = host_selection.find_all_next("p")
print(ps)

# This is not usable - because the ps also appear, which belong to the following titles

<h2 id="Host_selection">Host selection</h2>
[<p>In March 2003, FIFA announced that the tournament would be held in South America for the first time since <a href="/wiki/1978_FIFA_World_Cup" title="1978 FIFA World Cup">1978</a>, in line with its policy at the time of rotating the right to host the World Cup among different confederations.<sup class="reference" id="cite_ref-13"><a href="#cite_note-13"><span class="cite-bracket">[</span>13<span class="cite-bracket">]</span></a></sup><sup class="reference" id="cite_ref-14"><a href="#cite_note-14"><span class="cite-bracket">[</span>14<span class="cite-bracket">]</span></a></sup> With the <a href="/wiki/2010_FIFA_World_Cup" title="2010 FIFA World Cup">2010 FIFA World Cup</a> hosted in South Africa, it would be the second consecutive World Cup outside Europe, which was a first for the tournament. It was also sixth time (second consecutive) in the Southern Hemisphere.<sup class="reference" id="cite_ref-15"><a href="#cite_note-15"><span class="

In [20]:
# Obviously the search must be set one level higher
# Inspection of the page in the browser shows that the parent element has the class mw-parser-output
# search with it

body = soup.find_all('div', class_='mw-parser-output')
if (len(body) != 1):
    print("body not found " + title)

alles = body[0]


for child in alles.children:
    if child.name == 'div':
        if child.has_attr("class") and "mw-heading2" in child["class"]:
            print("**********************************")
            h2 = child.find("h2").getText()    
            print(h2)
    elif child.name == 'p':
        print (f'p: {child.name}')
        text = ""
        is_footnote = False;
        for elt in child.strings:
            if elt.startswith('['):
                is_footnote = True
            if not is_footnote:
                    text += elt
            if elt.startswith(']'):
                is_footnote = False
            
        print(text)


p: p


p: p
The 2014 FIFA World Cup was the 20th FIFA World Cup, the quadrennial world championship for men's national football teams organised by FIFA. It took place in Brazil from 12 June to 13 July 2014, after the country was awarded the hosting rights in 2007. It was the second time that Brazil staged the competition, the first being in 1950, and the fifth time that it was held in South America.

p: p
31 national teams advanced through qualification competitions to join the host nation in the final tournament (with Bosnia and Herzegovina as the only debutant). A total of 64 matches were played in 12 venues located in as many host cities across Brazil. For the first time at a World Cup finals, match officials used goal-line technology, as well as vanishing spray for free kicks. FIFA Fan Fests in each host city gathered a total of 5 million people, and the country received 1 million visitors from 202 countries. Spain, the defending champions, were eliminated at the group stage. Host 

## Create and write XML
We need an XML file with the following structure:

<pre>
&lt;add&gt;
  &lt;doc&gt;
   &lt;field name="xxx"&gt;yyyyy&lt;/field&gt;
   &lt;field name="xxx"&gt;yyyyy&lt;/field&gt;
  &lt;/doc&gt;
  &lt;doc&gt;
   &lt;field name="xxx"&gt;yyyyy&lt;/field&gt;
   &lt;field name="xxx"&gt;yyyyy&lt;/field&gt;
  &lt;/doc&gt;
&lt;/add&gt;
</pre>

In [21]:
xml = BeautifulSoup(features='xml')
xml.append(xml.new_tag("add"))
add = xml.find('add')
document = xml.new_tag('doc')
add.append(document)
field = xml.new_tag('field', attrs = {'name':'xxxx'})
field.string = 'yyyy'
field2 = xml.new_tag('field', attrs = {'name':'xxxx'})
field2.string = 'yyyy'
document.append(field)
document.append(field2)
document2 = xml.new_tag('doc')
add.append(document2)
field = xml.new_tag('field', attrs = {'name':'yyyy'})
field.string = 'yyyy'
field2 = xml.new_tag('field', attrs = {'name':'yyyy'})
field2.string = 'yyyy'
document2.append(field)
document2.append(field2)

from bs4.formatter import XMLFormatter
formatter = XMLFormatter(indent=0)
with open(path+"test.xml", "w") as f:
    f.write(xml.prettify(formatter=formatter))

# print(xml.preffify(formatter=formatter))

# View and verify the file on the file system

In [22]:
from bs4 import BeautifulSoup


print("Hello World")
xml = BeautifulSoup(features='xml')
add = xml.new_tag('foo')
add.string = 'bar&bar'
xml.append(add)
from bs4.formatter import XMLFormatter
formatter = XMLFormatter(indent=0)
xml.prettify(formatter=formatter)

Hello World


'<?xml version="1.0" encoding="utf-8"?>\n<foo>\nbar&bar\n</foo>\n'

# Find field names
- Feldnamen müssen den XML-Regeln genügen und dürfen keine Blanks enthalten.
- Wir finden zuerst alle Überschriften heraus und erstellen anschliessend ein Dictionary
- Dieses wird später benötigt, wenn die Suchmaschine erstellt wird.

The link structure of the Football World Cup pages on Wikipedia is very regular:
- https://en.wikipedia.org/wiki/2014_FIFA_World_Cup
- https://en.wikipedia.org/wiki/2018_FIFA_World_Cup
We can take advantage of this to get all the pages.

<pre>
years = [1930, 1934, 1938, 1950, 1954, 1958, 1962, 1966, 1970, 1974,
         1978, 1982, 1986, 1990, 1994, 1998, 2002, 2006, 2010, 2014,
         2018, 2022]
</pre>

In [23]:
from bs4 import BeautifulSoup
import requests

# We parse all files and extract all h2
def get_headers(year):
    web = f'https://en.wikipedia.org/wiki/{year}_FIFA_World_Cup'
    response = requests.get(web)
    content = response.text
    headers = set()
    soup = BeautifulSoup(content, 'lxml')
    bodies = soup.find_all('div', class_='mw-parser-output')
    if (len(bodies) != 1):
        print(f"Body inexistent or non-unique: year={year} count={len(bodies)}")
    for body in bodies:
        for child in body.children:
            if child.name == 'div':
                if child.has_attr("class") and "mw-heading2" in child["class"]:
                    h2 = child.find("h2").getText()    
                    print(h2)
                    headers.add(h2.lower().replace(" ","_"))
    return headers

# Wa already know the years
years = [1930, 1934, 1938, 1950, 1954, 1958, 1962, 1966, 1970, 1974,
         1978, 1982, 1986, 1990, 1994, 1998, 2002, 2006, 2010, 2014,
         2018, 2022]
years = ["2014"]

#Name of the <h1> field - should be the same for all DU
fields = {'title'}
# Fetch all titles
for year in years:
    fields = fields.union(get_headers(year))

print(fields)

# Write all fields nicely sorted and on individual lines in a file
fields = list(fields)
fields.sort()
with open(path+"fifa_fields.csv", "w") as f:
    for field in fields:
        f.write(field+"\n")

Host selection
Participating teams and officials
Venues
Innovations
Format
Opening ceremony
Group stage
Knockout stage
Statistics
Final standings
Preparations and costs
Marketing
Symbols
Media
Controversies
See also
Notes
References
External links
{'opening_ceremony', 'see_also', 'final_standings', 'venues', 'knockout_stage', 'controversies', 'format', 'preparations_and_costs', 'title', 'host_selection', 'innovations', 'references', 'notes', 'group_stage', 'participating_teams_and_officials', 'marketing', 'statistics', 'external_links', 'symbols', 'media'}


# Putting together the script
Now we have all the elements we need to put together the finished script.
When assembling the output XML, we must take into account the special structure of the input:
<pre>
h1
p
h2
p
h2
p
</pre>
- h1 is read separately
- h2 and p are read within the For loop - be careful: the first p belong to the h1
- The aim is to create this structure:
<pre>
&lt;add&gt;
  &lt;doc&gt;
   &lt;field_name="titel"&gt;Inhalt des h1-Tags&lt;/field&gt;
   &lt;field name="beschreibung"&gt;Einleitende Beschreibung&lt;/field&gt;
   &lt;field name="feld"&gt;inhalt&lt;/field&gt;
   &lt;field name="feld"&gt;inhalt&lt;/field&gt;
  &lt;/doc&gt;
  &lt;doc&gt;
   &lt;field_name="titel"&gt;Inhalt des h1-Tags&lt;/field&gt;
   &lt;field name="beschreibung"&gt;Einleitende Beschreibung&lt;/field&gt;
   &lt;field name="feld"&gt;inhalt&lt;/field&gt;
   &lt;field name="feld"&gt;inhalt&lt;/field&gt;
  &lt;/doc&gt;
&lt;/add&gt;
</pre>


In [24]:
from bs4 import BeautifulSoup
import requests

def one_document(year, document):
    web = f'https://en.wikipedia.org/wiki/{year}_FIFA_World_Cup'
    response = requests.get(web)
    content = response.text
    soup = BeautifulSoup(content, 'lxml')
    text = ""
    # Find h1
    find_h1(soup, document)
    # find all other fields
    bodies = soup.find_all('div', class_='mw-parser-output')
    if (len(bodies) != 1):
        print(f"Body does not exist or is not unique={year} count={len(bodies)}")
    fieldname = 'description' #first field
    for body in bodies:
        for child in body.children:
            if child.name == 'div':
                if child.has_attr("class") and "mw-heading2" in child["class"]:
                    # print("**********************************")
                    create_field(document, fieldname, text)
                    text = ""
                    # print (f'h2: {child.name}')
                    h2 = child.find("h2").getText()    
                    print(h2)
                    fieldname = h2.lower().replace(" ","_")
                    # print(fieldname)
            elif child.name == 'p':
                #print (f'p: {child.name}')
                # text = ""
                is_footnote = False;
                for elt in child.strings:
                    if elt.startswith('['):
                        is_footnote = True
                    if not is_footnote:
                            text += elt
                    if elt.startswith(']'):
                        is_footnote = False
                
def find_h1(soup, document):
    h1 = soup.find('span', class_='mw-page-title-main').string
    create_field(document, "title", h1)

def create_field(document, fieldname, text):
    print(fieldname)
    if len(text.strip()) > 0:
        field = xml.new_tag('field', attrs = {'name':fieldname})
        field.string = text
        document.append(field)

# We will re-use the list of years
years = [1930, 1934, 1938, 1950, 1954, 1958, 1962, 1966, 1970, 1974,
         1978, 1982, 1986, 1990, 1994, 1998, 2002, 2006, 2010, 2014,
         2018, 2022]
years = [2014]

xml = BeautifulSoup(features='xml')
add = xml.new_tag('add')
xml.append(add)

# Create a new doc-tag per year
for year in years:
    print(year)
    document = xml.new_tag('doc')
    create_field(document, "year", str(year))   # generate id_field
    one_document(year,document)
    xml.add.append(document)


# write it beautifully
from bs4.formatter import XMLFormatter
formatter = XMLFormatter(indent=0)
with open(path+"fifa.xml", "w") as f:
    output = str(xml)
    output.replace('\n',' ')
    print(output)
    f.write(output)


2014
year
title
description
Host selection
host_selection
Participating teams and officials
participating_teams_and_officials
Venues
venues
Innovations
innovations
Format
format
Opening ceremony
opening_ceremony
Group stage
group_stage
Knockout stage
knockout_stage
Statistics
statistics
Final standings
final_standings
Preparations and costs
preparations_and_costs
Marketing
marketing
Symbols
symbols
Media
media
Controversies
controversies
See also
see_also
Notes
notes
References
references
External links
<?xml version="1.0" encoding="utf-8"?>
<add><doc><field name="year">2014</field><field name="title">2014 FIFA World Cup</field><field name="description">
The 2014 FIFA World Cup was the 20th FIFA World Cup, the quadrennial world championship for men's national football teams organised by FIFA. It took place in Brazil from 12 June to 13 July 2014, after the country was awarded the hosting rights in 2007. It was the second time that Brazil staged the competition, the first being in 1950, 

# Quality
The output in the file must be checked and verified manually.