# Introduction to Web Scraping with BeautifulSoup

## Getting Started



Library wise we have a few different choices, including:
* Request
* Beautiful Soup
* Scrapy
* Selenium

`Scrapy` is a complete web scraping framework which takes care of everything from getting the HTML, to processing the data. Selenium is a browser automation tool that can for example enable you to navigate between multiple pages. These two libraries have a steeper learning curve than Request which is used to get HTML data and BeautifulSoup which is used as a parser for the HTML.

## Inspect the website

To get information about the elements we want to access, we first of need to inspect the web page using the developer tools.
In this post we will scrape the “content” and “see also” sections from an arbitrary Wikipedia article. To get information about the elements and attributes used for the sections, we can right click on the element to inspect it. This will open the inspector which lets us look at the HTML Code.

<img src ="https://miro.medium.com/max/2284/1*C2cTA1aBq5zJMTujIv444A.png" width ="500px"/>

The content section has an ip of `toc` and each list item has a class of `tocsection-n` where n is the number of the list item, so if we want to get the content text we can just loop through all list items that have a class that starts with `tocsection-`. This can be done using BeautifulSoup in combination with Regular Expressions.
To get the data from the “see also” section we can loop through all the list items contained in the div with the classes `div-col columns column-width` .

## Parse HTML

Now that we know what we need to scrape we can get started by parsing the HTML. First of we need to import the libraries that we will be using for scraping the website. As already said above, we will use BeautifulSoup for parsing the page and searching for specific elements. For connecting to the website and getting the html we will use urllib which is a Python Standard Library, and so it is already installed. Lastly the re libary will be used for working with Regular Expressions.

In [6]:
# importing libraries
from bs4 import BeautifulSoup
import urllib.request
import re

url = "https://en.wikipedia.org/wiki/Artificial_intelligence"

try:
    page = urllib.request.urlopen(url)  # conntect to website
except:
    print("An error occured.")

soup = BeautifulSoup(page, 'html.parser')
print(soup)

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Artificial intelligence - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"35345a7b-140a-4a9e-91f9-03f0aaf9ff25","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Artificial_intelligence","wgTitle":"Artificial intelligence","wgCurRevisionId":1084791620,"wgRevisionId":1084791620,"wgArticleId":1164,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Harv and Sfn no-target errors","Articles with short description","Short description is different from Wikidata","Wikipedia in

## Find specific elements in the page

The created BeautifulSoup object can now be used to find elements in the HTML. When we inspected the website we saw that every list item in the content section has a class that starts with `tocsection-` and we can us BeautifulSoup’s `find_all` method to find all list items with that class.

In [7]:
regex = re.compile('^tocsection-')
content_lis = soup.find_all('li', attrs={'class': regex})
print(content_lis)

[<li class="toclevel-1 tocsection-1"><a href="#History"><span class="tocnumber">1</span> <span class="toctext">History</span></a></li>, <li class="toclevel-1 tocsection-2"><a href="#Goals"><span class="tocnumber">2</span> <span class="toctext">Goals</span></a>
<ul>
<li class="toclevel-2 tocsection-3"><a href="#Reasoning,_problem-solving"><span class="tocnumber">2.1</span> <span class="toctext">Reasoning, problem-solving</span></a></li>
<li class="toclevel-2 tocsection-4"><a href="#Knowledge_representation"><span class="tocnumber">2.2</span> <span class="toctext">Knowledge representation</span></a></li>
<li class="toclevel-2 tocsection-5"><a href="#Planning"><span class="tocnumber">2.3</span> <span class="toctext">Planning</span></a></li>
<li class="toclevel-2 tocsection-6"><a href="#Learning"><span class="tocnumber">2.4</span> <span class="toctext">Learning</span></a></li>
<li class="toclevel-2 tocsection-7"><a href="#Natural_language_processing"><span class="tocnumber">2.5</span> <spa

To get the raw text we can loop through the array and call the getText method on each list item.

In [8]:
content = []
for li in content_lis:
    content.append(li.getText().split('\n')[0])
print(content)

['1 History', '2 Goals', '2.1 Reasoning, problem-solving', '2.2 Knowledge representation', '2.3 Planning', '2.4 Learning', '2.5 Natural language processing', '2.6 Perception', '2.7 Motion and manipulation', '2.8 Social intelligence', '2.9 General intelligence', '3 Tools', '3.1 Search and optimization', '3.2 Logic', '3.3 Probabilistic methods for uncertain reasoning', '3.4 Classifiers and statistical learning methods', '3.5 Artificial neural networks', '3.5.1 Deep learning', '3.6 Specialized languages and hardware', '4 Applications', '5 Philosophy', '5.1 Defining artificial intelligence', '5.1.1 Thinking vs. acting: the Turing test', '5.1.2 Acting humanly vs. acting intelligently: intelligent agents', '5.2 Evaluating approaches to AI', '5.2.1 Symbolic AI and its limits', '5.2.2 Neat vs. scruffy', '5.2.3 Soft vs. hard computing', '5.2.4 Narrow vs. general AI', '5.3 Machine consciousness, sentience and mind', '5.3.1 Consciousness', '5.3.2 Computationalism and functionalism', '5.3.3 Robot 

To get the data from the “see also” section, we use the find method to get the div containing the list items, and then use find_all to get an array of list items.

In [15]:
see_also_section = soup.find('div', attrs={'class': 'div-col'}) 
see_also_soup =  see_also_section.find_all('li') 
print(see_also_soup)

AttributeError: 'NoneType' object has no attribute 'find_all'

To extract the hrefs and the text a loop in combination with the find method can be used.

In [10]:
see_also = []
for li in see_also_soup:
    a_tag = li.find('a', href=True, attrs={'title':True, 'class':False}) # find a tags that have a title and a class
    href = a_tag['href'] # get the href attribute
    text = a_tag.getText() # get the text
    see_also.append([href, text]) # append to array
print(see_also)

NameError: name 'see_also_soup' is not defined

## Saving data

Almost all of the time we would like to save our scraped data, so we can use it later. The easiest way is to save it to a .txt or .csv file by using the open function which is build into Python.

We will save the content section into a text file with the name content.txt

In [11]:
with open('content.txt', 'w') as f:
    for i in content:
        f.write(i+"\n")

The best format for the “see also” data is probably a csv because it has two columns(One for the href and one for the text).

In [12]:
with open('see_also.csv', 'w') as f:
    for i in see_also:
        f.write(",".join(i)+"\n")

## Conclusion

Web Scraping is the process of downloading data from webpages and extracting information from that data. It is a great tool to have in your tool kit because it allows you to get rich varieties of data.

BeautifulSoup is a web scraping library which is best used for small projects. For larger projects libraries like Scrapy and Selenium start to shine and I will cover both of them in another blog post.

## ENHANCEMENT

In [19]:
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

try:
    page = urllib.request.urlopen(url)  # conntect to website
except:
    print("An error occured.")

soup = BeautifulSoup(page, 'html.parser')

regex = re.compile('^tocsection-')
content_lis = soup.find_all('li', attrs={'class': regex})

content = []
for li in content_lis:
    content.append(li.getText().split('\n')[0])
print(content)

['1 History', '2 Design philosophy and features', '3 Syntax and semantics', '3.1 Indentation', '3.2 Statements and control flow', '3.3 Expressions', '3.4 Methods', '3.5 Typing', '3.6 Arithmetic operations', '4 Programming examples', '5 Libraries', '6 Development environments', '7 Implementations', '7.1 Reference implementation', '7.2 Other implementations', '7.3 Unsupported implementations', '7.4 Cross-compilers to other languages', '7.5 Performance', '8 Development', '9 API documentation generators', '10 Naming', '11 Popularity', '12 Uses', '13 Languages influenced by Python', '14 See also', '15 References', '15.1 Sources', '16 Further reading', '17 External links']


In [87]:
url = "https://www.businessinsider.com/what-is-python"

try:
    page = urllib.request.urlopen(url)  # conntect to website
except:
    print("An error occured.")

soup = BeautifulSoup(page, 'html.parser')
#print(soup)

# Find title from webpage
regex = re.compile('meta')
content_lis = soup.find_all('h2')
content_lis2 = soup.find_all('h1')

content2 = []
print("Titles:")
for li in content_lis2:
    content2.append(li.getText().split('\n')[0])
print(f"1.{content2[0]} \n")

content = []
print("Subtitles:")
for li in content_lis:
    content.append(li.getText().split('\n')[0])
print(f"1.{content[0]}  \n2.{content[1]} \n2.{content[2]}" )

Titles:
1.What is Python? The popular, scalable programming language, explained 

Subtitles:
1.What is Python?  
2.How Python is used 
2.Advantages of Python 
