# Web scraping with Python

This is a short tutorial on web scraping. 

Web scraping is the automated extraction of data from websites. All websites are written using text-based markup languages (HTML and XHTML), and each page is essentially a code with embedded pieces of data (e.g. texts, photos, links to other pages, etc.) that we want to get as a result of scraping. The browser translates the code into what we see on the site as users. To get the data we need, we first need to get the page code, parse it, and then extract the necessary data.

Parsing the page code - or determining which tags and attributes contain the site elements that we need, for example, comments from users of a site - may not be very easy. Sometimes there is different data under the same tags and attributes, and it is not very clear how to automatically collect only the data we need. In addition, the page code can change - and you have to invent everything anew. Some resources, such as YouTube or Vkontakte, provide everyone with a list of commands for collecting the necessary data (application programming interface, API). This is much more convenient - there is a list of commands, and we do not have to figure out how to get what we need. However, not all resources on the Internet have an API. Therefore, web scraping can be a useful tool.

An important rule: before collecting information from the site, find the terms of use and make sure that scraping is not prohibited.

To see how web scraping works, let's download the transcripts of "The Big Bang Theory" series from here https://bigbangtrans.wordpress.com (kudos to the creator of this site!). 

Let's try to see the code of the page with the transcript of the first episode https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/. You need to press Ctrl + Shift + I or the right mouse button -> press "see the code". If you click on the arrow icon in the upper left corner of the code window, you can hover over different elements on the page and see where they are in the form of code, under which tags they are stored.

You can also slowly move the mouse over the lines of page code on the right and see which page fragments are highlighted. And thus look for a code fragment that contains the data we need.

We can see, for example, that links are written with the <\a> tag - in html, this is a command to the browser that the text contained in the tag is a web link. You can read more about html tags, for example, here http://htmlbook.ru/

We are interested in the transcript of the series itself - it is contained under the "div" tag with the atribute 'class="entrytext"'

Now let's try to extract this data.

In [1]:
# The first stage is getting the page code, which we will then parse
# To get the content of web pages, the "requests" package is used
# let's import it

import requests



In [2]:
# Now we can use this package to send requests and receive information from web pages
# We can read the package documentation and find required command https://docs.python-requests.org/en/master/

# Let's create a variable with the link to the page we need to get

url = "https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/"  

# creating a request to get this page, passing the link as a parameter:

response = requests.get(url) 


*Sometimes a site requires details about the user agent, i.e. the browser from which the request is made*   
*In this case, you can specify any data, for example these*

response = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}) 


In [3]:
# now the entire content of the page is written to the response variable
# we can see the entire content with this command --

response.text 



'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en">\n\n<head profile="http://gmpg.org/xfn/11">\n\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n\t<title>Series 01 Episode 01 &#8211; Pilot Episode | Big Bang Theory Transcripts</title>\n\t<!--[if lte IE 8]>\n\t<link rel="stylesheet" href="https://s2.wp.com/wp-content/themes/pub/chaoticsoul/ie.css" type="text/css" media="screen" />\n\t<![endif]-->\n\t<link rel="pingback" href="https://bigbangtrans.wordpress.com/xmlrpc.php" />\n\t<meta name=\'robots\' content=\'max-image-preview:large\' />\n<link rel=\'dns-prefetch\' href=\'//s2.wp.com\' />\n<link rel=\'dns-prefetch\' href=\'//s1.wp.com\' />\n<link rel=\'dns-prefetch\' href=\'//s0.wp.com\' />\n<link rel=\'dns-prefetch\' href=\'//s.pubmine.com\' />\n<link rel=\'dns-prefetch\' href=\'//x.bidswitch.net\' />\n<link rel=\'dns-prefetch\' href=\'//

We got about the same thing as viewing the page code in the browser.
It is impossible to understand what is going on here.

We need a parser - a set of commands with which you can separate the code (tags, attributes) from everything else, and get the necessary data.

We will use the html parser from the BeautifulSoup package. Documentation -https://www.crummy.com/software/BeautifulSoup/bs4/doc/



In [4]:
# let's parse the page
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser') 

# The same page is now written to the "soup" variable, but in a more structured form
# knowing by tags, we can get what we need using the soup.findAll(tag) method
# for example, let's find all the links using the tag "a"

# we still have some pieces of code, but what we get is much more comprehensible 

soup.findAll('a')


[<a href="https://bigbangtrans.wordpress.com/">Big Bang Theory Transcripts</a>,
 <a class="share-twitter sd-button share-icon" data-shared="sharing-twitter-3" href="https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/?share=twitter" rel="nofollow noopener noreferrer" target="_blank" title="Click to share on Twitter"><span>Twitter</span></a>,
 <a class="share-facebook sd-button share-icon" data-shared="sharing-facebook-3" href="https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/?share=facebook" rel="nofollow noopener noreferrer" target="_blank" title="Click to share on Facebook"><span>Facebook</span></a>,
 <a class="sd-link-color"></a>,
 <a href="https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&amp;hosted_button_id=4682827"><img src="https://www.paypal.com/en_GB/i/btn/btn_donate_SM.gif"/></a>,
 <a href="https://bigbangtrans.wordpress.com/about/">About</a>,
 <a aria-current="page" href="https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episod

In [5]:
# Now let's finally get the transcripts
# If you view the page code in the browser, we will find out that the text we need is contained under the "div" tag with the attribute 'class = "entrytext"'

# *by the way, this is the hardest part - to understand which tags\attributes do you need. It takes time and experimentation*
# 

# So, let's use this tag\attribute to construct the command
# We learned the structure of the command from the BeautifulSoup documentation (in case you wonder)

soup.findAll('div', {'class': 'entrytext'})


[<div class="entrytext">
 <p class="MsoNormal" style="margin:0 0 10pt;"><em><span style="font-size:small;"><span style="font-family:Calibri;">Scene: A corridor at a sperm bank.</span></span></em></p>
 <p class="MsoNormal" style="margin:0 0 10pt;"><span style="font-size:small;font-family:Calibri;">Sheldon: So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it’s unobserved it will, however, if it’s observed after it’s left the plane but before it hits its target, it will not have gone through both slits.</span></p>
 <p class="MsoNormal" style="margin:0 0 10pt;"><span style="font-size:small;font-family:Calibri;">Leonard: Agreed, what’s your point?</span></p>
 <p class="MsoNormal" style="margin:0 0 10pt;"><span style="font-size:small;font-family:Calibri;">Sheldon: There’s no point, I just think it’s a good idea for a tee-shirt. </span></p>
 <p class="MsoNormal" style="margin:0 0 10pt;"><span style="font-size:sma

We got a list-like structure. Now we need to get the text and get rid of the code. We learned from the documentation that this can be done using the "get_text()" method

In [6]:
text = soup.findAll('div', {'class': 'entrytext'})[0].get_text()
text

"\nScene: A corridor at a sperm bank.\nSheldon: So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it’s unobserved it will, however, if it’s observed after it’s left the plane but before it hits its target, it will not have gone through both slits.\nLeonard: Agreed, what’s your point?\nSheldon: There’s no point, I just think it’s a good idea for a tee-shirt. \nLeonard: Excuse me?\nReceptionist: Hang on. \nLeonard: One across is Aegean, eight down is Nabakov, twenty-six across is MCM, fourteen down is… move your finger… phylum, which makes fourteen across Port-au-Prince. See, Papa Doc’s capital idea, that’s Port-au-Prince. Haiti. \nReceptionist: Can I help you?\nLeonard: Yes. Um, is this the High IQ sperm bank?\nReceptionist: If you have to ask, maybe you shouldn’t be here.\nSheldon: I think this is the place.\nReceptionist: Fill these out.\nLeonard: Thank-you. We’ll be right back.\nReceptionist: Oh, take you

Great, we got almost what we need. However, at the end there are some technical symbols. Let's remove them. This can be done in many ways, but we will use regular expressions (https://docs.python.org/3/howto/regex.html) because on different pages of this site, this text probably looks a little different. We need to remove the part which starts with "\__ATA" and ends with ":Like Loading...". The regular expressioin for that is `"__ATA(.|\s)*:Like Loading..."`

`(.|\s)*` means any symbol `(.)` or end of paragraph `(.|\s)` repeated any number of times `*`  

In [12]:
# Importing regular expressions
import re
 
# Deleting the technical text
lines = re.sub("__ATA(.|\s)*:Like Loading...","", text)

# Splitting the whole text into lines
lines = lines.split('\n')
lines

# Removing empty lines
lines = [x for x in lines if x != ""]
lines

['Scene: A corridor at a sperm bank.',
 'Sheldon: So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it’s unobserved it will, however, if it’s observed after it’s left the plane but before it hits its target, it will not have gone through both slits.',
 'Leonard: Agreed, what’s your point?',
 'Sheldon: There’s no point, I just think it’s a good idea for a tee-shirt. ',
 'Leonard: Excuse me?',
 'Receptionist: Hang on. ',
 'Leonard: One across is Aegean, eight down is Nabakov, twenty-six across is MCM, fourteen down is… move your finger… phylum, which makes fourteen across Port-au-Prince. See, Papa Doc’s capital idea, that’s Port-au-Prince. Haiti. ',
 'Receptionist: Can I help you?',
 'Leonard: Yes. Um, is this the High IQ sperm bank?',
 'Receptionist: If you have to ask, maybe you shouldn’t be here.',
 'Sheldon: I think this is the place.',
 'Receptionist: Fill these out.',
 'Leonard: Thank-you. We’ll be righ

In [13]:
# Let's now save the result into .txt
with open("s1e1.txt", "w", encoding="utf-8") as text_file:
    for i in lines:
        text_file.write(i + '\n') # Добавляем знак конца строки, чтобы у нас все реплики были на отдельной строке

We downloaded one transcript. But we need all of them. To do this we need to somehow generate or get all the links to the pages. There is no universal method here. Sometimes links look like "https://myblog.com/?skip=0", "https://myblog.com/?skip=10", "https://something.com/?skip=20, etc. In this case, we can generate the links by changing only the number at the end. In our current case, the links look different - they include the name of an episode, so we cannot generate them straight away. But each page with a transcript has a "Pages" section on the left with a list of all the eposodes' names with links to them. We can gather the links to request the pages.

We need the <\a> tag which is used for storing links. The link destination (URL) itself is stored with the "href" attribute (one can find this out by using any tutorial about the html language, for example, here http://htmlbook.ru/)

In [16]:
links_with_text = []

links = soup.findAll('a', href=True)
for link in links:
    # Let's add an additional condition if the link has some displayed text (as in the Pages section), because there can be many different links on a page.
    if link.text: 
        links_with_text.append(link['href'])
            

links_with_text

['https://bigbangtrans.wordpress.com/',
 'https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/?share=twitter',
 'https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/?share=facebook',
 'https://bigbangtrans.wordpress.com/about/',
 'https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/',
 'https://bigbangtrans.wordpress.com/series-1-episode-2-the-big-bran-hypothesis/',
 'https://bigbangtrans.wordpress.com/series-1-episode-3-the-fuzzy-boots-corollary/',
 'https://bigbangtrans.wordpress.com/series-1-episode-4-the-luminous-fish-effect/',
 'https://bigbangtrans.wordpress.com/series-1-episode-5-the-hamburger-postulate/',
 'https://bigbangtrans.wordpress.com/series-1-episode-6-the-middle-earth-paradigm/',
 'https://bigbangtrans.wordpress.com/series-1-episode-7-the-dumpling-paradox/',
 'https://bigbangtrans.wordpress.com/series-1-episode-8-the-grasshopper-experiment/',
 'https://bigbangtrans.wordpress.com/series-1-episode-9-the-cooper-hofstadter-po

In [20]:
# Removing the links that we don't need
links_with_text = [link for link in links_with_text if "https://bigbangtrans.wordpress.com/series" in link and "share" not in link]
links_with_text

['https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/',
 'https://bigbangtrans.wordpress.com/series-1-episode-2-the-big-bran-hypothesis/',
 'https://bigbangtrans.wordpress.com/series-1-episode-3-the-fuzzy-boots-corollary/',
 'https://bigbangtrans.wordpress.com/series-1-episode-4-the-luminous-fish-effect/',
 'https://bigbangtrans.wordpress.com/series-1-episode-5-the-hamburger-postulate/',
 'https://bigbangtrans.wordpress.com/series-1-episode-6-the-middle-earth-paradigm/',
 'https://bigbangtrans.wordpress.com/series-1-episode-7-the-dumpling-paradox/',
 'https://bigbangtrans.wordpress.com/series-1-episode-8-the-grasshopper-experiment/',
 'https://bigbangtrans.wordpress.com/series-1-episode-9-the-cooper-hofstadter-polarization/',
 'https://bigbangtrans.wordpress.com/series-1-episode-10-the-loobenfeld-decay/',
 'https://bigbangtrans.wordpress.com/series-1-episode-11-the-pancake-batter-anomaly/',
 'https://bigbangtrans.wordpress.com/series-1-episode-12-the-jerusalem-duality/

In [21]:
# Extract the part of the link containing the season/episode's number and episode's name to save files under these names

titles = []

for i in range(len(links_with_text)):
    a=links_with_text[i].replace('https://bigbangtrans.wordpress.com/',"")
    a=a.rstrip("/")
    titles.append(a)
titles

['series-1-episode-1-pilot-episode',
 'series-1-episode-2-the-big-bran-hypothesis',
 'series-1-episode-3-the-fuzzy-boots-corollary',
 'series-1-episode-4-the-luminous-fish-effect',
 'series-1-episode-5-the-hamburger-postulate',
 'series-1-episode-6-the-middle-earth-paradigm',
 'series-1-episode-7-the-dumpling-paradox',
 'series-1-episode-8-the-grasshopper-experiment',
 'series-1-episode-9-the-cooper-hofstadter-polarization',
 'series-1-episode-10-the-loobenfeld-decay',
 'series-1-episode-11-the-pancake-batter-anomaly',
 'series-1-episode-12-the-jerusalem-duality',
 'series-1-episode-13-the-bat-jar-conjecture',
 'series-1-episode-14-the-nerdvana-annihilation',
 'series-1-episode-15-the-porkchop-indeterminacy',
 'series-1-episode-16-the-peanut-reaction',
 'series-1-episode-17-the-tangerine-factor',
 'series-2-episode-01-the-bad-fish-paradigm',
 'series-2-episode-02-the-codpiece-topology',
 'series-2-episode-03-the-barbarian-sublimation',
 'series-2-episode-04-the-griffin-equivalency',
 '

In [19]:
# Now everything is ready to scrape the pages

for i, url in enumerate(links_with_text):
    print(url) # let's print the link to see where the problem is if something doesn' work
    response = requests.get(url,headers={"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"})
    soup = BeautifulSoup(response.text, 'html.parser')
    
    text = soup.findAll('div', {'class': 'entrytext'})[0].get_text()       

    lines = re.sub("__ATA(.|\s)*:Like Loading...", "", text)
    lines = lines.split("\n")
    lines = [x for x in lines if x != ""]  
    
    # сохраним в txt файл
    with open(titles[i]+".txt", "w", encoding="utf-8") as text_file: 
        for i in lines:
            text_file.write(i + '\n') 
    
  

https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/
https://bigbangtrans.wordpress.com/series-1-episode-2-the-big-bran-hypothesis/
https://bigbangtrans.wordpress.com/series-1-episode-3-the-fuzzy-boots-corollary/
https://bigbangtrans.wordpress.com/series-1-episode-4-the-luminous-fish-effect/
https://bigbangtrans.wordpress.com/series-1-episode-5-the-hamburger-postulate/
https://bigbangtrans.wordpress.com/series-1-episode-6-the-middle-earth-paradigm/
https://bigbangtrans.wordpress.com/series-1-episode-7-the-dumpling-paradox/
https://bigbangtrans.wordpress.com/series-1-episode-8-the-grasshopper-experiment/
https://bigbangtrans.wordpress.com/series-1-episode-9-the-cooper-hofstadter-polarization/
https://bigbangtrans.wordpress.com/series-1-episode-10-the-loobenfeld-decay/
https://bigbangtrans.wordpress.com/series-1-episode-11-the-pancake-batter-anomaly/
https://bigbangtrans.wordpress.com/series-1-episode-12-the-jerusalem-duality/
https://bigbangtrans.wordpress.com/series-1-e

https://bigbangtrans.wordpress.com/series-5-episode-14-the-beta-test-initiation/
https://bigbangtrans.wordpress.com/series-5-episode-15-the-friendship-contraction/
https://bigbangtrans.wordpress.com/series-5-episode-16-the-vacation-solution/
https://bigbangtrans.wordpress.com/series-5-episode-17-the-rothman-disintegration/
https://bigbangtrans.wordpress.com/series-5-episode-18-the-werewolf-transformation/
https://bigbangtrans.wordpress.com/series-5-episode-19-the-weekend-vortex/
https://bigbangtrans.wordpress.com/series-5-episode-20-the-transporter-malfunction/
https://bigbangtrans.wordpress.com/series-5-episode-21-the-hawking-excitation/
https://bigbangtrans.wordpress.com/series-5-episode-22-the-stag-convergence/
https://bigbangtrans.wordpress.com/series-5-episode-23-the-launch-acceleration/
https://bigbangtrans.wordpress.com/series-5-episode-24-the-countdown-reflection/
https://bigbangtrans.wordpress.com/series-6-episode-01-the-date-night-variable/
https://bigbangtrans.wordpress.com/

https://bigbangtrans.wordpress.com/series-9-episode-18-the-application-deterioration/
https://bigbangtrans.wordpress.com/series-9-episode-19-the-solder-excursion-diversion/
https://bigbangtrans.wordpress.com/series-9-episode-20-the-big-bear-precipitation/
https://bigbangtrans.wordpress.com/series-9-episode-21-the-viewing-party-combustion/
https://bigbangtrans.wordpress.com/series-9-episode-22-the-fermentation-bifurcation/
https://bigbangtrans.wordpress.com/series-9-episode-23-the-line-substitution-solution/
https://bigbangtrans.wordpress.com/series-9-episode-24-the-convergence-convergence/
https://bigbangtrans.wordpress.com/series-10-episode-01-the-conjugal-conjecture/
https://bigbangtrans.wordpress.com/series-10-episode-02-the-military-miniturization/
https://bigbangtrans.wordpress.com/series-10-episode-03-the-dependence-transcendence/
https://bigbangtrans.wordpress.com/series-10-episode-04-the-cohabitation-experimentation/
https://bigbangtrans.wordpress.com/series-10-episode-05-the-h

Done!