# Introduction to Web Scraping

This tutorial covers some basic concepts of web scraping. In all the tutorials before, the datasets that you had to work with were provided by us. Sometimes, it's easy to directly download these datasets from some websites. But some annoying websites do not make it that easy for you! For instance, in the first tutorial, the Moons_and_planets.csv file was parsed from this Wikipedia page https://en.wikipedia.org/wiki/List_of_natural_satellites where there is no option to directly download the dataset :(
In such instances, we use web scraping. It is a technique for extracting data from websites and storing it in a file on your computer. <br>

First, let's go through what makes up a web page. When we visit a webpage, our browser sends a request to the web server called a **GET**
request(because we are requesting the server to send us files). The server then sends back files that tells our browser how the website looks like. These files are of different types : 

1. HTML :  Has the main contents of the page 
2. CSS : Used to "style" the webpage and make it look good
3. JS : Javascript files make the webpage more interactive
4. Images : Helps add images to the webpage

For our purposes, we need only concern ourselves with HTML, but you are free to look up the others. 

## Basics of HTML

HTML is not a programming language, it is a <i>markup language</i>, which means it tells the browser how the layout of the website looks like. It lets you do things like make a new heading, a new paragraph, make text bold, italicise text etc. HTML consists of elements called **tags** which basically gives the browser instructions like "the following text is meant to be bold" or "the following text is another paragraph". The most basic tag is the `<html>` tag. People who are already familiar with HTML can skip this section and move on to the actual web scraping part.

In [43]:
%%HTML
<html>

</html>

This tells the browser that everything inside these two tags is HTML code. Ignore the first line, that is not part of an HTML file.

In [156]:
%%HTML
<!-- This is how we comment in HTML :D  -->

Given below are some common HTML tags.
<l>
    <li>`<head>` tag : Contains information about the title of the page</li>
    <li>`<body>` tag : Contains information about the contents of the web page</li>
    <li>`<p>` tag : Starts a new paragraph </li>
    <li>`<a>` tag : Used to insert links in the webpage. The `href` property of this tag determines where the link goes. </li> 
</l> 
Run the code cell to see how HTML formats the page using these tags!

In [4]:
%%HTML  
<html>   

    <head>                                    
        This is <i>really</i> neat!           <!-- i tag : italics -->
    </head>
    
    <body>                                    
    
        <p>
        <b>This is <i>really</i> neat!</b>
        </p>                                  <!-- b tag : bold -->
                
        <p>        
        <a href = "https://www.tech-iitb.org/krittika/">
        Krittika, the Astronomy Club of IIT Bombay </a>  
        </p>
    </body>
    
</html>

A tag is called a <b>child</b> tag when it is inside another tag. Similarly, the enclosing tag is called the <b>parent</b> tag. A tag is the <b>sibling</b> of another tag if it is enclosed inside the same parent tag. In the above example, the two p tags are the children of the body tag and are sibling tags, while the body tag is a parent tag for these two.<br>

Some other very common HTML tags are `<div>`(helps in dividing the webpage into different areas), `<table>`(creates a table) and `<form>`(creates an input form). Refer to [this](https://developer.mozilla.org/en-US/docs/Web/HTML/Element) for a more detailed discussion on HTML tags.<br>

One last important concept in HTML are the `class` and `id` properties. These are used to give HTML elements names, and they make it easier for us while web scraping. A single element can have multiple classes and a class can also be shared among multiple elements. However, an id is unique to an HTML element and cannot be used more than once in a single webpage. 

In [157]:
%%HTML
<html>   

    <head>                                    
        This is <i>really</i> neat!           
    </head>
    
    <body>                                    
    
        <p class = "neat">                     <!-- This 'p' tag is part of 1 class-->
        <b>This is <i>really</i> neat!</b>
        </p>                                  
                
        <p class = "neat very-neat">            <!--This 'p' tag is part of 2 classes-->    
        <a href = "https://www.tech-iitb.org/krittika/" id="very-very-neat">
        Krittika, the Astronomy Club of IIT Bombay </a>  
        </p>
    </body>
    
</html>

As can be seen, adding classes and ids do not change the website's layout.

## Using the  requests Library

To scrape a webpage, we first download the HTML contents of the page and the **requests** library in Python lets us do that. There are different types of requests that we can send to the webpage, but here we will be using the `GET` request. Let's scrape a very simple webpage we have created [here](https://fathimazarin.github.io/simple.html).

In [203]:
import requests
page = requests.get("https://fathimazarin.github.io/simple.html")
page

<Response [200]>

If you run the above code and get a response code of 200, it means that your page downloaded successfully, while codes starting with a 4 or a 5 indicates an error.

In [204]:
page.content

b'<html>   \n\n    <head>                                    \n        This is <i>really</i> neat!          \n    </head>\n    \n    <body>                                    \n    \n        <p class = "neat">\n        <b>This is <i>really</i> neat!</b>\n        </p>                                  \n                \n        <p class = "neat very-neat">        \n        <a href = "https://www.tech-iitb.org/krittika/" id="very-very-neat">\nKrittika, the Astronomy Club of IIT Bombay</a>  \n            \n        </p>\n    </body>\n    \n</html>\n'

This command prints out the HTML code of the page. 

## Using the BeautifulSoup library

BeautifulSoup is a Python library that lets us parse HTML and extract whatever text that we want from it. This code loads the BeautifulSoup library and creats an instance of the BeautifulSoup class(we will be discussing classes in a later tutorial).

In [205]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

As you can see above, the HTML code was printed out in a very messy way when we used page.content . The following command can be used to format the code in a better way.

In [206]:
print(soup.prettify())

<html>
 <head>
  This is
  <i>
   really
  </i>
  neat!
 </head>
 <body>
  <p class="neat">
   <b>
    This is
    <i>
     really
    </i>
    neat!
   </b>
  </p>
  <p class="neat very-neat">
   <a href="https://www.tech-iitb.org/krittika/" id="very-very-neat">
    Krittika, the Astronomy Club of IIT Bombay
   </a>
  </p>
 </body>
</html>



Now, the information that we would want to extract is most probably inside the body tag in a <i>p</i> tag(the paragraph tags) or inside a table tag etc. The BeautifulSoup library has functions that help you directly search for these tags in the HTML code.

In [217]:
soup.find('p') #What datatype does it return?

<p class="neat">
<b>This is <i>really</i> neat!</b>
</p>

You might have realized that the HTML code given above has multiple <i>p</i> tags, yet it returned only the first one. The soup.find_all() function returns a list of all occurrences of that particular tag.

In [163]:
soup.find_all('p') 

[<p class="neat">
 <b>This is <i>really</i> neat!</b>
 </p>,
 <p class="neat very-neat">
 <a href="https://www.tech-iitb.org/krittika/" id="very-very-neat">
             
         Krittika, the Astronomy Club of IIT Bombay </a>
 </p>]

But this still returns HTML code with pesky HTML tags that you dont want in your parsed file. The get_text() function will help you out here. You cannot use the get_text() function on the list returned by find_all(). You will have to access each element and use the function.

In [178]:
soup.find_all('p')[0].get_text()

'\nThis is really neat!\n'

In [214]:
soup.find_all('p')[1].get_text()

'\n\nKrittika, the Astronomy Club of IIT Bombay\n'

In [215]:
print(soup.find_all('p')[1].get_text())     #Why do you think there is a difference in output ?



Krittika, the Astronomy Club of IIT Bombay



In [216]:
krittika=soup.find_all('p')[1]
krittika

<p class="neat very-neat">
<a href="https://www.tech-iitb.org/krittika/" id="very-very-neat">
Krittika, the Astronomy Club of IIT Bombay</a>
</p>

In [213]:
krittika.find_all('a')  #The find_all() function can also be used to search for tags inside a parent tag

[<a href="https://www.tech-iitb.org/krittika/" id="very-very-neat">
 Krittika, the Astronomy Club of IIT Bombay</a>]

In [212]:
krittika.find_all('a')[0]

<a href="https://www.tech-iitb.org/krittika/" id="very-very-neat">
Krittika, the Astronomy Club of IIT Bombay</a>

In [211]:
krittika.find_all('a')[0].get_text()

'\nKrittika, the Astronomy Club of IIT Bombay'

You still would not want those \n cluttering up your file.

In [207]:
soup.find_all('p')[1].get_text().replace('\n','')  #What do you think this code does?

'Krittika, the Astronomy Club of IIT Bombay'

## Searching by class and id
Adding one argument to the find_all() function helps you search by class and id.

In [153]:
soup.find_all('p', attrs={"class":"neat"})

[<p class="neat">
 <b>This is <i>really</i> neat!</b>
 </p>,
 <p class="neat very-neat">
 <a href="https://www.tech-iitb.org/krittika/" id="very-very-neat">
         Krittika, the Astronomy Club of IIT Bombay </a>
 </p>]

In [154]:
soup.find_all('p',attrs={"class":"very-neat"})

[<p class="neat very-neat">
 <a href="https://www.tech-iitb.org/krittika/" id="very-very-neat">
         Krittika, the Astronomy Club of IIT Bombay </a>
 </p>]

In [155]:
soup.find_all(id="very-very-neat")

[<a href="https://www.tech-iitb.org/krittika/" id="very-very-neat">
         Krittika, the Astronomy Club of IIT Bombay </a>]

## Moving on to actual webpages

In the above example, we parsed a relatively simple file with few lines of code, while that is not the case for real life websites(Go to any random webpage on Google Chrome or Firefox and press Ctrl+u and check for yourself). It will be hard to find the exact location of the paragraph or the tag that you want in such a big code. In such scenarios, the concept of class and id that we discussed above becomes useful. Web scraping is all about finding the right tag to search for using the find() function. <br>

Let us try scraping a much more longer webpage, say, a Wikipedia page [here](https://en.wikipedia.org/wiki/Lists_of_stars_by_constellation) to print the list of constellations. The constellations are listed as an unordered list, hence they can be found inside `<li></li>` tags(these tags are responsible for the bullets).

In [143]:
page = requests.get("https://en.wikipedia.org/wiki/Lists_of_stars_by_constellation")
soup = BeautifulSoup(page.content, 'html.parser')

In [144]:
soup.find_all('li')

[<li class="toclevel-1 tocsection-1"><a href="#Lists_of_stars_by_constellation"><span class="tocnumber">1</span> <span class="toctext">Lists of stars by constellation</span></a></li>,
 <li class="toclevel-1 tocsection-2"><a href="#Criteria_of_inclusion"><span class="tocnumber">2</span> <span class="toctext">Criteria of inclusion</span></a></li>,
 <li class="toclevel-1 tocsection-3"><a href="#See_also"><span class="tocnumber">3</span> <span class="toctext">See also</span></a></li>,
 <li class="toclevel-1 tocsection-4"><a href="#References"><span class="tocnumber">4</span> <span class="toctext">References</span></a></li>,
 <li class="toclevel-1 tocsection-5"><a href="#External_links"><span class="tocnumber">5</span> <span class="toctext">External links</span></a></li>,
 <li><a href="/wiki/List_of_stars_in_Andromeda" title="List of stars in Andromeda">Andromeda</a></li>,
 <li><a href="/wiki/List_of_stars_in_Antlia" title="List of stars in Antlia">Antlia</a></li>,
 <li><a href="/wiki/List_

You can see that the list of constellations that we want starts from the 6th element in the above list and there are 88 constellations in total. Let us try printing it out.

In [69]:
print(soup.find_all('li')[5].get_text())
print(soup.find_all('li')[6].get_text())
constellations=[]
for element in soup.find_all('li')[5:93]:    #How 93?
    constellations.append(element.get_text())
    
print(constellations)

Andromeda
Antlia
['Andromeda', 'Antlia', 'Apus', 'Aquarius', 'Aquila', 'Ara', 'Aries', 'Auriga', 'Boötes', 'Caelum', 'Camelopardalis', 'Cancer', 'Canes Venatici', 'Canis Major', 'Canis Minor', 'Capricornus', 'Carina', 'Cassiopeia', 'Centaurus', 'Cepheus', 'Cetus', 'Chamaeleon ', 'Circinus', 'Columba', 'Coma Berenices', 'Corona Australis', 'Corona Borealis', 'Corvus', 'Crater', 'Crux', 'Cygnus', 'Delphinus', 'Dorado', 'Draco', 'Equuleus', 'Eridanus', 'Fornax', 'Gemini', 'Grus', 'Hercules', 'Horologium', 'Hydra', 'Hydrus', 'Indus ', 'Lacerta', 'Leo', 'Leo Minor', 'Lepus', 'Libra', 'Lupus', 'Lynx', 'Lyra', 'Mensa', 'Microscopium', 'Monoceros', 'Musca', 'Norma', 'Octans', 'Ophiuchus', 'Orion', 'Pavo', 'Pegasus', 'Perseus', 'Phoenix', 'Pictor', 'Pisces ', 'Piscis Austrinus', 'Puppis', 'Pyxis', 'Reticulum', 'Sagitta', 'Sagittarius', 'Scorpius', 'Sculptor', 'Scutum', 'Serpens', 'Sextans', 'Taurus', 'Telescopium', 'Triangulum', 'Triangulum Australe', 'Tucana', 'Ursa Major', 'Ursa Minor', 'Vela

## One final lesson in web scraping
Not all website owners would appreciate a random person parsing through their website and collecting data. There are serious ethical concerns related to web scraping and you should always make sure before doing it that the owner is okay with this. In this tutorial, we have used only Wikipedia pages where web scraping is always allowed :)

If you have understood so far, you are good to go! You must have realized how annoying it is to open the website, see how it is formatted, check if there are any errors, find the tag which contains your data etc. So for today's assignment, you will be given the code to parse the data and format it. The corresponding functions will be explained and you can use them directly in your code. <br>
## Your assignment...
...should you choose to accept it will be the following:
1. Parse [this](https://en.wikipedia.org/wiki/Lists_of_stars_by_constellation) webpage for the RA and Dec of stars of each constellation, convert these coordinates to Cartesian coordinates and store them by constellation and plot them using matplotlib.
2. Try to recreate the 'Moons_and_planets.csv' file(used in the first tutorial) from [this](https://en.wikipedia.org/wiki/List_of_natural_satellites) webpage. You can take inspiration from how tables are scraped in the get_map() function for Task 1.

You can use the following code in Task 1. The data in https://en.wikipedia.org/wiki/Lists_of_stars_by_constellation is not well-formatted and these functions will help in that.
**The useful data to be extracted in Task 1 and Task 2 are stored in a table under the class 'wikitable sortable' and you can directly search by class for both tasks.**

In [89]:
import numpy as np
import matplotlib.pyplot as plt

In [90]:
def get_coords(ra_s, dec_s):
    h = float(ra_s[:2])
    m = float(ra_s[3:5])
    s = float(ra_s[6:-1])
    ra = h + m/60 + s/3600
    if dec_s[0] == '+':
        sign = 1
    else:
        sign = -1
    d = float(dec_s[1:3])
    m = float(dec_s[4:6])
    s = float(dec_s[7:-1])
    dec = sign*(d + m/60 + s/3600)
    return ra, dec

The get_coords() function is used to format the RA and Dec information of each star. Right Ascension is similar to longitude and is measured in hours, minutes, seconds while Declination is similar to latitude and is measured in degrees, minutes, seconds. The code parses data from the website as a string. This function converts the string to float and then returns the RA as hours and Declination as degrees.

In [150]:
def get_map(constellation, abs_mag = False, verb=False):
    url = f'https://en.wikipedia.org/wiki/List_of_stars_in_{constellation}' #page gets downloaded according to constellation
    r = requests.get(url)

    soup = BeautifulSoup(r.content, 'lxml')

    tab = soup.find_all('table', attrs={'class':'wikitable sortable'})[0]   #To extract information from a wikipedia table
                               
    data = [[]]
    for i in tab.find_all('tr'):   #searching in each row of the table ( 'tr' tag stands for row)
        row = []                    #declaring empty row
        for j in i.find_all('td'):  #'td' tag stands for a cell
            row.append(j.get_text())   #add the text contents of each row to the list
        data.append(row)

    heads = []
    for i in tab.find_all('tr')[:1]:
        for j in i.find_all('th'):
            heads.append(j.get_text().strip('\n'))

    name_ind = heads.index('Name')
    ra_ind = heads.index('RA')
    dec_ind = heads.index('Dec')
    if abs_mag:
        mag_ind = heads.index('abs.mag.')
    else:
        mag_ind = heads.index('vis.mag.')

    name = []
    ra = []
    dec = []
    mag = []
    for i in data[2:-2]:
        name_string = i[name_ind]
        try:
            ra_string = i[ra_ind].replace('\xa0', '')
            dec_string = i[dec_ind].replace('\xa0', '')
            mag_string = i[mag_ind]
            if mag_string[0]=='−':
                mag_string = '-'+mag_string[1:]
        except:
            if verb:
                print(f"{name_string} has no data for coordinates")
            continue
        try:
            ra_i, dec_i = get_coords(ra_string, dec_string)
        except:
            if verb:
                print(f"{name_string} has coordinate format issues")
            continue
        try:
            mag.append(float(mag_string))
            name.append(name_string)
            ra.append(ra_i)
            dec.append(dec_i)
        except:
            if verb:
                print(f"{name_string} does not have magnitude data")
            continue

    name = np.array(name)
    ra = np.array(ra)
    dec = np.array(dec)
    mag = np.array(mag)
    return name, ra, dec, mag

The function get_map() returns the name of the constellation, the formatted RA, Dec coordinates in hours and the magnitudes. You might have noticed the use of `try` and `except` in the above code. These commands are used to handle errors while executing. The interpreter will first try to execute the code inside `try`. If an error gets thrown up during that execution(for instance, incorrect formatting), the code inside `except` will get executed.<br>

For the next step, you will need to write a function that takes in the celestial coordinates(RA, Dec) and returns its projections into a Cartesian space. This is called a **[Steregraphic projection](https://en.wikipedia.org/wiki/Stereographic_projection)**, where points on a sphere are projected on to a plane. A hint to approach this would be to first convert RA, Dec to spherical coordinates on a unit sphere and then apply stereographic projection formulae. <br>

To plot the final figure, write a function plot() that takes in a constellation name and plots it. The size of the star must be proportional to its brightness or flux. In tutorial 2, there was a discussion on Magnitudes in Astronomy, which might prove useful here. Do normalize the brightness values before using them.
Do not forget to make the background dark!