<a href="https://colab.research.google.com/github/gurpreet-vilkhoo/Web-scrapping-on-Wikipedia-Data-Engineering-/blob/main/Data_Engineering_Summer_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Engineering


## Retrieving data from the web

### requests

The first task is to retrieve some data from the Internet. Python has many built-in libraries that were developed over the years to do exactly that (e.g. urllib, urllib2, urllib3).

However, these libraries are very low-level and somewhat hard to use. They become especially cumbersome when you need to issue POST requests or authenticate against a web service.

Luckly, as with most tasks in Python, someone has developed a library that simplifies these tasks. In reality, the requests made both on this assignment are fairly simple, and could easily be done using one of the built-in libraries. However, it is better to get acquainted to `requests` as soon as possible, since you will probably need it in the future.

In [1]:
# You tell Python that you want to use a library with the import statement.
import requests

Now that the requests library was imported into our namespace, we can use the functions offered by it.

In this case we'll use the appropriately named `get` function to issue a *GET* request. This is equivalent to typing a URL into your browser and hitting enter.

In [2]:
# Get the HU Wikipedia page
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")

Another very nifty Python function is `dir`. You can use it to list all the properties of an object.


Right now `req` holds a reference to a *Request* object; but we are interested in the text associated with the web page, not the object itself.

So the next step is to assign the value of the `text` property of this `Request` object to a variable.

In [3]:
page = req.text
page

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Harvard University - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YE-0BnLhdeoEXKrWcgqVygAAAE4","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Harvard_University","wgTitle":"Harvard University","wgCurRevisionId":1012281595,"wgRevisionId":1012281595,"wgArticleId":18426501,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: location","Webarchive template wayback links","CS1: Julian–Gregorian uncertainty","Articles with short description","Short description matches 

Great! Now we have the text of the HU Wikipedia page. But this mess of HTML tags would be a pain to parse manually. Which is why we will use another very cool Python library called BeautifulSoup.

### BeautifulSoup

Parsing data would be a breeze if we could always use well formatted data sources, such as CSV, JSON, or XML; but some formats such as HTML are at the same time a very popular and a pain to parse.

One of the problems with HTML is that over the years browsers have evolved to be very forgiving of "malformed" syntax. Your browser is smart enough to detect some common problems, such as open tags, and correct them on the fly.

Unfortunately, we do not have the time or patience to implement all the different corner cases, so we'll let BeautifulSoup do that for us.


# 1) Import BeautifulSoup

In [4]:
from bs4 import BeautifulSoup

BeautifulSoup can deal with HTML or XML data, so the next line parser the contents of the `page` variable using its HTML parser, and assigns the result of that to the `soup` variable.

# 2) Create a Soup variable to store the parsed contents of the page

In [5]:
soup = BeautifulSoup(req.content, 'html.parser')

Let's check the string representation of the `soup` object.

In [6]:
soup.title.text

'Harvard University - Wikipedia'

Doesn't look much different from the `page` object representation. Let's make sure the two are different types.

In [7]:
type(page)

str

In [8]:
type(soup)

bs4.BeautifulSoup

Looks like they are indeed different.

# 3) Display the title of the webpage

### Expected Output
```
<title>Harvard University - Wikipedia, the free encyclopedia</title>
```

In [9]:
print (soup.title)

<title>Harvard University - Wikipedia</title>


This is nice for HTML elements that only appear once per page, such the the `title` tag. But what about elements that can appear multiple times?

# 4) Display all p tags from the webpage

#### You may use find_all method!

In [10]:
soup.find_all('p')


[<p class="mw-empty-elt">
 </p>,
 <p><b>Harvard University</b> is a <a href="/wiki/Private_university" title="Private university">private</a> <a href="/wiki/Ivy_League" title="Ivy League">Ivy League</a> <a href="/wiki/Research_university" title="Research university">research university</a> in <a href="/wiki/Cambridge,_Massachusetts" title="Cambridge, Massachusetts">Cambridge, Massachusetts</a>. Established in 1636 and named for its first benefactor, clergyman <a href="/wiki/John_Harvard_(clergyman)" title="John Harvard (clergyman)">John Harvard</a>, Harvard is the <a href="/wiki/Colonial_colleges" title="Colonial colleges">oldest institution of higher learning in the United States</a><sup class="reference" id="cite_ref-6"><a href="#cite_note-6">[6]</a></sup>
 and among the most prestigious in the world.<sup class="reference" id="cite_ref-7"><a href="#cite_note-7">[7]</a></sup>
 </p>,
 <p>The Massachusetts colonial legislature, the <a href="/wiki/Massachusetts_General_Court" title="Mass

# 5) How may p tags are present?

In [11]:
soup_s=str(soup)


---

If you look at the Wikipedia page on your browser, you'll notice that it has a couple of tables in it. We will be working with the "Demographics" table, but first we need to find it.

One of the HTML attributes that will be very useful to us is the "class" attribute.

Getting the class of a single element is easy...

In [12]:
soup.table['class']

['infobox', 'vcard']

---

### List Comprehensions

Next we will use a list comprehension to see all the tables that have a "class" attributes. List comprehensions are a very cool Python feature that allows for a loop iteration and a list creation in a single line.



# 6) Create a nested list containing classes of all the table tags

In [13]:
tablenest=[classes.get('class') for classes in soup.find_all('table')  ]
tablenest


[['infobox', 'vcard'],
 ['toccolours'],
 ['infobox'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed', 'floatright'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed', 'floatright'],
 ['wikitable'],
 ['box-Cleanup_gallery', 'plainlinks', 'metadata', 'ambox', 'ambox-style'],
 ['metadata', 'mbox-small'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'navbox-subgroup'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', '

# 7) Check the classes and find the Demographics Table
#### Use find method to find the table using the correct class , convert it into string format and store it in table_html also stored the original form in html_soup

In [14]:
table_html=(soup.find_all('table',{'class':"wikitable"}))
table_html= table_html[2]
table_html=str(table_html)

In [15]:
table_html

'<table class="wikitable" style="text-align:center; float:right; font-size:85%; margin-right:2em;">\n<caption><i>Student demographics (Fall 2019)</i><sup class="reference" id="cite_ref-104"><a href="#cite_note-104">[104]</a></sup>\n</caption>\n<tbody><tr>\n<th></th>\n<th>Undergrad</th>\n<th>Grad/prof\n</th></tr>\n<tr>\n<th>Asian\n</th>\n<td>21%</td>\n<td>13%\n</td></tr>\n<tr>\n<th>Black\n</th>\n<td>9%</td>\n<td>5%\n</td></tr>\n<tr>\n<th>Hispanic or Latino\n</th>\n<td>11%</td>\n<td>7%\n</td></tr>\n<tr>\n<th>White\n</th>\n<td>37%</td>\n<td>38%\n</td></tr>\n<tr>\n<th>Two or more races\n</th>\n<td>8%</td>\n<td>3%\n</td></tr>\n<tr>\n<th>International\n</th>\n<td>12%</td>\n<td>32%\n</td></tr></tbody></table>'

In [16]:
from IPython.core.display import HTML

HTML(table_html)

Unnamed: 0,Undergrad,Grad/prof
Asian,21%,13%
Black,9%,5%
Hispanic or Latino,11%,7%
White,37%,38%
Two or more races,8%,3%
International,12%,32%


# 8) Extract the rows from the Demographics table and store it in rows variable

In [17]:
table1=(soup.find_all('table',{'class':"wikitable"}))
table1= table1[2]
rows=table1.find_all('tr')
rows

[<tr>
 <th></th>
 <th>Undergrad</th>
 <th>Grad/prof
 </th></tr>, <tr>
 <th>Asian
 </th>
 <td>21%</td>
 <td>13%
 </td></tr>, <tr>
 <th>Black
 </th>
 <td>9%</td>
 <td>5%
 </td></tr>, <tr>
 <th>Hispanic or Latino
 </th>
 <td>11%</td>
 <td>7%
 </td></tr>, <tr>
 <th>White
 </th>
 <td>37%</td>
 <td>38%
 </td></tr>, <tr>
 <th>Two or more races
 </th>
 <td>8%</td>
 <td>3%
 </td></tr>, <tr>
 <th>International
 </th>
 <td>12%</td>
 <td>32%
 </td></tr>]

In [18]:
# Lambda expressions return the value of the expression inside it.
# In this case, it will return a string with new line characters replaced by spaces.
rem_nl = lambda s: s.replace("\n", " ")

# 8) Extract the columns from the Demographics table and store it in columns variable

In [19]:
columns=[i.text.strip()  for i in table1.find_all('th')]
columns=columns[1:4]
print(columns)

['Undergrad', 'Grad/prof', 'Asian']


# 9) Extract the indexes from the rows variable
### Store it in a variable named indexes

In [24]:
indexes=[i.text.strip() for i in table1.find_all('th')]
indexes=indexes[4:]


In [25]:
indexes


['Black', 'Hispanic or Latino', 'White', 'Two or more races', 'International']

In [26]:
# Here's the original HTML table.
HTML(table_html)

Unnamed: 0,Undergrad,Grad/prof
Asian,21%,13%
Black,9%,5%
Hispanic or Latino,11%,7%
White,37%,38%
Two or more races,8%,3%
International,12%,32%


Next we start by checking if the last character of the string (Python allows for negative indexes) is a percent sign. If that is true, then we convert the characters before the sign to integers. Lastly, if one of the prior checks fails, we return a value of None.

This is a very common pattern in Python, and it works for two reasons: Python's `and` and `or` are "short-circuit" operators. This means that if the first element of an `and` statement evaluates to False, the second one is never computed (which in this case would be a problem since we can't convert a non-digit string to an integer). The `or` statement works the other way: if the first element evaluates to True, the second is never computed.

The second reason this works is because these operators will return the value of the last expression that was evaluated, which is this case will be either the integer value or the value `None`.

One last thing to notice: Python slices are open on the upper bound. So the `[:-1]` construct will return all elements of the string, except for the last.

# 10) Convert the percentages to integers
### Store it in a variable named values

In [None]:
values=[i.text.strip('%\n') for i in table1.find_all('td')]

values = list(map(int, values[0:-1]))
values.append('NaN')
values

[21, 13, 5, 9, 5, 12, 11, 7, 16, 37, 38, 64, 8, 3, 9, 12, 32, 'NaN']

The problem with the list above is that the values lost their grouping.

The `zip` function is used to combine two sequences element wise. So `zip([1,2,3], [4,5,6])` would return `[(1, 4), (2, 5), (3, 6)]`.

This is the first time we see a container bounded by parenthesis. This is a tuple, which you can think of as an immutable list (meaning you can't add, remove, or change elements from it). Otherwise they work just like lists and can be indexed, sliced, etc.

In [None]:
stacked_values = list( zip(*[values[i::3] for i in range(len(columns))]))
stacked_values

[(21, 13, 5),
 (9, 5, 12),
 (11, 7, 16),
 (37, 38, 64),
 (8, 3, 9),
 (12, 32, 'NaN')]

In [None]:
# Here's the original HTML table.
HTML(table_html)

Unnamed: 0,Undergrad,Grad/prof,US census
Asian,21%,13%,5%
Black,9%,5%,12%
Hispanic or Latino,11%,7%,16%
White,37%,38%,64%
Two or more races,8%,3%,9%
International,12%,32%,


---

## Pandas data structures

### DataFrames

To recap, we now have three data structures holding our column names, our row (index) names, and our values grouped by index.

We will now load this data into a Pandas DataFrame. The loading process is pretty straightforward, and all we need to do is tell Pandas which container goes where.

In [None]:
import pandas as pd

# 11) Create the DataFrame
### Use stacked_values, columns and indexes to create the Demographics DataFrame
#### Name the DataFrame df

In [None]:
df=pd.DataFrame(stacked_values,index=indexes,columns=columns)
df.head(6)

Unnamed: 0,Undergrad,Grad/prof,US census
Asian,21,13,5.0
Black,9,5,12.0
Hispanic or Latino,11,7,16.0
White,37,38,64.0
Two or more races,8,3,9.0
International,12,32,


In [None]:
# Here's the original HTML table.
HTML(table_html)

Unnamed: 0,Undergrad,Grad/prof,US census
Asian,21%,13%,5%
Black,9%,5%,12%
Hispanic or Latino,11%,7%,16%
White,37%,38%,64%
Two or more races,8%,3%,9%
International,12%,32%,


---

### DataFrame cleanup

Our DataFrame looks nice; but does it have the right data types?

# 12) Display the datatypes of all the columns

In [None]:
df.corr()

Unnamed: 0,Undergrad,Grad/prof
Undergrad,1.0,0.730634
Grad/prof,0.730634,1.0


The `U.S Census` looks a little strange. It should have been evaluated as an integer, but instead it came in as a float. It probably has something to do with the `NaN` value...

In fact, missing values can mess up a lot of our calculations, and some function don't work at all when `NaN` are present. So we should probably clean this up.

One way to do that is by dropping the rows that have missing values:

# 13) Drop the row containing NaN value.
### After droping the row store it in df_clean_row

In [None]:
df_clean_row=df.mask(df.eq('NaN')).dropna()


# 13) Drop the column containing NaN value.
### After droping the row store it in df_clean_column

In [None]:
df_clean_column=df.mask(df.eq('NaN')).dropna(axis='columns')

We will take a less radical approach and replace the missing value with a zero. In this case this solution makes sense, since 0% value meaningful in this context. We will also transform all the values to integers at the same time.

# 13) Fill the NaN value with 0 
### After filling the NaN value with 0 store it in df_clean

In [None]:
df_clean=df.fillna(0, inplace=False)
df_clean

Unnamed: 0,Undergrad,Grad/prof,US census
Asian,21,13,5.0
Black,9,5,12.0
Hispanic or Latino,11,7,16.0
White,37,38,64.0
Two or more races,8,3,9.0
International,12,32,


In [None]:
df_clean.dtypes


Undergrad     int64
Grad/prof     int64
US census    object
dtype: object

Now our table looks good!


---

### NumPy

Pandas is awesome, but it is built on top of another library the we will use extensively during the course. NumPy implements new data types and vectorized functions.

In [None]:
import numpy as np

The `values` method of the DataFrame will return a two-dimensional `array` with the DataFrame values. The `array` is a NumPy structure that we will be using a lot during this class.

In [None]:
df_clean.values

array([[21, 13, 5],
       [9, 5, 12],
       [11, 7, 16],
       [37, 38, 64],
       [8, 3, 9],
       [12, 32, 'NaN']], dtype=object)

# 14) Find the mean for the column 'Undergrad' from the cleaned dataset

In [None]:
df.Undergrad.mean()

16.333333333333332

# 15) Find the standard deviation for all the columns of the cleaned dataset

In [None]:
df.std()

Undergrad    11.129540
Grad/prof    14.962175
dtype: float64