# Wikipedia Dataset

---

## Introduction

In the present Jupyter Notebook, we will build from scratch a Wikipedia Dataset to pre-train a machine learning model which acknowledges the subject of a question (e.g. mathematics, physics, etc.).

In order to build such a dataset, we will use the [Outline of Academic Disciplines](https://wikipedia.org/wiki/Outline_of_academic_disciplines) to better classify the Wikipedia articles to approach the tags of 100mentors.

We will achieve this by selecting article links relevant to the subject specified. In other words, we will select the "sub-domains" of academic disciplines as presented in the outline.

Moreover, we will visit each of the selected articles, obtain their introduction text and its links, repeating this process until the desired depth is achieved.

For this reason, we will often refer to an article and its links to other articles as a "node" and its "children nodes", respectively. We can imagine it as traversing a tree structure. In more detail, the further we traverse the tree the weaker the correlation is between the article and the academic discipline it belongs to. However, we will not be concerned with this idea as we will retrieve only the direct "descendants" of the "node".

## Summary

The dataset contains **12872** reports.

Each report contains:
- **tag**: the **tag\*** that matches the article
- **title**: the **title** of the article
- **text**: the **text\*\*** of the article

The dataset covers **21** of **23 tags**.

**"Career"** and **"Foreign Languages"** do **NOT** match with any domain. 

\* with **tag** we mean **100mentors tags**<br>
\** with **text** we mean the **introduction text** of an article

#### Abbreviations

- **WArt** (Wikipedia Article)
- **AD** (Academic Discipline)

## Build

---

At first we will import the required packages in order to build the Wikipedia Dataset.

In [1]:
import re
import sys
import traceback
import time

import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

import pandas as pd
import numpy as np

# set tqdm's bar format
bar_format = '{l_bar}{bar:50}{r_bar}{bar:-10b}'
# set pandas' column width
pd.set_option('max_colwidth', None)

PATH = "data/intros/"

We will, manually, select the ADs from the "Outline of Academic Desciplines" article that are most relevant to the tags of 100mentors.

In [2]:
ads = [
    # Humanities
    { 'el': 'h3', 'attr': 'id',    'ad': 'Visual arts',              'tag': 'Visual Arts'                         },
    { 'el': 'h3', 'attr': 'id',    'ad': 'History',                  'tag': 'History'                             },
    { 'el': 'h3', 'attr': 'id',    'ad': 'Languages and literature', 'tag': 'Language and Literature'             },
    { 'el': 'h3', 'attr': 'id',    'ad': 'Philosophy',               'tag': 'Philosophy'                          },
    { 'el': 'h3', 'attr': 'id',    'ad': 'Theology',                 'tag': 'Religion'                            },
    { 'el': 'li', 'attr': 'title', 'ad': 'Film',                     'tag': 'Film'                                },
    { 'el': 'li', 'attr': 'title', 'ad': 'Dance',                    'tag': 'Dance'                               },
    { 'el': 'li', 'attr': 'title', 'ad': 'Theatre',                  'tag': 'Theatre'                             },
    { 'el': 'li', 'attr': 'title', 'ad': 'Music',                    'tag': 'Music'                               },
    # Social Sciences
    { 'el': 'h3', 'attr': 'id',    'ad': 'Anthropology',             'tag': 'Anthropology'                        },
    { 'el': 'h3', 'attr': 'id',    'ad': 'Economics',                'tag': 'Economics'                           },
    { 'el': 'h3', 'attr': 'id',    'ad': 'Geography',                'tag': 'Geography'                           },
    { 'el': 'h3', 'attr': 'id',    'ad': 'Political science',        'tag': 'Global Politics'                     },
    { 'el': 'h3', 'attr': 'id',    'ad': 'Psychology',               'tag': 'Psychology'                          },
    # Natural Sciences
    { 'el': 'h3', 'attr': 'id',    'ad': 'Biology',                  'tag': 'Biology'                             },
    { 'el': 'h3', 'attr': 'id',    'ad': 'Chemistry',                'tag': 'Chemistry'                           },
    { 'el': 'h3', 'attr': 'id',    'ad': 'Physics',                  'tag': 'Physics'                             },
    # Formal Sciences
    { 'el': 'h3', 'attr': 'id',    'ad': 'Computer Science',         'tag': 'Computer Science'                    },
    { 'el': 'h3', 'attr': 'id',    'ad': 'Mathematics',              'tag': 'Mathematics'                         },
    # Applied Sciences
    { 'el': 'h3', 'attr': 'id',    'ad': 'Business',                 'tag': 'Business Management'                 },
    { 'el': 'h3', 'attr': 'id',    'ad': 'Medicine and health',      'tag': 'Sports, Exercise and Health Science' },
    { 'el': 'li', 'attr': 'title', 'ad': 'Physical fitness',         'tag': 'Sports, Exercise and Health Science' },
]

There are a total of 22 ADs, considering that the "Exercise/Health Science" tag consists of 2 ADs.

Furthermore, we need to implement a function to get a wart.

In [3]:
def get_wart(url):
    try:
        req = requests.get("https://en.wikipedia.org" + url)
    except requests.exceptions.RequestException as reqex:
        print(f"Unable to get {article_url}: {reqex}")
    else:
        html = req.content
        soup = BeautifulSoup(html, 'html.parser')
        return soup

We will, additionally, implement a function that finds the "validated" links in an element.

In [4]:
def get_links(element, validate=None, *args, **kwargs):
    links = []
    for a in element.find_all('a'):
        link = a.attrs.get('href', False)
        if link:
            if validate:
                validated_link = validate(link, *args, **kwargs)
                if validated_link:
                    links.append(validated_link)
                
                continue

                continue
            
            links.append(link)
                
        
    return links

For that reason, we will implement a function that validates a wart link.

In [5]:
def validate_wart_link(link):
    # wart's link pattern
    wart_link_pattern = re.compile(r'^(\/wiki\/)[^:]*$')
    
    if not link\
       or not bool(re.match(wart_link_pattern, link))\
       or 'outline' in link.lower(): # 'outline' not in wart link

        return False
    
    hashtag_index = link.find('#')
    if hashtag_index != -1: # remove hashtag from wart link
        return link[:hashtag_index]
    
    return link

Finally, we will begin to collect the "starting nodes" using the list of ADs we created earlier.

In [6]:
wart = get_wart('/wiki/Outline_of_academic_disciplines')

starting_warts = []
for ad in ads:
    # find ad's element
    ad_element = wart.find(lambda el: ad['el'] == el.name and ad['ad'] in el.text)

    # find links of ad element
    wart_links = get_links(ad_element, validate_wart_link)
    starting_warts.extend([{'tag': ad['tag'], 'url': wart_link} for wart_link in wart_links])

    # if ad not heading element, continue
    if not bool(re.match(r'h.', ad['el'])):
        continue
    
    # find sibling element
    sibling_element = ad_element\
        .find_next_sibling( lambda tag: (tag.name == 'div' and 'div-col' in tag.attrs['class'])
                            or (tag.name == 'ul') )

    # find links of sibling element
    while (sibling_element and not bool(re.match(r'h.', sibling_element.name))):
        current_wart_links = get_links(sibling_element, validate_wart_link)
        starting_warts.extend([{'tag': ad['tag'], 'url': current_wart_link} for current_wart_link in current_wart_links])
        sibling_element = sibling_element\
            .find_next_sibling( lambda el: re.match(r'h.', el.name)
                                or ( el.name == 'div' and 'div-col' in el.attrs['class']) 
                                or (el.name == 'ul') )

print('length:', len(starting_warts))
[print(starting_wart) for starting_wart in starting_warts]; # colon to hide cell's output

length: 949
{'tag': 'Visual Arts', 'url': '/wiki/Visual_arts'}
{'tag': 'Visual Arts', 'url': '/wiki/Fine_arts'}
{'tag': 'Visual Arts', 'url': '/wiki/Graphic_arts'}
{'tag': 'Visual Arts', 'url': '/wiki/Drawing'}
{'tag': 'Visual Arts', 'url': '/wiki/Painting'}
{'tag': 'Visual Arts', 'url': '/wiki/Photography'}
{'tag': 'Visual Arts', 'url': '/wiki/Sculpture'}
{'tag': 'Visual Arts', 'url': '/wiki/Applied_arts'}
{'tag': 'Visual Arts', 'url': '/wiki/Animation'}
{'tag': 'Visual Arts', 'url': '/wiki/Calligraphy'}
{'tag': 'Visual Arts', 'url': '/wiki/Decorative_arts'}
{'tag': 'Visual Arts', 'url': '/wiki/Mixed_media'}
{'tag': 'Visual Arts', 'url': '/wiki/Printmaking'}
{'tag': 'Visual Arts', 'url': '/wiki/Studio_art'}
{'tag': 'Visual Arts', 'url': '/wiki/Architecture'}
{'tag': 'Visual Arts', 'url': '/wiki/Interior_architecture'}
{'tag': 'Visual Arts', 'url': '/wiki/Landscape_architecture'}
{'tag': 'Visual Arts', 'url': '/wiki/Landscape_design'}
{'tag': 'Visual Arts', 'url': '/wiki/Landscape_plan

We will, lastly, implement a function in which we will, recursively, visit a "node" with its "children nodes", retrieving the introduction of each wart.

We will look for references/links in this section to other warts too.

In [7]:
def get_data(dataset, visited, current_wart, max_depth, depth=0):
    # if maximum depth exceeded, return
    if depth > max_depth:
        return
    
    # add wart in visited
    visited.add(str(current_wart))

    wart = get_wart(current_wart['url'])
    wart_header = wart.find('h1', {'id': 'firstHeading'})
    wart_content = wart.find('div', {'id': 'bodyContent'})
    wart_inner_content = wart\
        .find('div', {'id': 'mw-content-text'})\
        .find('div', {'class': 'mw-parser-output'})

    # find wart's introduction
    p_list = []
    sibling_element = wart_inner_content.find('p', recursive=False) # strict mode, only on the current DOM depth
    while (sibling_element and not bool(re.match(r'h.', sibling_element.name))):
        p_list.append(sibling_element)
        sibling_element = sibling_element.find_next_sibling(lambda el: el.name == 'p' or bool(re.match(r'h.', el.name)))

    # get wart's introduction links
    for p in p_list:
        for current_wart_link in get_links(p, validate_wart_link):
            temp_wart = {'tag': current_wart['tag'], 'url': current_wart_link}

            # if already visited under the same tag, continue
            if str(temp_wart) in visited:
                continue
            
            # visit wart's introduction links
            get_data(dataset, visited, temp_wart, max_depth, depth+1)
    
    # add wart introduction in dataset 
    text = ' '.join([p.text.strip() for p in p_list])
    dataset.append({'tag': current_wart['tag'], 'title': wart_header.get_text(), 'text': text})

We will, now, fetch the warts starting from the ones in the ADs list.

In [8]:
visited = set()
dataset = []
for starting_wart in tqdm(starting_warts, bar_format=bar_format):
    try:
        get_data(dataset, visited, starting_wart, 1)
    except:
        print('Error:', sys.exc_info()[0])
        print('=' * len('Error:' + str(sys.exc_info()[0])))
        print(traceback.format_exc())

100%|██████████████████████████████████████████████████| 949/949 [1:41:21<00:00,  6.41s/it]  


We can see that we retrieved about **12872** introductions of warts.

In [9]:
print('length:', len(dataset))

length: 12872


An example of a report in the dataset can be seen below.

In [10]:
print(dataset[0])

{'tag': 'Visual Arts', 'title': 'Art', 'text': '  Art is a diverse range of (and products of) human activities involving creative imagination to express technical proficiency, beauty, emotional power, or conceptual ideas.[1][2][3] There is no generally agreed definition of what constitutes art,[4][5][6] and ideas have changed over time. The three classical branches of visual art are painting, sculpture, and architecture.[7] Theatre, dance, and other performing arts, as well as literature, music, film and other media such as interactive media, are included in a broader definition of the arts.[1][8] Until the 17th century, art referred to any skill or mastery and was not differentiated from crafts or sciences. In modern usage after the 17th century, where aesthetic considerations are paramount, the fine arts are separated and distinguished from acquired skills in general, such as the decorative or applied arts. The nature of art and related concepts, such as creativity and interpretation

Lastly, we will convert the dataset to a DataFrame...

In [11]:
df = pd.DataFrame(dataset)
df.head(3)

Unnamed: 0,tag,title,text
0,Visual Arts,Art,"Art is a diverse range of (and products of) human activities involving creative imagination to express technical proficiency, beauty, emotional power, or conceptual ideas.[1][2][3] There is no generally agreed definition of what constitutes art,[4][5][6] and ideas have changed over time. The three classical branches of visual art are painting, sculpture, and architecture.[7] Theatre, dance, and other performing arts, as well as literature, music, film and other media such as interactive media, are included in a broader definition of the arts.[1][8] Until the 17th century, art referred to any skill or mastery and was not differentiated from crafts or sciences. In modern usage after the 17th century, where aesthetic considerations are paramount, the fine arts are separated and distinguished from acquired skills in general, such as the decorative or applied arts. The nature of art and related concepts, such as creativity and interpretation, are explored in a branch of philosophy known as aesthetics.[9] The resulting artworks are studied in the professional fields of art criticism and the history of art."
1,Visual Arts,Painting,"Painting is the practice of applying paint, pigment, color or other medium to a solid surface (called the ""matrix"" or ""support"").[1] The medium is commonly applied to the base with a brush, but other implements, such as knives, sponges, and airbrushes, can be used. In art, the term painting describes both the act and the result of the action (the final work is called ""a painting""). The support for paintings includes such surfaces as walls, paper, canvas, wood, glass, lacquer, pottery, leaf, copper and concrete, and the painting may incorporate multiple other materials, including sand, clay, paper, plaster, gold leaf, and even whole objects. Painting is an important form in the visual arts, bringing in elements such as drawing, composition, gesture (as in gestural painting), narration (as in narrative art), and abstraction (as in abstract art).[2] Paintings can be naturalistic and representational (as in still life and landscape painting), photographic, abstract, narrative, symbolistic (as in Symbolist art), emotive (as in Expressionism) or political in nature (as in Artivism). A portion of the history of painting in both Eastern and Western art is dominated by religious art. Examples of this kind of painting range from artwork depicting mythological figures on pottery, to Biblical scenes on the Sistine Chapel ceiling, to scenes from the life of Buddha (or other images of Eastern religious origin)."
2,Visual Arts,Drawing,"Drawing is a form of visual art in which an artist uses instruments to mark paper or other two-dimensional surface. Drawing instruments include graphite pencils, pen and ink, various kinds of paints, inked brushes, colored pencils, crayons, charcoal, chalk, pastels, erasers, markers, styluses, and metals (such as silverpoint). Digital drawing is the act of using a computer to draw. Common methods of digital drawing include a stylus or finger on a touchscreen device, stylus- or finger-to-touchpad, or in some cases, a mouse. There are many digital art programs and devices. A drawing instrument releases a small amount of material onto a surface, leaving a visible mark. The most common support for drawing is paper, although other materials, such as cardboard, wood, plastic, leather, canvas, and board, have been used. Temporary drawings may be made on a blackboard or whiteboard. Drawing has been a popular and fundamental means of public expression throughout human history. It is one of the simplest and most efficient means of communicating ideas.[1] The wide availability of drawing instruments makes drawing one of the most common artistic activities. In addition to its more artistic forms, drawing is frequently used in commercial illustration, animation, architecture, engineering, and technical drawing. A quick, freehand drawing, usually not intended as a finished work, is sometimes called a sketch. An artist who practices or works in technical drawing may be called a drafter, draftsman, or draughtsman.[2]"


and extract it to a CSV.

In [12]:
df.to_csv(PATH + 'temp_intros.csv', index=False, index_label=False)