# Analyzing Wikipedia Pages

In this project, I will be working with data scraped from [Wikipedia](https://www.wikipedia.org/), a crowdsourced online encyclopedia. 

I will be analyzing 54 megabytes worth of articles to determine patterns in the Wikipedia writing and content presentation style. The data has been scraped by hitting random pages in Wikipedia, then downloading the contens using the `requests` package. The scraped data is in HTML format with embedded JavaScript code. 

The main goals of this project will be:

* Extract only the text from the Wikipedia pages, and remove all HTML and Javascript markup.
* Remove common page headers and footers from the Wikipedia pages.
* Figure out what tags are the most common in Wikipedia pages.
* Figure out patterns in the text.

## Understanding the data

I will begin by listing all of the files in the `wiki` folder, counting the number of files, and inspecting a single file to see if there are any patterns in the raw HTML. I will print only the first 100 characters of the random file I have chosen. 

In [1]:
import os

print("Number of files in 'wiki' folder:", len(os.listdir("wiki")), "\n")

filenames = []
for file in os.listdir("wiki"):
    filenames.append("wiki/{}".format(file))

print(filenames[:5], "\n")
with open ('wiki/Furubira_District,_Hokkaido.html') as f:
    example = f.read()
    print(example[:100])

Number of files in 'wiki' folder: 999 

['wiki/Furubira_District,_Hokkaido.html', 'wiki/Valentin_Yanin.html', 'wiki/Kings_XI_Punjab_in_2014.html', 'wiki/William_Harvey_Lillard.html', 'wiki/Radial_Road_3.html'] 

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title


As far as patterns, I can see that certain strings of symbols and letters are repeated frequently, such as `</th>`, `</tr>`, `<tr>`, `<td>`, `</ul>`, and `</div>`. 

## Reading in the data

Now that I understand the file structure and the structure of a single file, I can read in all of the files. Since this task is I/O bound, I can use threads to help read in the data more quickly. I will also experiment with different thread counts to determine which number works best. 

In [2]:
import concurrent.futures
import time

pool = concurrent.futures.ThreadPoolExecutor(max_workers=2)

def read_file(filename):
    with open(filename) as f:
        data = f.read()
    return data

start = time.time()
filenames = []
for file in os.listdir("wiki"):
    filenames.append("wiki/{}".format(file))
content = pool.map(read_file, filenames)
content = list(content)

articles = []
for i in range(len(content)):
    articles.append(filenames[i].replace("wiki/", "").replace(".html", ""))
    
print(articles[:10])
print(time.time() - start)

['Furubira_District,_Hokkaido', 'Valentin_Yanin', 'Kings_XI_Punjab_in_2014', 'William_Harvey_Lillard', 'Radial_Road_3', 'George_Weldrick', 'Zgornji_Otok', 'Blue_Heelers_(season_8)', 'Taggen_Nunatak', '1951_National_League_tie-breaker_series']
0.2934556007385254


I have found that two threads (`max_workers=2`) sufficiently handle all of this data. 

## Removing extraneous markup

Now that I have read in the data files, I can remove the extraneous markup that's outside of the `div#content` tag that most of the data seems to be inside. I will use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) package for this, which will allow me to extract all of the content inside a specific tag. 

With this package, I will parse each wiki article then extract the div with id content and everything inside it. 

This operation is more CPU intensive than before, so I will use a process pool to improve speed as opposed to a thread pool. 

In [3]:
from bs4 import BeautifulSoup

def parse_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    parsed_soup = str(soup.find_all("div", id="content")[0])
    return parsed_soup

start = time.time()
pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
parsed = pool.map(parse_html, content)
parsed = list(parsed)

print(time.time() - start)

53.37423658370972


It seems like this is a fairly slow process no matter how many workers are used. The above function takes about 55 seconds to execute. Below I will print just the first 1000 characters of the first item in the `parsed` list. 

In [4]:
print(parsed[0][:1000])

<div class="mw-body" id="content" role="main">
<a id="top"></a>
<div id="siteNotice"><!-- CentralNotice --></div>
<div class="mw-indicators">
</div>
<h1 class="firstHeading" id="firstHeading" lang="en">Furubira District, Hokkaido</h1>
<div class="mw-body-content" id="bodyContent">
<div id="siteSub">From Wikipedia, the free encyclopedia</div>
<div id="contentSub"></div>
<div class="mw-jump" id="jump-to-nav">
					Jump to:					<a href="#mw-head">navigation</a>, 					<a href="#p-search">search</a>
</div>
<div class="mw-content-ltr" dir="ltr" id="mw-content-text" lang="en"><table class="plainlinks metadata ambox ambox-content ambox-Unreferenced" role="presentation">
<tr>
<td class="mbox-image">
<div style="width:52px"><a class="image" href="/wiki/File:Question_book-new.svg"><img alt="" data-file-height="399" data-file-width="512" height="39" src="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/

## Finding common tags

I have now extracted the main part of each page, so I can count up how many times each tag occurs. This will give information about how Wikipedia pages are typically structured. For example, many `a` tags would tell me that articles tend to be very connected to other articles or pages. Many `div` tags would tell me that Wikipedia pages tend to have a nested structure with many page elements. 

This process will be CPU bound, so I will use processes. 

In [5]:
def find_tag_counts(document):
    soup = BeautifulSoup(document, 'html.parser')
    tags = {}
    for tag in soup.find_all():
        if tag.name not in tags:
            tags[tag.name] = 0
        tags[tag.name] += 1
    return tags

start = time.time()
pool = concurrent.futures.ProcessPoolExecutor(max_workers=2)
tags = pool.map(find_tag_counts, content)
tags = list(tags)

total_tag_counts = {}
for tag in tags:
    for k,v in tag.items():
        if k not in total_tag_counts:
            total_tag_counts[k] = 0
        total_tag_counts[k] += v
        
print(time.time() - start)
total_tag_counts

35.30339241027832


{'a': 214557,
 'abbr': 3665,
 'annotation': 2,
 'area': 39,
 'audio': 2,
 'b': 14455,
 'bdi': 4,
 'big': 75,
 'blockquote': 58,
 'body': 999,
 'br': 4986,
 'caption': 200,
 'center': 64,
 'cite': 3563,
 'code': 108,
 'dd': 1376,
 'del': 2,
 'div': 58927,
 'dl': 457,
 'dt': 334,
 'font': 40,
 'form': 999,
 'h1': 999,
 'h2': 5044,
 'h3': 11954,
 'h4': 117,
 'h5': 4,
 'h6': 1,
 'head': 999,
 'hr': 51,
 'html': 999,
 'i': 18246,
 'img': 8699,
 'input': 3996,
 'label': 999,
 'li': 133277,
 'link': 12985,
 'map': 2,
 'math': 2,
 'meta': 4499,
 'mo': 2,
 'mrow': 2,
 'mstyle': 2,
 'noscript': 999,
 'ol': 858,
 'p': 7998,
 'pre': 1,
 'q': 76,
 'rb': 16,
 'rp': 32,
 'rt': 16,
 'ruby': 16,
 's': 10,
 'samp': 2,
 'script': 4995,
 'semantics': 2,
 'small': 3272,
 'source': 2,
 'span': 75342,
 'strong': 599,
 'sub': 151,
 'sup': 11157,
 'table': 4010,
 'td': 57673,
 'th': 14472,
 'title': 999,
 'tr': 27300,
 'u': 51,
 'ul': 24147,
 'wbr': 85}

Upon inspecting the tag counts, I see that some of the most common tags are `a` and `li`. This indicates that there are many hyperlinks throughout the articles, and that there are many list items. This is fairly close to what I would expect, since it is very common to see links and lists on Wikipedia pages. 

## Finding common words

After finding common tags, I am now able to find common words in the article body. The criteria that I will use to determine if characters form a word are the following: 

* There are more than 5 letters. The reason for this is to exclude words like "a" and "the", since they do not necessarily give insight into trends in text. 
* They use the characters A-Z, a-z, 0-9, or _.

I will replace all other characters with a space, and will also make all words lowercase. 

In [6]:
import re
from collections import Counter

def find_word_counts(document):
    soup = BeautifulSoup(document, 'html.parser')
    parsed_soup = str(soup.find_all("div", id="content")[0])
    parsed_soup = parsed_soup.lower()
    data = re.sub('\W+', " ", parsed_soup)
    data = data.split(" ")
    words = []
    for word in data:
        if len(word) >= 5:
            words.append(word)
    count = Counter(words)
    return dict(count)

start = time.time()
pool = concurrent.futures.ProcessPoolExecutor(max_workers=2)
words = pool.map(find_word_counts, content)
words = list(words)

total_word_counts = {}
for word in words:
    for k,v in word.items():
        if k not in total_word_counts:
            total_word_counts[k] = 0
        total_word_counts[k] += v
        
print(time.time() - start)
top_20 = Counter(total_word_counts).most_common(20)
top_20

55.61172318458557


[('title', 153011),
 ('class', 146280),
 ('style', 69490),
 ('width', 35321),
 ('wikipedia', 30397),
 ('height', 26472),
 ('border', 25620),
 ('align', 24089),
 ('category', 19921),
 ('template', 19570),
 ('padding', 18893),
 ('wikimedia', 16990),
 ('thumb', 16990),
 ('upload', 16759),
 ('index', 16053),
 ('navbox', 15531),
 ('reference', 15261),
 ('cite_ref', 14867),
 ('action', 13849),
 ('commons', 13463)]

The performance time above could decrease if I were to select only the top 100 words from each article, and have only those returned in the function `find_word_counts`. 