# Analyzing Wikipedia Pages

In this guided project, you'll be working with data scraped from Wikipedia, a popular online encyclopedia. Wikipedia is maintained by volunteer content contributors and editors who continuously improve content. Anyone can edit Wikipedia, and you can read more about how to make an edit here. Because Wikipedia is crowdsourced, it's been able to rapidly assemble a huge library of articles.

## Introducing Wikipedia Data

* List all of the files in the wiki folder.
* Count up and display the number of files in the wiki folder.
* Display a single file from the wiki folder, and look at the raw HTML. Do you see any patterns?

In [1]:
import pandas as pd
import os

os.listdir("wiki")

['Furubira_District,_Hokkaido.html',
 'Valentin_Yanin.html',
 'Kings_XI_Punjab_in_2014.html',
 'William_Harvey_Lillard.html',
 'Radial_Road_3.html',
 'George_Weldrick.html',
 'Zgornji_Otok.html',
 'Blue_Heelers_(season_8).html',
 'Taggen_Nunatak.html',
 '1951_National_League_tie-breaker_series.html',
 'List_of_number-one_singles_of_1993_(Finland).html',
 'Vrila.html',
 'William_Henry_Porter.html',
 'Clive_Brown_(footballer).html',
 '2010_Karshi_Challenger_%E2%80%93_Singles.html',
 'Blick_nach_Rechts.html',
 'Central_District_(Rezvanshahr_County).html',
 'Gal%C3%A1pagos,_Guadalajara.html',
 'Campus_of_Texas_A%26M_University.html',
 'Alexios_Aspietes.html',
 'Mei_Lanfang.html',
 'Thalkirchen-Obersendling-Forstenried-F%C3%BCrstenried-Solln.html',
 'Coalville_Town_railway_station.html',
 'Gennady_Lesun.html',
 'Bartrum_Glacier.html',
 'Victor_S._Mamatey.html',
 'Gottfried_Keller.html',
 'Table_Point_Formation.html',
 'Nobuhiko_Ushiba.html',
 'Master_of_Space_and_Time.html',
 'Early_medieva

In [2]:
len(os.listdir("wiki"))

999

In [3]:
with open("wiki/Campus_of_Texas_A%26M_University.html") as f:
    print(f.read())

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Campus of Texas A&amp;M University - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Campus_of_Texas_A\u0026M_University","wgTitle":"Campus of Texas A\u0026M University","wgCurRevisionId":767577617,"wgRevisionId":767577617,"wgArticleId":11246998,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["All articles with dead external links","Articles with dead external links from September 2016","Articles with dead external links from January 2017","Articles with permanently dead external links","Articles with dead external links from April 2009","Articles wi

## Reading in the Data

It appears that the main content is nested inside the div tag with the id content.


Read all of the files into list called content.
Each value in the list should contain the content of a file.
Create a list called articles that contains the article names.
These should be the filename without the wiki/ prefix, and the .html suffix.
Experiment with different performance options, and write up your findings.

In [10]:
import concurrent.futures
import time

pool = concurrent.futures.ThreadPoolExecutor(max_workers=2)

def read_data(filename):
    with open(filename) as f:
        data = f.read()
    return data

start = time.time()
filenames = ["wiki/{}".format(f) for f in os.listdir("wiki")]
content = pool.map(read_data, filenames)
content = list(content)

end = time.time()
print(end - start)
articles = [f.replace(".html", "").replace("wiki/", "") for f in filenames]

0.23008227348327637


After doing some profiling, it doesn't appear that threading makes a huge difference to performance. It may be because although files are opened, most of the time saved is offset by the overhead of creating new threads.

## Remove Extraneous Markup

Using the BeautifulSoup package, we'll parse each wiki article, then extract the div with id content and everything inside it.

Since this operation is more CPU intensive than before, let's try using a process pool to see if the speed improves.

In [13]:
from bs4 import BeautifulSoup

def parse_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    return str(soup.find_all("div", id="content")[0])

start = time.time()
pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
parsed = pool.map(parse_html, content)
parsed = list(parsed)
end = time.time()

print(end - start)

26.994540691375732


This operation is quite slow and CPU-intensive. It looks like using as many processes are there are available processors speeds things up.



## Finding Common Tags

Now that we've extracted the main part of each page, let's count up how many times each tag occurs. This will give us clues about how Wikipedia pages are typically structured. For example, if there are a lot of a tags on each page, we know that Wikipedia articles tend to be very connected to other articles or pages. On the other hand, a lot of div tags will tell us that Wikipedia pages tend to have a nested structure with many page elements.

In [15]:
from bs4 import BeautifulSoup

def count_tags(html):
    soup = BeautifulSoup(html, 'html.parser')
    tags = {}
    for tag in soup.find_all():
        if tag.name not in tags:
            tags[tag.name] = 0
        tags[tag.name] += 1
    return tags

start = time.time()
pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
tags = pool.map(count_tags, parsed)
tags = list(tags)

tag_counts = {}
for tag in tags:
    for k,v in tag.items():
        if k not in tag_counts:
            tag_counts[k] = 0
        tag_counts[k] += v
end = time.time()

print(end - start)
tag_counts

13.91459345817566


{'a': 161065,
 'abbr': 3665,
 'annotation': 2,
 'area': 39,
 'audio': 2,
 'b': 14455,
 'bdi': 4,
 'big': 75,
 'blockquote': 58,
 'br': 4986,
 'caption': 200,
 'center': 64,
 'cite': 3563,
 'code': 108,
 'dd': 1376,
 'del': 2,
 'div': 28581,
 'dl': 457,
 'dt': 334,
 'font': 40,
 'h1': 999,
 'h2': 4045,
 'h3': 777,
 'h4': 117,
 'h5': 4,
 'h6': 1,
 'hr': 51,
 'i': 18246,
 'img': 6701,
 'li': 85779,
 'map': 2,
 'math': 2,
 'mo': 2,
 'mrow': 2,
 'mstyle': 2,
 'noscript': 999,
 'ol': 858,
 'p': 7998,
 'pre': 1,
 'q': 76,
 'rb': 16,
 'rp': 32,
 'rt': 16,
 'ruby': 16,
 's': 10,
 'samp': 2,
 'semantics': 2,
 'small': 3272,
 'source': 2,
 'span': 67350,
 'strong': 599,
 'sub': 151,
 'sup': 11157,
 'table': 4010,
 'td': 57673,
 'th': 14472,
 'tr': 27300,
 'u': 51,
 'ul': 10972,
 'wbr': 85}

Based on our findings, it looks like there are quite a few td, a, li, and span tags. This indicates that articles tend to have lots of links, along with lists and tables. Links are the most numerous tag, which indicates the interconnected nature of articles on Wikipedia.

## Finding Common Words

After finding the common tags, we should be able to find the common words in the article body. We can apply any definition of "word" that we want, but it might be helpful to apply similar criteria to what we saw in the last mission.

One thing to be aware of here is that depending on the words you choose, you may run out of memory, or performance may be slow. Try tweaking things to improve performance.

In [17]:
from bs4 import BeautifulSoup
from collections import Counter
import re

def count_words(html):
    soup = BeautifulSoup(html, 'html.parser')
    words = {}
    text = soup.get_text()
    text = re.sub("\W+", " ", text.lower())
    words = text.split(" ")
    words = [w for w in words if len(w) >= 5]
    return Counter(words).most_common(15)

start = time.time()
pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
words = pool.map(count_words, parsed)
words = list(words)

word_counts = {}
for wc in words:
    for word, count in wc:
        if word not in word_counts:
            word_counts[word] = 0
        word_counts[word] += 1
end = time.time()

print(end - start)
word_counts

14.085181474685669


{'third': 4,
 'archetypal': 1,
 'medalist': 2,
 'ryfylkeheiane': 1,
 'swept': 1,
 'morris': 3,
 'rendered': 1,
 'switzerland': 6,
 'povolný': 1,
 'margin': 1,
 'battalion': 1,
 'border': 3,
 'period': 3,
 'arxiv': 1,
 'omiodes': 1,
 'navigo': 1,
 'auxiliary': 1,
 'mumbai': 3,
 'floor': 3,
 'guitarist': 1,
 'cotton': 1,
 'least': 4,
 'mother': 1,
 'italy': 7,
 'supermoon': 1,
 'process': 2,
 'bundesliga': 2,
 'rapid': 2,
 'geezer': 1,
 'reefs': 1,
 'systems': 7,
 'riding': 1,
 'hepatitis': 1,
 'vivian': 1,
 'judges': 1,
 'creek': 11,
 'foliar': 1,
 'streatfields': 1,
 'jeeva': 1,
 'seyyed': 2,
 '01333': 1,
 'chitradurga': 1,
 'suffragan': 1,
 'aomori': 1,
 'solid': 1,
 'pathania': 1,
 'amazon': 2,
 'revolt': 1,
 'barros': 1,
 'bombay': 1,
 'yasmin': 1,
 'smilax': 1,
 'thamesmead': 1,
 'gornje': 1,
 'incubus': 1,
 'score': 7,
 '13833': 1,
 'pterygota': 1,
 'count': 1,
 'forces': 4,
 'williams': 9,
 'ståhlberg': 1,
 'yaddo': 1,
 'image': 5,
 'bells': 1,
 'kjetil': 1,
 'studied': 1,
 'cand