<a href="https://colab.research.google.com/github/mattiapocci/PhilosopherRank/blob/master/scrapingWikiDumps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import

In [3]:
from google.colab import drive
import os
import json
import re
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Download and extract Wikipedia dump


In [None]:
!wget  -P "/content/drive/My Drive/Wiki_dump/" "https://dumps.wikimedia.org/enwiki/20200120/enwiki-20200120-pages-articles-multistream.xml.bz2"

In [None]:
!bunzip2 -d -k -s /content/drive/My\ Drive/Wiki_dump/enwiki-20200120-pages-articles-multistream.xml.bz2

In [None]:
!python3 WikiExtractor.py "enwiki-20200120-pages-articles-multistream.xml"  --json --processes 2

# Parsing

## Parsing utility

In [None]:
def create_valid_json(filename):
    """
    Create a valid json with the commas and the square brackets
    :param string:
    :return: None 
    """
    with open(filename, 'r+') as f:
        data = f.read().replace('}', '},')
        data = data[:-2]
        f.seek(0, 0)
        f.write('['.rstrip('\r\n') + '\n' + data)
        
    with open(filename, 'a') as f:
        f.write("]")


def find_matches(filename,word,phil_list):
    """
    Find matches inside the articles with the given word and add it to the given phil_list
    :param string:
    :param string:
    :param list:
    :return: list of the articles containg the word philosopher
    """
    #phil_list = []
    with open(filename, 'r', encoding='utf-8') as file:
        try:
            data = json.loads(file.read())
            for article in data:
                if word in article['text']:
                    phil_list.append(article)
            return phil_list
        except:
            print("Error with: ", filename)


def write_json(data, name):
    """
    Write into a file the data given with the name given
    :param lst of json:
    :param string:
    :return: None
    """
    with open('/content/drive/My Drive/Wiki_dump/'+name, 'w') as outfile:
        json.dump(data, outfile)




## Executing the parsing

If the wikipedia articles are yet parsed with the create_valid_json function, do NOT redo the parsing, otherwise you will mess all the articles!!! To avoid this, comment the indicated line.

In [None]:
def parse_data(rootdir="/home/luigi/Downloads/wir/wikiextractor-master/text/"):
    """
    Parse all the files presents in the rootdir and its subdirectories
    :param string:
    :return: None 
    """
    lst = []
    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            filename_fullpath = os.path.join(subdir, file)
            create_valid_json(filename_fullpath) #THIS LINE TO BE COMMENTED according to what is written above
            lst = find_matches(filename_fullpath,"philosopher",lst)
    write_json(lst,file+".json")


Formatting the data to be uniform with the beautiful soup data.

In [None]:
lst = []
with open("/content/drive/My Drive/Wiki_dump/wiki_75.json", 'r', encoding='utf-8') as file:
    data = json.loads(file.read())
    for article in data:
        if "philosopher" in article['text']:
            var = prototype = {
                                "philosopher":"" ,
                                "article": "",
                                "pageid": "",
                                "table_influenced": [],
                                "table_influences": []
                                }
            var['philosopher'] = article['title']
            var['article'] = article['text']
            var['pageid'] = article['id']
            lst.append(var)
    write_json(lst,"uniformat.json")

# Compare with the category dump

Construct lists from the entire dump

In [13]:
reg_a_phil_dump = []
born_lived_dump = []

with open("/content/drive/My Drive/Wiki_dump/uniformat.json", 'r', encoding='utf-8') as file:
    data = json.loads(file.read())
    for article in data:
        if "born" in article['article'] or "lived" in article['article']:
            born_lived_dump.append(article)
        if re.match(r".*a.*philosopher",article['article']):
            reg_a_phil_dump.append(article)

construct lists from the category dump

In [14]:
born_lived = []
phil = []
reg_a_phil = []
with open("/content/drive/My Drive/Wiki_dump/mattia_ground_t.json", 'r', encoding='utf-8') as file:
    data_cat = json.loads(file.read())
for a in data_cat:
    if 'philosopher' in a['article']:
        phil.append(a)
        if "born" in a['article'] or "lived" in a['article']:
            born_lived.append(a)
        if re.match(r".*a.*philosopher",a['article']):
            reg_a_phil.append(a)

Finding the articles presents in both the dumps

In [None]:
match = 0
trovato = False
for cat_art in phil:
    trovato = False
    for dump_art in data:
        if not trovato and cat_art['pageid'] == dump_art['pageid']:
            match = match + 1
            trovato = True
    if not trovato:
        print(cat_art)


Printing the scores

In [16]:
print("===============DATA ANALYSIS OF CATEGORY DUMP ============")
print("Length of category dump: ",len(data_cat))
print("Length with 'philosopher' in article: ",len(phil))
print("Length with 'born' or 'lived' in article: ",len(born_lived))
print("Length with regex '.* a.* philosopher' in article: ",len(reg_a_phil))
print("\n")
print("===============DATA ANALYSIS OF ENTIRE DUMP ============")
print("Length of dump: 62000000")
print("Length with 'philosopher' in article: ",len(data))
print("Length with 'born' or 'lived' in article: ",len(born_lived_dump))
print("Length with regex '.* a.* philosopher' in article: ",len(reg_a_phil_dump))
print("\n")
print("Matched ",match," articles between category dump and all wiki")
print("Missing",len(phil)-match," articles from all dump")

Length of category dump:  1712
Length with 'philosopher' in article:  1161
Length with 'born' or 'lived' in article:  968
Length with regex '.* a.* philosopher' in article:  996


Length of dump: 62000000
Length with 'philosopher' in article:  26312
Length with 'born' or 'lived' in article:  14272
Length with regex '.* a.* philosopher' in article:  309


Matched  1131  articles between category dump and all wiki
Missing 30  articles from all dump
