### Parallel Processing Data with Wiki Data

In this project, we working with web scraped wiki data formatted in HTML. The purpose of this project will be to implement a grep like command line utility that will search through the Wiki articles that are save under the URL folder 'Wiki' formatted as: 

`https://en.wikipedia.org/wiki/*article_name*`

through this project I will look to complete the following goals: 

- Search for all occurances of a string
- Provide a case-insensitive option to the search
- Refine the result by providing the specific locations of the files

In [1]:
# first we will list out the available articles found under the wiki directory

import os
file_names = os.listdir("wiki")

# Print first 20 html files
for i in file_names[:20]:
    print(i)

Bay_of_ConcepciC3B3n.html
Bye_My_Boy.html
Valentin_Yanin.html
Kings_XI_Punjab_in_2014.html
William_Harvey_Lillard.html
Radial_Road_3.html
George_Weldrick.html
Zgornji_Otok.html
Blue_Heelers_(season_8).html
Taggen_Nunatak.html
Henri_BraqueniC3A9.html
Vrila.html
William_Henry_Porter.html
Clive_Brown_(footballer).html
Blick_nach_Rechts.html
Central_District_(Rezvanshahr_County).html
Alexios_Aspietes.html
Mei_Lanfang.html
Wangeroogeclass_tug.html
Dowell_Philip_O27Reilly.html


In [2]:
# Total article files list
print(len(file_names))

999


In [3]:
# Preview first file data
folder_name = 'wiki'
first_file = file_names[0]

first_file
# print name of first file
    

'Bay_of_ConcepciC3B3n.html'

In [4]:
# Preview first file in 'wiki folder directory'
with open(os.path.join(folder_name,first_file)) as file: 
    print(file.read())


<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Bay of Concepción - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Bay_of_Concepción","wgTitle":"Bay of Concepción","wgCurRevisionId":647460156,"wgRevisionId":647460156,"wgArticleId":16044270,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Coordinates on Wikidata","All stub articles","Landforms of Bío Bío Region","Bays of Chile","Bío Bío Region geography stubs"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNa

----------

#### Counting File Lines

Using the following map_reduce function, I will implement both the mapper and reducer function to count the number of lines that are found in all files under the wiki folder. 

`Mapper` this function will perform the the functions all of the file names listed under each chunk. This function will calculate the total number of lines for each file and return a result for each.

`Reducer` will act as a function to combined the results of the mapper function  to provide the total number of lines from all files within the wiki directory

In [5]:
# create map_reduce function 

import math
import functools
import itertools
from multiprocessing import Pool

def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    pool = Pool(num_processes)
    chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

In [6]:
# create mapper 

def count_lines(file_name):
    folder = 'wiki'
    file_len = 0
    for files in file_name:
        with open(os.path.join(folder,files)) as file:
            lines = [line for line in file.readlines()]
        file_len += len(lines)
    return file_len


# Create reducer function 

def sum_lines(counts1, counts2):
    counts1 += counts2
    return counts1

In [7]:
all_file_lines = map_reduce(file_names, 4, count_lines, sum_lines)

print(all_file_lines)

499797


#### Review for Line within Wiki Folder

Using the map_reduce function along with  custom functions mapper and reducer, we were able to identify that there are a total of about 500K lines for all of the files available within the `Wiki` directory. To get here, we first created the mapper function, which performed lined counts for each chunk defined by the map_reduce function. The mapper function first iterated over each file to gather the file name and joined with the folder name `wiki` using the os.path.join function to open each file. The next step of the function was to use list comprehension to add each line of the file to a list variable named `lines`. We then add the length of the total number of lines to an `file_len` variable that is initiated at 0. The reducer funtions then takes the results of each chunk and adds the total number of lines together as a single result.

#### Creating a `GREP` like function to identify lines containing string in each file

In [8]:
# mapper function to lowercase all lines and identify strings with data

def map_find_data_string(file_name):
    data_dict = {}
    for files in file_name:
        with open(files) as file:
            lines = [line for line in file.readlines()]
        for line_index, line in enumerate(lines):
            if target in line:
                if files not in data_dict:
                    data_dict[files] = []
                data_dict[files].append(line_index)
    return data_dict
                
def reduce_data_strings(grepd1, grepd2):
    grepd1.update(grepd2)
    return grepd1


def map_grep(path, num_process):
    file_names = [os.path.join(path,fn) for fn in os.listdir(path)]
    return map_reduce(file_names, num_process, map_find_data_string, reduce_data_strings)
    

In [9]:
target = "data"
find_data = map_grep('wiki', 8)


#### Creating case insentive grep



In [10]:
def map_find_data_string_case(file_name):
    data_dict = {}
    for files in file_name:
        with open(files) as file:
            lines = [line.lower() for line in file.readlines()]
        for line_index, line in enumerate(lines):
            if target.lower() in line:
                if files not in data_dict:
                    data_dict[files] = []
                data_dict[files].append(line_index)
    return data_dict
                
def reduce_data_strings_case(grepd1, grepd2):
    grepd1.update(grepd2)
    return grepd1


def map_grep_case(path, num_process):
    file_names = [os.path.join(path,fn) for fn in os.listdir(path)]
    return map_reduce(file_names, num_process, map_find_data_string_case, reduce_data_strings)
    

In [11]:
target = "data"

find_data_insen = map_grep_case('wiki', 8)

#### Grep Review

In the intial Grep Function, we looked for the string "data" in each html file under the wiki folder directory to create a dictionary that returned a list of indices to where this string occured. However, initial function is case sensitive. In order to  pull all occurences of this string in a case insensitive manner, updated the iterator for each line read in the list comprehension to return in lowercase form. We will review the following dictionaries to determine the number of new occurences below

In [12]:
for fn in find_data_insen:
    if fn not in find_data:
        print("There are {} new occurences within the {} file".format(len(find_data_insen[fn]), fn))
    elif len(find_data_insen[fn]) > len(find_data[fn]): 
        print("There are {} new occurences within the {} file".format(len(find_data_insen[fn]) - len(find_data[fn]), fn))


There are 1 new occurences within the wiki/Table_Point_Formation.html file
There are 1 new occurences within the wiki/Ingrid_GuimarC3A3es.html file
There are 2 new occurences within the wiki/Jules_Verne_ATV.html file
There are 1 new occurences within the wiki/Pictogram.html file
There are 2 new occurences within the wiki/Claire_Danes.html file
There are 1 new occurences within the wiki/PTPRS.html file
There are 1 new occurences within the wiki/A_Beautiful_Valley.html file
There are 1 new occurences within the wiki/Mudramothiram.html file
There are 2 new occurences within the wiki/Gordon_Bau.html file
There are 1 new occurences within the wiki/Embraer_Unidade_GaviC3A3o_Peixoto_Airport.html file
There are 3 new occurences within the wiki/Code_page_1023.html file
There are 1 new occurences within the wiki/Cryptographic_primitive.html file
There are 1 new occurences within the wiki/Alex_Kurtzman.html file
There are 1 new occurences within the wiki/Filip_Pyrochta.html file
There are 1 new o

#### Finding every occurance of the string within the line

We were able to find the index for each line as to where the the string "Data" has matched once. The next step is to identify the index of where the string "data" is mentioned each time. To do this, we will pair and identify the index of the first character of the matched string with the index of the occurance within the line

In [13]:
def find_match_indexes(line, target):
    results = []
    i = line.find(target, 0)
    while i != -1:
        results.append(i)
        i = line.find(target, i + 1)
    return results

# Test implementation
s = "Data science is related to data mining, machine learning and big data.".lower()
print(find_match_indexes(s, "data"))


[0, 27, 65]


In [14]:
def map_occurance(file_name):
    data_dict = {}
    for files in file_name:
        with open(files) as file:
            lines = [line.lower() for line in file.readlines()]
        for line_index, line in enumerate(lines):
            match_indexes = find_match_indexes(line, target.lower())
            if files not in data_dict:
                data_dict[files] = []
            data_dict[files] += [(line_index, match_index) for match_index in match_indexes]
    return data_dict


def mapreduce_grep_match_indexes(path, num_processes):
    file_names = [os.path.join(path, fn) for fn in os.listdir(path)]
    return map_reduce(file_names, num_processes,  map_occurance, reduce_data_strings_case)

target = "science"
occurrences = mapreduce_grep_match_indexes("wiki", 8)
for k,v in itertools.islice(occurrences.items(),20):
    print(k,": ", v)

wiki/Bay_of_ConcepciC3B3n.html :  []
wiki/Bye_My_Boy.html :  []
wiki/Valentin_Yanin.html :  [(6, 840), (6, 890), (66, 90), (66, 145), (66, 173), (144, 1440), (144, 1502), (144, 1548), (144, 1632), (144, 1697), (144, 1746)]
wiki/Kings_XI_Punjab_in_2014.html :  []
wiki/William_Harvey_Lillard.html :  [(80, 166)]
wiki/Radial_Road_3.html :  []
wiki/George_Weldrick.html :  []
wiki/Zgornji_Otok.html :  []
wiki/Blue_Heelers_(season_8).html :  []
wiki/Taggen_Nunatak.html :  []
wiki/Henri_BraqueniC3A9.html :  []
wiki/Vrila.html :  []
wiki/William_Henry_Porter.html :  []
wiki/Clive_Brown_(footballer).html :  []
wiki/Blick_nach_Rechts.html :  []
wiki/Central_District_(Rezvanshahr_County).html :  []
wiki/Alexios_Aspietes.html :  []
wiki/Mei_Lanfang.html :  []
wiki/Wangeroogeclass_tug.html :  []
wiki/Dowell_Philip_O27Reilly.html :  []


### Display Results

We will produce results and store in CSV


In [15]:
import csv

# How many character to show before and after the match
context_delta = 30

with open("results.csv", "w") as f:
    writer = csv.writer(f)
    rows = [["File", "Line", "Index", "Context"]]
    for fn in occurrences:
        with open(fn) as f:
            lines = [line.strip() for line in f.readlines()]
        for line, index in occurrences[fn]:
            start = max(index - context_delta, 0)
            end   = index + len(target) + context_delta
            rows.append([fn, line, index, lines[line][start:end]])
    writer.writerows(rows)

In [16]:
import pandas as pd
df = pd.read_csv("results.csv")

df.head(10)

Unnamed: 0,File,Line,Index,Context
0,wiki/Valentin_Yanin.html,6,840,"embers of the USSR Academy of Sciences"",""Full ..."
1,wiki/Valentin_Yanin.html,6,890,"ers of the Russian Academy of Sciences"",""Demid..."
2,wiki/Valentin_Yanin.html,66,90,"href=""/wiki/Soviet_Academy_of_Sciences"" class=..."
3,wiki/Valentin_Yanin.html,66,145,"ect"" title=""Soviet Academy of Sciences"">Soviet..."
4,wiki/Valentin_Yanin.html,66,173,"f Sciences"">Soviet Academy of Sciences</a>; he..."
5,wiki/Valentin_Yanin.html,144,1440,"rs_of_the_USSR_Academy_of_Sciences"" title=""Cat..."
6,wiki/Valentin_Yanin.html,144,1502,"rs of the USSR Academy of Sciences"">Full Membe..."
7,wiki/Valentin_Yanin.html,144,1548,rs of the USSR Academy of Sciences</a></li><li...
8,wiki/Valentin_Yanin.html,144,1632,"of_the_Russian_Academy_of_Sciences"" title=""Cat..."
9,wiki/Valentin_Yanin.html,144,1697,"of the Russian Academy of Sciences"">Full Membe..."
