# Notes, Links, Code Snippets During Common Crawl Data Processing

In [3]:
# import libraries

import numpy as np
import pandas as pd
import pyarrow.parquet as pq

### Links

Example Repo: https://github.com/commoncrawl/cc-pyspark

Common Crawl Format Example: https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-warc

Configure EMR to run a pyspark job using Python: https://aws.amazon.com/premiumsupport/knowledge-center/emr-pyspark-python-3x/

Apache PySpark Documentation: https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext

PySpark Cheat Sheet: Spark in Python: https://www.datacamp.com/community/blog/pyspark-cheat-sheet-python

PySpark Tutorial-Learn to use Apache Spark with Python: https://www.dezyre.com/apache-spark-tutorial/pyspark-tutorial

Apache Spark: Python Programming Guide: https://spark.apache.org/docs/0.9.0/python-programming-guide.html

Open Source Search Engines in Python: http://pythonsource.com/open-source/search-engines

Implementing a Search Engine with Ranking in Python: http://aakashjapi.com/fuckin-search-engines-how-do-they-work/


### Bash Scripts

Point the environment variable SPARK_HOME to your Spark installation

In [None]:
$ export SPARK_HOME="/Users/lxu213/spark/"

Submit example job to spark

In [None]:
$ $SPARK_HOME/bin/spark-submit ./server_count.py \ --num_output_partitions 1 --log_level WARN \ ./input/test_warc.txt servernames

ReadWARC: Assuming that you have the aws command line tools installed, you can list the contents of a crawl using:

In [None]:
$ aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2014-10/ --recursive | head -6

Copy one segment to local using:

In [None]:
$ aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2014-10/segments/1394023864559/warc/CC-MAIN-20140305125104-00002-ip-10-183-142-35.ec2.internal.warc.gz .

### Notes

SparkContext = Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. 

SQLContext = The entry point for working with structured data (rows and columns) in Spark. Allows the creation of DataFrame objects as well as the execution of SQL queries. 

Resilient Distributed Datasets are Apache Spark’s data abstraction, and the features they are built and implemented with are responsible for their significant speed. More about RDDs below:

1. RDDs are read-only, partitioned data stores, which are distributed across many machines (typically on a cluster)
2. RDDs can be invoked within Spark through Pyspark, Spark SQL or Spark Scala. Data which is ingested, or exists on the disk on the Linux file system or on the Hadoop Distributed File System (HDFS) can be taken and converted to a distributed dataset.
3. The key reasons RDDs are an abstraction that works better for distributed data processing, is because they don’t feature some of the issues that MapReduce, the older paradigm for data processing (which Spark is replacing increasingly). Chiefly, these are:
    - Replication: Replication of data on different parts of a cluster, is a feature of HDFS that enables data to be stored in a fault-tolerant manner. Spark’s RDDs address fault tolerance by using a lineage graph. The different name (resilient, as opposed to replicated) indicates this difference of implementation in the core functionality of Spark
    - Serialization: Serialization in MapReduce bogs it down, speed wise, in operations like shuffling and sorting.
    - Disk IO : One of the most computationally expensive operations is writing files to disk and reading them again, and this kind of Disk input-output impacts the performance of big compute jobs. Although Apache Spark can cache and persist RDDs to save time during in-memory computation, it is primarily an in-memory processing engine that depends on cheap access to RAM (which differs from the “commodity hardware” argument that’s made for Hadoop). Disk IO is expensive and time consuming in “big compute” jobs (as opposed to “big data”, which refers to large data set storage and handling). At every stage of a map or reduce step in MapReduce, there is Disk IO, which is avoided because Spark’s resource manager and optimiser allow for fine-grained control over scheduling and resilient processing.
    - Optimisation and Lazy Evaluation: These are mentioned together since lazy evaluation (a la Scala) allows a sequence of transformations to be performed on RDDs without actually spending compute time on them. Spark natively represents these transformations as a Directed Acyclic Graph (DAG) and Spark’s Catalyst Optimizer allows such computational graphs to be optimised and staged appropriately, based on the memory settings. Spark’s native resource manager is capable of handling various tasks by itself in conjunction with a file system, but Spark also integrates with existing resource managers in Hadoop based file systems (such as YARN).


Reading WARC Records

A key feature of the library is to be able to iterate over a stream of WARC records using the ArchiveIterator

It includes the following features: - Reading a WARC/ARC stream - On the fly ARC to WARC record conversion - Decompressing and de-chunking HTTP payload content stored in WARC/ARC files.

For example, the following prints the the url for each WARC response record:

from warcio.archiveiterator import ArchiveIterator

with open('path/to/file', 'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_type == 'response':
            print(record.rec_headers.get_header('WARC-Target-URI'))

The stream object could be a file on disk or a remote network stream. The ArchiveIterator reads the WARC content in a single pass. The record is represented by an ArcWarcRecord object which contains the format (ARC or WARC), record type, the record headers, http headers (if any), and raw stream for reading the payload.

class ArcWarcRecord(object):
    def __init__(self, *args):
        (self.format, self.rec_type, self.rec_headers, self.raw_stream,
         self.http_headers, self.content_type, self.length) = args

### Running Questions
1. "The key reasons RDDs are an abstraction that works better for distributed data processing, is because they don’t feature some of the issues that MapReduce" ... MR is a strategy that can also be used in Spark?

### Extract Keywords Python Function

Inherits from `CCSparkJob` and can run locally

In [29]:
# TODO: incorporate more robust search engine index
# TODO: more robust adlink detection. 

In [None]:
# run extract_keyword.py in shell
$ cd data/ad-free-search-engine
$ python extract_keyword.py input/test_wat.txt output

In [36]:
# extract links wat 
parquet_path = '/Users/lxu213/data/cc-pyspark-master/spark-warehouse/word_count_output/part-00009-564210ad-3e62-4dc7-96b9-f127908a22c8-c000.snappy.parquet'
table_wat = pq.read_table(parquet_path, nthreads=4).to_pandas()


In [51]:
# extract keywords
kw_path = '/Users/lxu213/data/ad-free-search-engine/spark-warehouse/output_features/part-00000-34ccb7a1-4cbe-413d-bb91-165ea931b1f8-c000.snappy.parquet'
data = pq.read_table(kw_path, nthreads=4).to_pandas()
data['section'].unique()

array([u'', u'Careers', u'Moms &amp; Babies', u'ricerca', u'Style',
       u'teens', u'Audio', u'Europe', u'Middle East', u'Cornwall',
       u'Manchester', u'Magazine', u'The Ashes', u'Southampton', u'USPGA',
       u'Saints & Angels', u"Results for 'fleecing'", u'US', u'us',
       u'international', u'spanish', u'Fashion', u'education',
       u'entertainment', u'Entertainment', u'innovations',
       u'Neighborhood News', u'/', u'Zuigflessen', u'Press Releases',
       u'Vino rosso', u'Blog'], dtype=object)

In [55]:
url = 'https://www.huffingtonpost.com/.../jacob-comfort-dog-parkland-shooting_us_5a85bc'
url = 'http://gradestack.com/Complete-CAT-Prep/If-b-be-the-pth-term-of-a/1'
len(url)

description = 'As a Marine dog handler, Jose is a perpetual outsider, assigned to platoons that have been together for years, tight-knit combat brotherhoods that regard newcomers, especially dog handlers, with a high degree of circumspection. His job is to accompany that platoon, to clear a path through hostile territory for his fellow'
len(description)

title = 'It&#8217;s Finally Happening: America Will Get a Cat Cafe   '
len(title)

descrip_line = 'Practice complete test of topic Advanced-2 inside chapter Sequence and Series. This chapter is part of '
len(descrip_line)

103

In [29]:
url[:25] + '...'

'http://www.flowerpictures...'

In [None]:
trips_wkd_rain = trips.loc[(trips['DoW'].isin([5,6])) & trips['PRCP'] != 0] 

In [43]:
query = 'hot dog'
query.lower().split()

['hot', 'dog']

In [36]:
data[['url', 'description']].loc[data['keywords'].isin(query.split(' '))] 
data[['url', 'description']].loc[data['keywords'].isin(['lily','cyrus'])] 

Unnamed: 0,url,description
30094,http://short-movies-animation.blogspot.com/201...,"Lily and the Snowman, vagotanulo 2016, Short..."
40776,http://www.backyardgardener.com/forums/showthr...,I have around 100 Blood Lily (Haemanthus multi...
47467,http://www.courtsystem.org/lily-ky-courts/,Search Lily court records to access free publi...
54005,http://www.flirtic.ee/polls/music/d5259a7d-4d2...,
54068,http://www.flowerpictures1.com/r-lily-flowers-...,"Free picture of Big Lily Flower Tattoo, Big Li..."
54072,http://www.flowerpictures1.com/r-lily-flowers-...,"Free picture of Big Lily Flower Tattoo, Big Li..."
54076,http://www.flowerpictures1.com/r-lily-flowers-...,"Free picture of Big Lily Flower Tattoo, Big Li..."
69405,http://www.nbcchicago.com/news/sports/Sochi-Wi...,After helping the U.S. sweep the Olympics slop...


In [18]:
for row in data[:10]

Unnamed: 0,url,keywords,title,description
0,http://0lik.ru/cliparts/clipartfoto/128020-sto...,stock,Stock Photo - Panoramas of European Cities,8 SHQ JPEG | up to ~ 8200 x 5500 | 300 dpi | 1...
1,http://0lik.ru/cliparts/clipartfoto/128020-sto...,photo,Stock Photo - Panoramas of European Cities,8 SHQ JPEG | up to ~ 8200 x 5500 | 300 dpi | 1...
2,http://0lik.ru/cliparts/clipartfoto/128020-sto...,panoramas,Stock Photo - Panoramas of European Cities,8 SHQ JPEG | up to ~ 8200 x 5500 | 300 dpi | 1...
3,http://0lik.ru/cliparts/clipartfoto/128020-sto...,european,Stock Photo - Panoramas of European Cities,8 SHQ JPEG | up to ~ 8200 x 5500 | 300 dpi | 1...
4,http://0lik.ru/cliparts/clipartfoto/128020-sto...,cities,Stock Photo - Panoramas of European Cities,8 SHQ JPEG | up to ~ 8200 x 5500 | 300 dpi | 1...
5,http://0lik.ru/templates/othert/241992-3-real-...,real,3 Real Estate Business Card Templates PSD,3 Real Estate Business Card Templates PSD PSD ...
6,http://0lik.ru/templates/othert/241992-3-real-...,estate,3 Real Estate Business Card Templates PSD,3 Real Estate Business Card Templates PSD PSD ...
7,http://0lik.ru/templates/othert/241992-3-real-...,business,3 Real Estate Business Card Templates PSD,3 Real Estate Business Card Templates PSD PSD ...
8,http://0lik.ru/templates/othert/241992-3-real-...,card,3 Real Estate Business Card Templates PSD,3 Real Estate Business Card Templates PSD PSD ...
9,http://0lik.ru/templates/othert/241992-3-real-...,templates,3 Real Estate Business Card Templates PSD,3 Real Estate Business Card Templates PSD PSD ...


In [21]:
for index, row in data[:10].iterrows():
    print row['keywords']
    print row['title']

stock
Stock Photo - Panoramas of European Cities
photo
Stock Photo - Panoramas of European Cities
panoramas
Stock Photo - Panoramas of European Cities
european
Stock Photo - Panoramas of European Cities
cities
Stock Photo - Panoramas of European Cities
real
3 Real Estate Business Card Templates PSD
estate
3 Real Estate Business Card Templates PSD
business
3 Real Estate Business Card Templates PSD
card
3 Real Estate Business Card Templates PSD
templates
3 Real Estate Business Card Templates PSD


In [56]:
data['description'].unique()

array([ u'8 SHQ JPEG | up to ~ 8200 x 5500 | 300 dpi | 169 Mb RAR LetitBit.netVip-File.com  128020',
       u'3 Real Estate Business Card Templates PSD PSD | AI | Font info | 7.69 MB 3 Real Estate Business Card Templates PSD PSD | AI | Font info | 7.69 MB Letitbit.net \u0421\u043a\u0430\u0447\u0430\u0442\u044c 3 Real Estate Business   241992',
       u'', ...,
       u'OVE AKADEMSKE GODINE, 2010./2011. GOSTOVAT \u0106E OVI PROFESORI IZ HRVATSKE: prof. dr. Mira MENAC MIHALI\u0106, (Dijalektologija s akcentologijom hrvatskog jezika 1, 2), prof. dr. Marko SAMARD\u017dIJA (Povijest hrvatkskog jezika (naglasak na 19. i 20. stolje\u0107e), prof. dr. Stipe Botica i prof. dr. Jelena Lu\u017eina\xa0 (Kultura i civilizacija u Hrvata)... Sva su ta predavanja obvezna za&hellip;',
       u'Fronzoli edoardiani, tipici della Belle \xc9poque britannica, cappelli a larghe falde e mises da marinaio si mescolano nei looks della P/E 2010 della stilista neozelandese Karen Walker. Lo stile trae ispirazione 

In [72]:
kw_data = data[['url', 'title', 'description']].loc[data['keywords'] == 'cities'][:10]
len(kw_data['title'])

10

In [67]:
data['url'].loc[data['keywords'] == 'cities']

4        http://0lik.ru/cliparts/clipartfoto/128020-sto...
9835        http://distancebetween.in/from/Tiruchchirapali
15582                       http://headshops.us/montana/e/
22308    http://metabolismofcities.org/people/325-maria...
22312    http://metabolismofcities.org/people/610-phili...
23981             http://netindian.in/book/export/html/473
48843    http://www.dentalby.com/endodontist-germany/de...
49593         http://www.distancefromto.net/city-la-paz-ph
49597    http://www.distancefromto.net/city-lancenigo-v...
49604      http://www.distancefromto.net/city-port-st-mary
80614    http://www.tcmaker.org/forum/viewtopic.php?p=7693
83961     http://www.ubercities.us/uber-in-burning-fork-ky
85595      http://www.w3-directory.com/events-Taichung.php
Name: url, dtype: object

In [49]:
urls

4        http://0lik.ru/cliparts/clipartfoto/128020-sto...
9835        http://distancebetween.in/from/Tiruchchirapali
15582                       http://headshops.us/montana/e/
22308    http://metabolismofcities.org/people/325-maria...
22312    http://metabolismofcities.org/people/610-phili...
23981             http://netindian.in/book/export/html/473
48843    http://www.dentalby.com/endodontist-germany/de...
49593         http://www.distancefromto.net/city-la-paz-ph
49597    http://www.distancefromto.net/city-lancenigo-v...
49604      http://www.distancefromto.net/city-port-st-mary
80614    http://www.tcmaker.org/forum/viewtopic.php?p=7693
83961     http://www.ubercities.us/uber-in-burning-fork-ky
85595      http://www.w3-directory.com/events-Taichung.php
Name: url, dtype: object

In [48]:
for url in urls:
    print url

http://0lik.ru/cliparts/clipartfoto/128020-stock-photo-panoramas-of-european-cities.html
http://distancebetween.in/from/Tiruchchirapali
http://headshops.us/montana/e/
http://metabolismofcities.org/people/325-mariane-planchon
http://metabolismofcities.org/people/610-philippe-bouillard
http://netindian.in/book/export/html/473
http://www.dentalby.com/endodontist-germany/dentist-in-beller-8/
http://www.distancefromto.net/city-la-paz-ph
http://www.distancefromto.net/city-lancenigo-villorba-it
http://www.distancefromto.net/city-port-st-mary
http://www.tcmaker.org/forum/viewtopic.php?p=7693
http://www.ubercities.us/uber-in-burning-fork-ky
http://www.w3-directory.com/events-Taichung.php


In [None]:
# no ads
kw_path = '/Users/lxu213/data/ad-free-search-engine/spark-warehouse/output/part-00000-cce80d6f-9481-49c6-a24f-0ea08880f341-c000.snappy.parquet'
data_adfree = pq.read_table(kw_path, nthreads=4).to_pandas()
data_adfree.head()

In [30]:
# with adwords in links
data.describe()
# without adwords in links > removed about 25% of web pages
data_adfree.describe()

Unnamed: 0,url,keywords
count,115170,115170
unique,22551,35387
top,https://www.bookrenter.com/linda-d-urden-dnsc-...,de
freq,20,901


Percent of crawled web pages that contain (detected) ad links:

In [27]:
100 - (100*data_adfree.count()['url']/data.count()['url'])

26

### Flask
http://flask.pocoo.org/

Build simple search GUI and use sample parquet files to build a search page given some search keyword.

In [70]:
from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
    return "Hello World!"

if __name__ == "__main__":
    app.run(host='0.0.0.0')

In [None]:
$ cd data/ad-free-search-engine/
$ . venv/bin/activate
$ deactivate 