Skip to content

Commit

Permalink
indexwarcsjob.py: update hard-coded path to be correct (TODO: load fr…
Browse files Browse the repository at this point in the history
…om env)

reqs: update to correct python-hadoop git
  • Loading branch information
ikreymer committed Feb 22, 2016
1 parent 7041826 commit 0ee5018
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 4 deletions.
4 changes: 2 additions & 2 deletions indexwarcsjob.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ def mapper_init(self):
'surt_ordered': True,
'sort': True,
'cdxj': True,
'minimal': True
#'minimal': True
}

def mapper(self, _, line):
Expand All @@ -77,7 +77,7 @@ def mapper(self, _, line):

def _conv_warc_to_cdx_path(self, warc_path):
# set cdx path
cdx_path = warc_path.replace('common-crawl/crawl-data', 'cdx2')
cdx_path = warc_path.replace('common-crawl/crawl-data', 'common-crawl/cc-index/cdx')
cdx_path = cdx_path.replace('.warc.gz', '.cdx.gz')
return cdx_path

Expand Down
4 changes: 2 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
mrjob
boto
pywb
#-e git+https://github.com/ikreymer/pywb.git@0.9.0b#egg=pywb
-e git+https://github.com/matteobertozzi/Hadoop.git#egg=hadoop&subdirectory=python-hadoop
#-e git+https://github.com/matteobertozzi/Hadoop.git#egg=hadoop&subdirectory=python-hadoop
-e git+https://github.com/commoncrawl/python-hadoop.git#egg=master

0 comments on commit 0ee5018

Please sign in to comment.