# Natural Language Processing

## Material

- https://arxiv.org/pdf/1805.03818.pdf
- https://cs.stanford.edu/people/chrismre/papers/dd.pdf
- https://cs.stanford.edu/people/chrismre/papers/deepdive_highlight.pdf
- https://arxiv.org/pdf/1711.10160.pdf

## Setup

In [None]:
%%bash

cd ~/
curl -LOJ http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip
pip install stanfordnlp warc3-wet bs4

## CoreNLP

In [22]:
from os import environ
from pprint import pprint
from stanfordnlp.server import CoreNLPClient

environ['CORENLP_HOME'] = f'{environ["HOME"]}/stanford-corenlp-full-2018-10-05/'

### Setup Parsing Pipeline

In [35]:
annot = ['parse']
nlp = CoreNLPClient(
    annotators=annot,
    timeout=3000,
    memory='16G'
)

### Parse Sentence and Print Tree 

In [80]:
def prpr(t,v=False):
    w = [prpr(c) for c in t.child]
    w = [t.value, w] if v else w
    return w if t.child else t.value

In [85]:
nlp.register_properties_key('multiparse', {'parse.kbest': 10})
doc = nlp.annotate('I went to the store to get ice cream.', properties_key='multiparse')

pprint(prpr(doc.sentence[0].kBestParseTrees[0]))

[[[['I']],
  [['went'],
   [['to'], [['the'], ['store']]],
   [[['to'], [['get'], [['ice'], ['cream']]]]]],
  ['.']]]


## Common Crawl

In [283]:
import requests, gzip, io, warc, bs4, http, json

In [261]:
cc_base = 'https://commoncrawl.s3.amazonaws.com'
cc_path = 'crawl-data/CC-MAIN-2019-47'
cc_idx = 'crawl-data/CC-MAIN-2019-47/cc-index.paths.gz'
cc_url = f'{cc_base}/{cc_path}'

### List WARC files

In [262]:
res = requests.get(f'{cc_url}/warc.paths.gz')
paths = gzip.decompress(res.content).decode('utf8').split('\n')

### List Crawled URLs

In [272]:
res = requests.get(f'{cc_base}/{cc_idx}')
shards = gzip.decompress(res.content).decode('utf8').split('\n')

res = requests.get(f'{cc_base}/{shards[1]}')
urls = gzip.decompress(res.content).decode('utf8').split('\n')

### Map crawl records

In [285]:
url, timestamp, data = urls[0].split(' ', 2)
crawled_at = json.loads(data)
crawled_at

### Read records from WARC archive

In [133]:
res = requests.get(f'{cc_base}/{paths[0]}')
arc = warc.warc.WARCFile(
    fileobj=io.BytesIO(gzip.decompress(res.content))
)

In [246]:
record = next(iter(arc))
url = record['WARC-Target-URI']

### Parse records

In [None]:
headers = http.client.parse_headers(record.payload)
body = bs4.BeautifulSoup(record.payload.read(), 'html.parser')
links = [link.get('href') for link in body.find_all('a')]
text = body.text

### Data sources to add

- Wikipedia Corpus
- Reddit Corpus
- Reddit API
- Twitter API
- Bing API
- Google API