# GIAN 9a: Processing Scraped Data

## 1. Importing the data
We will import the data that we have scraped from Language Log. This data has been stored in a *json* file, giving structured access to the different elements of the data.

In [None]:
import spacy
import en_core_web_sm
nlp=en_core_web_sm.load(disable=["parser", "tagger", "ner"])

In [None]:
import json
from collections import defaultdict, Counter

In [None]:
with open("language_log.json", "r", encoding="utf-8") as f_in:
    posts=json.load(f_in)

In [None]:
len(posts)

Let's look at what the structured data for a post looks like

In [None]:
for comment in posts[10]['comments']:
    print(comment["author"])

Here is a simple function to extract the text data (title, entry, and comment bodies) from the post

In [None]:
def extract_text(post):
    output=[]
    output.append(post['entry'])
    for comment in post.get('comments',[]):
        output.append(comment['body'])
    return("\n".join(output))

In [None]:
# perform a sanity check
extract_text(posts[0])

## 2. Tokenizing the corpus
We will first take all of the posts and make them into one big blob of text.

Let's try this first on 100 posts. We can remove the limit later.

In [None]:
corpus_blob="\n".join([extract_text(post) for post in posts[:100]])
# increase the maximum allowed size for one document
nlp.max_length=len(corpus_blob)+1

In [None]:
corpus=nlp(corpus_blob)

Our corpus has now been tokenized, and we can answer some basic questions

How many tokens are in the corpus ?

In [None]:
nwords=len(corpus)
print(nwords)

Some basic statistics about the corpus

In [None]:
word_lengths=[len(word) for word in corpus] # lengths of all the words
mean_word_length=sum(word_lengths)/nwords
min_word_length=min(word_lengths)
max_word_length=max(word_lengths)

In [None]:
print("Mean word length:", mean_word_length)
print("Shortest word length:", min_word_length)
print("Longest word length:", max_word_length)

We can inspect words of different lengths

In [None]:
def words_of_length(words, length):
    result=[word.lower_ for word in words if len(word)==length]
    return(set(result))

In [None]:
print("words", words_of_length(corpus, 690))

## 3. Counting words

Word frequencies are the basic building blocks of many text mining techniques. Once text has been tokenized it becomes very easy to count how often each particular word occurs in that text.

It is good to remember that:

+ We call each particular word a word *type*
+ We call every occurrence of a word a word *token*

In [None]:
from collections import Counter
from math import *
import re

In [None]:
corpus_wf=Counter([word.orth_ for word in corpus]) # frequency of the word in the corpus

In [None]:
corpus_wf.most_common(10)

In [None]:
# Let's make all words lowercase and get rid of punctuation!
corpus_wf=Counter([word.lower_ for word in corpus if re.search("\w+", word.lower_)])

In [None]:
corpus_wf.most_common(10)

Let's make relative frequencies 

In [None]:
n=sum(corpus_wf.values())
corpus_fpm=Counter({word: (frequency/n*1E6) for word, frequency in corpus_wf.items()})

In [None]:
corpus_fpm.most_common(10)

And zipf values

In [None]:
n=sum(corpus_wf.values())
ntypes=len(corpus_wf.values())
corpus_zipf = Counter({word: log10((f+1)/(n+ntypes)*1E9) for word, f in corpus_wf.items()}) #log 10 frequency per billion words

In [None]:
corpus_zipf.most_common(10)

In [None]:
# sort the frequencies in descending order for the graph
frequencies=sorted(corpus_zipf.values(),reverse=True)
ranks=list(range(len(frequencies)))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from math import *

In [None]:
plt.plot(ranks, frequencies)