### *Exercise*: Just a couple of examples from the book: Work through the exercises NLPP1e 3.12: 6, 30.

#### 6.
Describe the class of strings matched by the following regular expressions.
    1. [a-zA-Z]+
    2. [A-Z][a-z]*
    3. p[aeiou]{,2}t
    4. \d+(\.\d+)?
    5. ([^aeiou][aeiou][^aeiou])*
    6. \w+|[^\w\s]+

1. Matches any word with ASCII characters.
2. Matches any word starting with a upper case.
3. Matches words that start with *p* and end with *t*, that have 0 to 2 instances of any of these characters *a, e, i, o, u*.
4. Matches any positive digit, also those with decimal points.
5. ...
6. Matches any word character, also unicode charachters

Test your answers using nltk.re_show().

In [2]:
import nltk
string = 'Hello, world. paat paet paaat. 1, 23, 2.3, -23. Óðinn'
reg_exp = ['[a-zA-Z]+', '[A-Z][a-z]*', 'p[aeiou]{,2}t',
           '\d+(\.\d+)?', '([^aeiou][aeiou][^aeiou])*',
           '\w+|[^\w\s]+']

for i in reg_exp:
    print i
    nltk.re_show(i, string)
    print '---'

[a-zA-Z]+
{Hello}, {world}. {paat} {paet} {paaat}. 1, 23, 2.3, -23. Óð{inn}
---
[A-Z][a-z]*
{Hello}, world. paat paet paaat. 1, 23, 2.3, -23. Óðinn
---
p[aeiou]{,2}t
Hello, world. {paat} {paet} paaat. 1, 23, 2.3, -23. Óðinn
---
\d+(\.\d+)?
Hello, world. paat paet paaat. {1}, {23}, {2.3}, -{23}. Óðinn
---
([^aeiou][aeiou][^aeiou])*
{Hello,} {wor}l{}d{}.{} {}p{}a{}a{}t{} {}p{}a{}e{}t{} {}p{}a{}a{}a{}t{}.{} {}1{},{} {}2{}3{},{} {}2{}.{}3{},{} {}-{}2{}3{}.{} {}�{}�{}�{�in}n{}
---
\w+|[^\w\s]+
{Hello}{,} {world}{.} {paat} {paet} {paaat}{.} {1}{,} {23}{,} {2}{.}{3}{,} {-}{23}{.} {Óð}{inn}
---


#### 30.
Use the Porter Stemmer to normalize some tokenized text, calling the stemmer on each word. Do the same thing with the Lancaster Stemmer and see if you observe any differences.

In [3]:
# Raw text
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

print(raw)

# Tokenize the raw text
tokens = nltk.word_tokenize(raw)

DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony.


In [4]:
# Porter Stemmer
porter = nltk.PorterStemmer()
for t in tokens:
    print(porter.stem(t))

DENNI
:
Listen
,
strang
women
lie
in
pond
distribut
sword
is
no
basi
for
a
system
of
govern
.
Suprem
execut
power
deriv
from
a
mandat
from
the
mass
,
not
from
some
farcic
aquat
ceremoni
.


In [5]:
# Lancaster Stemmer
lancaster = nltk.LancasterStemmer()
for t in tokens:
    print(lancaster.stem(t))

den
:
list
,
strange
wom
lying
in
pond
distribut
sword
is
no
bas
for
a
system
of
govern
.
suprem
execut
pow
der
from
a
mand
from
the
mass
,
not
from
som
farc
aqu
ceremony
.


### Exercises: TF-IDF and the branches of philosophy.

Setup. We want to start from a clean version of the philosopher pages with as little wiki-markup as possible. We needed it earlier to get the links, etc, but now we want a readable version. We can get a fairly nice version directly from the wikipedia API, simply call prop=extracts&exlimit=max&explaintext instead of prop=revisions as we did earlier. This will make the API return the text without links and other markup.
* Use this method to retrive a nice copy of all philosopher's text. You can, of course, also clean the existing pages using regular expressions, if you like (but that's probably more work).

In [6]:
# Create a list of all philosophers from six different branches
import io
import re

branches_of_phi = ['aestheticians', 'epistemologists',
                   'ethicists', 'logicians', 'metaphysicians',
                   'social_and_political_philosophers']

all_phi = []
for phi in branches_of_phi:
    f = io.open('./wikitext_' + phi + '.txt', 'r', encoding='utf8')
    branch_of_phi = re.findall(r'\[\[(.*?)\]\]', f.read())
    all_phi = all_phi + branch_of_phi

# Deleting duplicates from prev. list and sort alphabetically
philosophers = sorted(set(all_phi))

In [1]:
import urllib2
import json

# Function for retrieving json data from Wikipedia
def jsonWiki( name ):
    # Parameters for retrieving page from wikipedia
    baseurl = 'https://en.wikipedia.org/w/api.php?'
    action = "action=query"
    title = "titles=" + name
    content = "prop=extracts&exlimit=max&explaintext&rvprop=content"
    dataformat = "format=json"

    # Construct the query
    query = "%s%s&%s&%s&%s" % (baseurl, action, title, content, dataformat)

    # Download json format of wikipedia page
    wikiresponse = urllib2.urlopen(query)
    wikisource = wikiresponse.read()
    wikijson = json.loads(wikisource)
    
    return wikijson

In [7]:
# Go through the list 'philosophers' and download their respective pages
for i in philosophers:
    #Convert philosophers' names from utf8 to string
    nameStr = i.encode("utf-8")
    
    # Whitespace changed to underscore
    name_url = re.sub('\s+', '_', nameStr)
    
    with io.open('./philosophers_json/' + name_url + '.json', 'w', encoding='utf8') as json_file:
        json_file.write(unicode(json.dumps(jsonWiki(name_url), ensure_ascii=False)))

In [8]:
# Find number of philosophers in each of the six branches
branch_start_list = [0]
branch_start = 0
for phi in branches_of_phi:
    f = io.open('./wikitext_' + phi + '.txt', 'r', encoding='utf8')
    branch_of_phi = re.findall(r'\[\[(.*?)\]\]', f.read())
    branch_start = branch_start + len(branch_of_phi)
    branch_start_list.append(branch_start)

# Extract philosophers from 'all_phi' list into their branches by using the size of their branch
aestheticians = all_phi[branch_start_list[0]:branch_start_list[1]]
epistemologists = all_phi[branch_start_list[1]:branch_start_list[2]]
ethicists = all_phi[branch_start_list[2]:branch_start_list[3]]
logicians = all_phi[branch_start_list[3]:branch_start_list[4]]
metaphysicians = all_phi[branch_start_list[4]:branch_start_list[5]]
social_and_political = all_phi[branch_start_list[5]:branch_start_list[6]]

The exercise.
* First, check out [the wikipedia page for TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). Explain in your own words the point of TF-IDF.
    * What does TF stand for?
    * What does IDF stand for?

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

* Since we want to find out which words are important for each branch, so we're going to create six large documents, one per branch of philosophy. Tokenize the pages, and combine the tokens into one long list per branch. Remember the bullets below for success.
    * If you dont' know what tokenization means, go back and read Chapter 3 again. This advice is valid for every cleaning step below.
    * Exclude philosopher names (since we're interested in the words, not the names).
    * Exclude punctuation.
    * Exclude stop words (if you don't know what stop words are, go back and read NLPP1e again).
    * Exclude numbers (since they're difficult to interpret in the word cloud).
    * Set everything to lower case.
    * Note that none of the above has to be perfect. It might not be easy to remove all philosopher names. And there's some room for improvisation. You can try using stemming. In my own first run the results didn't look so nice, because some pages are very detailed and repeat certain words again and again and again, whereas other pages are very short. For that reason, I decided to use the unique set of words from each page rather than each word in proportion to how it's actually used on that page. Choices like that are up to you.

In [25]:
# Getting words from Astheticians pages
aestheticians_words = []
for i in aestheticians[0:5]:
    name_url = re.sub('\s+', '_', i)
    f = io.open('./philosophers_json/' + name_url + '.json', 'r', encoding='utf8')
    #aestheticians_tokens = nltk.word_tokenize(f.read())
    hello = re.findall(r'[a-zA-Z]+', f.read())
    #aestheticians_words.append(re.findall(r'[a-zA-Z]+', aestheticians_tokens))
    aestheticians_words.append(hello)

print aestheticians_words

