<div align="center">
    <h1><a href="index.ipynb">Knowledge Discovery in Digital Humanities</a></h1>
</div>

<div align="center">
    <h2>Class 01. Course overview</h2>
    <img src="img/labyrinth.png" width="300">
</div>

###Table of contents

- [Introduction](#Introduction)
- [Content samples](#Contents-samples)
- [Syllabus](#Syllabus)

###Introduction

This course aims to teach how to obtain valuable knowledge from a variety of online sources of information. The way to proceed is to access online resources, extract text data, and apply text analysis in order to discover interesting patterns and support decision making, with an emphasis on practical applications.

A strong background in text processing is necessary as information contained on online communities are generated by human beings. Therefore, it is especially valuable for discovering knowledge about people's opinions and preferences, in addition to many other kinds of knowledge that we encode in text.

Due to the interdisciplinary nature of this course, students from all backgrounds, but especially from the humanities, social science, computer science and engineering, are strongly encouraged to take this course. There are no prerequisites or antirequesites.

###Content samples

####Knowledge
> Knowledge is *"justified true belief"*. **(Plato, *Theaetetus*, 201c-d)**

> Trust is a firm belief in the reliability, truth, ability, or strength of someone or something. ***(Oxford Dictionaries)***

Sources of knowledge in epistemology:
- **Perception**. Knowledge is acquired through the experience of the senses. This is the base of empiricism.
- **Reason**. Knowledge, at least in part, is derived from previous knowledge through pure reasoning. This is the base of rationalism.
- **Introspection**. Knowledge of one's self can be found through internal self-evalution.
- **Memory**. Knowledge arises by remembering past information or events.
- **Testimony**. Knowledge is directly obtained from others' knowledge.

Principle of testimony:
> If *B* knows *q*, *B* shares *q* with *C*, and *C* trusts *B*, then *C* knows *q* too.

####Social web
<br/>
<div align="center">
    <figure>
        <img src="img/toh_hashtags.png">
        <figcaption>Hashtags used by *Taste of Home* community on Twitter</figcaption>
    </figure>
</div>

####Programming in Python
Exercise:

All the ebooks from [Project Gutenberg](http://www.gutenberg.org/) have the same format. All of them contain metadata about themselves such as their title, author, release date, language, and encoding in their first 20 lines. Given the ebook contained in the url [http://www.gutenberg.org/files/2554/2554.txt](http://www.gutenberg.org/files/2554/2554.txt), print its title and author.

Solution:
```
from urllib import urlopen

counter = 1
for line in urlopen('http://www.gutenberg.org/files/2554/2554.txt'):
    if line.startswith('Title:') or line.startswith('Author:'):
        print line.strip()
    if counter == 20:
        break
    counter += 1
```

Result:
```
Title: Crime and Punishment
Author: Fyodor Dostoevsky
```

####Text processing
Exercise:
- Build a tagger by combining a trigram, a bigram, a unigram and a regular expression tagger (for the default case)
- Use it to tag a sentence
- Evaluate its performance

Solution:
```
t0 = nltk.RegexpTagger(patterns)
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t3 = nltk.TrigramTagger(train_sents, backoff=t2)

sent =
    'The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election
     produced "no evidence" that any irregularities took place.'
t3.tag(sent)
t3.evaluate(test_sents)
```

Result:
```
[(u'The', u'AT'),
 (u'Fulton', u'NP-TL'),
 (u'County', u'NN-TL'),
 (u'Grand', u'JJ-TL'),
 (u'Jury', u'NN-TL'),
 (u'said', u'VBD'),
 (u'Friday', u'NR'),
 (u'an', u'AT'),
 (u'investigation', 'NN'),
 (u'of', u'IN'),
 (u"Atlanta's", u'NP$'),
 (u'recent', u'JJ'),
 (u'primary', u'NN'),
 (u'election', 'NN'),
 (u'produced', u'VBD'),
 (u'``', u'``'),
 (u'no', u'AT'),
 (u'evidence', 'NN'),
 (u"''", u"''"),
 (u'that', u'CS'),
 (u'any', u'DTI'),
 (u'irregularities', u'NNS'),
 (u'took', u'VBD'),
 (u'place', u'NN'),
 (u'.', u'.')]

0.8620552177813217
```

###Syllabus

[Go to the syllabus of the course](syllabus.ipynb).