# Accessing Text Corpora and Lexical Resources

## Corpus Linguistics
- A corpus is a large body of existing text

- A large and structured set of texts. 
- Ideally*, a corpus should contain documents selected for variety
- They are used
    - As training data for various problems
    - Do statistical analysis and hypothesis testing.
    - Validating linguistic rules within a specific language territory

## Types of Corporas
- Plain text corpora
    - Project Gutenberg (digitized cultural works (1971))
    - British National Corpus: (100-million-word collection consists of the bigger written part (90 %, e.g. newspapers, academic books, letters, essays, etc.) and the smaller spoken part (remaining 10 %, e.g. informal conversations, radio shows, etc.). The spoken part is also available in the audio format.)
    - Presidential inaugural addresses
    - The Universal Declaration of Human Rights (translated into >300 languages)
    - CHILDES (conversations between parents and children)
    - Wikipedia
    - Google Books
    - The entire Web
- Annotated corpora
    - Brown corpus (has part of speech tags)
    - Penn Treebank (complete parse trees of sentences, mostly from the Wall Street Journal)
    - SemCor (distinguishes word senses)
    - LDC (Linguistic Data Consortium) (many corpuses in many languages for different purposes) 

corporas can be also catigorized to: monolingual vs multilingual...


## Text Corpus Structure in NLTK
![text-corpus-structure.png](attachment:text-corpus-structure.png)

- **Isloated** texts corporas are just a collection of texts.
- **Catigorized** text corporas are texts are grouped into categories that might correspond to genre, source, author, language, etc.
- Sometimes these categories **overlap**, notably in the case of topical categories as a text can be relevant to more than one topic.
- Text collections have **temporal** structure, news collections being the most common example.

## Gutenberg Corpus
NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books.

In [None]:
%matplotlib inline 
import nltk

from nltk.corpus import gutenberg

In [None]:
gutenberg.fileids()

In [None]:
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

## Brown Corpus
The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial... etc.

In [None]:
from nltk.corpus import brown

In [None]:
brown.categories()

In [None]:
brown.fileids()

In [None]:
brown.words(categories='news')

In [None]:
brown.sents(fileids=['cg22'])

In [None]:
brown.paras(categories=['news'])

In [None]:
brown.raw(categories=['news', 'editorial', 'reviews'])

In [None]:
nltk.corpus.brown.tagged_words(categories='news')

### Let's compare genres in their usage of modal verbs.

In [None]:
from nltk.corpus import brown
news_text = brown.words(categories='news')
fdist = nltk.FreqDist(w.lower() for w in news_text)
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
    print(m + ':', fdist[m], end=' ')

we want to calculate frequencies pair genre<br> freqDist can calculate conditional frequencies on a pair

In [None]:
list((genre, word) for genre in brown.categories() for word in brown.words(categories=genre))[:20]

In [None]:
cfd = nltk.ConditionalFreqDist(
        (genre, word)
        for genre in brown.categories()
        for word in brown.words(categories=genre))

In [None]:
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)

## Reuters Corpus
- The Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics, and grouped into two sets, called "training" and "test".
- Unlike the Brown Corpus, categories in the Reuters corpus overlap with each other, simply because a news story often covers multiple topics. 

## Inaugural Address Corpus
(Temporal corpus)

In [None]:
from nltk.corpus import inaugural
inaugural.fileids()[:10]

In [None]:
cfd = nltk.ConditionalFreqDist(
           (target, fileid[:4]) # get the year out of the filename, we extracted the first four characters, 
           for target in ['america', 'citizen']
    
           for fileid in inaugural.fileids()
           for w in inaugural.words(fileid)
          
           if w.lower().startswith(target))
cfd.plot()

## Corpora in Other Languages

In [None]:
nltk.corpus.cess_esp.words()

In [None]:
nltk.corpus.indian.words('hindi.pos')

Universal Declaration of Human Rights in over 300 languages. The fileids for this corpus include information about the character encoding used in the file, such as UTF8 or Latin1. 

In [None]:
nltk.corpus.udhr.words('Javanese-Latin1')[11:]

# Lexical Resources
- A lexicon, or lexical resource, is a collection of words and/or phrases along with associated information such as part of speech and sense definitions.
- A lexical entry consists of a **headword** (also known as a **lemma**) along with additional information such as the part of speech and the sense definition. Two distinct words having the same spelling are called **homonyms**.
![lexicon.png](attachment:lexicon.png)

## Wordlist Corpora
-  The **Words** Corpus: used by some spell checkers

In [None]:
print(nltk.corpus.words.words()[:30])

Use it to detecet unusual words in texts

In [None]:
def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab - english_vocab
    return sorted(unusual)

In [None]:
print(unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))[:30])

In [None]:
print(unusual_words(nltk.corpus.nps_chat.words())[:30])

**Stopwords** corpus

In [None]:
from nltk.corpus import stopwords
print(stopwords.words('english'))

Let's define a function to compute what fraction of words in a text are not in the stopwords list:

In [None]:
def content_fraction(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return len(content) / len(text)

In [None]:
content_fraction(nltk.corpus.reuters.words())

One more wordlist corpus is the Names corpus, containing 8,000 first names categorized by gender. The male and female names are stored in separate files. Let's find names which appear in both files, i.e. names that are ambiguous for gender:

In [None]:
names = nltk.corpus.names
names.fileids()

In [None]:
male_names = names.words('male.txt')
female_names = names.words('female.txt')
[w for w in male_names if w in female_names]

In [None]:
cfd = nltk.ConditionalFreqDist(
          (fileid, name[-1])
           for fileid in names.fileids()
           for name in names.words(fileid))
cfd.plot()

there is also
- **A Pronouncing Dictionary** : nltk.corpus.cmudict
     table (or spreadsheet), containing a word plus some properties in each row. NLTK includes the CMU Pronouncing Dictionary for US English, which was designed for use by speech synthesizers.
- **Comparative Wordlists** : ists of about 200 common words in several languages. e.g. nltk.corpus.swadesh

### WordNet
WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets. 