# **Practical Understanding of a Corpus and Dataset**

## What is a corpus?

Corpus is a collection of written or spoken natural language material, stored on computer,
and used to find out how language is used. So more precisely, a corpus is a systematic
computerized collection of authentic language that is used for linguistic analysis as well as
corpus analysis. If you have more than one corpus, it is called **corpora**.

Sometimes, NLP applications use a single corpus as the input, and at other times, they use multiple corpora as input.

There are three types of corpus:
* **Monolingual corpus**: This type of corpus has one language
* **Bilingual corpus**: This type of corpus has two languages
* **Multilingual corpus**: This type of corpus has more than one language

In [11]:
import nltk

from nltk.corpus import brown as cb
from nltk.corpus import gutenberg as cg

In [17]:
# cb.fileids() This results in files of the corpus
# cb.categories() This lists categories of the corpus
# cb.raw() This shows the raw content of the corpus
# cb.words() This shows the words of the whole corpus
cb.sents()

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

#### Exercise
1. Calculate the number of words in the brown corpus with fileID: fileidcc12.
2. Create your own corpus file, load it using nltk, and then check the frequency
distribution of that corpus.

In [20]:
len(cb.words(fileids=["cc12"]))

2342

In [22]:
nltk.data.load('corpora/mycorpus/worter.txt', format='raw')

b'Vorspeise\n\ndie Zwiebelsuppe\ndie Fleischsuppe\ndie Fishsuppe\n\n\n_______________________________________________________\nHauptgericht\n\ndas Fischpizza\ndie Eierpizza\ndie Zwiebelpizza\nder Fleischkuchen\nder Fishsalat\n\n_______________________________________________________\nDessert/Nachtisch\n\nder Obstsalat\nder Zitronenkuchen\ndas Zitroneneis\nder Milchreis\n\n'

In [27]:
import os, os.path
from nltk.corpus.reader import PlaintextCorpusReader

corpus_root = os.path.expanduser('~/nltk_data/corpora/mycorpus')
corpus = PlaintextCorpusReader(corpus_root, '.*', encoding='latin1')

In [30]:
corpus.words()

['Vorspeise', 'die', 'Zwiebelsuppe', 'die', ...]

## Understanding types of data attributes

##### Categorical or qualitative data attributes

Categorical or qualitative data attributes are as follows:

* These kinds of data attributes are more descriptive
* Examples are our written notes, corpora provided by ```nltk```, a corpus that has recorded different types of breeds of dogs, such as collie, shepherd, and terrier

There are two sub-types of categorical data attributes:

* **Ordinal data**: This type of data attribute is used to measure non-numeric concepts such as satisfaction level, happiness level, discomfort level, and so on.
* **Nominal data**: This type of data attribute is used to record data that doesn't overlap. Example: What is your gender? The answer is either male or female, and the answers are not overlapping.

##### Numeric or quantitative data attributes

The following are numeric or quantitative data attributes:

* These kinds of data attributes are numeric and represent a measurable quantity
* Examples: Financial data, population of a city, weight of people, and so on

There are two sub-types of numeric data attributes:
* **Continuous data**: These kinds of data attributes are continuous. Examples: If you are recording the weight of a student, from 10 to 12 years of age, whatever data you collect about the student's weight is continuous data; Iris flower corpus
* **Discrete data**: Discrete data can only take certain values. Examples: If you are rolling two dice, you can only have the resultant values of 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, and 12; you never get 1 or 1.5 as a result if you are rolling two dice. Take another example: If you toss a coin, you will get either heads or tails.

These kinds of data attributes are a major part of analytics applications.

In [33]:
import nltk.corpus
dir(nltk.corpus)

['AlignedCorpusReader',
 'AlpinoCorpusReader',
 'BCP47CorpusReader',
 'BNCCorpusReader',
 'BracketParseCorpusReader',
 'CHILDESCorpusReader',
 'CMUDictCorpusReader',
 'CategorizedBracketParseCorpusReader',
 'CategorizedCorpusReader',
 'CategorizedPlaintextCorpusReader',
 'CategorizedSentencesCorpusReader',
 'CategorizedTaggedCorpusReader',
 'ChasenCorpusReader',
 'ChunkedCorpusReader',
 'ComparativeSentencesCorpusReader',
 'ConllChunkCorpusReader',
 'ConllCorpusReader',
 'CorpusReader',
 'CrubadanCorpusReader',
 'DependencyCorpusReader',
 'EuroparlCorpusReader',
 'FramenetCorpusReader',
 'IEERCorpusReader',
 'IPIPANCorpusReader',
 'IndianCorpusReader',
 'KNBCorpusReader',
 'LazyCorpusLoader',
 'LinThesaurusCorpusReader',
 'MTECorpusReader',
 'MWAPPDBCorpusReader',
 'MacMorphoCorpusReader',
 'NKJPCorpusReader',
 'NPSChatCorpusReader',
 'NombankCorpusReader',
 'NonbreakingPrefixesCorpusReader',
 'OpinionLexiconCorpusReader',
 'PPAttachmentCorpusReader',
 'PanLexLiteCorpusReader',
 'Panle