# Lexicons
### What is a lexicon?
In linguistics, a lexicon is a language's inventory of lexemes. Linguistic theories generally regard human languages as consisting of two parts: a lexicon, essentially a catalogue of a language's words (its wordstock); and a grammar, a system of rules which allow for the combination of those words into meaningful sentences. The lexicon is also thought to include bound morphemes, which cannot stand alone as words (such as most affixes).

## Stopwords
In computing, stop words are words which are filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

In [1]:
from nltk.corpus import stopwords
stopwords.words('english')[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

## CMU Wordlist

In [2]:
import nltk
entries = nltk.corpus.cmudict.entries()
print(len(entries))
for entry in entries[10000:10010]:
    print(entry)

133737
('belford', ['B', 'EH1', 'L', 'F', 'ER0', 'D'])
('belfry', ['B', 'EH1', 'L', 'F', 'R', 'IY0'])
('belgacom', ['B', 'EH1', 'L', 'G', 'AH0', 'K', 'AA0', 'M'])
('belgacom', ['B', 'EH1', 'L', 'JH', 'AH0', 'K', 'AA0', 'M'])
('belgard', ['B', 'EH0', 'L', 'G', 'AA1', 'R', 'D'])
('belgarde', ['B', 'EH0', 'L', 'G', 'AA1', 'R', 'D', 'IY0'])
('belge', ['B', 'EH1', 'L', 'JH', 'IY0'])
('belger', ['B', 'EH1', 'L', 'G', 'ER0'])
('belgian', ['B', 'EH1', 'L', 'JH', 'AH0', 'N'])
('belgians', ['B', 'EH1', 'L', 'JH', 'AH0', 'N', 'Z'])


## Wordnet

In [3]:
from nltk.corpus import wordnet as wn
wn.synsets('motorcar')

[Synset('car.n.01')]

In [4]:
wn.synset('car.n.01').lemma_names()

['car', 'auto', 'automobile', 'machine', 'motorcar']

# Implementing Tokenization

## Testing Sentence Tokenizer

In [5]:
import nltk
text = 'My dog got bored ands chased our cat. The cat jumped into my arms'
sent = nltk.sent_tokenize(text)
print(sent)

['My dog got bored ands chased our cat.', 'The cat jumped into my arms']


## Testing Word Tokenizer

In [6]:
word = nltk.word_tokenize(text)
print(word)

['My', 'dog', 'got', 'bored', 'ands', 'chased', 'our', 'cat', '.', 'The', 'cat', 'jumped', 'into', 'my', 'arms']
