### One-Hot Encoding
- Contains order

In [1]:
from sklearn.feature_extraction import DictVectorizer

In [2]:
onehot_encoder = DictVectorizer()
instances = [
    {'city': 'New York'},
    {'city': 'San Francisco'},
    {'city': 'Chapel hILL'}
]
print(onehot_encoder.fit_transform(instances).toarray())

[[0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]]


## Extracting features from text
### The bag-of-words representation
- does not encode syntax, ignores the order of words, disregards all grammar
- Effective for document classification and retriecal
- Uses a feature vector with an element for each of the words
- Tokens: case-insensitive words or affixes

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
        'UNC played Duke in basketball',
        'Duke lost the basketball game',
        'I ate a sandwich'
]
vectorizer = CountVectorizer()

## Prints whether the nth word in the vocabulary list has appeared
## in the nth feature vector
print(vectorizer.fit_transform(corpus).todense())
## Prints the vocabulary list
print(vectorizer.vocabulary_)

[[0 1 1 0 1 0 1 0 0 1]
 [0 1 1 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0]]
{'unc': 9, 'played': 6, 'duke': 2, 'in': 4, 'basketball': 1, 'lost': 5, 'the': 8, 'game': 3, 'ate': 0, 'sandwich': 7}


#### We can calculate the Euclidean distance between each pair of the vectors, to see which ones are more similar. 

In [25]:
from sklearn.metrics.pairwise import euclidean_distances
counts = vectorizer.fit_transform(corpus).todense()
counts

matrix([[0, 1, 1, 0, 1, 0, 1, 0, 0, 1],
        [0, 1, 1, 1, 0, 1, 0, 0, 1, 0],
        [1, 0, 0, 0, 0, 0, 0, 1, 0, 0]], dtype=int64)

In [26]:
print('Distance between 1st and 2nd documents:', euclidean_distances(counts[0], counts[1]))
print('Distance between 2nd and 3rd documents:', euclidean_distances(counts[1], counts[2]))
print('Distance between 3rd and 1st documents:', euclidean_distances(counts[2], counts[0]))

Distance between 1st and 2nd documents: [[2.44948974]]
Distance between 2nd and 3rd documents: [[2.64575131]]
Distance between 3rd and 1st documents: [[2.64575131]]


- **Sparse vectors** ::= High-dimensional feature vectors that have many zero-values elements
    - We want to reduce the dmension becausse higher dimension data require more space and more trainings

### Stop-word filtering
- Dterminers, auxiliary verbs, prepositions etc..
- Contribute through grammar rather than denotations


In [31]:
vectorizer = CountVectorizer(stop_words = "english")
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]
{'unc': 7, 'played': 5, 'duke': 2, 'basketball': 1, 'lost': 4, 'game': 3, 'ate': 0, 'sandwich': 6}


#### We can see that the vocabulary list ignored stop words "the" and "in".

### Stemming
- A strategy to condense inflected and derived forms of a word into single feature

In [43]:
vectorizer = CountVectorizer(binary=True, stop_words='english')
corpus = [
    'He ate the sandwiches',
    'Every sandwich was eaten by him',
    'I eat that sandwich'
]

counts = vectorizer.fit_transform(corpus).todense()
print(counts)
print(vectorizer.vocabulary_)

print('Distance between 1st and 2nd documents:', euclidean_distances(counts[0], counts[1]))
print('Distance between 2nd and 3rd documents:', euclidean_distances(counts[1], counts[2]))
print('Distance between 3rd and 1st documents:', euclidean_distances(counts[2], counts[0]))

[[1 0 0 0 1]
 [0 0 1 1 0]
 [0 1 0 1 0]]
{'ate': 0, 'sandwiches': 4, 'sandwich': 3, 'eaten': 2, 'eat': 1}
Distance between 1st and 2nd documents: [[2.]]
Distance between 2nd and 3rd documents: [[1.41421356]]
Distance between 3rd and 1st documents: [[2.]]


### Lemmatization
- the process of determining the lemma, aka the morphological roots of words
- stemming remmoves all patterns of characters that appear to be affixes. Might result in generating tokens that are not words

In [44]:
corpus = [
       'I am gathering ingredients for the sandwich.',
       'There were many wizards at the gathering.'
]

In [4]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


Exception in Tkinter callback
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/tkinter/__init__.py", line 1705, in __call__
    return self.func(*args)
  File "/usr/local/lib/python3.7/site-packages/nltk/downloader.py", line 1796, in _download
    return self._download_threaded(*e)
  File "/usr/local/lib/python3.7/site-packages/nltk/downloader.py", line 2082, in _download_threaded
    assert self._download_msg_queue == []
AssertionError
Exception in Tkinter callback
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/tkinter/__init__.py", line 1705, in __call__
    return self.func(*args)
  File "/usr/local/lib/python3.7/site-packages/nltk/downloader.py", line 1796, in _download
    return self._download_threaded(*e)
  File "/usr/local/lib/python3.7/site-packages/nltk/downloader.py", line 2082, in _download_threaded
    assert 

True