### One-Hot Encoding
- Contains order

In [1]:
from sklearn.feature_extraction import DictVectorizer

In [2]:
onehot_encoder = DictVectorizer()
instances = [
    {'city': 'New York'},
    {'city': 'San Francisco'},
    {'city': 'Chapel hILL'}
]
print(onehot_encoder.fit_transform(instances).toarray())

[[0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]]


## Extracting features from text
### The bag-of-words representation
- does not encode syntax, ignores the order of words, disregards all grammar
- Effective for document classification and retrieval
- Uses a feature vector with an element for each of the words
- Tokens: case-insensitive words or affixes

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
        'UNC played Duke in basketball',
        'Duke lost the basketball game',
        'I ate a sandwich'
]
vectorizer = CountVectorizer()

## Prints whether the nth word in the vocabulary list has appeared
## in the nth feature vector
print(vectorizer.fit_transform(corpus).todense())
## Prints the vocabulary list
print(vectorizer.vocabulary_)

[[0 1 1 0 1 0 1 0 0 1]
 [0 1 1 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0]]
{'unc': 9, 'played': 6, 'duke': 2, 'in': 4, 'basketball': 1, 'lost': 5, 'the': 8, 'game': 3, 'ate': 0, 'sandwich': 7}


#### We can calculate the Euclidean distance between each pair of the vectors, to see which ones are more similar. 

In [4]:
from sklearn.metrics.pairwise import euclidean_distances
counts = vectorizer.fit_transform(corpus).todense()
counts

matrix([[0, 1, 1, 0, 1, 0, 1, 0, 0, 1],
        [0, 1, 1, 1, 0, 1, 0, 0, 1, 0],
        [1, 0, 0, 0, 0, 0, 0, 1, 0, 0]], dtype=int64)

In [5]:
print('Distance between 1st and 2nd documents:', euclidean_distances(counts[0], counts[1]))
print('Distance between 2nd and 3rd documents:', euclidean_distances(counts[1], counts[2]))
print('Distance between 3rd and 1st documents:', euclidean_distances(counts[2], counts[0]))

Distance between 1st and 2nd documents: [[2.44948974]]
Distance between 2nd and 3rd documents: [[2.64575131]]
Distance between 3rd and 1st documents: [[2.64575131]]


- **Sparse vectors** ::= High-dimensional feature vectors that have many zero-values elements
    - We want to reduce the dimension because higher dimension data require more space and more trainings

## Extending bag-of-words wit TF-IDF weights
- Sometimes containing a word for a single time can be entirely different than containing a word many times. We add a variable to count for the time a word appears in a text. 

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['The dog ate a sandwich, the wizard transfigured a sandwich, and I ate a sandwich']
vectorizer = CountVectorizer(stop_words='english')
result = vectorizer.fit_transform(corpus).todense()
print (result,"\n", vectorizer.vocabulary_)


## Now to consider also abouy=t the document frequency
## we introduce the inverse document frequency (IDF), which 
# is a measure of how frequent a word is in a corpus.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'The dog ate a sandwich and I ate a sandwich',
'The wizard transfigured a sandwich'
]
vectorizer = TfidfVectorizer(stop_words='english')
result = vectorizer.fit_transform(corpus).todense()
print ("\n\n",result,"\n", vectorizer.vocabulary_)

[[2 1 3 1 1]] 
 {'dog': 1, 'ate': 0, 'sandwich': 2, 'wizard': 4, 'transfigured': 3}


 [[0.75458397 0.37729199 0.53689271 0.         0.        ]
 [0.         0.         0.44943642 0.6316672  0.6316672 ]] 
 {'dog': 1, 'ate': 0, 'sandwich': 2, 'wizard': 4, 'transfigured': 3}


#### We can see that words appear more times in the corpus are penalized (e.g., sandwich).

### Stop-word filtering
- Dterminers, auxiliary verbs, prepositions etc..
- Contribute through grammar rather than denotations


In [6]:
vectorizer = CountVectorizer(stop_words = "english")
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]
{'unc': 7, 'played': 5, 'duke': 2, 'basketball': 1, 'lost': 4, 'game': 3, 'ate': 0, 'sandwich': 6}


#### We can see that the vocabulary list ignored stop words "the" and "in".

### Stemming
- A strategy to condense inflected and derived forms of a word into single feature

In [7]:
vectorizer = CountVectorizer(binary=True, stop_words='english')
corpus = [
    'He ate the sandwiches',
    'Every sandwich was eaten by him',
    'I eat that sandwich'
]

counts = vectorizer.fit_transform(corpus).todense()
print(counts)
print(vectorizer.vocabulary_)

print('Distance between 1st and 2nd documents:', euclidean_distances(counts[0], counts[1]))
print('Distance between 2nd and 3rd documents:', euclidean_distances(counts[1], counts[2]))
print('Distance between 3rd and 1st documents:', euclidean_distances(counts[2], counts[0]))

[[1 0 0 0 1]
 [0 0 1 1 0]
 [0 1 0 1 0]]
{'ate': 0, 'sandwiches': 4, 'sandwich': 3, 'eaten': 2, 'eat': 1}
Distance between 1st and 2nd documents: [[2.]]
Distance between 2nd and 3rd documents: [[1.41421356]]
Distance between 3rd and 1st documents: [[2.]]


### Lemmatization
- the process of determining the lemma, aka the morphological roots of words
- stemming remmoves all patterns of characters that appear to be affixes. Might result in generating tokens that are not words

In [8]:
corpus = [
       'I am gathering ingredients for the sandwich.',
       'There were many wizards at the gathering.'
]

In [11]:
import nltk
## The toolkit can be used after calling nltk.download()
## But I have no space left on my device for downloading.
## Check p.58 in Mastering Machine Learning with Scikit-Learn.