#### One hot encoding
One hot encoding creates new (binary) columns, indicating the presence of each possible value from the original categorical column data. It works well when the categorical column has fewer no of values.

In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv('employee.csv')
df

Unnamed: 0,Employee_Id,Gender
0,101,Male
1,102,Female
2,103,Female
3,104,Male
4,105,Male


In [3]:
one_hot_encoded_data = pd.get_dummies(df)
one_hot_encoded_data

Unnamed: 0,Employee_Id,Gender_Female,Gender_Male
0,101,0,1
1,102,1,0
2,103,1,0
3,104,0,1
4,105,0,1


#### Bag of Words
We can not directly pass text data to the classification model, features should be numeric. For this different state-of-art approaches are there which can extract features from text. Simplest is Bag of Words.
A bag-of-words is a representation of text that describes the occurrence of words within a document.It is called a bag-of-words , because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
text = ['its great time to learn something new','time is precious']
vectorizer = CountVectorizer()
vectorizer.fit(text)
print(sorted(vectorizer.vocabulary_))


['great', 'is', 'its', 'learn', 'new', 'precious', 'something', 'time', 'to']


In [7]:
# Transorm doc to document-term matrix
vector = vectorizer.transform(text)
#shape of matrix
print(vector.shape)
print(vector.toarray())#0 for absent 1 for present

(2, 9)
[[1 0 1 1 1 0 1 1 1]
 [0 1 0 0 0 1 0 1 0]]


#### Bag of N-grams
n-grams are continous sequence of words or tokens in a document

In [9]:
text = 'Learning is a continous process'
def generate_N_grams(text,ngram=1):
  '''
  Function will take two parameters
    str = text
    int = no of grams
  '''
  words=text.split() 
  temp=zip(*[words[i:] for i in range(0,ngram)])
  seq=[' '.join(ngram) for ngram in temp]
  return seq

In [10]:
generate_N_grams(text,2)

['Learning is', 'is a', 'a continous', 'continous process']

#### TF-IDF
It stands for Term Frequency-Inverse Document Frequency.
Term Frequency: In document d, the frequency represents the number of instances of a given word t.
tf(t,d) = count of t in d / number of words in d
Inverse Document Frequency: The IDF of the word is the number of documents in the corpus separated by the frequency of the text.
idf(t) = log(N/(df + 1))
tf-idf(t, d) = tf(t, d) * log(N/(df + 1))

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
d1 = 'Learning is continous process'
d2 = 'Machine learning'
d3 = 'deep learning'
li = [d1,d2,d3]
tfidf = TfidfVectorizer()
result = tfidf.fit_transform(li)
print('Word indexes:')
print(tfidf.vocabulary_)
print('tf-idf values:')
print(result)

Word indexes:
{'learning': 3, 'is': 2, 'continous': 0, 'process': 5, 'machine': 4, 'deep': 1}
tf-idf values:
  (0, 5)	0.546454011634009
  (0, 0)	0.546454011634009
  (0, 2)	0.546454011634009
  (0, 3)	0.3227445421804912
  (1, 4)	0.8610369959439764
  (1, 3)	0.5085423203783267
  (2, 1)	0.8610369959439764
  (2, 3)	0.5085423203783267


#### OOV Problem:
Out of vocab is very common problem in NLP. It can be seen in text classification , NER, information retrieval etc.
It can occur due to multiple reasons:


1.   typos
2.   different vocab: If your model is trained with a limited dataset, it is very likely you have a limited vocabulary. When you apply your model to new data, your model will fail to recognize some words.





#### Word Embeddings:
Word Embeddings are a method of extracting features out of text so that we can input those features into a machine learning model to work with text data. They try to preserve syntactical and semantic information. The methods such as Bag of Words(BOW), CountVectorizer and TFIDF rely on the word count in a sentence but do not save any syntactical or semantic information. 

#### CBOW and Skip Gram
In CBOW, the model takes various words as input and predicts the target word that is closely related to the context of the input words. On the other hand, the Skip-gram architecture takes one word as input and predicts its closely related context words.

#### Glove 
Word2Vec only captures the local context of words. During training, it only considers neighboring words to capture the context. GloVe considers the entire corpus and creates a large matrix that can capture the co-occurrence of words within the corpus.
GloVe performs significantly better in word analogy and named entity recognition problems. It is better than Word2Vec in some tasks and competes in others.