**Representations**
---

### Preparing the text Data with scikit-learn — Feature Extraction
---

### **Bag of Words Model**

` — Find the unique words i.e., vocabulary from the list of documents. Parse each document word with the vocabulary, if present ‘1’ else ‘0’. This makes each document vector maintain the same length that of vocabulary length. We use this vocabulary for the new document vectorization.`

In [1]:
# define a new document
docs = ["SUPERB, I AM IN LOVE IN THIS PHONE", "I hate this phone"]

words = list(set([word for doc in docs for word in doc.split()]))

# create a dictionary
vectors = []

for doc in docs:
    vectors.append([1 if word in doc.lower().split() else 0 for word in words])
print("vocabulary: ", words)   
print("vectors: ", vectors)
    

vocabulary:  ['PHONE', 'IN', 'phone', 'SUPERB,', 'THIS', 'LOVE', 'I', 'hate', 'this', 'AM']
vectors:  [[0, 0, 1, 0, 0, 0, 0, 0, 1, 0], [0, 0, 1, 0, 0, 0, 0, 1, 1, 0]]


### **Word Counts with CountVectorizer(scikit-learn)**

 `— Tokenize the collection of documents and form a vocabulary with it and use this vocabulary to encode new documents. We can use CountVectorizer of the scikit-learn library. It by default remove punctuation and lower the documents.`

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
# list of documents
docs = ['SUPERB, I AM IN LOVE IN THIS PHONE', 'I hate this phone']
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(docs)
print('vocabulary: ', vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(docs)
# summarize encoded vector
print('shape: ', vector.shape)
print('vectors: ', vector.toarray())

vocabulary:  {'superb': 5, 'am': 0, 'in': 2, 'love': 3, 'this': 6, 'phone': 4, 'hate': 1}
shape:  (2, 7)
vectors:  [[1 0 2 1 1 1 1]
 [0 1 0 0 1 0 1]]


### **Word Frequencies with TfidfVectorizer (scikit-learn)**

` — Word counts are pretty basic. In the first document, the word “in” has repeated and with that word we can’t draw any meaning. Stop words can repeat several times in a document and word count prioritize with the occurrence of the word. From word counts, we lose the interesting words and we mostly give priority to stopping words/less meaning carrying words.`

TF-IDF is a popular method. Acronym is “Term Frequency and Inverse Document Frequency”. TF-IDF is word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

There are a few types of weighting schemes for tf-idf in general. Let's see how scikit-learn calculates tf*idf.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

# list of documents
docs = ['SUPERB, I AM IN LOVE IN THIS PHONE', 'I hate this phone']

# create the transform
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

# summarize encoded vector
print('vocabulary: ', vectorizer.vocabulary_)
print('idfs: ', vectorizer.idf_)

# encode document
vector = vectorizer.transform(docs)
print(vector.toarray())

vocabulary:  {'superb': 5, 'am': 0, 'in': 2, 'love': 3, 'this': 6, 'phone': 4, 'hate': 1}
idfs:  [1.40546511 1.40546511 1.40546511 1.40546511 1.         1.40546511
 1.        ]
[[0.35327777 0.         0.70655553 0.35327777 0.25136004 0.35327777
  0.25136004]
 [0.         0.70490949 0.         0.         0.50154891 0.
  0.50154891]]
