In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk

plt.style.use("default")

## Vectorization

Process of encoding texts as integers as to create feature vectors

**Feature Vector:** vector of numerical features that represent an object

**Types of Vectorization:**
- Count Vectorization/unigram
- N-Grams
- TF-IDF

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

##### to give unique integer value to the different words, we need to fit the texts with the `CountVectorizer`. Then it'll assign the unique values

**fitting** also means I'm training with some data

In [4]:
corpus = ["Whoever is happy will make others happy too.",
          "Stay healthy, stay happy"]

X = cv.fit(corpus)
type(X)

sklearn.feature_extraction.text.CountVectorizer

#### to see the identified different words that the `CountVectorizer` found in the fitting process

In [5]:
print(X.get_feature_names())

['happy', 'healthy', 'is', 'make', 'others', 'stay', 'too', 'whoever', 'will']


#### to see the unique number generated for unique words

In [6]:
print(X.vocabulary_)

{'whoever': 7, 'is': 2, 'happy': 0, 'will': 8, 'make': 3, 'others': 4, 'too': 6, 'stay': 5, 'healthy': 1}


#### if you're confused from where I got the above two attributes of X

In [15]:
dir(X)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_n_features',
 '_check_stop_words_consistency',
 '_check_vocabulary',
 '_count_vocab',
 '_get_param_names',
 '_get_tags',
 '_limit_features',
 '_more_tags',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_sort_features',
 '_stop_words_id',
 '_validate_data',
 '_validate_params',
 '_validate_vocabulary',
 '_warn_for_unused_params',
 '_white_spaces',
 '_word_ngrams',
 'analyzer',
 'binary',
 'build_analyzer',
 'build_preprocessor',
 'build_tokenizer',
 'decode',
 'decode_error',
 'dtype',
 'encoding',
 'fit',
 'fit_transform',
 'fixe

#### Now transform our data based on training data

**Note:** We should not transform the train data, we should transform on test/new data. But for simplicity, let's try on train data

In [10]:
X_transformed = cv.transform(corpus)
type(X), type(X_transformed)

(sklearn.feature_extraction.text.CountVectorizer, scipy.sparse.csr.csr_matrix)

In [13]:
# As X_transformed is matrix, let's check its shape
X_transformed.shape

(2, 9)

`shape = (2, 9)` 👉 `R=2` and `C=9`, We had 2 text data in the corpus list and we got 9 unique words

In [14]:
print(X_transformed)

  (0, 0)	2
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 6)	1
  (0, 7)	1
  (0, 8)	1
  (1, 0)	1
  (1, 1)	1
  (1, 5)	2


##### let's convert the transformed X into array to visualize it better

In [21]:
X_array = X_transformed.toarray()
X_array

array([[2, 0, 1, 1, 1, 0, 1, 1, 1],
       [1, 1, 0, 0, 0, 2, 0, 0, 0]], dtype=int64)

In [25]:
print(X.get_feature_names())

['happy', 'healthy', 'is', 'make', 'others', 'stay', 'too', 'whoever', 'will']


**obseravtion:** It's difficult to understand the above array. let's convert it to `pandas` daraframe to visualize clearly.

- We need to add the data and also to add column values.

In [28]:
df = pd.DataFrame(data=X_array, columns=X.get_feature_names())
df

Unnamed: 0,happy,healthy,is,make,others,stay,too,whoever,will
0,2,0,1,1,1,0,1,1,1
1,1,1,0,0,0,2,0,0,0


In [29]:
corpus

['Whoever is happy will make others happy too.', 'Stay healthy, stay happy']

**observation:** from tha above dataframe, we see that it shows the frequency of each words in the corresponsing sentence

### try to transform on new text data

In [41]:
test = ["early to bed, early to rise, makes a man healthy, wealth and wise"]

test_transformed = cv.transform(test)
pd.DataFrame(data=test_transformed.toarray(), columns=X.get_feature_names())

Unnamed: 0,happy,healthy,is,make,others,stay,too,whoever,will
0,0,1,0,0,0,0,0,0,0


**observation:** the unknown words w.r.t the fitted words are eliminated.

AttributeError: get_feature_names not found