Weekly Coding Submission: Week 3
Scikit Practice

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

Methods - Text Features Extraction with Bag-of-Words Using Scikit Learn

In [2]:
corpus = np.array([
      'The sun is shining',
      'The weather is sweet',
      'The sun is shining, the weather is sweet, and this is an extra clause'
])

In [3]:
len(corpus)

3

Raw Term Frequency

Import Count Vectorizer
* Using the CountVectorizer from SKLearn, we can construct a bag-of-words model with the term frequencies
* Documentation: https://www.google.com/url?q=https%3A%2F%2Fscikit-learn.org%2Fstable%2Fmodules%2Fgenerated%2Fsklearn.feature_extraction.text.CountVectorizer.html

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
# tokenize and build vocabulary
tf = cv.fit_transform(corpus).toarray()
tf

array([[0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1],
       [1, 1, 1, 3, 1, 1, 1, 1, 2, 1, 1]])

In [5]:
cv.vocabulary_

{'the': 8,
 'sun': 6,
 'is': 3,
 'shining': 5,
 'weather': 10,
 'sweet': 7,
 'and': 1,
 'this': 9,
 'an': 0,
 'extra': 2,
 'sentence': 4}

In [6]:
# Shape
tf.shape

(3, 11)

We have 3 samples and 11 tokens

In [9]:
cv.get_feature_names_out()

array(['an', 'and', 'extra', 'is', 'sentence', 'shining', 'sun', 'sweet',
       'the', 'this', 'weather'], dtype=object)

In [10]:
cv.inverse_transform(tf)

[array(['is', 'shining', 'sun', 'the'], dtype='<U8'),
 array(['is', 'sweet', 'the', 'weather'], dtype='<U8'),
 array(['an', 'and', 'extra', 'is', 'sentence', 'shining', 'sun', 'sweet',
        'the', 'this', 'weather'], dtype='<U8')]

#### tf-idf
* The tf-idf rescales words that are common to have less weight
* We can use the TfidVectorizer to normalize the term frequencies(use_idf: False and smooth_idf=False)

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(use_idf=False, norm='l2', smooth_idf=False)
tf_norm = tfidf.fit_transform(corpus).toarray()
np.set_printoptions(precision=2)
print(f'Normalized term frequencies: \n {tf_norm[-1]}')

Normalized term frequencies: 
 [0.21 0.21 0.21 0.64 0.21 0.21 0.21 0.21 0.43 0.21 0.21]


#### Term frequency-onverse document frequency -- tf-idf

In [12]:
tfidf = TfidfVectorizer(use_idf=True, smooth_idf=True, norm='l2')
tf_idf = tfidf.fit_transform(corpus).toarray()
print(f'Normalized term frequencies of document 3:\n {tf_norm[-1]}')

Normalized term frequencies of document 3:
 [0.21 0.21 0.21 0.64 0.21 0.21 0.21 0.21 0.43 0.21 0.21]


#### Bigrams and N-Grams

In [13]:
# Sequences of tokens of minimum length 2 and lax length 2
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
bigram_vectorizer.fit(corpus)

In [15]:
bigram_vectorizer.get_feature_names_out()

array(['an extra', 'and this', 'extra sentence', 'is an', 'is shining',
       'is sweet', 'shining the', 'sun is', 'sweet and', 'the sun',
       'the weather', 'this is', 'weather is'], dtype=object)

In [16]:
bigram_vectorizer.transform(corpus).toarray()

array([[0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

Often we want to include unigrams (single tokens) AND bigrams, which we can do by passing the following tuple as an argument to the ngram_range parameter of the CountVectorizer function:

In [17]:
gram_vectorizer = CountVectorizer(ngram_range=(1, 2))
gram_vectorizer.fit(corpus)

In [18]:
gram_vectorizer.get_feature_names_out()

array(['an', 'an extra', 'and', 'and this', 'extra', 'extra sentence',
       'is', 'is an', 'is shining', 'is sweet', 'sentence', 'shining',
       'shining the', 'sun', 'sun is', 'sweet', 'sweet and', 'the',
       'the sun', 'the weather', 'this', 'this is', 'weather',
       'weather is'], dtype=object)

In [19]:
gram_vectorizer.transform(corpus).toarray()

array([[0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
        0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
        1, 1],
       [1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
        1, 1]])

### Character n-grams

Sometimes it is also helpful not only to look at words, but to consider single characters instead.  That is particularly usefil if we have very noisy data and want to identify the language or if we want to predict something about a single word.  We can simply look at characters instead of words by setting analyzer="char".  Looking at single characters is usually not very information, butlooking at longer n-grams of characters could be:

In [20]:
corpus

array(['The sun is shining', 'The weather is sweet',
       'The sun is shining, the weather is sweet, and this is an extra sentence'],
      dtype='<U71')

In [22]:
char_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer="char")
char_vectorizer.fit(corpus)

In [23]:
print(char_vectorizer.get_feature_names_out())

[' a' ' e' ' i' ' s' ' t' ' w' ', ' 'a ' 'an' 'at' 'ce' 'd ' 'e ' 'ea'
 'ee' 'en' 'er' 'et' 'ex' 'g,' 'he' 'hi' 'in' 'is' 'n ' 'nc' 'nd' 'ng'
 'ni' 'nt' 'r ' 'ra' 's ' 'se' 'sh' 'su' 'sw' 't,' 'te' 'th' 'tr' 'un'
 'we' 'xt']
