# Converting text to numeric data

Machine learning algorithms learn from numeric data. In order for the algorithms to learn from text, the text has to be represented in some numerical form. One way to do this is to represent documents as vectors of frequencies, where each location in the vector represents the count of a specific word in that document. To demonstrate this, a toy corpus is created from text.

In [7]:
corpus = ['Mary had a little lamb.',
         'The lamb followed Mary to school one day.',
         'The lamb was white.',
         'Mary should not bring a lamb to school.',
         'Mary is a little rebel.']

### CountVectorizer

The sklearn CountVectorizer function:
* tokenizes the text
* assigns a unique integer id to each word in the corpus
* counts the frequencies for each word

Documentation for CountVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

The code below first creates a CountVectorizer instance, then fits the corpus.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

The output above shows the default options for CountVectorizer. Full descriptions of each parameter are provided in the documentation.

Next, corpus counts and names are extracted from the vectorizer fit.

In [9]:
corpus_counts = vectorizer.fit_transform(corpus)
print(corpus_counts.toarray())
print('names:', vectorizer.get_feature_names())

[[0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0]
 [0 1 1 0 0 1 0 1 0 1 0 1 0 1 1 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1]
 [1 0 0 0 0 1 0 1 1 0 0 1 1 0 1 0 0]
 [0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0]]
names: ['bring', 'day', 'followed', 'had', 'is', 'lamb', 'little', 'mary', 'not', 'one', 'rebel', 'school', 'should', 'the', 'to', 'was', 'white']


In the 2D array above, each row represents one document in the corpus and each column represents a word. Each element represents the count of that word in that document. As you can see, most counts are zero. These types of vectorized representations of a corpus produce a sparse matrix.

In [14]:
print(vectorizer.vocabulary_)

{'mary': 7, 'had': 3, 'little': 6, 'lamb': 5, 'the': 13, 'followed': 2, 'to': 14, 'school': 11, 'one': 9, 'day': 1, 'was': 15, 'white': 16, 'should': 12, 'not': 8, 'bring': 0, 'is': 4, 'rebel': 10}


### TfidfVectorizer

The code below shows how to use the TfidfVectorizer which produces tfidf values instead of frequency counts. Documentation for TfidfVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [24]:
# take a look at  the terms and idf
print('terms:', vectorizer.vocabulary_)
print('\nidf:', vectorizer.idf_)

terms: {'mary': 7, 'had': 3, 'little': 6, 'lamb': 5, 'the': 13, 'followed': 2, 'to': 14, 'school': 11, 'one': 9, 'day': 1, 'was': 15, 'white': 16, 'should': 12, 'not': 8, 'bring': 0, 'is': 4, 'rebel': 10}

idf: [2.09861229 2.09861229 2.09861229 2.09861229 2.09861229 1.18232156
 1.69314718 1.18232156 2.09861229 2.09861229 2.09861229 1.69314718
 2.09861229 1.69314718 1.69314718 2.09861229 2.09861229]


In [None]:
The fit method above learns the vocabulary and the idf from the data. The fit_transform method below learns the vocabulary and idf, and returns a document-term matrix. It is essentially the fit method followed by the transform method, but doing the one method is more efficient than using the two methods. 

In [27]:
tfidf = vectorizer.fit_transform(corpus)
print(tfidf)

  (0, 5)	0.3726424037188896
  (0, 6)	0.5336436873608266
  (0, 3)	0.6614375955758462
  (0, 7)	0.3726424037188896
  (1, 1)	0.42304772562360027
  (1, 9)	0.42304772562360027
  (1, 11)	0.34131224130803434
  (1, 14)	0.34131224130803434
  (1, 2)	0.42304772562360027
  (1, 13)	0.34131224130803434
  (1, 5)	0.23833770928448939
  (1, 7)	0.23833770928448939
  (2, 16)	0.5804234289808452
  (2, 15)	0.5804234289808452
  (2, 13)	0.46828196785865284
  (2, 5)	0.3270004354105097
  (3, 0)	0.4500747244546631
  (3, 8)	0.4500747244546631
  (3, 12)	0.4500747244546631
  (3, 11)	0.36311745378911303
  (3, 14)	0.36311745378911303
  (3, 5)	0.2535642489869185
  (3, 7)	0.2535642489869185
  (4, 10)	0.5804234289808452
  (4, 4)	0.5804234289808452
  (4, 6)	0.46828196785865284
  (4, 7)	0.3270004354105097


In [28]:
tfidf.toarray()

array([[0.        , 0.        , 0.        , 0.6614376 , 0.        ,
        0.3726424 , 0.53364369, 0.3726424 , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ],
       [0.        , 0.42304773, 0.42304773, 0.        , 0.        ,
        0.23833771, 0.        , 0.23833771, 0.        , 0.42304773,
        0.        , 0.34131224, 0.        , 0.34131224, 0.34131224,
        0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.32700044, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.46828197, 0.        ,
        0.58042343, 0.58042343],
       [0.45007472, 0.        , 0.        , 0.        , 0.        ,
        0.25356425, 0.        , 0.25356425, 0.45007472, 0.        ,
        0.        , 0.36311745, 0.45007472, 0.        , 0.36311745,
        0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.      

## Isolating the test data

The previous examples showed how to apply the vectorizers to the entire corpus. For supervised machine learning, a better approach is to first divide the data into train/test sets, then fit the vectorizer to the training data only. The test set can be later fit to the vectorizer. In this way, no information from the test set leaks to the training set. The following code demonstrates how to do this on a spam data set. 

In [12]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# read the data
df = pd.read_csv('../data/sms-spam.csv')

X = df.text      # features
y = df.spam    # targets

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)

# use defaults
vectorizer = TfidfVectorizer()

# vectorize
X_train = vectorizer.fit_transform(X_train) # fit the training data
X_test = vectorizer.transform(X_test) # transform only

### Adding more features

Add bigrams as features.

In [31]:
vectorizer = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))

In [39]:
X_features = vectorizer.fit_transform(corpus)
print(X_features.toarray())

[[1.         0.         0.         0.         0.         0.        ]
 [0.         0.4472136  0.4472136  0.4472136  0.4472136  0.4472136 ]
 [0.         0.         0.70710678 0.70710678 0.         0.        ]
 [0.         0.57735027 0.         0.         0.57735027 0.57735027]
 [1.         0.         0.         0.         0.         0.        ]]


When to use counts versus tfidf? As the vocabulary size increases, tfidf will give better results.