# Machine Learning with Text in scikit-learn

`countvectorizer` can be used to learn the vocabulary of a corpus of text. Fit it to your training data. You can then output the vocabulary (feature names) from the text. This algorithm does/can do the following things:

1. Ignores punctuation
2. Can use stop words to remove common words
3. Can use regular expressions - default is to require words with more than 2 characters
4. Capitalisation was removed
5. Removed duplicates of words in feature set
6. Put them in lowercase, alphabetical order

After you've fit the model, you can transform text into a document term matrix (DTM). This generates a sparse matrix where the documents are the rows and the terms are the columns. 

Each individual term occurrence frequency (normalised or not) is a feature. 

Viewing the scipy sparse matrix directly shows you the non-zero elements. On the left is the position in the table, on the right is the value. Spare representation means you don't store the zeroes, just the values and their locations.

This strategy does not involve weighting important words, `tfidfvectorizor` might be better for that purpose.

If you want to add another column to the feature matrix you've gained, you can:

1. Build feature matrices separately and then hstack them together. Would need them both to be sparse matrices.
2. Feature union: unions of features
3. Train on two different models - one on the text, one on the non-text features and then ensembling.

The model dropped the word "don't". Even though this could be useful, because we're using bag of words, it can't understand the relationships between words so it isn't useful.

In [1]:
from sklearn.datasets import load_iris
iris = load_iris()

In [2]:
X = iris.data
y = iris.target
print(X.shape)
print(y.shape)

(150, 4)
(150,)


In [3]:
import pandas as pd
pd.DataFrame(X, columns=iris.feature_names).head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [4]:
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model (with the default parameters)
knn = KNeighborsClassifier()

# fit the model with data (occurs in-place)
knn.fit(X, y)
knn.predict([[3, 5, 4, 2]])

array([1])

In [5]:
# example text for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']
# example response vector
is_desperate = [0, 0, 1]

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# learn the 'vocabulary' of the training data (occurs in-place)
vect.fit(simple_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [7]:
# examine the fitted vocabulary
vect.get_feature_names()

['cab', 'call', 'me', 'please', 'tonight', 'you']

In [8]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [9]:
simple_train_dtm.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]])

In [10]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


In [11]:
type(simple_train_dtm)

scipy.sparse.csr.csr_matrix

In [12]:
# examine the sparse matrix contents
print(simple_train_dtm)

  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2


In [13]:
# build a model to predict desperation
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(simple_train_dtm, is_desperate)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

In [14]:
# example text for model testing
simple_test = ["please don't call me"]

In [15]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

array([[0, 1, 1, 1, 0, 0]])

In [16]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


In [17]:
# predict whether simple_test is desperate
knn.predict(simple_test_dtm)

array([1])