## Loading and vectorizing texts with sklearn

Scikit-learn has methods to transform a collection of documents into matrices of "**bag of words**" representations of these documents.

These matrices use the scipy.sparse type, which is appropriate for **sparse matrices**.

These modules have 3 methods:
- fit : builds the vocabulary and the correspondance between word forms and word ids
- transform : transforms the documents into matrices of counts
- fit_transform : performs both actions

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.feature_extraction.text import CountVectorizer
# a French corpus (to see what is going on with diacritics)
train_corpus = [
     'Ceci est un document.',
     'Ce document est encore un document à moi.',
     'Et voilà le troisième.',
     'Le premier document est-il le plus intéressant?',
 ]
vectorizer = CountVectorizer()
# the vectorizer is empty : this generates an error
#print(vectorizer.vocabulary_)
#print(vectorizer.get_feature_names())

# we can fill it using the training set
# and transform the training set into a matrix
X_train = vectorizer.fit_transform(train_corpus)

# the matrix is sparse
print("type of X_train", type(X_train))
print("shape of X_train", X_train.shape) # 4 documents/sentences and 15 unique words/types 
print(X_train)

# here it is as a standard matrix
print(X_train.toarray()) 

type of X_train <class 'scipy.sparse.csr.csr_matrix'>
shape of X_train (4, 15)
  (0, 1)	1
  (0, 4)	1
  (0, 13)	1
  (0, 2)	1
  (1, 4)	1
  (1, 13)	1
  (1, 2)	2
  (1, 0)	1
  (1, 3)	1
  (1, 9)	1
  (2, 5)	1
  (2, 14)	1
  (2, 8)	1
  (2, 12)	1
  (3, 4)	1
  (3, 2)	1
  (3, 8)	2
  (3, 11)	1
  (3, 6)	1
  (3, 10)	1
  (3, 7)	1
[[0 1 1 0 1 0 0 0 0 0 0 0 0 1 0]
 [1 0 2 1 1 0 0 0 0 1 0 0 0 1 0]
 [0 0 0 0 0 1 0 0 1 0 0 0 1 0 1]
 [0 0 1 0 1 0 1 1 2 0 1 1 0 0 0]]


In [None]:
# here is the mapping between word forms and ids (our "w2i" in previous lab session) 
print(vectorizer.vocabulary_) 
# the list of word forms (our i2w) 
print(vectorizer.get_feature_names_out()) 

#QUESTIONS: 
# What is the size of the vocabulary 
print("\n Size of vocabulary: ", len(vectorizer.vocabulary_))
# What does the 3rd column of X.train.toarray() represent ? 
## 3rd document - sparse vector representation 
# What is printed when printing the sparse matrix ? 
## for each row which is a document in itself and the columns which stand for the entire vocabulary, 
## 1s in each row indicate that a token of that type(word) is present in the given document, while 0s indicate an absence (thus a unit vector?)

{'ceci': 1, 'est': 4, 'un': 13, 'document': 2, 'ce': 0, 'encore': 3, 'moi': 9, 'et': 5, 'voilà': 14, 'le': 8, 'troisième': 12, 'premier': 11, 'il': 6, 'plus': 10, 'intéressant': 7}
['ce' 'ceci' 'document' 'encore' 'est' 'et' 'il' 'intéressant' 'le' 'moi'
 'plus' 'premier' 'troisième' 'un' 'voilà']

 Size of vocabulary:  15


In [None]:
test_corpus = [ 'Ah un nouveau document.',
              'Et ceci est encore un document.']
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_corpus)
X_test = vectorizer.transform(test_corpus)
print("shape of X_test", X_test.shape)

# What happened to the words in test_corpus that are not present in train_corpus? 
print(vectorizer.get_feature_names_out()) 
print(X_test.toarray()) 
## the words have been assigned a 0 in the sparse matrix since they haven't been encountered in the train phase 
# Compare to vectorizer.fit_transform
## if using fit_transform, the vocabulary will be extracted as if the test set is an input due to the presence of fit 


shape of X_test (2, 15)
['ce' 'ceci' 'document' 'encore' 'est' 'et' 'il' 'intéressant' 'le' 'moi'
 'plus' 'premier' 'troisième' 'un' 'voilà']
[[0 0 1 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 1 1 1 1 1 0 0 0 0 0 0 0 1 0]]


In [None]:
train_corpus = [
     'Ceci est un document .',
     'Ce document est encore un document à moi .',
     'Et voilà le troisième .',
     'Le premier document est -il le plus intéressant ?',
 ]

# QUESTIONS: 

# How can you change the tokenization that the CountVectorizer will use (see its constructor)zer)
# in particular, how to split on spaces only
#  (which corresponds to supposing texts were already tokenized)
# Indications: study 
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# to see all the members of the instance, 
# and deduce which member to modify:
print("\nMEMBERS:\n", "\n".join([ str(x) for x in vectorizer.__dict__.items()]))

## modify the regular expression 'r' in 'token_pattern' to split on spaces alone 

# Which parameters can you modify to switch to bigram and trigram of characters features 

## analyzer can be changed to char and ngram_range can be changed to (2, 3)

# Search what is a TF.IDF weighting (very famous) 

## used to reduce the impact of frequently occurring tokens 

# Study the TfidfVectorizer class
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
# dans deduce how to obtain TF.IDF weigthed vector representations of the documents

## the class has an attribute idf_ that gives the weighted vector 


MEMBERS:
 ('input', 'content')
('encoding', 'utf-8')
('decode_error', 'strict')
('strip_accents', None)
('preprocessor', None)
('tokenizer', None)
('analyzer', 'word')
('lowercase', True)
('token_pattern', '(?u)\\b\\w\\w+\\b')
('stop_words', None)
('max_df', 1.0)
('min_df', 1)
('max_features', None)
('ngram_range', (1, 1))
('vocabulary', None)
('binary', False)
('dtype', <class 'numpy.int64'>)
('fixed_vocabulary_', False)
('_stop_words_id', 4347562040)
('stop_words_', set())
('vocabulary_', {'ceci': 1, 'est': 4, 'un': 13, 'document': 2, 'ce': 0, 'encore': 3, 'moi': 9, 'et': 5, 'voilà': 14, 'le': 8, 'troisième': 12, 'premier': 11, 'il': 6, 'plus': 10, 'intéressant': 7})
