#### Text Vectorization

Text Vectorization is the process of converting text into numerical representation. 

Here are some popular methods to accomplish text vectorization:

1. Binary Term Frequency: captures presence (1) or absence (0) of term in document.

2. Bag of Words (BoW) Term Frequency: captures frequency of term in document. 

3. (L1) Normalized Term Frequency: captures normalized BoW term frequency in document.

4. (L2) Normalized TF-IDF(Term Frequency–Inverse Document Frequency): captures normalized TFIDF in document.

5. Word2Vec: provides embedded representation of words. Word2Vec starts with one representation of all words in the corpus and train a NN (with 1 hidden layer) on a very large corpus of data. 

Here are the two methods that is typically used for training the NN:
• Continuous Bag of Words (CBOW) — Predict vector representation of center/target word based on window of context words.
• Skip-Gram (SG) — Predict vector representation of window of context words based on center/target word 

#### Case Study: Book Recommendations from Charles Darwin

Data

Charles Darwin is the most famous scientist in the world. He wrote many other books on a wide range of topics, including geology, plants or his personal life. In this project, we will develop a content-based book recommendation system, which will determine which books are close to each other based on how similar the discussed topics are. 

##### Text Preprocessing

As the first step, we need to load the content of each book and check the regular expression to facilitate the process by removing the all non-alpha-numeric characters. We call such a collection of texts a CORPUS.

Next step, we transform the corpus into a format by doing tokenization.

For the next parts of text preprocessing, we use a stemming process, which will group together the inflected forms of a word so they can be analyzed as a single item: the stem. In order to make the process faster, we will directly load the final results from a pickle file and review the method used to generate it.

#### Text Vectorization

##### Bag-of-Words Models (BoW)

First, we need to create a universe of all words contained in our corpus of Charles Darwin’s books, which we call a dictionary.Then, using the stemmed tokens and the dictionary, we will create bag-of-words models (BoW) to represent our books as a list of all unique tokens they contain associated with their respective number of occurrences. 

##### TF-IDF Model

Next, we will use a TF-IDF model to define the importance of each word depending on how frequent it is in the text. As a result, a high TF-IDF score for a word will indicate that this word is specific to this text.

##### Recommendation

Now that we have a TF-IDF model on how specific they are to each book, we can measure how related to books are between each other. Therefore, we will use Cosine Similarity and visualize the results as a distance matrix.

##### Conclusion

We now have a matrix containing all the similarity measures between any pair of books from Charles Darwin! We can use barh() to display a horizontal bar plot for which books are the most similar to “On the Origin of Species.”

However, we want to have a better understanding of the big picture and see how Darwin’s books are generally related to each other. To this purpose, we will represent the whole similarity matrix as a dendrogram, which is a standard tool to display such data.

Finally, based on the chart we created before, we can conclude that “the variation of animals and plants under domestication” is most related to “On the Origin of Species.”

#### Machine Learning Text Processing and Vectorization

In [1]:
#Loading the dataset - training data.
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset = 'train', shuffle = True)

Types of data:
1. Structured data: Data in numeric form.E.g. csv file
2. Unstructured data: Not in numeric form.

##### Vectorization:

The process of converting structured to unstructure data is called vectorization. Vectorization is the process of converting text/unstr (human readable language) into meaningful numeric/vectors representation.

We cannot work with text directly when using Machine learning algorithms. Instead, we need to convert the text to numbers.

The mothods used are: 

. Bag of words(BOW) model: A simple and effective model for thinking about text documents in ML. This can be done by assigning each word a unique number.

. Count Vectorizer: Provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

. TF-IDF Vectorizer

In [2]:
# an example of using the CountVectorizer

In [3]:
text = ['The quick brown fox jumped over the lazy dog.']

In [4]:
text

['The quick brown fox jumped over the lazy dog.']

Tokenization refers to splitting up a larger body of text into smaller lines, words, or even creating words for a non-English language

In [5]:
twenty_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [6]:
len(twenty_train.target) # number of target variables(y)

11314

In [7]:
twenty_train.target

array([7, 4, 4, ..., 3, 1, 8])

In [8]:
set(twenty_train.target)

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}

In [9]:
len(twenty_train.data) # number of input records/x

11314

In [10]:
twenty_train.data[0:5]

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
#list of text documents
text = ['The quick brown fox jumped over the lazy dog.']

In [13]:
text

['The quick brown fox jumped over the lazy dog.']

In [14]:
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
(1, 8)
<class 'scipy.sparse._csr.csr_matrix'>
[[1 1 1 1 1 1 1 2]]


CountVectorizer converts your text into numbers.

In [15]:
vectorizer.vocabulary_

{'the': 7,
 'quick': 6,
 'brown': 0,
 'fox': 2,
 'jumped': 3,
 'over': 5,
 'lazy': 4,
 'dog': 1}

In [16]:
print(vectorizer.vocabulary_)

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}


In [17]:
# Encode another document
text2 = ['the puppy']
vector = vectorizer.transform(text2)
print(vector.toarray())

[[0 0 0 0 0 0 0 1]]


In [18]:
# Encode another document
text2 = ['the brown puppy brown']
vector = vectorizer.transform(text2)
print(vector.toarray())

[[2 0 0 0 0 0 0 1]]


In [19]:
text

['The quick brown fox jumped over the lazy dog.']

Working with text documents: preprocessing and vectorizing : different types

Types of data:
1. Structured: in numeric form.. csv file which has only numbers
2. Unstructured: not in numeric form ..ex: images, text data.

Vectorization: process of converting text/unstr (human readable lang) into meaningful numeric/vectors

We cannot work with text directly when using ML algorithms. Istead, we need to convert the text to numbers.

A bag of words(BOW: 
- a way of extracting features from text for use in modelling, such as with ML algorithms. It is simple and flexible to use.

- It is a representation of text that describes the occurence of words within a document and it involves two things.

- A simple and effective model for thinking about text documents in ML.

Vectorization:
    1. bag of words model.
    2. Count vectorizer.
    3. TF-IDF vectorizer.
    
Word Counts with CountVectorizer: The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also encode new documents using that vocabulary.

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
#list of text documents
#create the transforn
vectorizer = CountVectorizer()
#tokenize and build vocab
vector = vectorizer.fit_transform(text)
#summarize

In [21]:
print(vector.shape)
print(vector.toarray())

(1, 8)
[[1 1 1 1 1 1 1 2]]


In [22]:
print(vectorizer.vocabulary_)

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}


In [None]:
Continue from 24:25