# Text Vectorization
Machine learning algorithms most often take numeric feature vectors as input. Thus, when working with text documents, we need a way to convert each document into a numeric vector. This process is known as text vectorization.

## Part 1

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In this lab, given below corpus, you are required to convert it into a document-term matrix based on various vectorization methodologies.

In [None]:
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?'
]

### Exercise 1
- Please vectorize the corpus based on bag-of-words

In [None]:
### START CODE HERE ###
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
X.toarray()
### END CODE HERE ###

#### Expected output
```
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)
```

### Exercise 2
- Please vectorize the corpus based on bag-of-words with one-hot encoding

In [None]:
### START CODE HERE ###
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(corpus)
X.toarray()
### END CODE HERE ###

#### Expected output
```
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)
```

### Exercise 3
- Please vectorize the corpus based on bag-of-bigrams

In [None]:
### START CODE HERE ###
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(corpus)
X.toarray()
### END CODE HERE ###

#### Expected output
```
array([[0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 0, 2, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0],
       [1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0],
       [0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1]],
      dtype=int64)
```

### Exercise 4
- Please vectorize the corpus based on bag-of-bigrams with one-hot encoding

In [None]:
### START CODE HERE ###
vectorizer = CountVectorizer(ngram_range=(1, 2), binary=True)
X = vectorizer.fit_transform(corpus)
X.toarray()
### END CODE HERE ###

#### Expected output
```
array([[0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0],
       [1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0],
       [0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1]],
      dtype=int64)
```

### Exercise 5
- Please vectorize the corpus based on TF-IDF

In [None]:
### START CODE HERE ###
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X.toarray()
### END CODE HERE ###

#### Expected output
```
array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])
```

## Part 2

In [None]:
import gensim
import csv
import re
import jieba
import numpy as np

### Exercise 6
- Using CBOW algorithm in gensim to train a word embedding model on a custom corpus.
- Set the dimension of the model to be 300.
- Save the model as mymodel
- The training corpus we use is news-8000.csv, which includes newsid, title, content. 
- You may need to further preprocess the corpus before training.

In [None]:
### START CODE HERE ###
hRe = re.compile(r'<.*?>') # search pattern for html tags
nRe = re.compile(r'\d*\.?\d+') # search pattern for intrger or decimal number
sRe = re.compile(r'\s+') # search pattern for any whitespaces characters

class MySentences(object):
    def __init__(self, infile):
        self.infile = infile
 
    def __iter__(self):
        with open(self.infile, 'r', encoding='utf8') as csvfile:
            csvreader = csv.reader(csvfile)
            for row in csvreader:
                title = row[1]
                content = row[2]
                text = f'{title}。{content}'
                text = hRe.sub('', text)
                text = nRe.sub('NUM', text)
                text = text.replace('&nbsp;', '')
                yield [x for x in jieba.cut(text) if not sRe.search(x)]

sentences = MySentences('./news-8000.csv') 
model = gensim.models.Word2Vec(sentences, size=300)
model.save('mymodel')
### END CODE HERE ###

#### After the model training is complete：
- Load model from disk
- Output top 5 most similar words given the word of 美团

In [None]:
### START CODE HERE ###
model = gensim.models.Word2Vec.load('mymodel')
print(model.wv.most_similar('美团', topn=5))
### END CODE HERE ###

- Output the similarity between 华为 and 苹果

In [None]:
### START CODE HERE ###
print(model.wv.similarity('华为', '苹果'))
### END CODE HERE ###

- Vectorize below dccument by taking the average of all the word vectors in it

In [None]:
document = '谷歌追踪用户高度隐私信息发广告，在欧洲遭多起诉讼'

In [None]:
### START CODE HERE ###
tokens = jieba.cut(document)
doc2vec = np.average([model.wv[word] for word in tokens if word in model.wv], axis=0)
print(doc2vec)
### END CODE HERE ###