# 16 Vector models of text, Exercises

Part of ["Introduction to Data Science" course](https://github.com/kupav/data-sc-intro) by Pavel Kuptsov, [kupav@mail.ru](mailto:kupav@mail.ru)

To install spacy and its module for English uncomment and execute the following command.

In [None]:
#pip install spacy && python -m spacy download en_core_web_sm

## Lesson 1

### Exercises

1\. Describe in writing why stopword removal is required when a vectorized model of a text is prepared.

2\. Describe in writing what are stemming and lemmatization. For what purpose they are leveraged?

3\. Describe in writing what are n-grams and why their using may improve a text model.

## Lesson 2

In [None]:
text = """
We are reading about Natural Language Processing Here. What an interesting topic of data science! 
Natural Language Processing makes computers to comprehend language data. The field 
of Natural Language Processing is evolving everyday."""

sentences = nltk.sent_tokenize(text)
print(sentences)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language = 'english')
tokenizer = TfidfVectorizer().build_tokenizer()

wh_words = set(['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom'])
stop = set(stopwords.words('english')) - wh_words

def stemmed_tokenizer(doc):
    tokens = [tok for tok in tokenizer(doc) if tok not in stop]
    stem_tokens = [stemmer.stem(tok) for tok in tokens]
    return stem_tokens

tfidf_vectorizer = TfidfVectorizer(tokenizer=stemmed_tokenizer)
tfidf_matrix = tfidf_vectorizer.fit_transform(sentences);

The results on the preprocessed corpus after TF-IDF vectorization are shown below. 

The vocabulary is the same as `CountVectorizer`; however, the
weights are completely different for the various terms across the documents:

In [None]:
print(tfidf_vectorizer.get_feature_names_out(), "\n")
print(tfidf_matrix.toarray())

Additionally each vector, i.e., each row of this matrix has been normalized: after computation each vector element 
has been divided divided by the vector Euclidean norm (also this is called L2 norm). 

The resulting vectors have unit Euclidean lengths.

This is needed for the following computation of their distances.

This normalization can be switched off by setting `norm=None` parameter of `TfidfVectorizer`.

### Exercises

1\. Describe in writing the key differences between BoW and TF-IDF models of text.

2\. Describe in writing what is an idea of word embedding. What are its advantages in comparison with other vectorization techniques?

3\. Come up with two sentences with high cosine similarity and two whose similarity is exactly zero. Compute these similarities using the code that has been used above. 

4\. Compute word mover's distances for the sentences from the previous exercise. Use Word2vec model trained on `text8` corpus or download the pretrained model `glove-wiki-gigaword-50`. Compare the distances with cosine similarity. What method produces more reasonable results?

5\. Below you will find a piece of text. Split it to sentences and create BoW model. Above we have used stemming for the analogous model. Use lemmatization instead. Do not forget that lematization may require whole sentences to identify parts of speech. It means that the stopword removal must be done after lemmatization.

In [None]:
text = """The skull and the upper bones lay beside it in the thick dust, 
and in one place, where rain-water had dropped through a leak in the 
roof, the thing itself had been worn away. Further in the gallery was 
the huge skeleton barrel of a Brontosaurus. My museum hypothesis was 
confirmed. Going towards the side I found what appeared to be sloping 
shelves, and clearing away the thick dust, I found the old familiar 
glass cases of our own time. But they must have been air-tight to 
judge from the fair preservation of some of their contents."""

6\. Below you will find a list of tweets. Create TF-IDF model for them. For tokenization use TweetTokenizer provided by NLTK. Using cosine similarity find two most similar teats. 

In [None]:
tweets = [
"@Tatiana_K nope they didn't have it ",
"@twittera que me muera ? ",
"spring break in plain city... it's snowing ))) ",
"I just re-pierced my ears ",
"@caregiving I couldn't bear to watch it.  And I thought the UA losssssss was embarrassing . . . . .",
"@octolinz16 It it counts, idk why I did either. you never talk to me anymore ",
"@smarrison i would've been the first, but i didn't have a gun.    not really though, zac snyder's just a doucheclown.",
"@iamjazzyfizzle I wish I got to watch it with you!! I miss you and @iamlilnicki  how was the premiere?!",
"Hollis' death scene will hurt me severely to watch on film  wry is directors cut not out now?",
"about to file taxes ",
"@LettyA ahh ive always wanted to see rent  love the soundtrack!!",
"@FakerPattyPattz Oh dear. Were you drinking out of the forgotten table drinks? ",
"@alydesigns i was out most of the day so didn't get much done ;) ",
"one of my friend called me, and asked to meet with her at Mid Valley today...but i've no time *sigh* "]