### Exercises

1\. Describe in writing why stopword removal is required when a vectorized model of a text is prepared.

2\. Describe in writing what are stemming and lemmatization. For what purpose they are leveraged?

3\. Describe in writing what are n-grams and why their using may improve a text model.

1. Stopwords appear frequently in a text but carry almost no meaningful information. By removing them, we can reduce the number of tokens processed by the model, leading to faster training. Also if including them in the vectorized model we may some significant noise so the model can't identify important patterns and relationships in the text data.


2. *   **Stemming** is converting words into their base form which means stripping prefixes or suffixes . The goal of stemming is to cut words down to a root form. The resultant stem form may not be real words, the intent is to define a base form that can be found for the related words. **Lemmatization** is reducing words to its base form or dictionary form whereby the word has meaning to it. It is a kind of dictionary lookups and linguistic analysis to recover words to their dictionary form.
*   The problem of multiple forms of the same word is addressed through stemming and lemmatization. Both techniques reduce the words to their base forms, which helps to standardize the words and reduce the feature space. As a result, these two techniques improve the accuracy of various tasks, such as text classification, sentiment analysis, and information retrieval.


3. N-grams are contiguous sequences of n objects (generally words) in a body of text. N-gram consists of N (integer number) of adjacent words in sentence. This may be used to research the context and relationships among phrases in a given text. By shooting sequences of words, N-grams permit the model to recognize the context in which words seem and derive meaning from the relationships between adjoining phrases. The use of N-grams can improve a textual content version by using permitting it to find contextual relationships and extract more informative features for textual analysys tasks.

### Exercises

4\. Describe in writing the key differences between BoW and TF-IDF models of text.

5\. Describe in writing what is an idea of word embedding. What are its advantages in comparison with other vectorization techniques?

6\. Come up with two sentences with high cosine similarity and two whose similarity is exactly zero. Compute these similarities using the code that has been used above.

7\. Compute word mover's distances for the sentences from the previous exercise. Use Word2vec model trained on `text8` corpus or download the pretrained model `glove-wiki-gigaword-50`. Compare the distances with cosine similarity. What method produces more reasonable results?

4. While BoW treats all words equally, focusing only on their frequency in the document, TF-IDF considers the importance of a word not only in a document but also in the entire corpus through IDF. Also BoW vectors are typically normalized by the document length to mitigate bias towards longer documents, and TF-IDF normalizes term frequency based on the frequency of the term in the corpus.

5. Word embedding is a method used in natural language processing to symbolize phrases as dense, low-dimensional vectors in a continuous vector space. The basic idea is to encode semantic and syntactic similarities between words by using them more closely collectively in a vector space if they share comparable contexts or meanings. Word embedding techniques offer advantages in capturing semantic meaning, contextual information, and word relationships, providing more efficient and effective representations for natural language processing tasks compared to BoW or TF-IDF described before and unlike this vactorization techniques, word embedding captures nuanced contextual information and it can represent multiple meanings of a word.

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity #  Here is almost the same function as in lecture

sentence1 = 'The moon shines brightly in the night sky today'
sentence2 = 'Night falls as the moon fills the dark sky today'
sentence3 = 'The sky blue so as the moon'
sentence4 = 'Bananas are a good source potassium'

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([sentence1, sentence2, sentence3, sentence4])
cosine_similarities = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(f'Cosine Similarity of sentence 1 and sentence 2: {round(cosine_similarities[0, 1], 3)}')
print(f'Cosine Similarity of sentence 3 and sentence 4: {cosine_similarities[2, 3]}')

Cosine Similarity of sentence 1 and sentence 2: 0.527
Cosine Similarity of sentence 3 and sentence 4: 0.0


In [35]:
!pip install gensim



In [70]:
from gensim.models import Word2Vec
import gensim.downloader as api

word2vec_model = api.load('text8')

sentences = [
    'The moon shines brightly in the night sky today',
    'Night falls as the moon fills the dark sky today',
    'The sky blue so as the moon',
    'Bananas are a good source of potassium'
]

tokenized_sentences = [sentence.split() for sentence in sentences]

for i in range(len(tokenized_sentences)):
    for j in range(i+1, len(tokenized_sentences)):
        distance_wmd = word2vec_model.wv.wmdistance(tokenized_sentences[i], tokenized_sentences[j])  #  I don't know what's wrong here..
        cosine_similarity = word2vec_model.wv.n_similarity(tokenized_sentences[i], tokenized_sentences[j])
        print(f'Distance between Sentence {i+1} and Sentence {j+1} (WMD): {distance_wmd}')
        print(f'Cosine Similarity between Sentence {i+1} and Sentence {j+1}: {cosine_similarity}')

AttributeError: 'Dataset' object has no attribute 'wv'

8\. Below you will find a piece of text. Split it to sentences and create BoW model. Above we have used stemming for the analogous model. Use lemmatization instead. Do not forget that lematization may require whole sentences to identify parts of speech. It means that the stopword removal must be done after lemmatization.

In [76]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [79]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import Counter

text = """The skull and the upper bones lay beside it in the thick dust, and in one place, where rain-water had dropped through a leak in the roof,
the thing itself had been worn away. Further in the gallery was the huge skeleton barrel of a Brontosaurus. My museum hypothesis was confirmed.
Going towards the side I found what appeared to be sloping shelves, and clearing away the thick dust,
I found the old familiar glass cases of our own time. But they must have been air-tight to judge from the fair preservation of some of their contents."""
sentences = sent_tokenize(text)
wordnet_lemmatizer = WordNetLemmatizer()
lemmatized_words = []
for sentence in sentences:
    words = word_tokenize(sentence)
    lemmatized_words.extend([wordnet_lemmatizer.lemmatize(word) for word in words])
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in lemmatized_words if word.lower() not in stop_words]
print(filtered_words)
bow_model = Counter(filtered_words)
print(bow_model)

['skull', 'upper', 'bone', 'lay', 'beside', 'thick', 'dust', ',', 'one', 'place', ',', 'rain-water', 'dropped', 'leak', 'roof', ',', 'thing', 'worn', 'away', '.', 'gallery', 'wa', 'huge', 'skeleton', 'barrel', 'Brontosaurus', '.', 'museum', 'hypothesis', 'wa', 'confirmed', '.', 'Going', 'towards', 'side', 'found', 'appeared', 'sloping', 'shelf', ',', 'clearing', 'away', 'thick', 'dust', ',', 'found', 'old', 'familiar', 'glass', 'case', 'time', '.', 'must', 'air-tight', 'judge', 'fair', 'preservation', 'content', '.']
Counter({',': 5, '.': 5, 'thick': 2, 'dust': 2, 'away': 2, 'wa': 2, 'found': 2, 'skull': 1, 'upper': 1, 'bone': 1, 'lay': 1, 'beside': 1, 'one': 1, 'place': 1, 'rain-water': 1, 'dropped': 1, 'leak': 1, 'roof': 1, 'thing': 1, 'worn': 1, 'gallery': 1, 'huge': 1, 'skeleton': 1, 'barrel': 1, 'Brontosaurus': 1, 'museum': 1, 'hypothesis': 1, 'confirmed': 1, 'Going': 1, 'towards': 1, 'side': 1, 'appeared': 1, 'sloping': 1, 'shelf': 1, 'clearing': 1, 'old': 1, 'familiar': 1, 'glas

9\. Below you will find a list of tweets. Create TF-IDF model for them. For tokenization use TweetTokenizer provided by NLTK. Using cosine similarity find two most similar teats.

In [82]:
import nltk
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# List of tweets
tweets = ["@Tatiana_K nope they didn't have it ","@twittera que me muera ? ","spring break in plain city... it's snowing ))) ",
          "I just re-pierced my ears ","@caregiving I couldn't bear to watch it.  And I thought the UA losssssss was embarrassing . . . . .",
          "@octolinz16 It it counts, idk why I did either. you never talk to me anymore ","@smarrison i would've been the first, but i didn't have a gun.  not really though, zac snyder's just a doucheclown.",
          "@iamjazzyfizzle I wish I got to watch it with you!! I miss you and @iamlilnicki  how was the premiere?!",
          "Hollis' death scene will hurt me severely to watch on film  wry is directors cut not out now?","about to file taxes ",
          "@LettyA ahh ive always wanted to see rent  love the soundtrack!!","@FakerPattyPattz Oh dear. Were you drinking out of the forgotten table drinks? ",
          "@alydesigns i was out most of the day so didn't get much done ;) ","one of my friend called me, and asked to meet with her at Mid Valley today...but i've no time *sigh* "]

tokenizer = TweetTokenizer()
tokenized_tweets = [tokenizer.tokenize(tweet) for tweet in tweets]
tokenized_tweets_str = [' '.join(tokens) for tokens in tokenized_tweets]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(tokenized_tweets_str).toarray()
cosine_similarities = cosine_similarity(tfidf_matrix, tfidf_matrix)
np.fill_diagonal(cosine_similarities, 0)
most_similar_indices = np.unravel_index(cosine_similarities.argmax(), cosine_similarities.shape)

print('Most similar tweets:')
print(tweets[most_similar_indices[0]])
print(tweets[most_similar_indices[1]])

Most similar tweets:
@caregiving I couldn't bear to watch it.  And I thought the UA losssssss was embarrassing . . . . .
@iamjazzyfizzle I wish I got to watch it with you!! I miss you and @iamlilnicki  how was the premiere?!
