<a href="https://colab.research.google.com/github/kevin-r-murphy/ba820/blob/main/Hands-on/04-text-mining/text_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Text cleaning and Regex

we will start with some sentences. Each sentence in this context can be considered *a document*.

In [None]:
corpus = [
    "In 1945, the US dropped two nuclear bombs on Japan. Japan surrendered afterwards.",
    "Japan is located in Asia. Tokyo is its capital.",
    "The capital of the USA is Washington D.C., which is located on the eastern seaboard.",
    "I like eating apples! I eat 2.3 pounds everyday.",
    "The capitol of Canada is Ottawa. My aunt's number there is (613)-554-2121. I enjoy visiting here.",
    "       5/2 = 2.5.",
    "The professor was very kind to us when creating the midterm exam."
    "An apple a day keeps the doctor away!"
    "@jason We won the game! #WeAreTheChampions.",
    "My phone number in Canada is (613)-224-2311        ",
    "Apples are good for your health."
]

import pandas as pd
df = pd.DataFrame({'text':corpus})
df

Unnamed: 0,text
0,"In 1945, the US dropped two nuclear bombs on J..."
1,Japan is located in Asia. Tokyo is its capital.
2,"The capital of the USA is Washington D.C., whi..."
3,I like eating apples! I eat 2.3 pounds everyday.
4,The capitol of Canada is Ottawa. My aunt's num...
5,5/2 = 2.5.
6,The professor was very kind to us when creatin...
7,My phone number in Canada is (613)-224-2311 ...
8,Apples are good for your health.


Some pre-processing you might want to consider:

- lower/upper casing. *Is the effect positive or negative?*
- Removing trailing spaces.
- Removing punctuation. *Is the effect positive or negative?*
- Replacing synonyms.

In [None]:
df = pd.DataFrame(df.text.str.lower()) # to lower case.
# df = pd.DataFrame(df.text.str.strip()) # removing trailing spaces
# df = pd.DataFrame(df.text.str.replace("like", "enjoy")) # synonym replacement
df = pd.DataFrame(df.text.str.replace('[^\w\s]','')) # remove punctuation

# df = pd.DataFrame(df.text.str.findall('@\S+|#\S+'))
# df = df.text.str.contains("us")

df


The default value of regex will change from True to False in a future version.



Unnamed: 0,text
0,in 1945 the us dropped two nuclear bombs on ja...
1,japan is located in asia tokyo is its capital
2,the capital of the usa is washington dc which ...
3,i like eating apples i eat 23 pounds everyday
4,the capitol of canada is ottawa my aunts numbe...
5,52 25
6,the professor was very kind to us when creatin...
7,my phone number in canada is 6132242311
8,apples are good for your health


##Tokenization

Let's break down the text. This might entail some processing.

In [None]:
from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize, WhitespaceTokenizer, RegexpTokenizer
from nltk.tokenize.casual import TweetTokenizer
import nltk
nltk.download('punkt')

tokenized = [word_tokenize(t) for t in corpus] # word tokenization
# tokenized = [WhitespaceTokenizer().tokenize(t) for t in corpus] # word/punctuation tokenization
# tokenized = [TweetTokenizer().tokenize(t) for t in corpus] #Tweets tokenization
# tokenized = [RegexpTokenizer(r'\d{4}|\d{3}', gaps=False).tokenize(t) for t in corpus] # '\([0-9]{3}\)-[0-9]{3}-[0-9]{4}' #'\d{4}|\d{3}' # Regex tokenization. This keeps phone numbers only
tokenized

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[['In',
  '1945',
  ',',
  'the',
  'US',
  'dropped',
  'two',
  'nuclear',
  'bombs',
  'on',
  'Japan',
  '.',
  'Japan',
  'surrendered',
  'afterwards',
  '.'],
 ['Japan',
  'is',
  'located',
  'in',
  'Asia',
  '.',
  'Tokyo',
  'is',
  'its',
  'capital',
  '.'],
 ['The',
  'capital',
  'of',
  'the',
  'USA',
  'is',
  'Washington',
  'D.C.',
  ',',
  'which',
  'is',
  'located',
  'on',
  'the',
  'eastern',
  'seaboard',
  '.'],
 ['I',
  'like',
  'eating',
  'apples',
  '!',
  'I',
  'eat',
  '2.3',
  'pounds',
  'everyday',
  '.'],
 ['The',
  'capitol',
  'of',
  'Canada',
  'is',
  'Ottawa',
  '.',
  'My',
  'aunt',
  "'s",
  'number',
  'there',
  'is',
  '(',
  '613',
  ')',
  '-554-2121',
  '.',
  'I',
  'enjoy',
  'visiting',
  'here',
  '.'],
 ['5/2', '=', '2.5', '.'],
 ['The',
  'professor',
  'was',
  'very',
  'kind',
  'to',
  'us',
  'when',
  'creating',
  'the',
  'midterm',
  'exam.An',
  'apple',
  'a',
  'day',
  'keeps',
  'the',
  'doctor',
  'away',
  '

Tokenization does not have to be at the word level...

In [None]:
nltk.download('words')
from nltk.corpus import words
from nltk.tokenize import LegalitySyllableTokenizer

# print(words)
LP = LegalitySyllableTokenizer(words.words())

for tokenized_document in tokenized:
  tokenied2 = [LP.tokenize(word) for word in tokenized_document]
  print(tokenied2)

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


[['In'], ['1945'], [','], ['the'], ['US'], ['drop', 'ped'], ['two'], ['nucl', 'ear'], ['bombs'], ['on'], ['Ja', 'pan'], ['.'], ['Ja', 'pan'], ['sur', 'ren', 'de', 'red'], ['af', 'ter', 'wards'], ['.']]
[['Ja', 'pan'], ['is'], ['lo', 'ca', 'ted'], ['in'], ['As', 'ia'], ['.'], ['Tok', 'yo'], ['is'], ['its'], ['ca', 'pi', 'tal'], ['.']]
[['The'], ['ca', 'pi', 'tal'], ['of'], ['the'], ['U', 'SA'], ['is'], ['Wa', 'shing', 'ton'], ['D.C.'], [','], ['which'], ['is'], ['lo', 'ca', 'ted'], ['on'], ['the'], ['e', 'a', 'stern'], ['s', 'eab', 'oard'], ['.']]
[['I'], ['li', 'ke'], ['e', 'a', 'ting'], ['ap', 'ples'], ['!'], ['I'], ['eat'], ['2.3'], ['p', 'ounds'], ['e', 've', 'ryd', 'ay'], ['.']]
[['The'], ['ca', 'pi', 'tol'], ['of'], ['Ca', 'na', 'da'], ['is'], ['Ot', 'ta', 'wa'], ['.'], ['My'], ['aunt'], ["'s"], ['num', 'ber'], ['the', 're'], ['is'], ['('], ['613'], [')'], ['-554-2121'], ['.'], ['I'], ['enj', 'oy'], ['vi', 'si', 'ting'], ['he', 're'], ['.']]
[['5/2'], ['='], ['2.5'], ['.']]
[['The

- **Question:** Is it better to tokenize by word, sentence, character, or "sub-words?".

**Questions:**

- any interesting sentences you want to try?
- What if I want to only collect phone numbers?


#Bag of Words

Now that we have tokenized the sentences, we can vectorize them. Let's use **Bag of Words**, the simplest way we know.

Let's create and fit the model.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer() #tokenizer= lambda x: word_tokenize(x), stop_words='english'

cv.fit(df.text)

print('number of `tokens`', len(cv.vocabulary_))
cv.vocabulary_

number of `tokens` 70


{'in': 33,
 '1945': 0,
 'the': 53,
 'us': 58,
 'dropped': 21,
 'two': 57,
 'nuclear': 43,
 'bombs': 13,
 'on': 46,
 'japan': 36,
 'surrendered': 52,
 'afterwards': 6,
 'is': 34,
 'located': 40,
 'asia': 10,
 'tokyo': 56,
 'its': 35,
 'capital': 15,
 'of': 45,
 'usa': 59,
 'washington': 63,
 'dc': 19,
 'which': 67,
 'eastern': 22,
 'seaboard': 51,
 'like': 39,
 'eating': 24,
 'apples': 8,
 'eat': 23,
 '23': 1,
 'pounds': 49,
 'everyday': 26,
 'capitol': 16,
 'canada': 14,
 'ottawa': 47,
 'my': 42,
 'aunts': 11,
 'number': 44,
 'there': 54,
 '6135542121': 5,
 'enjoy': 25,
 'visiting': 61,
 'here': 32,
 '52': 3,
 '25': 2,
 'professor': 50,
 'was': 62,
 'very': 60,
 'kind': 38,
 'to': 55,
 'when': 66,
 'creating': 17,
 'midterm': 41,
 'examan': 27,
 'apple': 7,
 'day': 18,
 'keeps': 37,
 'doctor': 20,
 'awayjason': 12,
 'we': 64,
 'won': 68,
 'game': 29,
 'wearethechampions': 65,
 'phone': 48,
 '6132242311': 4,
 'are': 9,
 'good': 30,
 'for': 28,
 'your': 69,
 'health': 31}

**Questions:**
- Can we do custom tokenization?
- Can we specify stop words?

You can print the list of stop words that was used

In [None]:
cv.get_stop_words()

Now, let's transform the documents into BoW format

In [None]:
dtm = cv.transform(df.text)
bow = pd.DataFrame(dtm.toarray(), columns=cv.get_feature_names_out())
bow

Unnamed: 0,1945,23,25,52,6132242311,6135542121,afterwards,apple,apples,are,...,very,visiting,was,washington,we,wearethechampions,when,which,won,your
0,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
3,0,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
5,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,1,0,0,...,1,0,1,0,1,1,1,0,1,0
7,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,1


**Questions:**

- What does this table remind you of? Is there any relevance?
- How does the representation using TF-IDF look like?





We can see which tokens were extracted for a sentence using `cv.inverse_transform`

In [None]:
recognized_tokens_sentence0 = cv.inverse_transform([bow.iloc[0]])
recognized_tokens_sentence0

[array(['1945', 'afterwards', 'bombs', 'dropped', 'in', 'japan', 'nuclear',
        'on', 'surrendered', 'the', 'two', 'us'], dtype='<U17')]

##Document Similarity

Let's compare cosine similarity vs. Euclidean distance. We will calculate the *similariy matrix*.

First, cosine similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import euclidean
from scipy.spatial.distance import squareform
from scipy.spatial.distance import pdist

# Cosine sim
cos_sim = pd.DataFrame(cosine_similarity(bow, bow))
cos_sim

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,1.0,0.23355,0.215353,0.0,0.062622,0.0,0.218218,0.09759,0.0
1,0.23355,1.0,0.377217,0.0,0.292509,0.0,0.0,0.341882,0.0
2,0.215353,0.377217,1.0,0.0,0.404577,0.0,0.422944,0.157622,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.154303
4,0.062622,0.292509,0.404577,0.0,1.0,0.0,0.163984,0.458349,0.0
5,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6,0.218218,0.0,0.422944,0.0,0.163984,0.0,1.0,0.0,0.0
7,0.09759,0.341882,0.157622,0.0,0.458349,0.0,0.0,1.0,0.0
8,0.0,0.0,0.0,0.154303,0.0,0.0,0.0,0.0,1.0


We can try to answer a question

In [None]:
q = "What' is my aunt's number'?"

q_vector = cv.transform([q])

pd.DataFrame(cosine_similarity(q_vector, bow))

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.0,0.348155,0.240772,0.0,0.560112,0.0,0.0,0.654654,0.0


We can see that the top two most matches are the ones about phone numbers (mine and my aunt's).

Now, Euclidean *similarity*

In [None]:
uclid_sim = 1 / (1 + pd.DataFrame(
    squareform(pdist(bow)),
))

uclid_sim

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,1.0,0.182744,0.154387,0.175734,0.154387,0.195194,0.136527,0.182744,0.179129
1,0.182744,1.0,0.175734,0.190744,0.182744,0.217129,0.128496,0.224009,0.195194
2,0.154387,0.175734,1.0,0.154387,0.169521,0.166667,0.146392,0.163961,0.156613
3,0.175734,0.190744,0.154387,1.0,0.169521,0.25,0.133677,0.210897,0.231662
4,0.154387,0.182744,0.169521,0.169521,1.0,0.186605,0.131006,0.210897,0.172538
5,0.195194,0.217129,0.166667,0.25,0.186605,1.0,0.141188,0.25,0.261204
6,0.136527,0.128496,0.146392,0.133677,0.131006,0.141188,1.0,0.133677,0.135078
7,0.182744,0.224009,0.163961,0.210897,0.210897,0.25,0.133677,1.0,0.217129
8,0.179129,0.195194,0.156613,0.231662,0.172538,0.261204,0.135078,0.217129,1.0


Notice that when using Euclidean distance, documents that are not related still have a non-zero similarity, which is not ideal.

##TF-IDF

Let's rerun the same experiment with TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf_model = TfidfVectorizer(norm=None)

tfidf_model.fit(df.text)

df_tfidf_transformed = tfidf_model.transform(df.text)
tfidf_vectors = pd.DataFrame(df_tfidf_transformed.toarray(), columns=tfidf_model.get_feature_names_out())
tfidf_vectors

Unnamed: 0,1945,23,25,52,6132242311,6135542121,afterwards,apple,apples,are,...,very,visiting,was,washington,we,wearethechampions,when,which,won,your
0,2.609438,0.0,0.0,0.0,0.0,0.0,2.609438,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.609438,0.0,0.0,0.0,2.609438,0.0,0.0
3,0.0,2.609438,0.0,0.0,0.0,0.0,0.0,0.0,2.203973,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,2.609438,0.0,0.0,0.0,0.0,...,0.0,2.609438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,2.609438,2.609438,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.609438,0.0,0.0,...,2.609438,0.0,2.609438,0.0,2.609438,2.609438,2.609438,0.0,2.609438,0.0
7,0.0,0.0,0.0,0.0,2.609438,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.203973,2.609438,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.609438


## Spam Detection



In [None]:
url = "https://raw.githubusercontent.com/elhamod/BA820/main/Hands-on/04-text-mining/hamspam.csv"
df_sms = pd.read_csv(url, names = ['type', 'text'], index_col='type')

X = df_sms['text']
y = df_sms.index

# df = pd.DataFrame(df.text.str.lower()) # We can try lower-casing.

df_sms

Unnamed: 0_level_0,text
type,Unnamed: 1_level_1
ham,"Go until jurong point, crazy.. Available only ..."
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup fina...
ham,U dun say so early hor... U c already then say...
ham,"Nah I don't think he goes to usf, he lives aro..."
...,...
spam,This is the 2nd time we have tried 2 contact u...
ham,Will ü b going to esplanade fr home?
ham,"Pity, * was in mood for that. So...any other s..."
ham,The guy did some bitching but I acted like i'd...


###Unsupervised step: Let's vectorize

In [None]:
from sklearn.model_selection import train_test_split

# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
vectorizer = CountVectorizer() #lowercase=False

# create the vectorizer.
X_train_counts = vectorizer.fit_transform(X_train)

# vectorize the test set
X_test_counts = vectorizer.transform(X_test)

In [None]:
X_train_counts.toarray().shape

(4457, 7702)

There are almost 9600 different tokens! This is very high dimensional.

###Supervised learning: Let's train a classifier and look at the results.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix

# train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_counts, y_train)

# Predict on the test data
y_pred = model.predict(X_test_counts)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
pd.DataFrame(confusion_matrix(y_test, y_pred, normalize='true'), columns=model.classes_, index=model.classes_ )

Accuracy: 0.9883408071748879


Unnamed: 0,ham,spam
ham,1.0,0.0
spam,0.087248,0.912752


**Question:** Would making tokens lower case help?

## Let's explore n-grams

Let's split the data into training and testing

In [None]:
lowercase= True
n_gram_range = (1,3)
stratify = False

In [None]:
def get_split_datasets(X, y, stratify=False, stratify_size = 400):
  if stratify:
    # split into ham and spam
    y_ham = y[y == 'ham']
    X_ham = X[y == 'ham']
    y_spam = y[y == 'spam']
    X_spam = X[y == 'spam']

    # split and randomly sample into train and test
    X_ham_train, X_ham_test, y_ham_train, y_ham_test = train_test_split(X_ham, y_ham, test_size=0.2, random_state=42)
    X_spam_train, X_spam_test, y_spam_train, y_spam_test = train_test_split(X_spam, y_spam, test_size=0.2, random_state=42)

    # merge again with equal number of samples per class
    X_train = pd.concat([X_ham_train[:stratify_size], X_spam_train[:stratify_size]], axis=0)
    X_test = pd.concat([X_ham_test[:stratify_size], X_spam_test[:stratify_size]], axis=0)
    y_train = y_ham_train[:stratify_size].append(y_spam_train[:stratify_size])
    y_test = y_ham_test[:stratify_size].append(y_spam_test[:stratify_size])
  else:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  return X_train, X_test, y_train, y_test

In [None]:
X_train, X_test, y_train, y_test = get_split_datasets(X, y, stratify=stratify)

In [None]:
pd.DataFrame(y_train).value_counts()

type
ham     3859
spam     598
dtype: int64

Let's vectorize

In [None]:
import sklearn
vectorizer_ngram = CountVectorizer(lowercase=lowercase, ngram_range=n_gram_range)

# create the vectorizer.
X_train_ngram = vectorizer_ngram.fit_transform(X_train)

# vectorize the test set
X_test_ngram = vectorizer_ngram.transform(X_test)

In [None]:
X_train_ngram_df = pd.DataFrame(X_train_ngram.toarray(), columns=vectorizer_ngram.get_feature_names_out())

In [None]:
X_train_ngram_df

Unnamed: 0,00,00 in,00 in our,00 per,00 sub,00 sub 16,00 subs,00 subs 16,000,000 bonus,...,zouk,zouk with,zouk with nichols,zyada,zyada kisi,zyada kisi ko,èn,ú1,ú1 20,ú1 20 poboxox36504w45wq
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4452,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4453,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4455,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
X_train_ngram_df.astype(int).sum().sum()

179596

Let's predict

In [None]:
# train the model
model_ngram = LogisticRegression(max_iter=1000)
model_ngram.fit(X_train_ngram, y_train)

# Predict on the test data
y_pred_ngram = model_ngram.predict(X_test_ngram)

# Evaluate the model
accuracy_ngram = accuracy_score(y_test, y_pred_ngram)
f1_score_ngram = sklearn.metrics.f1_score(y_test, y_pred_ngram, pos_label="spam")
print(f"Accuracy: {accuracy_ngram}")
print(f"f1_score: {f1_score_ngram}")
print(sklearn.metrics.classification_report(y_test,y_pred_ngram))
pd.DataFrame(confusion_matrix(y_test, y_pred_ngram, normalize='true'), columns=model_ngram.classes_, index=model_ngram.classes_ )

Accuracy: 0.9838565022421525
f1_score: 0.9357142857142857
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       966
        spam       1.00      0.88      0.94       149

    accuracy                           0.98      1115
   macro avg       0.99      0.94      0.96      1115
weighted avg       0.98      0.98      0.98      1115



Unnamed: 0,ham,spam
ham,1.0,0.0
spam,0.120805,0.879195


**Questions:**

- What would happen if I use a large context?
- what would happen if I use a large range of ns altogether (i.e., mixed n-gram model)
- Would the results change if we do stratified sampling?

##Word2Vec

Let's train a Word2Vec model on some small set of sentences.




In [None]:
corpus2 = [
    'I love sleeping',
    'He hates eating',
    'I love drinking',
    'He hates studying',
    'I love traveling',
    'He hates swimming',
]

# Tokenize first.
tokenized_corpus2 = [word_tokenize(t) for t in corpus2]

In [None]:
from gensim.models import Word2Vec
import numpy as np

# We construct and train our own Word2Vec (Not a common practice but just to see how it works.)
model_word2vec = Word2Vec(sentences=tokenized_corpus2, vector_size=300, window=3, min_count=1, workers=4, negative=20, epochs=5000)

print("All words captured by the model:", model_word2vec.wv.key_to_index)
print("The embedding of", "love", "is", model_word2vec.wv["love"])

# Get the embedding for each word captured by the model.
embeddings = np.array([model_word2vec.wv[word] for word in model_word2vec.wv.key_to_index])

All words captured by the model: {'hates': 0, 'He': 1, 'love': 2, 'I': 3, 'swimming': 4, 'traveling': 5, 'studying': 6, 'drinking': 7, 'eating': 8, 'sleeping': 9}
The embedding of love is [-0.04574433  0.01937618  0.03162149  0.17942335 -0.0568259  -0.10790729
  0.07773129  0.17535149  0.00187355 -0.0916574   0.20438986 -0.0476336
 -0.00746679  0.10955089 -0.12008529 -0.05144455  0.24842072 -0.09873094
 -0.076167   -0.09188095  0.01952533  0.06001721  0.11701898  0.11338437
  0.05147251 -0.01172497 -0.14430407  0.16938813 -0.04475872 -0.08336935
  0.12505609  0.00868358  0.06391178 -0.01708465 -0.13951997 -0.00928784
  0.0837507  -0.15357025 -0.05577698 -0.03617855 -0.01911599  0.09228368
 -0.02256209 -0.1727733  -0.01518576  0.07049675 -0.13698967 -0.00157919
 -0.01492404  0.12475214  0.02668422  0.00725186 -0.18868412  0.0094077
 -0.00074468 -0.1987038   0.04731525 -0.0489654   0.01507965  0.01239138
 -0.10086067  0.06148646  0.15719633 -0.03371906 -0.08125222 -0.02043831
 -0.0748168

In [None]:
embeddings

array([[-0.0636627 ,  0.0130497 ,  0.04815424, ..., -0.03991516,
         0.04819209, -0.09363694],
       [-0.07017189,  0.02347881,  0.04166984, ..., -0.05163064,
         0.0409615 , -0.0931811 ],
       [-0.04574433,  0.01937618,  0.03162149, ..., -0.0376985 ,
         0.04311553, -0.09021162],
       ...,
       [-0.05282729,  0.02596222,  0.03709991, ..., -0.0423765 ,
         0.03695458, -0.07815757],
       [-0.05901086,  0.01160224,  0.04292152, ..., -0.03687622,
         0.0397259 , -0.0819128 ],
       [-0.05168435,  0.02423323,  0.03293442, ..., -0.04434136,
         0.03497027, -0.08070703]], dtype=float32)

In [None]:
embeddings.shape

(10, 300)

Ten words have ten embeddings. Each word has a 300-dimensional embedding (i.e., vector_size)

Now, let's plot a 3D PCA plot to see this embeddings



In [None]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import plotly.express as px
import pandas as pd

# dim_red = TSNE(n_components=3, perplexity=2, random_state=42)
dim_red = PCA(n_components=3, random_state=42)
embeddings_for_visualization = dim_red.fit_transform(embeddings)

# Convert the reduced embeddings and words into a DataFrame
df = pd.DataFrame(embeddings_for_visualization, columns=['x', 'y', 'z'])
df['word'] = model_word2vec.wv.index_to_key

# Create a scatter plot using Plotly
fig = px.scatter_3d(df, x='x', y='y', z='z', text='word', title='Word Embeddings Visualization')
fig.update_traces(marker=dict(size=8, opacity=0.8, line=dict(width=1, color='DarkSlateGrey')),
                  textposition='top center')
fig.update_layout(showlegend=False)
fig.show()

### Using Word2Vec for a Downstream task

Now that we understand how Word2Vec works, let's apply it to our spam detection problem.

In [None]:
# First, word tokenize.
tokenized_sms_messages = [word_tokenize(t) for t in df_sms['text'].str.lower()]

# Create and train the Word2Vec model
model_word2vec = Word2Vec(sentences=tokenized_sms_messages, vector_size=300, window=5, min_count=1, workers=4) #, negative=50 , epochs=50

# Construct the embeddings (i.e., vectorization) using your Word2Vec model
embeddings = [] # List of message embeddings
for tokenized_document in tokenized_sms_messages:# Iterate through the messages
  message_word_embeddings = [model_word2vec.wv[word] for word in tokenized_document] # Calculate the embedding for each word in the message. Put them all in a list.
  message_embedding = np.mean(message_word_embeddings, axis=0) # Average the word embeddings to get a sentence embedding
  embeddings = embeddings + [message_embedding] # Add the current message embedding into the list of embeddings for all messages.

embeddings = np.array(embeddings)

In [None]:
message = df_sms.iloc[1].text.lower()

print("a message:", message)

message_word_embeddings = []
print("Embeddings for each token in the message:")
for word in word_tokenize(message):
  word_embedding = model_word2vec.wv[word]
  print(word, ":", word_embedding)
  message_word_embeddings = message_word_embeddings + [word_embedding]
  print("")

print("Embedding of the entir message (the above averaged):", np.mean(message_word_embeddings, axis=0))

a message: ok lar... joking wif u oni...
Embeddings for each token in the message:
ok : [ 0.03140329  0.2777986   0.14247431 -0.00304514  0.02283175 -0.39883572
  0.36180368  0.9184446  -0.07061599 -0.26217198  0.21208423 -0.23730332
 -0.11263571 -0.06464706 -0.24095154 -0.34411556  0.25076327 -0.12339689
 -0.0118097  -0.06488261 -0.21425247  0.02138565  0.2807665   0.12437944
  0.3071085   0.08082989 -0.41088313  0.08564737 -0.12218192 -0.25936422
 -0.01304633 -0.1500889   0.06020748 -0.02144791  0.03906859  0.17783897
  0.11318205 -0.46123254  0.02694819 -0.12840007 -0.1655276  -0.04313102
  0.06895073 -0.11960714  0.14927284  0.37665096  0.10097046  0.1549118
 -0.01832077  0.2762437   0.0532097   0.09546474 -0.24162045  0.24298938
 -0.27798593  0.33616477  0.119671   -0.01674274  0.13518894  0.02655589
 -0.08972952 -0.12410627 -0.06747197  0.12147194 -0.16763857  0.28044498
  0.10385843  0.05799345 -0.34115246 -0.01695586 -0.00216818  0.314344
  0.31232482 -0.40957597  0.09343396  0

In [None]:
len(tokenized_sms_messages)

5572

In [None]:
embeddings.shape

(5572, 300)

In [None]:
y.shape

(5572,)

Now that the embeddings are constructed, we can split to train/test sets and use supervised learning.

In [None]:
# Split the dataset
X_train, X_test, y_train, y_test = get_split_datasets(embeddings, y, stratify=False)

In [None]:
# train the model
model_word2vec_classification = LogisticRegression()
model_word2vec_classification.fit(X_train, y_train)

# Predict on the test data
y_pred2 = model_word2vec_classification.predict(X_test)

# Evaluate the model
accuracy2 = accuracy_score(y_test, y_pred2)
f1_score = sklearn.metrics.f1_score(y_test, y_pred2, pos_label="spam")
print(f"Accuracy: {accuracy2}")
print(f"f1_score: {f1_score}")
print(sklearn.metrics.classification_report(y_test,y_pred2))
pd.DataFrame(confusion_matrix(y_test, y_pred2, normalize='true'), columns=model_word2vec_classification.classes_, index=model_word2vec_classification.classes_ )

Accuracy: 0.8905829596412556
f1_score: 0.3838383838383838
              precision    recall  f1-score   support

         ham       0.90      0.99      0.94       966
        spam       0.78      0.26      0.38       149

    accuracy                           0.89      1115
   macro avg       0.84      0.62      0.66      1115
weighted avg       0.88      0.89      0.87      1115



Unnamed: 0,ham,spam
ham,0.988613,0.011387
spam,0.744966,0.255034


**Question:** What parameters of `Word2Vec` could you tune to get better results?

**Questions:**

- Would dimensionality reduction help?
- Would clustering work?
- Plot the tSNE plot of this dataset.