Workflow

1. Data and library collection:
  1.   Library
  2.   Data Ingestion

2. EDA - Exploratory Data Analysis:
  1.   Dataframe visualization
  2.   Data cleaning: missing values, duplicates or aberrant
  3.   Level balancing analysis

3. Data preprocessing:

  1.   Level balancing
  2.   Text cleaning
  3.   Train-Validation-Test partitioning
  4.   Lemmatization, tokenization, sequencing and padding

4. Model:

  1.   Base Model definition
  2.   Compile
  3.   Fit
  4.   Base model Evaluation

5. Test SPAM recognition:
  1.   Random Ham Extraction
  2.   Test on random Ham
  3.   Random Spam Extraction
  4.   Test on random Spam

6. Topic Modelling:
    1.   Identify main topics in subset Spam


  7. Word Embedding:
    1. Semantic distances determination

  8. NER tagging, ORGanization in ham:
    1. NER and ORG extraction from subset ham



# 1. Data and library collection

### 1.1 Library

In [8]:
pip install scikeras



In [137]:
import pandas as pd
import numpy as np
import re
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GRU, Dense, Embedding, Bidirectional, Dropout, LSTM, BatchNormalization
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
import string
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
from imblearn.over_sampling import RandomOverSampler
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import cross_val_score
import tensorflow as tf
from keras.optimizers import Adam

import gensim
from gensim.utils import simple_preprocess
stop_words = stopwords.words('english')
from scipy import spatial
import itertools
import gensim.corpora as corpora
from pprint import pprint
from gensim.models import CoherenceModel
from gensim.models import Word2Vec
import gensim.downloader


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 1.2 Data Ingestion

Importiamo il dataframe

In [10]:
from google.colab import files

In [11]:
files.upload()

Saving spam_dataset.csv to spam_dataset.csv




In [None]:
dataframe = pd.read_csv('spam_dataset.csv')

# 2. EDA - Exploratory Data Analysis

### 2.1 Dataframe visualization

In [13]:
dataframe.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\nth...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\n( see a...",0
2,3624,ham,"Subject: neon retreat\nho ho ho , we ' re arou...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\nthis deal is to ...,0


### 2.2 Data cleaning: missing values, duplicates or aberrant

In [14]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  5171 non-null   int64 
 1   label       5171 non-null   object
 2   text        5171 non-null   object
 3   label_num   5171 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 161.7+ KB


No missing values present, check the levels for non-binaries or aberrations

In [15]:
label_num_values = dataframe['label_num'].unique()
label_num_values

array([0, 1])

### 2.3 Level balancing analysis

In [16]:
dataframe['label_num'].value_counts()

Unnamed: 0_level_0,count
label_num,Unnamed: 1_level_1
0,3672
1,1499


The levels are imbalanced, so oversampling of level 1, i.e., spam, will be performed until comparable quantities are reached. It is preferable to opt for oversampling the minority level instead of downsampling the majority level given the small size of the dataset.

# 3. Data preprocessing

### 1. Level balancing

Let's start by isolating the least-occurring level, Spam

In [17]:
spam = dataframe[dataframe['label_num'] == 1]
len(spam)

1499

In order to get a precise balancing with level 0 Ham, move on calculating the delta

In [18]:
ham_number = dataframe[dataframe['label_num'] == 0]
len(ham_number)

3672

Ham is more than twice occurrences than Spam, so we may proceed to multiply each occurrence by the correct factor and add the remainder, or simply add the dataset to itself and add the x missing rows

In [19]:
overspam = pd.concat([spam, spam], ignore_index = True)
len(overspam)

2998

In [20]:
rest_to_oversample = len(ham_number) - len(overspam)
rest_to_oversample

674

In [21]:
overspam = pd.concat([overspam, spam[:674]], ignore_index = True)
overspam

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
1,4185,spam,Subject: looking for medication ? we ` re the ...,1
2,4922,spam,Subject: vocable % rnd - word asceticism\nvcsc...,1
3,3799,spam,Subject: report 01405 !\nwffur attion brom est...,1
4,3948,spam,Subject: vic . odin n ^ ow\nberne hotbox carna...,1
...,...,...,...,...
3667,4862,spam,Subject: this service is provided by licensed ...,1
3668,4120,spam,"Subject: david\ntriplett , *\n75 % off for all...",1
3669,4636,spam,"Subject: doing clal 1 is , \ / 11 agrra , xana...",1
3670,5027,spam,Subject: photos\nmonth family baby were . simp...,1


In [22]:
ham = dataframe[dataframe['label_num'] == 0]
balanced_dataframe = pd.concat([ham, overspam])
balanced_dataframe

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\nth...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\n( see a...",0
2,3624,ham,"Subject: neon retreat\nho ho ho , we ' re arou...",0
4,2030,ham,Subject: re : indian springs\nthis deal is to ...,0
5,2949,ham,Subject: ehronline web address change\nthis me...,0
...,...,...,...,...
3667,4862,spam,Subject: this service is provided by licensed ...,1
3668,4120,spam,"Subject: david\ntriplett , *\n75 % off for all...",1
3669,4636,spam,"Subject: doing clal 1 is , \ / 11 agrra , xana...",1
3670,5027,spam,Subject: photos\nmonth family baby were . simp...,1


### 3.2 Text cleaning

In [23]:
def text_preprocessing(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\s+', ' ', text).strip()
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'$$.*?$$', '', text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'<.*?>+', '', text)
    return text

In [24]:
balanced_dataframe['text'] = balanced_dataframe['text'].apply(text_preprocessing)

In [25]:
balanced_dataframe

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,subject enron methanol meter this is a follow ...,0
1,2349,ham,subject hpl nom for january see attached file ...,0
2,3624,ham,subject neon retreat ho ho ho we re around to ...,0
4,2030,ham,subject re indian springs this deal is to book...,0
5,2949,ham,subject ehronline web address change this mess...,0
...,...,...,...,...
3667,4862,spam,subject this service is provided by licensed i...,1
3668,4120,spam,subject david triplett off for all new softwar...,1
3669,4636,spam,subject doing clal is agrra xanaax adlpex all ...,1
3670,5027,spam,subject photos month family baby were simple l...,1


### 3.3 Train-Validation-Test partitioning

x and y declaration followed by train-val-test split

In [26]:
y = balanced_dataframe['label_num']
x = balanced_dataframe['text']
X_train_temp, X_test, y_train_temp, y_test = train_test_split(x, y, test_size = .2, stratify=y)

In [27]:
X_train, X_val, y_train, y_val = train_test_split(X_train_temp, y_train_temp, test_size=.2, stratify= y_train_temp)

### 3.4 Lemmatization, tokenization, sequencing and padding

In [28]:
lemmatizer = WordNetLemmatizer()

In [29]:
tokenizer = Tokenizer(num_words= 10000)

In [30]:
def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    words = text.split()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)

Lemmatization, tokenization and preprocessing function to utilize at the end of the process for the test on the random sentence

In [None]:
def preprocess_single_text(text):
    
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\s+', ' ', text).strip()
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'$$.*?$$', '', text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'<.*?>+', '', text)
    
    text = lemmatize_text(text)

    sequence = tokenizer.texts_to_sequences([text])
    padded_sequence = pad_sequences(sequence, maxlen=max_len)

    return padded_sequence

Let's apply the lemmatize_text function on the train, val and test tests because in this case there is NOT a real fit, and therefore it's possible to do it, while for the Keras Tokenizer will be applied on the train set only.

In [32]:
X_train_lemmatized = X_train.apply(lemmatize_text)
X_val_lemmatized = X_val.apply(lemmatize_text)
X_test_lemmatized = X_test.apply(lemmatize_text)

In [33]:
tokenizer.fit_on_texts(X_train_lemmatized)

Sequencing

In [34]:
X_train_seq = tokenizer.texts_to_sequences(X_train_lemmatized)
X_val_seq = tokenizer.texts_to_sequences(X_val_lemmatized)
X_test_seq = tokenizer.texts_to_sequences(X_test_lemmatized)

Sequences padding, first calculate the maxlen as median of the existing lengths

In [35]:
sequence_length = []
for i in X_train_seq:
  sequence_length.append(len(i))

In [36]:
max_len = int(np.median(sequence_length))
max_len

69

In [37]:
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len)
X_val_pad = pad_sequences(X_val_seq, maxlen=max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len)

Define the vocab_size

In [38]:
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)

40314


# 4 Model

### 4.1 Base model definition

In [50]:
model = Sequential()
model.add(Embedding(input_dim = vocab_size, output_dim=128))
model.add(Bidirectional(LSTM(128, return_sequences=True)))
model.add(Dropout(0.5))
model.add(LSTM(128))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

### 4.2 Compile

In [51]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

### 4.3 Fit

In [52]:
model.fit(X_train_pad, y_train, epochs=4, batch_size=32, validation_data=(X_val_pad, y_val))

Epoch 1/4
[1m147/147[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m79s[0m 478ms/step - accuracy: 0.7817 - loss: 0.4214 - val_accuracy: 0.9540 - val_loss: 0.1267
Epoch 2/4
[1m147/147[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 471ms/step - accuracy: 0.9856 - loss: 0.0669 - val_accuracy: 0.9813 - val_loss: 0.0600
Epoch 3/4
[1m147/147[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 466ms/step - accuracy: 0.9948 - loss: 0.0202 - val_accuracy: 0.9770 - val_loss: 0.1073
Epoch 4/4
[1m147/147[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 452ms/step - accuracy: 0.9967 - loss: 0.0128

<keras.src.callbacks.history.History at 0x79a2120a7b80>

### 4.4 Model Evaluation

In [53]:
evaluation = model.evaluate(X_test_pad, y_test)
print(f'Test Loss: {evaluation[0]}, Test Accuracy: {evaluation[1]}')

[1m46/46[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 137ms/step - accuracy: 0.9863 - loss: 0.0723
Test Loss: 0.07915417850017548, Test Accuracy: 0.9829816222190857


# 5. Test on Spam and Ham recognition

### 5.1 random Ham extraction

In [40]:
random_ham = ham.sample(n=1, random_state=42)
random_ham

Unnamed: 0.1,Unnamed: 0,label,text,label_num
2977,3444,ham,Subject: conoco - big cowboy\ndarren :\ni ' m ...,0


In [41]:
ham_test = ham[ham['Unnamed: 0'] == 3444]
ham_test

Unnamed: 0.1,Unnamed: 0,label,text,label_num
2977,3444,ham,Subject: conoco - big cowboy\ndarren :\ni ' m ...,0


In [42]:
sample_ham_text = ham_test['text'].iloc[0]
sample_ham_true_label = ham_test['label_num'].iloc[0]

In [43]:
preprocessed_text = preprocess_single_text(sample_ham_text)
predicted_labels = model.predict(preprocessed_text)[0]
predicted_labels_binary = (predicted_labels >= 0.5).astype(int)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 801ms/step


### 5.2 Random Ham Test

In [44]:
print(f"Text: {sample_ham_text}")
print(f"True Labels: {sample_ham_true_label}")
print(f"Predicted Labels: {predicted_labels_binary}")

Text: Subject: conoco - big cowboy
darren :
i ' m not sure if you can help me with this , but i don ' t know who else to ask . for april and may , we have gas pathed on deal 133304 to conoco at the gepl big cowboy point . conoco is saying that we did not buy that gas from them . they have accounted for all of the hpl big cowboy gas and think we have over paid by about $ 1 . 5 mil each month for the gepl gas . do you know why we added the gepl meter to the deal in april ? could we have bought this gas from someone else ? i have the meter statements from tejas , but they do not say who the supply company was .
megan
True Labels: 0
Predicted Labels: [0]


### 5.3 Random Spam Extraction

In [45]:
random_spam = spam.sample(n=1, random_state=42)
random_spam

Unnamed: 0.1,Unnamed: 0,label,text,label_num
3927,5085,spam,"Subject: liffe is great\nhello ,\nvlsit our me...",1


In [46]:
spam_test = spam[spam['Unnamed: 0'] == 5085]
spam_test

Unnamed: 0.1,Unnamed: 0,label,text,label_num
3927,5085,spam,"Subject: liffe is great\nhello ,\nvlsit our me...",1


In [47]:
sample_spam_text = spam_test['text'].iloc[0]
sample_spam_true_label = spam_test['label_num'].iloc[0]

In [54]:
preprocessed_text = preprocess_single_text(sample_spam_text)
predicted_labels = model.predict(preprocessed_text)[0]
predicted_labels_binary = (predicted_labels >= 0.5).astype(int)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 499ms/step


### 5.4 Random Spam Test

In [55]:
print(f"Text: {sample_spam_text}")
print(f"True Labels: {sample_spam_true_label}")
print(f"Predicted Labels: {predicted_labels_binary}")

Text: Subject: liffe is great
hello ,
vlsit our medsbymail shop and save over 80 %
vl
raam
enle
racl
is ,
and
ag
bi
vlt
al
manyother .
you will be pieasantly surprised with our prlces !
have a nice day .
True Labels: 1
Predicted Labels: [1]


# 6. Topic Modelling

In [56]:
documents = spam['text']

In [57]:
def send_to_words(items):
  for item in items:
    yield(simple_preprocess(item, deacc=True))

In [58]:
def remove_stopwords(texts):
  return[[word for word in words if word not in stop_words and len(word) >= 5] for words in texts]

In [59]:
data_words = list(send_to_words(documents))
data_words = remove_stopwords(data_words)

In [61]:
id2word = corpora.Dictionary(data_words)

corpus = [id2word.doc2bow(text) for text in data_words]

In [63]:
num_topics = 10

In [64]:
lda_model = gensim.models.LdaMulticore(corpus=corpus,id2word=id2word, num_topics = num_topics, passes=3)

  self.pid = os.fork()


In [65]:
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.017*"subject" + 0.009*"computron" + 0.008*"please" + 0.007*"contact" + '
  '0.006*"message" + 0.006*"email" + 0.004*"click" + 0.004*"prices" + '
  '0.004*"reply" + 0.004*"money"'),
 (1,
  '0.019*"subject" + 0.003*"normally" + 0.003*"click" + 0.003*"email" + '
  '0.003*"please" + 0.002*"viagra" + 0.002*"epson" + 0.002*"visit" + '
  '0.002*"paliourg" + 0.002*"quality"'),
 (2,
  '0.014*"company" + 0.011*"statements" + 0.008*"subject" + '
  '0.008*"information" + 0.006*"within" + 0.006*"report" + 0.006*"securities" '
  '+ 0.006*"stock" + 0.005*"investment" + 0.005*"looking"'),
 (3,
  '0.009*"subject" + 0.003*"information" + 0.003*"email" + 0.002*"january" + '
  '0.002*"story" + 0.002*"company" + 0.002*"money" + 0.002*"mining" + '
  '0.002*"exploration" + 0.002*"please"'),
 (4,
  '0.015*"height" + 0.010*"width" + 0.007*"style" + 0.006*"family" + '
  '0.006*"moopid" + 0.006*"hotlist" + 0.006*"valign" + 0.006*"subject" + '
  '0.005*"border" + 0.004*"align"'),
 (5,
  '0.010*"subject"

Calculate topic coherence

In [66]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_words, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f'Coerenza del topic: {coherence_lda}')

Coerenza del topic: 0.46972285737038233


In [67]:
topics = lda_model.show_topics(formatted=False)
representative_words = []

The first word is the one chosen since hierarchically is the one with the highest value

In [68]:
for topic in topics:
    topic_id, word_probs = topic
    words = [word for word, _ in word_probs]
    representative_word = words[0]
    representative_words.append(representative_word)

### 6.1 Subset Spam Main Topics

In [69]:
main_topics = []
for idx, word in enumerate(representative_words):
  if word not in main_topics:
    main_topics.append(word)
  print(f"Topic {idx}: {word}")

Topic 0: subject
Topic 1: subject
Topic 2: company
Topic 3: subject
Topic 4: height
Topic 5: subject
Topic 6: subject
Topic 7: height
Topic 8: subject
Topic 9: pills


In [70]:
main_topics

['subject', 'company', 'height', 'pills']

# 7 Word Embedding

In [73]:
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [None]:
glove_vector = gensim.downloader.load('glove-wiki-gigaword-300')



In [75]:
glove_vector.most_similar(main_topics[0])

[('subjects', 0.6848979592323303),
 ('topic', 0.6692212224006653),
 ('question', 0.6213614344596863),
 ('particular', 0.6028071045875549),
 ('matter', 0.5995612740516663),
 ('discussion', 0.5915062427520752),
 ('topics', 0.5881487727165222),
 ('certain', 0.5754355192184448),
 ('issue', 0.5730682611465454),
 ('questions', 0.5612434148788452)]

In [76]:
vector_subjects = glove_vector.get_vector(main_topics[0])
vector_pills = glove_vector.get_vector(main_topics[1])
vector_company = glove_vector.get_vector(main_topics[2])
vector_color = glove_vector.get_vector(main_topics[3])

In [84]:
vector_subjects

array([-2.1840e-01, -2.8124e-01,  3.7378e-01,  6.1875e-02,  3.1466e-01,
        8.3846e-03, -1.6347e-01, -1.8485e-01, -1.0463e-01, -1.7608e+00,
       -7.7687e-02,  1.7499e-03, -3.0439e-01,  1.3145e-01,  1.3656e-01,
       -3.4374e-01, -2.1504e-01,  1.1346e-02,  1.1311e-01,  9.0882e-02,
       -1.0749e-01,  3.7275e-01, -1.7323e-01,  1.7267e-01, -1.1843e-01,
        8.3395e-02,  1.7774e-01, -2.8722e-01,  2.0361e-01,  1.7814e-01,
       -8.8604e-02,  1.4406e-01, -3.9039e-01, -1.8592e-01, -8.1113e-01,
        4.9848e-02, -9.7536e-02, -2.9705e-01, -3.0720e-01,  2.9255e-01,
        1.5588e-01, -2.6929e-01,  1.1744e-01, -2.9994e-02, -1.3471e-01,
        4.5017e-01, -3.7211e-01, -4.2043e-02, -1.6881e-01,  4.5912e-01,
       -4.7173e-02,  2.7309e-02, -2.4706e-02,  2.5323e-01,  4.6287e-01,
       -1.9394e-02, -2.2568e-01, -3.2923e-01,  7.3393e-01, -1.3008e-02,
        1.9766e-01,  9.4465e-02, -9.3694e-02,  2.4454e-01, -2.9178e-01,
       -1.8398e-01,  2.1481e-01, -9.3525e-02, -6.1812e-02,  2.15

In [77]:
main_vectors = [vector_subjects, vector_pills, vector_company, vector_color]

In [90]:
vector_names = ["vector_subjects", "vector_pills", "vector_company", "vector_color"]

In [91]:
vector_dict = {id(vec): name for vec, name in zip(main_vectors, vector_names)}

In [87]:
pairings = list(itertools.combinations(main_vectors, 2))

### 7.1 Semantic distances calculation

In [None]:
for vector_a, vector_b in pairings:
    cosine_similarity = 1 - spatial.distance.cosine(vector_a, vector_b)
    name_a = vector_dict[id(vector_a)]
    name_b = vector_dict[id(vector_b)]
    print(f"Semantic distance between {name_a} and {name_b}: {cosine_similarity}")

Distanza semantica tra vector_subjects e vector_pills: 0.2164314173705686
Distanza semantica tra vector_subjects e vector_company: 0.16034751024632765
Distanza semantica tra vector_subjects e vector_color: 0.006097154373375657
Distanza semantica tra vector_pills e vector_company: 0.0846695646086274
Distanza semantica tra vector_pills e vector_color: 0.1001885832748749
Distanza semantica tra vector_company e vector_color: 0.07542619540708062


# 8 NER tagging, ORGanizations in ham

Test operativity

In [101]:
sentence = 'This is an example sentence with Microsoft and Apple inside!'

In [102]:
doc_test_mo= nlp(sentence)

In [136]:
for token in doc_test_mo:
  print(f' {token} : {token.ent_type_}')

 This : 
 is : 
 an : 
 example : 
 sentence : 
 with : 
 Microsoft : ORG
 and : 
 Apple : ORG
 inside : 
 ! : 


In [130]:
org_tokens = set()

In [135]:
ham_test = ham['text'].apply(nlp)

### 8.1 NER ORG extraction from Ham subset

In [132]:
for each_sentence in ham_test:
  for token in each_sentence:
    if token.ent_type_ == 'ORG' and token not in org_tokens:
      org_tokens.add(token.text)

In [133]:
org_tokens

{'cheryl',
 'brown',
 'venita',
 'phibro',
 'wagner',
 'stephenson',
 '#',
 'esa',
 'martin',
 'needs',
 'sears',
 'resource',
 'miller',
 'hoong',
 'gda',
 'fortin',
 'devon',
 'lindley',
 'dreyfus',
 'see',
 'litigation',
 'painewebber',
 'campbell',
 'texoma',
 'gulf',
 '(',
 '3',
 'holland',
 'devries',
 'errigo',
 'texas',
 '9747',
 'shona',
 'support',
 'templeton',
 'landman',
 'veronica',
 'florida',
 'position',
 'zajac',
 'morela',
 'rick',
 'operating',
 'clem',
 'fidelity',
 'lehman',
 'controls',
 'resume',
 'mrha',
 'lauer',
 'cold',
 'syrup',
 'mci',
 '03',
 'delta',
 'ohio',
 'goodman',
 'gcs',
 'community',
 'milbank',
 'minerals',
 'rico',
 'ets',
 'times',
 'black',
 'fujitsu',
 'katherine',
 'cynet',
 'tammy',
 'effective',
 'evans',
 'org',
 'walters',
 'board',
 'payne',
 'kerr',
 'cavanaug',
 'associate',
 'dodge',
 'briana',
 'pdf',
 'for',
 '70120',
 'emerald',
 'microsoft',
 'n',
 'jdf',
 'he',
 'shut',
 'kemper',
 'alstom',
 'pan',
 'lloyd',
 'julie',
 'explo