# Embedding Corpus Builder:
The code in this notebook pulls text from the pega website and from agile studio epics to create a corpus of pega-related text. Once we've defined our corpus, we train a word2vec model on that corpus to understand the specialized language that surrounds pega technology.

Further, by combining a standard English corpus (The Brown Corpus) with the Pega Corpus we created, we train another W2V model to understand relationships between words in pega-specific language as well as relationships that apply accross the english language.

---
## Corpus 1: Pega Crawler
Purpose: Crawl Pega link with single layer recursion and pull relevant text to establish a Pega corpus of words.

*Authors: Jake Epstein & Matt Kenney*

#### Import Libraries:

In [1]:
#!bash ../nlp_workspace/install_packages.sh

In [2]:
# import libraries
import multiprocessing
import re
from gensim.models import Word2Vec
import contractions
import re
import string
import pickle
from tqdm import tqdm
import time

import pandas as pd
import numpy as np

In [3]:
import sys
sys.path.insert(1, '../nlp_engine')
from Preprocessing import preprocess_training_text, text_to_corpus

Spacy model is using GPU


#### Define Preprocessing Functions


#### Crawl a specified URL, and pull all english text to define our corpus

In [4]:
import urllib3
from urllib.parse import urlparse
from urllib.parse import urljoin
from langdetect import detect
import requests
from bs4 import BeautifulSoup
from validator_collection import validators, checkers

def crawl_webpage(url):
    """
    Crawls all webpages at a recursion depth of 1 page from the passed url. Pulls
    all paragraphs from the crawled webpages and returns them as a list of paragraphs.
    Ignores webpages that were not written in english
    """

    # request data from url and parse html
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    # make a list of all of the embedded links in the page
    href_tags = soup.find_all(href=True)
    hrefs = [tag.get('href') for tag in href_tags]
    para = ''
    strpara = ''
    count = 0
    paragraph_list = []
    # loop through embedded links
    
    for href in hrefs:
        # if the link is an extension, join it with the base link to form url
        o = urlparse(href)
        
        if o.scheme != 'http' and o.scheme != 'https':
            href = urljoin('https://www.pega.com', href)
        # check if link is valid and parse its html
        
        if checkers.is_url(href):
            response = requests.get(href)
            soup = BeautifulSoup(response.content, "html.parser")
            # check if the body is non-null
            
            if soup.body != None:
                # check if the language of the body is english
                lang = detect(soup.body.get_text())
                # print(lang)
                
                if lang == 'en':
                    print(href)
                    # pull all relevant expressions and add periods at the end
                    paras = soup.find_all('p')
                    paragraph_list.extend(paras)
                    count += 1
                    
    print('Number of Links Crawled', count)
    return paragraph_list

In [5]:
paragraph_list = crawl_webpage('https://www.pega.com/glossary')

https://www.pega.com/glossary
https://www.pega.com/glossary
https://www.pega.com/node/66696
https://www.pega.com/glossary


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://academy.pega.com
https://community.pega.com/support
https://community.pega.com/knowledgebase
https://www.pega.com/events/pegaworld
https://design.pega.com/
https://www.pega.com/services/partnerships
https://www.pega.com/services/consulting
https://www.pega.com/insights
https://www.pega.com/about/careers
https://www.pega.com/glossary
https://www.pega.com
https://www.pega.com/contact-us
https://www.pega.com/user/register?destination=/glossary
https://www.pega.com/user/login?destination=/glossary
https://www.pega.com/
https://www.pega.com/
https://www.pega.com/contact-us
https://www.pega.com/user/register?destination=/glossary
https://www.pega.com/user/login?destination=/glossary
https://www.pega.com/products
https://www.pega.com/products/crm-applications
https://www.pega.com/products/crm-applications/marketing
https://www.pega.com/products/crm-applications/sales-automation
https://www.pega.com/products/crm-applications/customer-service
https://www.pega.com/products/pega-platform/

In [6]:
# Test text_to_corpus function
text_to_corpus(str(paragraph_list[2]))

'pegaworld inspire .'

#### Convert pulled text into a format suitable for training a word2vec model

In [7]:
from importlib import reload
import tqdm
reload(tqdm)
from tqdm import tqdm

In [8]:
# put all crawled paragraphs into one big string and preprocess
big_string = ''
for paragraph in tqdm(paragraph_list):
    big_string += text_to_corpus(str(paragraph)) + ' '

100%|██████████| 7209/7209 [02:00<00:00, 60.04it/s] 


In [9]:
# seperate string into a list of sentences ending with a period - the format 
# word2vec understands
regex = r'\b\w+\b'
sentences = big_string.split('.')
for i, sentence in enumerate(sentences):
    words = re.findall(regex, sentence)
    words.append('.')
    sentences[i] = words

# Inspect the sentences. They are now in proper format for loading into a w2v model.
print('Total number of sentences in this corpus:', len(sentences), '\n\nExample Sentences:')
sentences[:5]

Total number of sentences in this corpus: 10472 

Example Sentences:


[['ebook', '.'],
 ['case', 'study', '.'],
 ['pegaworld', 'inspire', '.'],
 ['adaptive', 'case', 'management', '.'],
 ['banking', 'business', 'process', '.']]

#### Save the cleaned and processed pega corpus into a pickle called "crawl_words" for future use

In [10]:
# save the cleaned and processed pega corpus into a pickle called "crawl_words" for future use
with open('../data/pickles/crawl_words.pkl', 'wb') as f:
    pickle.dump(sentences, f)

## Corpus 2: Agile Studio
Purpose: Convert Agile Studio epic texts from the past 5 years into a format suitable for a w2v model. We will use these texts to augment the corpus we obtained by crawling the Pega Website

#### Step 1: Pull Agile Studio Bugs and Define Corpus:

In [11]:
# pull csv
astudio_bodies = pd.read_csv("../data/csvs/EpicTextForNLP.csv")

In [12]:
astudio_bodies

Unnamed: 0,Epic ID,Label,Description,Update Date/Time
0,EPIC-60115,UX enhancements to Spaces (quick wins),Below are a list of small UX enhancements for ...,2/3/20 1:48 PM
1,EPIC-56149,Add support for Websphere Liberty 19,"As a...\n\nCustomer, \n\n\n\nI would...\n\nlik...",2/3/20 1:26 PM
2,EPIC-59441,[Regional-ML1] Basic regional UIService - depl...,As a modern cloud service reqional deployment ...,2/3/20 1:00 PM
3,EPIC-59446,[Regional-MLx] Service scaleability and sharin...,"When deployed for Regional use, as a sys admin...",2/3/20 12:56 PM
4,EPIC-37840,Component build optimization w/cosmos defer load,Create a webpack configuration for both produc...,2/3/20 12:55 PM
...,...,...,...,...
7839,EPIC-9994,UIKit Changes for End User Portals,UI Kit updates for end user portals:\nCase Man...,2/9/15 9:17 AM
7840,EPIC-9387,PERF: JS Modularization for improved mobile pe...,"As a mobile user, I should get the response ti...",2/6/15 4:19 PM
7841,EPIC-9513,DX: Build from scratch demo enhancements for ML8,"As a Pega demoer, I would like to be able to r...",2/4/15 10:07 AM
7842,EPIC-10940,Efficient offline work in native applications,Native mobile applications should work efficie...,2/4/15 8:51 AM


In [13]:
# rename collumns
astudio_bodies = astudio_bodies.rename(columns={"Epic ID" : "epic_id", "Label" : "label", "Description" : "description", "Update Date/Time" : "update date/time"})

In [14]:
# clean astudio_bodies data set by removing null information, duplicates, and applying data cleaning functions
astudio_bodies = astudio_bodies[astudio_bodies['label'].notnull()]
astudio_bodies = astudio_bodies[astudio_bodies['description'].notnull()]
astudio_bodies.drop_duplicates(subset = "label", keep = False, inplace = True)

astudio_bodies['combined'] = astudio_bodies['label'].map(str) + '.' + ' ' + astudio_bodies['description']

In [15]:
astudio_bodies['combined'].iloc[6936]

'Easily reference a data type from a case type form. As a new user defining a case type, I want a simple and intuitive way to reference a data type I defined from a step in my process.\n\nUse case: I created a data type for job posting. I want to reference job postings from the first step of my job application case type. I should be able to very simply configure the ability to search for a job posting and pull in associated details. Only the job key should be stored with my case type but I should be able to use it to lookup other data about the job easily.'

In [16]:
tqdm.pandas(desc="performing text to corpus conversion")

  from pandas import Panel


In [17]:
# Apply the `text_to_corpus` function to preproccess epic texts and start to convert to w2v format
from tqdm import tqdm
astudio_bodies['combined'] = astudio_bodies['combined'].progress_apply(text_to_corpus)

performing text to corpus conversion: 100%|██████████| 7159/7159 [02:37<00:00, 45.52it/s]


In [18]:
# split into sentences (the word2vec format)
big_string = ''
for paragraph in astudio_bodies['combined']:
    big_string += paragraph + ' '
    
regex = r'\b\w+\b'
sentences = big_string.split('.')
for i, sentence in enumerate(sentences):
    words = re.findall(regex, sentence)
    words.append('.')
    sentences[i] = words

#### Step 2: Combine Pega Crawler Bugs with Agile Studio Bugs:

In [19]:
# pull crawled corpus from "Pega Corpus" pickle
print('Num Sentences in Agile Studio Corpus:', len(sentences))
with open('../data/pickles/crawl_words.pkl', 'rb') as f:
    crawl_sentences = pickle.load(f)
    
sentences.extend(crawl_sentences)
print('Num Sentences in  Agile Studio + Pega Glossary Corpus:', len(sentences))

Num Sentences in Agile Studio Corpus: 36629
Num Sentences in  Agile Studio + Pega Glossary Corpus: 47101


In [20]:
# Shuffle up Agile Studio and Pega Website Sentences
import random 
random.shuffle(sentences) # shuffle agile studio / 

In [21]:
# Save the Combined Pega Corpus:
with open('../data/pickles/pega_corpus.pkl', 'wb') as f:
    pickle.dump(sentences, f)

## Training The Word2Vec Model on the Pega Corpus
Now that we've defined a pega corpus, complete with sentences from both the pega website and agiel studio epic texts, we'll train a W2V model to represent the semantic similarity between words in these corpora.

#### Define Word2Vec Model

In [22]:
# make word2vec model
word_list = sentences
EMB_DIM = 300 # The embedding dimension is the size of the embedding vector that respresents each word
w2v = Word2Vec(word_list, size=EMB_DIM, window=5, min_count=2, negative=15, iter=10, workers=multiprocessing.cpu_count())

In [23]:
# save word2vec model for future use
w2v.wv.save("../saved_models/word_embeddings/pega_corpus.kv")

In [24]:
# test word2vec model by finding similar vectors of known words
word_vectors = w2v.wv
result = word_vectors.similar_by_word("trefler")
print("Most similar words to 'trefler':\n", result[:4])
result = word_vectors.similar_by_word("wait")
print("Most similar words to 'wait':\n", result[:4])
result = word_vectors.similar_by_word("app")
print("Most similar words to 'app':\n", result[:4])

Most similar words to 'trefler':
 [('alan', 0.9895352125167847), ('founder', 0.9795060157775879), ('ceo', 0.9764717221260071), ('magazine', 0.9440256357192993)]
Most similar words to 'wait':
 [('delay', 0.7934874892234802), ('second', 0.7325660586357117), ('lose', 0.7165380120277405), ('average', 0.7122946977615356)]
Most similar words to 'app':
 [('application', 0.6767767667770386), ('device', 0.5885218381881714), ('sa', 0.5421031713485718), ('customize', 0.5119743943214417)]


## Corpus 3: The Brown Corpus
Purpose: By simply downloading the brown corpus using NLTK and appending it to our existing pega corpus, we can train w2v model that is a bit more general. By combining pega-specific text & standard english text, we expected that we could derive a W2V model that understood the relationships between standard english words as well as pega terminology.

In [26]:
# Download & Load the brown corpus
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [27]:
# Get W2V format of brown corpus
from nltk.corpus import brown
brown_corpus = brown.sents()

In [28]:
# Reload Pega Corpus (if it isn't loaded already):
with open('../data/pickles/pega_corpus.pkl', 'rb') as f:
    pega_corpus = pickle.load(f)

In [29]:
print("Brown Corpus Length:", len(brown_corpus))
print("Pega Corpus Length:", len(pega_corpus))

Brown Corpus Length: 57340
Pega Corpus Length: 47101


#### Combine brown and pega corpus, create word2vec model

In [30]:
# combine corpuses
import random
combined_corpus = []
combined_corpus.extend(pega_corpus)
combined_corpus.extend(brown_corpus)
random.shuffle(combined_corpus)

In [31]:
# make word2vec model
EMB_DIM = 300 # The embedding dimension is the size of the embedding vector that respresents each word
w2v = Word2Vec(combined_corpus, size=EMB_DIM, window=5, min_count=5, negative=15, iter=10, workers=multiprocessing.cpu_count())

In [32]:
# test model using kmown keywords
word_vectors = w2v.wv
result = word_vectors.similar_by_word("pega")
print("Most similar words to 'pega':\n", result[:4])
result = word_vectors.similar_by_word("nature")
print("Most similar words to 'nature':\n", result[:4])
result = word_vectors.similar_by_word("cloud")
print("Most similar words to 'cloud':\n", result[:4])

Most similar words to 'pega':
 [('pega7', 0.7314003705978394), ('customize', 0.7154072523117065), ('demo', 0.7009561061859131), ('profile', 0.6969360709190369)]
Most similar words to 'nature':
 [('moral', 0.7723469734191895), ('philosophy', 0.7510935068130493), ('existence', 0.7411466836929321), ('importance', 0.7322114706039429)]
Most similar words to 'cloud':
 [('pegacloud', 0.7072383165359497), ('infinity', 0.7040613889694214), ('azure', 0.7028890252113342), ('pcf', 0.6927876472473145)]


#### Save word2vec model for future use
**Note** that by calling `.wv` we convert the model into a KeyedVector. According to Gensim: 
        
> The reason for separating the trained vectors into KeyedVectors is that if you don’t need the full model state any more (don’t need to continue training), the state can discarded, resulting in a much smaller and faster object that can be mmapped for lightning fast loading and sharing the vectors in RAM between processes.

In [33]:
w2v.wv.save("../saved_models/word_embeddings/combined_corpus.kv")