## 1. Why and When Do We Need It?

* **Patent data** has been a primary source for monitoring technological advancement and innovation activities; however, it has a limited ability to portray **market-side information**.
  
* To overcome this, existing studies have relied on **web mining techniques** to analyze companies’ web content, which companies commonly use to exhibit their **commercial activities**, including **main products and services, key technologies,** and personnel.
  
* However, company websites often include **miscellaneous information**, ranging from primary business information (i.e., products and services) to more general information (i.e., factory sites and recruitment). The noisy information embedded in the raw web data may blur the following analysis.
  
* The dual-attention model is designed to **automatically identify technical information** (i.e., products or firms’ technological capabilities) from the raw web data.



## 2. The Structure of the Dual-Attention Model

* Two attention layers are applied at the **webpage-level** and **word-level**, respectively.

* The training target is a **binary objective** of **high-tech versus non-high-tech companies**. High-tech companies are defined as companies with at least one patent in this study.

* The **assumption**: the terminologies presented on the websites of high-tech companies are distributed heterogeneously among
their non-high-tech counterparts (i.e., food companies and wholesalers).

* After the training process, both word-level and page-level attention scores will be extracted for the following information selection process.

<div style="text-align:center">
    <img src="./dual_attn_model.png" width="800" height="400">
</div>


* **Sparsemax vs. Softmax:** In contrast to the conventional softmax function, the sparsemax function converts real values into sparse probabilities. Because a company's website usually covers miscellaneous webpages, using sparsemax enables the algorithm to be more concentrated on crucial webpages, which also helps save one parameter for the following keyword extraction process.

* **Reference**: [Identifying technology opportunity using dual-attention model and technology-market concordance matrix.](https://www.sciencedirect.com/science/article/abs/pii/S0040162523006017)


In [1]:
import pandas as pd
import numpy as np
import pickle
import torch
import torch.nn as nn
from tqdm import tqdm
from collections import Counter
from tokenizer import Tokenizer
from torch.utils.data import TensorDataset, DataLoader
from sklearn.metrics import confusion_matrix, f1_score, recall_score, precision_score
from dual_attn import DualAttnModel

#### Step 1. Load the data

* When loading your own data, make sure the **column names** match those provided in the example dataframe.

In [2]:
data = pd.read_excel("fpatent.xlsx")
data

Unnamed: 0.2,Unnamed: 0.1,company_name,urls,cleaned_content,hojin_id,hightechflag,Unnamed: 0
0,0,IN*110022183105,http://www.shaktipumps.com/pdf/Investor_Relati...,streamhj,10000,1,
1,1,IN*110022183105,https://www.shaktipumps.com/branch-offices.php,shakti|branch|offices||location|branch|shakti|...,10000,1,
2,2,IN*110022183105,https://www.shaktipumps.com/csr-activity.php,company|activity-|shakti|limited|blower|homeco...,10000,1,
3,3,IN*110022183105,https://www.shaktipumps.com,shakti|manufacturer|supplier|blower|homecompan...,10000,1,
4,4,IN*110022183105,http://www.shaktipumps.com/media/promotions/de...,blower|homecompany|closecompany|profilevision|...,10000,1,
...,...,...,...,...,...,...,...
244174,244174,IN0007502693,http://www.lexistoolingsystems.com/straight-co...,,23199,0,128739.0
244175,244175,IN0007502693,http://www.lexistoolingsystems.com/sitemap.html,,23199,0,128740.0
244176,244176,IN0007502693,http://www.lexistoolingsystems.com/er-tap-coll...,,23199,0,128741.0
244177,244177,IN0007502693,http://www.lexistoolingsystems.com/er-wrench-s...,,23199,0,128742.0


#### Step 2. Load the pretrained word vectors

* To initialize the dual-attention model, we need to use the pre-trained word vectors.

* For example, one can download different pre-trained word vectors provided by [FastText](https://fasttext.cc/docs/en/crawl-vectors.html).

<div style="text-align:center">
    <img src="./pretrainedftt.png" width="800" height="400">
</div>

In [3]:

import numpy as np


class FastVector:
    """
    Minimal wrapper for fastvector embeddings.
    ```
    Usage:
        $ model = FastVector(vector_file='/path/to/wiki.en.vec')
        $ 'apple' in model
        > TRUE
        $ model['apple'].shape
        > (300,)
    ```
    """

    def __init__(self, vector_file='', transform=None):
        """Read in word vectors in fasttext format"""
        self.word2id = {}

        # Captures word order, for export() and translate methods
        self.id2word = []

        print('reading word vectors from %s' % vector_file)
        with open(vector_file, 'r', encoding='utf-8') as f:
            print(f)
            (self.n_words, self.n_dim) = \
                (int(x) for x in f.readline().rstrip('\n').split(' '))
            self.embed = np.zeros((self.n_words, self.n_dim))
            for i, line in enumerate(f):
                elems = line.rstrip('\n').split(' ')
                self.word2id[elems[0]] = i
                self.embed[i] = elems[1:self.n_dim+1]
                self.id2word.append(elems[0])
        
        # Used in translate_inverted_softmax()
        self.softmax_denominators = None
        
        if transform is not None:
            print('Applying transformation to embedding')
            self.apply_transform(transform)

    def apply_transform(self, transform):
        """
        Apply the given transformation to the vector space

        Right-multiplies given transform with embeddings E:
            E = E * transform

        Transform can either be a string with a filename to a
        text file containing a ndarray (compat. with np.loadtxt)
        or a numpy ndarray.
        """
        transmat = np.loadtxt(transform) if isinstance(transform, str) else transform
        self.embed = np.matmul(self.embed, transmat)

    def export(self, outpath):
        """
        Transforming a large matrix of WordVectors is expensive. 
        This method lets you write the transformed matrix back to a file for future use
        :param The path to the output file to be written 
        """
        fout = open(outpath, "w")

        # Header takes the guesswork out of loading by recording how many lines, vector dims
        fout.write(str(self.n_words) + " " + str(self.n_dim) + "\n")
        for token in self.id2word:
            vector_components = ["%.6f" % number for number in self[token]]
            vector_as_string = " ".join(vector_components)

            out_line = token + " " + vector_as_string + "\n"
            fout.write(out_line)

        fout.close()

    def translate_nearest_neighbour(self, source_vector):
        """Obtain translation of source_vector using nearest neighbour retrieval"""
        similarity_vector = np.matmul(FastVector.normalised(self.embed), source_vector)
        target_id = np.argmax(similarity_vector)
        return self.id2word[target_id]

    def translate_inverted_softmax(self, source_vector, source_space, nsamples,
                                   beta=10., batch_size=100, recalculate=True):
        """
        Obtain translation of source_vector using sampled inverted softmax retrieval
        with inverse temperature beta.

        nsamples vectors are drawn from source_space in batches of batch_size
        to calculate the inverted softmax denominators.
        Denominators from previous call are reused if recalculate=False. This saves
        time if multiple words are translated from the same source language.
        """
        embed_normalised = FastVector.normalised(self.embed)
        # calculate contributions to softmax denominators in batches
        # to save memory
        if self.softmax_denominators is None or recalculate is True:
            self.softmax_denominators = np.zeros(self.embed.shape[0])
            while nsamples > 0:
                # get batch of randomly sampled vectors from source space
                sample_vectors = source_space.get_samples(min(nsamples, batch_size))
                # calculate cosine similarities between sampled vectors and
                # all vectors in the target space
                sample_similarities = \
                    np.matmul(embed_normalised,
                              FastVector.normalised(sample_vectors).transpose())
                # accumulate contribution to denominators
                self.softmax_denominators \
                    += np.sum(np.exp(beta * sample_similarities), axis=1)
                nsamples -= batch_size
        # cosine similarities between source_vector and all target vectors
        similarity_vector = np.matmul(embed_normalised,
                                      source_vector/np.linalg.norm(source_vector))
        # exponentiate and normalise with denominators to obtain inverted softmax
        softmax_scores = np.exp(beta * similarity_vector) / \
                         self.softmax_denominators
        # pick highest score as translation
        target_id = np.argmax(softmax_scores)
        return self.id2word[target_id]

    def get_samples(self, nsamples):
        """Return a matrix of nsamples randomly sampled vectors from embed"""
        sample_ids = np.random.choice(self.embed.shape[0], nsamples, replace=False)
        return self.embed[sample_ids]

    @classmethod
    def normalised(cls, mat, axis=-1, order=2):
        """Utility function to normalise the rows of a numpy array."""
        norm = np.linalg.norm(
            mat, axis=axis, ord=order, keepdims=True)
        norm[norm == 0] = 1
        return mat / norm
    
    @classmethod
    def cosine_similarity(cls, vec_a, vec_b):
        """Compute cosine similarity between vec_a and vec_b"""
        return np.dot(vec_a, vec_b) / \
            (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))

    def __contains__(self, key):
        return key in self.word2id

    def __getitem__(self, key):
        return self.embed[self.word2id[key]]


In [4]:

en_dictionary = FastVector(vector_file='cc.en.300.vec')

words = list(en_dictionary.word2id.keys())
vectors = np.array([en_dictionary[word] for word in words])

wv_dict = dict(zip(words, vectors))

reading word vectors from cc.en.300.vec
<_io.TextIOWrapper name='cc.en.300.vec' mode='r' encoding='utf-8'>


#### Step 3. Preprocess the data

* Remove bad entries.
* Set the **max_page** parameter for cutting and padding strategies.

In [5]:
no_values = []

for i in tqdm(data.cleaned_content):
    try:
        i = i.split('|')
        i = [j for j in i if j in wv_dict]
        if len(i) < 1:
            no_values.append(1)
        else:
            no_values.append(0)
    except:
        no_values.append(1)


data['no_values'] = no_values
data = data[data.no_values == 0]
data = data[['hojin_id', 'company_name', 'urls', 'cleaned_content', 'hightechflag']]


hojin_ids = list(set(data.hojin_id))

sample_data = pd.DataFrame({})
max_page = 32

for hojin_id in tqdm(hojin_ids):
    temp = data[data.hojin_id == hojin_id]
    if temp.shape[0] <= max_page:
        sample_data = pd.concat([sample_data, temp], ignore_index=True)
    else:
        temp = temp.sample(n=max_page)
        sample_data = pd.concat([sample_data, temp], ignore_index=True)
        #sample_data = pd.concat([sample_data, temp.iloc[:max_page, :]], ignore_index=True)

num_words = [len(i.split('|')) for i in sample_data.cleaned_content]
sample_data['num_words'] = num_words
sample_data = sample_data[sample_data.num_words > 5]

hojin_ids = list(set(sample_data.hojin_id))
hojin_ids = [int(i) for i in hojin_ids]

tokenizer = Tokenizer(words, max_len=864, data = sample_data)

web_vectors = [tokenizer.encode_webportfolio(company_id=idx, max_page=max_page) for idx in tqdm(hojin_ids)]

seq_ids = torch.tensor([i[1] for i in web_vectors])
num_pages = torch.tensor([i[0] for i in web_vectors])
seq_lengths = tokenizer.max_len - torch.sum(seq_ids == 0, axis=-1)

labels = torch.tensor([tokenizer.get_label(i) for i in hojin_ids])
hojin_ids = torch.tensor(hojin_ids)

100%|███████████████████████████████████████████████████████████████████████| 244179/244179 [00:12<00:00, 18874.97it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 5358/5358 [00:14<00:00, 372.80it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 5194/5194 [00:09<00:00, 571.67it/s]


#### Step 4. Initialize the dual-attention model and training settings

In [6]:
batch_size = 10
dataset = TensorDataset(seq_ids, num_pages, seq_lengths, labels, hojin_ids)
train_dataloader = DataLoader(dataset, batch_size=batch_size)

def evaluate(data_loader, model):
    model.eval()
    count = 0
    i = 0
    gold_labels = []
    pred_labels = []
    for ind, batch in tqdm(enumerate(data_loader), ncols=80):

        seq_ids, num_pages, seq_lengths, label_list, hojin = batch
        outputs, _, _, _, _, _, _ = model(seq_ids.to(device), num_pages.to(device), seq_lengths.to(device))
        preds = (outputs>0.5).squeeze()

        gold_labels += list(label_list.cpu().numpy())
        pred_labels += list(preds.cpu().numpy())
        num = (preds.cpu() == label_list.bool()).sum().cpu().item()
        count += num
        i += 1
    accuracy = count*1.0/(i * batch_size)
    print('Evaluation accuracy:', accuracy)
    return accuracy



vectors = np.array(list(wv_dict.values()))
words = list(wv_dict.keys())
vectors_all = np.vstack([np.zeros(300), vectors])

torch.manual_seed(1218)
loss_function = nn.BCELoss()
scale = 10

model = DualAttnModel(vocab_size=len(words)+1, embed_dim=300, hidden_dim=300,
                             label_dim=1, scale=10, page_scale=10)
model.load_vector(pretrained_vectors=vectors_all, trainable=False)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = model.to(device)

optimizer = torch.optim.Adagrad(model.parameters(), lr=0.02, weight_decay=0.0000, lr_decay=0.01) # try different learning rates


embeddings loaded


#### Step 5. Training

In [None]:
best_model = None
best_acc = 0
for i in range(10):
    print('Epoch:', i)
    print('#'*20)
    total_loss = 0
    count = 0
    model.train()
    for ind, batch in tqdm(enumerate(train_dataloader), ncols=80):
        seq_ids, num_pages, seq_lengths, label_list, hojin = batch
        model.zero_grad()
        preds, _, _, _, _, _, _ = model(seq_ids.to(device), num_pages.to(device), seq_lengths.to(device))
        loss = loss_function(preds.squeeze(), label_list.to(device).float())
        loss.backward()
        optimizer.step()
        total_loss += loss.cpu().item()*len(seq_ids)
        count += len(seq_ids)
    print('total_loss:  ', total_loss/count)
    print('Training Accuracy')
    evaluate(train_dataloader, model)

Epoch: 0
####################


520it [00:13, 38.65it/s]


total_loss:   0.7498329973790019
Training Accuracy


520it [00:11, 44.86it/s]


Evaluation accuracy: 0.5340384615384616
Epoch: 1
####################


520it [00:12, 42.16it/s]


total_loss:   0.7247671073429576
Training Accuracy


520it [00:10, 49.38it/s]


Evaluation accuracy: 0.5340384615384616
Epoch: 2
####################


520it [00:11, 44.33it/s]


total_loss:   0.7216899127261566
Training Accuracy


520it [00:10, 50.15it/s]


Evaluation accuracy: 0.5340384615384616
Epoch: 3
####################


520it [00:12, 41.51it/s]


total_loss:   0.7214595958917565
Training Accuracy


520it [00:11, 46.84it/s]


Evaluation accuracy: 0.5340384615384616
Epoch: 4
####################


520it [00:12, 40.46it/s]


total_loss:   0.7205519180459429
Training Accuracy


520it [00:10, 51.00it/s]


Evaluation accuracy: 0.5340384615384616
Epoch: 5
####################


520it [00:11, 45.66it/s]


total_loss:   0.7209792135195132
Training Accuracy


520it [00:10, 49.64it/s]


Evaluation accuracy: 0.5340384615384616
Epoch: 6
####################


520it [00:11, 43.48it/s]


total_loss:   0.7202679919058147
Training Accuracy


304it [00:07, 54.67it/s]

#### Step 6. Collect the results

In [None]:
words = list(en_dictionary.word2id.keys())
vectors = np.array([en_dictionary[word] for word in words])

wv_dict = dict(zip(words, vectors))

hojin_ids = list(set(sample_data.hojin_id))
hojin_ids = [int(i) for i in hojin_ids]

tokenizer = Tokenizer(words, max_len=864, data = sample_data)

web_vectors = [tokenizer.encode_webportfolio(company_id=idx, max_page=max_page) for idx in tqdm(hojin_ids)]

seq_ids = torch.tensor([i[1] for i in web_vectors])
num_pages = torch.tensor([i[0] for i in web_vectors])
seq_lengths = tokenizer.max_len - torch.sum(seq_ids == 0, axis=-1)
labels = torch.tensor([tokenizer.get_label(i) for i in hojin_ids])
hojin_ids = torch.tensor(hojin_ids)

import numpy as np
import matplotlib
import matplotlib.pyplot as plt


def colorize(words, color_array):
    cmap=matplotlib.cm.Blues
    template = '<span class="barcode"; style="color: black; background-color: {}">{}</span>'
    colored_string = ''
    for word, color in zip(words, color_array):
        color = matplotlib.colors.rgb2hex(cmap(color)[:3])
        print(color)
        colored_string += template.format(color, '&nbsp' + word + '&nbsp')
    return colored_string

def select_keywords(attn_w, words, n=10):
    combo = [(i, j) for i, j in zip(attn_w, words) if i != 0]
    attn_w = np.array([i[0] for i in combo])
    words = [i[1] for i in combo]
    attn_diff = attn_w.max() - attn_w
    attn_thres = np.percentile(attn_diff, n)
    selected_keywords = [i for i, j in zip(words, attn_diff) if j <= attn_thres]
    selected_keywords_show = [0.6 if j <= attn_thres else 0 for i, j in zip(words, attn_diff)]
    return selected_keywords, selected_keywords_show

url_col = []
text_col = []
sents_selected = []
weight_col = []
hojin_id_col = []
hightechflag_col = []
model.eval()

final_vecs = []
web_vecs = []
page_attns = []
urls = []


for t in range(len(hojin_ids)):

    probs, senti_scores, attn, page_attn, final_vec, page_score, web_vec = model(seq_ids[t:(t+1)].to(device), num_pages[t:(t+1)].to(device), seq_lengths[t:(t+1)].to(device))
    id_to_token = tokenizer.id_to_token
    id_to_token[0] = '#'


    sents = []
    for i in range(num_pages[t:(t+1)].tolist()[0]):
        sents.append(' '.join([id_to_token[w] for w in seq_ids[t:(t+1)][0][i].tolist()]))

    final_vecs.append(final_vec.detach().cpu().numpy())
    df = pd.DataFrame({'url': list(sample_data[sample_data.hojin_id == int(hojin_ids[t].tolist())].urls),
                   'hojin_id': list(sample_data[sample_data.hojin_id == int(hojin_ids[t].tolist())].hojin_id),
                   'hightechflag': list(sample_data[sample_data.hojin_id == int(hojin_ids[t].tolist())].hightechflag),
                   'text':list(sample_data[sample_data.hojin_id == int(hojin_ids[t].tolist())].cleaned_content),
                   'weight': page_attn.view(-1)[:num_pages[t:(t+1)].tolist()[0]].tolist(),
                   'page_score': page_score.view(-1)[page_score.view(-1) > -9999].tolist(),
                   #'web_vecs': list(web_vec[0])[:num_pages[t:(t+1)].tolist()[0]]
                   })
    df = df[df.weight > 0].reset_index()


    hojin_id_col.extend(df.hojin_id)
    hightechflag_col.extend(df.hightechflag)
    url_col.extend(df.url)
    text_col.extend(df.text)
    weight_col.extend(df.weight)

    for i in list(df['index']):
        sent = sents[i]
        attn1 = attn.squeeze()[i]

        words = sent.split()
        color_array = np.array(attn1.view(-1).tolist())

        selected_keywords, selected_keywords_show = select_keywords(color_array, words, n=20)

        sents_selected.append([j for j, k in zip(words, selected_keywords_show) if k != 0])


sents_selected = ['|'.join(i) for i in sents_selected]
sents_selected = ['|'.join(list(set(i.split('|')))) for i in sents_selected]


selected_df = pd.DataFrame({
    'hojin_id': hojin_id_col,
    'url': url_col,
    'weight': weight_col,
    'text':text_col,
    'sents': sents_selected, 'hightechflag': hightechflag_col,
    })


100%|██████████████████████████████████████████| 20/20 [00:00<00:00, 162.31it/s]


In [None]:
selected_df

Unnamed: 0,hojin_id,url,weight,text,sents,hightechflag
0,704512,https://sbgi.net/sinclair-broadcast-group-name...,0.091239,menu|investor|relations|profile|sec|filings|fi...,via|phone|tulsa|billie|mail|expand|mass|im|net...,1.0
1,704512,https://sbgi.net/join-sinclair/,0.100939,menu|join|sinclair|current|openings|view|job|l...,bally|open|fmla|dental|telehealth|leave|k|inc|...,1.0
2,704512,https://sbgi.net/news/the-national-desk/,0.096508,menu|americas|news|national|desk|tnd|launched|...,rights|menu|subscription|inc|news|sinclair|ame...,1.0
3,704512,https://sbgi.net/investor-relations/,0.093900,menu|investor|relations|profile|sec|filings|fi...,phone|grow|billie|download|mail|expand|network...,1.0
4,704512,https://sbgi.net/,0.107802,diversified|media|company|broadcast|sports|mar...,drag|bally|sg|via|nhl|go|select|mass|networks|...,1.0
...,...,...,...,...,...,...
297,704530,https://www.highcomarmor.com/product-category/...,0.039235,american|manufactured|ballistic|armor|solution...,ar|contact|locator|threat|registration|trauma|...,0.0
298,704530,https://www.highcomarmor.com/product-category/...,0.023877,american|manufactured|ballistic|armor|solution...,ba|contact|locator|registration|ohio|outreach|...,0.0
299,704530,https://www.highcomarmor.com/product-category/...,0.009736,american|manufactured|ballistic|armor|solution...,stp|locator|threat|registration|ohio|sa|outrea...,0.0
300,704531,https://morgangroupholdingco.com/category/press/,0.443004,morgan|group|holding|company|home|sec|filings|...,p|non|date|officer|rye|no|fees|inc|of|j|to|yor...,0.0
