# KBIR-inspec
- 가상환경 sait
- KBIR extractor(): 가장 긴 abstract(abstracts[423])에 대해서도 끝부분까지 keyphrase 잘 추출하는 것 확인함
- 긴 background는 KBIR의 max token 수 넘어감 → 문장 단위로 split 후 진행

## 데이터 로드 & 전처리

In [1]:
import pickle
from tqdm.notebook import tqdm
from nltk import sent_tokenize

with open('Preprocessed_Data/H01L_2020-2022_9585_abstract.pickle', 'rb') as fr:
    abstracts = pickle.load(fr)
with open('Preprocessed_Data/H01L_2020-2022_9585_background.pickle', 'rb') as fr:
    backgrounds = pickle.load(fr)

In [2]:
len(abstracts)

9585

In [6]:
# None인 background는 제거
backgrounds = [b for b in backgrounds if b is not None]
len(backgrounds)

9293

In [23]:
# Background는 문장 단위로 split

backgrounds_sent = []

for b in backgrounds:
    backgrounds_sent.extend(sent_tokenize(b))

In [25]:
len(backgrounds_sent)

108915

In [24]:
for i in backgrounds_sent[:10]:
    print(i)
    print()

Recently, the high integration of semiconductors has increased the processing and storage capacity of information per unit area.

This has led to demands for large diameter semiconductor wafers, miniaturization of circuit line width, and multilayer wiring.

In order to form a multi-layered wiring on a semiconductor wafer, high-level flatness of the wafer is required, and a wafer flattening process is required for such high-level flatness.

One of the wafer flattening processes is a wafer polishing process.

The wafer polishing process is a step of polishing the upper and lower surfaces of the wafer with a polishing pad.

The wafer polishing process is carried out using a polishing system having a polishing unit provided with an upper plate, a lower plate and a means for supplying polishing slurry to the polishing unit.

A pipe connected to the polishing unit for supplying the slurry to the polishing unit may be provided in the polishing system.

However, the abrasive grains contained i

## 모델 로드

In [26]:
from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.SIMPLE,
        )
        keyphrases = [result.get("word").strip() for result in results] # 같은 keyphrase도 중복해서 추출

        return results, keyphrases

In [27]:
# CUDA 연결 안 됨, GPU 사용 x
# Load pipeline
model_name = "ml6team/keyphrase-extraction-kbir-inspec"
extractor = KeyphraseExtractionPipeline(model=model_name)

## Abstract에서 Keyphrase 추출

In [6]:
%%time
result = []
list_keyphrases = []

for abstract in tqdm(abstracts):
    patent = {}
    patent['abstract'] = abstract
    patent['keyphrases'] =[]
    
    raw, words = extractor(abstract)
    list_keyphrases.extend(words)
    
    for r in raw:
        keyphrase = {}
        keyphrase['keyphrase'] = r['word'].strip()
        keyphrase['start_index'] = r['start']
        keyphrase['end_index'] = r['end']
        patent['keyphrases'].append(keyphrase)
    result.append(patent)

  0%|          | 0/9585 [00:00<?, ?it/s]

CPU times: user 6h 54min 53s, sys: 15.3 s, total: 6h 55min 8s
Wall time: 52min 8s


In [14]:
print(len(result))
print(len(list_keyphrases))

9585
99169


In [15]:
from collections import Counter

list_keyphrases = Counter(list_keyphrases).most_common()
print(len(list_keyphrases))

24995


In [20]:
result[0]

{'abstract': 'The wafer polishing system is disclosed. The wafer polishing system may comprise a polishing unit; a slurry distribution unit mounted on the polishing unit and distributing a slurry flowing into the polishing unit for wafer polishing; a slurry tank connected to the slurry distribution unit and storing the slurry; a slurry pump connected to the polishing unit and the slurry tank for transferring the slurry from the slurry tank to the polishing unit; a first circulation line in which one side is connected to the slurry tank; a second circulation line in which one side is connected to the other side of the first circulation line and the other side is connected to the slurry distribution unit; and a cleaning liquid supply unit connected to the second circulation line for supplying a cleaning liquid flowing through the second circulation line.',
 'keyphrases': [{'keyphrase': 'wafer polishing system',
   'start_index': 4,
   'end_index': 26},
  {'keyphrase': 'wafer polishing sy

In [21]:
list_keyphrases[:30]

[('semiconductor device', 1518),
 ('semiconductor substrate', 1218),
 ('substrate', 854),
 ('dielectric layer', 654),
 ('gate structure', 530),
 ('first', 495),
 ('semiconductor structure', 493),
 ('semiconductor wafer', 488),
 ('layer', 474),
 ('top surface', 456),
 ('semiconductor layer', 428),
 ('second', 415),
 ('process chamber', 407),
 ('substrate processing apparatus', 380),
 ('processing chamber', 368),
 ('fin structure', 359),
 ('gate electrode', 341),
 ('embodiment', 295),
 ('semiconductor fin', 280),
 ('gate stack', 276),
 ('plasma', 260),
 ('insulating layer', 257),
 ('semiconductor material', 246),
 ('conductive layer', 246),
 ('channel region', 240),
 ('conductive material', 227),
 ('electrostatic chuck', 226),
 ('gate dielectric layer', 213),
 ('dielectric material', 209),
 ('semiconductor devices', 173)]

In [22]:
# pickle 파일로 저장
with open('Preprocessed_Data/H01L_2020-2022_9585_abstract_keyphrases.pickle','wb') as fw:
    pickle.dump(result, fw)
with open('Preprocessed_Data/H01L_2020-2022_9585_abstract_keyphrases_list.pickle','wb') as fw:
    pickle.dump(list_keyphrases, fw)

## Background에서 Keyphrase 추출

In [28]:
%%time
result = []
list_keyphrases = []

for sent in tqdm(backgrounds_sent):
    patent = {}
    patent['sent'] = sent
    patent['keyphrases'] =[]
    
    raw, words = extractor(sent)
    list_keyphrases.extend(words)
    
    for r in raw:
        keyphrase = {}
        keyphrase['keyphrase'] = r['word'].strip()
        keyphrase['start_index'] = r['start']
        keyphrase['end_index'] = r['end']
        patent['keyphrases'].append(keyphrase)
    result.append(patent)

  0%|          | 0/108915 [00:00<?, ?it/s]

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceed

In [29]:
print(len(result))
print(len(list_keyphrases))

108915
260672


In [30]:
from collections import Counter

list_keyphrases = Counter(list_keyphrases).most_common()
print(len(list_keyphrases))

54952


In [31]:
result[0]

{'sent': 'Recently, the high integration of semiconductors has increased the processing and storage capacity of information per unit area.',
 'keyphrases': [{'keyphrase': 'semiconductors',
   'start_index': 34,
   'end_index': 48},
  {'keyphrase': 'storage capacity', 'start_index': 82, 'end_index': 98}]}

In [32]:
list_keyphrases[:30]

[('semiconductor devices', 2941),
 ('semiconductor device', 2537),
 ('semiconductor wafer', 2286),
 ('integrated circuits', 1488),
 ('semiconductor substrate', 1444),
 ('transistors', 1194),
 ('substrate processing apparatus', 894),
 ('semiconductor industry', 846),
 ('etching', 775),
 ('Semiconductor devices', 685),
 ('CMP', 679),
 ('electrostatic chuck', 635),
 ('semiconductor wafers', 622),
 ('fabrication process', 595),
 ('plasma', 583),
 ('integrated circuit', 577),
 ('channel region', 545),
 ('FinFET', 523),
 ('electronic devices', 482),
 ('manufacturing', 472),
 ('electronic components', 466),
 ('manufacturing process', 456),
 ('photolithography', 455),
 ('lithography', 431),
 ('processing chamber', 419),
 ('wafers', 416),
 ('gate electrode', 414),
 ('process chamber', 409),
 ('capacitors', 395),
 ('functional density', 394)]

In [33]:
# pickle 파일로 저장
with open('Preprocessed_Data/H01L_2020-2022_9585_background_keyphrases.pickle','wb') as fw:
    pickle.dump(result, fw)
with open('Preprocessed_Data/H01L_2020-2022_9585_background_keyphrases_list.pickle','wb') as fw:
    pickle.dump(list_keyphrases, fw)

## 결과 불러오기
#### Abstract
- abstract_keyphrases : [abstract 1, abstract 2, ... ]인 리스트
  - abstract n : key가 'abstract', 'keyphrases'인 딕셔너리
    - abstract: abstract 텍스트
    - keyphrases: [keyphrase 1, keyphrase 2, ... ]인 리스트
      - keyphrase n: key가 'keyphrase', 'start_index', 'end_index'인 딕셔너리
        - keyphrase: keyphrase 텍스트
        - start_index: abstract에서 해당 keyphrase가 시작하는 index
        - end_index: abstract에서 해당 keyphrase가 끝나는 index
        
- abstract_keyphrases_list: keyphrase만 모아서 빈도순으로 나열한 리스트

#### Background
- background_keyphrases : [sentence 1, sentence 2, ... ]인 리스트
  - sentence n : key가 'sent', 'keyphrases'인 딕셔너리
    - sent: background 텍스트 (한 문장)
    - keyphrases: [keyphrase 1, keyphrase 2, ... ]인 리스트
      - keyphrase n: key가 'keyphrase', 'start_index', 'end_index'인 딕셔너리
        - keyphrase: keyphrase 텍스트
        - start_index: background 문장에서 해당 keyphrase가 시작하는 index
        - end_index: background 문장에서 해당 keyphrase가 끝나는 index
        
- background_keyphrases_list: keyphrase만 모아서 빈도순으로 나열한 리스트

In [34]:
# pickle 파일 불러오기

# Abstract
with open('Preprocessed_Data/H01L_2020-2022_9585_abstract_keyphrases.pickle', 'rb') as fr:
    abstract_keyphrases = pickle.load(fr)
with open('Preprocessed_Data/H01L_2020-2022_9585_abstract_keyphrases_list.pickle', 'rb') as fr:
    abstract_keyphrases_list = pickle.load(fr)

# Background
with open('Preprocessed_Data/H01L_2020-2022_9585_background_keyphrases.pickle', 'rb') as fr:
    background_keyphrases = pickle.load(fr)
with open('Preprocessed_Data/H01L_2020-2022_9585_background_keyphrases_list.pickle', 'rb') as fr:
    background_keyphrases_list = pickle.load(fr)

In [36]:
# Abstract
print(len(abstract_keyphrases)) # 전체 특허 수와 동일
print(len(abstract_keyphrases_list)) # unique한 keyphrase의 수

9585
24995


In [35]:
# Background
print(len(background_keyphrases)) # 전체 문장 수와 동일
print(len(background_keyphrases_list)) # unique한 keyphrase의 수

108915
54952


In [37]:
# Abstract Example
print(abstract_keyphrases[0]['abstract']) # 첫번째 특허의 abstract
print()
print(abstract_keyphrases[0]['keyphrases'][0]) # 첫번째 특허의 첫번째 keyphrase

The wafer polishing system is disclosed. The wafer polishing system may comprise a polishing unit; a slurry distribution unit mounted on the polishing unit and distributing a slurry flowing into the polishing unit for wafer polishing; a slurry tank connected to the slurry distribution unit and storing the slurry; a slurry pump connected to the polishing unit and the slurry tank for transferring the slurry from the slurry tank to the polishing unit; a first circulation line in which one side is connected to the slurry tank; a second circulation line in which one side is connected to the other side of the first circulation line and the other side is connected to the slurry distribution unit; and a cleaning liquid supply unit connected to the second circulation line for supplying a cleaning liquid flowing through the second circulation line.

{'keyphrase': 'wafer polishing system', 'start_index': 4, 'end_index': 26}


In [38]:
# Abstract Example
start = abstract_keyphrases[0]['keyphrases'][0]['start_index']
end = abstract_keyphrases[0]['keyphrases'][0]['end_index']

print(abstract_keyphrases[0]['keyphrases'][0]['keyphrase'])
print(abstract_keyphrases[0]['abstract'][start:end]) # index 이용해서 abstract에서 indexing

wafer polishing system
wafer polishing system


In [43]:
# Background Example
print(background_keyphrases[0]['sent']) # Background 첫번째 문장
print()
print(background_keyphrases[0]['keyphrases'][0]) # 첫번째 문장의 첫번째 keyphrase

Recently, the high integration of semiconductors has increased the processing and storage capacity of information per unit area.

{'keyphrase': 'semiconductors', 'start_index': 34, 'end_index': 48}


In [44]:
# Background Example
start = background_keyphrases[0]['keyphrases'][0]['start_index']
end = background_keyphrases[0]['keyphrases'][0]['end_index']

print(background_keyphrases[0]['keyphrases'][0]['keyphrase'])
print(background_keyphrases[0]['sent'][start:end]) # index 이용해서 background 문장에서 indexing

semiconductors
semiconductors
