# 문장에서 일치하는 단어 찾기

20.02.17

문장으로 이루어진 original_text / voice_text가 있고, 단어로 이루어진 Meta / Keytalk가 있다. 이때 문장 속에서 일치하는 단어 목록을 찾아내면 되는 작업이다.   
  
Ex)  
  
문장 1: 나는 맛있는 사과를 먹는다.    
문장 2 : 그는 달콤한 바나나를 먹었다.   

단어 목록 : 사과, 바나나, 멜론, 맛있는, 달콤한, 새콤한    

문장 1에 대한 결과 : [맛있는, 사과]  
문장 2에 대한 결과 : [달콤한, 바나나]    



### Library

In [1]:
import pandas as pd
import re
import warnings
warnings.filterwarnings(action='ignore')
from tqdm import tqdm_notebook

## Preprocessing

#### Sentence

In [2]:
# 데이터 불러오기
sentence_1 = pd.read_excel('sentences.xlsx', sheet_name = 'sentences_1')[['vt_pk', 'original_text', 'voice_text']]
sentence_2 = pd.read_excel('sentences.xlsx', sheet_name = 'sentences_2', nrows = 8160)[['vt_pk', 'original_text', 'voice_text']]

sentence = pd.concat([sentence_1, sentence_2], axis = 0).reset_index(drop = True)

# 인덱스 붙이기
sentence['No'] = sentence.index.values

# 소문자로
sentence['original'] = sentence['original_text'].str.lower()
sentence['voice'] = sentence['voice_text'].str.lower()

# . 제거
sentence['original'] = sentence['original_text'].str.replace('.', '')
sentence['voice'] = sentence['voice_text'].str.replace('.', '')

# 데이터 예시
sentence[['original', 'voice']].sample()

Unnamed: 0,original,voice
7402,What are some rowan joffe movies with rotten t...,what are some rowing you off of movies with Ro...


#### Meta

In [3]:
# 데이터 불러오기
meta = pd.read_excel('words.xlsx', sheet_name = 'm')

# 소문자로
meta['meta'] = meta['kl_name'].str.lower()
meta['meta'] = meta['meta'].str.replace('.', '')

# 데이터 예시
meta[['meta']].sample(3)

Unnamed: 0,meta
357,imdb: 7-8
6120,gbenga akinnagbe
8004,time travel


#### Key

In [4]:
# 데이터 불러오기
key = pd.read_excel('words.xlsx', sheet_name = 'k')

# 소문자로
key['key'] = key['kl_name'].str.lower()
key['key'] = key['key'].str.replace('.', '')

# 키토크 고유값 뽑아내기
key_original = list(key['key'].unique())

# 데이터 예시
key[['key']].sample(3)

Unnamed: 0,key
8954,commit atrocity
6384,sanitized
1891,aggravating voice


In [5]:
print('전체 키토크 수 : {}\n고유 키토크 수 : {}'.format(len(key['key']), len(key_original)))

전체 키토크 수 : 13458
고유 키토크 수 : 7999


---    

'단어 in 문장'을 이용   
  
But)   
문장1 : butterfly fly away  
단어 : butterfly, butter, fly, way  
  
내가 원하는 결과 : butterfly, fly  
'단어 in 문장'으로 했을 때 결과 : butterfly, butter, fly, way  
  
→ 애매하게 겹치는 단어도 매칭이 된다.      


&nbsp;   
★ 해결방안      
'공백 단어 공백'일 때 확실하게 들어있는 것은 1  
'단어 공백'&'공백 단어'일 때 확인해야하는 것은 2  
  
Ex)      
문장 1 : ' butterfly fly away '  
단어 : butterfly, butter, fly, way  
  
결과 : [[butterfly, 1], [butter, 2], [fly, 1], [way, 2]]  
  
→ 1로 나온거는 다 선택하고, 2로 나온건 확인한다.    
  

In [6]:
def wis(sent, word):   # word in sentence
    sent = list(sent.unique())
    word = list(word.unique())

    result = []
    
    for i in tqdm_notebook(word):
        for text in sent:
            if re.search(f' {i} ', text):
                result.append([i, text])
                
    result = pd.DataFrame(result)
    return result


def binder(df, oov, mok):
    binded = pd.DataFrame()

    if oov == 'original':
        sentence_unique = list(df['original'].unique())
    else:
        sentence_unique = list(df['voice'].unique())
        
    for i in tqdm_notebook(range(len(sentence_unique))):
        temp = pd.DataFrame({oov:sentence_unique[i], 
                            f'{mok}':[list(df[df[oov]==sentence_unique[i]]['kl_name'])]})
        binded = pd.concat([binded, temp], axis = 0)

    return binded

앞 뒤 공백 생성

In [7]:
sentence['original'] = ' ' + sentence['original']  + ' '
sentence['voice'] = ' ' + sentence['voice'] + ' '

### 메타 오리지널

문장 속에 있는 메타 찾기

In [8]:
meta_original = wis(sentence['original'], meta['meta'])

meta_original_2 = pd.DataFrame({'meta':meta_original[0], 'original':meta_original[1]})
meta_original_2 = pd.merge(meta_original_2, meta, on = 'meta', how = 'left')

HBox(children=(IntProgress(value=0, max=7979), HTML(value='')))




Sentence가 같은 값에 따라 meta값 묶어주기

In [9]:
original_meta = binder(meta_original_2, 'original', 'meta_o')

original_meta.sample(3)

HBox(children=(IntProgress(value=0, max=8281), HTML(value='')))




Unnamed: 0,original,meta_o
0,What are some david hogan movies with metasco...,"[Metascore: 60-70, David Hogan]"
0,A comeuppance movie directed by ronny yu,[Ronny Yu]
0,An action movie by keishi otomo,"[Action, Keishi Otomo]"


sentence와 합치기

In [10]:
sentence_result = pd.merge(sentence, original_meta, on = 'original', how = 'left')

---
### 메타 보이스

문장 속에 있는 메타 찾기

In [11]:
meta_voice = wis(sentence['voice'], meta['meta'])

meta_voice_2 = pd.DataFrame({'meta':meta_voice[0], 'voice':meta_voice[1]})
meta_voice_2 = pd.merge(meta_voice_2, meta, on = 'meta', how = 'left')

HBox(children=(IntProgress(value=0, max=7979), HTML(value='')))




Sentence가 같은 값에 따라 meta값 묶어주기

In [12]:
voice_meta = binder(meta_voice_2, 'voice', 'meta_v')

voice_meta.sample(3)

HBox(children=(IntProgress(value=0, max=2093), HTML(value='')))




Unnamed: 0,voice,meta_v
0,Find me a duty documentary movie,[Documentary]
0,a horror movie by kimya Juan,[Horror]
0,a science fiction movie by Don Roose,[Science Fiction]


sentence와 합치기

In [13]:
sentence_result = pd.merge(sentence_result, voice_meta, on = 'voice', how = 'left')

In [14]:
sentence_result.to_excel('메타 결과_원본.xlsx', index = False)

--- 
### 키토크 오리지널  

문장 속에 있는 키토크 찾기

In [15]:
key_original = wis(sentence['original'], key['key'])

key_original_2 = pd.DataFrame({'key':key_original[0], 'original':key_original[1]})
key_original_2 = pd.merge(key_original_2, key, on = 'key', how = 'left')

HBox(children=(IntProgress(value=0, max=7999), HTML(value='')))




In [16]:
key_original_2.head()

Unnamed: 0,key,original,kl_pk,kl_name,kl_category
0,masterpiece,Can you recommend me an artistic masterpiece ...,1.0,masterpiece,Opinion
1,masterpiece,Can you recommend me an artistic masterpiece ...,,masterpiece,
2,masterpiece,Something timeless masterpiece with a rating ...,1.0,masterpiece,Opinion
3,masterpiece,Something timeless masterpiece with a rating ...,,masterpiece,
4,masterpiece,Something epic masterpiece with a rating rott...,1.0,masterpiece,Opinion


Sentence가 같은 값에 따라 meta값 묶어주기

In [17]:
original_key = binder(key_original_2, 'original', 'key_o')

HBox(children=(IntProgress(value=0, max=6843), HTML(value='')))




In [18]:
original_key.sample(3)

Unnamed: 0,original,key_o
0,Something sheer spectacle with a rating metas...,"[sheer spectacle, sheer spectacle]"
0,Something overly impressive with a rating met...,"[impressive, impressive, overly impressive, ov..."
0,A movie where james cromwell plays hissable c...,"[hissable, hissable]"


sentence와 합치기

In [19]:
sentence_result = pd.merge(sentence_result, original_key, on = 'original', how = 'left')

--- 
### 키토크 보이스

문장 속에 있는 키토크 찾기

In [20]:
key_voice = wis(sentence['voice'], key['key'])

key_voice_2 = pd.DataFrame({'key':key_voice[0], 'voice':key_voice[1]})
key_voice_2 = pd.merge(key_voice_2, key, on = 'key', how = 'left')

HBox(children=(IntProgress(value=0, max=7999), HTML(value='')))




In [21]:
key_voice_2.head()

Unnamed: 0,key,voice,kl_pk,kl_name,kl_category
0,piece of art,something piece of art with a rating imdb rat...,2.0,piece of art,General Reaction
1,piece of art,something a piece of art with a rating IMDb r...,2.0,piece of art,General Reaction
2,good,which movie has a sitting front row good script,3.0,good,Opinion
3,good,which movie has a sitting front row good script,,good,
4,good,recommend me a movie by Mark Andrews featurin...,3.0,good,Opinion


Sentence가 같은 값에 따라 meta값 묶어주기

In [22]:
voice_key = binder(key_voice_2, 'voice', 'key_v')

HBox(children=(IntProgress(value=0, max=5240), HTML(value='')))




In [23]:
voice_key.sample(3)

Unnamed: 0,voice,key_v
0,something accessible with a rating IMDb 7 to 8,"[accessible, accessible]"
0,recommend me a movie by Richie King featuring...,"[likable, likable, immensely likable, immensel..."
0,Something lustrous with a rating Metascore ab...,"[lustrous, lustrous]"


sentence와 합치기

In [24]:
sentence_result = pd.merge(sentence_result, voice_key, on = 'voice', how = 'left')

### 내보내기

In [25]:
sentence_result[['original_text', 'meta_o', 'key_o', 'voice_text', 'meta_v', 'key_v']].sample(3)

Unnamed: 0,original_text,meta_o,key_o,voice_text,meta_v,key_v
8968,Which movie has searing a lot of twist,,"[searing, searing, a lot of twist]",which movie has searing a lot of twists,,"[lot of twists, searing, searing]"
1184,Fetch a tacky movies by bill pohlad,[Bill Pohlad],"[tacky, tacky]",fecha tacky movies by Bill pohlad,,"[tacky, tacky]"
5090,Something absolute blast with a rating metasco...,[Metascore: 60-70],"[blast, blast, absolute blast, absolute blast]",Something absolute blast with a rating Metasco...,,"[blast, blast, absolute blast, absolute blast]"


In [26]:
sentence_result.to_excel('result.xlsx', index = False, encoding = 'utf-8')