# 핵심 키워드 추출

### text rank (누가 나의 텍스트를 많이 참조하는가?)
 * TextRank: Bringing Order into Texts (논문)

하나의 노드가 연결된 노드에게 균등하게 score를 준다.  
자신의 노드가 n개에 연결되어 있으면, 1/n의 만큼 균등하게 score를 준다.  
## 1. TF-IDF
#### sample text

In [2]:
d1 = "The cat sat on my face. I hate a cat."
d2 = "The dog sat on my bed. I love a dog"

#### sklearn 활용 TF-IDF

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict

document_ls = [d1, d2]
# print(document_ls)
# > ['The cat sat on my face. I hate a cat.', 'The dog sat on my bed. I love a dog']

vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(document_ls)
# print(tfidf)

word2id = defaultdict(lambda: 0)
# print(word2id)
# > {}

for idx, feature in enumerate(vectorizer.get_feature_names()):
    word2id[feature] = idx
#     print(word2id[feature])
#     0, 1, 2, 3, 4, 5... 9

#### dataframe 으로 변환하여 출력

In [6]:
import pandas as pd
count_vect_df = pd.DataFrame(tfidf.todense(), columns=vectorizer.get_feature_names())
count_vect_df

Unnamed: 0,bed,cat,dog,face,hate,love,my,on,sat,the
0,0.0,0.706006,0.0,0.353003,0.353003,0.0,0.251164,0.251164,0.251164,0.251164
1,0.353003,0.0,0.706006,0.0,0.0,0.353003,0.251164,0.251164,0.251164,0.251164


In [7]:
tfidf.todense()

matrix([[0.        , 0.70600557, 0.        , 0.35300279, 0.35300279,
         0.        , 0.25116439, 0.25116439, 0.25116439, 0.25116439],
        [0.35300279, 0.        , 0.70600557, 0.        , 0.        ,
         0.35300279, 0.25116439, 0.25116439, 0.25116439, 0.25116439]])

#### TF-IDF score가 높은 순으로 출력

In [15]:
import numpy as np 

feature_array = np.array(vectorizer.get_feature_names())
tfidf_sorting = np.argsort(tfidf[0].toarray()).flatten()[::-1] # 정렬하는 코드

n = 3
top_n = feature_array[tfidf_sorting][:n] # 3개를 보여준다. (0, 1, 2)

top_n

array(['cat', 'hate', 'face'], dtype='<U4')

## 2. Textrank

In [16]:
#https://www.businessinsider.sg/faang-stocks-are-a-dead-trade-for-the-next-several-months-2018-3/amp/

Text = "The FAANG stocks won’t see much more growth in the near future, according to Bill Studebaker, founder and Chief Investment Officer of Robo Global. \
Studebaker argues we are seeing a 'reallocation' that will continue from large-cap tech stocks into market-weight stocks. \
The FAANG stocks have had a rough few weeks, and have been hit hard since March 12. \
One FAANG to look out for, in the midst of all this, is Amazon, according to Studebaker. \
The stock market is seeing a 'reallocation' out of FAANG stocks, which are not where the smart money is, founder and Chief Investment Officer of Robo Global Bill Studebaker told Business Insider. \
The FAANG stocks (Facebook, Apple, Amazon, Netflix, Google) are all down considerably since March 12, a trend that accelerated when news of a massive Facebook data scandal broke, sending the tech-heavy Nasdaq into a downward frenzy. \
Investors are wondering what’s next. \
And what’s next isn’t good news for FAANG stock optimists, Studebaker thinks. 'This is a dead trade' for the next several months, he said. 'I wouldn’t expect there to be a lot of performance attribution coming from the FAANG stocks,' he added. That is, if the stock market is to see gains in the next several months, they will largely not come from the big tech companies. \
The market is seeing a 'reallocation out of large-cap technology, into other parts of the market,' he said. And this trend could continue for the foreseeable future. 'When you get these reallocation trades, a de-risking, this can go on for months and months.' The FAANG’s are pricey stocks, he said, pointing out that investors will 'factor in the law of big numbers,' he said. 'Just because they’re big cap doesn’t mean they’re safe,' he added. \
Still, he doesn’t necessarily think that investors are going to shift drastically into value stocks. 'With an increasingly favorable macro backdrop, you have strong growth demand.' \
Studebaker, who runs an artificial intelligence and robotics exchange-traded fund with $4 billion in assets under management, thinks that AI and robotics are better areas of growth. His ETF is up 27% in the past year, while the FAANG stocks are also largely up over that same span, even if they are down since March 12. \
While many point to artificial intelligence as an area that will be a boost to Google and Amazon, Studebaker doesn’t see that as a sign of significant growth for the FAANGs. He pointed out that 'eighty to ninety percent of their businesses are still search,' and that 'AI doesn’t really move the needle on the business.' He also said 'the revenue mix [attributable to AI] in those businesses are insignificant.' \
And while he’s not bullish on FAANG’s, he does say that the one FAANG to still watch out for is Amazon, simply because ecommerce still represents a small portion of the global retail market, giving the company room to grow." 

### 1) 토큰화 (Tokenization)
 분석 텍스트 정제

In [18]:
import nltk
import string
from nltk.tokenize import TreebankWordTokenizer
nltk.download('punkt')

text = TreebankWordTokenizer().tokenize(Text)

print("Tokenized Text: \n")
print(text)

Tokenized Text: 

['The', 'FAANG', 'stocks', 'won', '’', 't', 'see', 'much', 'more', 'growth', 'in', 'the', 'near', 'future', ',', 'according', 'to', 'Bill', 'Studebaker', ',', 'founder', 'and', 'Chief', 'Investment', 'Officer', 'of', 'Robo', 'Global.', 'Studebaker', 'argues', 'we', 'are', 'seeing', 'a', "'reallocation", "'", 'that', 'will', 'continue', 'from', 'large-cap', 'tech', 'stocks', 'into', 'market-weight', 'stocks.', 'The', 'FAANG', 'stocks', 'have', 'had', 'a', 'rough', 'few', 'weeks', ',', 'and', 'have', 'been', 'hit', 'hard', 'since', 'March', '12.', 'One', 'FAANG', 'to', 'look', 'out', 'for', ',', 'in', 'the', 'midst', 'of', 'all', 'this', ',', 'is', 'Amazon', ',', 'according', 'to', 'Studebaker.', 'The', 'stock', 'market', 'is', 'seeing', 'a', "'reallocation", "'", 'out', 'of', 'FAANG', 'stocks', ',', 'which', 'are', 'not', 'where', 'the', 'smart', 'money', 'is', ',', 'founder', 'and', 'Chief', 'Investment', 'Officer', 'of', 'Robo', 'Global', 'Bill', 'Studebaker', 'told'

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 2) 품사부착 (POS Tagging)
 토큰화된 텍스트에 품사 부착

In [19]:
nltk.download('averaged_perceptron_tagger')
POS_tag = nltk.pos_tag(text) # 품사 부착

print(POS_tag)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...


[('The', 'DT'), ('FAANG', 'NNP'), ('stocks', 'NNS'), ('won', 'VBD'), ('’', 'JJ'), ('t', 'NN'), ('see', 'VBP'), ('much', 'RB'), ('more', 'JJR'), ('growth', 'NN'), ('in', 'IN'), ('the', 'DT'), ('near', 'JJ'), ('future', 'NN'), (',', ','), ('according', 'VBG'), ('to', 'TO'), ('Bill', 'NNP'), ('Studebaker', 'NNP'), (',', ','), ('founder', 'NN'), ('and', 'CC'), ('Chief', 'NNP'), ('Investment', 'NNP'), ('Officer', 'NNP'), ('of', 'IN'), ('Robo', 'NNP'), ('Global.', 'NNP'), ('Studebaker', 'NNP'), ('argues', 'VBZ'), ('we', 'PRP'), ('are', 'VBP'), ('seeing', 'VBG'), ('a', 'DT'), ("'reallocation", 'NN'), ("'", 'POS'), ('that', 'WDT'), ('will', 'MD'), ('continue', 'VB'), ('from', 'IN'), ('large-cap', 'JJ'), ('tech', 'NN'), ('stocks', 'NNS'), ('into', 'IN'), ('market-weight', 'JJ'), ('stocks.', 'NN'), ('The', 'DT'), ('FAANG', 'NNP'), ('stocks', 'NNS'), ('have', 'VBP'), ('had', 'VBD'), ('a', 'DT'), ('rough', 'JJ'), ('few', 'JJ'), ('weeks', 'NNS'), (',', ','), ('and', 'CC'), ('have', 'VBP'), ('been',

[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


### 3) 표제어 추출 (Lemmatization)

In [20]:
from nltk.corpus import wordnet
nltk.download('wordnet')

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('N'):
        return wordnet.VERB
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


In [22]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

lemmatized_text = []
# 각 토큰의 표제어 추출
for word in POS_tag:
    lemmatized_text.append(str(wordnet_lemmatizer.lemmatize(word[0],pos=get_wordnet_pos(word[1]))))
    
print(lemmatized_text)

['The', 'FAANG', 'stock', 'won', '’', 't', 'see', 'much', 'more', 'growth', 'in', 'the', 'near', 'future', ',', 'according', 'to', 'Bill', 'Studebaker', ',', 'founder', 'and', 'Chief', 'Investment', 'Officer', 'of', 'Robo', 'Global.', 'Studebaker', 'argues', 'we', 'are', 'seeing', 'a', "'reallocation", "'", 'that', 'will', 'continue', 'from', 'large-cap', 'tech', 'stock', 'into', 'market-weight', 'stocks.', 'The', 'FAANG', 'stock', 'have', 'had', 'a', 'rough', 'few', 'weeks', ',', 'and', 'have', 'been', 'hit', 'hard', 'since', 'March', '12.', 'One', 'FAANG', 'to', 'look', 'out', 'for', ',', 'in', 'the', 'midst', 'of', 'all', 'this', ',', 'is', 'Amazon', ',', 'according', 'to', 'Studebaker.', 'The', 'stock', 'market', 'is', 'seeing', 'a', "'reallocation", "'", 'out', 'of', 'FAANG', 'stock', ',', 'which', 'are', 'not', 'where', 'the', 'smart', 'money', 'is', ',', 'founder', 'and', 'Chief', 'Investment', 'Officer', 'of', 'Robo', 'Global', 'Bill', 'Studebaker', 'told', 'Business', 'Insider

### 4) 불용어(Stopwords) 처리 및 불필요한 품사 제거

In [24]:
stopwords = [] # 불용어 배열

# 추출 키워드 대상이 되는 품사 지정 (N: 명사, J: 형용사)
wanted_POS = ['NN','NNS','NNP','NNPS','JJ','JJR','JJS']

# 추출 키워드 대상 품사가 아닌 토큰은 불용어로 등록
for word in POS_tag:
    if word[1] not in wanted_POS:
        stopwords.append(word[0])
        
# punctuation 을 불용어로 추가
punctuations = list(str(string.punctuation))
stopwords = stopwords + punctuations

# 사용자 정의 토큰을 불용어로 추가
stopwords_plus = ['t','isn']
stopwords = stopwords + stopwords_plus
stopwords = set(stopwords)

processed_text = []
for word in lemmatized_text:
    if word not in stopwords:
        processed_text.append(word)
print(processed_text)

['FAANG', 'stock', 'more', 'growth', 'near', 'future', 'Bill', 'Studebaker', 'founder', 'Chief', 'Investment', 'Officer', 'Robo', 'Global.', 'Studebaker', "'reallocation", 'large-cap', 'tech', 'stock', 'market-weight', 'stocks.', 'FAANG', 'stock', 'rough', 'few', 'weeks', 'hard', 'March', 'One', 'FAANG', 'midst', 'Amazon', 'Studebaker.', 'stock', 'market', "'reallocation", 'FAANG', 'stock', 'smart', 'money', 'founder', 'Chief', 'Investment', 'Officer', 'Robo', 'Global', 'Bill', 'Studebaker', 'Business', 'Insider.', 'FAANG', 'stock', 'Facebook', 'Apple', 'Amazon', 'Netflix', 'Google', 'March', 'trend', 'news', 'massive', 'Facebook', 'data', 'scandal', 'tech-heavy', 'Nasdaq', 'downward', 'frenzy.', 'Investors', 'next', 'good', 'news', 'FAANG', 'stock', 'optimists', 'Studebaker', 'thinks.', "'This", 'dead', 'trade', 'next', 'several', 'months', 'lot', 'performance', 'attribution', 'FAANG', 'stock', 'stock', 'market', 'gain', 'next', 'several', 'months', 'big', 'tech', 'market', "'realloca

In [25]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

### Unique한 토큰 목록 생성
 그래프 생성을 위해서 Unique한 토큰 목록 생성

In [26]:
vocabulary = list(set(processed_text))
print(vocabulary)

['Global', 'safe', 'backdrop', 'Chief', 'percent', 'technology', "'AI", 'businesses', 'cap', 'data', 'Insider.', 'stocks.', 'Netflix', 'foreseeable', 'area', 'next', 'frenzy.', 'doe', 're', 'rough', 'several', 'robotics', 'Business', "'This", 'areas', 'Studebaker', 'number', 'tech-heavy', 'Investors', 'law', 'demand.', 'optimists', 'Studebaker.', 'founder', 'large-cap', 'room', 'growth', 'Apple', 'lot', 'performance', 'artificial', 'stock', 'retail', 'other', 'trade', 'fund', 'year', 'intelligence', 'Facebook', 'revenue', 'de-risking', 'Investment', 'span', 'months', 'hard', 'massive', 'AI', 'sign', 'Robo', 'FAANGs.', 'boost', 'dead', 'business.', 'global', 'market-weight', 'smart', 'thinks.', 'run', 'ecommerce', 'Officer', 'weeks', 'gain', 'future.', 'Global.', 'mix', 'more', 'past', 'Google', 'pricey', 'point', 'market', 'value', 'future', 'tech', 'downward', 'Bill', "'reallocation", 'trend', 'Nasdaq', 'assets', 'midst', 'FAANG', 'favorable', 'management', 'March', 'many', 'ETF', 'st

### 6) 그래프 생성 (weighted edge 계산)
 * TextRank는 그래프 기반 모델
 * 각 단어(토큰)은 그래프의 노드(vertex)
 * weighted_edge 행렬은 노드간 가중치 정보를 담고 있음
 * weighted_edge[i][j] 는 i번째 단어와 j번째 단어의 가중치를 의미
 * weighted_edge[i][j] 가 0인 경우는 노드간 연결이 없음을 의미
 * 모든 노드는 1로 초기화

In [27]:
import numpy as np
import math
vocab_len = len(vocabulary)

# 토큰별로 그래프 edge를 Matrix 형태로 생성
weighted_edge = np.zeros((vocab_len, vocab_len),dtype=np.float32)

# 각 토큰 노트 별로 점수계산을 위한 배열 생성
score = np.zeros((vocab_len),dtype=np.float32)

# coocurrence를 판단하기 위한 window 사이즈 설정
window_size = 3
covered_coocurrences = []

# ~20행까지 같은 단어는 skip 하는 코드 
for i in range(0, vocab_len):
    score[i] = 1
    for j in range(0, vocab_len):
        if j == i:
            continue
            
        else:
            # 마지막 사이즈에서 window size를 차감한 만큼만 for문이 돈다. 
            for window_start in range(0, (len(processed_text) - window_size)):
                
                window_end = window_start + window_size
                
                window = processed_text[window_start:window_end]
                
                # 탐색하고 있는 두 단어가 윈도(window)에 동시 등장할 경우 edge로 연결한다.
                if (vocabulary[i] in window) and (vocabulary[j] in window):
                    
                    index_of_i = window_start + window.index(vocabulary[i])
                    index_of_j = window_start + window.index(vocabulary[j])
                    
                    if [index_of_i,index_of_j] not in covered_coocurrences:
                        weighted_edge[i][j]+=1/math.fabs(index_of_i-index_of_j) 
                        # math.fabs -> 절대값을 취하는 코드 
                        covered_coocurrences.append([index_of_i,index_of_j])

### 각 노드의 score 계산
각 노드와 연결된 weighted edge의 값을 합산

In [28]:
inout = np.zeros((vocab_len), dtype=np.float32)

for i in range(0, vocab_len):
    for j in range(0, vocab_len):
        inout[i]+=weighted_edge[i][j]

In [46]:
MAX_ITERATIONS = 50
d = 0.85 # 임의의 상수
threshold = 0.0001 # convergence threshold
# threshold가 뭔데?
# 전의 계산된 score의 합과 현재 계산된 score의 합과의 차이 
# 계산을 여러번 할수록 아주 조금씩 줄어들게 된다. 

for iter in range(0, MAX_ITERATIONS):
    prev_score = np.copy(score)
    
    for i in range(0, vocab_len):
        
        summation = 0
        for j in range(0, vocab_len):
            if weighted_edge[i][j] != 0:
                summation += (weighted_edge[i][j]/inout[j])*score[j]
                
        score[i] = (1-d) + d*(summation)
        
    if np.sum(np.fabs(prev_score-score)) <= threshold: # convergence condition
        break

In [31]:
for i in range(0,10): # vocab_len:
    print('Score of '+vocabulary[i]+": "+str(score[i]))

Score of Global: 0.64329684
Score of safe: 0.7101814
Score of backdrop: 0.79021776
Score of Chief: 1.1211275
Score of percent: 0.7779519
Score of technology: 0.682462
Score of 'AI: 0.83814615
Score of businesses: 1.4344289
Score of cap: 0.70205617
Score of data: 0.7879862


### 8) 핵심 단어 추출

In [33]:
sorted_index = np.flip(np.argsort(score,0))

keywords_num = 10

print('Keywords:\n')

for i in range(0, keywords_num):
    print(str(vocabulary[sorted_index[i]]) + " : " + str(score[sorted_index[i]]))

Keywords:

FAANG : 5.3772945
stock : 5.2782774
Studebaker : 3.369168
market : 2.717907
Amazon : 2.3066387
growth : 1.9410361
March : 1.8605113
next : 1.8059615
big : 1.7737058
months : 1.7517571


In [34]:
sorted_index
np.argsort(score)

array([ 35, 106,  73,  64,  78,  32,  17,  31,  71,  65,  10,   0, 108,
        90,  22, 101, 112,  75,  37,  29,  12, 102, 114,  26,   5,  57,
        67,  52,  38,  82,  19, 105,  39, 122,  30,  43,  55,  50,   8,
        60, 124,  46,  66,   1,  81, 118, 121,  14,  95,  79,  68,  13,
        61,  23, 107,  54,  72, 113,  70,  59,  92,  42,  76,  97,  28,
        24,  98,   4,  63, 123,   9, 100,   2, 116,  93,  99,  96,  45,
        16,  89, 117, 120,   6,  84,  27,  74, 119,  88,  49,  62, 115,
        83,  51,   3,  69,  34,  33,  58,  20, 110,  85,  77, 109,  11,
        87,  48,  18, 103,  40,  44,  47,  21,   7,  56,  86,  53, 111,
        15,  94,  36, 104,  80,  25,  41,  91], dtype=int64)

In [35]:
np.flip?

## 2.2 TextRank 핵심 구 추출
#### 1) 불용어를 기준으로 구 추출

In [36]:
phrases = []

phrase = " "
for word in lemmatized_text:
    if word in stopwords:
        if phrase != " ":
            phrases.append(str(phrase).strip().split())
            
        phrase = " "
    elif word not in stopwords:
        phrase += str(word)
        phrase += " "
        
print(phrases)
        
        

[['FAANG', 'stock'], ['more', 'growth'], ['near', 'future'], ['Bill', 'Studebaker'], ['founder'], ['Chief', 'Investment', 'Officer'], ['Robo', 'Global.', 'Studebaker'], ["'reallocation"], ['large-cap', 'tech', 'stock'], ['market-weight', 'stocks.'], ['FAANG', 'stock'], ['rough', 'few', 'weeks'], ['hard'], ['March'], ['One', 'FAANG'], ['midst'], ['Amazon'], ['Studebaker.'], ['stock', 'market'], ["'reallocation"], ['FAANG', 'stock'], ['smart', 'money'], ['founder'], ['Chief', 'Investment', 'Officer'], ['Robo', 'Global', 'Bill', 'Studebaker'], ['Business', 'Insider.'], ['FAANG', 'stock'], ['Facebook'], ['Apple'], ['Amazon'], ['Netflix'], ['Google'], ['March'], ['trend'], ['news'], ['massive', 'Facebook', 'data', 'scandal'], ['tech-heavy', 'Nasdaq'], ['downward', 'frenzy.', 'Investors'], ['next'], ['good', 'news'], ['FAANG', 'stock', 'optimists'], ['Studebaker', 'thinks.', "'This"], ['dead', 'trade'], ['next', 'several', 'months'], ['lot'], ['performance', 'attribution'], ['FAANG', 'stock'

In [38]:
unique_phrases = []
for phrase in phrases:
    if phrase not in unique_phrases:
        unique_phrases.append(phrase)
        
print(unique_phrases)

[['FAANG', 'stock'], ['more', 'growth'], ['near', 'future'], ['Bill', 'Studebaker'], ['founder'], ['Chief', 'Investment', 'Officer'], ['Robo', 'Global.', 'Studebaker'], ["'reallocation"], ['large-cap', 'tech', 'stock'], ['market-weight', 'stocks.'], ['rough', 'few', 'weeks'], ['hard'], ['March'], ['One', 'FAANG'], ['midst'], ['Amazon'], ['Studebaker.'], ['stock', 'market'], ['smart', 'money'], ['Robo', 'Global', 'Bill', 'Studebaker'], ['Business', 'Insider.'], ['Facebook'], ['Apple'], ['Netflix'], ['Google'], ['trend'], ['news'], ['massive', 'Facebook', 'data', 'scandal'], ['tech-heavy', 'Nasdaq'], ['downward', 'frenzy.', 'Investors'], ['next'], ['good', 'news'], ['FAANG', 'stock', 'optimists'], ['Studebaker', 'thinks.', "'This"], ['dead', 'trade'], ['next', 'several', 'months'], ['lot'], ['performance', 'attribution'], ['gain'], ['big', 'tech'], ['market'], ['large-cap', 'technology'], ['other', 'part'], ['foreseeable', 'future.'], ['reallocation', 'trade'], ['de-risking'], ['months']

In [42]:
for word in vocabulary:
    # print(word)
    
    for phrase in unique_phrases:
        if (word in phrase) and ([word] in unique_phrases) and (len(phrase) > 1):
            unique_phrases.remove([word])
            
print(unique_phrases)
        

[['FAANG', 'stock'], ['more', 'growth'], ['near', 'future'], ['Bill', 'Studebaker'], ['founder'], ['Chief', 'Investment', 'Officer'], ['Robo', 'Global.', 'Studebaker'], ["'reallocation"], ['large-cap', 'tech', 'stock'], ['market-weight', 'stocks.'], ['rough', 'few', 'weeks'], ['hard'], ['March'], ['One', 'FAANG'], ['midst'], ['Amazon'], ['Studebaker.'], ['stock', 'market'], ['smart', 'money'], ['Robo', 'Global', 'Bill', 'Studebaker'], ['Business', 'Insider.'], ['Apple'], ['Netflix'], ['Google'], ['trend'], ['massive', 'Facebook', 'data', 'scandal'], ['tech-heavy', 'Nasdaq'], ['downward', 'frenzy.', 'Investors'], ['good', 'news'], ['FAANG', 'stock', 'optimists'], ['Studebaker', 'thinks.', "'This"], ['dead', 'trade'], ['next', 'several', 'months'], ['lot'], ['performance', 'attribution'], ['gain'], ['big', 'tech'], ['large-cap', 'technology'], ['other', 'part'], ['foreseeable', 'future.'], ['reallocation', 'trade'], ['de-risking'], ['months.'], ['pricey', 'stock'], ['investors'], ['law']

#### 2) 각 구의 Score 계산
 앞서 산출한 각 단어별 점수를 합산

In [45]:
phrase_scores = []
keywords = []
for phrase in unique_phrases:
    phrase_score = 0
    keyword = ''
    for word in phrase:
        keyword += str(word)
        keyword += " "
        phrase_score += score[vocabulary.index(word)]
        
    phrase_scores.append(phrase_score)
    keywords.append(keyword.strip())
    
i = 0
for keyword in keywords:
    print("Keyword: '"+str(keyword)+"', Score: "+str(phrase_scores[i]))
    i += 1

Keyword: 'FAANG stock', Score: 10.655571937561035
Keyword: 'more growth', Score: 2.6003788113594055
Keyword: 'near future', Score: 1.4000205397605896
Keyword: 'Bill Studebaker', Score: 4.571102142333984
Keyword: 'founder', Score: 1.14383864402771
Keyword: 'Chief Investment Officer', Score: 3.3657681941986084
Keyword: 'Robo Global. Studebaker', Score: 5.150090396404266
Keyword: ''reallocation', Score: 1.5940415859222412
Keyword: 'large-cap tech stock', Score: 7.522923707962036
Keyword: 'market-weight stocks.', Score: 1.8791368007659912
Keyword: 'rough few weeks', Score: 2.1873615384101868
Keyword: 'hard', Score: 0.740159273147583
Keyword: 'March', Score: 1.860511302947998
Keyword: 'One FAANG', Score: 6.052494943141937
Keyword: 'midst', Score: 0.651337742805481
Keyword: 'Amazon', Score: 2.306638717651367
Keyword: 'Studebaker.', Score: 0.6348740458488464
Keyword: 'stock market', Score: 7.996184349060059
Keyword: 'smart money', Score: 1.2878848910331726
Keyword: 'Robo Global Bill Studebake

### 3) 각 구를 Score로 정렬하게 핵심 구 추출

In [47]:
sorted_index = np.flip(np.argsort(phrase_scores),0)

keywords_num = 10

print("Keywords:\n")

for i in range(0, keywords_num):
    print(str(keywords[sorted_index[i]])+", ")

Keywords:

FAANG stock optimists, 
FAANG stock, 
stock market, 
large-cap tech stock, 
Robo Global Bill Studebaker, 
One FAANG, 
pricey stock, 
Robo Global. Studebaker, 
Studebaker thinks. 'This, 
next several months, 


## 2.3 gensim Textrank

In [49]:
pip install gensim

Collecting gensim
  Downloading https://files.pythonhosted.org/packages/3a/bc/1415be59292a23ff123298b4b46ec4be80b3bfe72c8d188b58ab2653dee4/gensim-3.8.0.tar.gz (23.4MB)
Collecting smart_open>=1.7.0 (from gensim)
  Downloading https://files.pythonhosted.org/packages/37/c0/25d19badc495428dec6a4bf7782de617ee0246a9211af75b302a2681dea7/smart_open-1.8.4.tar.gz (63kB)
Collecting boto3 (from smart_open>=1.7.0->gensim)
  Downloading https://files.pythonhosted.org/packages/fb/e5/7ff9a5f23b3e94aefccc18f9d2493ddbeb31725986ab5ae4552387381e01/boto3-1.9.198-py2.py3-none-any.whl (128kB)
Collecting botocore<1.13.0,>=1.12.198 (from boto3->smart_open>=1.7.0->gensim)
  Downloading https://files.pythonhosted.org/packages/8b/2d/74e83c1cc3e8ab5c9675b2990af2f4c2a7b66379627deef4b08780fe6b23/botocore-1.12.198-py2.py3-none-any.whl (5.6MB)
Collecting s3transfer<0.3.0,>=0.2.0 (from boto3->smart_open>=1.7.0->gensim)
  Downloading https://files.pythonhosted.org/packages/16/8a/1fc3dba0c4923c2a76e1ff0d52b305c44606da63f

In [50]:
from gensim.summarization import keywords
# 원문만 넣으면 바로 중요한 keywords를 추출한다. 

keywords(Text).split('\n')

['stocks',
 'stock',
 'studebaker',
 'trade',
 'trades',
 'amazon',
 'tech',
 'attribution',
 'attributable',
 'cap',
 'facebook',
 'market',
 'future',
 'growth',
 'thinks',
 'think',
 'frenzy',
 'investment',
 'big',
 'global',
 'favorable macro']

In [51]:
keywords?