## Step 1:Data acquisition and data preprocessing.

First get the data set

Consists of 2225 documents from the BBC news website
http://mlg.ucd.ie/datasets/bbc.html

In [None]:
dataset_path="C:\Users\qingh\Desktop\project\data\dataset.csv"

In [None]:
import pandas as pd

In [None]:
content=pd.read_csv(dataset_path,encoding="iso-8859-1")

In [4]:
content.head()

Unnamed: 0,news,type
0,China had role in Yukos split-up\n \n China le...,business
1,Oil rebounds from weather effect\n \n Oil pric...,business
2,Indonesia 'declines debt freeze'\n \n Indonesi...,business
3,$1m payoff for former Shell boss\n \n Shell is...,business
4,US bank in $515m SEC settlement\n \n Five Bank...,business


In [5]:
content_list=content['news'].tolist()
#Imported into a list.

In [6]:
content_list[0:2]

['China had role in Yukos split-up\n \n China lent Russia $6bn (Â£3.2bn) to help the Russian government renationalise the key Yuganskneftegas unit of oil group Yukos, it has been revealed.\n \n The Kremlin said on Tuesday that the $6bn which Russian state bank VEB lent state-owned Rosneft to help buy Yugansk in turn came from Chinese banks. The revelation came as the Russian government said Rosneft had signed a long-term oil supply deal with China. The deal sees Rosneft receive $6bn in credits from China\'s CNPC.\n \n According to Russian newspaper Vedomosti, these credits would be used to pay off the loans Rosneft received to finance the purchase of Yugansk. Reports said CNPC had been offered 20% of Yugansk in return for providing finance but the company opted for a long-term oil supply deal instead. Analysts said one factor that might have influenced the Chinese decision was the possibility of litigation from Yukos, Yugansk\'s former owner, if CNPC had become a shareholder. Rosneft a

#### Data cleaning. Remove the extra symbols.

In [7]:
import re
def token(string):
    string=re.sub(r'\n', '.',string,count=1)
    return re.sub(r'\'|\n+', '',string)

In [8]:
content_list=[token(n) for n in content_list]

In [9]:
content_list = [''.join(n) for n in content_list]

In [10]:
content_list[:3]

['China had role in Yukos split-up.  China lent Russia $6bn (Â£3.2bn) to help the Russian government renationalise the key Yuganskneftegas unit of oil group Yukos, it has been revealed.  The Kremlin said on Tuesday that the $6bn which Russian state bank VEB lent state-owned Rosneft to help buy Yugansk in turn came from Chinese banks. The revelation came as the Russian government said Rosneft had signed a long-term oil supply deal with China. The deal sees Rosneft receive $6bn in credits from Chinas CNPC.  According to Russian newspaper Vedomosti, these credits would be used to pay off the loans Rosneft received to finance the purchase of Yugansk. Reports said CNPC had been offered 20% of Yugansk in return for providing finance but the company opted for a long-term oil supply deal instead. Analysts said one factor that might have influenced the Chinese decision was the possibility of litigation from Yukos, Yugansks former owner, if CNPC had become a shareholder. Rosneft and VEB declined

#### Tokenizing is also required before using the dataset.
Tokenizing - Splitting sentences and words from the body of text.

#### Need one tokenizing.
In order to get all the words that mean ‘say’.

In [11]:
from nltk.tokenize import word_tokenize

In [12]:
def word_token(string):
    return re.findall(r'[\d|\w]+', string)

In [13]:
no_symbols_content_list = [word_token(n) for n in content_list]

In [14]:
no_symbols_content_list = [' '.join(n) for n in no_symbols_content_list]

In [15]:
no_symbols_content_list = [word_tokenize(n) for n in no_symbols_content_list]

In [16]:
word_content_list = [' '.join(n) for n in no_symbols_content_list]

Now we have one data sets that have been processed.

The next step is to get all the words that mean ‘say’.

## Step 2: Find the beginning of the sentence you want to display(get all the words that mean ‘say’)

The way I thought before was to look up the English dictionary. Manually find all the words that are similar to the meaning of the word. Then manually enter these words into the txt file.

However, the technical content of this method is too low. And there are too many words. So I found a new way to build a txt file.

The new approach I use is a combination of word embedding and breadth-first search.

Word2vec is a model of word embedding.

references:https://arxiv.org/pdf/1411.2738.pdf
    
The approximate meaning of this model is that you type a word. Then the model will turn the word into a vector. Then the vector around this vector is the synonym of the word. So I use this model to find synonyms.

In [17]:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence



In [18]:
with open('news_content.txt', 'w',encoding="iso-8859-1") as f:
    for n in word_content_list:
        f.write(n + '\n')

Because of the problem with the file encoding format. Change the text file to UTF-8 format.

In [19]:
def correctSubtitleEncoding(filename, newFilename, encoding_from, encoding_to='UTF-8'):
    with open(filename, 'r', encoding=encoding_from) as fr:
        with open(newFilename, 'w', encoding=encoding_to) as fw:
            for line in fr:
                fw.write(line[:-1]+'\r\n')
correctSubtitleEncoding("news_content.txt","news_content_utf.txt","iso-8859-1")

In [20]:
news_word2ve= Word2Vec(LineSentence('news_content_utf.txt'), size=35, workers=8)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [21]:
news_word2ve.wv.most_similar('said', topn=50)

[('says', 0.9205301403999329),
 ('added', 0.9003747701644897),
 ('told', 0.7915402054786682),
 ('believes', 0.7894333600997925),
 ('Smernicki', 0.7773923873901367),
 ('warned', 0.7764408588409424),
 ('argued', 0.7676066160202026),
 ('admitted', 0.7652660608291626),
 ('Ingram', 0.7645844221115112),
 ('insisted', 0.7607728242874146),
 ('Babinet', 0.757400393486023),
 ('Raskin', 0.7566797137260437),
 ('explained', 0.7558911442756653),
 ('thinks', 0.7463495135307312),
 ('Houlihan', 0.7432024478912354),
 ('Ebbers', 0.7427487969398499),
 ('Hogan', 0.7411108016967773),
 ('Butlers', 0.738810658454895),
 ('Sullivan', 0.7351438999176025),
 ('Donofrio', 0.733007550239563),
 ('denied', 0.7299922704696655),
 ('correctly', 0.7214597463607788),
 ('Myers', 0.7208898067474365),
 ('Fogg', 0.7095553874969482),
 ('Yoran', 0.7010022401809692),
 ('Blairs', 0.6985139846801758),
 ('Underwood', 0.6974301934242249),
 ('Kilroy', 0.6868307590484619),
 ('joked', 0.6856988072395325),
 ('Nachison', 0.685654759407043

As you can see, some of the words behind the list are not words that mean 'say'.

So use an algorithm similar to breadth-first search to optimize.

The principle of the algorithm is this.

If I am looking for a synonym for bananas.

The search results are probably like this.

![0](https://github.com/ngnl333/Character-Speech-Extraction/blob/master/image/image0.PNG?raw=true)


You can see that there is a word that is not fruit.

To solve this problem. You can continue to search for synonyms of synonyms.

like this.

![1](https://github.com/ngnl333/Character-Speech-Extraction/blob/master/image/image1.JPG?raw=true)

You can find out. Although there are still wrong words. But most of the words are fruit.

So there is such a method. I found a synonym for bananas. Apples, pears, peaches, etc. 

Then I go looking for synonyms for Apple (a synonym for the banana I just found). 

I can find the words banana, pear, and peach again. 

Most of them are the right fruits. A small part is something else. 

Then I searched for pears again. Search for peaches.

![2](https://github.com/ngnl333/Character-Speech-Extraction/blob/master/image/image2.PNG?raw=true)

Use this method to keep searching. 

If a word is fruit, then the word must appear multiple times. 

Other types of things will only appear in certain situations. It won't always appear.

For example, curry will only appear when searching for synonyms for bananas. It won't appear when searching for synonyms for apple or pineapple.

This gives the root node and child nodes. Turn this problem into a classic algorithmic problem.

In [22]:
from collections import defaultdict

In [23]:
def get_related_words(initial_words, model):
    
    unseen = initial_words
    
    seen = defaultdict(int)
    
    max_size = 500
    
    while unseen and len(seen) < max_size:
            
        node = unseen.pop(0)
        
        new_expanding = [w for w, s in model.wv.most_similar(node, topn=20)]
        
        unseen += new_expanding
        
        seen[node] += 1

    
    return seen

In [24]:
related_words = get_related_words(['say','told','said','think'], news_word2ve)

In [25]:
word_list=sorted(related_words.items(), key=lambda x: x[1], reverse=True)

In [26]:
words_means_say=[n[0] for n in word_list]

In [27]:
words_means_say[:50]

['ask',
 'understand',
 'accept',
 'feel',
 'tell',
 'realise',
 'afford',
 'So',
 'why',
 'love',
 'stop',
 'know',
 'consider',
 'require',
 'maybe',
 'actually',
 'believe',
 'thank',
 'see',
 'certainly',
 'disappear',
 'ignore',
 'agree',
 'achieve',
 'hope',
 'avoid',
 'Ingram',
 'hear',
 'quite',
 'wonder',
 'remember',
 'thats',
 'absolutely',
 'hassle',
 'find',
 'Sullivan',
 'felt',
 'pursue',
 'everything',
 'definitely',
 'think',
 'Myers',
 'doing',
 'improve',
 'Ebbers',
 'really',
 'listen',
 'learn',
 'apply',
 'wish']

In [31]:
with open('data\words_mean_say.txt','w')as f:
    for words in words_means_say:
        f.write(words+'\n')

So we got all the words that mean ‘say’.