<a href="https://colab.research.google.com/github/m37335/kanagawa-exam/blob/master/%E3%80%90%E8%8B%B1%E8%AA%9E%E3%80%91%E7%A5%9E%E5%A5%88%E5%B7%9D%E5%85%A5%E8%A9%A6%E5%95%8F%EF%BC%92%E9%81%A9%E8%AA%9E%E8%A3%9C%E5%85%85%E5%88%86%E6%9E%90.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **【英語】神奈川入試問２適語補充の分析**
自然言語処理の手法であるBERTを用いて、解答の予測を行う。
分析にあたっては、単語の分割を行うTokenizeはStanford大学が開発したStanzaを用いた。

# **ライブラリのインストール**

In [1]:
!pip install transformers
!pip install stanza

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 24.6MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 53.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 54.2MB/s 
Installing collected packages: tokenizers, sacremoses, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1
Collecting stanza
[?25l  Downloading https://files.pythonhosted.

In [2]:
# pytorch
import torch
from transformers import BertTokenizer, BertForMaskedLM
# stanza
import stanza
stanza.download('en') # download English model

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.2.0.json: 128kB [00:00, 17.2MB/s]                    
2021-05-03 21:14:23 INFO: Downloading default packages for language: en (English)...
Downloading http://nlp.stanford.edu/software/stanza/1.2.0/en/default.zip: 100%|██████████| 411M/411M [01:13<00:00, 5.59MB/s]
2021-05-03 21:15:43 INFO: Finished downloading models and saved to /root/stanza_resources.


## **問２適語補充問題のインポートとデータフレームに変換**

In [9]:
import pandas as pd
allSentence_df = pd.read_csv('https://raw.githubusercontent.com/m37335/kanagawa-exam/master/data/kanagawaPart2.csv')

In [10]:
allSentence_df

Unnamed: 0,year,s_id,question,sentence_question,sentence
0,2009,1,1,"Tom, a high school student from the U.S.A., (c...","Tom, a high school student from the U.S.A., ca..."
1,2009,2,2,"When he was in his (country), he studied Japan...","When he was in his country, he studied Japanes..."
2,2009,3,3,So he can speak Japanese a (little) now.,So he can speak Japanese a little now.
3,2009,5,4,"For (example), Japanese songs, movies, and books.","For example, Japanese songs, movies, and books."
4,2010,2,1,He lived in America for (nine) years and retur...,He lived in America for nine years and returne...
5,2010,3,2,He speaks English well and is also good at usi...,He speaks English well and is also good at usi...
6,2010,4,3,"Last (Friday), Ryota talked with one of his Am...","Last Friday, Ryota talked with one of his Amer..."
7,2010,5,4,His American friends are going to come to Japa...,His American friends are going to come to Japa...
8,2011,4,1,Mt.Fuji is the (highest) mountain in Japan.,Mt.Fuji is the highest mountain in Japan.
9,2011,5,2,Many people climb this mountain (during) the s...,Many people climb this mountain during the sum...


## **Stanzaを用いてTokenに分割**
分割したTokenはリストに追加し、DataFrameに変換。  
その際、データの'sentence'のみ利用する。

In [11]:
nlp = stanza.Pipeline(processors='tokenize,mwt,pos,lemma', use_gpu=True)
stanza_token = []

for df_sentence in allSentence_df.sentence:
  doc = nlp(df_sentence)
  for sentence in doc.sentences:
    tmp_token = []
    for word in sentence.words:
      tmp_token.append(word.text)
    tmp_token.insert(0, "[CLS]")
    tmp_token.append("[SEP]")
    
    stanza_token.append(tmp_token)

#print(stanza_token)
stanza_token_df = pd.DataFrame(stanza_token)

2021-05-03 21:26:09 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
| lemma     | combined |

2021-05-03 21:26:09 INFO: Use device: gpu
2021-05-03 21:26:09 INFO: Loading: tokenize
2021-05-03 21:26:09 INFO: Loading: pos
2021-05-03 21:26:09 INFO: Loading: lemma
2021-05-03 21:26:09 INFO: Done loading processors!


In [12]:
stanza_token_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
0,[CLS],Tom,",",a,high,school,student,from,the,U.S.A.,",",came,to,Japan,last,month,.,[SEP],,,,,
1,[CLS],When,he,was,in,his,country,",",he,studied,Japanese,in,his,high,school,.,[SEP],,,,,,
2,[CLS],So,he,can,speak,Japanese,a,little,now,.,[SEP],,,,,,,,,,,,
3,[CLS],For,example,",",Japanese,songs,",",movies,",",and,books,.,[SEP],,,,,,,,,,
4,[CLS],He,lived,in,America,for,nine,years,and,returned,to,Japan,when,he,was,fifteen,years,old,.,[SEP],,,
5,[CLS],He,speaks,English,well,and,is,also,good,at,using,a,computer,.,[SEP],,,,,,,,
6,[CLS],Last,Friday,",",Ryota,talked,with,one,of,his,American,friends,over,the,phone,and,heard,good,news,.,[SEP],,
7,[CLS],His,American,friends,are,going,to,come,to,Japan,to,see,Ryota,this,July,.,[SEP],,,,,,
8,[CLS],Mt.,Fuji,is,the,highest,mountain,in,Japan,.,[SEP],,,,,,,,,,,,
9,[CLS],Many,people,climb,this,mountain,during,the,summer,every,year,.,[SEP],,,,,,,,,,


In [13]:
# allSentence_dfとstanza_token_dfを結合し、一つのDataFrameにする。
df_concat = pd.concat([allSentence_df, stanza_token_df], axis=1)

In [14]:
df_concat

Unnamed: 0,year,s_id,question,sentence_question,sentence,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
0,2009,1,1,"Tom, a high school student from the U.S.A., (c...","Tom, a high school student from the U.S.A., ca...",[CLS],Tom,",",a,high,school,student,from,the,U.S.A.,",",came,to,Japan,last,month,.,[SEP],,,,,
1,2009,2,2,"When he was in his (country), he studied Japan...","When he was in his country, he studied Japanes...",[CLS],When,he,was,in,his,country,",",he,studied,Japanese,in,his,high,school,.,[SEP],,,,,,
2,2009,3,3,So he can speak Japanese a (little) now.,So he can speak Japanese a little now.,[CLS],So,he,can,speak,Japanese,a,little,now,.,[SEP],,,,,,,,,,,,
3,2009,5,4,"For (example), Japanese songs, movies, and books.","For example, Japanese songs, movies, and books.",[CLS],For,example,",",Japanese,songs,",",movies,",",and,books,.,[SEP],,,,,,,,,,
4,2010,2,1,He lived in America for (nine) years and retur...,He lived in America for nine years and returne...,[CLS],He,lived,in,America,for,nine,years,and,returned,to,Japan,when,he,was,fifteen,years,old,.,[SEP],,,
5,2010,3,2,He speaks English well and is also good at usi...,He speaks English well and is also good at usi...,[CLS],He,speaks,English,well,and,is,also,good,at,using,a,computer,.,[SEP],,,,,,,,
6,2010,4,3,"Last (Friday), Ryota talked with one of his Am...","Last Friday, Ryota talked with one of his Amer...",[CLS],Last,Friday,",",Ryota,talked,with,one,of,his,American,friends,over,the,phone,and,heard,good,news,.,[SEP],,
7,2010,5,4,His American friends are going to come to Japa...,His American friends are going to come to Japa...,[CLS],His,American,friends,are,going,to,come,to,Japan,to,see,Ryota,this,July,.,[SEP],,,,,,
8,2011,4,1,Mt.Fuji is the (highest) mountain in Japan.,Mt.Fuji is the highest mountain in Japan.,[CLS],Mt.,Fuji,is,the,highest,mountain,in,Japan,.,[SEP],,,,,,,,,,,,
9,2011,5,2,Many people climb this mountain (during) the s...,Many people climb this mountain during the sum...,[CLS],Many,people,climb,this,mountain,during,the,summer,every,year,.,[SEP],,,,,,,,,,


### **問題部分をMASKにする**

In [15]:
# それぞれの問題部分の単語をIDで指定する
mask_id = pd.Series([11, 6, 7, 2, 6, 12, 2, 14, 5, 6, 6, 4, 4, 6, 16, 16, 5, 1, 8, 16, 5, 12, 7, 8, 7, 3, 2, 3, 4, 5, 5, 5, 5, 2, 3, 11, 8, 15], name="mask_id")

In [16]:
# 結合
df = pd.concat([df_concat, mask_id], axis=1)

In [17]:
df

Unnamed: 0,year,s_id,question,sentence_question,sentence,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,mask_id
0,2009,1,1,"Tom, a high school student from the U.S.A., (c...","Tom, a high school student from the U.S.A., ca...",[CLS],Tom,",",a,high,school,student,from,the,U.S.A.,",",came,to,Japan,last,month,.,[SEP],,,,,,11
1,2009,2,2,"When he was in his (country), he studied Japan...","When he was in his country, he studied Japanes...",[CLS],When,he,was,in,his,country,",",he,studied,Japanese,in,his,high,school,.,[SEP],,,,,,,6
2,2009,3,3,So he can speak Japanese a (little) now.,So he can speak Japanese a little now.,[CLS],So,he,can,speak,Japanese,a,little,now,.,[SEP],,,,,,,,,,,,,7
3,2009,5,4,"For (example), Japanese songs, movies, and books.","For example, Japanese songs, movies, and books.",[CLS],For,example,",",Japanese,songs,",",movies,",",and,books,.,[SEP],,,,,,,,,,,2
4,2010,2,1,He lived in America for (nine) years and retur...,He lived in America for nine years and returne...,[CLS],He,lived,in,America,for,nine,years,and,returned,to,Japan,when,he,was,fifteen,years,old,.,[SEP],,,,6
5,2010,3,2,He speaks English well and is also good at usi...,He speaks English well and is also good at usi...,[CLS],He,speaks,English,well,and,is,also,good,at,using,a,computer,.,[SEP],,,,,,,,,12
6,2010,4,3,"Last (Friday), Ryota talked with one of his Am...","Last Friday, Ryota talked with one of his Amer...",[CLS],Last,Friday,",",Ryota,talked,with,one,of,his,American,friends,over,the,phone,and,heard,good,news,.,[SEP],,,2
7,2010,5,4,His American friends are going to come to Japa...,His American friends are going to come to Japa...,[CLS],His,American,friends,are,going,to,come,to,Japan,to,see,Ryota,this,July,.,[SEP],,,,,,,14
8,2011,4,1,Mt.Fuji is the (highest) mountain in Japan.,Mt.Fuji is the highest mountain in Japan.,[CLS],Mt.,Fuji,is,the,highest,mountain,in,Japan,.,[SEP],,,,,,,,,,,,,5
9,2011,5,2,Many people climb this mountain (during) the s...,Many people climb this mountain during the sum...,[CLS],Many,people,climb,this,mountain,during,the,summer,every,year,.,[SEP],,,,,,,,,,,6


## **単語のMASK化**

In [18]:
# 単語の分割の設定
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
# マスクの設定
msk_model = BertForMaskedLM.from_pretrained("bert-base-cased")
# GPUを使用
msk_model.cuda()
msk_model.eval()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

## **MASK化した単語をBERTを用いて予測する**

In [19]:
possible_answer = []

mask_id_list = df.mask_id
words = stanza_token_df.values.tolist()

for (words, mask_id) in zip(words, mask_id_list):
  # リスト内"NONE"を削除する
  words = [x for x in words if x is not None]

  # 特定の文字列をマスクに変換する
  msk_idx = mask_id
  words[msk_idx] = "[MASK]"

  #print(words)

  # tokenをidに変換
  word_ids = tokenizer.convert_tokens_to_ids(words)
  #print(word_ids)
  word_tensor = torch.tensor([word_ids])
  # print(word_tensor)
  
  # マスクした部分の予測
  x = word_tensor.cuda()
  y = msk_model(x)
  result = y[0]
  # print(result.size())
  
  _, msk_ids = torch.topk(result[0][msk_idx], k=10)
  result_words = tokenizer.convert_ids_to_tokens(msk_ids.tolist())

  possible_answer.append(result_words)
  
  # 予測した結果を出力
  print(result_words)

['came', 'went', 'returned', 'comes', 'moved', 'arrived', 'traveled', 'returns', 'flew', 'goes']
['teens', 'twenties', 'prime', 'youth', 'childhood', 'infancy', '60s', 'hometown', '70s', '80s']
['lot', 'little', 'bit', 'while', 'second', 'secret', 'minute', 'ways', 'moment', 'word']
['example', 'instance', 'children', 'music', 'entertainment', 'television', 'sale', 'comparison', 'reference', 'them']
['two', 'three', 'four', 'five', 'several', 'seven', 'six', 'ten', 'eight', 'nine']
['wheelchair', 'cane', 'knife', 'gun', 'weapon', 'computer', 'dictionary', 'hammer', 'keyboard', 'sword']
['night', 'week', '##ly', 'year', 'Sunday', 'month', 'time', 'day', 'Friday', 'Monday']
['year', 'summer', 'week', 'day', 'time', 'weekend', 'month', 'Christmas', 'evening', 'morning']
['highest', 'tallest', 'largest', 'northernmost', 'lowest', 'southernmost', 'smallest', 'deepest', 'Highest', 'longest']
['in', 'during', 'over', 'for', 'throughout', 'through', 'on', 'around', 'into', 'every']
[',', '...'

In [20]:
# 予測された単語リストをDataFrameに変換
result_words_df = pd.DataFrame(possible_answer)
# 新たに作成したDataFrameと元々のDataFrameを結合
possibleAnswer_df = pd.concat([allSentence_df, result_words_df], axis=1)
# 不要な列を削除
possibleAnswer_df = possibleAnswer_df.drop(columns='sentence', axis=1)

In [21]:
possibleAnswer_df

Unnamed: 0,year,s_id,question,sentence_question,0,1,2,3,4,5,6,7,8,9
0,2009,1,1,"Tom, a high school student from the U.S.A., (c...",came,went,returned,comes,moved,arrived,traveled,returns,flew,goes
1,2009,2,2,"When he was in his (country), he studied Japan...",teens,twenties,prime,youth,childhood,infancy,60s,hometown,70s,80s
2,2009,3,3,So he can speak Japanese a (little) now.,lot,little,bit,while,second,secret,minute,ways,moment,word
3,2009,5,4,"For (example), Japanese songs, movies, and books.",example,instance,children,music,entertainment,television,sale,comparison,reference,them
4,2010,2,1,He lived in America for (nine) years and retur...,two,three,four,five,several,seven,six,ten,eight,nine
5,2010,3,2,He speaks English well and is also good at usi...,wheelchair,cane,knife,gun,weapon,computer,dictionary,hammer,keyboard,sword
6,2010,4,3,"Last (Friday), Ryota talked with one of his Am...",night,week,##ly,year,Sunday,month,time,day,Friday,Monday
7,2010,5,4,His American friends are going to come to Japa...,year,summer,week,day,time,weekend,month,Christmas,evening,morning
8,2011,4,1,Mt.Fuji is the (highest) mountain in Japan.,highest,tallest,largest,northernmost,lowest,southernmost,smallest,deepest,Highest,longest
9,2011,5,2,Many people climb this mountain (during) the s...,in,during,over,for,throughout,through,on,around,into,every
