### OCI Data Science - Useful Tips
Everything stored in the <span style="background-color: #d5d8dc ">/home/datascience</span> folder is now stored on your block volume drive. The <span style="background-color: #d5d8dc ">ads-examples</span> folder has moved outside of your working space. Notebook examples are now accessible through a Launcher tab "Notebook Examples" button.
<details>
<summary><font size="2">1. Check for Public Internet Access</font></summary>

```python
import requests
response = requests.get("https://oracle.com")
assert response.status_code==200, "Internet connection failed"
```
</details>
<details>
<summary><font size="2">2. OCI Configuration and Key Files Set Up</font></summary><p>Follow the instructions in the getting-started notebook. That notebook is accessible via the "Getting Started" Launcher tab button.</p>
</details>
<details>
<summary><font size="2">3. Helpful Documentation </font></summary>
<ul><li><a href="https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm">Data Science Service Documentation</a></li>
<li><a href="https://docs.cloud.oracle.com/iaas/tools/ads-sdk/latest/index.html">ADS documentation</a></li>
</ul>
</details>
<details>
<summary><font size="2">4. Typical Cell Imports and Settings</font></summary>

```python
%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

import ads
from ads.dataset.factory import DatasetFactory
from ads.automl.provider import OracleAutoMLProvider
from ads.automl.driver import AutoML
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import MLData
from ads.explanations.explainer import ADSExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
from ads.catalog.model import ModelCatalog
from ads.common.model_artifact import ModelArtifact
```
</details>
<details>
<summary><font size="2">5. Useful Environment Variables</font></summary>

```python
import os
print(os.environ["NB_SESSION_COMPARTMENT_OCID"])
print(os.environ["PROJECT_OCID"])
print(os.environ["USER_OCID"])
print(os.environ["TENANCY_OCID"])
print(os.environ["NB_REGION"])
```
</details>

### BERT事前学習済モデルを使った単語予測のデモ
#### 2020/11/30 Oracle Big Data Jamsessionで実施したBERTの簡易デモ

In [None]:
##############################################################################################
#事前準備
# pip install transformers
# pip install torch
# git clone https://github.com/cl-tohoku/bert-japanese.git
# pip install ipywidgets
# jupyter nbextension enable --py widgetsnbextension
# pip install fugashi[unidic-lite]
# pip install ipadic
##############################################################################################

In [1]:
#ライブラリのimport
import torch
from transformers import BertJapaneseTokenizer, BertForMaskedLM

In [2]:
#単語分割処理に事前学習済みBERTモデルのtokenizerを定義
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')

In [18]:
#入力文章を定義し、単語分割
text = '学校で英語の勉強をする。それによって、アメリカ人と会話がしやすくなる。'
tokenized_text = tokenizer.tokenize(text)
tokenized_text

['学校',
 'で',
 '英語',
 'の',
 '勉強',
 'を',
 'する',
 '。',
 'それ',
 'によって',
 '、',
 'アメリカ',
 '人',
 'と',
 '会話',
 'が',
 'し',
 'やすく',
 'なる',
 '。']

In [19]:
#予測したい単語をマスク
#今回は「そこ」(指示代名詞)という単語を予測
masked_index = 2
tokenized_text[masked_index] = '[MASK]'
tokenized_text

['学校',
 'で',
 '[MASK]',
 'の',
 '勉強',
 'を',
 'する',
 '。',
 'それ',
 'によって',
 '、',
 'アメリカ',
 '人',
 'と',
 '会話',
 'が',
 'し',
 'やすく',
 'なる',
 '。']

In [20]:
#単語分割された文章をIDに変換して事前学習済みのBERTモデルに入力
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])

In [7]:
#事前学習済みモデルのロード
model = BertForMaskedLM.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
model.eval()

Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

In [21]:
# マスクされた単語の予測
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0][0, masked_index].topk(10) # 予測結果の上位10件を抽出

In [22]:
# 結果の確認
for i, index_t in enumerate(predictions.indices):
    index = index_t.item()
    token = tokenizer.convert_ids_to_tokens([index])[0]
    print(i, token)

0 英語
1 法律
2 日本語
3 語学
4 中国語
5 フランス語
6 数学
7 多く
8 ビジネス
9 会話
