# Lecture14 LLM時代のデータサイエンスとは

最後の回は、今までの講義を振り返るとともに、折に触れて紹介したChatGPTなどの大規模言語モデル(LLM)とデータサイエンスとの関係性についていくつか話題を実例を含めながら紹介する。

- 深層学習から発展し、ChatGPTなどのもとになった、Transformerについて、動かしてみる。（時系列データ予測もやってみる）
- LLMの活用例
 - CSVデータエージェント
 - Q&A with sources

## Transformer

ここでは「機械学習エンジニアのためのTransfomers」の例を実際に動かしてみます。

ここでは、HuggingFaceというオープンソースのモデルやデータをあつめたハブからpipeline()関数を使ってローカルにtransformerのモデルをダウンロード（実際にはcolabの仮想マシンに）して使っています。


https://github.com/nlp-with-transformers/notebooks/blob/main/01_introduction.ipynb

## Transformerの歴史
Transformerとは、深層学習のアーキテクチャの一つであり、Googleにより発明されました。それ以降、BERTやGPTなどが開発され、現在のGPT3.5やGPT4すなわちChatGPTのもとになるTransformerが開発されてきました。

![Transformerの歴史](https://raw.githubusercontent.com/nlp-with-transformers/notebooks/48e4a5e5c44b86e1593c0945a49af9675cfd7158//images/chapter01_timeline.png)

## 「エンコーダー・デコーダ」アーキテクチャ

深層学習の中で、RNN(再帰型ニューラルネットワーク）を用いて、入力文字の順序を持つ列を、状態にエンコードする、エンコーダーブロックと、状態からこれを、出力文字列に、やはりRNNを用いてでコードするデコードブロックを組み合わせたものが、トランスフォーマの基本となるアーキテクチャです。
![](https://raw.githubusercontent.com/nlp-with-transformers/notebooks/48e4a5e5c44b86e1593c0945a49af9675cfd7158//images/chapter01_enc-dec.png)

In [9]:
!pip install transformers  sentencepiece
!pip install diffusers accelerate scipy safetensors




[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip




例題とするテキスト（英語）です、~~Amazon~~EC業者への問い合わせのメールのようです。

In [10]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

## テキスト分類の例 
pipleline関数に、タスク種別を引数に指定すると、適切なモデルを自動選択して、モデルのインスタンスをcolab仮想マシンに生成します。（具体的なモデルを指定することもできます）。

In [11]:
from transformers import pipeline
#distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)

classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


具体的に選択されたモデルを表示します。

In [13]:
classifier.model.name_or_path

'distilbert/distilbert-base-uncased-finetuned-sst-2-english'

In [8]:
classifier.model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

このようにして、与えた文章に対して感情を分析してくれます。スコア付き。

In [4]:
classifier(text)

[{'label': 'NEGATIVE', 'score': 0.9015465974807739}]

一度、モデルを作れば、あとは何度も別のテキストを試せます。ディープラーニングの父、ヒントン教授のインタビューから

In [6]:
text_hinton="""Geoff Hinton 

Yes, I do. I strongly believe that. I strongly believe that when we eventually understand how the brain works,
 that's going to give us lots of psychological insight too. Just as understanding chemistry at the atomic level, 
 understanding of how molecules bump into each other and what happens, 
 gives us lots of insight into the gas laws. 
"""

In [7]:
classifier(text_hinton)

[{'label': 'POSITIVE', 'score': 0.9994831085205078}]

## 固有語の識別

named entity(名前のあるエンティティ)の識別

In [14]:
import pandas as pd
#dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf

ner_tagger = pipeline("ner", aggregation_strategy="simple")

outputs = ner_tagger(text)
pd.DataFrame(outputs)  

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.879011,Amazon,5,11
1,MISC,0.990859,Optimus Prime,36,49
2,LOC,0.999755,Germany,90,97
3,MISC,0.556571,Mega,208,212
4,PER,0.590255,##tron,212,216
5,ORG,0.669692,Decept,253,259
6,MISC,0.498349,##icons,259,264
7,MISC,0.775363,Megatron,350,358
8,MISC,0.987854,Optimus Prime,367,380
9,PER,0.812096,Bumblebee,502,511


## 質問に答える

In [16]:

reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs]) 

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,score,start,end,answer
0,0.631292,335,358,an exchange of Megatron


## 要約をする

In [17]:

summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=60, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])
     

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead. As a lifelong enemy of the Decepticons, I hope you can understand


## 翻訳では、FuguMTを使ってみます。

In [18]:
from transformers import pipeline
fugu_translator = pipeline('translation', model='staka/fugumt-en-ja')
fugu_translator(outputs[0]['summary_text'])

config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/121M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/289 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/791k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]



[{'translation_text': 'Ess に見当がつくようなものまで取り込むなんていう、そんなお手っ取り早い仕打っ取りの仕方ができるなんて、そんなお手っ取り早い仕置きが出来たんにしといられちゃうって気にももった、こんなお手頃なお手頃なお手頃なお手持ちのゲームがお手頃になるなんていう気にも思いは、この手っ取り早くお手っ取り早くお手元のお手数をお手元のドイツのお手数をお手元のお手製にかけることなんじゃないっスススススススススススススススススススススススリもお手製のお手製のお手書きの指手が、このお手製のお手製のお手製のお手製パンクタリドに仕立手製のお手製のお手製のお手上げ手上げ手がお手が、お手製のお手数をお手数をお手が、お手がお手に入るようなお手数をお手がお手に入るようなお手がお手数をお手軽に手がお手軽に手がおののののののののののかな手製でご手軽にご手がお手がお手がお手がお手に入るののののののののののののののののののののののに、お手製でお手製でお手元で、お手元に、お手がお手がお手がお手がお手がお手がお手がお手がお手がおのののののののののののののもと、お手がおのののののののに、と、お手がお手が、おのもごのもと、お手が、お手がお手がおののののとお手のとお手がおののののののほど、おのほどおのものののに、このに、おのものほど、おのもございますののものものもと、おのもおのもおのに、に、のに、このにごののののとおのおにごののののののののののがのものののののの'}]

もう少し本格的な

In [19]:
translator = pipeline("translation", model='Hoax0930/marian-finetuned-kde4-en-to-ja_kftt')
translator(outputs[0]['summary_text'])

config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/310M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/382 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/808k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/834k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.83M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

[{'translation_text': 'バンブルビーはあなたのドイツのオンラインストアからのプロヴィデンスプライムアクション図をオーダーしました。'}]

## 画像生成
テキストから画像を生成する有名な、stablefusionも試せます。

In [20]:
!pip install diffusers transformers accelerate scipy safetensors




[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


メモリ利用量が無料版のcolabを超えるかもしれません。その場合は、クラッシュします。がメモリがリセットされるので再実行します。

In [21]:
from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
import torch

model_id = "stabilityai/stable-diffusion-2"

scheduler = EulerDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler")
pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "The professor in panda outfits is teaching in a busy and packed class room. Some student is throwing spears to him"
image = pipe(prompt).images[0]
image

scheduler/scheduler_config.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model_index.json:   0%|          | 0.00/537 [00:00<?, ?B/s]

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

tokenizer/special_tokens_map.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

unet/config.json:   0%|          | 0.00/909 [00:00<?, ?B/s]

tokenizer/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer/tokenizer_config.json:   0%|          | 0.00/824 [00:00<?, ?B/s]

(…)ature_extractor/preprocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

text_encoder/config.json:   0%|          | 0.00/633 [00:00<?, ?B/s]

tokenizer/vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

vae/config.json:   0%|          | 0.00/611 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.36G [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/3.46G [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

AssertionError: Torch not compiled with CUDA enabled

In [25]:
!pip install langchain langchain_community

Collecting langchain_community
  Obtaining dependency information for langchain_community from https://files.pythonhosted.org/packages/1b/ad/59d9f88057c29d0a5d9ed786358bbbc8797bda347f3f007d51a651d2613a/langchain_community-0.2.0-py3-none-any.whl.metadata
  Downloading langchain_community-0.2.0-py3-none-any.whl.metadata (8.8 kB)
Downloading langchain_community-0.2.0-py3-none-any.whl (2.1 MB)
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
   ----- ---------------------------------- 0.3/2.1 MB 6.3 MB/s eta 0:00:01
   ------------ --------------------------- 0.6/2.1 MB 6.6 MB/s eta 0:00:01
   ---------------------- ----------------- 1.2/2.1 MB 8.5 MB/s eta 0:00:01
   ------------------------------- -------- 1.7/2.1 MB 8.8 MB/s eta 0:00:01
   ---------------------------------------- 2.1/2.1 MB 9.0 MB/s eta 0:00:00
Installing collected packages: langchain_community
Successfully installed langchain_commu


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [26]:
from langchain_community.llms import HuggingFacePipeline
#from langchain import PromptTemplate,  LLMChain
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate


llm = HuggingFacePipeline.from_model_id(model_id="bigscience/bloom-1b7", task="text-generation", model_kwargs={"temperature":0, "max_length":64})

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What is electroencephalography?"

print(llm_chain.run(question))

tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

KeyboardInterrupt: 

いちど、ロードするとあとは大丈夫。

In [None]:
question = "Why nuclear fusion is so difficult from enginnering perspective?"

print(llm_chain.run(question))

 First, we need to understand the basic principles of nuclear fusion. Nuclear fusion is a process in which two nuclei are brought together by the action of a strong electric field. The two nuclei are brought together by the action


## OpenAIのapiのkeyを持っている場合は。

In [None]:
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.7-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.27.7


In [None]:
# get a token: https://platform.openai.com/account/api-keys

from getpass import getpass

OPENAI_API_KEY = getpass()

··········


In [None]:
import os

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [None]:
from langchain import PromptTemplate,  LLMChain
from langchain.llms import OpenAI

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

llm = OpenAI()

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What is electroencephalography?"

print(llm_chain.run(question))



Electroencephalography (EEG) is a medical test that measures and records the electrical activity of the brain. It is typically used to diagnose neurological conditions, track brain activity during sleep, or monitor the effects of certain drugs. EEGs are noninvasive, meaning they do not involve any surgical procedures or the introduction of any foreign substances into the body. The process involves attaching electrodes to the scalp to measure the electrical activity of the brain and then sending the information to a computer for analysis.


## 有料のOpenAIのapi_keyを持っていなくても、無料のHugging Faceを使うことができます。

In [None]:
!pip install huggingface_hub 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting huggingface_hub
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface_hub
Successfully installed huggingface_hub-0.14.1


In [None]:
# get a token: https://huggingface.co/docs/api-inference/quicktour#get-your-api-token

from getpass import getpass

HUGGINGFACEHUB_API_TOKEN = getpass()

··········


In [None]:
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

In [None]:
from langchain import PromptTemplate,  LLMChain
from langchain import HuggingFaceHub

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

llm = HuggingFaceHub(repo_id="google/flan-ul2", model_kwargs={"temperature":0.1,
                                                                "max_length":64})

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What is electroencephalography?"

print(llm_chain.run(question))

Electroencephalography is the process of measuring brain activity by recording electrical signals in the brain. The
