**Libraryについて**
- unstructured: MLサービスのための自然言語データの前処理ツール。HTML, PDF, Wordなどの自然言語データをMLサービス用に変換することが出来る。
- pipecone: 

In [27]:
import os

from dotenv import load_dotenv, find_dotenv

import openai
import pinecone
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

In [28]:
load_dotenv(find_dotenv())
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["HUGGINGFACEHUB_API_TOKEN"] = os.getenv("HUGGINGFACEHUB_API_TOKEN")

## Load Documents

In [29]:
def load_docs(directory):
    loader = PyPDFDirectoryLoader(directory)
    documents = loader.load()
    return documents

In [30]:
directory = "Docs/"
documents = load_docs(directory)
len(documents)

3

In [31]:
documents

[Document(page_content="However, India also faces various socio-economic challenges. Poverty, income inequality, and \nunemployment are persistent issues that the country strives to address. Efforts are being made\nto improve education, healthcare, infrastructure, and social welfare programs to uplift \nmarginalized sections of society.\nEducation plays a vital role in India, with a strong emphasis on academic excellence. The \ncountry has a vast network of schools, colleges, and universities, producing a large number of \ngraduates every year. Indian professionals have made significant contributions in various fields \nglobally, particularly in science, technology, engineering, and mathematics (STEM).\nThe Indian film industry, popularly known as Bollywood, is a global phenomenon, producing the\nlargest number of films annually. Indian cinema reflects the diversity and cultural richness of \nthe country and has a massive following both within India and among the Indian diaspora \nworl

In [32]:
def split_docs(documents, chunk_size=1000, chunk_overlap=20):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs = text_splitter.split_documents(documents)
    return docs

In [33]:
docs = split_docs(documents)
print(len(docs))

7


## Text Embeddings

In [34]:
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

In [35]:
query_result = embeddings.embed_query("Hello Budy")
len(query_result)

384

In [36]:
query_result

[-0.06722398847341537,
 0.037480905652046204,
 0.029822470620274544,
 0.02237970195710659,
 -0.031980857253074646,
 -0.053381361067295074,
 0.06989926099777222,
 -0.011718658730387688,
 0.009420957416296005,
 -0.012913979589939117,
 -0.0012694346951320767,
 0.06621184200048447,
 0.004129477776587009,
 -0.016221601516008377,
 -0.010018410161137581,
 0.0009275245247408748,
 0.021145079284906387,
 -0.029878467321395874,
 -0.1343306303024292,
 -0.007621017284691334,
 -0.03920266777276993,
 0.08743668347597122,
 -0.07066568732261658,
 -0.014600532129406929,
 -0.03446967527270317,
 -0.08962449431419373,
 0.023275161162018776,
 0.05753074213862419,
 0.02163335680961609,
 0.011480447836220264,
 0.023150013759732246,
 0.03507670760154724,
 0.02814657986164093,
 0.014081288129091263,
 0.06822077184915543,
 -0.034113142639398575,
 -0.06482100486755371,
 -0.06238289922475815,
 0.07804656773805618,
 0.01914365030825138,
 0.06848161667585373,
 -0.034505169838666916,
 -0.07234135270118713,
 0.0438658

In [37]:
pinecone.init(
    api_key=os.getenv("PINECONE_API_KEY"),
    environment=os.getenv("PINECONE_ENVIRONMENT")
)

index_name = "mcq-quiz-creator"

In [38]:
index = Pinecone.from_documents(docs, embeddings, index_name=index_name)

## Retrieve Answer

In [39]:
def get_similar_docs(query, k=2):
    similar_docs = index.similarity_search(query, k=k)
    return similar_docs

In [40]:
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub

In [41]:
llm = HuggingFaceHub(repo_id="bigscience/bloom", model_kwargs={"temperature": 1e-10})
llm

HuggingFaceHub(cache=None, verbose=False, callbacks=None, callback_manager=None, tags=None, metadata=None, client=InferenceAPI(api_url='https://api-inference.huggingface.co/pipeline/text-generation/bigscience/bloom', task='text-generation', options={'wait_for_model': True, 'use_gpu': False}), repo_id='bigscience/bloom', task=None, model_kwargs={'temperature': 1e-10}, huggingfacehub_api_token=None)

In [42]:
chain = load_qa_chain(llm=llm, chain_type="stuff")

In [45]:
def get_answer(query):
    relevant_docs = get_similar_docs(query)
    response = chain.run(input_documents=relevant_docs, question=query)
    return response

In [46]:
our_query = "How is India's economy?"
answer = get_answer(our_query)
print(answer)


India's economy is a mixed economy. It is a developing country with a large population. It


## Structure the Output

In [48]:
import json
import re

In [49]:
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate
from langchain.output_parsers import StructuredOutputParser, ResponseSchema

In [50]:
response_schema = [
    ResponseSchema(name="question", description="Question generated from provided input text data."),
    ResponseSchema(name="choices", description="Available options for a multiple-choice question in comma separated"),
    ResponseSchema(name="answer", description="Correct answer for the asked question.")
]

output_parser = StructuredOutputParser.from_response_schemas(response_schema)
output_parser

StructuredOutputParser(response_schemas=[ResponseSchema(name='question', description='Question generated from provided input text data.', type='string'), ResponseSchema(name='choices', description='Available options for a multiple-choice question in comma separated', type='string'), ResponseSchema(name='answer', description='Correct answer for the asked question.', type='string')])

In [51]:
format_instructions = output_parser.get_format_instructions()

print(format_instructions)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"question": string  // Question generated from provided input text data.
	"choices": string  // Available options for a multiple-choice question in comma separated
	"answer": string  // Correct answer for the asked question.
}
```


In [52]:
chat_model = ChatOpenAI()

In [53]:
chat_model

ChatOpenAI(cache=None, verbose=False, callbacks=None, callback_manager=None, tags=None, metadata=None, client=<class 'openai.api_resources.chat_completion.ChatCompletion'>, model_name='gpt-3.5-turbo', temperature=0.7, model_kwargs={}, openai_api_key='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', openai_api_base='', openai_organization='', openai_proxy='', request_timeout=None, max_retries=6, streaming=False, n=1, max_tokens=None, tiktoken_model_name=None)

In [54]:
prompt = ChatPromptTemplate(
    messages=[
        HumanMessagePromptTemplate.from_template("""When a text input is given by the user, please generate multiple choice questions 
        from it along with the correct answer. 
        \n{format_instructions}\n{user_prompt}""")  
    ],
    input_variables=["user_prompt"],
    partial_variables={"format_instructions": format_instructions}
)

In [55]:
final_query = prompt.format_prompt(user_prompt=answer)
print(final_query)

messages=[HumanMessage(content='When a text input is given by the user, please generate multiple choice questions \n        from it along with the correct answer. \n        \nThe output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n\n```json\n{\n\t"question": string  // Question generated from provided input text data.\n\t"choices": string  // Available options for a multiple-choice question in comma separated\n\t"answer": string  // Correct answer for the asked question.\n}\n```\n\nIndia\'s economy is a mixed economy. It is a developing country with a large population. It', additional_kwargs={}, example=False)]


In [56]:
final_query.to_messages()

[HumanMessage(content='When a text input is given by the user, please generate multiple choice questions \n        from it along with the correct answer. \n        \nThe output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n\n```json\n{\n\t"question": string  // Question generated from provided input text data.\n\t"choices": string  // Available options for a multiple-choice question in comma separated\n\t"answer": string  // Correct answer for the asked question.\n}\n```\n\nIndia\'s economy is a mixed economy. It is a developing country with a large population. It', additional_kwargs={}, example=False)]

In [57]:
final_query_output = chat_model(final_query.to_messages())
print(final_query_output.content)

```json
{
	"question": "What type of economy does India have?",
	"choices": "A. Mixed economy, B. Capitalist economy, C. Socialist economy, D. Communist economy",
	"answer": "A. Mixed economy"
}
```


In [59]:
md_text = final_query_output.content
json_string = re.search(r'{(.*?)}', md_text, re.DOTALL).group(1)

In [61]:
print(json_string)


	"question": "What type of economy does India have?",
	"choices": "A. Mixed economy, B. Capitalist economy, C. Socialist economy, D. Communist economy",
	"answer": "A. Mixed economy"

