## Install Libraries

In [13]:
!pip install langchain --quiet
!pip install openai --quiet
!pip install pdf2image --quiet
!pip install tiktoken --quiet
!pip install unstructured --quiet

## Load the PDF from an online location

In [16]:
from langchain.document_loaders import OnlinePDFLoader

# If using on your S2 workspace remember to add a firewall exception for https://wolfpaulus.com
loader = OnlinePDFLoader("https://wolfpaulus.com/wp-content/uploads/2017/05/field-guide-to-data-science.pdf")

data = loader.load()

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


## Check to see how many documents and chars are in the book

In [17]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

print (f"You have {len(data)} document(s) in your data")
print (f"There are {len(data[0].page_content)} characters in your document")

You have 1 document(s) in your data
There are 72790 characters in your document


## Now let's split the data into chunks before we turn them into embeddings

In [18]:
# Split the data
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 2000, chunk_overlap = 0)
texts = text_splitter.split_documents(data)

# Check how many documents we have now
print (f"You have {len(texts)} pages")

You have 40 pages


## Let's convert the chunks into embeddings and then store it in a SingleStore table

### Let's create the table first

In [19]:
%%sql

USE winter_wikipedia;
DROP TABLE IF EXISTS my_book;
CREATE TABLE IF NOT EXISTS my_book (
    id INT PRIMARY KEY,
    text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
    embedding BLOB
);



In [21]:
import openai

# Check to see if there is an environment variable with your API key,
# if not, use what you put below

OPENAI_API_KEY = os.environ.get(
    'OPENAI_API_KEY',
    'YOUR-KEY-HERE'
)

In [22]:
print(texts[0].__dict__)

{'lc_kwargs': {'page_content': 'Everyone you will ever meet knows something you don’t.\n\n[1]\n\nT H E S TO RY\n\n››\n\nof T H E F I E L D\n\nG U I D E\n\nSeveral years ago we created The Field Guide to Data Science because we wanted to help organizations of all types and sizes. There were countless industry and academic publications describing what Data Science is and why we should care, but very little information was available to explain how to make use of data as a resource. We find that situation to be just as true today as we did two years ago, when we created the first edition of the field guide.\n\nAt Booz Allen Hamilton, we built an industry-leading team of Data Scientists. Over the course of hundreds of analytic challenges for countless clients, we’ve unraveled the DNA of Data Science. Many people have put forth their thoughts on single aspects of Data Science. We believe we can offer a broad perspective on the conceptual models, tradecraft, processes and culture of Data Scie

In [23]:
from sqlalchemy import *

db_connection = create_engine(connection_url)

In [24]:
from langchain.embeddings import OpenAIEmbeddings

# Initialize the OpenAIEmbeddings
embedder = OpenAIEmbeddings(openai_api_key = OPENAI_API_KEY)

## Now let's add the embeddings to the my_book table. Truncate to make sure we don't overwrite

In [25]:
# Clear the my_book table
db_connection.execute("TRUNCATE TABLE my_book")

# Iterate over the texts
for i, document in enumerate(texts):
    # Extract the text content from the Document
    text_content = document.page_content

    # Convert the text to embeddings
    embedding = embedder.embed_documents([text_content])[0]

    # Insert the text and its embedding into the database
    stmt = """
        INSERT INTO my_book (
            id,
            text,
            embedding
        )
        VALUES (
            %s,
            %s,
            JSON_ARRAY_PACK_F32(%s)
        )
    """

    db_connection.execute(stmt, (int(i), str(text_content), str(embedding)))

In [26]:
%%sql

USE winter_wikipedia;
SELECT JSON_ARRAY_UNPACK_F32(embedding)
FROM my_book
WHERE id = 5;

JSON_ARRAY_UNPACK_F32(embedding)
"[-0.0142523674,-0.0120618027,0.0159637462,-0.0168673545,-0.00181577355,0.0336251818,-0.00811878312,-0.00261327624,-0.0132939955,-0.049561549,-0.00292303599,0.0303667169,-0.0124040777,-0.011418323,-0.00667780265,0.0221384019,0.0298738386,-0.00591110438,-0.00273649557,-0.005175211,-0.0395944715,0.00916614756,0.0187019538,-0.00140333106,-0.0393754132,-0.0044769682,0.0237128735,-0.0141154574,-0.00308390567,-0.00920037553,0.0146767898,-0.00129636982,-0.016881045,-0.00972063467,-0.0307226833,-0.0143618956,-0.00456253719,0.00698242802,0.0193043593,-0.0157720707,0.0203585681,0.00414496055,-0.0130201746,-0.00855005067,0.00784496311,0.0256433096,0.0109049091,-0.00762590673,0.00876910798,0.00527789397,0.00873488002,0.0207556076,-0.0360074192,-0.00528473919,-0.0131091662,-0.0132460771,-0.00380268507,-0.00993284583,0.00628076261,-0.00412784703,0.0186198074,-0.00467206538,-0.0256980732,-0.0164840054,-0.0150601389,-0.0135815069,-0.0233432148,0.0124999154,0.00233089854,0.0150738303,0.0195507966,0.0180995483,-0.000334788579,0.00928936712,0.0262594055,0.00662988378,-0.00173106033,-0.000283875037,0.00211013062,0.0213580132,-0.0187977906,-0.00201429334,-0.0157036167,0.0117674451,0.0139237829,0.0151559757,-0.00976170786,0.0142112942,-0.0100902924,-0.00259102834,-0.0109528275,0.00126128655,0.00772858923,0.0191126838,-0.0207829904,0.0132803041,-0.00678390823,0.0112129571,0.00497669121,-0.023699183,-0.00627733953,0.0111239655,-0.0194823425,-0.0157446899,-0.0143482052,0.00802979153,0.0138895558,-0.00184828974,0.00922091212,-0.00238052872,-0.0251915045,0.0313524678,-0.0137184178,-0.0203311853,0.0128901098,-0.00610277895,0.0089539364,-0.0159089826,-0.0162786394,-0.00765328854,0.0231378507,0.00282719871,0.00533608068,-9.22540437e-07,0.0134172151,0.00597955985,-0.00356309186,-0.016401859,0.0199341457,-0.000262910646,0.00199717958,-0.00104479701,-0.0103367306,0.023699183,-0.0189347006,-0.0121713309,-0.00565097434,0.00412784703,0.00139306276,-0.00641082739,-0.00907031074,0.0063663316,-0.012725817,-0.0103230402,0.00115432532,0.0339537635,0.0171001013,0.0198383089,0.0233979803,0.00909769256,0.012520452,-0.0374312885,-0.00692766393,0.00250032521,0.0100492192,0.0110760471,-0.0115346974,0.0225491337,0.00758483354,0.0152791953,0.00249519106,-0.00578446267,0.0129243378,-0.00185513531,0.0273820702,0.02131694,0.0251367409,-0.00201600487,0.00311128772,-0.00449065911,0.00302400743,0.0233432148,-0.0115757706,0.0197424721,0.0211389568,-0.00382322166,0.0118153635,0.00881018024,-0.0291892868,-0.0314072333,0.0362538584,-0.00998076424,0.00175245258,0.0110828923,-0.00887863617,0.00551064126,-0.00353913265,0.000879649015,0.019016847,-0.0101998206,0.0114799328,-0.00733154919,0.00429556193,-0.00312326755,-0.651474178,-0.0232199971,0.00264065829,-0.0303393342,-0.0100697558,0.00340564502,0.00863904227,-0.0152928866,0.00183973287,0.0241646785,-0.00394301768,0.00331836473,0.00465495186,-0.0300381314,-0.0194549598,-0.00589741347,0.00907031074,-0.0232336875,0.014307132,-0.00306679192,-0.0278612562,0.030229805,-0.00130064832,-0.00962479692,0.0147589361,0.00697558234,0.00612673815,0.0173465405,-0.020837754,0.0105763245,-0.0137868728,0.00038933882,0.00951526873,0.00144354859,0.0490412898,0.000566894421,0.00173448306,0.0423052981,0.0213580132,0.0419219472,-0.0206871536,-0.0213717055,0.0205502436,0.0186334979,-0.00601378735,0.0152791953,0.00396013167,0.00944681372,-0.00807770994,-0.00225559785,-0.00809140131,-0.0252873432,-0.0337620899,-0.00168656441,0.0150875207,-0.00671887584,0.0123903872,-0.027998168,0.00442562671,-0.0157446899,0.000306978676,0.0186471883,-0.0243837349,-0.031325087,-0.0206323881,0.013102321,0.00137167051,-0.0139853926,0.0138621731,-0.0141565306,0.0132186944,0.0314072333,0.00780388992,-0.00581526756,0.00413469225,0.020084748,0.0331323035,0.00480213016,-0.00105592108,0.00951526873,0.00728363078,0.00097035215,-0.0213580132,-0.0132939955,0.0261498783,-0.00474052085,0.00313524716,-0.0226449706,-0.00630472181,-0.00303940987,0.0156625435,0.0254790168,0.00801610108,-0.0441672765,0.040854048,0.032803718,-0.000453515531,-0.00152398343,-0.00682498096,-0.00341420202,-0.0385539532,-0.0149369193,0.0218782723,0.0118359001,0.0286416467,0.00383691257,0.000281735818,0.00917983893,0.0282309148,-0.0168536641,-0.002192277,-0.00928936712,-0.0298464578,-0.000741455122,0.020427024,-0.0209472831,0.00617465703,-0.0273546893,-0.00866642501,-0.00665726606,0.00388483121,-0.00135541242,0.0117811365,-0.0270124134,-0.00573312072,0.00599667337,-0.00775597151,-0.00798187312,0.00358362845,-0.00875541661,-0.00619519362,-0.0253284164,-0.00340222218,-0.00563728344,0.0230557043,0.00794764515,-0.00815985631,0.00233774423,-0.00936466735,-0.0212758668,-0.027998168,-0.0262457151,0.00474736607,0.000788945938,-0.00991230924,-0.0268891938,-0.0191674475,0.0121850213,-0.00425106613,-0.0119659649,-0.0187704079,-0.00413469225,-0.00112523185,0.0263141692,0.0134103689,0.00173705013,0.0176203605,-0.0298464578,-0.013102321,-0.0241920594,-0.01788049,0.0186198074,-0.00487058563,-0.00211013062,-0.0160869658,-0.0307226833,-0.0203174949,0.00467206538,-0.0136225801,-0.024589099,0.0299833678,-0.0283678249,0.00623968942,0.0200026017,-0.0354871601,0.0207282268,-0.0332692154,0.014512497,-0.0190305375,-0.0250409041,0.00358020584,-0.00642109569,-0.00661277026,-0.00625338033,0.0302845705,0.0107748443,0.0211663395,0.000946392829,0.0044872365,-0.0173465405,-0.00492534973,0.0185239688,-0.0101724388,0.00608224235,0.00806401949,-0.00141616643,-0.00209301687,0.00528816227,-0.0207966808,0.0246438645,0.0178257264,0.0232747607,0.01269159,0.0235348903,0.00472340686,-0.0137389544,0.0240140762,-0.0308595933,-0.000235956439,0.000662303763,0.0161417294,-0.0213717055,-0.0156077798,-0.0336799435,-0.00659565628,0.0263278615,-0.015375033,0.0196192525,-0.0119865015,0.0139169376,-0.0213853959,-0.0170316473,0.0305583905,-0.0140470024,0.00950157829,0.000443247263,0.00101056951,0.0133624505,0.018222766,-0.0420314744,0.0246027913,0.0350764319,-0.00601378735,0.00722886669,-0.0126642082,0.0207966808,-0.00672229845,-0.016128039,0.0562427714,-0.0157036167,-0.0135815069,-0.000249219622,0.0203585681,-0.0238087103,0.00907031074,-0.00340051088,0.0359252729,0.0144303516,-0.0238224026,0.0101039838,-0.0213306323,0.00857743341,0.0124588422,-0.0181953851,0.0097069433,-0.0349395201,0.00913191959,0.00223334995,0.0287237912,0.0156351607,0.00362470164,0.0281761512,0.0126162888,-0.00851582363,0.00629445352,-0.00999445468,0.0044769682,-0.0098438533,-0.000401532394,-0.000158944356,-0.0132803041,-0.0150190657,0.00567835663,-0.000163222794,0.0145535702,0.0241920594,0.0237265639,-0.00559278764,-0.00190989941,0.00501434132,-0.0120823393,-0.000276173843,0.0142112942,0.0177983455,0.000282163674,-0.0141428392,-0.00412784703,-0.000492449384,-0.0129791014,-0.0182638392,-0.0221794751,-0.0188936274,-0.00815985631,0.00479870755,0.00855689682,0.00739315897,0.0322013125,-0.0113567133,-0.0052642026,0.0130133294,0.0117742904,-0.0195507966,-0.018879937,-0.0228640269,0.0509032682,-0.0151696671,-0.0140880756,0.00460703298,-0.0259445123,-0.0336799435,-0.00202113902,-0.0195507966,-0.0130612478,0.0191674475,-0.00754376035,-0.00441193581,0.00667095697,0.0133556053,0.017017955,0.0266016815,0.0161006562,-0.0201805849,-0.0114593962,0.0202764217,0.104818568,0.0017781232,-0.0166072249,0.00582895847,-0.00751637854,-0.00100543536,-0.0334335044,-0.0281487685,0.010281967,-0.00869380683,0.0212074127,0.0340359099,0.0163607858,-0.0222068578,-0.00784496311,0.0122534763,-0.0169358104,-0.0270124134,0.0108090714,0.0121781761,-0.0232747607,0.0263004787,0.0138758644,0.0270261038,0.0256706923,0.00497326814,0.0397587642,0.0458923467,-0.0124177691,-0.00989177264,0.00396355428,0.0257528368,-0.00697558234,0.0117126806,-0.014033311,0.0171685573,-0.005175211,-0.00688316813,0.0200026017,0.00950157829,0.0128011182,0.00930305757,0.0118359001,0.00302400743,-0.0173876137,-0.00690028165,0.0107200798,0.0200299826,-0.0143345138,-0.0401147306,0.00344671821,-0.00258076005,-0.0361169502,-0.00793395471,0.00629787613,0.00301545067,0.0143482052,-0.00473709777,-0.0264647715,-0.0428255573,-0.0424148254,-0.0140743842,0.00687974505,-0.0112334937,0.0175792873,-0.0120275747,-0.0118975099,0.00604801485,-0.0202079676,0.00516152009,-0.00763959764,-0.0341728218,-0.0338990018,-0.00145467254,0.0174560696,0.0173876137,0.0143345138,0.00609935634,-0.0012724105,-0.0101245204,0.000678561919,-0.0282309148,-0.0113977864,-0.0397313833,0.00161896495,0.0031506496,-0.0369931757,-0.00116630504,-0.0106653161,0.00711933849,-0.0265469179,-0.0116852988,-0.00636290852,0.0138621731,0.0104325684,0.0228366461,0.0176888164,0.0111787301,0.0268207379,-0.0328584835,0.00474736607,-0.00512729259,-0.0216729082,-0.00661619287,0.0318179652,0.00157019065,-0.00649297377,-0.00464810617,0.0430172309,-0.036171712,0.0109596727,-0.0174834505,0.0265743006,-0.0179626364,-0.00198177714,-0.00275189802,0.0131297028,0.0387182459,0.0239182394,-0.00355966925,0.0105626332,-0.0246986281,0.00807086471,0.016538769,-0.0221247114,0.00790657196,0.0057707713,-0.0266838279,-0.0398956761,-0.00639371341,-0.0126710534,0.0209335908,-0.0177983455,-0.0243426617,-0.0302845705,-0.00398751395,0.00152483908,-0.0136910351,-0.0190442298,-0.0214949232,0.00110127253,-0.00906346459,0.00592479575,-0.0197424721,0.0184418242,-0.0210157372,-0.0022265045,-0.00065032416,0.000770120765,0.0141291488,-0.0161691122,-0.0244111158,-0.0247260109,-0.00575708039,0.00600009644,-0.0101245204,0.0122329406,-0.00598298246,0.036445532,0.034008529,0.0110281287,-0.00941943191,0.0194138866,0.0147178629,-0.0163744763,-0.0226997342,0.00382322166,0.00543191796,0.00826253928,0.00313866977,0.00773543492,-0.00295212958,-0.0199752189,0.0144851152,0.0411004871,0.0197424721,-0.0126436716,-0.0187977906,-0.0186608806,-0.0174149964,-0.0179763287,0.00291619054,0.0184965879,0.0341180563,-0.0374039076,0.0135198971,0.0482198261,0.00538057648,0.0133487592,0.00360416505,0.0357883647,-0.00419287942,-0.000362384599,0.0304214805,-0.000167287333,-0.0313798524,-0.0266427547,-0.0261909515,0.00427844841,0.0108775273,-0.00887863617,0.00854320545,0.0301750414,-0.00654431479,0.0123561593,0.00269199954,-0.00941258576,-0.00620203884,0.00957687851,-0.0292440522,-0.00721517578,-0.029819075,-0.029955985,-0.00261498755,0.0319548734,0.00666068867,0.00328584854,0.00915930234,-0.0223026946,-0.0188662447,0.0247397013,-0.00337312883,0.0280392412,-0.00231891894,0.0534498021,0.00949473213,0.00755745126,-0.0225902069,-0.00901554618,-0.00490823574,-0.00854320545,0.0233569071,0.000489454484,-0.0259171296,-0.00171651354,0.00978908967,0.00206563482,0.00321397046,-0.0308048297,0.0299833678,0.00326873478,-0.0107406164,-0.0134514421,-0.0250409041,-0.00449750479,0.0293261968,-0.00746161444,0.0102888122,0.00444958638,-0.0341728218,0.0113224853,0.00731785828,-0.000575879123,0.0381979868,0.00553802354,-0.0153339598,0.00138022739,0.00818723813,-0.000349763141,-0.00750268716,0.00731101306,0.0108227627,-0.0235485807,0.0117948269,0.0142660588,0.0308595933,-0.0227271169,0.0210294295,-0.0158542171,0.0297916923,0.0149916839,-0.0111924205,-0.0104736416,-0.00475078914,0.0031711862,-0.0186334979,-0.0187019538,-0.0178394187,-0.0204817876,-0.0198383089,0.0226723533,0.0128421914,-0.0153613416,0.00933728553,0.00854320545,-0.0258760564,0.0147452448,-0.0125136068,0.0158952903,-0.00781758036,-0.000724341371,0.00944681372,0.00268001994,0.0118222088,0.0153476503,0.00812562928,-0.0214264691,0.0126025984,-0.0329953916,0.000611390278,0.00815985631,0.0135404337,-0.0429624654,0.00647243718,-0.029545255,-0.0025858942,0.0110828923,0.00487400824,-0.0189757738,-0.0111992657,0.0165113881,0.0130612478,-0.0201395117,0.00935782213,0.00932359416,0.0299286023,-0.0025311301,0.00547299115,-0.0358157456,0.0103093488,-0.0111034289,-0.014033311,-0.0108227627,-0.0208240636,0.023042012,-0.000380354089,0.0204407144,-0.0241236053,-0.0405254625,-0.0282035321,-0.00646559149,0.00699954201,-0.000270825782,0.0121302577,0.00251572765,-0.0115483887,-0.00722886669,0.00608908804,0.000223762865,0.00382322166,0.00582553539,-0.00252941856,0.00427844841,0.0134240603,-0.0144577334,-0.0295726359,-0.0364181511,0.00362470164,-0.0229324829,0.00185342389,-0.0250135213,0.0345014073,0.0155530162,-0.0345287882,-0.0191126838,-0.00372738438,-0.02118003,-0.010453105,0.0185787342,0.0168125909,0.0152655039,0.0295178723,-0.00592137268,0.0130201746,-0.0160322022,0.00655458309,-0.0193317402,0.00313866977,-0.0133966785,-0.0232199971,0.0404433161,0.00824884791,-0.0275737457,-0.0185924247,0.00656827446,-0.0193454325,0.00929621235,0.0173602309,0.0098643899,0.00235828059,0.00345527497,0.0148547729,0.0171822477,0.00621573022,0.0254790168,-0.0117058353,-0.0112745669,0.0137457997,-0.00691739563,-0.00643136399,-0.00117828464,-0.000344842934,-0.00781758036,-0.00402516406,0.00593848666,-0.00974801648,-0.00124930684,-0.00590083608,0.0109870555,0.0045009274,0.0264100078,0.0345014073,0.00647585979,-0.00632525841,-0.0143482052,0.00802979153,-0.00306850323,-0.0051854793,0.0121576395,-0.0161691122,0.00595902326,-0.00846790522,0.0111513473,-0.0133966785,0.0063355267,-0.010829608,0.00322252745,-0.0433731973,0.00686605414,-0.0135678165,-0.0141154574,0.00690370472,-0.00404912326,-0.00592821836,-0.0188936274,-0.0149506107,0.019290667,-0.0203038044,-0.00883756299,0.0205091704,-0.0142797502,-0.0239319298,0.00794764515,0.0201395117,-0.0233158339,0.0117605999,0.219275624,-0.00241133338,0.00876910798,0.0103367306,0.0117263719,-0.00134172139,0.0172780845,0.0183049124,-0.00359047391,0.0111445021,0.00723571237,0.00703034643,-0.00323279575,-0.00186711492,0.00941258576,-0.0036007422,-0.0158952903,-0.0246164817,-0.0217961259,0.0218508914,0.00987123605,0.0135541251,-0.0317905806,-0.0219878014,0.0196603257,0.00864588842,0.00201429334,0.00564412912,0.014375587,-0.000770976418,-0.00289736548,-0.0179900192,0.00396013167,0.00181919627,-0.00993284583,0.00444616331,0.00787919015,-0.0419767126,-0.004298985,-0.0123424688,0.0214538518,0.00423737522,-0.0129174916,0.00995338243,0.0109185996,0.0109117543,-0.0157173071,0.00284602377,0.00839260407,0.018222766,-0.0279844757,-0.0176203605,0.011418323,0.0161964931,0.000123219303,0.01052156,0.00452488707,0.0285595004,0.0108501446,0.00417918805,-0.0329132453,0.0131228575,0.00490823574,0.0275737457,-0.00976855308,-0.00328584854,-0.00837891269,0.00793395471,0.00733839488,-0.000694820017,-0.0184007511,-0.0117126806,-0.0115004694,0.00369657949,-0.0233021416,-0.0285595004,0.0371574685,0.00274334103,0.0134582873,0.0224943701,-0.00401489576,-0.00587345427,-0.0254379436,-0.0198109262,-0.016401859,-0.0206734613,0.028888084,-0.0173191577,0.00136568071,-0.0136362715,-0.00208103727,-0.00175416388,-0.00695162313,-0.0236854907,0.015375033,0.0148136998,0.0071330294,0.015648853,0.0163470954,-0.01788049,-0.00713987509,0.0707005039,0.0161828026,0.0210568104,-0.00919352937,-0.0129243378,-0.0168536641,0.0114799328,0.010761153,-0.0070440378,-0.00697900541,-0.0358705111,-0.00137851608,-0.0168399718,-0.00382322166,0.0254242532,0.013033866,-0.00681813573,-0.00778335333,-0.00344329537,-0.0219193455,-0.0103983404,0.0262867883,0.0116647622,-0.00610277895,-0.0122603225,0.00392590417,0.0157583803,-0.0215907618,-0.00852951407,0.0166619886,-0.00542164966,-0.00991230924,0.00435717171,-0.00445985468,-0.00363496994,0.0197150894,0.00388825405,-0.0198930725,0.00558251934,-0.0212347936,-0.00449065911,0.0366919711,0.00724255759,0.00541138137,-0.021768745,0.0255063996,-0.00810509268,-0.00655800616,0.00395328598,0.017017955,0.0225765165,-0.0116305351,-0.00896078162,0.0167578254,-0.0194686502,-0.00658196537,-0.0147589361,-0.000357678276,0.0238497835,-0.0263141692,0.0246164817,0.0381432213,0.00998760946,-0.0045111957,0.00228126859,-0.17524524,0.0201395117,0.0132939955,-0.00514782872,0.0337894745,0.00489454484,-2.87458261e-05,0.00640740478,-0.0101039838,0.0158131439,0.00674625766,0.00188080594,-0.0134582873,-0.0021494925,-0.00417234283,-0.000115945957,-0.021631835,0.0302845705,0.0212895591,0.0051512518,0.0192359034,-0.0199341457,-0.0184007511,-0.0133624505,0.0204133317,-0.00357336015,-0.00379926222,0.0142934406,-0.0113293314,-0.0189620834,-0.0125067607,-0.0190579202,0.00490139052,-0.0186061151,0.0121576395,0.00543876365,0.0132323857,-0.0138621731,-0.0279297121,0.0191674475,0.0518342592,0.0350490473,0.0102066658,-0.0188936274,0.00496299984,0.0111718839,0.0197561625,-0.0106105516,-0.0140606929,-0.00157703611,0.0187704079,-0.0264921542,0.00935782213,-0.00561674684,0.00783127174,-0.0165661518,0.00217858586,0.0127668902,0.000944681407,-0.0315167606,-0.000127818639,0.00425448874,0.0202901121,9.21470855e-05,-0.014238677,-0.0229051001,0.00208274857,0.0195097234,-0.0280940048,0.0103641134,0.00761221582,0.0100355279,-0.00754376035,-0.00381637597,-0.0210978836,0.0035665147,-0.000399607088,0.000554486876,0.0204954781,-0.00549695035,-0.0125341434,0.00187567179,-0.00661619287,-0.00326531194,-0.0135815069,-0.00139391841,-0.00377188018,0.0219467282,-0.00339879957,-0.024041459,0.0113498671,-0.030229805,-0.0225354433,-0.0115004694,-0.00075257913,0.0100971377,0.00830361247,-0.000901041261,0.00827623066,0.0108227627,0.0087964898,-0.00066444301,-0.0105489418,0.0161828026,0.00782442652,0.0260266587,-0.00288538565,0.0122808591,0.0282309148,0.00700296462,-0.0298464578,0.0174286868,0.0102408938,-0.0131844673,-0.00476448005,0.0441946574,-0.00573654333,-0.00685236324,0.00871434342,-0.00287511735,0.0519985519,-0.0269165747,-0.0120001929,0.0189620834,-0.00356993754,-0.0131228575,-0.128586188,-0.0351311937,0.0128764184,-0.0050519919,0.0153065771,0.0158268362,-0.000632782525,0.000515980879,-0.0587619245,0.0432362854,-0.0166072249,-0.0395944715,-0.00888548139,-0.0261772592,0.0176203605,-0.0260540396,-0.0173739232,0.00839944929,-0.0234664343,0.0198656917,-0.00291619054,-0.022384841,0.0153887235,-0.00653404649,-0.00536688557,0.00537373079,-0.0172643941,0.0211389568,0.0215770695,-0.00541138137,0.0207145344,-0.0134719787,0.0191811398,-0.0138553279,0.00314722676,-0.0126984352,-0.00552433217,-0.0155393248,0.0150464475,-0.028340444,0.00785180833,-0.00104051863,-0.0233432148,-0.00331836473,0.00623968942,-0.0115894619,-0.00928936712,0.0507115945,-0.00389509951,-0.00992599968,-0.0242331326,-0.00475078914,-0.0124930702,-0.0114114769,-0.0071672569,0.0141565306,0.0196466353,-0.00258076005,-0.00183117599,0.00733839488,-9.13983604e-05,0.0146357166,-0.0247260109,0.036445532,0.0143618956,-0.00472682947,-0.042058859,0.0174423773,-0.00160013977,-0.0272588506,-0.0043879766,-0.0022401954,-0.0195781793,0.0109733641,-0.0449613556,-0.00211697631,-0.00315407221,0.0124930702,0.00914561097,0.0017678549,-0.0175929796,-0.0126505168,-0.0150327571,-0.00580842188,0.00904977415,0.0032054137,-0.00552775525,0.000642622937,0.0179352555,-0.0291345231,0.0165250786,0.0279707853,-0.00207761442,-0.0130612478,0.0115141608,0.0233979803,0.026451081,-0.0216455255,-0.00938520394,0.0207282268,-0.00457622809,0.0094399685,-0.0329953916,0.00663330685,-0.0121302577,0.00449750479,-0.00532581238,-0.0135404337,-0.00283917831,-0.00126556505,0.00790657196,0.01038465,-0.0252736509,-0.0112334937,-0.0232884511,-0.00159072713,-0.0328584835,-0.033077538,0.01038465,-0.01788049,0.019947838,0.00175416388,0.0160595831,0.00943312235,0.01760667,-0.0105968602,0.00195952924,-0.012896955,-0.0013314531,0.0179900192,-0.00286313752,-0.00613700645,0.00649297377,0.00560647855,-0.0125615252,0.0151833585,-0.0102408938,-0.0251504313,0.00971378852,0.0189620834,0.019153757,0.0434005782,-0.0173054673,-0.032803718,-0.00315407221,-0.00533950329,-0.0268481206,0.00768067082,-0.0156625435,-0.018743027,0.0064792824,0.00602063304,0.0172917768,0.0207556076,-0.00615754304,-0.0153339598,0.0087074982,-0.000616952253,0.0191263743,0.00991230924,-0.0157446899,0.0161554199,0.0316262878,0.0170042645,-0.0141291488,0.0172643941,-0.00589399086,-0.00652720127,-0.032256078,0.0128285,-0.00606170576,0.00614042953,0.00562016945,-0.00662988378,0.00486373994,-0.00942627713,0.0405802242,0.00192016771,0.000928423309,0.00426133443,-0.0239456203,0.0616096593,8.87777969e-06,0.00283062132,0.00591795007,0.0209335908,0.0322013125,0.000265263778,-0.00694135483,0.000856545405,0.001130366,-0.00264065829,0.00349463685,0.0123698507,-0.0149916839,0.0249176845,0.0150464475,0.0126505168,0.0125615252,0.00695846882,0.0116099985,0.0185924247,0.0169768818,-0.00699269632,0.0169084277,-0.0322834589,-0.0340632945,-0.00207419181,-0.0201395117,-0.0366645902,0.000688830158,0.0247397013,0.00279297098,-0.00113207742,0.0101998206,0.00455226889,-0.017812036,0.00181919627,-0.00353570981,-0.013307686,-0.0244658813,0.00541480398,0.0089539364,0.00499380473,0.00787919015,-0.0196466353,0.0293261968,0.0152655039,0.0133966785,-0.0316536725,0.00631841272,-0.00701665552,-0.0138142547,0.0242605153,0.014512497,-0.00909769256,-0.00458991947,-0.0247397013,0.0134103689,0.0031626292,-0.00768067082,0.0454542339,0.0253421068,0.0014452599,0.0121165663,-0.0254379436,0.00630472181,0.0111992657,-0.00102853891,-0.0141565306,-0.0257391464,0.0240825322,0.0107406164,-0.0173465405,-0.010350422,0.00201429334,0.01760667,-0.0167030618,0.0101382108,-0.0108706811,-0.00130321539,0.0207692999,0.00973432511,0.0241920594,0.00613358384,-0.0279707853,-0.00861166045,0.0147315543,0.0115141608,-0.00428871671,-0.032803718,-0.000669149333,0.00909769256,-0.0278064925,-0.0153339598,-0.0113909403,0.0111513473,-0.00736577716,-0.0267249011,0.0198930725,0.0133419139,-0.0111445021,0.0217413623,-0.0305036269,-0.0411826298,-0.0268891938,0.0141154574,-0.00498011382,-0.0414564535,-0.00160527392]"


## Do a semantic search using the DOT_PRODUCT in SingleStore instead and JOIN with other data if needed

In [27]:
# Your query text
query_text = "What is the collect stage of data maturity?"

# Convert the query text to embeddings
query_embedding = embedder.embed_documents([query_text])[0]

# Perform a similarity search against the embeddings
stmt = """
    SELECT
        text,
        DOT_PRODUCT_F32(JSON_ARRAY_PACK_F32(%s), embedding) AS similarity
    FROM my_book
    ORDER BY similarity DESC
    LIMIT 1
"""

results = db_connection.execute(stmt, str(query_embedding))

for row in results:
    print(row[1])
    print(row[0])

0.8848673701286316
All of your initial effort will be focused on identifying and aggregating data. Over time, you will have the data you need and a smaller proportion of your effort can focus on Collect. You can now begin to Describe your data. Note, however, that while the proportion of time spent on Collect goes down dramatically, it never goes away entirely. This is indicative of the four activities outlined earlier – you will continue to Aggregate and Prepare data as new analytic questions arise, additional data is needed and new data sources become available.

Organizations continue to advance in maturity as they move through the stages from Describe to Advise. At each stage they can tackle increasingly complex analytic goals with a wider breadth of analytic capabilities. As described for Collect, each stage never goes away entirely. Instead, the proportion of time spent focused on it goes down and new, more mature activities begin. A brief description of each stage of maturity is

In [28]:
import openai
OPENAI_API_KEY = os.environ.get(
    'OPENAI_API_KEY',
    'YOUR-KEY-HERE'
)

# Assign the API key to the openai module
openai.api_key = OPENAI_API_KEY

# Pass the result as a prompt to ChatGPT
prompt = f"The user asked: {query_text}. The most similar text from the book is: {row[0]}"

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
)

# Print the assistant's response
print(response['choices'][0]['message']['content'])

The Collect stage of data maturity involves gathering and aggregating internal or external datasets to enhance or refine raw data, and leveraging basic analytic functions such as counts. As organizations advance through the stages of maturity, the proportion of time spent on collecting data goes down, but it never goes away entirely since new data sources and analytic questions will arise over time, requiring continued aggregation and preparation of data.
