# Semantic Search
The following notebook will implement a semantic search using OpenAIs text-embedding-ada-002 library. 

## Install Dependencies
Here we are going to install all our dependencies and packages that we need

In [1]:
!pip3 install openai -q
!pip3 install openai
!pip3 install pandas
!pip3 install plotly
!pip3 install sklearn
!pip3 install scipy



## Imports
Import our dependencies and configure our openai api key

In [2]:
import openai
import pandas as pd
import numpy as np 


key = "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"
openai.api_key = key


## Create our Dataframe
Here we read our .csv file and create a pandas data frame called 'df'

In [3]:
df = pd.read_csv('accData.csv')
print(df)

                  name                                          interests
0         Abbas Momeni  Partial Differential Equations. Nonlinear Anal...
1    AbdelRahman Abdou  Internet Measurements. Computer Systems and Ne...
2          Adrian Chan  Non-invasive sensor systems. Biomedical signal...
3       Ahmed Almaskut                                                  .
4        Ahmed El-Roby  Question answering over knowledge graphs. info...
..                 ...                                                ...
141         Yuhong Guo  Machine Learning. Artificial Intelligence. Nat...
142        Yuly Billig  Representation theory of infinite-dimensional ...
143         Yunran Wei  Include but are not limited to: Quantitative R...
144            Yuu Ono  Sensors development and applications. Biomedic...
145       Yvan Labiche  object-oriented systems. high-dependability sy...

[146 rows x 2 columns]


## Create embeddings

Using the openai.embeddings_utils library we import the method get_embedding and use the 'text-embedding-ada-002' engine to create a new column in our datat fram called embeddings

In [8]:
from openai.embeddings_utils import get_embedding
df['embedding'] = df['interests'].apply(lambda x: get_embedding(x,engine='text-embedding-ada-002'))
df.to_csv('openai_embeddings.csv')

## Search querry 
Take an input as a search querry and then calculate the embedding for our search term. Then we will create a new solumn in our datat frame that gives us a similarity value using openai's cosine-similarity method with our search term vector and the embedding calculated for each research interest in our previous step as our parameters. The Cosine similarity will output a number between 0 and 1. The closer the value is to 1, the more similar the two embeddings are. Using this information, we will extract only the first 20 professsors and export to a .csv file. 

In [50]:
search_term = input("What would like to search for?")
#semantic search
search_term_vector = get_embedding(search_term,engine='text-embedding-ada-002')

What would like to search for? professors who work on social media


In [51]:
from openai.embeddings_utils import cosine_similarity

df['similarities'] = df['embedding'].apply(lambda x: cosine_similarity(x,search_term_vector))
(df.sort_values("similarities", ascending=False).head(20)).to_csv('ranking.csv')
ranking = pd.read_csv('ranking.csv')

pd.set_option('display.max_colwidth',1000)
ranking

Unnamed: 0.1,Unnamed: 0,name,interests,embedding,similarities
0,93,Michel Barbeau,orcid.org/0000-0003-3531-4926,"[0.002847991418093443, -0.009178299456834793, 0.0029256639536470175, -0.03288135677576065, -0.011243093758821487, 0.029619110748171806, -0.03396877273917198, -0.004019551444798708, -0.014175230637192726, -0.0020842119120061398, 0.012485853396356106, -0.007663686294108629, 0.03570345789194107, 0.010395169258117676, -0.02580021321773529, 0.0023544474970549345, 0.026538101956248283, -0.02481636218726635, 0.007113506086170673, -0.008084412664175034, -0.03156092390418053, -0.006534198764711618, -0.028324570506811142, -0.0028981550130993128, -0.007074669934809208, 0.011430801823735237, 0.006738089025020599, -0.017748165875673294, 0.017100894823670387, -0.02539890632033348, 0.018576672300696373, -0.00599049124866724, -0.01079647708684206, -0.02300400473177433, -0.009838515892624855, -0.004754203837364912, -0.0007767249480821192, -0.01723034866154194, 0.018473109230399132, -0.004990458022803068, 0.0021117208525538445, 0.003912752028554678, -0.004466168582439423, -0.014628320001065731, 0.00...",0.797527
1,23,Christine Laurendeau,After working in the high tech industry for nearly 10 years as a software developer and team leader. Dr. Laurendeau returned to academia in 2005 to pursue a doctorate in Computer Science at Carleton University. She completed her Ph.D. in 2009. and in the same year joined the School of Computer Science as a tenure-track Instructor (teaching professor). She served as Associate Director (Undergraduate) from 2011 to 2014. then as the School’s Co-op Faculty Advisor from 2014 to 2018h She is currently the Associate Director (Recruitment and Outreach) for the School of Computer Scienceh Dr. Laurendeau specializes in teaching first and second year computer programming courses. with a focus on software engineering.,"[0.0017560936976224184, -0.0038342508487403393, -0.008536377921700478, -0.02793477289378643, 0.004058000165969133, 0.02558879368007183, -0.013485985808074474, 0.00944493617862463, -0.02686348743736744, -0.023798799142241478, 0.019513659179210663, 0.005712389480322599, 0.024205615743994713, -0.00722100306302309, 0.006092085503041744, -0.022564787417650223, 0.021629108116030693, -0.02204948477447033, -0.01448268722742796, -0.006726042367517948, -0.02652447298169136, -0.008516036905348301, 0.0003106217773165554, 0.003498626872897148, -0.014794580638408661, -0.003563039470463991, 0.0010661997366696596, 9.847303590504453e-05, 0.014794580638408661, -0.008990657515823841, 0.00916016474366188, -0.0003176139434799552, -0.008590620011091232, -0.010048381984233856, -0.01924244686961174, -0.013980946503579617, -0.0068006254732608795, 0.016611697152256966, -0.0029901054222136736, -0.006020892411470413, 0.018049117177724838, 0.02160198614001274, -0.003556259209290147, 0.00545134861022234, 0.0016...",0.78672
2,114,Rabe Abdalkareem,"My research interests and expertise are in Software Engineering. with a special interest in Crowdsourcing in Software Engineering (e.,g. Stack Overflow and npm). Empirical Software Engineering. Mining Software Repositories. and Software Ecosystems. In particular. I leverage historical project data and apply Data Mining. Machine Learning. and Statistical Analysis techniques in order to better understand what and how software practitioners use the crowd and build techniques to help them effectively take advantage of these crowd resources.","[-0.009694980457425117, -0.01598726585507393, 0.006049235351383686, -0.04274973273277283, -0.0139888571575284, 0.020591706037521362, 0.0013241141568869352, 0.008905068971216679, -0.010289101861417294, -0.029598044231534004, 0.0002694222202990204, 0.016324834898114204, 0.009978538379073143, -0.02550670877099037, -0.014474956318736076, -0.014920547604560852, 0.024169936776161194, -0.005623898468911648, -0.016378846019506454, -0.011922935023903847, -0.023602820932865143, 0.00947218481451273, 0.01346224918961525, -0.014677497558295727, -0.013279962353408337, 0.018026182428002357, 0.012105222791433334, -0.020429672673344612, -0.006109998095780611, -0.02198248915374279, 0.029571039602160454, -0.002577338833361864, -0.003922550939023495, -0.011760901659727097, -0.01704047992825508, -0.023656832054257393, -0.006967423018068075, -0.0028170128352940083, 0.004790103528648615, -0.015703707933425903, -0.0054956222884356976, 0.004833987448364496, -0.01248329970985651, -0.019930070266127586, 0.00...",0.786701
3,80,Junfeng Wen,Artificial Intelligence and Machine Learning,"[-0.014772557653486729, -0.0036964428145438433, 0.02074500545859337, -0.01320677250623703, -0.007373065687716007, 0.0176662877202034, -0.0037228695582598448, 0.018974412232637405, -0.018617650493979454, -0.023493388667702675, 0.019450094550848007, 0.027774523943662643, 0.0023437230847775936, 0.012750910595059395, -0.0005619815201498568, -0.001038901973515749, 0.0186308640986681, 0.007558052893728018, 0.011971321888267994, -0.0030506388284265995, -0.017296314239501953, 0.03160640224814415, 0.010385716333985329, -0.038900189101696014, -0.005427395459264517, -0.010927464812994003, 0.024193696677684784, -0.04146358370780945, -0.010147875174880028, -0.008806717582046986, 0.023823723196983337, -0.014508290216326714, -0.02004469558596611, -0.009751473553478718, -0.006206985097378492, 0.0038120599929243326, 0.021022485569119453, 0.008463169448077679, 0.0016731441719457507, -0.01353050023317337, 0.013900474645197392, 0.028593752533197403, -0.0038583066780120134, -0.024431537836790085, -0.01...",0.777047
4,105,Olga Baysal,Empirical software engineering. code review. recommender systems. mining Stack Overflow and GitHub. machine learning. NLP. qualitative research (e.g.. survey. experiments. interviews). eye tracking. API summarization. etc.,"[-0.00330601679161191, -0.010066484101116657, -0.0032638483680784702, -0.04725579917430878, -0.005441433750092983, 0.010889614932239056, -0.017191287130117416, 0.013500693254172802, -0.015787918120622635, -0.033330049365758896, -0.005087217781692743, 0.006450106389820576, 0.01574743539094925, -0.0020358990877866745, -0.03521919995546341, -0.009519979357719421, 0.027689578011631966, 0.007556609809398651, -0.008541667833924294, -0.0242216344922781, -0.015072737820446491, 0.005552758928388357, 0.01733972132205963, -0.03262836113572121, -0.01885104365646839, 0.019431283697485924, 0.013797559775412083, -0.032547399401664734, -0.021333929151296616, -0.00842022243887186, 0.034976307302713394, -0.016422132030129433, 0.004476616624742746, 0.002821921603754163, -0.020753689110279083, 0.002282163593918085, -0.0026313194539397955, 0.015814905986189842, 0.005252518691122532, -0.0065378169529139996, 0.0043079424649477005, 0.013898764736950397, -0.01423611305654049, -0.019215378910303116, 0.01256...",0.773108
5,79,John Oommen,Dr. Oommen has done research in the general areas of Artificial Intelligence for more than 40 years. In the fields of Stochastic Learning and Learning Automata (LA). he has pioneered the field of Discretized LA. Some of the fastest and most accurate LA are due to the work done by him and his co-authors. Research in these fields earned him his elevation to be a Fellow of the IEEE. In the area of Statistical Pattern Recognition (PR) he has pioneered the theory and applications of the so-called Anti-Bayesian paradigm of PR. Together with his co-authors. he also introduced the science and art of Chaotic PR. His work in using Dependence Trees to achieve PR is also well recorded. In the area of Syntactic PR. his award-winning algorithm has attained the optimal and information theoretic bound. and this earned him his elevation to be a Fellow of the IAPR. Within the field of Neural Networks (NNs). he has worked extensively with the Kohonen’s NN. and has demonstrated how one can merge the s...,"[-0.016920659691095352, 0.004084177315235138, 0.018992293626070023, -0.013889679685235023, 0.001845700666308403, 0.025429653003811836, -0.0049149165861308575, 0.0250125452876091, -0.03523167967796326, -0.022746261209249496, 0.011880612000823021, 0.008196162059903145, 0.004282303620129824, 0.004626417066901922, 0.009607375599443913, 0.002370561007410288, 0.03823485225439072, 0.0320894680917263, 0.004963579121977091, -0.006211425643414259, -0.02916971780359745, 0.0015572012634947896, -0.008717546239495277, -0.03659423068165779, 0.009920206852257252, 0.006253136321902275, 0.015001965686678886, -0.03784555196762085, 0.0047063627280294895, -0.010205229744315147, 0.026569746434688568, -0.0024418167304247618, -0.011241046711802483, -0.004828019067645073, -0.026625361293554306, 0.002417485462501645, 0.020048966631293297, 0.008383860811591148, 0.026277771219611168, -0.0067606172524392605, 0.004101557191461325, 0.017699260264635086, 0.002287139417603612, -0.03998670354485512, 0.0141468960791...",0.771482
6,43,Dr. David Thue,Data Science. Human-Computer Interaction. Game Design. Development. & Interactions. Animation. Virtual Environments. and Simulation. Interactive Storytelling. Machine Learning & Artificial Intelligence,"[-0.014260808005928993, -0.01137885544449091, 0.01633322425186634, -0.016307318583130836, -0.0008888719021342695, 0.012952595949172974, -0.02335353009402752, 0.023029714822769165, -0.012861927971243858, -0.038754165172576904, 0.0019412703113630414, 0.016553416848182678, 0.016786564141511917, 0.006395344156771898, -0.0072987875901162624, -0.01200705673545599, 0.014468049630522728, 0.0157892145216465, 0.023107431828975677, -0.011929340660572052, -0.008529284037649632, 0.016825422644615173, 0.012402110733091831, -0.02014128677546978, -0.015944644808769226, 0.008160135708749294, 0.016695896163582802, -0.03712213784456253, -0.0175766721367836, 0.002200322225689888, 0.020944347605109215, 0.014584623277187347, 0.0002738259790930897, -0.04118925333023071, -0.02831437438726425, -0.002862523775547743, -0.0058966693468391895, 0.008509855717420578, 0.005653807893395424, 0.00019671754853334278, -0.01126228179782629, 0.0036008215975016356, -0.02467469498515129, -0.012972024269402027, 0.015025011...",0.77051
7,49,Dr. Marzieh Amini,Data Science. Video/Image Processing and Compression. Internet of Things (IoT). Ad-hoc and Wireless Sensor Networks. Data Privacy. Machine Learning & Artificial Intelligence. Connected/Autonomous Vehicles,"[-0.007601149845868349, 0.002328544156625867, 0.022236783057451248, -0.016817625612020493, -0.00814176257699728, 0.013912644237279892, -0.02072567120194435, 0.017091188579797745, 0.008011494763195515, -0.047626055777072906, 0.004683142062276602, 0.0153846750035882, -0.0007987068966031075, 0.0078095789067447186, 0.015006897039711475, 0.0010755268158391118, 0.015658238902688026, 0.02310957945883274, 0.013873564079403877, -0.013547893613576889, -0.016921840608119965, 0.02327892743051052, 0.01646590046584606, -0.012167050503194332, 0.004937164951115847, 0.00019672534835990518, 0.024347126483917236, -0.03212413936853409, -0.01343065220862627, -0.021859005093574524, 0.027330270037055016, -0.018198467791080475, -0.024855174124240875, -0.027486590668559074, -0.01740383170545101, -0.012844445183873177, -0.00015459171845577657, -0.007093103602528572, 0.004546360112726688, -0.024190805852413177, -0.0004795498389285058, 0.014524905011057854, -0.007679310627281666, -0.013521839864552021, -0.000...",0.769941
8,57,Elizabeth Stobert,Dr. Stobert’s research interests are in usable security systems. applying research from cognitive science to improve the design of security systems. She is interested in usable authentication. usable security for healthcare. and security education.,"[-0.006986687425523996, 0.0038376152515411377, 0.004644290544092655, -0.03596625104546547, 0.01583649218082428, 0.03496718779206276, -0.0005910838954150677, -0.0012184513034299016, -0.007195950485765934, -0.016025502234697342, 0.008181512355804443, 0.01214401051402092, 0.014688919298350811, 0.003064691787585616, -0.00632177060469985, -0.005133696366101503, 0.037235330790281296, 0.023167449980974197, 0.0018529909430071712, -0.027136698365211487, -0.010226890444755554, 0.01871217042207718, -0.004654416348785162, 0.01535046100616455, -0.0037059818860143423, -0.00210781954228878, 0.022951437160372734, -0.017942622303962708, -0.007931746542453766, 0.0012344835558906198, 0.0265426617115736, 0.010105382651090622, -0.030295895412564278, -0.03831539675593376, 0.00515394750982523, -0.022465405985713005, -0.004357397556304932, -0.009221076965332031, 0.000526954885572195, -0.005039190407842398, -0.011894244700670242, 0.009619352407753468, 0.01806413009762764, -0.034724172204732895, 0.005717607...",0.768167
9,81,Jörg-Rüdiger Sack,Algorithms. Computational Geometry. Data Structures. and GIS,"[0.006956277880817652, 0.015610179863870144, 0.010358156636357307, -0.0217242781072855, -0.006956277880817652, 0.017387378960847855, -0.03299755975604057, -0.006850176490843296, -0.004625361412763596, -0.037772126495838165, 0.018435131758451462, 0.0023988881148397923, 0.016724245622754097, 0.02149881236255169, 0.0071684811264276505, 9.55845825956203e-05, 0.013594252057373524, 0.010762668214738369, 0.01308363862335682, 0.007314370479434729, -0.010019958019256592, 0.016167212277650833, 0.024589017033576965, -0.037904754281044006, -0.004317003767937422, 0.007851509377360344, 0.03161824122071266, -0.02627337910234928, 0.00010371833923272789, -0.008262652903795242, 0.020451059564948082, -0.009522607550024986, -0.01564996875822544, -0.03366069495677948, -0.021923217922449112, 0.004416474141180515, 0.01372687891125679, -0.010099533945322037, -0.01102129090577364, -0.009476188570261002, 0.005069661419838667, 0.0014713291311636567, 0.0025994861498475075, -0.020132755860686302, 0.00369365769...",0.767532


0                                                                                                                                                          Software Engineering: Software Architecture. Service-Oriented Computing. Generative Programming. Self-Managing Systems. Software Reengineering. Networks: Software Defined Networking (SDN). Traffic Engineering. Quality of Service. Content-Based Routing. Wireless Communications. Ad-Hoc Networks. Sensor Networks. and Network-Based Control Systems. Cloud Computing and Distributed Computing: Internet of Things (IoT). Big Data Analytics. Resource Management. Web Services. Real-time Concurrent Software.
1                                                                                                                                                                                                                                                                                                                                                      