In [1]:
import pandas
import gensim
import duckdb

# This section concerns training models.

In [2]:
data = pandas.read_csv("postings.csv")

In [7]:
documents = [gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(str(doc)), [i]) for i, doc in enumerate(data["description"])]

### Model save naming scheme

Consider the default parameters to be `(documents, vector_size=50, window=5, min_count=5, workers=6, epochs=1)`
The corresponding default save name is `"doc2vec.model"`

when parameters differ, eg `epochs=10`, then the save name is `"doc2vec-epochs_10.model"`

In [4]:
# Run this if we need to generate new models
vector_size=150
model = gensim.models.Doc2Vec(documents, vector_size=vector_size, window=8, min_count=5, workers=4, epochs=20, negative=15, sample=1e-5)
model.save("doc2vec-vector_size_50-window_8-epochs_20-negative_15-sample_1e-5.model")

In [10]:
# Run this to load models
vector_size=150
model = gensim.models.Doc2Vec.load("doc2vec-vector_size_50-window_8-epochs_20-negative_15-sample_1e-5.model")

In [5]:
# Run this to test models
print(model.wv.most_similar("java", topn=15))

[('golang', 0.884445071220398), ('angular', 0.8738394379615784), ('javascript', 0.8638027906417847), ('python', 0.8609697222709656), ('nodejs', 0.850350558757782), ('js', 0.8358636498451233), ('reactjs', 0.8354408740997314), ('typescript', 0.814383327960968), ('kotlin', 0.7976672053337097), ('sql', 0.7898983955383301), ('oop', 0.7849213480949402), ('angularjs', 0.7722851037979126), ('struts', 0.7709268927574158), ('backend', 0.7635115385055542), ('graphql', 0.760227620601654)]


# This section concerns speeding up query results using embedded Databases (Namely DuckDB for this demo)

In [11]:
# set up pandas df to be queried from
tmp_df = pandas.DataFrame({"vec": [model.dv[i] for i in range(len(documents))], "idx": data["description"]})
print(tmp_df.head())

                                                 vec  \
0  [0.16971074, -0.016613226, 0.4590436, -0.42158...   
1  [1.1401042, -0.1168741, 0.12566207, 0.67148614...   
2  [0.4515414, 0.15629922, 0.16206044, 0.03234753...   
3  [-0.25204325, -0.09656739, 0.2107027, 0.274823...   
4  [0.036246527, 0.0866064, 0.06634248, -0.027899...   

                                                 idx  
0  Job descriptionA leading real estate firm in N...  
1  At Aspen Therapy and Wellness , we are committ...  
2  The National Exemplar is accepting application...  
3  Senior Associate Attorney - Elder Law / Trusts...  
4  Looking for HVAC service tech with experience ...  


In [13]:
# Setup database via `model`
duckdb.sql("install vss")
duckdb.sql("load vss")
duckdb.sql("DROP TABLE embeddings")
duckdb.sql(f"CREATE TABLE embeddings (vec FLOAT[{vector_size}], idx STRING);")
duckdb.sql("CREATE INDEX cos_idx ON embeddings USING HNSW (vec) WITH (metric='cosine')")
duckdb.sql("INSERT INTO embeddings SELECT * FROM tmp_df")

In [20]:
# Sample query
queryString = """
OpenText is a global leader in information management, where innovation, creativity, and collaboration are the key components of our corporate culture. As a member of our team, you will have the opportunity to partner with the most highly regarded companies in the world, tackle complex issues, and contribute to projects that shape the future of digital transformation.

The Common Components unit provides shared software engineering services to global Product Management and development teams in the OpenText™ Products group. This includes User Experience Design support (guidelines and individual product designs), Product Information artifacts (product documentation and help), Localization support, Accessibility governance (VPAT, WCAG), Performance Engineering and Product Security and Build and Release management tools and services.

Your Impact:

A Development role at OpenText is more than just a job; it's an opportunity to impact lives. As a key contributor, you'll be instrumental in constructing cutting-edge Information Management Solutions that contribute to sustainable supply chains, support refugees, and enhance medical information access to save lives. You will engage in solving meaningful challenges within a motivated team, gaining exposure to advanced technologies beyond individual access. You will be encouraged to cultivate an engineering mindset, driving the creation of innovative software solutions that address real-world problems and shape the future.

What the role offers:

As a Senior Software Developer, you will:
• Own and deliver projects aligned with the team's quarterly cadence, ensuring work contributes to customer success.
• Model integrity and excellence, influencing best practices within the team and leveraging expertise.
• Identify and address issues when the current path does not effectively serve customer needs, collaborating with the manager for corrections.
• Keep customer value in focus, using input from others to determine appropriate technical solutions and making timely decisions without compromising trust.
• Decompose customer problems into designs with multiple interacting software components, mastering code fluency fundamentals.
• Serve as a technical lead, recognized for growing domain expertise, embracing change, and navigating ambiguity with resiliency, while fostering the development of less experienced team members.
• Possess 5-8 years of previous professional experience.

What you need to Succeed:
• Expertise in one development stack: MVC, Java (Spring), JavaScript (jQuery, Tomcat), or Ruby (Rails).
• Strong knowledge of SQL, Windows, Linux, and AWS.
• Experience with Node.js, Python, Bash, and Docker (Terraform and React are a plus).
• Ability to work with multiple development stacks.
• Exceptional troubleshooting and problem-solving skills.
• Proficiency in adapting to and learning new technologies.

One last thing:

OpenText is more than just a corporation, it's a global community where trust is foundational, the bar is raised, and outcomes are owned.

Join us on our mission to drive positive change through privacy, technology, and collaboration. At OpenText, we don't just have a culture; we have character. Choose us because you want to be part of a company that embraces innovation and empowers its employees to make a difference.

OpenText's efforts to build an inclusive work environment go beyond simply complying with applicable laws. Our Employment Equity and Diversity Policy provides direction on maintaining a working environment that is inclusive of everyone, regardless of culture, national origin, race, color, gender, gender identification, sexual orientation, family status, age, veteran status, disability, religion, or other basis protected by applicable laws.

If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please contact us at hr@opentext.com. Our proactive approach fosters collaboration, innovation, and personal growth, enriching OpenText's vibrant workplace.
"""
queryVec = model.infer_vector(gensim.utils.simple_preprocess(queryString), epochs=10)
result = duckdb.sql(f"SELECT array_cosine_similarity(vec, ?::FLOAT[{vector_size}]) AS dst, idx FROM embeddings ORDER BY dst DESC LIMIT 10", params=[queryVec]).fetchall()
for i in result:
	print(i)

(0.559284508228302, "Opentext - The Information Company\n\nAs the Information Company, our mission at OpenText is to create software solutions and deliver services that redefine the future of digital. Be part of a winning team that leads the way in Enterprise Information Management.\n\nOpenText is a global leader in information management, where innovation, creativity, and collaboration are the key components of our corporate culture. As a member of our team, you will have the opportunity to partner with the most highly regarded companies in the world, tackle complex issues, and contribute to projects that shape the future of digital transformation.\n\nYOUR IMPACT\n\nUnique opportunity for a Sr Sales Director, who has an excellent record in providing vision and strategic leadership for an international software business in Eastern Europe.\n\nIn this role you'll focus on successful territory prioritization, leveraging channels for an international sales organization. You will be experie

In [22]:
for i in result:
	if "OpenText" in i[1]:
		print("found")

found
found
