# vectorizing job offers

## aim
- cluster job offeres by similarity based on a dictionary of skills

## outline
- preprocess doc2vec with full job offers
- train model
- test similarity of job descriptions
- cluster offers (Kmeans, KNN)

## outcome
unicorns in a meadow

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import gensim
import joblib
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint
import math
import multiprocessing
import gensim.models.doc2vec
import time
import json

%matplotlib inline

In [2]:
df = joblib.load('../../../raw_data/processed_data.joblib')

In [3]:
df.shape

(7859, 14)

In [4]:
df['tag_language'] = df['tag_language'].fillna(value='en')

In [5]:
df.head(3)

Unnamed: 0,job_title,job_text,company,location,job_info,query_text,source,job_link,tag_language,reviews,job_info_tokenized,job_text_tokenized,job_text_tokenized_titlecase,job_title_tokenized
0,(Junior) Data Engineer (f/m/x),Customlytics ist die führende App Marketing Be...,Customlytics GmbH,Berlin,(Junior) Data Engineer (f/m/x)\nCustomlytics G...,data science,scrape_json,,en,,"[junior, data, engineer, fmx, customlytics, gm...","[customlytics, ist, die, führende, app, market...","[Customlytics, ist, die, führende, App, Market...","[junior, data, engineer, fmx]"
1,,Responsibilities\n\nAs working student (m/f/x)...,Aroundhome,Berlin,Aroundhome6 Bewertungen - Berlin,data science,scrape_json,,en,,"[aroundhome, bewertungen, berlin]","[responsibilities, as, working, student, mfx, ...","[Responsibilities, As, working, student, mfx, ...",[]
2,,Aufgaben\nAls Werkstudent (m/w/d) IT arbeitest...,Aroundhome,Berlin,"Aroundhome6 Bewertungen - Berlin\nTeilzeit, Pr...",data science,scrape_json,,de,,"[aroundhome, bewertungen, berlin, teilzeit, pr...","[aufgaben, als, werkstudent, mwd, it, arbeites...","[Aufgaben, Als, Werkstudent, mwd, IT, arbeites...",[]


In [6]:
# select english jobs
df_eng = df.copy()
df_eng = df_eng[df_eng['tag_language'] == 'en']
df_eng.reset_index(inplace=True)
df_eng.drop(columns='index', inplace=True)

In [7]:
df_eng.head()

Unnamed: 0,job_title,job_text,company,location,job_info,query_text,source,job_link,tag_language,reviews,job_info_tokenized,job_text_tokenized,job_text_tokenized_titlecase,job_title_tokenized
0,(Junior) Data Engineer (f/m/x),Customlytics ist die führende App Marketing Be...,Customlytics GmbH,Berlin,(Junior) Data Engineer (f/m/x)\nCustomlytics G...,data science,scrape_json,,en,,"[junior, data, engineer, fmx, customlytics, gm...","[customlytics, ist, die, führende, app, market...","[Customlytics, ist, die, führende, App, Market...","[junior, data, engineer, fmx]"
1,,Responsibilities\n\nAs working student (m/f/x)...,Aroundhome,Berlin,Aroundhome6 Bewertungen - Berlin,data science,scrape_json,,en,,"[aroundhome, bewertungen, berlin]","[responsibilities, as, working, student, mfx, ...","[Responsibilities, As, working, student, mfx, ...",[]
2,Full Stack Developer (m/f/d),We’re Phiture: a leading mobile growth consult...,Phiture,BerlinKreuzberg,Full Stack Developer (m/f/d)\nPhiture - Berlin...,data science,scrape_json,,en,,"[full, stack, developer, mfd, phiture, berlink...","[were, phiture, a, leading, mobile, growth, co...","[Were, Phiture, a, leading, mobile, growth, co...","[full, stack, developer, mfd]"
3,,"We are 18,000+ employees strong, operating in ...",PRA Health Sciences,Berlin,PRA Health Sciences - Berlin,data science,scrape_json,,en,,"[pra, health, sciences, berlin]","[we, are, employees, strong, operating, in, mo...","[We, are, employees, strong, operating, in, mo...",[]
4,Head of Finance,Head of Finance (m/f/d)\nAt Home our mission i...,Home HT GmbH,Berlin,Head of Finance\nHome HT GmbH2 Bewertungen - B...,data science,scrape_json,,en,,"[head, of, finance, home, ht, gmbh, bewertunge...","[head, of, finance, mfd, at, home, our, missio...","[Head, of, Finance, mfd, At, Home, our, missio...","[head, of, finance]"


In [8]:
df_eng.shape

(7547, 14)

In [9]:
# join strings
def join_strings(text):
    return ' '.join(text)

In [10]:
# lemmatize
def lemmatize_words(word):
    lemmatizer = WordNetLemmatizer()
    lemmatized = lemmatizer.lemmatize(word)

    return lemmatized

In [11]:
# remove stopwords
def remove_stopwords(text):

    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text) 
    text = [w for w in word_tokens if not w in stop_words] 
  
    return text

#['heute', 'weiter', 'zur', 'bewerbung', 'diesen', 'job', 'melden']

In [21]:
# process text
df_eng['clean'] = df_eng['job_text_tokenized'].apply(join_strings).apply(lemmatize_words)\
    .apply(remove_stopwords)

## model doc2vec 1

Conclusions :)
- ~ 700 offers - 100 epocs
    - model performs ok, but tends to cluster according to company
    - texts with very high similarity (> 0.90) are likely to be duplicated job adds
    - looks like the model first shows offers based on duplicates, then company, then position (probably because of semantics)


- 2500 offers - 150 epocs
    - still clusters by company
    - add more data? or try bigrams

In [22]:
# tag texts
texts = df_eng['clean']

def tag_text(texts):
    texts_tagged = [TaggedDocument(text, tags=['tag_'+str(tag)]) for tag, text in enumerate(texts)]

    return texts_tagged

texts_tagged = tag_texts(texts)
texts_tagged

NameError: name 'tag_texts' is not defined

In [19]:
# reduced dataset
texts_tagged_small = texts_tagged[:3000]
texts_tagged_small[0]

NameError: name 'texts_tagged' is not defined

In [18]:
data_to_train = texts_tagged_small # texts_tagged_small, texts_tagged

# build vocabulary with CBOW (dm=0)
cores = multiprocessing.cpu_count()
model_dbow = Doc2Vec(documents=data_to_train,
                     dm=0,
                     alpha=0.025,
                     vector_size=len(data_to_train), 
                     min_count=1,
                     workers=cores)

# train the model
model_dbow.train(data_to_train, total_examples=model_dbow.corpus_count, epochs=15)

NameError: name 'texts_tagged' is not defined

In [None]:
model_dbow.save('../../../models/doc2vec_3000_15_epochs')
#joblib.dump(model_dbow, filename='../../../models/doc2vec_3000_20_epochs.joblib' )

In [None]:
model_dbow.corpus_count

### test the model by hand and with copy-pasted text

**test model with texts in the database**

In [14]:
# load model
model_loaded = Doc2Vec.load('../../../models/doc2vec_all_10_epochs')

In [15]:
def similar_jobs(tokenized_job, offers):
    ''' input: tokenized job offers, number of offers 
        returns tags of top x most similar job offers and similarity probabilities
    '''

    # infer vector from text 
    infer_vector = model_loaded.infer_vector(tokenized_job)
    # find similar offers
    similar_documents = model_loaded.docvecs.most_similar([infer_vector], topn = offers)

    return similar_documents


def print_top_jobs(text, offers=5):
    
    """ input: index of text in dataframe and number of offers we want to see
        prints text of the offers
    """
    
    tags = similar_jobs(text, offers)
    indices = [int(tag[0].replace('tag_', '')) for tag in tags]  
    
    #print(text)
    print(f"{tags}\n")
    for num in indices:
        print(f"{df_eng['job_title'][num], df_eng['company'][num], df_eng['job_text'][num]}\n {filtered_texts[num]} -------------END------------\n ") 

In [16]:
similar_jobs(texts_tagged_small[0][0], 10)

NameError: name 'texts_tagged_small' is not defined

## test model with copy-pasted job

In [24]:
## change case to lower
import string
def to_lower(text):
    return text.lower()

## remove numbers from the corpus
def remove_number(text):
    text = ''.join(word for word in text if not word.isdigit())
    
    return text

## remove special puncutation from text
def remove_punctuation(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    
    return text

In [25]:
offer = """
The Data Science team at OLX Group is responsible for building algorithmic solutions to facilitate transactions between buyers and sellers. We are developing personalization technologies and optimization strategies that have a direct impact on OLX’s users as well as the company's bottom line.

You will be encouraged to research state-of-the-art machine learning, in the areas of user segmentation, image metadata extraction (including multi-label classification and tagging with deep learning), semi-supervised learning, recommender systems, and more. Applying these methods to core OLX product platforms deployed in the cloud that are affecting the user experience for millions of visitors per month, rolling your solutions to production, analysing model results offline and online, and measuring site impact.

What you will be doing:

Work in multi-functional teams with people from different backgrounds
Find opportunities where data science will make an impact
Help to translate business requirements into machine learning models
Build effective solutions with machine learning
Bring machine learning services to productions together with engineers
Measure the impact of your models on company goals
Collaborate with internal and external stakeholders


Who we’re looking for someone who has:

Strong analytical and software development background
At least 2 to 3 years of professional data science experience or equivalent time in PhD studies.
Experience with at least one of the following machine learning frameworks: Scikit-Learn, TensorFlow, PyTorch, (or similar)
Hands-on experience in SQL
Strong engineering background: good knowledge of Python and good understanding of best engineering practices
Proficient in English with excellent written and oral communication skills
Position based in Poznan or Warsaw


Nice to have

Experience bringing models in production and serving models at scale
Experience using AWS for deploying machine learning solutions
Experience with building data pipelines using tools like Spark and Airflow
Exposure to other programming languages such as Kotlin, Java, Scala, etc
Exposure to production infrastructure and DevOps practices: monitoring, alerting, CI/CD, container-orchestrating platforms, and infrastructure-as-code tools (Grafana, Prometheus, Kubernetes, Terraform)

What we’ll give you:

Competitive compensation and benefits
Contributing to the global OLX Group
A passionate and diverse team of data scientists spanning several tech hubs across the globe.
The opportunity to learn from each other and become better every day
Competitive salary and good benefits
Company Mobile phone, laptop of your choice: Laptop MacBook Pro, Windows or Linux, Notebook, PC, any tool you might need


What you need to know about us:

OLX is the world’s leading classifieds platform in high-growth markets. It’s available in more than 40 countries and in over 50 languages. The platform makes it so easy to connect people to buy, sell or exchange used goods and services.
OLX is part of the OLX Group, a global product and tech company with 19 brands, +40 countries, +5000 people and one mindset.
Our mission is to make it super easy for people to buy and sell almost anything, boosting local economies.
We are proud to be different, and we work differently too. We combine the spirit and agility of a start-up with the maturity that comes from being part of a 100 year-old company.
We are curious, ambitious and allergic to corporate interference. We improvise, experiment and push each other further, embracing uncertainty and driving change.

"""

In [87]:
offer = """

TD Reply is an innovation and marketing consultancy and part of the Reply Group. We take a data-driven and execution-oriented approach to drive organizational change through meaningful insights. In our Berlin and Beijing offices we are around 90 thinkers, developers, analysts, designers, consultants, visualizers, futurologists and organizers. We’ve helped transform the business of global leading brands such as Coca-Cola, BMW, adidas, Miele, FrieslandCampina, Volkswagen, Lufthansa, Postbank, Deutsche Bahn, L’Oreal, and Telefónica.

You & us

We drive innovation in marketing and data science and believe that magic happens when data meets imagination. We generate value by applying this mix to real-world business problems for the world’s leading companies and brands. Together with our consultants, developers and designers we help to build impactful products and drive digital transformation for our clients. Our solutions predict the future of consumer behavior, find the best media budget allocation, guide brand management operations and to provide insights and management solutions on how to drive business. If you are keen to shape the future of data driven management solutions and break down boundaries in marketing and data, you are welcome in our data engineering team. At TD Reply, you will play a central role in shaping product design, technical decision-making and delivering value to the company and our clients.

Your tasks will include

Implementing and maintaining production-level, robust data analytics scripts in Python or R
Build and shape scalable technical infrastructure, testing and monitoring
Define and put into action development, quality assurance, and support processes
Keep code quality high through pair programming, code reviews, code analytics
Crunching and describing data sets containing important consumer/customer insights on structured and unstructured data
Integrate publicly available datasources, external APIs, databases and leverage our reporting platform Pulse
Promote the spirit of Data Engineering throughout the company

There is a match if…

… your profile is completed by demonstrated skills in a relevant programming language, e.g. R or Python, for a minimum of 2 years

… you have a solid understanding of coding techniques, robust code, and error handling in data pipelines

… you have worked with DBMS like MongoDB, Athena/Presto, Elasticsearch or Snowflake

… you have worked with Cloud Services like AWS, Azure or GCP

… you have experience in connecting with external RESTful APIs

…. you possess a deep understanding of data modelling and transformation, and have handled large datasets in the past

…. you hold a degree in IT/Engineering/Econometrics/Mathematics/Statistics with at least 2 years’ working experience

… you like working in an agile environment and being responsible for the technical design, implementation, maintenance, monitoring and automated testing of software

… you are a team player who likes to share, discuss and work on ideas / tasks with your colleagues


That’s in it for you

A startup atmosphere in a sustainably successful company located in the center of Berlin – modern office with rooftop terrace and stunning views included
Flexible, family-friendly working hours
Work with a young, creative and diverse team
Get an opportunity to try out new technologies
A chance to actively shape products and projects
Projects & clients that will make you feel proud to work for"""

In [88]:
token_offer = to_lower(offer)
token_offer = remove_number(token_offer)
token_offer = remove_punctuation(token_offer)
token_offer = lemmatize_words(token_offer)
token_offer = remove_stopwords(token_offer)

In [89]:
infer_vector = model_loaded.infer_vector(token_offer)
infer_vector

array([ 0.01941206, -0.02264883,  0.08563899, ..., -0.04149463,
       -0.02167387, -0.07774137], dtype=float32)

In [90]:
similar_documents = model_loaded.docvecs.most_similar([infer_vector], topn = 5)
similar_documents

[('tag_157', 0.9139636158943176),
 ('tag_413', 0.6613494753837585),
 ('tag_353', 0.6545976996421814),
 ('tag_371', 0.6463046669960022),
 ('tag_327', 0.6298779845237732)]

In [84]:
# top_index = [text[0].replace('tag_', '') for text in similar_documents]
# top_offers = pd.DataFrame(df_eng.iloc[top_index]['job_text'])
# top_offers['similarities'] = [text[1] for text in similar_documents]

In [85]:
#top_offers.head()

In [91]:
tags = similar_documents
tags = [list(i) for i in tags]

print(f"{tags}\n")
# print(f"{df_eng['job_title'][text_index], df_eng['company'][text_index], df_eng['job_text'][text_index]} \
#     \n-------------END------------\n ")

for tag in tags: 
    num = int(tag[0].strip('tag_'))

    print(f"{df_eng['job_title'][num], df_eng['company'][num], df_eng['job_text'][num]} \
    \n-------------END------------\n ") 


[['tag_157', 0.9139636158943176], ['tag_413', 0.6613494753837585], ['tag_353', 0.6545976996421814], ['tag_371', 0.6463046669960022], ['tag_327', 0.6298779845237732]]

('DATA ENGINEER', 'TD Reply GmbH', 'That’s us\nTD Reply is an innovation and marketing consultancy and part of the Reply Group. We take a data-driven and execution-oriented approach to drive organizational change through meaningful insights. In our Berlin and Beijing offices we are around 90 thinkers, developers, analysts, designers, consultants, visualizers, futurologists and organizers. We’ve helped transform the business of global leading brands such as Coca-Cola, BMW, adidas, Miele, FrieslandCampina, Volkswagen, Lufthansa, Postbank, Deutsche Bahn, L’Oreal, and Telefónica.\nYou & us\nWe drive innovation in marketing and data science and believe that magic happens when data meets imagination. We generate value by applying this mix to real-world business problems for the world’s leading companies and brands. Together wit

## improve the model 1
Steps:
- filter out all words not in dictionary 
- train model
- get output and see if it's better

In [None]:
# import dictionary

with open('../fydjob/data/dicts/skills_dict.json') as json_file:
    dictionary = json.load(json_file)

# collapse dictionary into list
skills_list = [item for key, value in dictionary.items() for item in value]
#print(sorted(skills_list))

In [None]:
# filter tokens for skill
filtered_texts = [[word for word in text if word in skills_list] for text in texts]

In [None]:
# tag documents
filtered_texts_tag = tag_texts(filtered_texts)
filtered_texts_tag_small = filtered_texts_tag[:5000]

In [None]:
# train model # train_model.train(alldocs, total_examples=len(alldocs), epochs=epochs, start_alpha=0.025, end_alpha=0.001)

data_to_train = filtered_texts_tag_small # texts_tagged_small, texts_tagged, filtered_texts_tag_small

# build vocabulary with CBOW (dm=0)
model_dbow = Doc2Vec(documents=data_to_train,
                     dm=0,
                     alpha=0.025,
                     vector_size=len(data_to_train), 
                     min_count=1)

# train the model
model_dbow.train(data_to_train, total_examples=model_dbow.corpus_count, epochs=50)

In [None]:
# save model
model_dbow.save('../../../models/doc2vec_filtered_5000_50_epochs')

In [None]:
# test model
model_loaded = Doc2Vec.load('../../../models/doc2vec_filtered_5000_50_epochs')

In [None]:
num = 770
print(f"{df_eng['job_title'][num], df_eng['company'][num], df_eng['job_text'][num]}\n {filtered_texts[num]} -------------END------------\n ") 

In [None]:
print_top_jobs(filtered_texts[num], 5)

In [None]:
offer = '''Description

Would you like to join the team that protects the global AWS platform from fraud? Do you enjoy thinking like a fraudster and using your technical skills to help detect & mitigate AWS accounts from being compromised? If so, AWS Fraud Prevention has an exciting opportunity for you.

AWS has the most services and more features within those services, than any other cloud provider–from infrastructure technologies like compute, storage, and databases–to emerging technologies, such as machine learning and artificial intelligence, data lakes and analytics, and Internet of Things. AWS Platform is the glue that holds the AWS ecosystem together. Whether its Identity features such as access management and sign on, cryptography, console, builder & developer tools, and even projects like automating all of our contractual billing systems, AWS Platform is always innovating with the customer in mind. The AWS Platform team sustains over 750 million transactions per second.

The AWS Fraud Prevention Compromise vertical is responsible for detecting & mitigating AWS account compromise. You’ll be part of a team of Data Scientists, Investigations Analysts, and Technical & non-Technical Program Managers. The team’s goal is to identify and neutralize fraudsters from compromising AWS customers’ accounts.

As a Data Scientist, you will work directly with Business Analysts and Software Development Engineers to monitor the flavor/ trend of compromise on AWS worldwide and design appropriate solutions to respond in a collaborative environment. There are no walls, and success is determined by your ability to dive deep, and understand the subtle demands new and complex services will place upon systems and teams.

As a Data Scientist Your Responsibilities Will Include
Apply state-of-the-art Machine Learning methods to large amounts of data from different sources to build and productionalize fraud prevention, detection and mitigation solutions
Deep dive on the problems using SQL and scripting languages like Python/R to drive short term and long term solutions leveraging Statistical Analysis
Analyze data (past customer behavior, sales inputs, and other sources) to figure out trends, create compromise prevention and mitigation solutions and output reports with clear recommendations
Collaborate closely with the development team to recommend and build innovations based on Data Science
Manage your own process: identify and execute on high impact projects, triage external requests, and make sure you bring projects to conclusion in time for the results to be useful
Learn and Be Curious. We have a formal mentor search application that lets you find a mentor that works best for you based on location, job family, job level etc. Your manager can also help you find a mentor or two, because two is better than one. In addition to formal mentors, we work and train together so that we are always learning from one another, and we celebrate and support the career progression of our team members.

Inclusion and Diversity. Our team is diverse! We drive towards an inclusive culture and work environment. We are intentional about attracting, developing, and retaining amazing talent from diverse backgrounds. Team members are active in Amazon’s 10+ affinity groups, sometimes known as employee resource groups, which bring employees together across businesses and locations around the world. These range from groups such as the Black Employee Network, Latinos at Amazon, Indigenous at Amazon, Families at Amazon, Amazon Women and Engineering, LGBTQ+, Warriors at Amazon (Military), Amazon People With Disabilities, and more.

Learn more about Amazon on our Day 1 Blog: https://blog.aboutamazon.com


Basic Qualifications
Master’s degree in Mathematics, Statistics, Computer Science or in another related field
Several years of hands-on relevant experience using programming/scripting languages such as Python or equivalent
Proven understanding of Statistical Analysis, Modeling and Machine Learning techniques
Experience in designing and deploying ML modeling and prediction pipelines
Ability to leverage SQL or Spark for Ad-hoc analyses and building out ETL pipelines on heterogeneous data sources
Experience performing statistical analysis and using tools such as R, pandas, or equivalent
Preferred Qualifications
Experience and proficiency with AWS technologies (EC2, CloudTrail, S3, SageMaker, Lambda, DynamoDB, RDS, etc.), and Big Data technologies
Familiarity with AWS Redshift, Spark or other distributed computing technologies
Previous work as a Data Scientist in the context of fraud analytics or risk scoring
Ability to work in a fast-paced, ambiguous environment while prioritizing and managing multiple responsibilities
Excellent written and verbal communication skills
Excellent problem solving skills with a attention to detail
Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your experience and skills. We value your passion to discover, invent, simplify and build. Protecting your privacy and the security of your data is a longstanding top priority for Amazon. Please consult our Privacy Notice to know more about how we collect, use and transfer the personal data of our candidates.
'''

In [None]:
token_offer = to_lower(offer)
token_offer = remove_number(token_offer)
token_offer = remove_punctuation(token_offer)
token_offer = lemmatize_words(token_offer)
token_offer = remove_stopwords(token_offer)
token_offer = [word for word in token_offer if word in skills_list]
print(token_offer)

In [None]:
# offer needs to b masked as well!!

print_top_jobs(token_offer, 5)

# Dummy copy-paste model

In [None]:
from fydjob.Doc2VecPipeline import Doc2VecPipeline

In [None]:
model = Doc2VecPipeline()

In [None]:
offer = '''Description

Would you like to join the team that protects the global AWS platform from fraud? Do you enjoy thinking like a fraudster and using your technical skills to help detect & mitigate AWS accounts from being compromised? If so, AWS Fraud Prevention has an exciting opportunity for you.

AWS has the most services and more features within those services, than any other cloud provider–from infrastructure technologies like compute, storage, and databases–to emerging technologies, such as machine learning and artificial intelligence, data lakes and analytics, and Internet of Things. AWS Platform is the glue that holds the AWS ecosystem together. Whether its Identity features such as access management and sign on, cryptography, console, builder & developer tools, and even projects like automating all of our contractual billing systems, AWS Platform is always innovating with the customer in mind. The AWS Platform team sustains over 750 million transactions per second.

The AWS Fraud Prevention Compromise vertical is responsible for detecting & mitigating AWS account compromise. You’ll be part of a team of Data Scientists, Investigations Analysts, and Technical & non-Technical Program Managers. The team’s goal is to identify and neutralize fraudsters from compromising AWS customers’ accounts.

As a Data Scientist, you will work directly with Business Analysts and Software Development Engineers to monitor the flavor/ trend of compromise on AWS worldwide and design appropriate solutions to respond in a collaborative environment. There are no walls, and success is determined by your ability to dive deep, and understand the subtle demands new and complex services will place upon systems and teams.

As a Data Scientist Your Responsibilities Will Include
Apply state-of-the-art Machine Learning methods to large amounts of data from different sources to build and productionalize fraud prevention, detection and mitigation solutions
Deep dive on the problems using SQL and scripting languages like Python/R to drive short term and long term solutions leveraging Statistical Analysis
Analyze data (past customer behavior, sales inputs, and other sources) to figure out trends, create compromise prevention and mitigation solutions and output reports with clear recommendations
Collaborate closely with the development team to recommend and build innovations based on Data Science
Manage your own process: identify and execute on high impact projects, triage external requests, and make sure you bring projects to conclusion in time for the results to be useful
Learn and Be Curious. We have a formal mentor search application that lets you find a mentor that works best for you based on location, job family, job level etc. Your manager can also help you find a mentor or two, because two is better than one. In addition to formal mentors, we work and train together so that we are always learning from one another, and we celebrate and support the career progression of our team members.

Inclusion and Diversity. Our team is diverse! We drive towards an inclusive culture and work environment. We are intentional about attracting, developing, and retaining amazing talent from diverse backgrounds. Team members are active in Amazon’s 10+ affinity groups, sometimes known as employee resource groups, which bring employees together across businesses and locations around the world. These range from groups such as the Black Employee Network, Latinos at Amazon, Indigenous at Amazon, Families at Amazon, Amazon Women and Engineering, LGBTQ+, Warriors at Amazon (Military), Amazon People With Disabilities, and more.

Learn more about Amazon on our Day 1 Blog: https://blog.aboutamazon.com


Basic Qualifications
Master’s degree in Mathematics, Statistics, Computer Science or in another related field
Several years of hands-on relevant experience using programming/scripting languages such as Python or equivalent
Proven understanding of Statistical Analysis, Modeling and Machine Learning techniques
Experience in designing and deploying ML modeling and prediction pipelines
Ability to leverage SQL or Spark for Ad-hoc analyses and building out ETL pipelines on heterogeneous data sources
Experience performing statistical analysis and using tools such as R, pandas, or equivalent
Preferred Qualifications
Experience and proficiency with AWS technologies (EC2, CloudTrail, S3, SageMaker, Lambda, DynamoDB, RDS, etc.), and Big Data technologies
Familiarity with AWS Redshift, Spark or other distributed computing technologies
Previous work as a Data Scientist in the context of fraud analytics or risk scoring
Ability to work in a fast-paced, ambiguous environment while prioritizing and managing multiple responsibilities
Excellent written and verbal communication skills
Excellent problem solving skills with a attention to detail
Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your experience and skills. We value your passion to discover, invent, simplify and build. Protecting your privacy and the security of your data is a longstanding top priority for Amazon. Please consult our Privacy Notice to know more about how we collect, use and transfer the personal data of our candidates.
'''

In [None]:
model.find_similar_jobs_from_string(offer)