<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [1]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [2]:
# read in the csv
job_listings = pd.read_csv('data/job_listings.csv')
job_listings.head()

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [3]:
def clean_up_html(txt):
    """
    Takes in HTML document.
    Returns the text from the document.
    """
    soup = BeautifulSoup(txt, 'html.parser')
    html_txt = soup.get_text()
    clean1 = html_txt.replace('\\n', '.')
    cleaned_up_txt = re.sub('[^a-zA-Z 0-9"]', ' ', clean1)

    return cleaned_up_txt

In [4]:
# apply clean_up_html to description column
job_listings['description'] = job_listings['description'].apply(clean_up_html)

# get rid of b at beginning of each description
job_listings['description'] = job_listings['description'].apply(lambda txt: txt[1:])
job_listings.head()

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"""Job Requirements Conceptual understanding in...",Data scientist
1,1,Job Description As a Data Scientist 1 you w...,Data Scientist I
2,2,As a Data Scientist you will be working on co...,Data Scientist - Entry Level
3,3,4 969 6 756 a monthContractUnder the gene...,Data Scientist
4,4,Location USA xe2 x80 x93 multiple locations...,Data Scientist


## 2) Use Spacy to tokenize the listings 

In [5]:
nlp = spacy.load('en_core_web_lg')

def create_lemmas(txt):
    """
    This function takes in a text document and returns a list of lemmas
    """

    # custom stop words
    STOP_WORDS = nlp.Defaults.stop_words.union(['xe2', 'x80', 'x93'])

    # our language model creates tokens for us and flags their attributes
    doc = nlp(txt)
    
    lemmas = []

    for token in doc:
        # apply token attribute filters
        if (token.is_punct == False) & (token.text.lower() not in STOP_WORDS) & (token.is_space == False):
            # use the dot operator to access the lower case lemma of the token
            lemmas.append(token.lemma_.lower())

    return lemmas

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [6]:
# get tokens in a series
tokens = job_listings['description']

# create transformer
vect = CountVectorizer(tokenizer=create_lemmas)

# # build its vocab
vect.fit(tokens)

# # transform text
dtm = vect.transform(tokens)



In [7]:
# visualize as dataframe
dtm = pd.DataFrame(data=dtm.toarray(), columns=vect.get_feature_names())
dtm

Unnamed: 0,0,00,000,02115,03,0305,0356,04,062,06366,...,zero,zeus,zf,zheng,zillow,zogsport,zone,zoom,zuckerberg,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
421,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
422,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
423,0,0,2,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
424,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4) Visualize the most common word counts

In [8]:
dtm.sum().sort_values(ascending=False).head(15)

datum         2820
experience    1957
work          1661
data          1646
team          1373
business      1265
science        989
product        896
model          883
analytic       839
analysis       812
skill          724
scientist      710
machine        708
build          635
dtype: int64

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [9]:
# Instantiate vectorizer object
tfidf = TfidfVectorizer(tokenizer=create_lemmas)

# Create a vocabulary and get word counts per document
tfidf.fit(tokens)

# Similiar to fit_predict
tfidf_dtm = tfidf.transform(tokens)



In [10]:
tfidf_dtm = pd.DataFrame(data=tfidf_dtm.toarray(), columns=tfidf.get_feature_names())
tfidf_dtm

Unnamed: 0,0,00,000,02115,03,0305,0356,04,062,06366,...,zero,zeus,zf,zheng,zillow,zogsport,zone,zoom,zuckerberg,zurich
0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.108188,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
421,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
422,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
423,0.0,0.0,0.121650,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.102652,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
424,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0


## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [12]:
# Fit on TFIDF DTM
nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(tfidf_dtm)

NearestNeighbors(algorithm='kd_tree')

In [23]:
my_description = ["As a Machine Learning Engineer / Data Scientist within the Ground Truth Systems Team, you will be part of a team building infrastructure and tools for exciting new technologies that will shape our future. We build web services that work at scale to support new products and also enable Machine Learning development workflows. We are looking for a Machine learning (ML) engineer with a broad experience and knowledge in machine/deep learning and computer vision. You will be part of every stage of development from concept to deployment. The Engineer will be working on cutting edge problems in efficient data annotation building ML models that assist humans in reducing labeling efforts without sacrificing quality. The candidate should be proficient in the theoretical fundamentals of the above areas with experience in applying them to solve real-world problems.  Key Qualifications  Key Qualifications • Strong coding skills in Python using scientific libraries like numpy, scipy. • Experience with one or more deep learning frameworks such as PyTorch, Tensorflow, or Keras is a must. • Experience with training deep neural networks on large-scale datasets. • Understanding of data structures, software design principles and algorithms. • Interest in building machine learning models that assist humans in reducing labeling efforts. • Deep knowledge of traditional ML concepts such as GMMs, SVMs, trees, and boosting as well as more recent deep learning fundamentals. • Previous publication experience in conferences such as CVPR, ICCV, NeurIPS, and ICLR will be strongly considered  Description  Description The Video Computer Vision org is a centralized applied research and engineering organization responsible for developing real-time on-device Computer Vision and Machine Perception technologies across Apple products. We balance research and product to deliver Apple quality, state-of-the-art experiences, innovating through the full stack, and partnering with HW, SW and ML teams to influence the sensor and silicon roadmap that brings our vision to life. Examples include FaceID, Animoji/Memoji, Scene Understanding, People Understanding and Positional Tracking (VIO/SLAM).  Education & Experience  Education & Experience MS or PHD in CS/CE/EE (or equivalent) with emphasis in machine learning and computer vision"]

In [26]:
# vectorize my description (using model that created the dtm I'm comparing against) and make it dense array
my_description_dense_arr = tfidf.transform(my_description).toarray()

# Query Using NN kneighbors 
neigh_dist, neigh_index = nn.kneighbors(my_description_dense_arr)

In [27]:
neigh_dist

array([[1.17742676, 1.1922605 , 1.1922605 , 1.21785287, 1.23572137]])

In [28]:
neigh_index

array([[283, 272, 279, 201,  52]])

In [29]:
# Take a look at closest neighbor
job_listings['description'][52]

'"The challenge Adobe is looking for a Senior Data Scientist who will be building the next generation of marketing cloud products by leveraging machine learning  predictive modeling and optimization techniques  These products would help businesses understand  manage  and optimize the experience throughout the customer journey  Example applications include real time online media optimization  media attribution  predictive sales analytics  product recommendation  mobile analytics  predictive customer scoring and segmentation and large scale experimentation  Ideal candidates will have a strong academic background as well as technical skills including applied statistics  machine learning  data mining  and software development  Familiarity working with large scale datasets and big data techniques would be a plus  What you xe2 x80 x99ll do Develop predictive models on large scale datasets to address various business problems through leveraging advanced statistical modeling  machine learning 

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 