<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [1]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [33]:
from bs4 import BeautifulSoup
import requests
import re

##### Your Code Here #####
df = pd.read_csv('./data/job_listings.csv')
df.rename(columns={'description': 'raw_description'}, inplace=True)
df.head()

Unnamed: 0.1,Unnamed: 0,raw_description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [34]:
# Extracting the text from the soup object into a new column
### COME BACK TO THIS IF I HAVE TIME AND CLEAN BETTER

def extract_text(col):
    soup = BeautifulSoup(col, 'html.parser')
    soup = re.sub('[^a-zA-Z 0-9]', '', soup.get_text()[1:])
    soup = re.sub('xe2x80x99', '\'', soup)
    return soup

df['cleaned_description'] = df['raw_description'].apply(extract_text)

df.head()




Unnamed: 0.1,Unnamed: 0,raw_description,title,cleaned_description
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Job RequirementsnConceptual understanding in M...
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Job DescriptionnnAs a Data Scientist 1 you wil...
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,As a Data Scientist you will be working on con...
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,4969 6756 a monthContractUnder the general su...
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Location USA xe2x80x93 multiple locationsn2 ye...


In [35]:
# Testing cell

soup = BeautifulSoup(df['raw_description'][1], 'html.parser')
soup = re.sub('[^a-zA-Z 0-9]', '', soup.get_text()[1:])
print(soup)

Job DescriptionnnAs a Data Scientist 1 you will help us build machine learning models data pipelines and microservices to help our clients navigate their healthcare journey You will do so by empowering and improving the next generation of Accolade Applications and user experiencesnA day in the lifexe2x80xa6nWork with a small agile team to design and develop mobile applications in an iterative fashionnWork with a tightknit group of development team members in SeattlenContribute to best practices and help guide the future of our applicationsnOperates effectively as a collaborative member of the development teamnOperates effectively as an individual for quick turnaround of enhancements and fixesnResponsible for meeting expectations and deliverables on time with high qualitynDrive and implement new features within our mobile applicationsnPerform thorough manual testing and writing test cases that cover all areasnIdentify new development toolsapproaches that will increase code quality effic

## 2) Use Spacy to tokenize the listings 

In [36]:
##### Your Code Here #####
nlp = spacy.load("en_core_web_md", disable=['parser', 'tagger', 'ner'])

def tokenizer(text):

    doc = nlp(text)

    # Create a list of tokens using nlp and the sample string "text"
    tokens = []

    # iterate through the tokens in the doc
    for token in doc:

        # create a couple of filters for low quality tokens
        if (token.is_stop != True) and (token.is_punct != True):
            # save case normalized lemmas to token list
            tokens.append(token.lemma_.lower())

    return tokens

df['tokens'] = df['cleaned_description'].apply(tokenizer)
df.head()

Unnamed: 0.1,Unnamed: 0,raw_description,title,cleaned_description,tokens
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Job RequirementsnConceptual understanding in M...,"[job, requirementsnconceptual, understand, mac..."
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Job DescriptionnnAs a Data Scientist 1 you wil...,"[job, descriptionnnas, data, scientist, 1, hel..."
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,As a Data Scientist you will be working on con...,"[data, scientist, work, consult, business, res..."
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,4969 6756 a monthContractUnder the general su...,"[4969, , 6756, monthcontractunder, general, s..."
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Location USA xe2x80x93 multiple locationsn2 ye...,"[location, usa, xe2x80x93, multiple, locations..."


## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [37]:
from sklearn.feature_extraction.text import CountVectorizer
##### Your Code Here #####

# instantiate a countvector object
vect = CountVectorizer(stop_words='english',
                       ngram_range=(1,3),
                       min_df=5,
                       max_df = 0.25,
                       tokenizer = tokenizer)

# Learn our vocab
vect.fit(df['cleaned_description'])

# Get sparce DTM
dtm = vect.transform(df['cleaned_description'])

dtm = pd.DataFrame(data=dtm.todense(), columns= vect.get_feature_names())
print(dtm.shape)
dtm.head()



(426, 5956)


Unnamed: 0,ability,able,analytics,business,datum,deep,design,disability,disability.1,engineer,...,year relate,year relate experience,year relevant,year relevant work,year work,year work experience,yes,york,york city,younnabout
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4) Visualize the most common word counts

In [49]:
##### Your Code Here #####
# Most common words
dtm.sum().sort_values(ascending=False).head(20)

client          238
digital         199
state           190
employee        178
job             162
global          159
user            158
intelligence    157
perform         149
network         149
level           146
like            146
protect         145
program         144
internal        142
diverse         140
growth          139
approach        139
stakeholder     136
quality         136
dtype: int64

array(['Data scientist\xa0', 'Data Scientist I',
       'Data Scientist - Entry Level', 'Data Scientist',
       'Associate Data Scientist – Premium Analytics',
       'Sr. Data Scientist', 'Data Scientist, Lifecyle',
       'Data Scientist, Neuroimaging', 'Data Scientist II',
       'Data Scientist - Risk', 'Data Analyst/Jr. Data Scientist',
       'Assistant Data Scientist', 'Data Scientist, Junior',
       'Data Scientist – Personalization', 'Data Scientist - [Remote]',
       'Measurement Data Scientist', 'WTE Data Science Engineer',
       'Data Scientist, Sales', 'Data Scientist Intern',
       'Jr. Data Scientist', 'Sr. Data Engineer',
       'Data Scientist/Data Analytics Intern - Summer 2019',
       'Data Science Internship – Summer 2019',
       'Data Scientist Summer Intern', 'Data Scientist (Senior)',
       'Data Scientist Internship - Summer 2019',
       'Data Scientist - Insurance', 'Junior Data Scientist',
       'Data Scientist - Computer Vision',
       'Data Scient

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer
##### Your Code Here #####

# instantiate a tfidf object
tfidf = TfidfVectorizer(stop_words='english',
                       ngram_range=(1,3),
                       min_df=5,
                       max_df = 0.25,
                       tokenizer = tokenizer)

# Learn our vocab
tfidf.fit(df['cleaned_description'])

# Get sparce DTM
dtm = tfidf.transform(df['cleaned_description'])

dtm = pd.DataFrame(data=dtm.todense(), columns= tfidf.get_feature_names())
print(dtm.shape)
dtm.head()



(426, 5956)


Unnamed: 0,ability,able,analytics,business,datum,deep,design,disability,disability.1,engineer,...,year relate,year relate experience,year relevant,year relevant work,year work,year work experience,yes,york,york city,younnabout
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [58]:
df['cleaned_description']
#ideal = "Machine learning engineer role, client facing, excellent renumeration, travel opportinities, small team / start-up."
#ideal = pd.Series([ideal])
#ideal

0      Job RequirementsnConceptual understanding in M...
1      Job DescriptionnnAs a Data Scientist 1 you wil...
2      As a Data Scientist you will be working on con...
3      4969  6756 a monthContractUnder the general su...
4      Location USA xe2x80x93 multiple locationsn2 ye...
                             ...                        
421    About UsnWant to be part of a fantastic and fu...
422    InternshipAt Uber we ignite opportunity by set...
423    200000  350000 a yearA million people a year d...
424    SENIOR DATA SCIENTISTnJOB DESCRIPTIONnnABOUT U...
425    Cerner Intelligence is a new innovative organi...
Name: cleaned_description, Length: 426, dtype: object

In [60]:
##### Your Code Here #####
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=5, algorithm="kd_tree")

# Fit on DTM
nn.fit(dtm)

# writing my ideal job description and vectorizing it using our tfidf model
ideal = "Machine learning engineer role, client facing, excellent renumeration, travel opportinities, small team / start-up."
ideal = pd.Series([ideal]) # Turning it into a series as that's what dfidf is expecting
ideal_vectorized = tfidf.transform(ideal)
ideal_vectorized = pd.DataFrame(data=ideal_vectorized.todense(), columns= tfidf.get_feature_names())
# Query Using kneighbors
neigh_dist, neigh_ind = nn.kneighbors(ideal_vectorized)

In [61]:
neigh_dist

array([[1.30289781, 1.32557847, 1.34004016, 1.34301779, 1.34395737]])

In [67]:
neigh_ind

array([[182, 345, 311, 178, 137]])

In [70]:
neigh_ind

for i in df.iloc[neigh_ind[0]]['cleaned_description']:
    print(i)
#df.iloc[neigh_ind[0]]['cleaned_description']

About ScoopnnScoop brings coworkers and neighbors together to enjoy a smooth carpooling experiencexe2x80x94unlocking new opportunities to create friendships improve their wellbeing and make the most of their valuable timennLearn more in Forbes httpswwwforbescomsitesmiguelhelft20171108with36millioninfinancingscoopwantstomakecarpoolingmainstreamnnEngineering  ScoopnnFew companies get to face such diverse technical challenges as Scoop and we've built a team of people excited to face these challenges together while investing in each others' growthnnScoop's engineering team may move bits and pixels but we also put real live human beings in cars together We're touching problems academics have written about for years and have data that no other company has ever collectednnBut Scoop knows engineering is not a lone discipline We're a small team with varied backgrounds big companies VCbacked startups bootcamps academia We like to build together and we like to learn together Our entire team and p

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 