<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [2]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy
from spacy.tokenizer import Tokenizer

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [3]:
from bs4 import BeautifulSoup # used only to extract the jobs description
import requests
import os

##### Your Code Here #####
df = pd.read_csv(os.path.join('data', 'job_listings.csv'))
df = df.drop(['Unnamed: 0'], axis=1)
df

Unnamed: 0,description,title
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist
...,...,...
421,"b""<b>About Us:</b><br/>\nWant to be part of a ...",Senior Data Science Engineer
422,"b'<div class=""jobsearch-JobMetadataHeader icl-...",2019 PhD Data Scientist Internship - Forecasti...
423,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist - Insurance
424,"b""<p></p><div><p>SENIOR DATA SCIENTIST</p><p>\...",Senior Data Scientist


In [4]:
df = df.applymap(lambda soup: BeautifulSoup(soup, 'html.parser').get_text())
df = df.applymap(lambda soup: re.sub('/\r?\n|\r/', ))

In [27]:
df['description'][3]

"b'$4,969 - $6,756 a monthContractUnder the general supervision of Professors Dana Mukamel and Kai Zheng, the incumbent will join the CalMHSA Mental Health Tech Suite Innovation (INN) Evaluation Team. This large, statewide multi-year study will evaluate the effectiveness of two new and innovative applications offered to people with mental health conditions, which include opportunities for online chatting between users and online listeners Responsibilities of the incumbent will include managing and analyzing text data created by users of the two mental health applications as part of the research and evaluation objectives of the team. The incumbent will collaborate with faculty and other team researchers, and will be expected to create under supervision and direction variables describing the usage of the apps, the interactions between users, and the effectiveness of the apps. The incumbent will also be expected to interact with the vendors of the apps around data issues.\\n\\nThe Univers

In [6]:
df.head(5)

Unnamed: 0,description,title
0,"b""Job Requirements:\nConceptual understanding ...",Data scientist
1,"b'Job Description\n\nAs a Data Scientist 1, yo...",Data Scientist I
2,b'As a Data Scientist you will be working on c...,Data Scientist - Entry Level
3,"b'$4,969 - $6,756 a monthContractUnder the gen...",Data Scientist
4,b'Location: USA \xe2\x80\x93 multiple location...,Data Scientist


## 2) Use Spacy to tokenize the listings 

In [7]:
##### Your Code Here #####
nlp = spacy.load("en_core_web_lg")
tokenizer = Tokenizer(nlp.vocab)

In [8]:
STOP_WORDS = nlp.Defaults.stop_words.union([':','\xe2\x80\x93',"b'","\n"])

In [9]:

tokens  = []
for doc in tokenizer.pipe(df['description']):
    doc_tokens = []
    for token in doc:
        if (token.is_stop == False) & (token.is_punct == False):
            doc_tokens.append(token.text.lower())
    tokens.append(doc_tokens)
    
df['spacy_tokens'] = tokens
df['spacy_tokens'].head()

0    [b"job, requirements:\nconceptual, understandi...
1    [b'job, description\n\nas, data, scientist, 1,...
2    [b'as, data, scientist, working, consulting, b...
3    [b'$4,969, $6,756, monthcontractunder, general...
4    [b'location:, usa, \xe2\x80\x93, multiple, loc...
Name: spacy_tokens, dtype: object

In [10]:
from collections import Counter

def count(docs):
    word_counts = Counter()
    appears_in = Counter()
    total_docs = len(docs)
    
    for doc in docs:
        word_counts.update(doc)
        appears_in.update(set(doc))
        
    temp = zip(word_counts.keys(), word_counts.values())
    
    wc = pd.DataFrame(temp,columns = ['word','count'])
    wc['rank'] = wc['count'].rank(method='first',ascending=False)
    total = wc['count'].sum()
    
    wc['pct_total'] = wc['count'].apply(lambda x: x / total)
    wc = wc.sort_values(by='rank')
    wc['cul_pct_total'] = wc['pct_total'].cumsum()
    
    t2 = zip(appears_in.keys(), appears_in.values())
    ac = pd.DataFrame(t2, columns=['word','appears_in'])
    wc = ac.merge(wc,on='word')
    wc['appears_in_pct'] = wc['appears_in'].apply(lambda x: x / total_docs)
    
    return wc.sort_values(by='rank')

In [11]:
wc = count(df['spacy_tokens'])
wc.tail()

Unnamed: 0,word,appears_in,count,rank,pct_total,cul_pct_total,appears_in_pct
21300,receive.,1,1,21335.0,8e-06,0.999968,0.002347
21296,licensed,1,1,21336.0,8e-06,0.999976,0.002347
21289,25000,1,1,21337.0,8e-06,0.999984,0.002347
21297,countries.\n\ncerner\xe2\x80\x99s,1,1,21338.0,8e-06,0.999992,0.002347
21313,category.,1,1,21339.0,8e-06,1.0,0.002347


## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [12]:
df

Unnamed: 0,description,title,spacy_tokens
0,"b""Job Requirements:\nConceptual understanding ...",Data scientist,"[b""job, requirements:\nconceptual, understandi..."
1,"b'Job Description\n\nAs a Data Scientist 1, yo...",Data Scientist I,"[b'job, description\n\nas, data, scientist, 1,..."
2,b'As a Data Scientist you will be working on c...,Data Scientist - Entry Level,"[b'as, data, scientist, working, consulting, b..."
3,"b'$4,969 - $6,756 a monthContractUnder the gen...",Data Scientist,"[b'$4,969, $6,756, monthcontractunder, general..."
4,b'Location: USA \xe2\x80\x93 multiple location...,Data Scientist,"[b'location:, usa, \xe2\x80\x93, multiple, loc..."
...,...,...,...
421,"b""About Us:\nWant to be part of a fantastic an...",Senior Data Science Engineer,"[b""about, us:\nwant, fantastic, fun, startup, ..."
422,"b'InternshipAt Uber, we ignite opportunity by ...",2019 PhD Data Scientist Internship - Forecasti...,"[b'internshipat, uber,, ignite, opportunity, s..."
423,"b'$200,000 - $350,000 a yearA million people a...",Data Scientist - Insurance,"[b'$200,000, $350,000, yeara, million, people,..."
424,"b""SENIOR DATA SCIENTIST\nJOB DESCRIPTION\n\nAB...",Senior Data Scientist,"[b""senior, data, scientist\njob, description\n..."


In [13]:
##### Your Code Here #####
vect = CountVectorizer(stop_words='english')

In [14]:
df['str_spacy_tokens'] = [','.join(map(str, l)) for l in df['spacy_tokens']]
print(df['str_spacy_tokens'])
# Converting lists into string to pass into fit_transform method


# for some reason I couldn't use this columns (df['spacy_tokens']) because it's a list
# but it was able to take the entire df??

dtm = vect.fit_transform(df['str_spacy_tokens'])
print(dtm.shape)

0      b"job,requirements:\nconceptual,understanding,...
1      b'job,description\n\nas,data,scientist,1,,help...
2      b'as,data,scientist,working,consulting,busines...
3      b'$4,969,$6,756,monthcontractunder,general,sup...
4      b'location:,usa,\xe2\x80\x93,multiple,location...
                             ...                        
421    b"about,us:\nwant,fantastic,fun,startup,that\x...
422    b'internshipat,uber,,ignite,opportunity,settin...
423    b'$200,000,$350,000,yeara,million,people,year,...
424    b"senior,data,scientist\njob,description\n\nab...
425    b'cerner,intelligence,new,,innovative,organiza...
Name: str_spacy_tokens, Length: 426, dtype: object
(426, 9813)


In [15]:
type(dtm)

scipy.sparse.csr.csr_matrix

In [16]:
dtm.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 2, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0]], dtype=int64)

## 4) Visualize the most common word counts

In [17]:
dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())
print(dtm.shape)
dtm.head()

(426, 9813)


Unnamed: 0,00,000,02115,03,0356,04,062,06366,08,10,...,zenreach,zero,zeus,zf,zheng,zillow,zones,zoom,zuckerberg,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [18]:
##### Your Code Here #####
tfidf = TfidfVectorizer(stop_words = 'english',
                      ngram_range = (1,2),
                      min_df = 2,
                      max_df = 0.75) # 75%

dtm = tfidf.fit_transform(df['str_spacy_tokens'])

dtm = pd.DataFrame(dtm.todense(), columns = tfidf.get_feature_names())

print(dtm.shape)
dtm.head()

(426, 29099)


Unnamed: 0,000,000 125,000 350,000 associates,000 cities,000 client,000 employees,000 nthe,000 restaurants,000 subcontractor,...,zero,zero invite,zf,zf digital,zf divisions,zf organization,zf platforms,zf portfolio,zf positioning,zf xe2
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [19]:
##### Your Code Here #####
from sklearn.neighbors import NearestNeighbors

In [20]:
nn = NearestNeighbors(n_neighbors = 5, algorithm = 'kd_tree')
nn.fit(dtm)

NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [21]:
nn.kneighbors([dtm.iloc[0].values])

(array([[0.        , 1.33852255, 1.34109624, 1.34473944, 1.35136249]]),
 array([[  0, 336, 276, 115, 403]], dtype=int64))

In [22]:
df['description'][336]

"b'Discover business insights, identify opportunities and provide solutions and recommendations to solve business problems through the use of statistical, algorithmic, data mining and visualization techniques.\\nLevel 1\\nUnder close supervision, conduct predictive analyses for population health management, marketing campaign management and forecasting,\\nAnalyze and design solutions using healthcare data\\nWork with datasets of varying degrees of size and complexity including both structured and unstructured data\\nAnalyze and mine multiple data sources to select statistically valid data samples\\nBuild, test and implement machine learning models by investigating appropriate methods and algorithms Develop analytical datasets and transform fields from data sources as necessary for modeling\\nWork with other departments to develop research plans to match company goals\\nStrategize on approaches to develop robust and meaningful analytics Prepare and co-present reports of model performanc

In [23]:
a = ['''
Responsibilities:
Applying principled methods to quantitative insurance challenges in areas such as telematics risk scoring, pricing, reserving, and estimating customer lifetime value.
Learning the required tools to get the job done, e.g. Python, R, Spark, SQL, etc. Building data processing pipelines to quickly iterate on research ideas and put them into production.
Effectively communicating insights from complex analyses.
Taking end-to-end ownership of problem domains and continuously improving upon quantitative solutions.
Minimum Qualifications:
PhD in a quantitative discipline and/or 3+ years of applying advanced quantitative techniques to problems in industry.
Strong demonstrable knowledge of topics such as statistical inference, numerical linear algebra, machine learning, and numerical optimization.
Exceptional communicator and storyteller. Strong programming skills with experience using modern packages in R and Python.
Demonstrated experience building, validating, and applying statistical machine learning methods to real world problems.
''']
# just some random job listing i found for a data scientist at root insurance co


In [24]:
new = tfidf.transform(a)
new

<1x29099 sparse matrix of type '<class 'numpy.float64'>'
	with 99 stored elements in Compressed Sparse Row format>

In [25]:
nn.kneighbors(new.todense())

(array([[1.32850924, 1.33650361, 1.33769416, 1.33780662, 1.33870596]]),
 array([[283,  62,   8, 349,  25]], dtype=int64))

In [26]:
df['description'][283]
# Apple.....

"b'Summary\\nPosted: Mar 15, 2019\\nRole Number: 200041695\\nAt Apple, great ideas have a way of becoming great products, services, and customer experiences very quickly. If you are a self-motivated, high-energy individual who is not afraid of challenges, we\\xe2\\x80\\x99re looking for you. Apple is seeking a junior level Data Scientist to join a team passionate about Machine Learning applications for Apple Media Products (AMP), covering the App Store, Apple Music/iTunes, Video and other services. This role will involve working with Internet-scale data across numerous product and customer touch points, undertaking in-depth analysis as well as modeling, and building end-to-end Machine Learning applications to solve the problem. The team\\xe2\\x80\\x99s culture is centered around rapid iteration with open feedback and debate along the way, plus strong collaboration with product, engineering, business, and marketing partners.\\nKey Qualifications\\n4+ years of experience in machine learn

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 