<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [2]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read thru the documentation to accomplish this task. 

`Tip:` You will need to install the `bs4` library inside your conda environment. 

In [3]:
from bs4 import BeautifulSoup
import urllib3
import requests
import pandas as pd

http = urllib3.PoolManager()


url = "https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-1-NLP/master/module2-vector-representations/data/job_listings.csv"
jobs = pd.read_csv(url)
jobs.head()

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [4]:
#Taking one entry and trying to parse just the words from the entry
test_entry = jobs.description[0]
test_entry[:100]

'b"<div><div>Job Requirements:</div><ul><li><p>\\nConceptual understanding in Machine Learning models '

In [5]:
soup = BeautifulSoup(test_entry)
print(soup.get_text())

b"Job Requirements:\nConceptual understanding in Machine Learning models like Nai\xc2\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them\nIntermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role)\nExposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R\nAbility to communicate Model findings to both Technical and Non-Technical stake holders\nHands on experience in SQL/Hive or similar programming language\nMust show past work via GitHub, Kaggle or any other published article\nMaster's degree in Statistics/Mathematics/Computer Science or any other quant specific field.\nApply Now"


In [6]:
parsed_entries = []
for entry in jobs.description:
    soup = BeautifulSoup(entry)
    parsed_entries.append(soup.get_text())
    
test = pd.DataFrame(parsed_entries, columns=['parsed'])
test.head()

Unnamed: 0,parsed
0,"b""Job Requirements:\nConceptual understanding ..."
1,"b'Job Description\n\nAs a Data Scientist 1, yo..."
2,b'As a Data Scientist you will be working on c...
3,"b'$4,969 - $6,756 a monthContractUnder the gen..."
4,b'Location: USA \xe2\x80\x93 multiple location...


In [7]:
test['parsed'][0]

'b"Job Requirements:\\nConceptual understanding in Machine Learning models like Nai\\xc2\\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them\\nIntermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role)\\nExposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R\\nAbility to communicate Model findings to both Technical and Non-Technical stake holders\\nHands on experience in SQL/Hive or similar programming language\\nMust show past work via GitHub, Kaggle or any other published article\\nMaster\'s degree in Statistics/Mathematics/Computer Science or any other quant specific field.\\nApply Now"'

## 2) Use Spacy to tokenize the listings 

In [7]:
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load('en_core_web_lg')

# Tokenizer
tokenizer = Tokenizer(nlp.vocab)

In [8]:
a = tokenizer('test text')
dir(a)
a.text

'test text'

In [10]:
tokens = []
for doc in tokenizer.pipe(test['parsed'],batch_size=500):
    # Generate a list of tokens for every entry in the df
    doc_tokens = [token.text for token in doc]
    # append that list to the tokens list
    tokens.append(doc_tokens)

In [19]:
tokens[0][:5]

['b"Job', 'Requirements:\\nConceptual', 'understanding', 'in', 'Machine']

In [43]:
tokens_series=pd.Series(tokens,name='tokens')

In [44]:
jobs.head()

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [42]:
tokens_series.head()

0    [b"Job, Requirements:\nConceptual, understandi...
1    [b'Job, Description\n\nAs, a, Data, Scientist,...
2    [b'As, a, Data, Scientist, you, will, be, work...
3    [b'$4,969, -, $6,756, a, monthContractUnder, t...
4    [b'Location:, USA, \xe2\x80\x93, multiple, loc...
Name: tokens, dtype: object

In [38]:
tokens_series.name

'tokens'

In [None]:
toke

In [49]:
token_jobs = pd.merge(jobs,tokens_series,left_index=True,right_index=True)

In [55]:
token_jobs.drop(['description'],1,inplace=True)

In [56]:
token_jobs.head()

Unnamed: 0.1,Unnamed: 0,title,tokens
0,0,Data scientist,"[b""Job, Requirements:\nConceptual, understandi..."
1,1,Data Scientist I,"[b'Job, Description\n\nAs, a, Data, Scientist,..."
2,2,Data Scientist - Entry Level,"[b'As, a, Data, Scientist, you, will, be, work..."
3,3,Data Scientist,"[b'$4,969, -, $6,756, a, monthContractUnder, t..."
4,4,Data Scientist,"[b'Location:, USA, \xe2\x80\x93, multiple, loc..."


## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [72]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(stop_words='english')

# Getting one list with every single token
# This will have duplicates. Will that influence result?
all_tokens=[]
for i in tokens:
    for token in i:
        all_tokens.append(token)

vect.fit(all_tokens)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [84]:
# test = vect.transform(all_tokens)
# pd.DataFrame(test.todense(),columns=vect.get_feature_names)

count_df = pd.DataFrame(columns = ['token','count_vector'])

for token in token_jobs['tokens']:
    vect.transform(token)
    new_row = {'token':token, 'count_vector':vect.transform(token)}
    count_df=count_df.append(new_row,ignore_index=True)
    
count_df.head()

Unnamed: 0,token,count_vector
0,"[b""Job, Requirements:\nConceptual, understandi...","(0, 4315)\t1\n (1, 5469)\t1\n (1, 7952)\t1..."
1,"[b'Job, Description\n\nAs, a, Data, Scientist,...","(0, 4315)\t1\n (1, 2235)\t1\n (1, 5269)\t1..."
2,"[b'As, a, Data, Scientist, you, will, be, work...","(0, 705)\t1\n (2, 2071)\t1\n (3, 8234)\t1\..."
3,"[b'$4,969, -, $6,756, a, monthContractUnder, t...","(0, 214)\t1\n (2, 187)\t1\n (4, 5057)\t1\n..."
4,"[b'Location:, USA, \xe2\x80\x93, multiple, loc...","(0, 4635)\t1\n (1, 9550)\t1\n (2, 9911)\t1..."
5,"[b'Create, various, Business, Intelligence, An...","(0, 1949)\t1\n (1, 9610)\t1\n (2, 1162)\t1..."
6,"[b'As, Spotify, Premium, swells, to, over, 96M...","(0, 705)\t1\n (1, 8695)\t1\n (2, 7373)\t1\..."
7,"[b""Everytown, for, Gun, Safety,, the, nation's...","(0, 2885)\t1\n (1, 3249)\t1\n (2, 3613)\t1..."
8,"[b""MS, in, a, quantitative, discipline, such, ...","(0, 5096)\t1\n (1, 3944)\t1\n (3, 7657)\t1..."
9,"[b'Slack, is, hiring, experienced, data, scien...","(0, 8522)\t1\n (1, 4275)\t1\n (2, 3741)\t1..."


In [87]:
dtm = pd.DataFrame(count_df['count_vector'][0].todense(),columns=vect.get_feature_names())
dtm

Unnamed: 0,00,000,02115,03,0356,04,062,06366,08,10,...,zenreach,zero,zeus,zf,zheng,zillow,zones,zoom,zuckerberg,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4) Visualize the most common word counts

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 