<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [1]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from bs4 import BeautifulSoup
import squarify
from sklearn.neighbors import NearestNeighbors

nlp = spacy.load("en_core_web_lg")

## 1) *Optional:* Scrape 100 Job Listings that contain the title "Data Scientist" from indeed.com

At a minimum your final dataframe of job listings should contain
- Job Title
- Job Description

If you choose to not to scrape the data, there is a CSV with outdated data in the directory. Remeber, if you scrape Indeed, you're helping yourself find a job. ;)

In [2]:
jobs = pd.read_csv("/Users/ianforrest/Desktop/coding/repos/ianforrest11/DS-Unit-4-Sprint-1-NLP/module2-vector-representations/data/job_listings.csv", index_col=0)
jobs.head()

Unnamed: 0,description,title
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [3]:
jobs.shape

(426, 2)

In [4]:
x = jobs['description'][0]

In [5]:
nlp.Defaults.stop_words |= {"it's", "i", "the", "and", "to",
                            "i'm", "i've", "it"}


In [6]:
nlp.vocab.add_flag(lambda s: s.lower() in spacy.lang.en.stop_words.STOP_WORDS, spacy.attrs.IS_STOP)

12

# use spacy to clean listings


In [7]:
# remove html code
from bs4 import BeautifulSoup

def remove_html(df):
    df = df.copy()
    df['description'] = df['description'].str.replace('\n', '')
    df['description'] = df['description'].str.replace('\n\n ', '')
    df['description'] = df['description'].str.replace('n2', '')
    df['description'] = df['description'].str.replace(":\n", '')
    df['description'] = df['description'].str.replace('b"', '')
    df['description'] = df['description'].str.replace("b'", '')
    df['description'] = df['description'].str.replace("Job Description", '')
    df['description'] = df['description'].str.replace("Job Requirements", '')
    df = df.applymap(lambda text: BeautifulSoup(text, 'html.parser').get_text())
    df['description'] = df['description'].str.replace(r'[^a-zA-Z ^0-9]', '')

    
    return df


In [8]:
df = remove_html(jobs)
df.head()

Unnamed: 0,description,title
0,nConceptual understanding in Machine Learning ...,Data scientist
1,nnAs a Data Scientist 1 you will help us build...,Data Scientist I
2,As a Data Scientist you will be working on con...,Data Scientist - Entry Level
3,4969 6756 a monthContractUnder the general su...,Data Scientist
4,Location USA xe2x80x93 multiple locationsn2 ye...,Data Scientist


In [9]:
df['description'][0]

'nConceptual understanding in Machine Learning models like Naixc2xa8ve Bayes KMeans SVM Apriori Linear Logistic Regression Neural Random Forests Decision Trees KNN along with handson experience in at least 2 of themnIntermediate to expert level coding skills in PythonR Ability to write functions clean and efficient data manipulation are mandatory for this rolenExposure to packages like NumPy SciPy Pandas Matplotlib etc in Python or GGPlot2 dplyr tidyR in RnAbility to communicate Model findings to both Technical and NonTechnical stake holdersnHands on experience in SQLHive or similar programming languagenMust show past work via GitHub Kaggle or any other published articlenMasters degree in StatisticsMathematicsComputer Science or any other quant specific fieldnApply Now'

In [10]:
from spacy.tokenizer import Tokenizer

# Tokenizer Pipe with removal of default stop words and coffee-specific stop words

# Tokenizer
tokenizer = Tokenizer(nlp.vocab)

tokens = []

""" Update those tokens w/o stopwords"""
for doc in tokenizer.pipe(df['description'], batch_size=500):
    
    doc_tokens = []
    
    for token in doc:
        if (token.is_stop == False) & (token.is_punct == False):
            doc_tokens.append(token.text.lower())

    tokens.append(doc_tokens)

df['tokens'] = tokens

In [11]:
from collections import Counter

# import 'count' function from lecture
def count(docs):

        word_counts = Counter()
        appears_in = Counter()
        
        total_docs = len(docs)

        for doc in docs:
            word_counts.update(doc)
            appears_in.update(set(doc))

        temp = zip(word_counts.keys(), word_counts.values())
        
        wc = pd.DataFrame(temp, columns = ['word', 'count'])

        wc['rank'] = wc['count'].rank(method='first', ascending=False)
        total = wc['count'].sum()

        wc['pct_total'] = wc['count'].apply(lambda x: x / total)
        
        wc = wc.sort_values(by='rank')
        wc['cul_pct_total'] = wc['pct_total'].cumsum()

        t2 = zip(appears_in.keys(), appears_in.values())
        ac = pd.DataFrame(t2, columns=['word', 'appears_in'])
        wc = ac.merge(wc, on='word')

        wc['appears_in_pct'] = wc['appears_in'].apply(lambda x: x / total_docs)
        
        return wc.sort_values(by='rank')

In [12]:
df.head()

Unnamed: 0,description,title,tokens
0,nConceptual understanding in Machine Learning ...,Data scientist,"[nconceptual, understanding, machine, learning..."
1,nnAs a Data Scientist 1 you will help us build...,Data Scientist I,"[nnas, data, scientist, 1, help, build, machin..."
2,As a Data Scientist you will be working on con...,Data Scientist - Entry Level,"[data, scientist, working, consulting, busines..."
3,4969 6756 a monthContractUnder the general su...,Data Scientist,"[4969, , 6756, monthcontractunder, general, s..."
4,Location USA xe2x80x93 multiple locationsn2 ye...,Data Scientist,"[location, usa, xe2x80x93, multiple, locations..."


In [13]:
wc = count(df['tokens'])
wc.head()

Unnamed: 0,word,appears_in,count,rank,pct_total,cul_pct_total,appears_in_pct
58,data,421,4034,1.0,0.032281,0.032281,0.988263
103,business,316,1094,2.0,0.008754,0.041035,0.741784
29,experience,359,1024,3.0,0.008194,0.049229,0.842723
267,,269,928,4.0,0.007426,0.056655,0.631455
38,work,329,927,5.0,0.007418,0.064073,0.7723


In [14]:
df['tokens'].dtype

dtype('O')

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [15]:
##### Your Code Here #####
count_vect = CountVectorizer()

dtm = count_vect.fit_transform(df.description.values)
dtm_df = pd.DataFrame(dtm.todense(),columns=count_vect.get_feature_names())

In [16]:
dtm_df.head()

Unnamed: 0,02,02115njob,030nnmicrosoft,031819,032519,041819,06366,10,100,1000,...,zeus,zf,zfxe2x80x99s,zheng,zillow,zillows,zonesnability,zoom,zuckerberg,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# remove common word columns
dtm_df = dtm_df.drop(columns=['from', 'this', 'all', 
                              'such', 'with', 'at', 'other', 
                              'have', 'as', 'you', 'on', 'as','in',
                              'or', 'are','by', 'for','and', 'to', 
                              'the', 'that', 'we', 'of', 
                              'is', 'be', 'our', 'an', 'it', 'into', 'your'])

## 4) Visualize the most common word counts

In [None]:
word_ranks = dtm_df.sum().sort_values(ascending=False)
word_ranks.head()

In [None]:
squarify.plot(sizes=word_ranks.values[:50], label=word_ranks.index[:50], alpha=0.8)
plt.axis('off')
plt.show();

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [None]:
# Instantiate vectorizer object
tfidf = TfidfVectorizer(stop_words = 'english')

dtm_tfidf = tfidf.fit_transform(df.description.values)
dtm_tfidf_df = pd.DataFrame(dtm_tfidf.todense(),columns=tfidf.get_feature_names())

# View Feature Matrix as DataFrame
dtm_tfidf_df.head()

## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [None]:
nn  = NearestNeighbors(n_neighbors=5, algorithm='ball_tree')
nn.fit(dtm_tfidf_df)

In [None]:
dtm_tfidf_df.iloc[5].head()

In [None]:

# Query Using kneighbors 
nn.kneighbors([dtm_tfidf_df.iloc[5]])

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 