# Webscraping Glassdoor.com: Will I ever find a job?


The project, in a nutshell:
- Creating a webdriver for Glassdoor;
- Gathering job post containing:
 - Company, location, rating, url;
 - Job description; 
- Parsing the data into a dictionary, according to a number of relevant categories;
- Gathering analytics for Data Scientist market in US; 
- Measuring the "similarity" between a job post and a "simplified" resume'.

In [1]:
# Load libraries
%matplotlib inline
from plotly.graph_objs import Scatter, Layout
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import Bar, Scatter, Figure, Layout
import pandas as pd # For converting results to a dataframe and bar chart plots
import numpy as np
import pickle
from IPython.core.display import HTML

# load helper
from helperP3 import *

init_notebook_mode(connected=True)



In [2]:
# load data
jobDict = load_obj('glassDoorDict')
# 6- Analytics:  First check for consistency
    
completeDict = dict(filter(lambda x,: len(x[1]) == 6, jobDict.items()))   
        
finalDict = dict(map(lambda (x,y): (x, y[0:5] + [skills_info([y[0]]+y[5])]), completeDict.items()))

In [3]:
# Calculate top locations  
    
location_dict = Counter()
location_dict.update([finalDict[item][3] for item in finalDict.keys()])    
location_frame = pd.DataFrame(location_dict.items(), columns = ['Term', 'NumPostings'])\
                 .sort_values(by='NumPostings', ascending = False).head(10)




In [6]:
data = [Bar(
            x=location_frame.Term,
            y=location_frame.NumPostings
    )]
layout = Layout(yaxis = dict(title = "Number of posts"), title = 'Top 10 cities in US')
fig = Figure(data=data, layout=layout)
iplot(fig,filename='basic-bar')

In [7]:
# Calculate top companies - (company, rating) , Num posting
        
company_dict = Counter()
company_dict.update([(finalDict[item][2],finalDict[item][1]) for item in finalDict.keys()])
company_frame = pd.DataFrame(company_dict.items(), columns = ['Term', 'NumPostings'])\
                  .sort_values(by='NumPostings', ascending = False).head(20)

tmp = pd.DataFrame(company_dict.keys(),columns = ['Company','Rating'])\
                        .sort_values(by='Rating', ascending = False).head(20)


company, company_rating = zip(*company_dict.keys())

In [8]:
data = [Bar(
            x=company,
            y=company_frame.NumPostings
    )]
layout = Layout(yaxis = dict(title = "Number of posts"), title = 'Top 20 employers in Glassdoor')
fig = Figure(data=data, layout=layout)
iplot(fig,filename='basic-bar')

In [15]:
# Calculate other analytics
skill_frame, edu_frame, lang_frame = skills_info(completeDict)
skill = skill_frame.sort_values(by='NumPostings', ascending = False)


In [16]:
data = [Bar(
            x=skill.Term,
            y=skill.NumPostings
    )]
layout = Layout(yaxis = dict(title = "Number of posts"), title = 'Top skills for Data Scientists')
fig = Figure(data=data, layout=layout)
iplot(fig, filename='basic-bar')

In [18]:
data = [Bar(
            x=edu_frame.Term,
            y=edu_frame.NumPostings
    )]
layout = Layout(yaxis = dict(title = "Number of posts"), title = 'Required/preferred education for a Data Scientist')
fig = Figure(data=data, layout=layout)
iplot(fig, filename='basic-bar')

# Measuring similarity between CV and Job posting
- Both are unstructured text. They can be redefined as lists of keywords, according to a pre-existing dictionary;
- Using the Jiccard similarity measure, defined as the ratio of common categorical features and the total number of features (i.e intersection(A,B)/union(A,B))

In [8]:
# 7- Find your match!

Diego = ['Data Scientist', 'PhD','French','Python','R','Matlab','Spark','SQL','Physics']
Linlin = ['Statistics','Python','R','Matlab','SQL', 'French','STATA','Economics','Master','Excel']
Sam = ['Python','R','Hadoop','SQL', 'Java', 'Javascript', 'Master', 'Excel','German']
Amy = ['Python','R','SQL', 'Bachelor', 'Excel','German','C++', 'SAS', 'Statistics']
# first parse the CV
DiegoCV = [item.lower() for item in Diego]
linlinCV = [item.lower() for item in Linlin]
SamCV = [item.lower() for item in Sam]
AmyCV = [item.lower() for item in Amy]

BestMatch = get_match(AmyCV,finalDict)    



In [12]:

print 'The top 5 companies matching my CV are:' 
print  BestMatch.head(10)


The top 5 companies matching my CV are:
             Id                                 Company           Location  \
835  1886821278                              Home Depot        Atlanta, GA   
791  1924978071                          PCO Innovation       New York, NY   
48   1876919434                                    DBRS       New York, NY   
721  1849152045                          Liberty Mutual         Boston, MA   
123  1910294624                        Truth Initiative     Washington, DC   
573  1920913211                               Rover.com        Seattle, WA   
304  1895317223  Gemological Institute of America, Inc.       New York, NY   
867  1882318194                                 Nielsen  San Francisco, CA   
831  1836557681                            AlixPartners         Boston, MA   
440  1884200296                             J.P. Morgan       New York, NY   

     Similarity  
835    0.700000  
791    0.600000  
48     0.600000  
721    0.600000  
123    0.60

In [17]:
# Well, I do not believe it. What was matched?
tmp = finalDict[BestMatch.iloc[8,0]][5]

In [16]:
# Or check the website
GlassDoor = finalDict[BestMatch.iloc[7,0]][4].encode('ascii')
HTML('<a href='+ GlassDoor + '> Check your next job!</a>')

# Coming Next:
- Scraping/parsing a CV from Linkedin and find the best match with the (updated) job list
- Improve the parsing to extract years of experience and more complicated keywords.
- Avoid CAPTCHA messages?