# Webscraping Glassdoor.com: Will I ever find a job?


The project, in a nutshell:
- Creating a webdriver for Glassdoor;
- Gathering job post containing:
 - Company, location, rating, url;
 - Job description; 
- Parsing the data into a dictionary, according to a number of relevant categories;
- Gathering analytics for Data Scientist market in US; 
- Measuring the "similarity" between a job post and a "simplified" resume'.

In [125]:
# Load libraries
%matplotlib inline
from plotly.graph_objs import Scatter, Layout
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import Bar, Scatter, Figure, Layout
import pandas as pd # For converting results to a dataframe and bar chart plots
import numpy as np
import pickle
from IPython.core.display import HTML

# load helper
from helperP3 import *

init_notebook_mode(connected=True)

In [126]:
# load data
jobDict = load_obj('glassDoorDict')
# 6- Analytics:  First check for consistency
    
completeDict = dict( filter(lambda x,: len(x[1]) == 6, jobDict.items()))

In [127]:
# Calculate top locations  

location_dict = Counter()
location_dict.update([completeDict[item][3] for item in completeDict.keys()])    
location_frame = pd.DataFrame(location_dict.items(), columns = ['Term', 'NumPostings']) \
                 .sort_values(by='NumPostings', ascending = False).head(10)




In [128]:
data = [Bar(
            x=location_frame.Term,
            y=location_frame.NumPostings
    )]
layout = Layout(yaxis = dict(title = "Number of posts"), title = 'Top 10 cities in US')
fig = Figure(data=data, layout=layout)
iplot(fig,filename='basic-bar')

In [129]:
# Calculate top companies - (company, rating) , Num posting

company_dict = Counter()
company_dict.update([(completeDict[item][2],completeDict[item][1]) for item in completeDict.keys()])
company_frame = pd.DataFrame(company_dict.items(), columns = ['Term', 'NumPostings'])\
                  .sort_values(by='NumPostings', ascending = False).head(20)

tmp = pd.DataFrame(company_dict.keys(),columns = ['Company','Rating'])\
                        .sort_values(by='Rating', ascending = False).head(20)


company, company_rating = zip(*company_dict.keys())

In [130]:
data = [Bar(
            x=company,
            y=company_frame.NumPostings
    )]
layout = Layout(yaxis = dict(title = "Number of posts"), title = 'Top 20 employers in Glassdoor')
fig = Figure(data=data, layout=layout)
iplot(fig,filename='basic-bar')

In [131]:
# Calculate other analytics
skill_frame, edu_frame, lang_frame = skills_info(completeDict)
skill = skill_frame.sort_values(by='NumPostings', ascending = False)


In [132]:
data = [Bar(
            x=skill.Term,
            y=skill.NumPostings
    )]
layout = Layout(yaxis = dict(title = "Number of posts"), title = 'Top skills for Data Scientists')
fig = Figure(data=data, layout=layout)
iplot(fig, filename='basic-bar')

In [133]:
# 7- Find your match!

DiegoCV = ['Data Scientist', 'PhD','French','Python','R','Matlab','Spark','SQL','Physics']
Linlin = ['Statistics','Python','R','Matlab','SQL', 'French','STATA','Economics','Master','Excel']
# first parse the CV
myCV = [item.lower() for item in myCV]
linlinCV = [item.lower() for item in Linlin]
BestMatch = get_match(linlinCV,completeDict)    



In [134]:

print 'The top 5 companies matching my CV are:' 
print  BestMatch


The top 5 companies matching my CV are:
             Id       Company           Location  Similarity
675  1915307177       Twitter  San Francisco, CA    0.666667
691  1791204320          Uber  San Francisco, CA    0.600000
28   1518243809        Fitbit  San Francisco, CA    0.555556
640  1837694688  New York, NY                14d    0.555556
609  1871161850        Dia&Co       New York, NY    0.555556


In [135]:
# Well, I do not believe it. What is the first posting about?
completeDict[BestMatch.iloc[0,0]][5]

['super',
 'help',
 'scrappy',
 'executive',
 'pursuant',
 'results',
 'utilize',
 'employer.',
 'scientist',
 'questions',
 'committed',
 'go',
 'skill',
 'decisions',
 'resourceful',
 'consider',
 'impact',
 'partners',
 'based',
 'monitor',
 'statistics',
 'disability',
 'unanswered',
 'writing',
 'environment',
 'arrest',
 'presentation',
 'collaborate',
 'finance',
 'exceptional',
 'python',
 'big',
 'internet',
 'masters',
 'dashboards',
 'finding',
 'qualified',
 'interpreting',
 'closely',
 'hands',
 'using',
 'years',
 'advanced',
 'applicants',
 'execute',
 'like',
 'success',
 'gender',
 'undergrad',
 'large',
 'banking',
 'race',
 'math',
 'team',
 'small',
 'consulting',
 'passionate',
 'statistical',
 'findings',
 'twitter.',
 'twitter',
 'group.',
 'related',
 'sex',
 'analytics',
 'growth',
 'ancestry',
 'sexual',
 'investment',
 'ability',
 'records.',
 'degree',
 'driving',
 'product.',
 '+',
 'looking',
 'religion',
 'protected',
 'conduct',
 'experience',
 'across',

In [136]:
# Or check the website
GlassDoor = completeDict[BestMatch.iloc[0,0]][4].encode('ascii')
HTML('<a href='+ GlassDoor + '> Check your next job!</a>')

# Coming Next:
- Scraping/parsing a CV from Linkedin and find the best match with the (updated) job list
- Improve the parsing to extract years of experience and more complicated keywords.
- Avoid CAPTCHA messages?