## Skills requested in Google job posts

### Introduction

There is a question in our mind that which **language, skills, and experience** should we add to our toolbox for getting a job in Google. Niyamat Ullah thought why not we find out the answer by analyzing the **Google Jobs Site**. Google published all of their jobs at `careers.google.com`. So he scraped all of the job data from that site by going every job page using a tool called **Selenium**, taking only the job title, location, responsibilities, minimum and preferred qualifications.

The data set contains posts for 1,250 jobs. The variables are:

* `company`: either Google or Youtube.

* `title`: the title of the job.

* `category`: the category of the job.

* `location`: The location of the job.

* `responsibilities`: the responsibilities for the job.

* `minqual`: minimum qualifications for the job.

* `prefqual`: preferred qualifications for the job.

### Importing the data

I load the CSV file with the data. The encoding is specified to prevent problems with systems which do not use UTF-8 (typically Windows). This is probably not needed here, but it may save trouble in other cases.

In [1]:
google = read.csv('https://raw.githubusercontent.com/iese-bad/DataSci/master/Data/skills_google.csv',
    stringsAsFactors=FALSE, encoding='UTF-8')

As usual, I check that the content of the file is that expected. Note that the longer strings (e.g. in the column `responsabilities`) are not fully printed.

In [2]:
str(google)

'data.frame':	1250 obs. of  7 variables:
 $ company         : chr  "Google" "Google" "Google" "Google" ...
 $ title           : chr  "Google Cloud Program Manager" "Supplier Development Engineer (SDE), Cable/Connector" "Data Analyst, Product and Tools Operations, Google Technical Services" "Developer Advocate, Partner Engineering" ...
 $ category        : chr  "Program Management" "Manufacturing & Supply Chain" "Technical Solutions" "Developer Relations" ...
 $ location        : chr  "Singapore" "Shanghai, China" "New York, NY, United States" "Mountain View, CA, United States" ...
 $ responsibilities: chr  "Shape, shepherd, ship, and show technical programs designed to support the work of Cloud Customer Engineers and"| __truncated__ "Drive cross-functional activities in the supply chain for overall Technical Operational readiness in all NPI ph"| __truncated__ "Collect and analyze data to draw insight and identify strategic solutions.\nBuild consensus by facilitating bro"| __truncated__

I also load the package `stringr`, whose functions I use in this example.

In [3]:
library(stringr)

### Exploring the company

I start my exploratory analysis by the company, which, in most of the jobs, is Google.

In [4]:
table(google[, 'company'])


 Google YouTube 
   1227      23 

### Exploring the titles

Many different titles are included in the data set. We extract the top-10 jobs.

In [5]:
title = google[, 'title']
print(length(unique(title)))

[1] 794


In [6]:
sort(table(title), decreasing=TRUE)[1:10]

title
                      Business Intern 2018 
                                        35 
                   MBA Intern, Summer 2018 
                                        34 
                           MBA Intern 2018 
                                        28 
                  BOLD Intern, Summer 2018 
                                        21 
  Field Sales Representative, Google Cloud 
                                        17 
                      Interaction Designer 
                                        12 
                User Experience Researcher 
                                         9 
      Partner Sales Engineer, Google Cloud 
                                         7 
                                 Recruiter 
                                         7 
User Experience Design Intern, Summer 2018 
                                         7 

Interns seems to dominate the picture, but, with 794 different titles, this quick view could be misleading. So, I check other possibilities. 

In [7]:
print(sum(str_detect(title, 'Intern')))

[1] 187


In [8]:
print(sum(str_detect(title, 'Sales')))

[1] 135


In [9]:
print(sum(str_detect(title, 'Cloud')))

[1] 277


In [10]:
print(sum(str_detect(title, 'Google Cloud')))

[1] 259


So far, the cloud dominates. To proceed more systematically, I extract a list of most frequent tokens. Previous to the extraction, I clean the data deleting all the expressions within parenthesis. I use the function `str_replace_all` with the **regular expression** ` [(].+[)]`. 

Some technicalities about this regular expression: 

1. In a regular expression, the parentheses is used for grouping pieces of text. To refer to the parentheses temselves, I use the square brackets.

2. The dot (.) stands for any character.

3. The plus sign (+) is a **quantifier**, meaning any number of occurrences.

In [11]:
title_short = str_replace_all(title, ' [(].+[)]', '')

Now, I split the composite titles, which have two parts separated by a comma. Since `str_split` returns a list, I use `unlist` to get a vector. I get 886 different tokens.

In [12]:
title_terms = unlist(str_split(title_short, ', '))
print(length(unique(title_terms)))

[1] 886


Most of the jobs are for Google Cloud or for interns. The sales jobs seem to be very scattered, with many different titles.

In [13]:
sort(table(title_terms), decreasing=TRUE)[1:10]

title_terms
              Google Cloud                Summer 2018 
                       206                         76 
      Business Intern 2018                 MBA Intern 
                        51                         34 
           MBA Intern 2018  Google Technical Services 
                        32                         31 
         Consumer Hardware Field Sales Representative 
                        27                         26 
     Google Cloud Platform  Product Marketing Manager 
                        25                         23 

### Exploring categories

I apply the same approach for categories. In the end, most of the jobs do not seem to call for techies.

In [14]:
sort(table(google[, 'category']), decreasing=TRUE)


      Sales & Account Management       Marketing & Communications 
                             168                              165 
                         Finance              Technical Solutions 
                             115                              101 
               Business Strategy                People Operations 
                              98                               86 
        User Experience & Design               Program Management 
                              84                               74 
                    Partnerships       Product & Customer Support 
                              60                               50 
    Legal & Government Relations                   Administrative 
                              46                               40 
                Sales Operations             Software Engineering 
                              31                               31 
            Hardware Engineering Real Estate & Workplace Serv

### Exploring countries

I extract the country from the location. The location has two or three components, separated by the string ', '. So I have to drop all the characters that come before the last occurrence of that string. I use `str_replace_all` and a regular expression which stands for the string to be suppressed. The trick is to use an expression which stands for any string ending by a comma followed by white space. 

In [15]:
country = str_replace_all(google[, 'location'], '.+,', '')
print(length(unique(country)))

[1] 49


There are 49 countries but, as we see next, most of the job requests are for US.

In [16]:
sort(table(country), decreasing=TRUE)[1:10]

country
  United States         Ireland  United Kingdom         Germany       Singapore 
            638              87              62              54              41 
          China       Australia           Japan          Taiwan           India 
             38              35              31              30              28 

### Exploring responsibilities

To explore the content of responsibilities column, I put first everything in **lowercase**. 

In [17]:
resp = google[, 'responsibilities']
resp = str_to_lower(resp)
print(resp[1])

[1] "shape, shepherd, ship, and show technical programs designed to support the work of cloud customer engineers and solutions architects.\nmeasure and report on key metrics tied to those programs to identify any need to change course, cancel, or scale the programs from a regional to global platform.\ncommunicate status and identify any obstacles and paths for resolution to stakeholders, including those in senior roles, in a transparent, regular, professional and timely manner.\nestablish expectations and rationale on deliverables for stakeholders and program contributors.\nprovide program performance feedback to teams in product, engineering, sales, and marketing (among others) to enable efficient cross-team operations."


Next, I extract the words. This will leave out **punctuation** and the **control character** '\n', which means new line, and is used to separate paragraphs. I use `str_extract_all` and a regular expression that stands for any word. `str_extract_all` returns a list, every term of which is a **bag of words**, that is, a vector whose elements are words.

In [18]:
resp_terms = str_extract_all(resp, '[a-z]+')
print(is.list(resp_terms))

[1] TRUE


In [19]:
print(resp_terms[[1]])

  [1] "shape"        "shepherd"     "ship"         "and"          "show"        
  [6] "technical"    "programs"     "designed"     "to"           "support"     
 [11] "the"          "work"         "of"           "cloud"        "customer"    
 [16] "engineers"    "and"          "solutions"    "architects"   "measure"     
 [21] "and"          "report"       "on"           "key"          "metrics"     
 [26] "tied"         "to"           "those"        "programs"     "to"          
 [31] "identify"     "any"          "need"         "to"           "change"      
 [36] "course"       "cancel"       "or"           "scale"        "the"         
 [41] "programs"     "from"         "a"            "regional"     "to"          
 [46] "global"       "platform"     "communicate"  "status"       "and"         
 [51] "identify"     "any"          "obstacles"    "and"          "paths"       
 [56] "for"          "resolution"   "to"           "stakeholders" "including"   
 [61] "those"        "in"   

I join all the terms collected in the 1250 bags of words in a single vector using `unlist`.

In [20]:
resp_terms = unlist(resp_terms)
print(is.vector(resp_terms))

[1] TRUE


This leaves me with a vector which contains 3,824 different terms.

In [21]:
print(length(unique(resp_terms)))

[1] 3824


In [22]:
sort(table(resp_terms), decreasing=TRUE)[1:50]

resp_terms
          and            to           the            of          with 
         9457          4303          2668          2233          2182 
          for        google            in      business             a 
         1372          1292          1247          1218          1185 
      product            on       develop         teams          work 
          968           870           779           768           755 
           as          team      partners     technical        manage 
          712           660           633           606           596 
     customer           our          that       partner       support 
          561           536           517           516           489 
      provide         drive         sales         cloud    management 
          478           474           473           441           440 
            s         their     customers          data     including 
          426           424           423           420           

Most of these terms are **stopwords**, that is, words that do not contain relevant information (and, to, the, etc). The leading topics seem to be development, teams and partners. To get a better picture, I should continue the analysis by dropping the stopwords and merging **synonyms** (such as "team" and "teams"). I stop here.

### Exploring minimum qualifications

The analysis of the minimum qualifications follows the same lines.

In [23]:
minqual = google[, 'minqual']
minqual = str_to_lower(minqual)
print(minqual[1])

[1] "ba/bs degree or equivalent practical experience.\n3 years of experience in program and/or project management in cloud computing, enterprise software and/or marketing technologies."


In [24]:
minqual_terms = str_extract_all(minqual, '[a-z]+')
minqual_terms = unlist(minqual_terms)
print(length(unique(minqual_terms)))

[1] 1920


In [25]:
sort(table(minqual_terms), decreasing=TRUE)[1:50]

minqual_terms
   experience            or            in           and             a 
         3036          2478          2400          2304          1231 
           of    equivalent        degree     practical            to 
         1110          1063          1059           993           928 
           bs            ba         years          with           the 
          879           838           722           718           611 
   management       ability         field       related       working 
          413           363           341           321           313 
      program            as       english         speak      fluently 
          305           292           286           281           280 
idiomatically       science         write      computer     technical 
          278           276           276           273           265 
            s            be   engineering            an         sales 
          249           248           237           233        

Bad news here for the starters, experience is the main thing. I check how often are mentioned the leading programming languages. Not much.

In [26]:
print(sum(str_detect(minqual_terms, 'sql')))

[1] 85


In [27]:
print(sum(str_detect(minqual_terms, 'javascript')))

[1] 77


In [28]:
print(sum(str_detect(minqual_terms, 'python')))

[1] 97


### Exploring preferred qualifications

In [29]:
prefqual = google[, 'prefqual']
prefqual = str_to_lower(prefqual)
print(prefqual[1])

[1] "experience in the business technology market as a program manager in saas, cloud computing, and/or emerging technologies.\nsignificant cross-functional experience across engineering, sales, and marketing teams in cloud computing or related technical fields.\nproven successful program outcomes from idea to launch in multiple contexts throughout your career.\nability to manage the expectations, demands and priorities of multiple internal stakeholders based on overarching vision and success for global team health.\nability to work under pressure and possess flexibility with changing needs and direction in a rapidly-growing organization.\nstrong organization and communication skills."


In [30]:
prefqual_terms = str_extract_all(prefqual, '[a-z]+')
prefqual_terms = unlist(prefqual_terms)
print(length(unique(prefqual_terms)))

[1] 3205


In [31]:
sort(table(prefqual_terms), decreasing=TRUE)[1:50]

prefqual_terms
           and             to             in           with     experience 
          6496           2991           2501           2419           2308 
       ability             of              a         skills             or 
          1856           1655           1654           1461           1432 
           the     management       business   demonstrated             as 
          1379            720            680            640            582 
     excellent           work  communication    environment      technical 
           570            549            546            520            476 
        strong     analytical        working          cloud           data 
           472            446            440            395            394 
           for      knowledge     technology          sales  understanding 
           389            371            371            359            326 
             s      effective        project         google         degre

Now it is experience and ability. No news about the languages. 

In [32]:
print(sum(str_detect(prefqual_terms, 'sql')))

[1] 84


In [33]:
print(sum(str_detect(prefqual_terms, 'javascript')))

[1] 66


In [34]:
print(sum(str_detect(prefqual_terms, 'python')))

[1] 79
