# Group Project:  Indeed NLP Analysis /w Scrapy

Today we will break into groups to look at the global data science job market from indeed.  You will be extending the work on an existing spider project and working with NLP.

The following insights are required from all Groups:

 - Top hiring companies
 - Counts of "skill" keywords (ie: Statistics, Python, Machine Learning, Big Data, etc)
 - Prediction of "data scientists" job titles against job titles not labeled "data scientist"
   - Capture the probability predictions of your model, put it back in your dataframe, sort to see top predicted jobs that are not labeld "data scientist".  Further extend your analsysis from here to see which are most common or most likely.  What are the insights from this?
 
BONUS:
 - Perform LDA on job summaries.
 
Advice:
 - Create a feature that takes the value 0 or 1 if the title is "Data Scientist".
 - Develop an xpath feature that extracts the company name in the spider.
 - Set your DOWNLOAD_DELAY in your settings.py file to 4 and debug your queries 1st.  Then remove the delay once you want to scrape the whole site.
 - Use CountVectorizer, and compare with TFIDFVectorizer
 - Vectorize the summary as your X, and the 0 / 1 feature from your dataframe as your **y**
 - LogisticRegression is a good place to start with modeling.

Use this spider to start your analysis.  Be mindful of the rate default in your settings file!
https://gist.github.com/dyerrington/902b13d3b128cd211b5059039714e798


In [56]:
# PREPARE REQUIRED LIBRARIES
import requests
from scrapy.selector import Selector
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve, auc

# this line tells jupyter notebook to put the plots in the notebook rather than saving them to file.
%matplotlib inline

# this line makes plots prettier on mac retina screens. If you don't have one it shouldn't do anything.
%config InlineBackend.figure_format = 'retina'

plt.style.use('fivethirtyeight')
pd.options.display.max_rows = 999

## Group 1:  San Francisco

Work with the New York Group to investigate what skills are important in each market but most importantly "why" they might be.  Is there anything else different from New York vs San Francisco that you can draw on?

Otherwise, complete the required insights for your presentation.  Pick someone to present that hasn't presented during a group activity yet.


## Group 2:  New York

Work with the San Francisco group use their data to help classify your regional market requirement of classifying data scientist jobs.  Is there anything different from New York vs San Francisco in terms of what is predicted outside of the "data scientist" job title?

Otherwise, complete the required insights for your presentation.  Pick someone to present that hasn't presented during a group activity yet.


## Group 3:  United States

Work with the International group to compare skill keywords. Is there anything different about the US compared to markets outside the US?  Are some job requirements more emphasised than others?

## Group 4:  International

Generally, you can work with any of the other groups but you must find at least one aspect of data science job market comparison with other groups in order to complete your presentation.  Consider working with the United States group.

You may focus on a single market outside of the us.  London appears to be the biggest market.

In [2]:
london = pd.read_csv('indeed_jobs_london.csv')

In [18]:
london.head()

Unnamed: 0,company,title,summary
0,\n\n\n McLaren\n,Data Scientist,\nData Scientists at McLaren Applied Technolog...
1,\n\n Oliver Bernard\n,"Data Scientist - Python, Machine learning - Fi...","\nData Scientist (Python, Machine learning). D..."
2,\n\n\n Facebook\n,"Data Scientist, Analytics",\nThe Data Scientist Analytics role has work a...
3,\n\n Stratagem Technologies Limited\n,Machine Learning Researcher,\nWe are seeking talented and motivated data s...
4,\n\n\n Vodafone\n,Data Scientist,\nWe are looking for an experience Data Scient...


In [19]:
london.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3960 entries, 0 to 3959
Data columns (total 3 columns):
company    3956 non-null object
title      3960 non-null object
summary    3960 non-null object
dtypes: object(3)
memory usage: 92.9+ KB


In [20]:
london.isnull().sum()

company    4
title      0
summary    0
dtype: int64

In [42]:
london_wip = london.dropna()

In [43]:
london.ix[0,'company'].strip('\n').strip(' ')

'McLaren'

In [44]:
london_wip['company'] = london['company'].map(lambda x: str(x).strip('\n').strip(' '))
london_wip['title'] = london['title'].map(lambda x: str(x).strip('\n').strip(' '))
london_wip['summary'] = london['summary'].map(lambda x: str(x).strip('\n').strip(' '))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [46]:
london_wip.groupby('company').count().reset_index()

Unnamed: 0,company,title,summary
0,A C Human Resources Ltd,4,4
1,ABILITY Resorucing Ltd,12,12
2,ACR Solutions Ltd.,4,4
3,ACS Performance,4,4
4,AECOM,4,4
5,AIG,12,12
6,AKQA,4,4
7,AMS Contingent,4,4
8,AS Watson Group,4,4
9,ASI,4,4


In [58]:
def clean_companies(company):
    if 'amazon' in company.lower():
        company = 'Amazon'
    elif 'xcede' in company.lower():
        company = 'Xcede'
    elif 'ikas' in company.lower():
        company = 'iKas'
    elif 'vitality' in company.lower():
        company = 'Vitality'
    elif 'berkeley' in company.lower():
        company = 'Berkeley Square'
    elif 'fresh minds' in company.lower():
        company = 'Fresh Minds Talent'
    elif 'big wednesday' in company.lower():
        company = 'Big Wednesday'
    elif 'clear cube' in company.lower():
        company = 'ClearCube Consulting'
    elif 'digital gurus' in company.lower():
        company = 'Digital Gurus Recruitment'
    elif 'pcr' in company.lower():
        company = 'PCR'
    elif 'University College London (UCL)' in company:
        company = 'University College London'
    elif 'UCB S.A.' in company:
        company = 'UCB'
    elif 'University College London (UCL)' in company:
        company = 'University College London'
    elif 'Stratagem Technologies Limited' in company:
        company = 'Stratagem Technologies Ltd'
    elif 'Salt Recruitment' in company:
        company = 'Salt'    
    elif 'ResourceFlow Recruitment Ltd' in company:
        company = 'Resource Flow'
    elif 'Quant Capital Consulting Ltd' in company:
        company = 'Quant Capital'
    elif 'PCR Digital' in company:
        company = 'PCR'
    elif 'PCR Recruitment Limited' in company:
        company = 'PCR'
    elif 'ONE Campaign' in company:
        company = 'ONE'
    elif 'Nicoll Curtin Limited' in company:
        company = 'Nicoll Curtin'
    elif 'Nicoll Curtin Technology' in company:
        company = 'Nicoll Curtin'
    elif 'Networking People (UK) Limited' in company:
        company = 'Networking People'    
    elif 'Mortimer Spinks (Manchester)' in company:
        company = 'Mortimer Spinks'
    return company

In [59]:
london_wip['company'] = london_wip['company'].map(clean_companies)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [63]:
london_wip.groupby('company').count().reset_index().sort_values('title',ascending=False)[0:10]

Unnamed: 0,company,title,summary
430,Xcede,196,196
153,Harnham,172,172
67,CK Science,96,96
374,TechNET IT Recruitment Ltd,68,68
401,UCB,64,64
83,Client Server,60,60
187,JPMorgan Chase,60,60
176,Imperial College London,44,44
17,Aimia,44,44
196,King.com,36,36
