# Group Project:  Indeed NLP Analysis /w Scrapy

Today we will break into groups to look at the global data science job market from indeed.  You will be extending the work on an existing spider project and working with NLP.

The following insights are required from all Groups:

 - Top hiring companies
 - Counts of "skill" keywords (ie: Statistics, Python, Machine Learning, Big Data, etc)
 - Prediction of "data scientists" job titles against job titles not labeled "data scientist"
   - Capture the probability predictions of your model, put it back in your dataframe, sort to see top predicted jobs that are not labeld "data scientist".  Further extend your analsysis from here to see which are most common or most likely.  What are the insights from this?
 
BONUS:
 - Perform LDA on job summaries.
 
Advice:
 - Create a feature that takes the value 0 or 1 if the title is "Data Scientist".
 - Develop an xpath feature that extracts the company name in the spider.
 - Set your DOWNLOAD_DELAY in your settings.py file to 4 and debug your queries 1st.  Then remove the delay once you want to scrape the whole site.
 - Use CountVectorizer, and compare with TFIDFVectorizer
 - Vectorize the summary as your X, and the 0 / 1 feature from your dataframe as your **y**
 - LogisticRegression is a good place to start with modeling.

Use this spider to start your analysis.  Be mindful of the rate default in your settings file!
https://gist.github.com/dyerrington/902b13d3b128cd211b5059039714e798


## Group 1:  San Francisco

Work with the New York Group to investigate what skills are important in each market but most importantly "why" they might be.  Is there anything else different from New York vs San Francisco that you can draw on?

Otherwise, complete the required insights for your presentation.  Pick someone to present that hasn't presented during a group activity yet.


## Group 2:  New York

Work with the San Francisco group use their data to help classify your regional market requirement of classifying data scientist jobs.  Is there anything different from New York vs San Francisco in terms of what is predicted outside of the "data scientist" job title?

Otherwise, complete the required insights for your presentation.  Pick someone to present that hasn't presented during a group activity yet.


## Group 3:  United States

Work with the International group to compare skill keywords. Is there anything different about the US compared to markets outside the US?  Are some job requirements more emphasised than others?

In [17]:
import pandas as pd
df = pd.read_csv("results.csv")
print df.shape

df.head()



(2128, 2)


Unnamed: 0,title,summary
0,Data Scientist / Machine Learning Scientist,Data Scientist / Machine Learning Scientist. S...
1,"Research Engineer* (Maplewood, MN) Job",\nData handling and analytics. The Research En...
2,Scientist - Analytical R&D,\nThe Associate Scientist / Scientist will pro...
3,"Data Scientist - Eden Prairie, MN",\nData Scientists work closely with the Busine...
4,Senior Statistician,\nProvide thought leadership and project leade...


In [20]:
summary_only = df.summary.values
print summary_only[:5]

[ 'Data Scientist / Machine Learning Scientist. Strong knowledge of computer vision, sensor data fusion and analytics, and machine learning algorithms with proven...'
 '\nData handling and analytics. The Research Engineer will be part of a team of scientists and technicians using approximately 90 weathering machines to study 3M...'
 '\nThe Associate Scientist / Scientist will provide technical support to assigned projects, using robust scientific methods which comply with standard operating...'
 '\nData Scientists work closely with the Business, data stewards, scrum masters, project managers, and other software teams to turn data into actionable signals...'
 '\nProvide thought leadership and project leadership to develop advanced analytic methodologies to detect patterns in mid- and large-scale clinical, product...']


In [39]:
# ======================================  count vectorizer 
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(stop_words='english')
cvec.fit(summary_only)


df2  = pd.DataFrame(cvec.transform(summary_only).todense(),
              columns=cvec.get_feature_names())
X = df2.transpose().sort_values(0, ascending=False).head(30).transpose()


df2.transpose().sort_values(0, ascending=False).head(10).transpose().sum()



data          2403
scientist      701
machine        622
learning       568
sensor         135
algorithms     245
fusion         130
proven         137
strong         184
computer       185
dtype: int64

In [28]:
df['data_scientist'] = df['title'].map(lambda x: 1 if 'data scientist' in x.lower() else 0)

In [38]:
y = df['data_scientist'].values
print y.shape

(2128,)


In [31]:
X

Unnamed: 0,data,scientist,machine,learning,sensor,algorithms,fusion,proven,strong,computer
0,2,2,2,2,1,1,1,1,1,1
1,1,0,0,0,0,0,0,0,0,0
2,0,2,0,0,0,0,0,0,0,0
3,3,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0
5,0,2,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0
7,1,0,0,0,0,0,0,0,0,0
8,3,0,0,0,0,0,0,0,0,0
9,2,0,0,0,0,0,0,0,0,0


In [49]:
from sklearn.ensemble import RandomForestClassifier

rfc1 = RandomForestClassifier(n_estimators=100,max_depth=1, verbose=1)
rfc2 = RandomForestClassifier(n_estimators=100,max_depth=2, verbose=1)
rfc3 = RandomForestClassifier(n_estimators=100,max_depth=2, verbose=1)

print y.shape, X.shape

rfc1.fit(X,y)
rfc2.fit(X,y)
rfc3.fit(X,y)

from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

yhat1 = rfc1.predict(X)
yhat2 = rfc2.predict(X)
yhat3 = rfc3.predict(X)

print rfc1.score(X,y)
print rfc2.score(X,y)
print rfc3.score(X,y)

df3 = pd.DataFrame({'features': X.columns.values, 'rfc1': rfc1.feature_importances_, 'rfc2':rfc2.feature_importances_, 'rfc3':rfc3.feature_importances_ })
print df3.sort_values('rfc3',ascending=False)


(2128,) (2128, 30)


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.1s finished


0.781015037594
0.801221804511
0.789473684211
           features  rfc1      rfc2      rfc3
2           machine  0.14  0.218404  0.212772
1         scientist  0.14  0.170834  0.199730
5        algorithms  0.15  0.078912  0.121136
3          learning  0.10  0.166266  0.104081
0              data  0.10  0.102085  0.098223
6            fusion  0.09  0.058508  0.073856
7            proven  0.04  0.052357  0.048582
12        analytics  0.03  0.041762  0.040878
4            sensor  0.07  0.027587  0.031076
11           vision  0.04  0.053350  0.023990
9          computer  0.03  0.004602  0.022478
8            strong  0.02  0.016521  0.017309
10        knowledge  0.01  0.005777  0.002045
24  personalization  0.00  0.000480  0.001211
14       performing  0.02  0.001351  0.000954
23          perform  0.01  0.000120  0.000619
22           person  0.01  0.000152  0.000485
13        performed  0.00  0.000713  0.000286
28      perspective  0.00  0.000000  0.000140
18             perl  0.00  0.000014

[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished


In [51]:
import numpy as np
print np.mean(y)

0.316729323308


## Group 4:  International

Generally, you can work with any of the other groups but you must find at least one aspect of data science job market comparison with other groups in order to complete your presentation.  Consider working with the United States group.

You may focus on a single market outside of the us.  London appears to be the biggest market.