# Project Jobs: EDA Preliminary Report

### Problem Statement: finding treasure

**Objective**

Searching for new jobs can be an arduous process. The diligent can get through 10-20 per week, and the hit rate may not be that high. If the job search goes for 2 months, thats around 150 job postings. From my initial search, there are 1000s of data science/ data analyst positions, and many more similar positions under business and product engineers. So can data science help data scientists find jobs?

Assumption: NLP can be used with the detailed job descriptions to 
1. find correct titles (if any) for job descriptions
2. recommend similar jobs 
3. cluster similar job skills extracted from these job descriptions 

---
### Part 1: Web Scraping

#### Goals : Pull job postings for possible data science-related careers


**Terms Searched for:**
- "Data Science"
- "Data Scientist"
- "Data Analyst"
- "Data Engineer"
- "Business Analyst"
- "Machine Learning"
- "Statistics"
- "Product Analyst"
- "Deep Learning"

**Cities/Locations**
- San Francisco, CA
- Mountain View, CA
- Seattle, WA
- Los Angeles, CA
- Boston, MA
- New York, NY
- Philadelphia, PN
- Washington, DC
- Atlanta, GA
- Houston, TX
- Austin, TX
- Chicago, IL
- Minneapolis, MN

**Job Website 1:** the website has an API. Python was used to pull job listing urls, but just has titles, company names, cities, and states. Python, requests library was used to pull from the API. From those links, python with selenium was used to pull job description details. 50,000 job postings were pulled **Notes:** Noted in the EDA, but a majority of the job postings were duplicates, after de-duping, the total unique postings were around 14,000. 

**Process**
1. Get API key by registering
2. Write python looper for ~14 different cities, using same API search string (no max for search results)
3. For each search string, pull all possible results using python to go through the pagination
4. Save all the individual job links (~50,000)
5. Start 2nd crawler to pull jobdescription details for the remaining jobs


**Job Website 2:** the website does not have an API, so for both the job links and the job descriptions, selenium was used over a period of 2 weeks at a slow 20sec delay to pull 14,000 job postings. Due to website security and blocking, scraping was very slow, and had to be restarted multiple times. Search results were capped at 40 pages x 25 results per page = 1000 search results. As a result a list of ~ 14 cities x ~10 search terms were used to make ~140 unique searches.

**Process**
1. Test website limitis
2. Write python looper for 140 different search combinations
3. For each search string, pull all possible results using selenium to go through the pagination
4. Save all the individual job links (~14,000)
5. Start 2nd crawler to pull jobdescription details for the remaining jobs


Sample API Link
http://api.indeed.com/ads/apisearch?publisher=2313019136084570&q=%22data+analyst%22+OR+%22data+scientist%22+OR+%22data+science%22+OR+%22business+analyst%22+or+%22Analytics%22&l=San+Francisco%2C+CA&st=&jt=fulltime&start=0&limit=20&fromage=60&latlong=1&co=us&userip=1.2.3.4&useragent=Mozilla/%2F4.0%28Firefox%29&v=2

### Sample API Results

```
{u'response': {u'@version': 2,
  u'clickedCategories': {},
  u'dupefilter': {u'$': True},
  u'end': {u'$': 25},
  u'highlight': {u'$': False},
  u'location': {u'$': u'San Jose, CA'},
  u'pageNumber': {u'$': 0},
  u'paginationPayload': {},
  u'query': {u'$': u'"data analyst" OR "data scientist" OR "data science" OR "business analyst" or "Analytics"'},
  u'radius': {u'$': 25},
  u'results': {u'result': [{u'city': {u'$': u'San Jose'},
     u'company': {u'$': u'Adobe'},
     u'country': {u'$': u'US'},
     u'date': {u'$': u'Mon, 08 Aug 2016 19:52:38 GMT'},
     u'expired': {u'$': False},
     u'formattedLocation': {u'$': u'San Jose, CA'},
     u'formattedLocationFull': {u'$': u'San Jose, CA'},
     u'formattedRelativeTime': {u'$': u'30+ days ago'},
     u'indeedApply': {u'$': False},
     u'jobkey': {u'$': u'788848ad2706b267'},
     u'jobtitle': {u'$': u'Business Systems Analyst'},
     u'latitude': {u'$': 37.337914},
     u'longitude': {u'$': -121.89011},
     u'onmousedown': {u'$': u"indeed_clk(this,'2086');"},
     u'snippet': {u'$': u'The Adobe Information, Data and services Business Solutions Analyst (BSA) will be responsible for collaborating with IT and business clients to understand key...'},
     u'source': {u'$': u'Adobe'},
     u'sponsored': {u'$': False},
     u'state': {u'$': u'CA'},
     u'url': {u'$': u'http://www.indeed.com/viewjob?jk=788848ad2706b267&qd=I0TxM_wMPz_iqziD0cGR0PxLjuFXC0PsEfAsModHF2st56MW-C4_aRZZ49ZIFQT7Q6T0dt0aQ-pjLjZ1rU0TrrgvjzJLVDraSv5jRUBSQRE4rzSPTGVAZrn57D8I7-iZPPOnyqkZAWHSsZVuWLMMyGdT8RqGX3zRviuHfoN9_16z9Y5HrDbNczadbi7F-QkeLN4wjpfgfMVbIHjTO97TuDfgwv9wD6L4u89hEa7BM75_dbjB9DmsoKYh9Vr0WnfH&indpubnum=2313019136084570&atk=1asm1jak9b9rkajg'}
                           }]}}}
```        
                    

### Sample job desc

````
-------------------------------------------------------------------------------------------------
We’re looking for Data Engineers to put complex algorithms and data models into production.
-------------------------------------------------------------------------------------------------
Kabbage s looking to hire a Data Engineer to to put complex algorithms and data models into production. This is a particularly exciting phase, as Kabbage offers the only fully automated, online lending platform designed to support continuous customer data monitoring. We serve small businesses and consumers directly through Kabbage.com and Karrot.com and power lending for organizations all over the globe.
---------------------------------------------
Day in the life of a Data Engineer at Kabbage
---------------------------------------------

Skills & Experience
-----------------------

Some of the things we're looking for in you
-------------------------------------------


Of course we have all of the requisite goodies: unlimited food and beverages (of all kinds), lunch catered daily, regular ping pong tournaments, free parking and white boards filled with questionable drawings. More than that, we have a pretty tight team of passionate, driven individuals with whom you wouldn't mind spending a big part of your day.
#36 on the Inc. 500 | Fast Company’s "Top 10 Most Innovative Companies in Finance" | Business Insider's "Top 20 Unicorn Startups to Work For" | Forbes' "America’s 100 Most Promising Companies" | AJC's "Top Places to Work"
Kabbage is an equal opportunity employer. At Kabbage we make all employment decisions, which include hiring, promoting, transferring, demoting, evaluating, compensating and separating, without regard to sex, sexual orientation, gender identity, race, color, religion, age, national origin, pregnancy, citizenship, disability, service in the uniform services, or any other classification protected by federal, state or local law.
*******************************************************

-------------------------------------------------------------------------------------------------
We’re looking for Data Engineers to put complex algorithms and data models into production.
-------------------------------------------------------------------------------------------------

Kabbage s looking to hire a Data Engineer to to put complex algorithms and data models into production. This is a particularly exciting phase, as Kabbage offers the only fully automated, online lending platform designed to support continuous customer data monitoring. We serve small businesses and consumers directly through Kabbage.com and Karrot.com and power lending for organizations all over the globe.

---------------------------------------------
Day in the life of a Data Engineer at Kabbage
---------------------------------------------


Design and develop ETL packages from our source systems that are scalable
Manage a large-scale data Hadoop platform supporting terabytes of data growing rapidly
Analyze complex data systems and elements, data flow, relationships and dependencies to contribute to conceptual, logical and physical data models
Perform thorough testing and validation to support the accuracy of data transformations, data verification used in the machine learning models
Ensure data quality and governance, create/maintain data dictionary and related metadata
Lead innovation by exploration, recommendation, benchmarking and implementing big data technologies for the platform
Provide support to the entire analytics team for their data centric needs

Skills & Experience
-----------------------


Experience in Hadoop and MapReduce
Familiarity with Hive, Oozie, HBase and related technologies
Proficiency in programming in Python, SQL, shell scripting and Java
3+ years in using big data technologies and proven ability to work with complex data systems
Bachelor’s Degree in Computer Science, Math or related disciplines with 4+ years industry experience OR Master’s Degree with 2+ years experience
Strong experience in design and development of large scale applications
Strong ability to collaborate with team members throughout the process and consult with other project teams on the design and use of enterprise data
Proven ability to create and maintain online and printed documentation
Aptitude to independently learn new technologies
Excellent written and oral communication skills and a desire to educate co-workers and business users about our data and how to access it

Some of the things we're looking for in you
-------------------------------------------


Highly motivated to tackle multiple projects in parallel
Aptitude to independently learn new technologies
Excellent written and oral communication skills and ability to share findings with a large, non-technical audience
An excellent communicator, teacher and student
A perfectionist who tempers it with deadlines

The environment: Of course we have all of the requisite goodies: unlimited food and beverages (of all kinds), lunch catered daily, regular ping pong tournaments, free parking and white boards filled with questionable drawings. More than that, we have a pretty tight team of passionate, driven individuals with whom you wouldn't mind spending a big part of your day.

#36 on the Inc. 500 | Fast Company’s "Top 10 Most Innovative Companies in Finance" | Business Insider's "Top 20 Unicorn Startups to Work For" | Forbes' "America’s 100 Most Promising Companies" | AJC's "Top Places to Work"

Kabbage is an equal opportunity employer. At Kabbage we make all employment decisions, which include hiring, promoting, transferring, demoting, evaluating, compensating and separating, without regard to sex, sexual orientation, gender identity, race, color, religion, age, national origin, pregnancy, citizenship, disability, service in the uniform services, or any other classification protected by federal, state or local law.

13 days ago  - save job
 - 
                original job









» Apply Now

            Please review all application instructions before applying to Kabbage.






Recommended Jobs

Data Engineer

Trivoca Consulting -
		 Atlanta, GA
11 days ago
Big Data Engineer

CapTech Consulting -
		 Atlanta, GA
6 days ago
Big Data Engineer

Principle Solutions Group -
		 Atlanta, GA
5 days ago
Big Data Engineer

Trace Staffing -
		 Atlanta, GA
13 days ago

 Easily apply
Data Engineer

Veredus -
		 Atlanta, GA
11 days ago

 Easily apply
````

---
### Part 2: Munging

** Sample Code HTML extraction by Beautiful Soup**
```
with open(singlefile, 'rb') as f:
        loaded = pickle.load(f)
    print len(loaded)
    for record in loaded:
        i+=1
        print i,
        row ={}
        row['head-title'] = record['title']
        row['head-company'] = record['company']       
        row['head-jobid'] = record['jobid']
        row['link'] = record['link']
        row['head-location'] = record['location']
        row['searchlink'] = record['searchlink']
        
        soup = BeautifulSoup(record['fullHTML'],'lxml')
        for br in soup.find_all('br'):
            br.replace_with('\n')
        #print soup.prettify()
        title_e = soup.find_all('title')
        elements = {
            'page-title': title_e[0]            
            ,'body-jobname': soup.find(itemprop='title')
            ,'body-company': soup.find(itemprop='name')
            ,'body-location': soup.find(itemprop='addressLocality')
            ,'industry': soup.find(itemprop='industry')
            ,'employment': soup.find(itemprop='employmentType')
            ,'experience': soup.find(itemprop='experienceRequirements')
            ,'jobcat': soup.find(itemprop='occupationalCategory')
            ,'jobdesc': soup.find(itemprop='description')
            ,'company_desc': soup.find('div',{'class':'description-container'})
            ,'views': soup.find('li',{'class':"views"})
        }
        
        for k,v in elements.items():
            if v!=None:
                row[k] = v.get_text()
            else:
                row[k] = np.nan
        
        collection_list.append(row)
        if i % 100 == 0:
            print 'saving files'
            with open('lnkcompiled'+'00000'[:-len(str(i))]+str(i)+'.p','wb') as f:
                pickle.dump(collection_list,f)
            collection_list = []
```

In [10]:
import cPickle as pickle
with open('master_lnk_df.p','rb') as f:
    master_lnk_df = pickle.load(f)
master_lnk_df.head(4)

Unnamed: 0,body-company,body-jobname,body-location,company_desc,days_posted,employment,experience,head-company,head-jobid,head-location,...,industry,jobcat,jobdesc,link,page-title,post_end_date,post_start_date,searchlink,views,jobkey
0,Jobspring Partners,Python/SQL Analyst,"Chicago, IL, US",Jobspring Partners is a nationwide contingency...,Posted 28 days ago,Full-time,Entry level,Jobspring Partners,11,"Chicago, IL, US",...,Staffing and Recruiting,Research,Python/SQL Analyst\n\nData Analysts capable of...,https://www.linkedin.com/jobs/view/195559462,Python/SQL Analyst Job at Jobspring Partners i...,"September 22, 2016","August 23, 2016",https://www.linkedin.com/jobs/search?keywords=...,0,195559462
1,Broad Institute,Senior Associate Computational Biologist,"Cambridge, MA, US",The Broad Institute was launched to pioneer a ...,Posted 13 days ago,Full-time,Not Applicable,Broad Institute,12,"Cambridge, MA, US",...,"Biotechnology, Nonprofit Organization Manageme...",Research,The Broad Institute of Harvard and MIT is look...,https://www.linkedin.com/jobs/view/196771881,Senior Associate Computational Biologist Job a...,"October 8, 2016","September 8, 2016",https://www.linkedin.com/jobs/search?keywords=...,52,196771881
2,Stripe,Compensation Analyst,"San Francisco, CA, US",Stripe is a set of tools for building and runn...,Posted 9 days ago,Full-time,Associate,Stripe,13,"San Francisco, CA, US",...,"Computer Software, Financial Services, and Int...",Human Resources,Stripe’s people are its most valuable resource...,https://www.linkedin.com/jobs/view/196902956,Compensation Analyst Job at Stripe in San Fran...,"October 12, 2016","September 12, 2016",https://www.linkedin.com/jobs/search?keywords=...,83,196902956
3,Facebook,"Data Engineer, Analytics - Seattle",Seattle -WA -US,Facebook was founded in 2004. Our mission is t...,Posted 9 days ago,Full-time,Not Applicable,Facebook,14,Seattle -WA -US,...,Internet,"Information Technology,Engineering,Analyst",Facebook was built to help people connect and ...,https://www.linkedin.com/jobs/view/172687405,"Data Engineer, Analytics - Seattle Job at Face...","October 21, 2016","September 12, 2016",https://www.linkedin.com/jobs/search?keywords=...,1893,172687405


---
### Part 3: EDA

Process:
1. clean the two data sources
2. merge the two datasets together with common columns
3. extract/normalizing some of the fields, filling blanks
4. pull out generic titles (analyst, engineer, scientist..)
5. pull out 2-gram version of the titles extended titles




```
analyst          9150
engineer         4120
scientist        1506
                 1446
manager          1263
specialist       1182
architect         862
developer         729
consultant        556
associate         439
lead              367
director          360
designer          139
assistant         111
researcher         94
intern             88
administrator      73
programmer         61
accountant         56
vp                 46
statistician       35
editor             24
```

```
business analyst          4029
data analyst              1530
software engineer         1285
data scientist            1139
data engineer             1085
analyst                    310
systems analyst            294
development engineer       266
senior consultant          249
marketing specialist       188
director                   183
product manager            178
manager                    170
senior analyst             155
intelligence analyst       146
research scientist         139
marketing analyst          139
data architect             132
operations analyst         124
project manager            121
solution architect         118
product analyst            115
financial analyst          111
senior associate           105
learning engineer          105
research analyst           102
solutions architect         93
software developer          88
systems engineer            88
applied scientist           81

```

---
### Part 4: Linear Regression - predicting  no of View counts from NLP-elements 

### Features that will be used

- City
- State
- Posting start date
- Days up
- Length of Job description
- Company
- Position name (generic)
- Semi-specific Position Name
- Posting Month (start)

Even though these are only a few features, because they are categorical, this will turn into a large number of features. For example: "Apple", "IBM","Amazon" listings in the "COMPANY" field will result in 2-3 new features. For 8000 job postings, this will increase to be much larger.

If these features are not enough, word analysis will be performed to see if certain words will result in more views of the paragraphs. Will clean the date first.

### Current Results
model | R2 training score | R2 testscore
--------|----------|------------|
LassoCV|0.286 |0.142
RidgeCV|0.372 |0.189
Decision Tree Regressor | 0.407 | 0.078
RidgeCV w words |0.347|0.144
Decision Tree Regressor w words | 0.407 | -0.049

```
======================================================================================
======================================================================================
1. loading data
0:00:04.043642  of  0:00:04.043650
======================================================================================
======================================================================================
2. cleaning data
(22707, 15)
(8038, 15)
0:00:05.455217  of  0:00:09.498867
======================================================================================
======================================================================================
3. scaling data data
0:00:00.017944  of  0:00:09.516811
======================================================================================
======================================================================================
4. splitting X and Y datasets
(8038, 4047) (8038,)
0:00:00.746515  of  0:00:10.263326
======================================================================================
======================================================================================
4a. train_test_split
0:00:00.098361  of  0:00:10.361687
======================================================================================
======================================================================================
begin basic and baseline regressions (without NLP additions) =========
5.Performing lasso
0.286436230017 0.5856753767
train / test score
0.141959208326 0.866105745957
0:01:17.637280  of  0:01:27.998967
======================================================================================
======================================================================================
6. Performing Ridge
0.372487195455 10
train / test score
0.188878880572 10
0:01:54.864978  of  0:03:22.863945
======================================================================================
======================================================================================
7. Performing Decision Tree Regressor
0.407789552976
test train score
0.0781474174699
0:00:03.338161  of  0:03:26.202106
======================================================================================
======================================================================================
8. prepping the word data
0:00:03.065934  of  0:03:29.268040
0:00:03.622710  of  0:03:32.890750
======================================================================================
======================================================================================
begin add in NLP features =========
9. merging the top 10,000 most common words; merging with X
(8038, 14047)
0:00:04.618510  of  0:03:37.509260
======================================================================================
======================================================================================
10. using enhanced data with Ridge
ridgeCV - training score
0.347754677768 1000
ridgeCV - test score
0.144250344331 1000
0:02:37.631455  of  0:06:15.140715
======================================================================================
======================================================================================
11. using enhanced data with decision tree
0.407789552976
test train score
-0.0490863894688
```

---
### Part 5: Logistic Regression

### Features that will be used

- City
- State
- Days up
- Length of Job description
- Company
- Words or Phrases (1-->2 Gram)

Adding features and then doing NLP on the job descriptions to identify key word elements which can predict job titles. Then the model will be used to predict the title vs. the job description. This will also be used to infer some clustering or if some job descriptions are possibly mis-matched to their current titles

```
['analyst' 'business analyst' 'data analyst' 'data architect'
 'data engineer' 'data scientist' 'development engineer' 'director'
 'financial analyst' 'intelligence analyst' 'learning engineer' 'manager'
 'marketing analyst' 'marketing specialist' 'operations analyst'
 'product analyst' 'product manager' 'project manager' 'research analyst'
 'research scientist' 'senior analyst' 'senior associate'
 'senior consultant' 'software engineer' 'solution architect'
 'systems analyst']
```
Current Logistic Regression Score: 0.626

```
====================================================================================================
0. loading data
task time:  0:00:00.055854 overall 0:00:00.055889
====================================================================================================
1. start data cleaning
task time:  0:00:10.157849 overall 0:00:10.213738
====================================================================================================
2. process the job description word data for NLP
task time:  0:00:16.732239 overall 0:00:26.945977
====================================================================================================
3. word vectorizing and processing complete
expanded_title ~ company + city + state + desc_len -1
(12518, 15748)
['analyst' 'business analyst' 'data analyst' 'data architect'
 'data engineer' 'data scientist' 'development engineer' 'director'
 'financial analyst' 'intelligence analyst' 'learning engineer' 'manager'
 'marketing analyst' 'marketing specialist' 'operations analyst'
 'product analyst' 'product manager' 'project manager' 'research analyst'
 'research scientist' 'senior analyst' 'senior associate'
 'senior consultant' 'software engineer' 'solution architect'
 'systems analyst']
(12518, 26)
task time:  0:00:02.460229 overall 0:00:29.406206
====================================================================================================
4. train test and split the data
(10014, 15748) (2504, 15748) (10014, 26) (2504, 26)
task time:  0:00:00.993279 overall 0:00:30.399485
====================================================================================================
5. start basic logistic regression
0:01:04.628651
score: 0.626996805112
task time:  0:01:08.933368 overall 0:01:39.332853
====================================================================================================
6. create feature importance by models SAMPLES
============================================================
analyst
(-0.7294571620125575, 'state[T.other]')
(0.60752602170861592, u'nlp_contribute')
(0.56008826484315311, u'nlp_applicable')
(-0.55330929427153253, u'nlp_machine')
(-0.50260812587536741, u'nlp_python')
============================================================
business analyst
(-1.3185233118900246, u'nlp_machine')
(-0.63542970482329042, u'nlp_spark')
(-0.59549567740698928, u'nlp_entry')
(-0.5905707658908671, u'nlp_sapient')
(-0.57991409222923129, u'nlp_follows')
============================================================
data analyst
(-0.78851163917594225, u'nlp_phd')
(-0.7419060741690171, u'nlp_aws')
(-0.70317755107476998, u'nlp_java')
(-0.657736632750478, u'nlp_amazon')
(0.63353943588904083, u'nlp_excel')
============================================================
data architect
(0.68058302210380006, u'nlp_architecture')
(0.61319157673180225, u'nlp_architect')
(0.54462427652792267, u'nlp_10')
(0.52948853561862586, u'nlp_zipcar')
(-0.46109320413764215, 'state[T.other]')
============================================================
data engineer
(0.71850196547062606, u'nlp_etl')
(0.62422750215671463, u'nlp_pipelines')
(-0.4801069829163066, u'nlp_excel')
(-0.42731579742720199, u'nlp_statistical')
(0.4253979425387614, u'nlp_warehouse')
============================================================
data scientist
(-0.93358364023154439, u'nlp_30')
(-0.87326940157254285, u'nlp_excel')
(0.7797469790420245, u'nlp_phd')
(0.72062782058667341, u'nlp_python')
(0.59399705561080585, u'nlp_statistics')
============================================================
development engineer
(-0.60044002414727149, u'nlp_requirements')
(0.57417600056886919, u'nlp_mining')
(-0.49781548012691956, u'nlp_com')
(0.49676435215758213, u'nlp_teradata')
(-0.48700221168771218, u'nlp_management')
============================================================
director
(-0.8149324745716009, u'nlp_30')
(0.72353772425633367, u'nlp_leadership')
(-0.65769126452244653, u'nlp_join')
(0.59732858480159168, u'nlp_managing')
(-0.5496765899677305, u'nlp_sql')
============================================================
financial analyst
(0.80860907138264049, u'nlp_financial')
(0.5519457583731302, u'nlp_finance')
(0.47213540492732753, u'nlp_modeling')
(0.44321014920959639, u'nlp_reporting')
(-0.41159737232354149, u'nlp_plans')
============================================================
intelligence analyst
(1.3745988769239414, u'nlp_intelligence')
(-0.59927251105401635, 'state[T.other]')
(0.54551585080260523, u'nlp_dashboards')
(-0.49719563769898623, 'state[T.CA]')
(-0.48146223842226493, u'nlp_process')
============================================================
learning engineer
(0.98281816984954828, u'nlp_machine')
(-0.45697373405838787, u'nlp_years')
(0.43540369115115163, u'nlp_ocado')
(-0.42000864562103002, 'state[T.other]')
(-0.38541587216190587, u'nlp_client')
====================================================================================================
7. identify top 3 titles per
(12518, 26)
task time:  0:00:24.687746 overall 0:02:04.262677
====================================================================================================
```

---
### Part 6: Topic Modeling

### 3-Topic Modeling
```
[(0,
  u'0.028*data + 0.015*experience + 0.009*business + 0.007*team + 0.007*work + 0.006*years + 0.005*skills + 0.005*design'),
 (1,
  u'0.011*marketing + 0.009*team + 0.008*experience + 0.007*work + 0.007*business + 0.006*analytics + 0.005*new + 0.005*data'),
 (2,
  u'0.017*business + 0.010*experience + 0.009*management + 0.007*work + 0.007*data + 0.007*project + 0.007*requirements + 0.006*skills')]
```
### 5-Topic Modeling
```
[(0,
  u'0.017*business + 0.009*experience + 0.007*team + 0.007*management + 0.007*work + 0.006*data + 0.006*support + 0.006*financial'),
 (1,
  u'0.031*data + 0.015*experience + 0.009*business + 0.009*team + 0.008*work + 0.006*software + 0.006*development + 0.006*skills'),
 (2,
  u'0.009*data + 0.009*experience + 0.009*business + 0.008*health + 0.007*work + 0.007*systems + 0.006*management + 0.006*analysis'),
 (3,
  u'0.018*marketing + 0.010*data + 0.009*analytics + 0.008*experience + 0.008*team + 0.008*digital + 0.007*work + 0.007*business'),
 (4,
  u'0.017*business + 0.013*experience + 0.011*requirements + 0.011*management + 0.010*project + 0.007*support + 0.007*team + 0.006*development')]
```
### 10-Topic Modeling
```
[(0,
  u'0.033*data + 0.016*experience + 0.010*team + 0.009*work + 0.008*business + 0.007*analytics + 0.006*software + 0.006*development'),
 (1,
  u'0.024*supply + 0.021*chain + 0.011*sourcing + 0.010*analytics + 0.007*manufacturing + 0.006*business + 0.006*planning + 0.005*team'),
 (2,
  u'0.015*business + 0.010*experience + 0.009*requirements + 0.009*management + 0.008*project + 0.008*data + 0.008*support + 0.007*work'),
 (3,
  u'0.008*experience + 0.007*work + 0.006*solutions + 0.006*management + 0.006*program + 0.006*business + 0.005*client + 0.005*services'),
 (4,
  u'0.016*experience + 0.011*security + 0.011*systems + 0.008*analysis + 0.008*management + 0.008*information + 0.007*requirements + 0.007*related'),
 (5,
  u'0.019*business + 0.010*marketing + 0.010*data + 0.009*experience + 0.009*management + 0.009*team + 0.008*skills + 0.007*analysis'),
 (6,
  u'0.018*health + 0.015*care + 0.009*experience + 0.008*clinical + 0.008*data + 0.007*healthcare + 0.007*research + 0.007*hospital'),
 (7,
  u'0.017*search + 0.014*seo + 0.010*marketing + 0.008*campaigns + 0.007*experience + 0.006*google + 0.006*paid + 0.006*tools'),
 (8,
  u'0.011*information + 0.009*experience + 0.008*required + 0.008*service + 0.008*systems + 0.007*application + 0.006*data + 0.006*work'),
 (9,
  u'0.015*marketing + 0.013*business + 0.008*project + 0.007*management + 0.007*product + 0.007*experience + 0.006*team + 0.006*work')]
```

---
### Part 7: Recommender Systems

Using NLP, jobs will be associated with one another to gather similar jobs. The job descriptions will be split into word vectors and will be used for comparison, using cosine similarity calculation

**Original Posting**

```
The Discovery wing of the Data Analytics and Engineering (DEA) team focuses on analytics of the core features of the Netflix product - Kids, Title Merchandising, Personalization, Billboard, Post Play, Ratings, My List (aka Queue), Popularity of UI features, Netflix Originals promotions etc. These areas are constantly evolving to improve our users personalized experience worldwide. The best part is, the work you do helps in shaping the future of Netflix product and fun part is, as a consumer can see and feel the difference.


We are looking for an individual with following characteristics - exceptional analytical skills, passion for data, looking for a fast pace environment, dislikes micromanagement, explore creative and innovative ideas, not afraid of failures, eager to collaborate, good listener and has courage to challenge ideas. You will partner directly with the Product Innovation directors to analyze user behavior data, help them understand how users are consuming content and help them innovate Netflix product.


Problems we are solving:

What metrics should we use to measure the success of our product features? How do Kids use our product compared to adult? How to evolve the product for global landscape - regional difference, language difference, cultural difference etc.? What does a user’s interaction behavior tell us about their engagement/retention?


What you’ll do


Mine and analyze data pertaining to customers’ discovery and viewing experiences to identify critical actionable product insights.
Proactively develop new metrics and studies to quantify the value of different aspects of discovery features, and set up ongoing reports to continually measure their performance.
Translate analytic insights into concrete, actionable recommendations for business or product improvement.
Partner closely with product managers and engineering leaders throughout the lifecycle of launching new Netflix features. Ensure that analytic needs are well-defined up front, and development timelines are coordinated with analytic needs.
Drive efforts to enable product managers and engineering leaders to share your knowledge and insights through clear and concise communication, education, and data visualization.


Who you are:


5+ years of experience in data analysis to derive impactful insights.
Strong interpersonal and communication skills: ability to tell a clear, concise, actionable story with data.
Proficiency at querying high volume data using SQL, and pulling data from various sources (e.g. log files, Redshift).
Strong data visualization skills (Tableau).
Experience with big data technologies - Hadoop, Hive, Presto, etc.
Good understanding of high-level product architecture and data flow patterns.


A few more things to know:


Our culture is unique and we live by our values, so it’s worth learning more about Netflix at www.netflix.com/Jobs.
You will need to be comfortable working in the most agile of environments. Requirements will be vague. Iterations will be rapid. You will need to be nimble and take smart risks.

30+ 
```

### Recommended Jobs



#### similarity rating: (0.63400555872866082, 190)
```
============================================
The Discovery wing of the Data Analytics and Engineering (DEA) team focuses on analytics of the core Netflix product features - Original content promotion, Kids, feature performance, etc. In addition to learning about user’s behavior through product usage, we also partner with consumer insight business team to understand the user's behavior/perception of product and in general Netflix brand by conducting quantitative and qualitative study.


We are looking for an individual with following characteristics - strong analytical skills, passion for data, looking for a fast paced environment, self starter, explore creative and innovative ideas, eager to collaborate. You will partner directly with the consumer insight team to support their data need to conduct quantitative/qualitative studies and analytics requirement. You will also work with other product innovation director to help them understand how users are interacting with product and partner with them to innovate product.


Problems we are solving :

What metrics should we use to measure the success of our product features? How do Kids use our product compared to adult? How to evolve the product for global landscape - regional difference, language difference, cultural difference etc.? What does a user’s interaction behavior tell us about their engagement/retention?


What you’ll do


Mine and analyze data pertaining to customers’ discovery and viewing experiences to support quantitative and qualitative study.
Develop insightful reports/dashboard for business to self serve
Develop new metrics to measure the impact of discovery features
Partner with product managers and engineering leaders to help innovate product


Who you are:


1+ years of data analytics experience
Strong SQL background, and proficient in pulling data from different sources (e.g. log files, Redshift).
Strong interpersonal and communication skills
Strong data visualization skills (Tableau)
Experience with big data technologies (Hadoop, Hive, Presto) is a plus


A few more things to know:


Our culture is unique and we live by our values, so it’s worth learning more about Netflix atwww.netflix.com/Jobs.
You will need to be comfortable working in the most agile of environments. Requirements will be vague. Iterations will be rapid. You will need to be nimble and take smart risks.
```
5 
#### similarity rating: (0.29482984870228707, 31)

```
============================================

Our client-partner is a leading eCommerce services company – they are one of Canada’s fastest-growing and most innovative Internet companies. As the Business Intelligence Product Manager, you will be responsible for working on a talented team of Data Science and Business Intelligence professionals.


In this role you will be responsible for defining BI product strategy and collaborating with engineering to bring business intelligence and reporting solutions from conception to delivery. You will establish strong working relationships with internal and external clients, aggregate requirements, build a product roadmap, and manage the life cycle of new and existing BI technology initiatives.


Responsibilities

Work with internal clients to understand, document, support, and re-envision the reporting and tools that support a variety of business processes.
Work with the Relationship Management team and loyalty partner representatives to understand partner reporting and data requirements.
Define all aspects of business intelligence product requirements and specifications, including mockups, detailed definitions, external data source integrations, internal data and transaction flows, and operational and product adoption documentation
Work closely with the Product Managers, Directors, and VP of Product to drive an understanding of business intelligence throughout all product development work at the company.
Use Agile product delivery practices; working closely with a scrum master, software engineering and QA to build, test, release, and measure all product initiatives.
Manage a product backlog, project timelines, scope, milestones, and prioritization.

Required Skills

3+ years of hands-on experience in software/internet product management
A curiosity for and desire to work with data products (this could be databases, data warehousing concepts, BI reporting tools or business data visualization tools).
Comfortable writing SQL to explore data sets and perform basic manipulations.
An understanding of strategic trends in data (e.g. predictive analytics, big data, machine learning) and how they relate to Business Intelligence.
Experience in defining requirements for reports and underlying ETL processes.
Familiarity with ecommerce development, Web Services, and related areas.
Computer Science or Engineering degree, or significant technical work experience.
Experience in an eCommerce environment is highly advantageous
Hands-on experience with client interfacing for requirements analysis, developing multi-phase feature roadmaps, ongoing communication with clients, granular feature prioritization, detailed feature design and documentation of product/functional requirement specifications.
Detail-oriented, able to bring clarity to complex situations, and skilled at bridging the gap between technical and non-technical audiences.
Strongly analytic, able to find patterns in data, and skilled at understanding a diverse set of transactional processes.
Results-oriented at an individual and team level, ambitious, and strong ability to follow-through on commitments.
Previous experience with Agile development practices.

30+ 
```
#### similarity rating: (0.29460614637750837, 215)

```
============================================

Vencore is a proven provider of information solutions, engineering and analytics for the U.S. Government. With more than 40 years of experience working in the defense, civilian and intelligence communities, Vencore designs, develops and delivers high impact, mission-critical services and solutions to overcome its customers most complex problems.


Headquartered in Chantilly, Virginia, Vencore employs 3,800 engineers, analysts, IT specialists and other professionals who strive to be the best at everything they do.


Vencore is an AA/EEO Employer - Minorities/Women/Veterans/Disabled


Responsibilities:
The M&S Data Architect will develop a detailed knowledge of the underlying data and data products and become the subject matter expert on content, current and potential future uses of data, and the quality and interrelationship between core elements of the data repository and data products.


The M&S Data Architect will consult with information technology and M&S staff to design and implement scripts, programs, databases, software components and analysis that will support product quality and an in depth understanding of potential uses of the data.


Performs a key leadership role in the areas of advanced data techniques, including data modeling, data access, data integration, data visualization, text mining, data discovery, statistical methods, database design and implementation.


Defines and achieves the strategy roadmap for the enterprise; including data modeling, implementation and data management for the enterprise data warehouse and advanced data analytics systems.


Establishes standards and guidelines for the design & development, tuning, deployment and maintenance of information, advanced data analytics, and text mining models and physical data persistence technologies.


Provides leadership in establishing analytic environments required for structured, semi-structured and unstructured data.


Works with staff and customers to understand the business requirements and business processes, design data warehouse ("DW") schema and define extract-translate-load ("ETL") and/or extract-load-translate ("ELT") processes for DW.


Utilization of advanced data analysis, including statistical analysis, data mining techniques, and use of computational packages such as SAS, R or SPSS.


Qualifications:
Required Qualifications:

BS in Electrical/Computer Engineering, Computer Science, Math, Physics, or equivalent, with 5 or more years of relevant experience in a DoD-related program or similar.


-Experience in Data Management, Data Storage and Data Analysis.

-Technical knowledge with; Big Data storage & analytics, Information Science, Software Development and Databases, Systems Security process and systems engineering.

-Proven experience converting large ambiguous data sets into compact visualizations, and transitioning code and requirements to developers.

-Experience defining/architecting end-to-end data flow and data management systems, and multi-system/multi-software interface definition & documentation.

-Strong experience working directly with non-technical customers and translating requirements into actionable prototypes, and developing formal requirements for such prototypes.

-Strong knowledge and experience of software development, modern database and indexing theories.

-Strong knowledge and experience of different data storage, search, and retrieval models


Security: Top Secret Clearance is required.

6 
```
#### similarity rating: (0.29166454034940587, 464)

```
============================================

Amazon Relational Database Service (Amazon RDS) is an industry leading web service that makes it easy to set up, operate, and scale a relational database in the cloud using any of the leading database engines – MySQL, MariaDB, PostgreSQL, SQL Server and Oracle, as well as Amazon’s own MySQL-compatible database engine, Aurora. We are looking for a for a seasoned and talented data engineer to join the team in our Seattle Headquarters. More information on Amazon RDS is available at http://aws.amazon.com/rds.


The data engineer must be passionate about data and the insights that large amounts of data can provide and has the ability to contribute major novel innovations for our team. The role will focus on working with a team of product and program managers, engineering leaders and business leaders to build pipelines and data analysis tools to help the organization run it’s business better. The role will focus on business insights, deep data and trend analysis, operational monitoring and metrics as well as new ideas we haven’t had yet (but you’ll help us have!). The ideal candidate will possess both a data engineering background and a strong business acumen that enables him/her to think strategically and add value to help us improve the RDS customer experience. He/she will experience a wide range of problem solving situations, strategic to real-time, requiring extensive use of data collection and analysis techniques such as data mining and machine learning. In addition, the data engineering role will act as a foundation for the business intelligence team and be forward facing to all levels within the organization.


· Develop and improve the current data architecture for RDS

· Drive insights into how our customers use RDS, how successful they are, where our revenue trends are going up or down, how we are helping customers have a remarkable experience, etc.

· Improve upon the data ingestion models, ETLs, and alarming to maintain data integrity and data availability.

· Keep up to date with advances in big data technologies and run pilots to design the data architecture to scale with the increased data sets of RDS.

· Partner with BAs across teams such as product management, operations, sales, marketing and engineering to build and verify hypotheses.

· Manage and report via dashboards and papers the results of daily, weekly, and monthly reporting


Basic Qualifications
Basic Qualifications

· Bachelor's Degree in Computer Science or a related technical field.

· 6+ years of experience developing data management systems, tools and architectures using SQL, databases, Redshift and/or other distributed computing systems.

· Familiarity with new advances in the data engineering space such as EMR and NoSQL technologies like Dynamo DB.

· Experience designing and operating very large Data Warehouses.

· Demonstrated strong data modelling skills in areas such as data mining and machine learning.

· Proficient in Oracle, Linux, and programming languages such as R, Python, Ruby or Java.

· Skilled in presenting findings, metrics and business information to a broad audience consisting of multiple disciplines and all levels or the organizations.

· Track record for quickly learning new technologies.

· Solid experience in at least one business intelligence reporting tool, e.g. Tableau.

· An ability to work in a fast-paced environment where continuous innovation is occurring and ambiguity is the norm.


Preferred Qualifications
Preferred Qualification

· Master’s degree in Information Systems or a related field.

· Capable of investigating, familiarizing and mastering new datasets quickly.

· Knowledge of a programming or scripting language (R, Python, Ruby, or JavaScript).

· Experience with MPP databases such as Greenplum, Vertica, or Redshift

· Experience with Java and Map Reduce frameworks such as Hive/Hadoop.

· 1+ years of experience managing an Analytic or Data Engineering team.

· Strong organizational and multitasking skills with ability to balance competing priorities.

Amazon.com - 8 
```

## Part 8 - Clustering by Hard Skills by jobname NLP