# Company finder based on user's current job


## Goal:

Search for people that attended Machine learning, data science or data analytics internships or junior positions around Europe. Filter by company size. The objective is to find a list of companies that hire for such positions. I will also leave a count of repetitive positions in the same company, as it helps decide which companies to put on the watchlist.

To get people from linkedin, I will use tom quirk's unofficial [linkedin api](https://github.com/tomquirk/linkedin-api)
I will also use `dotenv` to load the username/password safely. Create a `.env` file and populate it with environmental variables that you don't want others to see or for cleaner scripts.

In [1]:
import re
import os
from diskcache import Cache
from linkedin_api import Linkedin
from utils.helper import *

%load_ext dotenv
%dotenv

LINKEDIN_USERNAME = os.environ.get("LINKEDIN_USERNAME")
LINKEDIN_PASSWORD = os.environ.get("LINKEDIN_PASSWORD")

linkedin = Linkedin(LINKEDIN_USERNAME, LINKEDIN_PASSWORD)

### Get the search results, NOT cached

Regions codes: https://developer.linkedin.com/docs/reference/geography-codes
To obtain other codes, see https://i.imgur.com/lcHeXRa.png

Translation:
    nl:0 means all NL
    de:4944 means Berlin Area in Germany

See https://github.com/tomquirk/linkedin-api/blob/252add3bda1a9ce1f50fcd5419aac05dbf81498c/linkedin_api/linkedin.py#L113 for more about the params.

In [2]:
# raise Exception("Comment me out if you want to load new data")

results = linkedin.search_people(
    keywords='Junior data analyst',
    regions=["nl:0", "de:0", "fr:0", "gb:0", "es:0", "ch:0", "dk:0", "fi:0", "pt:0", "se:0", "be:0"],
    profile_languages=['en'],
    limit=(49 * 1),  # limit is 49 per page, so we get multiples of that
#     network_depth="O"
)

print(f"{len(results)} results retrieved")

49 results retrieved


### Get each user profile from cached, if not cached, make an api call and save it in cache

Now we loop them, using DiskCache (./cache.db), used with no expiry on the objects.

In [3]:
found = []

# the cache has all the users, you can loop those without loading everytime
# cache saved in ./cache.db
with Cache("./") as cache:
    i = 0
    for res in results:
        cache_id = res["urn_id"]
        i += 1
        
        if cache_id not in cache:
            print(f"Miss, getting profile from API and saving {i}/{len(results)}")
            profile = linkedin.get_profile(urn_id=res["urn_id"])
            cache.set(cache_id, profile)
            headline = profile['headline'] if 'headline' in profile else ''
            print(f"Miss, get+save: {profile['lastName']} [{headline}] from {profile['locationName']}")
        else:
            print("Hit, getting profile from cache")
            profile = cache.get(cache_id)
        
        match = chk_match(profile["experience"], verbose=True)
        if(match):
            found.append(res)
        
print(f"\n\n\n{len(found)} users found")

Hit, getting profile from cache
Hit, getting profile from cache

---
Junior Data Analyst @ WonderLoudly (?)
Berlin Metropolitan Area, size 0
from {'month': 1, 'year': 2020} to ?
id: None

Description: Building and developing custom made data analytics products tailored  make for the clients' need
Such as:     
- Activity monitoring dashboards using Python(Matplotlib, Numpy)
- Data migration tool using Python APIs  (Monday, Google drive, Tableau)
- CV  analysis tool using Python NLP libraries  
---

Hit, getting profile from cache
Hit, getting profile from cache
Hit, getting profile from cache
Hit, getting profile from cache

---
Junior Data Analyst @ Meister (Internet)
Vienna, Austria, size 51
from {'month': 6, 'year': 2020} to ?
id: urn:li:fs_miniCompany:613998
---

Hit, getting profile from cache

---
Data Scientist - Trainee @ Ubiqum Code Academy (Internet)
Germany, size 11
from {'month': 2, 'year': 2019} to {'month': 7, 'year': 2019}
id: urn:li:fs_miniCompany:3011456

Description: 

### To avoid suspending your LinkedIn account, always use the cache and avoid making extra calls to linkedin

In [4]:
found = []
found_companies = {}

# the cache has all the users, you can loop those without loading everytime

with Cache("./") as cache:
    for key in cache:        
        profile = cache.get(key)        
        exp = chk_match(profile['experience'], verbose=False)
        
        if exp:
            found.append(profile)
            cid = exp["companyName"]
            
            if cid in found_companies: # add
                found_companies[cid]['count'] += 1
                found_companies[cid]['titles'].append(exp["title"])
            else: # new
                found_companies[cid] = {
                    'name': exp["companyName"], 
                    'count': 1, 
                    'id': exp.get("companyUrn", None),

                    'company': exp["company"] if "company" in exp else [], 
                    'locationName': exp["locationName"] if "locationName" in exp else "unknown", 
                    'titles': [exp["title"]]
                }

print(f"\nFound {len(found)} users out of {len(cache)}")
print(f"\nFound {len(found_companies)} companies with jobs matching our search/regex criteria. We can check them out and monitor them for future jobs.")


Found 32 users out of 75

Found 23 companies with jobs matching our search/regex criteria. We can check them out and monitor them for future jobs.


In [5]:
x=0
for key, item in found_companies.items():
    #if the company has no linkedin page, skip
    if(item['id']):
        x+=1
        print(f'''
Name: {item["name"]}
Id:   {item["id"]}
Location: {item["locationName"]}
Titles: {item["titles"]}
Company Size: between {item["company"]["employeeCountRange"]["start"]} to {item["company"]["employeeCountRange"]["end"]}
Industries: {item["company"].get("industries")}
Hits: {item["count"]}''')
        
print(f"Final number of companies that has a linkedin page: {x}")


Name: E.ON Energie Deutschland GmbH
Id:   urn:li:fs_miniCompany:9595025
Location: Potsdam Area, Germany
Titles: ['Junior Data Analyst']
Company Size: between 11 to 50
Industries: None
Hits: 1

Name: myTomorrows
Id:   urn:li:fs_miniCompany:3319435
Location: Anthony Fokkerweg 61, 1059 CP Amsterdam
Titles: ['Data Scientist Intern']
Company Size: between 51 to 200
Industries: ['Hospital & Health Care']
Hits: 1

Name: Ubiqum Code Academy
Id:   urn:li:fs_miniCompany:3011456
Location: Berlin Area, Germany
Titles: ['Data Analytics & Machine Learning  Trainee', 'Data Science Trainee', 'Data Analyst Trainee', 'Data Analyst Trainee', 'Junior mentor and data analyst', 'Data Analyst & Machine learning trainee', 'Data Scientist - Trainee', 'Junior Data Analyst', 'Data Analyst Trainee']
Company Size: between 11 to 50
Industries: ['Internet']
Hits: 9

Name: adviqo GmbH
Id:   urn:li:fs_miniCompany:5098944
Location: Berlin, Germany
Titles: ['Junior Data Scientist']
Company Size: between 201 to 500
Indu

Now that you have a list of companies, you can monitor them for new jobs, or filter them further based on industry, location or job title.