# Internship and Junior data analyst job finder

To use .ipynb Jupyter notebook, see [How To Set Up Jupyter Notebook with Python 3 on Ubuntu 18.04](https://www.digitalocean.com/community/tutorials/how-to-set-up-jupyter-notebook-with-python-3-on-ubuntu-18-04).


## Goal:

Search for people that attended ML internships or junior positions around Europe. Filter by company size. The objective is to find a list of companies that hire for such positions. I will also leave a count of repetitive positions in the same company, as it helps decide which companies to put on the watchlist.

To get people from linkedin, I will use tom quirk's unofficial [linkedin api](https://github.com/tomquirk/linkedin-api)
I will also use `dotenv` to load the username/password safely. Create a `.env` file and populate it with environmental variables that you don't want others to see or for cleaner scripts.

In [5]:
import re
import os
from diskcache import Cache
from linkedin_api import Linkedin
from utils.helper import *

%load_ext dotenv
%dotenv

LINKEDIN_USERNAME = os.environ.get("LINKEDIN_USERNAME")
LINKEDIN_PASSWORD = os.environ.get("LINKEDIN_PASSWORD")

linkedin = Linkedin(LINKEDIN_USERNAME, LINKEDIN_PASSWORD)

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


### Get the search results, NOT cached

Regions codes: https://developer.linkedin.com/docs/reference/geography-codes
To obtain other codes, see https://i.imgur.com/lcHeXRa.png

Translation:
    nl:0 means all NL
    de:4944 means Berlin Area in Germany

See https://github.com/tomquirk/linkedin-api/blob/252add3bda1a9ce1f50fcd5419aac05dbf81498c/linkedin_api/linkedin.py#L113 for more about the params.

In [2]:
# raise Exception("Comment me out if you want to load new data")

results = linkedin.search_people(
    keywords='Junior data analyst',
    regions=["nl:0", "de:0", "fr:0", "gb:0", "es:0", "ch:0", "dk:0", "fi:0", "pt:0", "se:0", "be:0"],
    profile_languages=['en'],
    limit=(49 * 1),  # limit is 49 per page, so we get multiples of that
#     network_depth="O"
)

print(f"{len(results)} results retrieved")

49 results retrieved


### Get each user profile from cached, if not cached, make an api call and save it in cache

Now we loop them, using DiskCache (./cache.db), used with no expiry on the objects.

In [8]:
found = []

# the cache has all the users, you can loop those without loading everytime
# cache saved in ./cache.db
with Cache("./") as cache:
    i = 0
    for res in results:
        cache_id = res["urn_id"]
        i += 1
        
        if cache_id not in cache:
            print(f"Miss, getting profile from API and saving {i}/{len(results)}")
            profile = linkedin.get_profile(urn_id=res["urn_id"])
            cache.set(cache_id, profile)
            headline = profile['headline'] if 'headline' in profile else ''
            print(f"Miss, get+save: {profile['lastName']} [{headline}] from {profile['locationName']}")
        else:
            print("Hit, getting profile from cache")
            profile = cache.get(cache_id)
        
        match = chk_match(profile["experience"], verbose=True)
        if(match):
            found.append(res)
        
print(f"\n\n\n{len(found)} users found")

Hit, getting profile from cache
Hit, getting profile from cache
Miss, getting profile from API and saving 3/49
Miss, get+save: S. [Junior Data Analyst at WonderLoudly] from Germany

---
Junior Data Analyst @ WonderLoudly (?)
Berlin Metropolitan Area, size 0
from {'month': 1, 'year': 2020} to ?
id: urn:li:fs_geo:90009712

Description: Building and developing custom made data analytics products tailored  make for the clients' need
Such as:     
- Activity monitoring dashboards using Python(Matplotlib, Numpy)
- Data migration tool using Python APIs  (Monday, Google drive, Tableau)
- CV  analysis tool using Python NLP libraries  
---

Miss, getting profile from API and saving 4/49
Miss, get+save: Kaur [Junior Data Analyst at E.ON Energie Deutschland GmbH] from Potsdam, Brandenburg, Germany

---
Junior Data Analyst @ E.ON Energie Deutschland GmbH (?)
Potsdam Area, Germany, size 11
from {'month': 9, 'year': 2019} to ?
id: urn:li:fs_geo:103091164

Description: * Developing database queries an

Miss, get+save: Ariza [Junior Data Analyst ] from Netherlands
Miss, getting profile from API and saving 22/49
Miss, get+save: Farrés [Business Data Analyst en Adevinta] from Spain
Miss, getting profile from API and saving 23/49
Miss, get+save: Behairy [Data Science Analyst] from Germany
Miss, getting profile from API and saving 24/49
Miss, get+save: Thornewill von Essen [💻 Data Scientist | ⚗️ Bayesian | 🐢 Persistence] from Germany

---
Junior Data Analyst @ Goodgame Studios (Computer Games)
Hamburg Area, Germany, size 201
from {'month': 8, 'year': 2018} to ?
id: urn:li:fs_geo:90009725

Description: ► Created and maintained multiple Tableau dashboards (hosted on Tableau Server) for use in marketing optimisation.
► Wrote multiple production level ETLs using SQL to transform raw data into usable data sources.
► Performed multiple ad-hoc analyses for various stakeholders throughout the company ranging from Bayesian AB Testing to using machine learning (ML) algorithms (Decision Trees/Logist

### To avoid suspending your LinkedIn account, always use the cache and avoid making extra calls to linkedin

In [7]:
found = []
found_companies = {}

# the cache has all the users, you can loop those without loading everytime

with Cache("./") as cache:
    for key in cache:        
        profile = cache.get(key)        
        exp = chk_match(profile['experience'], verbose=False)
        
        if exp:
            found.append(profile)
            cid = exp["companyName"]
            
            if cid in found_companies: # add
                found_companies[cid]['count'] += 1
                found_companies[cid]['titles'].append(exp["title"])
            else: # new
                found_companies[cid] = {
                    'name': exp["companyName"], 
                    'count': 1, 
                    'company': exp["company"] if "company" in exp else [], 
                    'locationName': exp["locationName"] if "locationName" in exp else "unknown", 
                    'titles': [exp["title"]]
                }

print(f"\nFound {len(found)} users out of {len(cache)}")
print(f"\nFound {len(found_companies)} companies with jobs matching our search/regex criteria. We can check them out and monitor them for future jobs.")


Found 0 users out of 2

Found 0 companies with jobs matching our search/regex criteria. We can check them out and monitor them for future jobs.
