# Internship and Junior data analyst job finder

To use .ipynb Jupyter notebook, see [How To Set Up Jupyter Notebook with Python 3 on Ubuntu 18.04](https://www.digitalocean.com/community/tutorials/how-to-set-up-jupyter-notebook-with-python-3-on-ubuntu-18-04).


## Goal:

Search for people that attended ML internships or junior positions around Europe. Filter by company size. The objective is to find a list of companies that hire for such positions. I will also leave a count of repetitive positions in the same company, as it helps decide which companies to put on the watchlist.

To get people from linkedin, I will use tom quirk's unofficial [linkedin api](https://github.com/tomquirk/linkedin-api)
I will also use `dotenv` to load the username/password safely. Create a `.env` file and populate it with environmental variables that you don't want others to see or for cleaner scripts.

In [1]:
import re
import os
from diskcache import Cache
from linkedin_api import Linkedin
from utils.helper import *

%load_ext dotenv
%dotenv

LINKEDIN_USERNAME = os.environ.get("LINKEDIN_USERNAME")
LINKEDIN_PASSWORD = os.environ.get("LINKEDIN_PASSWORD")

linkedin = Linkedin(LINKEDIN_USERNAME, LINKEDIN_PASSWORD)

### Get the search results, NOT cached

Regions codes: https://developer.linkedin.com/docs/reference/geography-codes
To obtain other codes, see https://i.imgur.com/lcHeXRa.png

Translation:
    nl:0 means all NL
    de:4944 means Berlin Area in Germany

See https://github.com/tomquirk/linkedin-api/blob/252add3bda1a9ce1f50fcd5419aac05dbf81498c/linkedin_api/linkedin.py#L113 for more about the params.

In [2]:
# raise Exception("Comment me out if you want to load new data")

results = linkedin.search_people(
    keywords='Junior data analyst',
    regions=["nl:0", "de:0", "fr:0", "gb:0", "es:0", "ch:0", "dk:0", "fi:0", "pt:0", "se:0", "be:0"],
    profile_languages=['en'],
    limit=(49 * 1),  # limit is 49 per page, so we get multiples of that
#     network_depth="O"
)

print(f"{len(results)} results retrieved")

[DEBUG] fetching https://www.linkedin.com/voyager/api/search/blended?count=49&filters=List(resultType-%3EPEOPLE,geoRegion-%3Enl%3A0%7Cde%3A0%7Cfr%3A0%7Cgb%3A0%7Ces%3A0%7Cch%3A0%7Cdk%3A0%7Cfi%3A0%7Cpt%3A0%7Cse%3A0%7Cbe%3A0,profileLanguage-%3Een)&origin=GLOBAL_SEARCH_HEADER&q=all&start=0&queryContext=List(spellCorrectionEnabled-%3Etrue,relatedSearchesEnabled-%3Etrue,kcardTypes-%3EPROFILE%7CCOMPANY)&keywords=Junior+data+analyst
49 results retrieved


### Get each user profile from cached, if not cached, make an api call and save it in cache

Now we loop them, using DiskCache (./cache.db), used with no expiry on the objects.

In [8]:
found = []

# the cache has all the users, you can loop those without loading everytime
# cache saved in ./cache.db
with Cache("./") as cache:
    i = 0
    for res in results:
        cache_id = res["urn_id"]
        i += 1
        
        if cache_id not in cache:
            print(f"Miss, getting profile from API and saving {i}/{len(results)}")
            profile = linkedin.get_profile(urn_id=res["urn_id"])
            cache.set(cache_id, profile)
            headline = profile['headline'] if 'headline' in profile else ''
            print(f"Miss, get+save: {profile['lastName']} [{headline}] from {profile['locationName']}")
        else:
            print("Hit, getting profile from cache")
            profile = cache.get(cache_id)
        
        match = chk_match(profile["experience"], verbose=True)
        if(match):
            found.append(res)
        
print(f"\n\n\n{len(found)} users found")

Hit, getting profile from cache

---
Junior Data Analyst @ Datactics (Computer Software)
Belfast, United Kingdom, size 11
from {'month': 7, 'year': 2015} to {'month': 1, 'year': 2016}

Description: • Worked on design, implementation, and testing of data quality management projects for financial data using company's in-house developed tool FlowDesigner.
• Implemented and managed dashboards for Data Quality Management using QlikView. 
• Worked on analysis of big volumes of data focusing on data cleansing, normalization, fuzzy matching.
---

Hit, getting profile from cache
Hit, getting profile from cache

---
Data Scientist Intern @ myTomorrows (Hospital & Health Care)
Anthony Fokkerweg 61, 1059 CP Amsterdam, size 51
from {'month': 9, 'year': 2018} to {'month': 2, 'year': 2019}

Description: Task:  Analyzing the behavior of website visitors, positioning visitors into the marketing funnel and   
           predict helped website visitors based on their interaction within the website.

Acti

[DEBUG] fetching https://www.linkedin.com/voyager/api/identity/profiles/ACoAACMpC4gB8S-BuTwWZv0fBJjmf9BpuIbRjx0/profileView
Miss, get+save: Hobeiche [Junior Data Analyst at FocusEconomics] from Barcelona, Catalonia, Spain

---
Junior Data Analyst @ FocusEconomics (Information Services)
Barcelona Area, Spain, size 11
from {'month': 8, 'year': 2018} to ?
---

Miss, getting profile from API and saving 20/49
[DEBUG] fetching https://www.linkedin.com/voyager/api/identity/profiles/ACoAABxXjN4BF1oyd8_C3q8unfBEM29KGSRvOL8/profileView
Miss, get+save: Talic [Data Analyst at trivago] from Düsseldorf, North Rhine-Westphalia, Germany
Miss, getting profile from API and saving 21/49
[DEBUG] fetching https://www.linkedin.com/voyager/api/identity/profiles/ACoAABu2S-IBYCvZwqQsQfHekEvd-wc9pZxzinA/profileView
Miss, get+save: N. [Data Analyst | Data Scientist | Researcher | Digital Consultant] from Berlin, Berlin, Germany
Miss, getting profile from API and saving 22/49
[DEBUG] fetching https://www.linkedin

Miss, get+save: Makarov [Data Analyst at PEAT GmbH] from Germany
Miss, getting profile from API and saving 36/49
[DEBUG] fetching https://www.linkedin.com/voyager/api/identity/profiles/ACoAABnLrE4B0c8ZZVDB4G-sZ8B1ucRD8r0K940/profileView
Miss, get+save: Adam [Junior Data Scientist for Medical Devices @ Fresenius Medical Care] from Berlin, Berlin, Germany
Miss, getting profile from API and saving 37/49
[DEBUG] fetching https://www.linkedin.com/voyager/api/identity/profiles/ACoAACCAyH4BC8ml7VxwmCeOCXnheCc_g842QLY/profileView
Miss, get+save: Zhu [Junior data analyst] from Netherlands
Miss, getting profile from API and saving 38/49
[DEBUG] fetching https://www.linkedin.com/voyager/api/identity/profiles/ACoAAAfo7yMBtCmvCY6CR4b_xzppXEfgznwM3Lo/profileView
Miss, get+save: Crucello [Researcher - Genetics and Microbiology | Data analyst] from Berlin, Berlin, Germany

---
Data Analyst Trainee @ Ubiqum Code Academy (Internet)
Berlin Area, Germany, size 11
from {'month': 11, 'year': 2018} to {'mont

### To avoid suspending your LinkedIn account, always use the cache and avoid making extra calls to linkedin

In [3]:
found = []
found_companies = {}

# the cache has all the users, you can loop those without loading everytime

with Cache("./") as cache:
    for key in cache:        
        profile = cache.get(key)        
        exp = chk_match(profile['experience'], verbose=False)
        
        if exp:
            found.append(profile)
            cid = exp["companyName"]
            
            if cid in found_companies: # add
                found_companies[cid]['count'] += 1
                found_companies[cid]['titles'].append(exp["title"])
            else: # new
                found_companies[cid] = {
                    'name': exp["companyName"], 
                    'count': 1, 
                    'company': exp["company"] if "company" in exp else [], 
                    'locationName': exp["locationName"] if "locationName" in exp else "unknown", 
                    'titles': [exp["title"]]
                }

print(f"\nFound {len(found)} users out of {len(cache)}")
print(f"\nFound {len(found_companies)} companies with jobs matching our search/regex criteria. We can check them out and monitor them for future jobs.")


Found 23 users out of 49

Found 18 companies with jobs matching our search/regex criteria. We can check them out and monitor them for future jobs.
