# Initial Data Exploration - Reed API #

Here I will connect to the Reed API and request some data this will allow me to view the format of the data reuturnded from the API, from here I can decide what data is relevant for the **extract** stage of the **ETL** process.

### Connecting to the API ###

Import libaries

In [1]:
import requests
import json
import os.path
import csv
import pandas as pd
from datetime import datetime

I created a function that makes a call to the Reed API and returns the results as a list of dictionaries

In [2]:
def search_jobs(keywords, locationName, distance_in_miles):


    api_key = "440e3704-d856-440c-b07f-00ec507efd09"
    username = api_key
    password = ""

    url = f"https://www.reed.co.uk/api/1.0/search?keywords={keywords}&locationName={locationName}&distanceFromLocation={distance_in_miles}"

    response = requests.get(url,
                        auth=requests.auth.HTTPBasicAuth(
                            username, password))

    json_data = json.loads(response.text)
    return json_data['results']

### Request Data ###

Search for all job listings that contain the keyword 'data' and save the result to the data variable

In [3]:
data = search_jobs("data","","")

### Explore Data ###

Check to the number of jobs returned from the response 

In [4]:
len(data)

100

It appears the API response is limited to 100 results and there is no pagination within the response, as I will need more than 100 results for this project I will instead be more specific with my API request.  
My idea is to create a list of UK cities and run a seperate API requests for each city.

In [5]:

uk_cities = ['London',
 'Birmingham',
 'Manchester',
 'Glasgow',
 'Leeds',
 'Liverpool',
 'Bristol',
 'Sheffield',
 'Edinburgh',
 'Leicester',
 'Coventry',
 'Nottingham',
 'Hull',
 'Stoke-on-Trent',
 'Derby',
 'Southampton',
 'Reading',
 'Luton',
 'Wolverhampton',
 'Belfast',
 'Cambridge',
 'Brighton',
 'Oxford',
 'Newcastle',
 'Cardiff',
 'York']

len(uk_cities)

26

In [6]:
for city in uk_cities:
    city_results = search_jobs("data engineer", city, 10)
    print(city, len(city_results))

London 100
Birmingham 100
Manchester 100
Glasgow 37
Leeds 92
Liverpool 74
Bristol 100
Sheffield 59
Edinburgh 32
Leicester 51
Coventry 100
Nottingham 84
Hull 18
Stoke-on-Trent 55
Derby 52
Southampton 75
Reading 100
Luton 76
Wolverhampton 52
Belfast 8
Cambridge 42
Brighton 26
Oxford 52
Newcastle 62
Cardiff 31
York 23


In [7]:
Brighton = search_jobs("data engineer", "Brighton", 10)

In [8]:
Brighton[20]

{'jobId': 53990627,
 'employerId': 617997,
 'employerName': 'In Technology Group Limited',
 'employerProfileId': None,
 'employerProfileName': None,
 'jobTitle': 'Analytics Consultant',
 'locationName': 'Brighton',
 'minimumSalary': 55000.0,
 'maximumSalary': 75000.0,
 'currency': 'GBP',
 'expirationDate': '29/11/2024',
 'date': '08/11/2024',
 'jobDescription': 'Title: Analytics Consultant (GCP) Salary: 55,000 - 75,000 Bonus Location: Brighton / Fully Remote About Us The client is a consultancy based in Brighton. They work with organisations to modernise and scale their data analytics capabilities based on a modern data stack based on Google Cloud, DBT Labs, Looker, Cube, Dagster, and other partner technology. We work with our clients to design, build and support innovative analytics solutions that empo... ',
 'applications': 27,
 'jobUrl': 'https://www.reed.co.uk/jobs/analytics-consultant/53990627'}

To further increase the number of data jobs listed I will search for different tech job titles in each city. 

In [9]:
data_tech_job_titles = [
    "Data Scientist",
    "Data Analyst",
    "Data Engineer",
    "Cloud Engineer",
    "Software Engineer",
    "Software Developer",
    "Frontend Developer",
    "Backend Developer",
    "Full Stack Developer",
    "DevOps Engineer",
    "Machine Learning Engineer",
    "AI Engineer",
    "Mobile App Developer",
    "Game Developer",
]

In [10]:
len(data_tech_job_titles)

14

I have created a list of 14 different job titles, I will use the API to search each job title in each city in my uk_cities list. This should extract enough data to perform a good analysis. 

The next stage is the **extract** phase of the **ETL** pipeline in which we will extract the data from the API and store it in a pandas dataframe ready for **transformation**.