# 6.1 LinkedIn 1
1. Suppose that Integrify wants to get some insights for the Machine Learning and Data Science job market in order to build the best practice and update the curriculum to maximize the chance for getting as many job offers as possible for the students. 
2. Your tasks are the following:
    - a. Each group member will be working on one country (Finland, Netherlands, Denmark, Sweden, and Germany)
    - b. Use the following keyword sets and try to locate 20 companies in each country:
    <br><br>
    DataScience= [Data Science, Big data, Machine learning, Data mining, Artificial intelligence, Predictive modeling, Statistical analysis, Data visualization, Deep learning, Natural language processing, Business intelligence, Data warehousing, Data management, Data cleaning, Feature engineering, Time series analysis, Text analytics, Database, SQL, NoSQL, Neural networks, Regression analysis, Clustering, Dimensionality reduction, Anomaly detection, Recommender systems, Data integration, Data governance]
    <br><br>
    MachineLearning = [Machine learning, Data preprocessing, Feature selection, Feature engineering, Data visualization, Model selection, Hyperparameter tuning, Cross-validation, Ensemble methods, Neural networks, Deep learning, Convolutional neural networks, Recurrent neural networks, Natural language processing, Computer vision, Reinforcement learning, Unsupervised learning, Clustering, Dimensionality reduction, Bayesian methods, Time series analysis, Random forest, Gradient boosting, Support vector machines, Decision trees, Regression analysis]

    - c. Collect all job offers of each company for a one-year time frame. 
    - d. You will end up with a dictionary where the keys are the company names and the values are a list of dictionaries. 
    - e. The keys in the sub-dictionaries correspond to keywords, and the values correspond to the company’s posts that include those keywords. 
    - f. In total, you will produce five dictionaries, each corresponding to one of the listed countries above. 
    - g. Save each dictionary in JSON format under the name of the corresponding country.

### !!! Note !!!
I made it up to the point that it logs in, locates 20 companies, and collects the job listing ids, but then LinkedIn flagged my account as suspicious, so now I can't continue... (i could try a few tricks, but I don't find it worth the risk of getting my account suspended)
<br><br>
Next time you run this course you might want to inform students about the risk of getting blacklisted by their most important job seeking resource *before* you have them do the assignment :) Or, even better, don't use LinkedIn for scraping exercises (instead use e.g. an American job board)

### Class with scraping functions

In [1]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service
from bs4 import BeautifulSoup
from selenium import webdriver
import time

class LinkedInScraper():

    def __init__(self):
        self.driver = self.create_webdriver()
    
    def create_webdriver(self):
        # create a webdriver instance
        exe_location = r"C:\Webdriver\geckodriver.exe"
        firefox_binary_location = r"C:\Program Files\Mozilla Firefox\firefox.exe"  
        options = Options()
        options.binary_location = firefox_binary_location
        service = Service(executable_path=exe_location)
        driver = webdriver.Firefox(service=service, options=options)
        return driver

    def login(self):
        # set my credentials for LinkedIn
        login_name = "nronaldvdberg@gmail.com"
        with open('linkedinpw.txt', 'r') as file:
            login_pw = file.readline().strip()       
        # perform the log in
        self.driver.get("https://linkedin.com/uas/login")
        wait = WebDriverWait(self.driver, 10)
        username = wait.until(EC.presence_of_element_located((By.ID, "username")))
        username.send_keys(login_name)
        pword = self.driver.find_element(By.ID, "password")
        pword.send_keys(login_pw)
        self.driver.find_element(By.XPATH, "//button[@type='submit']").click()
        time.sleep(3)
    
    def find_company_ids(self, keyword_string, country, max_num = 20):
        ''' Returns a list of companies ids (URNs) for a given keyword string (comma-separated keywords) and country '''
        # look up the region code    
        region_dict = {'Denmark': '104514075', 'Finland': '100456013', 'Germany': '101282230', 'Netherlands': '102890719', 'Sweden': '105117694'}
        region_code = region_dict.get([key for key in region_dict if key.lower() == country.lower()][0], None)
        if region_code is None:
            print("Invalid country name -> returning empty list")
            return []

        # LinkedIn shows 10 results per page; loop over pages until we have enough results
        company_ids = []
        page_nr = 1
        last_page_was_empty = False
        while len(company_ids) < max_num and not last_page_was_empty:
            # perform the request
            URL  = "https://www.linkedin.com/search/results/companies/?"   # base URL  
            URL += "page=" + str(page_nr)                                  # page nr (we get 10 results per page)
            URL += "&keywords=" + keyword_string                           # keywords
            URL += '&companyHqGeo=["' + region_code + '"]'                 # specify country
            self.driver.get(URL)
            time.sleep(3)

            # extract company ids from the resulting data - they are recognized as these kinds of strings: <div class="entity-result" data-chameleon-result-urn="urn:li:company:3740012">      
            soup = BeautifulSoup(self.driver.page_source, "html.parser")
            company_results = soup.find_all("div", class_="entity-result")
            last_page_was_empty = True
            for result in company_results:
                urn_string = result.get("data-chameleon-result-urn")
                print("urn_string = " + urn_string)
                if urn_string:
                    last_page_was_empty = False
                    company_id = urn_string.split(":")[-1]
                    company_ids.append(company_id)
            
            # increase page nr and set last_page_was_empty variable
            page_nr += 1
            
        # return list
        return company_ids[:max_num]
    
    
    def find_job_ids(self, company_id_list, country, max_age_in_days = 365):
        ''' Gets all job post id's for the companies in the specified list '''
        
        # perform the request
        company_id_string = '%2C'.join(company_id_list)
        self.driver.get("https://www.linkedin.com/jobs/search/?f_TPR=r" + str(max_age_in_days * 60 * 60 * 24) + "&f_C=" + company_id_string)
        time.sleep(5)
        
        # extract job ids from the resulting data - they are recognized as these kinds of strings: <div data-job-id="3567339171" class="job-card-container relative job-card-list
        soup = BeautifulSoup(self.driver.page_source, "html.parser")
        job_results = soup.find_all("div", {"data-job-id": True})
        job_ids = []
        for result in job_results:
            job_id = result.get("data-job-id")
            if job_id:
                job_ids.append(int(job_id))

        # return list
        return job_ids
        
    def close_driver(self):
        self.driver.quit()


### Main script

In [2]:
# instantiate the class
print("Instantiating LinkedInScraper object...")
lis = LinkedInScraper()

# log in
print("Logging in...")
lis.login()

# find companies
print("Searching for companies...")
ds_keyword_string = "Data Science, Big data, Machine learning, Data mining, Artificial intelligence, Predictive modeling, Statistical analysis, Data visualization, Deep learning, Natural language processing, Business intelligence, Data warehousing, Data management, Data cleaning, Feature engineering, Time series analysis, Text analytics, Database, SQL, NoSQL, Neural networks, Regression analysis, Clustering, Dimensionality reduction, Anomaly detection, Recommender systems, Data integration, Data governance"
ms_keyword_string = "Machine learning, Data preprocessing, Feature selection, Feature engineering, Data visualization, Model selection, Hyperparameter tuning, Cross-validation, Ensemble methods, Neural networks, Deep learning, Convolutional neural networks, Recurrent neural networks, Natural language processing, Computer vision, Reinforcement learning, Unsupervised learning, Clustering, Dimensionality reduction, Bayesian methods, Time series analysis, Random forest, Gradient boosting, Support vector machines, Decision trees, Regression analysis"
company_ids = lis.find_company_ids(ds_keyword_string, 'Netherlands', 10)

# find job post for each company
job_ids = lis.find_job_ids(company_ids, 'Netherlands')

# close
lis.close_driver()

Instantiating LinkedInScraper object...
Logging in...
Searching for companies...


# 6.2 LinkedIn 2
I couldn't do this exercise due to my account being flagged as suspicious - see my earlier remark about this

# 6.3 Nasa Satellite images - solution 1 (with selenium; slow-ish)

1. Suppose we want to build a Computer vision dataset that involves satellite images. 
2. Your tasks are the following:
    * Collect satellite images from  https://earthobservatory.nasa.gov/images
    * Make sure to render the whole page using selenium and then use BeautifulSoup to scrape the data.
    * Create a repo and name it Images, Save the crowled images based on their titles. 
    * Create a dictionary where the keys are the images/titles and the values are the images’ descriptions.


### Class with scraping functions

In [4]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service
from selenium import webdriver
import time
import os
import requests
from bs4 import BeautifulSoup
import json

class NasaScraper():

    # constructor
    def __init__(self, max_download_num = 100):
        ''' Initialize the driver and dictionary with meta info '''
        self.image_dict = {}
        self.driver = self.create_webdriver()
        self.max_download_num = max_download_num
        self.n_downloaded = 0
       
    def save_dictionary_to_JSON(self):
        ''' Save the dictionary with meta info to JSON file '''
        print("Saving dictionary as JSON")
        with open('nasa_images.json', 'w') as json_file:
            json.dump(self.image_dict, json_file)
    
    def create_webdriver(self):
        ''' create a webdriver -- we could move this to a separate class (or superclass), since we use this in multiple projects '''
        # create a webdriver instance
        exe_location = r"C:\Webdriver\geckodriver.exe"
        firefox_binary_location = r"C:\Program Files\Mozilla Firefox\firefox.exe"  
        options = Options()
        options.binary_location = firefox_binary_location
        service = Service(executable_path=exe_location)
        driver = webdriver.Firefox(service=service, options=options)
        return driver
    
    # download and save an image 
    def download_image(self, url, category, title):
        ''' download and save the image at the specified url '''
        # create filename
        filename = category + "_" + title.strip().replace(' ','_').replace('.','_').replace('?','') + ".jpg"
        filename = ''.join(c for c in filename if c.isalnum() or c.isspace() or c in ('.', '_', '-'))
        file_path = os.path.join("./nasa_images", filename)        
        # do nothing if file was already downloaded earlier
        if os.path.exists(file_path):
            print(f"Skip (already downloaded): {title}")
            return
        # otherwise, download and save the image
        print(f"Downloading image: {title}")
        response = requests.get(url, stream=True)
        response.raise_for_status()
        self.n_downloaded += 1
        # write to file
        with open("./nasa_images/" + filename, "wb") as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
    
    def scrape_images(self):
        ''' automatically navigate through the website with images and scrape all images '''
        self.driver.get("https://earthobservatory.nasa.gov/images")
        # define the pages (to do: read this list automatically from the page instead of hard-code it)
        button_names = ["atmosphere", "heat", "human", "land", "life", "naturalevent", "remote", "snowice", "water"]
        # loop over image pages and download the images
        for button_name in button_names:
            # press the menu button to open the current image page
            try:
                menu_button = WebDriverWait(self.driver, 10).until(
                    EC.element_to_be_clickable((By.CSS_SELECTOR, ".btn.btn-filter.btn-" + button_name + ".no-underline.hvr-rectangle-out"))
                )
                menu_button.click()
            except Exception as e:
                print("An error occurred:", e)
            # download the images and press 'explore more' button until there is no new content
            downloaded_image_ids = set()   # keep track of downloaded images to avoid duplicate downloads
            while self.n_downloaded < self.max_download_num:
                # Find all thumbnail divs
                page_content = self.driver.page_source
                soup = BeautifulSoup(page_content, "html.parser")
                thumbnail_divs = soup.find_all("div", class_="thumbnail-image")

                for div in thumbnail_divs:
                    image_tag = div.find("img")

                    if image_tag:
                        image_url = image_tag["src"]
                        image_title = image_tag["alt"]
                        image_id = image_url.split('/')[-1]  # Extract image ID from the URL
                        caption_div = div.find_next_sibling("div", class_="caption")
                        image_description = caption_div.find("p").text
                        self.image_dict[image_title] = image_description                        
                        
                        if image_id not in downloaded_image_ids and self.n_downloaded < self.max_download_num:
                            self.download_image(image_url, button_name, image_title)
                            downloaded_image_ids.add(image_id)
                # Try to click the "Explore More" button
                try:
                    explore_more_button = WebDriverWait(self.driver, 10).until(
                        EC.element_to_be_clickable((By.CSS_SELECTOR, ".explore-more"))
                    )
                    explore_more_button.click()
                    time.sleep(1)
                except Exception as e:
                    print("No more content to load or an error occurred:", e)
                    break
        print(f"Downloaded {self.n_downloaded} images.\nThe repo now contains {len(self.image_dict)} images.")
    
    def print_image_dict(self):
        for title, description in self.image_dict.items():
            print(f"Title: {title}\nDescription: {description}\n\n")
    
    def close_driver(self):
        self.driver.quit()


### Main script

In [6]:
# instantiate the class
print("Instantiating NasaScraper object...")
scraper = NasaScraper()

# scrape images
scraper.scrape_images()

# we are done -- close the driver
scraper.close_driver()

Instantiating NasaScraper object...
Skip (already downloaded): Deadly Blooms in the Gulf of Mannar
Skip (already downloaded): Popocatépetl Volcano Keeps on Puffing
Skip (already downloaded): Espíritu Santo Archipelago
Skip (already downloaded): Tulare Lake Grows
Skip (already downloaded): Freddy Brings Lean Times to Malawi
Skip (already downloaded): An Awesome Aurora
Skip (already downloaded): Wave Clouds Over the Crozet Islands
Skip (already downloaded): Swirly Clouds in the Canaries
Skip (already downloaded): Cyclone Ilsa Reaches Western Australia
Skip (already downloaded): Kamchatka Erupts
Skip (already downloaded): Tornado Razes a Path Through Wynne
Skip (already downloaded): How Dust Affects the World’s Health
Skip (already downloaded): For the Longest Time
Skip (already downloaded): Taking Stock of Carbon Dioxide Emissions
Skip (already downloaded): Nitrogen Dioxide in the Neighborhood
Skip (already downloaded): A Dazzling Aurora Borealis
Skip (already downloaded): Dust Blows Acr


KeyboardInterrupt



# 6.3 Nasa Satellite images - solution 2 (without selenium; faster)

1. Suppose we want to build a Computer vision dataset that involves satellite images. 
2. Your tasks are the following:
    * Collect satellite images from  https://earthobservatory.nasa.gov/images
    * Make sure to render the whole page using selenium and then use BeautifulSoup to scrape the data.
    * Create a repo and name it Images, Save the crowled images based on their titles. 
    * Create a dictionary where the keys are the images/titles and the values are the images’ descriptions.


### Scraping functions

In [1]:
import requests
import json
import os
import os.path
import time
import random
from concurrent.futures import ThreadPoolExecutor

def download_image(image_info):
    ''' Download and save a given image '''
    url = image_info['url']
    title = image_info['title']
    filename = ''.join(c for c in title if c.isalnum() or c.isspace() or c in (';', '&', '.', '_', '-')).rstrip() + ".jpg"
    filename = filename.replace(';', '-').replace('&', '_').replace(' ', '_').replace('\r', '').replace('\n', '')
    
    file_path = os.path.join("./nasa_images", filename)
    if os.path.exists(file_path):
        return
    response = requests.get(url, stream=True)
    response.raise_for_status()
    with open(file_path, "wb") as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

def process_image_page(page_number):
    ''' Process one page with image URLs '''
    sleep_time = 1
    # sleep briefly to avoid getting blocked
    time.sleep(sleep_time)
    url = f"https://earthobservatory.nasa.gov/images/getRecords?page={page_number}"
    # try up to 5 times - increase sleep_time each time a request fails
    for retry in range(10):  
        response = requests.get(url)   
        # handle a 503 error (service unavailable -> likely because of being rate-limited by the server)
        if response.status_code == 503:
            sleep_time *= 2
            print(f"Received 503 error, retrying in {sleep_time} seconds...")
            time.sleep(sleep_time)  # Exponential backoff
            continue
        # handle other errors:
        elif response.status_code != 200:
            print(f"Error: status code {response.status_code}, skipping page {page_number}")
            return []
        data = json.loads(response.text)
        break
    else:
        print(f"Failed to process page {page_number} after 5 retries.")
        return []        
    # we got data -> process it
    image_data = []
    for record in data['data']:
        # get the image data
        image_url = record['image_path'] + record['thumbnail_file']
        title = record['title']
        caption_short = record['caption_short']
        image_info = {'url': image_url, 'title': title, 'caption_short': caption_short}
        image_data.append(image_info)
        # download the image
        download_image(image_info)
    return image_data

def save_dictionary_to_JSON(data_dict, file_name):
    with open(file_name, 'w') as json_file:
        json.dump(data_dict, json_file, indent=4)        

### Main script

In [2]:
# some settings
image_directory = "./nasa_images"
page_number = 1
all_image_data = {}
n_workers = 16
n_seconds_total = 0

# make sure the image directory exists
if not os.path.exists(image_directory):
    os.makedirs(image_directory)

# process image pages in parallel
with ThreadPoolExecutor(max_workers=n_workers) as executor:
    while True:        
        print(f"Processing pages {page_number} to {page_number+n_workers-1}")
        start_time = time.time()

        # process the batch of pages in parallel
        time.sleep(10) 
        page_numbers = [page_number + i for i in range(n_workers)]
        image_data_list = list(executor.map(process_image_page, page_numbers))
        image_data = [image_info for sublist in image_data_list for image_info in sublist]

        # check if we are done
        if len(image_data) == 0:
            print("No more images found -> exiting")
            break

        # add meta data to dictionary
        for image_info in image_data:
            all_image_data[image_info['title']] = image_info

        # save meta data
        save_dictionary_to_JSON(all_image_data, 'nasa_image_data.json')

        # show some progress info
        n_seconds = time.time() - start_time
        n_seconds_total += n_seconds
        print(f" took {round(n_seconds,1)} seconds; total number of images = {len(all_image_data)}; images per second = {round(len(all_image_data)/n_seconds_total,1)}")
        
        # increase starting page number for next batch
        page_number += n_workers

Processing pages 1 to 16
 took 16.4 seconds; total number of images = 80; images per second = 4.9
Processing pages 17 to 32
 took 15.6 seconds; total number of images = 160; images per second = 5.0
Processing pages 33 to 48
 took 15.5 seconds; total number of images = 240; images per second = 5.1
Processing pages 49 to 64
 took 16.1 seconds; total number of images = 320; images per second = 5.0
Processing pages 65 to 80
 took 16.2 seconds; total number of images = 399; images per second = 5.0
Processing pages 81 to 96
 took 15.4 seconds; total number of images = 479; images per second = 5.0
Processing pages 97 to 112
 took 16.0 seconds; total number of images = 559; images per second = 5.0
Processing pages 113 to 128
 took 16.2 seconds; total number of images = 639; images per second = 5.0
Processing pages 129 to 144
 took 16.1 seconds; total number of images = 719; images per second = 5.0
Processing pages 145 to 160
 took 16.0 seconds; total number of images = 799; images per second =

# 6.4 IMDB images

3. Suppose we want to build a data set for a Computer vision task that involves gender images. 
4. Your tasks are the following:
   * Collect 10k male/female images from: https://www.imdb.com
   * Make sure to render the whole page using selenium and then use BeautifulSoup  to scrape the images
   * Create a folder for male/female
   * Each image will be named after the person in the picture


In [26]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service
from selenium import webdriver
import time
import os
import requests
from bs4 import BeautifulSoup
import json

class IMDBScraper():

    # constructor
    def __init__(self):
        self.image_dict = {}
        self.driver = self.create_webdriver()
       
    def save_dictionary_to_JSON(self):
        print("Saving dictionary as JSON")
        with open('imdb_images.json', 'w') as json_file:
            json.dump(self.image_dict, json_file)
    
    def create_webdriver(self):
        # create a webdriver instance
        exe_location = r"C:\Webdriver\geckodriver.exe"
        firefox_binary_location = r"C:\Program Files\Mozilla Firefox\firefox.exe"  
        options = Options()
        options.binary_location = firefox_binary_location
        service = Service(executable_path=exe_location)
        driver = webdriver.Firefox(service=service, options=options)
        return driver
    
    # download and save an image 
    def download_image(self, url, gender, filename):
        # create directory path for the gender
        dir_path = os.path.join("./imdb_images", gender)
        file_path = os.path.join(dir_path, filename)
        # create directory if it doesn't exist
        if not os.path.exists(dir_path):
            os.makedirs(dir_path)        
        # do nothing if file was already downloaded earlier
        if os.path.exists(file_path):
            return
        # otherwise, download and save the image
        response = requests.get(url, stream=True)
        response.raise_for_status()
        # write to file
        with open(file_path, "wb") as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
    
    def scrape_images(self, gender, n_images_to_download):
        # initialize
        n_downloaded = 0
        batch_size = 250
        batch_nr = 1
        self.driver.get("https://www.imdb.com/search/name/?gender=" + gender + "&count=" + str(batch_size) + "&start=1&ref_=rlm")
        while n_downloaded < n_images_to_download:
            start_time = time.time()
            soup = BeautifulSoup(self.driver.page_source, "html.parser")
            image_elements = soup.select('.lister-item-image img')
            for image_element in image_elements:
                # get the URL and name of actor/actress
                img_url = image_element['src']
                lister_item_content = image_element.find_parent('div', class_='lister-item').find('div', class_='lister-item-content')
                name = lister_item_content.find('h3', class_='lister-item-header').find('a').text.strip()
                # construct file name based on name
                filename = name.strip().replace(' ','_').replace('.','_').replace('?','') + ".jpg"
                filename = ''.join(c for c in filename if c.isalnum() or c.isspace() or c in ('.', '_', '-'))
                # download and save the image                
                self.download_image(img_url, gender, filename)
                # update dictionary
                self.image_dict[name] = (gender, filename, img_url)                
                # are we done?
                if n_downloaded >= n_images_to_download:                   
                    break
            # Print some progress info
            n_seconds = time.time() - start_time
            print(f"Batch nr {batch_nr} ({batch_size} images) took {round(n_seconds,1)} seconds; current repo size = {len(self.image_dict)}")
            batch_nr += 1
            # Click 'next' button (the &start=... parameter works only for the first 10k images)
            button = self.driver.find_element(By.XPATH, '//a[@class="lister-page-next next-page"]')
            button.click()            
            time.sleep(10)
    
        # Save the dictionary as a JSON file
        self.save_dictionary_to_JSON()

    def print_image_dict(self):
        for name, (gender, image_file_name, img_url) in self.image_dict.items():
            print(f"Name: {name}\nGender: {gender}\nImage File Name: {image_file_name}\n")
    
    def close_driver(self):
        self.driver.quit()


# Main script

In [27]:
# Example usage
scraper = IMDBScraper()
scraper.scrape_images('female', 10000)
scraper.scrape_images('male', 10000)
scraper.close_driver()

#scraper.print_image_dict()


Batch nr 1 (250 images) took 2.5 seconds; current repo size = 250
Batch nr 2 (250 images) took 0.3 seconds; current repo size = 500
Batch nr 3 (250 images) took 0.3 seconds; current repo size = 750
Batch nr 4 (250 images) took 0.4 seconds; current repo size = 999
Batch nr 5 (250 images) took 0.3 seconds; current repo size = 1249
Batch nr 6 (250 images) took 0.4 seconds; current repo size = 1499
Batch nr 7 (250 images) took 0.3 seconds; current repo size = 1749
Batch nr 8 (250 images) took 0.3 seconds; current repo size = 1999
Batch nr 9 (250 images) took 0.4 seconds; current repo size = 2249
Batch nr 10 (250 images) took 0.3 seconds; current repo size = 2499
Batch nr 11 (250 images) took 0.4 seconds; current repo size = 2749
Batch nr 12 (250 images) took 0.3 seconds; current repo size = 2999
Batch nr 13 (250 images) took 0.3 seconds; current repo size = 3249
Batch nr 14 (250 images) took 0.4 seconds; current repo size = 3499
Batch nr 15 (250 images) took 0.3 seconds; current repo size 

KeyboardInterrupt: 