# Data Collection: Building an Instagram Crawler

In this walkthrough we will see how we can crawl data from Instagram using the Selenium WebDriver.

# Step 1: Set up the WebDriver

The first thing we need is a web driver. A web driver is basically a browser that can be controlled programatically.

In [68]:
# Set up a function to start the webdriver
from selenium import webdriver
def start_webdriver():
    chromedriver_path = 'helpers/chromedriver'
    driver = webdriver.Chrome(chromedriver_path)
    return driver

In [94]:
# Execute the webdriver
driver = start_webdriver()

# Step 2: Navigate to Instagram and Login
In order to log in we need to mimic the same user flow that we would use to log on manually. To do so, we use the Google Developer Tools to find the selectors / clases of the respective buttons and form fields.

In [70]:
driver.get("https://www.instagram.com/")

In [71]:
# Import packages that we need
from random import randint
import time
import json

In [95]:
# Define Login Helper Function
def login():
  try:
    # the login_info.json file contains the login information for five different Instagram accounts. We randomly pick one of the five accounts and save the username and password in respective variables.
    with open('helpers/login_info.json') as f:
        login_info = json.load(f)
    random_number = randint(0, len(login_info['accounts']) - 1)
    user = login_info['accounts'][random_number]['username']
    pw = login_info['accounts'][random_number]['pw']
    # We go to the Instagram home page
    driver.get("https://www.instagram.com/")
    # We wait five seconds in order to make sure the page has fully loaded
    time.sleep(5)
    # We programmatically close the cookie notice
    try:
      driver.find_element_by_css_selector('.aOOlW').click()
    except:
      pass
    time.sleep(3)
    # We find the username and password fields and make sure they are empty
    username = driver.find_element_by_css_selector("input[name='username']")
    password = driver.find_element_by_css_selector("input[name='password']")
    username.clear()
    password.clear()
    # We enter the login credentials into the form and then programmatically click the login button.
    username.send_keys(user)
    password.send_keys(pw)
    time.sleep(randint(3,5))
    login = driver.find_element_by_css_selector("button[type='submit']").click()
  except:
    print("Could not log in / already logged in")

In [73]:
# Define Logout Helper Function
def logout():
  try:
    # We go to the Instagram home page
    driver.get("https://www.instagram.com/")
    time.sleep(5)
    # We open the dropdown menu
    menu = driver.find_element_by_class_name("_47KiJ")
    menu.find_element_by_class_name("_2dbep").click()
    time.sleep(3)
    # We click on the logout button
    menu.find_element_by_css_selector('.-qQT3:last-child').click()
    time.sleep(1)
    # We delete all cookies so that Instagram does not 
    # modify the login process when we want to log in again
    driver.delete_all_cookies()
  except:
    print("Could not log out / already logged out")

In [96]:
# Execute Login Helper Function
login()

In [81]:
# Execute Logout Helper Function
logout()

# Step 3: Decide which Accounts / Companies to Crawl and Create a Folder for Each

In [83]:
# For this tutorial we take three big FMCG brands as an example
companies = ['nestle', 'unilever', 'proctergamble']

In [102]:
# Import the os package
import os
# Define function that checks if a folder for the company exists within the example data folder. If not, it creates one.
def create_company_folder(company):
  path = os.getcwd() + "/example_data/" + company + "/"
  if not os.path.exists(path):
      os.makedirs(path)
  return path
# Execute the function
for company in companies:
  create_company_folder(company)

In [85]:
# Check out one of the accounts to plan the next steps
driver.get("https://www.instagram.com/" + companies[0])

# Step 4: Get Post URLs
In order to crawl the content of all Instagram posts of the selected accounts, we need their respective URLs. We get these by programatically scrolling to the bottom of the page and saving the URLs (href-attribute of the pictures) after every scroll action.

For this tutorial we limit the number of scrolled posts to 100 per company.

In [104]:
#check if hrefs.txt file already exists in the folder
from pathlib import Path
def find_hrefs(company):
    # Define the file in which all URLs will be saved in the end
    href_file = os.getcwd() + "/example_data/" + company + "/hrefs.txt"
    # Check if the file already exists
    href_file_exists = Path(href_file).is_file()
    # if the file does not exist, start scraping the URLs
    if not href_file_exists:
        print("no URLs found for " + company + ", starting to scrape them")
        time.sleep(randint(3, 5))
        # Go to the company's Instagram account
        driver.get("https://www.instagram.com/" + company)
        # Scrape all URLs, then scroll down and scrape the newly loaded posts until the end of the page is reached
        # For this tutorial we also limit the number of scrolled posts to 100 per company
        hrefs = []
        scrolldown = 0
        match = False
        while (match == False and len(hrefs) < 100):
            last_count = scrolldown
            # Find all links that are currently displayed on the page
            links = driver.find_elements_by_tag_name('a')
            time.sleep(randint(3, 4))
            for link in links:
                try:
                    href = link.get_attribute('href')
                    # only take links that include "/p/", indicating that it is a post link
                    if '/p/' in href:
                        # only add the post to the list of URLs if it is not in the list yet (prevent duplicates)
                        if href not in hrefs:
                            if len(hrefs) < 100:
                                hrefs.append(href)
                except:
                    pass
            # scroll to the bottom of the page to load new posts
            scrolldown = driver.execute_script(
                "window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
            # stop the process when the end of the page is reached or we have collected at least 100 URLs
            if last_count == scrolldown or len(hrefs) > 99:
                match = True
        print("saving " + str(len(hrefs)) + " URLs to file for " + company)
        with open(href_file, 'w') as file:
            file.write(str(hrefs))
    else:
        print("URLs file discovered")
        file = open(href_file, 'r')
        hrefs = eval(file.read())
    return hrefs, href_file_exists

In [103]:
# Execute the Scraping Function, iterating over the companies
for company in companies:
  find_hrefs(company)

no URLs found for nestle, starting to scrape them
saving 100 to file for nestle
no URLs found for unilever, starting to scrape them
saving 100 to file for unilever
no URLs found for proctergamble, starting to scrape them
saving 100 to file for proctergamble


# Step 5: Iterate of the URLs and save post data as raw .json-files
Fortunately, we can leverage the Instagram GraphQL API to retrieve posts as structured data (otherwise we would have to go through all the different HTML elements that contain the information we need and scrape it from there). To retrieve a post as structured data we just add "?__a=1" to the URL. This yields a .json (Javascript Object Notation) file that we can save to our harddrive.

There is a caveat: images are only saved as URLs. However, Instagram periodically changes image URLs. Therefore, we need to save the images on our own harddrive or some other service (see next steps).