<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Initialization" data-toc-modified-id="Initialization-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Initialization</a></span><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Specify-Global-Variables" data-toc-modified-id="Specify-Global-Variables-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Specify Global Variables</a></span></li><li><span><a href="#Functions-and-Classes" data-toc-modified-id="Functions-and-Classes-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Functions and Classes</a></span></li><li><span><a href="#System-dependent-Configuration" data-toc-modified-id="System-dependent-Configuration-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>System-dependent Configuration</a></span></li></ul></li><li><span><a href="#Collect-Data" data-toc-modified-id="Collect-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Collect Data</a></span><ul class="toc-item"><li><span><a href="#Setup-the-Data-Collection-Environment" data-toc-modified-id="Setup-the-Data-Collection-Environment-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Setup the Data Collection Environment</a></span></li><li><span><a href="#Collect-Instagram-Posts" data-toc-modified-id="Collect-Instagram-Posts-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Collect Instagram Posts</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

# Introduction

<p> This playbooks scrapes Instagram posts for a given hashtag. It can also scrape a particular Instagram post. </p>



# Initialization


<p> The imports, function and class defintions, global variables, and system-dependent configuration are in this section. </p>

<p> The system dependent configuration should be carefully reviewed and configured for each system (e.g., Linux vs. Windows, or the path of an external program) since the playbook will most likely fail without proper configuration. </p>

## Imports

In [1]:
### This cell imports necessary Python modules and performs initial configuration

### Data manipulation libraries
# import json
import pandas as pd 
import csv

### Visualization and Interaction
# import matplotlib.pyplot as plt
# plt.style.use('ggplot')

from IPython.display import set_matplotlib_formats, clear_output
set_matplotlib_formats('retina')

import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot 
init_notebook_mode(connected=True)

import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from ipywidgets import VBox, HBox, Button, HTML

### Computation libraries 
import numpy as np
import re
import random

### Graph analysis
# import networkx as nx
# import community

### System related
# import sys
# import warnings;
# warnings.filterwarnings('ignore')

# import io
# from joblib import Parallel, delayed

### Datetime libraries
from datetime import datetime
import time
from pytz import timezone

### NLP dependencies
# import spacy
# from spacy.tokenizer import Tokenizer
# nlp = spacy.load('en')
# tokenizer = Tokenizer(nlp.vocab)

# from langdetect import detect

### Scraping libraries
from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException
from bs4 import BeautifulSoup

### Machine learning libraries
# from sklearn import datasets
# from sklearn import linear_model
# from sklearn.feature_selection import f_regression, mutual_info_regression
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import classification_report

### Logging
import logging 
logging.basicConfig(level=logging.INFO)

## Specify Global Variables

In [2]:
### This cell defines global variables and parameters used throughout the playbook

# Make this True if you want to watch Selenium scrape pages
WATCH_SCRAPING = True

RAW_DATA_DIRECTORY = "../data/raw/"

# Specify the maximum number of thumbnails to scrap
MAX_NO_OF_THUMBNAILS_TO_SCRAPE = 5

# Specify the maximum number of comments to scrape from a post
MAX_NO_OF_COMMENTS_TO_SCRAPE = 10

# datetime.now(timezone('US/Eastern')).strftime('%Y-%m-%dT%H:%M:%S%z')

## Functions and Classes

In [3]:
### This cell defines functions and classes used throughout the playbook

def scrape_instagram_search(driver, scrape_parameter, type_of_scrape, max_thumbnails=1000, max_scrolls=2000):
    """Searches for an instagram tag and returns a list of links to thumbnails"""
    url = ""
    
    # For TAG search
    if type_of_scrape == "TAG":
        tag = scrape_parameter.lower().strip()
        url = "https://www.instagram.com/explore/tags/" + tag

    # For LOCATION search
    if type_of_scrape == "LOCATION":
        url = scrape_parameter
 
    driver.get(url)

    thumbnail_link_set = []

    hit_bottom = False
    
    thumbnail_wrappers = set()
    i = 0
    while not hit_bottom and len(thumbnail_link_set) < max_thumbnails and i < max_scrolls:
        
        i += 1
        print("Retrieving page number {}".format(i))
        try:
            driver.execute_script("window.scrollTo(0, " + str(i * 250) + ");")
            page_html = driver.page_source        
            soup = BeautifulSoup(page_html, 'html.parser')
            kIKUGs = soup.find_all('div', class_='kIKUG')
            for sd in kIKUGs:
                thumbnail_wrappers.add(sd)
            logging.info("Number of thumbnail wrappers = {}".format(len(thumbnail_wrappers)))
            
        except StaleElementReferenceException as e:
            logging.info('stale element')
            logging.info(e)
            continue

        except Exception as e:
            logging.info('finished')
            logging.info(e)
            break


    
    
    if thumbnail_wrappers:
        for thumbnail in thumbnail_wrappers:
            link_of_the_thumbnail = thumbnail.find('a').get('href')
            thumbnail_link_set.append(link_of_the_thumbnail)
            
    #driver.quit()
    logging.info("Number of distinct links retrieved = {}".format(len(thumbnail_link_set)))
    return thumbnail_link_set


def scrape_instagram_post(driver, post_link, max_comments = 100):
    """Scrapes an instagram post"""

    # Form the URL
    url = "https://www.instagram.com" + post_link
    
    print(url)
    driver.get(url)
    
    tracking_time = datetime.now(timezone('US/Eastern')).strftime('%Y-%m-%dT%H:%M:%S%z')
    comment_row = []
    post_metadata = {"post_id": post_link[3:][:-1],
                     "scraped_datetime": tracking_time,
                     "post_thumbnail_link": None,
                     "post_thumbnail_tags": None,
                     "post_thumbnail_type": None,
                     "post_number_of_likes": None,
                     "post_number_of_views": None,
                     "post_datetime": None,
                     "user_profile_id": None,
                     "user_profile_picture": None,
                     "user_location": None,
                     "user_verified": None
                    }
   

    try:
        try:
            wait = WebDriverWait(driver, 2)
            # TODO: Do the load_more comments later.
            while False: 
            # while True:
                load_more_comments = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "Z4IfV")))
                load_more_comments.click()
        except Exception as e:
            pass
            # logging.info("no load more comments button was found")
            # logging.info(e)
            
        soup = BeautifulSoup(driver.page_source)
        
        # Check whether it is a video/image
        try:
            if soup.find("div", class_="KL4Bh"):
                
                # Check if the img["alt"] exists
                try:
                    _ = soup.find("div", {"class": "KL4Bh"}).find("img")["alt"]
                    image_alt_exists = True
                    logging.info("image alt exists")
                except:
                    logging.info("image alt does not exist")
                    image_alt_exists = False
                    
                if image_alt_exists:
                    post_metadata["post_thumbnail_type"] = "image"
                    image_container = soup.find("div", {"class": "KL4Bh"}).find("img")
                    image_tags = image_container["alt"]
                    image_tags = image_tags.replace("Image may contain: ", "")
                    post_metadata['post_thumbnail_tags'] = image_tags        

                    image_src = image_container["src"]
                    post_metadata["post_thumbnail_link"] = image_src
                else:
                    post_metadata["post_thumbnail_type"] = "video"
                    image_container = soup.find("div", {"class": "KL4Bh"}).find("img")
                    image_src = image_container["src"]
                    post_metadata["post_thumbnail_link"] = image_src
                    
            """
            if soup.find("div", class_="OAXCp"):
                post_metadata["post_thumbnail_type"] = "video"
                
                video_container = soup.find("div", class_="OAXCp").find("video")
                video_src = video_container["src"]
                post_metadata["post_thumbnail_link"] = video_src
            """
        except:
            print("Image/video scraping did not work")
            
            
        # Scrape views and likes
        try:
            # Check if it is likes
            if soup.find("div", class_="Nm9Fw"):
                number_of_likes = soup.find("div", class_="Nm9Fw").text.lower().replace(" likes", "")
                post_metadata["post_number_of_likes"] = number_of_likes
                
            if soup.find("div", class_="HbPOm"):
                number_of_views = soup.find("div", class_="HbPOm").text.lower().replace(" views", "")
                post_metadata["post_number_of_views"] = number_of_views     
                
        except:
            print("Links/replies scrape did not work")
            
        # Scrape time
        try:
            time_wrapper = soup.find("div", class_="NnvRN")
            datetime_of_the_post = time_wrapper.find("time")["datetime"]
            post_metadata["post_datetime"] = datetime_of_the_post
        except:
            print("Post datetime scrape did not work")
           
        # Scrape profile pic
        try:
            header_wrapper = soup.find("header", class_="UE9AK")
            profile_pic_wrapper = header_wrapper.find("div", class_="mrq0Z")
            profile_pic_src = profile_pic_wrapper.find("img")["src"]
            post_metadata["user_profile_picture"] = profile_pic_src
            
            profile_id = header_wrapper.find("div", class_="e1e1d").find("a")["title"]
            post_metadata["user_profile_id"] = profile_id
        except:
            print("Profile pic and name scrape did not work")

        # Scrape profile id
        try:
            profile_id_wrapper = soup.find("div", class_= "e1e1d")
            profile_id = profile_id_wrapper.find("a")["title"]
            post_metadata["user_profile_id"] = profile_id
        except:
            logging.info("Profile id did not work")
            
        # Scrape user location    
        try:
            if soup.find("div", class_="M30cS"):
                user_location_wrapper = soup.find("div", class_="M30cS")
                user_location = user_location_wrapper.text
                if user_location:
                    post_metadata["user_location"] = user_location
        except:
            logging.info("User location did not work")
            
        # Scrape whether user is verified
        try:
            profile_id_wrapper = soup.find("div", class_= "e1e1d")
            if profile_id_wrapper.find("span"):
                user_verified = 1
            else:
                user_verified = 0
                
            post_metadata["user_verified"] = user_verified
        except:
            logging.info("User verified did not work")
        
        try:
            user_elements = soup.find("div", class_="EtaWk").find_all("div", class_="ZyFrc")
        except:
            print("Could not find EtaWk and/or ZyFrc")
            
        processing_the_first_element = True    
        for e in user_elements:
            comment_data = {"user_post_or_comment": None,
                            "commenter_user_id": None,
                            "commenter_profile_picture": None,
                            "commenter_verified": None,
                            "comment_text": None,
                            "comment_likes": None,
                            "comment_replies": None,
                            "comment_datetime": None}
            
            if processing_the_first_element:
                comment_data["user_post_or_comment"] = "user_post"
            else:
                comment_data["user_post_or_comment"] = "comment"
                
            element = e.find("div", class_="C4VMK")
            try:
                comment_text = element.find('span').text
                comment_data["comment_text"] = comment_text
            except:
                print("User comment could not be scraped")
                logging.info("User comment could not be scraped")
                
            try:    
                commenter_user_id = element.find("a")["title"]
                comment_data["commenter_user_id"] = commenter_user_id
            except:
                print("User id could not be scraped")
                logging.info("User id could not be scraped")

            try:
                comment_datetime = element.find("time")["datetime"]
                comment_data["comment_datetime"] = comment_datetime
            except:
                logging.info("Comment datetime could not be scraped")
            
            try:
                comment_datetime = element.find("time")["datetime"]
                comment_data["comment_datetime"] = comment_datetime
            except:
                logging.info("Comment datetime could not be scraped")
            
            try:
                comment_buttons = element.find_all("button")
                if comment_buttons:
                    for cb in comment_buttons:
                        text_of_cb = cb.text.lower()
                        
                        if ("like" in text_of_cb) or ("likes" in text_of_cb):
                            if len(text_of_cb.split()) > 1:
                                comment_data["comment_likes"] = text_of_cb.split()[0]

                if not processing_the_first_element:
                    comment_data["comment_replies"] = 0
            except:
                print("Likes/replies could not be scraped")
                
            try:
                commenter_profile_picture = e.find("div", class_="TKzGu").find("img")["src"]
                comment_data["commenter_profile_picture"] = commenter_profile_picture
            except:
                logging.info("Commenter profile picture could not be scraped")
                              
            # rows.append([comment_text.encode("utf-8"), username, numbers[0], numbers[1], numbers[2]])
            comment_row.append({**post_metadata, **comment_data})
            processing_the_first_element = False    

            
        logging.info("Retrived the post and comments for " + url)
    except Exception as e:
        logging.info("could not load comments")
        logging.info(e)
        
    return comment_row  
    # return df    
        
        
# TODO: Currently all data is saved into a dataframe and exported to a CSV. 
# Creating an Instagram class might be useful for further analysis.
class InstagramPost:
    """This class represents an Instagram post."""
    
    def __init__(self, unique_id):
        self.unique_id = unique_id
    
    

## System-dependent Configuration

In [4]:
### This cell defines system-dependent configuration such as those different in Linux vs. Windows

# Assuming a particular directory structure and a Linux-based system
# As of Sep 2, 2019, the chromedriver is version 76.X
EXECUTABLE_PATH = "../WebDriver/chromedriver"

# Collect Data

## Setup the Data Collection Environment

In [5]:
### Instagram hashtag or user to be scraped is entered in this step

# Create the driver
chrome_options = webdriver.ChromeOptions()
if not WATCH_SCRAPING:
    chrome_options.add_argument('--headless')
chrome_options.add_argument('--incognito')

try:
    driver = webdriver.Chrome(options=chrome_options, executable_path=EXECUTABLE_PATH)
    logging.info("Chrome launched")
except:
    logging.critical("Chrome could not be launched. Check if EXECUTABLE_PATH is configured correcely. If it is, check if the Chromedriver supports the version of the browser.")
    

CRITICAL:root:Chrome could not be launched. Check if EXECUTABLE_PATH is configured correcely. If it is, check if the Chromedriver supports the version of the browser.


## Collect Instagram Posts

In [6]:
### This cell creates the UI for the instagram tag search and retrieves the links for the thumbnails

# The dropdown menu that allows the user to search for a hashtag or a user account
dropdown = widgets.Dropdown(
    options=['Search Tag'],
    value='Search Tag',
    description='',
    disabled=False
)

# Text input box
text = widgets.Text(
    value='',
    placeholder='',
    description='',
    disabled=False,
    style={'description_width': '200px'}, 
    layout={'width': '500px'}
)

# The Instagram collection button
button = widgets.Button(description="Start Collection!")

# Formatting
html_text = widgets.HTML(value="<h1 style='text-align:center'>Instagram Search by Tag</h1>")

COLUMNS = ["post_id", "scraped_datetime", "post_thumbnail_link", "post_thumbnail_type", "post_thumbnail_tags", "post_number_of_likes", "post_number_of_views", "post_datetime", "user_profile_id", "user_profile_picture", "user_location", "user_verified", "user_post_or_comment", "commenter_user_id", "commenter_profile_picture", "commenter_verified", "comment_text", "comment_likes", "comment_replies", "comment_datetime"]

def on_button_clicked(b):
    """Triggerred with button click."""
    global thumbnails, df_user
    
    with out:
        clear_output()
        spinner = widgets.HTML(value='<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.8.2/css/all.css" integrity="sha384-oS3vJWv+0UjzBfQzYUhtDYW+Pj2yciDJxpsK1OYPAYjqT085Qq/1cq5FLXAZQ7Ay" crossorigin="anonymous"><div style="margin:auto; width:20%"><i class="fas fa-spinner fa-spin fa-10x"></i></div>')
        display(spinner)
        
        ### TODO: Scrape the thumbnails
        type_of_scrape = "TAG"
        if dropdown.value == "Search Tag":
            file_identifier = text.value
            type_of_scrape = "TAG"
        
        # TODO: Only works when logged in
        if dropdown.value == "Search by Location":
            file_identifier = text.value.strip("/")[-1]
            type_of_scrape = "LOCATION"
        
        df_user = pd.DataFrame(columns=COLUMNS)
        thumbnails = scrape_instagram_search(driver, text.value, type_of_scrape)
        for thumbnail in thumbnails[:MAX_NO_OF_THUMBNAILS_TO_SCRAPE]:
            thumbnail_post_data = scrape_instagram_post(driver, thumbnail)
            dummy_df = pd.DataFrame.from_dict(thumbnail_post_data)
            df_user = pd.concat([df_user, dummy_df], ignore_index=True, sort=False)
        
        
        df_user.to_csv(RAW_DATA_DIRECTORY + "INS-" + type_of_scrape + "-" + file_identifier + "-" + datetime.now().strftime("%Y-%m-%dT%H-%M-%S") + ".csv", index=False, na_rep='None', columns=COLUMNS)
        ### TODO: Scrape the comments
        #clear_output()
        display(df_user[COLUMNS])
        #print(dropdown.value)
        #print(text.value)
    
button.on_click(on_button_clicked)
hbox = HBox([dropdown, text, button])
vbox = VBox([html_text, hbox])
display(vbox)
out = widgets.Output()
display(out)

VBox(children=(HTML(value="<h1 style='text-align:center'>Instagram Search by Tag</h1>"), HBox(children=(Dropdo…

Output()

# Conclusion

In [7]:
"""Add post-processing steps here
"""

# Clean up the environment
driver.quit() and any post-processing is done here

SyntaxError: invalid syntax (<ipython-input-7-1f415da3c117>, line 5)