# Image Retrieval From Instagram

Goal: collect image data from instagram and then preprocess it, extract information (image files) from a user's Instagram profile

Image size: 224*224

Resolution: 

Number of images: 



#### Websites: 

This notebook's code is based on the following tutorials: 

https://medium.com/@srujana.rao2/scraping-instagram-with-python-using-selenium-and-beautiful-soup-8b72c186a058

https://edmundmartin.com/scraping-instagram-with-python/

https://michaeljsanders.com/2017/05/12/scrapin-and-scrollin.html

**Important Note:** *Remember to respect user’s rights when you download copyrighted content. Do not use images/videos from Instagram for commercial intent.*

### 1. Import dependencies

Install non-standard libraries: requests, BeautifulSoup 

In [4]:
from random import choice
import json

# to install
import requests
from bs4 import BeautifulSoup

### 2. Build InstagramScraper class
based on: https://edmundmartin.com/scraping-instagram-with-python/

Switching user agents is often a best practice when web scraping and can help you avoid detection. Should the caller of our class have provided their own list of user agents we take a random agent from the provided list.  Otherwise we will return our default user agent.

In [6]:
_user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
]

Define a class called InstagramScraper: 

In [178]:
class InstagramScraper:

    def __init__(self, user_agents=None, proxy=None):
        self.user_agents = user_agents
        self.proxy = proxy

    def __random_agent(self):
        if self.user_agents and isinstance(self.user_agents, list):
            return choice(self.user_agents)
        return choice(_user_agents)

    def __request_url(self, url):
        """Our second helper method is simply a wrapper around requests. 
        We pass in a URL and try to make a request using the provided user agent and proxy. 
        If we are unable to make the request or Instagram responds with a non-200 status code we simply re-raise the error. 
        If everything goes fine, we return the page in questions HTML."""
        try:
            response = requests.get(url, headers={'User-Agent': self.__random_agent()}, proxies={'http': self.proxy,
                                                                                                 'https': self.proxy})
            response.raise_for_status()
        except requests.HTTPError:
            raise requests.HTTPError('Received non 200 status code from Instagram')
        except requests.RequestException:
            raise requests.RequestException
        else:
            return response.text


    @staticmethod
    def extract_json_data(html):
        """Instagram serve’s all the of information regarding a user in the form of JavaScript object. 
        This means that we can extract all of a users profile information and their recent posts by just 
        making a HTML request to their profile page. We simply need to turn this JavaScript object into 
        JSON, which is very easy to do."""
        soup = BeautifulSoup(html, 'html.parser')
        body = soup.find('body')
        script_tag = body.find('script')
        raw_string = script_tag.text.strip().replace('window._sharedData =', '').replace(';', '')
        return json.loads(raw_string)

    def profile_page_metrics(self, profile_url):
        results = {}
        try:
            response = self.__request_url(profile_url)
            json_data = self.extract_json_data(response)
            metrics = json_data['entry_data']['ProfilePage'][0]['graphql']['user']
        except Exception as e:
            raise e
        else:
            for key, value in metrics.items():
                #print('key:', key, '-value:', value)
                if key != 'edge_owner_to_timeline_media':
                    if value and isinstance(value, dict):
                        value = value['count']
                        results[key] = value
                    elif value:
                        results[key] = value
        return results

    #TODO
    def hash_page_metrics(self, profile_url):
        results = {}
        try:
            response = self.__request_url(profile_url)
            json_data = self.extract_json_data(response)
            metrics = json_data['entry_data']['TagPage'][0]['graphql']['hashtag']
         
        except Exception as e:
            raise e
        else:
            for key, value in metrics.items():
                #print('metrics:', metrics)
                if key != 'edge_hashtag_to_media' and key != 'edge_hashtag_to_top_posts' and key != 'profile_pic_url':
                    results[key] = value
                    if value and isinstance(value, dict):
                        try: 
                            value = value['count']            
                            results[key] = value
                        except: 
                            results[key] = value
                        try: 
                            sigma = []
                            for i in range(0,5): 
                                #print(i)
                                value = value['edges'][i]['node']['name']  
                                #print(i)
                            sigma.append(value)
                            print(len(value['edges']['node']))
                            
                            #results[key] = sigma
                        except: 
                            results[key] = value 
                    elif value:
                        results[key] = value
        return results
    
    def profile_page_posts(self, profile_url):
        results = []
        try:
            response = self.__request_url(profile_url)
            json_data = self.extract_json_data(response)
            metrics = json_data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']["edges"]
            #pprint(metrics)
        except Exception as e:
            raise e
        else:
            for node in metrics:
                node = node.get('node')
                #if node and isinstance(node, dict): #this line only gets most recent post out
                results.append(node)
        return results
    
    def hashtag_page_posts(self, hashtag_url):
        results = []
        try:
            response = self.__request_url(hashtag_url)
            json_data = self.extract_json_data(response)
            #pprint(json_data)
            metrics = json_data['entry_data']['TagPage'][0]['graphql']['hashtag']['edge_hashtag_to_media']["edges"]
            #pprint(metrics)
        except Exception as e:
            raise e
        else:
            for node in metrics:
                node = node.get('node')
                #if node and isinstance(node, dict): #this line only gets most recent post out
                results.append(node)
        return results

### 3. Specify Instagram page

Specify instragram USERNAME profile whose page you want to scrape. Get a dictionary with all information (image, comments, etc.) from that Instagram profile. 

#### User-profile Page

If you want to open a user-profile page, specify the username as:

In [110]:
# to specify
username='pickuplimes'
hashtag = False
url = 'https://www.instagram.com/'+username+'/?hl=en'

#### Hashtag Page

If you want to open a hashtag page (instead of a user profile): 

In [115]:
# to specify
hashtag='swatch'
username = False
url = 'https://www.instagram.com/explore/tags/'+hashtag

### 3. Get information from Instagram page

Now that the url of the Instagram page is defined, it will extract out all the posts or meta-information from the website usinge the InstagramScraper class. 

Get meta-information metrics by using a class method. 

In [114]:
# get profile page metrics
from pprint import pprint

k = InstagramScraper()
results = k.profile_page_metrics(url) 
pprint(results)

{'biography': '🌱plant-based recipes & wholesome living \n'
              '🍒nourish the cells & the soul \n'
              '🌱a YouTube community of 2M friends 👩🏻\u200d🌾\n'
              '👇 NEW VIDEO 👇',
 'business_category_name': 'Publishers',
 'category_id': '2707',
 'edge_felix_video_timeline': 0,
 'edge_follow': 127,
 'edge_followed_by': 531071,
 'edge_media_collections': 0,
 'edge_mutual_followed_by': 0,
 'edge_saved_media': 0,
 'external_url': 'https://youtu.be/0Kgi-H2W7Hk',
 'external_url_linkshimmed': 'https://l.instagram.com/?u=https%3A%2F%2Fyoutu.be%2F0Kgi-H2W7Hk&e=ATM5rZNI8I5aBiZz3RAszJWMkhflagAU_QiH_SQDII3ITWclaigcQbJHAT__clKn0V1x15eE&s=1',
 'full_name': 'Sadia Badiei, BSc Dietetics',
 'highlight_reel_count': 1,
 'id': '2072931271',
 'is_business_account': True,
 'is_verified': True,
 'profile_pic_url': 'https://instagram.fzrh2-1.fna.fbcdn.net/v/t51.2885-19/s150x150/84057956_823380854858266_527460638654464000_n.jpg?_nc_ht=instagram.fzrh2-1.fna.fbcdn.net&_nc_ohc=RvJ85_MOJB4AX_

In [179]:
# get hashtag page metrics
from pprint import pprint

k = InstagramScraper()
#TODO
results = k.hash_page_metrics(url) 
#pprint(results)

Get all posts on an Instagram **profile page** that are visible on the landing page. 

In [56]:
# get posts (images) from profile page 
from pprint import pprint


k = InstagramScraper()
results = k.profile_page_posts(url)

print('Posts on Instagram profile page: ', len(results))
print('Second image url on instagram profile: ', results[1]['display_url'])

Posts on Instagram profile page:  12
Second image url on instagram profile:  https://instagram.fzrh2-1.fna.fbcdn.net/v/t51.2885-15/e35/89830458_279055869752422_1934838557654693738_n.jpg?_nc_ht=instagram.fzrh2-1.fna.fbcdn.net&_nc_cat=110&_nc_ohc=TatfJYTxmiAAX_jXpLZ&oh=de7f6ba79434acf649c2664156bb8a2b&oe=5E837309


Get all posts on an Instagram **hashtag page** that are visible on the landing page. 

In [58]:
# get posts (images) from hashtag page 
from pprint import pprint


k = InstagramScraper()
results = k.hashtag_page_posts(url)

#pprint(results)
print('Posts on Instagram hashtag page: ', len(results))
print('Second image url on instagram hashtag: ', results[1]['display_url'])

Posts on Instagram profile page:  70
Second image url on instagram profile:  https://instagram.fzrh2-1.fna.fbcdn.net/v/t51.2885-15/e35/s1080x1080/91390209_138245237708029_7195385645887789729_n.jpg?_nc_ht=instagram.fzrh2-1.fna.fbcdn.net&_nc_cat=101&_nc_ohc=2I9-8_vl0AYAX-Whws3&oh=2f77674f588a954ba4f0b3743e9fd603&oe=5EAB50DC


### 4. Save images from list of dict: 

Use requests library to download images from the ‘display_url’ in pandas ‘result’ data frame and store them with respective shortcode as file name.

Specify the directory for storing the images. 

In [59]:
# download all images from an Instagram page 
import os
import requests
import shutil

# to specify
directory= r"C:\Users\Anonym\Documents\GitHub\DLfM_BrandManagement\images"
os.chdir(directory)

if username: 
    os.mkdir(os.path.join(directory, username))
    os.chdir(os.path.join(directory, username))
elif hashtag: 
    os.mkdir(os.path.join(directory, hashtag))
    os.chdir(os.path.join(directory, hashtag))

for i in range(len(results)):
    r = requests.get(results[i]['display_url'], stream=True)
    with open(results[i]['shortcode']+".jpg", 'wb') as f:
        # Set decode_content value to True, otherwise the downloaded image file's size will be zero.
        r.raw.decode_content = True
        # Copy the response stream raw data to local image file.
        shutil.copyfileobj(r.raw, f)
        # Remove the image url response object.
        del r

In [43]:
# download one image only
import os
import requests
import shutil

# to specify
directory= r"C:\Users\Anonym\Documents\GitHub\DLfM_BrandManagement\images"
os.chdir(directory)

r = requests.get(url, stream=True)

with open(directory+"B-Tckr0AgrH"+".jpg", 'wb') as f:
    # Set decode_content value to True, otherwise the downloaded image file's size will be zero.
    r.raw.decode_content = True
    # Copy the response stream raw data to local image file.
    shutil.copyfileobj(r.raw, f)
    # Remove the image url response object.
    del r