# Instagram scraping by Scraping Fish 🐟

This notebook is a comprehensive tutorial for scraping public Instagram profile information and posts using [Scraping Fish API](https://scrapingfish.com).
To be able to run it and actually scrape the data, you will need Scraping Fish API key which you can get here: [Scraping Fish Request Packs](https://scrapingfish.com/buy).
A starter pack of 1,000 API requests costs only $2 and will let run this tutorial and play with the API on your own ⛹️.
Without Scraping Fish API key you are likely to get blocked instantly ⛔️.

Scraping Fish is a premium API for scraping powered by rotating 4G/LTE proxy by default.
It is the best available proxy type for scraping since mobile IPs are ephemeral and constantly reassigned between real users.
This type of proxy is capable of scraping even the most demanding websites, like Instagram, without being blocked.
You can read more on advanced topics in Scraping Fish API [Documentation](https://scrapingfish.com/docs/intro).

This notebook implements a function for scraping Instagram profile data and posts and handles pagination.
As an example, we will use it to obtain data posted by profile [Stare domy](https://www.instagram.com/staredomynasprzedaz/) 🏚.
It is an aggregate listing of old houses for sale in Poland.
Post descriptions in this profile provide fairly structured data about the propery, including location, price, size, etc.

### Imports

Required packages imported in the cell below are listed in requirements.txt file.
Install them first by running `pip install -r requirements.txt`.

In [1]:
import json
import re
import time
from typing import Any, Dict, List, Optional, Union
from urllib.parse import quote_plus

import pandas as pd
import requests
from tqdm.notebook import tqdm
from retry.api import retry_call

### API key

Scraping Fish API key is needed to run this example without being instantly blocked by Instagram.

Get your API key and a starter pack of 1,000 API requests for just $2 here: [Scraping Fish Request Packs](https://scrapingfish.com/buy).

In [2]:
API_KEY = "your API key"

### Parsing Instagram response

Function `parse_posts` implemented in the cell below extracts basic posts data from a JSON response:
* shortcode: you can use it to access the post at `https://www.instagram.com/<shortcode>/`
* image_url 🏞
* description: post text 📝
* n_comments: number of comments 💬
* n_likes: number of likes 👍
* timestamp: when the post was created ⏰

In [3]:
def parse_posts(response_json: Dict[str, Any]) -> List[Dict[str, Any]]:
    top_level_key = "graphql" if "graphql" in response_json else "data"
    user_data = response_json[top_level_key].get("user", {})
    post_edges = user_data.get("edge_owner_to_timeline_media", {}).get("edges", [])
    posts = []
    for node in post_edges:
        post_json = node.get("node", {})
        shortcode = post_json.get("shortcode")
        image_url = post_json.get("display_url")
        caption_edges = post_json.get("edge_media_to_caption", {}).get("edges", [])
        description = caption_edges[0].get("node", {}).get("text") if len(caption_edges) > 0 else None
        n_comments = post_json.get("edge_media_to_comment", {}).get("count")
        likes_key = "edge_liked_by" if "edge_liked_by" in post_json else "edge_media_preview_like"
        n_likes = post_json.get(likes_key, {}).get("count")
        timestamp = post_json.get("taken_at_timestamp")
        posts.append({
            "shortcode": shortcode,
            "image_url": image_url,
            "description": description,
            "n_comments": n_comments,
            "n_likes": n_likes,
            "timestamp": timestamp,
        })
    return posts

Function `parse_page_info` implemented in the cell below extracts page info dictionary which contains cursor used for pagination.

In [4]:
def parse_page_info(response_json: Dict[str, Any]) -> Dict[str, Union[Optional[bool], Optional[str]]]:
    top_level_key = "graphql" if "graphql" in response_json else "data"
    user_data = response_json[top_level_key].get("user", {})
    page_info = user_data.get("edge_owner_to_timeline_media", {}).get("page_info", {})
    return page_info

### Instagram profile scraping logic

The main function to scrape data from Instagram profile is implemented in the cell below.
It takes three arguments:
* username: Instagram profile username, e.g., `staredomynasprzedaz`, `itsdougthepug`, `selenagomez`
* url_prefix: (optional) Scraping Fish API endpoint with your API key and other query params set according to your needs, e.g. `f"https://scraping.narf.ai/api/v1/?api_key={API_KEY}&render_js=false&url="`
* n_retries: (optional) The number of retries in case of any error (timeout, JSON parsing, etc.)

and returns a list of all user posts.

In [5]:
def scrape_ig_profile(username: str, url_prefix: str = "", n_retries: int = 5) -> List[Dict[str, Any]]:
    # url in Scraping Fish API must be encoded: https://scrapingfish.com/docs/scraping-urls-with-query-params
    ig_profile_url = quote_plus(f"https://www.instagram.com/{username}/?__a=1")
    
    def request_json(url: str) -> Dict[str, Any]:
        response = requests.get(url)
        response.raise_for_status()
        return response.json()
    
    response_json = retry_call(request_json, fargs=[f"{url_prefix}{ig_profile_url}"], tries=n_retries)
    
    # get user_id from response to request next pages with posts
    user_id = response_json.get("graphql", {}).get("user", {}).get("id")
    if not user_id:
        print(f"User {username} not found.")
        return []
    # parse the first batch of posts from user profile response
    posts = parse_posts(response_json=response_json)
    page_info = parse_page_info(response_json=response_json)
    # get next page cursor
    end_cursor = page_info.get("end_cursor")
    pbar = tqdm()
    while end_cursor:
        posts_url = quote_plus(
            f"https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first=24&after={end_cursor}"
        )
        response_json = retry_call(request_json, fargs=[f"{url_prefix}{posts_url}"], tries=n_retries)
        posts.extend(parse_posts(response_json=response_json))
        page_info = parse_page_info(response_json=response_json)
        end_cursor = page_info.get("end_cursor")
        pbar.update()
    return posts

### Scrape with Scraping Fish 🐟

Now we are ready to scrape posts for profile `staredomynasprzedaz` using Scraping Fish API 🚀

In [6]:
url_prefix = f"https://scraping.narf.ai/api/v1/?api_key={API_KEY}&render_js=false&url="
posts = scrape_ig_profile(username="staredomynasprzedaz", url_prefix=url_prefix)

0it [00:00, ?it/s]

As you can see, Scraping Fish scraped all the posts from profile `staredomynasprzedaz` in less than 2 seconds per requested page❗️

Let us create pandas data frame to inspect the data and for easier processing.

In [7]:
df = pd.DataFrame(posts)
df

Unnamed: 0,shortcode,image_url,description,n_comments,n_likes,timestamp
0,CfGM_rwMIU2,https://scontent-frt3-1.cdninstagram.com/v/t51...,"Jarki, Rojewo, woj. kujawsko-pomorskie \nCena:...",16,579,1655878061
1,CfBcqbCs7rD,https://scontent-frx5-1.cdninstagram.com/v/t51...,"Kalinowo, woj. warmińsko-mazurskie \nCena: 350...",6,436,1655718503
2,Ce4Z5oCs5NE,https://scontent-frt3-1.cdninstagram.com/v/t51...,"Kwiatów, Złotoryja, woj. dolnośląskie \nCena: ...",16,651,1655415065
3,CewFMPHsxwf,https://scontent-frt3-1.cdninstagram.com/v/t51...,"Sadki, woj. kujawsko-pomorskie \nCena: 290 000...",16,688,1655135772
4,CernE8HMqMC,https://scontent-frx5-2.cdninstagram.com/v/t51...,"Wysoka, woj. lubuskie\nCena: 260 000 zł\n95 m²...",18,563,1654985766
...,...,...,...,...,...,...
377,CM1ukXeHR5R,https://instagram.fwaw8-1.fna.fbcdn.net/v/t51....,"Kobiele Wielkie, woj. łódzkie\nCena: 119 000 z...",0,52,1616670532
378,CM1shw6H8kO,https://instagram.fwaw8-1.fna.fbcdn.net/v/t51....,"Siedliska, Tuchów, woj. małopolskie\nCena: 119...",1,58,1616669462
379,CM0QmlwnfQ9,https://instagram.fwaw8-1.fna.fbcdn.net/v/t51....,"Siedliska-Bogusz, Brzostek, woj. podkarpackie....",0,48,1616621267
380,CM0LEY5nL46,https://instagram.fwaw8-1.fna.fbcdn.net/v/t51....,"Henryków Lubański, Lubań, woj. dolnośląskie.\n...",7,106,1616618366


### Parsing post description

It seems that everythin looks good.
Now, we can parse post descriptions to retrieve more detailed property information.
For this, we will implement a function that uses regular expressions to parse basic information about properties:
* location (address and province) 📍
* price in PLN 💰
* house size in m² 🏠
* plot area in m² 📐

In [8]:
# precompiled regex for extracting property info from description
address_pattern = re.compile(r"(?P<address>[\w\-,\s]+)woj.(?P<province>[\w\-,\s]+)")
price_pattern = re.compile(r"Cena:(?P<price>[\d\s]+)zł")
house_size_pattern = re.compile(r"[a-żA-Ż,\s]*(?P<house_size>[\d,\s]+)(m²|m2)")
plot_area_pattern = re.compile(
    r"Dzia[ł|l]ka:"
    + r"(((?P<plot_area_m>[\d,\s]+)(m²|m2))"
    + r"|((?P<plot_area_a>[\d,\s]+)arów)"
    + r"|((?P<plot_area_h>[\d,\s]+)ha))"
)

def parse_description(description: str):
    address = None
    province = None
    price = None
    house_size = None
    plot_area = None
    # get the structured part of the post description
    property_info = description[:description.find("\n\n")]
    property_info = property_info.replace("&nbsp;", " ").replace("\xa0", " ").split("\n")
    for i, line in enumerate(property_info):
        if i == 0:
            address_match = address_pattern.match(line)
            if address_match:
                address = address_match.group("address").strip().rstrip(",")
                province = address_match.group("province").strip()
                continue
        if price is None:
            price_match = price_pattern.match(line)
            if price_match:
                price = float(price_match.group("price").replace(" ", "").replace(",", "."))
                continue
        if house_size is None:
            house_size_match = house_size_pattern.match(line)
            if house_size_match:
                house_size = float(house_size_match.group("house_size").replace(" ", "").replace(",", "."))
                continue
        if plot_area is None:
            plot_area_match = plot_area_pattern.match(line)
            if plot_area_match:
                if plot_area_match.group("plot_area_m"):
                    plot_area = float(plot_area_match.group("plot_area_m").replace(" ", "").replace(",", "."))
                    break
                if plot_area_match.group("plot_area_a"):
                    plot_area = float(plot_area_match.group("plot_area_a").replace(" ", "").replace(",", "."))
                    plot_area = plot_area * 100
                    break
                if plot_area_match.group("plot_area_h"):
                    plot_area = float(plot_area_match.group("plot_area_h").replace(" ", "").replace(",", "."))
                    plot_area = plot_area * 10_000
                    break
    return address, province, price, house_size, plot_area

We can include columns with property information extracted from descriptions to the data frame:

In [9]:
df[["address", "province", "price", "house_size", "plot_area"]] = df.apply(
    lambda row: parse_description(description=row["description"]), axis="columns", result_type="expand"
)

It would be usefull to have additional derived features, e.g., price per m² of the house and plot

In [10]:
df["price_per_house_m2"] = df["price"].div(df["house_size"])
df["price_per_plot_m2"] = df["price"].div(df["plot_area"])

Let us inspect the final result

In [11]:
df

Unnamed: 0,shortcode,image_url,description,n_comments,n_likes,timestamp,address,province,price,house_size,plot_area,price_per_house_m2,price_per_plot_m2
0,CfGM_rwMIU2,https://scontent-frt3-1.cdninstagram.com/v/t51...,"Jarki, Rojewo, woj. kujawsko-pomorskie \nCena:...",16,579,1655878061,"Jarki, Rojewo",kujawsko-pomorskie,330000.0,42.0,1137.0,7857.142857,290.237467
1,CfBcqbCs7rD,https://scontent-frx5-1.cdninstagram.com/v/t51...,"Kalinowo, woj. warmińsko-mazurskie \nCena: 350...",6,436,1655718503,Kalinowo,warmińsko-mazurskie,350000.0,90.0,12279.0,3888.888889,28.503950
2,Ce4Z5oCs5NE,https://scontent-frt3-1.cdninstagram.com/v/t51...,"Kwiatów, Złotoryja, woj. dolnośląskie \nCena: ...",16,651,1655415065,"Kwiatów, Złotoryja",dolnośląskie,2800000.0,625.0,1314.0,4480.000000,2130.898021
3,CewFMPHsxwf,https://scontent-frt3-1.cdninstagram.com/v/t51...,"Sadki, woj. kujawsko-pomorskie \nCena: 290 000...",16,688,1655135772,Sadki,kujawsko-pomorskie,290000.0,110.0,1840.0,2636.363636,157.608696
4,CernE8HMqMC,https://scontent-frx5-2.cdninstagram.com/v/t51...,"Wysoka, woj. lubuskie\nCena: 260 000 zł\n95 m²...",18,563,1654985766,Wysoka,lubuskie,260000.0,95.0,1990.0,2736.842105,130.653266
...,...,...,...,...,...,...,...,...,...,...,...,...,...
377,CM1ukXeHR5R,https://instagram.fwaw8-1.fna.fbcdn.net/v/t51....,"Kobiele Wielkie, woj. łódzkie\nCena: 119 000 z...",0,52,1616670532,Kobiele Wielkie,łódzkie,119000.0,80.0,2600.0,1487.500000,45.769231
378,CM1shw6H8kO,https://instagram.fwaw8-1.fna.fbcdn.net/v/t51....,"Siedliska, Tuchów, woj. małopolskie\nCena: 119...",1,58,1616669462,"Siedliska, Tuchów",małopolskie,119000.0,60.0,5900.0,1983.333333,20.169492
379,CM0QmlwnfQ9,https://instagram.fwaw8-1.fna.fbcdn.net/v/t51....,"Siedliska-Bogusz, Brzostek, woj. podkarpackie....",0,48,1616621267,"Siedliska-Bogusz, Brzostek",podkarpackie,120000.0,88.2,1300.0,1360.544218,92.307692
380,CM0LEY5nL46,https://instagram.fwaw8-1.fna.fbcdn.net/v/t51....,"Henryków Lubański, Lubań, woj. dolnośląskie.\n...",7,106,1616618366,"Henryków Lubański, Lubań",dolnośląskie,245000.0,180.0,1399.0,1361.111111,175.125089


Everything looks correct, we can save the data to a CSV file:

In [12]:
df.to_csv(f"staredomynasprzedaz-{time.strftime('%Y-%m-%d-%H-%M-%S')}.csv", index=False)

Now, you can add some automation on top of the code in this notebook to run scraping every day or hour, depending on your needs, and get notified of new great deals posted on Instagram.

### Data exploration

Based on the data frame that we created, we can extract some usefull stats, e.g., the number of houses in each provine and the mean price per m²:

In [13]:
df.groupby("province").agg({"price_per_house_m2": ["mean", "count"]}).sort_values(by=("price_per_house_m2", "mean"))

Unnamed: 0_level_0,price_per_house_m2,price_per_house_m2
Unnamed: 0_level_1,mean,count
province,Unnamed: 1_level_2,Unnamed: 2_level_2
opolskie,1452.879375,4
zachodniopomorskie,1832.76397,23
dolnośląskie,1970.157887,47
lubuskie,2022.4196,13
podkarpackie,2272.833264,50
łódzkie,2396.973905,8
warmińsko - mazurskie,2733.333333,1
lubelskie,2781.316515,18
podlaskie,2781.343499,50
śląskie,2893.644595,27


You can also filter the data to find houses that you might be interested in.
Example below searches for houses which with price below 200,000 PLN and of size between 100 m² and 200 m².
Here is a link to one of them based on its shortcode: https://www.instagram.com/p/CYv93e8Nvwh/

In [14]:
df[(df["price"] < 200000.0) & (df["house_size"] < 200.0) & (df["house_size"] > 100.0)]

Unnamed: 0,shortcode,image_url,description,n_comments,n_likes,timestamp,address,province,price,house_size,plot_area,price_per_house_m2,price_per_plot_m2
69,CYv93e8Nvwh,https://scontent-vie1-1.cdninstagram.com/v/t51...,"Ponikwa, Bystrzyca Kłodzka, woj. dolnośląskie ...",24,710,1642247030,"Ponikwa, Bystrzyca Kłodzka",dolnośląskie,165000.0,120.0,3882.0,1375.0,42.503864
95,CXGiAE6sw6y,https://instagram.fwaw7-1.fna.fbcdn.net/v/t51....,"Gadowskie Holendry, Tuliszków, woj. wielkopols...",7,466,1638709205,"Gadowskie Holendry, Tuliszków",wielkopolskie,199000.0,111.0,3996.0,1792.792793,49.7998
159,CTSiKe4sGsx,https://instagram.fwaw7-1.fna.fbcdn.net/v/t51....,"Gotówka, Ruda - Huta, woj. lubelskie\nCena: 18...",3,188,1630522009,"Gotówka, Ruda - Huta",lubelskie,186000.0,120.0,1832.0,1550.0,101.528384
186,CR040yusVyV,https://instagram.fwaw8-1.fna.fbcdn.net/v/t51....,"Leżajsk, woj. podkarpackie \nCena: 175 000 zł\...",26,550,1627379773,Leżajsk,podkarpackie,175000.0,108.0,912.0,1620.37037,191.885965
218,CQvirvJM0vi,https://scontent-frt3-1.cdninstagram.com/v/t51...,"Rząśnik, Świerzawa, woj. dolnośląskie \nCena: ...",4,239,1625052909,"Rząśnik, Świerzawa",dolnośląskie,199000.0,160.0,1900.0,1243.75,104.736842
227,CQhJJTFsyPt,https://scontent-frt3-1.cdninstagram.com/v/t51...,"Szymbark, Gorlice, woj. małopolskie \nCena: 19...",2,222,1624569758,"Szymbark, Gorlice",małopolskie,199000.0,136.0,9574.0,1463.235294,20.785461
243,CQBAajvM4q0,https://instagram.fwaw8-1.fna.fbcdn.net/v/t51....,"Brzeżanka, Strzyżów, woj. podkarpackie \nCena:...",9,256,1623491439,"Brzeżanka, Strzyżów",podkarpackie,179000.0,143.66,24600.0,1245.997494,7.276423
254,CPnu52aMKbl,https://scontent-frx5-1.cdninstagram.com/v/t51...,"Gierałtów, Nowogrodziec, woj. dolnośląskie \nC...",4,225,1622643397,"Gierałtów, Nowogrodziec",dolnośląskie,190000.0,176.0,900.0,1079.545455,211.111111
273,CPJEkPEMdOU,https://scontent-frt3-1.cdninstagram.com/v/t51...,"Winiec-Sułowo, Bisztynek, woj. warmińsko-mazur...",10,245,1621614567,"Winiec-Sułowo, Bisztynek",warmińsko-mazurskie,185000.0,180.0,4600.0,1027.777778,40.217391
284,CO3DCWKs9Tv,https://scontent-frx5-1.cdninstagram.com/v/t51...,"Rząśnik, Świerzawa, woj. dolnośląskie \nCena: ...",4,193,1621009785,"Rząśnik, Świerzawa",dolnośląskie,199000.0,160.0,1900.0,1243.75,104.736842


### Conclusion

I hope you now feel more confident in scraping.
As you can see, it is super easy to scape publicly available data with Scraping Fish API from even as challenging websites as Instagram.
In a similar way, you can scrape other user profiles as well as other websites that contain relevent information for you or your business 📈.

### Let's talk about your use case 💼

Feel free to reach out using our [contact form](https://scrapingfish.com/contact).
We can assist you in integrating Scraping Fish API into your existing scraping workflow or help you set up scraping system for your use case.