# README

This notebook is a scratch to implementation of Instagram web scraping.

The goals are test and design scracthes of functions to get data from Instagram.

This notebook contains the implementation and tests for a snowballing effect function to get data about Instagram tags.

**Some references**

https://www.eatthis.com/biggest-fast-food-chains-america/

https://medium.com/@h4t0n/instagram-data-scraping-550c5f2fb6f1

https://medium.com/@srujana.rao2/scraping-instagram-with-python-using-selenium-and-beautiful-soup-8b72c186a058

https://towardsdatascience.com/social-network-analysis-of-related-hashtags-on-instagram-using-instacrawlr-46c397cb3dbe

**Endpoint to User Information**

https://i.instagram.com/api/v1/users/{USER_ID}/info/

In [0]:
# Instagram base url preffix
tagurl_prefix = 'https://www.instagram.com/explore/tags/'

In [0]:
# suffix to append to tag request url to retrieve data in JSON format
tagurl_suffix = '/?__a=1'

In [0]:
tagurl_endcursor = '&max_id='

In [0]:
# a generic media post preffix (concat with media shortcode to view)
posturl_prefix = 'https://www.instagram.com/p/'

In [0]:
# target initial tags
tags = ['bolsonaro', 'haddad', 'dilma', 'ciro', 'guedes', 'moro']

In [0]:
# target url to initial test
tagurl = tagurl_prefix + tags[0] + tagurl_suffix

In [0]:
# checking target url
tagurl

'https://www.instagram.com/explore/tags/bolsonaro/?__a=1'

In [0]:
# needed module
import requests

In [0]:
# requesting JSON information
json_info = requests.get(tagurl).json()

In [0]:
# retrieving a list of medias
medias_list = json_info['graphql']['hashtag']['edge_hashtag_to_media']['edges']

In [0]:
# checking lenght of the list
len(medias_list)

61

In [0]:
# checking details about one media
medias_list[0]

{'node': {'__typename': 'GraphImage',
  'accessibility_caption': 'Image may contain: one or more people and text',
  'comments_disabled': False,
  'dimensions': {'height': 599, 'width': 480},
  'display_url': 'https://scontent-iad3-1.cdninstagram.com/vp/d36ffd95fe83839648100da029da0dd5/5D8DD563/t51.2885-15/e35/61457319_405267080197269_7121056769722451689_n.jpg?_nc_ht=scontent-iad3-1.cdninstagram.com',
  'edge_liked_by': {'count': 0},
  'edge_media_preview_like': {'count': 0},
  'edge_media_to_caption': {'edges': [{'node': {'text': '#lavajato #vazajato #morogate #lulalivre #lula #dilma #deltandallagnol #curitiba #brasil #bolsonaro #forabolsonaro #14J'}}]},
  'edge_media_to_comment': {'count': 0},
  'id': '2063482202007812352',
  'is_video': False,
  'owner': {'id': '8381139047'},
  'shortcode': 'Byi9iFDBDkA',
  'taken_at_timestamp': 1560206269,
  'thumbnail_resources': [{'config_height': 150,
    'config_width': 150,
    'src': 'https://scontent-iad3-1.cdninstagram.com/vp/a9680f8b0fd0c9

**Important:**

- shortcode
- edge_media_to_caption > edges[0] > node > text
- owner

In [0]:
# list of media dictionaries (filtered and processed information)
medias = []

for media in medias_list:
  
  node = media['node']
  
  id_media = node['id']
  
  id_owner = node['owner']['id']
  
  edges = node['edge_media_to_caption']['edges']
  
  shortcode = node['shortcode']
  
  # not all medias have a text
  text = edges[0]['node']['text'].replace('\n','') if len(edges) else ''
  
  mediaurl =  posturl + shortcode + '/'
  
  media_dict = {
      'id_media': id_media,
      'id_owner': id_owner,
      'shortcode': shortcode,
      'text': text,
      'mediaurl': posturl + shortcode + '/'
  }
  
  medias.append( media_dict )

medias[0:3]

[{'id_media': '2063482202007812352',
  'id_owner': '8381139047',
  'mediaurl': 'https://www.instagram.com/p/Byi9iFDBDkA/',
  'shortcode': 'Byi9iFDBDkA',
  'text': '#lavajato #vazajato #morogate #lulalivre #lula #dilma #deltandallagnol #curitiba #brasil #bolsonaro #forabolsonaro #14J'},
 {'id_media': '2063481337260851473',
  'id_owner': '3047349704',
  'mediaurl': 'https://www.instagram.com/p/Byi9VfsFMER/',
  'shortcode': 'Byi9VfsFMER',
  'text': '‘Em tempo: a dra. Erika Marena(Lava Jato) é uma das responsáveis pela Operação Ouvidos Moucos - que, com seus métodos abusivos, teria levado ao suicídio o reitor da UFSC, Luiz Carlos #Cancellier de Olivo, o Cau.’ #ConversaAfiada#democracia #resistência #lulalivre #lulapresopolítico #história #brasil #ditadura #bolsonaro #bol卐onaro #extremadireita #maquiavel #thomashobbes #rousseau #foucault #poweredbypride #sérgiomoro #stf #vazajato #togasuja #pgr #cnj #cnmp #mpf'},
 {'id_media': '2063345956317004326',
  'id_owner': '8682580051',
  'mediaurl':

**Note**

We have a valid list of information to work with.

The goal now is to create a function to retrieve a large amount of posts. Are we talkgint about **snowballing effect**?

In [0]:
# checking the end_cursor variable to iterate the search
json_info['graphql']['hashtag']['edge_hashtag_to_media']['page_info']['end_cursor']

'QVFBTmJMWWlteEcxQmgwQUFJR3lBUTlQa0pDcU51d2hMeVJRekdJZFhzbjVDRGd6Ul92eGdHSzhwR3h2VUMxdHZqaXZ1dVZ1TDZJWFA2OUJoY0Y0ekUxVQ=='

In [0]:
def json2medias(json_info):
  
  medias_list = json_info['graphql']['hashtag']['edge_hashtag_to_media']['edges']
  
  medias = []

  for media in medias_list:

    node = media['node']

    id_media = node['id']

    id_owner = node['owner']['id']

    edges = node['edge_media_to_caption']['edges']

    shortcode = node['shortcode']
    
    text = edges[0]['node']['text'].replace('\n','') if len(edges) else ''

    mediaurl =  posturl + shortcode + '/'

    media_dict = {
        'id_media': id_media,
        'id_owner': id_owner,
        'shortcode': shortcode,
        'text': text,
        'mediaurl': posturl + shortcode + '/'
    }

    medias.append( media_dict )
    
  return medias

In [0]:
import time

In [0]:
def snowball(url, deep=1, end_cursor='', count=0, showurl=False, 
             sleep=0, forever=False, progress=False, pause=60 ):
  
  count = count + 1
  
  request_url = url + tagurl_endcursor + end_cursor
  
  if showurl :
    
    print(request_url)
    
  else:
    
    if progress :
      
      print( count )
      # if count == 1 :
      #  print( '*' * (deep-1) )
      # else:
      #  print( '*', end='' )
    
  # TODO Involve the request in a try-except block
  json_info = requests.get( request_url ).json()
    
  end_cursor = json_info['graphql']['hashtag']['edge_hashtag_to_media']['page_info']['end_cursor']
  
  medias = json2medias( json_info )
  
  time.sleep(sleep)
  
  if count < deep :
    
    try:
      
      medias += snowball(
          url=url, 
          deep=deep, 
          end_cursor=end_cursor, 
          count=count, 
          showurl=showurl, 
          sleep=sleep,
          forever=forever,
          progress=progress, 
          pause=pause)
      
    except:
      
      if forever :
        
        print( 'Fail, retrying in ' + str(pause) + ' seconds' )
        
        time.sleep(pause)
        
        medias += snowball(
          url=url, 
          deep=deep, 
          end_cursor=end_cursor, 
          count=count, 
          showurl=showurl, 
          sleep=sleep,
          forever=forever, 
          progress=progress, 
          pause=pause)
      
      else:
        
        print( 'Fail, ' + str(count) + ' requests done' )
      
  else:
    
    pass
  
  return medias

In [0]:
medias = snowball(tagurl, deep=1)

print( len(medias) + ' posts retrieved' )

63

In [0]:
medias = snowball(tagurl, deep=3)

print( len(medias) + ' posts retrieved' )

185

In [0]:
medias = snowball(tagurl, deep=10)

print( len(medias) + ' posts retrieved' )

641

In [0]:
medias = snowball(tagurl, deep=10, showurl=True)

print( len(medias) + ' posts retrieved' )

https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFDYmd3TGJCbTN1QjBXM25zbm1rUFBBam9oX2o3cmdTd05Mdkc0b05OMm9LOXdQQWpVX0o3VVVpLUxjRV9TU0lza2RKMk9iT0Q1WWdEYVZqUkl4V1gyVg==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFEMzd1cnowZXNPMkxpZTB0d19VSlptVjhVSmktck41Szk2TGdpRjd6R3hUTEwxRFFIWG5lQWxuMFowZGpxNkcwZUxPQWQ0YUk2VmZPVVVreTRYRTdCVQ==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFCTnlWY2hCdU01UjlDZ0xReTBRXzR6ckpxZVlpRnQxYTBGcDJIOFRtRFFRNGVwZndhUmJPclBrZlFaNmhtdks4cDViMk4yWHUwak54MVVqTDdsaEgxZw==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFCOFRYa0g2VTlOUGVPMWtreFFSOGtPR2FVenNlNDF3QmcwbHZQcmdGRHFoMkN3Qi1TUEVmWTVZUVg2V3Vlc3NEaWx4Yy1fVFQ1SHUzRzFoT2lwZmFVZw==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFCZ3pwaTNDY0o0NEtHUUY1aGhyY1lhU0w3dUpGSTNYWlBBODRWdHJvQkFIVDdmbGwxdDhlTG5VaWJsTWNlQ1ZvdHFHQzZnN1JBZU1hYUlSMHQ5cmJYcg==
https://www.inst

641

In [0]:
%%time

medias = snowball(tagurl, deep=20, sleep=1, showurl=True)

print( len(medias) + ' posts retrieved' )

https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFBSF9kcVJ0MWt4Vmp0TnBsS2tFVWdkek1UTk9zTGthcDRjM1BMOVhrYnhfcnp3V0JHYXdTdGZIRWh1akdTUTI0Y3NYdURaTFNvUExsU0ZJSFNYb2pRNw==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFEc0RxOC0zMHRzX3pfbUtIc2hzXzNaLXgwUUgtV3JJTzkwaDdTRU1FR240cENrMVBuYmhHWWUxT2xaeUFZVXdpMkVTcDJWaXhmWl9nZ2VGaXdrR2FXYQ==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFCaWtzcTBxaWxiU0M2Rm0xNVAyclcyc3VrbmN2OGJMcC11bTVncjFjRDZFNGdjSU4wMVRKR24xcDNyX3Vlb3lQcmRyYVkweGd5N3hlYlU4SjhGQjVkQg==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFBSC1BMUxnZ1FoTGVNNWl6RUJVVkVSY1R3QTRuVVJpMlNFa2tFY2JHeUh3SExzZUdnYnB3bWdEMUplTmFJN0ppd0d3VGFiN01HN0RWTE1Ua2k2YThXbA==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFCZFVrQzAyaFlvRVRfdDZyMUJiOU5rSzRMMTZ2RWRRQ016LXhicDlhVWY5dmpfUkF6SjUyclc4RE54RzF6RnlvTXlXX1VGQ2lCY2d1XzNpeTNoQ21Faw==
https://www.inst

In [0]:
# testing progress bar

%%time

medias = snowball(tagurl, deep=20, sleep=1, progress=True)

print( len(medias) + ' posts retrieved' )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1286
CPU times: user 425 ms, sys: 45.3 ms, total: 470 ms
Wall time: 47.4 s


In [0]:
%%time

medias = snowball(tagurl, deep=50, sleep=1, progress=True)

print( len(medias) + ' posts retrieved' )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Fail, 23 requests done
1486
CPU times: user 494 ms, sys: 54.1 ms, total: 549 ms
Wall time: 48.6 s


In [0]:
%%time

medias = snowball(tagurl, deep=50, sleep=1, progress=True, forever=True)

print( len(medias) + ' posts retrieved' )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

Fail, retrying in 60 seconds
41
42
43
44
45
46
47
48
49
50
3168
CPU times: user 1.11 s, sys: 120 ms, total: 1.23 s
Wall time: 2min 46s


In [0]:
%%time

medias = snowball(tagurl, deep=100, sleep=0.5, progress=True, forever=True, pause=30)

print( len(medias) + ' posts retrieved' )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
Fail, retrying in 30 seconds
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Fail, retrying in 30 seconds
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
Fail, retrying in 30 seconds
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
Fail, retrying in 30 seconds
98
99
100
6322
CPU times: user 2.32 s, sys: 232 ms, total: 2.55 s
Wall time: 4min 55s
