# Microstock Keyword Helper
Author: Irfan Muhammad Ghani
Updated: 13 Apr 2021


Pada dasarnya urutan keyword yang digunakan pada metadata gambar sangat menentukan posisi hasil pencarian, dan ini mempengaruhi banyaknya download dan penghasilan.

Tools ini dibuat untuk memudahkan microstocker dalam menganalisa title dan keywords yang akan digunakan di dalam metadata gambar.

Yang dibutuhkan:

- Pandas
- Selenium
- BeautifulSoup
- Driver browser:
    - Edge: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
    - Chrome: https://chromedriver.chromium.org/downloads
    - Apabila browser yang digunakan bukan di atas, cari dengan kata kunci "nama_browsernya driver"

Cara kerjanya:

- Pastikan telah menginstall library yang diperlukan dan memasukkan driver browser ke variable "driver"
- Masukkan kata kunci pencarian gambar di variable "search_for", dan masukkan jumlah data yg ingin diambil di variable "how_many"
- Sistem akan melakukan scapping data pada halaman pertama pencarian, diantaranya:
    - Title
    - Keywords
    - Jumlah keywords
    - Category
    - Author
    - Jumlah stok Author
    - Link author
    - Link gambar
- Semua data akan disimpan pada dataframe "metadata", dari sini temen-temen bisa menganalisa beberapa hal, semisal keyword apa saja yg paling relevan untuk pencarian tersebut, atau menentukan judul yang cocok untuk gambar yang akan diupload.
- Selain itu, data tersebut akan disimpan dalam format .csv<br/><br/>

Tools ini baru bisa mengambil data dari Shutterstock saja, mungkin jika ada waktu luang akan saya update agar bisa mengambil data dari agensi lainnya, dan menambahkan beberapa fitur otomatis.


In [1]:
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

# masukkan driver browser yang digunakan
driver = webdriver.Edge('msedgedriver')

In [2]:
# masukkan kata kuncinya di sini
search_for = 'isometric laptop'

# masukkan jumlah data yang ingin diambil
how_many = 5

URL_home = 'https://www.shutterstock.com'
URL_search = 'https://www.shutterstock.com/search/'

driver.get(URL_search + search_for)
content = driver.page_source
soup = BeautifulSoup(content)

In [3]:
# get item url from search page
links = []
for a in soup.find_all('a', href = True, class_ = 'z_h_81637'):
    links.append(URL_home + a['href'])
    # print(URL_home + a['href'])

titles = []
keywords = []
categories = []
author = []
author_link = []
author_total_stock_text = []
image_link = []
word_temp = []

#get title and keywords from item
for link in links[:how_many]:
    driver.get(link)
    content = driver.page_source
    soup = BeautifulSoup(content)

    # get image link
    image_link.append(link)

    # get title
    titles.append(soup.find('h1', class_ = 'm_b_d59a1 font-headline-base').get_text())

    # get categories
    for category in soup.find_all('p', class_ = 'm_q_51037'):
        for a in category.find_all('a'):
            word_temp.append(a.get_text())
        categories.append(word_temp)
        word_temp = []

    # get keywords
    for keyword in soup.find_all('div', class_ = 'C_a_03061'):
        for word in keyword.find_all('a'):
            word_temp.append(word.get_text())
        keywords.append(word_temp)
        word_temp = []

    # get author
    for auth in soup.find_all('p', class_ = "oc_Q_7bfac"):
        for a in auth.find_all('a', href = True):
            author.append(a.get_text())
            author_link.append(a['href'])
    
    # get total stock author
    driver.get(a['href'])
    content = driver.page_source
    soup = BeautifulSoup(content)

    author_total_stock_text.append(soup.find('h2', class_ = "z_n_1ebe6").get_text())

    print("^")

proses
proses
proses
proses
proses


In [4]:
# get total keyword
keywords_total = []
for idx, words in enumerate(keywords):
    keywords_total.append(len(words))

In [5]:
# cleaning total stock
author_total_stock = []
for idx, total in enumerate(author_total_stock_text):
    author_total_stock.append(total.split()[0].replace(",",""))

In [6]:
# create dataframe
metadata = pd.DataFrame({
    "Title" : titles,
    # "Category" : categories,
    "Keyword" : keywords,
    "TotalKeyword" : keywords_total,
    "Author" : author,
    "Total Stock" : author_total_stock,
    "Author Link" : author_link,
    "Image Link" : image_link
})

In [7]:
metadata

Unnamed: 0,Title,Keyword,TotalKeyword,Author,Total Stock,Author Link
0,Big data concept. Isometric vector illustration,"[isometric, laptop, computer, data, 3d, vector...",43,victoria pineapple,252,https://www.shutterstock.com/g/victoria+pineapple
1,Concept business strategy. Analysis data and I...,"[isometric, laptop, market, analytic, business...",50,monkographic,2588,https://www.shutterstock.com/g/monkographic
2,Web design and Front end development isometric...,"[app, application, background, banner, busines...",50,Andrii Symonenko,13079,https://www.shutterstock.com/g/Symonenko_Andrii
3,"Isometric vector set of computer, laptop. 3d d...","[isometric, laptop, vector, 3d, pc, notebook, ...",30,RobertCop93,373,https://www.shutterstock.com/g/RobertCop13
4,"Opened grey laptop, modern technology, device,...","[3d, business, cartoon, clip art, communicatio...",50,robuart,139629,https://www.shutterstock.com/g/robuart


In [8]:
metadata.to_csv(search_for + ".csv", index=False)

In [9]:
# jumlah kemunculan setiap keyword
keyword_gabungan = metadata.Keyword.sum()
print("jumlah seluruh keyword:", len(keyword_gabungan))
print("jumlah keyword unik:", pd.Series(metadata["Keyword"].sum()).nunique())

jumlah seluruh keyword: 223
jumlah keyword unik: 132


In [10]:
# daftar keyword unik
keyword_unik = pd.Series(metadata["Keyword"].sum()).unique()
print("daftar keyword unik:\n", keyword_unik)

daftar keyword unik:
 ['isometric' 'laptop' 'computer' 'data' '3d' 'vector' 'science' 'chart'
 'business' 'design' 'mobile' 'internet' 'technology' 'seo' 'coffee'
 'notebook' 'web' 'concept' 'gradient' 'graphic' 'intelligence' 'website'
 'app' 'information' 'optimization' 'statistics' 'network' 'phone'
 'background' 'banner' 'cup' 'mining' 'analytics' 'big data'
 'data analysis' 'data management' 'documents' 'graphs' 'illustration'
 'infographics' 'interface' 'management' 'overload' 'market' 'analytic'
 'analysis' 'report' 'icon' 'strategy' 'research' 'flat' 'graph'
 'financial' 'stock' 'calculator' 'service' 'money' 'infographic'
 'statistic' 'economic' 'plan' 'success' 'growth' 'diagram' 'corporate'
 'modern' 'progress' 'symbol' 'communication' 'development' 'digital'
 'finance' 'graphics' 'investment' 'marketing' 'application' 'code'
 'coding' 'creating' 'developer' 'doodle' 'end' 'front' 'front-end'
 'frontend' 'hands' 'icons' 'landing page' 'male' 'media' 'page'
 'production' 'pro

In [11]:
# jumlah kemunculan keyword unik
kemunculan = []
jumlah_kemunculan = 0
for key1 in keyword_unik:
    for key2 in keyword_gabungan:
        if key1 == key2:
            jumlah_kemunculan += 1
    kemunculan.append(jumlah_kemunculan)
    jumlah_kemunculan = 0

df_kemunculan_keyword_unik = pd.DataFrame({"Keyword Unik" : keyword_unik, "Kemunculan" : kemunculan})

print(df_kemunculan_keyword_unik.sort_values("Kemunculan", ascending=False).head(50))

      Keyword Unik  Kemunculan
16             web           5
38    illustration           5
2         computer           5
5           vector           5
8         business           5
9           design           5
1           laptop           5
11        internet           5
12      technology           5
17         concept           5
47            icon           4
50            flat           4
28      background           4
65          modern           4
0        isometric           4
70         digital           4
3             data           4
4               3d           4
15        notebook           3
23     information           3
40       interface           3
10          mobile           3
19         graphic           3
67          symbol           3
100             pc           2
68   communication           2
101       isolated           2
103         device           2
104         office           2
76            code           2
107        display           2
69     d

In [12]:
# irisan / intersection seluruh keyword apabila digabung
# akan terlihat bagus jika hanya 3 data saja yg diambil

set.intersection(*map(set,keywords))

{'business',
 'computer',
 'concept',
 'design',
 'illustration',
 'internet',
 'laptop',
 'technology',
 'vector',
 'web'}

In [13]:
# rata-rata posisi keyword
# keywords[0]
posisi = []
pos = []
for row in range(len(keywords)):
    for idx, key in enumerate(keywords[row]):
        # print(key,idx+1)
        pos.append(idx+1)
    posisi.append(pos)

sum_posisi = []
ambil_posisi = 0
rata_posisi = []

for idx1, key1 in enumerate(keyword_unik):
    for row in range(len(keywords)):
        for idx2, key2 in enumerate(keywords[row]):
            if key1 == key2:
                ambil_posisi += posisi[row][idx2]
    sum_posisi.append(ambil_posisi)
    rata_posisi.append(ambil_posisi/kemunculan[idx1])
    # print(key1, ambil_posisi, ambil_posisi/kemunculan[idx1])
    ambil_posisi = 0

In [14]:
df_kemunculan_keyword_unik = pd.DataFrame({
    "Keyword Unik" : keyword_unik,
    "Kemunculan" : kemunculan,
    "Rata-rata Posisi" : rata_posisi
})

In [15]:
df_kemunculan_keyword_unik.sort_values(["Kemunculan","Rata-rata Posisi"], ascending=(False, True))[:20]

Unnamed: 0,Keyword Unik,Kemunculan,Rata-rata Posisi
5,vector,5,4.6
8,business,5,7.8
2,computer,5,11.4
1,laptop,5,13.0
9,design,5,13.0
16,web,5,13.6
17,concept,5,17.6
38,illustration,5,21.0
12,technology,5,21.8
11,internet,5,24.0
