<a href="https://colab.research.google.com/github/luthfiyahastutiningtyas/web-scraping/blob/main/Data_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Web Scraping (Extract Data)**

Scraping data (atau _web scraping_) adalah proses mengambil data secara otomatis dari situs web menggunakan program komputer. Teknik ini berguna untuk mengumpulkan informasi dari internet tanpa harus menyalinnya secara manual.

- Mengambil harga produk dari e-commerce (Sh**e, Tok*****a).
- Mengumpulkan daftar artikel berita terbaru.
- Mengambil data review dan rating aplikasi dari Google Play Store.
- Mengambil data lowongan kerja dari website karier.

## **Install Google Play Scraper**

In [1]:
pip install google-play-scraper

Collecting google-play-scraper
  Downloading google_play_scraper-1.2.7-py3-none-any.whl.metadata (50 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/50.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.2/50.2 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading google_play_scraper-1.2.7-py3-none-any.whl (28 kB)
Installing collected packages: google-play-scraper
Successfully installed google-play-scraper-1.2.7


## **Proses Scraping Link Mobile Banking Syariah**

Mengambil (scrape) daftar aplikasi dari Google Play Store berdasarkan kata kunci pencarian tertentu, dalam kasus ini adalah "bank syariah"

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
import requests
from bs4 import BeautifulSoup

url = 'https://play.google.com/store/search?q=bank%20syariah&c=apps&hl=en'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
get_all = soup.find_all('div', {'jsname' : 'O2DNWb', 'class': 'fUEl2e'})

app_links = []
for item in get_all:
    app_link_elements = item.find_all('a', {'class': 'Si6A0c Gy4nib'})
    for link_element in app_link_elements:
        id_link = link_element['href'].replace('/store/apps/details?id=', '')
        app_links.append(id_link)

print(app_links)

['co.id.bankbsi.superapp', 'com.bsm.activity2', 'id.aladinbank.mobile', 'id.co.bcasyariah.bsya', 'com.usbank.mobilebanking', 'com.jago.digitalBanking', 'com.brightact.android.bankaroo', 'com.simas.mobile.SimobiPlusSyariah', 'com.ruya.bank', 'com.gma.mbanking.bmd', 'com.malauzai.DH17677', 'com.asiweb.pineriver.mobilebanking', 'com.sdb.app.rl', 'com.asiweb.waverly.mobilebanking']


## **Melakukan Evaluasi**

Evaluasi dilakukan karena ternyata tidak semua list yang di scrapping merupakan data yang dibutuhkan

In [4]:
master_id_bank_syariah = [
    'com.bsm.activity2',
    'com.megasyariah',
    'id.aladinbank.mobile',
    'com.mobilemaslahah',
    'com.app.vioss4',
    'com.dwidasa.bcas.mb.android',
    'com.bankbtpns.mobilebanking',
    'com.simas.mobile.SimobiPlusSyariah',
    'com.aceh.action',
    'mlpt.siemo.mobilebanking.riau',
    'com.bankntbsyariah.mobilebanking',
    'com.gma.mbanking.bmd',
    'com.jago.digitalBanking',
    'com.alami_funder',
    'com.danasyariah.mobiledanasyariah',
    'id.co.danamonsyariah.shafa'
]

## **Proses Scraping Review & Rating Mobile Banking Syariah**


In [5]:
import pandas as pd
from google_play_scraper import app, reviews_all

def convert_result_to_dataframe(results):
    hasil = dict()
    for data in results:
        for key, value in data.items():
            if key not in hasil:
                hasil[key] = []
            hasil[key].append(value)

    hasil_akhir = pd.DataFrame(hasil)
    return(hasil_akhir)

iterasi = 1
data_integrasi = pd.DataFrame()

for app_package in master_id_bank_syariah:
    try:
        app_detail = app(app_package)
        print(f"Nama Aplikasi : {app_detail['title']}")

        result = reviews_all(
            app_package,
            #lang = 'id'
        )

        data = convert_result_to_dataframe(result)
        data['app_name'] = app_detail['title'].upper()

        if(iterasi == 1):
            data_integrasi = data
        else:
            data_integrasi = pd.concat([data_integrasi, data])

        iterasi += 1
    except:
      pass

for col_date in ['at', 'repliedAt']:
    data_integrasi[col_date] = pd.to_datetime(data_integrasi[col_date], errors='coerce')

Nama Aplikasi : BSI Mobile
Nama Aplikasi : M-Syariah
Nama Aplikasi : Aladin : Bank Syariah Digital
Nama Aplikasi : Mobile Maslahah by bjb syariah
Nama Aplikasi : BISA Mobile by KBBS
Nama Aplikasi : Tepat Mobile
Nama Aplikasi : Aira Mobile
Nama Aplikasi : Action Mobile
Nama Aplikasi : BRKS Mobile
Nama Aplikasi : Bank NTB Syariah mBanking
Nama Aplikasi : BMD Syariah Mobile System
Nama Aplikasi : Bank Jago/Jago Syariah
Nama Aplikasi : ALAMI P2P Funding Sharia
Nama Aplikasi : Dana Syariah
Nama Aplikasi : Shafa by Danamon Syariah


In [10]:
data_integrasi = data_integrasi.rename(columns={
    "reviewId": "review_id",
    "userName": "user_name",
    "userImage": "user_image",
    "content": "review_content",
    "score": "rating_score",
    "thumbsUpCount": "thumbs_up_count",
    "reviewCreatedVersion": "review_created_version",
    "at": "review_timestamp",
    "replyContent": "reply_content",
    "repliedAt": "replied_at",
    "appVersion": "app_version",
    "app_name": "app_name"
})


In [11]:
# Tampilkan hasilnya
data_integrasi.head(10)

Unnamed: 0,review_id,user_name,user_image,review_content,rating_score,thumbs_up_count,review_created_version,review_timestamp,reply_content,replied_at,app_version,app_name
0,b54f4a03-95d8-4ea0-9378-f5052f40b204,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,"pathetic app, lots of bugs, frequent glitches....",1.0,0.0,,2025-09-02 10:33:51,Terima kasih atas kepercayaan Kakak terhadap B...,2025-09-02 12:16:17,,BSI MOBILE
1,8c007323-23a8-4678-bd8f-dfc24e9ff839,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,Please fix and improve application reability,2.0,0.0,6.21.1,2025-09-02 03:18:34,Terima kasih atas kepercayaan Kakak terhadap B...,2025-09-02 12:18:51,6.21.1,BSI MOBILE
2,2f7c5c43-5488-44e9-8e58-31d582dc8bfb,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,Buat transfer susahnya minta ampun... selalu p...,1.0,0.0,6.26.1,2025-09-01 07:25:21,Terima kasih atas kepercayaan Kakak terhadap B...,2025-09-01 08:19:40,6.26.1,BSI MOBILE
3,1099327a-f1a0-45c3-bd9a-f02260603cd0,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,"Tidak bisa dibuka di HP Huawei, saya butuh sekali",4.0,0.0,6.26.1,2025-08-30 22:39:47,Terima kasih atas kepercayaan Kakak terhadap B...,2025-08-31 01:03:23,6.26.1,BSI MOBILE
4,10aeba7d-f899-4e95-9a1a-b2e9e9d98afe,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,"salam, mohon bantuannya kenapa saya tidak bisa...",5.0,0.0,6.26.1,2025-08-27 15:32:07,"Assalamualaikum Bapak Lutfie, mohon maaf atas ...",2025-08-27 16:20:54,6.26.1,BSI MOBILE
5,a396d2a5-8382-42ba-a222-cabbefafab67,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,sering nge-bug pls 🥹,1.0,0.0,6.26.1,2025-08-23 01:11:54,Assalamualaikum Bapak/Ibu Tosca. Mohon maaf at...,2025-08-23 02:38:19,6.26.1,BSI MOBILE
6,1a8eb854-66da-4958-a732-75ce72d7de5d,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,good,5.0,0.0,,2025-08-22 23:14:06,Terima kasih atas kepercayaan Kakak terhadap B...,2025-08-22 23:35:20,,BSI MOBILE
7,d0062289-e25b-49f3-a2d5-39dada02b5bd,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,untuk sekarang apakah masih boleh pakai Bsi mo...,4.0,0.0,,2025-08-20 06:43:37,Terima kasih atas kepercayaan Kakak terhadap B...,2025-08-20 09:31:56,,BSI MOBILE
8,b9a77ca5-c34b-41c0-b3b2-458431dec410,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,"Beberapa kali ngalamin error, gak bisa transaksi",1.0,0.0,6.26.1,2025-08-18 01:48:28,"Assalamualaikum Bapak/Ibu, mohon maaf atas ket...",2025-08-18 01:59:48,6.26.1,BSI MOBILE
9,c9c124a5-1aa5-43f9-9c78-01b6d2ec1851,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,tolong dong byond by bsi juga bisa di Android ...,5.0,0.0,6.25.0,2025-08-17 06:04:42,Assalamualaikum Bapak Luthfi. Mohon maaf atas ...,2025-08-17 06:18:40,6.25.0,BSI MOBILE


In [12]:
data_integrasi.shape

(28666, 12)

# **Load Data**


In [14]:
!pip install --upgrade google-cloud-bigquery
!pip install --upgrade pandas-gbq

from google.colab import auth
from google.cloud import bigquery

# Autentikasi akun Google
auth.authenticate_user()

# Buat client
client = bigquery.Client(project='teak-vent-471410-p8')

# Tentukan dataset & table
dataset_id = 'Bank_Syariah_Scraping'
table_id = 'reviews_bank_syariah'
full_table_id = f"{client.project}.{dataset_id}.{table_id}"

# Load ke BigQuery
data_integrasi.to_gbq(
    destination_table=full_table_id,
    project_id=client.project,
    if_exists='replace'  # 'replace' = timpa table, 'append' = tambah
)




100%|██████████| 1/1 [00:00<00:00, 2613.27it/s]


# **Tranform Data (Bigquery)**

In [18]:
project_id = "teak-vent-471410-p8"  # ganti dengan project kamu
client = bigquery.Client(project=project_id)

# Penjelasan Transformasi:
# 1. Membersihkan teks dengan TRIM pada kolom user_name, review_content, reply_content
# 2. Mengubah rating_score ke INT64 agar konsisten
# 3. thumbs_up_count, app_version, review_created_version → isi 'unknown' bila NULL
# 4. Menambahkan kolom turunan:
#    - review_length: panjang teks review
#    - sentiment_label: kategorisasi sentimen berdasarkan rating
#    - response_time_hours: lama waktu (jam) antara review ditulis & dibalas developer
# 5. Menyimpan hasil ke tabel baru reviews_bank_syariah_cleaned

query = """
CREATE OR REPLACE TABLE `teak-vent-471410-p8.Bank_Syariah_Scraping.reviews_bank_syariah_cleaned` AS
SELECT
  review_id,
  TRIM(user_name) AS user_name,
  user_image,
  TRIM(review_content) AS review_content,
  CAST(rating_score AS INT64) AS rating_score,
  IFNULL(thumbs_up_count, 0) AS thumbs_up_count,
  IFNULL(review_created_version, 'unknown') AS review_created_version,
  TIMESTAMP(review_timestamp) AS review_timestamp,
  TRIM(reply_content) AS reply_content,
  TIMESTAMP(replied_at) AS replied_at,
  IFNULL(app_version, 'unknown') AS app_version,
  app_name,

  -- kolom turunan
  LENGTH(review_content) AS review_length,
  CASE
    WHEN rating_score >= 4 THEN 'positive'
    WHEN rating_score = 3 THEN 'neutral'
    WHEN rating_score BETWEEN 1 AND 2 THEN 'negative'
    ELSE 'unknown'
  END AS sentiment_label,

  -- lama respon (dalam jam)
  TIMESTAMP_DIFF(replied_at, review_timestamp, HOUR) AS response_time_hours

FROM `teak-vent-471410-p8.Bank_Syariah_Scraping.reviews_bank_syariah`
"""

client.query(query).result()
print("Transformasi selesai, tabel cleaned berhasil dibuat.")


Transformasi selesai, tabel cleaned berhasil dibuat.


In [19]:
# ambil 5 data teratas dari tabel hasil transform
check_query = """
SELECT *
FROM `teak-vent-471410-p8.Bank_Syariah_Scraping.reviews_bank_syariah_cleaned`
LIMIT 5
"""
df_check = client.query(check_query).to_dataframe()
df_check


Unnamed: 0,review_id,user_name,user_image,review_content,rating_score,thumbs_up_count,review_created_version,review_timestamp,reply_content,replied_at,app_version,app_name,review_length,sentiment_label,response_time_hours
0,118a0b2d-5fc6-4723-94b9-a5323b849173,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,what happened to the e-wallet top up feature? ...,1,0.0,unknown,2025-05-09 07:04:27+00:00,Assalamualaikum. Apologize for the inconvenien...,2025-05-09 10:18:25+00:00,unknown,BSI MOBILE,174,negative,3
1,d736205a-af7f-499a-8f77-b9299fa3ef80,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,App sering error dan jelek,1,0.0,unknown,2025-02-22 05:16:53+00:00,Assalamualaikum Bapak mohon maaf atas akses BS...,2025-02-22 07:38:36+00:00,unknown,BSI MOBILE,26,negative,2
2,09bb9781-a659-4099-814b-762a48bdf102,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,Sejak ada berita peretasan dan aplikasi diperb...,1,0.0,unknown,2024-06-05 12:28:47+00:00,Assalamualaikum Bapak/Ibu. Mohon maaf atas ket...,2024-06-05 13:03:03+00:00,unknown,BSI MOBILE,294,negative,1
3,0029f86a-103c-422d-8ac5-ceda7f3f8088,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,Rajin amat force close🥲,1,0.0,unknown,2024-05-28 09:45:49+00:00,"Assalamualaikum Bapak Hanako, mohon maaf atas ...",2024-05-28 14:27:12+00:00,unknown,BSI MOBILE,23,negative,5
4,7b971fbe-5fef-4590-84a8-a4ace63c1eab,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,Parah mau bula rekening bolak balik isi ktp da...,1,4.0,unknown,2024-01-17 08:14:15+00:00,"Assalamualaikum Bapak Bambang, mohon maaf atas...",2024-02-02 03:14:29+00:00,unknown,BSI MOBILE,196,negative,379


# **Contoh Kebutuhan Analysis**


In [8]:
rataan_score = data_integrasi.groupby(['app_name'], as_index = False).agg(
    jumlah_data = ('reviewId', 'count'),
    total_score = ('score', 'sum'),
    avg_score = ('score', 'mean'),
    stdev_score = ('score', 'std')
)

rataan_score = rataan_score.sort_values(by = ['avg_score'], ascending = False, ignore_index = True)
rataan_score

Unnamed: 0,app_name,jumlah_data,total_score,avg_score,stdev_score
0,BMD SYARIAH MOBILE SYSTEM,3,15.0,5.0,0.0
1,ALADIN : BANK SYARIAH DIGITAL,2519,11820.0,4.692338,1.00881
2,ACTION MOBILE,179,746.0,4.167598,1.396167
3,TEPAT MOBILE,47,193.0,4.106383,1.563636
4,BISA MOBILE BY KBBS,120,489.0,4.075,1.551158
5,DANA SYARIAH,1084,4378.0,4.038745,1.566139
6,ALAMI P2P FUNDING SHARIA,274,1090.0,3.978102,1.571109
7,MOBILE MASLAHAH BY BJB SYARIAH,227,901.0,3.969163,1.555442
8,M-SYARIAH,307,1169.0,3.807818,1.710847
9,BANK JAGO/JAGO SYARIAH,7902,28365.0,3.589598,1.734499
