## *Web-Scrapp using XPATH*

In [1]:
# Importing Libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

In [2]:
# Installing webdriver
driver = webdriver.Chrome(options=options, service=Service(ChromeDriverManager().install()))

# Get URL
driver.get("https://www.jpnn.com/indeks/")


# Get article detail with XPATH
Title = driver.find_element(by=By.XPATH, value='//div[@class="content list-berita"]//div[@class="content"]//h1')
Url = driver.find_element(by=By.XPATH, value='//div[@class="content list-berita"]//div[@class="content"]//h1//a')
Date = driver.find_element(by=By.XPATH, value='//div[@class="content list-berita"]//div[@class="content"]//h6//span[@class="silver"]')

# Saving details in detail article dictionary
detail_article = {'title': Title.text, 'url' : Url.get_attribute("href"), 'date' : Date.text}

# print detail article
print(detail_article)
driver.quit()

{'title': 'Alam Ganjar Cari Atlet Esport Potensial di Jateng', 'url': 'https://jateng.jpnn.com/jateng-terkini/4707/alam-ganjar-cari-atlet-esport-potensial-di-jateng', 'date': 'Senin, 29 Agustus 2022 – 02:37 WIB'}


In [3]:
# Use the url to open the link page with full description
saved_url = detail_article.get('url')
driver = webdriver.Chrome(options=options, service=Service(ChromeDriverManager().install()))
driver.get(saved_url)

# Get article content using XPATH
article_body = driver.find_element(by=By.XPATH, value = '//*[@class="page-content"]//*[@itemprop="articleBody"]').text

# print article content
print(article_body)
driver.quit()

jateng.jpnn.com, SOLO - Ketua Harian Indonesia Esports Association (IESPA) Alam Ganjar menyatakan banyak atlet esport yang potensial di Jawa Tengah seiring dengan perkembangan jenis olahraga baru tersebut.
"Salah satunya terlihat pada saat turnamen esport di Palembang beberapa waktu lalu, kami mengirimkan atlet untuk tiga cabor dan mampu menghasilkan dua medali," katanya di sela pelaksanaan DIGIFUN Festival di Mal Solo Paragon, Minggu (28/8).
Selain itu, sejumlah atlet di Jawa Tengah juga berhasil mewakili Indonesia ke SEA Games 2021.
Baca Juga:
Persis Solo Tumbang di Markas Borneo FC
"Memang potensi atlet asal Jawa Tengah tinggi, makanya kami coba bisa enggak sampai ke akar rumput (mencari potensi yang lain) melalui turnamen seperti ini," katanya.
Alam Ganjar menyampaikan dalam beberapa bulan ke depan Indonesia akan menjadi tuan rumah turnamen esport internasional di Bali.
"Harapannya pada makin semangat ikut kualifikasi dari Jateng. Sayang kalau dari Jateng sebetulnya potensial, teta

In [4]:
# Make a Dataframe
import pandas as pd

df = pd.DataFrame.from_dict([detail_article])
df["article_body"] = article_body
df

Unnamed: 0,title,url,date,article_body
0,Alam Ganjar Cari Atlet Esport Potensial di Jateng,https://jateng.jpnn.com/jateng-terkini/4707/al...,"Senin, 29 Agustus 2022 – 02:37 WIB","jateng.jpnn.com, SOLO - Ketua Harian Indonesia..."


## *Saving the news detail to a database*

In [5]:
# importing libraries
import mysql.connector
from mysql.connector import Error
import pandas as pd

In [6]:
# Connecting to MySQL Server
def create_server_connection(host_name, user_name, user_password):
    connection = None
    try:
        connection = mysql.connector.connect(
            host=host_name,
            user=user_name,
            passwd=user_password
        )
        print("MySQL Database connection successful")
    except Error as err:
        print(f"Error: '{err}'")

    return connection

In [7]:
# Connecting
connection = create_server_connection("localhost", "root", "insert your mysql password here")

MySQL Database connection successful


In [8]:
# Function for Creating a New Database
def create_database(connection, query):
    cursor = connection.cursor()
    try:
        cursor.execute(query)
        print("Database created successfully")
    except Error as err:
        print(f"Error: '{err}'")

In [None]:
# Creating a New Database
create_database_query = "CREATE DATABASE news_article"
create_database(connection, create_database_query)

Now that we have created a database in MySQL Server, we can modify our create_server_connection function to connect directly to this database. Note that it's possible - common, in fact - to have multiple databases on one MySQL Server, so we want to always and automatically connect to the database we're interested in.

We can do this like so:

In [10]:
# Connecting to the Database
def create_db_connection(host_name, user_name, user_password, db_name):
    connection = None
    try:
        connection = mysql.connector.connect(
            host=host_name,
            user=user_name,
            passwd=user_password,
            database=db_name
        )
        print("MySQL Database connection successful")
    except Error as err:
        print(f"Error: '{err}'")

    return connection

In [11]:
# Creating a Query Execution Function
def execute_query(connection, query):
    cursor = connection.cursor()
    try:
        cursor.execute(query)
        connection.commit()
        print("Query successful")
    except Error as err:
        print(f"Error: '{err}'")

In [None]:
# Creating Tables
create_newsarticle_table = """
CREATE TABLE news_article (
  id int(11) NOT NULL AUTO_INCREMENT,
  title varchar(255) COLLATE utf8mb4_unicode_520_ci NOT NULL DEFAULT '',
  url varchar(255) CHARACTER SET latin1 NOT NULL DEFAULT '',
  content longtext COLLATE utf8mb4_unicode_520_ci,
  summary text COLLATE utf8mb4_unicode_520_ci,
  article_ts bigint(20) NOT NULL DEFAULT '0' COMMENT 'published timestamp of article',
  published_date date DEFAULT NULL,
  inserted timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (id),
  UNIQUE KEY UNIK (url)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci;
 """

connection = create_db_connection("localhost", "root", "insert your mysql password here", "news_article") # Connect to the Database
execute_query(connection, create_newsarticle_table) # Execute our defined query

*Since the task is `to import the news details to a database`, I will perform how to make a query to import the information (from `df` dataframe) `manually`.*

In [17]:
# Populating the Tables
pop_newsarticle = """
INSERT INTO news_article VALUES
(1, 'Alam Ganjar Cari Atlet Esport Potensial di Jateng', 'https://jateng.jpnn.com/jateng-terkini/4707/alam-ganjar-cari-atlet-esport-potensial-di-jateng', 'jateng.jpnn.com, SOLO - Ketua Harian Indonesia Esports Association (IESPA) Alam Ganjar menyatakan banyak atlet esport yang potensial di Jawa Tengah seiring dengan perkembangan jenis olahraga baru tersebut.\n"Salah satunya terlihat pada saat turnamen esport di Palembang beberapa waktu lalu, kami mengirimkan atlet untuk tiga cabor dan mampu menghasilkan dua medali," katanya di sela pelaksanaan DIGIFUN Festival di Mal Solo Paragon, Minggu (28/8).\nSelain itu, sejumlah atlet di Jawa Tengah juga berhasil mewakili Indonesia ke SEA Games 2021.\nBaca Juga:\nPersis Solo Tumbang di Markas Borneo FC\n"Memang potensi atlet asal Jawa Tengah tinggi, makanya kami coba bisa enggak sampai ke akar rumput (mencari potensi yang lain) melalui turnamen seperti ini," katanya.\nAlam Ganjar menyampaikan dalam beberapa bulan ke depan Indonesia akan menjadi tuan rumah turnamen esport internasional di Bali.\n"Harapannya pada makin semangat ikut kualifikasi dari Jateng. Sayang kalau dari Jateng sebetulnya potensial, tetapi enggak bisa berkontribusi," katanya.\nBaca Juga:\nBerkat Lapak Ganjar, Pengusaha Mebel di Jepara Banjir Orderan\nPutra Gubernur Jateng Ganjar Pranowo itu juga berharap ke depan esport bisa menjadi salah satu pilihan ekstra kurikuler di sekolah sehingga pelajar makin terwadahi untuk mengembangkan potensi mereka.\n"Sebetulnya mayoritas atlet kami dari pelajar, mahasiswa. Memang belajar tetap yang utama, tetapi kami juga ingin mendorong esport jadi ekskul, ini bisa jadi industri besar," katanya.', NULL, NULL, 'Senin, 29 Agustus 2022 – 02:37 WIB', NULL, NULL);
"""

connection = create_db_connection("localhost", "root", "insert your mysql password here", "news_article")
execute_query(connection, pop_newsarticle)

MySQL Database connection successful
Error: '1048 (23000): Column 'article_ts' cannot be null'


***Column `article_ts`*** cannot be null, I inserted `NULL` values because I'm still searching for how to convert the string value (*since the timestamp scrapped result is in `string` form*), I'll fix it later soon.

But, we've near to reach the goal that is to scrapp the news articles using **`XPATH`** and importing the result on a **`database`**.