<a href="https://colab.research.google.com/github/jinzalabim/Data-Mining-Projects/blob/PSSC-Web-Scraper/PSSC_Web_Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PSSC Web Scraper**

### **Introduction**

The [Philippine Social Science Council (PSSC) Knowledge Archives](https://pssc.org.ph/knowledgearchives/) is a comprehensive repository of academic and scholarly resources. These archives host a variety of journals, articles, and publications that are essential for researchers, academicians, and students. By scraping the PSSC Knowledge Archives, we aim to extract valuable metadata such as filenames, file links, authors, and publication years. This process will help in organizing and analyzing the data, making it more accessible and useful for academic research and study.

### **Value of Scraping the PSSC Knowledge Archives**

1. **Efficient Data Collection**: Automating the extraction of metadata from the PSSC Knowledge Archives saves time and effort compared to manual collection.
2. **Data Organization**: The scraped data can be structured into a well-organized format, making it easier to search, filter, and analyze.
3. **Enhanced Accessibility**: Organizing the metadata into a database or CSV file makes it more accessible for researchers who need to find specific articles or publications.
4. **Supporting Research**: With organized metadata, researchers can quickly identify relevant resources, aiding in literature reviews and supporting their academic work.
5. **Preservation and Archiving**: Extracting and storing metadata ensures that even if the website structure changes, the information remains preserved and accessible.

### **Objectives**

In this notebook, we will:
1. Scrape metadata from various sections of the PSSC Knowledge Archives, including:
   - Aghamtao
   - Philippine Review of Economics
   - Philippine Political Science Journal
2. Extract and clean data such as filenames, file links, authors, and publication years.
3. Save the extracted data into CSV files for easy access and further analysis.

By the end of this notebook, we will have a well-organized dataset containing valuable metadata from the PSSC Knowledge Archives, ready to be used for academic research and study.


# **AGHAMTAO**


In [None]:
import os
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin, quote

# Define the URL and retrieve the web page
url = 'https://pssc.org.ph/knowledgearchives/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')

# Identify the Aghamtao section
aghmatao_tab = soup.find('li', id='aghamtao')
if not aghmatao_tab:
    raise Exception("Aghamtao tab not found.")

aghmatao_section_id = aghmatao_tab['aria-controls']
aghmatao_section = soup.find('div', id=aghmatao_section_id)

if not aghmatao_section:
    raise Exception("Aghamtao section not found.")

# Extract filenames and file links
books = aghmatao_section.find_all('td', class_='filename')

# Prepare lists to store the data
filenames = []
filelinks = []
years = []

for book in books:
    a_tag = book.find('a')
    filename = a_tag.text.strip()
    raw_link = a_tag['href']

    # Print raw data for inspection
    print(f"Raw Filename: {filename}")
    print(f"Raw Link: {raw_link}")

    # Correct the file link
    # Extract the relative path from the raw link
    relative_path = raw_link.split('/pssc-archives/')[1]  # Extract the relative path part
    cleaned_link = f"https://pssc.org.ph/wp-content/pssc-archives/{relative_path}"
    cleaned_link = quote(cleaned_link, safe=':/')

    # Print cleaned link for inspection
    print(f"Cleaned Link: {cleaned_link}")

    filenames.append(filename)
    filelinks.append(cleaned_link)

    # Extract year from the surrounding div
    year_div = book.find_previous('div', class_='elementor-tab-title')
    if year_div:
        years.append(year_div.text.strip())
    else:
        years.append(None)

# Create a Pandas DataFrame
data = {
    'Filename': filenames,
    'Filelink': filelinks,
    'Year': years,
}

df = pd.DataFrame(data)

# Save DataFrame to a CSV file in the current working directory
output_path = os.path.join(os.getcwd(), 'aghmatao_books.csv')
df.to_csv(output_path, index=False)
print(f"Data has been saved to {output_path}")

# Print current working directory for debugging
print(f"Current working directory: {os.getcwd()}")


Raw Filename: 0   Contents aghamtao 28
Raw Link: https://pssc.org.ph/wp-content/uploads/2024/07/../../../pssc-archives/Aghamtao/2020/0 - Contents aghamtao 28 .pdf
Cleaned Link: https://pssc.org.ph/wp-content/pssc-archives/Aghamtao/2020/0%20-%20Contents%20aghamtao%2028%20.pdf
Raw Filename: 0   Editors note A G H A M T A O 28
Raw Link: https://pssc.org.ph/wp-content/uploads/2024/07/../../../pssc-archives/Aghamtao/2020/0 - Editors note AGHAMTAO 28.pdf
Cleaned Link: https://pssc.org.ph/wp-content/pssc-archives/Aghamtao/2020/0%20-%20Editors%20note%20AGHAMTAO%2028.pdf
Raw Filename: 1 Aghamtao Vol 28 1 22 T Gibson  Islamic Models of Social Justice in South Sulawesi, Indonesia
Raw Link: https://pssc.org.ph/wp-content/uploads/2024/07/../../../pssc-archives/Aghamtao/2020/1 Aghamtao Vol 28 1-22 TGibson _Islamic Models of Social Justice in South Sulawesi, Indonesia .pdf
Cleaned Link: https://pssc.org.ph/wp-content/pssc-archives/Aghamtao/2020/1%20Aghamtao%20Vol%2028%201-22%20TGibson%20_Islamic%20Mo

# **ECONOMICS**

In [None]:
import os
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Define the URL and retrieve the web page
url = 'https://pssc.org.ph/knowledgearchives/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')

# Identify the Philippine Review of Economics section
philippine_review_tab = soup.find('li', id='philippine-review-of-economics')
if not philippine_review_tab:
    raise Exception("Philippine Review of Economics tab not found.")

philippine_review_section_id = philippine_review_tab['aria-controls']
philippine_review_section = soup.find('div', id=philippine_review_section_id)

if not philippine_review_section:
    raise Exception("Philippine Review of Economics section not found.")

# Extract filenames, file links, authors, volumes, and years
books = philippine_review_section.find_all('li')

# Prepare lists to store the data
titles = []
filelinks = []
authors = []
volumes = []
years = []

for book in books:
    # Extract filename (title) and file link
    a_tag = book.find('a')
    if a_tag:
        title = a_tag.text.strip()
        link = a_tag['href']

        # Print raw data for inspection
        print(f"Title: {title}")
        print(f"Link: {link}")

        titles.append(title)
        filelinks.append(link)

    # Extract author
    em_tag = book.find('em')
    if em_tag:
        author = em_tag.text.strip()
        authors.append(author)
    else:
        authors.append(None)

    # Extract volume and year from the surrounding div
    year_div = book.find_previous('div', class_='elementor-tab-title')
    if year_div:
        year_text = year_div.text.strip()
        # Extract volume and year using regex
        match = re.search(r'(Vol \d+, No \d+)\s+\((\d{4})\)', year_text)
        if match:
            volume, year = match.groups()
            volumes.append(volume)
            years.append(year)
        else:
            volumes.append(None)
            years.append(None)
    else:
        volumes.append(None)
        years.append(None)

# Create a Pandas DataFrame
data = {
    'Filename': titles,
    'Filelink': filelinks,
    'Author': authors,
    'Volume': volumes,
    'Year': years,
}

df = pd.DataFrame(data)

# Save DataFrame to a CSV file in the current working directory
output_path = os.path.join(os.getcwd(), 'philippine_review_of_economics_books.csv')
df.to_csv(output_path, index=False)
print(f"Data has been saved to {output_path}")

# Print current working directory for debugging
print(f"Current working directory: {os.getcwd()}")


Title: On Multiple Objectives in the Firm and Arrows Theorem
Link: https://pre.econ.upd.edu.ph/index.php/pre/article/view/677
Title: A Marketing Approach to Exporting
Link: https://pre.econ.upd.edu.ph/index.php/pre/article/view/694
Title: On the Tax Conciousness Survey
Link: https://pre.econ.upd.edu.ph/index.php/pre/article/view/693
Title: A Second Look at the Agricultural Land Reform Code of 1963
Link: https://pre.econ.upd.edu.ph/index.php/pre/article/view/695
Title: Development and Undevelopment: The Quest for Valid Theory
Link: https://pre.econ.upd.edu.ph/index.php/pre/article/view/696
Title: Pattern of Philippine Public Expenditure, 1951-60
Link: https://pre.econ.upd.edu.ph/index.php/pre/article/view/697
Title: Book Review
Link: https://pre.econ.upd.edu.ph/index.php/pre/article/view/710
Title: The College in Review
Link: https://pre.econ.upd.edu.ph/index.php/pre/article/view/719
Title: The Philippine Sugar Industry
Link: https://pre.econ.upd.edu.ph/index.php/pre/article/view/703
Ti

# **POLITICAL SCIENCE**

In [None]:
import os
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin, quote

# Define the URL and retrieve the web page
url = 'https://pssc.org.ph/knowledgearchives/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')

# Identify the Philippine Political Science Journal section
political_science_tab = soup.find('li', id='philippine-political-science-journal')
if not political_science_tab:
    raise Exception("Philippine Political Science Journal tab not found.")

political_science_section_id = political_science_tab['aria-controls']
political_science_section = soup.find('div', id=political_science_section_id)

if not political_science_section:
    raise Exception("Philippine Political Science Journal section not found.")

# Extract filenames, file links, and years
# Print the section HTML to debug what is being retrieved
print(political_science_section.prettify())

# Find all `a` tags with file links
links = political_science_section.find_all('a', href=True)

# Prepare lists to store the data
filenames = []
filelinks = []
years = []

for link in links:
    filename = link.text.strip()
    raw_link = link['href']

    # Print raw data for inspection
    print(f"Raw Filename: {filename}")
    print(f"Raw Link: {raw_link}")

    # Correct the file link
    # Extract the relative path from the raw link
    relative_path = raw_link.split('/wp-content/')[1]  # Extract the relative path part
    cleaned_link = f"https://pssc.org.ph/wp-content/{relative_path}"
    cleaned_link = quote(cleaned_link, safe=':/')

    # Print cleaned link for inspection
    print(f"Cleaned Link: {cleaned_link}")

    filenames.append(filename)
    filelinks.append(cleaned_link)

    # Extract year from the surrounding span
    # Find the previous sibling span with class 'eael-accordion-tab-title'
    year_span = link.find_previous('span', class_='eael-accordion-tab-title')
    year = year_span.text.strip() if year_span else 'Unknown'

    years.append(year)

# Create a Pandas DataFrame
data = {
    'Filename': filenames,
    'Filelink': filelinks,
    'Year': years,
}

df = pd.DataFrame(data)

# Save DataFrame to a CSV file in the current working directory
output_path = os.path.join(os.getcwd(), 'philippine_political_science_journal.csv')
df.to_csv(output_path, index=False)
print(f"Data has been saved to {output_path}")

# Print current working directory for debugging
print(f"Current working directory: {os.getcwd()}")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
            <i aria-hidden="true" class="fa-accordion-icon fas fa-plus">
            </i>
           </span>
           <span class="eael-advanced-accordion-icon-opened">
            <i aria-hidden="true" class="fa-accordion-icon fas fa-minus">
            </i>
           </span>
           <span class="eael-accordion-tab-title">
            1990 | Num 30-32
           </span>
           <i aria-hidden="true" class="fa-toggle fas fa-angle-right">
           </i>
          </div>
          <div aria-labelledby="1990-num-30-32" class="eael-accordion-content clearfix" data-tab="22" id="elementor-tab-content-20522">
           <p>
           </p>
           <table class="">
            <tr>
             <th class="filename">
              Filename / Link
             </th>
             <th class="filesize">
              Size
             </th>
            </tr>
            <tr>
             <td class="filename">
            

# **Final Notes**


* Adjust the scraping logic as needed if the HTML structure changes.
* Make sure to test the script thoroughly and verify the output CSV file.


