# PubMed Health-Related Articles Extraction

This project focuses on extracting health-related articles from PubMed using the NCBI Entrez Programming Utilities (E-utilities). PubMed provides access to a wealth of peer-reviewed articles on various health topics, making it invaluable for applications like literature reviews, sentiment analysis, trend monitoring, and topic modeling.

## Objectives
The main objectives of this project are:
- **Data Collection**: Retrieve articles from PubMed based on specific health-related categories (e.g., mental health, fitness, oncology).
- **Data Processing**: Clean and preprocess the extracted data to prepare it for analysis.
- **Data Storage**: Store the scraped data in a structured format, such as CSV or a database, for further analysis.

## Tools and Technologies
The following tools and technologies are used in this project:
- **Python**: The primary programming language for interacting with the PubMed API.
- **NCBI E-utilities**: The official API for retrieving articles and metadata from PubMed.
- **Requests**: A Python library for making HTTP requests to access web pages and APIs.
- **Pandas**: A Python library for data manipulation and storage in CSV or database formats.





<p style="color:#FBCE60;text-align:center;font-size:30px"> Scraping PubMed Articles</p>

In [None]:
!pip install bs4
!pip install selenium

^C



[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Scraping PubMed  Articles 

<p style="color:#FFC107;text-align:left;font-size:20px"> Searching for PubMed's health related Articles   </p>

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import pandas as pd 
import re
# URL to scrape
url = 'https://pubmed.ncbi.nlm.nih.gov/?term=author+manuscript%5Bfilter%5D&format=abstract&size=200'

# List to store scraped articles
articles = []
 

# Function to save articles periodically
def save_articles_to_file(data):
    try:
        data=pd.DataFrame(data)     
        data.to_csv("../data/pubmed/pubmedArticles.csv")                 
        print("Data saved to ../data/pubmed/pubmedArticles.csv")
    except Exception as e:
        print(f"Error saving data: {e}")

def scrape_current_page(url, pageIndex):
    if(pageIndex>1):
        url+=f"&page={pageIndex}"
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36',
    'Referer': 'https://pubmed.ncbi.nlm.nih.gov',
    }

    request = Request(url, headers=headers)

    # Open the URL and read the page content
    with urlopen(request) as response:
        page_source = response.read()
    try:
        soup = BeautifulSoup(page_source, 'html.parser')
        # Find all articles on the page
        article_elements = soup.find_all('article', class_="article-overview")
    except KeyError:
        print(KeyError)

    # Extract article data
    for articleElement in article_elements:
        try:
            article = {
                "articleTitle": articleElement.find("h1",class_="heading-title").find("a").get_text(),
                "articleTextContent":re.sub(r'\s+', ' ', articleElement.find("div", "abstract-content").find("p").get_text().strip())
            }
            articles.append(article)
            print(article)

            # Save data periodically every 500 articles
            if len(articles) % 100 == 0:
                print(f"Last seen page {pageIndex}")
                save_articles_to_file()

        except Exception as e:
            print(f"Error processing article: {e}")
            continue

pageIndex=1
# Start scraping from the initial URL
for i in range(1,4984):
    scrape_current_page(url, pageIndex)
    pageIndex+=1


{'articleTitle': '\n        \n  Intraventricular CARv3-TEAM-E T Cells in Recurrent Glioblastoma\n\n\n      ', 'articleTextContent': 'In this first-in-human, investigator-initiated, open-label study, three participants with recurrent glioblastoma were treated with CARv3-TEAM-E T cells, which are chimeric antigen receptor (CAR) T cells engineered to target the epidermal growth factor receptor (EGFR) variant III tumor-specific antigen, as well as the wild-type EGFR protein, through secretion of a T-cell-engaging antibody molecule (TEAM). Treatment with CARv3-TEAM-E T cells did not result in adverse events greater than grade 3 or dose-limiting toxic effects. Radiographic tumor regression was dramatic and rapid, occurring within days after receipt of a single intraventricular infusion, but the responses were transient in two of the three participants. (Funded by Gateway for Cancer Research and others; INCIPIENT ClinicalTrials.gov number, NCT05660369.).'}
{'articleTitle': '\n        \n  Newe

HTTPError: HTTP Error 502: Bad Gateway