<h2 style='text-align:center;'>NLP Web Scraping & XML Parsing Project</h2>

In this project, we parse a **medical article in XML format**, extract structured metadata, clean the article body, and preprocess the text for **Natural Language Processing (NLP)** tasks.

**Tech Stack:** Python, XML Parsing (ElementTree), BeautifulSoup, NLTK, Pandas.

In [1]:
# Import required libraries
import xml.etree.ElementTree as ET   # XML parsing
from bs4 import BeautifulSoup        # HTML/XML text extraction
import re, nltk, string, unicodedata # text cleaning
import pandas as pd                  # structured outputs

# Download NLTK resources (only first time)
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

### Step 1: Load the XML file
We will parse the XML file using `ElementTree`.

In [2]:
# Load XML file
xml_file = '../data/769952.xml'
tree = ET.parse(xml_file)
root = tree.getroot()

# Display root tag
root.tag

'article'

### Step 2: Extract Metadata
We extract article metadata such as **title, author, publication date, and keywords**.

In [3]:
# Extract article ID
article_id = root.find('.//article-id').text

# Extract journal title
journal_title = root.find('.//journal-title').text

# Extract article title
article_title = root.find('.//article-title').text

# Extract author name
author = root.find('.//contrib/name/surname').text

# Extract publication date
pub_day = root.find('.//pub-date/day').text
pub_month = root.find('.//pub-date/month').text
pub_year = root.find('.//pub-date/year').text
pub_date = f"{pub_day}-{pub_month}-{pub_year}"

# Extract keywords
keywords = root.find('.//kwd').text

# Store metadata in dictionary
metadata = {
    'article_id': article_id,
    'journal_title': journal_title,
    'article_title': article_title,
    'author': author,
    'publication_date': pub_date,
    'keywords': keywords
}

metadata

{'article_id': '0901c79180555528',
 'journal_title': 'Orphan Drug Approvals',
 'article_title': 'FDA Grants Orphan Drug Status to Gevokizumab',
 'author': 'Troy Brown',
 'publication_date': '29-08-2012',
 'keywords': 'choroiditis,cyclitis,intermediate uveitis,orphan drugs,pars planitis,posterior uveitis'}

### Step 3: Extract Body Text
We collect `<p>` tags inside `<body>` and clean them using BeautifulSoup and regex.

In [4]:
# Convert XML to string for BeautifulSoup parsing
xml_string = ET.tostring(root, encoding='utf8').decode('utf8')
soup = BeautifulSoup(xml_string, "html.parser")

# Extract all <p> tags from body
paragraphs = [p.get_text() for p in soup.find_all('p')]
raw_text = ' '.join(paragraphs)

# Save raw cleaned text
with open('../results/cleaned_text.txt', 'w', encoding='utf-8') as f:
    f.write(raw_text)

raw_text[:500]  # Display sample

  k = self.parse_starttag(i)


'WebMD, LLC index Troy Brown is a freelance writer for Medscape. Troy Brown has disclosed no relevant financial relationships.  August 29, 2012 — The US Food and Drug Administration (FDA) has granted orphan drug status to gevokizumab (Xoma 052, Xoma Corp), a monoclonal antibody that binds strongly to interleukin 1β (IL-1β), for the treatment of noninfectious intermediate uveitis, posterior uveitis, or panuveitis, or chronic noninfectious anterior uveitis. The Orphan Drug Act of 1983 was passed to'

### Step 4: Text Preprocessing (NLP)
We will tokenize, clean, remove stopwords, and lemmatize the text.

In [5]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Initialize tools
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords and lemmatize
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words]
    return tokens

tokens = preprocess_text(raw_text)
tokens[:50]  # Show first 50 tokens

['webmd',
 'llc',
 'index',
 'troy',
 'brown',
 'freelance',
 'writer',
 'medscape',
 'troy',
 'brown',
 'disclosed',
 'relevant',
 'financial',
 'relationship',
 'august',
 '29',
 '2012',
 'u',
 'food',
 'drug',
 'administration',
 'fda',
 'granted',
 'orphan',
 'drug',
 'status',
 'gevokizumab',
 'xoma',
 '052',
 'xoma',
 'corp',
 'monoclonal',
 'antibody',
 'bind',
 'strongly',
 'interleukin',
 '1',
 'il1',
 'treatment',
 'noninfectious',
 'intermediate',
 'uveitis',
 'posterior',
 'uveitis',
 'panuveitis',
 'chronic',
 'noninfectious',
 'anterior',
 'uveitis',
 'orphan']

### Step 5: Final Structured Output
We combine metadata and processed body into a **DataFrame** and save as CSV.

In [6]:
df = pd.DataFrame([{**metadata, 'cleaned_body_text': ' '.join(tokens)}])
df.to_csv('../results/article_data.csv', index=False)
df.head()

Unnamed: 0,article_id,journal_title,article_title,author,publication_date,keywords,cleaned_body_text
0,0901c79180555528,Orphan Drug Approvals,FDA Grants Orphan Drug Status to Gevokizumab,Troy Brown,29-08-2012,"choroiditis,cyclitis,intermediate uveitis,orph...",webmd llc index troy brown freelance writer me...


### ✅ Conclusion
We successfully parsed an XML medical article, extracted metadata (title, author, keywords, etc.), cleaned and preprocessed the body text for NLP tasks, and stored the results in **CSV and TXT formats**.

This workflow can be extended to parse **multiple XML articles** and perform advanced NLP tasks such as **sentiment analysis, keyword extraction, or topic modeling**.