# Q2.a – RDF Ontology and Data Extraction
This Jupyter Notebook works importing and analyzing the RDF/XML ontology given in the CA assignment for the CityPulse AI system, where news article are represented together with their attributes such as headlines, description, categories, and locations. The data we extract from that file will be used later on a Naive Bayes classifier.

In [1]:
from rdflib import Graph, Namespace
import pandas as pd

def load_data_from_rdf_xml(filename):
    # Reads the RDF data from the XML file using rdflib
    g = Graph()
    g.parse(filename, format='xml')

    # Define the namespaces
    NEWS = Namespace("http://www.example.org/news#")

    # Lists to store the data extracted
    article_ids = []
    category = []
    title = []
    description = []
    place = []

    # Extract data from news file using SPARQL query
    qres = g.query(
        """
        SELECT ?article ?category ?title ?description ?place
        WHERE {
            ?article news:category ?category .
            ?article news:headline ?title .
            ?article news:short_description ?description .
            ?article news:place ?place .
        }
        """
        , initNs = {'news' : NEWS}
    )

    # Stores information from the query into the lists iterating
    for row in qres:
        # Takes only the article numeric ID to store in the list
        article_id = str(row.article).split("/")[-1]
        numeric_id = article_id.replace("Article", "")
        article_ids.append(numeric_id)
        # Stores the rest of the information
        category.append(str(row.category))
        title.append(str(row.title))
        description.append(str(row.description))
        place.append(str(row.place))

    # Generates a DataFrame
    df = pd.DataFrame({
        "Article ID" : article_ids,
        "Category" : category,
        "Headline" : title,
        "Description" : description,
        "Place" : place
    })
    return df

final_data_frame = load_data_from_rdf_xml("../data/News_Categorizer_RDF.xml")   
final_data_frame.to_csv("../data/news_extracted.csv", index=False)
final_data_frame


Unnamed: 0,Article ID,Category,Headline,Description,Place
0,1,WELLNESS,143 Miles in 35 Days: Lessons Learned,Resting is part of training. I've confirmed wh...,Torrance
1,2,WELLNESS,Talking to Yourself: Crazy or Crazy Helpful?,Think of talking to yourself as a tool to coac...,Norwalk
2,3,WELLNESS,Crenezumab: Trial Will Gauge Whether Alzheimer...,The clock is ticking for the United States to ...,Norwalk
3,4,WELLNESS,"Oh, What a Difference She Made","If you want to be busy, keep trying to be perf...",Norwalk
4,5,WELLNESS,Green Superfoods,"First, the bad news: Soda bread, corned beef a...",Norwalk
...,...,...,...,...,...
9994,9995,SPORTS,ESPN's Linda Cohn Predicts The Super Bowl Will...,"The ""SportsCenter"" anchor added that she expec...",Pasadena
9995,9996,SPORTS,Indians Fireworks Guy Accidentally Lets 'Em Fl...,He was all types of sad after making this mist...,Santa Monica
9996,9997,SPORTS,Meet The First UFC Fighter To Wear A Turban To...,Arjan Singh Bhullar is the UFC's first Sikh an...,Pasadena
9997,9998,SPORTS,Larry Nassar Was Allowed To See Patients Durin...,Michigan State University didn't suspend the t...,Norwalk
