Skip to content

mohan-kartik/PubMed-Analysis-in-SQL-and-R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Objectives:

Extract PubMed data from an XML document to store relationally in a SQLite database (transactional database) and create an "analytical" database using a star schema in MySQL for analysis.

  • Extract XML document,
  • realize the relational schema in MySQL,
  • load data into the database,
  • create star/snowflake schema,
  • execute analytical SQL queries,
  • performed some visualization of the data

Tasks

Part 1: Load XML Data into Database (LoadXML2DB.Rmd)

  • Build an external DTD for the XML file
  • Create the database schema in SQLite that contains the following entities/tables: Articles, Journals, Authors
  • Extract and transform the data from the XML file and load into the appropriate tables in the database

Part 2: Create Star/Snowflake Schema (LoadDataWarehouse.Rmd)

  • Create and populate a star schema for journal facts.

  • Perform the analytical queries below:

    i) What the are number of articles published in every journal in 2012 and 2013?

    ii) What is the number of articles published in every journal in each quarter of 2012 through 2015?

    iii) How many articles were published each quarter (across all years)?

    iv) How many unique authors published articles in each year for which there is data?

Part 3: Explore and Mine Data (AnalyzeData.Rmd)

  • Top five journals with the most articles published in them for the time period.
  • Number of articles per journal per year broken down by quarter.

About

ETL of PubMed data in MySQL & R for analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages