-
Notifications
You must be signed in to change notification settings - Fork 0
1.2 Open Data: Definitions, Principles, Formats, and Typologies
**Note This guide has been adapted from: Atenas, J., Ciociola, C., & Rodés, V. ILDA Guía Breve Datos Abiertos. Universidad de la República (UDELAR).
Open data sits at the intersection of technology, governance, and knowledge production, playing a central role in how information is shared, accessed, and reused in contemporary societies. As data becomes increasingly embedded in everyday life, understanding what constitutes open data—and how it is structured, managed, and used—has become an essential component of data literacy for academics, researchers, and students alike. At its core, open data refers to data that can be freely accessed, used, modified, and shared by anyone for any purpose, subject at most to requirements such as attribution. However, openness is not simply a technical condition; it is also a political and ethical commitment to transparency, accountability, and participation. Open data initiatives, particularly in government and research contexts, aim to make information more accessible, enabling citizens, scholars, and organisations to engage with evidence, scrutinise decision-making, and generate new forms of knowledge and innovation. To fully engage with open data, it is necessary to move beyond definition and explore three interconnected dimensions. First, the principles of open data establish the conditions under which data can be considered genuinely open, including accessibility, interoperability, and reuse. Second, the formats of open data determine how data is structured and shared, shaping its usability across different systems and contexts. Third, the typologies of data highlight the diversity of data forms—qualitative and quantitative, structured and unstructured, primary and secondary—each requiring different approaches to management and interpretation. Understanding these elements is crucial for academic practice. It enables educators to design meaningful learning experiences around data, supports researchers in publishing and reusing datasets responsibly, and equips learners with the skills to critically navigate data-rich environments. Ultimately, engaging with open data is not only about improving technical competence, but about fostering a more transparent, inclusive, and collaborative knowledge ecosystem.
Open Data must, in addition, be:
- Complete: that is, it must include all elements required to enable exportation, online and offline use, integration and aggregation with other resources, and dissemination across the web.
- Timely: users should be provided with conditions enabling rapid and immediate access to, and use of, data available on the network.
- Accessible: data should be made available to the widest possible range of users without barriers to use, preferably without reliance on proprietary platforms. Moreover, data should be accessible without subscription, contractual agreement, payment, registration, or formal request.
- Machine-readable: data must be in a format that can be processed automatically by computers (machine-readable), enabling computational use.
- In non-proprietary formats: data should be encoded in open and public formats over which no single entity (company or organisation) retains exclusive control.
- Free from restrictive licences: Open Data is characterised by licences that do not restrict use, dissemination, or redistribution.
- Reusable: users must be able to reuse and integrate data in order to create new resources, such as applications and public-interest or commons-based services.
- Searchable / discoverable: data must be easily identifiable on the web through catalogues that are readily indexed by search engines.
flowchart TD
%% Definition layer
A[Data] --> B[Information]
B --> C["Open Data<br/>Freely used, reused, shared"]
%% Core characteristics
C --> D[Accessible]
C --> E[Reusable]
C --> F[Redistributable]
%% Principles layer
C --> G[Open Data Principles]
G --> G1[Open by Default]
G --> G2[Accessible & Usable]
G --> G3[Comparable & Interoperable]
G --> G4[Timely & Comprehensive]
G --> G5[For Governance & Transparency]
G --> G6[For Innovation & Inclusion]
%% Formats layer
C --> H[Formats]
H --> H1[Structured Data]
H --> H2[Machine-Readable]
H --> H3[Non-Proprietary Formats]
H --> H4[Metadata & Documentation]
%% Typologies layer
C --> I[Typologies of Data]
I --> I1[Qualitative Data]
I --> I2[Quantitative Data]
I --> I3[Structured vs Unstructured]
I --> I4[Primary vs Secondary Data]
%% Outcomes
G --> J[Transparency]
H --> K[Interoperability]
I --> L[Knowledge Creation]
J --> M[Accountability]
K --> N[Data Reuse]
L --> N
%% Styling (pastel tones + black text)
classDef definition fill:#e6f7ff,stroke:#444,color:#000;
classDef principles fill:#e6ffe6,stroke:#444,color:#000;
classDef formats fill:#fff5e6,stroke:#444,color:#000;
classDef types fill:#f3e6ff,stroke:#444,color:#000;
classDef outcomes fill:#ffe6f0,stroke:#444,color:#000;
class A,B,C,D,E,F definition;
class G,G1,G2,G3,G4,G5,G6 principles;
class H,H1,H2,H3,H4 formats;
class I,I1,I2,I3,I4 types;
class J,K,L,M,N outcomes;
With regard to Open Government Data, it is expected that information will be delivered in open, machine-readable file formats. However, where the choice is between publishing data that do not fully comply with the above requirements or not publishing them at all, the logic of Open Data tends towards the former option.
In such cases, the principle of “Raw Data Now” applies: the provision of raw data in its most immediate form. Even if data are not fully open, it is preferable that they are nevertheless published. The assumption is that if the datasets are sufficiently valuable, the user community will subsequently transform them into Open Data through processes such as data scraping (i.e., the automated extraction of data from sources).
Data can be published in a wide variety of formats; however, not all of them satisfy the requirements necessary to be considered “open”.
A data format refers to the digital structure used to store information. Formats may be either open or closed. An open format is one in which technical specifications are publicly and freely available, allowing anyone to use them in their own software without restrictions imposed by intellectual property rights.
By contrast, a closed (proprietary) format may either conceal technical specifications or restrict their use, even when documentation is partially available.
The fundamental rationale for emphasising openness can be summarised in one concept: interoperability. Interoperability refers to the capacity of different systems and organisations to work together. In this context, it enables the combination of datasets across different sources.
Interoperability represents the primary practical advantage of openness, as it significantly increases the potential to integrate heterogeneous datasets and thereby develop new and improved services and applications.
Open formats also reduce barriers to reuse, allowing developers to build software and services without dependence on proprietary systems. Conversely, proprietary formats may generate dependency on specific software providers or licensing regimes. In the worst cases, information may only be accessible through specific, potentially costly or obsolete software systems.
The following provides an overview of common formats, coding systems, and data containers used for the creation, storage, and dissemination of data.
-
CSV (Comma Separated Values): CSV is a text-based file format for datasets, which facilitates import and export to spreadsheets and databases. Values are separated by commas. CSV files are highly useful due to their compactness and efficiency in transferring large datasets with consistent structure. However, their simplicity can also be a limitation: without proper documentation, CSV files may be difficult to interpret, as the meaning of columns is not self-evident. Therefore, adequate metadata is essential.
-
Spreadsheets Many organisations store data in spreadsheet formats (e.g., Microsoft Excel). Such data can be used effectively when accompanied by clear descriptions of column content. However, spreadsheets may include formulas or functions that complicate data extraction and reuse.
-
Databases Databases allow direct access to structured data and enable users to retrieve only relevant subsets of information. However, remote access may raise security concerns. Furthermore, databases are only useful when their structure—tables, fields, and relationships—is clearly documented.
-
RDF (Resource Description Framework) RDF is a framework for representing information on the web. It links data with web resources, enabling computers to interpret context and meaning, thereby supporting interoperability across applications.
-
HTML (HyperText Markup Language) HTML is a markup language used to structure and present web documents via HTTP. It enables hyperlinks between documents, thereby supporting the organisation of hypertext systems. A large proportion of web-based data is currently published in HTML format.
-
XML (eXtensible Markup Language) XML is a flexible format widely used for data exchange. It preserves structural information and allows documentation to be embedded within data without compromising readability.
-
JSON (JavaScript Object Notation) JSON is a lightweight data-interchange format that is easy to read and process across programming languages. Its simplicity makes it highly efficient for computational use.
-
Text Documents (Word, PDF) Traditional document formats such as Word or PDF may be sufficient for certain types of data presentation. They are widely used and easy to share. However, they often fail to preserve structured data in a consistent form, making automated extraction difficult or impossible.
-
Plain Text Plain text files (.txt) are highly machine-readable but lack structural metadata. As a result, developers must create parsers to interpret each file individually.
-
Scanned Images Scanned documents (e.g., TIFF, JPEG) are among the least suitable formats for data reuse. Although they may include visual contextual information, they are not inherently structured and are not machine-readable in a meaningful way.
-
Proprietary Formats Some systems use proprietary formats for storing and exporting data. While sharing in such formats may sometimes be acceptable—particularly within the same ecosystem—it is always advisable to provide documentation and alternative open formats where possible.
To classify datasets according to openness, Tim Berners-Lee proposes a five-level rating system:
Data are available online in any format (e.g., PDF, JPEG, Word). They are accessible and printable but not reusable or structured.
Data are structured but use proprietary formats (e.g., Microsoft Excel). They can be processed but require proprietary software.
Data are structured and use non-proprietary formats (e.g., CSV). This is the baseline level of Open Data.
Data are structured, non-proprietary, and published using web standards (e.g., via URLs enabling direct access and linking to data items).
Linked Open Data (LOD): datasets are not only open and structured but also interlinked with other datasets, enabling cross-dataset integration across different sources and institutions.
Open Data commonly refers to information structured in databases across diverse domains, including cartography, genetics, chemistry, scientific modelling, medicine, biosciences, census and civil registry data, government records, statistics, and economics.
Typical thematic categories include:
- Geospatial data: mapping data such as street locations, buildings, topography, administrative boundaries, and georeferenced points of interest.
- Culture: data relating to cultural artefacts and outputs (e.g., titles, authors), typically held by libraries, archives, museums, and galleries.
- Science: research data across disciplines from astronomy to zoology.
- Economy and finance: public accounts (income and expenditure) and financial market data (stocks, bonds, securities).
- Statistics: datasets produced by statistical offices, including social, economic, and demographic indicators.
- Meteorology: data used for weather and climate analysis and forecasting.
- Environment and health: environmental monitoring (pollution levels, water quality, waste) and health-related indicators (mortality rates, disease incidence).
- Transport: timetables, routes, journey times, and transport system performance data.