# Frictionless framework demo

*Jacqueline R. M. A. Maasch* | *January 2021*

This notebook provides a brief walk-through of `frictionless` functionality, both in command line syntax and Python syntax. The data validated in this demo was scraped from the [MEROPS Peptidase Database](https://www.ebi.ac.uk/merops/index.shtml) in June 2020.

>[Frictionless](https://frictionlessdata.io/tooling/python/#purpose) is a framework to describe, extract, validate, and transform tabular data. It supports a great deal of data sources and formats, as well as provides popular platforms integrations. The framework is powered by the lightweight yet comprehensive Frictionless Data Specifications.

>**Describe your data:** You can infer, edit and save metadata of your data tables. It’s a first step for ensuring data quality and usability. Frictionless metadata includes general information about your data like textual description, as well as, field types and other tabular data details.

> **Extract your data:** You can read your data using a unified tabular interface. Data quality and consistency are guaranteed by a schema. Frictionless supports various file protocols like HTTP, FTP, and S3 and data formats like CSV, XLS, JSON, SQL, and others.

>**Validate your data:** You can validate data tables, resources, and datasets. Frictionless generates a unified validation report, as well as supports a lot of options to customize the validation process.

>**Transform your data:** You can clean, reshape, and transfer your data tables and datasets. Frictionless provides a pipeline capability and a lower-level interface to work with the data.

In [1]:
# Importations.
import frictionless
import pandas as pd

## Command line syntax

In [2]:
# View first 4 lines of data file.
! cat merops_peptidase_families.csv | sed '1,4!d'

,Family,Subfamily,Type enzyme,Group
0,A1,A1A,pepsin A (Homo sapiens),Aspartic (A) Peptidase
1,A1,A1B,nepenthesin (Nepenthes gracilis),Aspartic (A) Peptidase
2,A2,A2A,HIV-1 retropepsin (human immunodeficiency virus 1),Aspartic (A) Peptidase


In [3]:
# Describe data file's inferred schema.
! frictionless describe merops_peptidase_families.csv

---[0m
[1mmetadata: merops_peptidase_families.csv[0m
---[0m
[0m
compression: 'no'
compressionPath: ''
control:
  newline: ''
dialect: {}
encoding: utf-8
format: csv
hashing: md5
name: merops_peptidase_families
path: merops_peptidase_families.csv
profile: tabular-data-resource
query: {}
schema:
  fields:
    - name: field1
      type: integer
    - name: Family
      type: string
    - name: Subfamily
      type: any
    - name: Type enzyme
      type: string
    - name: Group
      type: string
scheme: file
stats:
  bytes: 27216
  fields: 5
  hash: 55a6de92a8855391526150876ca9f33c
  rows: 345[0m
[0m


In [4]:
# Extract normalized data that conforms to inferred schema.
# E.g. invalid cells removed.
! frictionless extract merops_peptidase_families.csv | sed '1,10!d'

---
data: merops_peptidase_families.csv
---

field1  Family  Subfamily  Type enzyme                                                                                                                      Group                              
     0  A1      A1A        pepsin A (Homo sapiens)                                                                                                          Aspartic (A) Peptidase             
     1  A1      A1B        nepenthesin (Nepenthes gracilis)                                                                                                 Aspartic (A) Peptidase             
     2  A2      A2A        HIV-1 retropepsin (human immunodeficiency virus 1)                                                                               Aspartic (A) Peptidase             


In [5]:
# Validate data file.
! frictionless validate merops_peptidase_families.csv

---[0m
[1minvalid: merops_peptidase_families.csv[0m
---[0m
[0m
row   field  code         message                                              
None      1  blank-label  Label in the header in field at position "1" is blank
[0m


## Python syntax

In [6]:
# View first 4 lines of data file.
df = pd.read_csv("merops_peptidase_families.csv")
display(df.head(4))

Unnamed: 0.1,Unnamed: 0,Family,Subfamily,Type enzyme,Group
0,0,A1,A1A,pepsin A (Homo sapiens),Aspartic (A) Peptidase
1,1,A1,A1B,nepenthesin (Nepenthes gracilis),Aspartic (A) Peptidase
2,2,A2,A2A,HIV-1 retropepsin (human immunodeficiency viru...,Aspartic (A) Peptidase
3,3,A2,A2B,Ty3 transposon peptidase (Saccharomyces cerevi...,Aspartic (A) Peptidase


In [7]:
# Describe data file's inferred schema.
frictionless.describe("merops_peptidase_families.csv")

{'name': 'merops_peptidase_families',
 'profile': 'tabular-data-resource',
 'path': 'merops_peptidase_families.csv',
 'scheme': 'file',
 'format': 'csv',
 'hashing': 'md5',
 'encoding': 'utf-8',
 'compression': 'no',
 'compressionPath': '',
 'control': {'newline': ''},
 'dialect': {},
 'query': {},
 'schema': {'fields': [{'name': 'field1', 'type': 'integer'},
   {'name': 'Family', 'type': 'string'},
   {'name': 'Subfamily', 'type': 'any'},
   {'name': 'Type enzyme', 'type': 'string'},
   {'name': 'Group', 'type': 'string'}]},
 'stats': {'hash': '55a6de92a8855391526150876ca9f33c',
  'bytes': 27216,
  'fields': 5,
  'rows': 345}}

In [8]:
# Extract normalized data that conforms to inferred schema.
# E.g. invalid cells removed.
frictionless.extract("merops_peptidase_families.csv")[:4]

[{'field1': 0, 'Family': 'A1', 'Subfamily': 'A1A', 'Type enzyme': 'pepsin A (Homo sapiens)', 'Group': 'Aspartic (A) Peptidase'},
 {'field1': 1, 'Family': 'A1', 'Subfamily': 'A1B', 'Type enzyme': 'nepenthesin (Nepenthes gracilis)', 'Group': 'Aspartic (A) Peptidase'},
 {'field1': 2, 'Family': 'A2', 'Subfamily': 'A2A', 'Type enzyme': 'HIV-1 retropepsin (human immunodeficiency virus 1)', 'Group': 'Aspartic (A) Peptidase'},
 {'field1': 3, 'Family': 'A2', 'Subfamily': 'A2B', 'Type enzyme': 'Ty3 transposon peptidase (Saccharomyces cerevisiae)', 'Group': 'Aspartic (A) Peptidase'}]

In [9]:
# Validate data file.
frictionless.validate("merops_peptidase_families.csv")

{'version': '3.48.0',
 'time': 0.038,
 'valid': False,
 'stats': {'errors': 1, 'tables': 1},
 'errors': [],
 'tables': [{'path': 'merops_peptidase_families.csv',
   'scheme': 'file',
   'format': 'csv',
   'hashing': 'md5',
   'encoding': 'utf-8',
   'compression': 'no',
   'compressionPath': '',
   'control': {'newline': ''},
   'dialect': {},
   'query': {},
   'schema': {'fields': [{'name': 'field1', 'type': 'integer'},
     {'name': 'Family', 'type': 'string'},
     {'name': 'Subfamily', 'type': 'any'},
     {'name': 'Type enzyme', 'type': 'string'},
     {'name': 'Group', 'type': 'string'}]},
   'header': ['', 'Family', 'Subfamily', 'Type enzyme', 'Group'],
   'time': 0.038,
   'valid': False,
   'scope': ['dialect-error',
    'schema-error',
    'field-error',
    'extra-label',
    'missing-label',
    'blank-label',
    'duplicate-label',
    'blank-header',
    'incorrect-label',
    'extra-cell',
    'missing-cell',
    'blank-row',
    'type-error',
    'constraint-error',
 