### File ingestion and schema validation

Take any csv/text file of 2+ GB of your choice. --- (You can do this assignment on Google colab)

Read the file ( Present approach of reading the file )

Try different methods of file reading eg: Dask, Modin, Ray, pandas and present your findings in term of computational efficiency

Perform basic validation on data columns : eg: remove special character , white spaces from the col name

As you already know the schema hence create a YAML file and write the column name in YAML file. --define separator of read and write file, column name in YAML

Validate number of columns and column name of ingested file with YAML.

Write the file in pipe separated text file (|) in gz format.

Create a summary of the file:

Total number of rows,

total number of columns

file size

# Data Ingestion sample code walkthrough

- Create a utility file
- Config file creation
- Data ingestion pipeline

In [None]:
%%writefile testutility.py
import logging
import os
import subprocess
import yaml
import pandas as pd
import datetime 
import gc
import re


################
# File Reading #
################

def read_config_file(filepath):
    with open(filepath, 'r') as stream:
        try:
            return yaml.safe_load(stream)
        except yaml.YAMLError as exc:
            logging.error(exc)


def replacer(string, char):
    pattern = char + '{2,}'
    string = re.sub(pattern, char, string) 
    return string

def col_header_val(df,table_config):
    '''
    replace whitespaces in the column
    and standardized column names
    '''
    df.columns = df.columns.str.lower()
    df.columns = df.columns.str.replace('[^\w]','_',regex=True)
    df.columns = list(map(lambda x: x.strip('_'), list(df.columns)))
    df.columns = list(map(lambda x: replacer(x,'_'), list(df.columns)))
    expected_col = list(map(lambda x: x.lower(),  table_config['columns']))
    expected_col.sort()
    df.columns =list(map(lambda x: x.lower(), list(df.columns)))
    df = df.reindex(sorted(df.columns), axis=1)
    if len(df.columns) == len(expected_col) and list(expected_col)  == list(df.columns):
        print("column name and column length validation passed")
        return 1
    else:
        print("column name and column length validation failed")
        mismatched_columns_file = list(set(df.columns).difference(expected_col))
        print("Following File columns are not in the YAML file",mismatched_columns_file)
        missing_YAML_file = list(set(expected_col).difference(df.columns))
        print("Following YAML columns are not in the file uploaded",missing_YAML_file)
        logging.info(f'df columns: {df.columns}')
        logging.info(f'expected columns: {expected_col}')
        return 0

Writing testutility.py


# Write YAML File

In [None]:
%%writefile file.yaml
file_type: csv
dataset_name: en-books-dataset
file_name: test_data
table_name: edsurv
inbound_delimiter: ","
outbound_delimiter: "|"
skip_leading_rows: 1
columns: 
    - title
    - url
    - abstract

Writing file.yaml


In [None]:
# Read config file
import testutility as util
config_data = util.read_config_file("file.yaml")

In [None]:
config_data['inbound_delimiter']

','

In [None]:
# inspecting data of config file
config_data

{'file_type': 'csv',
 'dataset_name': 'en-books-dataset',
 'file_name': 'test_data',
 'table_name': 'edsurv',
 'inbound_delimiter': ',',
 'outbound_delimiter': '|',
 'skip_leading_rows': 1,
 'columns': ['title', 'url', 'abstract']}

In [None]:
# Normal reading process of the file
import pandas as pd
df = pd.read_csv("en-books-dataset.csv",delimiter=',')
df.head()

Unnamed: 0,title,url,abstract,body_text,body_html
0,Wikibooks: Radiation Oncology/NHL/CLL-SLL,https://en.wikibooks.org/wiki/Radiation_Oncolo...,Chronic Lymphocytic Leukemia and Small Lymphoc...,Front Page: Radiation Oncology | RTOG Trials |...,"<div class=""mw-parser-output""><table width=""10..."
1,Wikibooks: Romanian/Lesson 9,https://en.wikibooks.org/wiki/Romanian/Lesson_9,==Băuturi/Beverages==,Băuturi/Beverages[edit]\nTea : Ceai\nMilk : La...,"<div class=""mw-parser-output""><h2><span id=""B...."
2,Wikibooks: Karrigell,https://en.wikibooks.org/wiki/Karrigell,Karrigell is an open Source Python web framewo...,Karrigell is an open Source Python web framewo...,"<div class=""mw-parser-output""><p>Karrigell is ..."
3,Wikibooks: The Pyrogenesis Engine/0 A.D./GuiSe...,https://en.wikibooks.org/wiki/The_Pyrogenesis_...,====setupUnitPanel====,setupUnitPanel[edit]\nHelper function for upda...,"<div class=""mw-parser-output""><h4><span class=..."
4,Wikibooks: LMIs in Control/pages/Exterior Coni...,https://en.wikibooks.org/wiki/LMIs_in_Control/...,== The Concept ==,Contents\n\n1 The Concept\n2 The System\n3 The...,"<div class=""mw-parser-output""><div id=""toc"" cl..."


In [None]:
# read the file using config file
file_type = config_data['file_type']
source_file = "./" + config_data['file_name'] + f'.{file_type}'
#print("",source_file)
df = pd.read_csv(source_file,config_data['inbound_delimiter'])
df.head()

Unnamed: 0,title,url,abstract
0,Wikibooks: Radiation Oncology/NHL/CLL-SLL,https://en.wikibooks.org/wiki/Radiation_Oncolo,Chronic Lymphocytic Leukemia and Small Lymphoc...
1,Wikibooks: Romanian/Lesson 9,https://en.wikibooks.org/wiki/Romanian/Lesson_9,==Băuturi/Beverages==
2,Wikibooks: Karrigell,https://en.wikibooks.org/wiki/Karrigell,Karrigell is an open Source Python web framewo...
3,Wikibooks: Calculus/Precalculus,https://en.wikibooks.org/wiki/The_Pyrogenesis,====setupUnitPanel====


In [None]:
# validate the header of the file
util.col_header_val(df,config_data)

column name and column length validation failed
Following File columns are not in the YAML file ['body_text', 'body_html']
Following YAML columns are not in the file uploaded []


0

In [None]:
print("columns of files are:" ,df.columns)
print("columns of YAML are:" ,config_data['columns'])

columns of files are: Index(['title', 'url', 'abstract', 'body_text', 'body_html'], dtype='object')
columns of YAML are: ['title', 'url', 'abstract']


In [None]:
if util.col_header_val(df,config_data)==0:
    print("validation failed")
    # write code to reject the file
else:
    print("col validation passed")
    # write the code to perform further action
    # in the pipeline

column name and column length validation failed
Following File columns are not in the YAML file ['body_text', 'body_html']
Following YAML columns are not in the file uploaded []
validation failed


In [None]:
pd.read_csv("en-books-dataset.csv")

Unnamed: 0,title,url,abstract,body_text,body_html
0,Wikibooks: Radiation Oncology/NHL/CLL-SLL,https://en.wikibooks.org/wiki/Radiation_Oncolo...,Chronic Lymphocytic Leukemia and Small Lymphoc...,Front Page: Radiation Oncology | RTOG Trials |...,"<div class=""mw-parser-output""><table width=""10..."
1,Wikibooks: Romanian/Lesson 9,https://en.wikibooks.org/wiki/Romanian/Lesson_9,==Băuturi/Beverages==,Băuturi/Beverages[edit]\nTea : Ceai\nMilk : La...,"<div class=""mw-parser-output""><h2><span id=""B...."
2,Wikibooks: Karrigell,https://en.wikibooks.org/wiki/Karrigell,Karrigell is an open Source Python web framewo...,Karrigell is an open Source Python web framewo...,"<div class=""mw-parser-output""><p>Karrigell is ..."
3,Wikibooks: The Pyrogenesis Engine/0 A.D./GuiSe...,https://en.wikibooks.org/wiki/The_Pyrogenesis_...,====setupUnitPanel====,setupUnitPanel[edit]\nHelper function for upda...,"<div class=""mw-parser-output""><h4><span class=..."
4,Wikibooks: LMIs in Control/pages/Exterior Coni...,https://en.wikibooks.org/wiki/LMIs_in_Control/...,== The Concept ==,Contents\n\n1 The Concept\n2 The System\n3 The...,"<div class=""mw-parser-output""><div id=""toc"" cl..."
...,...,...,...,...,...
82253,Wikibooks: Python Programming/Creating Python ...,https://en.wikibooks.org/wiki/Python_Programmi...,Welcome to Python! This tutorial will show you...,Previous: Self Help\n\nIndex\n\nNext: Variable...,"<div class=""mw-parser-output""><div class=""nopr..."
82254,Wikibooks: Calculus/Precalculus,https://en.wikibooks.org/wiki/Calculus/Precalc...,==Precalculus==,← Contributing\n\nCalculus\n\nAlgebra →\n\n\nP...,"<div class=""mw-parser-output""><table width=""10..."
82255,Wikibooks: Castles of England/Somerset,https://en.wikibooks.org/wiki/Castles_of_Engla...,There are 11 castles in Somerset.,There are 11 castles in Somerset.\n\n\n\n\nNam...,"<div class=""mw-parser-output""><p>There are 11 ..."
82256,Wikibooks: Digital Technology and Cultures/Int...,https://en.wikibooks.org/wiki/Digital_Technolo...,=CULTURAL STUDIES AND IDENTITY=,Contents\n\n1 CULTURAL STUDIES AND IDENTITY\n\...,"<div class=""mw-parser-output""><div id=""toc"" cl..."


In [None]:
df['url'][0:4]

0    https://en.wikibooks.org/wiki/Radiation_Oncolo...
1      https://en.wikibooks.org/wiki/Romanian/Lesson_9
2              https://en.wikibooks.org/wiki/Karrigell
3    https://en.wikibooks.org/wiki/The_Pyrogenesis_...
Name: url, dtype: object

In [None]:
df['abstract'][2]

'Karrigell is an open Source Python web framework written in Python'

In [None]:
### Creating test file for this demo:
testdata = {
    'title' : ['Wikibooks: Radiation Oncology/NHL/CLL-SLL', 'Wikibooks: Romanian/Lesson 9', 'Wikibooks: Karrigell','Wikibooks: Calculus/Precalculus'],
    'url' : ['https://en.wikibooks.org/wiki/Radiation_Oncolo', 'https://en.wikibooks.org/wiki/Romanian/Lesson_9', 'https://en.wikibooks.org/wiki/Karrigell','https://en.wikibooks.org/wiki/The_Pyrogenesis'],
    'abstract' : ['Chronic Lymphocytic Leukemia and Small Lymphocytic Lymphoma (CLL/SLL)','==Băuturi/Beverages==','Karrigell is an open Source Python web framework written in Python','====setupUnitPanel====']
}
import pandas as pd
df = pd.DataFrame(testdata, columns=['title', 'url','abstract'])
df.to_csv("./test_data.csv",index=False)

In [None]:
df

Unnamed: 0,title,url,abstract
0,Wikibooks: Radiation Oncology/NHL/CLL-SLL,https://en.wikibooks.org/wiki/Radiation_Oncolo,Chronic Lymphocytic Leukemia and Small Lymphoc...
1,Wikibooks: Romanian/Lesson 9,https://en.wikibooks.org/wiki/Romanian/Lesson_9,==Băuturi/Beverages==
2,Wikibooks: Karrigell,https://en.wikibooks.org/wiki/Karrigell,Karrigell is an open Source Python web framewo...
3,Wikibooks: Calculus/Precalculus,https://en.wikibooks.org/wiki/The_Pyrogenesis,====setupUnitPanel====


In [None]:
testdata

{'title': ['Wikibooks: Radiation Oncology/NHL/CLL-SLL',
  'Wikibooks: Romanian/Lesson 9',
  'Wikibooks: Karrigell',
  'Wikibooks: Calculus/Precalculus'],
 'url': ['https://en.wikibooks.org/wiki/Radiation_Oncolo',
  'https://en.wikibooks.org/wiki/Romanian/Lesson_9',
  'https://en.wikibooks.org/wiki/Karrigell',
  'https://en.wikibooks.org/wiki/The_Pyrogenesis'],
 'abstract': ['Chronic Lymphocytic Leukemia and Small Lymphocytic Lymphoma (CLL/SLL)',
  '==Băuturi/Beverages==',
  'Karrigell is an open Source Python web framework written in Python',
  '====setupUnitPanel====']}