<a href="https://colab.research.google.com/github/matthewpecsok/data_engineering/blob/main/tutorials/de_building_flat_files_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building flat files tutorial

In this tutorial we'll review some useful shell commands that can be run as magics in Jupyter notebooks. We'll then begin building example dataframes in pandas to illustrate how to write CSV, Parquet as well as AVRO format.

## Shell commands

what directory are we currently in? Using the 'print working directory' command aka pwd. Prefixing the ! tells jupyter this is a shell magic command, not python code.

### pwd

In [1]:
!pwd

/content


show the contents of the working directory. For now it's just another directory.

### ls

In [2]:
!ls

sample_data


make a new directory named some_dir

### mkdir

In [3]:
!mkdir some_dir

ls again to see the new directory

In [4]:
!ls

sample_data  some_dir


change the current working directory to our new directory.

In [5]:
%cd some_dir

/content/some_dir


check our working directory

In [6]:
!pwd

/content/some_dir


create a new file with the touch command. It will be an empty text file.

### touch

In [7]:
!touch somefile.txt

list our new directory to see the file created. use an "argument" to give more detail. this creates a "long" listing (the l argument) with human readable output for the file size (the h argument) showing the file size is 0k, or empty.
We also get the timestamp on the file and the file permissions.

This file is readable/writable by the owner, readable by the group and readable by everyone.

### command arguments

In [8]:
!ls -lh

total 0
-rw-r--r-- 1 root root 0 Feb 27 22:58 somefile.txt


### echo

the echo command can print to screen something you say, or print the same text to a file. The >> determines if the command will overwrite the file or append.

In [9]:
!echo 'hello world!'

hello world!


In [10]:
!echo "hi matt!" > somefile.txt ## > overwrite

the cat command can print out the data in a file. be warned if the file is large, this command is a poor choice!

In [11]:
!cat somefile.txt

hi matt!


In [12]:
!echo "more text!" >> somefile.txt ## > append

In [13]:
!cat somefile.txt

hi matt!
more text!


cat on the os-release file tells us what operating system we are using in Colab.

In [14]:
!cat /etc/os-release

PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy


## Build a CSV from scratch with shell commands

we'll use jupyter magics to execute unix shell commands to create a simple csv file on our filesystem and then import it with pandas

In [15]:
%ls -l

total 4
-rw-r--r-- 1 root root 20 Feb 27 22:58 somefile.txt


In [16]:
!touch mycsv.csv

In [17]:
%ls -lh

total 4.0K
-rw-r--r-- 1 root root  0 Feb 27 22:58 mycsv.csv
-rw-r--r-- 1 root root 20 Feb 27 22:58 somefile.txt


In [18]:
!echo "Company,Email,Phone Number,TotalSales" > mycsv.csv
!echo "Company A,email@companya.com,888-555-3333,98.12" >> mycsv.csv
!echo "Company B,email@companyb.com,444-555-1234,123.45" >> mycsv.csv
!echo "Company C,email@companyc.com,987-123-4855,65.64" >> mycsv.csv
!echo "Company D,email@companyd.com,987-125-9542,49.18" >> mycsv.csv
!echo "Company E,email@companyd.com,987-634-4687,44.38" >> mycsv.csv

In [19]:
!cat mycsv.csv

Company,Email,Phone Number,TotalSales
Company A,email@companya.com,888-555-3333,98.12
Company B,email@companyb.com,444-555-1234,123.45
Company C,email@companyc.com,987-123-4855,65.64
Company D,email@companyd.com,987-125-9542,49.18
Company E,email@companyd.com,987-634-4687,44.38


import the pandas package so we can deserialize the file.

In [20]:
import pandas as pd

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [21]:
my_csv = pd.read_csv('mycsv.csv')
my_csv

Unnamed: 0,Company,Email,Phone Number,TotalSales
0,Company A,email@companya.com,888-555-3333,98.12
1,Company B,email@companyb.com,444-555-1234,123.45
2,Company C,email@companyc.com,987-123-4855,65.64
3,Company D,email@companyd.com,987-125-9542,49.18
4,Company E,email@companyd.com,987-634-4687,44.38


### shape

In [22]:
my_csv.shape

(5, 4)

### info

In [23]:
my_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company       5 non-null      object 
 1   Email         5 non-null      object 
 2   Phone Number  5 non-null      object 
 3   TotalSales    5 non-null      float64
dtypes: float64(1), object(3)
memory usage: 288.0+ bytes


### describe

In [24]:
my_csv.describe(include='all')

Unnamed: 0,Company,Email,Phone Number,TotalSales
count,5,5,5,5.0
unique,5,4,5,
top,Company A,email@companyd.com,888-555-3333,
freq,1,2,1,
mean,,,,76.154
std,,,,33.790327
min,,,,44.38
25%,,,,49.18
50%,,,,65.64
75%,,,,98.12


### head

In [25]:
my_csv.head(n=2)

Unnamed: 0,Company,Email,Phone Number,TotalSales
0,Company A,email@companya.com,888-555-3333,98.12
1,Company B,email@companyb.com,444-555-1234,123.45


### tail

In [26]:
my_csv.tail(n=2)

Unnamed: 0,Company,Email,Phone Number,TotalSales
3,Company D,email@companyd.com,987-125-9542,49.18
4,Company E,email@companyd.com,987-634-4687,44.38


### cp

the cp shell command makes a copy of a file.

In [27]:
!cp mycsv.csv anewcsv.csv

In [28]:
%ls -l

total 12
-rw-r--r-- 1 root root 279 Feb 27 22:58 anewcsv.csv
-rw-r--r-- 1 root root 279 Feb 27 22:58 mycsv.csv
-rw-r--r-- 1 root root  20 Feb 27 22:58 somefile.txt


### rm

removes a file or files.

In [29]:
!rm anewcsv.csv

In [30]:
%ls -l

total 8
-rw-r--r-- 1 root root 279 Feb 27 22:58 mycsv.csv
-rw-r--r-- 1 root root  20 Feb 27 22:58 somefile.txt


## Faker

faker is a useful python package to generate synthetic data that can look like real data. it uses randomization to pull from a list of data internally to mock new data. there is not an infinite quantity of names, but it's still quite useful when prototyping for an upcoming project that you have no data for.

In [31]:
!pip install faker

Collecting faker
  Downloading Faker-23.3.0-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faker
Successfully installed faker-23.3.0


In [32]:
from faker import Faker
fake=Faker()

In [33]:
fake.name()

'Calvin Thomas'

In [34]:
fake.name()

'Stacey Martin'

In [35]:
fake.name()

'David Mitchell'

In [36]:
fake.random_int(min=10, max=60, step=1)

30

In [37]:
fake.random_int(min=10, max=60, step=1)

43

In [38]:
fake.random_int(min=10, max=60, step=10)

40

In [39]:
fake.random_int(min=10, max=60, step=10)

30

generate a list of dictionaries. for illustration purposes keep the list of dicts small, then scale it up to generate more data.

generate 2 records

In [40]:
our_fake_data_list = list()

for x in range(2):

  data={"name":fake.name(),
        "age":fake.random_int(min=10, max=60, step=1),
        "street":fake.street_address(),
        "city":fake.city(),"state":fake.state(),
        "zip":fake.zipcode(),
        "lng":float(fake.longitude()),
        "lat":float(fake.latitude())}

  our_fake_data_list.append(data)

our_fake_data_list

[{'name': 'Kevin Parrish',
  'age': 26,
  'street': '39177 Gonzalez Avenue Suite 986',
  'city': 'South Saraland',
  'state': 'Texas',
  'zip': '28784',
  'lng': -136.100363,
  'lat': 78.1452915},
 {'name': 'Margaret Gonzalez',
  'age': 60,
  'street': '49755 Joanne Fork Suite 461',
  'city': 'New Amyberg',
  'state': 'New Jersey',
  'zip': '49961',
  'lng': 90.721848,
  'lat': 39.1536915}]

generate 1000 records

In [41]:
our_fake_data_list = list()
type(our_fake_data_list)

for x in range(1000):

  data={"name":fake.name(),
        "age":fake.random_int(min=10, max=60, step=1),
        "street":fake.street_address(),
        "city":fake.city(),"state":fake.state(),
        "zip":fake.zipcode(),
        "lng":float(fake.longitude()),
        "lat":float(fake.latitude())}

  our_fake_data_list.append(data)

In [42]:
our_fake_df = pd.DataFrame.from_records(our_fake_data_list)

In [43]:
our_fake_df.shape

(1000, 8)

In [44]:
our_fake_df.head()

Unnamed: 0,name,age,street,city,state,zip,lng,lat
0,David Terrell,38,765 Scott Cape Apt. 829,Lake Tracy,Rhode Island,61860,-168.09529,42.786221
1,Cesar Hayes,40,10791 Alvarez Bypass Apt. 512,North Josephview,Montana,54158,-74.877781,59.931328
2,Sarah Wolf,33,9488 Ryan Radial Suite 029,Port Mitchell,New Jersey,4563,-36.009133,14.344936
3,Molly Baker,41,92459 Stewart Skyway,Ericborough,Georgia,65031,-164.901127,-73.987098
4,Randall Williams,48,05456 Dwayne Ports Suite 404,Paigeburgh,Illinois,72813,-13.155734,33.944083


serialize our file to a CSV. what type of encoding does this use by default? Look at the help text to see.

In [45]:
our_fake_df.to_csv('fake_csv.csv',index=False)

In [46]:
%ls -lh

total 100K
-rw-r--r-- 1 root root 89K Feb 27 22:58 fake_csv.csv
-rw-r--r-- 1 root root 279 Feb 27 22:58 mycsv.csv
-rw-r--r-- 1 root root  20 Feb 27 22:58 somefile.txt


In [47]:
only_three_rows = pd.read_csv('fake_csv.csv',nrows=3)

In [48]:
only_three_rows.shape

(3, 8)

In [49]:
our_fake_df.to_csv('fake_csv_utf_16.csv',index=False,encoding='UTF-16')
our_fake_df.to_csv('fake_csv_utf_32.csv',index=False,encoding='UTF-32')

In [50]:
%ls -lh

total 636K
-rw-r--r-- 1 root root  89K Feb 27 22:58 fake_csv.csv
-rw-r--r-- 1 root root 178K Feb 27 22:58 fake_csv_utf_16.csv
-rw-r--r-- 1 root root 356K Feb 27 22:58 fake_csv_utf_32.csv
-rw-r--r-- 1 root root  279 Feb 27 22:58 mycsv.csv
-rw-r--r-- 1 root root   20 Feb 27 22:58 somefile.txt


## Compression, CSV files, Parquet files and AVRO

Comparing the file size for compression, and limitations.  

In [51]:
our_fake_data_list = list()
type(our_fake_data_list)

for x in range(1000):

  data={"name":fake.name(),
        "Customer":fake.random_element(elements=('Yes','No')),
        "Status":fake.random_element(elements=('Active','Inactive'))}

  our_fake_data_list.append(data)
  our_fake_df = pd.DataFrame.from_records(our_fake_data_list)

our_fake_df

Unnamed: 0,name,Customer,Status
0,Charles Carroll,Yes,Inactive
1,Mary Cunningham,No,Active
2,Jill Lutz,No,Active
3,Adam Villegas,No,Inactive
4,Martin Moore,No,Inactive
...,...,...,...
995,Kimberly Stephens,Yes,Active
996,David Avila,No,Inactive
997,Donald Woods,Yes,Active
998,Mark Werner,Yes,Active


### CSV and Parquet

Pay close attention to the file size comparing the CSV to Parquet.

In [52]:
our_fake_df.to_csv('compression_csv.csv',index=False)
our_fake_df.to_parquet('compression_parquet.parquet',index=False)

In [53]:
%ls -lh

total 680K
-rw-r--r-- 1 root root  26K Feb 27 22:58 compression_csv.csv
-rw-r--r-- 1 root root  14K Feb 27 22:58 compression_parquet.parquet
-rw-r--r-- 1 root root  89K Feb 27 22:58 fake_csv.csv
-rw-r--r-- 1 root root 178K Feb 27 22:58 fake_csv_utf_16.csv
-rw-r--r-- 1 root root 356K Feb 27 22:58 fake_csv_utf_32.csv
-rw-r--r-- 1 root root  279 Feb 27 22:58 mycsv.csv
-rw-r--r-- 1 root root   20 Feb 27 22:58 somefile.txt


In [54]:
our_fake_data_list = list()
type(our_fake_data_list)

for x in range(1000):

  data={"name":fake.name(),
        "Customer":fake.lexify('???')}

  our_fake_data_list.append(data)
  our_fake_df = pd.DataFrame.from_records(our_fake_data_list)

our_fake_df.to_csv('random_letters_compression_csv.csv',index=False)
our_fake_df.to_parquet('random_letters_compression_parquet.parquet',index=False)

In [55]:
our_fake_df

Unnamed: 0,name,Customer
0,Kathy Jones,kST
1,Jeremy Ellis,UFm
2,Audrey Patterson,wLd
3,Ronald Wilson,plT
4,James Cordova,CzE
...,...,...
995,Joseph White,NtG
996,Rhonda Haney,RUQ
997,Deborah Frank,fXn
998,Jennifer Smith,oXx


In [56]:
our_fake_df.Customer.nunique() # almost all unique values

997

In [57]:
our_fake_df.name.nunique() # almost all unique values

991

In [58]:
%ls -lh

total 720K
-rw-r--r-- 1 root root  26K Feb 27 22:58 compression_csv.csv
-rw-r--r-- 1 root root  14K Feb 27 22:58 compression_parquet.parquet
-rw-r--r-- 1 root root  89K Feb 27 22:58 fake_csv.csv
-rw-r--r-- 1 root root 178K Feb 27 22:58 fake_csv_utf_16.csv
-rw-r--r-- 1 root root 356K Feb 27 22:58 fake_csv_utf_32.csv
-rw-r--r-- 1 root root  279 Feb 27 22:58 mycsv.csv
-rw-r--r-- 1 root root  18K Feb 27 22:58 random_letters_compression_csv.csv
-rw-r--r-- 1 root root  19K Feb 27 22:58 random_letters_compression_parquet.parquet
-rw-r--r-- 1 root root   20 Feb 27 22:58 somefile.txt


Wait what? I thought you said it used compression!!!! Why is the random_letters_compression Parquet file larger than the CSV?

In [62]:
!mkdir /content/parquet_files

In [68]:
!wget -q -O /content/parquet_files/file1.parquet https://github.com/matthewpecsok/data_engineering/raw/main/data/set_df_1.parquet
!wget -q -O /content/parquet_files/file2.parquet https://github.com/matthewpecsok/data_engineering/raw/main/data/set_df_2.parquet
!wget -q -O /content/parquet_files/file3.parquet https://github.com/matthewpecsok/data_engineering/raw/main/data/set_df_3.parquet

In [70]:
print(pd.read_parquet('/content/parquet_files/file1.parquet').shape)
print(pd.read_parquet('/content/parquet_files/file2.parquet').shape)
print(pd.read_parquet('/content/parquet_files/file3.parquet').shape)

(3001, 11)
(3000, 11)
(3999, 11)


In [71]:
pd.read_parquet('/content/parquet_files/')

Unnamed: 0,Auction,Color,IsBadBuy,MMRCurrentAuctionAveragePrice,Size,TopThreeAmericanName,VehBCost,VehicleAge,VehOdo,WarrantyCost,WheelType
0,ADESA,WHITE,False,2871,LARGE TRUCK,FORD,5300,8,75419,869,Alloy
1,ADESA,GOLD,True,1840,VAN,FORD,3600,8,82944,2322,Alloy
2,ADESA,RED,False,8931,SMALL SUV,CHRYSLER,7500,4,57338,588,Alloy
3,ADESA,GOLD,False,8320,CROSSOVER,FORD,8500,5,55909,1169,Alloy
4,ADESA,GREY,False,11520,LARGE TRUCK,FORD,10100,5,86702,853,Alloy
...,...,...,...,...,...,...,...,...,...,...,...
9995,ADESA,RED,False,7536,SMALL SUV,CHRYSLER,6600,4,85377,983,Alloy
9996,ADESA,BLACK,False,4921,LARGE TRUCK,GM,7000,7,89665,1543,Alloy
9997,ADESA,BLACK,False,9263,MEDIUM SUV,CHRYSLER,9000,4,59383,1417,Alloy
9998,ADESA,BLUE,False,3240,MEDIUM,OTHER,5500,4,48642,482,Covers


In [72]:
pd.read_parquet('/content/parquet_files/',columns=['VehicleAge','Size'])

Unnamed: 0,VehicleAge,Size
0,8,LARGE TRUCK
1,8,VAN
2,4,SMALL SUV
3,5,CROSSOVER
4,5,LARGE TRUCK
...,...,...
9995,4,SMALL SUV
9996,7,LARGE TRUCK
9997,4,MEDIUM SUV
9998,4,MEDIUM


In [73]:
!pip install memory_profiler

Collecting memory_profiler
  Downloading memory_profiler-0.61.0-py3-none-any.whl (31 kB)
Installing collected packages: memory_profiler
Successfully installed memory_profiler-0.61.0


In [74]:
list("Size,Age")

['S', 'i', 'z', 'e', ',', 'A', 'g', 'e']

In [75]:
%%writefile load_parquet.py

# Import necessary library (no need to import memory_profiler in the script)
import pandas as pd
import sys
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--columns', type=str, help='Enter a list of values separated by commas')
parser.add_argument('--files', type=str, help='the filepath')
args = parser.parse_args()
columns = args.columns.split(',')
file_name = args.files


# Function to be profiled
@profile
def load_data():
    if (columns[0]!='all'):
      df = pd.read_parquet(file_name,columns=columns)
    else:
      df = pd.read_parquet(file_name)
    # any other code

# Call the function
load_data()

Writing load_parquet.py


In [76]:
!python -m memory_profiler load_parquet.py --files /content/parquet_files/ --columns 'all'

Filename: load_parquet.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    16  107.023 MiB  107.023 MiB           1   @profile
    17                                         def load_data():
    18  107.023 MiB    0.000 MiB           1       if (columns[0]!='all'):
    19                                               df = pd.read_parquet(file_name,columns=columns)
    20                                             else:
    21  119.766 MiB   12.742 MiB           1         df = pd.read_parquet(file_name)
    22                                             # any other code




In [77]:
!python -m memory_profiler load_parquet.py --files /content/parquet_files/ --columns 'VehicleAge'

Filename: load_parquet.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    16  106.500 MiB  106.500 MiB           1   @profile
    17                                         def load_data():
    18  106.500 MiB    0.000 MiB           1       if (columns[0]!='all'):
    19  117.074 MiB   10.574 MiB           1         df = pd.read_parquet(file_name,columns=columns)
    20                                             else:
    21                                               df = pd.read_parquet(file_name)
    22                                             # any other code




### AVRO

In [78]:
!pip install fastavro

Collecting fastavro
  Downloading fastavro-1.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fastavro
Successfully installed fastavro-1.9.4


In [79]:
from fastavro import writer, reader, parse_schema

In [80]:
schema = {
    'doc': 'fake test',
    'name': 'faketest',
    'namespace': 'test',
    'type': 'record',
    'fields': [
        {'name': 'name', 'type': 'string'},
        {'name': 'Customer', 'type': 'string'}
    ]
}

parsed_schema = parse_schema(schema)

In [81]:
records = our_fake_df.to_dict('records')

# 3. Write to Avro file
with open('random_letters_compression_csv.avro', 'wb') as out:
    writer(out, parsed_schema, records)

In [82]:
compression_schema = {
  'doc': 'fake test',
  'name': 'faketest',
  'namespace': 'test',
  'type': 'record',
  'fields' : [
  {"name" : "name", 'type' : 'string'},
  {"name" : "age", 'type' : 'int'},
  {"name" : "street", 'type' : 'string'},
  {"name" : "zip", 'type' : 'string'},
  {"name" : "lng", 'type' : 'float'},
  {"name" : "lat", 'type' : 'float'}
  ]}

parsed_comp_schema = parse_schema(compression_schema)

records = our_fake_df.to_dict('records')

# 3. Write to Avro file
with open('compression_avro.avro', 'wb') as out:
    writer(out, parsed_schema, records)

In [83]:
!ls -lh

total 768K
-rw-r--r-- 1 root root  18K Feb 27 23:05 compression_avro.avro
-rw-r--r-- 1 root root  26K Feb 27 22:58 compression_csv.csv
-rw-r--r-- 1 root root  14K Feb 27 22:58 compression_parquet.parquet
-rw-r--r-- 1 root root  89K Feb 27 22:58 fake_csv.csv
-rw-r--r-- 1 root root 178K Feb 27 22:58 fake_csv_utf_16.csv
-rw-r--r-- 1 root root 356K Feb 27 22:58 fake_csv_utf_32.csv
-rw-r--r-- 1 root root  639 Feb 27 23:04 load_parquet.py
-rw-r--r-- 1 root root  279 Feb 27 22:58 mycsv.csv
drwxr-xr-x 2 root root 4.0K Feb 27 22:58 parquet_files
-rw-r--r-- 1 root root  18K Feb 27 23:05 random_letters_compression_csv.avro
-rw-r--r-- 1 root root  18K Feb 27 22:58 random_letters_compression_csv.csv
-rw-r--r-- 1 root root  19K Feb 27 22:58 random_letters_compression_parquet.parquet
-rw-r--r-- 1 root root   20 Feb 27 22:58 somefile.txt


## Problems with files

It is possible to have data in files that is problematic. Imagine a price column with words in it or more commas than there are columns

In [84]:
!echo "Price,Customer" > "problem.csv"
!echo "23.45," >> "problem.csv"
!echo "14.64" >> "problem.csv"
!echo "Ooops!,," >> "problem.csv"
!echo "48.87,,," >> "problem.csv"

!cat problem.csv

Price,Customer
23.45,
14.64
Ooops!,,
48.87,,,


In [85]:
problem_df = pd.read_csv('problem.csv')
problem_df

ParserError: Error tokenizing data. C error: Expected 2 fields in line 4, saw 3


In [None]:
problem_df.info()