<a href="https://colab.research.google.com/github/matthewpecsok/data_engineering/blob/main/tutorials/de_building_flat_files_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building flat files tutorial 

In this tutorial we'll review some useful shell commands that can be run as magics in Jupyter notebooks. We'll then begin building example dataframes in pandas to illustrate how to write CSV, Parquet as well as AVRO format. 

## Shell commands (as magics)

what directory are we currently in? Using the 'print working directory' command aka pwd. Prefixing the ! tells jupyter this is a shell magic command, not python code. 

### pwd

In [64]:
!pwd

/content/some_dir


show the contents of the working directory. For now it's just another directory. 

### ls

In [65]:
!ls

compression_avro.avro	     mycsv.csv
compression_csv.csv	     random_letters_compression_csv.avro
compression_parquet.parquet  random_letters_compression_csv.csv
fake_csv.csv		     random_letters_compression_parquet.parquet
fake_csv_utf_16.csv	     somefile.txt
fake_csv_utf_32.csv


make a new directory named some_dir

### mkdir

In [66]:
!mkdir some_dir

ls again to see the new directory

In [67]:
!ls

compression_avro.avro	     mycsv.csv
compression_csv.csv	     random_letters_compression_csv.avro
compression_parquet.parquet  random_letters_compression_csv.csv
fake_csv.csv		     random_letters_compression_parquet.parquet
fake_csv_utf_16.csv	     some_dir
fake_csv_utf_32.csv	     somefile.txt


change the current working directory to our new directory. 

In [68]:
%cd some_dir

/content/some_dir/some_dir


check our working directory

In [69]:
!pwd

/content/some_dir/some_dir


create a new file with the touch command. It will be an empty text file.

### touch

In [70]:
!touch somefile.txt

list our new directory to see the file created. use an "argument" to give more detail. this creates a "long" listing (the l argument) with human readable output for the file size (the h argument) showing the file size is 0k, or empty.
We also get the timestamp on the file and the file permissions. 

This file is readable/writable by the owner, readable by the group and readable by everyone. 

### command arguments

In [71]:
!ls -lh

total 0
-rw-r--r-- 1 root root 0 May 23 17:24 somefile.txt


### echo

the echo command can print to screen something you say, or print the same text to a file. The >> determines if the command will overwrite the file or append. 

In [72]:
!echo 'hello world!'

hello world!


In [73]:
!echo "hi matt!" > somefile.txt ## > overwrite

the cat command can print out the data in a file. be warned if the file is large, this command is a poor choice!

In [74]:
!cat somefile.txt

hi matt!


In [75]:
!echo "more text!" >> somefile.txt ## > append

In [76]:
!cat somefile.txt

hi matt!
more text!


cat on the os-release file tells us what operating system we are using in Colab. 

In [77]:
!cat /etc/os-release

NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal


## Build a CSV from scratch with shell commands

we'll use jupyter magics to execute unix shell commands to create a simple csv file on our filesystem and then import it with pandas

In [78]:
%ls -l

total 4
-rw-r--r-- 1 root root 20 May 23 17:24 somefile.txt


In [79]:
!touch mycsv.csv

In [80]:
%ls -lh

total 4.0K
-rw-r--r-- 1 root root  0 May 23 17:24 mycsv.csv
-rw-r--r-- 1 root root 20 May 23 17:24 somefile.txt


In [81]:
!echo "Company,Email,Phone Number,TotalSales" > mycsv.csv
!echo "Company A,email@companya.com,888-555-3333,98.12" >> mycsv.csv
!echo "Company B,email@companyb.com,444-555-1234,123.45" >> mycsv.csv
!echo "Company C,email@companyc.com,987-123-4855,65.64" >> mycsv.csv
!echo "Company D,email@companyd.com,987-125-9542,49.18" >> mycsv.csv
!echo "Company E,email@companyd.com,987-634-4687,44.38" >> mycsv.csv

In [82]:
!cat mycsv.csv

Company,Email,Phone Number,TotalSales
Company A,email@companya.com,888-555-3333,98.12
Company B,email@companyb.com,444-555-1234,123.45
Company C,email@companyc.com,987-123-4855,65.64
Company D,email@companyd.com,987-125-9542,49.18
Company E,email@companyd.com,987-634-4687,44.38


import the pandas package so we can deserialize the file. 

In [83]:
import pandas as pd

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [84]:
my_csv = pd.read_csv('mycsv.csv')
my_csv

Unnamed: 0,Company,Email,Phone Number,TotalSales
0,Company A,email@companya.com,888-555-3333,98.12
1,Company B,email@companyb.com,444-555-1234,123.45
2,Company C,email@companyc.com,987-123-4855,65.64
3,Company D,email@companyd.com,987-125-9542,49.18
4,Company E,email@companyd.com,987-634-4687,44.38


### shape

In [85]:
my_csv.shape

(5, 4)

### info

In [129]:
my_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company       5 non-null      object 
 1   Email         5 non-null      object 
 2   Phone Number  5 non-null      object 
 3   TotalSales    5 non-null      float64
dtypes: float64(1), object(3)
memory usage: 288.0+ bytes


### describe

In [86]:
my_csv.describe(include='all')

Unnamed: 0,Company,Email,Phone Number,TotalSales
count,5,5,5,5.0
unique,5,4,5,
top,Company A,email@companyd.com,888-555-3333,
freq,1,2,1,
mean,,,,76.154
std,,,,33.790327
min,,,,44.38
25%,,,,49.18
50%,,,,65.64
75%,,,,98.12


### head

In [87]:
my_csv.head(n=2)

Unnamed: 0,Company,Email,Phone Number,TotalSales
0,Company A,email@companya.com,888-555-3333,98.12
1,Company B,email@companyb.com,444-555-1234,123.45


### tail

In [88]:
my_csv.tail(n=2)

Unnamed: 0,Company,Email,Phone Number,TotalSales
3,Company D,email@companyd.com,987-125-9542,49.18
4,Company E,email@companyd.com,987-634-4687,44.38


### cp 

the cp shell command makes a copy of a file.

In [89]:
!cp mycsv.csv anewcsv.csv

In [90]:
%ls -l

total 12
-rw-r--r-- 1 root root 279 May 23 17:24 anewcsv.csv
-rw-r--r-- 1 root root 279 May 23 17:24 mycsv.csv
-rw-r--r-- 1 root root  20 May 23 17:24 somefile.txt


### rm 

removes a file or files.

In [91]:
!rm anewcsv.csv

In [92]:
%ls -l

total 8
-rw-r--r-- 1 root root 279 May 23 17:24 mycsv.csv
-rw-r--r-- 1 root root  20 May 23 17:24 somefile.txt


## Faker

faker is a useful python package to generate synthetic data that can look like real data. it uses randomization to pull from a list of data internally to mock new data. there is not an infinite quantity of names, but it's still quite useful when prototyping for an upcoming project that you have no data for. 

In [93]:
!pip install faker

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [94]:
from faker import Faker
fake=Faker()

In [95]:
fake.name()

'Tiffany Villanueva'

In [96]:
fake.name()

'Anna Adams'

In [97]:
fake.name()

'Donna Bowman'

In [98]:
fake.random_int(min=10, max=60, step=1)

45

In [99]:
fake.random_int(min=10, max=60, step=1)

22

In [100]:
fake.random_int(min=10, max=60, step=10) 

10

In [101]:
fake.random_int(min=10, max=60, step=10) 

60

generate a list of dictionaries. for illustration purposes keep the list of dicts small, then scale it up to generate more data.

generate 2 records

In [102]:
our_fake_data_list = list()

for x in range(2):

  data={"name":fake.name(),
        "age":fake.random_int(min=10, max=60, step=1),
        "street":fake.street_address(),
        "city":fake.city(),"state":fake.state(),
        "zip":fake.zipcode(),
        "lng":float(fake.longitude()),
        "lat":float(fake.latitude())}

  our_fake_data_list.append(data)

our_fake_data_list

[{'name': 'Chris Parker',
  'age': 11,
  'street': '29606 Fischer Trail',
  'city': 'Matthewsberg',
  'state': 'Connecticut',
  'zip': '37787',
  'lng': -142.541804,
  'lat': 11.5730575},
 {'name': 'Nancy Rios',
  'age': 43,
  'street': '938 Lee Loaf Suite 075',
  'city': 'New Jonathan',
  'state': 'Wyoming',
  'zip': '60147',
  'lng': -149.33502,
  'lat': 67.652322}]

generate 1000 records

In [103]:
our_fake_data_list = list()
type(our_fake_data_list)

for x in range(1000):

  data={"name":fake.name(),
        "age":fake.random_int(min=10, max=60, step=1),
        "street":fake.street_address(),
        "city":fake.city(),"state":fake.state(),
        "zip":fake.zipcode(),
        "lng":float(fake.longitude()),
        "lat":float(fake.latitude())}

  our_fake_data_list.append(data)

In [104]:
our_fake_df = pd.DataFrame.from_records(our_fake_data_list)

In [105]:
our_fake_df.shape

(1000, 8)

In [106]:
our_fake_df.head()

Unnamed: 0,name,age,street,city,state,zip,lng,lat
0,Christopher Morse,43,1159 Kimberly Garden Apt. 529,West Nicholaston,Ohio,73094,-161.453697,79.445381
1,Wayne Contreras,25,373 Johnson Forest,West John,Alaska,91032,-75.239872,-19.652461
2,Erika Graham,10,0592 Ryan Valley,Michaelville,New Jersey,76452,-129.123349,-50.790276
3,Elizabeth Reynolds,44,2459 Angelica Landing Apt. 413,New Cassandra,Maryland,46994,-147.75672,89.851244
4,Mallory Frazier,44,2759 Lewis Estates Apt. 611,Matthewhaven,Louisiana,24673,-51.793957,46.735985


serialize our file to a CSV. what type of encoding does this use by default? Look at the help text to see.

In [107]:
our_fake_df.to_csv('fake_csv.csv',index=False)

In [108]:
%ls -lh

total 100K
-rw-r--r-- 1 root root 89K May 23 17:25 fake_csv.csv
-rw-r--r-- 1 root root 279 May 23 17:24 mycsv.csv
-rw-r--r-- 1 root root  20 May 23 17:24 somefile.txt


In [109]:
only_three_rows = pd.read_csv('fake_csv.csv',nrows=3)

In [110]:
only_three_rows.shape

(3, 8)

In [111]:
our_fake_df.to_csv('fake_csv_utf_16.csv',index=False,encoding='UTF-16')
our_fake_df.to_csv('fake_csv_utf_32.csv',index=False,encoding='UTF-32')

In [112]:
%ls -lh

total 636K
-rw-r--r-- 1 root root  89K May 23 17:25 fake_csv.csv
-rw-r--r-- 1 root root 178K May 23 17:25 fake_csv_utf_16.csv
-rw-r--r-- 1 root root 356K May 23 17:25 fake_csv_utf_32.csv
-rw-r--r-- 1 root root  279 May 23 17:24 mycsv.csv
-rw-r--r-- 1 root root   20 May 23 17:24 somefile.txt


## Compression, CSV files, Parquet files and AVRO

Comparing the file size for compression, and limitations.  

In [113]:
our_fake_data_list = list()
type(our_fake_data_list)

for x in range(1000):

  data={"name":fake.name(),
        "Customer":fake.random_element(elements=('Yes','No')),
        "Status":fake.random_element(elements=('Active','Inactive'))}

  our_fake_data_list.append(data)
  our_fake_df = pd.DataFrame.from_records(our_fake_data_list)

our_fake_df

Unnamed: 0,name,Customer,Status
0,Jessica Williams,No,Active
1,Gregory Yu,No,Active
2,Shannon Henry,Yes,Inactive
3,Christopher Kidd,Yes,Active
4,Alvin Mccarty,No,Active
...,...,...,...
995,Manuel Estes,No,Inactive
996,Justin Ross,No,Active
997,Dr. Alisha Roberts,Yes,Inactive
998,Jerry Yoder,No,Inactive


### CSV and Parquet

Pay close attention to the file size comparing the CSV to Parquet. 

In [114]:
our_fake_df.to_csv('compression_csv.csv',index=False)
our_fake_df.to_parquet('compression_parquet.parquet',index=False)

In [115]:
%ls -lh

total 680K
-rw-r--r-- 1 root root  26K May 23 17:25 compression_csv.csv
-rw-r--r-- 1 root root  14K May 23 17:25 compression_parquet.parquet
-rw-r--r-- 1 root root  89K May 23 17:25 fake_csv.csv
-rw-r--r-- 1 root root 178K May 23 17:25 fake_csv_utf_16.csv
-rw-r--r-- 1 root root 356K May 23 17:25 fake_csv_utf_32.csv
-rw-r--r-- 1 root root  279 May 23 17:24 mycsv.csv
-rw-r--r-- 1 root root   20 May 23 17:24 somefile.txt


In [116]:
our_fake_data_list = list()
type(our_fake_data_list)

for x in range(1000):

  data={"name":fake.name(),
        "Customer":fake.lexify('???')}

  our_fake_data_list.append(data)
  our_fake_df = pd.DataFrame.from_records(our_fake_data_list)

our_fake_df.to_csv('random_letters_compression_csv.csv',index=False)
our_fake_df.to_parquet('random_letters_compression_parquet.parquet',index=False)

In [117]:
our_fake_df

Unnamed: 0,name,Customer
0,Allen Dickson,jpI
1,Aaron Carey,zNW
2,Kelsey Rodriguez,okO
3,Melissa Medina,afH
4,Gabrielle Williams,puT
...,...,...
995,Kevin Hoffman,yiO
996,Amy Reyes,bTK
997,Claudia Jackson,ief
998,Carol Cunningham,TPX


In [118]:
our_fake_df.Customer.nunique() # almost all unique values

997

In [119]:
our_fake_df.name.nunique() # almost all unique values

993

In [120]:
%ls -lh

total 720K
-rw-r--r-- 1 root root  26K May 23 17:25 compression_csv.csv
-rw-r--r-- 1 root root  14K May 23 17:25 compression_parquet.parquet
-rw-r--r-- 1 root root  89K May 23 17:25 fake_csv.csv
-rw-r--r-- 1 root root 178K May 23 17:25 fake_csv_utf_16.csv
-rw-r--r-- 1 root root 356K May 23 17:25 fake_csv_utf_32.csv
-rw-r--r-- 1 root root  279 May 23 17:24 mycsv.csv
-rw-r--r-- 1 root root  18K May 23 17:25 random_letters_compression_csv.csv
-rw-r--r-- 1 root root  19K May 23 17:25 random_letters_compression_parquet.parquet
-rw-r--r-- 1 root root   20 May 23 17:24 somefile.txt


Wait what? I thought you said it used compression!!!! Why is the random_letters_compression Parquet file larger than the CSV?

### AVRO

In [121]:
!pip install fastavro

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [122]:
from fastavro import writer, reader, parse_schema

In [123]:
schema = {
    'doc': 'fake test',
    'name': 'faketest',
    'namespace': 'test',
    'type': 'record',
    'fields': [
        {'name': 'name', 'type': 'string'},
        {'name': 'Customer', 'type': 'string'}
    ]
}

parsed_schema = parse_schema(schema)

In [124]:
records = our_fake_df.to_dict('records')

# 3. Write to Avro file
with open('random_letters_compression_csv.avro', 'wb') as out:
    writer(out, parsed_schema, records)

In [125]:
compression_schema = {
  'doc': 'fake test',
  'name': 'faketest',
  'namespace': 'test',
  'type': 'record',
  'fields' : [
  {"name" : "name", 'type' : 'string'},
  {"name" : "age", 'type' : 'int'},
  {"name" : "street", 'type' : 'string'},
  {"name" : "zip", 'type' : 'string'},
  {"name" : "lng", 'type' : 'float'},
  {"name" : "lat", 'type' : 'float'}
  ]}

parsed_comp_schema = parse_schema(compression_schema)

records = our_fake_df.to_dict('records')

# 3. Write to Avro file
with open('compression_avro.avro', 'wb') as out:
    writer(out, parsed_schema, records)

In [126]:
!ls -lh

total 760K
-rw-r--r-- 1 root root  19K May 23 17:25 compression_avro.avro
-rw-r--r-- 1 root root  26K May 23 17:25 compression_csv.csv
-rw-r--r-- 1 root root  14K May 23 17:25 compression_parquet.parquet
-rw-r--r-- 1 root root  89K May 23 17:25 fake_csv.csv
-rw-r--r-- 1 root root 178K May 23 17:25 fake_csv_utf_16.csv
-rw-r--r-- 1 root root 356K May 23 17:25 fake_csv_utf_32.csv
-rw-r--r-- 1 root root  279 May 23 17:24 mycsv.csv
-rw-r--r-- 1 root root  19K May 23 17:25 random_letters_compression_csv.avro
-rw-r--r-- 1 root root  18K May 23 17:25 random_letters_compression_csv.csv
-rw-r--r-- 1 root root  19K May 23 17:25 random_letters_compression_parquet.parquet
-rw-r--r-- 1 root root   20 May 23 17:24 somefile.txt


## Problems with files

It is possible to have data in files that is problematic. Imagine a price column with words in it or more commas than there are columns 

In [132]:
!echo "Price,Customer" > "problem.csv"
!echo "23.45," >> "problem.csv"
!echo "14.64" >> "problem.csv"
!echo "Ooops!,," >> "problem.csv"
!echo "48.87,,," >> "problem.csv"

!cat problem.csv

Price,Customer
23.45,
14.64
Ooops!,,
48.87,,,


In [133]:
problem_df = pd.read_csv('problem.csv')
problem_df

ParserError: ignored

In [131]:
problem_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Price   4 non-null      object
dtypes: object(1)
memory usage: 160.0+ bytes
