<a href="https://colab.research.google.com/github/matthewpecsok/data_engineering/blob/main/tutorials/de_building_flat_files_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building flat files tutorial

In this tutorial we'll review some useful shell commands that can be run as magics in Jupyter notebooks. We'll then begin building example dataframes in pandas to illustrate how to write CSV, Parquet as well as AVRO format.

## Shell commands (as magics)

what directory are we currently in? Using the 'print working directory' command aka pwd. Prefixing the ! tells jupyter this is a shell magic command, not python code.

### pwd

In [None]:
!pwd

show the contents of the working directory. For now it's just another directory.

### ls

In [None]:
!ls

make a new directory named some_dir

### mkdir

In [None]:
!mkdir some_dir

ls again to see the new directory

In [None]:
!ls

change the current working directory to our new directory.

In [None]:
%cd some_dir

check our working directory

In [None]:
!pwd

create a new file with the touch command. It will be an empty text file.

### touch

In [None]:
!touch somefile.txt

list our new directory to see the file created. use an "argument" to give more detail. this creates a "long" listing (the l argument) with human readable output for the file size (the h argument) showing the file size is 0k, or empty.
We also get the timestamp on the file and the file permissions.

This file is readable/writable by the owner, readable by the group and readable by everyone.

### command arguments

In [None]:
!ls -lh

### echo

the echo command can print to screen something you say, or print the same text to a file. The >> determines if the command will overwrite the file or append.

In [None]:
!echo 'hello world!'

In [None]:
!echo "hi matt!" > somefile.txt ## > overwrite

the cat command can print out the data in a file. be warned if the file is large, this command is a poor choice!

In [None]:
!cat somefile.txt

In [None]:
!echo "more text!" >> somefile.txt ## > append

In [None]:
!cat somefile.txt

cat on the os-release file tells us what operating system we are using in Colab.

In [None]:
!cat /etc/os-release

## Build a CSV from scratch with shell commands

we'll use jupyter magics to execute unix shell commands to create a simple csv file on our filesystem and then import it with pandas

In [None]:
%ls -l

In [None]:
!touch mycsv.csv

In [None]:
%ls -lh

In [None]:
!echo "Company,Email,Phone Number,TotalSales" > mycsv.csv
!echo "Company A,email@companya.com,888-555-3333,98.12" >> mycsv.csv
!echo "Company B,email@companyb.com,444-555-1234,123.45" >> mycsv.csv
!echo "Company C,email@companyc.com,987-123-4855,65.64" >> mycsv.csv
!echo "Company D,email@companyd.com,987-125-9542,49.18" >> mycsv.csv
!echo "Company E,email@companyd.com,987-634-4687,44.38" >> mycsv.csv

In [None]:
!cat mycsv.csv

import the pandas package so we can deserialize the file.

In [None]:
import pandas as pd

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [None]:
my_csv = pd.read_csv('mycsv.csv')
my_csv

### shape

In [None]:
my_csv.shape

### info

In [None]:
my_csv.info()

### describe

In [None]:
my_csv.describe(include='all')

### head

In [None]:
my_csv.head(n=2)

### tail

In [None]:
my_csv.tail(n=2)

### cp

the cp shell command makes a copy of a file.

In [None]:
!cp mycsv.csv anewcsv.csv

In [None]:
%ls -l

### rm

removes a file or files.

In [None]:
!rm anewcsv.csv

In [None]:
%ls -l

## Faker

faker is a useful python package to generate synthetic data that can look like real data. it uses randomization to pull from a list of data internally to mock new data. there is not an infinite quantity of names, but it's still quite useful when prototyping for an upcoming project that you have no data for.

In [None]:
!pip install faker

In [None]:
from faker import Faker
fake=Faker()

In [None]:
fake.name()

In [None]:
fake.name()

In [None]:
fake.name()

In [None]:
fake.random_int(min=10, max=60, step=1)

In [None]:
fake.random_int(min=10, max=60, step=1)

In [None]:
fake.random_int(min=10, max=60, step=10)

In [None]:
fake.random_int(min=10, max=60, step=10)

generate a list of dictionaries. for illustration purposes keep the list of dicts small, then scale it up to generate more data.

generate 2 records

In [None]:
our_fake_data_list = list()

for x in range(2):

  data={"name":fake.name(),
        "age":fake.random_int(min=10, max=60, step=1),
        "street":fake.street_address(),
        "city":fake.city(),"state":fake.state(),
        "zip":fake.zipcode(),
        "lng":float(fake.longitude()),
        "lat":float(fake.latitude())}

  our_fake_data_list.append(data)

our_fake_data_list

generate 1000 records

In [None]:
our_fake_data_list = list()
type(our_fake_data_list)

for x in range(1000):

  data={"name":fake.name(),
        "age":fake.random_int(min=10, max=60, step=1),
        "street":fake.street_address(),
        "city":fake.city(),"state":fake.state(),
        "zip":fake.zipcode(),
        "lng":float(fake.longitude()),
        "lat":float(fake.latitude())}

  our_fake_data_list.append(data)

In [None]:
our_fake_df = pd.DataFrame.from_records(our_fake_data_list)

In [None]:
our_fake_df.shape

In [None]:
our_fake_df.head()

serialize our file to a CSV. what type of encoding does this use by default? Look at the help text to see.

In [None]:
our_fake_df.to_csv('fake_csv.csv',index=False)

In [None]:
%ls -lh

In [None]:
only_three_rows = pd.read_csv('fake_csv.csv',nrows=3)

In [None]:
only_three_rows.shape

In [None]:
our_fake_df.to_csv('fake_csv_utf_16.csv',index=False,encoding='UTF-16')
our_fake_df.to_csv('fake_csv_utf_32.csv',index=False,encoding='UTF-32')

In [None]:
%ls -lh

## Compression, CSV files, Parquet files and AVRO

Comparing the file size for compression, and limitations.  

In [None]:
our_fake_data_list = list()
type(our_fake_data_list)

for x in range(1000):

  data={"name":fake.name(),
        "Customer":fake.random_element(elements=('Yes','No')),
        "Status":fake.random_element(elements=('Active','Inactive'))}

  our_fake_data_list.append(data)
  our_fake_df = pd.DataFrame.from_records(our_fake_data_list)

our_fake_df

### CSV and Parquet

Pay close attention to the file size comparing the CSV to Parquet.

In [None]:
our_fake_df.to_csv('compression_csv.csv',index=False)
our_fake_df.to_parquet('compression_parquet.parquet',index=False)

In [None]:
%ls -lh

In [None]:
our_fake_data_list = list()
type(our_fake_data_list)

for x in range(1000):

  data={"name":fake.name(),
        "Customer":fake.lexify('???')}

  our_fake_data_list.append(data)
  our_fake_df = pd.DataFrame.from_records(our_fake_data_list)

our_fake_df.to_csv('random_letters_compression_csv.csv',index=False)
our_fake_df.to_parquet('random_letters_compression_parquet.parquet',index=False)

In [None]:
our_fake_df

In [None]:
our_fake_df.Customer.nunique() # almost all unique values

In [None]:
our_fake_df.name.nunique() # almost all unique values

In [None]:
%ls -lh

Wait what? I thought you said it used compression!!!! Why is the random_letters_compression Parquet file larger than the CSV?

### AVRO

In [None]:
!pip install fastavro

In [None]:
from fastavro import writer, reader, parse_schema

In [None]:
schema = {
    'doc': 'fake test',
    'name': 'faketest',
    'namespace': 'test',
    'type': 'record',
    'fields': [
        {'name': 'name', 'type': 'string'},
        {'name': 'Customer', 'type': 'string'}
    ]
}

parsed_schema = parse_schema(schema)

In [None]:
records = our_fake_df.to_dict('records')

# 3. Write to Avro file
with open('random_letters_compression_csv.avro', 'wb') as out:
    writer(out, parsed_schema, records)

In [None]:
compression_schema = {
  'doc': 'fake test',
  'name': 'faketest',
  'namespace': 'test',
  'type': 'record',
  'fields' : [
  {"name" : "name", 'type' : 'string'},
  {"name" : "age", 'type' : 'int'},
  {"name" : "street", 'type' : 'string'},
  {"name" : "zip", 'type' : 'string'},
  {"name" : "lng", 'type' : 'float'},
  {"name" : "lat", 'type' : 'float'}
  ]}

parsed_comp_schema = parse_schema(compression_schema)

records = our_fake_df.to_dict('records')

# 3. Write to Avro file
with open('compression_avro.avro', 'wb') as out:
    writer(out, parsed_schema, records)

In [None]:
!ls -lh

## Problems with files

It is possible to have data in files that is problematic. Imagine a price column with words in it or more commas than there are columns

In [None]:
!echo "Price,Customer" > "problem.csv"
!echo "23.45," >> "problem.csv"
!echo "14.64" >> "problem.csv"
!echo "Ooops!,," >> "problem.csv"
!echo "48.87,,," >> "problem.csv"

!cat problem.csv

In [None]:
problem_df = pd.read_csv('problem.csv')
problem_df

In [None]:
problem_df.info()