<a href="https://colab.research.google.com/github/Fuenfgeld/2022TeamADataEngineeringBC/blob/main/PetlTutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Tutorial PETL
` petl ` is a general purpose Python package for implementing ETL (Extract, Transform, Load) workflows in Python. In this tutorial you will learn how to use `petl` to extract data from different sources into, so called **tables**. A **table** contains rows and columns which can be transformed based on user given criteria. The transformed **table** can the be loaded into another database for further usage. 

##Table of Contents


1.   Creating our first table.
2.   Basic transformations.
3.   Selecting Rows.
5.   Filling Missing Values.
6.   Deduplicating Rows.
7.   Reshaping tables.
8.   Importing and exporting tables.

##1. Creating our first table

We will create our first table from a linked list. Since `petl` is not part of the standard python library we will install it first via pip.

In [None]:
pip install petl

Collecting petl
  Downloading petl-1.7.8.tar.gz (401 kB)
[?25l[K     |▉                               | 10 kB 19.0 MB/s eta 0:00:01[K     |█▋                              | 20 kB 16.6 MB/s eta 0:00:01[K     |██▌                             | 30 kB 10.9 MB/s eta 0:00:01[K     |███▎                            | 40 kB 9.0 MB/s eta 0:00:01[K     |████                            | 51 kB 4.7 MB/s eta 0:00:01[K     |█████                           | 61 kB 5.5 MB/s eta 0:00:01[K     |█████▊                          | 71 kB 5.6 MB/s eta 0:00:01[K     |██████▌                         | 81 kB 5.3 MB/s eta 0:00:01[K     |███████▍                        | 92 kB 5.8 MB/s eta 0:00:01[K     |████████▏                       | 102 kB 5.2 MB/s eta 0:00:01[K     |█████████                       | 112 kB 5.2 MB/s eta 0:00:01[K     |█████████▉                      | 122 kB 5.2 MB/s eta 0:00:01[K     |██████████▋                     | 133 kB 5.2 MB/s eta 0:00:01[K     |██████████

In [None]:
import petl as etl
etl.__version__

'1.7.8'

Now we can run the code below to create a toy example. For the table to be transformed into an object recognized by `petl`, we have to apply the `.convert()` method to it.

In [None]:
table1 = [["foo","bar","baz"],
          ["a"  ,  1  ,  3.4],
          ["b"  ,  2  ,  7.4],
          ["c"  ,  6  ,  2.2],
          ["d"  ,  9  ,  8.1]]

print(f"Type before transformation: {type(table1)}\n")
table1 = etl.convert(table1)
print(f"Type after transformation: {type(table1)}\n")

# .look() method enables us to visualize the table nicely.
table1.look()

Type before transformation: <class 'list'>

Type after transformation: <class 'petl.transform.conversions.FieldConvertView'>



+-----+-----+-----+
| foo | bar | baz |
+=====+=====+=====+
| 'a' |   1 | 3.4 |
+-----+-----+-----+
| 'b' |   2 | 7.4 |
+-----+-----+-----+
| 'c' |   6 | 2.2 |
+-----+-----+-----+
| 'd' |   9 | 8.1 |
+-----+-----+-----+

Now that you know how to create a table in `petl` we move on to the second chapter where we are going to see how to add and remove parts it.

##1.2 Importing json and csv files and converting them to tables

We can also import json-files with the command `etl.fromjson` or csv-files with `etl.fromcsv` as shown below.

In [None]:
# mount google-drive where the json and csv files are located
from google.colab import drive
drive.mount('/content/drive')

# convert json file to petl-table
table1 = etl.fromjson("/content/drive/MyDrive/Daten/iris.json")
print(table1)

table2 = etl.fromcsv("/content/drive/MyDrive/Daten/iris.csv")
print(table2)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
+-------------+------------+-------------+------------+---------+
| sepalLength | sepalWidth | petalLength | petalWidth | species |
|         5.1 |        3.5 |         1.4 |        0.2 | setosa  |
+-------------+------------+-------------+------------+---------+
|         4.9 |        3.0 |         1.4 |        0.2 | setosa  |
+-------------+------------+-------------+------------+---------+
|         4.7 |        3.2 |         1.3 |        0.2 | setosa  |
+-------------+------------+-------------+------------+---------+
|         4.6 |        3.1 |         1.5 |        0.2 | setosa  |
+-------------+------------+-------------+------------+---------+
|         5.0 |        3.6 |         1.4 |        0.2 | setosa  |
+-------------+------------+-------------+------------+---------+
...

+--------------+-------------+--------------+-------------+---------+
| se

##2. Basic transformations
Some of the most basic functionalities of tables are that we can access specific chunks of the tables rows and columns as well as new rows and columns.

If we want to access only part of the rows we can use `.rowslice()` to choose which rows to keep.

In [None]:
etl.rowslice(table1, 1, 4).look()

+-----+-----+-----+
| foo | bar | baz |
+=====+=====+=====+
| 'b' |   2 | 7.4 |
+-----+-----+-----+
| 'c' |   6 | 2.2 |
+-----+-----+-----+
| 'd' |   9 | 8.1 |
+-----+-----+-----+

If we want to access only certain columns we can use `.cut()`.

In [None]:
etl.cut(table1,'foo','baz').look()

+-----+-----+
| foo | baz |
+=====+=====+
| 'a' | 3.4 |
+-----+-----+
| 'b' | 7.4 |
+-----+-----+
| 'c' | 2.2 |
+-----+-----+
| 'd' | 8.1 |
+-----+-----+

Suppose we have similar datasets from multiple sources. Wouldn't it be practical to combine them into one table ? `petl` provides such a funcionality via the `.cat()` method which stands for concatenation.

In [None]:
table2 = [["foo","bar","baz"],
          ["e"  ,  2  ,  5.5],
          ["f"  ,  7  ,  8.4]]

table3 = etl.cat(table1, table2)
table3.look()

+-----+-----+-----+
| foo | bar | baz |
+=====+=====+=====+
| 'a' |   1 | 3.4 |
+-----+-----+-----+
| 'b' |   2 | 7.4 |
+-----+-----+-----+
| 'c' |   6 | 2.2 |
+-----+-----+-----+
| 'd' |   9 | 8.1 |
+-----+-----+-----+
| 'e' |   2 | 5.5 |
+-----+-----+-----+
...

In case we want to add columns we can do so via the `.addcolumn()` method. Note that we have to specify the name of the column `"zoo"` specifically from the values.

In [None]:
table1 = etl.addcolumn(table1,"zoo", [1,2,3,4])
table1.look()

+-----+-----+-----+-----+
| foo | bar | baz | zoo |
+=====+=====+=====+=====+
| 'a' |   1 | 3.4 |   1 |
+-----+-----+-----+-----+
| 'b' |   2 | 7.4 |   2 |
+-----+-----+-----+-----+
| 'c' |   6 | 2.2 |   3 |
+-----+-----+-----+-----+
| 'd' |   9 | 8.1 |   4 |
+-----+-----+-----+-----+

In order to perform the opposite operation, i.e removing a column `.cutout()` can be used. In contrast to `.cut()`, this method removes the specified column completely instead of returning it as slice ot the table.

In [None]:
table1 = etl.cutout(table1,"zoo")
table1.look()

+-----+-----+-----+
| foo | bar | baz |
+=====+=====+=====+
| 'a' |   1 | 3.4 |
+-----+-----+-----+
| 'b' |   2 | 7.4 |
+-----+-----+-----+
| 'c' |   6 | 2.2 |
+-----+-----+-----+
| 'd' |   9 | 8.1 |
+-----+-----+-----+

You learned how to perform some basic transformations of the table, but what if you want to look up values not based on indices but rather on criteria such as a certain column's entry being bigger than some threshold? In the next chapter we are going to take a look at how to select rows via user given conditions.