# "Data Engineering - Week 1"
> "This is week 1 from the Data Engineering Zoomcamp course."

- toc: True
- branch: master
- badges: true
- comments: true
- categories: [data engineering, jupyter]
- image: images/some_folder/your_image.png
- hide: false
- search_exclude: true

Here is the video of this week:

> youtube: https://youtu.be/bkJZDmreIpA

github repo: https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/dataset.md

# Google Cloud Platform

A very short introduction of Google cloud services:

> youtube: https://youtu.be/18jIzE41fJ4

# Docker




> youtube: https://youtu.be/EYNwNlOrpr0


The main goal is to get data (in csv format for example) and process it and then push it into postgres database:

![](images/data-engineering-w1/data-pipeline.png)

Let's write a Dockerfile and build an image to run a simple python script:

In [None]:
# pipeline.py

import sys
import pandas as pd
print(sys.argv)
day = sys.argv[1]
# some fancy stuff with pandas

print(f'job finished successfully for day = {day}')

In [None]:
# Dockerfile

FROM python:3.9.1 #the base image to start from

RUN pip install pandas #run a command to install python packages

WORKDIR /app #change the working directory - it's like cd command in linux 
COPY pipeline.py pipeline.py # copy the file from current folder in the host machine to the working directory

ENTRYPOINT [ "python", "pipeline.py" ] # run the python pipeline.py command when we use docker run command

use the following command to build the image from Dockerfile in the current directory

In [None]:
docker build -t test:pandas .

# PostgreSQL

> youtube: https://youtu.be/2JM-ziJt0WI

Now let's see how we can run a PostgreSQL database with docker and push some data into that.

Run *postgres:13* image database with some environment commands (specified by -e), mapping local folder from host machine to a path in docker container (using -v flag), and on port 5432 which will be used for connecting to the database from outside (our python code for example).

In [None]:
docker run -it \
> -e POSTGRES_USER="root" \
> -e POSTGRES_PASSWORD="root" \
> -e POSTGRES_DB="ny_taxi" \
> -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
> -p 5432:5432 \
> postgres:13

Download data from [here]( https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) and under `2021 > January > Yellow Taxi Trip Records`. The file name is *yellow_tripdata_2021-01.csv*.

Using the following codes you can load and visualize and import data to postgres.

In [None]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine

# create engine and set the root as postgresql://user:password@host:port/database
engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')

df_iter = pd.read_csv('yellow_tripdata_2021-01.csv', iterator=True, chunksize=100000)

while True: #iterate and read chunks of data and append it to the table
    df = next(df_iter)
    df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
    df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
    df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')

Then we need to connect to the postgres database. `pgcli` is a python package that we can use it for connecting to the database.

In [None]:
!pip install pgcli

In [None]:
pgcli -h localhost -p 5432 -u root -d ny_taxi

Then using `\dt` command, we can list tables of the database.

Use `\d yellow_taxi_data` command to see the imported data schema:

In [None]:
+-----------------------+-----------------------------+-----------+
| Column                | Type                        | Modifiers |
|-----------------------+-----------------------------+-----------|
| index                 | bigint                      |           |
| VendorID              | bigint                      |           |
| tpep_pickup_datetime  | timestamp without time zone |           |
| tpep_dropoff_datetime | timestamp without time zone |           |
| passenger_count       | bigint                      |           |
| trip_distance         | double precision            |           |
| RatecodeID            | bigint                      |           |
| store_and_fwd_flag    | text                        |           |
| PULocationID          | bigint                      |           |
| DOLocationID          | bigint                      |           |
| payment_type          | bigint                      |           |
| fare_amount           | double precision            |           |
| extra                 | double precision            |           |
| mta_tax               | double precision            |           |
| tip_amount            | double precision            |           |
| tolls_amount          | double precision            |           |
| improvement_surcharge | double precision            |           |
| total_amount          | double precision            |           |
| congestion_surcharge  | double precision            |           |


We can also write any query on imported tables in the database. For example: 

In [None]:
root@localhost:ny_taxi> SELECT max(tpep_pickup_datetime), min(tpep_pickup_datetime), max(total_amount
 ) FROM yellow_taxi_data;                                                                            
+---------------------+---------------------+---------+
| max                 | min                 | max     |
|---------------------+---------------------+---------|
| 2021-02-22 16:52:16 | 2008-12-31 23:05:14 | 7661.28 |
+---------------------+---------------------+---------+
SELECT 1
Time: 0.204s

We can then write the whole pipeline to read data, process it, and push it into postgres. 

We will have one docker container for data pipeline and one for postgres.

# PgAdmin

It is a tool that can help to connect to the database.