Skip to content

org-not-included/simple_analytics_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

simple_analytics_pipeline

A simple example of loading data from CSV into SQLite DB, and then representing some data as charts.

CodeFactor Grade GitHub languages GitHub license
GitHub pull requests GitHub issues GitHub contributors
GitHub Release Date GitHub last commit GitHub commit activity
GitHub forks GitHub stars GitHub watchers Twitter Follow

Quick-Start

Download Github Repo

git clone https://github.com/org-not-included/simple_analytics_pipeline
cd simple_analytics_pipeline

Load data and build sample chart

source setup.sh

Process Overview

What a Data Engineer does

A data engineer is responsible for Extracting (reading), Transforming, and Loading (saving) data between different locations in your environment. Primarily loading data into/out of your database/data warehouse.
As part of their skillset, they need to understand:

  1. Languages:
    1. Python (coding)
    2. SQL (querying syntax)
  2. Tools:
    1. Pandas (parse/transform)
    2. SQLite (store)
    3. Matplotlib (visualize)

ETL (Extract, Transform, Load)

  1. Use python3 sqlite to initialize a local db connection:
import sqlite3
conn = sqlite3.connect(‘your_database_name’)

1/2. (Extract/Transform) Use pandas (io) to convert the csv into a python object (dictionary):

 dataframe = pandas.read_csv(filename)

The reason to convert it into a dict is because you can then use pandas to load a DataFrame into the database your_database_name.

Read a csv file and use pandas to read/convert it: Create a dataframe which is a pandas object (python dict with extra functionality) of the data that is read from csv.

csv -> Pandas (io read) -> python dict -> Pandas (io write) -> sqlite table with table name


  1. (Load) Save the Pandas DataFrame into your database, in a new table called sample_table.
dataframe.to_sql(‘sample_table’, conn)

Use Dataframe to write into your connection. Result is a table with the name of your choice stored in your_database_name specified by the connection.


  1. (Optional) Visualize the Result
def create_chart(sql, conn, x, y):
    # Fetch query results as pandas dataframe (dictionary object)
    dataframe = pd.read_sql_query(sql, con=conn)
    # pprint(dataframe)
    dataframe.plot(x=x, y=y, kind="bar")
    plt.show()