<a href="https://colab.research.google.com/github/matthewpecsok/data_engineering/blob/main/tutorials/de_tutorial_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Extraction from a OLTP Source System

In this tutorial we'll begin exploring the concept of querying a OLTP database, not for analytics, but rather to start extracting data for downstream data engineering applications.

**Data Engineering**

**Matthew Pecsok 2/10/2023**




# 1.&nbsp;A quick overview of lists and tuples


## Lists

A list is object that can store multiple items, is changeable, and allows duplicate values. Define a list using [] around all the elements to be included in the list.

A list can begin empty and items can be added to it.

In [1]:
import sys

In [2]:
my_list = [] #create an empty list
my_list

[]

In [3]:
my_list.append('movie_1') # append a single item to the list
my_list

['movie_1']

In [4]:
my_list.extend(['movie_2','movie_3']) # concatenate two lists together creating a single list as a result
my_list

['movie_1', 'movie_2', 'movie_3']

In [5]:
len(my_list) # how long is out new list?

3

In [6]:
my_list[0] # get the zeroth (or first depending on how you count) element in the list

'movie_1'

### YOUDO

Create a list with apples, apples, oranges and pears. Apples should appear twice. Lists allow duplicate values. cCall the list variable "fruit".

## Tuple

What if we want to store immutable data? We can use a **tuple**. A tuple is similar to a list except it is unchangeable. Notice the parenthesis instead of brackets.

We can store multiple data types in a list. They need not all be the same type. Here we store a string as well as an integer literal value.

In [7]:
movie_1 = ('Toy Story',1995)
movie_1

('Toy Story', 1995)

In [8]:
movie_2 = ('Monsters Inc.',2001)
movie_2

('Monsters Inc.', 2001)

In [9]:
my_complex_list = []
my_complex_list

[]

In [10]:
my_complex_list.extend([movie_1,movie_2])

In [11]:
my_complex_list

[('Toy Story', 1995), ('Monsters Inc.', 2001)]

In [12]:
my_complex_list[0] # get the zeroth (or first depending on how you count) element in the list

('Toy Story', 1995)

In [13]:
my_complex_list[0][1]

1995

### YOUDO

Create a tuple with a first name, last name and age.

## For loops on lists

It's quite easy to loop through a list and execute some code on that list. Here's a few examples to get your comfortable.

Please note, for loops are not the most efficient way to accomplish tasks like this, but they are simple to understand. Experienced programmers might prefer list comprehensions coupled with functions.

Note that the code that is enclosed in the for loop is indented. Anything not indented is outside of the for loop and will not run for every iteration.

some_numbers = [1,2,3,4,5,6,7,8,9,10]

for number in some_numbers:
  print(f'the current number is {number}') <- this is inde

print('we are outside the for loop and run once')

In [14]:
some_numbers = [1,2,3,4,5,6,7,8,9,10]

for number in some_numbers:
  print(f'the current number is {number}') # this line of code is indented

print('we are outside the for loop and run once') # this line of code is NOT indented

the current number is 1
the current number is 2
the current number is 3
the current number is 4
the current number is 5
the current number is 6
the current number is 7
the current number is 8
the current number is 9
the current number is 10
we are outside the for loop and run once


In [15]:
for number in some_numbers:
  print(f"adding 10 to {number} gives the result {number + 10}")

adding 10 to 1 gives the result 11
adding 10 to 2 gives the result 12
adding 10 to 3 gives the result 13
adding 10 to 4 gives the result 14
adding 10 to 5 gives the result 15
adding 10 to 6 gives the result 16
adding 10 to 7 gives the result 17
adding 10 to 8 gives the result 18
adding 10 to 9 gives the result 19
adding 10 to 10 gives the result 20


the variable name chosen should be descriptive of what it contains, but it is not required to do so.

In [16]:
for water in some_numbers:
  print(f"adding 10 to {water} gives the result {water + 10}")

adding 10 to 1 gives the result 11
adding 10 to 2 gives the result 12
adding 10 to 3 gives the result 13
adding 10 to 4 gives the result 14
adding 10 to 5 gives the result 15
adding 10 to 6 gives the result 16
adding 10 to 7 gives the result 17
adding 10 to 8 gives the result 18
adding 10 to 9 gives the result 19
adding 10 to 10 gives the result 20


### YOUDO

create a list of the numbers 10,15,20. Run a for loop on this list and add 100 to each number and print the result.

In [17]:
i = 0

fruit_list = ['apples','oranges','bananas']

for fruit in fruit_list:
  print(f'the current fruit is {fruit}')
  i += 1

print(f'we looped {i} times')

the current fruit is apples
the current fruit is oranges
the current fruit is bananas
we looped 3 times


In [18]:
for the_number in [1,2,3,4,5,6,42]:
  print(f'{the_number} is even = {(the_number%2==0)}')

1 is even = False
2 is even = True
3 is even = False
4 is even = True
5 is even = False
6 is even = True
42 is even = True


The takeaway here is that a for loop allows us to iterate over a list and execute code for each element in the list.

## Brief introduction to Conditional Statements. If statements.

What if we want to only operate on certain elements of the list? We can use Python conditions to do so.

The code below checks to see if the_number is has a remainder or not, if the remainder (when dividing by 2) is 0 then we conditionally print the number and say that it is even.

If statements are also indented to denote what should execute if the statement evaluates to True

In [19]:
if True:
  print('Yes!')

Yes!


In [20]:
if 1==1:
  print('Yes!')

Yes!


In [21]:
if 1!=2:
  print('Yes!')

Yes!


In [22]:
if 2!=2:
  print('Yes!')

### YOUDO

Create an if statement that checks to see if 4 is greater than 2 and print "It is!" if the result is True.

In [63]:
if 4>2:
  print('it is')

it is


## Combining For Loops and Conditionals



In [23]:
for the_number in [1,2,3,4,5,6,42]:
  if (the_number%2==0):
    print(f'{the_number} is even')


2 is even
4 is even
6 is even
42 is even


In [24]:
for the_number in [1,2,3,4,5,6,42]:
  if (the_number%2!=0):
    print(f'{the_number} is odd')

1 is odd
3 is odd
5 is odd


If the number does not have a remainder of 0 when divided by 2 it must be odd.

### YOUDO  

Loop on the "fruit" variable and use an If statement to conditionally print that the element is an apple if it is truly an apple.

In [64]:
fruit

'bananas'

# 2.&nbsp;Package import and get database

In [25]:
import pandas as pd
import sqlite3
from tqdm import tqdm

In [26]:
!wget -qO movies.db https://github.com/matthewpecsok/data_engineering/blob/main/data/movies.sqlite?raw=true

!pip -q install --upgrade ipython
!pip -q install --upgrade ipython-sql

con_movie_source = sqlite3.connect('movies.db')

%load_ext sql
%sql sqlite:///movies.db

%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m825.5/825.5 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.4/85.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires ipython==7.34.0, but you have ipython 8.32.0 which is incompatible.[0m[31m
[0m

In [27]:
con_movie_source # con is our connection to the database

<sqlite3.Connection at 0x7d1014926b60>

In [28]:
cur_movie_source = con_movie_source.cursor()
cur_movie_source # cursor

<sqlite3.Cursor at 0x7d0ffffc8540>

## Querying with Pandas

We can directly query with Pandas and receive a DataFrame as the result of the query.

Below we query the movies table and intentionally limit the query to two movies. The resulting dataframe has just two rows as expected.

Pandas knows what database to query because we pass the connection object in as the second argument to the read_sql_query function.

https://pandas.pydata.org/docs/reference/api/pandas.read_sql_query.html

*pandas.read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None, dtype=None, dtype_backend=_NoDefault.no_default)*

## verify the source data

We do some basic queries to inspect the source database table and the data in it. This is a crucial step to be sure the data is what is expected.

### inspect the first few rows

In [29]:
two_movies = pd.read_sql_query('select * from movies limit 2',con_movie_source)

two_movies

Unnamed: 0,id,original_title,budget,popularity,release_date,revenue,title,vote_average,vote_count,overview,tagline,uid,director_id
0,43597,Avatar,237000000,150,2009-12-10,2787965087,Avatar,7.2,11800,"In the 22nd century, a paraplegic Marine is di...",Enter the World of Pandora.,19995,4762
1,43598,Pirates of the Caribbean: At World's End,300000000,139,2007-05-19,961000000,Pirates of the Caribbean: At World's End,6.9,4500,"Captain Barbossa, long believed to be dead, ha...","At the end of the world, the adventure begins.",285,4763


Now we query again and return all movies as a dataframe. This new dataframe has 4773 movies in it.

### get the total count of rows and columns

In [30]:
all_movies = pd.read_sql_query('select * from movies',con_movie_source)

all_movies.shape

(4773, 13)

### check datatypes and null values.

In [31]:
all_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4773 entries, 0 to 4772
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              4773 non-null   int64  
 1   original_title  4773 non-null   object 
 2   budget          4773 non-null   int64  
 3   popularity      4773 non-null   int64  
 4   release_date    4773 non-null   object 
 5   revenue         4773 non-null   int64  
 6   title           4773 non-null   object 
 7   vote_average    4773 non-null   float64
 8   vote_count      4773 non-null   int64  
 9   overview        4770 non-null   object 
 10  tagline         3951 non-null   object 
 11  uid             4773 non-null   int64  
 12  director_id     4773 non-null   int64  
dtypes: float64(1), int64(7), object(5)
memory usage: 484.9+ KB


# Creating an extract query along with a transform

Using sql combine ET steps.

We'll assume the machine learning team doesn't need each individual movie. They just want yearly data and average budget and revenue for the year. So, we can reduce the granularity from specific dates and movies to just aggregated information.

In [32]:
yearly_aggregates = pd.read_sql_query('''
select
  strftime('%Y',release_date) as release_year
  ,avg(budget) as avg_budget
  ,avg(revenue) as avg_revenue
  ,count(1) as movie_count
from movies
group by release_year
''',con_movie_source)

## show the top 5 row

In [33]:
yearly_aggregates.head()

Unnamed: 0,release_year,avg_budget,avg_revenue,movie_count
0,1916,385907.0,8394751.0,1
1,1925,245000.0,22000000.0,1
2,1927,92620000.0,650422.0,1
3,1929,189500.0,2179000.0,2
4,1930,3950000.0,8000000.0,1


# Migrate as Bulk Insert

We insert all rows at once. This only works for smaller data migrations. Larger migrations would likely crash due to lack of RAM to hold a larger dataset in memory.

## Migrating data to a NEW database

In data engineering we are often moving data from a SOURCE system to a DESTINATION system. For the purpose of this tutorial we'll assume the movies database is the source.

Let's create an entirely new database called "movie_destination.db" and practice moving data into it.

We'll consider multiple movement strategies. First, we'll just move the entire movie table, then we'll begin thinking about how to move this data in batches.

This code creates a new movie database. You can see it in your files on the left nav bar after you run the code.

Also, notice that we have created a new connection object as well as a cursor object both with appropriate names for the destination database.

## create destination connection and cursor

In [34]:
con_movie_destination = sqlite3.connect('movie_destination.db')
cur_movie_destination = con_movie_destination.cursor()

## drop table (if exists)

In [35]:
cur_movie_destination.execute('''
drop table if exists yearly_aggregates
''')
con_movie_destination.commit()

## create table (if not exists)

In [36]:
cur_movie_destination.execute('''
CREATE TABLE IF NOT EXISTS yearly_aggregates (
  release_year int,
  avg_budget REAL,
  avg_revenue REAL,
  movie_count INTEGER
)
''')
con_movie_destination.commit()

## check for existing data

In [37]:
pd.read_sql_query('select * from yearly_aggregates',con_movie_destination)

Unnamed: 0,release_year,avg_budget,avg_revenue,movie_count


## delete existing data

In [38]:
cur_movie_destination.execute('delete from yearly_aggregates')
con_movie_destination.commit()

ensure the deletion was successful

In [39]:
pd.read_sql_query('select * from yearly_aggregates',con_movie_destination)

Unnamed: 0,release_year,avg_budget,avg_revenue,movie_count


## check for the existing years in destination (coalesce)

Use coalesce to ensure a value even if the query returns null

In [40]:
migrated_movie_years = cur_movie_destination.execute('select coalesce(max(release_year),0) from yearly_aggregates').fetchall()

The return value is a list of tuples

In [41]:
migrated_movie_years

[(0,)]

## convert list of tuple to a simple integer value

In [42]:
migrated_movie_years = migrated_movie_years[0][0] # 0 means no migrated years
migrated_movie_years

0

#

## create a dataframe for Loading into Destination db

we filter the query to exclude years that have already been migrated.

In [43]:
year_data = pd.read_sql_query(f"""
select
  CAST(strftime('%Y',release_date) as INTEGER) as release_year,
  avg(budget) as avg_budget,
  avg(revenue) as avg_revenue,
  count(1) as movie_count
from movies
where release_year > {migrated_movie_years}
group by release_year;
""",con=con_movie_source)

year_data.shape

(90, 4)

This pandas dataframe method takes the data in the dataframe and appends it to the existing table (if it exists) or it creates the table if it doesn't exist in the db specified by the con argument. It excludes the internal dataframe index from being added to the table.

In [44]:
year_data.to_sql('yearly_aggregates',if_exists='append',index=False,con=con_movie_destination)

90

## check migrated data integrity

perform queries to ensure the data has been migrated as expected. perform multiple checks

In [45]:
pd.read_sql_query('select * from yearly_aggregates limit 5',con_movie_destination)

Unnamed: 0,release_year,avg_budget,avg_revenue,movie_count
0,1916,385907.0,8394751.0,1
1,1925,245000.0,22000000.0,1
2,1927,92620000.0,650422.0,1
3,1929,189500.0,2179000.0,2
4,1930,3950000.0,8000000.0,1


In [46]:
pd.read_sql_query('select min(release_year),max(release_year) from yearly_aggregates',con_movie_destination)

Unnamed: 0,min(release_year),max(release_year)
0,1916,2017


In [47]:
pd.read_sql_query('select count(1) as year_count from yearly_aggregates',con_movie_destination)

Unnamed: 0,year_count
0,90


# Migrate in mini-batches.

We'll migrate each year separately.

This has the advantage of reducing the batch size. Each insert is just a single row, but we increase the overall time for inserting as there are many small inserts.

## get a list of year tuples

In [48]:
unique_source_years = cur_movie_source.execute("select distinct(CAST(strftime('%Y',release_date) as INTEGER)) as unique_release_year from movies order by unique_release_year").fetchall()
unique_source_years[0:5]

[(1916,), (1925,), (1927,), (1929,), (1930,)]

In [49]:
cur_movie_destination.execute('delete from yearly_aggregates')
con_movie_destination.commit()

## ensure the deletion was successful

In [50]:
pd.read_sql_query('select * from yearly_aggregates',con_movie_destination)

Unnamed: 0,release_year,avg_budget,avg_revenue,movie_count


## for loop

we query for each year individually. Each iteration through the for loop creates a new row in the database.

In [51]:
migrated_years = cur_movie_destination.execute("select release_year as migrated_years from yearly_aggregates").fetchall()
migrated_years[0:3]

[]

a list comprehension, which is a simple and concise way to build a list

https://www.w3schools.com/Python/python_lists_comprehension.asp


In [52]:
unmigrated_years = [x for x in unique_source_years if x not in migrated_years]

unmigrated_years[0:4]

[(1916,), (1925,), (1927,), (1929,)]

In [53]:
for year in tqdm(unique_source_years):
  year = year[0] # get int from tuple

  year_data = pd.read_sql_query(f"""
  select
    CAST(strftime('%Y',release_date) as INTEGER) as release_year,
    avg(budget) as avg_budget,
    avg(revenue) as avg_revenue,
    count(1) as movie_count
  from movies
  where release_year = {year}
  group by release_year;
  """,con=con_movie_source)

  year_data.to_sql('yearly_aggregates',if_exists='append',index=False,con=con_movie_destination)

100%|██████████| 90/90 [00:02<00:00, 31.49it/s]


## check migrated data integrity

perform queries to ensure the data has been migrated as expected. perform multiple checks

In [54]:
pd.read_sql_query('select * from yearly_aggregates limit 5',con_movie_destination)

Unnamed: 0,release_year,avg_budget,avg_revenue,movie_count
0,1916,385907.0,8394751.0,1
1,1925,245000.0,22000000.0,1
2,1927,92620000.0,650422.0,1
3,1929,189500.0,2179000.0,2
4,1930,3950000.0,8000000.0,1


In [55]:
pd.read_sql_query('select min(release_year),max(release_year) from yearly_aggregates',con_movie_destination)

Unnamed: 0,min(release_year),max(release_year)
0,1916,2017


In [56]:
pd.read_sql_query('select count(1) as year_count from yearly_aggregates',con_movie_destination)

Unnamed: 0,year_count
0,90


In [57]:
import random
random.randint(1, 100)

46

### YOU DO

Migrate the director data from the existing db to a new database.

Use .execute and .fetchall to get a list of tuples. Use a for loop to iterate over that list of tuples and print element if the director's name contains 'Steven'

In [58]:
name = "Matthew Pecsok"
has_substr = "Matt" in name
has_substr

True

In [59]:
directors = cur_movie_source.execute("select * from directors").fetchall()

In [60]:
directors[0:3]

[('James Cameron', 4762, 2, 2710, 'Directing'),
 ('Gore Verbinski', 4763, 2, 1704, 'Directing'),
 ('Sam Mendes', 4764, 2, 39, 'Directing')]

In [61]:
for director in directors:
  if "Steven" in director[0]:
    print(director)

('Steven Spielberg', 4799, 2, 488, 'Directing')
('Mark Steven Johnson', 4895, 2, 16837, 'Directing')
('Steven Soderbergh', 4909, 2, 1884, 'Directing')
('Steven Brill', 5013, 2, 32593, 'Directing')
('Steven Zaillian', 5117, 2, 2260, 'Directing')
('Robert Stevenhagen', 5120, 0, 64152, 'Directing')
('Steven Quale', 5216, 2, 93214, 'Directing')
('Steven Seagal', 5221, 2, 23880, 'Directing')
('Steven E. de Souza', 5390, 2, 1726, 'Directing')
('George Stevens', 5698, 2, 18738, 'Directing')
('Steven Shainberg', 5803, 2, 67795, 'Directing')
('Robert Stevenson', 6293, 2, 5834, 'Directing')
('Steven R. Monroe', 6713, 2, 88039, 'Directing')
('D. Stevens', 6863, 0, 146026, 'Directing')


### YOUDO

Create a new database called steven_directors.
It should contain the just a column called name that will contain the directors name.
Only migrate directors that start contain "Steven" in the name.


In [62]:
# Create the new database. Should return a new connection object.
# Create a new cursor to the new database using the new connection object.

# drop the table if it exists.
con_steven_directors.commit()

# create the table.
con_steven_directors.commit()

# fetch all directors from the origin db as a list of tuples.

# use a for loop to identify directors with the name Steven
# use a parameterized insert to insert each row one at a time in the for loop
# cur_steven_directors.execute("insert into steven_directors (name) values (?)",(director[0],))
con_steven_directors.commit()


NameError: name 'con_steven_directors' is not defined

Check the migrated data to see if all of the correct directors were migrated.