# Project 2, Part 1, Create and load the product mapping table

University of California, Berkeley

Master of Information and Data Science (MIDS) program

w205 - Fundamentals of Data Engineering

Student: Landon Morin

Year: 2022

Semester: Spring

Section: 9


# Included Modules and Packages

Code cell containing your includes for modules and packages

In [1]:
import csv

import math
import numpy as np
import pandas as pd

import psycopg2


# Supporting code

Code cells containing any supporting code, such as connecting to the database, any functions, etc.  

Remember you can freely use any code from the labs. You do not need to cite code from the labs.

In [2]:
#
# function to run a select query and return rows in a pandas dataframe
# pandas puts all numeric values from postgres to float
# if it will fit in an integer, change it to integer
#

def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # fix the float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)
    

In [3]:
connection = psycopg2.connect(
    user = "postgres",
    password = "ucb",
    host = "postgres",
    port = "5432",
    database = "postgres"
)

In [4]:
cursor = connection.cursor()

# 2.1.1 Drop the product mapping table if it exists

The mapping table should be named peak_product_mapping

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [5]:
#
# drop all the temp tables in the foreign key order
#

connection.rollback()

query = """

drop table if exists peak_product_mapping;

"""

cursor.execute(query)

connection.commit()


# 2.1.2 Create the product mapping table

The mapping table should be named peak_product_mapping with the following columns:
* product_id numeric(3) - AGM's product id
* peak_product_id numeric(12) - Peak's product id

product_id should be the primary key

AGM has entered its products into Peak's system.  Peak is using its product IDs and not AGM's product IDs.  This table will allow us to map between the two IDs.

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [6]:
#
# create all the temp tables in the foreign key order
#

connection.rollback()

query = """

create table peak_product_mapping (
  product_id numeric(3),
  peak_product_id numeric(12), 
  primary key (product_id)
  
)

"""

cursor.execute(query)

connection.commit()

In [7]:
rollback_before_flag = True
rollback_after_flag = True

query = """

select * 
from peak_product_mapping;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,product_id,peak_product_id


# 2.1.3 Create a CSV file of product mapping data and display it

Create a CSV file of product mapping data named peak_product_mapping.csv

Check this file into your GitHub repo

The field names in the first line of the CSV file should match the column names of the peak_product_mapping table

The data should map the products as follows:

|product_id |peak_product_id |
|---|---|
|1|42314677|
|2|42314678|
|3|42314679|
|4|42314780|
|5|42314781|
|6|42314782|
|7|42314783|
|8|42314784|

Display all the rows in the CSV file using the function my_read_csv_file() from the labs.

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [8]:
def my_read_csv_file(file_name, limit):
    "read the csv file and print only the first limit rows"
    
    csv_file = open(file_name, "r")
    
    csv_data = csv.reader(csv_file)
    
    i = 0
    
    for row in csv_data:
        i += 1
        if i <= limit:
            print(row)
            
    print("\nPrinted ", min(limit, i), "lines of ", i, "total lines.")

In [9]:
my_read_csv_file('peak_product_mapping.csv', 9)

['product_id', 'peak_product_id']
['1', '42314677']
['2', '42314678']
['3', '42314679']
['4', '42314780']
['5', '42314781']
['6', '42314782']
['7', '42314783']
['8', '42314784']

Printed  9 lines of  9 total lines.


# 2.1.4 Load product mapping data into database table

Load the CSV file (peak_product_mapping.csv) into the database table (peak_product_mapping)

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [10]:
#
# load the csv files into the database tables in foreign key order
#

connection.rollback()

query = """

copy peak_product_mapping
from '/user/projects/ucb_mids_w205_project_2/peak_product_mapping.csv' delimiter ',' NULL '' csv header;

"""

cursor.execute(query)

connection.commit()

# 2.1.5 Verify the product mapping loaded correctly

Write a query to verify the product mapping loaded correctly

Also join to the products table and pull the description as product name

Include: product id, peak product id, product name

Sort by product id

Display the results in a Pandas data frame

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [11]:
rollback_before_flag = True
rollback_after_flag = True

query = """

select * 
from peak_product_mapping;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,product_id,peak_product_id
0,1,42314677
1,2,42314678
2,3,42314679
3,4,42314780
4,5,42314781
5,6,42314782
6,7,42314783
7,8,42314784


In [12]:
rollback_before_flag = True
rollback_after_flag = True

query = """

select pe.product_id, 
       pe.peak_product_id, 
       p.description as product_name
from peak_product_mapping as pe
    join products as p
        on pe.product_id = p.product_id
group by pe.product_id, pe.peak_product_id, p.description
order by pe.product_id
        

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,product_id,peak_product_id,product_name
0,1,42314677,Pistachio Salmon
1,2,42314678,Teriyaki Chicken
2,3,42314679,Spinach Orzo
3,4,42314780,Eggplant Lasagna
4,5,42314781,Chicken Salad
5,6,42314782,Curry Chicken
6,7,42314783,Tilapia Piccata
7,8,42314784,Brocolli Stir Fry
