# Multinational Retail Data Centralisation
##### Tasks 3 - 8 are broken down in three key steps. Extraction, Cleaning and Upload. Throughout the document we will be storing codes into methods inside of classes of three python documents. This will allow us to reduce the amount of code in a single document.

***
## **TASK 3 - ORDERS TABLE (Database)**
Retrieve **user_data** from AWS RDS database
***

***
### <font color='lightblue'>**EXTRACTION // T3**</font>
First we need to create a method to connect to the AWS RDS database. RDS Credentials provided in file [db_creds.yaml](db_creds.yaml).
<br> We will use scripts from SQLAlchemy to make the task simpler.

In [12]:
from    sqlalchemy              import create_engine
import  pandas                  as pd
import  yaml

def read_db_creds(name):
        with open(name, 'r') as stream:
            credentials = yaml.safe_load(stream)
            return credentials
                   
def init_db_engine (credentials):
        engine = create_engine(f"{'postgresql'}+{'psycopg2'}://{credentials['RDS_USER']}:{credentials['RDS_PASSWORD']}@{credentials['RDS_HOST']}:{credentials['RDS_PORT']}/{credentials['RDS_DATABASE']}")
        engine.connect()
        return engine

We will use the credentials provided to identify a series of tables inside the database.

In [13]:
from    sqlalchemy          import inspect

name = "db_creds.yaml"
credentials = read_db_creds(name)
engine = init_db_engine(credentials)

def list_db_tables(engine):
        engine.connect()
        inspector = inspect(engine)
        return inspector.get_table_names()

list_db_tables(engine)

['legacy_store_details', 'legacy_users', 'orders_table']

We can now transfer the above code into our DATA CONNECTOR Class. *[see here](data_utils.py)*
<br> The user_data is located inside the second table in the list as "legacy_users".
<br> We can now commence the extraction process.

In [22]:
from    sqlalchemy          import create_engine
from    sqlalchemy          import inspect
import  pandas              as pd
import  yaml

name            = "db_creds.yaml"
credentials     = read_db_creds(name)
engine          = init_db_engine(credentials)
table_names     = list_db_tables(engine)
column          = 1

def read_rds_table(table_names, column, engine):
        database = pd.read_sql_table(table_names[column], engine)
        return database

read_rds_table(table_names, column, engine)


Unnamed: 0,index,first_name,last_name,date_of_birth,company,email_address,address,country,country_code,phone_number,join_date,user_uuid
0,0,Sigfried,Noack,1990-09-30,Heydrich Junitz KG,rudi79@winkler.de,Zimmerstr. 1/0\n59015 Gießen,Germany,DE,+49(0) 047905356,2018-10-10,93caf182-e4e9-4c6e-bebb-60a1a9dcf9b8
1,1,Guy,Allen,1940-12-01,Fox Ltd,rhodesclifford@henderson.com,Studio 22a\nLynne terrace\nMcCarthymouth\nTF0 9GH,United Kingdom,GB,(0161) 496 0674,2001-12-20,8fe96c3a-d62d-4eb5-b313-cf12d9126a49
2,2,Harry,Lawrence,1995-08-02,"Johnson, Jones and Harris",glen98@bryant-marshall.co.uk,92 Ann drive\nJoanborough\nSK0 6LR,United Kingdom,GB,+44(0)121 4960340,2016-12-16,fc461df4-b919-48b2-909e-55c95a03fe6b
3,3,Darren,Hussain,1972-09-23,Wheeler LLC,daniellebryan@thompson.org,19 Robinson meadow\nNew Tracy\nW22 2QG,United Kingdom,GB,(0306) 999 0871,2004-02-23,6104719f-ef14-4b09-bf04-fb0c4620acb0
4,4,Garry,Stone,1952-12-20,Warner Inc,billy14@long-warren.com,3 White pass\nHunterborough\nNN96 4UE,United Kingdom,GB,0121 496 0225,2006-09-01,9523a6d3-b2dd-4670-a51a-36aebc89f579
...,...,...,...,...,...,...,...,...,...,...,...,...
15315,14913,Stephen,Jenkins,1943-08-09,"Thornton, Carroll and Newman",s.jenkins@smith.com,Studio 41I\nJones lodge\nOliviaborough\nE8 3DU,United Kingdom,GB,+44(0)292018946,2016-04-15,2bd3a12f-a92d-4cdd-b99c-fc70572db302
15316,14994,Stephen,Smith,1948-08-20,Robinson-Harris,s.smith@smith.com,530 Young parkway\nMillsfurt\nL4G 7NX,United Kingdom,GB,+44(0)1144960977,2020-07-20,d234c04b-c07c-46a5-a902-526f91478ecc
15317,15012,Stephen,Losekann,1940-10-09,Rosenow,s.losekann@smith.com,Viviane-Fritsch-Straße 3/5\n15064 Bad Liebenwerda,Germany,DE,02984 08192,2021-03-07,1a0a8b7b-7c17-42d8-a946-8a85d5495651
15318,15269,Stephen,Rivera,1952-06-04,"Taylor, Fry and Jones",s.rivera@smith.com,"660 Ross Falls Suite 357\nAnthonymouth, MA 09610",United States,US,239.711.3836,2011-01-03,187fe06e-bd5f-4381-af2f-d7ac37ca7572


We will now transfer the above code into our DATA EXTRACTOR Class. *[see here](data_extraction.py)*

***
### <font color='yellow'>**CLEANING // T3**</font>
Now that we have the orders table as a PD DataFrame we need to clean up any mistakes and issues that will affect our querying.

***
### <font color='lightgreen'>**UPLOAD TO SCHEMA // T3**</font>
<sub><sup>This CELL will contain all operations of extraction, cleaning and uploading within a single operation in order to streamline future updates.</sup></sub>

***
## **TASK 4 - CARD DETAILS (Database)**
Retrieve **card_details** from PDF document located in AWS S3 Bucket.
<br>
PDF LINK: https://data-handling-public.s3.eu-west-1.amazonaws.com/card_details.pdf
***

***
## **TASK 5 - STORE DATA (Database)**
Retrieve **store_data** & **store_details** via API.
<br>
Retrieve a store: https://aqj7u5id95.execute-api.eu-west-1.amazonaws.com/prod/store_details/{store_number}
<br>
Return the number of stores: https://aqj7u5id95.execute-api.eu-west-1.amazonaws.com/prod/number_stores
***

***
## **TASK 6 - PRODUCT INFO (Database)**
Retrieve **prdocuts** from CSV document located in AWS S3 Bucket.
<br>
S3 Address: s3://data-handling-public/products.csv
***

***
## **TASK 7 - ORDERS TABLE (Database)**
Retrieve the **orders table** from an YAML Document
***

***
### <font color='lightblue'>**EXTRACTION // T7**</font>
Extracting from a YAML document has already been done in TASK 3, so we will only need to call the same method but adjust a few details such as names. <br> DATA EXTRACTOR Class. *[see here](data_extraction.py)*

In [3]:
from    data_utils          import DataConnector
from    data_extraction     import DataExtractor

import  pandas              as pd

dc              = DataConnector()
de              = DataExtractor()

name            = "db_creds.yaml" 
credentials     = dc.read_db_creds(name)
engine          = dc.init_db_engine(credentials)
table_names     = dc.list_db_tables(engine)
column          = 2

dc.list_db_tables(engine)
database = de.read_rds_table(table_names,column, engine)
database

Unnamed: 0,level_0,index,date_uuid,first_name,last_name,user_uuid,card_number,store_code,product_code,1,product_quantity
0,0,0,9476f17e-5d6a-4117-874d-9cdb38ca1fa6,,,93caf182-e4e9-4c6e-bebb-60a1a9dcf9b8,30060773296197,BL-8387506C,R7-3126933h,,3
1,1,1,0423a395-a04d-4e4a-bd0f-d237cbd5a295,,,8fe96c3a-d62d-4eb5-b313-cf12d9126a49,349624180933183,WEB-1388012W,C2-7287916l,,2
2,2,2,65187294-bb16-4519-adc0-787bbe423970,,,fc461df4-b919-48b2-909e-55c95a03fe6b,3529023891650490,CH-01D85C8D,S7-1175877v,,2
3,3,3,579e21f7-13cb-436b-83ad-33687a4eb337,,,6104719f-ef14-4b09-bf04-fb0c4620acb0,213142929492281,CL-C183BE4B,D8-8421505n,,2
4,4,4,00ab86c3-2039-4674-b9c1-adbcbbf525bd,,,9523a6d3-b2dd-4670-a51a-36aebc89f579,502067329974,SO-B5B9CB3B,B6-2596063a,,2
...,...,...,...,...,...,...,...,...,...,...,...
120118,110549,110548,f0e8fff6-9998-4661-954b-0e258e09d33c,,,95c74b0a-d495-4359-b1c0-e2da511e8403,575421945446,KA-FA7ED3B8,C9-6827622o,,4
120119,82164,82164,1c80940a-d186-4ba9-9daa-8abd1aceae32,,,5d6fa6fe-e583-4baf-8bbb-d1dd6e2b551f,4971858637664481,WA-A41DA979,I0-1146408B,,1
120120,97599,97599,58598aca-049c-418e-8e39-46327634a7f1,Sharon,Miller,48b7f1fc-db13-4611-ad8e-3dac0b759488,4971858637664481,WEB-1388012W,A4-5443400b,,4
120121,106591,106591,3a76f661-0707-4fbc-9862-f21d3249f581,,,51c0b538-7ded-4697-8e84-9f7aa13f9112,4971858637664481,SO-6D328417,E9-2782979e,,4


***
### <font color='yellow'>**CLEANING // T7**</font>
Now that we have the orders table as a PD DataFrame we need to clean up any mistakes and issues that will affect our querying.

In [11]:
from    data_cleaning       import DataCleaning

import  numpy               as np

database = de.read_rds_table(table_names,column, engine)
database = database.drop(columns= ("level_0"))
database = database.drop(columns= ("index"))

database

Unnamed: 0,date_uuid,first_name,last_name,user_uuid,card_number,store_code,product_code,1,product_quantity
0,9476f17e-5d6a-4117-874d-9cdb38ca1fa6,,,93caf182-e4e9-4c6e-bebb-60a1a9dcf9b8,30060773296197,BL-8387506C,R7-3126933h,,3
1,0423a395-a04d-4e4a-bd0f-d237cbd5a295,,,8fe96c3a-d62d-4eb5-b313-cf12d9126a49,349624180933183,WEB-1388012W,C2-7287916l,,2
2,65187294-bb16-4519-adc0-787bbe423970,,,fc461df4-b919-48b2-909e-55c95a03fe6b,3529023891650490,CH-01D85C8D,S7-1175877v,,2
3,579e21f7-13cb-436b-83ad-33687a4eb337,,,6104719f-ef14-4b09-bf04-fb0c4620acb0,213142929492281,CL-C183BE4B,D8-8421505n,,2
4,00ab86c3-2039-4674-b9c1-adbcbbf525bd,,,9523a6d3-b2dd-4670-a51a-36aebc89f579,502067329974,SO-B5B9CB3B,B6-2596063a,,2
...,...,...,...,...,...,...,...,...,...
120118,f0e8fff6-9998-4661-954b-0e258e09d33c,,,95c74b0a-d495-4359-b1c0-e2da511e8403,575421945446,KA-FA7ED3B8,C9-6827622o,,4
120119,1c80940a-d186-4ba9-9daa-8abd1aceae32,,,5d6fa6fe-e583-4baf-8bbb-d1dd6e2b551f,4971858637664481,WA-A41DA979,I0-1146408B,,1
120120,58598aca-049c-418e-8e39-46327634a7f1,Sharon,Miller,48b7f1fc-db13-4611-ad8e-3dac0b759488,4971858637664481,WEB-1388012W,A4-5443400b,,4
120121,3a76f661-0707-4fbc-9862-f21d3249f581,,,51c0b538-7ded-4697-8e84-9f7aa13f9112,4971858637664481,SO-6D328417,E9-2782979e,,4


***
### <font color='lightgreen'>**UPLOAD TO SCHEMA // T7**</font>
<sub><sup>This CELL will contain all operations of extraction, cleaning and uploading within a single operation in order to streamline future updates.</sup></sub>

In [8]:
from    data_utils         import DataConnector
from    data_extraction    import DataExtractor
from    data_cleaning      import DataCleaning

import  pandas              as pd
import  numpy               as np

def     upload_to_dim_orders_table():
                #CLASS VARIABLES
                dc              = DataConnector()
                de              = DataExtractor()
                dcl             = DataCleaning()

                #EXTRACTION
                name            = "db_creds.yaml" 
                credentials     = dc.read_db_creds(name)
                engine          = dc.init_db_engine(credentials)
                table_names     = dc.list_db_tables(engine)
                column          = 2

                dc.list_db_tables(engine)
                database = de.read_rds_table(table_names,column, engine)

                #CLEANING
                


                #SCHEMA SERVER
                sql_name        = "orders_table"
                local_name      = "db_creds_local.yaml" 
                credentials     = dc.read_db_creds(local_name)
                engine          = dc.init_db_engine(credentials)

                #UPLOAD
                dc.upload_to_db(database, sql_name, engine)

***
## **TASK 8 - EVENT DATES (Database)**
Obtain Event/Activies Dates & Time for purchases orders from an AWS RDS via URL
***

### <font color='lightblue'>**EXTRACTION**</font>
We have created a function using the api method to manage this connection. <br> In order to create this table, we are extracting from the amazon S3 server as URL. <br> We will use the method from the Boto3 1.33.11 Documentation, *https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-presigned-urls.html*

In [7]:
import  pandas              as pd
import  requests

LINK = "https://data-handling-public.s3.eu-west-1.amazonaws.com/date_details.json"
def extract_from_s3_LINK():
    if LINK is not None:
        response = requests.get(LINK)
        data = pd.DataFrame(response.json())
    return data

extract_from_s3_LINK()

Unnamed: 0,timestamp,month,year,day,time_period,date_uuid
0,22:00:06,9,2012,19,Evening,3b7ca996-37f9-433f-b6d0-ce8391b615ad
1,22:44:06,2,1997,10,Evening,adc86836-6c35-49ca-bb0d-65b6507a00fa
2,10:05:37,4,1994,15,Morning,5ff791bf-d8e0-4f86-8ceb-c7b60bef9b31
3,17:29:27,11,2001,6,Midday,1b01fcef-5ab9-404c-b0d4-1e75a0bd19d8
4,22:40:33,12,2015,31,Evening,dfa907c1-f6c5-40f0-aa0d-40ed77ac5a44
...,...,...,...,...,...,...
120156,22:56:56,11,2022,12,Evening,d6c4fb31-720d-4e94-aa6b-dcbcb85f2bb7
120157,18:25:20,5,1997,31,Evening,f7722027-1aae-49c3-8f8d-853e93f9f3e6
120158,18:21:40,9,2011,13,Evening,4a3b9851-52e1-463c-ac81-1960f141444e
120159,19:10:53,7,2013,12,Evening,64974909-0d4b-42a2-822a-73b5695e8bfb


We can now transfer the above code into our DATA EXTRACTOR Class. *[see here](data_extraction.py)*

***
### <font color='yellow'>**CLEANING**</font>
Now that we have the event dates table as a PD DataFrame we need to clean up any mistakes and issues that will affect our querying.


***
### <font color='lightgreen'>**UPLOAD TO SCHEMA**</font>
This CELL will contain all operations of extraction, cleaning and uploading within a single operation in order to streamline future updates.

In [1]:
from    data_utils          import  DataConnector
from    data_extraction     import  DataExtractor
from    data_cleaning       import  DataCleaning

def     upload_to_dim_dates():
                #CLASS VARIABLES
                de              = DataExtractor()
                dcl             = DataCleaning()

                #EXTRACTION
                LINK            = "https://data-handling-public.s3.eu-west-1.amazonaws.com/date_details.json"
                database        = de.extract_from_s3_LINK(LINK)

                #CLEANING


                #SCHEMA SERVER
                sql_name        = "event_dates"
                local_name      = "db_creds_local.yaml" 
                credentials     = dc.read_db_creds(local_name)
                engine          = dc.init_db_engine(credentials)

                #UPLOAD
                dc.upload_to_db(database, sql_name, engine)

upload_to_dim_dates()