# Multinational Retail Data Centralisation

This notebook is used to interactively work with the classes and the data returned so that development is easier. For example, interacting with the DataFrame to understand the data in the database, to create methods for cleaning.

In [18]:
import pandas as pd
from database_utils import DatabaseConnector
from data_extraction import DataExtractor

connector = DatabaseConnector()
extractor = DataExtractor(connector)

## Fetch DataFrame from table name

Using connector to find table names, and then using extractor to produce a DataFrame of a specific table.

In [19]:
connector.list_db_tables()

['legacy_store_details', 'dim_card_details', 'legacy_users', 'orders_table']

In [20]:
df = extractor.read_rds_table("legacy_users")
df.head(5)

Unnamed: 0,index,first_name,last_name,date_of_birth,company,email_address,address,country,country_code,phone_number,join_date,user_uuid
0,0,Sigfried,Noack,1990-09-30,Heydrich Junitz KG,rudi79@winkler.de,Zimmerstr. 1/0\n59015 Gießen,Germany,DE,+49(0) 047905356,2018-10-10,93caf182-e4e9-4c6e-bebb-60a1a9dcf9b8
1,1,Guy,Allen,1940-12-01,Fox Ltd,rhodesclifford@henderson.com,Studio 22a\nLynne terrace\nMcCarthymouth\nTF0 9GH,United Kingdom,GB,(0161) 496 0674,2001-12-20,8fe96c3a-d62d-4eb5-b313-cf12d9126a49
2,2,Harry,Lawrence,1995-08-02,"Johnson, Jones and Harris",glen98@bryant-marshall.co.uk,92 Ann drive\nJoanborough\nSK0 6LR,United Kingdom,GB,+44(0)121 4960340,2016-12-16,fc461df4-b919-48b2-909e-55c95a03fe6b
3,3,Darren,Hussain,1972-09-23,Wheeler LLC,daniellebryan@thompson.org,19 Robinson meadow\nNew Tracy\nW22 2QG,United Kingdom,GB,(0306) 999 0871,2004-02-23,6104719f-ef14-4b09-bf04-fb0c4620acb0
4,4,Garry,Stone,1952-12-20,Warner Inc,billy14@long-warren.com,3 White pass\nHunterborough\nNN96 4UE,United Kingdom,GB,0121 496 0225,2006-09-01,9523a6d3-b2dd-4670-a51a-36aebc89f579


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15320 entries, 0 to 15319
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   index          15320 non-null  int64 
 1   first_name     15320 non-null  object
 2   last_name      15320 non-null  object
 3   date_of_birth  15320 non-null  object
 4   company        15320 non-null  object
 5   email_address  15320 non-null  object
 6   address        15320 non-null  object
 7   country        15320 non-null  object
 8   country_code   15320 non-null  object
 9   phone_number   15320 non-null  object
 10  join_date      15320 non-null  object
 11  user_uuid      15320 non-null  object
dtypes: int64(1), object(11)
memory usage: 1.4+ MB


## Cleaning user data

Interactively attempting to clean the data in the user table, so that this can be implemented in the DataCleaning class.

In [22]:
# Convert object columns to their respective type
df = df.astype(
    {
        "first_name": "string",
        "last_name": "string",
        "company": "string",
        "email_address": "string",
        "address": "string",
        "country_code": "string",
        "country": "string",
        "user_uuid": "string"
    }
)

df.dtypes

index                     int64
first_name       string[python]
last_name        string[python]
date_of_birth            object
company          string[python]
email_address    string[python]
address          string[python]
country          string[python]
country_code     string[python]
phone_number             object
join_date                object
user_uuid        string[python]
dtype: object

In [23]:
# Convert object date columns to the datetime type
date_format = "%Y-%m-%d"
df.date_of_birth = pd.to_datetime(df.date_of_birth, errors='coerce', format=date_format)
df.join_date = pd.to_datetime(df.join_date, errors='coerce', format=date_format)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15320 entries, 0 to 15319
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   index          15320 non-null  int64         
 1   first_name     15320 non-null  string        
 2   last_name      15320 non-null  string        
 3   date_of_birth  15257 non-null  datetime64[ns]
 4   company        15320 non-null  string        
 5   email_address  15320 non-null  string        
 6   address        15320 non-null  string        
 7   country        15320 non-null  string        
 8   country_code   15320 non-null  string        
 9   phone_number   15320 non-null  object        
 10  join_date      15261 non-null  datetime64[ns]
 11  user_uuid      15320 non-null  string        
dtypes: datetime64[ns](2), int64(1), object(1), string(8)
memory usage: 1.4+ MB


In [24]:
# We can confirm actual user entries among bad data by their UUID
from re import search
uuid_regex = r'^[0-9A-Za-z]{8}-[0-9A-Za-z]{4}-4[0-9A-Za-z]{3}-[89ABab][0-9A-Za-z]{3}-[0-9A-Za-z]{12}$'

good_uuid = "93caf182-e4e9-4c6e-bebb-60a1a9dcf9b8"
bad_uuid = "AS45323"

match_good = search(uuid_regex, good_uuid)
match_bad = search(uuid_regex, bad_uuid)
match_good, match_bad

(<re.Match object; span=(0, 36), match='93caf182-e4e9-4c6e-bebb-60a1a9dcf9b8'>,
 None)

In [25]:
# pandas suggest using pd.NA over numpy.nan for string type columns
df.loc[~df.user_uuid.str.match(uuid_regex, na=False), 'user_uuid'] = pd.NA

df[df.user_uuid.isna()].head()

Unnamed: 0,index,first_name,last_name,date_of_birth,company,email_address,address,country,country_code,phone_number,join_date,user_uuid
752,752,PYCLKLLC7I,W350SCUD6R,NaT,R7IZUNSQX0,3Q791B3VIY,YW2YXLOQ5J,I7G4DMDZOZ,VSM4IZ4EL3,A4Q4HQBI3I,NaT,
866,867,,,NaT,,,,,,,NaT,
1022,1023,,,NaT,,,,,,,NaT,
1046,1047,GI4C78KWH0,UTB5PPYFG8,NaT,CA1XGS8GZW,7HSZB429UK,63GXGYR3XL,AJ1ENKS3QL,QVUW9JSKY3,64ZO0ONUQO,NaT,
1805,1807,,,NaT,,,,,,,NaT,


In [27]:
# We can see some rows have incorrect country code GB as GGB
df.country_code.value_counts()

country_code
GB     9335
DE     4692
US     1201
GGB       6
Name: count, dtype: Int64

In [28]:
df.country_code = df.country_code.replace("GGB", "GB")
df.country_code.value_counts()

country_code
GB    9341
DE    4692
US    1201
Name: count, dtype: Int64

In [29]:
df.phone_number.head(50)

0        +49(0) 047905356
1         (0161) 496 0674
2       +44(0)121 4960340
3         (0306) 999 0871
4           0121 496 0225
5       277-664-6389x8405
6             028 9018749
7        +44(0)1414960221
8           028 9018 0338
9              6554215915
10        +44141 496 0404
11        (0114) 496 0775
12        (028) 9018 0333
13       +49(0)9775 74337
14       +49(0)0406372221
15          +441174960765
16           572.068.8397
17         (0114) 4960518
18           08346 147221
19         +44306 9990447
20       +44(0)1614960247
21    +49 (0) 9914 457670
22         (0151) 4960510
23     +44(0)808 157 0714
24          029 2018 0952
25        (0117) 496 0586
26            03069990628
27     (987)576-3015x3130
28         (0151) 4960784
29     +44(0)117 496 0576
30    +49 (0) 5932 078914
31      +49(0)1338 672811
32        +44909 879 0133
33           0117 4960692
34            03381 12459
35      723-654-4681x6799
36         +44306 9990216
37       +49(0) 438105334
38       872

In [30]:
import phonenumbers
import re

def parse_phone_number(phone: str, region: str):
    # Clean the phone number by removing (0), extensions, and other unnecessary characters
    phone = re.sub(r'\(0\)', '', phone)  # Remove (0)
    phone = phone.replace("(", "").replace(")", "")  # Remove parentheses
    phone = re.sub(r'x.*$', '', phone)  # Remove extensions (e.g., x1234)
    phone = re.sub(r'[^\d+]', '', phone)  # Remove non-numeric characters except for +

    try:
        # Attempt to parse the number with the phonenumbers library
        # If no '+' sign, assume it's a local number and use the default region
        if not phone.startswith('+'):
            parsed_number = phonenumbers.parse(phone, region)
        else:
            parsed_number = phonenumbers.parse(phone)

        # Format the parsed number in international format
        return phonenumbers.format_number(parsed_number, phonenumbers.PhoneNumberFormat.INTERNATIONAL)

    except phonenumbers.phonenumberutil.NumberParseException:
        return None

df.phone_number = df.apply(
    lambda row: parse_phone_number(row['phone_number'], row['country_code']), axis=1
) # type: ignore


In [31]:
df.loc[df.country_code == "DE"].head(10)

Unnamed: 0,index,first_name,last_name,date_of_birth,company,email_address,address,country,country_code,phone_number,join_date,user_uuid
0,0,Sigfried,Noack,1990-09-30,Heydrich Junitz KG,rudi79@winkler.de,Zimmerstr. 1/0 59015 Gießen,Germany,DE,+49 4790 5356,2018-10-10,93caf182-e4e9-4c6e-bebb-60a1a9dcf9b8
13,13,Hajo,Hölzenbecher,1963-07-17,Scholl,evelynemetz@hartung.de,Christian-Kensy-Platz 9 52957 Kyritz,Germany,DE,+49 9775 74337,2006-07-21,196c8554-5df5-4519-973c-e05c0781cf52
14,14,Till,Schönland,1977-12-27,Noack GmbH,wagnerphilip@trueb.org,Hans-Günther-Kranz-Straße 1/4 07843 Cuxhaven,Germany,DE,+49 40 6372221,2001-09-22,21be4057-b932-41c4-96c0-0c9ab2c48d6c
18,18,Claus-Peter,Mitschke,1966-01-08,Käster,barkholzdetlef@tschentscher.net,Mark-Stiffel-Allee 8/7 90829 Rehau,Germany,DE,+49 8346 147221,2015-03-16,4dff00da-ffb7-4cd4-accf-805738e630dd
21,21,Jörg,Hoffmann,1940-11-04,Höfig,zgehringer@beckmann.de,Marliese-Holzapfel-Gasse 767 09198 Gardelegen,Germany,DE,+49 991 4457670,2006-02-04,fbf6ce18-7838-40fc-9cf9-5c90e26b5b65
30,30,Piotr,Lindau,1968-01-09,Mülichen Kade AG,tsoeding@patberg.com,Schmidtring 937 34482 Eberswalde,Germany,DE,+49 5932 078914,2015-08-05,fbedd1b1-463d-4660-83f5-cda91c900731
31,31,Gertrude,Neureuther,1944-05-25,Caspar GmbH,loewerrobert@misicher.net,Kabusweg 116 46469 Hansestadttralsund,Germany,DE,+49 133 8 672811,2001-06-15,c4e8899f-a2c2-4bb6-98d0-58fc3a23c412
34,34,Swantje,Hermann,1949-10-06,Ladeck Oestrovsky AG & Co. OHG,karolinaschuchhardt@huhn.org,Langegasse 32 00806 Genthin,Germany,DE,+49 3381 12459,2000-07-18,06999681-7360-461c-9a1f-9dd85a97b871
37,37,Sepp,Mans,1999-01-29,Dowerg GmbH & Co. OHG,marliese83@kabus.com,Jonas-Stadelmann-Gasse 98 62315 Ahaus,Germany,DE,+49 4381 05334,2012-04-28,a0f59605-c280-4714-aad1-0768269ecd86
39,39,Paul,Misicher,2006-08-24,Kuhl,tlangern@ebert.org,Aleksandra-Schlosser-Gasse 0 97750 Neuruppin,Germany,DE,+45 0208239,1996-02-26,15e92a28-e885-4844-9ead-17cb1a56fa6c


In [32]:
# Confirming no dates are in the future
df.date_of_birth.dt.date.min(), df.date_of_birth.dt.date.max()

(datetime.date(1938, 11, 23), datetime.date(2006, 11, 20))

In [33]:
df.join_date.dt.date.min(), df.join_date.dt.date.max()

(datetime.date(1992, 11, 21), datetime.date(2022, 11, 19))

In [34]:
# drop any null rows
df.replace("NULL", pd.NA, inplace=True)
df = df.dropna(how='any', axis='index')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15203 entries, 0 to 15319
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   index          15203 non-null  int64         
 1   first_name     15203 non-null  string        
 2   last_name      15203 non-null  string        
 3   date_of_birth  15203 non-null  datetime64[ns]
 4   company        15203 non-null  string        
 5   email_address  15203 non-null  string        
 6   address        15203 non-null  string        
 7   country        15203 non-null  string        
 8   country_code   15203 non-null  string        
 9   phone_number   15203 non-null  object        
 10  join_date      15203 non-null  datetime64[ns]
 11  user_uuid      15203 non-null  string        
dtypes: datetime64[ns](2), int64(1), object(1), string(8)
memory usage: 1.5+ MB
