**Data EXTRACTION was done manually by downloading the CSV files from municipal open data portals.**

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("edmonton_business_licenses.csv",on_bad_lines="warn")

**Drop columns which are not needed.**

In [3]:
drop_cols = ['Licence Status','Business Improvement Area','Neighbourhood ID','Neighbourhood','Ward','Latitude','Longitude','Location','Count','Geometry Point']
df = df.drop(drop_cols,axis=1)

**Rename columns.**

In [4]:
df = df.rename(columns={'Trade Name':'Business Name','Licence Number':'License Number'})

**Reorder columns in dataframe. I arbitrarily decided on column ordering based on a gut feeling of what is most intuitive.**

In [5]:
columns = df.columns
df = df[
    [columns[3],columns[1],columns[0],columns[4],columns[5],columns[2]]
       ]

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39394 entries, 0 to 39393
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   License Number  39394 non-null  object
 1   Business Name   39393 non-null  object
 2   Category        39394 non-null  object
 3   Issue Date      39392 non-null  object
 4   Expiry Date     39394 non-null  object
 5   Address         25832 non-null  object
dtypes: object(6)
memory usage: 1.8+ MB


**Modify column dtypes to match their values more appriopriately.**

In [7]:
df["Issue Date"] = pd.to_datetime(df["Issue Date"])
df["Expiry Date"] = pd.to_datetime(df["Expiry Date"])

**Observe 'Issue Date' and 'Expiry Date' are now datetime.**

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39394 entries, 0 to 39393
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   License Number  39394 non-null  object        
 1   Business Name   39393 non-null  object        
 2   Category        39394 non-null  object        
 3   Issue Date      39392 non-null  datetime64[ns]
 4   Expiry Date     39394 non-null  datetime64[ns]
 5   Address         25832 non-null  object        
dtypes: datetime64[ns](2), object(4)
memory usage: 1.8+ MB


**I chose to include the phone information from the Toronoto dataset. In order to merge all 3 city datasets into 1 unified database, the columns need to be consistent across each table.**

**Adding the missing columns.**

In [9]:
df.insert(5,"Phone Number",np.nan)
df.insert(6,"Phone Ext.",np.nan)
df.insert(7,"Phone Type",np.nan)
df.insert(9,"City",'Edmonton')

In [10]:
df

Unnamed: 0,License Number,Business Name,Category,Issue Date,Expiry Date,Phone Number,Phone Ext.,Phone Type,Address,City
0,100001421-002,INTERIOR ILLUSIONS,General Business,2021-09-24,2022-09-24,,,,,Edmonton
1,101224086-002,OPENING DOORS,General Business,2021-11-02,2022-11-04,,,,,Edmonton
2,101481614-002,AGP TRUCKING LTD,Delivery/Transportation Services,2021-09-22,2022-10-13,,,,,Edmonton
3,101988181-003,ELEGANT MOTORS,Vehicle Repair,2021-03-08,2022-04-17,,,,,Edmonton
4,102754139-004,HAIRCANDY,Minor Retail Store,2021-11-25,2022-12-07,,,,,Edmonton
...,...,...,...,...,...,...,...,...,...,...
39389,7479134-001,SHIPLEY CONSTRUCTION INC,General Contractor,2021-12-02,2022-12-02,,,,,Edmonton
39390,84167978-001,FX INC TATTOOS & PIERCING,Personal Service Shop,2021-01-21,2022-02-25,,,,"1401, 8882 - 170 STREET NW",Edmonton
39391,88231442-002,LOUD MOUTH COMMUNICATIONS,General Business,2021-01-03,2022-01-11,,,,,Edmonton
39392,94584438-002,MIXES R US,Food Processing,2021-10-25,2022-10-25,,,,312 - DUNLUCE ROAD NW,Edmonton


**Convert dataframe back to CSV for concatenation.**

In [11]:
 df.to_csv('/Users/graemebalint/Documents/Python/Jupyter Notebooks/Canadian Businesses/csv_reformatted/edmonton_reformatted.csv',index=False)  