# ETL Extract

## Modules

In [None]:
# Importing the needed modules for extraction
import pandas as pd

## Load and Preview

### Loading

In [None]:
# Loading the raw_data.csv and incremental_data.csv files into pandas dataframes
df_full = pd.read_csv("data/raw_data.csv")
df_incremental = pd.read_csv("data/incremental_data.csv")

### Information

In [None]:
# Printing information about the dataframe created from the raw_data.csv file
df_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   order_id       100 non-null    int64  
 1   customer_name  99 non-null     object 
 2   product        100 non-null    object 
 3   quantity       74 non-null     float64
 4   unit_price     65 non-null     float64
 5   order_date     99 non-null     object 
 6   region         75 non-null     object 
dtypes: float64(2), int64(1), object(4)
memory usage: 5.6+ KB


In [None]:
# Printing information about the dataframe created from the incremental_data.csv file
df_incremental.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   order_id       10 non-null     int64  
 1   customer_name  4 non-null      object 
 2   product        10 non-null     object 
 3   quantity       6 non-null      float64
 4   unit_price     10 non-null     float64
 5   order_date     10 non-null     object 
 6   region         8 non-null      object 
dtypes: float64(2), int64(1), object(4)
memory usage: 692.0+ bytes


### Snapshots

In [None]:
# Displaying the first 5 rows
df_full.head()

Unnamed: 0,order_id,customer_name,product,quantity,unit_price,order_date,region
0,1,Diana,Tablet,,500.0,2024-01-20,South
1,2,Eve,Laptop,,,2024-04-29,North
2,3,Charlie,Laptop,2.0,250.0,2024-01-08,
3,4,Eve,Laptop,2.0,750.0,2024-01-07,West
4,5,Eve,Tablet,3.0,,2024-03-07,South


In [None]:
# Displaying the first 5 rows
df_incremental.head()

Unnamed: 0,order_id,customer_name,product,quantity,unit_price,order_date,region
0,101,Alice,Laptop,,900.0,2024-05-09,Central
1,102,,Laptop,1.0,300.0,2024-05-07,Central
2,103,,Laptop,1.0,600.0,2024-05-04,Central
3,104,,Tablet,,300.0,2024-05-26,Central
4,105,Heidi,Tablet,2.0,600.0,2024-05-21,North


## Observations

Every column in both the `raw_data.csv` and `incremental_data.csv` files are aligned to the aim of keeping track of sales made by a given business. However, the column `order_id` is not needed for the purposes of data analysis thus it should be omitted. Moreover, the columns `quantity` and `unit_price` should be multiplied in order to obtain a concrete value of the cost of a particular sale. The column `order_date` is in a format that is not very human readable thus will need to converted.

For the `raw_data.csv` file:

In [8]:
# Printing the number of null values for each column
df_full_null_counts = df_full.isnull().sum()
print(df_full_null_counts)

order_id          0
customer_name     1
product           0
quantity         26
unit_price       35
order_date        1
region           25
dtype: int64


In [10]:
# Printing the number of duplicate rows
df_full_num_dupes = df_full.duplicated().sum()
print(df_full_num_dupes)

1


For the `incremental_data.csv` file:

In [9]:
# Printing the number of null values for each column
df_incremental_null_counts = df_incremental.isnull().sum()
print(df_incremental_null_counts)

order_id         0
customer_name    6
product          0
quantity         4
unit_price       0
order_date       0
region           2
dtype: int64


In [11]:
# Printing the number of duplicate rows
df_incremental_num_dupes = df_incremental.duplicated().sum()
print(df_incremental_num_dupes)

0


These null values with be handled by:

- Mean imputation
- Mode imputation
- Median imputation

Dependent on the data type and data distribution if the data is quantitative.

For the `raw_data.csv` file it has one duplicate row which will be dropped during the data cleaning.

## Saving Raw Copies

In [None]:
# Saving the raw dataframes to the data folder
df_full.to_csv("data/raw_data.csv", index=False)
df_incremental.to_csv("data/incremental_data.csv", index=False)