## **Structured Extraction of Retail Transaction Records**

## **Full_Data**

In [220]:
# Importing the pandas library and giving it an alias 'pd' for data manipulation and analysis
import pandas as pd

In [221]:
# Reading the CSV file named 'raw_data.csv' into a DataFrame called df_full
df_full = pd.read_csv("raw_data.csv")

# Displaying the first 5 rows of the DataFrame to quickly inspect the data
df_full.head()

Unnamed: 0,order_id,customer_name,product,quantity,unit_price,order_date,region
0,1,Diana,Tablet,,500.0,1/20/2024,South
1,2,Eve,Laptop,,,4/29/2024,North
2,3,Charlie,Laptop,2.0,250.0,1/8/2024,
3,4,Eve,Laptop,2.0,750.0,1/7/2024,West
4,5,Eve,Tablet,3.0,,3/7/2024,South


In [222]:
# Displaying a summary of the DataFrame, including:
# - Number of entries (rows)
# - Column names and their data types
# - Number of non-null (non-missing) values per column
df_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   order_id       100 non-null    int64  
 1   customer_name  99 non-null     object 
 2   product        100 non-null    object 
 3   quantity       74 non-null     float64
 4   unit_price     65 non-null     float64
 5   order_date     99 non-null     object 
 6   region         75 non-null     object 
dtypes: float64(2), int64(1), object(4)
memory usage: 5.6+ KB


### 📌 Observation: Missing Values

The dataset contains missing values in several key columns:

- **customer_name** has 1 missing entry (99 out of 100).
- **quantity** is missing in 26 rows (only 74 non-null).
- **unit_price** is missing in 35 rows (only 65 non-null).
- **order_date** has 1 missing entry.
- **region** is missing in 25 rows.


In [223]:
# Check for duplicate rows
df_full.duplicated().sum()

1

In [224]:
duplicate_rows = df_full[df_full.duplicated(keep=False)]
duplicate_rows

Unnamed: 0,order_id,customer_name,product,quantity,unit_price,order_date,region
3,4,Eve,Laptop,2.0,750.0,1/7/2024,West
5,4,Eve,Laptop,2.0,750.0,1/7/2024,West


### 🔁 Observation: Duplicate Record

The dataset contains **1 duplicate entry**, which appears to be an exact repetition of the following record:

- **order_id**: 4  
- **customer_name**: Eve  
- **product**: Laptop  
- **quantity**: 2.0  
- **unit_price**: 2.0  
- **order_date**: 1/7/2024  

This duplicate should be investigated and possibly removed to maintain data integrity during analysis.


In [225]:
df_full.describe()

Unnamed: 0,order_id,quantity,unit_price
count,100.0,74.0,65.0
mean,50.48,1.959459,500.0
std,29.043151,0.818271,211.947812
min,1.0,1.0,250.0
25%,25.75,1.0,250.0
50%,50.5,2.0,500.0
75%,75.25,3.0,750.0
max,100.0,3.0,750.0


### ⚠️ Observation : Suspicious Columns

#### 1. Unusual Values in Numeric Columns
From the statistical summary:

- **`unit_price`** has a **minimum of 1.0**, which is suspiciously low for products like laptops, tablets, monitors, and phones. This may indicate a **data entry error** or a placeholder value.

#### 2. Significant Missing Values

- **`unit_price`** is missing in **35% of the rows (35 out of 100)**.
- **`quantity`** is missing in **26% of the rows (26 out of 100)**.

These two fields are **crucial for revenue calculations** (e.g., `revenue = quantity × unit_price`). Missing values in such critical columns reduce the dataset's analytical value.


In [226]:
# Export the df_full DataFrame to a CSV file named 'raw_data.csv' without including the index column
df_full.to_csv("raw_data.csv", index=False)


### 

## **Incremental_Data**

In [227]:
df_incremental=pd.read_csv("incremental_data.csv")
df_incremental.head()

Unnamed: 0,order_id,customer_name,product,quantity,unit_price,order_date,region
0,101,Alice,Laptop,,900,5/9/2024,Central
1,102,,Laptop,1.0,300,5/7/2024,Central
2,103,,Laptop,1.0,600,5/4/2024,Central
3,104,,Tablet,,300,5/26/2024,Central
4,105,Heidi,Tablet,2.0,600,5/21/2024,North


In [228]:
df_incremental.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   order_id       10 non-null     int64  
 1   customer_name  4 non-null      object 
 2   product        10 non-null     object 
 3   quantity       6 non-null      float64
 4   unit_price     10 non-null     int64  
 5   order_date     10 non-null     object 
 6   region         8 non-null      object 
dtypes: float64(1), int64(2), object(4)
memory usage: 692.0+ bytes


### 📌 Observation: Missing Values

The dataset contains missing values in several key columns:

- **customer_name** → 6 missing entries (4 out of 10 non-null)
- **quantity** → 4 missing entries (6 out of 10 non-null)
- **region** → 2 missing entries (8 out of 10 non-null)

In [229]:
# Check for duplicate rows
df_incremental.duplicated().sum()

0

### 📌 Observation: Duplicate Records

There is **no duplicate record** in the incremental data.


In [230]:
df_incremental.describe()

Unnamed: 0,order_id,quantity,unit_price
count,10.0,6.0,10.0
mean,105.5,1.5,600.0
std,3.02765,0.547723,200.0
min,101.0,1.0,300.0
25%,103.25,1.0,600.0
50%,105.5,1.5,600.0
75%,107.75,2.0,600.0
max,110.0,2.0,900.0


### ⚠️ Observation: Suspicious Columns

- **unit_price** shows a potential anomaly:
  - Most values are clustered at **600**, but there's a minimum of **300** and a maximum of **900**, suggesting possible inconsistencies in pricing or product types.
- **quantity** has only 6 non-null entries out of 10, indicating a significant proportion of missing values (40%), which can affect analysis related to sales volume.

In [231]:
# Export the df_incremental DataFrame to a CSV file named 'raw_data.csv' without including the index column
df_incremental.to_csv("incremental_data.csv", index=False)