### 1) Clean and Analyse the Dataset

* Create a Jupyter Notebook `data_analysis.ipynb` where you clean the dataset and analyze the relation between Sale Price and Quantity for some products.

* Save the cleaned dataset into `sales_data_cleaned.csv`.

In [22]:
import os
import pandas as pd


In [25]:
file_path = os.path.join("../data/raw/sales_data.csv")
cleaned_file_path = os.path.join("../data/processed/sales_data_cleaned.csv")
log_file_path = os.path.join("../log/sales_data_cleaning.log")

In [33]:
# Read the CSV file into a DataFrame
df = pd.read_csv(file_path, sep=";")

# Drop rows with missing values and print the column TransactionID of the rows dropped
missing_values = df[df.isnull().any(axis=1)]

# Drop rows with missing values
df = df.dropna()

# Drop duplicate rows
df = df.drop_duplicates()

# Save the cleaned data to a new CSV file
df = df.reset_index(drop=False)
df.to_csv(cleaned_file_path, index=False)

# save all the print statments in log file from root folder in log format

with open(log_file_path, "w") as f:
    f.write("size of the data set: " + str(df.shape) + "\n")
    f.write("Rows with missing values: " + str(missing_values["TransactionID"]) + "\n")
    f.write(
        "size of the data set after dropping rows with missing values: "
        + str(df.shape)
        + "\n"
    )
    f.write(
        "size of the data set after dropping duplicate rows: " + str(df.shape) + "\n"
    )
    f.write("cleaned file path: " + cleaned_file_path + "\n")

print("Data cleaning is done and saved in the file: ", cleaned_file_path)
print("Log file is saved in the file: ", log_file_path)

Data cleaning is done and saved in the file:  ../data/processed/sales_data_cleaned.csv
Log file is saved in the file:  ../log/sales_data_cleaning.log
