![walmartecomm](walmartecomm.jpg)

Walmart is the biggest retail store in the United States. Just like others, they have been expanding their e-commerce part of the business. By the end of 2022, e-commerce represented a roaring $80 billion in sales, which is 13% of total sales of Walmart. One of the main factors that affects their sales is public holidays, like the Super Bowl, Labour Day, Thanksgiving, and Christmas. 

In this project, you have been tasked with creating a data pipeline for the analysis of supply and demand around the holidays, along with conducting a preliminary analysis of the data. You will be working with two data sources: grocery sales and complementary data. You have been provided with the `grocery_sales` table in `PostgreSQL` database with the following features:

# `grocery_sales`
- `"index"` - unique ID of the row
- `"Store_ID"` - the store number
- `"Date"` - the week of sales
- `"Weekly_Sales"` - sales for the given store

Also, you have the `extra_data.parquet` file that contains complementary data:

# `extra_data.parquet`
- `"IsHoliday"` - Whether the week contains a public holiday - 1 if yes, 0 if no.
- `"Temperature"` - Temperature on the day of sale
- `"Fuel_Price"` - Cost of fuel in the region
- `"CPI"` – Prevailing consumer price index
- `"Unemployment"` - The prevailing unemployment rate
- `"MarkDown1"`, `"MarkDown2"`, `"MarkDown3"`, `"MarkDown4"` - number of promotional markdowns
- `"Dept"` - Department Number in each store
- `"Size"` - size of the store
- `"Type"` - type of the store (depends on `Size` column)

You will need to merge those files and perform some data manipulations. The transformed DataFrame can then be stored as the `clean_data` variable containing the following columns:
- `"Store_ID"`
- `"Month"`
- `"Dept"`
- `"IsHoliday"`
- `"Weekly_Sales"`
- `"CPI"`
- "`"Unemployment"`"

After merging and cleaning the data, you will have to analyze monthly sales of Walmart and store the results of your analysis as the `agg_data` variable that should look like:

|  Month | Weekly_Sales  | 
|---|---|
| 1.0  |  33174.178494 |
|  2.0 |  34333.326579 |
|  ... | ...  |  

Finally, you should save the `clean_data` and `agg_data` as the csv files.

It is recommended to use `pandas` for this project. 

In [None]:
# import required libraries
import pandas as pd
import os
import logging

In [56]:
# import the grocery_sales raw data

Unnamed: 0,index,Store_ID,Date,Dept,Weekly_Sales
0,0,1,2010-02-05,1,24924.50
1,1,1,2010-02-05,26,11737.12
2,2,1,2010-02-05,17,13223.76
3,3,1,2010-02-05,45,37.44
4,4,1,2010-02-05,28,1085.29
...,...,...,...,...,...
231517,232414,24,2011-05-06,8,49471.07
231518,232415,24,2011-05-06,50,1210.00
231519,232416,24,2011-05-06,87,25893.32
231520,232417,24,2011-05-06,85,1357.83


In [58]:

def extract(grocery_sales_df : pd.DataFrame, parquet_file: str):
    """Take two data files, grocery sales and extra_data.parquet and merges them
    
    Returns:
        merged_df : two files merged together
    """
    try:
        extra_data_df = pd.read_parquet(parquet_file, engine="pyarrow")

        merged_df = pd.merge(extra_data_df, grocery_sales_df, on="index")
    except Exception as e:
        print(f"Error {e} while extracting data...")
        merged_df = pd.DataFrame()
    
    return merged_df

merged_df = extract("grocery_sales.csv", "extra_data.parquet")

### Transforming the data

The following transformations are to be performed on the merged dataframe:

- `fill missing values`
- `add a column "Month"`
- `keep the rows where sales are over 10,000`

The output DataFrame should be stored as `clean_data`



In [59]:
# NOTE : There are some null values, an inspection of the dataframe suggest 
# suggests backfilling as a suitable way of handling the nulls

def transform(raw_df : pd.DataFrame):
    """
    Transform and clean the data
    
    Parameters:
        raw_df : dataframe to be cleaned
        
    Returns: 
        clean_data : cleaned and transformed pandas dataframe
    """
    raw_df.fillna(
        {
            'CPI' : raw_df['CPI'].mean(),
            'Weekly_Sales' : raw_df['Weekly_Sales'].mean(),
            'Unemployment' : raw_df['Unemployment'].mean()
         }, inplace = True
    )
    
    raw_df['Date'] = pd.to_datetime(raw_df['Date'], format="%Y-%m-%d")
        
    
    # create a month column
    raw_df["Month"] = raw_df["Date"].dt.month
    
    # keep the raw where sales are greater than 10,000
    filtered_df = raw_df.loc[raw_df.Weekly_Sales > 10000]
    
    # keep only the required columns
    cols = ["Store_ID","Month","Dept","IsHoliday","Weekly_Sales","CPI","Unemployment"]
    
    clean_data = filtered_df[cols]
    
    return clean_data

clean_data = transform(merged_df)
    
    

Create a function called `avg_monthly_sales()`, that takes `clean_data()` as input and returns an aggregated DataFrame containing two columns - `"Month"` and `"Avg_Sales"` (rounded to 2 d.p).

Call the function and store the results as a variable called `agg_data`.

In [60]:
def avg_monthly_sales(clean_df):
    """
    Parameters:
        clean_df : pandas dataframe
    Output:
        aggregated pandas dataframe
    """
    
    # select only the required columns for the analysis
    filtered_df = clean_df[["Month", "Weekly_Sales"]]
    monthly_sales = filtered_df.groupby("Month")["Weekly_Sales"].mean().round(2)
    
    # create a new column for the average sales
    agg_data = pd.DataFrame({
      "Month": monthly_sales.index,
      "Avg_Sales": monthly_sales.values
  })
    
    return agg_data


agg_data = avg_monthly_sales(clean_data)
    
    

Save the `agg_data` and `clean_data` as a csv file.

In [61]:
def load(full_data, full_data_filepath, agg_data, agg_data_filepath):
    """Load the dataframes as csv files
    
    Parameters:
        full_data : cleaned dataframe
        full_data_filepath : outputname of the cleaned dataframe
        agg_data : aggregated dataframe
        agg_data_filepath : output filename of the aggregated dataframe
        """

    full_data.to_csv(full_data_filepath, index=False)
    agg_data.to_csv(agg_data_filepath, index=False)
    
load(clean_data, "clean_data.csv", agg_data, "agg_data.csv")

Create a `validation()` function that checks whether the two csv files from the `load()` exist in the current working directory.

In [62]:
def validation(file_path : str):
    """
    Check whether csv file exist in the current path
    
    Parameters:
        file_path : OS path containing csv files
    """
    
    file_exists = os.path.exists(file_path)
    
    # Raise an exception if path does not exist
    if not file_exists:
        raise Exception(f"File {file_path} Not found")
        
validation("clean_data.csv")
validation ("agg_data.csv")