John Ferrara
---------------

# Final Examination: Business Analytics and Data Science

## Instructions:

You are required to complete this take-home final examination by the end of the last week of class. Your solutions should be uploaded in **pdf format** as a knitted document (with graphs, content, commentary, etc. in the pdf). This project will showcase your ability to apply the concepts learned throughout the course.

The dataset you will use for this examination is provided as **`retail_data.csv`**, which contains the following variables:

- **Product_ID**: Unique identifier for each product.
- **Sales**: Simulated sales numbers (in dollars).
- **Inventory_Levels**: Inventory levels for each product.
- **Lead_Time_Days**: The lead time in days for each product.
- **Price**: The price of each product.
- **Seasonality_Index**: An index representing seasonality.

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import gamma, lognorm, norm

In [9]:
## Reading in the Data
retail_df = pd.read_csv("synthetic_retail_data.csv")
print(retail_df.info())
print('---')
print(retail_df.info())
print('---')
print(retail_df.describe())
print('---')
# print(retail_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Product_ID         200 non-null    int64  
 1   Sales              200 non-null    float64
 2   Inventory_Levels   200 non-null    float64
 3   Lead_Time_Days     200 non-null    float64
 4   Price              200 non-null    float64
 5   Seasonality_Index  200 non-null    float64
dtypes: float64(5), int64(1)
memory usage: 9.5 KB
None
---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Product_ID         200 non-null    int64  
 1   Sales              200 non-null    float64
 2   Inventory_Levels   200 non-null    float64
 3   Lead_Time_Days     200 non-null    float64
 4   Price              200 non-null    float64
 5   Seasonal

## Problem 1: Business Risk and Revenue Modeling

### Context:
You are a data scientist working for a retail chain that models sales, inventory levels, and the impact of pricing and seasonality on revenue. Your task is to analyze various distributions that can describe sales variability and forecast potential revenue.

### Part 1: Empirical and Theoretical Analysis of Distributions (5 Points)

#### Task:

1. **Generate and Analyze Distributions:**

   - **X ~ Sales**: Consider the `Sales` variable from the dataset. Assume it follows a **Gamma distribution** and estimate its shape and scale parameters using the `fitdistr` function from the **MASS** package.
   - **Y ~ Inventory Levels**: Assume that the sum of inventory levels across similar products follows a **Lognormal distribution**. Estimate the parameters for this distribution.
   - **Z ~ Lead Time**: Assume that `Lead_Time_Days` follows a **Normal distribution**. Estimate the mean and standard deviation.

In [12]:
## Startiong wioth X  (Sales) 
retail_df['Sales']

0       158.439522
1       278.990204
2       698.858683
3      1832.394674
4       459.703882
          ...     
195     493.610495
196     814.089770
197     525.058515
198     369.390301
199    1649.189739
Name: Sales, Length: 200, dtype: float64

2. **Calculate Empirical Expected Value and Variance:**

   - Calculate the **empirical mean and variance** for all three variables.
   - Compare these empirical values with the **theoretical values** derived from the estimated distribution parameters.

### Part 2: Probability Analysis and Independence Testing (5 Points)

#### Task:

1. **Empirical Probabilities**  
   For the `Lead_Time_Days` variable (assumed to be normally distributed), calculate the following empirical probabilities:

   - \\( P(Z > \mu \mid Z > \mu - \sigma) \\)
   - \\( P(Z > \mu + \sigma \mid Z > \mu) \\)
   - \\( P(Z > \mu + 2\sigma \mid Z > \mu) \\)

2. **Correlation and Independence**  

   - Investigate the **correlation** between `Sales` and `Price`.  
     Create a **contingency table** using **quartiles** of `Sales` and `Price`, and then evaluate the **marginal** and **joint probabilities**.
   - Use **Fisher’s Exact Test** and the **Chi-Square Test** to check for **independence** between `Sales` and `Price`.  
     **Discuss which test is most appropriate and why.**

## Problem 2: Advanced Forecasting and Optimization (Calculus) in Retail

### Context:
You are working for a large retail chain that wants to optimize pricing, inventory management, and sales forecasting using data-driven strategies. Your task is to use regression, statistical modeling, and calculus-based methods to make informed decisions.

### Part 1: Descriptive and Inferential Statistics for Inventory Data (5 Points)
#### Task:

1. **Inventory Data Analysis:**

   - Generate **univariate descriptive statistics** for the `Inventory_Levels` and `Sales` variables.
   - Create appropriate **visualizations** such as **histograms** and **scatterplots** for `Inventory_Levels`, `Sales`, and `Price`.
   - Compute a **correlation matrix** for `Sales`, `Price`, and `Inventory_Levels`.
   - Test the hypotheses that the **correlations between the variables are zero** and provide a **95% confidence interval**.

2. **Discussion:**

   - Explain the meaning of your findings and discuss the **implications of the correlations for inventory management**.
   - Would you be concerned about **multicollinearity** in a potential regression model? Why or why not?

### Part 2: Linear Algebra and Pricing Strategy (5 Points)

#### Task:

1. **Price Elasticity of Demand:**

   - Use **linear regression** to model the relationship between `Sales` and `Price` (assuming `Sales` as the dependent variable).
   - **Invert the correlation matrix** from your model, and calculate the **precision matrix**.
   - Discuss the implications of the **diagonal elements of the precision matrix** (which are **variance inflation factors**).
   - Perform **LU decomposition** on the correlation matrix and interpret the results in the context of **price elasticity**.

### Part 3: Calculus-Based Probability & Statistics for Sales Forecasting (5 Points)

#### Task:

1. **Sales Forecasting Using Exponential Distribution:**

   - Identify a variable in the dataset that is skewed to the right (e.g., `Sales` or `Price`) and fit an **exponential distribution** to this data using the `fitdistr` function.
   - Generate **1,000 samples** from the fitted exponential distribution and compare a **histogram** of these samples with the original data's histogram.
   - Calculate the **5th and 95th percentiles** using the **cumulative distribution function (CDF)** of the exponential distribution.
   - Compute a **95% confidence interval** for the original data assuming normality and compare it with the empirical percentiles.

2. **Discussion:**

   - Discuss how well the exponential distribution models the data and what this implies for forecasting future sales or pricing.
   - Consider whether a different distribution might be more appropriate.

### Part 4: Regression Modeling for Inventory Optimization (10 Points)

#### Task:

1. **Multiple Regression Model:**

   - Build a **multiple regression model** to predict `Inventory_Levels` based on `Sales`, `Lead_Time_Days`, and `Price`.
   - Provide a **full summary** of your model, including **coefficients**, **R-squared value**, and **residual analysis**.

2. **Optimization:**

   - Use your model to **optimize inventory levels** for a **peak sales season**, balancing **minimizing stockouts** with **minimizing overstock**.