# Data Analyst Professional Practical Exam Submission

# Data Validation and Cleaning Report

## 1. Working Environment and Tools

This analysis was conducted using the Datalab workbook, utilizing the following Python libraries:
- `pandas` for data manipulation and cleaning
- `matplotlib`, `seaborn`, and `plotly` for data visualization
- `numpy`, `scipy` for numerical operations and statistics
- `sklearn` for any machine learning-based analysis (if applicable)

To get an overview of the dataset, a column profile function was defined. This function provides a summarized view of each column, allowing for quick identification of potential issues or health checks across the data. The function was inspired by the "Column Profile" feature found in Power Query and has helped guide the validation and cleaning process.

### Initial Column Profile Output:
| Column Name         | Data Type    | Unique Values | Null Values | Duplicated Values | Min Value | Max Value | Mean   | Median | Standard Deviation | Most Common Value | Most Common Value Count | Max String Length | Min String Length |
|---------------------|--------------|---------------|-------------|-------------------|-----------|-----------|--------|--------|--------------------|-------------------|------------------------|-------------------|-------------------|
| **week** | int64 | 6 | 0 | 14994 | 1 | 6 | 3.0982666666666665 | 3.0 | 1.656419807092205 | 1 | 3721 | N/A | N/A |
| **sales_method** | object | 5 | 0 | 14995 | N/A | N/A | N/A | N/A | N/A | Email | 7456 | 12 | 4 |
| **customer_id** | object | 15000 | 0 | 0 | N/A | N/A | N/A | N/A | N/A | 00019f95-cd18-4a2a-aa62-512cc6b17ac5 | 1 | 36 | 36 |
| **nb_sold** | int64 | 10 | 0 | 14990 | 7 | 16 | 10.084666666666667 | 10.0 | 1.8122133327416081 | 10 | 3677 | N/A | N/A |
| **revenue** | float64 | 6743 | 1074 | 8256 | 32.54 | 238.32 | 93.93494255349705 | 89.5 | 47.43531224572558 | 51.86 | 11 | N/A | N/A |
| **years_as_customer** | int64 | 42 | 0 | 14958 | 0 | 63 | 4.965933333333333 | 3.0 | 5.044951558865982 | 1 | 2504 | N/A | N/A |
| **nb_site_visits** | int64 | 27 | 0 | 14973 | 12 | 41 | 24.990866666666665 | 25.0 | 3.5009142152079415 | 25 | 1688 | N/A | N/A |
| **state** | object | 50 | 0 | 14950 | N/A | N/A | N/A | N/A | N/A | California | 1872 | 14 | 4 |

---

## 2. Initial Findings and Actions Taken

### Week Column:
- **Observation:** The most common value is `1`, which appears 3,721 times. Given that the product launch spanned only 6 weeks, this may indicate that a large portion of sales occurred in the first week. This is a notable pattern but not necessarily an issue with data quality.
- **Action:** After further diagnostics, this column was confirmed as healthy with no discrepancies.

### Sales Method Column:
- **Observation:** There are 5 unique values, despite there being only 3 expected methods: `"Email"`, `"Call"`, and `"Email and Call"`. This discrepancy suggests typographical errors or inconsistencies in the naming conventions.
- **Action:** After further diagnostics, necessary mappings were applied to ensure that the column reflects the correct and validated sales methods.

### Revenue Column:
- **Observation:** There are 1,074 null values in the `Revenue` column, which is concerning as this missing data represents about 6% of the total rows. Missing revenue data could skew calculations and affect the analysis of sales method effectiveness.
- **Action:** After plotting the revenue distribution and running skewness diagnostics (revealing a right-skewed distribution), I opted for median imputation. Revenue medians were calculated for each sales method, and these values were used to replace the null rows.

### Years as Customer Column:
- **Observation:** The maximum value in this column is 63, but the company was founded in 1984. This means the maximum number of years a customer could have been with the company should not exceed the difference between 2023 and 1984, which is 39 years.
- **Action:** After confirming that the dataset includes values for 2023, all values exceeding 39 years were capped at 39. This resolved the discrepancies.

### State Column:
- **Observation:** With 50 unique values, this column appears to align with the number of U.S. states. No significant issues were initially noted.
- **Action:** Further validation was performed to check for misspellings, typos, and inconsistent abbreviations. The column was confirmed as accurate and ready for further analysis.

### Customer ID Column:
- **Observation:** There are 15,000 unique customer IDs with no duplicates or null values.
- **Action:** No further action was required as this column was deemed valid.

---

## 3. Final Column Profile

After completing the validation and cleaning processes, the final column profile appears as follows:
| Column Name         | Data Type    | Unique Values | Null Values | Duplicated Values | Min Value | Max Value | Mean   | Median | Standard Deviation | Most Common Value | Most Common Value Count | Max String Length | Min String Length |
|---------------------|--------------|---------------|-------------|-------------------|-----------|-----------|--------|--------|--------------------|-------------------|------------------------|-------------------|-------------------|
| **week** | int64 | 6 | 0 | 14994 | 1 | 6 | 3.0982666666666665 | 3.0 | 1.656419807092205 | 1 | 3721 | N/A | N/A |
| **sales_method** | object | 3 | 0 | 14997 | N/A | N/A | N/A | N/A | N/A | Email | 7466 | 12 | 4 |
| **customer_id** | object | 15000 | 0 | 0 | N/A | N/A | N/A | N/A | N/A | 00019f95-cd18-4a2a-aa62-512cc6b17ac5 | 1 | 36 | 36 |
| **nb_sold** | int64 | 10 | 0 | 14990 | 7 | 16 | 10.084666666666667 | 10.0 | 1.8122133327416081 | 10 | 3677 | N/A | N/A |
| **revenue** | float64 | 6743 | 0 | 8257 | 32.54 | 238.32 | 95.565964 | 90.95 | 47.985181822124396 | 95.58 | 546 | N/A | N/A |
| **years_as_customer** | int64 | 40 | 0 | 14960 | 0 | 39 | 4.9638 | 3.0 | 5.026294904734961 | 1 | 2504 | N/A | N/A |
| **nb_site_visits** | int64 | 27 | 0 | 14973 | 12 | 41 | 24.990866666666665 | 25.0 | 3.5009142152079415 | 25 | 1688 | N/A | N/A |
| **state** | object | 50 | 0 | 14950 | N/A | N/A | N/A | N/A | N/A | California | 1872 | 14 | 4 |

---

# Analysis

## 1. How Many Customers Were There for Each Approach?
![Sales Method Counts](method_counts.png)
The analysis shows the distribution of customers across the three sales methods:

- **Email method:** 7,466 customers (49%)
- **Call method:** 4,962 customers (33%)
- **Combined Email + Call method:** 2,572 customers (17%)

### Commentary:

At first glance, it’s clear that the **Email method** has been the most widely used, accounting for nearly half of all customer interactions. This may suggest that email marketing, which requires the least effort from the sales team, was seen as a quick and cost-effective way to reach a large audience. Given the lower time commitment per customer, it’s possible that the sales team prioritized email to maximize reach during the early stages of the product launch.

On the other hand, the **Call method**, which involves a more direct and time-intensive engagement, accounted for a significant portion—33%—of customers. This indicates that a notable share of customers required or preferred more personalized contact, which might reflect the nature of the products (i.e., office tools that potentially require explanations or demonstrations). It also highlights the willingness of the sales team to invest time in higher-value prospects.

The **Combined Email + Call method** was used for 17% of the customers. Despite the smaller share, this approach could be interpreted as a more balanced method, targeting prospects who either showed interest after receiving the email or required follow-up to close the sale. While this method involves both email and a shorter phone call, it’s reasonable to assume that it could lead to more personalized attention and better customer understanding. However, the lower proportion may indicate resource limitations, where the sales team couldn’t apply this method as widely due to time constraints.

### Additional Insights:

- **Strategic Considerations:** The fact that almost half of the customers were reached solely by email suggests it was a central component of the strategy. Depending on how effective this method was (to be explored in further analyses), the company may want to evaluate whether it should continue investing heavily in email campaigns or balance it more evenly with direct calls or the combined method.

- **Efficiency vs. Personalization:** The distribution also underscores a trade-off between efficiency (email) and personalization (calls or combined methods). As the analysis continues, it will be useful to explore whether the methods with fewer customers (calls and combined) resulted in higher sales per customer or a stronger relationship, potentially justifying the increased time investment.


## 2. What Does the Spread of Revenue Look Like Overall? And for Each Method?

![Sales Method Distributions](distribution_sales_methods.png)

### Commentary:

The visual clearly highlights the distinct clustering of revenue distributions across the three sales methods. Key observations include:

- **Email Method:** The distribution for the email method shows a high frequency of revenues clustering around the lower to mid-range (50 to 100). This suggests that while the email method was the most widely used, it tended to generate lower revenue per customer on average, which may indicate that email marketing was more effective for lower-value customers or smaller purchases.

- **Call Method:** The distribution for the call method appears to have a tighter cluster, with most of the revenues concentrated in the 30-60 range. This reflects a lower revenue distribution than the other two methods. Despite the time-intensive nature of calls, this method seems to capture smaller, perhaps more personalized sales interactions. It could be that the sales team targeted specific, lower-value customers who responded better to direct engagement.

- **Email + Call Method:** The distribution for this combined approach is more dispersed and is concentrated in the 150-200 range. This indicates that while fewer customers were reached with this method, those who were tended to generate higher revenue per sale. This could suggest that combining personalized contact with prior email outreach might lead to better customer engagement, and consequently, higher sales. It appears to be the most successful method for generating higher-value purchases.

- **Overall Distribution:** The dashed line representing the overall distribution shows an uneven but clear right-skewed pattern, with the highest frequencies occurring in the lower revenue ranges, particularly between 50 and 100. This suggests that across all methods, most sales were of relatively lower value, with fewer high-revenue sales scattered throughout.

### Additional Insights:

- **Trade-off Between Reach and Revenue:** While the email method reached more customers, it brought in lower revenue per customer, suggesting a trade-off between volume and value. In contrast, the combined approach reached fewer customers but yielded higher revenues, indicating the importance of personalized follow-up after initial outreach.

- **Strategic Implications:** Depending on the company's goals (e.g., high-volume sales or targeting high-revenue customers), the sales strategy could be adjusted to either focus on maintaining high customer engagement through the combined approach or to continue scaling efforts with email campaigns, possibly with more targeted follow-ups for higher-value prospects.


## 3. Was There Any Difference in Revenue Over Time for Each of the Methods?

![Average Revenue for Each Week Over Time by Sales Method](avg_rev_week_sales_method.png)

### Commentary:

The visual tracks the **average revenue** per customer for each week across the three sales methods:

- **Email Method:** The average revenue remains relatively stable throughout the six-week period, with a slight increase from week 1 to week 6 (from 88 to 128.1). This indicates that while the email method is consistent, it does not show significant increases in individual revenue per customer, suggesting that it may be effective for maintaining steady engagement but not necessarily for driving high-value sales.

- **Email + Call Method:** The average revenue steadily increases over time, with the highest value at 220.8 in week 6. This pattern indicates that combining email with follow-up calls is not only more effective than using emails alone but also continues to drive higher revenue per customer as time progresses. This steady increase highlights the importance of personalized interaction after initial outreach.

- **Call Method:** The average revenue for the call method remains the lowest, ranging from 35.7 in week 1 to 65.4 in week 6. Despite the increase, the call method is significantly less effective in generating higher revenue per customer compared to the other methods. This suggests that while the method could have been useful for specific customer segments, it has not performed as well overall.

### Focus on Email and Email + Call Methods:

At this point, it’s clear that the **call method** is the least effective, and further analysis should focus on **email** and **email + call** methods for performance evaluation. However, it’s important to note that this analysis tracks average revenue per customer, and while it gives a good indication of individual customer performance, it does not fully capture the total revenue trend over time, which is key for determining the success of the methods.

---

## 4. Total Revenue Over Time for Email and Email + Call Methods

![Total Revenue Over Time for Email and Email + Call Sales Methods](revenue_over_time_email_email+call.png)

### Commentary:

This visualization tracks **total revenue** over time for the **email** and **email + call** methods. Key observations include:

- **Email Method:** The total revenue for the email method shows a significant drop after the initial product launch, declining steeply from over 247,000 in week 1 to approximately 25,000 by week 6. This steep decline suggests that while email was effective in generating an immediate response from customers during the product launch, its effectiveness decreased sharply over time. The slight stabilization around week 3 coincides with the **follow-up email procedure**, indicating a temporary boost in engagement, but the overall trend continues to decline afterward.

- **Email + Call Method:** In contrast, the **email + call method** demonstrates a more gradual but consistent increase in total revenue, starting at around 20,000 in week 1 and peaking at 128,000 in week 5. The total revenue increases steadily after the **follow-up call in week 2**, indicating that combining personal interaction with prior email outreach is a more successful long-term strategy for driving revenue. However, the slight dip in week 6 suggests that the momentum may not be fully sustained beyond a certain point.

### Additional Insights:

- **Effect of Follow-ups:** Both methods demonstrate the positive effect of follow-ups. The **email method** saw a temporary boost after the follow-up email in week 3, but the long-term revenue trend was still downward. On the other hand, the **email + call method** saw continuous growth after the follow-up call in week 2, showing that the combination of email and call creates a more lasting customer engagement.

- **Strategic Takeaway:** Based on the total revenue trends, it is evident that the **email + call method** outperforms the email method in terms of sustained revenue generation. The personalized interaction, though more resource-intensive, clearly pays off in driving higher total revenue over time. This suggests that the company may want to invest more in strategies that include personal follow-ups for higher-value customers. But the decision on continuation is still on hold, further lookups are definitely required.


## 5. Metric Hunt - Discovering the possible metrics

As we are done with answering the initial questions from Sales Rep, now continuing with the additional analyses and inspections for the establishment of metrics for both decision and tracking purposes. Here are some further calculations made:

### Composite Score Table
------------------------------------------
| sales_method   | revenue_mean | revenue_sum | nb_sold_mean | nb_sold_sum | nb_site_visits_mean | years_as_customer_mean | week_mean | time_spent_sum | revenue_per_item | revenue_per_time | composite_score |
|:---------------|-------------:|------------:|-------------:|------------:|--------------------:|-----------------------:|----------:|---------------:|-----------------:|-----------------:|----------------:|
| Email          |        97.01 |   724313.35 |         9.73 |      72639  |               24.75 |                   4.98 |      2.47 |          37330 |              9.97 |            19.40 |           14.22 |
| Email + Call   |       183.80 |   472730.95 |        12.23 |      31444  |               26.77 |                   4.51 |      4.29 |          38580 |             15.03 |            12.25 |           12.99 |
| Call           |        47.65 |   236445.16 |         9.51 |      47187  |               24.42 |                   5.18 |      3.43 |         148860 |              5.01 |             1.59 |            5.62 |
------------------------------------------

The code for this table generates a **composite score** for each sales method by combining several metrics to evaluate the overall effectiveness of the methods. Here’s a breakdown of the process:

1. **Time Spent Calculation:** A new column (`time_spent`) is added to the dataset, with estimated time spent for each sales method (5 minutes for email, 30 minutes for call, and 15 minutes for the combined method).
   
2. **Aggregation:** The data is grouped by `sales_method` and several metrics are calculated, such as:
   - **Revenue (mean and sum)**
   - **Average number of items sold and total number of items sold**
   - **Average number of site visits**
   - **Average years as a customer**
   - **Average weeks passed since product launch**
   - **Total time spent on each method**

3. **Additional Metrics:** 
   - **Revenue per item** is calculated as the total revenue divided by the total number of items sold.
   - **Revenue per time** is calculated as the total revenue divided by the total time spent on each method.

4. **Composite Score:** A weighted scoring system is applied to the metrics:
   - `Revenue per item` (30% weight)
   - `Revenue per time` (40% weight)
   - `Average site visits` (10% weight)
   - `Average years as customer` (20% weight)
   
   These weighted metrics are summed to generate a **composite score** for each method, which represents the overall effectiveness of each approach. The data is then sorted by this composite score to rank the sales methods.

---

### Composite Score Table Insights

1. **Email Method:**
   - **Revenue Efficiency:** The **Email method** shows solid efficiency, with a revenue per item of `9.97` and a revenue per time of `19.40`. This suggests that the method is efficient in generating revenue both per unit sold and in terms of the time invested.
   - **Total Revenue:** It has the highest total revenue of `724,313`, which is achieved by selling the highest number of items (`72,639`).
   - **Years as Customer:** The average years as customer is `4.98`, reflecting the method’s balance in reaching both newer and long-standing customers.
   - **Conclusion:** The **Email method** is the most effective approach overall, boasting a composite score of `14.22`. It delivers high returns with efficient time usage and strong revenue generation, making it the best candidate for continuation.

2. **Email + Call Method:**
   - **Revenue Efficiency:** The **Email + Call method** has the highest revenue per item (`15.03`), suggesting it is particularly effective for driving higher-value purchases per transaction. However, the revenue per time (`12.25`) is lower than the email-only method, reflecting a higher time cost.
   - **Total Revenue:** While the total revenue (`472,730`) is lower than the email method, the combined method shows stronger individual customer sales.
   - **Years as Customer:** The method has an average years as customer of `4.50`, which is slightly lower than the other methods, but still substantial. Additionally, it excels in terms of customer engagement, evidenced by the highest average site visits (`26.77`).
   - **Conclusion:** The **Email + Call method** ranks second with a composite score of `12.99`. This approach is less efficient in terms of time spent but drives higher-value purchases, making it suitable for engaging high-value customers.

3. **Call Method:**
   - **Revenue Efficiency:** The **Call method** performs the worst in terms of revenue per item (`5.01`) and revenue per time (`1.59`), indicating that it is inefficient both in terms of the revenue generated per sale and the time invested.
   - **Total Revenue:** The method has the lowest total revenue (`236,445`), even though it sold a substantial number of items (`47,187`), highlighting its poor performance in driving high-value transactions.
   - **Years as Customer:** Interestingly, the **Call method** has the highest average years as customer (`5.17`), suggesting that it may be better suited for long-term customer relationships. However, the time cost outweighs the benefits, making this method inefficient.
   - **Conclusion:** The **Call method** ranks the lowest with a composite score of `5.61`. It is inefficient and underperforms across most metrics. While it retains long-term customers, the high time cost and low revenue generation make it the least favorable method.

---

### Overall Conclusion:
- The **Email method** is the most effective and efficient, with the highest composite score and the best balance between revenue generation and time spent. It should remain the primary strategy for sales.
- The **Email + Call method** is useful for driving higher-value purchases, but its higher time cost means it should be used selectively, particularly for high-value or long-term customers.
- The **Call method** is highly inefficient and should be reconsidered. It may have a role in maintaining relationships with long-term customers, but it is not effective for generating revenue or managing time efficiently.



## 6. Customer Type Creation and Distribution

### Explanation of Customer Type Calculation

The code calculates and assigns **customer types** based on three key metrics: revenue, years as a customer, and site visits. The thresholds for these metrics are derived from the mean values in the dataset:

1. **Revenue Threshold:** The average revenue is calculated, and customers whose revenue exceeds this threshold are labeled as **High-Value**. Those below the threshold are labeled as **Low-Value**.

2. **Years as Customer Threshold:** A second threshold is defined based on the average years as a customer. For customers whose tenure exceeds this threshold, their label is updated to **Long-Term**, overriding the initial High-Value/Low-Value label if applicable.

3. **Site Visits Threshold:** The final threshold is based on the average number of site visits. Customers with visits above this threshold are labeled as **Engaged**, once again overriding the previous labels if applicable.

Thus, the hierarchy of labeling works as follows:
- If a customer has revenue above the threshold, they are first classified as **High-Value**.
- If they are also a long-term customer, their type changes to **Long-Term**.
- If they are highly engaged in terms of site visits, they are ultimately labeled as **Engaged**.

This approach categorizes customers based on a mix of financial value, loyalty (years), and engagement (site visits), with engagement overriding other classifications.

### Customer Type Distribution

Once the customer types are established, the code generates a visualization showing how these types are distributed across the different sales methods (Email, Email + Call, and Call).

![Customer Type Distribution by Sales Method](customer_type_distribution.png)

Key observations from the visualization:

- **Email Method:** This method has the in between proportion of **Engaged** customers (52.9%), followed by **Long-Term** customers (19.1%) and a significant share of **High-Value** customers (12.2%). This indicates that email reaches a wide customer base, many of them may not represent the most of high-value and engaged opportunities.
  
- **Email + Call Method:** The combined method has the largest share of **Engaged** customers (75.2%), which suggests that the personalized approach of combining email and a follow-up call helps retain highly engaged customers. This method also has the highest proportion of **High-Value** customers (15.5%).

- **Call Method:** The Call method primarily targets **Low-Value** customers (27.5%) and has a relatively balanced distribution among other types, but the smallest share of **Engaged** customers (49.4%). This indicates that calls may not be as effective in maintaining engagement as other methods.

---

### Question: Did Sales Methods Target Pre-Determined Customer Segments?

One interesting observation is the apparent distinction in how sales methods performed, particularly in the early stages. Specifically, in the first week, both **Email** and **Email + Call** begin as email-only methods, yet their revenue trajectories differ noticeably before the "call" step in the Email + Call method.

- **Email:** There is a significant decline in total revenue after the initial email outreach.
- **Email + Call:** Total revenue shows an increasing trend prior to the call being made, suggesting that even before the follow-up call, customers responded more positively to the initial email.

#### Discussion:

At first glance, you would expect the **Email** and **Email + Call** methods to perform similarly before the call step, as both methods involve sending the same email initially. However, the discrepancy suggests that the customers targeted by these methods may have been different from the outset.

There are a few possible explanations for this:

1. **Pre-Determined Segmentation:** It’s possible that the sales team pre-segmented customers based on certain characteristics (e.g., engagement, value, or loyalty) and assigned different sales methods accordingly. For instance, more valuable or engaged customers may have been pre-selected for the Email + Call method, expecting that these customers would respond better to personalized follow-up, leading to stronger early results.

2. **Self-Selection:** Another possibility is that the **Email + Call** method targets customers who self-select based on their initial engagement with the email. These customers may have shown higher interest earlier on, which prompted the follow-up call. This could explain why **Email + Call** shows increasing revenue even before the call.

3. **Customer Type Distribution:** As seen in the customer type distribution chart, the **Email + Call** method has a much higher proportion of **Engaged** and **High-Value** customers, compared to the **Email** method, which has a larger share of **Low-Value** customers. This difference in customer profiles could account for the performance gap, even before the distinct sales method (the call) was introduced.

In conclusion, while both methods may have started with an email, the performance gap suggests that customers were likely pre-segmented or self-selected based on early engagement, leading to better results for the **Email + Call** method, even before the calls were made.


## 7. Ridge Regression Analysis

In this section, I conducted a **ridge regression analysis** to evaluate the impact of several features on revenue across different sales methods. The features used in this analysis are:
- **Week** (number of weeks since launch)
- **Number of items sold** (`nb_sold`)
- **Years as a customer** (`years_as_customer`)
- **Composite score** (as defined earlier)

### Reason for Choosing Ridge Regression Over OLS
Initially, **Ordinary Least Squares (OLS)** regression was selected to model the relationship between these features and revenue. However, OLS showed signs of **multicollinearity**, which occurs when independent variables are highly correlated. This can inflate the variance of the coefficient estimates, making the model less reliable. To address this, **ridge regression** was chosen, as it applies a penalty to the size of the coefficients, thereby reducing multicollinearity and improving the model's robustness.

### Interpretation of Coefficients:

The plot below illustrates the **ridge regression coefficients** for each feature across the different sales methods (Email, Call, and Email + Call). These coefficients provide insight into how changes in each feature affect revenue. For each feature:

- **Positive Coefficients**: An increase in this feature is associated with an increase in revenue.
- **Negative Coefficients**: An increase in this feature is associated with a decrease in revenue.

**Ridge Coefficients:**  
   The ridge regression coefficients for each feature and method are displayed below. This highlights the relative importance of each feature in the model.

   ![Ridge Coefficients](coeffs.png)

1. **Number of Items Sold (`nb_sold`)**:
   - For all methods, an increase in the number of items sold leads to an increase in revenue, as expected. The positive coefficients suggest that, for each additional unit sold:
     - Revenue increases the most for the **Email + Call method**, with the largest positive coefficient.
     - The **Call method** also shows a significant positive relationship, though to a lesser extent than the combined method.
   
2. **Composite Score**:
   - The **Composite Score** has the largest positive coefficient for **Email + Call**, which suggests that for every unit increase in the composite score, revenue increases significantly. This reflects that the combined method works particularly well when targeting high composite score customers.
   - For **Call**, the composite score also positively impacts revenue, though its effect is weaker compared to the Email + Call method.
   
3. **Years as Customer**:
   - Surprisingly, for **Email + Call**, an increase in the **years as customer** has a strong **negative** impact on revenue. This could indicate diminishing returns on long-term customers when the combined method is used, potentially due to the nature of the product or customer fatigue.
   - For **Call**, the effect is less pronounced, though still slightly negative, whereas **Email** shows a moderate negative impact, indicating that focusing on newer customers might be more fruitful.
   
4. **Week**:
   - The **week** coefficient, although small across all methods, remains relatively neutral. This indicates that the number of weeks passed since the start of the sales period doesn't have a substantial impact on revenue for any of the methods.

### Key Takeaway:
- The **number of items sold** and **composite score** are the most influential features positively affecting revenue, particularly in the **Email + Call** method.
- The negative impact of **years as customer** suggests that this variable might be less important in driving revenue, especially for long-term customers under the combined method.  


#### Test Results
Here are the results of the ridge regression model, evaluated using the **Mean Squared Error (MSE)** and **R-squared** values:

| Method        | Ridge Test MSE | Ridge Test R-squared |
|---------------|----------------|----------------------|
| **Email**     | 5.18           | 0.96                 |
| **Call**      | 4.00           | 0.95                 |
| **Email + Call** | 23.24         | 0.97                 |

- **Email and Call Methods** show high R-squared values (~0.95-0.97), indicating that the models explain most of the variance in the data.
- The **Email + Call method** has a significantly higher MSE, suggesting that while the model fits well overall, there is higher error in the predictions for this method.

#### Variance Inflation Factor (VIF)
VIF was calculated to further assess multicollinearity among the features. A VIF value greater than 5 generally indicates problematic multicollinearity. Here are the results:

| Feature            | Email VIF   | Call VIF   | Email + Call VIF |
|--------------------|-------------|------------|------------------|
| **Constant**        | 298.95      | 184.84     | 276.79           |
| **Week**            | 2.64        | 5.75       | 8.23             |
| **Number of Items Sold** | 4.47        | 5.99       | 8.94             |
| **Years as Customer** | 4.67        | 11.12      | 3.10             |
| **Composite Score** | 6.93        | 11.26      | 4.07             |

- While the **composite score** and other features show some collinearity, **ridge regression** helps mitigate its effect.
- **Years as Customer** in the **Call method** shows higher multicollinearity (VIF > 10), suggesting that this variable may be contributing redundantly to the model.

#### Visualizations
Below are some key plots generated during the analysis:

1. **Residual Plots:**  
   These show the distribution of residuals for each method. Ideally, residuals should be normally distributed with a mean of zero.

   ![Residual Plots](resid_histograms.png)

2. **QQ Plots:**  
   These quantile-quantile plots help assess the normality of residuals. Deviations from the line indicate departures from normality.

   ![QQ Plots](qq_plots.png)

3. **Residuals vs Fitted:**  
   These plots check for any patterns in the residuals versus the fitted values. Patterns indicate potential problems with the model fit.

   ![Residuals vs Fitted](resid_vs_fitted.png)

4. **Partial Regression Plots:**  
   These plots illustrate the relationship between each feature and the response variable (revenue), adjusting for the other features.

   ![Partial Regression for Email](email_partial_regression.png)  
   ![Partial Regression for Call](call_partial_regression.png)  
   ![Partial Regression for Email + Call](email+call_partial_regression.png)

---

#### Key Takeaways:
- The **ridge regression** approach was adopted to mitigate multicollinearity, and while the model produces reasonably strong results, certain features, especially **years as customer** in the **Call method**, show high VIF values.
- The model's performance across methods is solid, as evidenced by high R-squared values. However, the **Email + Call method** exhibits higher error (MSE), and the residuals deviate slightly from normality, suggesting that there is room for further refinement.
- Overall, this step reinforces the validity of the **composite score** approach, showing that it contributes meaningfully to the explanation of revenue.

While the tests indicate some imperfections, the results align with expectations, making this approach a valid way to analyze and predict revenue performance.

# Conclusion

The analysis has provided a clear understanding of how different sales methods perform in terms of customer engagement, revenue generation, and resource efficiency. Here's a summary of the findings:

- **Email** is the most effective method overall, reaching the widest customer base with the highest total revenue. It is the most efficient in terms of both time spent and revenue generated per item.
- **Email + Call** performs better with high-value customers, generating higher revenue per customer and showing continuous growth. However, this method requires more resources, so it should be used selectively.
- **Call** is the least effective method, with lower efficiency in terms of revenue generated per time and item sold. It does maintain long-term customer relationships, but it does not justify the resources spent.

## Recommendations

1. **Continue Prioritizing the Email Method:**
   - The company should continue to invest in email marketing as it reaches a large audience efficiently and maintains steady revenue growth over time.
   
2. **Strategically Use the Email + Call Method:**
   - This method should be reserved for **high-value and engaged customers** where the additional resource investment is justified by higher potential returns. The composite score can be used to identify which customers would benefit most from this approach.

3. **Deprioritize the Call Method:**
   - The call method should be reduced or eliminated unless specifically targeting long-term or low-value customers who show a clear preference for phone communication. The resource cost outweighs the revenue gains.

4. **Track the Composite Score Over Time:**
   - Implement a system to monitor the **composite score** regularly. This metric can guide decisions on customer targeting and help optimize resource allocation for maximum revenue efficiency.

5. **Refine Customer Segmentation Strategy:**
   - Based on the evidence of performance differences, the company should refine its **customer segmentation** process, ensuring that high-value or engaged customers receive the most personalized follow-up methods (e.g., Email + Call), while others are managed through email-only strategies.

By following these recommendations, the company can optimize its sales strategies, ensuring that resources are allocated effectively and that high-value customers are targeted with the appropriate sales method.
