### **Introduction**

In this analysis, we aimed to identify the small businesses most likely to respond to a wave-2 mailing campaign for upgrading to QuickBooks version 3.0. Our approach was data-driven, leveraging a dataset of 75,000 small businesses provided by Intuit. This dataset, a subset of 801,821 businesses contacted in the wave-1 mailing, included detailed information on each business's interaction with Intuit Direct, their purchasing history, and their response to the initial mailing campaign.

 
 
**Q1. Describe how you developed your predictive models, and discuss predictive performance for each model**


To achieve our goal, we employed two predictive modeling techniques: Logistic Regression and Neural Networks. These models were chosen for their ability to handle binary classification tasks effectively and to accommodate the varied nature of our dataset. The development of these predictive models was guided by a meticulous exploratory data analysis, which helped us understand the underlying patterns and relationships within our data. By identifying key predictors of a positive response to the mailing campaign, we were able to refine our models for greater accuracy. We also used a variety of performance metrics to evaluate the models' effectiveness. 




### **Data Preprocessing**

We first investigated the feature variables and data types to understand the nature of the dataset. We identified the target variable, res1, which indicates whether a business responded positively to the wave-1 mailing campaign. We created a dummy variable, res1_yes, to represent the target variable in a binary format, with 1 indicating a positive response and 0 indicating no response. We then examined the distribution of the target variable and found that the dataset was imbalanced, with only 4.76% of businesses in the training dataset responding positively to the initial mailing campaign. This imbalance was addressed by including the positive responses from the test data when training our model, since these businesses would not have been used when calculating the model's performance, as we cannot send the offer the businesses that have already responded. This approach allowed us to train our model with a larger number of positive responses, enabling us to better understand the underlying patterns and improve the model's predictive power. Lastly, we converted the zip_bins variable, which categorizes the zip codes of the businesses into 20 bins, into a categorical variable to avoid imposing a linear relationship between the zip code and the target variable.

We then performed some basic exploratory analysis and confirmed the absence of missing values across the dataset. ***Given the large variance observed between the means and standard deviations of different features, we recognized the need to standardize the data before interpreting the importance of the features. Standardization is vital for our modeling techniques, particularly neural networks, which are sensitive to the scale of input features.*** 

### **Logistic Regression**

**Base Model**

We ran the base model (lr) with res1 as the target variable and all other variables as predictors. The model was significant with a p-value of <0.001 and a decent Pseudo R-squared value of 0.121. This value indicates that the model has moderate predictive power, but there is room for improvement. The AUC was 0.755, which indicates that the model has a good ability to distinguish between positive and negative responses. There are several variables (sex, bizflag, and sincepurch) that are not significant at the 0.05 level. We performed further analysis to determine if these variables should be included in the model.

**Feature Selection**

We tested the base model with and without the three non-significant predictors to determine if the model's performance improved with their inclusion. The Pseudo R-squared for both models was 0.121, indicating that the model's predictive power did not improve with the inclusion of these variables. The chi-squared test was used to compare the fit of the two models. The p-value of 0.99 was much higher than an acceptable alpha level of 0.05, indicating that the addition of these variables did not significantly improve the model's ability to explain the variability in res1. 

We then ran a new model (lrM2) without the sex, bizflag, and sincepurch variables, to determine whether the odds ratios of the other variables changed, in order to test for multicollinearity. If the odds ratios of the other variables changed significantly, this would suggest that the variables are related to other predictors in the model, and their exclusion would lead to omitted variable bias. After examining the odds ratios, we found that the exclusion of the non-significant variables did not significantly change the odds ratios of the other variables. This suggests that multicollinearity is not a problem in the model, and we could remove the non-significant variables without introducing bias.

Finally, before removing the variables, we will test for interactions between the variables to see if the variables are related to each other in a non-linear way.

**Q3. If you created new variables to include in the model, please describe these as well** 
**Interactions Testing**

Through exploratory data analysis, we identified zip_bins had potential interactions with other variables. We created a function to test for interactions between zip_bins and all other variables since zip_bins has the greatest number of categories and adds complexity to the model. The results of the test for interactions between zip_bins suggests that there are significant interactions between zip_bins and sex, numords, dollars, last, sincepurch, version1, and upgraded.

The significant interaction terms highlight that the influence of geographical location (as categorized into zip code bins) on the likelihood of responding to the upsell campaign is moderated by factors such as the number of orders, the time since the last order, the time since the original QuickBooks purchase, the QuickBooks version owned, and whether the customer previously purchased tax software or upgraded their QuickBooks software. These interactions suggest complex relationships that warrant further investigation to understand the underlying dynamics and to tailor marketing strategies accordingly.

We ran the new model (lrM3) including the significant interaction terms with zip_bins and the other variables. The model was significant with a p-value of <0.001 and a Pseudo R-squared value of 0.131, indicating a modest but meaningful improvement in explanatory power compared to the base model. However, the inclusion of interaction terms increased the complexity of the model, and we needed to evaluate the significance of these terms to ensure that the model remains interpretable and practically useful for targeting the QuickBooks upsell campaign. 

Throughout the process of refining our predictive model for the QuickBooks upsell campaign, we iteratively tested the significance of various variables and their interactions. Initially, we included all potential predictors and their interactions, specifically focusing on the interactions between geographical locations (zip_bins) and other variables. After identifying significant interactions, we evaluated the model's complexity and the significance of these terms. Non-significant interaction terms, along with variables such as dollars and sincepurch, which lost significance upon the inclusion of interaction terms, were systematically evaluated for exclusion to simplify the model without sacrificing its predictive power. This process involved running chi-squared tests to compare models with and without these terms, aiming to strike a balance between model complexity and predictive accuracy.

**Final Model Selection**

The final model, lrM12, was chosen after removing non-significant interaction terms and variables that did not contribute to improving the model's performance. This model includes the variables zip_bins, numords, last, version1, owntaxprod, upgraded, and dollars, along with significant interaction terms between zip_bins and numords, last, version1, upgraded, and interaction terms between version1 and numords, and version1 and last. The inclusion of these interactions allows the model to capture the nuanced effects of geographical differences on customer behavior regarding the QuickBooks upgrade decision, as well as the differential response of customers depending on whether they currently have version1 or not. The final model demonstrates a Pseudo R-squared value of 0.135, indicating a modest but meaningful improvement in explanatory power compared to the base model. It was significant with a p-value of <0.001, highlighting its statistical robustness. This streamlined approach ensures that the model remains interpretable and practically useful for targeting the upsell campaign while reflecting the complex dynamics of customer response patterns across different geographical areas and current product ownership.

We evaluated the individual impact and importance of each predictor and interaction terms by plotting some visualizations. ***Figure 1*** shows the predicted probabilities of a customer responding positively to the upsell campaign across different levels of each predictor, holding all other variables constant. ***Figure 2*** illustrates the relative importance of each predictor measured on a standardized scale (AUC decrease), highlighting the most influential predictors in the model. These visualizations provide valuable insights into the relative impact of each predictor and interaction term on the likelihood of a positive response, guiding the development of targeted marketing strategies for the QuickBooks upsell campaign. From the chart, zip_bins emerges as the most influential variable, indicating that geographical location has the highest impact on the model's predictions. Upgraded and last also show significant importance, suggesting that a business's history of software upgrades is a strong indicator of their likelihood to respond to the campaign, as is the time since the last order.

**Q2. How did you compare and evaluate different models**

### **Model Evaluation**

We performed the following analysis to determine the fit and performance of our selected model.

- Decile Analysis: ***Figure 3*** illustrates the response rate by predicted probability, grouped by decile. This chart reflects a strong predictive gradient; higher deciles show significantly higher response rates, indicating that the model is effective at ranking individuals by their likelihood of response.

- Cumulative Lift Chart: ***Figure 4*** illustrates the model's ability to target a smaller subset of individuals with a higher likelihood of response more effectively than random chance. The convergence of the training and test lines suggests that the model performs consistently across both known and unknown data.

- Cumulative Gains Chart: ***Figure 5*** shows the model's effectiveness compared to no model (random chance), with the gains curve for the test data closely following the training data, which indicates that the model has a high degree of generalizability.

- Together, these evaluation metrics suggest that the logistic regression model is well-calibrated, has strong predictive power, and is expected to perform reliably on new data, making it a valuable tool for targeted marketing efforts.

**Accuracy Testing**

***Figure 6*** outlines the confusion matrix for the selected model, which shows that the model's predictions led to a mailing strategy that achieved a high recall of 93.47%, indicating that most actual responders were correctly identified. The accuracy was relatively low at 28.03%, with a specificity of 24.65%, suggesting room for improvement in reducing false positives and better targeting to enhance the campaign's cost-effectiveness. The low precision of 6.01% indicates that a small percentage of the predicted responders actually respond to the campaign, leading to a high number of false positives. The low specificity further exacerbates this issue, as it indicates that the model incorrectly identifies a large portion of non-responders as responders, leading to further unnecessary expenditure. 

The effectiveness of the campaign, as dictated by these metrics, is financially challenging. The high recall ensures most actual responders are targeted, but the extremely low precision indicates a high cost for each actual responder due to the large number of non-responders also being targeted. This approach results in a substantial expenditure on non-responders, which might outweigh the revenue generated from the responders, especially when the mailing cost is considered at scale. The low specificity further exacerbates this issue, as it indicates that a large portion of non-responders are incorrectly identified as responders, leading to further unnecessary expenditure. However, since customers can only purchase the upgrade if they are sent the mail, Intuit may be willing to accept a low precision in order to capture as many potential responders as possible.

**Q4.What criteria did you use to decide which customers should receive the wave-2 mailing?**
***Breakeven Analysis***
We predicted the probability of a positive response to the QuickBooks upsell campaign using the selected model (lrM12). The pred_logit_wave1 and pred_logit_wave2 columns were created in the intuit75k dataframe to store the predicted probabilities for wave 1 and wave 2 of the campaign, respectively.

We used the breakeven analysis to determine the optimal cutoff probability for the QuickBooks upsell campaign. The breakeven analysis calculates the probability threshold at which the expected profit from the campaign equals the cost of the campaign. The optimal cutoff probability is 0.0235, which is the probability at which the expected profit from the campaign equals the cost of the campaign. This means that the company should only send the upsell campaign to businesses with a predicted probability of response greater than 0.0235. It is important to note that this mailout strategy is assuming that the businesses will respond to the campaign with the same probability as the mail-1 campaign. Since the response probability for mail-2 is expected to be half of mail-1, the cutoff probability for mail-2 should be adjusted accordingly.

**Q5. How much profit do you anticipate from the wave-2 mailing?**

**Profit Calculations**


The predicted probabilities for wave 2 are assumed to be 50% of the predicted probabilities for wave 1. This is based on the assumption that the probability of a business responding to the upsell campaign is lower for the second wave than the first wave, as the businesses that were most likely to respond have already done so.

Using the response rate from the training data, and the percentage of the customer base that we chose to target from the test data, we can extrapolate what the overall performance of the direct mail campaign would be for a larger customer base, excluding those who have already responded. We recommend Intuit target 303,949 businesses, which is 39.82% of the remaining customer base. This targeting is expected to result in 25,225 positive responses, generating a profit of $1,084,932 and an ROME of 253.15%.

The consistent ROME across different subsets of data (training, test, and now the overall campaign) reinforces the model's reliability and effectiveness. The high ROME indicates that the logistic targeting model is very effective in predicting customer responses and achieving substantial returns on the marketing investments made. This outcome suggests that the model could be a valuable tool for Intuit in optimizing the profitability of their direct marketing efforts.

**Q6. What did you learn about the type of businesses that are likely to upgrade?**

based on one of the interaction added to our model of zip bins and upgraded, we can interpret that business we are targeting would be local to intuit office and have a high number of orders. as they mentioned in their flyer that they ran an experiment testing v3 update and it was successful, so the businesses that are likely to upgrade are the ones that are local to intuit office and could have possibly been part of the experiment and tested version3.

### **Neural Networks**

To analyze and enhance the performance of our neural network (NN) model for a predictive task, we adopted a systematic approach, detailed in the flow below:

1. **Starting Simple with Neural Network**:
    Initially, our NN model featured just one hidden layer. This simple structure served as our baseline, enabling us to understand the basic performance of our model without the complexities introduced by additional layers.

2. **Incrementally Adding Complexity**:
    To explore the potential for improved performance, we progressively increased the number of hidden layers in our NN model. This step was guided by the understanding that deeper networks can capture more complex relationships in the data but also acknowledging the increased risk of overfitting associated with more complex models.

3. **Monitoring Overfitting with Gains Plot**:
    As we adjusted the model's complexity, we closely monitored for signs of overfitting. This was achieved by plotting gains charts for both the training and test sets. These plots provided a visual means to assess how additional layers affected the model's ability to generalize to unseen data versus memorizing the training set.

4. **Hyperparameter Tuning via Grid Search**:
    To optimize the NN model's performance, we utilized Grid Search. This method systematically worked through a predefined grid of hyperparameters, evaluating each combination's effectiveness using the Area Under the Curve (AUC) metric. The optimal configuration identified was a single hidden layer with 10 units, which achieved an AUC of 0.755.

5. **Evaluating on the Test Set**:
    The model, configured with the best parameters from the Grid Search, was then assessed on the test set. Here, it achieved an AUC of 0.761, marking the highest performance we had observed to date.

6. **Assessing Model Profitability**:
    Beyond AUC, we evaluated the model's practical effectiveness by calculating the profit generated from predictions in two distinct phases, referred to as "wave 1" and "wave 2." The profit from wave 2 predictions slightly exceeded that of the logistic regression model, indicating the NN model's potential financial advantage.
    

7. **Testing with different classifiers**.
    In our comprehensive analysis to identify the most effective model for our binary classification task, we extended our exploration beyond neural networks (NN) to include several other classifiers, alongside implementing a voting classifier strategy. Here's how the process unfolded:

    1. Selection of Classifiers:
        Given that our target variable was binary, we opted to evaluate a range of classification models known for their effectiveness in such scenarios. The models chosen for comparison included Decision Trees, K-Nearest Neighbours (KNN), and Random Forest classifiers, alongside our previously discussed NN and logistic regression models.
    
    2. Hyperparameter Tuning:
        For each classifier, we conducted hyperparameter tuning to optimize performance:
            Decision Tree: Tuned using max_depth, min_samples_split, and min_samples_leaf.
            K-Nearest Neighbours (KNN): The key hyperparameter tuned was n_neighbors.
            Random Forest: We adjusted n_estimators, max_depth, min_samples_split, and min_samples_leaf.

    3. Performance Evaluation Using AUC:
        The Area Under the Curve (AUC) metric served as our standard for evaluating each model's performance. Additionally, we experimented with a Voting Classifier, which combines predictions from multiple models, to assess potential improvements in AUC.

    4. Comparison of AUC Scores:
        The AUC scores for the individual classifiers were as follows:
            Decision Tree: 0.737
            K Nearest Neighbours (KNN): 0.636
            Random Forest: 0.742
        The Voting Classifier, which aggregated predictions from the individual models, achieved an AUC of 0.740.

    5. Rationale Behind Classifier Choices:
        Our choice of classifiers was driven by a preference for simplicity and interpretability. The use of a Voting Classifier was motivated by the hypothesis that it might enhance the AUC score by leveraging the strengths of individual classifiers.
    
8. **Final Model Selection**:
    Despite the comprehensive testing and the range of AUC scores observed:
    The Random Forest classifier emerged as the top-performing individual model with an AUC score of 0.742.

    However, when we revisited the neural network (NN) model, which had been tuned to an optimal configuration of a single hidden layer with 10 units achieving an AUC of 0.755 on validation and 0.761 on the test set, it surpassed the performance of all individual classifiers and the Voting Classifier.

    This superior performance of the NN model, coupled with the gains in AUC, led us to choose the neural network as our final model, despite its relatively higher complexity compared to the logistic regression model. The decision was based on achieving the highest possible predictive accuracy, acknowledging the trade-off between interpretability and performance.

**Conclusion**
This meticulous approach, encompassing a range of classifiers, hyperparameter tuning, and performance evaluation through AUC, underscored our commitment to leveraging the most effective model for our predictive task. Ultimately, the neural network's superior AUC affirmed its selection as the best model among those tested, marking a pivotal conclusion to our model selection process.