# Final Report - "Title"
### Group 5
#### Nelson Li
#### Chriscenci Susanto
#### Nariman Tavakoli
#### Yao Xiao

## Introduction

> 232 words

Understanding which types of customers are more likely to engage with marketing campaigns is a importatn question in consumer analytics. Businesses invest substantial resources into customer research to help guide campaign targeting, yet it is not always clear which factors are most strongly associated with customer engagement. The [Customer Personality Analysis Dataset](https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis/data) provides rich demographic and behavioural information which can be used to explore these associations in a statistical way.

In this project, we analyze a marketing campaign dataset containing information on customer demographics, past purchasing behaviour, and previous campaign participation. Our goal is to investigate how these characteristics relate to the likelihood of responding to the company's most recent marketing campaign. Since the data come from an observational setting, our focus is on identifying associations, rather than drawing causal conclusions about the effect of any variable on campaign response.

To guide our analysis, we consider the broad question:

> **What customer characteristics (demograpahic information, purchasing  behaviour, and past campaign engagement) are associated with responding to the company's most recent marketing campaign?**

Addressing this question requires fitting multipe logistic regression models using a common binary response variable (`Repsonse`) and a set of demographic and behavioural covariates. Since the primary goal of this analysis is inference, we seek to understand which characteristics show statistically significant associations with campaign response. These insight can support bussiness in making data-driven marketing decisions.

## Method & Results

> 156 words

### Data

**Dataset Summary**

The [Customer Personality Analysis Dataset](https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis/data) from Kaggle contains demographic, behavioural and spending information about customers to help a business better its target customers. Although the collection methodology was not mentioned on the Kaggle website, it is most likely from an observational setting as customers were not randomly assigned marketing conditions.

- Number of observations: 2240 customers
- Number of variables: 27 variables
    - 29 if include `Z_CostContact` and `Z_Revenue` but their descriptions are absent on the website
- Unique Identifier: `ID`

**Variables**

| Variable Name   | Type        | Description     |
|-----------------|-------------|-----------------|
| `ID`            | Numeric     | Unique customer identifier|
| `Year_Birth`    | Numeric     | Year of birth of the customer|
| `Education`     | Categorical | Education level (e.g., Graduation, PhD, Master, etc.)|
| `Marital_Status`| Categorical | Marital status (e.g., Single, Married, Together, etc.)|
| `Income`        | Numeric     | Yearly household income|
| `Kidhome`       | Numeric     | Number of children in the household|
| `Teenhome`      | Numeric     | Number of teenagers in the household|
| `Dt_Customer`   | Date        | Date of customer enrollment with the company|
| `Recency`       | Numeric     | Number of days since the last purchase|
| `Complain`      | Binary      | 1 if the customer complained in the last 2 years, 0 otherwise|
| `MntWines`      | Numeric     | Amount spent on wine in the last 2 years|
| `MntFruits`     | Numeric     | Amount spent on fruits in the last 2 years|
| `MntMeatProducts`| Numeric     | Amount spent on meat products in the last 2 years|
| `MntFishProducts`| Numeric     | Amount spent on fish products in the last 2 years|
| `MntSweetProducts`| Numeric     | Amount spent on sweet products in the last 2 years|
| `MntGoldProds`  | Numeric     | Amount spent on gold products in the last 2 years|
| `NumWebPurchases`| Numeric     | Number of purchases made through the company’s website|
| `NumCatalogPurchases`| Numeric     | Number of purchases made using a catalogue|
| `NumStorePurchases`| Numeric     | Number of purchases made directly in stores|
| `NumWebVisitsMonth`| Numeric     | Number of company’s website visits in the last month |
| `NumDealsPurchases`| Numeric     | Number of purchases made with a discount|
| `AcceptedCmp1`  | Binary      | 1 if customer accepted offer in campaign 1, 0 otherwise|
| `AcceptedCmp2`  | Binary      | 1 if customer accepted offer in campaign 2, 0 otherwise|
| `AcceptedCmp3`  | Binary      | 1 if customer accepted offer in campaign 3, 0 otherwise|
| `AcceptedCmp4`  | Binary      | 1 if customer accepted offer in campaign 4, 0 otherwise|
| `AcceptedCmp5`  | Binary      | 1 if customer accepted offer in campaign 5, 0 otherwise|
| `Response`      | Binary      | 1 if customer accepted last campaign offer, 0 otherwise|


#### Source & Information

- Source: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis/data (created by user `imakash3011`)
- Information about 27 out of the 29 variables listed above are provided.

#### Pre-selection of variables

**Variables to be dropped**

- `ID` - this is a unique identifier and has no underlying relationship to model
- `Z_CostContact`, `Z_Revenue` - not documented and constant values for each observation
- `Dt_Customer` - date may not be very useful here especially since `Recency` might be collinear with it

In [12]:
# Loading in libraries
library(tidyverse)
library(broom)

In [13]:
# Reading in data (stored on github)
url <- "https://raw.githubusercontent.com/nelsonli2323/STAT-301-Project/refs/heads/main/marketing_campaign.csv"
customer_data <- read.delim(url, header = TRUE, sep = "\t")
num_rows <- nrow(customer_data)
num_cols <- ncol(customer_data)

cat("Number of rows:", num_rows, "\n")
cat("Number of columns:", num_cols)
glimpse(customer_data)

Number of rows: 2240 
Number of columns: 29Rows: 2,240
Columns: 29
$ ID                  [3m[90m<int>[39m[23m 5524, 2174, 4141, 6182, 5324, 7446, 965, 6177, 485…
$ Year_Birth          [3m[90m<int>[39m[23m 1957, 1954, 1965, 1984, 1981, 1967, 1971, 1985, 19…
$ Education           [3m[90m<chr>[39m[23m "Graduation", "Graduation", "Graduation", "Graduat…
$ Marital_Status      [3m[90m<chr>[39m[23m "Single", "Single", "Together", "Together", "Marri…
$ Income              [3m[90m<int>[39m[23m 58138, 46344, 71613, 26646, 58293, 62513, 55635, 3…
$ Kidhome             [3m[90m<int>[39m[23m 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1,…
$ Teenhome            [3m[90m<int>[39m[23m 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,…
$ Dt_Customer         [3m[90m<chr>[39m[23m "04-09-2012", "08-03-2014", "21-08-2013", "10-02-2…
$ Recency             [3m[90m<int>[39m[23m 58, 38, 26, 26, 94, 16, 34, 32, 19, 68, 11, 59, 82…
$ MntWines            [3m[90m<int>[39m

### Exploratory Data Analysis

- Clean and wrangle your data into a tidy format
- Include 2 effective and creative visualizations
    - explore the association of some potential explanatory variables with the response (use colours, point types, point size and/or faceting to include more variables)
    - highlight potential problems (e.g., multicollinearity or outliers)
    - You may utilize sub-plots as you did in Stage 1 Report.
    - Use easily readable main/axis/legend titles, appropriately sized and without any underscores.
- Transform some variables if needed and include a clear explanation (e.g. log-transformation may be useful when outliers are present)
- Any summary tables that are relevant to your analysis (e.g., summarize number of observation in groups, indicate if NAs exist)
- Be sure not to print output that takes up a lot of screen space!
- Your EDA must be comprehensive with high quality plots

In [None]:
# EDA code

### Methods: Plan

- Describe in written English the methods/models you used to perform your analysis from beginning to end.
- Provide a detailed justification of the method(s) used. The analysis must be based on methods learned in class.
    - Make sure that the analysis responded the question posed and that the proposed method is appropriate for the characteristics of the data.
- If a variable selection method is used, you need to describe and justify the method. Furthermore, explain what data will be used, and how final model will be chosen.
- Include a careful model assessment plan relevant to your goal (i.e. diagnostics and/or evaluation, however appropriate), with justifications.

### Code & Results

- all the analysis code, from reading the data to visualizing results, must be based on clean, reproducible (e.g. read from an open source and not a local directory in your server or computer), and well-commented code.
- Include no more than 3 visualizations and/or tables to summarize and highlight your results. Ensure your tables and/or figures are labelled with a figure/table number and readable fonts.
    - You may utilize sub-plots as you did in Stage 1 Report.
    - Use easily readable main/axis/legend titles, appropriately sized and without any underscores.
- Make sure to interpret/explain the results you obtain. It’s not enough to just say, “I fitted a linear model with these covariates, and my R-square is 0.87”.
    - If inference is the aim of your project, a detailed interpretation of your fitted models will be required, as well as a discussion of relevant quantities.
        - For example, which coefficient(s) is(are) statistically significant? What are some hypothesis tests of interest? Interpretation of coefficients, how does the model fit the data? among other points.
        - Also explain briefly the key differences between your fitted models.
    - If prediction is the aim, you must highlight the key outcomes from your model fitting/selection/prediction in written English.

In [1]:
# more code?

## Discussion

In this section, you’ll interpret and reflect on the results you obtained in the previous section with respect to the main question/goal of your project.

- Summarize what you found and the implications/impact of your findings
- If relevant, discuss whether your results were what you expected to find
- Discuss how your model could be improved
- Discuss future questions/research this study could lead to

## References

Include any citation of literature relevant to the project. The citation format is your choice – just be consistent. Make sure to cite the source of your data as well.