<h1 style="text-align: center;">Promo Abuse Detection</h1>

### Introduction

Promotions and discounts are widely used by businesses to attract customers, drive sales, and enhance brand loyalty. However, in the digital era, promotional abuse has become a significant concern for organizations across various industries. Promo abuse refers to the misuse or exploitation of promotional offers, codes, or incentives by individuals with the intention to gain undue advantages, circumvent rules, or engage in fraudulent activities. It poses serious challenges to businesses, leading to financial losses, erosion of trust, and negative customer experiences.

This project aims to delve into the complex phenomenon of promo abuse, investigating its underlying patterns, understanding its impacts on businesses and consumers, and exploring effective strategies for its mitigation. 

### Objective
- Clean and preprocess the dataset.
- Explore the MCA (Multiple Correspondence Analysis) techniques to work with categorical or qualitative variables.
- Find the best classification model.

### Outline

1. [Data Review](#datareview)

2. [Data Cleaning](#datacleaning)

3. [Labeling the target variable](#labelingthetargetvariable)

4. [MCA (Multiple Correspondence Analysis) technique](#mca)

5. [Modeling](#modeling)

6. [Conclusion](#conclusion)

---

## 1. Data review <a name="datareview"></a>

### About the dataset ([Data Source](https://www.kaggle.com/datasets/tanisha1416/promo-abuse-detection-for-payment-apps))

Imagine that ABC offering subscription-based services (such as streaming platforms) use payment gateways to manage recurring payments and subscriptions. For every new user sign-up on its platforms, ABC offer benefits to new customers to thank them for their commerce, encouraging them to become repeat users. Sign-up bonus abuse occurs when a customer uses multi accounting to benefit from a sign-up deal multiple times.

This dataset comprises of the information of the users registering for the online payment gateway app of ABC. 


**Target variable:**
- This dataset is designed to represent a real-world scenario, so the Target variable in this dataset is *'unknown'*. 

**Feature variables:** 
- *'X':* The ID for each registration.
- *'FirstName':* First Name of the user.
- *'LastName':* Last Name of the User.
- *'Dob':* Date of birth of the user.
- *'Gender':* Gender of the user. 
- *'UserId':* User ID given by the payment app.
- *'Status':* Status of Registration.
- *'Email':* Email of the user.
- *'Mobile':* Mobile Number of the user.
- *'ProgramId':* Program id provided by the payment gateway app.
- *'ProgramName':* Name of the Payment program.
- *'PostalCode':* Zipcode of the user's location.
- *'BadTryCount':* User taking too long to fill in his details or switching between tabs too much.
- *'LastLoginTime':* Timestamp of the last login.
- *'PresentLoginTime':* Timestamp of recent login.
- *'RegisteredOn':* Date and time of registration.
- *'OS Type':* Device type.
- *'Reg Referral Code':* Referral code name provided by the app.
- *'Reg Referral Prefix':* Prefix of the referral code name provided by the app.
- *'Ref PC AC Number':* Account number entered by the user.
- *'Reg Client IP':* IP address of the registration device.
- *'Device Id':* ID of the device used to make the registration.
- *'Sim Id':* Id of the sim of the registered number on the app.
- *'Meta Data':* Additional Info about the data.

**Now, let's examine the dataset to gain insights, understand the data and identify any potential issues or trends.**

---

## 2. Data cleaning <a name="datacleaning"></a>

In [1]:
# import libraries
library(skimr)
library(dplyr)
library(caTools)
library(lubridate)
library(caret)
library(FactoMineR)
library(pROC)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



Attaching package: 'lubridate'


The following objects are masked from 'package:base':

    date, intersect, setdiff, union


Loading required package: ggplot2

Loading required package: lattice

Type 'citation("pROC")' for a citation.


Attaching package: 'pROC'


The following objects are masked from 'package:stats':

    cov, smooth, var




In [2]:
# load dataset into R
df <- read.csv('Promo_Abuse_Detection_Dataset.csv')
head(df)

Unnamed: 0_level_0,X,FirstName,LastName,Dob,Gender,UserId,Status,Email,Mobile,ProgramId,⋯,PresentLoginTime,RegisteredOn,OS.Type,Reg.Referral.Code,Reg.Referral.Prefix,Ref.PC.AC.Number,Reg.Client.IP,Device.Id,Sim.Id,Meta.Data
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<dbl>,<int>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,0,smt,shyani devi,28-05-1985,F,13260350,1,smtXXXXXX@gmail.com,8812569734,6019,⋯,19:34:34,10/19/2020 18:45,Android,srpaa335,srpaa,IN017XXXXX1161,111.94.41.232,faid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:22b12,faid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:81b11,56641ab-7e57-4962-a362a89985
2,1,shakshi,sagar,09-06-1998,F,13454400,1,shaXXXXXX@gmail.com,8615389304,6019,⋯,18:58:54,10/19/2020 18:45,Android,srpaa345,srpaa,IN017XXXXX1166,111.92.31.237,faid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:14b15,faid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:22b15,56641ab-7e57-4962-a362a89990
3,2,anshu,d/o,23-11-2001,F,13321313,1,ansXXXXXX@gmail.com,8418054019,6019,⋯,18:44:38,10/19/2020 18:45,ios,srpaa349,srpaa,IN017XXXXX1168,111.94.91.346,iosvid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:81b16,faid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:14b16,56641ab-7e57-4962-a362a89992
4,3,kanika,kathuria,20-06-2001,F,12444655,1,kanXXXXXX@gmail.com,8055221226,6019,⋯,18:41:04,10/19/2020 18:45,ios,srpaa350,srpaa,IN012XXXXX1162,111.94.41.237,iosvid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:81b17,iosvid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:81b16,56641ab-7e53-4962-a362a89967
5,4,riya,masi,18-07-2005,U,12547615,1,riyXXXXXX@gmail.com,8556603655,6019,⋯,18:08:58,10/19/2020 18:45,Android,srpaa359,srpaa,IN017XXXXX1173,111.94.41.240,faid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:22b20,faid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:81b19,56641ab-7e57-4962-a362a89997
6,5,leela,with a child,13-04-1987,U,12379326,1,leeXXXXXX@gmail.com,9396979136,6019,⋯,18:01:50,10/19/2020 18:45,Android,srpaa361,srpaa,IN017XXXXX1174,111.94.91.350,faid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:81b20,faid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:14b20,56641ab-7e57-4962-a362a89998


In [3]:
# skim through df, getting useful summary statistics
skim(df)

── Data Summary ────────────────────────
                           Values
Name                       df    
Number of rows             395   
Number of columns          24    
_______________________          
Column type frequency:           
  character                17    
  numeric                  7     
________________________         
Group variables            None  

── Variable type: character ────────────────────────────────────────────────────
   skim_variable       n_missing complete_rate min max empty n_unique whitespace
[90m 1[39m FirstName                   0             1   1  13     0      309          0
[90m 2[39m LastName                    0             1   0  28   133      184          0
[90m 3[39m Dob                         0             1  10  10     0      392          0
[90m 4[39m Gender                      0             1   0   1     2        4          0
[90m 5[39m Email                       0             1  17  19     0      220          0


"'length(x) = 17 > 1' in coercion to 'logical(1)'"


Unnamed: 0_level_0,skim_type,skim_variable,n_missing,complete_rate,character.min,character.max,character.empty,character.n_unique,character.whitespace,numeric.mean,numeric.sd,numeric.p0,numeric.p25,numeric.p50,numeric.p75,numeric.p100,numeric.hist
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,character,FirstName,0,1,1.0,13.0,0.0,309.0,0.0,,,,,,,,
2,character,LastName,0,1,0.0,28.0,133.0,184.0,0.0,,,,,,,,
3,character,Dob,0,1,10.0,10.0,0.0,392.0,0.0,,,,,,,,
4,character,Gender,0,1,0.0,1.0,2.0,4.0,0.0,,,,,,,,
5,character,Email,0,1,17.0,19.0,0.0,220.0,0.0,,,,,,,,
6,character,ProgramName,0,1,7.0,7.0,0.0,1.0,0.0,,,,,,,,
7,character,LastLoginTime,0,1,8.0,16.0,0.0,392.0,0.0,,,,,,,,
8,character,PresentLoginTime,0,1,8.0,8.0,0.0,394.0,0.0,,,,,,,,
9,character,RegisteredOn,0,1,18.0,18.0,0.0,3.0,0.0,,,,,,,,
10,character,OS.Type,0,1,3.0,7.0,0.0,2.0,0.0,,,,,,,,


**Things to notice:**

- *'Gender'* has 4 unique values. Typically, when dealing with the gender variable, we commonly encounter two distinct categories: Female and Male. However, it is noteworthy that in this particular dataset, we have 4 unique values present in this column. 

- *'Empty values':* An empty value indicates that a cell is empty not null (""). To provide consistency and clarity within the dataset, it is a good practice to standardize the representation of these empty values. In this case, we will replace all the empty values with the label "Unknown".

- *'Problems with data type:'* As I see, some data type in this dataset is not reasonable.

**Gender**

In [4]:
# extract unique elements of gender column
unique(df$Gender)

As we can observe, the *'Gender'* column has 4 unique values. We have the 'M' and 'F' stand for 'Male' and 'Female' as expected. Additionally, we also notice the presence of 2 values, U and "", which indicate unknown values. To maintain consistency within the dataset, we will replace both of these values with "Unknown".

In [5]:
# replace values "U" and 'empty values' with "Unknown"
df$Gender <- replace(df$Gender, df$Gender != "M" & df$Gender != "F", "Unknown")
unique(df$Gender)

**Empty value**

As mention above, we should replace all the empty values with the label "Unknown" to make sure the consistency of the dataset.

In [6]:
df[df == ""] <- "Unknown"

**Problems with data type**

To ensure accurate data representation and facilitate appropriate data analysis, it is crucial to correctly assign and manage the data types within a dataset. By choosing the appropriate data types for each variable, we can enhance the efficiency, accuracy, and reliability of data processing and analysis.

In [7]:
# change data to character type
col_char <- c("X", "UserId","Mobile", "ProgramId", "PostalCode")
df[col_char] <- sapply(df[col_char], as.character)

# change data to factor type
df$Gender <- factor(df$Gender)
df$Status <- as.factor(df$Status)
df$BadTryCount <- as.factor(df$BadTryCount)

# change data to datetime type
df$Dob <- dmy(df$Dob)
df$RegisteredOn <- mdy_hm(df$RegisteredOn)

str(df)

'data.frame':	395 obs. of  24 variables:
 $ X                  : chr  "0" "1" "2" "3" ...
 $ FirstName          : chr  "smt" "shakshi" "anshu" "kanika" ...
 $ LastName           : chr  "shyani devi" "sagar" "d/o" "kathuria" ...
 $ Dob                : Date, format: "1985-05-28" "1998-06-09" ...
 $ Gender             : Factor w/ 3 levels "F","M","Unknown": 1 1 1 1 3 3 2 1 1 3 ...
 $ UserId             : chr  "13260350" "13454400" "13321313" "12444655" ...
 $ Status             : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
 $ Email              : chr  "smtXXXXXX@gmail.com" "shaXXXXXX@gmail.com" "ansXXXXXX@gmail.com" "kanXXXXXX@gmail.com" ...
 $ Mobile             : chr  "8812569734" "8615389304" "8418054019" "8055221226" ...
 $ ProgramId          : chr  "6019" "6019" "6019" "6019" ...
 $ ProgramName        : chr  "PayZapp" "PayZapp" "PayZapp" "PayZapp" ...
 $ PostalCode         : chr  "509726" "772845" "377130" "534482" ...
 $ BadTryCount        : Factor w/ 1 level "0": 1 1 1 1 1 1 1 

**Review the data**

In [8]:
skim(df)

── Data Summary ────────────────────────
                           Values
Name                       df    
Number of rows             395   
Number of columns          24    
_______________________          
Column type frequency:           
  character                19    
  Date                     1     
  factor                   3     
  POSIXct                  1     
________________________         
Group variables            None  

── Variable type: character ────────────────────────────────────────────────────
   skim_variable       n_missing complete_rate min max empty n_unique whitespace
[90m 1[39m X                           0             1   1   3     0      395          0
[90m 2[39m FirstName                   0             1   1  13     0      309          0
[90m 3[39m LastName                    0             1   2  28     0      184          0
[90m 4[39m UserId                      0             1   8   8     0      394          0
[90m 5[39m Email      

"'length(x) = 20 > 1' in coercion to 'logical(1)'"


Unnamed: 0_level_0,skim_type,skim_variable,n_missing,complete_rate,Date.min,Date.max,Date.median,Date.n_unique,POSIXct.min,POSIXct.max,POSIXct.median,POSIXct.n_unique,character.min,character.max,character.empty,character.n_unique,character.whitespace,factor.ordered,factor.n_unique,factor.top_counts
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<date>,<date>,<date>,<int>,<dttm>,<dttm>,<dttm>,<int>,<int>,<int>,<int>,<int>,<int>,<lgl>,<int>,<chr>
1,Date,Dob,0,1,1975-01-27,2005-12-17,1990-06-27,392.0,,,,,,,,,,,,
2,POSIXct,RegisteredOn,0,1,,,,,2020-10-19 18:45:00,2020-10-19 19:00:00,2020-10-19 19:00:00,3.0,,,,,,,,
3,character,X,0,1,,,,,,,,,1.0,3.0,0.0,395.0,0.0,,,
4,character,FirstName,0,1,,,,,,,,,1.0,13.0,0.0,309.0,0.0,,,
5,character,LastName,0,1,,,,,,,,,2.0,28.0,0.0,184.0,0.0,,,
6,character,UserId,0,1,,,,,,,,,8.0,8.0,0.0,394.0,0.0,,,
7,character,Email,0,1,,,,,,,,,17.0,19.0,0.0,220.0,0.0,,,
8,character,Mobile,0,1,,,,,,,,,10.0,10.0,0.0,394.0,0.0,,,
9,character,ProgramId,0,1,,,,,,,,,4.0,4.0,0.0,1.0,0.0,,,
10,character,ProgramName,0,1,,,,,,,,,7.0,7.0,0.0,1.0,0.0,,,


**Delete all the columns have unique and stable values**

In order to build a model that effectively captures the relevant information and produces meaningful insights, it is essential to carefully select the columns to be included. Columns with unique and stable values do not contribute significant information to the modeling process. Hence, it is recommended to remove such columns from the dataset. 

In [9]:
# counts the number of unique in very column in df
col_unique <- sapply(df, n_distinct)

# remove the columns have unique and stable values
col_remove <- names(col_unique[col_unique == nrow(df) | col_unique == 1])
df <- df[ , !(names(df) %in% col_remove)]

head(df)

Unnamed: 0_level_0,FirstName,LastName,Dob,Gender,UserId,Email,Mobile,PostalCode,LastLoginTime,PresentLoginTime,RegisteredOn,OS.Type,Reg.Referral.Code,Reg.Referral.Prefix,Ref.PC.AC.Number,Device.Id
Unnamed: 0_level_1,<chr>,<chr>,<date>,<fct>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dttm>,<chr>,<chr>,<chr>,<chr>,<chr>
1,smt,shyani devi,1985-05-28,F,13260350,smtXXXXXX@gmail.com,8812569734,509726,00:23:00,19:34:34,2020-10-19 18:45:00,Android,srpaa335,srpaa,IN017XXXXX1161,faid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:22b12
2,shakshi,sagar,1998-06-09,F,13454400,shaXXXXXX@gmail.com,8615389304,772845,00:22:40,18:58:54,2020-10-19 18:45:00,Android,srpaa345,srpaa,IN017XXXXX1166,faid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:14b15
3,anshu,d/o,2001-11-23,F,13321313,ansXXXXXX@gmail.com,8418054019,377130,00:22:32,18:44:38,2020-10-19 18:45:00,ios,srpaa349,srpaa,IN017XXXXX1168,iosvid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:81b16
4,kanika,kathuria,2001-06-20,F,12444655,kanXXXXXX@gmail.com,8055221226,534482,00:22:30,18:41:04,2020-10-19 18:45:00,ios,srpaa350,srpaa,IN012XXXXX1162,iosvid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:81b17
5,riya,masi,2005-07-18,Unknown,12547615,riyXXXXXX@gmail.com,8556603655,258413,00:22:12,18:08:58,2020-10-19 18:45:00,Android,srpaa359,srpaa,IN017XXXXX1173,faid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:22b20
6,leela,with a child,1987-04-13,Unknown,12379326,leeXXXXXX@gmail.com,9396979136,261600,00:22:08,18:01:50,2020-10-19 18:45:00,Android,srpaa361,srpaa,IN017XXXXX1174,faid:y46GHDAWDVBGF1937ETsdfhd56483GDSSAA=:81b20


---

## 3. Labeling the target variable <a name="labelingthetargetvariable"></a>

Because we don't have the target variable in our dataset, we have to label the target variable ourselves. Labeling the target variable is an essential step in building a supervised learning model. By assigning appropriate labels to the target variable, we enable the model to learn from the provided data and make predictions or classifications based on the labeled outcomes. However, it is crucial to label the target variable carefully and accurately to ensure the model's effectiveness and interpretability. Here are the steps show how I label the target variable:

- **Understanding the Target Variable:** First, we need to gain a clear understanding of the purpose of the target variable. Here, our target variable is a binary variable where 0 represents normal reregisters and 1 represents promo abuse cases.

- **Define the Labeling Criteria:** This involves defining the rules or conditions that determine how each observation should be labeled. Remember, promo abuse is a type of fraud that involves people who try to register many times to take advantage of the company's offers. Therefore, there are a few considerations of promo abuse that can be mentioned like:

    - *Personally identifiable information cannot be duplicated (UserID, Email, Mobile, Ref PC AC Number and Device Id):* These specific pieces of information are considered personally identifiable information (PII) as they uniquely identify an individual. In normal circumstances, each instance of this information should correspond to a distinct person. Therefore, if we come across repeated occurrences of the same PII in the dataset, it raises a red flag and indicates a potential case of promo abuse. So, by assigning a value of 1 to the target variable, we explicitly flag these repeated occurrences as cases of promo abuse. 

    - *Personally information can be duplicated (FirstName, LastName, Dob, Gender, PostalCode and OS Type):* It is indeed common to encounter individuals who share the same first name, last name, gender, phone type, or even birthday. These shared characteristics alone may not provide sufficient evidence to draw conclusions about promo abuse. In other words, the presence of such similarities in the dataset does not automatically imply promo abuse. Relying solely on these shared attributes may lead to false assumptions or incorrect conclusions. 

    - *Not Personally information (LastLoginTime, PresentLoginTime, RegisteredOn, Reg Referral Code, Reg Referral Prefix):* These variables provide valuable information that can be explored further to identify potential patterns of promo abuse. But now, we cannot draw definitive conclusions about the relationship between these variables and promo abuse.

- **Consider Imbalanced Classes:** If the target variable has imbalanced classes, meaning one class significantly outweighs the others, carefully address the issue during the labeling process. Employ techniques such as oversampling, undersampling, or synthetic sampling to balance the classes and prevent the model from being biased towards the majority class.

By labeling the target variable accurately and consistently, we provide the necessary ground truth for the supervised learning model. The labeled dataset serves as the foundation for training, validating, and evaluating the model's performance. Careful attention to the labeling process enhances the reliability and effectiveness of the subsequent modeling and analysis tasks.

In [10]:
# create target value with all 0 (normal register)
df$Target <- "0"

# change the data type of target variable to factor
df$Target <- factor(df$Target, levels = c("0", "1"))

**Personally identifiable information cannot be duplicated (UserId, Email, Mobile, Ref PC AC Number and Device Id)**

when we encounter repeated occurrences of specific information that indicate promo abuse, assigning a value of 1 to the target variable 

In [11]:
# target the promo abuse case

# UserId
df$Target[duplicated(df$UserId)] <- "1"

# Email
df$Target[duplicated(df$Email)] <- "1"

# Mobile
df$Target[duplicated(df$Mobile)] <- "1"

# Ref PC AC Number
df$Target[duplicated(df$Ref.PC.AC.Number)] <- "1"

# device id
df$Target[duplicated(df$Device.Id)] <- "1"

**Consider Imbalanced Classes**

It is important to recognize and handle imbalanced datasets appropriately to avoid biased or ineffective models. 

In [12]:
# check the frequency of the target variable in the dataset
print("Target Frequency || 0: Normal register & 1: Promo abuse")
table(df$Target)

# check the proportion (%) of the target variable in the original dataset
print("Target Proportion (%) || 0: Normal register & 1: Promo abuse")
prop.table(table(df$Target)) * 100

[1] "Target Frequency || 0: Normal register & 1: Promo abuse"



  0   1 
205 190 

[1] "Target Proportion (%) || 0: Normal register & 1: Promo abuse"



       0        1 
51.89873 48.10127 

In the scenario, the classes 0 and 1 in the target variable are present in relatively equal proportions, it can be considered a case of relative balance within the dataset. This balance provides advantages such as improved predictive performance, model interpretability, reliable evaluation metrics, and enhanced generalizability.

**Next, we will learn how to use the MCA technique used for reducing dimemsions of categorical or qualitative features before building a model.**

---

## 4. MCA (Multiple Correspondence Analysis) technique <a name="mca"></a>

Given that most of the features in the dataset are qualitative variables, it is indeed appropriate to apply MCA (Multiple Correspondence Analysis) to facilitate the analysis and modeling process.

**Definition of MCA (Multiple Correspondence Analysis) technique**

MCA (Multiple Correspondence Analysis) is a dimensionality reduction technique used for categorical or qualitative data analysis. It is an extension of Correspondence Analysis (CA) and is commonly applied to explore relationships and patterns in large categorical datasets.

MCA operates on a table that represents the associations between multiple categorical variables. It seeks to find a lower-dimensional representation of the data while preserving the relationships between variables. The resulting dimensions, called factorial dimensions, provide a visual representation of the associations among the categorical variables in the dataset.

**Some advantages of MCA technique** MCA can be a useful tool in the process of building models, especially when working with categorical or qualitative data:

- *Feature Engineering:* MCA helps in feature engineering by transforming categorical variables into numerical representations. It captures the relationships and associations between the categorical variables and provides a reduced-dimensional representation of the data. This transformation allows for the inclusion of categorical variables in predictive models that typically require numerical inputs.

- *Dimensionality Reduction:* MCA reduces the dimensionality of the categorical data while preserving the underlying structure and associations. MCA allows for a more concise representation of the data. This reduction in dimensionality can be advantageous in building models, as it helps to avoid the curse of dimensionality and enhances the interpretability of the model.

- *Variable Selection:* MCA can assist in variable selection by identifying the most informative categorical variables. The factorial dimensions generated by MCA provide insights into the importance and contribution of different variables to the overall structure of the data. Variables that exhibit strong associations with the dimensions can be selected as relevant features for modeling, potentially improving the performance and efficiency of the model.

- *Preprocessing for Machine Learning Algorithms:* MCA can serve as a preprocessing step for machine learning algorithms that require numerical inputs. By applying MCA to categorical variables, the transformed dimensions can be used as features in the model-building process. This enables the utilization of powerful machine learning algorithms, such as decision trees, random forests, or neural networks, for predictive modeling tasks involving categorical data.

In summary, by leveraging the reduced-dimensional representation and insights provided by MCA, analysts and data scientists can enhance the effectiveness, interpretability, and performance of models when working with categorical or qualitative data.

In [13]:
# conduct mca technique
mca_result <- MCA(df[, -17], ncp = 7)$ind$coord

# view the feature after aplly mca
head(mca_result)

"ggrepel: 9 unlabeled data points (too many overlaps). Consider increasing max.overlaps"


Unnamed: 0,Dim 1,Dim 2,Dim 3,Dim 4,Dim 5,Dim 6,Dim 7
1,2.31452,1.573361,1.1086826,0.61131277,-0.9475785,0.2593996,-0.7705553
2,2.943898,2.833644,-0.3934609,1.18108346,-1.077075,-0.0836121,-0.3525873
3,4.874686,6.655806,-0.2262243,0.03932749,-1.2939013,-0.2712755,-0.2445833
4,4.874686,6.655806,-0.2262243,0.03932749,-1.2939013,-0.2712755,-0.2445833
5,3.987433,3.300953,-0.2011732,3.39416292,-3.451824,0.6066285,0.8891277
6,3.987433,3.300953,-0.2011732,3.39416292,-3.451824,0.6066285,0.8891277


As you can see drom the result above, MCA technique not only help us to transform the qualitative features into numerical but also  reduce the dimensionality of the qualitative variables.

Let's combine these result into the original dataset.

In [14]:
# combine the result of MCA into the original dataset
df <- cbind(df, mca_result)

# remve the qualitative feature from the original dataset
df <- df[,-c(1:16)]

# make valid names for new columns
colnames(df) <- make.names(colnames(df))

# view the dataset after apply MCA
head(df)

Unnamed: 0_level_0,Target,Dim.1,Dim.2,Dim.3,Dim.4,Dim.5,Dim.6,Dim.7
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0,2.31452,1.573361,1.1086826,0.61131277,-0.9475785,0.2593996,-0.7705553
2,0,2.943898,2.833644,-0.3934609,1.18108346,-1.077075,-0.0836121,-0.3525873
3,0,4.874686,6.655806,-0.2262243,0.03932749,-1.2939013,-0.2712755,-0.2445833
4,0,4.874686,6.655806,-0.2262243,0.03932749,-1.2939013,-0.2712755,-0.2445833
5,0,3.987433,3.300953,-0.2011732,3.39416292,-3.451824,0.6066285,0.8891277
6,0,3.987433,3.300953,-0.2011732,3.39416292,-3.451824,0.6066285,0.8891277


***Next, let's build supervised model on this dataset.***

---

## 5. Modeling <a name="modeling"></a>

In [15]:
# Split the original data into train set and test set
set.seed(100)
ind = sample.split(df, SplitRatio = 0.8) # split ratio is 0.8.
train = df[ind, ]
test =  df[!ind, ]

# see the proportion of the target variable in train set and test set
print("Class Proportion (%) in Trainset and Testset || 0: Normal register & 1: Promo abuse")
cbind(Trainset = prop.table(table(train$Target)) * 100, 
      Testset = prop.table(table(test$Target)) * 100)

[1] "Class Proportion (%) in Trainset and Testset || 0: Normal register & 1: Promo abuse"


Unnamed: 0,Trainset,Testset
0,51.35135,53.53535
1,48.64865,46.46465


**Target Frequency:** The percentage frequency of target variable in both train set and test set is relatively similar to the original dataset. So, now let's apply the techniques we discussed above to the train set to solve the imbalanced problem.

So, now we will apply a variety of predictive models to detect fraudulent registers:
- Random forest
- K-nearest neighbors
- Logistic regression
-  Decision tree

In [16]:
# create list of models 
model_name <- c( "rf", "knn", "glm", "rpart")

# this variable is used to store the evaluation metrics
evaluation <- matrix(nrow = 0, ncol = 5) 
# this variable is used to store confusion matrices
list_confusion_matrix <- list() 

for (methods in model_name){
    # build model on train set
    set.seed(1)
    model <- train(Target ~ ., 
                   data = train, 
                   method = methods, 
                   trControl = trainControl(method = "cv", number = 5))

    # use the model to predict on test set
    predictions <- predict(model, newdata = test)

    # calculate evaluation metrics 
    accuracy <- sum(predictions == test$Target) / length(test$Target)
    auc <- as.numeric(auc(test$Target, as.numeric(predictions)))
    cm <- confusionMatrix(predictions, test$Target)$table
    precision <- cm[2, 2]/(cm[2, 2] + cm[2, 1]) # TP/(TP+FP)
    recall <- cm[2, 2]/(cm[2, 2] + cm[1, 2]) # TP/(TP+FN)
    f1 <- 2 * (precision * recall) / (precision + recall)

    # store calculated metrics and confusion matrices
    evaluation <- rbind(evaluation, c(accuracy, auc, precision, recall, f1))
    list_confusion_matrix[[length(list_confusion_matrix)+1]] <- list(cm)
}

Setting levels: control = 0, case = 1

Setting direction: controls < cases

Setting levels: control = 0, case = 1

Setting direction: controls < cases

Setting levels: control = 0, case = 1

Setting direction: controls < cases

Setting levels: control = 0, case = 1

Setting direction: controls < cases



Once the model has been constructed using the training set, it will be assessed by applying it to the test set. This evaluation process involves examining various performance metrics to gauge the model's effectiveness.

In [17]:
# see evaluation metrics of all models
print("The evaluation metrics: ")
colnames(evaluation) <- c("Accuracy", "AUC", "Precision", "Recall", "F1")
rownames(evaluation) <- c("Random Forest", "K-nearest neighbors", "Logistic Regression", "Decision Tree")
evaluation

# see the confusion matrices of all models
print("The confusion matrices: ")
names(list_confusion_matrix) <- c("Random Forest", "K-nearest neighbors", "Logistic Regression", "Decision Tree")
list_confusion_matrix

[1] "The evaluation metrics: "


Unnamed: 0,Accuracy,AUC,Precision,Recall,F1
Random Forest,0.7676768,0.7686628,0.7346939,0.7826087,0.7578947
K-nearest neighbors,0.7070707,0.7034454,0.6976744,0.6521739,0.6741573
Logistic Regression,0.6464646,0.6454061,0.6170213,0.6304348,0.6236559
Decision Tree,0.7272727,0.7251846,0.7111111,0.6956522,0.7032967


[1] "The confusion matrices: "


$`Random Forest`
$`Random Forest`[[1]]
          Reference
Prediction  0  1
         0 40 10
         1 13 36


$`K-nearest neighbors`
$`K-nearest neighbors`[[1]]
          Reference
Prediction  0  1
         0 40 16
         1 13 30


$`Logistic Regression`
$`Logistic Regression`[[1]]
          Reference
Prediction  0  1
         0 35 17
         1 18 29


$`Decision Tree`
$`Decision Tree`[[1]]
          Reference
Prediction  0  1
         0 40 14
         1 13 32



**Interpret the evaluation metrics**

- **Accuracy** 
    - Accuracy is a straightforward metric and provides a quick overview of the overall performance of a classification model. However, it has limitations, in some cases a high accuracy score can be misleading. It is essential to consider other evaluation metrics like precision, recall, F1-score, or area under the curve (AUC) to get a more comprehensive understanding of the model's performance.
    - The 'Random Forest' model has the best accuracy score.

- **AUC**
    - AUC stands for Area Under the Curve, and it is a commonly used evaluation metric for binary classification models. Specifically, it measures the performance of a model in terms of its ability to discriminate between positive and negative instances.
    - The AUC score ranges from 0 to 1, with a higher value indicating better performance. A model with an AUC of 1 implies perfect discrimination, meaning it can perfectly distinguish between positive and negative instances. Conversely, an AUC of 0.5 suggests that the model's predictions are no better than random guessing.
    - The 'Random Forest' model has the best AUC score.

- **Precision**
    - The Precision measures the accuracy of the positive predictions made by a model, specifically the proportion of true positive predictions (correctly predicted positive instances) out of all positive predictions made.
    - We can see that The precision metric focuses on the quality of positive predictions.
    - The 'Random Forest' model has the best Precision score.

- **Recall**
    - Recall, also known as sensitivity or true positive rate, measures the model's ability to correctly identify positive instances out of all actual positive instances in the dataset.
    - The recall metric focuses on the model's ability to avoid false negatives.
    - The 'Random Forest' model has the best Recall score.

- **F1**
    - F1-score is a metric that combines precision and recall into a single value, providing a balanced measure of a model's performance in binary classification tasks. It strikes a balance between false positives and false negatives and is particularly useful when both precision and recall are equally important.
    - The 'Random Forest' model has the best F1 score.

***Overall, The 'Random Forest' model shows the best results on our test set.***

---


## 6. Conclusion <a name="conclusion"></a>

In conclusion, the project focused on detecting promo abuse using machine learning techniques, specifically leveraging MCA (Multiple Correspondence Analysis). The goal was to identify patterns, associations, and relationships within the dataset to effectively detect and prevent promo abuse instances.

**Summary**

- **Labeling the target variable:** The project involved manually labeling the target variable. Labeling the target variable by hand allowed for the identification of promo abuse instances based on domain knowledge.

- **MCA technique:** The MCA technique played a crucial role in the project by transforming qualitative variables into numerical representations, facilitating feature engineering, dimensionality reduction, and model interpretation. It allowed for the exploration of associations between categorical variables and provided insights into the underlying structure of the data.

- **Machine learning model:** Use machine learning algorithms to build models, incorporating MCA dimensions as features, allowed for the prediction of promo abuse cases. The models were trained and evaluated using appropriate evaluation metrics such as accuracy, precision, recall, F1-score, and AUC to assess their performance. The 'Random Forest' model gives the best prediction on the test set.


**Some recommendations for future improvements**

- **Advanced Machine Learning Techniques:** Experiment with advanced machine learning algorithms beyond traditional models. Deep learning approaches, such as neural networks or recurrent neural networks (RNNs), may uncover complex patterns and improve the detection accuracy of promo abuse.

- **Unsupervised Learning Techniques:** Consider using unsupervised learning techniques such as clustering or anomaly detection algorithms. These techniques can help identify unusual patterns or outliers that may indicate promo abuse, even in cases where labeled data is limited.

- **Ensembling and Model Stacking:** Explore ensemble methods such as model averaging, bagging, or boosting to combine multiple models' predictions. Additionally, consider model stacking, which combines the predictions of multiple models as input to a higher-level model, to improve overall performance and capture diverse aspects of promo abuse.