# COGS 118B - Project Proposal

# Project Description

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like mtcars, iris, palmer penguins etc.
- The dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training an unsupervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.

- The project must include some elements of unsupervised learning, but you are welcome to include some supervised or other learning approaches as well.
- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

# Names

- Markus Buan
- Yasamin Mazaheri
- Vishal Patel
- Akash Premkumar
- Michael Tang


# Abstract 
With the vast number of people opening credit card accounts, banks need a way to find valuable insights about customers and their spending habits. Currently, the FICO score is the primary way that banks gauge the creditworthiness of credit card holders. However, this score is affected by so many factors that it may not provide a straightforward picture of a customer. Our data is composed of 8950 credit card holders and their credit card spending habits through 18 different features. We will use this data to create a clustering algorithm to perform customer segmentation and draw distinct groups of customers based on their spending habits. Our results can be used to improve marketing campaigns or customer tailoring for services such as new card applications or rewards programs.

# Background

With the rise of the Internet and e-commerce, credit card datasets have been increasingly studied using machine learning algorithms methods to gain insight into fraud detection problems. Datasets with credit card transaction details are used to discover what kinds of transactions can be identified as fraud in both unsupervised and supervised techniques<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1). Supervised techniques use existing labels of historical transactions to predict the probability of a fraudulent transaction, while unsupervised techniques use outliers of the dataset to identify fraud. Clustering, such as the k-means algorithm used in unsupervised outlier detection for fraud data, may utilize features such as average transaction spending and total transactions over a certain time range. We are also interested in using unsupervised learning to answer a problem, but instead of only tracking customer spending using transaction datasets, we also need features that entail credit card payment behavior. 

Marketing effectively to the appropriate customers is essential for fostering growth and longevity in various industries, which unsupervised machine learning can help accomplish<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2). For instance, a hierarchical clustering algorithm can also be used to group similar objects together with distance metrics, and an optimal number of clusters can be determined through the elbow method. By combining credit card spending and payment variables, we can use unsupervised methods like clustering to group a customer based on how similar in creditworthiness or risk levels they are to other customers.

# Problem Statement

The problem we are addressing is finding a more effective way to be able to group customers based on their credit card usage habits. The FICO score is the primary metric for creditors to understand a customer’s credit behavior, as it is determined by your credit history and is tracked by different credit reporting agencies<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3). This presents a problem since each three agencies are provided with different information on credit card holders and some inconsistency is created between FICO scores measured by each agency. This is why we propose a solution by creating a clustering algorithm based on the credit card holder’s spending habits. This would eliminate all the other “noise” that the FICO score includes. We can utilize several clustering algorithms like KMeans, DBSCAN, and Gaussian Mixture Models to find which factors contribute the most to a cluster and group several similar customers, in terms of credit card usage habits, to gain valuable insight into different types of customers. For instance, a person with low "PURCHASES_FREQUENCY", low "BALANCE", high "MINIMUM_PAYMENTS", and low "TENURE" might be grouped into a cluster representing low-spending and higher-risk customers. On the other hand, a person with high "PURCHASES_FREQUENCY", high "CREDIT_LIMIT", and high "PRCFULLPAYMENT" might be grouped into a cluster representing high-spending and lower-risk customers. A distance measure such as using the Euclidean distance to find which data points are more similar can ensure the similarity between data points is measurable. This method of clustering is replicable since if someone were to use this dataset and implement the same clustering algorithm, they may get similar clustering results, depending on the exact method and parameters used.


# Data

- Dataset Link: https://www.kaggle.com/datasets/arjunbhasin2013/ccdata/data 
- The dataset contains 8950 observations (unique customer IDs each representing one customer’s credit card information), and 18 features. These features include customer ID (CUST_ID), balance, balance frequency, purchases, one-off purchases, installments purchases, cash advance, purchase frequency, one-off purchase frequency, purchase installment frequency, cash advance frequency, cash advance transactions (CASHADVANCETRX), purchase transactions (PURCHASES_TRX), credit limit, payments, minimum payments, percent of full payments (PRCFULLPAYMENT), and tenure.
- An observation consists of the unique customer ID’s credit card information
- Some critical variables would be the customer ID, balance, purchases, purchase frequency, and credit limit. The customer ID is represented by a string, and this ID is unique and is given to each observation. The balance is given as a float and it represents the available balance left in their account. The purchase frequency is also a float in the range of 0 and 1, where 0 is not frequently purchased and 1 is frequently purchased, and the credit limit is an integer and it represents the customer ID’s credit card limit.
- This dataset does not require much data cleaning to be ready to use, however, many of the features are on different scales. For example, the frequency is on a scale from 0 to 1, and the balance and payments are on a scale of $1:1. Some normalizations may be required to ensure the features are on a uniform scale. This is important when we run a clustering algorithm, as everything should be equally weighted to accurately find any trends or which features may be more important. In addition, data points with missing values will either be dropped completely or taken into account when used for a specific variable.


# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Why might your solution work? Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

One evaluation metric that can be used is the silhouette score. The silhouette score is a measure of how similar an object is to its cluster (cohesion) compared to other clusters (separation). It ranges from -1 to 1, where a high value indicates that the object is well-matched to its cluster and poorly matched to neighboring clusters. This makes it an excellent metric for assessing the effectiveness of a clustering algorithm, such as k-means, in grouping customers based on their credit card usage habits.


To calculate the silhouette score:
- For each data point, calculate the average distance from all other data points in the same cluster: 
    - For the same data point, calculate the average distance from all data points in the nearest cluster
    - Compute the silhouette score for the data point using the formula: silhouette score $= \frac{{b - a}}{{\max(a, b)}}$

    
The overall silhouette score for the clustering solution is the average of the silhouette scores of all data points.


# Ethics & Privacy

A major privacy concern when working with credit card data in general involves the use of personally identifiable information (PII). If we were using data directly from a creditor, it would likely include PII such as names, addresses, account numbers, and SSNs. In this case, steps should be taken to anonymize the data. The data should still be representative of a diverse population of credit card applications before anonymization and include observations of varying age, gender, income, ethnicity, geographic location, etc. This helps produce less biased results, as the machine learning model typically generalizes to the kind of data it is trained on. The dataset we are working with consists of mostly numerical data, and personal information is not included. The data is anonymized as each observation is denoted by a customer ID. However, this means we can’t account for the representativeness of the dataset and there is a possibility the model can be biased or not generalizable to the overall population. 


# Team Expectations 

- Meet weekly at a time we all agree on
    - Please try your very best to attend all meetings on time.
- Equitable contribution
    - Each team member contributes to their portion(s) of the project equally, to the best of their ability, and in timely manner
    - If issues arise, communicate sooner than later
    - Ask another team member for help/advice if you run into any issues
- Clear and timely communication
    - Reply to messages addressed to everyone in the group chat within 24 hours
    - Communicate when you will be late to a meeting or miss a deadline
    - Share relevant information, discuss project progress in a timely manner
    - Communicate when you start committing changes on Github
- Be open-minded to other ideas and opinions and try to compromise with one another.
- If we have a conflict within our team we would gather all members together and talk it out. Communication is the best way to solve any problem and we would all take action needed collectively.


# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/20  |  Before 11:59 PM |  Work on proposal  | Edit, finalize, and submit proposal | 
| 2/29  |  10 AM | Import & Clean Data, Work on Proposed Solution and Evaluation Metric, Review feedback from project proposal | Review and Finalize Data Section; Discuss our work on Proposal Solution and Evaluation Metric, Edit sections from the proposal based on feedback, Brainstorm ideas for the different subsections of the Results and distribute work based off this | 
| 3/7  | 10 AM  | Continue to work on Proposed Solution and Evaluation Metric, continue to discuss the different subsections of the Results and distribute work if this was not finished in the previous meeting, work on assigned parts for Results section | Finalize Proposed Solution and Evaluation Metric, Work on Results sections, Make plans to complete the Results section outside of group meetings, Distribute work for Discussion Section |
| 3/14  | 10 AM  | Continue working on Results and Draft Discussion section | Finalize Results section and work on finishing Discussion, Take an overall look at our project to see if anything is missing/needs to be improved  |
| 3/20  | Before 10:00 PM  | Finish Discussion Section, Final Touches and Polish Document aesthetics | Turn in Final Project |

# Footnotes

<a name="cite_ref-1"></a>1.[^](#cite-note-1): Carcillo, Fabrizio, et al. “Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection.” Information Sciences, vol. 557, May 2021, pp. 317–31. ScienceDirect, https://doi.org/10.1016/j.ins.2019.05.042.birds-arent-real-gen-z-misinformation.html<br> 
<a name="cite_ref-2"></a>2.[^](#cite-note-2): van Leeuwen, Rik, and Ger Koole. “Data-Driven Market Segmentation in Hospitality Using Unsupervised Machine Learning.” Machine Learning with Applications, vol. 10, Dec. 2022, p. 100414. ScienceDirect, https://doi.org/10.1016/j.mlwa.2022.100414.<br>
<a name="cite_ref-3"></a>3.[^](#cite-note-3): Avery, Alexandria White, Dan. “What Is a FICO Score and Why Is It Important?” CNBC, https://www.cnbc.com/select/what-is-fico-score/. Accessed 20 Feb. 2024.



