# In-class Competition: Home Credit Default Risk

## Background

In this competition, you will work on the **“Home Credit Default Risk”** competition, which was originally held on Kaggle in 2018.
(https://www.kaggle.com/c/home-credit-default-risk/overview/description)

The goal of this competition is to predict whether a given data entry (representing a customer) is likely to default on their loan based on a variety of customer data provided. While the original competition provided multiple datasets, this competition will distribute only a subset of them.

## Objective

Predict the **probability** of default for each customer based on the provided data. The TARGET column in the dataset serves as the target variable, where 1 indicates payment difficulties (default), and 0 indicates no payment difficulties. All other columns are explanatory variables.

## Data

In data science competitions, **training data** and **test data** are typically provided. In this competition, the training data is **input/train.csv**, and the test data is **input/test.csv**. The training and test data differ in the following ways:
- Training Data: This data is used to train the machine learning model.
- Test Data: You will submit predictions for this data and compete on the accuracy of those predictions.

For details on the columns in the training data, please refer to **HomeCredit_columns_description.xlsx**.

## Evaluation Metric

The competition will be judged based on the [Area Under the Curve (AUC) of the ROC curve ](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) between the predicted probabilities and the ground truth for the test data.

## Submission, Scoring, and Ranking

Predictions must be submitted via the **Omnicampus** platform. The format of the submitted data should be a CSV file as shown below.

SK_ID_CURR|TARGET
---|---
171202|0.5
171203|0.5
171204|0.5
…|…
232699|0.5
232700|0.5
232701|0.5


Please follow the format shown in **sample_submission.csv**. Submissions not adhering to this format will receive a score of -1, so please check your file format if you encounter this issue.

The number of submissions is unlimited. However, scoring results may not be immediately reflected. When checking your results, please ensure that the scoring timestamp has been updated.

During the competition period, scoring will be conducted on a subset of the test data. This approach helps prevent "leaderboard overfitting." Based on these interim results, the leaderboard on Omnicampus will be updated during the competition. This leaderboard, reflecting scores based on a subset of the test data, is referred to as the **Public Leaderboard**.

In contrast, after the competition ends, final scoring will be performed on the entire test data. The final rankings will be determined based on this comprehensive evaluation. The leaderboard generated from this final scoring is referred to as the **Private Leaderboard**, in contrast to the Public Leaderboard.

Final rankings will be determined based on the **last submitted file**.

Participants ranked high on the Private Leaderboard will be requested to submit code that runs on Google Colab. This is to verify reproducibility before confirming the final rankings. We appreciate your cooperation in this matter.

## Timeline

Details regarding the start date, submission deadline, and ranking announcements will be provided separately.

## Rules
- **Prohibition of External Data Usage**  

No external data may be used at any stage of the analysis. You must rely solely on the provided datasets (input/train.csv and input/test.csv).

- **Prohibition of Hand-Labeling**  

Creating predictions manually instead of using a model is referred to as Hand-Labeling, and it is prohibited in most data science competitions. This rule also applies to this competition, prohibiting Hand-Labeling for all or part of the test data. **Making manual decisions for part of the predictions** based on rules or specific conditions derived from EDA (Exploratory Data Analysis) also falls under Hand-Labeling. All predictions must be generated by a model, and manual predictions are not permitted. Data processing must be based on reproducible methods, and the submitted predictions must be **automatically generated** by a model that can also be applied to other data with similar characteristics.

- **Ensuring Reproducibility**  
  
Ensure reproducibility of your predictions as much as possible. To achieve this, it is essential to **set seed values for random number generation**.

The phrase "as much as possible" is used because there are situations where reproducibility cannot be fully guaranteed. For example, recent updates to the PyTorch deep learning framework have caused overall precision to decrease. In such cases, reproducing predictions made with an earlier version would require downgrading the framework. However, for this competition, such measures are not required to ensure reproducibility.


- **Permission for Private Sharing**  

In data science competitions, private information sharing is typically strictly regulated from a fairness perspective. In most cases, information sharing within a team is only permitted when the team formation is declared through official channels. Personal information sharing outside the team, referred to as **Private Sharing**, is strictly prohibited.
However, since this competition is positioned as a tutorial for beginners, the Private Sharing prohibition will not be applied. If you have experienced individuals nearby, you are encouraged to actively seek their advice, and you may also exchange information and discuss with other participants online. When engaging in online discussions with other participants, please try to conduct these interactions in public forums rather than private channels, ensuring that all participants can share insights.

- **Citing Original Sources (Reference Code)**  

This competition has numerous high-quality and informative notebooks reported on the internet. Exploring such excellent approaches and deciphering code is also a form of learning, so please feel free to reference them. There is absolutely no problem with referencing others' code.
From an educational standpoint, GCI distinguishes between one's own work and citations. Therefore, when you quote code, please explicitly indicate the original source. For top-ranking participants in this competition, code disclosure is required. If it is discovered during this disclosure that the original source was not cited, and the code is clearly from another source, there is a possibility that the prize may be revoked. Therefore, it is recommended to keep a record of the notebooks you referenced to track which codes you have used.