# Choice of Problem

In this notebook, we explain our choice of problem, linking back what what we learned in Project 0. We will use what we have learnt in Project 0, taking into consideration the issues we faced and the things which worked well.

**Table of Contents**
- [1. Reflections and Links to Project 0](#1-reflections-and-links-to-project-0)
- [2. Data Search](#2-data-search)
    - [2.1. Data Sources](#21-data-sources)
- [3. Datasets](#3-datasets)
    - [3.1. Classification Problems](#31-classification-problems)
    - [3.2. Kaggle Datasets](#32-kaggle-datasets)
    - [3.3. UCI Datasets](#33-uci-datasets)
    - [3.4. scikit-learn Datasets](#34-scikit-learn-datasets)
- [4. Choice of Dataset](#4-choice-of-dataset)
- [5. Conclusion](#5-conclusion)

## 1. Reflections and Links to Project 0
In Project 0, we used structured finance data as it was easy to find and provided opportunities to apply many different techniques. We investigated a classification problem, learning about challenges that it can involve, such as imbalanced data and missing data. This was interesting to us and we therefore chose a **classification problem** for Project 1. We chose not to analyse the same data, as we wanted to showcase our abilities to draw unique insights when the answers were not presented to us. We will instead draw upon the Kaggle notebooks we encountered and adapt our approach to our chosen problem. We believe that this will also help us grow and develop our skills further.

We encountered an issue with the Kaggle dataset, as it was too large to upload to Github, and we did not have any other file hosting solution. We considered this issue in our data search for Project 1 and aimed to find data that was easy to import.

## 2. Data Search

### 2.1. Data Sources

In Project 0, we diligently searched a variety of data sources to find datasets for Project 1. We found several interesting datasets that present unique challenges like imbalanced data, missing values, and outliers, making them excellent candidates for providing learning opportunities. In searching for sources for this project, I discovered additional sources, such as:
- [Dataset List](https://www.datasetlist.com/): This contains several machine learning datasets, involving image classification, NLP and other techniques.
- [QuantumStat Dataset Index](https://index.quantumstat.com/): This contained several NLP problems.

These contained large datasets or required advanced techniques (such as natural language processing) and were not pursued further due to size or complexity. They are included for the reader's interest. Below are other additional sources we explored, which were more feasible for analysis:
- [Serokell’s Blog on Best ML Datasets](https://serokell.io/blog/best-machine-learning-datasets): Accessible blog containing many sources of datasets such as **[Google Dataset Search](https://datasetsearch.research.google.com/)**.
- [Scikit-learn Real-World Datasets](https://scikit-learn.org/1.5/datasets/real_world.html): This package provided real-world datasets which can easily be imported using the provided code.

We also used other resources from Project 0, such as **[Kaggle](https://www.kaggle.com/datasets)** and the **[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)**. This provided us with a rich source of varied datasets. We present some interesting datasets in the next section. 


# 3. Datasets

As described earlier, we chose to focus on a classification problem. We provide some datasets for classification below. The datasets we selected present a wide range of challenges, including missing data and imbalanced classes. This provides an excellent opportunity for testing various machine learning techniques.

### 3.1 Classification Problems

The datasets are grouped by source. Kaggle datasets can be downloaded and uploaded to Github or another file hosting solution if the file size permits. The UCI datasets can be loaded very easily by following the links provided and clicking the "Import to Python" button. This gives Python code to load the data. The sci-kit learn datasets can also be easily loaded using Python code given below.

#### 3.2 Kaggle Datasets

1. **Titanic Survival Dataset**
   - **Description:** Predict passenger survival based on demographic data like age, gender, and class.
   - **Challenges:** **Missing data** (especially age) and **imbalanced data** (fewer survivors).
   - **Size:** 891 records with 12 features.
   - **Use Case:** Survival prediction, historical data analysis.
   - **Source:** [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)

---

2. **Loan Prediction Dataset**
   - **Description:** Predict loan approval based on credit history, income, and other financial information.
   - **Challenges:** **Missing data** and slightly **imbalanced** outcomes.
   - **Size:** 615 records with 13 features, with a separate test set of about 300 records.
   - **Use Case:** Financial decision-making.
   - **Source:** [Loan Prediction Dataset](https://www.kaggle.com/altruistdelhite04/loan-prediction-problem-dataset)

---

#### 3.3 UCI Datasets

3. **Breast Cancer Dataset**
   - **Description:** Predict whether a tumor is benign or malignant based on cell nuclei features.
   - **Challenges:** Slightly **imbalanced dataset** (more benign than malignant cases).
   - **Size:** 569 samples with 30 features.
   - **Use Case:** Medical diagnosis.
   - **Source:** [UCI Breast Cancer Dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)

---

4. **Adult Income Dataset (Census Data)**
   - **Description:** Predict whether an individual's income exceeds $50,000 based on demographic features.
   - **Challenges:** **Missing values** in some attributes like occupation, **imbalanced classes** (fewer high-income individuals).
   - **Size:** 48,842 records with 14 features.
   - **Use Case:** Social and economic analysis.
   - **Source:** [UCI Adult Dataset](https://archive.ics.uci.edu/ml/datasets/adult)

---

5. **Drug Consumption Dataset**
   - **Description:** Predict whether an individual consumes a drug based on personality traits and demographics.
   - **Challenges:** **Imbalanced data** and potential ethical concerns.
   - **Size:** 1,889 records with 12 features.
   - **Use Case:** Behavioral studies, addiction research.
   - **Source:** [UCI Drug Consumption Dataset](https://archive.ics.uci.edu/ml/datasets/Drug+consumption+%28quantified%29)

---

#### 3.4 scikit-learn Datasets

6. **Forest Cover Type Dataset**
   - **Description:** Classify forest cover type based on cartographic features like elevation and soil type.
   - **Challenges:** **Imbalanced data** across forest types.
   - **Size:** 581,012 records with 54 features.
   - **Use Case:** Environmental monitoring, land management.
   - **Method to Load:** `fetch_covtype()` (also available on [UCI ML Repository](https://archive.ics.uci.edu/dataset/31/covertype))

---

7. **Digits Dataset**
   - **Description:** Handwritten digit images (0–9) with pixel values, used for classification.
   - **Challenges:** Preprocessing steps like feature scaling and dimensionality reduction add complexity.
   - **Size:** 1,797 records, each with 64 features (an 8x8 pixel image).
   - **Use Case:** Optical character recognition (OCR), computer vision.
   - **Method to Load:** `load_digits()`

---

8. **20 Newsgroups Dataset**
   - **Description:** A text dataset for classification tasks, with the goal of classifying documents into one of 20 different newsgroups based on their content.
   - **Size:** 18,000 posts from newsgroups.
   - **Challenges:** Text classification and natural language processing (NLP).
   - **Use Case:** Text mining and topic modeling.
   - **Method to Load:** `fetch_20newsgroups()`

---

For more information and additional datasets, please visit Kaggle, the UCI Machine Learning Repository, or [scikit-learn's real-world datasets page](https://scikit-learn.org/1.5/datasets/real_world.html).

---

# 4. Choice of Dataset 

The datasets which particularly stood out to us were the forest cover dataset and adult income dataset. These datasets present a variety of challenges and opportunities for applying classification and regression techniques. They are also excellent for testing advanced methods for handling imbalanced data (e.g., synthetic minority oversampling technique -- SMOTE), and for handling missing data (e.g., imputation).

After an in-person meeting and online WhatsApp discussions, we settled on using the adult income dataset.

# 5. Conclusion

This document shows the thought process behind making our choice to analyse the adult income dataset. The next documents will describe the models we explored.