# Assignment #2: Data Management and Preprocessing

## Data Management and Pre-processing

In this section, we will delve into the crucial aspects of data management and pre-processing in the context of preparing data for machine learning models. Properly handling and transforming the raw data are essential steps that significantly impact the performance and reliability of your models. We will explore various techniques using popular Python libraries, such as pandas and scikit-learn, to address common challenges in real-world datasets.

### Assignment Overview: AI/ML Solutions for Financial Services

As part of the assignment, students will be tasked with developing AI/ML solutions for financial services. The objective is to build models that can effectively analyze and predict credit risk, a vital aspect in the financial industry. The assignment will cover topics ranging from dataset exploration, preprocessing, model development, to evaluation. Each step is crucial in ensuring the robustness and accuracy of the developed machine learning models.

Let's proceed with a practical example to understand the fundamental steps involved in data management and pre-processing.


## Paul's original data pre-processing code:

In [None]:
#Import necessary libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

#Sample data
data = pd.DataFrame({
    'Age': [25, 30, None, 35, 28],
    'Income': [50000, 60000, 75000, None, 55000],
    'Gender': ['M', 'F', 'M', 'F', 'M'],
    'Loan_Status': ['Approved', 'Rejected', 'Approved', 'Approved', 'Rejected']
})

print("Original Data: ")
print(data)

#Handling missing values with mean imputation
imputer = SimpleImputer(strategy='mean')
data[['Age', 'Income']] = imputer.fit_transform(data[['Age', 'Income']])

print("\nMissing Data replaced with mean: ")
print(data)

#Encoding categorical variables (Gender and Loan_Status)
le = LabelEncoder()
data['Gender'] = le.fit_transform(data['Gender'])
data['Loan_Status'] = le.fit_transform(data['Loan_Status'])

print("\n1-hot encoding categorical data: ")
print(data)

#Scaling numerical features (Age and Income) using StandardScaler
scaler = StandardScaler()
data[['Age', 'Income']] = scaler.fit_transform(data[['Age', 'Income']])

#Display the preprocessed and cleansed data
print("\nScaling numnerical features using StandardScaler: ")
print(data)

Original Data: 
    Age   Income Gender Loan_Status
0  25.0  50000.0      M    Approved
1  30.0  60000.0      F    Rejected
2   NaN  75000.0      M    Approved
3  35.0      NaN      F    Approved
4  28.0  55000.0      M    Rejected

Missing Data replaced with mean: 
    Age   Income Gender Loan_Status
0  25.0  50000.0      M    Approved
1  30.0  60000.0      F    Rejected
2  29.5  75000.0      M    Approved
3  35.0  60000.0      F    Approved
4  28.0  55000.0      M    Rejected

1-hot encoding categorical data: 
    Age   Income  Gender  Loan_Status
0  25.0  50000.0       1            0
1  30.0  60000.0       0            1
2  29.5  75000.0       1            0
3  35.0  60000.0       0            0
4  28.0  55000.0       1            1

Scaling numnerical features using StandardScaler: 
        Age    Income  Gender  Loan_Status
0 -1.382164 -1.195229       1            0
1  0.153574  0.000000       0            1
2  0.000000  1.792843       1            0
3  1.689312  0.000000       0 

## Explanation of Paul's Code:

**Import Libraries:**

- Import necessary libraries, including pandas for data manipulation and scikit-learn for data preprocessing.

**Sample Data:**

- Create a sample dataset with columns for 'Age,' 'Income,' 'Gender,' and 'Loan_Status.' Introduce missing values and use categorical variables intentionally.

**Handling Missing Values:**

- Use SimpleImputer to handle missing values in the 'Age' column by imputing the missing values with the mean of the non-missing values.

**Encoding Categorical Variables:**

- Use LabelEncoder to encode categorical variables 'Gender' and 'Loan_Status' into numerical labels.

**Scaling Numerical Features:**

- Use StandardScaler to scale numerical features 'Age' and 'Income' to standardize them, making their values comparable.

**Display Data:**

- Display the preprocessed and cleansed data at each step.

# Assignment: Data Management and Pre-processing in Financial Services

## Introduction:

In this assignment, you will dive into the world of data management and pre-processing for financial services using a dataset from LendingClub. The goal is to apply essential data cleaning and transformation techniques to prepare the data for further analysis and modeling.

## Tasks:

1. **Data Loading:**
   - Import the LendingClub Loan Data dataset, limiting the import to three numeric variables and three character variables.
   - This step is spelled out in more detail below.  The Lending Club Loan Data dataset will be used in other projects, so it is best practices to load the dataset into your Jupyter Notebook directory.
  
2. **Data Exploration:**
   - Conduct an initial exploration of the dataset, examining summary statistics and understanding the distribution of key variables.

3. **Handling Missing Values:**
   - Identify and handle missing values for numeric variables using an appropriate strategy (e.g., imputation).

4. **Encoding Categorical Variables:**
   - Utilize encoding techniques (e.g., one-hot encoding) for handling categorical variables.

5. **Scaling Numerical Features:**
   - Implement scaling on numeric features to standardize their values.

## Grading:

- **Data Loading (15%):**
  - Successful import of the LendingClub dataset with specified limitations.

- **Data Exploration (20%):**
  - Effective exploration of key statistics and distribution of variables.

- **Handling Missing Values (20%):**
  - Appropriate identification and handling of missing values using a chosen strategy.

- **Encoding Categorical Variables (20%):**
  - Accurate encoding of categorical variables for enhanced model compatibility.

- **Scaling Numerical Features (15%):**
  - Proper implementation of scaling techniques on numeric features.

- **Code Readability and Comments (10%):**
  - Well-structured and commented code for clarity and understanding.

## Submission Guidelines:

- Submit a Jupyter notebook (.ipynb) containing your code, explanations, and results or a python file that I can run and generate your results.
- Clearly label each section with corresponding task numbers.
- Ensure code readability and provide comments where necessary.

Feel free to reach out if you have any questions. Happy coding!


## LendingClub Dataset Setup

### Overview:
The LendingClub dataset will be utilized for various projects. To ensure best practices and seamless access, it's recommended to save the dataset CSV file in the same directory as your Jupyter notebook. Follow the steps below to download, extract, and load the dataset into your working directory.

### Step-by-Step Guidance:

1. **Download the LendingClub Dataset:**
   - Visit the [LendingClub Dataset on Kaggle](https://www.kaggle.com/datasets/wordsforthewise/lending-club/).
   - Click on the "Download" button to obtain the dataset in ZIP format.

2. **Extract the Dataset:**
   - Locate the downloaded ZIP file (e.g., `loan.zip`).
   - Extract the contents to reveal the CSV file (`accepted_2007_to_2018q4.csv`).

3. **Move the CSV to Your Notebook Directory:**
   - Move the extracted CSV file to the directory where your Jupyter notebook resides.
   - Alternatively, you can specify the full path to the CSV file in your notebook.

4. **Load the Dataset in Your Notebook:**
   - Use the following code to read the CSV file in your Jupyter notebook:
     ```python
     import pandas as pd

     # Assuming the CSV file is in the same directory as your notebook
     data = pd.read_csv('accepted_2007_to_2018q4.csv', low_memory=False)
     ```

By following these steps, you'll have the LendingClub dataset readily available for analysis in your Jupyter notebook.
