### Objective:
The objective of this assignment is to manipulate, wrangle, visualize data.

We use the German Credit Risk dataset to answer the questions given in this notebook.

### German Credit Risk Data

**About dataset**\
The dataset consists of following columns
1. **checking_balance**           : Amount of money available in account of customers
2. **months_loan_duration**       : Duration since loan taken
3. **credit_history**             : credit history of each customers 
4. **purpose**                    : Purpose why loan has been taken
5. **amount**                     : Amount of loan taken
6. **savings_balance**            : Balance in account
7. **employment_duration**        : Duration of employment
8. **percent_of_income**          : Percentage of monthly income
9. **years_at_residence**         : Duration of current residence
10. **age**                       : Age of customer
11. **other_credit**              : Any other credits taken
12. **housing**                   : Type of housing, rent or own
13. **existing_loans_count**      : Existing count of loans
14. **job**                       : Job type
15. **dependents**                : Any dependents on customer
16. **phone**                     : Having phone or not
17. **default**                   : Default status (Target column)

#### Install Libraries

In [30]:
!pip install imbalanced-learn



In [26]:
#install the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import RandomOverSampler

# Label Encoding
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder

#### Manipulate and Wrangle Data


Impute missing values in the column using a random sampling approach based on the distribution of non-missing values.

In [5]:
def impute_missing_data(df, column):
    # Impute missing values
    value_counts = df[column].value_counts()
    print("Original value_counts:\n", value_counts)
    print('\n')

    # Get the distribution of non-missing values
    distribution = df[column].value_counts(normalize=True)

    print("Distribution:\n", distribution)
    print('\n')

    # Replace missing values based on distribution
    missing_indices = df[df[column].isnull()].index
    imputed_values = np.random.choice(distribution.index, size=len(missing_indices), p=distribution.values)
    df.loc[missing_indices, column] = imputed_values

    # Calculate value counts after imputation
    value_counts_after = df[column].value_counts()
    print("Value_counts after imputation:\n", value_counts_after)
    print('\n')
    
    return df

Merge categories

In [8]:
def merge_categories(df,column,category1,category2):
    value_counts = df[column].value_counts()
    print(value_counts)
    print('\n')

    df[column] = df[column].replace(category1, category2)

    value_counts = df[column].value_counts()
    print(value_counts.sort_values(ascending=False))
    return df

Visualization : Feature Distribution

In [14]:
def feature_distribution(df, features):
    # Calculate the number of rows and columns needed for subplots
    num_rows = (len(features) + 1) // 2
    num_cols = 2

    # Create a figure and subplots
    fig, axs = plt.subplots(num_rows, num_cols, figsize=(12, 30))

    # Flatten the axs array to iterate through subplots
    axs = axs.flatten()

    # Iterate through features and create bar plots
    for i, feature in enumerate(features):
        ax = axs[i]
        df[feature].value_counts().plot(kind='bar', ax=ax)
        ax.set_title(f'{feature} Distribution')
        ax.set_xlabel(feature)
        ax.set_ylabel('Frequency')

    # Adjust spacing between subplots
    plt.tight_layout()

    # Show the plots
    plt.show()


    

### Wrangling : Categorical Encoding

- Ordinal Encoding

In [19]:
def perform_ordinal_encoding(df, columns, categories_list):
    for col, categories in zip(columns, categories_list):
        encoder = OrdinalEncoder(categories=[categories])
        encoded_values = encoder.fit_transform(df[[col]])
        df[f'{col}_encoded'] = encoded_values
    return df

- Label Encoder

In [25]:
def perform_label_encoding(df, columns_to_encode):
    for col in columns_to_encode:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])
    return df

Relationship of categorical predictor columns with target

In [41]:
def categorical_predictors_to_target(df, categorical_columns):
    plt.figure(figsize=(15, 10))
    for col in categorical_columns:
        plt.subplot(3, 3, categorical_columns.index(col) + 1)
        sns.barplot(x=col, y='default', data=df, ci=None)
        plt.title(f'Relationship between {col} and Default')
        plt.ylabel('Default Probability')
        plt.xlabel(col)
        plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    plt.savefig('C:\Jasmine\GreatLearning\ML\project\GermanBankLoan\german-bank-loan-defaults\reports\figures\CategoricalFeaturesRelationToTarget.png')

Relationship of numeric predictor columns with target

In [44]:
def numeric_predictors_to_target(df, numeric_columns):
    plt.figure(figsize=(15, 10))
    for col in numeric_columns:
        plt.subplot(3, 3, numeric_columns.index(col) + 1)
        sns.boxplot(x='default', y=col, data=df)
        plt.title(f'Relationship between {col} and Default')
        plt.xlabel('Default')
        plt.ylabel(col)
    plt.tight_layout()
    plt.show()
    plt.savefig('C:/Jasmine/GreatLearning/project/GermanBankLoan/german-bank-loan-defaults/reports/figures/NumericalFeaturesRelationToTarget.png')

Perform Random Oversampling to improve recall

In [32]:
def perform_random_oversampling(X, y, random_state=42):
    # Count the occurrences of each class before oversampling
    class_counts = np.bincount(y)
    print("Class counts before oversampling:", class_counts)

    # Create an instance of RandomOverSampler
    oversampler = RandomOverSampler(sampling_strategy="minority", random_state=random_state)

    # Perform random oversampling on the data
    X_resampled, y_resampled = oversampler.fit_resample(X, y)

    # Count the occurrences of each class after oversampling
    resampled_class_counts = np.bincount(y_resampled)
    print("Class counts after oversampling:", resampled_class_counts)

    return X_resampled, y_resampled