Skip to content

neilpradhan/Credit_Score_Calculator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation


Apache PySpark Project - Credit_Score_Calculator

Overview

This Apache PySpark project that focuses on calculating loan scores based on various factors such as payment history, financial health, and default history. The project utilizes PySpark to process and analyze loan-related data to derive insights and aid decision-making.

Tables

Project Structure

  • Data Cleaning and Processing: Scripts to clean data related to loan payment histories, customer financial health, and loan defaulters. Creating processed DataFrames and writing the data back in both CSV and PARQUET formats.
  • Data Analysis: Scripts to create external tables and views for analyzing processed data.
  • Loan Score Calculation: A detailed implementation of the loan score calculation logic based on predefined criteria.

Installation

Prerequisites

  • Apache Spark
  • Python 3.x
  • PySpark
  • Access to a Hadoop Distributed File System (HDFS) or a local filesystem for storage

Setup

  1. Clone the repository:
    git clone [repository-url]
    cd [project-directory]
  2. Install required Python packages:
    pip install -r requirements.txt

Usage

To run the project scripts, navigate to the project directory and use the following command:

spark-submit --master local[4] script_name.py

Replace script_name.py with the actual name of the script you want to execute.

Configuration

Modify the config.py file to update paths or parameters according to your environment setup. This includes specifying the paths for input data and locations for storing outputs.

Data Model

  • Input Data: Includes loan repayment history, public records, bankruptcies, inquiries, and customer financial data.
  • Processed Data: DataFrames that aggregate and cleanse the input data to prepare it for analysis.
  • Output Data: Includes detailed records and summary statistics useful for downstream analysis and reporting.

Features

  • Data Cleaning: Scripts to identify and remove bad data, such as duplicate records.
  • Loan Score Calculation: Comprehensive calculation of loan scores using various metrics like loan repayment history (20%), loan defaulters history (45%), and financial health (35%).

Business Problem Statement

Business requirement 1 Teams are required to analyze the cleaned data which requires the creation of permanent tables on top of the cleaned data that allow the downstream teams to query the data using simple SQL-like queries. (External tables are preferred over Managed Tables because dropping the data accidentally will not affect the dropping of the table )

Business requirement 2 The teams require a single consolidated view of all the datasets with the latest up-to-date data. (Best practice is to create a view after the data is refreshed over 24 hours, if its running every 24 hrs, no data will be older than 24 hrs, this is preferred over an SQL query because it will take time to generate the view if no further analytics is required )

Business requirement 3 Another team wants real quick access to the “view data” without having to wait for the view results to be processed. Since processing the results takes a very long time. (We can have a weekly join of data creation and storage, but in this case, even though you can view the data quickly, the data may not be the latest, however, In this case Managed table will be used and actual data stored in warehouse directory)


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published