## Initial Data Check

Purpose: Understanding what's in our credit card dataset before we clean it.

What we'll do:

1. Load the data
2. Check how many records we have
3. Look for missing information
4. Count unique customers
5. See what needs fixing for Tableau

### 1. Intial Setup

In [None]:
# importing libraires
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import os

print("Libraries imported successfully.")   

## 2. Loading the Data

Loading the credit card dataset from the raw data folder.

In [None]:
# looding the data from data / raw folder
df = pd.read_csv('../data/raw/clients_card_data.csv')

print("Data loaded.")

## 3. Data Overview

Getting a initial look at what's in our dataset.

In [None]:
# Shows the first 10 rows
print("First 10 rows of data:")
df.head(10)

## 4. Dataset Information

Checking to see how many rows, columns and what types of data we have and removing any rows with missing values

In [None]:
# checking to see the shape of the data

print("Shape of the data:")
print(f"columns: {df.shape[1]}, rows: {df.shape[0]}")

In [None]:
# checking to see overall info of the data
print("Overall info of the data:")
print(df.info())    

rangeindex shows 6147 entries and the non-null count shows 6146 meaning there is a missing value in each column.

In [None]:
# checking to see missing values in each column
print("Missing values in each column:")
print(df.isnull().sum())    

## 5. Finding and Removing Missing Data

checking the rows with missing values and cleaning it up.

In [None]:
# shows me where the count of missing data is more than 0
missing_rows = df[df.isnull().sum(axis=1) > 0]
missing_rows.head()

In [None]:
# removing the roe with missing data
df_cleaned = df.dropna()
print(f"total rows after removing missing data: {df_cleaned.shape[0]}")

## 6. Customer Analysis

information about our customers

In [None]:
# for clarity renaming the column id to card_id as this represents the card id.
df_cleaned = df_cleaned.rename(columns={'id': 'card_id'})
df_cleaned.head(1)

In [None]:
# checking how many unique clients we have 
unique_clients = df_cleaned['client_id'].nunique()

print(f"unique clients: {unique_clients}")

## 7. Removing Columns

focusing on only the information that we want for key insights

In [None]:
# list of columns we want to remove 
columns_to_remove = ['expires', 'year_pin_last_changed', 'acct_open_date', 'card_on_dark_web']

df_cleaned = df_cleaned.drop(columns=columns_to_remove)


In [None]:
# final list of columns after cleaning
df_cleaned.columns.tolist()

['card_id',
 'client_id',
 'card_brand',
 'card_type',
 'card_number',
 'cvv',
 'has_chip',
 'num_cards_issued',
 'credit_limit',
 'current_age',
 'retirement_age',
 'birth_year',
 'birth_month',
 'gender',
 'address',
 'latitude',
 'longitude',
 'per_capita_income',
 'yearly_income',
 'total_debt',
 'credit_score',
 'num_credit_cards']

### 

## 8. Saving cleaned data

In [43]:
# saving the cleaned data to data/cleaned folder
df_cleaned.to_csv('../data/cleaned/clients_card_data_cleaned.csv', index=False)

In [None]:
# create a working copy of the cleaned data - df_cc = dataframe cleaned copy
df_cc = df_cleaned.copy()