## Data Exploration and Cleaning w/ Jupyter Notebook

### Packages versions

In [1]:
import sys
import pandas as pd

print('The Python version is: {}\n'.format(sys.version))
print('The Pandas version is: {}\n'.format(pd.__version__))

The Python version is: 3.8.5 (default, Jul 28 2020, 12:59:40) 
[GCC 9.3.0]

The Pandas version is: 1.1.0



### Loading data into Jupyter w/ Pandas

In [2]:
df = pd.read_excel('data/default_of_credit_card_clients__courseware_version_1_21_19.xls')

Number of rows and columns (rows, columns):

In [3]:
df.shape 

(30000, 25)

### Getting familiar with the data

Examining the column names:

In [4]:
df.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_1',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default payment next month'],
      dtype='object')

Formating the column names to lowercase:

In [5]:
df.columns = [column.lower() for column in df.columns]
df.columns

Index(['id', 'limit_bal', 'sex', 'education', 'marriage', 'age', 'pay_1',
       'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt1', 'bill_amt2',
       'bill_amt3', 'bill_amt4', 'bill_amt5', 'bill_amt6', 'pay_amt1',
       'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6',
       'default payment next month'],
      dtype='object')

Observing the first few rows of data:

In [6]:
df.head()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_1,pay_2,pay_3,pay_4,...,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default payment next month
0,798fc410-45c1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,8a8c8f3b-8eb4,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,85698822-43f5,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,0737c11b-be42,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,3b7f77cc-dbc0,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


Certify that the ids are unique:

In [7]:
df.id.nunique()

29687

Verifying the the duplicate ids:

In [8]:
id_counts = df['id'].value_counts()
id_counts.head()

956cbf4a-d24e    2
31d0095f-75bc    2
0a2d6db4-0c54    2
cd944abe-cef9    2
64fe3b96-7ed0    2
Name: id, dtype: int64

Veirifying the number of duplicates:

In [9]:
id_counts.value_counts()

1    29374
2      313
Name: id, dtype: int64

### Boolean Masks

In [10]:
dup_mask = id_counts == 2
dup_mask[0:5]

956cbf4a-d24e    True
31d0095f-75bc    True
0a2d6db4-0c54    True
cd944abe-cef9    True
64fe3b96-7ed0    True
Name: id, dtype: bool

Extracting the duplication ids:

In [11]:
id_counts.index[0:5]

Index(['956cbf4a-d24e', '31d0095f-75bc', '0a2d6db4-0c54', 'cd944abe-cef9',
       '64fe3b96-7ed0'],
      dtype='object')

In [12]:
dup_ids = id_counts.index[dup_mask]
dup_ids = list(dup_ids)
len(dup_ids)

313

In [13]:
dup_ids[0:5]

['956cbf4a-d24e',
 '31d0095f-75bc',
 '0a2d6db4-0c54',
 'cd944abe-cef9',
 '64fe3b96-7ed0']

Displaying df records that have some of these ids:

In [14]:
df.loc[df.id.isin(dup_ids[0:3]), :].head(10)

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_1,pay_2,pay_3,pay_4,...,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default payment next month
10102,0a2d6db4-0c54,30000,2,2,1,42,2,2,2,0,...,6795,7342,7075,2521,0,245,658,0,268,1
10202,0a2d6db4-0c54,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
17226,31d0095f-75bc,130000,2,2,2,24,0,0,0,0,...,109428,90359,89216,4155,5802,3487,2908,3013,3093,0
17326,31d0095f-75bc,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21588,956cbf4a-d24e,500000,2,1,2,37,-2,-2,-2,-2,...,813,0,4443,2579,3824,813,0,4443,330,0
21688,956cbf4a-d24e,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Selecting df records with all zeroes:

In [15]:
df_zeroes = df == 0
feature_zeroes_mask = df_zeroes.iloc[:, 1:].all(axis=1)
sum(feature_zeroes_mask)

315

Droping all zeroes df records:

In [16]:
df_clean = df.loc[~feature_zeroes_mask, :].copy()
df_clean.shape

(29685, 25)

Checking the uniqueness of the ids:

In [17]:
df_clean.id.nunique() == df_clean.shape[0]

True