This notebook aims to simulate a data scientist consuming data from a data pipeline, using PySpark. 

In this notebook, I will be executing the following steps:
  - Conduct EDA on data produced from Medallion Data Pipeline 
  - Handling Missing Values 

Thereafter, feature engineering, selection and multicollinearity checks will be handled invidually in their respective notebooks. I have specifically done this, since feature engineering should be targeted for LGD, PD, EAD. Doing feature engineering and producing a master table to use for all 3 models makes our model building tedious, computationally expensive and hard to interpret. 

# 1. Import Libraries

In [20]:
# == Import Spark Functions == 
from init_spark import start_spark
spark = start_spark()

from pyspark.sql.functions import (
    col, when, count, desc, isnan, isnull, lit, length, trim, lower, upper, to_date, concat_ws,  regexp_extract, sum 
)

from pyspark.sql.types import (
    StructType, StructField, StringType, DoubleType, IntegerType, DateType, NumericType
)

4.0.0


25/08/07 23:53:35 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [21]:
# == Helper Functions ==

def drop_constant_columns(df):
    """
    Removes all columns in the DataFrame that have only one distinct value.
    Returns a new DataFrame with those columns removed.
    """
    cols_to_drop = []

    for column_name in df.columns:
        if df.select(col(column_name)).distinct().count() <= 1:
            cols_to_drop.append(column_name)

    print(f"⚠️ Dropping constant columns: {cols_to_drop}")
    return df.drop(*cols_to_drop)



# 2. EDA: Summary Statistics & Identify Data Issues

In this section, I will be acting as a data scientist (credit risk modeling) pulling data from the Gold Delta Layer of the Medallion Structure. I will be mainly observing summary statistics, spotting and solving issues (e.g. missing values), understanding distribution of features etc. [](url)


In [22]:
df = spark.read.format("delta")\
    .load("../data/gold/medallion_cleaned_lc_data")
    
df.limit(10).toPandas()

25/08/07 23:53:35 WARN DeltaLog: Change in the table id detected while updating snapshot. 
Previous snapshot = Snapshot(path=file:/Users/lunlun/Downloads/Github/Credit-Risk-Modeling-PySpark/data/gold/medallion_cleaned_lc_data/_delta_log, version=0, metadata=Metadata(e6f0fd74-f5d8-44bd-bcf5-fbac9dc0016c,null,null,Format(parquet,Map()),{"type":"struct","fields":[{"name":"id","type":"integer","nullable":true,"metadata":{}},{"name":"member_id","type":"integer","nullable":true,"metadata":{}},{"name":"loan_amnt","type":"double","nullable":true,"metadata":{}},{"name":"funded_amnt","type":"double","nullable":true,"metadata":{}},{"name":"funded_amnt_inv","type":"double","nullable":true,"metadata":{}},{"name":"term","type":"integer","nullable":true,"metadata":{}},{"name":"int_rate","type":"double","nullable":true,"metadata":{}},{"name":"installment","type":"double","nullable":true,"metadata":{}},{"name":"grade","type":"string","nullable":true,"metadata":{}},{"name":"sub_grade","type":"string","n

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,credit_history_years
0,109730906,,1200.0,1200.0,1200.0,36,18.06,43.42,D,D2,...,,Cash,N,,,,,,,9
1,111701885,,15000.0,15000.0,15000.0,36,5.32,451.73,A,A1,...,,Cash,N,,,,,,,20
2,110090185,,35000.0,35000.0,35000.0,36,16.02,1230.85,C,C5,...,,Cash,N,,,,,,,15
3,110744306,,10875.0,10875.0,10875.0,36,12.62,364.44,C,C1,...,,Cash,N,,,,,,,7
4,109516487,,9500.0,9500.0,9500.0,36,13.59,322.8,C,C2,...,,Cash,N,,,,,,,34
5,110713514,,25000.0,25000.0,25000.0,60,10.42,536.36,B,B3,...,,Cash,N,,,,,,,16
6,110686233,,40000.0,40000.0,40000.0,36,10.42,1298.59,B,B3,...,,Cash,N,,,,,,,7
7,110212195,,3000.0,3000.0,3000.0,36,7.21,92.92,A,A3,...,,Cash,N,,,,,,,15
8,111091132,,2800.0,2800.0,2800.0,36,19.03,102.68,D,D3,...,,Cash,N,,,,,,,15
9,109922306,,3475.0,3475.0,3475.0,36,17.09,124.05,D,D1,...,,Cash,N,,,,,,,15


In [23]:
df.summary().toPandas()

25/08/07 23:54:02 WARN DAGScheduler: Broadcasting large task binary with size 2.0 MiB
                                                                                

Unnamed: 0,summary,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,...,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,settlement_status,settlement_amount,settlement_percentage,settlement_term,credit_history_years
0,count,1339155.0,0.0,1339155.0,1339155.0,1339155.0,1339155.0,1339155.0,1339155.0,1339155,...,3732.0,5719.0,5719.0,1339155,1339155,33140,33140.0,33140.0,33140.0,1339155.0
1,mean,56328509.91023817,,14421.497268800102,14413.091557736036,14390.870506804406,41.78888627530047,13.229712684492386,438.06957889115455,,...,410.3293649517684,10994.577940199326,184.80631054380137,,,,5026.490442365717,47.69125437537728,13.158147254073628,15.80248066877994
2,stddev,38366080.73598679,,8713.595538056383,8709.666221854559,8711.945681379939,10.26752867158084,4.766807743202633,261.36724975360954,,...,357.6684379796852,7474.319121444205,196.6367945132472,,,,3681.726883650031,7.305866633564949,8.234771724233399,7.51340678056518
3,min,54734.0,,500.0,500.0,0.0,36.0,5.31,4.93,A,...,1.92,55.73,0.01,Cash,N,ACTIVE,44.21,0.2,0.0,3.0
4,25%,19966939.0,,8000.0,8000.0,7900.0,36.0,9.75,248.48,,...,147.42,5041.17,39.78,,,,2226.0,45.0,6.0,11.0
5,50%,57732661.0,,12000.0,12000.0,12000.0,36.0,12.74,375.43,,...,303.15,9284.28,121.17,,,,4173.0,45.0,14.0,14.0
6,75%,84546087.0,,20000.0,20000.0,20000.0,36.0,15.99,580.61,,...,563.64,15302.56,267.41,,,,6875.0,50.0,18.0,20.0
7,max,145636374.0,,40000.0,40000.0,40000.0,60.0,30.99,1719.83,G,...,2343.15,39542.45,1407.86,DirectPay,Y,COMPLETE,33601.0,521.35,181.0,83.0


In [24]:
# == Inspect distribution of default vs non-default (to understand existence of class imbalance issues)
df.groupBy(col('default_status')).count().show()

+--------------+-------+
|default_status|  count|
+--------------+-------+
|             1| 267056|
|             0|1072099|
+--------------+-------+



Based on the summary statistics of the Lending Club dataset, several data quality issues become apparent. 
- Some columns are unusable, due to to having a **large percentage of missing values** (taken reference from `../sandbox/string_issues`)

- A number of features exhibit **missing values**, including `emp_length` etc. This requires imputation strategies to be implemented

- Some columns are irrelevant to our credit risk modeling project, due to **high cardinality** (large number of unique values), e.g. `emp_title` **(categorical data)**. 

- There are also **redundant columns**, like `member_id`, which provides no value to our prediction of LGD, EAD and PD.

- There are also **post-loan information**. This means that the value of these features are generated after loan origination (attaining application and approval). Hence, such features should be dropped, since they would skew our subsequent machine learning models. Such features include `total_pymnt`, `last_pymnt_d`. These shall be removed in individual modelling notebooks, since models like LGD and EAD will require them. 

- There are features like `delinq_2yrs` which have **outliers** (maximum data point way above the 75% quartile)


# 3. Handling Missing Values

Based on the above issues identified by me with the Lending Club Dataset, I will now be tackling each of them in order. 

### 3.1 Find Null Value % Per Column 
For this credit risk modeling project, I will be dropping columns with &gt; 50% missingness.  Many credit risk modelling projects on Kaggle and Github use 50%-65% missingness as the threshold to drop columns as well. 

However, let's display the columns which have >=50% missing values first to inspect them 

In [25]:
# == Get total number of rows == 
total_rows = df.count()

# == Calculate % of nulls per column and keep only those ≥ 50% == 
missing_val_threshold = 30
high_missingness_columns = []

for column in df.columns:
    null_count = df.select(sum(col(column).isNull().cast("int"))).collect()[0][0]
    null_pct = (null_count / total_rows) * 100
    if null_pct >= missing_val_threshold:
        print(f"{column}: {null_pct:.2f}% null")
        high_missingness_columns.append(column)

# == Drop columns with >= 50% missing values (Low predictive power upon inspection) == 
df = df.drop(*high_missingness_columns) 
print("\n✅ Columns with high pct of missing values dropped ... \n")

# == Inspect Dimensions == 
num_rows = df.count()
num_cols = len(df.columns)
print(f"Updated Shape: ({num_rows}, {num_cols})")

member_id: 100.00% null
mths_since_last_delinq: 50.43% null
mths_since_last_record: 82.96% null
next_pymnt_d: 100.00% null
mths_since_last_major_derog: 73.68% null
annual_inc_joint: 98.11% null
dti_joint: 98.11% null
verification_status_joint: 98.12% null
open_acc_6m: 60.04% null
open_act_il: 60.04% null
open_il_12m: 60.04% null
open_il_24m: 60.04% null
mths_since_rcnt_il: 61.09% null
total_bal_il: 60.04% null
il_util: 65.43% null
open_rv_12m: 60.04% null
open_rv_24m: 60.04% null
max_bal_bc: 60.04% null
all_util: 60.04% null
inq_fi: 60.04% null
total_cu_tl: 60.04% null
inq_last_12m: 60.04% null
mths_since_recent_bc_dlq: 76.28% null
mths_since_recent_revol_delinq: 66.54% null
revol_bal_joint: 98.64% null
sec_app_fico_range_low: 98.64% null
sec_app_fico_range_high: 98.64% null
sec_app_earliest_cr_line: 98.64% null
sec_app_inq_last_6mths: 98.64% null
sec_app_mort_acc: 98.64% null
sec_app_open_acc: 98.64% null
sec_app_revol_util: 98.66% null
sec_app_open_act_il: 98.64% null
sec_app_num_rev

### 3.2 Dropping Irrelevant/Redundant Columns 
This section implements the removal of **meaningless columns, features which has high cardinality (categorical data), features with little predictive value, e.g. `member_id`, `emp_title` etc, and post-loan features**. Including such features may lead to multicollinearity, and ultimately lead to low predictive power of our credit models. 

Reasons why I removed certain columns are as shown: 
- Columns with `inv`: Largely same as its subset, e.g. `total_pymnt_inv` is largely the same as `total_pymnt`

- `last_pymnt_d` and `last_credit_pull_d` (according to Data Dictionary) have little predictive value even after feature engineering. It merely shows the last payment date by borrower and last date where credit report is pulled. This has little value in predicting PD, LGD or EAD. 

- `sub_grade` is more granular than `grade`. This may lead to a risk of overfitting of our PD, LGD, and EAD models. 

- High Cardinality Columns may lead to high computational costs in encoding for machine learning models, which makes it undesirable in a big data space such as credit risk. 

- Hardship & Settlement Features (Borrowers are only eligible for hardship and settlement programmes after loan origination for Lending Club, not when they apply for it). Borrowers will contact lenders of financial hardship, attempting to settle with lenders for interest-fee payments or lower principal sum payments. Such features should not be used to predict PD, LGD, and EAD. My models should not know if a borrower will fall into hardship for this credit risk modeling project 

- `disbursement_method` indicates how loan funds are delivered to the borrower. This has little relevance in predicting PD, LGD or EAD. 

- 🚩 Low Variance Features may lead to slower running of PCA (which aims to reduce dimensionality). They also add little value to prediction of PD, EAD and LGD. (Dealt after standardisation)


In [26]:
# == Drop Derived/Meaningless Features ==
derived_features = ["funded_amnt_inv", "sub_grade", "out_prncp_inv", "total_pymnt_inv", "last_pymnt_d", "last_credit_pull_d"] 
df = df.drop(*derived_features)
print(f"✅ Derived/Meaningless Features Dropped ...")

✅ Derived/Meaningless Features Dropped ...


In [27]:
# == Drop High Cardinality Features == 

# == 1. Define Threshold == 
high_cardinality_threshold = 50

# == 2. Find Categorical Features (to identify high cardinality columns) == 
categorical_cols = [field.name for field in df.schema.fields if isinstance(field.dataType, StringType)]
print(categorical_cols)

# == 3. Identify high-cardinality columns == 
high_card_cols = []

for col_name in categorical_cols:
    unique_count = df.select(col_name).distinct().count()

    if unique_count >= high_cardinality_threshold:
        print(f"\n{col_name} has {unique_count} unique values → dropping ... ")
        high_card_cols.append(col_name)

# == 4. Drop high cardinality columns == 
df = df.drop(*high_card_cols)

print(f"\n✅ High Cardinality Features Dropped ...")

['grade', 'emp_title', 'home_ownership', 'verification_status', 'pymnt_plan', 'addr_state', 'initial_list_status', 'application_type', 'hardship_flag', 'disbursement_method', 'debt_settlement_flag']

emp_title has 359391 unique values → dropping ... 

addr_state has 51 unique values → dropping ... 

✅ High Cardinality Features Dropped ...


In [28]:
# ==  Drop columns with only 1 distinct value (No variation is useless for modeling phase)== 
df = drop_constant_columns(df)
print("✅ Columns with only 1 distinct value dropped...")

⚠️ Dropping constant columns: ['pymnt_plan', 'policy_code', 'hardship_flag']
✅ Columns with only 1 distinct value dropped...


In [29]:
# == Drop hardship related columns & miscelleanous columns (rare & irrelevant for modeling) == 
hardship_columns = ["hardship_flag", "disbursement_method", "debt_settlement_flag", 'policy_code']

df = df.drop(*hardship_columns)
print("✅ Hardship & miscelleneous columns dropped ...")

✅ Hardship & miscelleneous columns dropped ...


🚩 I will need to remove post-loan origination features later on for PD prediction. Post-loan origination features such as `recoveries` are needed for LGD, EAD prediction, but not for PD prediction. 

### 3.4 Impute Missing Values (Categorical & Numerical)
After removing unnecessary columns with little predictive power, we will proceed to impute missing values. We will first identify % missing values per column. 

For numerical columns, median values shall replace missing values, given how we haven't dealt with outliers yet. For categorical columns, mode categories shall be used to replace missing values. Such an approach is common and simplistic, though there are advanced imputation techniques like clustering. However, we shall not lose focus of learning about the credit risk modeling domain in this project. 

In [30]:
total_rows = df.count()

for column in df.columns: 
    null_count = df.filter(col(column).isNull()).count()
    if null_count > 0:
        print(f"{column}: {null_count} null values, {round(null_count/total_rows * 100,2)}% missing values.")

emp_length: 77891 null values, 5.82% missing values.
inq_last_6mths: 1 null values, 0.0% missing values.
collections_12_mths_ex_med: 54 null values, 0.0% missing values.
tot_coll_amt: 67219 null values, 5.02% missing values.
tot_cur_bal: 67219 null values, 5.02% missing values.
total_rev_hi_lim: 67219 null values, 5.02% missing values.
acc_open_past_24mths: 46992 null values, 3.51% missing values.
avg_cur_bal: 67219 null values, 5.02% missing values.
bc_open_to_buy: 59990 null values, 4.48% missing values.
bc_util: 60631 null values, 4.53% missing values.
chargeoff_within_12_mths: 54 null values, 0.0% missing values.
mo_sin_old_il_acct: 105039 null values, 7.84% missing values.
mo_sin_old_rev_tl_op: 67219 null values, 5.02% missing values.
mo_sin_rcnt_rev_tl_op: 67219 null values, 5.02% missing values.
mo_sin_rcnt_tl: 67219 null values, 5.02% missing values.
mort_acc: 46992 null values, 3.51% missing values.
mths_since_recent_bc: 59142 null values, 4.42% missing values.
mths_since_rece

In [31]:
# == Loop over each column == 
for feature in df.schema.fields:
    col_name = feature.name
    dtype = feature.dataType

    if isinstance(dtype, StringType):
        mode_value = (
            df.groupBy( col(f"{col_name}")  )
            .count()
            .orderBy(col("count").desc()) 
            .first()[0]
        )

        df = df.fillna({f"{col_name}": mode_value})


    # == Impute Numerical Columns with Median == 
    elif isinstance(dtype, NumericType) :
        if df.filter(   col(col_name).isNull()  ).count() > 0:
            median_val = df.approxQuantile(col_name, [0.5], 0.01)[0]
            df = df.fillna({col_name: median_val})

print('✅ Categorical Column Missing Values Filled!')
print('✅ Numerical Column Missing Values Filled!')

✅ Categorical Column Missing Values Filled!
✅ Numerical Column Missing Values Filled!


In [32]:
# == Double check if there are any missing values before subsequent steps == 
total_rows = df.count()

output_arr = []
for column in df.columns: 
    null_count = df.filter(col(column).isNull()).count()
    if null_count > 0:
        output_arr.append(f"{column}: {null_count} null values, {round(null_count/total_rows * 100,2)}% missing values.")

if len(output_arr) == 0: 
    print('✅ No Missing Values Found!')
else:
    print(output_arr)

✅ No Missing Values Found!


In [33]:
df.limit(10).toPandas()

Unnamed: 0,id,loan_amnt,funded_amnt,term,int_rate,installment,grade,emp_length,home_ownership,annual_inc,...,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,credit_history_years
0,109730906,1200.0,1200.0,36,18.06,43.42,D,4,RENT,36000.0,...,1.0,37.5,0.0,0.0,0.0,20814.0,23191.0,300.0,17014.0,9
1,111701885,15000.0,15000.0,36,5.32,451.73,A,4,MORTGAGE,87000.0,...,0.0,100.0,0.0,0.0,0.0,184380.0,24876.0,5000.0,54630.0,20
2,110090185,35000.0,35000.0,36,16.02,1230.85,C,4,MORTGAGE,162000.0,...,2.0,100.0,42.9,0.0,0.0,662307.0,40185.0,0.0,61874.0,15
3,110744306,10875.0,10875.0,36,12.62,364.44,C,6,RENT,56556.0,...,1.0,100.0,0.0,0.0,0.0,40600.0,31962.0,6000.0,31400.0,7
4,109516487,9500.0,9500.0,36,13.59,322.8,C,6,RENT,40000.0,...,0.0,100.0,100.0,1.0,0.0,27892.0,22350.0,7000.0,20292.0,34
5,110713514,25000.0,25000.0,60,10.42,536.36,B,0,MORTGAGE,72000.0,...,2.0,100.0,0.0,0.0,0.0,221459.0,363.0,55000.0,12434.0,16
6,110686233,40000.0,40000.0,36,10.42,1298.59,B,3,OWN,120000.0,...,2.0,100.0,0.0,0.0,0.0,91421.0,9100.0,66300.0,21000.0,7
7,110212195,3000.0,3000.0,36,7.21,92.92,A,3,MORTGAGE,136000.0,...,1.0,77.3,0.0,0.0,0.0,240862.0,36546.0,12500.0,57166.0,15
8,111091132,2800.0,2800.0,36,19.03,102.68,D,10,RENT,35000.0,...,1.0,87.0,0.0,0.0,0.0,39582.0,32883.0,3300.0,35782.0,15
9,109922306,3475.0,3475.0,36,17.09,124.05,D,2,MORTGAGE,26880.0,...,1.0,100.0,100.0,0.0,0.0,162177.0,4582.0,500.0,5000.0,15


In [34]:
# == df3 represents the DataFrame after finishing Chapter 3 Handling Missing Values == 
df.write.format("delta")\
    .mode("overwrite")\
        .save("../data/gold/medallion_cleaned_lc_data_b4_model")

                                                                                

In [35]:
# Check if Gold Delta is accessible for subsequent model building 
gold_table = spark.read.format("delta")\
    .load("../data/gold/medallion_cleaned_lc_data_b4_model")
    
gold_table.limit(10).toPandas()

Unnamed: 0,id,loan_amnt,funded_amnt,term,int_rate,installment,grade,emp_length,home_ownership,annual_inc,...,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,credit_history_years
0,61403622,7200.0,7200.0,36,17.57,258.75,D,6,RENT,38000.0,...,4.0,92.6,42.9,0.0,0.0,34764.0,26286.0,0.0,30564.0,17
1,61872125,10100.0,10100.0,60,12.29,226.16,C,5,OWN,44000.0,...,2.0,84.8,0.0,0.0,0.0,62475.0,53306.0,9600.0,47375.0,11
2,61552063,12000.0,12000.0,60,14.65,283.28,C,1,MORTGAGE,60000.0,...,1.0,92.1,0.0,1.0,0.0,41987.0,28670.0,4500.0,29287.0,26
3,62346225,10800.0,10800.0,60,24.99,316.94,F,6,MORTGAGE,68000.0,...,6.0,100.0,0.0,0.0,0.0,184487.0,66021.0,12600.0,74181.0,7
4,62844280,29775.0,29775.0,60,26.77,905.11,G,2,RENT,60000.0,...,3.0,100.0,42.9,0.0,0.0,70664.0,52704.0,0.0,67064.0,16
5,31547781,12000.0,12000.0,60,12.29,268.7,C,6,MORTGAGE,60000.0,...,0.0,93.3,0.0,0.0,0.0,306431.0,67190.0,5500.0,58805.0,22
6,61323404,12000.0,12000.0,36,9.17,382.55,B,6,RENT,72613.32,...,1.0,95.5,50.0,2.0,0.0,23537.0,3101.0,2300.0,20811.0,47
7,61441285,4000.0,4000.0,36,9.99,129.05,B,6,RENT,40000.0,...,1.0,69.0,100.0,1.0,0.0,49442.0,38440.0,1800.0,47642.0,14
8,62277179,18350.0,18350.0,36,18.55,668.48,E,6,MORTGAGE,54000.0,...,0.0,100.0,100.0,0.0,0.0,108559.0,12997.0,13500.0,0.0,11
9,60537374,7000.0,7000.0,36,7.89,219.0,A,0,RENT,73000.0,...,2.0,96.4,0.0,0.0,0.0,44723.0,30110.0,6300.0,37623.0,15


In [36]:
print("✅ All missing values filled and saved to Gold Delta Table!")
print("✅ Ready for PD, LGD, and EAD Modeling!")

✅ All missing values filled and saved to Gold Delta Table!
✅ Ready for PD, LGD, and EAD Modeling!


25/08/08 08:39:34 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 1001663 ms exceeds timeout 120000 ms
25/08/08 08:39:34 WARN SparkContext: Killing executors is not supported by current scheduler.
25/08/08 08:45:59 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:53)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:342)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:101)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:85)
	at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:81)
	at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:669)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1296)
	at 