# Feature Engineering with Modelling
## Author: Dulan Wijeratne 1181873

In this notebook we will make new features using modelling techniques.

First we will start by creating a Spark session and reading in the joined aggregated data.

In [1]:
from pyspark.sql import SparkSession, functions as f

In [2]:
spark = (
    SparkSession.builder.appName("feature_engineering")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '3g')   
    .config('spark.executor.memory', '4g')  
    .config('spark.executor.instances', '2')  
    .config('spark.executor.cores', '2')
    .getOrCreate()
)

your 131072x1 screen size is bogus. expect trouble
23/10/02 22:35:17 WARN Utils: Your hostname, LAPTOP-RELH58H1 resolves to a loopback address: 127.0.1.1; using 172.19.22.4 instead (on interface eth0)
23/10/02 22:35:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/02 22:35:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
joined = spark.read.parquet("../../../data/curated/removed_outliers.parquet")

                                                                                

In [4]:
joined.orderBy(f.col("consumer_diff_over_period").asc()).show()

23/10/02 22:35:56 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+------------+--------------------+-------------+------------------+----------------------------------+--------------------------+----------------------------------+--------------------------+----------------------------------------+----------------+---------------------+--------------------------+------------------------------+-------------------------+------------------------+------------------------+--------------------------+-------------------------+-------------------------+-------------------+--------------------------------+---------------------------+------------------+------------------+--------------------+---------------------------------+--------------------+------------------+
|merchant_abn|                name|revenue_level|         take_rate|average_merchant_fraud_probability|number_of_unique_consumers|average_consumer_fraud_probability|number_of_repeat_consumers|average_repeat_transactions_per_consumer|number_of_orders|average_cost_of_order|average_spend_per_consumer|a

In [5]:
joined.filter(joined.merchant_abn == 71118957552).show()

+------------+--------------------+-------------+-----------------+----------------------------------+--------------------------+----------------------------------+--------------------------+----------------------------------------+----------------+---------------------+--------------------------+------------------------------+-------------------------+------------------------+------------------------+--------------------------+-------------------------+-------------------------+-------------------+--------------------------------+---------------------------+-----------------+------------------+--------------------+---------------------------------+------------------+------------------+
|merchant_abn|                name|revenue_level|        take_rate|average_merchant_fraud_probability|number_of_unique_consumers|average_consumer_fraud_probability|number_of_repeat_consumers|average_repeat_transactions_per_consumer|number_of_orders|average_cost_of_order|average_spend_per_consumer|averag

In [6]:
joined.orderBy(f.col("average_growth_consumers").desc()).show()

+------------+--------------------+-------------+------------------+----------------------------------+--------------------------+----------------------------------+--------------------------+----------------------------------------+----------------+---------------------+--------------------------+------------------------------+-------------------------+------------------------+------------------------+--------------------------+-------------------------+-------------------------+-------------------+--------------------------------+---------------------------+------------------+------------------+------------------+---------------------------------+--------------------+------------------+
|merchant_abn|                name|revenue_level|         take_rate|average_merchant_fraud_probability|number_of_unique_consumers|average_consumer_fraud_probability|number_of_repeat_consumers|average_repeat_transactions_per_consumer|number_of_orders|average_cost_of_order|average_spend_per_consumer|ave

Changing NULLs to 0s

As we are going to be using modelling techniques we need to change the NULLs to an interpretable value.

In [7]:
joined = joined.fillna(0)

Next we want to convert the categorical features into integer values so that we can check its correlation between the target variables.

In the dataset there are 2 categorical features:
- Revenue Value
- Segment

In [8]:
from pyspark.ml.feature import StringIndexer

In [9]:
input_cols = ["revenue_level","segment"]
output_cols = ["revenue_level_indexed","segment_indexed"]

In [10]:
revenue_level_indexer = StringIndexer(inputCol = "revenue_level", outputCol= "revenue_level_indexed")
segment_indexer = StringIndexer(inputCol = "segment", outputCol = "segment_indexed")

In [11]:
pre_correlation_df = revenue_level_indexer.fit(joined).transform(segment_indexer.fit(joined).transform(joined))
pre_correlation_df = pre_correlation_df.drop("revenue_level", "segment","name","first_recorded_transaction","last_recorded_transaction")

                                                                                

In [12]:
correlation_df = pre_correlation_df.toPandas()

Now we will check the correlation matrix

In [13]:
import pandas as pd

In [14]:
correlation_df.head()

Unnamed: 0,merchant_abn,take_rate,average_merchant_fraud_probability,number_of_unique_consumers,average_consumer_fraud_probability,number_of_repeat_consumers,average_repeat_transactions_per_consumer,number_of_orders,average_cost_of_order,average_spend_per_consumer,...,number_of_postcodes,avg_total_weekly_personal_income,avg_total_weekly_fam_income,avg_median_age,avg_household_size,postcode_reach,avg_num_of_consumers_per_postcode,bnpl_maximum_gain,segment_indexed,revenue_level_indexed
0,87566366459,2.72,0.0,62,0.0,0,1.0,62,232.798871,232.798871,...,60,766.669355,1928.637097,42.362903,2.480323,0.022736,1.033333,392.59202,3.0,2.0
1,90173050473,2.49,0.0,13026,0.058007,5700,1.616076,21051,238.164377,384.891625,...,2619,789.088167,1971.626241,43.118165,2.459049,0.992421,8.037801,124838.598397,2.0,2.0
2,80893432676,5.95,0.0,776,0.012474,18,1.023196,794,36.97932,37.837088,...,683,798.984887,2004.309194,43.103275,2.454106,0.25881,1.162518,1747.013954,0.0,0.0
3,70309831462,3.64,0.0,1610,0.033903,66,1.042236,1678,109.33885,113.956888,...,1212,799.475864,2005.841478,43.130215,2.458665,0.459265,1.384488,6678.329668,0.0,1.0
4,15043504837,4.62,0.994698,163,18.443855,0,1.0,163,15856.225399,15856.225399,...,159,815.592025,2010.564417,42.337423,2.485399,0.06025,1.025157,119406.88803,4.0,1.0


In [15]:
corr_matrix = correlation_df.corr()

### Feature Engineering

Predicting number of consumers in 3 years

In [16]:
corr_matrix.loc["number_of_unique_consumers"]

merchant_abn                                0.004690
take_rate                                   0.039364
average_merchant_fraud_probability         -0.005603
number_of_unique_consumers                  1.000000
average_consumer_fraud_probability         -0.144273
number_of_repeat_consumers                  0.860697
average_repeat_transactions_per_consumer    0.613602
number_of_orders                            0.714180
average_cost_of_order                      -0.180890
average_spend_per_consumer                 -0.171051
average_monthly_diff_consumers              0.773080
consumer_diff_over_period                   0.773120
average_growth_consumers                    0.442527
merchant_revenue_rounded                    0.647470
transcation_period_months                   0.243886
number_of_postcodes                         0.845176
avg_total_weekly_personal_income            0.008346
avg_total_weekly_fam_income                 0.010363
avg_median_age                             -0.

Next we seperate the features and the target variables

In [17]:
modelling_df = correlation_df.copy()

In [18]:
modelling_df.head()

Unnamed: 0,merchant_abn,take_rate,average_merchant_fraud_probability,number_of_unique_consumers,average_consumer_fraud_probability,number_of_repeat_consumers,average_repeat_transactions_per_consumer,number_of_orders,average_cost_of_order,average_spend_per_consumer,...,number_of_postcodes,avg_total_weekly_personal_income,avg_total_weekly_fam_income,avg_median_age,avg_household_size,postcode_reach,avg_num_of_consumers_per_postcode,bnpl_maximum_gain,segment_indexed,revenue_level_indexed
0,87566366459,2.72,0.0,62,0.0,0,1.0,62,232.798871,232.798871,...,60,766.669355,1928.637097,42.362903,2.480323,0.022736,1.033333,392.59202,3.0,2.0
1,90173050473,2.49,0.0,13026,0.058007,5700,1.616076,21051,238.164377,384.891625,...,2619,789.088167,1971.626241,43.118165,2.459049,0.992421,8.037801,124838.598397,2.0,2.0
2,80893432676,5.95,0.0,776,0.012474,18,1.023196,794,36.97932,37.837088,...,683,798.984887,2004.309194,43.103275,2.454106,0.25881,1.162518,1747.013954,0.0,0.0
3,70309831462,3.64,0.0,1610,0.033903,66,1.042236,1678,109.33885,113.956888,...,1212,799.475864,2005.841478,43.130215,2.458665,0.459265,1.384488,6678.329668,0.0,1.0
4,15043504837,4.62,0.994698,163,18.443855,0,1.0,163,15856.225399,15856.225399,...,159,815.592025,2010.564417,42.337423,2.485399,0.06025,1.025157,119406.88803,4.0,1.0


In [19]:
target_variable = "number_of_unique_consumers"

In [20]:
features_unique_customers = modelling_df.drop(columns = ["merchant_abn",target_variable])
number_of_unique_customer = modelling_df[target_variable]

Feature Selection

In [21]:
from sklearn.feature_selection import f_regression, SelectKBest

In [22]:
selector = SelectKBest(score_func=f_regression, k= 5)
features_unique_customers_selected = selector.fit_transform(features_unique_customers, number_of_unique_customer)

In [23]:
selected_feature_indices = selector.get_support(indices=True)
selected_features = features_unique_customers.columns[selected_feature_indices]
print(selected_features)

Index(['number_of_repeat_consumers', 'average_monthly_diff_consumers',
       'consumer_diff_over_period', 'number_of_postcodes', 'postcode_reach'],
      dtype='object')


Splitting the data for train and test 
 - We will use a 80 - 20

In [24]:
from sklearn.model_selection import train_test_split

In [25]:
features_unique_customers_train, features_unique_customers_test, number_of_unique_customer_train, number_of_unique_customer_test = \
    train_test_split(features_unique_customers[selected_features], number_of_unique_customer, test_size=0.33, random_state=42)

Fitting the model
- We will use a linear regression model

In [26]:
from sklearn.linear_model import LinearRegression

In [27]:
num_of_unique_customers_model = LinearRegression()
num_of_unique_customers_model.fit(features_unique_customers_train, number_of_unique_customer_train)

Model Evaluation

In [28]:
from sklearn.metrics import mean_squared_error, r2_score

In [29]:
num_of_unique_customer_pred = num_of_unique_customers_model.predict(features_unique_customers_test)
mse = mean_squared_error(number_of_unique_customer_test, num_of_unique_customer_pred)
rmse = (mse ** 0.5)
r2 = r2_score(number_of_unique_customer_test, num_of_unique_customer_pred)

In [30]:
print(f'R-squared (R2): {r2}')

R-squared (R2): 0.9700684617272636


Next we will predict the number of customers in 3 years

In [31]:
future_modelling_df = modelling_df.copy()
future_modelling_df = future_modelling_df.sort_values(by='merchant_abn')

In [32]:
future_modelling_df["transcation_period_months"] = future_modelling_df["transcation_period_months"] + 36

In [33]:
future_modelling_df.head()

Unnamed: 0,merchant_abn,take_rate,average_merchant_fraud_probability,number_of_unique_consumers,average_consumer_fraud_probability,number_of_repeat_consumers,average_repeat_transactions_per_consumer,number_of_orders,average_cost_of_order,average_spend_per_consumer,...,number_of_postcodes,avg_total_weekly_personal_income,avg_total_weekly_fam_income,avg_median_age,avg_household_size,postcode_reach,avg_num_of_consumers_per_postcode,bnpl_maximum_gain,segment_indexed,revenue_level_indexed
1094,10023283211,0.18,0.0,2525,0.095502,174,1.071683,2706,215.798008,231.267093,...,1628,786.702328,1971.123799,43.031966,2.456907,0.6169,1.662162,1051.10898,2.0,4.0
2543,10142254217,4.22,0.0,2389,0.064356,151,1.064881,2544,38.59136,41.095195,...,1591,792.25,1983.427083,42.850629,2.464033,0.60288,1.598994,4143.044718,1.0,1.0
3960,10165489824,4.4,0.0,4,6.979425,0,1.0,4,8885.895,8885.895,...,4,817.5,2066.125,41.625,2.475,0.001516,1.0,1563.917554,4.0,1.0
429,10187291046,3.29,0.0,291,0.058022,1,1.003436,292,115.995445,116.394055,...,273,796.547945,1961.171233,43.125,2.449418,0.103448,1.069597,1114.34503,4.0,1.0
2729,10192359162,6.33,0.0,321,0.036126,2,1.006231,323,460.347214,463.215421,...,303,808.877709,2024.267802,43.294118,2.44548,0.114816,1.066007,9412.212982,0.0,0.0


In [34]:
future_features_unique_customers = future_modelling_df.drop(columns = ["merchant_abn",target_variable])

In [35]:
future_features_unique_customers.columns

Index(['take_rate', 'average_merchant_fraud_probability',
       'average_consumer_fraud_probability', 'number_of_repeat_consumers',
       'average_repeat_transactions_per_consumer', 'number_of_orders',
       'average_cost_of_order', 'average_spend_per_consumer',
       'average_monthly_diff_consumers', 'consumer_diff_over_period',
       'average_growth_consumers', 'merchant_revenue_rounded',
       'transcation_period_months', 'number_of_postcodes',
       'avg_total_weekly_personal_income', 'avg_total_weekly_fam_income',
       'avg_median_age', 'avg_household_size', 'postcode_reach',
       'avg_num_of_consumers_per_postcode', 'bnpl_maximum_gain',
       'segment_indexed', 'revenue_level_indexed'],
      dtype='object')

In [36]:
predicted_num_of_unique_customers= num_of_unique_customers_model.predict(future_features_unique_customers[selected_features])

In [37]:
results = future_modelling_df.copy()
results["predicted_num_of_unique_customers"] = predicted_num_of_unique_customers

In [38]:
results.head()

Unnamed: 0,merchant_abn,take_rate,average_merchant_fraud_probability,number_of_unique_consumers,average_consumer_fraud_probability,number_of_repeat_consumers,average_repeat_transactions_per_consumer,number_of_orders,average_cost_of_order,average_spend_per_consumer,...,avg_total_weekly_personal_income,avg_total_weekly_fam_income,avg_median_age,avg_household_size,postcode_reach,avg_num_of_consumers_per_postcode,bnpl_maximum_gain,segment_indexed,revenue_level_indexed,predicted_num_of_unique_customers
1094,10023283211,0.18,0.0,2525,0.095502,174,1.071683,2706,215.798008,231.267093,...,786.702328,1971.123799,43.031966,2.456907,0.6169,1.662162,1051.10898,2.0,4.0,3220.890429
2543,10142254217,4.22,0.0,2389,0.064356,151,1.064881,2544,38.59136,41.095195,...,792.25,1983.427083,42.850629,2.464033,0.60288,1.598994,4143.044718,1.0,1.0,3066.441393
3960,10165489824,4.4,0.0,4,6.979425,0,1.0,4,8885.895,8885.895,...,817.5,2066.125,41.625,2.475,0.001516,1.0,1563.917554,4.0,1.0,-291.338108
429,10187291046,3.29,0.0,291,0.058022,1,1.003436,292,115.995445,116.394055,...,796.547945,1961.171233,43.125,2.449418,0.103448,1.069597,1114.34503,4.0,1.0,270.059494
2729,10192359162,6.33,0.0,321,0.036126,2,1.006231,323,460.347214,463.215421,...,808.877709,2024.267802,43.294118,2.44548,0.114816,1.066007,9412.212982,0.0,0.0,304.966874


In [39]:
results_df = spark.createDataFrame(results)

In [40]:
results_df = results_df.select(f.col("merchant_abn"),f.col("predicted_num_of_unique_customers"))

In [41]:
joined = joined.join(results_df, on = "merchant_abn", how = "inner")

In [42]:
joined = joined.withColumn("predicted_num_of_unique_customers", f.when(joined.predicted_num_of_unique_customers < 0, 0).otherwise(f.round(joined.predicted_num_of_unique_customers)))

In [43]:
joined.orderBy(f.col("number_of_unique_consumers").asc()).show()

[Stage 14:>                                                         (0 + 8) / 8]

+------------+--------------------+-------------+------------------+----------------------------------+--------------------------+----------------------------------+--------------------------+----------------------------------------+----------------+---------------------+--------------------------+------------------------------+-------------------------+------------------------+------------------------+--------------------------+-------------------------+-------------------------+-------------------+--------------------------------+---------------------------+--------------+------------------+--------------------+---------------------------------+--------------------+------------------+---------------------------------+
|merchant_abn|                name|revenue_level|         take_rate|average_merchant_fraud_probability|number_of_unique_consumers|average_consumer_fraud_probability|number_of_repeat_consumers|average_repeat_transactions_per_consumer|number_of_orders|average_cost_of_orde

                                                                                

In [44]:
joined.write.mode("overwrite").parquet("../../../data/ranking_data.parquet")

                                                                                

In [45]:
spark.stop()