# Feature Engineering with Modelling
## Author: Dulan Wijeratne 1181873

In this notebook we will make new features using modelling techniques.

First we will start by creating a Spark session and reading in the joined aggregated data.

In [1]:
from pyspark.sql import SparkSession, functions as f

In [2]:
spark = (
    SparkSession.builder.appName("feature_engineering")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '3g')   
    .config('spark.executor.memory', '4g')  
    .config('spark.executor.instances', '2')  
    .config('spark.executor.cores', '2')
    .getOrCreate()
)

your 131072x1 screen size is bogus. expect trouble


23/09/29 22:54:36 WARN Utils: Your hostname, LAPTOP-RELH58H1 resolves to a loopback address: 127.0.1.1; using 172.19.22.4 instead (on interface eth0)
23/09/29 22:54:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/29 22:54:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/09/29 22:54:42 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [3]:
joined = spark.read.parquet("../../../data/insights/joined.parquet")

                                                                                

In [4]:
joined.orderBy(f.col("consumer_diff_over_period").asc()).show()

23/09/29 22:54:59 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+------------+--------------------+-------------+---------+----------------------------------+--------------------------+----------------------------------+--------------------------+----------------------------------------+----------------------+----------------+---------------------+--------------------------+------------------------------+-------------------------+------------------------+------------------------+--------------------------+-------------------------+-------------------------+-------------------+--------------------------------+---------------------------+------------------+------------------+--------------------+---------------------------------+--------------------+
|merchant_abn|                name|revenue_level|take_rate|average_merchant_fraud_probability|number_of_unique_consumers|average_consumer_fraud_probability|number_of_repeat_consumers|average_repeat_transactions_per_consumer|consumer_retainability|number_of_orders|average_cost_of_order|average_spend_per_c

In [5]:
joined.filter(joined.merchant_abn == 71118957552).show()

[Stage 2:>                                                          (0 + 1) / 1]

+------------+--------------------+-------------+---------+----------------------------------+--------------------------+----------------------------------+--------------------------+----------------------------------------+----------------------+----------------+---------------------+--------------------------+------------------------------+-------------------------+------------------------+------------------------+--------------------------+-------------------------+-------------------------+-------------------+--------------------------------+---------------------------+-----------------+------------------+--------------------+---------------------------------+------------------+
|merchant_abn|                name|revenue_level|take_rate|average_merchant_fraud_probability|number_of_unique_consumers|average_consumer_fraud_probability|number_of_repeat_consumers|average_repeat_transactions_per_consumer|consumer_retainability|number_of_orders|average_cost_of_order|average_spend_per_cons

                                                                                

In [6]:
joined.orderBy(f.col("average_growth").desc()).show()

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `average_growth` cannot be resolved. Did you mean one of the following? [`take_rate`, `average_growth_consumers`, `postcode_reach`, `revenue_level`, `segment`].;
'Sort ['average_growth DESC NULLS LAST], true
+- Relation [merchant_abn#0L,name#1,revenue_level#2,take_rate#3,average_merchant_fraud_probability#4,number_of_unique_consumers#5L,average_consumer_fraud_probability#6,number_of_repeat_consumers#7L,average_repeat_transactions_per_consumer#8,consumer_retainability#9,number_of_orders#10L,average_cost_of_order#11,average_spend_per_consumer#12,average_monthly_diff_consumers#13,consumer_diff_over_period#14L,average_growth_consumers#15,merchant_revenue_rounded#16,first_recorded_transaction#17,last_recorded_transaction#18,transcation_period_months#19,number_of_postcodes#20L,avg_total_weekly_personal_income#21,avg_total_weekly_fam_income#22,avg_median_age#23,... 4 more fields] parquet


Changing NULLs to 0s

As we are going to be using modelling techniques we need to change the NULLs to an interpretable value.

In [None]:
joined = joined.fillna(0)

Next we want to convert the categorical features into integer values so that we can check its correlation between the target variables.

In the dataset there are 2 categorical features:
- Revenue Value
- Segment

In [None]:
from pyspark.ml.feature import StringIndexer

In [None]:
input_cols = ["revenue_level","segment"]
output_cols = ["revenue_level_indexed","segment_indexed"]

In [None]:
revenue_level_indexer = StringIndexer(inputCol = "revenue_level", outputCol= "revenue_level_indexed")
segment_indexer = StringIndexer(inputCol = "segment", outputCol = "segment_indexed")

In [None]:
pre_correlation_df = revenue_level_indexer.fit(joined).transform(segment_indexer.fit(joined).transform(joined))
pre_correlation_df = pre_correlation_df.drop("revenue_level", "segment","name","first_recorded_transaction","last_recorded_transaction")

In [None]:
correlation_df = pre_correlation_df.toPandas()

Now we will check the correlation matrix

In [None]:
import pandas as pd

In [None]:
correlation_df.head()

Unnamed: 0,merchant_abn,take_rate,average_merchant_fraud_probability,number_of_unique_consumers,average_consumer_fraud_probability,number_of_repeat_consumers,average_repeat_transactions_per_consumer,consumer_retainability,number_of_orders,average_cost_of_order,...,transcation_period_months,number_of_postcodes,avg_total_weekly_personal_income,avg_total_weekly_fam_income,avg_median_age,avg_household_size,postcode_reach,avg_num_of_consumers_per_postcode,segment_indexed,revenue_level_indexed
0,83412691377,2.94,0.0,8990,0.032161,2419,1.326808,0.269077,11928,34.971224,...,19.193548,2551,791.10576,1980.449153,43.164361,2.457791,0.966654,4.675813,4.0,2.0
1,38700038932,6.31,0.0,5154,0.605566,707,1.153279,0.137175,5944,1344.342695,...,19.612903,2286,794.789283,1984.221484,43.09211,2.452603,0.866237,2.600175,2.0,0.0
2,73256306726,4.81,0.0,3899,0.056487,436,1.118492,0.111824,4361,283.94462,...,19.580645,2049,791.844875,1982.720936,43.252006,2.449668,0.77643,2.128355,3.0,1.0
3,73841664453,5.55,0.0,792,0.026275,19,1.02399,0.02399,811,85.539625,...,18.612903,679,796.419852,2013.956227,42.881011,2.439963,0.257294,1.194404,1.0,0.0
4,35344855546,2.92,0.0,1237,0.053722,37,1.029911,0.029911,1274,89.123652,...,19.83871,995,795.796703,1973.193485,42.75471,2.450118,0.377037,1.280402,4.0,2.0


In [None]:
corr_matrix = correlation_df.corr()

### Feature Engineering

Predicting number of consumers in 3 years

In [None]:
corr_matrix.loc["number_of_unique_consumers"]

merchant_abn                                0.014463
take_rate                                   0.027533
average_merchant_fraud_probability          0.000810
number_of_unique_consumers                  1.000000
average_consumer_fraud_probability         -0.119152
number_of_repeat_consumers                  0.869945
average_repeat_transactions_per_consumer    0.613173
consumer_retainability                      0.975056
number_of_orders                            0.710842
average_cost_of_order                      -0.194075
average_spend_per_consumer                 -0.160724
average_monthly_diff_consumers              0.770269
consumer_diff_over_period                   0.770314
average_growth                              0.346373
merchant_revenue_rounded                    0.616352
transcation_period_months                   0.302558
number_of_postcodes                         0.834871
avg_total_weekly_personal_income           -0.003423
avg_total_weekly_fam_income                 0.

Next we seperate the features and the target variables

In [None]:
modelling_df = correlation_df.copy()

In [None]:
modelling_df.head()

Unnamed: 0,merchant_abn,take_rate,average_merchant_fraud_probability,number_of_unique_consumers,average_consumer_fraud_probability,number_of_repeat_consumers,average_repeat_transactions_per_consumer,consumer_retainability,number_of_orders,average_cost_of_order,...,transcation_period_months,number_of_postcodes,avg_total_weekly_personal_income,avg_total_weekly_fam_income,avg_median_age,avg_household_size,postcode_reach,avg_num_of_consumers_per_postcode,segment_indexed,revenue_level_indexed
0,83412691377,2.94,0.0,8990,0.032161,2419,1.326808,0.269077,11928,34.971224,...,19.193548,2551,791.10576,1980.449153,43.164361,2.457791,0.966654,4.675813,4.0,2.0
1,38700038932,6.31,0.0,5154,0.605566,707,1.153279,0.137175,5944,1344.342695,...,19.612903,2286,794.789283,1984.221484,43.09211,2.452603,0.866237,2.600175,2.0,0.0
2,73256306726,4.81,0.0,3899,0.056487,436,1.118492,0.111824,4361,283.94462,...,19.580645,2049,791.844875,1982.720936,43.252006,2.449668,0.77643,2.128355,3.0,1.0
3,73841664453,5.55,0.0,792,0.026275,19,1.02399,0.02399,811,85.539625,...,18.612903,679,796.419852,2013.956227,42.881011,2.439963,0.257294,1.194404,1.0,0.0
4,35344855546,2.92,0.0,1237,0.053722,37,1.029911,0.029911,1274,89.123652,...,19.83871,995,795.796703,1973.193485,42.75471,2.450118,0.377037,1.280402,4.0,2.0


In [None]:
target_variable = "number_of_unique_consumers"

In [None]:
features_unique_customers = modelling_df.drop(columns = ["merchant_abn",target_variable])
number_of_unique_customer = modelling_df[target_variable]

Feature Selection

In [None]:
from sklearn.feature_selection import f_regression, SelectKBest

In [None]:
selector = SelectKBest(score_func=f_regression, k= 5)
features_unique_customers_selected = selector.fit_transform(features_unique_customers, number_of_unique_customer)

In [None]:
selected_feature_indices = selector.get_support(indices=True)
selected_features = features_unique_customers.columns[selected_feature_indices]
print(selected_features)

Index(['number_of_repeat_consumers', 'consumer_retainability',
       'consumer_diff_over_period', 'number_of_postcodes', 'postcode_reach'],
      dtype='object')


Splitting the data for train and test 
 - We will use a 80 - 20

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
features_unique_customers_train, features_unique_customers_test, number_of_unique_customer_train, number_of_unique_customer_test = \
    train_test_split(features_unique_customers[selected_features], number_of_unique_customer, test_size=0.33, random_state=42)

Fitting the model
- We will use a linear regression model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
num_of_unique_customers_model = LinearRegression()
num_of_unique_customers_model.fit(features_unique_customers_train, number_of_unique_customer_train)

Model Evaluation

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
num_of_unique_customer_pred = num_of_unique_customers_model.predict(features_unique_customers_test)
mse = mean_squared_error(number_of_unique_customer_test, num_of_unique_customer_pred)
rmse = (mse ** 0.5)
r2 = r2_score(number_of_unique_customer_test, num_of_unique_customer_pred)

In [None]:
print(f'R-squared (R2): {r2}')

R-squared (R2): 0.9962345114520971


Next we will predict the number of customers in 3 years

In [None]:
future_modelling_df = modelling_df.copy()
future_modelling_df = future_modelling_df.sort_values(by='merchant_abn')

In [None]:
future_modelling_df["transcation_period_months"] = future_modelling_df["transcation_period_months"] + 36

In [None]:
future_modelling_df.head()

Unnamed: 0,merchant_abn,take_rate,average_merchant_fraud_probability,number_of_unique_consumers,average_consumer_fraud_probability,number_of_repeat_consumers,average_repeat_transactions_per_consumer,consumer_retainability,number_of_orders,average_cost_of_order,...,number_of_postcodes,most_popular_postcode,avg_total_weekly_personal_income,avg_total_weekly_fam_income,avg_median_age,avg_household_size,postcode_reach,avg_num_of_consumers_per_postcode,segment_indexed,revenue_level_indexed
0,10023283211,0.18,0.0,2525,0.095502,174,1.071683,0.068911,2706,215.797947,...,1628,3275,786.702328,1971.123799,43.031966,2.456914,0.6169,1.662162,2.0,4.0
1,10142254217,4.22,0.0,2389,0.064356,151,1.064881,0.063206,2544,38.59147,...,1591,6438,792.25,1983.427083,42.850629,2.464025,0.60288,1.598994,1.0,1.0
2,10165489824,4.4,0.0,0,0.0,0,0.0,0.0,4,8885.894209,...,4,2534,817.5,2066.125,41.625,2.475,0.001516,1.0,3.0,1.0
3,10187291046,3.29,0.0,291,0.058022,1,1.003436,0.003436,292,115.99557,...,273,5067,796.547945,1961.171233,43.125,2.449418,0.103448,1.069597,3.0,1.0
4,10192359162,6.33,0.0,321,0.036126,2,1.006231,0.006231,323,460.347109,...,303,2062,808.877709,2024.267802,43.294118,2.44548,0.114816,1.066007,0.0,0.0


In [None]:
future_features_unique_customers = future_modelling_df.drop(columns = ["merchant_abn",target_variable])

In [None]:
future_features_unique_customers.columns

Index(['take_rate', 'average_merchant_fraud_probability',
       'average_consumer_fraud_probability', 'number_of_repeat_consumers',
       'average_repeat_transactions_per_consumer', 'consumer_retainability',
       'number_of_orders', 'average_cost_of_order',
       'average_spend_per_consumer', 'average_monthly_diff_consumers',
       'consumer_diff_over_period', 'average_growth',
       'merchant_revenue_rounded', 'transcation_period_months',
       'number_of_postcodes', 'most_popular_postcode',
       'avg_total_weekly_personal_income', 'avg_total_weekly_fam_income',
       'avg_median_age', 'avg_household_size', 'postcode_reach',
       'avg_num_of_consumers_per_postcode', 'segment_indexed',
       'revenue_level_indexed'],
      dtype='object')

In [None]:
predicted_num_of_unique_customers= num_of_unique_customers_model.predict(future_features_unique_customers[selected_features])

In [None]:
results = future_modelling_df.copy()
results["predicted_num_of_unique_customers"] = predicted_num_of_unique_customers

In [None]:
results.head()

Unnamed: 0,merchant_abn,take_rate,average_merchant_fraud_probability,number_of_unique_consumers,average_consumer_fraud_probability,number_of_repeat_consumers,average_repeat_transactions_per_consumer,consumer_retainability,number_of_orders,average_cost_of_order,...,most_popular_postcode,avg_total_weekly_personal_income,avg_total_weekly_fam_income,avg_median_age,avg_household_size,postcode_reach,avg_num_of_consumers_per_postcode,segment_indexed,revenue_level_indexed,predicted_num_of_unique_customers
46,10023283211,0.18,0.0,2525,0.095502,174,1.071683,0.068911,2706,215.797947,...,1628,786.702328,1971.123799,43.031966,2.456914,0.6169,1.662162,2.0,4.0,2733.324536
761,10142254217,4.22,0.0,2389,0.064356,151,1.064881,0.063206,2544,38.59147,...,1591,792.25,1983.427083,42.850629,2.464025,0.60288,1.598994,1.0,1.0,2499.825334
2500,10187291046,3.29,0.0,291,0.058022,1,1.003436,0.003436,292,115.99557,...,273,796.547945,1961.171233,43.125,2.449418,0.103448,1.069597,4.0,1.0,16.669832
2165,10192359162,6.33,0.0,321,0.036126,2,1.006231,0.006231,323,460.347109,...,303,808.877709,2024.267802,43.294118,2.44548,0.114816,1.066007,0.0,0.0,129.359745
783,10206519221,6.34,0.0,6652,0.058119,1302,1.222489,0.195731,8132,37.385626,...,2438,787.455116,1969.985735,43.111842,2.453609,0.923835,3.335521,0.0,0.0,6933.080489


In [None]:
results_df = spark.createDataFrame(results)

In [None]:
results_df = results_df.select(f.col("merchant_abn"),f.col("predicted_num_of_unique_customers"))

In [None]:
joined = joined.join(results_df, on = "merchant_abn", how = "inner")

In [None]:
joined = joined.withColumn("predicted_num_of_unique_customers", f.when(joined.predicted_num_of_unique_customers < 0, 0).otherwise(f.round(joined.predicted_num_of_unique_customers)))

In [None]:
joined.orderBy(f.col("number_of_unique_consumers").asc()).show()

[Stage 12:>                                                         (0 + 8) / 8]

+------------+--------------------+-------------+---------+----------------------------------+--------------------------+----------------------------------+--------------------------+----------------------------------------+----------------------+----------------+---------------------+--------------------------+------------------------------+-------------------------+-------------------+------------------------+--------------------------+-------------------------+-------------------------+-------------------+--------------------------------+---------------------------+------------------+------------------+--------------------+---------------------------------+--------------------+---------------------------------+
|merchant_abn|                name|revenue_level|take_rate|average_merchant_fraud_probability|number_of_unique_consumers|average_consumer_fraud_probability|number_of_repeat_consumers|average_repeat_transactions_per_consumer|consumer_retainability|number_of_orders|average_cost

                                                                                

In [None]:
spark.stop()