# Feature Engineering with Modelling
## Author: Dulan Wijeratne 1181873

In this notebook we will make new features using modelling techniques.

First we will start by creating a Spark session and reading in the joined aggregated data.

In [3]:
from pyspark.sql import SparkSession, functions as f

In [4]:
spark = (
    SparkSession.builder.appName("feature_engineering")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '3g')   
    .config('spark.executor.memory', '4g')  
    .config('spark.executor.instances', '2')  
    .config('spark.executor.cores', '2')
    .getOrCreate()
)

your 131072x1 screen size is bogus. expect trouble
23/09/30 10:41:12 WARN Utils: Your hostname, LAPTOP-RELH58H1 resolves to a loopback address: 127.0.1.1; using 172.19.22.4 instead (on interface eth0)
23/09/30 10:41:12 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/30 10:41:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/09/30 10:41:16 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [5]:
joined = spark.read.parquet("../../../data/insights/joined.parquet")

                                                                                

In [6]:
joined.orderBy(f.col("consumer_diff_over_period").asc()).show()

23/09/30 10:41:25 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+------------+--------------------+-------------+---------+----------------------------------+--------------------------+----------------------------------+--------------------------+----------------------------------------+----------------------+----------------+---------------------+--------------------------+------------------------------+-------------------------+------------------------+------------------------+--------------------------+-------------------------+-------------------------+-------------------+--------------------------------+---------------------------+------------------+------------------+--------------------+---------------------------------+--------------------+
|merchant_abn|                name|revenue_level|take_rate|average_merchant_fraud_probability|number_of_unique_consumers|average_consumer_fraud_probability|number_of_repeat_consumers|average_repeat_transactions_per_consumer|consumer_retainability|number_of_orders|average_cost_of_order|average_spend_per_c

In [7]:
joined.filter(joined.merchant_abn == 71118957552).show()

+------------+--------------------+-------------+---------+----------------------------------+--------------------------+----------------------------------+--------------------------+----------------------------------------+----------------------+----------------+---------------------+--------------------------+------------------------------+-------------------------+------------------------+------------------------+--------------------------+-------------------------+-------------------------+-------------------+--------------------------------+---------------------------+-----------------+------------------+--------------------+---------------------------------+------------------+
|merchant_abn|                name|revenue_level|take_rate|average_merchant_fraud_probability|number_of_unique_consumers|average_consumer_fraud_probability|number_of_repeat_consumers|average_repeat_transactions_per_consumer|consumer_retainability|number_of_orders|average_cost_of_order|average_spend_per_cons

In [8]:
joined.orderBy(f.col("average_growth").desc()).show()

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `average_growth` cannot be resolved. Did you mean one of the following? [`take_rate`, `average_growth_consumers`, `postcode_reach`, `revenue_level`, `segment`].;
'Sort ['average_growth DESC NULLS LAST], true
+- Relation [merchant_abn#0L,name#1,revenue_level#2,take_rate#3,average_merchant_fraud_probability#4,number_of_unique_consumers#5L,average_consumer_fraud_probability#6,number_of_repeat_consumers#7L,average_repeat_transactions_per_consumer#8,consumer_retainability#9,number_of_orders#10L,average_cost_of_order#11,average_spend_per_consumer#12,average_monthly_diff_consumers#13,consumer_diff_over_period#14L,average_growth_consumers#15,merchant_revenue_rounded#16,first_recorded_transaction#17,last_recorded_transaction#18,transcation_period_months#19,number_of_postcodes#20L,avg_total_weekly_personal_income#21,avg_total_weekly_fam_income#22,avg_median_age#23,... 4 more fields] parquet


Changing NULLs to 0s

As we are going to be using modelling techniques we need to change the NULLs to an interpretable value.

In [9]:
joined = joined.fillna(0)

Next we want to convert the categorical features into integer values so that we can check its correlation between the target variables.

In the dataset there are 2 categorical features:
- Revenue Value
- Segment

In [10]:
from pyspark.ml.feature import StringIndexer

In [11]:
input_cols = ["revenue_level","segment"]
output_cols = ["revenue_level_indexed","segment_indexed"]

In [12]:
revenue_level_indexer = StringIndexer(inputCol = "revenue_level", outputCol= "revenue_level_indexed")
segment_indexer = StringIndexer(inputCol = "segment", outputCol = "segment_indexed")

In [13]:
pre_correlation_df = revenue_level_indexer.fit(joined).transform(segment_indexer.fit(joined).transform(joined))
pre_correlation_df = pre_correlation_df.drop("revenue_level", "segment","name","first_recorded_transaction","last_recorded_transaction")

                                                                                

In [14]:
correlation_df = pre_correlation_df.toPandas()

Now we will check the correlation matrix

In [15]:
import pandas as pd

In [16]:
correlation_df.head()

Unnamed: 0,merchant_abn,take_rate,average_merchant_fraud_probability,number_of_unique_consumers,average_consumer_fraud_probability,number_of_repeat_consumers,average_repeat_transactions_per_consumer,consumer_retainability,number_of_orders,average_cost_of_order,...,transcation_period_months,number_of_postcodes,avg_total_weekly_personal_income,avg_total_weekly_fam_income,avg_median_age,avg_household_size,postcode_reach,avg_num_of_consumers_per_postcode,segment_indexed,revenue_level_indexed
0,10023283211,0.18,0.0,2525,0.095502,174,1.071683,0.068911,2706,215.798008,...,19.677419,1628,786.702328,1971.123799,43.031966,2.456907,0.6169,1.662162,2.0,4.0
1,10142254217,4.22,0.0,2389,0.064356,151,1.064881,0.063206,2544,38.59136,...,19.322581,1591,792.25,1983.427083,42.850629,2.464033,0.60288,1.598994,1.0,1.0
2,10165489824,4.4,0.0,0,0.0,0,0.0,0.0,4,8885.895,...,6.903226,4,817.5,2066.125,41.625,2.475,0.001516,1.0,3.0,1.0
3,10187291046,3.29,0.0,291,0.058022,1,1.003436,0.003436,292,115.995445,...,18.903226,273,796.547945,1961.171233,43.125,2.449418,0.103448,1.069597,3.0,1.0
4,10192359162,6.33,0.0,321,0.036126,2,1.006231,0.006231,323,460.347214,...,19.483871,303,808.877709,2024.267802,43.294118,2.44548,0.114816,1.066007,0.0,0.0


In [17]:
corr_matrix = correlation_df.corr()

### Feature Engineering

Predicting number of consumers in 3 years

In [18]:
corr_matrix.loc["number_of_unique_consumers"]

merchant_abn                                0.005000
take_rate                                   0.040962
average_merchant_fraud_probability         -0.012018
number_of_unique_consumers                  1.000000
average_consumer_fraud_probability         -0.047828
number_of_repeat_consumers                  0.859005
average_repeat_transactions_per_consumer    0.665024
consumer_retainability                      0.976162
number_of_orders                            0.713317
average_cost_of_order                      -0.166808
average_spend_per_consumer                 -0.159572
average_monthly_diff_consumers              0.772166
consumer_diff_over_period                   0.772211
average_growth_consumers                    0.447964
merchant_revenue_rounded                    0.648542
transcation_period_months                   0.198016
number_of_postcodes                         0.847743
avg_total_weekly_personal_income            0.012759
avg_total_weekly_fam_income                 0.

Next we seperate the features and the target variables

In [19]:
modelling_df = correlation_df.copy()

In [20]:
modelling_df.head()

Unnamed: 0,merchant_abn,take_rate,average_merchant_fraud_probability,number_of_unique_consumers,average_consumer_fraud_probability,number_of_repeat_consumers,average_repeat_transactions_per_consumer,consumer_retainability,number_of_orders,average_cost_of_order,...,transcation_period_months,number_of_postcodes,avg_total_weekly_personal_income,avg_total_weekly_fam_income,avg_median_age,avg_household_size,postcode_reach,avg_num_of_consumers_per_postcode,segment_indexed,revenue_level_indexed
0,10023283211,0.18,0.0,2525,0.095502,174,1.071683,0.068911,2706,215.798008,...,19.677419,1628,786.702328,1971.123799,43.031966,2.456907,0.6169,1.662162,2.0,4.0
1,10142254217,4.22,0.0,2389,0.064356,151,1.064881,0.063206,2544,38.59136,...,19.322581,1591,792.25,1983.427083,42.850629,2.464033,0.60288,1.598994,1.0,1.0
2,10165489824,4.4,0.0,0,0.0,0,0.0,0.0,4,8885.895,...,6.903226,4,817.5,2066.125,41.625,2.475,0.001516,1.0,3.0,1.0
3,10187291046,3.29,0.0,291,0.058022,1,1.003436,0.003436,292,115.995445,...,18.903226,273,796.547945,1961.171233,43.125,2.449418,0.103448,1.069597,3.0,1.0
4,10192359162,6.33,0.0,321,0.036126,2,1.006231,0.006231,323,460.347214,...,19.483871,303,808.877709,2024.267802,43.294118,2.44548,0.114816,1.066007,0.0,0.0


In [21]:
target_variable = "number_of_unique_consumers"

In [22]:
features_unique_customers = modelling_df.drop(columns = ["merchant_abn",target_variable])
number_of_unique_customer = modelling_df[target_variable]

Feature Selection

In [23]:
from sklearn.feature_selection import f_regression, SelectKBest

In [24]:
selector = SelectKBest(score_func=f_regression, k= 5)
features_unique_customers_selected = selector.fit_transform(features_unique_customers, number_of_unique_customer)

In [25]:
selected_feature_indices = selector.get_support(indices=True)
selected_features = features_unique_customers.columns[selected_feature_indices]
print(selected_features)

Index(['number_of_repeat_consumers', 'consumer_retainability',
       'consumer_diff_over_period', 'number_of_postcodes', 'postcode_reach'],
      dtype='object')


Splitting the data for train and test 
 - We will use a 80 - 20

In [26]:
from sklearn.model_selection import train_test_split

In [27]:
features_unique_customers_train, features_unique_customers_test, number_of_unique_customer_train, number_of_unique_customer_test = \
    train_test_split(features_unique_customers[selected_features], number_of_unique_customer, test_size=0.33, random_state=42)

Fitting the model
- We will use a linear regression model

In [28]:
from sklearn.linear_model import LinearRegression

In [29]:
num_of_unique_customers_model = LinearRegression()
num_of_unique_customers_model.fit(features_unique_customers_train, number_of_unique_customer_train)

Model Evaluation

In [30]:
from sklearn.metrics import mean_squared_error, r2_score

In [31]:
num_of_unique_customer_pred = num_of_unique_customers_model.predict(features_unique_customers_test)
mse = mean_squared_error(number_of_unique_customer_test, num_of_unique_customer_pred)
rmse = (mse ** 0.5)
r2 = r2_score(number_of_unique_customer_test, num_of_unique_customer_pred)

In [32]:
print(f'R-squared (R2): {r2}')

R-squared (R2): 0.9966278166932794


Next we will predict the number of customers in 3 years

In [33]:
future_modelling_df = modelling_df.copy()
future_modelling_df = future_modelling_df.sort_values(by='merchant_abn')

In [34]:
future_modelling_df["transcation_period_months"] = future_modelling_df["transcation_period_months"] + 36

In [35]:
future_modelling_df.head()

Unnamed: 0,merchant_abn,take_rate,average_merchant_fraud_probability,number_of_unique_consumers,average_consumer_fraud_probability,number_of_repeat_consumers,average_repeat_transactions_per_consumer,consumer_retainability,number_of_orders,average_cost_of_order,...,transcation_period_months,number_of_postcodes,avg_total_weekly_personal_income,avg_total_weekly_fam_income,avg_median_age,avg_household_size,postcode_reach,avg_num_of_consumers_per_postcode,segment_indexed,revenue_level_indexed
0,10023283211,0.18,0.0,2525,0.095502,174,1.071683,0.068911,2706,215.798008,...,55.677419,1628,786.702328,1971.123799,43.031966,2.456907,0.6169,1.662162,2.0,4.0
1,10142254217,4.22,0.0,2389,0.064356,151,1.064881,0.063206,2544,38.59136,...,55.322581,1591,792.25,1983.427083,42.850629,2.464033,0.60288,1.598994,1.0,1.0
2,10165489824,4.4,0.0,0,0.0,0,0.0,0.0,4,8885.895,...,42.903226,4,817.5,2066.125,41.625,2.475,0.001516,1.0,3.0,1.0
3,10187291046,3.29,0.0,291,0.058022,1,1.003436,0.003436,292,115.995445,...,54.903226,273,796.547945,1961.171233,43.125,2.449418,0.103448,1.069597,3.0,1.0
4,10192359162,6.33,0.0,321,0.036126,2,1.006231,0.006231,323,460.347214,...,55.483871,303,808.877709,2024.267802,43.294118,2.44548,0.114816,1.066007,0.0,0.0


In [36]:
future_features_unique_customers = future_modelling_df.drop(columns = ["merchant_abn",target_variable])

In [37]:
future_features_unique_customers.columns

Index(['take_rate', 'average_merchant_fraud_probability',
       'average_consumer_fraud_probability', 'number_of_repeat_consumers',
       'average_repeat_transactions_per_consumer', 'consumer_retainability',
       'number_of_orders', 'average_cost_of_order',
       'average_spend_per_consumer', 'average_monthly_diff_consumers',
       'consumer_diff_over_period', 'average_growth_consumers',
       'merchant_revenue_rounded', 'transcation_period_months',
       'number_of_postcodes', 'avg_total_weekly_personal_income',
       'avg_total_weekly_fam_income', 'avg_median_age', 'avg_household_size',
       'postcode_reach', 'avg_num_of_consumers_per_postcode',
       'segment_indexed', 'revenue_level_indexed'],
      dtype='object')

In [38]:
predicted_num_of_unique_customers= num_of_unique_customers_model.predict(future_features_unique_customers[selected_features])

In [39]:
results = future_modelling_df.copy()
results["predicted_num_of_unique_customers"] = predicted_num_of_unique_customers

In [40]:
results.head()

Unnamed: 0,merchant_abn,take_rate,average_merchant_fraud_probability,number_of_unique_consumers,average_consumer_fraud_probability,number_of_repeat_consumers,average_repeat_transactions_per_consumer,consumer_retainability,number_of_orders,average_cost_of_order,...,number_of_postcodes,avg_total_weekly_personal_income,avg_total_weekly_fam_income,avg_median_age,avg_household_size,postcode_reach,avg_num_of_consumers_per_postcode,segment_indexed,revenue_level_indexed,predicted_num_of_unique_customers
0,10023283211,0.18,0.0,2525,0.095502,174,1.071683,0.068911,2706,215.798008,...,1628,786.702328,1971.123799,43.031966,2.456907,0.6169,1.662162,2.0,4.0,2726.168026
1,10142254217,4.22,0.0,2389,0.064356,151,1.064881,0.063206,2544,38.59136,...,1591,792.25,1983.427083,42.850629,2.464033,0.60288,1.598994,1.0,1.0,2489.348772
2,10165489824,4.4,0.0,0,0.0,0,0.0,0.0,4,8885.895,...,4,817.5,2066.125,41.625,2.475,0.001516,1.0,3.0,1.0,-72.459657
3,10187291046,3.29,0.0,291,0.058022,1,1.003436,0.003436,292,115.995445,...,273,796.547945,1961.171233,43.125,2.449418,0.103448,1.069597,3.0,1.0,99.810098
4,10192359162,6.33,0.0,321,0.036126,2,1.006231,0.006231,323,460.347214,...,303,808.877709,2024.267802,43.294118,2.44548,0.114816,1.066007,0.0,0.0,212.324384


In [41]:
results_df = spark.createDataFrame(results)

In [42]:
results_df = results_df.select(f.col("merchant_abn"),f.col("predicted_num_of_unique_customers"))

In [43]:
joined = joined.join(results_df, on = "merchant_abn", how = "inner")

In [44]:
joined = joined.withColumn("predicted_num_of_unique_customers", f.when(joined.predicted_num_of_unique_customers < 0, 0).otherwise(f.round(joined.predicted_num_of_unique_customers)))

In [45]:
joined.orderBy(f.col("number_of_unique_consumers").asc()).show()

                                                                                

+------------+--------------------+-------------+---------+----------------------------------+--------------------------+----------------------------------+--------------------------+----------------------------------------+----------------------+----------------+---------------------+--------------------------+------------------------------+-------------------------+------------------------+------------------------+--------------------------+-------------------------+-------------------------+-------------------+--------------------------------+---------------------------+------------------+------------------+--------------------+---------------------------------+--------------------+---------------------------------+
|merchant_abn|                name|revenue_level|take_rate|average_merchant_fraud_probability|number_of_unique_consumers|average_consumer_fraud_probability|number_of_repeat_consumers|average_repeat_transactions_per_consumer|consumer_retainability|number_of_orders|average

In [46]:
joined.write.mode("overwrite").parquet("../../../data/ranking_data.parquet")

                                                                                

In [None]:
spark.stop()