# Feature Engineering

I need to create some features for this data. There are too many categories to make one-hot encoded columns for each one. I was playing around with various frequency-related measures to do instead. I'm counting the occurrence of features and pairs of features, for 
both the dataset as a whole and for class one (target encoding technique).

Dividing the count by the number of rows results in tiny values at the tail. Dividing by the maximum count should be more 
numerically stable.

# To do

Right now I'm creating feature counts over the entire set of training data (except for ip, which is per calendar day). Data goes 
stale over time. It would be interesting to see whether the most useful features come from the last hour, 8 hours, day, 2 days, etc. 

In [1]:
import pyspark
import pyspark.sql.functions as F
import pyspark.sql.types as T

In [24]:
from itertools import combinations

In [2]:
%load_ext watermark
%watermark -iv

pyspark 2.4.3



In [3]:
# Comment these out to run on a cluster. Also, adjust memory to size of your laptop
pyspark.sql.SparkSession.builder.config('spark.driver.memory', '8g')
pyspark.sql.SparkSession.builder.config('spark.sql.shuffle.paritions', 5)

<pyspark.sql.session.SparkSession.Builder at 0x103a9ca90>

In [4]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()

## Load the data

In [14]:
df = spark.read.parquet('../data/intermed/train.parquet')
class1 = spark.read.parquet('../data/intermed/train1.parquet')
test = spark.read.parquet('../data/intermed/test_supplement.parquet')

## Counting 

Each count table gets its own parquet

In [27]:
def make_count_table( sdf, groupby_clause, prefix ):
    
    if type(groupby_clause) == str:
        column_name = groupby_clause + '_pct' # for example: ip_pct
        join_clause = [groupby_clause]
    else:
        column_name = "_".join(groupby_clause)  # for example: device_os
        join_clause = groupby_clause
        
    file_name = ''.join([ prefix,column_name,'.parquet'])

    counts_sdf =  sdf.groupby( 
                        groupby_clause 
                ).count(
                ).orderBy(
                    'count', ascending = False
                )
    
    maxcnt = counts_sdf.select(F.max('count').alias('maxcnt')).collect()
    maxcnt = maxcnt[0].maxcnt
    
    counts_sdf = counts_sdf.withColumn('ratios',
                    F.col('count').astype(T.DoubleType())/float(maxcnt))
    counts_sdf = counts_sdf.drop('count').withColumnRenamed('ratios', column_name)
    
    counts_sdf.write.parquet('../data/features/' + file_name)
    return counts_sdf

## Device, OS, channel, app 

In [28]:
# not doing ip address yet -- it's special because IP's come and go more quickly over time
columns = [ 'device', 'os', 'channel', 'app' ]
bigrams = [ list(b) for b in combinations(columns,2)]

for c in columns:
    make_count_table( df, c, 'df_' )
    make_count_table( class1, c, 'tgt_')
    
for bigram in bigrams:
    make_count_table( df, bigram, 'df_' )
    make_count_table( class1, bigram, 'tgt_')
        

## IP

In [33]:
train_ip = df.select('ip', 'click_time')
test_ip = test.select('ip', 'click_time')
ip_data = train_ip.unionAll(test_ip)

ip_data = ip_data.withColumn('doy', F.dayofyear('click_time'))

In [35]:
# count how many times an ip appears each day
day_counts = ip_data.groupby(['doy','ip']).count()
# find the max count for each day
day_max = day_counts[['doy','count']]\
                .groupby(['doy'])\
                .max()\
                .withColumnRenamed('max(count)', 'day_max')\
                .drop('max(doy)')
# merge the max per day into the daily counts table
merge = day_counts.join(day_max, ['doy'], how='left')
# normalize all the counts by the max
ip_table = merge.withColumn('ip_pct',
                 F.col('count').astype(T.FloatType())/
                 F.col('day_max').astype(T.FloatType())
                ).drop(
                    'count'
                ).drop(
                    'day_max'
                )

In [37]:
ip_table.write.parquet('../data/features/ip_pct.parquet')

In [32]:
# barrier so I don't accidently kill my spark session by hitting return too many times
assert(0)

In [41]:
spark.stop()