# Data Analysis and Feature Engineering Notebook

This notebook guides you through key steps in data analysis and preparation for machine learning:

1. **Calculate statistical summaries** to understand data distributions and central tendencies.
2. **Test hypotheses** to compare patterns between weekdays and weekends.
3. **Identify correlations** among variables to uncover relationships.
4. **Engineer features for ML** to enhance predictive modeling.

### Task-1 Calculate Statistical Summaries

In [0]:
# loading into events df
events = spark.read.table("workspace.default.silver_ecommerce_events_event_type_part")
# showing statistical summaries
events.describe().display(10)

summary,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
count,42448764,42448764,42448764.0,42448764.0,28933155,36335756,42448764.0,42448764.0,42448762
mean,,,10549932.375842676,2.0574042379407944e+18,,,290.3236606849145,533537147.50816846,
stddev,,,11881906.97060811,1.8439264661404264e+16,,,358.2691553394025,18523738.174654134,
min,2019-10-01 00:00:00 UTC,cart,1000978.0,2.053013552226108e+18,accessories.bag,a-case,0.0,183503497.0,00000042-3e3f-42f9-810d-f3d264139c50
max,2019-10-31 23:59:59 UTC,view,9900461.0,2.1754195950939676e+18,stationery.cartrige,zyxel,999.82,64078358.0,fffffc65-7ce9-435c-8b72-1d9f7062fe77


In [0]:
from pyspark.sql import functions as F

events = events.withColumn("product_id", F.col("product_id").cast("long")) \
               .withColumn("user_id", F.col("user_id").cast("long")) \
               .withColumn("price", F.col("price").cast("double")) \
               .withColumn("category_id", F.col("category_id").cast("long")) \
               .withColumn("event_time", F.col("event_time").cast("timestamp")) \
               .withColumn("event_date", F.to_date("event_time"))

### Task-2 Hypothesis Testing

In [0]:
# Step-1 Create weekend flag
from pyspark.sql import functions as F

events_flagged = events.withColumn(
    "is_weekend",
    F.dayofweek("event_time").isin([1, 7])  # Sunday=1, Saturday=7
)


In [0]:
# Step-2 compare behavior 

events_flagged.groupBy("is_weekend", "event_type") \
    .count() \
    .orderBy("is_weekend", "event_type") \
    .display()


is_weekend,event_type,count
False,cart,664318
False,purchase,546439
False,view,29775216
True,cart,262198
True,purchase,196410
True,view,11004183


### Task-3 Identify Correlations

Correlation measures the linear relationship strength between two numeric columns.

+1 → strong positive

0 → no relationship

-1 → strong negative

This summary helps interpret correlation values when analyzing relationships between variables in data. 

In [0]:
# Correlation only works on numeric, row-level columns
events.stat.corr("price", "user_id")

0.0033993499464311703

### Task-4 Feature Engineering for ML

Feature engineering transforms raw data into model-ready signals.

**Feature 1: Time based features**

In [0]:
#printing schema to verify
events.printSchema()

root
 |-- event_time: timestamp (nullable = true)
 |-- event_type: string (nullable = true)
 |-- product_id: long (nullable = true)
 |-- category_id: long (nullable = true)
 |-- category_code: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: double (nullable = true)
 |-- user_id: long (nullable = true)
 |-- user_session: string (nullable = true)
 |-- event_date: date (nullable = true)



In [0]:
#creating new features
from pyspark.sql import functions as F

features = events \
    .withColumn("hour", F.hour("event_time")) \
    .withColumn("day_of_week", F.dayofweek("event_time"))


In [0]:
#viewing features
features.select("event_time", "hour", "day_of_week").display(1)

event_time,hour,day_of_week
2019-10-26T07:55:55.000Z,7,7
2019-10-26T07:55:55.000Z,7,7
2019-10-26T07:55:55.000Z,7,7
2019-10-26T07:55:57.000Z,7,7
2019-10-26T07:55:57.000Z,7,7
2019-10-26T07:55:58.000Z,7,7
2019-10-26T07:55:58.000Z,7,7
2019-10-26T07:55:59.000Z,7,7
2019-10-26T07:55:59.000Z,7,7
2019-10-26T07:55:59.000Z,7,7


**Feature 2 Time since first event**

In [0]:
#Creating feature logic
from pyspark.sql import Window

window = Window.partitionBy("user_id") \
    .orderBy("event_time") \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

features = features.withColumn(
    "first_event_time",
    F.first("event_time").over(window)
)


In [0]:
#Computing the feature
features = features.withColumn(
    "time_since_first_view",
    F.unix_timestamp("event_time") -
    F.unix_timestamp("first_event_time")
)


In [0]:
# Viewing the new feature output
features.select(
    "user_id",
    "event_time",
    "first_event_time",
    "time_since_first_view"
).display(10, truncate=False)


user_id,event_time,first_event_time,time_since_first_view
237271696,2019-10-07T07:13:38.000Z,2019-10-07T07:13:38.000Z,0
237271696,2019-10-07T07:14:42.000Z,2019-10-07T07:13:38.000Z,64
237271696,2019-10-07T07:14:56.000Z,2019-10-07T07:13:38.000Z,78
237271696,2019-10-07T07:15:21.000Z,2019-10-07T07:13:38.000Z,103
239876607,2019-10-09T06:33:01.000Z,2019-10-09T06:33:01.000Z,0
239876607,2019-10-09T06:34:49.000Z,2019-10-09T06:33:01.000Z,108
239876607,2019-10-09T06:35:12.000Z,2019-10-09T06:33:01.000Z,131
239876607,2019-10-09T06:36:13.000Z,2019-10-09T06:33:01.000Z,192
239876607,2019-10-09T06:36:46.000Z,2019-10-09T06:33:01.000Z,225
239876607,2019-10-09T06:37:19.000Z,2019-10-09T06:33:01.000Z,258
