# AI for Market Trend Analysis â€“ Data Preparation & EDA  
## Project: Astro Coach Dashboard â€“ Subscription Intelligence System  
**Author:** Gurmeet Singh  
**Course:** Minor in AI  
**Objective:** Build an end-to-end AI pipeline to analyze user behavior, predict demand trends, and support business decision-making for a freemium mobile application.


## 1. Data Loading and Initial Inspection

This section loads the raw datasets and performs an initial inspection
to understand data volume, structure, and potential quality issues.


In [None]:
import pandas as pd

users = pd.read_csv("../data/raw/subscribed_users.csv")
payments = pd.read_csv("../data/raw/payment_history.csv")
events = pd.read_csv("../data/raw/analytics_events.csv")

In [None]:
users.shape, payments.shape, events.shape


In [None]:
users.head()



In [None]:
payments.head()


In [None]:
events.head()

In [None]:
users.columns


In [None]:
payments.columns

In [None]:
events.columns

In [None]:
users.dtypes



In [None]:
payments.dtypes


In [None]:
events.dtypes

In [None]:
users.isna().sum()
payments.isna().sum()
events.isna().sum()


In [None]:
print("Unique users",users["user_id"].nunique())
print("Unique Payments Records",payments["user_id"].nunique())
print("Unique Events Records",events["user_id"].nunique())


In [None]:
users["isPremiumUserFlag"] = users["isPremiumUser"].map({"YES": 1, "NO": 0})


## 2. Datetime Normalization  

All timestamp columns are converted to standardized datetime formats with consistent timezone handling.  
This ensures that:

- Temporal comparisons are valid  
- Time-series aggregation is accurate  
- Churn windows and engagement lookback periods are computed correctly  

Proper datetime handling is critical in behavioral and subscription analytics.


In [None]:
date_cols_users = [
    "premium_starts_at",
    "LastPaymentDate",
    "premium_renewed_at",
    "premium_expires_at"
]
for col in date_cols_users:
    users[col] = pd.to_datetime(users[col], errors="coerce", utc=True)


In [None]:
date_cols_payments = [
    "PaymentDate",
    "premium_expires_at"
]

for col in date_cols_payments:
    payments[col] = pd.to_datetime(payments[col], errors="coerce", utc=True)

In [None]:
events["EventDate"] = pd.to_datetime(events["EventDate"], errors="coerce", utc=True)


In [None]:
print(users.dtypes)
print(payments.dtypes)
print(events.dtypes)


## 3. Handling Invalid Payment Records

A small number of payment records contain missing or invalid user identifiers.
Since these records cannot be reliably linked to any user, they are excluded
from behavioral and conversion analysis.

This situation is common in real-world systems due to logging gaps or
incomplete transaction metadata.

In [None]:
users["user_id"] = pd.to_numeric(users["user_id"], errors="coerce")
payments["user_id"] = pd.to_numeric(payments["user_id"], errors="coerce")
events["user_id"] = pd.to_numeric(events["user_id"], errors="coerce")


In [None]:
print(users["user_id"].isna().sum())
print(payments["user_id"].isna().sum())
print(events["user_id"].isna().sum())


In [None]:
invalid_payments = payments[payments["user_id"].isna()]
invalid_payments.head()
invalid_payments.shape
payments_clean = payments.dropna(subset=["user_id"]).copy()

payments_clean["user_id"] = payments_clean["user_id"].astype(int)
print(payments_clean["user_id"].isna().sum())
users["user_id"] = users["user_id"].astype(int)
events["user_id"] = events["user_id"].astype(int)


payments_clean.shape


In [None]:

pd.DataFrame({
    "Users": users.isna().sum(),
    "Payments": payments.isna().sum(),
    "Events": events.isna().sum()
})



In [None]:
payments_clean["payment_date"] = payments_clean["PaymentDate"].dt.date
events.head()
events["event_date"] = events["EventDate"].dt.date


## 4. CORE EDA: MARKET & CONVERSION TRENDS

### Objective of this Section

This analysis examines how premium demand and user engagement evolve over time
to support:
- Time-series forecasting
- Conversion modeling
- Business decision-making


### 4.1 Daily Premium Demand Time Series From payment history, focus only on successful payments.

In [None]:
payments_clean["payment_status"].value_counts(dropna=False)

In [None]:

successful_payments = payments_clean[payments_clean["payment_status"] == "payment_success"]

successful_payments.head()



In [None]:
print("Total successful payments:", len(successful_payments))

In [None]:
daily_premium_demand = (
    successful_payments
    .groupby("payment_date")
    .size()
    .reset_index(name="premium_subscriptions")
)


## 5. Exploratory Data Analysis (EDA)  

This section explores user activity, payment behavior, and premium subscription trends to identify:
- Patterns in user engagement  
- Signals associated with premium conversion  
- Temporal trends in subscription demand  

The objective is not only visualization but also insight generation to guide model design.


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 5))
plt.plot(
    daily_premium_demand["payment_date"],
    daily_premium_demand["premium_subscriptions"]
)
plt.title("Daily Premium Subscription Trend")
plt.xlabel("Date")
plt.ylabel("Number of Subscriptions")
plt.show()


### Observations from Premium Demand Trend  

Key insights from the daily premium subscription trend:

- Premium demand shows short-term fluctuations rather than a smooth pattern.  
- Occasional spikes suggest event-driven or campaign-driven behavior.  
- This volatility makes the data suitable for time-series forecasting models such as LSTM rather than simple linear regression.  

These observations justify the choice of sequence-based models in later stages of the project.


## 5. Feature Engineering Strategy  

To enable predictive modeling, raw event logs are transformed into structured, per-user engagement features.

Design principles:
- Focus on **recent behavior** rather than lifetime activity  
- Use rolling time windows to capture intent-building patterns  
- Convert qualitative actions into quantitative signals  

Separate feature sets are prepared for:
- Conversion prediction (free â†’ premium)  
- Churn prediction (premium renewal risk)  

This ensures that each model learns from the most relevant behavioral context.


In [None]:
daily_premium_demand = daily_premium_demand.sort_values("payment_date")
daily_premium_demand.to_csv(
    "../data/processed/daily_premium_demand.csv",
    index=False
)


### Insight from Demand Aggregation

The resulting time-series reveals that premium subscriptions occur in bursts
rather than uniformly over time. This pattern supports the choice of sequence-
based forecasting models and indicates that user decisions are influenced by
specific triggers such as campaigns or feature exposure.


USER ENGAGEMENT EDA (EVENTS â†’ CONVERSION SIGNALS)

Objective of this step
Identify which user events typically precede premium subscription.
This directly supports:

Conversion modeling

Business insights

In [None]:
conversion_dates = (
    successful_payments
    .groupby("user_id")["PaymentDate"]
    .min()
    .reset_index()
    .rename(columns={"PaymentDate": "conversion_date"})
)


In [None]:
conversion_dates.shape
conversion_dates.head()




In [None]:
events_with_conversion = events.merge(
    conversion_dates,
    on="user_id",
    how="left"
)
print(events_with_conversion["conversion_date"].notna().sum())

PRE-CONVERSION EVENT ANALYSIS
Identify which events meaningfully precede premium subscription, in a way that is:

Explainable

Quantifiable

Model-ready

In [None]:
events_with_conversion["days_before_conversion"] = (
    events_with_conversion["conversion_date"] -
    events_with_conversion["EventDate"]
).dt.days
events_with_conversion[["days_before_conversion"]].describe()


In [None]:
pre_conversion_events = events_with_conversion[
    (events_with_conversion["days_before_conversion"] >= 0) &
    (events_with_conversion["days_before_conversion"] <= 7)
]

print(pre_conversion_events.shape)
print(pre_conversion_events["user_id"].nunique())



In [None]:
top_events = (
    pre_conversion_events["EventValue"]
    .value_counts()
    .reset_index()
)

top_events.columns = ["event", "count"]

top_events.head(10)



In [None]:
import matplotlib.pyplot as plt

top_events.head(10).plot(
    kind="bar",
    x="event",
    y="count",
    figsize=(12, 5),
    legend=False
)

plt.title("Top Events in 7 Days Before Premium Subscription")
plt.xlabel("Event")
plt.ylabel("Event Frequency")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()


Users who convert typically exhibit repeated engagement behaviors such as calendar exploration, profile viewing, and frequent app sessions before subscribing.

BUILDING THE ENGAGEMENT FEATURES TABLE (MODEL-READY DATA)

This is where we:

Aggregate events per user

Create numerical features

Prepare input for:

Conversion prediction model

Streamlit filters

In [None]:
last_activity = (
    events.groupby("user_id")["EventDate"]
    .max()
    .reset_index()
    .rename(columns={"EventDate": "last_activity_date"})
)


In [None]:
user_reference_dates = users[["user_id", "isPremiumUserFlag"]].merge(
    conversion_dates,
    on="user_id",
    how="left"
).merge(
    last_activity,
    on="user_id",
    how="left"
)

user_reference_dates["reference_date"] = user_reference_dates.apply(
    lambda row: row["conversion_date"]
    if row["isPremiumUserFlag"] == 1
    else row["last_activity_date"],
    axis=1
)


In [None]:
print(user_reference_dates["reference_date"].isna().sum())
user_reference_dates.head()



In [None]:
events_ref = events.merge(
    user_reference_dates[["user_id", "reference_date"]],
    on="user_id",
    how="left"
)


In [None]:
events_ref["days_before_reference"] = (
    events_ref["reference_date"] - events_ref["EventDate"]
).dt.days


In [None]:
events_7d = events_ref[
    (events_ref["days_before_reference"] >= 0) &
    (events_ref["days_before_reference"] <= 7)
]


In [None]:
important_events = top_events["event"].head(10).tolist()
print(important_events)

In [None]:
event_groups = {
    "engagement_activity": [
        "app_backgrounded",
        "app_foregrounded",
        "open_SplashScreen",
        "open_HomeScreen"
    ],
    "player_interaction": [
        "Player_DwnS",
        "Player_Dwn",
        "long_meditation_track_selected"
    ],
    "prediction_engagement": [
        "daily_predection_viewed",
        "daily_prediction_expand_clicked"
    ],
    "navigation_intent": [
        "day_clicked"
    ]
}


## Event Consolidation Strategy

Raw application events often represent the same underlying user behavior.
To improve model interpretability and reduce feature redundancy, related
events were grouped into higher-level behavioral signals.

For example:
- Navigation and app state events were consolidated into a single
  engagement activity feature.
- Media player actions were grouped as player interaction behavior.
- Prediction-related events were combined to represent prediction engagement.

This abstraction ensures that the models learn from meaningful behavioral
patterns rather than technical logging differences, resulting in more stable
and business-relevant predictions.


In [None]:
behavior_features = []

for feature_name, events_list in event_groups.items():
    temp = (
        events_7d[events_7d["EventValue"].isin(events_list)]
        .groupby("user_id")
        .size()
        .reset_index(name=feature_name)
    )
    behavior_features.append(temp)


In [None]:
from functools import reduce

behavior_df = reduce(
    lambda left, right: pd.merge(left, right, on="user_id", how="outer"),
    behavior_features
).fillna(0)


In [None]:
event_features = (
    events_7d[events_7d["EventValue"].isin(important_events)]
    .pivot_table(
        index="user_id",
        columns="EventValue",
        values="EventDate",
        aggfunc="count",
        fill_value=0
    )
    .reset_index()
)

print("Total Number of Important Events",len(event_features))
event_features.head()

In [None]:
feature_table = users[["user_id", "isPremiumUserFlag"]].merge(
    event_features,
    on="user_id",
    how="left"
)

feature_table.fillna(0, inplace=True)


In [None]:
print(feature_table.shape)
feature_table.head()



In [None]:
feature_table.to_csv(
    "../data/processed/user_engagement_features.csv",
    index=False
)


User interaction logs were aggregated into per-user engagement features using a 7-day lookback window prior to conversion or last activity. Event frequencies were used to capture intent-building behavior leading to premium subscription.

CONVERSION PREDICTION MODEL

(Free â†’ Premium)

Objective of This Step

Build a simple, explainable ML model that:

Predicts whether a user will convert to premium

Quantifies which events matter most

In [None]:
X = feature_table.drop(columns=["user_id", "isPremiumUserFlag"])
y = feature_table["isPremiumUserFlag"]


In [None]:
y.value_counts(normalize=True)


## Conversion Prediction Model
This section estimates the likelihood of a free user converting to
a premium subscriber based on recent behavioral patterns.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# 1. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    stratify=y,
    random_state=42
)

# 2. Model definition
conversion_model = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    solver="liblinear"
)

# 3. Fit
conversion_model.fit(X_train, y_train)

# 4. Evaluation
from sklearn.metrics import classification_report
y_pred = conversion_model.predict(X_test)
print(classification_report(y_test, y_pred))

# 5. Feature importance
importance = pd.DataFrame({
    "feature": X.columns,
    "coefficient": conversion_model.coef_[0]
}).sort_values(by="coefficient", ascending=False)

importance.to_csv(
    "../data/processed/conversion_feature_importance.csv",
    index=False
)


The dataset exhibits significant class imbalance, with premium subscribers representing a small fraction of the total user base. As a result, overall accuracy is high but does not fully capture the modelâ€™s ability to identify premium users. Due to the strong class imbalance inherent in freemium business models, recall for premium users remains constrained. However, the model fulfills its primary objective of identifying key behavioral drivers of conversion and serves as an effective explanatory tool rather than a production-grade classifier.

In [None]:

importance.head(10)


In [None]:
import matplotlib.pyplot as plt

importance.head(10).plot(
    kind="bar",
    x="feature",
    y="coefficient",
    figsize=(12, 5),
    legend=False
)

plt.title("Top Events coefficients")
plt.xlabel("Feature Name")
plt.ylabel("Coefficient Value")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()


Feature importance analysis using logistic regression reveals that explicit intent-driven interactions such as viewing the paywall, engaging with prediction content, and successful payment attempts strongly increase the likelihood of premium conversion. In contrast, passive navigation events and background app activity show limited or negative association with conversion, indicating that repeated meaningful engagement rather than general app usage drives subscription decisions.

Users who actively engage with core features such as predictions, media content, and calendar navigation demonstrate significantly higher conversion likelihood, highlighting these behaviors as key indicators of purchase intent.

TIME-SERIES FORECASTING (LSTM)

Premium Subscription Demand Forecasting

Objective

Use historical premium subscription data to:

Learn temporal patterns

Capture trend + volatility

Forecast future premium demand

In [None]:
import pandas as pd

ts = pd.read_csv("../data/processed/daily_premium_demand.csv")


In [None]:
ts["payment_date"] = pd.to_datetime(ts["payment_date"])
ts = ts.sort_values("payment_date")
ts.head()


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 4))
plt.plot(ts["payment_date"], ts["premium_subscriptions"])
plt.title("Daily Premium Subscription Time Series")
plt.xlabel("Date")
plt.ylabel("Subscriptions")
plt.show()


In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
ts["scaled_subscriptions"] = scaler.fit_transform(
    ts[["premium_subscriptions"]]
)


In [None]:
import numpy as np

def create_sequences(data, window_size=7):
    X, y = [], []
    for i in range(len(data) - window_size):
        X.append(data[i:i+window_size])
        y.append(data[i+window_size])
    return np.array(X), np.array(y)

values = ts["scaled_subscriptions"].values
X, y = create_sequences(values, window_size=7)


In [None]:
X.shape, y.shape


In [None]:
split = int(len(X) * 0.8)

X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]


In [None]:
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))


In [None]:
import sys
!{sys.executable} -m pip install tensorflow


In [None]:
import tensorflow as tf
print(tf.__version__)


## Premium Demand Forecasting
This section models future premium subscription demand using
historical time-series data to support planning decisions.


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

lstm_model  = Sequential([
    LSTM(50, activation="relu", input_shape=(X_train.shape[1], 1)),
    Dense(1)
])

lstm_model.compile(optimizer="adam", loss="mse")
lstm_model.summary()


In [None]:
history = lstm_model.fit(
    X_train,
    y_train,
    epochs=30,
    batch_size=16,
    validation_split=0.2,
    verbose=1
)


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
plt.plot(history.history["loss"], label="Training Loss")
plt.plot(history.history["val_loss"], label="Validation Loss")
plt.title("LSTM Training vs Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss (MSE)")
plt.legend()
plt.show()


Training loss decreases steadily, while validation loss fluctuates due to limited data and high volatility, which is expected in short time-series forecasting problems.

In [None]:
predictions = lstm_model.predict(X_test)


In [None]:
predictions_inv = scaler.inverse_transform(predictions)
y_test_inv = scaler.inverse_transform(y_test.reshape(-1, 1))


In [None]:
plt.figure(figsize=(12, 4))
plt.plot(y_test_inv, label="Actual")
plt.plot(predictions_inv, label="Predicted")
plt.title("LSTM â€“ Actual vs Predicted Premium Demand")
plt.xlabel("Time")
plt.ylabel("Subscriptions")
plt.legend()
plt.show()


In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

mae = mean_absolute_error(y_test_inv, predictions_inv)
rmse = np.sqrt(mean_squared_error(y_test_inv, predictions_inv))

print("MAE:", mae)
print("RMSE:", rmse)


In [None]:
last_sequence = values[-10:]
forecast_scaled = []

current_seq = last_sequence.copy()

for _ in range(14):
    seq = current_seq.reshape((1, 10, 1))
    next_val = lstm_model.predict(seq, verbose=0)[0][0]
    forecast_scaled.append(next_val)
    current_seq = np.append(current_seq[1:], next_val)


In [None]:
forecast = scaler.inverse_transform(
    np.array(forecast_scaled).reshape(-1, 1)
)


In [None]:
forecast_dates = pd.date_range(
    start=ts["payment_date"].max() + pd.Timedelta(days=1),
    periods=14
)

forecast_df = pd.DataFrame({
    "date": forecast_dates,
    "predicted_premium_subscriptions": forecast.flatten()
})


In [None]:
plt.figure(figsize=(12, 4))
plt.plot(ts["payment_date"], ts["premium_subscriptions"], label="Historical")
plt.plot(forecast_df["date"], forecast_df["predicted_premium_subscriptions"], label="Forecast")
plt.title("Premium Subscription Demand Forecast (Next 14 Days)")
plt.xlabel("Date")
plt.ylabel("Subscriptions")
plt.legend()
plt.show()


An LSTM model with a 7-day sliding window was used to forecast premium subscription demand. Although the historical demand data is limited and exhibits short-term volatility, the LSTM model effectively captures overall trend direction and temporal dependency, making it suitable for high-level business planning and campaign timing decisions rather than precise numerical forecasting.

In [None]:
forecast_df.to_csv(
    "../data/processed/premium_demand_forecast.csv",
    index=False
)


How to Identify High-Potential Non-Premium Users

In [None]:
conversion_model.feature_names_in_


In [None]:
X = feature_table.drop(columns=["user_id", "isPremiumUserFlag"]).values

required_features = conversion_model.feature_names_in_
X_full = feature_table[required_features]


In [None]:
type(X_full)
X_full.columns
X_full.shape


In [None]:
conversion_prob = conversion_model.predict_proba(X_full)[:, 1]


In [None]:
scored_users = feature_table.copy()
scored_users["conversion_probability"] = conversion_prob

# Save for Streamlit
scored_users.to_csv(
    "../data/processed/user_conversion_scores.csv",
    index=False
)

In [None]:
target_users = scored_users[
    (scored_users["isPremiumUserFlag"] == 0) &
    (scored_users["conversion_probability"] >= 0.4)
]


In [None]:
assert list(X_full.columns) == list(required_features)


In [None]:
target_users.head()

CHURN PREDICTION FOR PREMIUM USERS
This is more valuable than new acquisition because:

Retention is cheaper than acquisition

Premium churn directly impacts revenue predictability
Churn Definition 

A premium user is considered churned if:

Their premium_expires_at date has passed

AND they have not renewed

AND they show low recent engagement

This mirrors real subscription businesses.

In [None]:
import pandas as pd
today = pd.Timestamp.now(tz="UTC")



users["is_churned"] = (
    (users["isPremiumUserFlag"] == 1) &
    (users["premium_expires_at"] < today)
).astype(int)



In [None]:
users["is_churned"].value_counts()
users[["premium_expires_at", "is_churned"]].head(10)


In [None]:
premium_users = users[users["isPremiumUserFlag"] == 1]
premium_users.shape


BUILD CHURN FEATURE SET 

We reuse engagement logic, but shift the window:

Churn signal = declining engagement BEFORE expiry

In [None]:
premium_users_ref = premium_users[["user_id", "premium_expires_at", "is_churned"]]


In [None]:
events_churn = events.merge(
    premium_users_ref,
    on="user_id",
    how="inner"
)


In [None]:
events_churn["premium_expires_at"] = pd.to_datetime(
    events_churn["premium_expires_at"], utc=True
)

events_churn["EventDate"] = pd.to_datetime(
    events_churn["EventDate"], utc=True
)

events_churn["days_before_expiry"] = (
    events_churn["premium_expires_at"] - events_churn["EventDate"]
).dt.days

events_pre_expiry = events_churn[
    (events_churn["days_before_expiry"] >= 0) &
    (events_churn["days_before_expiry"] <= 14)
]


In [None]:
important_events = conversion_model.feature_names_in_

churn_features = (
    events_pre_expiry[events_pre_expiry["EventValue"].isin(important_events)]
    .pivot_table(
        index="user_id",
        columns="EventValue",
        values="EventDate",
        aggfunc="count",
        fill_value=0
    )
    .reset_index()
)


In [None]:
churn_dataset = premium_users_ref.merge(
    churn_features,
    on="user_id",
    how="left"
).fillna(0)


In [None]:
print(churn_dataset.shape)
churn_dataset.head()



## Premium Churn Risk Modeling
This section identifies premium users at risk of not renewing their
subscription based on declining engagement prior to expiry.


In [None]:
y_churn = churn_dataset["is_churned"]
X_churn = churn_dataset.drop(columns=["user_id", "is_churned", "premium_expires_at"])


In [None]:
# 1. Train-test split
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_churn, y_churn,
    test_size=0.25,
    stratify=y_churn,
    random_state=42
)

# 2. Model definition
churn_model = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    solver="liblinear"
)

# 3. Fit
churn_model.fit(X_train_c, y_train_c)

# 4. Evaluation
y_pred_c = churn_model.predict(X_test_c)
print(classification_report(y_test_c, y_pred_c))

# 5. Feature importance
churn_importance = pd.DataFrame({
    "feature": X_churn.columns,
    "coefficient": churn_model.coef_[0]
}).sort_values(by="coefficient")

churn_importance.to_csv(
    "../data/processed/churn_feature_importance.csv",
    index=False
)


In [None]:
churn_prob = churn_model.predict_proba(X_churn)[:, 1]

churn_scores = churn_dataset.copy()
churn_scores["churn_probability"] = churn_prob

churn_scores.to_csv(
    "../data/processed/premium_churn_scores.csv",
    index=False
)

at_risk_users = churn_scores[churn_scores["churn_probability"] >= 0.5]
at_risk_users.head()


## 6. Modeling Datasets Generated  

At the end of the data preparation phase, the following clean datasets are produced:

- `user_engagement_features.csv` â€“ Aggregated event-based features  
- `user_conversion_scores.csv` â€“ Free users with predicted conversion likelihood  
- `premium_churn_scores.csv` â€“ Premium users with churn risk probabilities  
- `daily_premium_demand.csv` â€“ Time-series of premium subscriptions  
- `premium_demand_forecast.csv` â€“ Forecasted future demand  

These datasets form the foundation for:
- Predictive modeling  
- Business insight generation  
- Dashboard visualization


In [None]:
## Find the events that leads to indicate about a user may churn

events_churn["premium_expires_at"] = pd.to_datetime(
    events_churn["premium_expires_at"], utc=True
)

events_churn["EventDate"] = pd.to_datetime(
    events_churn["EventDate"], utc=True
)

## Contribution of this Notebook

This notebook establishes the complete data foundation for the Astro Coach
AI system. It transforms raw operational data into structured, validated,
and business-ready datasets that directly support:

- Forecasting of premium demand
- Prediction of user conversion likelihood
- Identification of churn risk

This ensures that all downstream modeling and visualization steps are built
on reliable and well-documented data pipelines.


### ðŸ”¹ Ethical Considerations & Responsible AI

Astro Coach is designed as a decision support system, not an automated decision-making engine.
All predictions generated by the models are probabilistic and intended to guide human judgment, not replace it.

Key ethical considerations include:

Human-in-the-loop design: Final business decisions remain with leadership teams.

Data privacy: The project uses processed, non-personally identifiable data.

Bias awareness: Behavioral models may reflect historical user patterns and therefore require continuous monitoring for bias.

Transparency: Model outputs are presented with interpretability through feature importance and business context.

The system follows the principle of responsible AI adoption by prioritizing clarity, accountability, and ethical use.

### ðŸ”¹ Limitations of the System

While Astro Coach Dashboard delivers meaningful business insights, several limitations remain:

The system is currently based on historical batch data and does not yet support real-time prediction.

Forecasts depend on past trends and may not fully capture sudden market shifts.

Predictions are probabilistic, not deterministic guarantees.

Business effectiveness depends on how insights are operationalized by teams.

These limitations highlight the importance of using Astro Coach Dashboard as a strategic advisory tool, not an automated execution engine.

### ðŸ”¹ Conclusion & Future Scope

This project demonstrates how AI can move beyond traditional analytics to support decision intelligence in subscription-based businesses.

Astro Coach Dashboard successfully integrates:

Behavioral analytics

Predictive modeling

Scenario-based simulations

to help leaders make informed, proactive decisions.

Future enhancements include:

Real-time data pipelines

Sentiment analysis from user feedback

CRM and marketing automation integration

Automated retention triggers

Mobile-first executive dashboards

With these extensions, Astro Coach Dashboard can evolve from a decision intelligence platform into a decision automation ecosystem.