# Final Fall 2025 

## Project Overview

You have been hired by a data science company to complete a specific analytical task. Your responsibility is to complete the task and present your results in a notebook.

**Important:** Your audience (stakeholders, managers, and decision-makers) is not interested in code implementation details or technical minutiae. They want to understand:

- **Your findings** - What insights did you discover?
- **Your solution** - What approach did you take and why?
- **The results** - What are the outcomes and recommendations?

Focus on clear, concise explanations of your analysis, methodology, and conclusions. Use visualizations and summaries to communicate effectively. The code should support your narrative, not dominate it.

**Rubric** https://docs.google.com/spreadsheets/d/1vgd5cFvqIrfhafaOu--UD1e3jF7Oh3h3hH_y3PyioZI/edit?gid=56981286#gid=56981286

# Option1

In [1]:
from IPython.display import Markdown, display
display(Markdown("ecommerce_dataset_documentation.md"))


# E-commerce Customer Purchase Prediction

## Business Context

**Company A Solutions** is an e-commerce analytics company hired by online retailers to predict which customers are likely to make a purchase during their browsing session. This helps optimize marketing spend and personalize user experience in real-time.

This dataset does NOT include the target variable. The data scientist must analyze the data and ask the subject matter expert if a customer will purchase for a limited number of times.

## Dataset Overview

- **Total Records**: 50,000
- **Features**: 26
- **Target Variable**: NOT INCLUDED
- **Missing Values**: 10000

## Feature Categories

### 1. Temporal Features
- `session_start_time`: Timestamp when session began
- `hour`: Hour of day (0-23)
- `day_of_week`: Day of week (0=Monday, 6=Sunday)
- `month`: Month (1-12)
- `is_weekend`: Binary flag for weekend (1) vs weekday (0)
- `session_duration_minutes`: Length of session in minutes
- `days_since_last_visit`: Days since customer's last visit

### 2. Behavioral Numerical Features
- `page_views`: Number of pages viewed during session
- `clicks`: Number of clicks made during session
- `avg_scroll_depth`: Average scroll depth (0-1)
- `items_added_to_cart`: Number of items added to cart
- `items_removed_from_cart`: Number of items removed from cart
- `search_queries_count`: Number of search queries made
- `product_page_time_minutes`: Time spent on product pages
- `categories_viewed_count`: Number of different categories viewed
- `avg_price_viewed`: Average price of products viewed ($)

### 3. Categorical Features
- `device_type`: mobile, desktop, tablet
- `traffic_source`: organic_search, paid_search, social_media, email, direct, referral
- `customer_segment`: new_customer, returning_customer, vip_customer, at_risk_customer
- `region`: North_America, Europe, Asia_Pacific, Latin_America, Middle_East_Africa
- `browser`: Chrome, Safari, Firefox, Edge, Other
- `operating_system`: Windows, macOS, iOS, Android, Linux
- `user_type`: new_user, returning_user

### 4. Text Features
- `search_queries`: Comma-separated search terms used
- `categories_viewed`: Comma-separated product categories viewed
- `user_agent`: Browser user agent string

## Using the Subject Matter Expert (SME)

The SME class allows you to query purchase probabilities for specific customer profiles.

### Example Usage:

```python
from ecommerce_sme import SME

# Initialize SME
sme = SME()

# Query purchase probability
# Pass a dictionary with feature values
purchase_prob = sme.ask({
    'device_type': 'desktop',
    'items_added_to_cart': 3,
    'session_duration_minutes': 25,
    'traffic_source': 'email'
})

print(f"Purchase probability: {purchase_prob}")
```

### Important Notes:
- You can ask up to 500 times (SME has other tasks.)
- Pass feature values as a dictionary
- String values should be passed as strings
- Numeric values should be passed as numbers
- If no matching records found, SME will raise an exception
- The SME returns the mean purchase rate (0-1) for matching records


In [2]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/2025/ecommerce_sessions_X.csv')
df.shape

(50000, 26)

In [3]:
df.head().T

Unnamed: 0,0,1,2,3,4
session_start_time,2024-08-21 13:28:14,2024-01-10 19:20:38,2024-04-10 14:22:10,2024-03-22 14:52:35,2024-09-25 09:02:21
hour,13,19,14,14,9
day_of_week,2,2,2,4,2
month,8,1,4,3,9
is_weekend,0,0,0,0,0
session_duration_minutes,17.323063,3.268847,16.22868,4.920244,13.404538
days_since_last_visit,3.044042,11.887877,0.264852,13.660919,0.30211
page_views,6.132345,0.884203,3.060268,1.014015,3.821892
clicks,2.069694,0.0,1.042415,0.967902,2.012747
avg_scroll_depth,0.261632,0.590109,0.223406,0.403105,0.310772


# Option2

In [4]:
display(Markdown("churn_dataset_documentation.md"))


# Streaming Service Customer Churn Prediction

## Business Context

**StreamFlix** is a popular streaming service that wants to identify customers who are likely to cancel their subscription within the next 30 days. Early identification allows the company to implement targeted retention strategies, such as personalized content recommendations, promotional offers, or customer support outreach, to reduce churn and maintain revenue.

This dataset does NOT include the target variable. The data scientist must analyze the data and ask the subject matter expert if a customer will churn for a limited number of times. 

## Dataset Overview

- **Total Records**: 50,000
- **Features**: 37
- **Target Variable**: NOT INCLUDED

- **Missing Values**: 12500

## Feature Categories

### 1. Subscription Features
- `subscription_start_date`: Date when customer first subscribed
- `subscription_length_days`: Number of days since subscription started
- `subscription_length_months`: Number of months since subscription started
- `subscription_plan`: basic, standard, premium, family, student
- `monthly_price`: Monthly subscription price ($)
- `billing_cycle`: monthly, annual
- `payment_method`: credit_card, debit_card, paypal, apple_pay, google_pay, bank_transfer

### 2. Usage Behavioral Features (Last 30 Days)
- `total_watch_time_hours`: Total hours of content watched
- `sessions_count`: Number of viewing sessions
- `avg_session_duration_minutes`: Average length of viewing sessions
- `days_since_last_watch`: Days since customer last watched content
- `unique_titles_watched`: Number of different titles watched
- `content_completion_rate`: Percentage of started content that was completed
- `abandoned_series_count`: Number of series started but not finished
- `binge_sessions_count`: Number of binge-watching sessions (>3 hours)

### 3. Content Preference Features
- `primary_genre`: Most watched genre (Action, Comedy, Drama, etc.)
- `genres_watched_count`: Number of different genres watched
- `genres_watched`: Comma-separated list of genres watched
- `content_type_preference`: movies, tv_shows, both
- `new_content_ratio`: Ratio of new releases vs catalog content watched

### 4. Device and Platform Features
- `primary_device`: smart_tv, mobile, tablet, desktop, game_console, streaming_device
- `devices_used_count`: Number of different devices used
- `devices_used`: Comma-separated list of devices used
- `primary_os`: Operating system of primary device

### 5. Engagement and Interaction Features
- `has_profile`: Binary flag for profile creation (1) or not (0)
- `number_of_profiles`: Number of user profiles on account
- `watchlist_size`: Number of titles in watchlist
- `ratings_given_count`: Number of content ratings provided
- `reviews_written_count`: Number of reviews written
- `downloads_count`: Number of titles downloaded for offline viewing

### 6. Support and Service Features
- `support_tickets_count`: Number of support tickets opened (last 90 days)
- `support_ticket_reasons`: Comma-separated reasons for support tickets
- `payment_failures_count`: Number of payment failures (last 90 days)
- `account_on_hold`: Binary flag for account suspension (1) or not (0)
- `used_free_trial`: Binary flag if customer used free trial before subscribing

### 7. Temporal Features
- `current_month`: Current month (1-12) for seasonality analysis
- `days_until_next_billing`: Days until next billing date

## Using the Subject Matter Expert (SME)

The SME class allows you to query churn probabilities for specific customer profiles.

### Example Usage:

```python
from streamflix_sme import SME

# Initialize SME
sme = SME()

# Query churn probability
# Pass a dictionary with feature values
churn_prob = sme.ask({
    'subscription_plan': 'basic',
    'days_since_last_watch': 25,
    'total_watch_time_hours': 3,
    'payment_failures_count': 2
})

print(f"Churn probability: {churn_prob}")
```

### Important Notes:
- You can ask up to 500 times (sme has other tasks.)
- Pass feature values as a dictionary
- String values should be passed as strings
- Numeric values should be passed as numbers
- If no matching records found, SME will raise an exception
- The SME returns the mean churn rate (0-1) for matching records


In [5]:
df = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/2025/streaming_churn_dataset.csv')
df.shape

(50000, 37)

In [6]:
df.head().T

Unnamed: 0,0,1,2,3,4
subscription_start_date,2022-11-07,2023-02-28,2024-01-31,2021-05-21,2023-04-05
subscription_length_days,785,672,335,1320,636
subscription_plan,basic,standard,standard,standard,family
monthly_price,9.99,15.99,15.99,15.99,24.99
billing_cycle,monthly,monthly,monthly,monthly,annual
payment_method,credit_card,apple_pay,debit_card,credit_card,credit_card
total_watch_time_hours,74.963172,13.848148,43.368652,28.69851,204.641244
sessions_count,29.742333,8.341995,20.970448,13.858491,95.626485
avg_session_duration_minutes,152.370633,102.980411,120.414698,112.382035,119.821246
days_since_last_watch,5.58705,5.350973,10.167759,20.525603,6.219125


# Production Notebook

In [7]:
# don't use this
from random import randint

class TestModel:
    def predict(self, X):
        return [ randint(0, 1) for _ in range(len(X))]

In [9]:
# In the second notebook

from sklearn.metrics import classification_report

def production(X_path, y_path):
    # load your model
    model = TestModel()
    
    # load data
    df_X = pd.read_csv(X_path)

    # make the changes if required 
    # -------------------------

    

    # -------------------------
    pred = model.predict(df_X)

    y_label = 'will_churn' if 'churn' in y_path else 'will_purchase'

    df_y = pd.read_csv(y_path)[y_label]
    print(classification_report(df_y, pred))
    

production( 
    X_path='https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/2025/streaming_churn_dataset.csv',
    y_path='https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/2025/streaming_churn_dataset_y.csv'
)
    

              precision    recall  f1-score   support

           0       0.81      0.50      0.62     40674
           1       0.19      0.50      0.27      9326

    accuracy                           0.50     50000
   macro avg       0.50      0.50      0.44     50000
weighted avg       0.70      0.50      0.55     50000

