# Survey Response Data Preprocessing and Transformation

This notebook processes the survey responses from `dropoffs.csv` file, which contains user behavior data about movie watching habits and dropout patterns. The preprocessing pipeline includes:

## Key Preprocessing Steps:
1. **Data Loading and Initial Exploration** - Load survey data and examine structure
2. **Consent Validation** - Remove respondents who didn't consent to participation
3. **Checkbox Response Separation** - Convert multi-select responses into binary features
4. **Categorical Data Discretization** - Transform categorical responses into numerical codes
5. **Feature Engineering** - Create aggregate features and engagement metrics
6. **Data Quality Validation** - Check for missing values, outliers, and data consistency
7. **Final Dataset Preparation** - Create ML-ready datasets for modeling

The output will be clean, preprocessed datasets ready for machine learning algorithms as specified in the process tracker.

In [1]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns

# Load survey response data from dropoffs.csv
print("Loading survey response data...")
df = pd.read_csv('dropoffs.csv')

print(f"Dataset shape: {df.shape}")
print(f"Number of respondents: {len(df)}")
print(f"Total columns: {len(df.columns)}")

# Display basic dataset information
print("\n=== COLUMN OVERVIEW ===")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

# Check for consent column and remove non-consenting respondents
print("\n=== CONSENT VALIDATION ===")
consent_cols = [col for col in df.columns if 'consent' in col.lower() or 'voluntarily' in col.lower()]
print(f"Consent columns found: {consent_cols}")

if consent_cols:
    consent_col = consent_cols[0]
    print(f"Using consent column: {consent_col}")
    
    # Check consent responses
    consent_responses = df[consent_col].value_counts()
    print(f"Consent responses:")
    print(consent_responses)
    
    # Remove rows without consent
    initial_count = len(df)
    df = df[df[consent_col].notna()]  # Remove rows with no consent response
    
    # Check if there are any explicit "No" responses (though unlikely in this dataset)
    if 'No' in df[consent_col].values:
        df = df[df[consent_col] != 'No']
    
    print(f"Rows removed due to missing/invalid consent: {initial_count - len(df)}")
    print(f"Final dataset size: {len(df)}")
else:
    print("No consent column found - proceeding with full dataset")

# Display basic dataset information after consent filtering
data_info = pd.DataFrame({
    'Column': df.columns,
    'Data_Type': df.dtypes,
    'Non_Null_Count': df.count(),
    'Null_Count': df.isnull().sum(),
    'Unique_Values': [df[col].nunique() for col in df.columns]
})

print("\n=== DATASET OVERVIEW AFTER CONSENT FILTERING ===")
print(data_info.head(15).to_string(index=False))

Loading survey response data...
Dataset shape: (78, 52)
Number of respondents: 78
Total columns: 52

=== COLUMN OVERVIEW ===
 1. Timestamp
 2. What is your age group?
 3. What is your gender?
 4. What is the highest level of education you’ve completed? 
 5. How often do you watch movies?
 6. Which genres do you enjoy watching the most?  (Select up to 3)
 7. Can you recall a movie you started but did not finish? (Optional: name it)
 8. How do you usually discover movies you decide to watch? (Select all that apply)
 9. What is the movie's genre?
10. What is the movie's runtime?
11. Are you familiar with the director or main actors?
12. Where do you usually watch movies? (Select all that apply)
13. Who do you usually watch movies with?
14. What is your typical mood before watching a movie?
15. Why do you usually choose to watch movies? (Select all that apply)
16. Have you ever stopped watching a movie before finishing it?
17. At what point do you typically stop watching movies you drop?
1

## Step 2: Checkbox Response Separation

Many survey questions allow multiple selections (checkboxes). These responses are stored as comma-separated text strings that need to be separated into individual binary features for machine learning.

### Why Checkbox Separation is Important:
- **Statistical Analysis**: Enables counting and percentage calculations for each option
- **Machine Learning**: Converts text responses into numerical binary features (0/1)
- **Correlation Analysis**: Allows analysis of relationships between specific choices
- **Targeted Modeling**: Enables building models for specific behaviors

### Checkbox Questions in Survey:
1. **Genre Preferences**: "Which genres do you enjoy watching the most?"
2. **Movie Discovery**: "How do you usually discover movies you decide to watch?"
3. **Viewing Locations**: "Where do you usually watch movies?"
4. **Movie Selection Reasons**: "Why do you usually choose to watch movies?"
5. **Dropout Reasons**: "What are the main reasons you stop watching movies?"
6. **Genres That Cause Dropouts**: "Which genres do you find yourself stopping more often?"
7. **Pause Reasons**: "Why do you usually pause the movie?"
8. **Multi-tasking Activities**: "Do you usually do other things while watching movies?"

In [2]:
def separate_checkbox_responses(df):
    """
    Separate checkbox responses into individual binary features
    """
    processed_df = df.copy()
    
    # Define checkbox questions and their expected options
    checkbox_questions = {
        'Which genres do you enjoy watching the most?  (Select up to 3)': [
            'Action', 'Comedy', 'Drama', 'Horror', 'Romance', 'Science Fiction/Sci-Fi', 
            'Documentary', 'Thriller', 'Family', 'Adventure', 'Fantasy', 'Historical', 
            'Contemporary', 'None / No specific genre'
        ],
        'How do you usually discover movies you decide to watch? (Select all that apply)': [
            'Trailer', 'Friend/Family recommendation', 'Social Media', 'Streaming platform suggestion',
            'Reviews or ratings', 'Recently trending or popular movies', 'Awards or critical acclaim'
        ],
        'Where do you usually watch movies? (Select all that apply)': [
            'Streaming at home (Netflix, Disney+, etc.)', 'In a cinema/theater', 'DVD or Blu-ray',
            'Downloaded or offline video files', 'Streaming on mobile while out (e.g., commuting)',
            'Live TV or cable', 'Pirated Sites', 'illegal streaming websites'
        ],
        'Why do you usually choose to watch movies? (Select all that apply)': [
            'Entertainment', 'Socia/cultural relevance', 'Recommendation from friends or family',
            'Favorite actors or directors', 'Interesting storyline or genre', 
            'Trailer or promotional material', 'Awards or critical acclaim', 'Recently trending',
            'To relax or unwind'
        ],
        'In general, what are the main reasons you stop watching movies before finishing? (Select all that apply)': [
            'Boring/uninteresting plot', 'Poor acting or characters', 'Too long/slow pacing',
            'Technical issues (buffering, audio, etc.)', 'Distractions or interruptions',
            'Not in the right mood'
        ],
        'Which genres do you find yourself stopping more often before finishing? (Select all that apply)': [
            'Action', 'Comedy', 'Drama', 'Horror', 'Romance', 'Science Fiction/Sci-Fi',
            'Documentary', 'Thriller', 'Historical', 'Contemporary', 'None / No specific genre'
        ],
        'Why do you usually pause the movie? (Select all that apply)': [
            'Bathroom break', 'Snack refill', 'Answering phone or messages',
            'To discuss something with others watching', 'Attending to someone (e.g., family, kids)',
            'Lost focus or distracted', 'Feeling bored or uninterested',
            'Technical issues (e.g., buffering, audio/video problems)'
        ],
        'Do you usually do other things while watching movies? (Select all that apply)': [
            'No, I usually focus only on the movie', 'I scroll on my phone or use social media',
            'I chat or text with others', 'I eat or prepare food', 'I do chores (laundry, cleaning, etc.)',
            'I work or study while watching'
        ]
    }
    
    separation_report = {}
    
    for question, options in checkbox_questions.items():
        if question in df.columns:
            print(f"\nProcessing: {question}")
            
            # Create binary columns for each option
            for option in options:
                # Clean option name for column name
                col_name = option.replace('/', '_').replace(' ', '_').replace('(', '').replace(')', '').replace(',', '').replace('-', '_').lower()
                col_name = f"{question.split('?')[0].replace(' ', '_').replace(',', '').lower()}_{col_name}"
                col_name = col_name.replace('__', '_').replace('___', '_')
                
                # Initialize binary column
                processed_df[col_name] = 0
                
                # Fill binary column based on responses
                for idx, response in df[question].items():
                    if pd.notna(response):
                        response_str = str(response).lower()
                        option_lower = option.lower()
                        
                        # Check if option is mentioned in response
                        if option_lower in response_str:
                            processed_df.at[idx, col_name] = 1
            
            # Count separated columns
            separated_cols = [col for col in processed_df.columns if col.startswith(question.split('?')[0].replace(' ', '_').replace(',', '').lower())]
            separation_report[question] = len(separated_cols)
            print(f"  Created {len(separated_cols)} binary columns")
    
    print(f"\n=== CHECKBOX SEPARATION SUMMARY ===")
    total_binary_cols = sum(separation_report.values())
    print(f"Total binary columns created: {total_binary_cols}")
    
    for question, count in separation_report.items():
        print(f"  {question.split('?')[0]}: {count} columns")
    
    return processed_df, separation_report

# Apply checkbox separation
print("=== SEPARATING CHECKBOX RESPONSES ===")
processed_df, checkbox_report = separate_checkbox_responses(df)

print(f"\nDataset shape after checkbox separation: {processed_df.shape}")
print(f"New columns added: {processed_df.shape[1] - df.shape[1]}")

# Show sample of binary columns
binary_cols = [col for col in processed_df.columns if col not in df.columns]
print(f"\nSample of new binary columns:")
for col in binary_cols[:10]:
    print(f"  {col}: {processed_df[col].sum()} respondents selected this option")

=== SEPARATING CHECKBOX RESPONSES ===

Processing: Which genres do you enjoy watching the most?  (Select up to 3)
  Created 14 binary columns

Processing: How do you usually discover movies you decide to watch? (Select all that apply)
  Created 7 binary columns

Processing: Where do you usually watch movies? (Select all that apply)
  Created 8 binary columns

Processing: Why do you usually choose to watch movies? (Select all that apply)
  Created 9 binary columns

Processing: In general, what are the main reasons you stop watching movies before finishing? (Select all that apply)
  Created 6 binary columns

Processing: Which genres do you find yourself stopping more often before finishing? (Select all that apply)
  Created 11 binary columns

Processing: Why do you usually pause the movie? (Select all that apply)
  Created 8 binary columns

Processing: Do you usually do other things while watching movies? (Select all that apply)
  Created 6 binary columns

=== CHECKBOX SEPARATION SUMMARY

## Step 3: Categorical Data Discretization

Convert categorical survey responses into numerical codes for machine learning algorithms.

### Discretization Process:
1. **Age Groups**: Convert to ordinal scale (1-6)
2. **Gender**: Convert to categorical codes (1-5)
3. **Education Level**: Convert to ordinal scale (1-6)
4. **Movie Watching Frequency**: Convert to ordinal scale (1-5)
5. **Mood Categories**: Convert to categorical codes
6. **Viewing Frequency**: Convert to ordinal scale
7. **Pause Frequency**: Convert to ordinal scale

### Why Discretization is Important:
- **Algorithm Compatibility**: Many ML algorithms require numerical inputs
- **Ordinal Relationships**: Preserves natural ordering in data (e.g., age groups)
- **Standardization**: Creates consistent numerical scales across features
- **Performance**: Improves model training speed and accuracy

In [3]:
def discretize_categorical_responses(df):
    """
    Discretize categorical survey responses into numerical codes
    """
    processed_df = df.copy()
    discretized_columns = {}
    
    print("=== DISCRETIZING CATEGORICAL RESPONSES ===")
    
    # Age group discretization
    age_col = 'What is your age group?'
    if age_col in df.columns:
        age_mapping = {
            '18-24': 1, '25-34': 2, '35-44': 3,
            '45-54': 4, '55-64': 5, '65+': 6
        }
        processed_df[f'{age_col}_discretized'] = df[age_col].map(age_mapping)
        discretized_columns['age'] = age_mapping
        print(f"Age groups discretized: {df[age_col].value_counts().to_dict()}")
    
    # Gender discretization
    gender_col = 'What is your gender?'
    if gender_col in df.columns:
        gender_mapping = {
            'Female': 1, 'Male': 2, 'Non-binary': 3,
            'Prefer not to say': 4, 'Other': 5
        }
        processed_df[f'{gender_col}_discretized'] = df[gender_col].map(gender_mapping)
        discretized_columns['gender'] = gender_mapping
        print(f"Gender discretized: {df[gender_col].value_counts().to_dict()}")
    
    # Education level discretization
    education_col = 'What is the highest level of education you\'ve completed? '
    if education_col in df.columns:
        education_mapping = {
            'High school or below': 1, 'Some college': 2,
            'Undergraduate': 3, "Bachelor's Degree": 4,
            'Graduate Degree': 5, 'PhD or higher': 6
        }
        processed_df[f'{education_col}_discretized'] = df[education_col].map(education_mapping)
        discretized_columns['education'] = education_mapping
        print(f"Education levels discretized: {df[education_col].value_counts().to_dict()}")
    
    # Movie watching frequency discretization
    frequency_col = 'How often do you watch movies?'
    if frequency_col in df.columns:
        frequency_mapping = {
            'Rarely (about once per month)': 1,
            'A few times a month (2-3 times per month)': 2,
            'Once a week': 3,
            'Several times a week (3–6 times a week)': 4,
            'Daily (at least 1 movie per day)': 5
        }
        processed_df[f'{frequency_col}_discretized'] = df[frequency_col].map(frequency_mapping)
        discretized_columns['frequency'] = frequency_mapping
        print(f"Watching frequency discretized: {df[frequency_col].value_counts().to_dict()}")
    
    # Mood discretization
    mood_col = 'What is your typical mood before watching a movie?'
    if mood_col in df.columns:
        mood_mapping = {
            'Excited': 1, 'Relaxed': 2, 'Neutral': 3,
            'Bored': 4, 'Stressed': 5, 'Tired': 6
        }
        # Handle complex mood responses
        for idx, response in df[mood_col].items():
            if pd.notna(response):
                response_str = str(response).lower()
                for mood, value in mood_mapping.items():
                    if mood.lower() in response_str:
                        processed_df.at[idx, f'{mood_col}_discretized'] = value
                        break
                else:
                    processed_df.at[idx, f'{mood_col}_discretized'] = 3  # Default to neutral
        discretized_columns['mood'] = mood_mapping
        print(f"Mood discretized: {df[mood_col].value_counts().head().to_dict()}")
    
    # Dropout frequency discretization
    dropout_freq_col = 'How often do you stop watching movies before finishing them?'
    if dropout_freq_col in df.columns:
        dropout_mapping = {
            'Never': 0, 'Rarely': 1, 'Sometimes': 2,
            'Often': 3, 'Always': 4
        }
        processed_df[f'{dropout_freq_col}_discretized'] = df[dropout_freq_col].map(dropout_mapping)
        discretized_columns['dropout_frequency'] = dropout_mapping
        print(f"Dropout frequency discretized: {df[dropout_freq_col].value_counts().to_dict()}")
    
    # Pause frequency discretization
    pause_freq_col = 'How often do you typically pause or stop the movie during viewing?'
    if pause_freq_col in df.columns:
        pause_mapping = {
            'Never': 0, '1-2 times': 1, '3 or more times': 2
        }
        processed_df[f'{pause_freq_col}_discretized'] = df[pause_freq_col].map(pause_mapping)
        discretized_columns['pause_frequency'] = pause_mapping
        print(f"Pause frequency discretized: {df[pause_freq_col].value_counts().to_dict()}")
    
    # Stopping point discretization
    stopping_point_col = 'Thinking about movies you have started but did not finish, at what point do you usually stop watching?'
    if stopping_point_col in df.columns:
        stopping_mapping = {
            'Within the first 15 minutes': 1,
            '15-30 minutes in': 2,
            '30-60 minutes in': 3,
            'After 60 mins': 4
        }
        processed_df[f'{stopping_point_col}_discretized'] = df[stopping_point_col].map(stopping_mapping)
        discretized_columns['stopping_point'] = stopping_mapping
        print(f"Stopping point discretized: {df[stopping_point_col].value_counts().to_dict()}")
    
    print(f"\nDiscretization completed: {len(discretized_columns)} categories processed")
    return processed_df, discretized_columns

# Apply discretization
processed_df, discretization_mappings = discretize_categorical_responses(processed_df)

print(f"\nDataset shape after discretization: {processed_df.shape}")

# Show sample of discretized columns
discretized_cols = [col for col in processed_df.columns if '_discretized' in col]
print(f"\nDiscretized columns created: {len(discretized_cols)}")
for col in discretized_cols:
    print(f"  {col}: {processed_df[col].value_counts().head(3).to_dict()}")

=== DISCRETIZING CATEGORICAL RESPONSES ===
Age groups discretized: {'18-24': 59, '25-34': 15, '45-54': 3}
Gender discretized: {'Female': 53, 'Male': 22, 'Non-binary/Other': 1, 'Prefer not to say': 1}
Watching frequency discretized: {'Several times a week (3–6 times a week)': 21, 'Rarely (about once per month)': 18, 'A few times a month (2-3 times per month)': 16, 'Once a week': 13, 'Daily (at least 1 movie per day)': 9}
Mood discretized: {'Neutral': 27, 'Relaxed': 21, 'Bored': 15, 'Stressed': 7, 'Excited': 3}
Dropout frequency discretized: {'Sometimes': 33, 'Rarely': 22, 'Often': 17, 'Never': 4, 'Always': 1}
Pause frequency discretized: {'3 or more times': 37, '1-2 times': 33, 'Never': 2}
Stopping point discretized: {'15-30 minutes in': 33, '30-60 minutes in': 26, 'Within the first 15 minutes': 9, 'After 60 mins': 4}

Discretization completed: 7 categories processed

Dataset shape after discretization: (78, 128)

Discretized columns created: 7
  What is your age group?_discretized: {1.

## Step 4: Feature Engineering

Create aggregate features and engagement metrics from the processed data.

### Feature Categories:
1. **Aggregate Counts**: Total liked genres, total drop reasons, total discovery methods
2. **Engagement Metrics**: User engagement score based on viewing frequency and preferences
3. **Behavioral Scores**: Dropout tendency, completion likelihood, pause behavior
4. **Interaction Features**: Combinations of different behavioral patterns

### Created Features:
- **total_liked_genres**: Count of genres the user enjoys
- **total_drop_reasons**: Count of reasons why user drops movies
- **total_discovery_methods**: Count of ways user discovers movies
- **engagement_score**: Composite score based on viewing frequency and preferences
- **dropout_tendency**: Ratio of drop reasons to liked genres
- **completion_likelihood**: Inverse of dropout frequency

In [4]:
def create_aggregate_features(processed_df):
    """
    Create aggregate features from the processed data
    """
    df_with_features = processed_df.copy()
    
    print("=== CREATING AGGREGATE FEATURES ===")
    
    # Genre preference count
    genre_cols = [col for col in processed_df.columns if 'which_genres_do_you_enjoy_watching' in col]
    if genre_cols:
        df_with_features['total_liked_genres'] = processed_df[genre_cols].sum(axis=1)
        print(f"Created total_liked_genres from {len(genre_cols)} genre columns")
    
    # Drop reason count
    reason_cols = [col for col in processed_df.columns if 'main_reasons_you_stop_watching' in col]
    if reason_cols:
        df_with_features['total_drop_reasons'] = processed_df[reason_cols].sum(axis=1)
        print(f"Created total_drop_reasons from {len(reason_cols)} reason columns")
    
    # Discovery method count
    discovery_cols = [col for col in processed_df.columns if 'how_do_you_usually_discover_movies' in col]
    if discovery_cols:
        df_with_features['total_discovery_methods'] = processed_df[discovery_cols].sum(axis=1)
        print(f"Created total_discovery_methods from {len(discovery_cols)} discovery columns")
    
    # Viewing location count
    location_cols = [col for col in processed_df.columns if 'where_do_you_usually_watch_movies' in col]
    if location_cols:
        df_with_features['total_viewing_locations'] = processed_df[location_cols].sum(axis=1)
        print(f"Created total_viewing_locations from {len(location_cols)} location columns")
    
    # Pause reason count
    pause_cols = [col for col in processed_df.columns if 'why_do_you_usually_pause' in col]
    if pause_cols:
        df_with_features['total_pause_reasons'] = processed_df[pause_cols].sum(axis=1)
        print(f"Created total_pause_reasons from {len(pause_cols)} pause columns")
    
    # Multitasking activity count
    multitask_cols = [col for col in processed_df.columns if 'do_you_usually_do_other_things' in col]
    if multitask_cols:
        df_with_features['total_multitasking_activities'] = processed_df[multitask_cols].sum(axis=1)
        print(f"Created total_multitasking_activities from {len(multitask_cols)} multitask columns")
    
    # Create user engagement score
    frequency_col = 'How often do you watch movies?_discretized'
    if frequency_col in df_with_features.columns and 'total_liked_genres' in df_with_features.columns:
        df_with_features['engagement_score'] = (
            df_with_features[frequency_col] * df_with_features['total_liked_genres']
        )
        print(f"Created engagement_score (frequency × liked genres)")
    
    # Create dropout tendency score
    if 'total_drop_reasons' in df_with_features.columns and 'total_liked_genres' in df_with_features.columns:
        df_with_features['dropout_tendency'] = (
            df_with_features['total_drop_reasons'] / df_with_features['total_liked_genres'].replace(0, 1)
        )
        print(f"Created dropout_tendency (drop reasons / liked genres)")
    
    # Create completion likelihood (inverse of dropout frequency)
    dropout_freq_col = 'How often do you stop watching movies before finishing them?_discretized'
    if dropout_freq_col in df_with_features.columns:
        df_with_features['completion_likelihood'] = 5 - df_with_features[dropout_freq_col]
        print(f"Created completion_likelihood (inverse of dropout frequency)")
    
    # Create pause behavior score
    pause_freq_col = 'How often do you typically pause or stop the movie during viewing?_discretized'
    if pause_freq_col in df_with_features.columns and 'total_pause_reasons' in df_with_features.columns:
        df_with_features['pause_behavior_score'] = (
            df_with_features[pause_freq_col] * df_with_features['total_pause_reasons']
        )
        print(f"Created pause_behavior_score (pause frequency × pause reasons)")
    
    # Create focus score (inverse of multitasking)
    focus_col = 'do_you_usually_do_other_things_while_watching_movies_no_i_usually_focus_only_on_the_movie'
    if focus_col in df_with_features.columns:
        df_with_features['focus_score'] = df_with_features[focus_col]
        print(f"Created focus_score (focuses only on movie)")
    
    return df_with_features

def create_target_variables(processed_df):
    """
    Create target variables for different prediction tasks
    """
    df_with_targets = processed_df.copy()
    
    print("\n=== CREATING TARGET VARIABLES ===")
    
    # Target 1: High engagement user
    if 'engagement_score' in df_with_targets.columns:
        engagement_threshold = df_with_targets['engagement_score'].median()
        df_with_targets['high_engagement_user'] = (
            df_with_targets['engagement_score'] > engagement_threshold
        ).astype(int)
        print(f"Created high_engagement_user (threshold: {engagement_threshold:.2f})")
    
    # Target 2: High dropout tendency
    if 'dropout_tendency' in df_with_targets.columns:
        dropout_threshold = df_with_targets['dropout_tendency'].median()
        df_with_targets['high_dropout_user'] = (
            df_with_targets['dropout_tendency'] > dropout_threshold
        ).astype(int)
        print(f"Created high_dropout_user (threshold: {dropout_threshold:.2f})")
    
    # Target 3: Movie completion likelihood
    if 'completion_likelihood' in df_with_targets.columns:
        completion_threshold = df_with_targets['completion_likelihood'].median()
        df_with_targets['likely_to_complete'] = (
            df_with_targets['completion_likelihood'] > completion_threshold
        ).astype(int)
        print(f"Created likely_to_complete (threshold: {completion_threshold:.2f})")
    
    # Target 4: Frequent pauser
    if 'pause_behavior_score' in df_with_targets.columns:
        pause_threshold = df_with_targets['pause_behavior_score'].median()
        df_with_targets['frequent_pauser'] = (
            df_with_targets['pause_behavior_score'] > pause_threshold
        ).astype(int)
        print(f"Created frequent_pauser (threshold: {pause_threshold:.2f})")
    
    # Target 5: Focused viewer
    if 'focus_score' in df_with_targets.columns:
        df_with_targets['focused_viewer'] = df_with_targets['focus_score']
        print(f"Created focused_viewer (binary: focuses only on movie)")
    
    target_variables = [
        'high_engagement_user', 'high_dropout_user', 'likely_to_complete', 
        'frequent_pauser', 'focused_viewer'
    ]
    target_variables = [col for col in target_variables if col in df_with_targets.columns]
    
    print(f"Total target variables created: {len(target_variables)}")
    return df_with_targets, target_variables

# Apply feature engineering
processed_df = create_aggregate_features(processed_df)
final_df, target_variables = create_target_variables(processed_df)

# Show sample of created features
print(f"\n=== FEATURE ENGINEERING SUMMARY ===")
aggregate_features = ['total_liked_genres', 'total_drop_reasons', 'total_discovery_methods', 
                     'engagement_score', 'dropout_tendency', 'completion_likelihood']
existing_features = [f for f in aggregate_features if f in final_df.columns]

print(f"Aggregate features created: {len(existing_features)}")
for feature in existing_features:
    print(f"  {feature}: mean={final_df[feature].mean():.2f}, std={final_df[feature].std():.2f}")

print(f"\nTarget variables created: {len(target_variables)}")
for target in target_variables:
    distribution = final_df[target].value_counts()
    print(f"  {target}: {distribution.to_dict()}")

# Show sample of final processed data
print(f"\nSample of processed features:")
sample_features = existing_features[:5] if len(existing_features) >= 5 else existing_features
if sample_features:
    print(final_df[sample_features].head())

=== CREATING AGGREGATE FEATURES ===
Created total_liked_genres from 14 genre columns
Created total_drop_reasons from 6 reason columns
Created total_discovery_methods from 7 discovery columns
Created total_viewing_locations from 8 location columns
Created total_pause_reasons from 8 pause columns
Created total_multitasking_activities from 6 multitask columns
Created engagement_score (frequency × liked genres)
Created dropout_tendency (drop reasons / liked genres)
Created completion_likelihood (inverse of dropout frequency)
Created pause_behavior_score (pause frequency × pause reasons)
Created focus_score (focuses only on movie)

=== CREATING TARGET VARIABLES ===
Created high_engagement_user (threshold: 12.00)
Created high_dropout_user (threshold: 0.80)
Created likely_to_complete (threshold: 3.00)
Created frequent_pauser (threshold: 4.00)
Created focused_viewer (binary: focuses only on movie)
Total target variables created: 5

=== FEATURE ENGINEERING SUMMARY ===
Aggregate features created

## Step 5: Data Quality Validation

Comprehensive data quality checks to ensure the processed dataset is ready for machine learning.

### Quality Checks Performed:
1. **Missing Value Analysis**: Identify and handle missing values in features
2. **Target Variable Distribution**: Ensure balanced target variables
3. **Feature Type Classification**: Categorize features into binary, categorical, and continuous
4. **Outlier Detection**: Identify potential outliers in continuous features
5. **Feature Correlation**: Check for highly correlated features
6. **Variance Analysis**: Identify features with zero or low variance

### Data Quality Metrics:
- **Completeness**: Percentage of non-null values
- **Balance**: Distribution of target variables
- **Diversity**: Number of unique values per feature
- **Correlation**: Feature interdependence analysis

In [5]:
def prepare_ml_features(processed_df):
    """
    Prepare final feature set for machine learning
    """
    # Identify ML-ready features
    discretized_features = [col for col in processed_df.columns if '_discretized' in col]
    binary_features = [col for col in processed_df.columns if any(keyword in col for keyword in [
        'which_genres_do_you_enjoy_watching', 'how_do_you_usually_discover_movies',
        'where_do_you_usually_watch_movies', 'why_do_you_usually_choose_to_watch',
        'main_reasons_you_stop_watching', 'which_genres_do_you_find_yourself_stopping',
        'why_do_you_usually_pause', 'do_you_usually_do_other_things'
    ])]
    aggregate_features = [col for col in processed_df.columns if col.startswith('total_') or 
                         col in ['engagement_score', 'dropout_tendency', 'completion_likelihood', 
                                'pause_behavior_score', 'focus_score']]
    
    # Combine all ML-ready features
    ml_features = discretized_features + binary_features + aggregate_features
    ml_features = [col for col in ml_features if col in processed_df.columns]
    
    print(f"ML Feature Summary:")
    print(f"  Discretized features: {len(discretized_features)}")
    print(f"  Binary features: {len(binary_features)}")
    print(f"  Aggregate features: {len(aggregate_features)}")
    print(f"  Total ML features: {len(ml_features)}")
    
    return ml_features

def perform_data_quality_checks(final_df, ml_features, target_variables):
    """
    Perform comprehensive data quality checks
    """
    print("=== PERFORMING DATA QUALITY CHECKS ===")
    
    # Check for missing values
    print(f"\n1. Missing Values Analysis:")
    missing_counts = final_df[ml_features].isnull().sum()
    if missing_counts.sum() > 0:
        print("Features with missing values:")
        for feature, count in missing_counts[missing_counts > 0].items():
            print(f"  {feature}: {count} missing ({count/len(final_df)*100:.1f}%)")
    else:
        print("✓ No missing values in ML features")
    
    # Check target variable distribution
    print(f"\n2. Target Variable Distributions:")
    target_balance = {}
    for target in target_variables:
        if target in final_df.columns:
            value_counts = final_df[target].value_counts()
            target_balance[target] = value_counts
            print(f"  {target}:")
            for value, count in value_counts.items():
                print(f"    {value}: {count} ({count/len(final_df)*100:.1f}%)")
    
    # Check feature distributions
    print(f"\n3. Feature Distribution Analysis:")
    binary_features = [f for f in ml_features if final_df[f].nunique() <= 2]
    categorical_features = [f for f in ml_features if 3 <= final_df[f].nunique() <= 10]
    continuous_features = [f for f in ml_features if final_df[f].nunique() > 10]
    
    print(f"  Binary features: {len(binary_features)}")
    print(f"  Categorical features: {len(categorical_features)}")
    print(f"  Continuous features: {len(continuous_features)}")
    
    # Check for potential issues
    print(f"\n4. Potential Issues Check:")
    
    # Check for features with zero variance
    zero_variance = []
    for f in ml_features:
        if final_df[f].var() == 0:
            zero_variance.append(f)
    
    if zero_variance:
        print(f"  ⚠️  Features with zero variance: {zero_variance}")
    else:
        print("  ✓ No zero variance features")
    
    # Check for highly correlated features
    correlation_threshold = 0.95
    numeric_features = [f for f in ml_features if final_df[f].dtype in ['int64', 'float64']]
    high_corr = []
    
    if len(numeric_features) > 1:
        corr_matrix = final_df[numeric_features].corr()
        for i in range(len(corr_matrix.columns)):
            for j in range(i+1, len(corr_matrix.columns)):
                if abs(corr_matrix.iloc[i, j]) > correlation_threshold:
                    high_corr.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))
        
        if high_corr:
            print(f"  ⚠️  Highly correlated feature pairs (>{correlation_threshold}):")
            for feat1, feat2, corr in high_corr:
                print(f"    {feat1} <-> {feat2}: {corr:.3f}")
        else:
            print("  ✓ No highly correlated features")
    
    # Check for class imbalance
    print(f"\n5. Class Balance Analysis:")
    imbalanced_targets = []
    for target in target_variables:
        if target in final_df.columns:
            value_counts = final_df[target].value_counts()
            if len(value_counts) == 2:
                minority_percentage = min(value_counts) / sum(value_counts) * 100
                if minority_percentage < 30:
                    imbalanced_targets.append((target, minority_percentage))
    
    if imbalanced_targets:
        print(f"  ⚠️  Imbalanced targets (minority class < 30%):")
        for target, percentage in imbalanced_targets:
            print(f"    {target}: {percentage:.1f}% minority class")
    else:
        print("  ✓ No severely imbalanced targets")
    
    return {
        'missing_values': missing_counts,
        'target_distributions': target_balance,
        'feature_types': {
            'binary': binary_features,
            'categorical': categorical_features, 
            'continuous': continuous_features
        },
        'zero_variance': zero_variance,
        'high_correlation': high_corr,
        'imbalanced_targets': imbalanced_targets
    }

# Prepare ML features
ml_feature_list = prepare_ml_features(final_df)

# Perform quality checks
quality_report = perform_data_quality_checks(final_df, ml_feature_list, target_variables)

# Handle missing values if any
if quality_report['missing_values'].sum() > 0:
    print("\n=== HANDLING MISSING VALUES ===")
    for feature in ml_feature_list:
        if final_df[feature].isnull().sum() > 0:
            if feature in quality_report['feature_types']['binary']:
                final_df[feature] = final_df[feature].fillna(0)
            elif feature in quality_report['feature_types']['categorical']:
                final_df[feature] = final_df[feature].fillna(final_df[feature].mode()[0])
            else:
                final_df[feature] = final_df[feature].fillna(final_df[feature].median())
    print("✓ Missing values handled")

print(f"\n=== FINAL DATASET SUMMARY ===")
print(f"Total samples: {len(final_df)}")
print(f"Total features for ML: {len(ml_feature_list)}")
print(f"Target variables: {len(target_variables)}")
print(f"Memory usage: {final_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"Data quality: {'✓ GOOD' if len(quality_report['zero_variance']) == 0 else '⚠️ REVIEW NEEDED'}")

ML Feature Summary:
  Discretized features: 7
  Binary features: 69
  Aggregate features: 11
  Total ML features: 87
=== PERFORMING DATA QUALITY CHECKS ===

1. Missing Values Analysis:
Features with missing values:
  What is your age group?_discretized: 1 missing (1.3%)
  What is your gender?_discretized: 2 missing (2.6%)
  How often do you watch movies?_discretized: 1 missing (1.3%)
  What is your typical mood before watching a movie?_discretized: 1 missing (1.3%)
  How often do you stop watching movies before finishing them?_discretized: 1 missing (1.3%)
  How often do you typically pause or stop the movie during viewing?_discretized: 6 missing (7.7%)
  Thinking about movies you have started but did not finish, at what point do you usually stop watching?_discretized: 6 missing (7.7%)
  engagement_score: 1 missing (1.3%)
  completion_likelihood: 1 missing (1.3%)
  pause_behavior_score: 6 missing (7.7%)

2. Target Variable Distributions:
  high_engagement_user:
    0: 53 (67.9%)
    1:

## Step 6: Final Dataset Preparation

Create the final datasets for machine learning and save them for the next phase of the pipeline.

### Output Files:
1. **`dropoffs_fully_processed.csv`**: Complete processed dataset with all features
2. **`dropoffs_ml_ready.csv`**: ML-ready features + target variables only
3. **`dropoffs_human_readable.csv`**: Human-readable version with original responses
4. **`feature_info.json`**: Feature metadata and mappings for reference

### Dataset Versions:
- **Complete**: All original and processed columns for reference
- **ML-Ready**: Only features and targets needed for modeling
- **Human-Readable**: Original responses with segregated checkbox selections

### Ready for Machine Learning:
The processed datasets are now ready for the machine learning phase as outlined in the process tracker:
- Decision Tree Based Methods
- Naive Bayes Classification  
- Neural Networks (Artificial Neural Networks)
- Nearest Neighbor Classification (K-NN)

In [8]:
def create_human_readable_dataset(final_df, original_df):
    """
    Create a clean, human-readable version with actual responses visible
    """
    print("=== CREATING HUMAN-READABLE DATASET ===")
    readable_df = pd.DataFrame()
    
    # Include key demographic info (original responses)
    demographic_cols = [
        'What is your age group?',
        'What is your gender?',
        'What is the highest level of education you\'ve completed? ',
        'How often do you watch movies?'
    ]
    
    for col in demographic_cols:
        if col in original_df.columns:
            readable_df[col] = original_df[col]
    
    # Include movie watching preferences (original responses)
    preference_cols = [
        'Which genres do you enjoy watching the most?  (Select up to 3)',
        'How do you usually discover movies you decide to watch? (Select all that apply)',
        'Where do you usually watch movies? (Select all that apply)',
        'Why do you usually choose to watch movies? (Select all that apply)'
    ]
    
    for col in preference_cols:
        if col in original_df.columns:
            readable_df[col] = original_df[col]
    
    # Include completion behavior (original responses)
    behavior_cols = [
        'Have you ever stopped watching a movie before finishing it?',
        'How often do you stop watching movies before finishing them?',
        'At what point do you typically stop watching movies you drop?',
        'How often do you typically pause or stop the movie during viewing?'
    ]
    
    for col in behavior_cols:
        if col in original_df.columns:
            readable_df[col] = original_df[col]
    
    # Include drop reasons (original responses)
    reason_cols = [
        'In general, what are the main reasons you stop watching movies before finishing? (Select all that apply)',
        'Which genres do you find yourself stopping more often before finishing? (Select all that apply)',
        'Why do you usually pause the movie? (Select all that apply)'
    ]
    
    for col in reason_cols:
        if col in original_df.columns:
            readable_df[col] = original_df[col]
    
    # Add summary metrics
    if 'total_liked_genres' in final_df.columns:
        readable_df['Total_Liked_Genres'] = final_df['total_liked_genres']
    if 'total_drop_reasons' in final_df.columns:
        readable_df['Total_Drop_Reasons'] = final_df['total_drop_reasons']
    if 'engagement_score' in final_df.columns:
        readable_df['Engagement_Level'] = pd.cut(final_df['engagement_score'], 
                                               bins=3, 
                                               labels=['Low', 'Medium', 'High'])
    
    print(f"Human-readable dataset created with {len(readable_df.columns)} columns")
    return readable_df

def create_ml_dataset(final_df, ml_feature_list, target_variables):
    """
    Create ML-ready dataset with features and targets only
    """
    print("=== CREATING ML-READY DATASET ===")
    ml_dataset = final_df[ml_feature_list + target_variables].copy()
    ml_dataset.insert(0, 'Respondent_ID', range(1, len(ml_dataset) + 1))
    
    print(f"ML dataset created with {len(ml_dataset.columns)} columns")
    return ml_dataset

# Create different dataset versions
print("=== CREATING FINAL DATASETS ===")

# Version 1: Human-readable with actual responses
readable_dataset = create_human_readable_dataset(final_df, df)

# Version 2: ML-ready with features and targets
ml_dataset = create_ml_dataset(final_df, ml_feature_list, target_variables)

# Save all datasets
print("\n=== SAVING DATASETS ===")

# Save complete processed dataset
final_df.to_csv('dropoffs_fully_processed.csv', index=False)
print("✓ Saved complete processed dataset: dropoffs_fully_processed.csv")

# Save human-readable version
readable_dataset.to_csv('dropoffs_human_readable.csv', index=False)
print("✓ Saved human-readable version: dropoffs_human_readable.csv")

# Save ML-ready version
ml_dataset.to_csv('dropoffs_ml_ready.csv', index=False)
print("✓ Saved ML-ready version: dropoffs_ml_ready.csv")

# Save feature metadata
feature_info = {
    'ml_features': ml_feature_list,
    'target_variables': target_variables,
    'discretization_mappings': discretization_mappings,
    'checkbox_separation_report': checkbox_report,
    'dataset_info': {
        'original_samples': len(df),
        'processed_samples': len(final_df),
        'original_features': len(df.columns),
        'processed_features': len(final_df.columns),
        'ml_features': len(ml_feature_list),
        'target_variables': len(target_variables)
    }
}

with open('feature_info.json', 'w') as f:
    json.dump(feature_info, f, indent=2)
print("✓ Saved feature metadata: feature_info.json")

# Display final summary
print(f"\n=== PREPROCESSING PIPELINE COMPLETED ===")
print(f"📊 Dataset Statistics:")
print(f"   • Original samples: {len(df)}")
print(f"   • Processed samples: {len(final_df)}")
print(f"   • Original features: {len(df.columns)}")
print(f"   • Total processed features: {len(final_df.columns)}")
print(f"   • ML-ready features: {len(ml_feature_list)}")
print(f"   • Target variables: {len(target_variables)}")

print(f"\n🎯 Target Variables for Prediction:")
for i, target in enumerate(target_variables, 1):
    distribution = final_df[target].value_counts()
    print(f"   {i}. {target}: {distribution.to_dict()}")

print(f"\n📁 Output Files:")
print(f"   • dropoffs_fully_processed.csv - Complete processed dataset")
print(f"   • dropoffs_human_readable.csv - Human-readable version")
print(f"   • dropoffs_ml_ready.csv - ML features + targets only")
print(f"   • feature_info.json - Feature metadata and mappings")

print(f"\n✅ Ready for Machine Learning Phase:")
print(f"   1. Decision Tree Based Methods")
print(f"   2. Naive Bayes Classification")
print(f"   3. Neural Networks (Artificial Neural Networks)")
print(f"   4. Nearest Neighbor Classification (K-NN)")

print(f"\n🚀 Next Steps:")
print(f"   Load ml_data = pd.read_csv('dropoffs_ml_ready.csv')")
print(f"   Features: X = ml_data[ml_features]")
print(f"   Targets: y = ml_data[target_variables]")

# Show sample of final datasets
print(f"\n📋 Sample of ML-Ready Dataset:")
print(ml_dataset.head(3))

print(f"\n🎉 Data preprocessing pipeline completed successfully!")
print(f"    Ready for the next phase: Machine Learning Model Training")

=== CREATING FINAL DATASETS ===
=== CREATING HUMAN-READABLE DATASET ===
Human-readable dataset created with 17 columns
=== CREATING ML-READY DATASET ===
ML dataset created with 93 columns

=== SAVING DATASETS ===
✓ Saved complete processed dataset: dropoffs_fully_processed.csv
✓ Saved human-readable version: dropoffs_human_readable.csv
✓ Saved ML-ready version: dropoffs_ml_ready.csv
✓ Saved feature metadata: feature_info.json

=== PREPROCESSING PIPELINE COMPLETED ===
📊 Dataset Statistics:
   • Original samples: 78
   • Processed samples: 78
   • Original features: 52
   • Total processed features: 144
   • ML-ready features: 87
   • Target variables: 5

🎯 Target Variables for Prediction:
   1. high_engagement_user: {0: 53, 1: 25}
   2. high_dropout_user: {0: 41, 1: 37}
   3. likely_to_complete: {0: 52, 1: 26}
   4. frequent_pauser: {0: 43, 1: 35}
   5. focused_viewer: {0: 56, 1: 22}

📁 Output Files:
   • dropoffs_fully_processed.csv - Complete processed dataset
   • dropoffs_human_read

## Preprocessing Pipeline Summary

### ✅ Completed Steps:
1. **Data Loading & Consent Validation** - Loaded survey data from `dropoffs.csv` and removed non-consenting respondents
2. **Checkbox Response Separation** - Converted multi-select responses into individual binary features
3. **Categorical Data Discretization** - Transformed categorical responses into numerical codes
4. **Feature Engineering** - Created aggregate features and behavioral metrics
5. **Data Quality Validation** - Performed comprehensive quality checks and handled issues
6. **Final Dataset Preparation** - Created multiple dataset versions for different use cases

### 📊 Output Statistics:
- **Original Respondents**: Survey responses from dropoffs.csv
- **Processed Features**: Binary, categorical, and continuous features ready for ML
- **Target Variables**: Multiple prediction targets for different behavioral patterns
- **Quality Score**: Data validated and ready for machine learning

### 🎯 Machine Learning Targets:
- **High Engagement User**: Users with high movie viewing engagement
- **High Dropout User**: Users with tendency to stop watching movies
- **Likely to Complete**: Users likely to finish movies they start
- **Frequent Pauser**: Users who pause movies frequently
- **Focused Viewer**: Users who focus only on the movie

### 📁 Generated Files:
- `dropoffs_fully_processed.csv` - Complete dataset with all features
- `dropoffs_ml_ready.csv` - Features and targets for machine learning
- `dropoffs_human_readable.csv` - Human-readable version with original responses
- `feature_info.json` - Feature metadata and mappings

### 🚀 Next Phase: Machine Learning
The preprocessing is complete and datasets are ready for the machine learning phase as outlined in the process tracker:
1. **Decision Tree Based Methods** - Random Forest, Gradient Boosting
2. **Naive Bayes Classification** - Gaussian and Multinomial variants
3. **Neural Networks** - Artificial Neural Networks for complex patterns
4. **Nearest Neighbor Classification** - K-NN for similarity-based predictions

The processed datasets can now be used for model training, evaluation, and comparison using metrics like F1-score, Precision/Recall, and ROC-AUC as specified in the process tracker.