# Dataset Description: Musical EdTech Simulation

## Overview
This dataset is a simulated representation of a **Musical EdTech platform**, aiming to analyze user behavior, retention patterns, and premium conversion rates. It emulates a **real-life scenario** with added complexity, including realistic user attributes, engagement trends, and outliers. 

The dataset's core question is:
**To what extent specific factors—such as age, instruments, or geographic location - affect retention and premium conversion rates on the platform?**

## Objectives
The dataset is designed to:
1. **Analyze User Retention**:
   - Investigate how retention varies by age, instrument, and country.
   - Identify the impact of onboarding features on short-term (Day 1) and long-term (Day 30) retention.

2. **Evaluate Premium Conversion**:
   - Explore which user groups (e.g., age, geography) are more likely to convert to premium subscriptions.
   - Study the relationship between engagement metrics (e.g., lessons completed) and conversions.

3. **Test Robustness Against Outliers**:
   - Include outliers to simulate unexpected but realistic scenarios, such as power users, extreme engagement levels, and geographic anomalies.

## Dataset Features
1. **Attributes**:
   - **Age Group**: Simulates user demographics, with adjustments for seasonal trends like back-to-school effects.
   - **Skill Level**: Captures user proficiency across Beginner, Intermediate, and Advanced levels.
   - **Instrument**: Reflects the diversity of user interests in learning Guitar, Piano, Ukulele, Bass, or Voice.
   - **Country**: Represents a global audience with varying economic conditions.

2. **Retention Metrics**:
   - **Day 1, Day 7, Day 30 Retention**: Tracks user engagement over time.

3. **Churn**:
   - Models disengagement based on age, instrument, and retention behavior.

4. **Premium Conversion**:
   - Simulates the likelihood of users upgrading to premium, influenced by multiple factors like age, country, retention, and engagement.

5. **Outliers**:
   - **Power Users**: Users with unusually high engagement (e.g., completing 50+ lessons).
   - **Geographic Anomalies**: Rare cases with unexpectedly high or low premium conversions.
   - **Instrument Extremes**: Users exhibiting disproportionate activity in specific instruments.

## Real-Life Simulation
The dataset reflects **real-world complexity** by:
- Adjusting activity trends based on **day-of-week and seasonal patterns**.
- Modeling demographic-specific behaviors, such as **higher retention in older users** and **greater conversions in stronger economies**.
- Including **random variability and outliers** to test the robustness of analytical models.
- Accounting for **cross-feature relationships** like age influencing instrument choice or country impacting retention.

This dataset was conceived with the help of AI to ensure realistic variability while integrating the creator's background in **teaching, music, and data analysis**. It serves as an ideal testbed for advanced analytics, hypothesis testing, and predictive modeling.



# Constants and Initial Setup
This section initializes the dataset constants, such as the total number of users, date range


In [4]:
import pandas as pd 
import numpy as np
import random
print(f"Libraries succesfully imported") 

Libraries succesfully imported


In [5]:
# Constants for dataset
SEED = 301 
np.random.seed(SEED) 
n_users = 30848  # Total number of users
round_float = lambda x, y: round(np.random.uniform(x,y),2)
def feat_scale(no_scale):
    total = sum(no_scale)
    if total == 1:
        print(f"Distribution already correct for {no_scale}")
        print('\n') 
        return no_scale  # No need to scale
    else:
        print(f"Distribution properly scaled for {no_scale}")
        # Scale to ensure the sum is 1
        scale = [x / total for x in no_scale]
        print(f"Corrected distribution: {scale}")
        print('\n')
        return scale
     
start_date_range = pd.date_range(start="2024-08-01", end="2024-08-31")
age_distribution_no_scale = [round_float(0.05, 0.25), round_float(0.45, 0.65), round_float(0.30, 0.50), round_float(0.02, 0.10)]
age_distribution = feat_scale(age_distribution_no_scale)

skill_distribution_no_scale = [round_float(0.55, 0.75), round_float(0.15, 0.30), round_float(0.05, 0.15)]
skill_distribution = feat_scale(skill_distribution_no_scale)
instrument_distribution_no_scale = [
    round_float(0.3 + random.uniform(0, 0.1), 0.6),
    round_float(0.15 + random.uniform(0, 0.05), 0.3),
    round_float(0.1, 0.2 + random.uniform(0, 0.1)),
    round_float(0.05, 0.2),
    round_float(0.02, 0.15)
]
instrument_distribution = feat_scale(instrument_distribution_no_scale)
 
country_weight_dict = {"United States":13, "United Kingdom":8, "Germany":8, "Finland": 5, "France": 5, "Sweden": 5, "Norway": 5,
     "Brazil": 3, "Australia": 5, "Netherlands": 5, "Canada": 5,  "Italy": 3, "Spain": 3, "Ireland": 3, "Argentina": 2, 
    "South Africa": 2, "New Zealand": 5, "Mexico": 1}

country_list = [country for country in country_weight_dict.keys()] 

country_weights = [(weight * round_float(1, 1.25)) for weight in country_weight_dict.values()] 
country_weights = [weight*100 for weight in feat_scale(country_weights)]

country_weight_scale = {country:scale for country,scale in zip(country_list,country_weights)}  

print(f"Country weights: {country_weight_scale}\n")
print(f"Constants succesfully created") 

Distribution properly scaled for [0.12, 0.57, 0.43, 0.06]
Corrected distribution: [0.1016949152542373, 0.4830508474576271, 0.364406779661017, 0.05084745762711865]


Distribution properly scaled for [0.69, 0.22, 0.14]
Corrected distribution: [0.657142857142857, 0.20952380952380953, 0.13333333333333333]


Distribution properly scaled for [0.58, 0.27, 0.18, 0.11, 0.11]
Corrected distribution: [0.46399999999999997, 0.21600000000000003, 0.144, 0.088, 0.088]


Distribution properly scaled for [13.65, 8.0, 9.36, 5.25, 6.0, 5.0, 5.949999999999999, 3.7199999999999998, 6.15, 6.25, 5.6499999999999995, 3.3000000000000003, 3.54, 3.5999999999999996, 2.34, 2.38, 5.300000000000001, 1.21]
Corrected distribution: [0.14123124676668392, 0.08277289187790998, 0.09684428349715467, 0.05431971029487843, 0.062079668908432487, 0.05173305742369374, 0.06156233833419554, 0.03848939472322814, 0.0636316606311433, 0.06466632177961718, 0.05845835488877392, 0.03414381789963787, 0.03662700465597517, 0.037247801345059485,

# Simulating Real-Life Date Distributions
User activity typically varies by the day of the week, with weekdays seeing higher engagement. Additionally, mid/late August may see a drop due to seasonal patterns. This section normalizes user distribution while maintaining realistic patterns.


In [7]:
SEED = 301 
np.random.seed(SEED)
# Simulating real-life date distributions (weekdays more active)
daily_weights = [
    random.uniform(0.5, 1.2) if date.weekday() < 5 else random.uniform(0.3, 0.8)
    for date in start_date_range
]
print(f"Daily weights: {daily_weights}\n") 
# Introduce mid/late August drop
late_august_decay = [0.9 if date > pd.Timestamp("2024-08-15") else 1 for date in start_date_range]
print(f"Late August decay: {late_august_decay}\n")
daily_weights = np.array(daily_weights) * late_august_decay
print(f"Daily weights after: {daily_weights}\n")
# Normalize daily weights to sum to total number of users
daily_weights = daily_weights / daily_weights.sum() * n_users
daily_counts = np.round(daily_weights).astype(int)


# Ensure the counts sum to the total number of users
daily_counts[-1] += n_users - daily_counts.sum()

# Generate start dates based on daily counts
start_dates = np.hstack([[date] * count for date, count in zip(start_date_range, daily_counts)])
if len(start_dates) == n_users:
    print(f"Dates generated succesfully")
else:
    print(f"Inconsistent number of dates") 

Daily weights: [0.962580630737286, 0.8197100384922145, 0.30138974136848257, 0.4927688110516895, 0.5565537929689999, 0.6033393066065744, 0.7290139219254756, 0.5415240091039004, 0.8980186724925285, 0.6066200120787864, 0.7534722798657969, 1.081052634696308, 0.652291281783919, 0.6286641896095017, 0.9487465110505584, 0.812274605658382, 0.33120238763919335, 0.3732939688133825, 0.7712310805299213, 1.1442420369211816, 0.6510221353621124, 1.1854552312627833, 0.777030570208064, 0.4995371852564891, 0.7299428232222319, 1.036954957808348, 0.697731730009278, 1.0254070611309376, 0.5249336610041005, 0.9388767556148099, 0.70417575710165]

Late August decay: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9]

Daily weights after: [0.96258063 0.81971004 0.30138974 0.49276881 0.55655379 0.60333931
 0.72901392 0.54152401 0.89801867 0.60662001 0.75347228 1.08105263
 0.65229128 0.62866419 0.94874651 0.73104715 0.29808215 0.33596457
 0

# Generating User Attributes
This section assigns attributes to users, such as:
- Age group
- Skill level
- Instrument. These distributions follow predefined probabilities but are adjusted for real-life scenarios like the back-to-school drop for `"Under 18"`


In [9]:
SEED = 301 
np.random.seed(SEED)
# Assign age groups with back-to-school adjustment
back_to_school_drop = 0.1  # 10% drop during mid/late August
age_distribution_adjusted = age_distribution.copy()
# Prevent any negative probabilities
if age_distribution_adjusted[0] > back_to_school_drop:
    age_distribution_adjusted[0] -= back_to_school_drop
else:
    back_to_school_drop = age_distribution_adjusted[0]
    age_distribution_adjusted[0] = 0

# Normalize the adjusted distribution to ensure it sums to 1
age_distribution_adjusted = np.array(age_distribution_adjusted) / np.sum(age_distribution_adjusted)
age_groups = np.random.choice(['Under 18', '18-34', '35-54', '55+'], size=n_users, p=age_distribution_adjusted)

# Assign skill levels
skill_levels = np.random.choice(['Beginner', 'Intermediate', 'Advanced'], size=n_users, p=skill_distribution)

# Assign instruments
instruments = np.random.choice(['Guitar', 'Piano', 'Ukulele', 'Bass', 'Voice'], size=n_users, p=instrument_distribution)

# Assign countries
countries = random.choices(country_list, weights=country_weights, k=n_users)
print(f"Attributes succesfully generated") 

Attributes succesfully generated


# Churn Rates
Churn rates reflect the likelihood of users disengaging from the platform. These are influenced by:
1. **Age Groups**: Younger users (Under 18, 18-34) churn more frequently.
2. **Instruments**: Instruments with broader appeal (Guitar, Piano) have lower churn rates than niche ones (Voice).
3. **Retention Behavior**: Users with higher Day 30 retention have a lower chance of churn.

In [11]:
SEED = 301 
np.random.seed(SEED)
churn_scalers_age = {'Under 18': 0.6, '18-34': 0.5, '35-54': 0.4, '55+': 0.3}
instrument_engagement = {'Guitar': 0.3, 'Piano': 0.35, 'Ukulele': 0.4, 'Bass': 0.45, 'Voice': 0.5}

# Define churn probabilities based on instruments
instrument_engagement = {'Guitar': 0.2, 'Piano': 0.25, 'Ukulele': 0.3, 'Bass': 0.35, 'Voice': 0.4}
churn_prob_instrument = [instrument_engagement[instrument] for instrument in instruments]
noise_instrument = np.random.normal(0, 0.15, len(churn_prob_instrument))  # Noise with SD=0.15

# Add random noise to churn probabilities
churn_prob_age = [churn_scalers_age[age] for age in age_groups]
noise_age = np.random.normal(0, 0.1, len(churn_prob_age))  # Noise with SD=0.1
churn_prob_age_noisy = np.clip(np.array(churn_prob_age) + noise_age, 0, 1)

noise_instrument = np.random.normal(0, 0.05, len(churn_prob_instrument))  # Noise with SD=0.05
churn_prob_instrument_noisy = np.clip(np.array(churn_prob_instrument) + noise_instrument, 0, 1)

# Combine churn probabilities
churn_prob = churn_prob_age_noisy * churn_prob_instrument_noisy

# Add country-specific effects
min_weight = min(country_weight_dict.values())
max_weight = max(country_weight_dict.values())
country_weight_norm = {
    country: 0.5 + (weight - min_weight) / (max_weight - min_weight) * 1.0  # Scale to [0.5, 1.5]
    for country, weight in country_weight_dict.items()
}


# Compute interaction term with country weights factored in
interaction_term = np.array([
    (0.7 if age == '18-34' and instrument == 'Guitar' else 
     1.3 if age == 'Under 18' and instrument == 'Voice' else 
     1.1 if age == '35-54' and instrument in ['Piano', 'Ukulele'] else 
     1.0) * country_weight_norm.get(country, 1.0)  # Default to 1.0 if country is missing
    for age, instrument, country in zip(age_groups, instruments, countries)
])
churn_prob *= interaction_term

# Add churn spikes for a random 5% of users
spike_indices = np.random.choice(len(churn_prob), size=int(len(churn_prob) * 0.1), replace=False)
churn_prob[spike_indices] = np.clip(churn_prob[spike_indices] + np.random.uniform(0.4, 0.6, len(spike_indices)), 0, 1)

# Simulate churn
churn = np.random.binomial(1, churn_prob)
print(f"Churn rates succesfully created\n") 
# Create dataset
dataset = pd.DataFrame({
    'user_id': [f"AUG{str(i).zfill(5)}" for i in range(1, n_users + 1)],
    'start_date':start_dates,
    'age_group': age_groups,
    'instrument': instruments,
    'country': countries,
    'skill_level': skill_levels,
    'churn': churn
})
dataset.head(3)
churn = dataset['churn'].sum() 
churn_rate = churn/dataset['churn'].size
print(f"Churn: {churn}") 
print(f"Churn rate: {round(churn_rate,3)*100}%") 

Churn rates succesfully created

Churn: 4791
Churn rate: 15.5%


# Retention Metrics
Retention metrics track user engagement at different time periods (Day 1, Day 7, Day 30). These are influenced by:
1. **Age Group**: Older users (35-54, 55+) are typically more consistent.
2. **Instrument**: Instruments like Guitar and Piano encourage more consistent retention.
3. **Country**: Users from stronger economies tend to retain longer due to better resources.
4. **Outliers**: Retention extremes, such as users retained across all periods, are introduced for realism.


In [13]:
SEED = 301 
np.random.seed(SEED)
# Define retention scalers for factors
retention_scalers_age = {'Under 18': 0.8, '18-34': 1.0, '35-54': 1.5, '55+': 1.2}
retention_scalers_instrument = {'Guitar': 1.2, 'Piano': 1.1, 'Ukulele': 1.0, 'Bass': 0.9, 'Voice': 0.8}
retention_scalers_country = {
    country: round(scaler / 10 * round_float(1, 1.5), 2)
    for country, scaler in zip(country_list, country_weights)
}

# Define base retention probabilities
day_1_prob = round_float(0.3, 0.4)  # Adjusted upwards
day_7_given_day_1_prob = round_float(0.5, 0.6)  # Adjusted upwards
day_30_given_day_7_prob = round_float(0.4, 0.5)  # Adjusted upwards

# Add noise to base retention probabilities
day_1_prob_noisy = np.clip(day_1_prob * round_float(0.5, 1.5), 0, 1)
day_7_given_day_1_prob_noisy = np.clip(day_7_given_day_1_prob * round_float(0.5, 1.5), 0, 1)
day_30_given_day_7_prob_noisy = np.clip(day_30_given_day_7_prob * round_float(0.5, 1.5), 0, 1)

# Interaction effects for retention probabilities
interaction_effects_age = np.array([retention_scalers_age[age] for age in dataset['age_group']])
interaction_effects_instrument = np.array([retention_scalers_instrument[instrument] for instrument in dataset['instrument']])
interaction_effects_country = np.array([retention_scalers_country.get(country, 1.0) for country in dataset['country']])

# Combine interaction effects and normalize tightly
interaction_effects = interaction_effects_age * interaction_effects_instrument * interaction_effects_country
interaction_effects /= (np.max(interaction_effects) * 1.2)  # Tighter normalization

# Adjust retention probabilities with interaction effects
adjusted_day_1_prob = np.clip(day_1_prob_noisy * interaction_effects, 0, 1)
adjusted_day_7_given_day_1_prob = np.clip(day_7_given_day_1_prob_noisy * interaction_effects, 0, 1)
adjusted_day_30_given_day_7_prob = np.clip(day_30_given_day_7_prob_noisy * interaction_effects, 0, 1)

# Spike retention probabilities for 2% of users
spike_indices = np.random.choice(len(dataset), size=int(len(dataset) * 0.02), replace=False)
adjusted_day_1_prob[spike_indices] = np.clip(adjusted_day_1_prob[spike_indices] + np.random.uniform(0.2, 0.4, len(spike_indices)), 0, 1)

# Drop retention probabilities for 1% of users
drop_indices = np.random.choice(len(dataset), size=int(len(dataset) * 0.01), replace=False)
adjusted_day_1_prob[drop_indices] -= np.random.uniform(0.05, 0.1, len(drop_indices))  # Reduced noise range

# Ensure probabilities don't go too low
adjusted_day_1_prob = np.clip(adjusted_day_1_prob, 0.05, 1)  # Set a minimum threshold

# Simulate retention metrics
dataset['day_1_retention'] = np.random.binomial(1, adjusted_day_1_prob)
dataset['day_7_retention'] = np.where(
    dataset['day_1_retention'] == 1,
    np.random.binomial(1, adjusted_day_7_given_day_1_prob),
    0
)
dataset['day_30_retention'] = np.where(
    dataset['day_7_retention'] == 1,
    np.random.binomial(1, adjusted_day_30_given_day_7_prob),
    0
)

# Introduce retention outliers (users retained across all periods)
n_users = len(dataset)
retention_outliers = np.random.choice(dataset.index, size=int(n_users * 0.015), replace=False)  # Increased to 1.5%
dataset.loc[retention_outliers, ['day_1_retention', 'day_7_retention', 'day_30_retention']] = 1

# Retention metrics
retention_rates = dataset[['day_1_retention', 'day_7_retention', 'day_30_retention']]
retention = retention_rates.sum().sum()  # Total 1s
non_retention = retention_rates.size - retention  # Total 0s
total_entries = retention_rates.size  # Total number of values

# Debugging intermediate values
print("Adjusted Day 1 Probabilities (mean, min, max):", adjusted_day_1_prob.mean(), adjusted_day_1_prob.min(), adjusted_day_1_prob.max())
print("Adjusted Day 7 Probabilities (mean, min, max):", adjusted_day_7_given_day_1_prob.mean(), adjusted_day_7_given_day_1_prob.min(), adjusted_day_7_given_day_1_prob.max())
print("Adjusted Day 30 Probabilities (mean, min, max):", adjusted_day_30_given_day_7_prob.mean(), adjusted_day_30_given_day_7_prob.min(), adjusted_day_30_given_day_7_prob.max())

# Calculate retention rate
retention_rate = round((retention / total_entries) * 100, 2)
print(f"Retention: {retention}")
print(f"Non-retention: {non_retention}")
print(f"Retention rate: {retention_rate}%")

Adjusted Day 1 Probabilities (mean, min, max): 0.1965616946120591 0.05 0.8723096306705631
Adjusted Day 7 Probabilities (mean, min, max): 0.20897759947329453 0.01638787878787879 0.52
Adjusted Day 30 Probabilities (mean, min, max): 0.14604988963189702 0.011453131313131316 0.36341666666666667
Retention: 9231
Non-retention: 83313
Retention rate: 9.97%


# Fine-Tuning Premium Conversion
Premium conversion is influenced by:
1. **Age Groups**: Older users (35-54, 55+) convert more due to higher disposable income.
2. **Country**: Stronger economies like the US and Germany have higher conversion rates.
3. **Retention**: Users with higher Day 30 retention are more likely to convert.
4. **Instrument Engagement**: Structured instruments like Piano tend to encourage premium conversions.

In [15]:
SEED = 301 
np.random.seed(SEED)
# Premium conversion scalers
premium_scalers_instrument = {'Guitar': 1.2, 'Piano': 1.1, 'Ukulele': 1.0, 'Bass': 0.9, 'Voice': 0.8}
premium_scalers_age = {'Under 18': 0.8, '18-34': 1.0, '35-54': 1.2, '55+': 1.1}
premium_scalers_country = {
    country: round(scaler / 10 * round_float(0.8, 1.2), 2)
    for country, scaler in zip(country_list, country_weights)
}

# Factor in instrument, age, and country scalers
premium_effects_instrument = np.array([premium_scalers_instrument[instrument] for instrument in dataset['instrument']])
premium_effects_age = np.array([premium_scalers_age[age] for age in dataset['age_group']])
premium_effects_country = np.array([premium_scalers_country.get(country, 1.0) for country in dataset['country']])

# Combine interaction effects
premium_interaction_effects = premium_effects_instrument * premium_effects_age * premium_effects_country
premium_interaction_effects /= np.max(premium_interaction_effects)  # Normalize

# Adjust premium conversion probabilities with additional factors
premium_conversion_prob = np.clip(
    adjusted_day_1_prob * adjusted_day_7_given_day_1_prob * adjusted_day_30_given_day_7_prob * premium_interaction_effects,
    0,
    1
)

# Simulate premium conversion
dataset['premium_conversion'] = np.random.binomial(1, premium_conversion_prob)

# Calculate premium conversion metrics
day_1_premium_conversion_rate = dataset.loc[dataset['day_1_retention'] == 1, 'premium_conversion'].mean()
day_7_premium_conversion_rate = dataset.loc[dataset['day_7_retention'] == 1, 'premium_conversion'].mean()
day_30_premium_conversion_rate = dataset.loc[dataset['day_30_retention'] == 1, 'premium_conversion'].mean()
overall_premium_conversion_rate = dataset['premium_conversion'].mean()

# Print premium conversion metrics
print(f"Day 1 Premium Conversion Rate (given Day 1 Retention): {day_1_premium_conversion_rate:.2%}")
print(f"Day 7 Premium Conversion Rate (given Day 7 Retention): {day_7_premium_conversion_rate:.2%}")
print(f"Day 30 Premium Conversion Rate (given Day 30 Retention): {day_30_premium_conversion_rate:.2%}")
print(f"Overall Premium Conversion Rate: {overall_premium_conversion_rate:.2%}")


Day 1 Premium Conversion Rate (given Day 1 Retention): 1.64%
Day 7 Premium Conversion Rate (given Day 7 Retention): 2.58%
Day 30 Premium Conversion Rate (given Day 30 Retention): 2.21%
Overall Premium Conversion Rate: 0.90%


# Lessons Completed
Lessons completed track user activity. The number of lessons depends on:
1. **Retention**: Users retained longer complete more lessons.
2. **Skill Level**: Advanced users may complete more lessons, but beginners might binge early.
3. **Instrument**: Some instruments (e.g., Guitar, Piano) encourage higher engagement.
4. **Outliers**: Power users completing an unusually high number of lessons are introduced for added realism.


In [17]:
SEED = 301 
np.random.seed(SEED)
# Define baseline lessons completed
lessons_mean = {
    'Beginner': random.randint(5, 10),  # Randomized mean for Beginner
    'Intermediate': random.randint(10, 15),  # Randomized mean for Intermediate
    'Advanced': random.randint(15, 20)  # Randomized mean for Advanced
}
lessons_std_dev = {
    'Beginner': random.randint(5, 10),  # Randomized standard deviation for Beginner
    'Intermediate': random.randint(10, 15),  # Randomized standard deviation for Intermediate
    'Advanced': random.randint(15, 20)  # Randomized standard deviation for Advanced
}

# Simulate lessons completed
lessons_completed = [
    np.random.normal(
        lessons_mean[skill_level] * (1 + 0.1 * retention),  # Scale by retention
        lessons_std_dev[skill_level]  # Standard deviation per skill level
    )
    for skill_level, retention in zip(dataset['skill_level'], dataset['day_30_retention'])
]
dataset['lessons_completed'] = np.clip(np.round(lessons_completed), 0, None).astype(int)

# Introduce lessons completed outliers
lesson_outliers = np.random.choice(dataset.index, size=int(n_users * 0.01), replace=False)
dataset.loc[lesson_outliers, 'lessons_completed'] += np.random.randint(10, 50, len(lesson_outliers))
print(f"Lessons completed per user were successfully created") 
print(f"Number of lessons completed: {dataset['lessons_completed'].sum()}")
print(f"Standard deviation: {dataset['lessons_completed'].std()}")

Lessons completed per user were successfully created
Number of lessons completed: 368845
Standard deviation: 10.342755572784496


# Power Users
Power users are individuals with unusually high engagement, completing significantly more lessons than average. These users add realism and test the dataset's ability to handle extreme cases.

# Idle Users  
Idle users are individuals who register on the platform but exhibit minimal or no engagement, completing zero or very few lessons. These users reflect a common real-world scenario where many people sign up out of curiosity but do not actively use the platform.  

#### Justification  
- **Realistic Representation:** Including idle users ensures the dataset mirrors actual platform usage patterns, where a significant percentage of users fail to engage meaningfully.  
- **Business Insights:** Identifying and analyzing idle users helps businesses understand barriers to engagement, allowing them to design targeted strategies for reactivation.  
- **Performance Testing:** Idle users test the platform's ability to handle inactive accounts without impacting metrics like conversion rates, retention, or scalability.  
- **Strategic Planning:** Incorporating idle users highlights the need for outreach campaigns, personalized onboarding, or incentive programs to convert these users into active ones.  

By including idle users in the dataset, the simulation gains credibility and creates opportunities for actionable insights into user engagement.


In [19]:
SEED = 301 
np.random.seed(SEED)
lessons_mean = {'Beginner': 5, 'Intermediate': 7, 'Advanced': 10}
lessons_std_dev = {'Beginner': 2, 'Intermediate': 3, 'Advanced': 4}

lessons_completed = [
    np.random.normal(
        lessons_mean[skill_level], lessons_std_dev[skill_level]
    )
    for skill_level in dataset['skill_level']
]
dataset['lessons_completed'] = np.clip(np.round(lessons_completed), 0, None).astype(int)

# Identify power users
power_user_indices = np.random.choice(dataset.index, size=int(n_users * 0.01), replace=False)
idle_users_indices = np.random.choice(dataset.index, size=int(n_users * 0.15), replace=False)

# Add extreme lessons for power users
dataset.loc[power_user_indices, 'lessons_completed'] += np.random.randint(10, 50, len(power_user_indices))
dataset.loc[idle_users_indices, 'lessons_completed'] = np.random.randint(0, 4, len(idle_users_indices))
print(f"Idle and Power Users successfully created") 

Idle and Power Users successfully created


# Geographic Anomalies
Geographic anomalies represent rare or unexpected patterns, such as users from underrepresented countries with unusually high retention or conversion rates.


In [21]:
SEED = 301 
np.random.seed(SEED)
# Expanded list of rare countries
rare_countries = [
    'Maldives', 'Bhutan', 'Timor-Leste',  # Asia
    'Tuvalu', 'Nauru', 'Palau',          # Oceania
    'San Marino', 'Liechtenstein', 'Monaco',  # Europe
    'Seychelles', 'Djibouti', 'Eswatini',    # Africa
    'Guyana', 'Suriname', 'Paraguay',    # South America
    'Belize', 'Barbados', 'Saint Kitts and Nevis'  # North America
]

# Assign rare countries to 2-3% of users
rare_indices = np.random.choice(dataset.index, size=int(len(dataset) * 0.03), replace=False)
dataset.loc[rare_indices, 'country'] = np.random.choice(rare_countries, size=len(rare_indices))

# Retention anomalies for rare countries
high_retention_countries = ['Maldives', 'Bhutan', 'San Marino', 'Liechtenstein', 'Seychelles']
low_retention_countries = ['Tuvalu', 'Nauru', 'Timor-Leste', 'Djibouti', 'Eswatini']

# Boost or drop retention based on country
dataset.loc[dataset['country'].isin(high_retention_countries), 'day_30_retention'] = np.random.binomial(
    1, 0.9, len(dataset[dataset['country'].isin(high_retention_countries)])
)
dataset.loc[dataset['country'].isin(low_retention_countries), 'day_30_retention'] = np.random.binomial(
    1, 0.3, len(dataset[dataset['country'].isin(low_retention_countries)])
)

# Premium conversion anomalies
high_conversion_countries = ['Monaco', 'Palau', 'Barbados', 'San Marino']
low_conversion_countries = ['Tuvalu', 'Djibouti', 'Timor-Leste', 'Eswatini']

# Apply conversion rates based on country
dataset.loc[dataset['country'].isin(high_conversion_countries), 'premium_conversion'] = np.random.binomial(
    1, 0.8, len(dataset[dataset['country'].isin(high_conversion_countries)])
)
dataset.loc[dataset['country'].isin(low_conversion_countries), 'premium_conversion'] = np.random.binomial(
    1, 0.2, len(dataset[dataset['country'].isin(low_conversion_countries)])
)

# Lessons completed anomalies
high_activity_countries = ['Bhutan', 'Seychelles', 'Monaco', 'Liechtenstein']
low_activity_countries = ['Tuvalu', 'Guyana', 'Paraguay', 'Eswatini']

# Boost or reduce lessons completed
dataset.loc[dataset['country'].isin(high_activity_countries), 'lessons_completed'] += np.random.randint(
    10, 50, len(dataset[dataset['country'].isin(high_activity_countries)])
)
dataset.loc[dataset['country'].isin(low_activity_countries), 'lessons_completed'] -= np.random.randint(
    1, 10, len(dataset[dataset['country'].isin(low_activity_countries)])
)
dataset['lessons_completed'] = np.clip(dataset['lessons_completed'], 0, None)  # Ensure no negative values
print("Geographic anomalies successfully created'") 

Geographic anomalies successfully created'


# Instrument Extremes
Certain users exhibit disproportionate activity for specific instruments. These outliers represent highly engaged users who skew the typical engagement patterns.


In [23]:
SEED = 301 
np.random.seed(SEED)
# Instrument outlier counts with variability
instrument_outlier_base = {
    'Guitar': 0.01, 'Piano': 0.01, 'Ukulele': 0.005, 'Bass': 0.005, 'Voice': 0.003
}

# Add variability to outlier proportions
instrument_outlier_counts = {
    instrument: max(0, proportion + np.random.uniform(-0.002, 0.002))  # Add noise to proportions
    for instrument, proportion in instrument_outlier_base.items()
}

# Introduce both high and low outliers
for instrument, proportion in instrument_outlier_counts.items():
    # Get indices for outliers
    outlier_indices = dataset[dataset['instrument'] == instrument].sample(
        frac=proportion, random_state=42
    ).index
    
    # Randomly decide high or low outliers for each user
    for idx in outlier_indices:
        if np.random.rand() > 0.5:  # 50% chance of being a high outlier
            dataset.loc[idx, 'lessons_completed'] += np.random.randint(15, 40)
        else:  # 50% chance of being a low outlier
            dataset.loc[idx, 'lessons_completed'] -= np.random.randint(5, 15)

# Ensure lessons_completed remains non-negative
dataset['lessons_completed'] = np.clip(dataset['lessons_completed'], 0, None)
print("Instrument extremes successfully created'") 

Instrument extremes successfully created'


In [None]:
dataset.describe() 

In [25]:
pd.set_option('display.max_columns', None) 
dataset.head(5)

Unnamed: 0,user_id,start_date,age_group,instrument,country,skill_level,churn,day_1_retention,day_7_retention,day_30_retention,premium_conversion,lessons_completed
0,AUG00001,2024-08-01,18-34,Guitar,Mexico,Beginner,0,0,0,0,0,7
1,AUG00002,2024-08-01,35-54,Guitar,United States,Beginner,0,1,1,0,1,2
2,AUG00003,2024-08-01,35-54,Guitar,United Kingdom,Beginner,0,0,0,0,0,6
3,AUG00004,2024-08-01,18-34,Guitar,Seychelles,Beginner,0,1,0,1,1,53
4,AUG00005,2024-08-01,35-54,Guitar,Mexico,Intermediate,0,1,0,0,0,3


In [26]:
dataset.to_csv('pretest.csv', index=False) 