# RTS Data Analyst take-home assignment

🔍 Step 1: Understand the Objective
You need to:
- Recommend volume per content (i.e., how much to produce) for each of 5 themes: info, sport, musique, societe, humour
- Help business understand the function each theme serves:
acquisition, retention, or loyalty
- Communicate this with clear insights and visuals

🧹 Step 2: Data Exploration & Cleaning
📁 Files Overview
You have:

1. Mesures_contenu_volume_audio_à_commander.csv: performance metrics for segments (episodes)
2. Correspondance_show_segment_tag.csv: mapping from episodes to tags (themes)

Actions:
- Load both CSVs
- Join them on segment ID
- Filter to:
  - Date from Jan 2025 onward (for tags to be valid)
  - Platform = "rts.ch" (only this has theme data)
  - Keep only the 5 themes of interest via assigned tags


For each of the 5 themes, calculate:

# 📊 Step 3: Define Key Metrics per Theme of segments

- Total play duration (overall consumption)
- Average play duration per segment
- Visitors: how many users reached the content
- New visit rate: to estimate acquisition
- Returning visits and bounces: for retention vs. bounce
- Entries and Exits: to see if the segment starts or ends visits

This will let you profile each theme:
- Acquisition = High entries + High new visit rate
- Retention = Low bounce + Long play duration
- Loyalty = High returning visitors + Low exit

# 📌 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import re
import plotly.express as px
import plotly.graph_objects as go

import warnings
warnings.filterwarnings('ignore')

# 📂 2. Load Raw Data

## 2.1. Mesures_contenu_volume_audio_à_commander.csv

In [2]:
# Load the CSV file
path_volume = "../data/Mesures_contenu_volume_audio_à_commander.csv"
metrics_df = pd.read_csv(path_volume, sep=';', encoding='utf-8')

# Show first few rows of each for context
metrics_df

Unnamed: 0,Segment ID,Segment,Show ID,Show,Publication Date,App/Site Name,Device Class,Segment Length,Media Views,Avg Play Duration,Visitors,New Visit Rate %,Entries,Exits,Returning Visits,Bounces,Total Play Duration
0,14897825,Le Suisse Nemo triomphe à lEurovision avec sa ...,2031524,Le Journal horaire,12.05.2024,rts.ch,Smartphone,1234,20762,00:05:19,18877,"84,56%",9770,13135,3428,5181,94:50:23
1,15102359,Une trombe sest formée au-dessus du lac Léman,2031524,Le Journal horaire,18.08.2024,rts.ch,Smartphone,586,14703,00:03:27,13381,"53,30%",9889,11505,6458,6798,108:13:53
2,14572281,De Genève à Zurich: un périple sanglant en Hel...,14546712,Crimes suisses,05.01.2024,rts-app-play,Smartphone,3490,7327,00:24:41,4124,"2,49%",1527,1928,6594,602,2601:23:11
3,14689374,Prise dotages dans un train près dYverdon: les...,8849020,La Matinale,09.02.2024,rts.ch,Smartphone,1500,7560,00:06:25,7934,"71,32%",4370,4993,2671,2729,151:43:36
4,359fc205-7470-38e0-b393-3b4a2e429508,Pourquoi les couples se séparent,6067786,Tribu,07.04.2025,rts.ch,Smartphone,1956,7201,00:08:34,7147,"43,80%",6741,3901,4808,4016,851:19:51
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
277581,8d3ad86d-97e1-372b-bc60-8e0f58643b37,"Face au défi climatique, Neuchâtel fait un app...",1423859,Le 12h30,12.05.2025,rts-app-sport,Smartphone,564,524,00:01:01,90,"0,00%",103,140,23,726,00:04:16
277582,96015a33-f517-3cf8-bcda-c9658dd6c844,En Douceur,14570123,En Douceur,12.05.2025,rts.ch,Smartphone,4677,451,00:04:25,772,"103,00%",141,802,695,687,00:04:11
277583,0267bc07-2c73-327c-9f5b-f692289ed9d2,Le Suisse mort en Ukraine était un Lausannois ...,1784426,Forum,28.03.2025,rts-app-sport,Smartphone,814,438,00:05:24,989,"0,00%",476,772,859,92,00:00:12
277584,41568641-62b4-3596-99ce-3b8bf4d09ad8,Helveticus,12027724,Léchappée,28.03.2025,rts.ch,Smartphone,1150,512,00:05:02,289,"103,00%",1222,1055,82,889,00:00:14


In [3]:
metrics_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277586 entries, 0 to 277585
Data columns (total 17 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   Segment ID           277586 non-null  object
 1   Segment              277586 non-null  object
 2   Show ID              277467 non-null  object
 3   Show                 277467 non-null  object
 4   Publication Date     277467 non-null  object
 5   App/Site Name        277467 non-null  object
 6   Device Class         277467 non-null  object
 7   Segment Length       277586 non-null  int64 
 8   Media Views          277586 non-null  int64 
 9   Avg Play Duration    277586 non-null  object
 10  Visitors             277586 non-null  int64 
 11  New Visit Rate %     277586 non-null  object
 12  Entries              277586 non-null  int64 
 13  Exits                277586 non-null  int64 
 14  Returning Visits     277586 non-null  int64 
 15  Bounces              277586 non-nu

In [4]:
metrics_df.describe()

Unnamed: 0,Segment Length,Media Views,Visitors,Entries,Exits,Returning Visits,Bounces
count,277586.0,277586.0,277586.0,277586.0,277586.0,277586.0,277586.0
mean,2266.756205,328.131336,667.504734,630.95284,639.351001,661.112099,624.488497
std,2800.624729,215.137803,375.419407,361.20359,362.775949,373.208065,358.03318
min,6.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,783.0,178.0,353.0,321.0,328.0,346.0,316.0
50%,1310.0,319.0,666.0,632.0,639.0,659.0,623.0
75%,2131.0,457.0,976.0,941.0,949.0,969.0,934.0
max,22957.0,20762.0,18877.0,9889.0,13135.0,6594.0,6798.0


In [5]:
metrics_df.isnull().sum()

Segment ID               0
Segment                  0
Show ID                119
Show                   119
Publication Date       119
App/Site Name          119
Device Class           119
Segment Length           0
Media Views              0
Avg Play Duration        0
Visitors                 0
New Visit Rate %         0
Entries                  0
Exits                    0
Returning Visits         0
Bounces                  0
Total Play Duration      0
dtype: int64

## 2.2. Correspondance_show_segment_tag.csv

In [6]:
# Load the CSV file
path_tags = "../data/Correspondance_show_segment_tag.csv"
tags_df = pd.read_csv(path_tags, sep=';', encoding='utf-8')

# Show rows for context
tags_df

Unnamed: 0,Segment ID,Show,Show ID,Assigned Tags
0,14897825,Le Journal horaire,2031524,-
1,15102359,Le Journal horaire,2031524,-
2,15112045,Crimes suisses,14546712,-
3,14689374,La Matinale,8849020,-
4,15126915,Vertigo,4197907,-
...,...,...,...,...
107797,14818255,Forum,1784426,media_radio:media_radio_media_radio:info_rts_i...
107798,14851148,Forum,1784426,media_radio:la-1ere_rts_info:rts_info_media_tv...
107799,15228940,Forum,1784426,media_radio:media_radio_rts_info:monde_rts_inf...
107800,14845847,Forum,1784426,media_radio:media_radio_media_radio:info_rts_i...


In [7]:
tags_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107802 entries, 0 to 107801
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   Segment ID     107802 non-null  object
 1   Show           107802 non-null  object
 2   Show ID        107802 non-null  object
 3   Assigned Tags  107802 non-null  object
dtypes: object(4)
memory usage: 3.3+ MB


In [8]:
tags_df.describe()

Unnamed: 0,Segment ID,Show,Show ID,Assigned Tags
count,107802,107802,107802,107802
unique,80731,491,470,2360
top,15447445,Le Journal horaire,2031524,-
freq,9,18894,18971,59253


In [9]:
tags_df.isnull().sum()

Segment ID       0
Show             0
Show ID          0
Assigned Tags    0
dtype: int64

# 🧼 3. Clean and Prepare Metrics

#### 3.1. Cleaning titles

In [10]:
# remove trailing spaces from column names
metrics_df.columns = metrics_df.columns.str.rstrip()
# remove extra characters
metrics_df.columns = metrics_df.columns.str.strip().str.replace(r'[^\x00-\x7F]+', '', regex=True)

# Renaming 'Segment Length' to 'Episode Length (s)' as per data description
metrics_df = metrics_df.rename(columns={'Segment Length': 'Episode Length (s)'})

#### 3.2. Checking duplicated rows

In [11]:
# Count duplicated rows (full row duplicates)
duplicate_rows = metrics_df.duplicated()
print(duplicate_rows.sum())

0


#### 3.3. Missing values

In [12]:
# Identify rows
## impact on 'Show ID','Show','Publication Date','App/Site Name', 'Device Class'
missing_rows_1 = metrics_df[metrics_df['Show ID'].isnull()]
missing_rows_2 = metrics_df[metrics_df['Publication Date'].isnull()]
missing_rows_3 = metrics_df[metrics_df['App/Site Name'].isnull()]
missing_rows_4 = metrics_df[metrics_df['Device Class'].isnull()]

## Checking if the missing rows are the same
missing_rows_dfs = [missing_rows_1, missing_rows_2, missing_rows_3, missing_rows_4]
for i, missing_rows in enumerate(missing_rows_dfs):
    i +=1
    missing_count = len(missing_rows)
    total_count = len(metrics_df)
    missing_ratio = missing_count / total_count
    print(f"missing_rows_{i}: {missing_count}, Total rows: {total_count}, Missing ratio: {missing_ratio:.2%}")

missing_rows_1: 119, Total rows: 277586, Missing ratio: 0.04%
missing_rows_2: 119, Total rows: 277586, Missing ratio: 0.04%
missing_rows_3: 119, Total rows: 277586, Missing ratio: 0.04%
missing_rows_4: 119, Total rows: 277586, Missing ratio: 0.04%


In [13]:
# To see if the exact same rows are missing *only* when all these columns are null
## we will check for the intersection of the null masks: "Are these rows null IN ALL specified columns simultaneously?"
all_specified_cols_null_mask = (
    metrics_df['Show ID'].isnull() &
    metrics_df['Publication Date'].isnull() &
    metrics_df['App/Site Name'].isnull() &
    metrics_df['Device Class'].isnull()
)
rows_where_all_specified_are_missing = metrics_df[all_specified_cols_null_mask]


## Then we compare this combined result to our individual missing_rows_X DataFrames
for i, missing_rows in enumerate(missing_rows_dfs):
    i += 1
    print(f"Is missing_rows_{i} identical to rows where ALL specified columns are missing?",
          missing_rows.equals(rows_where_all_specified_are_missing))

Is missing_rows_1 identical to rows where ALL specified columns are missing? True
Is missing_rows_2 identical to rows where ALL specified columns are missing? True
Is missing_rows_3 identical to rows where ALL specified columns are missing? True
Is missing_rows_4 identical to rows where ALL specified columns are missing? True


Regarding the file "Mesures_contenu_volume_audio_à_commander.csv", I have found that 0.04% (119 rows) of the data have missing values on the exact same rows. Moreover, after further investigation, it was also found that these rows containing missing values also have wrong formatting/value in their numerical inputs and wrong titles. Consequently, I will remove the rows containing missing values instead of placig a placeholder

In [14]:
# drop the rows containing empty values in 'Show ID', 'Publication Date', 'App/Site Name', 'Device Class'
metrics_df = metrics_df.dropna(subset=['Show ID', 'Publication Date', 'App/Site Name', 'Device Class']).copy()
print(f"Remaining entries after drop: {metrics_df.shape[0]}")

Remaining entries after drop: 277467


#### 3.4. Data Consistency Checks

In [15]:
metrics_df.columns

Index(['Segment ID', 'Segment', 'Show ID', 'Show', 'Publication Date',
       'App/Site Name', 'Device Class', 'Episode Length (s)', 'Media Views',
       'Avg Play Duration', 'Visitors', 'New Visit Rate %', 'Entries', 'Exits',
       'Returning Visits', 'Bounces', 'Total Play Duration'],
      dtype='object')

- "New Visit Rate %" column

In [16]:
# Noticed that many values were formatted with a coma and a '%' sign

## Converting the column to string
metrics_df["New Visit Rate %"] = metrics_df["New Visit Rate %"].astype(str)

## Remove the '%' character and replace ',' with '.' for decimal conversion
metrics_df["New Visit Rate %"] = metrics_df["New Visit Rate %"] \
                                 .str.replace('%', '', regex=False) \
                                 .str.replace(',', '.', regex=False)

- "Publication Date" column

In [17]:
# Date standardization based on "DD.MM.YYYY"
metrics_df['Publication Date'] = pd.to_datetime(metrics_df['Publication Date'], format='%d.%m.%Y', errors='coerce')
# metrics_df['Publication Date']

- Converting time strings to seconds

In [18]:
# Function to convert hh:mm:ss to total seconds
def duration_to_seconds(duration_str):
    try:
        h, m, s = map(int, duration_str.split(':'))
        return h * 3600 + m * 60 + s
    except:
        return None  # Handles invalid formats

# Apply conversion to 'Avg Play Duration'
metrics_df['Avg Play Duration (s)'] = metrics_df['Avg Play Duration'].apply(duration_to_seconds)
# metrics_df['Avg Play Duration (s)']

# Apply conversion to 'Total Play Duration'
metrics_df['Total Play Duration (s)'] = metrics_df['Total Play Duration'].apply(duration_to_seconds)
# metrics_df['Total Play Duration (s)']

- Converting numerical values

In [19]:
# Numeric columns check
numeric_columns = ['Episode Length (s)', 'Media Views', 'Visitors', 'New Visit Rate %', 
                   'Entries', 'Exits', 'Returning Visits', 'Bounces',
                   'Avg Play Duration (s)', 'Total Play Duration (s)']
# Ensure columns are converted to float explicitly
metrics_df[numeric_columns] = metrics_df[numeric_columns].apply(lambda col: pd.to_numeric(col, errors='coerce')).astype(float)

# Check for numeric conversion issues
print("Numeric conversion check:")
print(metrics_df[numeric_columns].isnull().sum())

Numeric conversion check:
Episode Length (s)         0
Media Views                0
Visitors                   0
New Visit Rate %           0
Entries                    0
Exits                      0
Returning Visits           0
Bounces                    0
Avg Play Duration (s)      0
Total Play Duration (s)    0
dtype: int64


- Converting categorical values

In [20]:
# Convert selected columns to categorical type
categorical_columns = ['Segment ID', 'Segment', 'Show ID', 'Show', 'App/Site Name', 'Device Class']
metrics_df[categorical_columns] = metrics_df[categorical_columns].astype('category')

In [21]:
metrics_df

Unnamed: 0,Segment ID,Segment,Show ID,Show,Publication Date,App/Site Name,Device Class,Episode Length (s),Media Views,Avg Play Duration,Visitors,New Visit Rate %,Entries,Exits,Returning Visits,Bounces,Total Play Duration,Avg Play Duration (s),Total Play Duration (s)
0,14897825,Le Suisse Nemo triomphe à lEurovision avec sa ...,2031524,Le Journal horaire,2024-05-12,rts.ch,Smartphone,1234.0,20762.0,00:05:19,18877.0,84.56,9770.0,13135.0,3428.0,5181.0,94:50:23,319.0,341423.0
1,15102359,Une trombe sest formée au-dessus du lac Léman,2031524,Le Journal horaire,2024-08-18,rts.ch,Smartphone,586.0,14703.0,00:03:27,13381.0,53.30,9889.0,11505.0,6458.0,6798.0,108:13:53,207.0,389633.0
2,14572281,De Genève à Zurich: un périple sanglant en Hel...,14546712,Crimes suisses,2024-01-05,rts-app-play,Smartphone,3490.0,7327.0,00:24:41,4124.0,2.49,1527.0,1928.0,6594.0,602.0,2601:23:11,1481.0,9364991.0
3,14689374,Prise dotages dans un train près dYverdon: les...,8849020,La Matinale,2024-02-09,rts.ch,Smartphone,1500.0,7560.0,00:06:25,7934.0,71.32,4370.0,4993.0,2671.0,2729.0,151:43:36,385.0,546216.0
4,359fc205-7470-38e0-b393-3b4a2e429508,Pourquoi les couples se séparent,6067786,Tribu,2025-04-07,rts.ch,Smartphone,1956.0,7201.0,00:08:34,7147.0,43.80,6741.0,3901.0,4808.0,4016.0,851:19:51,514.0,3064791.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
277581,8d3ad86d-97e1-372b-bc60-8e0f58643b37,"Face au défi climatique, Neuchâtel fait un app...",1423859,Le 12h30,2025-05-12,rts-app-sport,Smartphone,564.0,524.0,00:01:01,90.0,0.00,103.0,140.0,23.0,726.0,00:04:16,61.0,256.0
277582,96015a33-f517-3cf8-bcda-c9658dd6c844,En Douceur,14570123,En Douceur,2025-05-12,rts.ch,Smartphone,4677.0,451.0,00:04:25,772.0,103.00,141.0,802.0,695.0,687.0,00:04:11,265.0,251.0
277583,0267bc07-2c73-327c-9f5b-f692289ed9d2,Le Suisse mort en Ukraine était un Lausannois ...,1784426,Forum,2025-03-28,rts-app-sport,Smartphone,814.0,438.0,00:05:24,989.0,0.00,476.0,772.0,859.0,92.0,00:00:12,324.0,12.0
277584,41568641-62b4-3596-99ce-3b8bf4d09ad8,Helveticus,12027724,Léchappée,2025-03-28,rts.ch,Smartphone,1150.0,512.0,00:05:02,289.0,103.00,1222.0,1055.0,82.0,889.0,00:00:14,302.0,14.0


In [22]:
# Dropping the column "Avg Play Duration" & "Total Play Duration" 
# as we have their values in seconds in "Avg Play Duration (s)" & "Total Play Duration (s)"
metrics_df.drop(columns=['Avg Play Duration', 'Total Play Duration'], inplace=True)

In [23]:
metrics_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 277467 entries, 0 to 277585
Data columns (total 17 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   Segment ID               277467 non-null  category      
 1   Segment                  277467 non-null  category      
 2   Show ID                  277467 non-null  category      
 3   Show                     277467 non-null  category      
 4   Publication Date         277467 non-null  datetime64[ns]
 5   App/Site Name            277467 non-null  category      
 6   Device Class             277467 non-null  category      
 7   Episode Length (s)       277467 non-null  float64       
 8   Media Views              277467 non-null  float64       
 9   Visitors                 277467 non-null  float64       
 10  New Visit Rate %         277467 non-null  float64       
 11  Entries                  277467 non-null  float64       
 12  Exits                

In [24]:
# Count duplicated rows (full row duplicates)
duplicate_rows = metrics_df.duplicated()
print(duplicate_rows.sum())

0


# 🧼 4. Clean and Prepare Tags

#### 4.1 Cleaning titles

In [25]:
# Clean column names (e.g., remove invisible characters)
tags_df.columns = tags_df.columns.str.strip().str.replace(r'[^\x00-\x7F]+', '', regex=True)

#### 4.2. Cleaning 'Assigned Tags'

In [26]:
# Replace '-' with None
tags_df['Assigned Tags'] = tags_df['Assigned Tags'].replace('-', None)

# Drop rows where 'Assigned Tags' is None or effectively empty after stripping whitespace
tags_df = tags_df[tags_df['Assigned Tags'].notna() & (tags_df['Assigned Tags'].astype(str).str.strip() != '')]

In [27]:
# Define the five exact valid tags
valid_tags = {
    'media_radio:societe',
    'media_radio:humour',
    'media_radio:info',
    'media_radio:musique',
    'media_radio:sport'
}

In [28]:
# Function to check if each valid tag is present in the full string
def match_valid_tags_in_string(tag_string, valid_tags):
    tag_string = str(tag_string).lower()
    return [tag for tag in valid_tags if tag in tag_string]

In [29]:
# Apply the matching function
tags_df['cleaned_themes'] = tags_df['Assigned Tags'].apply(lambda x: match_valid_tags_in_string(x, valid_tags))
tags_df['Primary Theme'] = tags_df['cleaned_themes'].apply(lambda tags: tags[0] if tags else None)

We noticed that there are 10 shows that contain multi themes:

    3ème mi-temps
    Dis, pourquoi?
    Émission spéciale
    Footaises
    La Matinale
    Le 12h30
    Le grand soir
    Les beaux parleurs
    Sport-Première
    The Jam

In this particular case, we are sticking to the assumption of primary_theme = first theme.
Otherwise we could explode to multi-theme rows, but shows could be repeated among categories.

#### 4.3 Dropping null values in 'Primary Theme'

In [30]:
# Drop all the empty 'Primary Theme' as we could not retrieve the needed ones
tags_df = tags_df.dropna(subset=['Primary Theme'])

#### 4.4 Dropping unnecessary columns

In [31]:
# Drop column 'Assigned Tags' and 'cleaned themes' as it was to check the extractions
tags_df = tags_df.drop(["Assigned Tags", "cleaned_themes"], axis=1)

#### 4.5. Checking duplicated rows

In [32]:
# Remove duplicated rows and keeping the first time the tag was assigned
tags_df = tags_df.drop_duplicates(subset=['Segment ID', 'Show', 'Show ID'], keep='first')

In [33]:
tags_df

Unnamed: 0,Segment ID,Show,Show ID,Primary Theme
5,359fc205-7470-38e0-b393-3b4a2e429508,Tribu,6067786,media_radio:societe
10,973c9679-fa7e-35b5-a450-fa60781e10f4,Le 12h30,1423859,media_radio:info
22,61f0806a-251b-3c30-b3f3-0f2fb6558cc1,La Matinale,8849020,media_radio:info
26,10547a3f-6d1f-3f68-a746-7d7e96abbbca,Le 12h30,1423859,media_radio:info
45,3083d203-f816-3b44-8f77-6d1104da54fa,Forum,1784426,media_radio:info
...,...,...,...,...
107796,14614555,Forum,1784426,media_radio:info
107797,14818255,Forum,1784426,media_radio:info
107799,15228940,Forum,1784426,media_radio:info
107800,14845847,Forum,1784426,media_radio:info


In [34]:
# tags_df.to_csv("tags.csv", encoding='utf-8-sig')

# 🔗 5. Merge Datasets

#### 5.1 Merge datasets

In [35]:
# Merge datasets on "Segment ID"
# many-to-one method as we have multiple Segment ID due to "App/Site Name" & "Device Class"
merged_df = pd.merge(metrics_df, tags_df[['Segment ID', 'Primary Theme']],
                     on='Segment ID', how='left')

# Check merge results
merged_df

Unnamed: 0,Segment ID,Segment,Show ID,Show,Publication Date,App/Site Name,Device Class,Episode Length (s),Media Views,Visitors,New Visit Rate %,Entries,Exits,Returning Visits,Bounces,Avg Play Duration (s),Total Play Duration (s),Primary Theme
0,14897825,Le Suisse Nemo triomphe à lEurovision avec sa ...,2031524,Le Journal horaire,2024-05-12,rts.ch,Smartphone,1234.0,20762.0,18877.0,84.56,9770.0,13135.0,3428.0,5181.0,319.0,341423.0,
1,15102359,Une trombe sest formée au-dessus du lac Léman,2031524,Le Journal horaire,2024-08-18,rts.ch,Smartphone,586.0,14703.0,13381.0,53.30,9889.0,11505.0,6458.0,6798.0,207.0,389633.0,
2,14572281,De Genève à Zurich: un périple sanglant en Hel...,14546712,Crimes suisses,2024-01-05,rts-app-play,Smartphone,3490.0,7327.0,4124.0,2.49,1527.0,1928.0,6594.0,602.0,1481.0,9364991.0,
3,14689374,Prise dotages dans un train près dYverdon: les...,8849020,La Matinale,2024-02-09,rts.ch,Smartphone,1500.0,7560.0,7934.0,71.32,4370.0,4993.0,2671.0,2729.0,385.0,546216.0,media_radio:info
4,359fc205-7470-38e0-b393-3b4a2e429508,Pourquoi les couples se séparent,6067786,Tribu,2025-04-07,rts.ch,Smartphone,1956.0,7201.0,7147.0,43.80,6741.0,3901.0,4808.0,4016.0,514.0,3064791.0,media_radio:societe
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
277462,8d3ad86d-97e1-372b-bc60-8e0f58643b37,"Face au défi climatique, Neuchâtel fait un app...",1423859,Le 12h30,2025-05-12,rts-app-sport,Smartphone,564.0,524.0,90.0,0.00,103.0,140.0,23.0,726.0,61.0,256.0,media_radio:info
277463,96015a33-f517-3cf8-bcda-c9658dd6c844,En Douceur,14570123,En Douceur,2025-05-12,rts.ch,Smartphone,4677.0,451.0,772.0,103.00,141.0,802.0,695.0,687.0,265.0,251.0,
277464,0267bc07-2c73-327c-9f5b-f692289ed9d2,Le Suisse mort en Ukraine était un Lausannois ...,1784426,Forum,2025-03-28,rts-app-sport,Smartphone,814.0,438.0,989.0,0.00,476.0,772.0,859.0,92.0,324.0,12.0,media_radio:info
277465,41568641-62b4-3596-99ce-3b8bf4d09ad8,Helveticus,12027724,Léchappée,2025-03-28,rts.ch,Smartphone,1150.0,512.0,289.0,103.00,1222.0,1055.0,82.0,889.0,302.0,14.0,media_radio:musique


#### 5.2. Checking duplicates after merging

In [36]:
# Count duplicated rows (full row duplicates)
merged_df.sort_values('Segment ID')
print(merged_df.duplicated(keep=False).sum())

# merged_df.to_csv("dups.csv", encoding='utf-8-sig')

0


#### 5.3. Filtering valid tags

In [37]:
# Dropping the rows that do not have the 5 tags that we needed to have
## valid tags from before. Applying this method if we were to add more valid_tags.
df = merged_df[merged_df['Primary Theme'].isin(valid_tags)]

# Remove 'media_radio:' prefix from 'Primary Theme'
df['Primary Theme'] = df['Primary Theme'].str.replace('media_radio:', '', regex=False)

## changing the values to categories for consistency
df['Primary Theme'] = df['Primary Theme'].astype('category')

## display data
df

Unnamed: 0,Segment ID,Segment,Show ID,Show,Publication Date,App/Site Name,Device Class,Episode Length (s),Media Views,Visitors,New Visit Rate %,Entries,Exits,Returning Visits,Bounces,Avg Play Duration (s),Total Play Duration (s),Primary Theme
3,14689374,Prise dotages dans un train près dYverdon: les...,8849020,La Matinale,2024-02-09,rts.ch,Smartphone,1500.0,7560.0,7934.0,71.32,4370.0,4993.0,2671.0,2729.0,385.0,546216.0,info
4,359fc205-7470-38e0-b393-3b4a2e429508,Pourquoi les couples se séparent,6067786,Tribu,2025-04-07,rts.ch,Smartphone,1956.0,7201.0,7147.0,43.80,6741.0,3901.0,4808.0,4016.0,514.0,3064791.0,societe
18,14760215,Cinq des six randonneurs à ski portés disparus...,8849020,La Matinale,2024-03-11,rts-app-info,Smartphone,803.0,6022.0,5304.0,0.33,2828.0,2777.0,5473.0,1367.0,350.0,248913.0,info
21,14673612,Retour sur la prise d’otage qui a eu lieu entr...,8849020,La Matinale,2024-02-09,rts.ch,Smartphone,1233.0,6018.0,5844.0,72.39,2144.0,4829.0,1714.0,2543.0,149.0,670593.0,info
26,973c9679-fa7e-35b5-a450-fa60781e10f4,"Nouvel effondrement sur le glacier du Birch, a...",1423859,Le 12h30,2025-05-28,rts.ch,Smartphone,1540.0,5230.0,5073.0,62.48,3156.0,2781.0,2085.0,1181.0,110.0,219705.0,info
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
277458,c13bf95f-c6f7-3259-acc7-fd95ca2d4518,"Larmée suisse encourage le port de brassières,...",1423859,Le 12h30,2025-05-12,rts-app-sport,Smartphone,1453.0,516.0,366.0,0.00,289.0,884.0,180.0,322.0,29.0,264.0,info
277459,24226e74-ec65-3590-99f9-34b77f57430f,La Chine et les Etats-Unis repartent sur de no...,7006364,Tout un monde,2025-05-12,rts-app-sport,Smartphone,1076.0,253.0,82.0,0.00,438.0,538.0,867.0,1052.0,315.0,243.0,info
277462,8d3ad86d-97e1-372b-bc60-8e0f58643b37,"Face au défi climatique, Neuchâtel fait un app...",1423859,Le 12h30,2025-05-12,rts-app-sport,Smartphone,564.0,524.0,90.0,0.00,103.0,140.0,23.0,726.0,61.0,256.0,info
277464,0267bc07-2c73-327c-9f5b-f692289ed9d2,Le Suisse mort en Ukraine était un Lausannois ...,1784426,Forum,2025-03-28,rts-app-sport,Smartphone,814.0,438.0,989.0,0.00,476.0,772.0,859.0,92.0,324.0,12.0,info


#### 5.4. Setting the KPI within the dataset

For each "Primary Theme", we’ll calculate:

| KPI                           | Formula                          |
| ----------------------------- | -------------------------------- |
| 🎧 Total Media Views          | `sum(Media Views)`               |
| 👥 Total Visitors             | `sum(Visitors)`                  |
| 📈 Average New Visit Rate (%) | `mean(New Visit Rate %)`         |
| 🚪 Total Entries              | `sum(Entries)`                   |
| 🚶‍♂️ Total Exits             | `sum(Exits)`                     |
| 🔁 Total Returning Visits     | `sum(Returning Visits)`          |
| ⛔️ Total Bounces              | `sum(Bounces)`                   |
| ⏱️ Average Play Duration (s)  | `mean(Avg Play Duration (s))`    |
| ⏱️ Total Play Duration (s)    | `sum(Total Play Duration (s))`   |
| 🔁 Engagement per Visitor     | `Total Play Duration / Visitors` |
| 🧲 Acquisition Rate           | `Entries / Visitors`             |
| 📌 Retention Rate             | `Returning Visits / Visitors`    |


In [38]:
df["Engagement per Visitor (s)"] = df["Total Play Duration (s)"] / df["Visitors"]
df["Acquisition Rate %"] = df["Entries"] / df["Visitors"] * 100
df["Retention Rate %"] = df["Returning Visits"] / df["Visitors"] * 100


#### 5.5. Exporting final dataset

In [39]:
# Generating a file to keep for further investigation
df.to_csv("../data/rts_data_metrics_tags.csv", encoding='utf-8-sig', index=False)

In [40]:
df.shape

(78312, 21)

In [41]:
df.describe()

Unnamed: 0,Publication Date,Episode Length (s),Media Views,Visitors,New Visit Rate %,Entries,Exits,Returning Visits,Bounces,Avg Play Duration (s),Total Play Duration (s),Engagement per Visitor (s),Acquisition Rate %,Retention Rate %
count,78312,78312.0,78312.0,78312.0,78312.0,78312.0,78312.0,78312.0,78312.0,78312.0,78312.0,78312.0,78312.0,78312.0
mean,2024-11-26 12:33:40.104198656,2395.841799,351.262412,685.017596,27.077243,633.771976,647.440737,679.277263,626.46989,613.88293,45170.7,77.355297,194.311621,201.895948
min,2024-01-01 00:00:00,24.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2024-08-19 00:00:00,814.0,194.0,370.0,0.0,323.0,338.0,363.0,318.0,234.0,2197.0,3.798315,47.155916,53.305083
50%,2025-01-14 00:00:00,1342.0,338.0,683.0,22.875,633.0,645.0,676.0,625.0,364.0,6709.0,12.712092,92.917257,99.19968
75%,2025-03-23 00:00:00,2201.0,477.0,993.0,50.5,945.0,955.0,986.0,935.0,610.0,22099.5,44.166775,172.127458,183.120408
max,2025-05-31 00:00:00,22947.0,7560.0,7934.0,103.0,6741.0,4993.0,5473.0,4016.0,39130.0,4658323.0,7220.038462,106600.0,48800.0
std,,2906.555635,223.421458,377.950429,28.640365,361.50865,362.446133,377.035979,358.106729,735.509058,176887.1,242.634007,695.616214,589.671485


In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 78312 entries, 3 to 277465
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Segment ID                  78312 non-null  object        
 1   Segment                     78312 non-null  category      
 2   Show ID                     78312 non-null  category      
 3   Show                        78312 non-null  category      
 4   Publication Date            78312 non-null  datetime64[ns]
 5   App/Site Name               78312 non-null  category      
 6   Device Class                78312 non-null  category      
 7   Episode Length (s)          78312 non-null  float64       
 8   Media Views                 78312 non-null  float64       
 9   Visitors                    78312 non-null  float64       
 10  New Visit Rate %            78312 non-null  float64       
 11  Entries                     78312 non-null  float64       

In [51]:
df

Unnamed: 0,Segment ID,Segment,Show ID,Show,Publication Date,App/Site Name,Device Class,Episode Length (s),Media Views,Visitors,...,Entries,Exits,Returning Visits,Bounces,Avg Play Duration (s),Total Play Duration (s),Primary Theme,Engagement per Visitor (s),Acquisition Rate %,Retention Rate %
3,14689374,Prise dotages dans un train près dYverdon: les...,8849020,La Matinale,2024-02-09,rts.ch,Smartphone,1500.0,7560.0,7934.0,...,4370.0,4993.0,2671.0,2729.0,385.0,546216.0,info,68.844971,55.079405,33.665238
4,359fc205-7470-38e0-b393-3b4a2e429508,Pourquoi les couples se séparent,6067786,Tribu,2025-04-07,rts.ch,Smartphone,1956.0,7201.0,7147.0,...,6741.0,3901.0,4808.0,4016.0,514.0,3064791.0,societe,428.822023,94.319295,67.272982
18,14760215,Cinq des six randonneurs à ski portés disparus...,8849020,La Matinale,2024-03-11,rts-app-info,Smartphone,803.0,6022.0,5304.0,...,2828.0,2777.0,5473.0,1367.0,350.0,248913.0,info,46.929299,53.318250,103.186275
21,14673612,Retour sur la prise d’otage qui a eu lieu entr...,8849020,La Matinale,2024-02-09,rts.ch,Smartphone,1233.0,6018.0,5844.0,...,2144.0,4829.0,1714.0,2543.0,149.0,670593.0,info,114.748973,36.687201,29.329227
26,973c9679-fa7e-35b5-a450-fa60781e10f4,"Nouvel effondrement sur le glacier du Birch, a...",1423859,Le 12h30,2025-05-28,rts.ch,Smartphone,1540.0,5230.0,5073.0,...,3156.0,2781.0,2085.0,1181.0,110.0,219705.0,info,43.308693,62.211709,41.099941
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
277458,c13bf95f-c6f7-3259-acc7-fd95ca2d4518,"Larmée suisse encourage le port de brassières,...",1423859,Le 12h30,2025-05-12,rts-app-sport,Smartphone,1453.0,516.0,366.0,...,289.0,884.0,180.0,322.0,29.0,264.0,info,0.721311,78.961749,49.180328
277459,24226e74-ec65-3590-99f9-34b77f57430f,La Chine et les Etats-Unis repartent sur de no...,7006364,Tout un monde,2025-05-12,rts-app-sport,Smartphone,1076.0,253.0,82.0,...,438.0,538.0,867.0,1052.0,315.0,243.0,info,2.963415,534.146341,1057.317073
277462,8d3ad86d-97e1-372b-bc60-8e0f58643b37,"Face au défi climatique, Neuchâtel fait un app...",1423859,Le 12h30,2025-05-12,rts-app-sport,Smartphone,564.0,524.0,90.0,...,103.0,140.0,23.0,726.0,61.0,256.0,info,2.844444,114.444444,25.555556
277464,0267bc07-2c73-327c-9f5b-f692289ed9d2,Le Suisse mort en Ukraine était un Lausannois ...,1784426,Forum,2025-03-28,rts-app-sport,Smartphone,814.0,438.0,989.0,...,476.0,772.0,859.0,92.0,324.0,12.0,info,0.012133,48.129424,86.855410


# 📊 6. Compute KPIs

In [43]:
# Convert numeric columns to ensure they aggregate properly
kpi_columns = [
    'Media Views', 'Visitors', 'New Visit Rate %', 'Entries', 'Exits',
    'Returning Visits', 'Bounces', 'Avg Play Duration (s)', 'Total Play Duration (s)'
]
df[kpi_columns] = df[kpi_columns].apply(pd.to_numeric, errors='coerce')

In [44]:
# Group by Primary Theme and compute aggregations
theme_kpis = df.groupby('Primary Theme').agg({
    'Media Views': 'sum',
    'Visitors': 'sum',
    'New Visit Rate %': 'mean',
    'Entries': 'sum',
    'Exits': 'sum',
    'Returning Visits': 'sum',
    'Bounces': 'sum',
    'Avg Play Duration (s)': 'mean',
    'Total Play Duration (s)': 'sum'
}).reset_index()

- Engagement per Visitor = Total Play Duration / Visitors
- Acquisition Rate = Entries / Visitors
- Retention Rate = Returning Visits / Visitors

In [45]:
# Add calculated KPIs
theme_kpis['Engagement per Visitor'] = theme_kpis['Total Play Duration (s)'] / theme_kpis['Visitors']
theme_kpis['Acquisition Rate'] = theme_kpis['Entries'] / theme_kpis['Visitors']
theme_kpis['Retention Rate'] = theme_kpis['Returning Visits'] / theme_kpis['Visitors']

In [46]:
# Min-max normalization (optional)
for col in ['Engagement per Visitor', 'Acquisition Rate', 'Retention Rate']:
    theme_kpis[f'norm_{col}'] = (theme_kpis[col] - theme_kpis[col].min()) / (theme_kpis[col].max() - theme_kpis[col].min())

| Threshold                | Why it was used                                      |
| ------------------------ | ---------------------------------------------------- |
| `Engagement > 180`       | 180 seconds = 3 minutes — strong signal of attention |
| `Retention Rate > 0.3`   | 30%+ return is solid for editorial content           |
| `Acquisition Rate > 0.4` | 40%+ entries vs. visitors shows discovery value      |
| `Engagement < 60`        | <1 minute average usually indicates disinterest      |


In [47]:
thresholds = {
    'high_engagement': theme_kpis['Engagement per Visitor'].quantile(0.75),
    'low_engagement': theme_kpis['Engagement per Visitor'].quantile(0.25),
    'high_retention': theme_kpis['Retention Rate'].quantile(0.75),
    'high_acquisition': theme_kpis['Acquisition Rate'].quantile(0.75),
}

In [48]:
def recommend(row):
    if row['Engagement per Visitor'] > thresholds['high_engagement'] and row['Retention Rate'] > thresholds['high_retention']:
        return "Boost Production"
    elif row['Acquisition Rate'] > thresholds['high_acquisition'] and row['Engagement per Visitor'] > thresholds['low_engagement']:
        return "Maintain & Monitor"
    elif row['Engagement per Visitor'] < thresholds['low_engagement']:
        return "Review Content Strategy"
    else:
        return "Needs Further Analysis"

In [49]:
theme_kpis['Recommendation'] = theme_kpis.apply(recommend, axis=1)

In [50]:
theme_kpis

Unnamed: 0,Primary Theme,Media Views,Visitors,New Visit Rate %,Entries,Exits,Returning Visits,Bounces,Avg Play Duration (s),Total Play Duration (s),Engagement per Visitor,Acquisition Rate,Retention Rate,norm_Engagement per Visitor,norm_Acquisition Rate,norm_Retention Rate,Recommendation
0,humour,1949133.0,3275922.0,21.387552,2927337.0,3074072.0,3416846.0,2887076.0,864.24261,915506600.0,279.465337,0.893592,1.043018,1.0,0.0,1.0,Boost Production
1,info,19951723.0,38662597.0,25.787312,35290596.0,36248585.0,38046311.0,35021585.0,372.30671,1580277000.0,40.873542,0.912784,0.98406,0.058293,0.205554,0.0,Needs Further Analysis
2,musique,3547351.0,7653360.0,32.978826,7550225.0,7527588.0,7654616.0,7375975.0,1519.311485,584158900.0,76.327117,0.986524,1.000164,0.198226,0.995339,0.273146,Needs Further Analysis
3,societe,1627968.0,3148810.0,33.056259,2971178.0,2968424.0,3141555.0,2891233.0,961.50825,433855900.0,137.784073,0.943588,0.997696,0.440793,0.535472,0.231283,Needs Further Analysis
4,sport,431887.0,904409.0,26.981917,892615.0,883710.0,936233.0,884241.0,549.851389,23609130.0,26.104479,0.986959,1.035188,0.0,1.0,0.867186,Review Content Strategy
