## RTS Data Analyst take-home assignment

🔍 Step 1: Understand the Objective
You need to:
- Recommend volume per content (i.e., how much to produce) for each of 5 themes: info, sport, musique, societe, humour
- Help business understand the function each theme serves:
acquisition, retention, or loyalty
- Communicate this with clear insights and visuals

🧹 Step 2: Data Exploration & Cleaning
📁 Files Overview
You have:

1. Mesures_contenu_volume_audio_à_commander.csv: performance metrics for segments (episodes)
2. Correspondance_show_segment_tag.csv: mapping from episodes to tags (themes)

Actions:
- Load both CSVs
- Join them on segment ID
- Filter to:
  - Date from Jan 2025 onward (for tags to be valid)
  - Platform = "rts.ch" (only this has theme data)
  - Keep only the 5 themes of interest via assigned tags

📊 Step 3: Define Key Metrics per Theme
For each of the 5 themes, calculate:

# of segments

- Total play duration (overall consumption)
- Average play duration per segment
- Visitors: how many users reached the content
- New visit rate: to estimate acquisition
- Returning visits and bounces: for retention vs. bounce
- Entries and Exits: to see if the segment starts or ends visits

This will let you profile each theme:
- Acquisition = High entries + High new visit rate
- Retention = Low bounce + Long play duration
- Loyalty = High returning visitors + Low exit

### 0. Librairies

In [1]:
import pandas as pd
import numpy as np
import re
import plotly.express as px
import plotly.graph_objects as go

### 1. Data Ingestion

#### 1.1. Mesures_contenu_volume_audio_à_commander.csv

In [2]:
# Load the CSV file
path_volume = "../data/Mesures_contenu_volume_audio_à_commander.csv"
metrics_df = pd.read_csv(path_volume, sep=';', encoding='utf-8')

# Show first few rows of each for context
metrics_df

Unnamed: 0,Segment ID,Segment,Show ID,Show,Publication Date,App/Site Name,Device Class,Segment Length,Media Views,Avg Play Duration,Visitors,New Visit Rate %,Entries,Exits,Returning Visits,Bounces,Total Play Duration
0,14897825,Le Suisse Nemo triomphe à lEurovision avec sa ...,2031524,Le Journal horaire,12.05.2024,rts.ch,Smartphone,1234,20762,00:05:19,18877,"84,56%",9770,13135,3428,5181,94:50:23
1,15102359,Une trombe sest formée au-dessus du lac Léman,2031524,Le Journal horaire,18.08.2024,rts.ch,Smartphone,586,14703,00:03:27,13381,"53,30%",9889,11505,6458,6798,108:13:53
2,14572281,De Genève à Zurich: un périple sanglant en Hel...,14546712,Crimes suisses,05.01.2024,rts-app-play,Smartphone,3490,7327,00:24:41,4124,"2,49%",1527,1928,6594,602,2601:23:11
3,14689374,Prise dotages dans un train près dYverdon: les...,8849020,La Matinale,09.02.2024,rts.ch,Smartphone,1500,7560,00:06:25,7934,"71,32%",4370,4993,2671,2729,151:43:36
4,359fc205-7470-38e0-b393-3b4a2e429508,Pourquoi les couples se séparent,6067786,Tribu,07.04.2025,rts.ch,Smartphone,1956,7201,00:08:34,7147,"43,80%",6741,3901,4808,4016,851:19:51
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
277581,8d3ad86d-97e1-372b-bc60-8e0f58643b37,"Face au défi climatique, Neuchâtel fait un app...",1423859,Le 12h30,12.05.2025,rts-app-sport,Smartphone,564,524,00:01:01,90,"0,00%",103,140,23,726,00:04:16
277582,96015a33-f517-3cf8-bcda-c9658dd6c844,En Douceur,14570123,En Douceur,12.05.2025,rts.ch,Smartphone,4677,451,00:04:25,772,"103,00%",141,802,695,687,00:04:11
277583,0267bc07-2c73-327c-9f5b-f692289ed9d2,Le Suisse mort en Ukraine était un Lausannois ...,1784426,Forum,28.03.2025,rts-app-sport,Smartphone,814,438,00:05:24,989,"0,00%",476,772,859,92,00:00:12
277584,41568641-62b4-3596-99ce-3b8bf4d09ad8,Helveticus,12027724,Léchappée,28.03.2025,rts.ch,Smartphone,1150,512,00:05:02,289,"103,00%",1222,1055,82,889,00:00:14


In [3]:
metrics_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277586 entries, 0 to 277585
Data columns (total 17 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   Segment ID           277586 non-null  object
 1   Segment              277586 non-null  object
 2   Show ID              277467 non-null  object
 3   Show                 277467 non-null  object
 4   Publication Date     277467 non-null  object
 5   App/Site Name        277467 non-null  object
 6   Device Class         277467 non-null  object
 7   Segment Length       277586 non-null  int64 
 8   Media Views          277586 non-null  int64 
 9   Avg Play Duration    277586 non-null  object
 10  Visitors             277586 non-null  int64 
 11  New Visit Rate %     277586 non-null  object
 12  Entries              277586 non-null  int64 
 13  Exits                277586 non-null  int64 
 14  Returning Visits     277586 non-null  int64 
 15  Bounces              277586 non-nu

In [4]:
metrics_df.describe()

Unnamed: 0,Segment Length,Media Views,Visitors,Entries,Exits,Returning Visits,Bounces
count,277586.0,277586.0,277586.0,277586.0,277586.0,277586.0,277586.0
mean,2266.756205,328.131336,667.504734,630.95284,639.351001,661.112099,624.488497
std,2800.624729,215.137803,375.419407,361.20359,362.775949,373.208065,358.03318
min,6.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,783.0,178.0,353.0,321.0,328.0,346.0,316.0
50%,1310.0,319.0,666.0,632.0,639.0,659.0,623.0
75%,2131.0,457.0,976.0,941.0,949.0,969.0,934.0
max,22957.0,20762.0,18877.0,9889.0,13135.0,6594.0,6798.0


In [5]:
metrics_df.isnull().sum()

Segment ID               0
Segment                  0
Show ID                119
Show                   119
Publication Date       119
App/Site Name          119
Device Class           119
Segment Length           0
Media Views              0
Avg Play Duration        0
Visitors                 0
New Visit Rate %         0
Entries                  0
Exits                    0
Returning Visits         0
Bounces                  0
Total Play Duration      0
dtype: int64

In [6]:
# Count duplicated rows
duplicate_count = metrics_df.duplicated().sum()
print(f"Number of duplicated rows: {duplicate_count}")

Number of duplicated rows: 0


#### 1.2. Correspondance_show_segment_tag.csv

In [7]:
# Load the CSV file
path_tags = "../data/Correspondance_show_segment_tag.csv"
tags_df = pd.read_csv(path_tags, sep=';', encoding='utf-8')

# Show rows for context
tags_df

Unnamed: 0,Segment ID,Show,Show ID,Assigned Tags
0,14897825,Le Journal horaire,2031524,-
1,15102359,Le Journal horaire,2031524,-
2,15112045,Crimes suisses,14546712,-
3,14689374,La Matinale,8849020,-
4,15126915,Vertigo,4197907,-
...,...,...,...,...
107797,14818255,Forum,1784426,media_radio:media_radio_media_radio:info_rts_i...
107798,14851148,Forum,1784426,media_radio:la-1ere_rts_info:rts_info_media_tv...
107799,15228940,Forum,1784426,media_radio:media_radio_rts_info:monde_rts_inf...
107800,14845847,Forum,1784426,media_radio:media_radio_media_radio:info_rts_i...


In [8]:
tags_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107802 entries, 0 to 107801
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   Segment ID     107802 non-null  object
 1   Show           107802 non-null  object
 2   Show ID        107802 non-null  object
 3   Assigned Tags  107802 non-null  object
dtypes: object(4)
memory usage: 3.3+ MB


In [9]:
tags_df.describe()

Unnamed: 0,Segment ID,Show,Show ID,Assigned Tags
count,107802,107802,107802,107802
unique,80731,491,470,2360
top,15447445,Le Journal horaire,2031524,-
freq,9,18894,18971,59253


In [10]:
tags_df.isnull().sum()

Segment ID       0
Show             0
Show ID          0
Assigned Tags    0
dtype: int64

In [11]:
# Count duplicated rows
duplicate_count = tags_df.duplicated().sum()
print(f"Number of duplicated rows: {duplicate_count}")

Number of duplicated rows: 0


### 2. Data Cleaning

#### 2.1.1. Cleaning titles

In [12]:
# remove trailing spaces from column names
metrics_df.columns = metrics_df.columns.str.rstrip()
# remove extra characters
metrics_df.columns = metrics_df.columns.str.strip().str.replace(r'[^\x00-\x7F]+', '', regex=True)

#### 2.1.2. Removing duplicated rows

In [13]:
# Count duplicated rows (full row duplicates)
duplicate_rows = metrics_df.duplicated()
duplicate_count = duplicate_rows.sum()
print(duplicate_count)

0


#### 2.1.3. Missing values

In [14]:
# Identify rows
## impact on 'Show ID','Show','Publication Date','App/Site Name', 'Device Class'
missing_rows_1 = metrics_df[metrics_df['Show ID'].isnull()]
missing_rows_2 = metrics_df[metrics_df['Publication Date'].isnull()]
missing_rows_3 = metrics_df[metrics_df['App/Site Name'].isnull()]
missing_rows_4 = metrics_df[metrics_df['Device Class'].isnull()]

## Checking if the missing rows are the same
missing_rows_dfs = [missing_rows_1, missing_rows_2, missing_rows_3, missing_rows_4]
for i, missing_rows in enumerate(missing_rows_dfs):
    i +=1
    missing_count = len(missing_rows)
    total_count = len(metrics_df)
    missing_ratio = missing_count / total_count
    print(f"missing_rows_{i}: {missing_count}, Total rows: {total_count}, Missing ratio: {missing_ratio:.2%}")

missing_rows_1: 119, Total rows: 277586, Missing ratio: 0.04%
missing_rows_2: 119, Total rows: 277586, Missing ratio: 0.04%
missing_rows_3: 119, Total rows: 277586, Missing ratio: 0.04%
missing_rows_4: 119, Total rows: 277586, Missing ratio: 0.04%


In [15]:
# To see if the exact same rows are missing *only* when all these columns are null
## we will check for the intersection of the null masks: "Are these rows null IN ALL specified columns simultaneously?"
all_specified_cols_null_mask = (
    metrics_df['Show ID'].isnull() &
    metrics_df['Publication Date'].isnull() &
    metrics_df['App/Site Name'].isnull() &
    metrics_df['Device Class'].isnull()
)
rows_where_all_specified_are_missing = metrics_df[all_specified_cols_null_mask]


## Then we compare this combined result to our individual missing_rows_X DataFrames
for i, missing_rows in enumerate(missing_rows_dfs):
    i += 1
    print(f"Is missing_rows_{i} identical to rows where ALL specified columns are missing?",
          missing_rows.equals(rows_where_all_specified_are_missing))

Is missing_rows_1 identical to rows where ALL specified columns are missing? True
Is missing_rows_2 identical to rows where ALL specified columns are missing? True
Is missing_rows_3 identical to rows where ALL specified columns are missing? True
Is missing_rows_4 identical to rows where ALL specified columns are missing? True


Regarding the file "Mesures_contenu_volume_audio_à_commander.csv", I have found that 0.04% (119 rows) of the data have missing values on the exact same rows. Moreover, after further investigation, it was also found that these rows containing missing values also have wrong formatting/value in their numerical inputs and wrong titles. Consequently, I will remove the rows containing missing values instead of placig a placeholder

In [16]:
# drop the rows containing empty values in 'Show ID', 'Publication Date', 'App/Site Name', 'Device Class'
metrics_df = metrics_df.dropna(subset=['Show ID', 'Publication Date', 'App/Site Name', 'Device Class']).copy()
print(f"Remaining entries after drop: {metrics_df.shape[0]}")

Remaining entries after drop: 277467


#### 2.1.4. Data Consistency Checks

In [17]:
metrics_df.columns

Index(['Segment ID', 'Segment', 'Show ID', 'Show', 'Publication Date',
       'App/Site Name', 'Device Class', 'Segment Length', 'Media Views',
       'Avg Play Duration', 'Visitors', 'New Visit Rate %', 'Entries', 'Exits',
       'Returning Visits', 'Bounces', 'Total Play Duration'],
      dtype='object')

- "New Visit Rate %" column

In [18]:
# Noticed that many values were formatted with a coma and a '%' sign

## Converting the column to string
metrics_df["New Visit Rate %"] = metrics_df["New Visit Rate %"].astype(str)

## Remove the '%' character and replace ',' with '.' for decimal conversion
metrics_df["New Visit Rate %"] = metrics_df["New Visit Rate %"] \
                                 .str.replace('%', '', regex=False) \
                                 .str.replace(',', '.', regex=False)

- "Publication Date" column

In [19]:
# Date standardization based on "DD.MM.YYYY"
metrics_df['Publication Date'] = pd.to_datetime(metrics_df['Publication Date'], format='%d.%m.%Y', errors='coerce')
# metrics_df['Publication Date']

- Converting time strings to seconds

In [20]:
# Function to convert hh:mm:ss to total seconds
def duration_to_seconds(duration_str):
    try:
        h, m, s = map(int, duration_str.split(':'))
        return h * 3600 + m * 60 + s
    except:
        return None  # Handles invalid formats

# Apply conversion to 'Avg Play Duration'
metrics_df['Avg Play Duration (s)'] = metrics_df['Avg Play Duration'].apply(duration_to_seconds)
# metrics_df['Avg Play Duration (s)']

# Apply conversion to 'Total Play Duration'
metrics_df['Total Play Duration (s)'] = metrics_df['Total Play Duration'].apply(duration_to_seconds)
# metrics_df['Total Play Duration (s)']

- Converting numerical values

In [21]:
# Numeric columns check
numeric_columns = ['Media Views', 'Visitors', 'New Visit Rate %', 
                   'Entries', 'Exits', 'Returning Visits', 'Bounces',
                   'Avg Play Duration (s)', 'Total Play Duration (s)']
# Ensure columns are converted to float explicitly
metrics_df[numeric_columns] = metrics_df[numeric_columns].apply(lambda col: pd.to_numeric(col, errors='coerce')).astype(float)

# Check for numeric conversion issues
print("Numeric conversion check:")
print(metrics_df[numeric_columns].isnull().sum())

Numeric conversion check:
Media Views                0
Visitors                   0
New Visit Rate %           0
Entries                    0
Exits                      0
Returning Visits           0
Bounces                    0
Avg Play Duration (s)      0
Total Play Duration (s)    0
dtype: int64


- Converting categorical values

In [22]:
# Convert selected columns to categorical type
categorical_columns = ['Segment ID', 'Segment', 'Show ID', 'Show', 'App/Site Name', 'Device Class']
metrics_df[categorical_columns] = metrics_df[categorical_columns].astype('category')

In [23]:
metrics_df

Unnamed: 0,Segment ID,Segment,Show ID,Show,Publication Date,App/Site Name,Device Class,Segment Length,Media Views,Avg Play Duration,Visitors,New Visit Rate %,Entries,Exits,Returning Visits,Bounces,Total Play Duration,Avg Play Duration (s),Total Play Duration (s)
0,14897825,Le Suisse Nemo triomphe à lEurovision avec sa ...,2031524,Le Journal horaire,2024-05-12,rts.ch,Smartphone,1234,20762.0,00:05:19,18877.0,84.56,9770.0,13135.0,3428.0,5181.0,94:50:23,319.0,341423.0
1,15102359,Une trombe sest formée au-dessus du lac Léman,2031524,Le Journal horaire,2024-08-18,rts.ch,Smartphone,586,14703.0,00:03:27,13381.0,53.30,9889.0,11505.0,6458.0,6798.0,108:13:53,207.0,389633.0
2,14572281,De Genève à Zurich: un périple sanglant en Hel...,14546712,Crimes suisses,2024-01-05,rts-app-play,Smartphone,3490,7327.0,00:24:41,4124.0,2.49,1527.0,1928.0,6594.0,602.0,2601:23:11,1481.0,9364991.0
3,14689374,Prise dotages dans un train près dYverdon: les...,8849020,La Matinale,2024-02-09,rts.ch,Smartphone,1500,7560.0,00:06:25,7934.0,71.32,4370.0,4993.0,2671.0,2729.0,151:43:36,385.0,546216.0
4,359fc205-7470-38e0-b393-3b4a2e429508,Pourquoi les couples se séparent,6067786,Tribu,2025-04-07,rts.ch,Smartphone,1956,7201.0,00:08:34,7147.0,43.80,6741.0,3901.0,4808.0,4016.0,851:19:51,514.0,3064791.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
277581,8d3ad86d-97e1-372b-bc60-8e0f58643b37,"Face au défi climatique, Neuchâtel fait un app...",1423859,Le 12h30,2025-05-12,rts-app-sport,Smartphone,564,524.0,00:01:01,90.0,0.00,103.0,140.0,23.0,726.0,00:04:16,61.0,256.0
277582,96015a33-f517-3cf8-bcda-c9658dd6c844,En Douceur,14570123,En Douceur,2025-05-12,rts.ch,Smartphone,4677,451.0,00:04:25,772.0,103.00,141.0,802.0,695.0,687.0,00:04:11,265.0,251.0
277583,0267bc07-2c73-327c-9f5b-f692289ed9d2,Le Suisse mort en Ukraine était un Lausannois ...,1784426,Forum,2025-03-28,rts-app-sport,Smartphone,814,438.0,00:05:24,989.0,0.00,476.0,772.0,859.0,92.0,00:00:12,324.0,12.0
277584,41568641-62b4-3596-99ce-3b8bf4d09ad8,Helveticus,12027724,Léchappée,2025-03-28,rts.ch,Smartphone,1150,512.0,00:05:02,289.0,103.00,1222.0,1055.0,82.0,889.0,00:00:14,302.0,14.0


In [24]:
# Dropping the column "Avg Play Duration" & "Total Play Duration" 
# as we have their values in seconds in "Avg Play Duration (s)" & "Total Play Duration (s)"
metrics_df.drop(columns=['Avg Play Duration', 'Total Play Duration'], inplace=True)

In [25]:
metrics_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 277467 entries, 0 to 277585
Data columns (total 17 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   Segment ID               277467 non-null  category      
 1   Segment                  277467 non-null  category      
 2   Show ID                  277467 non-null  category      
 3   Show                     277467 non-null  category      
 4   Publication Date         277467 non-null  datetime64[ns]
 5   App/Site Name            277467 non-null  category      
 6   Device Class             277467 non-null  category      
 7   Segment Length           277467 non-null  int64         
 8   Media Views              277467 non-null  float64       
 9   Visitors                 277467 non-null  float64       
 10  New Visit Rate %         277467 non-null  float64       
 11  Entries                  277467 non-null  float64       
 12  Exits                

#### 2.2.1. file #2 Cleaning titles

In [26]:
# Clean column names (e.g., remove invisible characters)
tags_df.columns = tags_df.columns.str.strip().str.replace(r'[^\x00-\x7F]+', '', regex=True)

#### 2.2.2. file #2 Cleaning 'Assigned Tags'

In [27]:
# Replace '-' with None
tags_df['Assigned Tags'] = tags_df['Assigned Tags'].replace('-', None)

# Drop rows where 'Assigned Tags' is None or effectively empty after stripping whitespace
# Use .astype(str) to safely apply .str.strip() to potential None/NaN values
tags_df = tags_df[tags_df['Assigned Tags'].notna() & (tags_df['Assigned Tags'].astype(str).str.strip() != '')]

In [28]:
tags_df

Unnamed: 0,Segment ID,Show,Show ID,Assigned Tags
5,359fc205-7470-38e0-b393-3b4a2e429508,Tribu,6067786,media_radio:media_radio_media_radio:societe_me...
6,88f6e038-226d-3883-9e5f-f23844d012f7,Le Journal horaire,2031524,media_radio:la-1ere_rts_info:valais_media_radi...
10,973c9679-fa7e-35b5-a450-fa60781e10f4,Le 12h30,1423859,media_radio:la-1ere_media_radio:info_rts_info:...
14,15477997,Le Journal horaire,2031524,media_radio:media_radio_rts_info:monde_rts_inf...
15,15479892,Egosystème,6067782,media_radio:media_radio_media_radio:entretiens...
...,...,...,...,...
107797,14818255,Forum,1784426,media_radio:media_radio_media_radio:info_rts_i...
107798,14851148,Forum,1784426,media_radio:la-1ere_rts_info:rts_info_media_tv...
107799,15228940,Forum,1784426,media_radio:media_radio_rts_info:monde_rts_inf...
107800,14845847,Forum,1784426,media_radio:media_radio_media_radio:info_rts_i...


In [29]:
# Checking all the possible themes
## Define the core themes to look for
core_themes = ['info', 'sport', 'musique', 'societe', 'humour']

## Function to extract tags that contain any of the core themes
def extract_theme_tags(tag_string, themes):
    parts = re.split(r'[\s,;:_]+', str(tag_string).lower())
    return [tag for tag in parts if any(theme in tag for theme in themes)]

## Apply and get unique matching tags
all_matching_tags = tags_df['Assigned Tags'].apply(lambda x: extract_theme_tags(x, core_themes))
flat_tags = [tag for sublist in all_matching_tags for tag in sublist]
unique_theme_tags = sorted(set(flat_tags))

## Display matching tags
unique_theme_tags

['autres-sports',
 'economie-et-transport',
 'enjeux-de-societe',
 'humour',
 'info',
 'information',
 'monde-et-societe',
 'musique',
 'option-musique',
 'societe',
 'sport',
 'sportapp',
 'transports']

Here, I am taking the initiative of broader the tags

✅ Pros:
- More complete picture of what content belongs to each theme.
- Better reflects real audience interest and content diversity.
- Shows initiative and domain awareness in your analysis.

❌ Cons:
- Introduces judgment calls that must be clearly documented.
- Slight risk of overgeneralizing or including off-topic content.

The above approach can be kept for further initiative. In this particular case, we will stick to the task requested.

In [30]:
# Define mapping from related or broader tags to the main 5 themes
# Define the five exact valid tags
valid_tags = {
    'media_radio:societe',
    'media_radio:humour',
    'media_radio:info',
    'media_radio:musique',
    'media_radio:sport'
}

In [31]:
# Function to check if each valid tag is present in the full string
def match_valid_tags_in_string(tag_string, valid_tags):
    tag_string = str(tag_string).lower()
    return [tag for tag in valid_tags if tag in tag_string]

In [32]:
# Apply the matching function
tags_df.loc[:,'cleaned_themes'] = tags_df['Assigned Tags'].apply(lambda x: match_valid_tags_in_string(x, valid_tags))
tags_df.loc[:,'Primary Theme'] = tags_df['cleaned_themes'].apply(lambda tags: tags[0] if tags else None)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tags_df.loc[:,'cleaned_themes'] = tags_df['Assigned Tags'].apply(lambda x: match_valid_tags_in_string(x, valid_tags))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tags_df.loc[:,'Primary Theme'] = tags_df['cleaned_themes'].apply(lambda tags: tags[0] if tags else None)


In [33]:
# # Checking the output manually to double check the descrepencies
# tags_df.to_csv("check_tags.csv")

We noticed that there are 10 shows that contain multi themes:

    3ème mi-temps
    Dis, pourquoi?
    Émission spéciale
    Footaises
    La Matinale
    Le 12h30
    Le grand soir
    Les beaux parleurs
    Sport-Première
    The Jam

In this particular case, we are sticking to the assumption of primary_theme = first theme.
Otherwise we could explode to multi-theme rows, but show could be repeated among categories.

In [34]:
# Drop column 'cleaned themes' as it was to check the extractions
# tags_df = tags_df.drop("cleaned_themes", axis=1)

In [35]:
tags_df

Unnamed: 0,Segment ID,Show,Show ID,Assigned Tags,cleaned_themes,Primary Theme
5,359fc205-7470-38e0-b393-3b4a2e429508,Tribu,6067786,media_radio:media_radio_media_radio:societe_me...,[media_radio:societe],media_radio:societe
6,88f6e038-226d-3883-9e5f-f23844d012f7,Le Journal horaire,2031524,media_radio:la-1ere_rts_info:valais_media_radi...,[],
10,973c9679-fa7e-35b5-a450-fa60781e10f4,Le 12h30,1423859,media_radio:la-1ere_media_radio:info_rts_info:...,[media_radio:info],media_radio:info
14,15477997,Le Journal horaire,2031524,media_radio:media_radio_rts_info:monde_rts_inf...,[],
15,15479892,Egosystème,6067782,media_radio:media_radio_media_radio:entretiens...,[],
...,...,...,...,...,...,...
107797,14818255,Forum,1784426,media_radio:media_radio_media_radio:info_rts_i...,[media_radio:info],media_radio:info
107798,14851148,Forum,1784426,media_radio:la-1ere_rts_info:rts_info_media_tv...,[],
107799,15228940,Forum,1784426,media_radio:media_radio_rts_info:monde_rts_inf...,[media_radio:info],media_radio:info
107800,14845847,Forum,1784426,media_radio:media_radio_media_radio:info_rts_i...,[media_radio:info],media_radio:info


### 3. Data Transformation

We will now merge the datasets to retrieve the tags from tags_df

In [41]:
# Merge datasets on "Segment ID"
# many-to-one method as we have multiple Segment ID due to "App/Site Name" & "Device Class"
merged_df = pd.merge(metrics_df, tags_df[['Segment ID', 'Assigned Tags', 'Primary Theme']],
                     left_on='Segment ID', right_on='Segment ID',
                     how='left')

# Check merge results
merged_df

Unnamed: 0,Segment ID,Segment,Show ID,Show,Publication Date,App/Site Name,Device Class,Segment Length,Media Views,Visitors,New Visit Rate %,Entries,Exits,Returning Visits,Bounces,Avg Play Duration (s),Total Play Duration (s),Assigned Tags,Primary Theme
0,14897825,Le Suisse Nemo triomphe à lEurovision avec sa ...,2031524,Le Journal horaire,2024-05-12,rts.ch,Smartphone,1234,20762.0,18877.0,84.56,9770.0,13135.0,3428.0,5181.0,319.0,341423.0,media_radio:la-1ere,
1,15102359,Une trombe sest formée au-dessus du lac Léman,2031524,Le Journal horaire,2024-08-18,rts.ch,Smartphone,586,14703.0,13381.0,53.30,9889.0,11505.0,6458.0,6798.0,207.0,389633.0,media_radio:la-1ere_rts_info:rts_info_rts_info...,
2,14572281,De Genève à Zurich: un périple sanglant en Hel...,14546712,Crimes suisses,2024-01-05,rts-app-play,Smartphone,3490,7327.0,4124.0,2.49,1527.0,1928.0,6594.0,602.0,1481.0,9364991.0,media_radio:podcasts-originaux,
3,14689374,Prise dotages dans un train près dYverdon: les...,8849020,La Matinale,2024-02-09,rts.ch,Smartphone,1500,7560.0,7934.0,71.32,4370.0,4993.0,2671.0,2729.0,385.0,546216.0,rts_info:regions_media_radio:media_radio_rts_i...,media_radio:info
4,14689374,Prise dotages dans un train près dYverdon: les...,8849020,La Matinale,2024-02-09,rts.ch,Smartphone,1500,7560.0,7934.0,71.32,4370.0,4993.0,2671.0,2729.0,385.0,546216.0,media_radio:la-1ere_rts_info:rts_info_rts_info...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
293873,8d3ad86d-97e1-372b-bc60-8e0f58643b37,"Face au défi climatique, Neuchâtel fait un app...",1423859,Le 12h30,2025-05-12,rts-app-sport,Smartphone,564,524.0,90.0,0.00,103.0,140.0,23.0,726.0,61.0,256.0,media_radio:la-1ere_media_radio:info_rts_info:...,media_radio:info
293874,96015a33-f517-3cf8-bcda-c9658dd6c844,En Douceur,14570123,En Douceur,2025-05-12,rts.ch,Smartphone,4677,451.0,772.0,103.00,141.0,802.0,695.0,687.0,265.0,251.0,media_radio:espace-2_media_radio:media_radio,
293875,0267bc07-2c73-327c-9f5b-f692289ed9d2,Le Suisse mort en Ukraine était un Lausannois ...,1784426,Forum,2025-03-28,rts-app-sport,Smartphone,814,438.0,989.0,0.00,476.0,772.0,859.0,92.0,324.0,12.0,media_radio:media_radio_media_radio:info_rts_i...,media_radio:info
293876,41568641-62b4-3596-99ce-3b8bf4d09ad8,Helveticus,12027724,Léchappée,2025-03-28,rts.ch,Smartphone,1150,512.0,289.0,103.00,1222.0,1055.0,82.0,889.0,302.0,14.0,media_radio:media_radio_media_radio:option-mus...,media_radio:musique


In [37]:
# Check for any unmatched segments
print("Check unmatched segments:")
merged_df[merged_df['Primary Theme'].isnull()]

Check unmatched segments:


Unnamed: 0,Segment ID,Segment,Show ID,Show,Publication Date,App/Site Name,Device Class,Segment Length,Media Views,Visitors,New Visit Rate %,Entries,Exits,Returning Visits,Bounces,Avg Play Duration (s),Total Play Duration (s),Assigned Tags,Primary Theme
0,14897825,Le Suisse Nemo triomphe à lEurovision avec sa ...,2031524,Le Journal horaire,2024-05-12,rts.ch,Smartphone,1234,20762.0,18877.0,84.56,9770.0,13135.0,3428.0,5181.0,319.0,341423.0,media_radio:la-1ere,
1,15102359,Une trombe sest formée au-dessus du lac Léman,2031524,Le Journal horaire,2024-08-18,rts.ch,Smartphone,586,14703.0,13381.0,53.30,9889.0,11505.0,6458.0,6798.0,207.0,389633.0,media_radio:la-1ere_rts_info:rts_info_rts_info...,
2,14572281,De Genève à Zurich: un périple sanglant en Hel...,14546712,Crimes suisses,2024-01-05,rts-app-play,Smartphone,3490,7327.0,4124.0,2.49,1527.0,1928.0,6594.0,602.0,1481.0,9364991.0,media_radio:podcasts-originaux,
4,14689374,Prise dotages dans un train près dYverdon: les...,8849020,La Matinale,2024-02-09,rts.ch,Smartphone,1500,7560.0,7934.0,71.32,4370.0,4993.0,2671.0,2729.0,385.0,546216.0,media_radio:la-1ere_rts_info:rts_info_rts_info...,
6,88f6e038-226d-3883-9e5f-f23844d012f7,Une quarantaine de caravanes bloquées par la p...,2031524,Le Journal horaire,2025-04-15,rts.ch,Smartphone,730,6107.0,5808.0,55.68,3725.0,5231.0,3694.0,2801.0,341.0,128204.0,media_radio:la-1ere_rts_info:valais_media_radi...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
293868,a233bb0c-d1e1-3149-9eea-89b104dcf5b5,Séquence 2,4191283,CQFD,2025-05-12,rts-app-info,Smartphone,2469,234.0,545.0,0.00,701.0,1153.0,427.0,674.0,208.0,313.0,media_radio:sciences_media_radio:la-1ere_media...,
293871,f669acf3-cdf8-3ad9-91d2-a79add7e1407,La fin de vie de retour dans lhémicycle de lAs...,2031524,Le Journal horaire,2025-05-12,rts-app-sport,Smartphone,85,533.0,215.0,0.00,473.0,518.0,445.0,902.0,130.0,3.0,media_radio:la-1ere_rts_info:monde_media_radio...,
293872,a79bdf00-2e6f-32db-b3a4-aaf8be8da70e,Retour de Flamme,14254338,Fuego,2025-05-12,rts.ch,Smartphone,1530,43.0,525.0,101.00,1179.0,897.0,1206.0,1011.0,25.0,325.0,media_radio:couleur3_media_radio:media_radio,
293874,96015a33-f517-3cf8-bcda-c9658dd6c844,En Douceur,14570123,En Douceur,2025-05-12,rts.ch,Smartphone,4677,451.0,772.0,103.00,141.0,802.0,695.0,687.0,265.0,251.0,media_radio:espace-2_media_radio:media_radio,


In [None]:
# merged_df.to_csv("test.csv", encoding='utf-8-sig')

In [39]:
merged_df["Device Class"].unique()

['Smartphone', 'PC / Laptop']
Categories (2, object): ['PC / Laptop', 'Smartphone']

In [40]:
merged_df["App/Site Name"].unique()

['rts.ch', 'rts-app-play', 'rts-app-info', 'rts-app-sport']
Categories (4, object): ['rts-app-info', 'rts-app-play', 'rts-app-sport', 'rts.ch']

### 4. Feature Engineering