***
# CAPSTONE PROJECT (DSI-25)
***
# Female Conversations Online in Singapore: What is She Talking About?
### Topic Modelling of Female-focused YouTube Content
***

# Project Content
1. Project Background, Audio Scraping & Preprocessing
2. Inital Analysis with LDA
3. Text-preprocessing, EDA & LDA Modelling
4. Transformer Modelling with BERTopic
5. Overal Insights & Conclusion

## Problem Statement

With the gradual liberalisation of the Singapore society that place a higher value on self-expression and freedom of thought, more female-focused content can be found online in the form of serious conversations that reflect the current concerns, opinions and values of Singapore women. Due to the personal and intimate nature of such conversations that are rooted in the local customs and cultures, such sentiments are not easily generalised or derived from information or studies conducted outside of Singapore. However, these feature-rich content are frequently overlooked as a source of information on female perspectives in Singapore as they form only a small portion of all local content created online. Hence, this project aims to use Natural Language Processing (NLP) techniques of Topic Modelling to uncover the everyday concerns and opinions of women in Singapore.

## Background

The past few decades have witnessed one of the most dramatic cultural changes that has occurred since the dawn of recorded history. Singapore too, has become more liberal since 2002, the first time it participated in the World Values Survey (WVS) - a global research project monitoring changing public beliefs and their socio-political impact over time across 80 countries ([source](https://www.straitstimes.com/singapore/community/singapore-still-conservative-on-moral-sexuality-issues-but-more-liberal-since)). The WVS has demonstrated over the years that people’s beliefs play a key role in economic development, the emergence and flourishing of democratic institutions, the rise of gender equality, and the extent to which societies have effective government. Advancements in technology has also not stagnated through the decades and most societies now have access to the internet and its various social networks. In a liberal post-industrial economy, an increasing share of the population has grown up taking survival and freedom of thought for granted, resulting in that self-expression is highly valued ([source](https://www.worldvaluessurvey.org/WVSContents.jsp)).

The algomation of the above realities have led to the rise of self-expression online in various forms on a wide range of platforms ([source](https://viewpoint.pointloma.edu/the-rise-of-the-social-media-influencer/)), including higher visibilities and more serious discussions on female-focused topics that have traditionally been overlooked or discouraged from being openly addressed ([source](https://womenlovetech.com/female-founders-apply-science-to-womens-health-issues/)). This global phenomena has also spread to Singapore and it is now more common than before to find videos, podcasts and other forms of social media contributions that have a female-centric focus or tackle female-related issues.

While these forms of female-focused content are present in our local online ecosystem and can be accessed easily by many,  they form only a small percentage of the information that online users are overwhelmed with daily ([source](https://www.wsj.com/articles/social-media-algorithms-rule-how-we-see-the-world-good-luck-trying-to-stop-them-11610884800). As a result, they are frequently overlooked as a source of information that can have larger impacts in the real world.

### Project Impact

This project aims to analyse the topics discussed in a selection of Singapore-based female-focused YouTube channels to look into the concerns, opinions and values of Singaporean women through the use of topic modelling in Natural Language Processing (NLP). Sentiments expressed in these content are often unique to Singapore due to the personal and intimate nature of such discussions and are hence very valuable as they cannot be fully generalised from information obtained anywhere else other than in Singapore due to cultural and accessibiliy reasons. 

Online content can be used to provide an indicator of the current trends related to women in Singapore and have been shown to have far-reaching impacts ranging from product development ([source](https://www.sciencedirect.com/science/article/pii/S0148296320301363)), marketing campaigns ([source](https://digitalcontentnext.org/blog/2020/04/10/evaluating-the-value-of-media-content/) to wide-sweeping social change ([source](https://www.thedrum.com/opinion/2020/06/19/the-importance-social-media-instigating-social-change). As a serious discussion format, it is also an informal educational outlet that the younger, more tech-saavy girls have easy access to. The amount of online content has been growing and will only continue to grow, and so will female-focused content that are produced in Singapore. Therefore, it is worthwhile to know what topics have been addressed so that more varied narratives could be added to the present collection to provide more diverse perspectives for a wider reach.

## Methodology

![flowchart](../media/methodology.png)

## Dataset

The dataset used in this project is vectorised from the Google Cloud Speech-To-Text API text transcripts. These transcripts were obtained from 45 hours of audio content scraped from 2 local YouTube channels, [itsclarityco](https://www.youtube.com/channel/UCEAGCuChX7adlus-NQOamog/featured) and [Something Private](https://www.youtube.com/channel/UCAZ7NfSRX1reSpRUw0xtEmg). The 130 videos were obtained in batches over a span of two weeks from 28 November to 5 December 2021, and have upload dates ranging from 19 September 2020 to 1 December 2021.

***
# 1. Data Scraping
***

In [1]:
from youtube_dl import YoutubeDL
import ffmpeg
from ffprobe import FFProbe

Audio tracks of each video in the stereo m4a format were scrapped from the 2 YouTube channels. These channels were chosen also because they used minimal music and sound effects in their videos and would hence present less issues with regards to noisy audio data.

## itsclarityco

In [3]:
!youtube-dl --max-downloads 300 -i --yes-playlist -f 140 --write-auto-sub --download-archive ITSCLARITYCO_done.txt -o "~/Desktop/dsi25-workspace/Projects/capstone_project/audio/%(uploader)s_%(upload_date)s_%(title)s_%(duration)s.%(ext)s" "https://www.youtube.com/channel/UCEAGCuChX7adlus-NQOamog/videos" 

[youtube:tab] UCEAGCuChX7adlus-NQOamog: Downloading webpage
[download] Downloading playlist: itsclarityco - Videos
[youtube:tab] Downloading page 1
[youtube:tab] Downloading page 2
[youtube:tab] Downloading page 3
[youtube:tab] playlist itsclarityco - Videos: Downloading 101 videos
[download] Downloading video 1 of 101
[youtube] y73sW-OBX-g: Downloading webpage
[info] Writing video subtitles to: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20211124_Does work life balance even exist (ft. Sangeeta) _ Hush Podcast_2882.en.vtt
[download] Destination: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20211124_Does work life balance even exist (ft. Sangeeta) _ Hush Podcast_2882.m4a
[K[download] 100% of 44.48MiB in 11:2542KiB/s ETA 00:00136
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20211124_Does work life balance even exist (ft. Sangeeta) 

[download] Destination: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20211029_Our love-hate relationships with arm, facial & body hair _ Hush tl;dr_521.m4a
[K[download] 100% of 8.04MiB in 02:1200KiB/s ETA 00:005
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20211029_Our love-hate relationships with arm, facial & body hair _ Hush tl;dr_521.m4a"
[download] Downloading video 12 of 101
[youtube] Q0wpG76fmwE: Downloading webpage
[download] Destination: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20211028_Has upbringing affected the way we love (ft. Narelle Kheng) _ Hush tl;dr_398.m4a
[K[download] 100% of 6.15MiB in 02:2766KiB/s ETA 00:00
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20211028_Has upbringing affected the way we love (ft. Narelle Kheng) _ Hush tl;d

[K[download] 100% of 39.39MiB in 09:5873KiB/s ETA 00:00309
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210915_Are women more emotionally expressive than men (ft. Yung Raja) _ Hush Podcast_2552.m4a"
[download] Downloading video 39 of 101
[youtube] jAtCkdfZ7RU: Downloading webpage
[info] Writing video subtitles to: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210908_We Tried to Create 3 Different Looks Under 15 Minutes with the Dyson Airwrap _ Hush Hush_643.en.vtt
[download] Destination: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210908_We Tried to Create 3 Different Looks Under 15 Minutes with the Dyson Airwrap _ Hush Hush_643.m4a
[K[download] 100% of 9.93MiB in 03:2887KiB/s ETA 00:00
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210908_We Tried t

[download] Destination: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210804_How has portrayal of women in the media changed _ Hush Podcast_2382.m4a
[K[download] 100% of 36.77MiB in 08:3765KiB/s ETA 00:00501
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210804_How has portrayal of women in the media changed _ Hush Podcast_2382.m4a"
[download] Downloading video 51 of 101
[youtube] cbEdqTNUi2Y: Downloading webpage
[download] Destination: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210729_Ghost in the studio! _ Hush Podcast Season 2 Bloopers_273.m4a
[K[download] 100% of 4.21MiB in 01:0697KiB/s ETA 00:004
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210729_Ghost in the studio! _ Hush Podcast Season 2 Bloopers_273.m4a"
[download] Downloading video 52 of 

[K[download] 100% of 6.05MiB in 01:5009KiB/s ETA 00:00
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210625_“I will jump into a pool full of snakes for her” _ Joint Account Ep 2 _ Gillyn & Jolyn_392.m4a"
[download] Downloading video 62 of 101
[youtube] P-4SQnGAL-E: Downloading webpage
[download] Destination: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210619_On shopping online, lying and feeling embarrassed about our purchases _ Hush WFH Special Part 2_2341.m4a
[K[download] 100% of 36.13MiB in 11:1445KiB/s ETA 00:00:09
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210619_On shopping online, lying and feeling embarrassed about our purchases _ Hush WFH Special Part 2_2341.m4a"
[download] Downloading video 63 of 101
[youtube] -jbJeIZ0mJY: Downloading webpage
[info] Writing video subtitles to: /

[download] Downloading video 74 of 101
[youtube] CE0c9v_eSJM: Downloading webpage
[0;31mERROR:[0m unable to download video data: HTTP Error 403: Forbidden
[download] Downloading video 75 of 101
[youtube] eK02Zvxg3Tg: Downloading webpage
[info] Writing video subtitles to: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210504_New Dads on the Truth About Pregnancy, The Postpartum Body and Sexy Time _ MEN, EXPLAIN EP 2_1356.en.vtt
[download] Destination: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210504_New Dads on the Truth About Pregnancy, The Postpartum Body and Sexy Time _ MEN, EXPLAIN EP 2_1356.m4a
[K[download] 100% of 20.93MiB in 08:0717KiB/s ETA 00:00:11
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210504_New Dads on the Truth About Pregnancy, The Postpartum Body and Sexy Time _ MEN, EXPLAIN EP 2_1356.m4a"
[download] Do

[info] Writing video subtitles to: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210308_Learning Self-Defence Tricks with Sharul Channa _ [FT. Tiffany Teo]_480.en.vtt
[download] Destination: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210308_Learning Self-Defence Tricks with Sharul Channa _ [FT. Tiffany Teo]_480.m4a
[K[download] 100% of 7.41MiB in 01:4808KiB/s ETA 00:008
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210308_Learning Self-Defence Tricks with Sharul Channa _ [FT. Tiffany Teo]_480.m4a"
[download] Downloading video 89 of 101
[youtube] JU-INcCHpew: Downloading webpage
[info] Writing video subtitles to: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20210307_Girls Trying Different Types of Fancy Period Panties _ Hush Hush Ep 1_1016.en.vtt
[download] Destination: /Users/lukasiwe

[K[download] 100% of 10.61MiB in 03:4805KiB/s ETA 00:00
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20201115_What is it like to live with multiple mental health disorders _ A Closer Stranger_688.m4a"
[download] Downloading video 100 of 101
[youtube] XpONjTSucAw: Downloading webpage
[youtube] XpONjTSucAw: Downloading MPD manifest
[info] Writing video subtitles to: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20201101_Strangers connect through their struggle with body image _ A Closer Stranger_709.en.vtt
[download] Destination: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itsclarityco_20201101_Strangers connect through their struggle with body image _ A Closer Stranger_709.m4a
[K[download] 100% of 10.94MiB in 02:4869KiB/s ETA 00:001
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/itscl

## Something Private

In [2]:
# itsclarityco hush
!youtube-dl --max-downloads 300 -i --yes-playlist -f 140 --write-auto-sub --download-archive SOMETHING_PRIVATE_done.txt -o "~/Desktop/dsi25-workspace/Projects/capstone_project/audio/something_private/%(uploader)s_%(upload_date)s_%(title)s_%(duration)s.%(ext)s" "https://www.youtube.com/channel/UCAZ7NfSRX1reSpRUw0xtEmg/videos" 

[youtube:tab] UCAZ7NfSRX1reSpRUw0xtEmg: Downloading webpage
[download] Downloading playlist: Something Private - Videos
[youtube:tab] playlist Something Private - Videos: Downloading 29 videos
[download] Downloading video 1 of 29
[youtube] gZUYmklBTJ4: Downloading webpage
[info] Writing video subtitles to: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/something_private/Something Private_20211201_Social Media Burnout is Real, We Know You Feel It_2342.en.vtt
[download] Destination: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/something_private/Something Private_20211201_Social Media Burnout is Real, We Know You Feel It_2342.m4a
[K[download] 100% of 36.15MiB in 10:2255KiB/s ETA 00:00924
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/something_private/Something Private_20211201_Social Media Burnout is Real, We Know You Feel It_2342.m4a"
[download] Downloading video 2 of 29
[y

[download] Destination: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/something_private/Something Private_20210630_Discharge, Husband Stitch, Love After Menopause - How Your Vulva Grows With You_2538.m4a
[K[download] 100% of 39.18MiB in 14:2761KiB/s ETA 00:00:27
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/something_private/Something Private_20210630_Discharge, Husband Stitch, Love After Menopause - How Your Vulva Grows With You_2538.m4a"
[download] Downloading video 11 of 29
[youtube] Ory2TS90OaI: Downloading webpage
[youtube] Ory2TS90OaI: Downloading MPD manifest
[info] Writing video subtitles to: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/something_private/Something Private_20210628_NEW SERIES - Voyage to the vulva-verse (COMING SOON!)_63.en.vtt
[download] Destination: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/something_private/Someth

[K[download] 100% of 36.51MiB in 10:4236KiB/s ETA 00:00:00
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/something_private/Something Private_20210302_How’d you like your eggs Fertilised, or Frozen _ Season 4 Episode 1_2366.m4a"
[download] Downloading video 21 of 29
[youtube] 4KOzolOd7uo: Downloading webpage
[youtube] 4KOzolOd7uo: Downloading MPD manifest
[info] Writing video subtitles to: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/something_private/Something Private_20201128_The Men’s Telehealth Service that’s Breaking Taboos in Asia_2439.en.vtt
[download] Destination: /Users/lukasiwei/Desktop/dsi25-workspace/Projects/capstone_project/audio/something_private/Something Private_20201128_The Men’s Telehealth Service that’s Breaking Taboos in Asia_2439.m4a
[K[download] 100% of 37.64MiB in 10:5319KiB/s ETA 00:00349
[ffmpeg] Correcting container in "/Users/lukasiwei/Desktop/dsi25-workspace/Projects/c

In [3]:
# empty cache if downloading unable to resume after HTTP Error 304
#!youtube-dl --rm-cache-dir;