<a href="https://colab.research.google.com/github/lauberto/politopic/blob/main/notebook/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modelling on Politicians' speech

This notebook shows how to apply topic modelling on the transcription of yt videos with Politicians' speech. The transcription is performed with [`whisper`](https://github.com/openai/whisper).

## Installing dependencies

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!sudo apt update && sudo apt install ffmpeg

Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease [1,581 B]
Hit:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages [1,073 kB]
Get:6 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:9 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:12 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Get:13 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [3,0

In [3]:
# Installing whisper
!pip install git+https://github.com/openai/whisper.git 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-f8ws7i94
  Running command git clone -q https://github.com/openai/whisper.git /tmp/pip-req-build-f8ws7i94
Collecting transformers>=4.19.0
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 33.6 MB/s 
[?25hCollecting ffmpeg-python==0.2.0
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 64.3 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 76.9 MB/s 
Building wheels for collected packages: whisper
  Buil

In [4]:
import torch
torch.cuda.is_available()

True


 Installing [`pytube`](https://https://github.com/pytube/pytube). and downloading the [video](https://https://www.youtube.com/watch?v=knj8ULToNvo) from youtube.

In [5]:
!pip install git+https://github.com/pytube/pytube

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/pytube/pytube
  Cloning https://github.com/pytube/pytube to /tmp/pip-req-build-yv45tb2f
  Running command git clone -q https://github.com/pytube/pytube /tmp/pip-req-build-yv45tb2f
Building wheels for collected packages: pytube
  Building wheel for pytube (setup.py) ... [?25l[?25hdone
  Created wheel for pytube: filename=pytube-12.1.0-py3-none-any.whl size=56809 sha256=430333b5b2e0422e9f70bfdcc808e1da212f1a1d3ce2dde2e6f85b60da16869e
  Stored in directory: /tmp/pip-ephem-wheel-cache-dbb2__1a/wheels/a8/ac/8c/337af6a10cc543c5eadf4eb2bbd02bd8609b25bea729df338a
Successfully built pytube
Installing collected packages: pytube
Successfully installed pytube-12.1.0


In [6]:
import requests
import json
from getpass import getpass

In [7]:
def parse_response(search_response: dict):
	videos = []
	channels = []
	playlists = []
	
	for search_result in search_response.get('items', []):
		if search_result['id']['kind'] == 'youtube#video':
			videos.append({
				'title': search_result['snippet']['title'],
				'video_id':search_result['id']['videoId']
			})
		elif search_result['id']['kind'] == 'youtube#channel':
			channels.append({
				'title': search_result['snippet']['title'],
				'channel_id': search_result['id']['channelId']
			})
		elif search_result['id']['kind'] == 'youtube#playlist':
			playlists.append({
				'title': search_result['snippet']['title'],
				'playlist_id': search_result['id']['playlistId']
			})

	return {
		'videos': videos,
		'channels': channels,
		'playlists': playlists
	}

In [8]:
api_key = getpass("Google API key: ")
channel_id = "UCDjM54fZ-cD7F8uom767OhA" # Salvini Channel
# channel_id = "UCw2AF_J2QizzmWw99bs3e6Q" # Lega Channel
max_results = 100

Google API key: ··········


In [9]:
res = requests.get(f"https://www.googleapis.com/youtube/v3/search?key={api_key}&channelId={channel_id}&part=snippet,id&order=date&maxResults={max_results}")

In [10]:
res_dict = dict(json.loads(res.text))

In [11]:
parsed_res = parse_response(res_dict)

In [12]:
parsed_res['videos'][:5]

[{'title': 'MATTEO SALVINI A NON STOP NEWS (RTL 102.5, 7.11.2022)',
  'video_id': 'hnL4Ysh5oOI'},
 {'title': 'MATTEO SALVINI A DRITTO E ROVESCIO (RETE 4, 3.11.2022)',
  'video_id': 'dEO4XxqxoU8'},
 {'title': 'MATTEO SALVINI A FUORI DAL CORO (RETE 4, 25.10.2022)',
  'video_id': 'j2snzceN3uA'},
 {'title': 'MATTEO SALVINI A PORTA A PORTA (RAI 1, 24.10.2022)',
  'video_id': 'c98wjS-B8r4'},
 {'title': 'MATTEO SALVINI A PASSWORD (RTL 102.5, 21.10.2022)',
  'video_id': 'Wcy_FNa8Fxs'}]

## First Demo - Salvini's Speech at a Coldiretti Conference.

Who's better than a populist far-right politician like Salvini to start with the analysis of political speeches?

In [13]:
video_ids = [video['video_id'] for video in parsed_res['videos']]
video_urls = ['https://youtu.be/' + video_id for video_id in video_ids]
video_urls[:5]

['https://youtu.be/hnL4Ysh5oOI',
 'https://youtu.be/dEO4XxqxoU8',
 'https://youtu.be/j2snzceN3uA',
 'https://youtu.be/c98wjS-B8r4',
 'https://youtu.be/Wcy_FNa8Fxs']

### Download & Transcription
Download the video from YouTube and transcribe it using whisper. We are going to 

In [14]:
import pandas as pd
import whisper
from pytube import YouTube, Channel

In [15]:
whisper_model = whisper.load_model('base')

100%|███████████████████████████████████████| 139M/139M [00:02<00:00, 64.3MiB/s]


In [17]:
from tqdm import tqdm

In [18]:
titles = []
texts = []

for url in tqdm(video_urls):
  yt = YouTube(url=url)
  titles.append(yt.title)
  path = yt.streams.filter(only_audio=True)[0].download(filename="audio.mp4")
  transcription = whisper_model.transcribe(path)
  texts.append(transcription["text"])  

 30%|███       | 15/50 [30:21<1:10:49, 121.43s/it]


KeyboardInterrupt: ignored

### Sentence segmentation

In [None]:
import nltk

nltk.download('punkt')

In [None]:
from nltk import sent_tokenize

In [None]:
docs = []
video_titles = []

for title, text in zip(titles, texts):
  sents = sent_tokenize(text)
  docs.extend(sents)
  video_titles.extend([title]*len(sents))

Let's save the texts on google drive just in case.

In [None]:
import pandas as pd

df = pd.DataFrame({"Title": video_titles, "Text": docs})
df.head()

In [None]:
df.to_csv('/content/drive/MyDrive/Data/political_speeches/politopic_salvini.tsv', sep='\t', index=False)

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Data/political_speeches/politopic_salvini.tsv', sep='\t')

In [None]:
df.groupby('Title').count()

### Preprocessing
Let's normalize the numbers and lemmatize.

In [None]:
!pip install stanza

In [None]:
import stanza
import re

In [None]:
nlp = stanza.Pipeline('it', processors='tokenize,mwt,pos,lemma')

In [None]:
def lemmatize(text: str):
  doc = nlp(text)
  lemmas = []

  for sent in doc.sentences:
    for word in sent.words:
      if word.upos != "PUNCT":
        lemmas.append(word.lemma)
  return " ".join(lemmas)

In [None]:
def normalize(text: str):
  text = re.sub("[0-9]+", "NUM", text)
  text = lower(text)
  return text

In [None]:
df.Text = df.Text.apply(lambda text: normalize(text))
df.lemmas = df.Text.apply(lambda text: lemmatize(text))

## Latent Dirichlet Allocation

In [None]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

In [None]:
it_stopwords = stopwords.words('italian')

In [None]:
it_stopwords[:5]

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer(max_df=0.9, min_df=3, stop_words=it_stopwords)
lda = LatentDirichletAllocation(n_components=5)

In [None]:
count = cv.fit_transform(df.lemmas)
lda.fit(count)

count.toArray()

In [None]:
cv.get_feature_names()[:5]



['10', '100', '15', '23', '25']

In [None]:
def get_feature_names_per_topic(lda_components, count_vectorizer, topn=5):
  topics = []

  for i in range(len(lda_components)):
    topic = lda_components[i]
    top_words_idx = topic.argsort()[-topn:]
    topic_names = [count_vectorizer.get_feature_names_out()[id] for id in top_words_idx]
    topics.append('_'.join(topic_names))
  
  return topics

In [None]:
topics = get_feature_names_per_topic(lda.components_, cv)

In [None]:
topics

['quando_bollette_italiani_governo_prima',
 'italiani_energia_fare_destra_centro',
 'cosa_anni_italia_grazie_fa',
 'essere_poi_penso_quindi_però',
 'sì_assolutamente_quindi_essere_lavoro']