<a href="https://colab.research.google.com/github/lauberto/politopic/blob/main/notebook/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modelling on Politicians' speech

This notebook shows how to apply topic modelling on the transcription of yt videos with Politicians' speech. The transcription is performed with [`whisper`](https://github.com/openai/whisper).

## Installing dependencies

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!sudo apt update && sudo apt install ffmpeg

[33m0% [Working][0m            Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
[33m0% [Connecting to archive.ubuntu.com (91.189.91.39)] [1 InRelease 14.2 kB/88.7 [0m                                                                               Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
[33m0% [1 InRelease 43.1 kB/88.7 kB 49%] [Connected to cloud.r-project.org (52.85.1[0m                                                                               Hit:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
[33m0% [Waiting for headers] [1 InRelease 43.1 kB/88.7 kB 49%] [Connected to develo[0m[33m0% [2 InRelease gpgv 242 kB] [Waiting for headers] [1 InRelease 48.9 kB/88.7 kB[0m                                                                               Get:4 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
[33m0% [2 InRelease gpgv 242 kB] [4 InRelease 14.2 kB/88.7 kB 16%] [1 InRelease 6

In [None]:
# Installing whisper
!pip install git+https://github.com/openai/whisper.git 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-gkejdgyq
  Running command git clone -q https://github.com/openai/whisper.git /tmp/pip-req-build-gkejdgyq


 Installing [`pytube`](https://https://github.com/pytube/pytube). and downloading the [video](https://https://www.youtube.com/watch?v=knj8ULToNvo) from youtube.

In [None]:
!pip install git+https://github.com/pytube/pytube

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/pytube/pytube
  Cloning https://github.com/pytube/pytube to /tmp/pip-req-build-qdi093gu
  Running command git clone -q https://github.com/pytube/pytube /tmp/pip-req-build-qdi093gu


## First Demo - Salvini's Speech at a Coldiretti Conference.

Who's better than a populist far-right politician like Salvini to start with the analysis of political speeches?

In [None]:
video_ids = ['knj8ULToNvo', 'wVHyFwyhQwE&t=73s', 'y1qlZ077zJI&t=162s', '4OQmieY4a-Q', 'V9w4ZtZisrs', '5h-M0Qbuj74']
video_urls = ['https://youtu.be/' + video_id for video_id in video_ids]

### Download & Transcription
Download the video from YouTube and transcribe it using whisper. We are going to 

In [None]:
import whisper
from pytube import YouTube

In [None]:
whisper_model = whisper.load_model('base')

100%|████████████████████████████████████████| 139M/139M [00:01<00:00, 124MiB/s]


In [None]:
titles = []
texts = []

for url in video_urls:
  yt = YouTube(url=url)
  titles.append(yt.title)
  path = yt.streams.filter(only_audio=True)[0].download(filename="audio.mp4")
  transcription = whisper_model.transcribe(path)
  texts.append(transcription["text"])  



### Sentence segmentation

In [None]:
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk import sent_tokenize

In [None]:
docs = []
video_titles = []

for title, text in zip(titles, texts):
  sents = sent_tokenize(text)
  docs.extend(sents)
  video_titles.extend([title]*len(sents))

Let's save the texts on google drive just in case.

In [None]:
import pandas as pd

df = pd.DataFrame({"Title": video_titles, "Text": docs})
df.head()

Unnamed: 0,Title,Text
0,ASSEMBLEA NAZIONALE COLDIRETTI (28.07.2022),accogliamo un ultimo dei nostri amici.
1,ASSEMBLEA NAZIONALE COLDIRETTI (28.07.2022),"Parlaramo poco fammi, ha detto, sono una perso..."
2,ASSEMBLEA NAZIONALE COLDIRETTI (28.07.2022),Quindi diamo la parola per ultimo proprio il s...
3,ASSEMBLEA NAZIONALE COLDIRETTI (28.07.2022),"Cioè, presentazione alla Gesmundo New Look, se..."
4,ASSEMBLEA NAZIONALE COLDIRETTI (28.07.2022),"Non ho il compitino pronto, cioè il mio uffici..."


In [None]:
df.to_csv('/content/drive/MyDrive/Data/political_speeches/politopic_salvini.tsv', sep='\t', index=False)

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Data/political_speeches/politopic_salvini.tsv', sep='\t')

## Latent Dirichlet Allocation

In [None]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
it_stopwords = stopwords.words('italian')

In [None]:
it_stopwords[:5]

['ad', 'al', 'allo', 'ai', 'agli']

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer(max_df=0.9, min_df=3, stop_words=it_stopwords)
lda = LatentDirichletAllocation(n_components=5)

In [None]:
count = cv.fit_transform(df.Text)
lda.fit(count)

LatentDirichletAllocation(n_components=5)

count.toArray()

In [None]:
cv.get_feature_names()[:5]



['10', '100', '15', '23', '25']

In [None]:
def get_feature_names_per_topic(lda_components, count_vectorizer, topn=5):
  topics = []

  for i in range(len(lda_components)):
    topic = lda_components[i]
    top_words_idx = topic.argsort()[-topn:]
    topic_names = [count_vectorizer.get_feature_names_out()[id] for id in top_words_idx]
    topics.append('_'.join(topic_names))
  
  return topics

In [None]:
topics = get_feature_names_per_topic(lda.components_, cv)

In [None]:
topics

['quando_bollette_italiani_governo_prima',
 'italiani_energia_fare_destra_centro',
 'cosa_anni_italia_grazie_fa',
 'essere_poi_penso_quindi_però',
 'sì_assolutamente_quindi_essere_lavoro']