<a href="https://colab.research.google.com/github/lauberto/politopic/blob/main/notebook/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modelling on Politicians' speech

This notebook shows how to apply topic modelling on the transcription of yt videos with Politicians' speech. The transcription is performed with [`whisper`](https://github.com/openai/whisper).

## Installing dependencies

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
!sudo apt update && sudo apt install ffmpeg

[33m0% [Working][0m            Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:3 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Ign:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease [1,581 B]
Hit:7 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:8 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:9 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1,567 kB]
Hit:10 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:11 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [3,095 kB]
Get:12 http://archive.ubuntu.c

In [2]:
# Installing whisper
!pip install git+https://github.com/openai/whisper.git 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-8e332mz2
  Running command git clone -q https://github.com/openai/whisper.git /tmp/pip-req-build-8e332mz2
Collecting transformers>=4.19.0
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 28.1 MB/s 
[?25hCollecting ffmpeg-python==0.2.0
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 86.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 50.0 MB/s 
Building wheels for collected packages: whisper
  Buil

 Installing [`pytube`](https://https://github.com/pytube/pytube). and downloading the [video](https://https://www.youtube.com/watch?v=knj8ULToNvo) from youtube.

In [3]:
!pip install git+https://github.com/pytube/pytube

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/pytube/pytube
  Cloning https://github.com/pytube/pytube to /tmp/pip-req-build-my32wrxi
  Running command git clone -q https://github.com/pytube/pytube /tmp/pip-req-build-my32wrxi
Building wheels for collected packages: pytube
  Building wheel for pytube (setup.py) ... [?25l[?25hdone
  Created wheel for pytube: filename=pytube-12.1.0-py3-none-any.whl size=56809 sha256=13bbbb2c2a49414c1542945304ada5f564fcecac79fff5bbed0d623e3c951ee0
  Stored in directory: /tmp/pip-ephem-wheel-cache-03e_1zep/wheels/a8/ac/8c/337af6a10cc543c5eadf4eb2bbd02bd8609b25bea729df338a
Successfully built pytube
Installing collected packages: pytube
Successfully installed pytube-12.1.0


## First Demo - Salvini's Speech at a Coldiretti Conference.

Who's better than a populist far-right politician like Salvini to start with the analysis of political speeches?

In [4]:
video_ids = ['knj8ULToNvo', 'wVHyFwyhQwE&t=73s', 'y1qlZ077zJI&t=162s', '4OQmieY4a-Q', 'V9w4ZtZisrs']
video_urls = ['https://youtu.be/' + video_id for video_id in video_ids]

### Download & Transcription
Download the video from YouTube and transcribe it using whisper. We are going to 

In [5]:
import whisper
from pytube import YouTube

In [7]:
whisper_model = whisper.load_model('base')

100%|███████████████████████████████████████| 139M/139M [00:05<00:00, 29.0MiB/s]


In [14]:
titles = []
texts = []

for url in video_urls:
  yt = YouTube(url=url)
  titles.append(yt.title)
  path = yt.streams.filter(only_audio=True)[0].download(filename="audio.mp4")
  transcription = whisper_model.transcribe(path)
  texts.append(transcription["text"])  



### Sentence segmentation

In [29]:
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [30]:
from nltk import sent_tokenize

In [31]:
docs = []
video_titles = []

for title, text in zip(titles, texts):
  sents = sent_tokenize(text)
  docs.extend(sents)
  video_titles.extend([title]*len(sents))

Let's save the texts on google drive just in case.

In [33]:
import pandas as pd

df = pd.DataFrame({"Title": video_titles, "Text": docs})
df.head()

Unnamed: 0,Title,Text
0,ASSEMBLEA NAZIONALE COLDIRETTI (28.07.2022),accogliamo un ultimo dei nostri amici.
1,ASSEMBLEA NAZIONALE COLDIRETTI (28.07.2022),"Parlaramo poco fammi, ha detto, sono una perso..."
2,ASSEMBLEA NAZIONALE COLDIRETTI (28.07.2022),Quindi diamo la parola per ultimo proprio il s...
3,ASSEMBLEA NAZIONALE COLDIRETTI (28.07.2022),"Cioè, presentazione alla Gesmundo New Look, se..."
4,ASSEMBLEA NAZIONALE COLDIRETTI (28.07.2022),"Non ho il compitino pronto, cioè il mio uffici..."


In [41]:
df.to_csv('/content/drive/MyDrive/Data/political_speeches/politopic_salvini.tsv', sep='\t', index=False)

In [8]:
import pandas as pd

In [10]:
df = pd.read_csv('/content/drive/MyDrive/Data/political_speeches/politopic_salvini.tsv', sep='\t')

## Latent Dirichlet Allocation

In [14]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [15]:
it_stopwords = stopwords.words('italian')

In [16]:
it_stopwords[:20]

['ad',
 'al',
 'allo',
 'ai',
 'agli',
 'all',
 'agl',
 'alla',
 'alle',
 'con',
 'col',
 'coi',
 'da',
 'dal',
 'dallo',
 'dai',
 'dagli',
 'dall',
 'dagl',
 'dalla']

In [7]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

In [18]:
cv = CountVectorizer(max_df=0.9, min_df=3, stop_words=it_stopwords)
lda = LatentDirichletAllocation(n_components=5)

In [25]:
count = cv.fit_transform(df.Text)
lda.fit_transform(count)

array([[0.10000576, 0.10000503, 0.1007525 , 0.10000721, 0.5992295 ],
       [0.10000769, 0.10000689, 0.10102752, 0.5989493 , 0.10000859],
       [0.03412359, 0.03373869, 0.86335336, 0.03444671, 0.03433765],
       [0.03361694, 0.03338984, 0.03338702, 0.86611208, 0.03349412],
       [0.05016168, 0.05014483, 0.05072297, 0.79879065, 0.05017986],
       [0.1014639 , 0.10127744, 0.10000757, 0.10001141, 0.59723968],
       [0.05147728, 0.2999771 , 0.29865569, 0.05033114, 0.29955879],
       [0.02044627, 0.02020788, 0.02010445, 0.02013906, 0.91910234],
       [0.90917128, 0.0223167 , 0.02259083, 0.02282919, 0.023092  ],
       [0.02897043, 0.88382139, 0.02874845, 0.02859883, 0.0298609 ],
       [0.03439513, 0.03333615, 0.86531576, 0.03346464, 0.03348833],
       [0.05054971, 0.79877614, 0.05066846, 0.05000305, 0.05000265],
       [0.05090852, 0.05000499, 0.05032201, 0.79772848, 0.051036  ],
       [0.02874263, 0.02862965, 0.0286975 , 0.02893401, 0.88499622],
       [0.03357951, 0.03336408, 0.

count.toArray()

In [43]:
count.toarray()[30]

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0])

In [36]:
cv.get_feature_names()



['agricoltori',
 'agricoltura',
 'alimentare',
 'assolutamente',
 'centro',
 'cioè',
 'costa',
 'destra',
 'devo',
 'diretti',
 'due',
 'essere',
 'euro',
 'fa',
 'fare',
 'fatto',
 'già',
 'grazie',
 'inno',
 'intusiasta',
 'italia',
 'italiani',
 'lega',
 'livello',
 'no',
 'ora',
 'parola',
 'parte',
 'persone',
 'però',
 'piano',
 'poi',
 'prendo',
 'problema',
 'qua',
 'quando',
 'quindi',
 'ringrazio',
 'salute',
 'senza',
 'soldi',
 'solo',
 'stamattina',
 'stato',
 'sì',
 'tanti',
 'tema',
 'ultimo',
 'vita']

In [54]:
len(lda.components_)

5

In [47]:
first_topic

array([0.20002656, 2.19912448, 1.1980291 , 1.20130993, 1.20061042,
       1.19743369, 0.20001028, 1.20031285, 0.20001323, 0.20002601,
       0.20001088, 2.20764045, 0.20001411, 0.20002617, 2.20178964,
       0.20002087, 1.19988985, 0.20002881, 3.20947825, 0.20002755,
       2.20319975, 3.20301082, 1.20115287, 1.19669393, 1.20233841,
       0.20000756, 0.20001194, 0.20002497, 0.20001289, 3.19965086,
       1.20327205, 1.19200254, 0.20000767, 1.19423074, 1.20289721,
       0.20002626, 3.20164163, 0.20000651, 0.2000111 , 3.20448175,
       0.20001028, 0.20001095, 0.20001663, 2.1983794 , 1.18912246,
       1.19960469, 1.3484609 , 0.20097756, 0.20001164])

In [45]:
first_topic.argsort()

array([37, 25, 32, 40,  6, 10, 41, 38, 48, 26, 28,  8, 12, 42, 15, 27,  9,
       13, 35,  0, 19, 17, 47, 44, 31, 33, 23,  5,  2, 45, 16,  7,  4, 22,
        3, 24, 34, 30, 46, 43,  1, 14, 20, 11, 29, 36, 21, 39, 18])

In [48]:
first_topic.argsort()[-10:]

array([43,  1, 14, 20, 11, 29, 36, 21, 39, 18])

In [59]:
list(cv.get_feature_names_out()[21:23])

['italiani', 'lega']

In [60]:
def get_feature_names_per_topic(lda_components, count_vectorizer, topn=5):
  topics = []

  for i in range(len(lda_components)):
    topic = lda_components[i]
    top_words_idx = topic.argsort()[-topn:]
    topic_names = [count_vectorizer.get_feature_names_out()[id] for id in top_words_idx]
    topics.append('_'.join(topic_names))
  
  return topics

In [62]:
topics = get_feature_names_per_topic(lda.components_, cv)

In [63]:
topics

['però_quindi_italiani_senza_inno',
 'quindi_poi_devo_ringrazio_fatto',
 'prendo_grazie_soldi_costa_fa',
 'intusiasta_cioè_problema_sì_alimentare',
 'salute_ultimo_agricoltura_senza_quando']