Recommending YouTube videos based on personal needs, using YouTube API and specified tags / fields of interest through GenSim and BERTopic
Extends on this project, based on this Medium article
pytube is used to scrape video metadata youtube-transcript-api is used to download captions
# add YouTube v3 API key to youtube_recommender/config/config.yaml
# structure:
# api_key:
# "your_api_key"
# copy configuration files
# to installed package
cp -r ${SECRETS_DIR}/rarc/config/youtube_recommender/* ~/anaconda3/envs/py39/lib/python3.9/site-packages/youtube_recommender/config
# to cloned repo
cp -r ${SECRETS_DIR}/rarc/config/youtube_recommender/* ~/repos/youtube-recommender/youtube_recommender/config
# install deps
pip install -r requirements.txt
# download english corpus for SpaCy
python -m spacy download en_core_web_sm
# install app
pip install -U ~/repos/youtube_recommender
Search for YouTube videos, example usage:
# alias ipy="ipython --no-confirm-exit --no-banner -i"
ipy -m youtube_recommender -- 'search term 1' 'search term 2'
# with different search-period (default is 365 days)
# make sure to decrease it for recent videos, or increase it alltime popular videos
ipy -m youtube_recommender -- 'search term 1' --search-period 29
# save top_videos to feather
ipy -m youtube_recommender -- 'search term 1' 'search term 2' --save
# to inspect results, inspect `res` object`, or `df` for only top_videos
python -m youtube_recommender --help
:
usage: __main__.py [-h] [--search-period SEARCH_PERIOD] [--dryrun] [-f] [--filter] [-n NITEMS] [-s] [-p] search_terms [search_terms ...]
Defining search parameters
positional arguments:
search_terms The terms to query. Can be multiple.
options:
-h, --help show this help message and exit
--search-period SEARCH_PERIOD
The number of days to search for.
--dryrun only load modules, do not requests APIs
-f, --force force to run query, do not use cache
--filter filter non English titles from dataset using langid
-n NITEMS, --nitems NITEMS
Max search results to fetch from YouTube API
-s, --save Save results to
-p, --push_db push queryResult and Video rows to PostgreSQL`
Download captions for YouTube videos, example usage:
ipy -m youtube_recommender.topicer -- 'video_id_1' 'video_id_2'
# optionally save captions to feather file
ipy -m youtube_recommender.topicer -- 'video_id_1' 'video_id_2' --save_captions
# load video_ids from top_videos.feather file automatically
ipy -m youtube_recommender.topicer -- --save_captions --from_feather
# keep videos data with captions
ipy -m youtube_recommender.topicer -- --save_captions --from_feather --merge_with_videos
# short, most used option:
ipy -m youtube_recommender.topicer -- --from_feather -sp
# to inspect results, inspect `captions` object`
# download video caption and save to clipboard
python -m youtube_recommender.topicer --with_start_times --to_clipboard \
VIDEO_ID
python -m youtube_recommender.topicer --help
:
usage: __main__.py [-h] [--from_feather] [-n N] [--dryrun] [-f] [--merge_with_videos] [--with_start_times] [-s] [-c] [-p] [video_ids ...]
Defining parameters
positional arguments:
video_ids The YouTube videos to extract captions from. Can be multiple.
options:
-h, --help show this help message and exit
--from_feather Import video ids from `/home/paul/repos/youtube-recommender/youtube_recommender/data/top_videos.feather`, created in main.py. ignores any manually passed video_ids
-n N select first `n` rows from feather file
--dryrun only load data, do not download captions
-f, --force force to download captions, do not use cache
--merge_with_videos merge resulting captions dataset with videos metadata
--with_start_times include start_times in the output caption string
-s, --save_captions Save captions to `/home/paul/repos/youtube-recommender/youtube_recommender/data/captions.feather`
-c, --to_clipboard Save captions to clipboard
-p, --push_db push Video, Channel and Caption rows to PostgreSQL`
Convert .ipynb
to .py
files and run them in ipython
cd ~/repos/youtube-recommender/youtube_recommender
jupyter nbconvert --to script recommend/logistic_regression/train.ipynb && ipy recommend/logistic_regression/train.py
Deploy locally with docker-compose
# rsync nginx configuration: nginx.cert and nginx.key
rsync -avz -e "ssh -p PORT" --progress USER@HOST:/home/paul/repos/youtube-recommender/cert ~/repos/youtube-recommender
# create network
docker network create microservices
# deploy scraper
docker-compose up --build --scale scrape-service=5 && docker-compose logs -f
# this should automatically update ./nginx/includes/grpcservers with compose services names
# check if reverseproxy is running succesfully
docker logs youtube-recommender_nginx-reverseproxy_1 --tail 20 -f
# maybe convert service names manually to upstream nginx servers
cd ./nginx
chmod +x ./save_server_names.sh
./save_server_names.sh ./includes/grpcservers
Deploy to production with Kubernetes:
# create persistent volumes, if not already created
# k apply -f $(find ./kubernetes -name 'persistentvolume0.yaml' -type f | tr '\n' ',' | sed 's/,$//')
# I use secrets for now, to keep pods stateless
# only deploy scrape-service and its secret files
k apply -f $(find ./kubernetes -name 'scrape-service*.yaml' -o -name '*secret.yaml' -type f | tr '\n' ',' | sed 's/,$//')
# verify if service dns can reach all pods
k run dnsutils --image=tutum/dnsutils --command -- sleep infinity
# should show a list of multiple ips
k exec dnsutils -- nslookup scrape-service
# get all logs
k logs --selector io.kompose.service=scrape-service
# create a file with kubernetes pod names, and inline it into `nginx-inline.conf`. OLD
# cd ./nginx
# ./save_server_names.sh ./includes/grpcservers kubernetes
# ./create_inlined_conf.sh && less nginx-inline.conf
# save this conf to secret file base64-encoded
# alias cs="xclip -selection clipboard"
# cat ./nginx-inline.conf | base64 | cs
# and paste it into ./kubernetes/secrets/nginx-conf-secret.yaml
# now you can deploy load balancer
# k apply -f $(find ./kubernetes -name 'nginx-reverseproxy*.yaml' -o -name '*secret.yaml' -type f | tr '\n' ',' | sed 's/,$//')
# apply grpc linkerd ingress
# k create ns ingress-nginx
# k apply -f ./kubernetes/ingress-service.yaml
# now add a dummy load balancer, to request an external-ip from cloud provider
# k apply -f kubernetes/loadbalancer-service.yaml
# now expose this reverseproxy endpoint
# kubectl expose deployment nginx-reverseproxy --port=1443 --target-port=1443 --name=external-service --type=NodePort
# expose service
# k expose deployment scrape-service --type=LoadBalancer --name=scrape-service-ext
k expose deployment scrape-service --type=NodePort --target-port=50051 --name=scrape-service-ext
# wait for external-ip to be assigned
# k get svc
# k describe svc scrape-service-ext
# access the service on node where this service is running + assigned NodePort
# linkerd for meshing your grpc cluster
# install linkerd onto your cluster
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -
# assuming pods run in `default` namespace
kubectl get -n default deploy scrape-service -o yaml \
| linkerd inject - \
| kubectl apply -f -
# show linkerd dashboard
# install viz extension first
linkerd viz install | kubectl apply -f -
linkerd viz dashboard &
# and test your cluster
# externalIp=$(k get svc my-loadbalancer -o=json | jq --raw-output '.status.loadBalancer.ingress[0].ip')
externalIp=$(k get svc nginx-ingress-ingress-nginx-controller -o=json | jq --raw-output '.status.loadBalancer.ingress[0].ip')
k create ns monitoring
# k apply -f ./kubernetes/prometheus/deployment.yaml
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# deploy
helm install -n monitoring prometheus prometheus-community/prometheus
k -n monitoring get po -w
# delete helm package
helm uninstall prometheus