Skip to content
Scrape URIs from Telegram channel transcripts in PDF files
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE.md
README.md
uriscrape.py

README.md

uriscrape

Scrapes URIs from Telegram channel transcripts in PDF files. Typically URIs will take the form of something like these examples:

https://t.me/joinchat/AAAAAEOs3wFD4Mv6SN4hlQ

(tg://join?invite=AAAAAEOs3wFD4Mv6SN4hlQ)

https://drive.google.com/open?id=0B_3xyna6XV4GMHNPU0VVWHZKRXc

https://archive.org/details/Rumiyah13UR_201709
(https://archive.org/details/Rumiyah13UR_201709)

(tg://search_hashtag?hashtag=%D8%A6%DB%95%D9%84%DA%BE%D8%A7%D9%8A%D8%A7%D8%AA)

Running the program

usage: `python uriscrape.py transcript`

positional arguments:
  transcript         filepath to transcript pdf or directory

optional arguments:
  None yet...

Output file

urls.xlsx - All found URIs, including columns/variables as follows:

  • File: PDF file processed
  • Access_Date: Date/time the program was run. May be important for documenting when the program attempted to resolve URIs
  • Post_Date: Date of the post, as derived from the date labels in the Telegram transcript
  • URL: URL as found
  • Site_Reached: True/False - whether the URI was able to be resolved
  • Unshortened URL: Unshortened URL (e.g. https://youtu.be/lqXwyl89xU4 -> unshortens to https://www.youtube.com/watch?v=lqXwyl89xU4&feature=youtu.be )
  • Status: Error code, if an error was encountered in trying to access the URI
  • Type: Classification of the link
  • Hashtag: Hashtag, if the link is a Telegram hashtag link
  • Channel: Channel, if the link is a Telegram join link
  • Account: Account, if the link is a Telegram account link
  • Domain: Full server daomain (e.g. www.youtube.com)
  • Primary_Secondary: Just the primary and secondary portions of the domain (e.g. youtube.com)
You can’t perform that action at this time.