Skip to content

Latest commit

 

History

History
44 lines (33 loc) · 1.66 KB

README.md

File metadata and controls

44 lines (33 loc) · 1.66 KB

uriscrape

Scrapes URIs from Telegram channel transcripts in PDF files. Typically URIs will take the form of something like these examples:

https://t.me/joinchat/AAAAAEOs3wFD4Mv6SN4hlQ

(tg://join?invite=AAAAAEOs3wFD4Mv6SN4hlQ)

https://drive.google.com/open?id=0B_3xyna6XV4GMHNPU0VVWHZKRXc

https://archive.org/details/Rumiyah13UR_201709
(https://archive.org/details/Rumiyah13UR_201709)

(tg://search_hashtag?hashtag=%D8%A6%DB%95%D9%84%DA%BE%D8%A7%D9%8A%D8%A7%D8%AA)

Running the program

usage: `python uriscrape.py transcript`

positional arguments:
  transcript         filepath to transcript pdf or directory

optional arguments:
  None yet...

Output file

urls.xlsx - All found URIs, including columns/variables as follows:

  • File: PDF file processed
  • Access_Date: Date/time the program was run. May be important for documenting when the program attempted to resolve URIs
  • Post_Date: Date of the post, as derived from the date labels in the Telegram transcript
  • URL: URL as found
  • Site_Reached: True/False - whether the URI was able to be resolved
  • Unshortened URL: Unshortened URL (e.g. https://youtu.be/lqXwyl89xU4 -> unshortens to https://www.youtube.com/watch?v=lqXwyl89xU4&feature=youtu.be )
  • Status: Error code, if an error was encountered in trying to access the URI
  • Type: Classification of the link
  • Hashtag: Hashtag, if the link is a Telegram hashtag link
  • Channel: Channel, if the link is a Telegram join link
  • Account: Account, if the link is a Telegram account link
  • Domain: Full server daomain (e.g. www.youtube.com)
  • Primary_Secondary: Just the primary and secondary portions of the domain (e.g. youtube.com)