# Prototype: Telegram Scraping & Data Exploration

## Objective
This notebook serves as a sandbox environment to:
1. **Validate Credentials:** Test the connection to the Telegram API using `Telethon`.
2. **Schema Discovery:** Inspect the raw JSON structure of messages to design the Postgres database schema.
3. **Entity Analysis:** Handle differences between "Channels", "Groups", and "Users" to prevent scraper crashes.
4. **Media Testing:** Verify the logic for downloading and storing images locally.

## Prerequisites
*   Ensure your `.env` file is configured in the project root with:
    *   `TG_API_ID`
    *   `TG_API_HASH`
    *   `TG_PHONE_NUMBER`
*   **Note:** If this is your first time connecting, you will need to enter the OTP code sent to your Telegram app in the input field that appears.

### Imports and Setup

In [3]:
import os
import sys
import asyncio
from dotenv import load_dotenv
from telethon import TelegramClient

# Helper to allow async loops in notebooks
import nest_asyncio
nest_asyncio.apply()

# Load environment variables
load_dotenv('../.env') # Adjust path if your notebook is in notebooks/ subdir

api_id = os.getenv('TG_API_ID')
api_hash = os.getenv('TG_API_HASH')
phone = os.getenv('TG_PHONE')

print(f"API ID Loaded: {api_id is not None}")

API ID Loaded: True


### 2. Client Initialization & Authentication
Creating the Telethon client session.

In [9]:
from telethon.sessions import MemorySession
SESSION_NAME = 'medical_scraper_dev_session'

# Create the client
client = TelegramClient(MemorySession(), int(api_id), api_hash)

# Start the client (This will trigger the login flow if not authenticated)
await client.start(phone=phone) # type: ignore

print("Client Created and Authenticated!")

Signed in successfully as Nat; remember to not break the ToS or you will risk an account ban!
Client Created and Authenticated!


In [10]:
# Verify who we are logged in as
me = await client.get_me()
print(f"Logged in as: {me.username or me.first_name}")
print(f"Phone: {me.phone}")

Logged in as: Nat
Phone: 251712291187


### 3. Entity Inspection
Testing how to robustly extract channel titles (fixing the `AttributeError` issue).

In [11]:
channel_handle = 'tikvahpharma' # Example channel

# Get the entity
entity = await client.get_entity(channel_handle)

# Inspect what attributes exist
print(f"ID: {entity.id}")
print(f"Type: {type(entity)}")

# Test the robust name extraction logic
if hasattr(entity, 'title'):
    print(f"Title found: {entity.title}")
elif hasattr(entity, 'username'):
    print(f"Username found: {entity.username}")
elif hasattr(entity, 'first_name'):
    print(f"Name found: {entity.first_name} {entity.last_name}")
else:
    print("No name attribute found")

ID: 1569871437
Type: <class 'telethon.tl.types.Channel'>
Title found: Tikvah | Pharma


### 4. Message Structure Analysis
Inspecting raw message dictionaries to determine necessary columns for the Data Warehouse.

In [13]:
# Fetch last 5 messages
messages = await client.get_messages(channel_handle, limit=5)

for msg in messages:
    print(f"--- Message ID: {msg.id} ---")
    print(f"Date: {msg.date}")
    print(f"Views: {msg.views}")
    print(f"Text Preview: {msg.text[:50] if msg.text else '[No Text]'}")
    print(f"Has Media?: {bool(msg.media)}")
    print("-" * 20)

# Inspect raw object of the last message to see all available fields
print("\nRAW MESSAGE OBJECT (Last Message):")
print(messages[0].to_dict())

--- Message ID: 188968 ---
Date: 2026-01-16 11:48:23+00:00
Views: 2
Text Preview: 16**-01-2**6
Currently available Drugs 


**üíä. Dic
Has Media?: False
--------------------
--- Message ID: 188963 ---
Date: 2026-01-16 10:46:10+00:00
Views: 227
Text Preview: NEW JOB  VACCANCY
07/05/2018
 SENIOR PHARMACIAT WZ
Has Media?: False
--------------------
--- Message ID: 188962 ---
Date: 2026-01-16 10:40:11+00:00
Views: 231
Text Preview: TO BE SOLD PHARMAC
08/05/2018
PIYASSA
ALEM BANK
BO
Has Media?: True
--------------------
--- Message ID: 188961 ---
Date: 2026-01-16 08:09:09+00:00
Views: 488
Text Preview: ·ã®·àö·à∏·å• ·çã·à≠·àõ·à≤ (1)
Price for all shelf¬† and key 600000

Has Media?: False
--------------------
--- Message ID: 188960 ---
Date: 2026-01-16 07:12:32+00:00
Views: 505
Text Preview: **üåÄ****MEDIX PHARMA PLC
**
*
Has Media?: True
--------------------

RAW MESSAGE OBJECT (Last Message):
{'_': 'Message', 'id': 188968, 'peer_id': {'_': 'PeerChannel', 'channel_id': 1569871437}, 'date':

### 5. Media Download Test
Testing the download logic for images.

In [14]:
# Find a message with a photo
output_dir = '../data/raw/images_test'
os.makedirs(output_dir, exist_ok=True)

print("Searching for a photo...")
async for msg in client.iter_messages(channel_handle, limit=20):
    if msg.photo:
        print(f"Found photo in message {msg.id}")
        path = await client.download_media(msg.photo, file=output_dir)
        print(f"Saved to: {path}")
        break

Searching for a photo...
Found photo in message 188962
Saved to: ../data/raw/images_test\photo_2026-01-16_14-52-36.jpg


#### Cell 7: Cleanup
**Important:** Always disconnect the client when done testing. If you don't, the session file stays "locked," and your actual script (`src/scraper.py`) will fail with a "Database Locked" error because two clients can't use the same session file simultaneously.

In [15]:
await client.disconnect()
print("Client Disconnected. You can now run the main script.")

Client Disconnected. You can now run the main script.
