# ü§ñ Social Media Post Scraper - Google Colab Setup

This notebook sets up and runs the Chinese Social Media scraper in Google Colab.

## ‚ö†Ô∏è Important Limitations in Google Colab

‚úÖ **What WILL work:**
- Douyin post scraping (including comments)
- Weibo post scraping
- Basic Weixin scraping (title, content, publish date)
- Database storage and Excel export

‚ùå **What WON'T work:**
- `scrape_weixin_post_ui.py` (requires Windows desktop automation)
- Advanced Weixin metrics (like/share/comment counts - requires Windows UI automation)

---

## Step 1: Install System Dependencies

In [None]:
# Install Tesseract OCR and language packs
! apt-get update
!apt-get install -y tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra tesseract-ocr-eng

print("‚úÖ System dependencies installed! ")

## Step 2: Clone the Repository

In [None]:
# Clone your repository
! git clone https://github.com/pwklam/scrape_chinese_social_media.git
%cd scrape_chinese_social_media

# List files to verify
!ls -la

print("\n‚úÖ Repository cloned successfully!")

## Step 3: Install Python Dependencies

In [None]:
# Install required Python packages
!pip install -q playwright pandas openpyxl pytesseract Pillow

print("‚úÖ Python packages installed!")

## Step 4: Install Playwright Browsers

In [None]:
# Install Playwright and its browser dependencies
!playwright install chromium
! playwright install-deps chromium

print("\n‚úÖ Playwright browsers installed!")

## Step 5: Update Configuration for Colab Environment

In [None]:
# Update config. py to use Colab's Tesseract path
config_content = '''import os

# Tesseract OCR configuration (Colab path)
tesseract_cmd = '/usr/bin/tesseract'

# Database configuration
DATABASE_PATH = 'data. db'

# Scraping configuration
DOUYIN_TIMEOUT = 30000  # 30 seconds
WEIBO_TIMEOUT = 30000
WEIXIN_TIMEOUT = 30000

# Comment scraping configuration
MAX_COMMENTS = 20
SCROLL_ATTEMPTS = 3
'''

with open('config.py', 'w', encoding='utf-8') as f:
    f.write(config_content)

print("‚úÖ Configuration updated for Colab environment!")

## Step 6: Add Your URLs

Edit the `urls.txt` file with your target URLs (one per line).

In [None]:
# Example: Create urls.txt with sample URLs
# REPLACE THESE WITH YOUR ACTUAL URLS!

urls_content = '''# Add your URLs here, one per line
# Douyin example:
# https://www.douyin.com/video/1234567890123456789

# Weibo example:
# https://weibo.com/1234567890/Abcdefghijk

# Weixin example:
# https://mp.weixin.qq.com/s/abcdefghijklmnopqrstuvwxyz
'''

with open('urls.txt', 'w', encoding='utf-8') as f:
    f. write(urls_content)

print("‚úÖ urls.txt created! ")
print("\n‚ö†Ô∏è  IMPORTANT: Edit the urls.txt file above with your actual URLs before running the scraper!")
print("\nYou can edit it in the next cell. ")

## Step 7: View/Edit urls.txt

Use this cell to view and edit your URLs:

In [None]:
# View current urls.txt content
!cat urls.txt

## Step 8: Run the Scraper

This will scrape all URLs from `urls.txt` and save data to `data.db`.

In [None]:
# Run the main scraper
!python main.py

print("\n‚úÖ Scraping completed!  Check the output above for details.")

## Step 9: Export Data to Excel

In [None]:
# Export database to Excel
!python export_excel_data.py

print("\n‚úÖ Data exported to data.xlsx!")

## Step 10: Preview the Data

In [None]:
# Preview the scraped data
import pandas as pd

try:
    df = pd.read_excel('data.xlsx')
    print(f"Total posts scraped: {len(df)}")
    print("\nFirst few rows:")
    display(df.head())
    
    print("\nColumn summary:")
    print(df.info())
except FileNotFoundError:
    print("‚ö†Ô∏è  No data. xlsx file found. Make sure scraping completed successfully.")

## Step 11: Download Results

Download the database, Excel file, and log file to your local machine.

In [None]:
from google.colab import files
import os

# Download data. xlsx
if os.path.exists('data.xlsx'):
    files.download('data.xlsx')
    print("‚úÖ Downloaded: data.xlsx")
else:
    print("‚ö†Ô∏è  data.xlsx not found")

# Download data.db
if os.path.exists('data.db'):
    files.download('data. db')
    print("‚úÖ Downloaded: data.db")
else:
    print("‚ö†Ô∏è  data.db not found")

# Download app.log
if os.path.exists('app.log'):
    files.download('app. log')
    print("‚úÖ Downloaded: app.log")
else:
    print("‚ö†Ô∏è  app.log not found")

## üìù Additional Notes

### Troubleshooting:

1. **Timeouts**: If scraping times out, try increasing timeout values in `config.py`
2. **Missing data**: Some platforms may block automated access - check `app.log` for details
3.  **Empty results**: Verify your URLs are correct and accessible

### Platform-Specific Notes:

- **Douyin**: Includes up to 20 comments per post with author, content, timestamp, and likes
- **Weibo**: Includes basic post metrics
- **Weixin**: Only basic info (title, content, date) - advanced metrics require Windows UI automation

### Re-running the Scraper:

To scrape new URLs:
1. Update `urls.txt` (Step 6-7)
2. Re-run Steps 8-11

---

**Repository**: [pwklam/scrape_chinese_social_media](https://github. com/pwklam/scrape_chinese_social_media)