# ü§ñ Social Media Post Scraper - Google Colab Setup

This notebook sets up and runs the Chinese Social Media scraper in Google Colab.

## ‚ö†Ô∏è Important Limitations

‚úÖ **What WILL work:**
- Douyin post scraping (including comments)
- Weibo post scraping (including comments)
- Basic Weixin scraping (title, content, publish date)
- Database storage and Excel export

‚ùå **What WON'T work:**
- scrape_weixin_post_ui.py (requires Windows desktop automation)
- Advanced Weixin metrics (like/share/comment counts)
- Login-protected content (Weibo may require login)

## Step 1: Install System Dependencies

In [None]:
! apt-get update
!apt-get install -y tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra tesseract-ocr-eng
print('‚úÖ System dependencies installed!')

## Step 2: Clone the Repository

In [None]:
import os

# Remove old repository if exists (prevents duplicate paths)
if os.path.exists('/content/scrape_chinese_social_media'):
    print('üóëÔ∏è Removing old repository.. .')
    ! rm -rf /content/scrape_chinese_social_media

# Clone fresh copy
!git clone https://github.com/pwklam/scrape_chinese_social_media. git

# Change to repository directory
%cd /content/scrape_chinese_social_media

# Verify location and files
print('\nüìÇ Current directory:')
!pwd
print('\nüìÑ Python files:')
!ls -la *.py
print('\n‚úÖ Repository cloned successfully!')

## Step 3: Install Python Dependencies

In [None]:
# Install ALL required Python packages from requirements.txt
! pip install -q -r requirements.txt
print('‚úÖ All Python packages from requirements.txt installed!')

## Step 4: Install Playwright Browsers

In [None]:
! playwright install chromium
! playwright install-deps chromium
print('\n‚úÖ Playwright browsers installed!')

## Step 5: Update Configuration for Colab Environment

In [None]:
# Fix config.py with ALL required attributes for Colab
config_content = '''import os

# Database configuration
db_name = "data.db"
table_name = "posts"

# Tesseract OCR configuration (Colab path)
tesseract_cmd = "/usr/bin/tesseract"

# Scraping configuration
DOUYIN_TIMEOUT = 30000
WEIBO_TIMEOUT = 30000
WEIXIN_TIMEOUT = 30000

# Comment scraping configuration
MAX_COMMENTS = 20
SCROLL_ATTEMPTS = 3

# Export Excel path
export_excel_path = "data.xlsx"
'''

with open('config.py', 'w', encoding='utf-8') as f:
    f.write(config_content)

print('‚úÖ Configuration updated for Colab environment!')
print('\nüìÑ Config contents:')
! cat config.py

## Step 6: Verify URLs from GitHub

The urls.txt file is automatically cloned from your GitHub repository in Step 2.

**To update URLs:**
1. Edit urls.txt on GitHub: https://github.com/pwklam/scrape_chinese_social_media/blob/main/urls.txt
2. Re-run Step 2 to clone the latest version

**Or** use the code below to add URLs temporarily (without committing to GitHub).

In [None]:
# Display URLs from GitHub repository
print('‚úÖ Using urls.txt from GitHub repository')
print('\nüìÑ Current URLs in urls.txt:')
!cat urls. txt

print('\n' + '='*60)
print('üí° How to manage URLs:')
print('='*60)
print('1. Edit on GitHub: https://github.com/pwklam/scrape_chinese_social_media/blob/main/urls.txt')
print('2. Then re-run Step 2 to pull latest changes')
print('\n   OR temporarily add URLs below (uncomment the code):')
print('='*60)

# Uncomment the lines below to add more URLs temporarily
# additional_urls = '''\nhttps://weibo.com/YOUR_WEIBO_URL
# https://www.douyin.com/YOUR_DOUYIN_URL
# '''

# with open('urls.txt', 'a', encoding='utf-8') as f:
#     f. write(additional_urls)

# print('\n‚úÖ Additional URLs added!')
# print('\nüìÑ Updated URLs:')
# ! cat urls.txt

## Step 7: Run the Scraper

In [None]:
# Ensure we're in the right directory
%cd /content/scrape_chinese_social_media

print('='*60)
print('üöÄ Starting scraper...')
print('='*60 + '\n')

! python main.py

print('\n' + '='*60)
print('‚úÖ Scraping completed!')
print('='*60)

# Verify results
import os
import sqlite3

if os. path.exists('data.db'):
    conn = sqlite3.connect('data. db')
    cursor = conn.cursor()
    try:
        cursor.execute("SELECT COUNT(*) FROM posts")
        count = cursor. fetchone()[0]
        print(f'\nüìä Total posts in database: {count}')
    except Exception as e:
        print(f'\n‚ö†Ô∏è Database error: {e}')
    conn.close()
else:
    print('\n‚ùå No database created - check errors above')

## Step 8: Export Data to Excel

In [None]:
# Ensure we're in the right directory
%cd /content/scrape_chinese_social_media

# Verify export script exists
import os
if os.path.exists('export_excel_data.py'):
    print('‚úÖ Found export_excel_data.py')
    ! python export_excel_data.py
    
    if os.path.exists('data. xlsx'):
        print('\n‚úÖ Data exported to data.xlsx!')
    else:
        print('\n‚ö†Ô∏è export_excel_data.py ran but no data.xlsx created')
else:
    print('‚ùå export_excel_data. py not found!')

## Step 9: Preview the Data

In [None]:
import pandas as pd
import json

try:
    df = pd.read_excel('data.xlsx')
    print(f'‚úÖ Total posts scraped: {len(df)}')
    print('\nüìä First few rows:')
    display(df.head())
    print('\nüìã Columns:')
    print(df.columns.tolist())
    
    # Check for comments
    if 'comments' in df.columns and not df. empty:
        comments_data = df['comments'].iloc[0]
        if comments_data and str(comments_data) != 'nan':
            try:
                comments = json.loads(comments_data)
                print(f'\nüí¨ Found {len(comments)} comments in first post')
                if comments:
                    print('\nüìù Sample comment:')
                    print(json.dumps(comments[0], indent=2, ensure_ascii=False))
            except:
                print('\n‚ö†Ô∏è Comments field exists but could not parse')
        else:
            print('\n‚ö†Ô∏è No comments found in the scraped post')
            
except FileNotFoundError:
    print('‚ö†Ô∏è No data.xlsx file found.')
except Exception as e:
    print(f'‚ùå Error: {e}')

## Step 10: Download Results

In [None]:
from google.colab import files
import os

if os.path.exists('data. xlsx'):
    files.download('data.xlsx')
    print('‚úÖ Downloaded: data.xlsx')
else:
    print('‚ö†Ô∏è data.xlsx not found')

if os. path.exists('data.db'):
    files.download('data.db')
    print('‚úÖ Downloaded: data.db')
else:
    print('‚ö†Ô∏è data. db not found')

if os.path.exists('app.log'):
    files.download('app. log')
    print('‚úÖ Downloaded: app.log')
else:
    print('‚ö†Ô∏è app.log not found')

## üìù Troubleshooting

### Common Issues:

1. **Login Required**: Weibo may require login (won't work in headless Colab)
   - Solution: Run locally with `headless=False` in scraper code

2. **No Comments Scraped**: Comments section didn't load or requires login
   - Check if the post actually has comments on Weibo
   - Weibo may block automated access

3. **Timeouts**: Page took too long to load
   - Increase timeout values in config.py

4. **No Data**: Check app.log for errors
   - Database may be empty if scraping failed

### Platform Notes:

- **Douyin**: Includes up to 20 comments
- **Weibo**: Includes post metrics and up to 20 comments
- **Weixin**: Basic info only (no advanced metrics in Colab)

### Success Indicators:

Look for these messages in Step 7 output:
```
üöÄ Launching Playwright browser...
üîç Page content loaded, starting data extraction...
üíæ Inserting scraped data into the database...
üîé Scraping comments for: [URL]
‚úÖ Successfully scraped X comments
```

---

**Repository**: [pwklam/scrape_chinese_social_media](https://github. com/pwklam/scrape_chinese_social_media)