# 链家租房数据爬虫与分析

## 项目背景
本项目基于链家网（Lianjia）的租房数据，采集包含城市、行政区、租金、面积、户型等核心维度的房源信息。项目旨在通过数据挖掘技术，揭示租房市场的价格规律，并构建预测模型辅助决策。

## 项目目标
1. **数据采集**：构建多线程/自动化爬虫，抓取多城市房源数据。
2. **数据洞察**：分析城市、区域、交通（地铁）等因素对租金的影响。
3. **价格预测**：基于房屋属性（面积、户型、楼层等）构建机器学习模型预测租金。
4. **可视化展示**：通过交互式图表直观呈现市场特征。

## 数据集概览
* **数据源**: 链家网 (lianjia.com)
* **覆盖城市**: 北京、上海、广州、深圳、杭州、成都、武汉、厦门、福州。

### 核心字段说明
| 字段名称 | 类型 | 含义 | 建模作用 |
| :--- | :--- | :--- | :--- |
| `city` | Categorical | 城市名称 | 宏观地域特征 |
| `district` | Categorical | 行政区 | 微观区域特征 |
| `price_avg` | Numerical | 月租金（元/月） | **目标变量 (Target)** |
| `area_sqm` | Numerical | 建筑面积（㎡） | 核心数值特征 |
| `bedrooms` | Numerical | 卧室数量 | 户型特征 |
| `is_subway` | Categorical | 是否近地铁 | 交通便利性特征 |

## 第一阶段: 环境配置与工具导入

### 1.1 核心依赖库
本项目依赖以下 Python 数据科学与爬虫生态库：
* **数据处理**: `pandas` (结构化数据清洗), `numpy` (数值计算)。
* **网络爬虫**: `requests` (HTTP请求), `BeautifulSoup` (HTML解析), `fake_useragent` (反爬伪装)。
* **可视化**: `plotly` (交互式绘图引擎)。
* **机器学习**: `sklearn` (随机森林, 评估指标), `catboost` (梯度提升树)。

### 1.2 全局参数配置
* **路径配置**: 自动创建 `data/` 和 `images/` 目录用于持久化存储。
* **爬虫策略**: 设置随机延迟 (`MIN_DELAY` ~ `MAX_DELAY`) 和熔断机制 (`MAX_FAILURES`) 以规避反爬。
* **城市映射**: 定义 `CITIES_MAP` 字典，管理目标城市的拼音与中文映射。

In [29]:
import logging
import random
import re
import time
from datetime import datetime, timedelta
from pathlib import Path
from typing import Any, Dict, List, Optional, Union

import numpy as np
import pandas as pd
import plotly.graph_objects as go
import requests
from bs4 import BeautifulSoup, Tag
from catboost import CatBoostRegressor, Pool
from fake_useragent import UserAgent
from plotly.subplots import make_subplots
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

logging.basicConfig(
    level=logging.INFO,
    format='[%(asctime)s] %(message)s',
    datefmt='%H:%M:%S'
)

logger = logging.getLogger(__name__)

In [30]:
PROJECT_ROOT = Path('.').resolve()
DATA_DIR = PROJECT_ROOT / "data"
IMAGE_DIR = PROJECT_ROOT / "images"

DATA_DIR.mkdir(
    parents=True,
    exist_ok=True
)

IMAGE_DIR.mkdir(
    parents=True,
    exist_ok=True
)

COOKIE = """lianjia_ssid=4a28a7ab-95c3-4a2a-adb2-614ada8aa564; lianjia_uuid=70e937bd-096c-43b1-876d-c72b58e8a945; crosSdkDT2019DeviceId=g4b33x-e5a4d0-07iwd5eo9m7trtn-38ptqziht; select_city=350200; GUARANTEE_POPUP_SHOW=true; login_ucid=2000000512872949; lianjia_token=2.0013adb9bb48a505a20200908a7396f6a0; lianjia_token_secure=2.0013adb9bb48a505a20200908a7396f6a0; security_ticket=nYOKAMd9n/QpLtgFkorqsClo/Elzf7O8IHDZLiibijk5gnMta1g4DwgwUytslHSsb+Wz1EqewjXZ3po58cIZ0E7LclTtf26iM5PlXgKCKoLr8s6t1sj2LPgvDDhfs5RcTYl5WIQSixY86e/AA2Y6n6Bf4PcQsuq3RCCz6h69MnM=; ftkrc_=a9c712e6-2d9e-44ff-a8a7-8b676f62ef8b; lfrc_=16737aa3-77b2-4858-8626-97a1824f2268; GUARANTEE_BANNER_SHOW=true; hip=USua7Rzi0ELGJmPTBhm83ikA4JdNm3nVTcvSTndyeOs5M-oesooEitDMZIJjcQ_7O0-bRy0kqnBjh9XJCTkW8to4ouhldjpCDOfZAcoOAEWK0qSXtpwgXGJkdjpiTA6en4iZ1fEwqTQGULR-mU0EUaoWArDedlcPFcSBojAGiI9F9HM2SawjgPlF_ppSdnrJSVUCZ5AvBtgXfNXndXNluMBTCckonCdbNtyWieJRaHHInnxHnJMoOPbowNcLq1jL8EgPJ7WVGhZcozM8ckOdXxl7XuWcTTg2BIkoTw%3D%3D; srcid=eyJ0Ijoie1wiZGF0YVwiOlwiODY5ZWMwMWNiMzgwNDUzMTVjNDAwMTQwODU2MjJkODk3NjQyYTVjYTIzYjE2NTE2ZDA1N2UwZTk3ZDRjYTZmMzlhOWJmZDVhZjNlMjY5MTE3OTZlY2Q3YjFjZTYwMzk5MmM4MDM2NzQyNDVmYmYzNGNiZTQ2MGQyMjk2YmE5OWI4NTg3Y2JmMDVhOWU1MmYyNmY0ZWI1NzZmMWY2OGVkZjRlYzdlYWYxOWFjMmY3OGY4ZDBiODg0OWE1NTQzNWJhZjQ1YzcxMzcwYzhmYWVlMjViZGY5NmVhY2Y0NTA5YTcxOGRhNjRmMzkyMGRmNGVhMThlYTFjMjY4YTg5NmJhMlwiLFwia2V5X2lkXCI6XCIxXCIsXCJzaWduXCI6XCJkNWUzYmNkMVwifSIsInIiOiJodHRwczovL3htLmxpYW5qaWEuY29tL3p1ZmFuZy9wZzMvIiwib3MiOiJ3ZWIiLCJ2IjoiMC4xIn0="""

START_PAGE = 1
END_PAGE = 50
MIN_DELAY = 6
MAX_DELAY = 8
MAX_FAILURES = 3

CITIES_MAP = {
    'bj': '北京',
    'sh': '上海',
    'gz': '广州',
    'sz': '深圳',
    'tj': '天津',
    'cd': '成都',
    "nj": '南京',
    'hz': '杭州',
    'qd': '青岛',
    'sy': '沈阳',
    'xm': '厦门',
    'wh': '武汉',
}

ANALYSIS_CITY = '上海'

## 辅助模块: 数据持久化

### 核心工具类: `CsvUtil`
为了规范文件的读写操作，封装了统一的 I/O 工具类：
1. **`save_data`**: 将抓取或清洗后的数据保存为 CSV 文件，自动处理 UTF-8 编码（防止乱码）。
2. **`load_data`**: 支持读取单个文件或批量合并读取（如合并多个城市的 `raw_*.csv` 数据）。

In [31]:
class CsvUtil:
    @staticmethod
    def save_data(
            dataset: Union[pd.DataFrame, List[Dict]],
            prefix: str,
            identifier: Optional[str] = None
    ) -> Path:
        df = dataset if isinstance(dataset, pd.DataFrame) else pd.DataFrame(dataset)

        file_stem = f"{prefix}_{identifier}" if identifier else prefix
        output_path = DATA_DIR / f"{file_stem}.csv"

        if df is not None:
            df.to_csv(output_path, index=False, encoding='utf-8-sig')

        return output_path

    @staticmethod
    def load_data(prefix: str, match_multiple: bool = False) -> Optional[pd.DataFrame]:
        search_pattern = f"{prefix}_*.csv" if match_multiple else f"{prefix}.csv"
        found_files = list(DATA_DIR.glob(search_pattern))

        if not found_files:
            return None

        return pd.concat(
            [pd.read_csv(file_path, encoding='utf-8-sig') for file_path in found_files],
            ignore_index=True
        )

## 第二阶段: 数据采集

### 核心爬虫类: `LianJiaSpider`
该模块负责从链家网抓取原始房源数据，主要逻辑如下：

1. **多城市遍历**: 根据 `CITIES_MAP` 依次抓取各城市的租房列表页。
2. **智能分页**: 自动遍历指定页码范围 (`START_PAGE` ~ `END_PAGE`)，并具备空页检测功能（连续空页自动停止）。
3. **精细化解析**:
    * 使用 `BeautifulSoup` 定位房源卡片。
    * 利用 **正则表达式 (Regex)** 精确提取“户型”、“面积”、“楼层”、“朝向”等非结构化文本中的数值。
4. **反爬虫机制**:
    * **User-Agent 轮询**: 每次请求随机切换 UA。
    * **随机延迟**: 每次请求间隔随机休眠，模拟人类行为。

In [32]:
class LianJiaSpider:
    REGEX_PATTERNS = {
        'layout_structure': re.compile(r'(\d+)室(\d+)厅(\d+)卫'),
        'area_size': re.compile(r'(\d+\.?\d*)㎡'),
        'orientation': re.compile(r'/\s*([\u4e00-\u9fa5\s]+)\s*/'),
        'floor_position': re.compile(r'([\u4e00-\u9fa5]+)楼层'),
        'total_stories': re.compile(r'（(\d+)层）')
    }

    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': UserAgent().random,
            'Cookie': COOKIE
        })

    def run(self, city_map: Dict[str, str]):
        for city_code, city_name in city_map.items():
            listings = self.crawl_city(city_name, city_code)

            CsvUtil.save_data(
                listings,
                prefix='raw',
                identifier=city_name
            )

    def crawl_city(self, city_name: str, city_code: str) -> List[Dict]:
        logger.info(f">>> 启动抓取: {city_name} ({city_code})")
        all_listings = []
        empty_page_count = 0

        for page_num in range(START_PAGE, END_PAGE + 1):
            listings_on_page = self.fetch_page_listings(city_code, page_num, city_name)

            if listings_on_page:
                all_listings.extend(listings_on_page)
                empty_page_count = 0
                logger.info(f"    √ 已抓取 {len(all_listings)} 条 (Page: {page_num})")
            else:
                empty_page_count += 1
                logger.info(f"    × Page {page_num} 无数据")

            if empty_page_count >= MAX_FAILURES:
                logger.error(f"连续空页触发熔断: {city_name}")
                break

            time.sleep(random.uniform(MIN_DELAY, MAX_DELAY))

        return all_listings

    def fetch_page_listings(self, city_code: str, page_num: int, city_name: str) -> List[Dict]:
        url = f"https://{city_code}.lianjia.com/zufang/pg{page_num}/"

        response = self.session.get(url, timeout=10)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')
        listing_cards = soup.select('div.content__list--item')

        return [
            self.parse_listing_card(card, city_name) for card in listing_cards
        ]

    def parse_listing_card(self, card_element: Tag, city_name: str) -> Dict[str, Any]:
        house_details_text = card_element.select_one('p.content__list--item--des').get_text(strip=True)
        floor_info_text = card_element.select_one('p.content__list--item--des span.hide').get_text(strip=True)
        listing_title = card_element.select_one('p.content__list--item--title').get_text(strip=True)

        location_anchors = card_element.select('p.content__list--item--des a')
        location_names = [a.get_text(strip=True) for a in location_anchors[:3]]
        location_names += [''] * (3 - len(location_names))

        layout_match = self.REGEX_PATTERNS['layout_structure'].search(house_details_text)
        area_match = self.REGEX_PATTERNS['area_size'].search(house_details_text)
        orientation_match = self.REGEX_PATTERNS['orientation'].search(house_details_text)

        floor_pos_match = self.REGEX_PATTERNS['floor_position'].search(floor_info_text)
        total_floor_match = self.REGEX_PATTERNS['total_stories'].search(floor_info_text)

        feature_tags = [
            tag.get_text(strip=True) for tag in card_element.select('p.content__list--item--bottom i')
        ]

        brand_span = card_element.select_one('p.content__list--item--brand span.brand')
        time_span = card_element.select_one('p.content__list--item--brand span.content__list--item--time')
        price_em = card_element.select_one('span.content__list--item-price em')

        return {
            'city': city_name,
            'title': listing_title,
            'rent_type': '整租' if '整租' in listing_title else ('合租' if '合租' in listing_title else '独栋'),
            'district': location_names[0],
            'sub_district': location_names[1],
            'community': location_names[2],
            'area_sqm': float(area_match.group(1)) if area_match else 0.0,
            'bedrooms': int(layout_match.group(1)) if layout_match else 0,
            'living_rooms': int(layout_match.group(2)) if layout_match else 0,
            'bathrooms': int(layout_match.group(3)) if layout_match else 0,
            'orientation': orientation_match.group(1).strip() if orientation_match else '',
            'floor_level': floor_pos_match.group(1) if floor_pos_match else '',
            'total_floors': int(total_floor_match.group(1)) if total_floor_match else 0,
            'tags': '|'.join(feature_tags),
            'platform': brand_span.get_text(strip=True) if brand_span else "",
            'update_time': time_span.get_text(strip=True) if time_span else "",
            'price_rmb': price_em.get_text(strip=True) if price_em else "0"
        }

In [33]:
def get_pending_cities() -> dict:
    raw_files = list(DATA_DIR.glob("raw_*.csv"))
    processed_cities = {
        p.stem.split('_')[1] for p in raw_files
    }
    return {
        code: name for code, name in CITIES_MAP.items()
        if name not in processed_cities
    }


cities_to_scrape = get_pending_cities()

if cities_to_scrape:
    print(f"待抓取列表: {list(cities_to_scrape.values())}")
    LianJiaSpider().run(cities_to_scrape)
else:
    print("所有目标城市数据已存在，跳过。")

所有目标城市数据已存在，跳过。


## 第三阶段: 数据预处理

### 核心清洗类: `DataCleaner`
原始数据通常包含噪声、缺失值和非标准格式。本阶段通过流水线（Pipeline）方式进行清洗：

1. **文本标准化 (`clean_text`)**: 
    * 去除首尾空格。
    * 统一填充缺失值（`NaN` -> `'未知'`）。
2. **数值转换 (`clean_numeric`)**: 
    * 将文本型数字转换为浮点数/整数，非法字符转为 0。
3. **价格解析 (`parse_prices`)**: 
    * 从“3500元/月”等字符串中提取纯数值 `3500.0`。
4. **时间序列化 (`standardize_dates`)**: 
    * 将相对时间（如“3天前维护”）转换为绝对日期格式 (`YYYY-MM-DD`)。
5. **异常值处理 (`remove_outliers`)**: 
    * 基于 `price_avg` 字段，剔除 **Top 2%** 和 **Bottom 2%** 的极端价格数据，防止离群点干扰分析。

In [34]:
class DataCleaner:
    TEXT_COLUMNS = [
        'city', 'title', 'rent_type', 'district', 'sub_district',
        'community', 'orientation', 'floor_level', 'tags',
        'platform', 'update_time', 'price_rmb'
    ]
    NUMERIC_COLUMNS = [
        'area_sqm', 'bedrooms', 'living_rooms', 'bathrooms', 'total_floors'
    ]

    REGEX_PRICE = re.compile(r'(\d+\.?\d*)')
    REGEX_REL_DATE = re.compile(r'(\d+)\s*(天|周|个月|月|年)前')

    DATE_OFFSET_MAP = {
        '天': 1, '周': 7, '月': 30, '年': 365
    }

    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        self.reference_date = datetime.now()

    @classmethod
    def run(cls) -> pd.DataFrame:
        raw_df = CsvUtil.load_data("raw", match_multiple=True)

        if raw_df is None or raw_df.empty:
            return pd.DataFrame()

        cleaned_df = cls(raw_df).process_pipeline()

        CsvUtil.save_data(cleaned_df, prefix='cleaned')

        return cleaned_df

    def process_pipeline(self) -> pd.DataFrame:
        self.df = (
            self.df
            .pipe(self.clean_text)
            .pipe(self.clean_numeric)
            .pipe(self.parse_prices)
            .pipe(self.standardize_dates)
            .pipe(self.remove_outliers)
        )

        return self.df

    def clean_text(self, df: pd.DataFrame) -> pd.DataFrame:
        cols = df.columns.intersection(self.TEXT_COLUMNS)

        df[cols] = df[cols].fillna('未知')

        df[cols] = df[cols].astype(str).apply(lambda x: x.str.strip())

        df[cols] = df[cols].replace(
            {'nan': '未知', 'None': '未知', '': '未知'}
        )

        return df

    def clean_numeric(self, df: pd.DataFrame) -> pd.DataFrame:
        cols = df.columns.intersection(self.NUMERIC_COLUMNS)

        for col in cols:
            df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)

        return df

    def parse_prices(self, df: pd.DataFrame) -> pd.DataFrame:
        prices = df['price_rmb'].astype(str).str.extract(self.REGEX_PRICE)[0]

        df['price_avg'] = pd.to_numeric(prices, errors='coerce')

        return df.dropna(subset=['price_avg'])

    def standardize_dates(self, df: pd.DataFrame) -> pd.DataFrame:
        today_str = self.reference_date.strftime('%Y-%m-%d')

        def _parse(val: str):
            val = str(val)

            if '今天' in val:
                return today_str

            match = self.REGEX_REL_DATE.search(val)

            if match:
                num, unit = match.groups()
                days = int(num) * self.DATE_OFFSET_MAP[unit.replace('个', '')]

                return (self.reference_date - timedelta(days=days)).strftime('%Y-%m-%d')

            return val

        df['clean_date'] = df['update_time'].apply(_parse)

        return df

    def remove_outliers(self, df: pd.DataFrame) -> pd.DataFrame:
        low = df['price_avg'].quantile(0.02)
        high = df['price_avg'].quantile(0.98)

        return df[df['price_avg'].between(low, high)]

In [35]:
df_clean = DataCleaner.run()

display(df_clean.head())

Unnamed: 0,city,title,rent_type,district,sub_district,community,area_sqm,bedrooms,living_rooms,bathrooms,orientation,floor_level,total_floors,tags,platform,update_time,price_rmb,price_avg,clean_date
0,上海,整租·杨家浜小区 1室1厅 南,整租,杨浦,控江路,杨家浜小区,36.0,1,1,1,南,低,3,自营|新上|近地铁|精装|押一付一|随时看房|首次出租,贝壳优选,今天维护,3200,3200,2026-01-08
1,上海,整租·金杨六街坊 1室1厅 南,整租,浦东,金杨,金杨六街坊,41.07,1,1,1,南,中,6,官方核验|近地铁|随时看房,链家,1天前维护,3400,3400,2026-01-07
2,上海,整租·东方明珠大宁公寓 2室2厅 东南/南,整租,静安,大宁,东方明珠大宁公寓,95.18,2,2,2,东南 南,低,25,自营|新上|押一付一|双卫生间|随时看房|首次出租,贝壳优选,今天维护,7500,7500,2026-01-08
3,上海,独栋·方隅服务公寓 上海会展旗舰店 【康复医院陪护】近华山西院康复医院 可做饭 陪护优选拎包...,独栋,未知,未知,未知,50.0,1,1,1,未知,未知,0,独栋公寓|月租|精装|开放厨房|押一付一,方隅服务公寓,1天前维护,4670,4670,2026-01-07
4,上海,整租·绿地海域笙晖(公寓) 3室1厅 南,整租,宝山,杨行,绿地海域笙晖(公寓),96.32,3,1,1,南,低,16,自营|新上|精装|押一付一|随时看房|首次出租,贝壳优选,今天维护,6100,6100,2026-01-08


## 第四阶段: 探索性数据分析

### 核心可视化类: `RentalDataVisualizer`
利用 `Plotly` 绘制交互式图表，从宏观到微观多维度剖析市场特征：

1. **城市级概览 (`plot_city_comparison`)**:
    * **双轴图**: 柱状图展示“月租均价”，折线图展示“单位面积租金”，直观对比各城市生活成本。
2. **交通溢价分析 (`plot_subway_premium`)**:
    * 对比“地铁房”与“普通房”的租金差异，量化交通便利性的溢价幅度。
3. **区域深度透视 (`plot_city_analysis`)**:
    * **箱线图**: 展示某城市各行政区的价格分布区间与中位数。
    * **排行榜**: 筛选租金最昂贵的 Top 10 行政区。
    * **价格结构**: 饼图展示不同价格段（如 2k-4k）的市场占比。

In [None]:
class RentalDataVisualizer:
    LABELS_MAP = {
        'city': '城市',
        'price_avg': '月租均价（元）',
        'price_per_sqm': '每平米单价（元／㎡）',
        'is_subway': '房源类型',
        'district': '行政区',
        'count': '房源数量',
        'range': '价格区间'
    }

    COLOR_SEQ = [
        '#636EFA', '#EF553B', '#00CC96',
        '#AB63FA', '#FFA15A', '#19D3F3'
    ]

    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()

        self.df['price_per_sqm'] = self.df['price_avg'] / self.df['area_sqm']

        self.df['is_subway'] = np.where(
            self.df['tags'].str.contains('近地铁', na=False),
            '地铁房',
            '普通房'
        )

    def _apply_style(self, fig: go.Figure, title: str) -> go.Figure:
        fig.update_layout(
            title=dict(
                text=title,
                x=0.5,
                y=0.95,
                xanchor='center',
                yanchor='top'
            ),
            template='plotly_white',
            font=dict(
                family='Microsoft YaHei',
                size=12
            ),
            margin=dict(
                l=40,
                r=40,
                t=80,
                b=40
            ),
            showlegend=True
        )
        return fig

    def plot_city_comparison(self) -> go.Figure:
        metrics = (
            self.df
            .groupby('city')[['price_avg', 'price_per_sqm']]
            .mean()
            .reset_index()
        )

        fig = make_subplots(
            rows=1,
            cols=1,
            specs=[[{'secondary_y': True}]]
        )

        fig.add_trace(
            go.Bar(
                x=metrics['city'],
                y=metrics['price_avg'],
                name='月租均价',
                marker=dict(
                    color=self.COLOR_SEQ[0]
                )
            ),
            secondary_y=False
        )

        fig.add_trace(
            go.Scatter(
                x=metrics['city'],
                y=metrics['price_per_sqm'],
                name='每平米单价',
                mode='lines+markers',
                line=dict(
                    color=self.COLOR_SEQ[1],
                    width=3
                ),
                marker=dict(
                    size=8
                )
            ),
            secondary_y=True
        )

        fig.update_yaxes(
            title_text='月租均价（元）',
            secondary_y=False
        )

        fig.update_yaxes(
            title_text='每平米单价（元／㎡）',
            secondary_y=True,
            showgrid=False
        )

        return self._apply_style(fig, '各城市租金水平对比')

    def plot_subway_premium(self) -> go.Figure:
        metrics = (
            self.df
            .groupby('is_subway')['price_avg']
            .mean()
            .reset_index()
        )

        fig = go.Figure()

        fig.add_trace(
            go.Bar(
                x=metrics['is_subway'],
                y=metrics['price_avg'],
                name='月租均价',
                marker=dict(
                    color=[
                        self.COLOR_SEQ[0],
                        self.COLOR_SEQ[1]
                    ]
                ),
                text=metrics['price_avg'].round(0),
                textposition='outside'
            )
        )

        fig.update_xaxes(
            title_text=self.LABELS_MAP['is_subway']
        )

        fig.update_yaxes(
            title_text=self.LABELS_MAP['price_avg']
        )

        return self._apply_style(fig, '地铁房 vs 普通房：月租均价对比')

    def plot_city_analysis(self, city: str) -> dict:
        city_data = self.df[self.df['city'] == city]

        fig_box = go.Figure()

        for i, district in enumerate(city_data['district'].unique()):
            fig_box.add_trace(
                go.Box(
                    y=city_data.loc[
                        city_data['district'] == district,
                        'price_avg'
                    ],
                    name=district,
                    marker=dict(
                        color=self.COLOR_SEQ[i % len(self.COLOR_SEQ)]
                    ),
                    boxmean=True
                )
            )

        fig_box.update_xaxes(
            title_text=self.LABELS_MAP['district']
        )

        fig_box.update_yaxes(
            title_text=self.LABELS_MAP['price_avg']
        )

        top10 = (
            city_data
            .groupby('district')['price_avg']
            .mean()
            .nlargest(10)
            .reset_index()
        )

        fig_bar = go.Figure()

        fig_bar.add_trace(
            go.Bar(
                x=top10['price_avg'],
                y=top10['district'],
                orientation='h',
                marker=dict(
                    color=top10['price_avg'],
                    colorscale='Blues'
                )
            )
        )

        fig_bar.update_xaxes(
            title_text=self.LABELS_MAP['price_avg']
        )

        fig_bar.update_yaxes(
            title_text=self.LABELS_MAP['district'],
            autorange='reversed'
        )

        bins = [0, 2000, 4000, 6000, 8000, 10000, float('inf')]
        labels = ['2k以下', '2k-4k', '4k-6k', '6k-8k', '8k-1w', '1w以上']

        counts = (
            pd.cut(city_data['price_avg'], bins=bins, labels=labels)
            .value_counts()
            .reset_index()
        )
        counts.columns = ['range', 'count']

        fig_pie = go.Figure()

        fig_pie.add_trace(
            go.Pie(
                labels=counts['range'],
                values=counts['count'],
                hole=0.4,
                textinfo='percent+label',
                marker=dict(
                    colors=self.COLOR_SEQ
                )
            )
        )

        return {
            'box': self._apply_style(fig_box, f'{city}－各区域租金分布'),
            'bar': self._apply_style(fig_bar, f'{city}－租金最贵行政区 Top 10'),
            'pie': self._apply_style(fig_pie, f'{city}－房源价格区间占比')
        }


In [None]:
viz = RentalDataVisualizer(df_clean)

fig1 = viz.plot_city_comparison()
fig1.show()

In [None]:
fig2 = viz.plot_subway_premium()
fig2.show()

In [None]:
city_charts = viz.plot_city_analysis(ANALYSIS_CITY)

city_charts['box'].show()

In [None]:
city_charts['bar'].show()

In [None]:
city_charts['pie'].show()

## 第五阶段: 价格预测建模

### 核心建模类: `ModelPipeline`
构建端到端的机器学习流程，实现租金预测并解析关键特征。

### 5.1 建模流程
1. **特征工程**:
    * **数值特征**: 面积、室数、厅数、卫数、楼层总数。
    * **类别特征**: 城市、行政区、商圈、朝向、楼层位置（通过 `OrdinalEncoder` 或模型原生支持处理）。
2. **模型训练**:
    * **CatBoostRegressor**: 能够原生处理类别特征，鲁棒性强。
    * **RandomForestRegressor**: 经典的集成学习基准模型。
3. **模型评估**:
    * 使用 **RMSE** (均方根误差)、**MAE** (平均绝对误差) 和 **R²** (拟合优度) 综合评价模型性能。

### 5.2 可解释性分析
* **特征重要性 (Feature Importance)**: 提取模型认为最重要的特征（如“面积”、“行政区”），解释租金构成的核心要素。

In [None]:
class ModelPipeline:
    FEATURE_DISPLAY_MAP = {
        'city': '城市',
        'rent_type': '租赁方式',
        'district': '行政区',
        'sub_district': '商圈',
        'orientation': '朝向',
        'floor_level': '楼层等级',
        'area_sqm': '面积（㎡）',
        'bedrooms': '室数',
        'living_rooms': '厅数',
        'bathrooms': '卫数',
        'total_floors': '总楼层',
        'is_subway': '是否近地铁'
    }

    COLOR_SEQ = [
        '#636EFA',
        '#EF553B',
        '#00CC96',
        '#AB63FA',
        '#FFA15A'
    ]

    def __init__(self, df: pd.DataFrame, target_col: str, cat_features: List[str] = None):
        self.df = df
        self.target_col = target_col
        self.cat_features = cat_features if cat_features else []
        self.performance_metrics: List[Dict] = []

        self.X = self.df.drop(columns=[self.target_col])
        self.y = self.df[self.target_col]
        self.feature_names = self.X.columns.tolist()

    def run(self):
        train_x, test_x, train_y, test_y = train_test_split(
            self.X,
            self.y,
            test_size=0.2,
            random_state=42
        )

        print('Training CatBoost...')
        cat_model = self._train_catboost(train_x, train_y, test_x, test_y)
        self._evaluate(cat_model, test_x, test_y, 'CatBoost')

        print('Training RandomForest...')
        rf_model, test_x_enc = self._train_random_forest(train_x, train_y, test_x)
        self._evaluate(rf_model, test_x_enc, test_y, 'RandomForest')

        self._plot_metrics_comparison()
        self._plot_importance(cat_model)

    def _train_catboost(self, train_x, train_y, test_x, test_y):
        model = CatBoostRegressor(
            iterations=1000,
            learning_rate=0.05,
            depth=5,
            loss_function='RMSE',
            verbose=0,
            early_stopping_rounds=50,
            allow_writing_files=False
        )

        model.fit(
            Pool(train_x, train_y, cat_features=self.cat_features),
            eval_set=Pool(test_x, test_y, cat_features=self.cat_features),
            use_best_model=True
        )

        return model

    def _train_random_forest(self, train_x, train_y, test_x):
        encoder = OrdinalEncoder(
            handle_unknown='use_encoded_value',
            unknown_value=-1
        )

        if self.cat_features:
            train_x_enc = train_x.copy()
            test_x_enc = test_x.copy()

            train_x_enc[self.cat_features] = encoder.fit_transform(train_x[self.cat_features])
            test_x_enc[self.cat_features] = encoder.transform(test_x[self.cat_features])
        else:
            train_x_enc = train_x
            test_x_enc = test_x

        model = RandomForestRegressor(
            n_estimators=100,
            max_depth=20,
            n_jobs=-1,
            random_state=42
        )

        model.fit(train_x_enc, train_y)

        return model, test_x_enc

    def _evaluate(self, model, features, target, model_name: str):
        pred = model.predict(features)

        rmse = np.sqrt(mean_squared_error(target, pred))
        mae = mean_absolute_error(target, pred)
        r2 = r2_score(target, pred)

        print(f'{model_name}: RMSE={rmse:.2f}, MAE={mae:.2f}, R2={r2:.3f}')

        self.performance_metrics.extend([
            {'Model': model_name, 'Metric': 'RMSE', 'Value': rmse},
            {'Model': model_name, 'Metric': 'MAE', 'Value': mae},
            {'Model': model_name, 'Metric': 'R2', 'Value': r2}
        ])

    def _plot_metrics_comparison(self):
        metrics_df = pd.DataFrame(self.performance_metrics)

        fig = make_subplots(
            rows=1,
            cols=2,
            subplot_titles=('模型拟合度（R2）', '误差指标（RMSE／MAE）')
        )

        r2_df = metrics_df[metrics_df['Metric'] == 'R2']

        fig.add_trace(
            go.Bar(
                x=r2_df['Model'],
                y=r2_df['Value'],
                name='R2',
                text=r2_df['Value'].round(3),
                textposition='auto',
                marker=dict(color=self.COLOR_SEQ[0])
            ),
            row=1,
            col=1
        )

        err_df = metrics_df[metrics_df['Metric'].isin(['RMSE', 'MAE'])]

        for i, metric in enumerate(['RMSE', 'MAE']):
            d = err_df[err_df['Metric'] == metric]

            fig.add_trace(
                go.Bar(
                    x=d['Model'],
                    y=d['Value'],
                    name=metric,
                    marker=dict(color=self.COLOR_SEQ[i + 1])
                ),
                row=1,
                col=2
            )

        fig.update_layout(
            barmode='group'
        )

        fig = self._apply_style(fig, '模型性能评估对比')
        fig.show()

    def _plot_importance(self, model):
        importance_df = pd.DataFrame({
            'feature': self.feature_names,
            'score': model.get_feature_importance()
        })

        importance_df['feature_cn'] = (
            importance_df['feature']
            .map(self.FEATURE_DISPLAY_MAP)
            .fillna(importance_df['feature'])
        )

        importance_df = (
            importance_df
            .sort_values('score', ascending=True)
            .tail(10)
        )

        fig = go.Figure()

        fig.add_trace(
            go.Bar(
                x=importance_df['score'],
                y=importance_df['feature_cn'],
                orientation='h',
                marker=dict(
                    color=importance_df['score'],
                    colorscale='Viridis'
                )
            )
        )

        fig.update_xaxes(title_text='贡献度')
        fig.update_yaxes(title_text='特征')

        fig = self._apply_style(fig, 'CatBoost 核心特征贡献排行')
        fig.show()

    def _apply_style(self, fig: go.Figure, title: str) -> go.Figure:
        fig.update_layout(
            title=dict(
                text=title,
                x=0.5,
                y=0.95,
                xanchor='center',
                yanchor='top'
            ),
            template='plotly_white',
            font=dict(
                family='Microsoft YaHei',
                size=12
            ),
            margin=dict(
                l=40,
                r=40,
                t=80,
                b=40
            ),
            showlegend=True
        )
        return fig

In [None]:
def prepare_features(raw_df):
    df = raw_df.copy()
    df['is_subway'] = df['tags'].str.contains('近地铁', na=False).astype(int)

    cat_cols = ['city', 'rent_type', 'district', 'sub_district', 'orientation', 'floor_level']
    num_cols = ['area_sqm', 'bedrooms', 'living_rooms', 'bathrooms', 'total_floors', 'is_subway']
    target = 'price_avg'

    return df[cat_cols + num_cols + [target]], target, cat_cols

df_train, target_col, cat_features = prepare_features(df_clean)

pipeline = ModelPipeline(df_train, target_col, cat_features)

pipeline.run()