# 数据分析

本 notebook 为前端生成数据，直接用于可视化

文档部分主要描述数据分析的含义，代码部分根据原始数据生成图表数据

如无特殊说明，每组数据由两个数组组成，分别代表 x 坐标和 y 坐标

In [16]:
import pandas as pd
import numpy as np
from src.data.load import load_dataset
data = load_dataset()
df: pd.DataFrame = data['all_train']
data_for_vis = {}

In [17]:
# 通用函数
def series_to_obj(s: pd.Series):
    return {
        'x': s.index.to_list(),
        'y': s.values.tolist()
    }

def get_cumulative_percent(value_counts: pd.Series) -> pd.Series:
    cumulative = value_counts.sort_index(ascending=False).cumsum()
    return cumulative / cumulative.max()

## 1 - 互动数量分布

互动 (engagement) 行为包含: 点赞 (like), 转发 (forward), 评论 (comment)

本部分想描述所有博文里互动数量的分布是怎样的

图表类型: 对数坐标散点图, 折线图

- `xxx_value_counts` 数据点含义: **互动数量值为 x 的博文数量为 y**
- `cumulative_xxx_value_percent` 数据点含义: **互动数量值大于 x 的博文数量为 y**

加个需求：这个折线图，最好能有个鼠标悬停到线上时，显示俩辅助线+这个点的横纵坐标，这样能清晰看到比如，有百分之多少的博文点赞数量大于几

In [18]:
like_value_counts    = df['like_count'].value_counts()
forward_value_counts = df['forward_count'].value_counts()
comment_value_counts = df['comment_count'].value_counts()
cumulative_like_value_percent    = get_cumulative_percent(like_value_counts)
cumulative_forward_value_percent = get_cumulative_percent(forward_value_counts)
cumulative_comment_value_percent = get_cumulative_percent(comment_value_counts)
data_for_vis.update({
    'like_value_counts': series_to_obj(like_value_counts),
    'forward_value_counts': series_to_obj(forward_value_counts),
    'comment_value_counts': series_to_obj(comment_value_counts),
    'cumulative_like_value_percent': series_to_obj(cumulative_like_value_percent),
    'cumulative_forward_value_percent': series_to_obj(cumulative_forward_value_percent),
    'cumulative_comment_value_percent': series_to_obj(cumulative_comment_value_percent),
})

## 2 - 列的相关性

矩阵需要存成二维数组

In [19]:
engagements = df[['like_count', 'forward_count', 'comment_count']]
engagements.corr()

Unnamed: 0,like_count,forward_count,comment_count
like_count,1.0,0.554547,0.617456
forward_count,0.554547,1.0,0.579505
comment_count,0.617456,0.579505,1.0


In [20]:
data_for_vis.update({
    'engagement_corr': engagements.corr().to_numpy().tolist()
})

## 3.1 - 用户的发文数量

In [21]:
num_posts_of_users_counts = df['uid'].value_counts().value_counts()
cumulative_num_posts = get_cumulative_percent(num_posts_of_users_counts)
data_for_vis.update({
    'num_posts_of_users_counts': series_to_obj(num_posts_of_users_counts),
    'cumulative_num_posts': series_to_obj(cumulative_num_posts)
})

## 3.2 - 用户发文的平均互动数量

这部分还没想好咋弄

In [22]:
user_dataframes_sorted = {uid: group.sort_values(by='time') for uid, group in df.groupby('uid')}
user_stats = []
for uid, dataframe in user_dataframes_sorted.items():
    user_stats.append([
        dataframe['like_count'].mean(),
        dataframe['forward_count'].mean(),
        dataframe['comment_count'].mean(),
        len(dataframe)
    ])
user_stats = pd.DataFrame(user_stats, columns=['like_count', 'forward_count', 'comment_count', 'num_posts'])

## 3.3 - 用户受到的平均互动量和发帖量的相关性

In [23]:
user_stats.corr()

Unnamed: 0,like_count,forward_count,comment_count,num_posts
like_count,1.0,0.514941,0.801157,0.024916
forward_count,0.514941,1.0,0.527527,0.030005
comment_count,0.801157,0.527527,1.0,0.028349
num_posts,0.024916,0.030005,0.028349,1.0


In [24]:
data_for_vis.update({
    'user_engagement_corr': user_stats.corr().to_numpy().tolist()
})

## 4 - 发文时间

In [25]:
hour_of_day_counts = df['time'].dt.hour.value_counts()
day_of_week_counts = df['time'].dt.day_of_week.value_counts()
data_for_vis.update({
    'hour_of_day_counts': series_to_obj(hour_of_day_counts),
    'day_of_week_counts': series_to_obj(day_of_week_counts),
})

## 5 - 文本内容

目前这部分只有文本长度分布图

In [26]:
from src.data.process import process_text

text_length = df['content'].astype(str).apply(process_text).apply(len)
text_characters_counts = text_length.value_counts()
data_for_vis.update({
    'text_characters_counts': series_to_obj(text_characters_counts)
})

In [27]:
def run_wordcloud():
    import jieba
    from tqdm import tqdm
    from collections import Counter
    jieba.setLogLevel(20)
    jieba.initialize()

    def cut_text(text: str) -> list[str]:
        return [word for word in jieba.cut(text) if word.isalnum()]

    all_words = []
    for content in tqdm(df['content']):
        all_words.extend(cut_text(content))
    
    word_counts = Counter(word for word in all_words)
    word_counts = pd.Series(word_counts).sort_values(ascending=False)
    word_counts.to_csv('word_counts.csv')

## 保存数据

In [28]:
import json

class CustomJSONEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.ndarray) and obj.ndim == 1:
            return '[{}]'.format(', '.join([str(obj[i]) for i in range(obj.size)]))
        # For other types, fallback to default serialization
        return super().default(obj)

with open('./test_data.json', 'w') as f:
    json.dump(data_for_vis, f, ensure_ascii=True, cls=CustomJSONEncoder, indent=4)