# FPM: Funding-Productivity Model

This notebook is a simple model to estimate the funding required to achieve a certain level of productivity in a research lab. The model is based on the assumption that the productivity of a lab is proportional to the funding it receives. 

The model is based on the following assumptions:

## Data Formatting

The data is formatted as a table with the following columns:

- `Year`: The year of the data
- `Funding`: The funding received by the lab in that year
- `Productivity`: The productivity of the lab in that year
- `Target Productivity`: The target productivity of the lab


## Key Assumptions

1. **Productivity is proportional to funding**: The productivity of the lab is assumed to be proportional to the funding it receives. This is a simplifying assumption, but it is a common one in the literature on research funding.
1. **Diminishing returns to funding**: The relationship between funding and productivity is assumed to be concave, meaning that the marginal productivity of funding decreases as funding increases. This is also a common assumption in the literature on research funding.
1. **Lag between funding and productivity**: The effect of funding on productivity is assumed to be lagged, meaning that funding in one year affects productivity in the following year. This is a common assumption in the literature on research funding.

In [2]:
import pandas as pd
import numpy as np
from semopy import Model, Optimizer
from semopy.stats import calc_stats
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample

import semopy
print(semopy.__version__)

# 读取数据
company_df = pd.read_csv('公司概况.csv', encoding='utf-8') 
financing_df = pd.read_csv('融资历史.csv', encoding='utf-8')

# 数据预处理
financing_df['融资金额'] = pd.to_numeric(financing_df['金额'], errors='coerce')
company_df['子行业'] = company_df['子行业'].astype('category')
company_df['省'] = company_df['省'].astype('category')
company_df['估值'] = pd.to_numeric(company_df['估值(万人民币)'], errors='coerce')

# 计算融资汇总统计信息
financing_summary = financing_df.groupby('公司简称').agg(
    平均融资金额=('融资金额', 'mean'),
    融资总次数=('融资金额', 'count'),
    融资总金额=('融资金额', 'sum')
).reset_index()

# 将汇总数据与公司概况表合并
df = pd.merge(company_df, financing_summary, on='公司简称', how='left')

# 缺失值处理
df['年营收'] = pd.to_numeric(df['年营收(亿元)'], errors='coerce')
df['年利润'] = pd.to_numeric(df['年度利润(亿元)'], errors='coerce')
df['员工人数'] = pd.to_numeric(df['员工人数'], errors='coerce')

# 填充缺失值为中位数
columns_to_fill = ['年营收', '年利润', '员工人数', '平均融资金额', '融资总次数', '融资总金额']
for col in columns_to_fill:
    df[col] = df[col].fillna(df[col].median())

# 计算人均产值
df['人均产值'] = df['年营收'] / df['员工人数']

# 将子行业变量转换为哑变量
df = pd.get_dummies(df, columns=['子行业'], prefix='子行业')

# 检查数据类型和缺失值
# print(df.info())

# 定义SEM模型
model_spec = """
    # measurement model
    融资策略 =~ 融资总次数
    业务健康度 =~ 年营收 + 年利润 + 人均产值

    # structural model
    业务健康度 ~ 融资策略

    # 可以考虑添加其他控制变量或影响因素
    人均产值 ~ 业务健康度 + 融资策略 + 年营收
"""

# 构建模型对象
model = Model(model_spec)

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[['融资总次数', '平均融资金额', '融资总金额', '人均产值', '年营收', '年利润', '估值']])
scaled_df = pd.DataFrame(scaled_data, columns=['融资总次数', '平均融资金额', '融资总金额', '人均产值', '年营收', '年利润', '估值'])

# 拟合模型
model.fit(scaled_df)

# 输出拟合结果
print("Fit results:")
print(model.inspect())

# 输出标准化系数
print("Standardized coefficients:")
print(model.inspect(std_est=True))

# 输出拟合优度指标
stats = calc_stats(model)
print("Fit indices:")
print(stats.T[['Value']])  # This will print the table of fit indices

# Bootstrap稳健性检验
nrep = 1000  # Bootstrap抽样次数
bootstrap_results = []

# 手动创建一个包含自由参数名称的列表
free_params = ['融资策略 =~ 融资总次数',
               '融资策略 =~ 融资总金额',
               '业务健康度 =~ 年营收',
               '业务健康度 =~ 人均产值',
               '业务健康度 =~ 年利润',
               '业务健康度 ~ 融资策略',
               '估值 ~ 业务健康度',
               '估值 ~ 融资策略',
               '年营收 ~~ 年营收']

# 获取model.param_vals的长度
param_vals_length = len(model.param_vals)

# 创建bootstrap_results数组
bootstrap_results = np.empty((nrep, param_vals_length), dtype=float)

for i in range(nrep):
    # 对数据进行Bootstrap抽样
    bootstrap_data = resample(scaled_df)

    # 在Bootstrap样本上拟合模型
    bootstrap_model = semopy.Model(model_spec)
    bootstrap_model.fit(bootstrap_data)

    # 直接使用bootstrap_model.param_vals数组
    param_estimates = bootstrap_model.param_vals

    # 储存Bootstrap样本的参数估计结果
    bootstrap_results[i, :] = param_estimates

# 计算参数估计的置信区间
conf_int = np.percentile(bootstrap_results, [2.5, 97.5], axis=0)

# 将Bootstrap结果转换为numpy数组
bootstrap_results = np.array(bootstrap_results)

# 输出Bootstrap结果的置信区间
print("Bootstrap confidence intervals:")
for param, ci in zip(free_params, conf_int):
    print(f"{param}: [{ci[0]:.3f}, {ci[1]:.3f}]")

2.3.11
Fit results:
     lval  op   rval  Estimate        Std. Err   z-value   p-value
0     年营收   ~  业务健康度  1.000000               -         -         -
1   业务健康度   ~   融资策略 -0.064070  1307490.264796      -0.0       1.0
2   融资总次数   ~   融资策略  1.000000               -         -         -
3     年利润   ~  业务健康度  0.305732        4.612128  0.066289  0.947148
4    人均产值   ~  业务健康度 -0.395614  8389480.542206      -0.0       1.0
5    人均产值   ~  业务健康度 -0.395614  8395727.536504      -0.0       1.0
6    人均产值   ~   融资策略  0.331128  6771791.016456       0.0       1.0
7    人均产值   ~    年营收  0.603717        8.783867   0.06873  0.945204
8   业务健康度  ~~  业务健康度  0.646716    28116.018576  0.000023  0.999982
9     年营收  ~~    年营收  0.449578        9.775857  0.045989  0.963319
10   融资策略  ~~   融资策略  0.335630  6849269.750743       0.0       1.0
11   人均产值  ~~   人均产值  0.794333   750992.338278  0.000001  0.999999
12    年利润  ~~    年利润  1.019824        0.924136  1.103543  0.269792
13  融资总次数  ~~  融资总次数  0.674083  6849269.75

  return np.sqrt((chi2 / dof - 1) / (model.n_samples - 1))


Bootstrap confidence intervals:
融资策略 =~ 融资总次数: [-34.160, -4.575]
融资策略 =~ 融资总金额: [8.092, 10.955]
