### 理解以下这个代码
```python
if (any(x in row["consequence"] 
        for x in ["missense", "non_coding_transcript", "intergenic"])) 
        and not any(x in row["consequence"] 
        for x in calculate_oe.oe_functions.more_severe_than_missense) 
```

1. 功能​​：检查row["consequence"]中是否包含以下任意一种类型：missense、non_coding_transcript、intergenic。

优化：

```python
is_medium_impact = any(x in row["consequence"] for x in ["missense", "non_coding_transcript", "intergenic"])
is_not_severe = not any(x in row["consequence"] for x in more_severe_than_missense)
if is_medium_impact and is_not_severe:
    # 执行操作
```

### 共享字典在不同进程间共享

```python
# initiate dictionaries, to be used in parallelized function
	manager = mp.Manager()
	kmers = manager.dict()
	pool = mp.Pool(mp.cpu_count())
	
	# calculate oe ratios and CIs for every possible kmer
	for position in list(range(1, 16570)):
		pool.apply_async(parallelize_kmers, args=(
			position, kmer_length, lookup_dictionary, excluded_sites, fit_parameters, mitomap_loci, kmers))
	pool.close()  # close pool and let all the processes complete
	pool.join()  # postpones the execution of next line of code until all processes in the queue are done
	
	# using pd to assign percentile ranks to each kmer by their upper CI value
	df = pd.DataFrame.from_dict(kmers, orient='index', columns=[
		'total_all', 'obs_max_het', 'exp_max_het', 'ratio_oe', 'lower_CI', 'upper_CI', 'loci', 'loci_type'])  为什么这里的返回值kmers，不用进行kmers.get()
```

- kmers 是通过 multiprocessing.Manager().dict() 创建的共享字典，用于在多个进程间传递数据。​​不需要使用 kmers.get() 的


### pandas根据几列进行去重
````python
df.drop_duplicates(subset=['col1', 'col2'], keep='first', inplace=False, ignore_index=False)
# 根据列'A'和'B'的组合去重
df_unique = df.drop_duplicates(subset=['A', 'B'])
````
### 去重后保留某列的最大/最小值​​：
````python
# 按'年龄'排序后，保留每个'姓名'的最大年龄记录
df_max = df.sort_values('年龄').drop_duplicates(subset=['姓名'], keep='last')
````
去重后行索引可能不连续，可通过reset_index(drop=True)重置

### df2[df2['POS'].isin(df1_pos)]  这个取反怎么取？
```python
# 取反操作：筛选 df2 中 'POS' 列值不在 df1_pos 中的行
df2_not_in_pos = df2[~df2['POS'].isin(df1_pos)]
```

### 强制转换为字符串类型​
```python
df1['POS_REF_ALT'] = df1['POS'].astype(str) + '_' + df1['REF'] + '_' + df1['ALT']

# 使用 str.cat() 方法​​
# Pandas 的字符串拼接方法，更高效且明确：
df1['POS_REF_ALT'] = df1['POS'].astype(str).str.cat([df1['REF'], df1['ALT']], sep='_')

# 使用 apply 函数​​
# 灵活处理混合类型：
df1['POS_REF_ALT'] = df1.apply(lambda x: f"{x['POS']}_{x['REF']}_{x['ALT']}", axis=1)

# 对于大数据集，astype(str) 或 str.cat() 比 apply 更快
```

### 将list转为pandas数据框
```python
# 1. 一维列表转换为单列DataFrame​
import pandas as pd
my_list = [10, 20, 30, 40]
df = pd.DataFrame(my_list, columns=['Column1'])  # 列名为'Column1'

# 2. 二维列表转换为DataFrame​
data = [[1, 'Alice', 25], [2, 'Bob', 30]]
df = pd.DataFrame(data, columns=['ID', 'Name', 'Age'])  # 指定列名

# 3. 字典列表转换为DataFrame​
data = [{'ID': 1, 'Name': 'Alice', 'Age': 25}, {'ID': 2, 'Name': 'Bob', 'Age': 30}]
df = pd.DataFrame(data)  # 自动识别列名

# 4. 复杂嵌套结构的处理​
data = [{'ID': 1, 'Details': {'Age': 25, 'City': 'NY'}}]
flattened = [{'ID': item['ID'], **item['Details']} for item in data]
df = pd.DataFrame(flattened)
```

### pandas删除cutoff列为NaN的所有行
```python
import pandas as pd

# 根据图片数据创建DataFrame
data = {
    'CHROM': ['MT', 'MT', 'MT', 'MT', 'MT'],
    'POS': [1, 1, 1, 2, 2],
    'REF': ['G', 'G', 'G', 'A', 'A'],
    'ALT': ['A', 'C', 'T', 'C', 'G'],
    'type': ['SNV', 'SNV', 'SNV', 'SNV', 'SNV'],
    'mean': [0.000013, 0.000010, 0.000029, 0.000026, 0.000018],
    'sd': [0.000078, 0.000069, 0.000128, 0.000106, 0.000104],
    'cutoff': [float('nan'), float('nan'), float('nan'), float('nan'), float('nan')],
    'note': ['Not curated', 'Not curated', 'Not curated', 'Not curated', 'Not curated']
}
df = pd.DataFrame(data)

# 删除cutoff为NaN的行
df_cleaned = df.dropna(subset=['cutoff'])

# 输出结果
print("原始数据行数:", len(df))
print("清理后数据行数:", len(df_cleaned))
print("\n清理后的DataFrame:")
print(df_cleaned)

# 如果需要保留部分 NaN 行（例如仅删除全为 NaN 的行），可以修改为：
df_cleaned = df.dropna(subset=['cutoff'], how='all')

# 建议在实际数据中先检查 NaN 分布：
print(df['cutoff'].isna().value_counts())

