# Handling Titles


**Objective**: 
1. Extract `year` from title to build a new composite primary key: **(code, year)**
2. Remove Duplicates: 
   - Delete english versions
   - Deal with *Revision*: only keep the revised version for each (code,year)（更正后/更新后）
3. Form new data structure:
   - **code**
   - **year**
   - name
   - link

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [44]:
# load sample data (first 10 pages)
df = pd.read_csv('./Data/annual_report_10pages.csv')

In [45]:
# load sample data (first 10 pages) with specified data types
dtype_dict = {'code': str, 'name': str, 'title': str, 'link': str}
df = pd.read_csv('./Data/annual_report_10pages.csv', dtype=dtype_dict)
df

Unnamed: 0,code,name,title,link
0,603019,中科曙光,中科曙光2024年年度报告,http://www.cninfo.com.cn/new/disclosure/detail...
1,872808,曙光数创,2024年年度报告,http://www.cninfo.com.cn/new/disclosure/detail...
2,300353,东土科技,2024年年度报告,http://www.cninfo.com.cn/new/disclosure/detail...
3,000063,中兴通讯,2024年年度报告,http://www.cninfo.com.cn/new/disclosure/detail...
4,688041,海光信息,海光信息技术股份有限公司2024年年度报告,http://www.cninfo.com.cn/new/disclosure/detail...
...,...,...,...,...
223,603008,喜临门,喜临门家具股份有限公司2023年年度报告（更正后）,http://www.cninfo.com.cn/new/disclosure/detail...
224,300398,飞凯材料,2023年年度报告（更新后）,http://www.cninfo.com.cn/new/disclosure/detail...
225,300398,飞凯材料,2021年年度报告（更正后）,http://www.cninfo.com.cn/new/disclosure/detail...
226,300398,飞凯材料,2022年年度报告（更正后）,http://www.cninfo.com.cn/new/disclosure/detail...


## 1. Extract `year` from titles

In [46]:
df['year'] = df['title'].str.extract(r'(\d{4})')
df.head()

Unnamed: 0,code,name,title,link,year
0,603019,中科曙光,中科曙光2024年年度报告,http://www.cninfo.com.cn/new/disclosure/detail...,2024
1,872808,曙光数创,2024年年度报告,http://www.cninfo.com.cn/new/disclosure/detail...,2024
2,300353,东土科技,2024年年度报告,http://www.cninfo.com.cn/new/disclosure/detail...,2024
3,63,中兴通讯,2024年年度报告,http://www.cninfo.com.cn/new/disclosure/detail...,2024
4,688041,海光信息,海光信息技术股份有限公司2024年年度报告,http://www.cninfo.com.cn/new/disclosure/detail...,2024


## 2. Remove Duplicates

### Delete english versions

In [47]:
df = df[~df['title'].str.contains('英文|英文版')]

### Deal with revisions
1. Extract all (code, year)s that are distinct,
2.  

In [48]:
# filter duplicated titles
df = df.drop_duplicates(subset=['code','year','title'])

In [49]:
# count the number of (code, year) combinations
df['count'] = df.groupby(['code','year'])[['title']].transform('count')

In [36]:
df[df['count']>1].sort_values(by=['code','year'])

Unnamed: 0,code,name,title,link,year,count
49,2569,ST步森,2021年年度报告全文（补充更正后）,http://www.cninfo.com.cn/new/disclosure/detail...,2021,2
60,2569,ST步森,2021年年度报告全文（更正后）,http://www.cninfo.com.cn/new/disclosure/detail...,2021,2
48,2569,ST步森,2022年年度报告（补充更正后）,http://www.cninfo.com.cn/new/disclosure/detail...,2022,2
62,2569,ST步森,2022年年度报告（更正后）,http://www.cninfo.com.cn/new/disclosure/detail...,2022,2
47,2569,ST步森,2023年年度报告（补充更正后）,http://www.cninfo.com.cn/new/disclosure/detail...,2023,2
59,2569,ST步森,2023年年度报告（更正后）,http://www.cninfo.com.cn/new/disclosure/detail...,2023,2
141,600965,福成股份,福成股份：2023年年度报告（ 更正后）,http://www.cninfo.com.cn/new/disclosure/detail...,2023,2
142,600965,福成股份,福成股份：关于更正《2023年年度报告》的公告,http://www.cninfo.com.cn/new/disclosure/detail...,2023,2
174,688165,埃夫特,埃夫特关于披露2023年年度报告补充信息及发布《2023年年度报告（修订稿）》的公告,http://www.cninfo.com.cn/new/disclosure/detail...,2023,2
175,688165,埃夫特,埃夫特2023年年度报告（修订稿）,http://www.cninfo.com.cn/new/disclosure/detail...,2023,2


Find patterns of revision:
- delete titles like '埃夫特**关于**披露2023年年度报告补充信息及发布《2023年年度报告（修订稿）》的公告' (not a real report)
- i.e. ('2023年度报告', '2023年度报告（更正后）') --> 保留更正版
- i.e. ('2022年度报告（更正后）', '2022年度报告（补充更正后）') --> 保留一个

In [51]:
df_unique = df[df['count']==1]
df_duplicated = df[df['count']>1].sort_values(by=['code','year'])

In [52]:
# delete **关于**
df_duplicated = df_duplicated[~df_duplicated['title'].str.contains('关于')]

In [55]:
# 保留更正版
df_duplicated = df_duplicated[df_duplicated['title'].str.contains('更正版|更正后|修订稿|修订版|修订后|更正公告|更正稿')]
# 保留最新版
df_duplicated.drop_duplicates(subset=['code','year'], keep='first', inplace=True)

In [57]:
df = pd.concat([df_unique, df_duplicated])

In [None]:
df.duplicated(subset=['code','year']).sum() # no duplicated (code, year) combinations

0

In [60]:
df = df[['code','year', 'name', 'link']]
df

Unnamed: 0,code,year,name,link
0,603019,2024,中科曙光,http://www.cninfo.com.cn/new/disclosure/detail...
1,872808,2024,曙光数创,http://www.cninfo.com.cn/new/disclosure/detail...
2,300353,2024,东土科技,http://www.cninfo.com.cn/new/disclosure/detail...
3,000063,2024,中兴通讯,http://www.cninfo.com.cn/new/disclosure/detail...
4,688041,2024,海光信息,http://www.cninfo.com.cn/new/disclosure/detail...
...,...,...,...,...
67,830974,2023,凯大催化,http://www.cninfo.com.cn/new/disclosure/detail...
169,831856,2023,浩淼科技,http://www.cninfo.com.cn/new/disclosure/detail...
50,834475,2023,三友科技,http://www.cninfo.com.cn/new/disclosure/detail...
107,839719,2023,宁新新材,http://www.cninfo.com.cn/new/disclosure/detail...


In [61]:
df.to_csv('./Data/annual_report_cleaned.csv', index=False)