# CKO JAR Revision

In [15]:
import pandas as pd, numpy as np
import rpy2.rinterface #ggplot tool

## Review TNIC-3 data

In [16]:
"""
Hoberg and Philips TNIC3 database
"""
tnic = pd.read_csv('/Users/ohn0000/Project/cko/0_data/external/tnic3_data.txt', delimiter='\t', header=0)
tnic.dropna(inplace=True)

In [17]:
# """
# Subset to firms with more than 20 competitors each year
# """
# tnic.set_index(["gvkey1", "year", "gvkey2"], inplace=True, verify_integrity=True)
# tnic_industry = tnic.groupby(level=['gvkey1', 'year']).apply(lambda x: x.nlargest(20, 'score')).reset_index(level=(0,1), drop=True)
# tnic_industry = tnic_industry.groupby(level=['gvkey1', 'year']).filter(lambda x: x.size == 20)
# tnic.reset_index(inplace=True)

# """
# Save tnic_industry to a csv file
# """
# tnic_industry.to_csv('/Users/ohn0000/Project/cko/2_pipeline/tnic_sub.csv')

In [18]:
tnic_industry = pd.read_csv('/Users/ohn0000/Project/cko/2_pipeline/tnic_sub.csv', header=0)

In [19]:
# tnic_industry['gvkey1'] = tnic_industry['gvkey1'].apply(lambda x: str(x).zfill(6))
# tnic_industry['gvkey2'] = tnic_industry['gvkey2'].apply(lambda x: str(x).zfill(6))

Remeber that _year_ in __tnic_industry__ is the base year for identifying close competitors. Accordingly, _lead1_ is the M&A year and _lead2_ is the year following M&A.

Readme_tnic3.txt explains that _year_ equals the first four digits of the __compustat__ _datadate_.

Shift years in __tnic__ to get _lead1_ similarity scores

In [20]:
tnic['year'] = tnic['year'] - 1
tnic.rename(columns={'score':'score_lead1'}, inplace=True)

In [21]:
tnic_industry = pd.merge(tnic_industry, tnic, how='left', left_on=['gvkey1', 'year', 'gvkey2'], right_on=['gvkey1', 'year', 'gvkey2'])

Shift years one more time to get _lead2_ similarity scores.

In [22]:
tnic['year'] = tnic['year'] - 1
tnic.rename(columns={'score_lead1':'score_lead2'}, inplace=True)

In [23]:
tnic_industry = pd.merge(tnic_industry, tnic, how='left', left_on=['gvkey1', 'year', 'gvkey2'], right_on=['gvkey1', 'year', 'gvkey2'])

In [24]:
tnic_industry.drop_duplicates(inplace=True)

In [25]:
tnic_industry.set_index(['gvkey1', 'year', 'gvkey2'], inplace=True, verify_integrity=True)

Reset __tnic__ years and column name back to original

In [26]:
tnic['year'] = tnic['year'] + 2
tnic.rename(columns={'score_lead2':'score'}, inplace=True)

Average TNIC similarity score across 20-closest competitors.  
Remeber that in __TNIC_ALL__ most of the scores equals to zero. The _z\__ might be the more suitable.

In [27]:
avg_sim = tnic_industry.groupby(level=['gvkey1','year']).mean()
avg_sim = avg_sim.join(tnic_industry.groupby(level=['gvkey1','year']).count().add_prefix("n_"))
avg_sim = avg_sim.join(tnic_industry.fillna(0).groupby(level=['gvkey1','year']).mean().add_prefix("z_"))

In [28]:
avg_sim

Unnamed: 0_level_0,Unnamed: 1_level_0,score,score_lead1,score_lead2,n_score,n_score_lead1,n_score_lead2,z_score,z_score_lead1,z_score_lead2
gvkey1,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1004,1997,0.028105,0.016573,0.025478,20,11,9,0.028105,0.009115,0.011465
1004,1998,0.015110,0.021625,0.033844,20,12,9,0.015110,0.012975,0.015230
1013,1996,0.095175,0.088694,0.087237,20,17,16,0.095175,0.075390,0.069790
1013,1997,0.097925,0.092165,0.078647,20,20,19,0.097925,0.092165,0.074715
1013,1998,0.096375,0.077726,0.068343,20,19,14,0.096375,0.073840,0.047840
...,...,...,...,...,...,...,...,...,...,...
317264,2015,0.097115,0.081853,0.080572,20,17,18,0.097115,0.069575,0.072515
317264,2016,0.098755,0.101350,,20,20,0,0.098755,0.101350,0.000000
317264,2017,0.101895,,,20,0,0,0.101895,0.000000,0.000000
318728,2016,0.144200,0.138850,,20,18,0,0.144200,0.124965,0.000000


In [29]:
# avg_sim.dropna() 
# # 54963 observations with non-missing scores

# avg_sim[(avg_sim['n_score'] == 20) & (avg_sim['n_score_lead1'] == 20) & (avg_sim['n_score_lead2'] == 20)]
# # 991 observations with all 20 competitors present in TNIC

## IV candidates

The materiality measure based on deal value will be the last resort for the IV.   
Alternatively, 2SLS using multiple IVs is feasible.

Candidates
* Max deal value
* Sum deal value
* Datedif between _dateeff_ and _datadate_
    * _dateeff_ of the first M&A
    * _dateeff_ of the largest M&A
    * weighted average of _dateeff_ 

## Import previously constructed datasets

### Materiality of M&A

In [30]:
material = pd.read_csv('/Users/ohn0000/Project/cko/0_data/external/materiality.csv')
material.set_index(["year", "gvkey1"], inplace=True, verify_integrity=True)

# M&A Disclosure

Disclosure also might need additonal data collection.

In [31]:
disc = pd.read_csv('/Users/ohn0000/Project/cko/0_data/manual/disc.csv', parse_dates=['DATADATE'])
disc['CIK'] = disc['CIK'].apply(lambda x: str(int(x)).zfill(10) if pd.notnull(x) else None)

In [32]:
disc.rename(columns={"GVKEY":"gvkey1", "FYEAR":"year"}, inplace=True)
disc.set_index(["year", "gvkey1"], inplace=True, verify_integrity=True)

In [33]:
manual = disc.join(material)[['DATADATE', 'CIK', 'TGTAT_ACQAT', 'TGTDVAL_ACQAT', 'MD_A', 'PROFORMA']].sort_index()

### SDC and Compustat Link File

The link file is from [Michael Ewens](https://github.com/michaelewens/SDC-to-Compustat-Mapping.git). Cite papers below.

```
@article{phillips2013r,
  title={R\&D and the Incentives from Merger and Acquisition Activity},
  author={Phillips, Gordon M and Zhdanov, Alexei},
  journal={The Review of Financial Studies},
  volume={26},
  number={1},
  pages={34--78},
  year={2013},
  publisher={Society for Financial Studies}
  }
 ```

```
@article{ewensPetersWang2018,
 title={Acquisition prices and the measurement of intangible capital},
 author={Ewens, Michael and Peters, Ryan and Wang, Sean},
 journal={Working Paper}
 year={2018}
 }
```

In [34]:
sdc_link = pd.read_csv('/Users/ohn0000/Project/cko/0_data/external/dealnum_to_gvkey.csv', 
                       dtype={'DealNumber':'Int64', 'agvkey':'Int64', 'tgvkey':'Int64'}, index_col='DealNumber')

In [35]:
# import wrds
# db = wrds.Connection(wrds_username = "yaera")
# ma_details_desc = db.describe_table('sdc', 'ma_details').sort_values('name')
# with pd.option_context('display.max_rows', None):
#     print(ma_details_desc)

|     Variable | Description                    |
|:------------:|:-------------------------------|
|bookvalue     |Target Book Value (\$mil)       |
|compete       |Competing Bidder (Y/N)          |
|competecode   |Competing Bid Deal Code         |  
|dateann       |Date Announced                  |
|dateannest    |_dateann_ is estimated (Y/N)    | 
|dateeff       |Date Effective                  | 
|ebitltm       |Target EBIT LTM (\$mil)         |
|pct_cash      |Percentage of consideration paid in cash|
|pct_other|Percentage of consideration paid in other then cash or stock|
|pct_stk|Percentage of consideration paid in stock|
|pct_unknown|Percentage of consideration which is unknown|
|ptincltm|Target Pre-Tax Income LTM (\$mil)|
|salesltm|Target Sales LTM (\$mil)|
|rankval|Ranking Value incl Net Debt of Target (\$mil)|

Run sql query below on _WRDS_

In [36]:
# import wrds
# sdc_query = """
# select master_deal_no as DealNumber, 
#         bookvalue, 
#         compete, 
#         competecode, 
#         dateann, 
#         dateannest, 
#         dateeff, 
#         ebitltm, 
#         pct_cash,
#         pct_other,
#         pct_stk,
#         pct_unknown,
#         ptincltm,
#         salesltm,
#         rankval
# from sdc.ma_details
# where dateeff is not null 
# """
# # and master_deal_no in %(deal_no)s
# sdc = db.raw_sql(sdc_query, date_cols=['dateann', 'dateeff'])
# sdc.to_pickle('/home/upenn/yaera/sdc.pkl')

In [37]:
sdc = pd.read_pickle('/Users/ohn0000/Project/cko/0_data/external/sdc.pkl')
sdc.drop_duplicates('dealnumber', inplace = True)
sdc['dealnumber'] = sdc['dealnumber'].apply(int)

# clear up values and change dtype to 'float'
for column in ['bookvalue', 'ebitltm', 'pct_cash', 'pct_other', 'pct_stk', 'pct_unknown', 'ptincltm', 'salesltm', 'rankval']:
    sdc[column] = sdc[column].apply(lambda x: np.NaN if x == '*********' else (np.NaN if pd.isna(x) else (float(x.replace(',', '')) if isinstance(x, str) else float(x))))
    sdc[column].astype('float16')

In [38]:
sdc_sub = pd.merge(sdc_link, sdc, left_index=True, right_on='dealnumber').drop('dealnumber', axis='columns')
sdc_sub.index.name = 'dealnumber'

In [39]:
sdc_sub.sort_values(['agvkey', 'dateeff'], inplace=True)

Use __compustat__ _datadate_ and gvkey to link the sdc data to the similarity scores

In [40]:
import wrds
db = wrds.Connection(wrds_username = 'hohn')

sdc_quary = """
select gvkey, datadate, fyear, cusip,  cik
from comp.funda
where consol = %(consol)s and indfmt in %(indfmt)s and datafmt = %(datafmt)s and popsrc = %(popsrc)s and curcd in %(curcd)s
"""

parm = {'consol':('C'), 'indfmt' : ('INDL', 'FS'), 'datafmt': ('STD'), 'popsrc' : ('D'), 'curcd' : ('USD', 'CAD')}

funda = db.raw_sql(sdc_quary, params = parm, date_cols = ['datadate'])

Loading library list...
Done


In [41]:
funda['start'] = funda['datadate'] - pd.DateOffset(months = 12) + pd.DateOffset(days = 1)
funda['gvkey'] = funda['gvkey'].astype('int64')
funda.set_index('gvkey', inplace=True)

In [42]:
funda.fyear = funda.fyear.astype('Int16')

In [43]:
import pandasql as ps

sql_query = '''
select a.*, b.datadate, b.fyear, b.cusip, b.cik
from sdc_sub a left join funda b
on a.agvkey = b.gvkey and a.dateeff between b.start and b.datadate
'''

newdf = ps.sqldf(sql_query, locals())

In [44]:
col = list(newdf)
for i in range(2, 6):
    col.insert(i, col.pop(-1))
newdf = newdf.loc[:,col]

In [45]:
for i in ['datadate', 'dateann', 'dateeff']:
    newdf[i] = newdf[i].astype('datetime64[ns]')
    
newdf['year'] = newdf['datadate'].dt.year.astype('Int16')
for i in ['fyear', 'agvkey', 'tgvkey']:
    newdf[i] = newdf[i].astype('Int64')

In [46]:
col = list(newdf)
col.insert(col.index('datadate'), col.pop(col.index('year')))
newdf = newdf.loc[:,col]

In [47]:
newdf = newdf.drop_duplicates(subset='dealnumber')

In [48]:
newdf[newdf['agvkey'].notnull()]

Unnamed: 0,dealnumber,agvkey,cik,cusip,fyear,year,datadate,tgvkey,bookvalue,compete,...,dateannest,dateeff,ebitltm,pct_cash,pct_other,pct_stk,pct_unknown,ptincltm,salesltm,rankval
0,2238597,1004,0000001750,000361105,1997,1998,1998-05-31,,,,...,No,1997-06-19,,,,,,,45.000,
1,2273624,1004,0000001750,000361105,1997,1998,1998-05-31,,,,...,No,1997-10-24,,,,,,,18.000,
2,2570557,1004,0000001750,000361105,2000,2001,2001-05-31,1300,,,...,No,2000-09-29,,18.75,81.25,,,,20.000,0.016
3,3307499,1004,0000001750,000361105,2006,2007,2007-05-31,,,,...,No,2007-04-03,,,,,,,,
4,3419980,1004,0000001750,000361105,2007,2008,2008-05-31,,,,...,No,2007-12-03,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33059,1029741,289735,0001565228,92836Y201,2016,2016,2016-12-31,,,,...,No,2016-02-01,,,100.00,,,,12.238,3.000
33060,402724,296753,0001528903,N07831105,2013,2013,2013-12-31,,,,...,No,2013-01-29,,,,,,,,
33061,850888,314866,0001492966,G6583A102,2015,2015,2015-08-31,,,No,...,No,2015-06-25,,100.00,,,,,,559.000
33062,687462,315318,0001590714,28618M106,2014,2014,2014-12-31,,,,...,No,2014-10-01,,,,,100.0,,169.854,401.553


In [49]:
newdf['rankval'].count()

18994

18994 observations with non-missing _rankval_

In [50]:
newdf['salesltm'].count()

8055

8055 observations with non-missing _salesltm_

In [51]:
np.sum(newdf['rankval'].notnull() & newdf['salesltm'].notnull())

6445

6445 observations with both _rankval_ and _salesltm_ available

## Append similarity score between acquirer and target

In [69]:
upload = newdf[newdf['agvkey'].notnull() & newdf['tgvkey'].notnull() & newdf['year'].notnull()][['agvkey', 'tgvkey', 'year']].rename(columns={'agvkey':'gvkey1', 'tgvkey':'gvkey2'})
upload.to_csv('/Users/ohn0000/Project/cko/2_pipeline/upload.csv', index=False)
!scp /Users/ohn0000/Project/cko/2_pipeline/upload.csv $WRDS:/scratch/ou/hohn

upload.csv                                    100%  113KB 448.0KB/s   00:00    


Run this on wrds server. The __TNIC_All__ files should be uploaded in scratch beforehand.

In [None]:
"""
The server killed the previous code that joins after combines all files. The current code instead loop over the files.
"""
# !cd /scratch/ou/hohn/TNIC_AllPairsDistrib
# !cat tnicall1996.txt > tnicall_combined.txt
# !for file in tnicall{1997..2017}.txt; do sed '1d' $file >> tnicall_combined.txt; done
# !cd ~


"""
atsim.py
"""


In [72]:
!scp atsim.py $WRDS:~

atsim.py                                      100%  939    19.4KB/s   00:00    


In [None]:
!scp $WRDS:/scratch/ou/hohn/atsim.csv /Users/ohn0000/Project/cko/2_pipeline/

In [None]:
col = list(newdf)
col.insert(col.index('bookvalue'), col.pop(col.index('atsim')))
newdf = newdf.loc[:,col]

## Cross-sections
* Similarity between acquirer and target 
    - Relation stronger in diversifying
    - Could be more of a U-shaped relation, i.e., competitors don't follow when you move far enough
* Average value of pre-similarities between acquirer and close competitors 
    - Prediction not clear
* M&A performance during the completed firm-year
    - Relation stronger when M&A was more successful <-> how do we define success of an M&A?
* Number of close competitors of the target
    - Potential targets are candidates of future mergers
* How many competitors were there initially?
    - The size of the TNIC industry