#MIP Merge

This script merges the different data sets (MUP ownership data, MUP MIP panel on the owners and the companies), turns them into panels and cleans the different columns

In [1]:
import pandas as pd
import numpy as np

Load the data into DataFrames

In [2]:
df_ownership = pd.read_csv(r"C:\Users\lucas\OneDrive\BA\Data\Ownership_Change\MUPOwn.csv", 
                           encoding="ISO-8859-1")
df_companies = pd.read_csv(r"C:\Users\lucas\OneDrive\BA\Data\Ownership_Change\MUPMIP_panel_owned.csv", 
                           encoding="ISO-8859-1")
df_owners = pd.read_csv(r"C:\Users\lucas\OneDrive\BA\Data\Ownership_Change\MUPMIP_panel_owner.csv", 
                        encoding="ISO-8859-1")

  df_companies = pd.read_csv(r"C:\Users\lucas\OneDrive\BA\Data\Ownership_Change\MUPMIP_panel_owned.csv",
  df_owners = pd.read_csv(r"C:\Users\lucas\OneDrive\BA\Data\Ownership_Change\MUPMIP_panel_owner.csv",


The flag `b_is_main_owner` is used to seperate minority from majority shareholders (defined as the owner of at least 50% of the equity), and, where there is no information on the percentage owned, only owners with following "characteristics" (dt. Eigenschaft) were considered majority: "Owner" (Inhaber), "Shareholder" (Gesellschafter), "Limited Partner" (Kommanditist), "General Partner" (Komplementär), and "Majority Shareholder" (Hauptaktionär)

In [3]:
df_ownership["b_is_main_owner"] = np.where(df_ownership["b_anteil"] >= 50 
                                           | df_ownership["b_eigenschaft"].str.contains(
                                               "Inhaber|Gesellschafter|Kommanditist|Komplementär|Hauptaktionär",
                                                 regex=True), True, False)
df_ownership["b_is_main_owner"] = np.where(df_ownership["b_anteil"] < 50, False, True)

Specify the start and end year of the participation, as a step to turn the ownership data into a panel. The entries which have no start nor end date will be assigned all the years where there are observations in the MIP data set (first year 1993, last year 2021). The end date is set to 2023 for all participations which didn't end in the observation period or where there is no information so that 2021 is within the start to end range

In [4]:
df_ownership["b_start_year"] = df_ownership["b_beginn"].astype(str).str[:4]
df_ownership["b_end_year"] = df_ownership["b_ende"].astype(str).str[:4]
df_ownership["b_start_year"] = np.where(df_ownership["b_start_year"] == "0.0", 1993, df_ownership["b_start_year"])
df_ownership["b_start_year"] = np.where(df_ownership["b_start_year"] == "nan", 1993, df_ownership["b_start_year"])
df_ownership["b_end_year"] = np.where(df_ownership["b_end_year"] == "0.0", 2023, df_ownership["b_end_year"])
df_ownership["b_end_year"] = np.where(df_ownership["b_end_year"] == "nan", 2023, df_ownership["b_end_year"])

Parse `b_start_year` and `b_end_year` to integers

In [5]:
df_ownership["b_start_year"] = pd.to_numeric(df_ownership["b_start_year"], downcast="integer")
df_ownership["b_end_year"] = pd.to_numeric(df_ownership["b_end_year"], downcast="integer")

The array `survey_years` contains all the years with sample data for the companies in the MIP panel. Now I will create dummy variables for all sample years, so that we transform the ownership data frame later to a panel

In [6]:
survey_years = np.unique(df_companies["smpljahr"])
for i in range(len(survey_years)):
    df_ownership[str(survey_years[i])] = np.where((df_ownership["b_start_year"] <= survey_years[i]) 
                                                  & (df_ownership["b_end_year"] > survey_years[i]),True, False)

`df_panel_ownership` now has the structure of a panel data set, after using `df.melt()`

In [7]:
df_panel_ownership = df_ownership.melt(id_vars=['crefo', 'b_crefo', 'b_eigenschaft', 'b_betrag', 'b_anteil', 
                                                'b_beginn','b_ende', 'b_firma', 'b_person', 'welle', 'companyid', 
                                                'ownerid','b_is_main_owner', 'b_start_year', 'b_end_year'],
                                                  var_name="panel_year")

In this step, I filtered data set to only have the years where there was an observation on the owner and to only have the main owners included, and drop the variables afterwards

In [8]:
df_panel_ownership = df_panel_ownership[df_panel_ownership["value"] == True]
df_panel_ownership = df_panel_ownership[df_panel_ownership["b_is_main_owner"] == True]
df_panel_ownership.drop(labels=["value", "b_is_main_owner"], axis=1, inplace=True)

Currently, we have 127.400 different owners corresponding to 25.306 companies, and 3.9 million observations in our data set

In [9]:
print(len(df_panel_ownership["b_crefo"].unique()))
print(len(df_panel_ownership))
print(len(df_panel_ownership["companyid"].unique()))

127400
3949831
25306


Inner merge with companies data: the option `inner` when merging the ownership panel data and the companies panel data ensures that only companies with ownership data (and vice-versa, i.e. only ownership data linked to a company) end up in the merged data set

In [10]:
df_panel_ownership["panel_year"] = pd.to_numeric(df_panel_ownership["panel_year"])
df_merged_companies = pd.merge(df_panel_ownership, df_companies, how="inner", left_on=["panel_year", "companyid"], right_on=["smpljahr", "companyid"])
df_merged_companies

Unnamed: 0,crefo,b_crefo,b_eigenschaft,b_betrag,b_anteil,b_beginn,b_ende,b_firma,b_person,welle_x,...,ghe3,ghp1,ghp2,ghvarp,maein1,maein2,maein3,maein4,maein5,_merge
0,8170003453,2010000581,Gesellschafter,62500000.0,50.0,0.0,20150105.0,Unternehmen,Unternehmen,30,...,,,,,,,,,,matched (3)
1,8170003453,2010000581,Gesellschafter,62500000.0,50.0,0.0,20150105.0,Unternehmen,Unternehmen,30,...,,,,,,,,,,matched (3)
2,8170003453,2010000581,Gesellschafter,62500000.0,50.0,0.0,20150105.0,Unternehmen,Unternehmen,30,...,,,,,,,,,,matched (3)
3,2010169588,2010000581,Gesellschafter,1022590.0,50.0,19910204.0,0.0,Unternehmen,Unternehmen,30,...,,,,,,,,,,matched (3)
4,2010169588,2010000581,Gesellschafter,1022590.0,50.0,19910204.0,0.0,Unternehmen,Unternehmen,30,...,,,,,,,,,,matched (3)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
346655,8130162928,8350233990,Geschäftsführer,,,20081218.0,0.0,nat. Person,nat. Person,57,...,,,,,,,,,,matched (3)
346656,8130162928,8350233990,Geschäftsführer,,,20081218.0,0.0,nat. Person,nat. Person,57,...,ja,nein,ja,80.0,nein,nein,ja,nein,nein,matched (3)
346657,8130162928,8350233990,Gesellschafter,25564.0,100.0,20090114.0,0.0,nat. Person,nat. Person,52,...,,,,,,,,,,matched (3)
346658,8130162928,8350233990,Gesellschafter,25564.0,100.0,20090114.0,0.0,nat. Person,nat. Person,52,...,ja,nein,ja,80.0,nein,nein,ja,nein,nein,matched (3)


23.976 companies contained in the merged data set

In [11]:
len(df_merged_companies["companyid"].value_counts())

23976

Create a file with all variables and their descriptions, in order to sort out which ones are needed

In [13]:
df_columns = pd.DataFrame(df_merged_companies.columns)
df_columns.columns = ["labels"]
df_descriptions_own = pd.read_excel(r"C:\Users\lucas\OneDrive\BA\Data\MUPOwn_panel_variables.xlsx")
df_descriptions_mip = pd.read_excel(r"C:\Users\lucas\OneDrive\BA\Data\MUPMIP_panel_variables.xlsx")
df_columns = pd.merge(df_columns, df_descriptions_own, how="left", left_on="labels", right_on="name")
df_columns = pd.merge(df_columns, df_descriptions_mip, how="left", left_on="labels", right_on="name")
df_columns.to_csv(r"C:\Users\lucas\OneDrive\BA\Data\merged_variables.csv")

In [15]:
df_merged_companies.groupby(["companyid", "smpljahr"])["b_crefo"].nunique().value_counts()

b_crefo
1      9581
2      7241
3      3475
4      1574
5       725
6       372
7       242
8       169
9       103
10       81
11       73
13       55
12       50
14       39
15       35
16       27
17       23
19       18
18       18
20       16
22       12
27       10
26        7
21        7
23        6
37        5
25        4
35        4
24        4
34        4
33        3
40        3
29        3
28        3
41        2
31        2
42        2
189       2
49        2
48        2
61        1
39        1
30        1
38        1
70        1
206       1
94        1
64        1
57        1
43        1
32        1
45        1
44        1
Name: count, dtype: int64

In [16]:
df_merged_companies.groupby(["companyid"])["b_crefo"].nunique().value_counts()

b_crefo
1      9569
2      7229
3      3464
4      1570
5       723
6       373
7       241
8       169
9       104
10       80
11       73
13       54
12       50
14       40
15       34
16       28
17       22
18       19
19       18
20       15
22       12
27       10
21        8
26        7
23        6
37        5
25        4
35        4
24        4
34        4
33        3
40        3
29        3
28        3
41        2
31        2
42        2
189       2
49        2
48        2
61        1
39        1
30        1
38        1
70        1
206       1
94        1
64        1
57        1
43        1
32        1
45        1
44        1
Name: count, dtype: int64

In [29]:
df_merged_companies[df_merged_companies["oekpz1"].notna()]["smpljahr"].value_counts().sort_index()

smpljahr
1993     3537
1994      539
1995     3750
1997      622
1998       59
1999     1689
2001     3494
2003     4031
2004      199
2005     8171
2007     6031
2008        5
2009    10483
2010       23
2011     5662
2012      295
2013     5161
2015     1692
2016        8
2017     2049
2019     1011
2021     1951
Name: count, dtype: int64

In [34]:
df_companies[df_companies["oekpz1"].notna()]["smpljahr"].value_counts().sort_index()

smpljahr
1993    1028
1994     104
1995    1087
1997     198
1998      19
1999     478
2001    1081
2003    1059
2004      58
2005    2081
2007    1800
2008       1
2009    2961
2010       5
2011    1372
2012      67
2013    1257
2015     419
2016       3
2017     468
2019     202
2021     304
Name: count, dtype: int64

In [35]:
df_companies["smpljahr"].value_counts().sort_index()

smpljahr
1993     6086
1994      678
1995     6629
1996       10
1997     1196
1998      144
1999     3096
2001     6672
2003     6637
2004      307
2005    13019
2007    11215
2008        9
2009    12631
2010       29
2011    10225
2012      812
2013     7208
2014        2
2015     1688
2016        7
2017     3455
2019      881
2021      584
Name: count, dtype: int64