#MIP Merge

This script merges the different data sets (MUP ownership data, MUP MIP panel on the owners and the companies), turns them into panels and cleans the different columns

In [1]:
import pandas as pd
import numpy as np

Load the data into DataFrames

In [2]:
df_ownership = pd.read_csv(r"C:\Users\lucas\OneDrive\BA\Data\Ownership_Change\MUPOwn.csv", 
                           encoding="ISO-8859-1")
df_companies = pd.read_csv(r"C:\Users\lucas\OneDrive\BA\Data\Ownership_Change\MUPMIP_panel_owned.csv", 
                           encoding="ISO-8859-1")
df_owners = pd.read_csv(r"C:\Users\lucas\OneDrive\BA\Data\Ownership_Change\MUPMIP_panel_owner.csv", 
                        encoding="ISO-8859-1")

  df_companies = pd.read_csv(r"C:\Users\lucas\OneDrive\BA\Data\Ownership_Change\MUPMIP_panel_owned.csv",
  df_owners = pd.read_csv(r"C:\Users\lucas\OneDrive\BA\Data\Ownership_Change\MUPMIP_panel_owner.csv",


The flag `b_is_main_owner` is used to seperate minority from majority shareholders (defined as the owner of at least 50% of the equity), and, where there is no information on the percentage owned, only owners with following "characteristics" (dt. Eigenschaft) were considered majority: "Owner" (Inhaber), "Shareholder" (Gesellschafter), "Limited Partner" (Kommanditist), "General Partner" (Komplementär), and "Majority Shareholder" (Hauptaktionär)

In [3]:
df_ownership["b_is_main_owner"] = np.where(df_ownership["b_anteil"] >= 50 
                                           | df_ownership["b_eigenschaft"].str.contains(
                                               "Inhaber|Gesellschafter|Kommanditist|Komplementär|Hauptaktionär",
                                                 regex=True), True, False)
df_ownership["b_is_main_owner"] = np.where(df_ownership["b_anteil"] < 50, False, True)

Specify the start and end year of the participation, as a step to turn the ownership data into a panel. The entries which have no start nor end date will be assigned all the years where there are observations in the MIP data set (first year 1993, last year 2021). The end date is set to 2023 for all participations which didn't end in the observation period or where there is no information so that 2021 is within the start to end range

In [4]:
df_ownership["b_start_year"] = df_ownership["b_beginn"].astype(str).str[:4]
df_ownership["b_end_year"] = df_ownership["b_ende"].astype(str).str[:4]
df_ownership["b_start_year"] = np.where(df_ownership["b_start_year"] == "0.0", 1993, df_ownership["b_start_year"])
df_ownership["b_start_year"] = np.where(df_ownership["b_start_year"] == "nan", 1993, df_ownership["b_start_year"])
df_ownership["b_end_year"] = np.where(df_ownership["b_end_year"] == "0.0", 2023, df_ownership["b_end_year"])
df_ownership["b_end_year"] = np.where(df_ownership["b_end_year"] == "nan", 2023, df_ownership["b_end_year"])

Parse `b_start_year` and `b_end_year` to integers

In [5]:
df_ownership["b_start_year"] = pd.to_numeric(df_ownership["b_start_year"], downcast="integer")
df_ownership["b_end_year"] = pd.to_numeric(df_ownership["b_end_year"], downcast="integer")

The array `survey_years` contains all the years with sample data for the companies in the MIP panel. Now I will create dummy variables for all sample years, so that we transform the ownership data frame later to a panel

In [65]:
survey_years = np.unique(df_companies["jahr"])
for i in range(len(survey_years)):
    df_ownership[str(survey_years[i])] = np.where((df_ownership["b_start_year"] <= survey_years[i]) 
                                                  & (df_ownership["b_end_year"] > survey_years[i]),True, False)

`df_panel_ownership` now has the structure of a panel data set, after using `df.melt()`

In [66]:
df_panel_ownership = df_ownership.melt(id_vars=['crefo', 'b_crefo', 'b_eigenschaft', 'b_betrag', 'b_anteil', 
                                                'b_beginn','b_ende', 'b_firma', 'b_person', 'welle', 'companyid', 
                                                'ownerid','b_is_main_owner', 'b_start_year', 'b_end_year'],
                                                  var_name="panel_year")

In this step, I filtered data set to only have the years where there was an observation on the owner and to only have the main owners included, and drop the variables afterwards

In [67]:
df_panel_ownership = df_panel_ownership[df_panel_ownership["value"] == True]
df_panel_ownership = df_panel_ownership[df_panel_ownership["b_is_main_owner"] == True]
df_panel_ownership.drop(labels=["value", "b_is_main_owner"], axis=1, inplace=True)

Currently, we have 127.442 different owners corresponding to 25.306 companies, and 4.6 million observations in our data set

In [68]:
print(len(df_panel_ownership["b_crefo"].unique()))
print(len(df_panel_ownership))
print(len(df_panel_ownership["companyid"].unique()))

127442
4624052
25306


Inner merge with companies data: the option `inner` when merging the ownership panel data and the companies panel data ensures that only companies with ownership data (and vice-versa, i.e. only ownership data linked to a company) end up in the merged data set

In [69]:
df_panel_ownership["panel_year"] = pd.to_numeric(df_panel_ownership["panel_year"])
df_merged_companies = pd.merge(df_panel_ownership, df_companies, how="inner", left_on=["panel_year", "companyid"], right_on=["jahr", "companyid"])
df_merged_companies

Unnamed: 0,crefo,b_crefo,b_eigenschaft,b_betrag,b_anteil,b_beginn,b_ende,b_firma,b_person,welle_x,...,ghe3,ghp1,ghp2,ghvarp,maein1,maein2,maein3,maein4,maein5,_merge
0,2010141336,2010000074,Gesellschafter,1000000.0,,20011001.0,0.0,Unternehmen,Unternehmen,62,...,,,,,,,,,,matched (3)
1,2010141336,2010000074,Gesellschafter,1000000.0,100.0,20011001.0,0.0,Unternehmen,Unternehmen,30,...,,,,,,,,,,matched (3)
2,2010141336,2010000074,Gesellschafter,1000000.0,,20011001.0,0.0,nat. Person,Unternehmen,57,...,,,,,,,,,,matched (3)
3,2010141336,2012141877,,,,20070129.0,0.0,nat. Person,nat. Person,57,...,,,,,,,,,,matched (3)
4,2010837305,2010000524,Hauptaktionär,100.0,100.0,0.0,0.0,Unternehmen,Unternehmen,30,...,,,,,,,,,,matched (3)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
477906,8130162928,8350233990,Gesellschafter,25565.0,,20090114.0,0.0,nat. Person,nat. Person,57,...,,,,,,,,,,matched (3)
477907,8130162928,8350233990,Geschäftsführer,,,20081218.0,0.0,nat. Person,nat. Person,57,...,,,,,,,,,,matched (3)
477908,8130162928,8350233990,Gesellschafter,25564.0,100.0,20090114.0,0.0,nat. Person,nat. Person,52,...,,,,,,,,,,matched (3)
477909,8330201961,8350236630,Inhaber,,,19780000.0,0.0,nat. Person,nat. Person,57,...,,,,,,,,,,matched (3)


25.025 companies contained in the merged data set

In [71]:
len(df_merged_companies["companyid"].value_counts())

25025

Create a file with all variables and their descriptions, in order to sort out which ones are needed

In [13]:
df_columns = pd.DataFrame(df_merged_companies.columns)
df_columns.columns = ["labels"]
df_descriptions_own = pd.read_excel(r"C:\Users\lucas\OneDrive\BA\Data\MUPOwn_panel_variables.xlsx")
df_descriptions_mip = pd.read_excel(r"C:\Users\lucas\OneDrive\BA\Data\MUPMIP_panel_variables.xlsx")
df_columns = pd.merge(df_columns, df_descriptions_own, how="left", left_on="labels", right_on="name")
df_columns = pd.merge(df_columns, df_descriptions_mip, how="left", left_on="labels", right_on="name")
df_columns.to_csv(r"C:\Users\lucas\OneDrive\BA\Data\merged_variables.csv")

Data exploration (code blocks below)

In [72]:
df_merged_companies.groupby(["companyid", "jahr"])["b_crefo"].nunique().value_counts()

b_crefo
1      28426
2      24444
3      14474
4       7392
5       3970
       ...  
112        1
72         1
88         1
124        1
233        1
Name: count, Length: 120, dtype: int64

In [73]:
df_merged_companies.groupby(["companyid"])["b_crefo"].nunique().value_counts()

b_crefo
1      7420
2      6484
3      4040
4      2359
5      1315
       ... 
105       1
56        1
48        1
277       1
233       1
Name: count, Length: 90, dtype: int64

In [74]:
df_merged_companies[df_merged_companies["oekpz1"].notna()]["jahr"].value_counts().sort_index()

jahr
2008    23630
2014    26025
2020    33184
Name: count, dtype: int64

In [75]:
df_companies[df_companies["oekpz1"].notna()]["jahr"].value_counts().sort_index()

jahr
2008    5987
2014    5224
2020    4841
Name: count, dtype: int64

In [76]:
df_merged_companies["jahr"].value_counts().sort_index()

jahr
2006    18334
2007    21607
2008    26452
2009    24311
2010    26601
2011    29932
2012    30647
2013    26201
2014    30700
2015    26383
2016    32055
2017    31674
2018    33915
2019    40999
2020    38903
2021    39197
Name: count, dtype: int64

In [77]:
print("Number of ownerid observations:", df_merged_companies["ownerid"].count())
print("Number of unique ownerids:", df_merged_companies["ownerid"].nunique())

Number of ownerid observations: 14931
Number of unique ownerids: 1381


In [78]:
print("Number of companyid observations:", df_merged_companies["companyid"].count())
print("Number of unique companyids:", df_merged_companies["companyid"].nunique())

Number of companyid observations: 477911
Number of unique companyids: 25025


In [79]:
print("Number of b_crefo observations:", df_merged_companies["b_crefo"].count())
print("Number of b_crefo observations:", df_merged_companies["b_crefo"].nunique())

Number of b_crefo observations: 477911
Number of b_crefo observations: 87577


In [94]:
df_merged_companies.loc[np.where((df_merged_companies["companyid"] == 3) & 
                                 (df_merged_companies["jahr"] == 2007))].to_csv(
                                     r"C:\Users\lucas\OneDrive\BA\Data\company3.csv")