#MIP Merge

This script merges the different data sets (MUP ownership data, MUP MIP panel on the owners and the companies), turns them into panels and cleans the different columns

In [20]:
import pandas as pd
import numpy as np

Load the data into DataFrames

In [21]:
df_ownership = pd.read_csv(r"C:\Users\lucas\OneDrive\BA\Data\Ownership_Change\MUPOwn.csv", 
                           encoding="ISO-8859-1")
df_companies = pd.read_csv(r"C:\Users\lucas\OneDrive\BA\Data\Ownership_Change\MUPMIP_panel_owned.csv", 
                           encoding="ISO-8859-1")
df_owners = pd.read_csv(r"C:\Users\lucas\OneDrive\BA\Data\Ownership_Change\MUPMIP_panel_owner.csv", 
                        encoding="ISO-8859-1")

  df_companies = pd.read_csv(r"C:\Users\lucas\OneDrive\BA\Data\Ownership_Change\MUPMIP_panel_owned.csv",
  df_owners = pd.read_csv(r"C:\Users\lucas\OneDrive\BA\Data\Ownership_Change\MUPMIP_panel_owner.csv",


The flag `b_is_main_owner` is used to seperate minority from majority shareholders (defined as the owner of at least 50% of the equity), and, where there is no information on the percentage owned, only owners with following "characteristics" (dt. Eigenschaft) were considered majority: "Owner" (Inhaber), "Shareholder" (Gesellschafter), "Limited Partner" (Kommanditist), "General Partner" (Komplementär), and "Majority Shareholder" (Hauptaktionär)

In [22]:
df_ownership["b_is_main_owner"] = np.where(df_ownership["b_anteil"] >= 50 
                                           | df_ownership["b_eigenschaft"].str.contains(
                                               "Inhaber|Gesellschafter|Kommanditist|Komplementär|Hauptaktionär",
                                                 regex=True), True, False)
df_ownership["b_is_main_owner"] = np.where(df_ownership["b_anteil"] < 50, False, True)

Specify the start and end year of the participation, as a step to turn the ownership data into a panel. The entries which have no start nor end date will be assigned all the years where there are observations in the MIP data set (first year 1993, last year 2021). The end date is set to 2023 for all participations which didn't end in the observation period or where there is no information so that 2021 is within the start to end range

In [23]:
df_ownership["b_start_year"] = df_ownership["b_beginn"].astype(str).str[:4]
df_ownership["b_end_year"] = df_ownership["b_ende"].astype(str).str[:4]
df_ownership["b_start_year"] = np.where(df_ownership["b_start_year"] == "0.0", 1993, df_ownership["b_start_year"])
df_ownership["b_start_year"] = np.where(df_ownership["b_start_year"] == "nan", 1993, df_ownership["b_start_year"])
df_ownership["b_end_year"] = np.where(df_ownership["b_end_year"] == "0.0", 2023, df_ownership["b_end_year"])
df_ownership["b_end_year"] = np.where(df_ownership["b_end_year"] == "nan", 2023, df_ownership["b_end_year"])

Parse `b_start_year` and `b_end_year` to integers

In [24]:
df_ownership["b_start_year"] = pd.to_numeric(df_ownership["b_start_year"], downcast="integer")
df_ownership["b_end_year"] = pd.to_numeric(df_ownership["b_end_year"], downcast="integer")

The array `survey_years` contains all the years with sample data for the companies in the MIP panel. Now I will create dummy variables for all sample years, so that we transform the ownership data frame later to a panel

In [25]:
survey_years = np.unique(df_companies["smpljahr"])
for i in range(len(survey_years)):
    df_ownership[str(survey_years[i])] = np.where((df_ownership["b_start_year"] <= survey_years[i]) 
                                                  & (df_ownership["b_end_year"] > survey_years[i]),True, False)

`df_panel_ownership` now has the structure of a panel data set, after using `df.melt()`

In [26]:
df_panel_ownership = df_ownership.melt(id_vars=['crefo', 'b_crefo', 'b_eigenschaft', 'b_betrag', 'b_anteil', 
                                                'b_beginn','b_ende', 'b_firma', 'b_person', 'welle', 'companyid', 
                                                'ownerid','b_is_main_owner', 'b_start_year', 'b_end_year'],
                                                  var_name="panel_year")

In this step, I filtered data set to only have the years where there was an observation on the owner and to only have the main owners included, and drop the variables afterwards

In [27]:
df_panel_ownership = df_panel_ownership[df_panel_ownership["value"] == True]
df_panel_ownership = df_panel_ownership[df_panel_ownership["b_is_main_owner"] == True]
df_panel_ownership.drop(labels=["value", "b_is_main_owner"], axis=1, inplace=True)

Currently, we have 127.400 different owners corresponding to 25.306 companies, and 3.9 million observations in our data set

In [28]:
print(len(df_panel_ownership["b_crefo"].unique()))
print(len(df_panel_ownership))
print(len(df_panel_ownership["companyid"].unique()))

127400
3949831
25306


Inner merge with companies data: the option `inner` when merging the ownership panel data and the companies panel data ensures that only companies with ownership data (and vice-versa, i.e. only ownership data linked to a company) end up in the merged data set

In [None]:
df_panel_ownership["panel_year"] = pd.to_numeric(df_panel_ownership["panel_year"])
df_merged_companies = pd.merge(df_panel_ownership, df_companies, how="inner", left_on=["panel_year", "companyid"], right_on=["smpljahr", "companyid"])
df_merged_companies

23.976 companies contained in the merged data set

In [37]:
len(df_merged_companies["companyid"].value_counts())

23976

Create a file with all variables and their descriptions, in order to sort out which ones are needed

In [48]:
df_columns = pd.DataFrame(df_merged_companies.columns)
df_columns.columns = ["labels"]
df_descriptions_own = pd.read_excel(r"C:\Users\lucas\OneDrive\BA\Data\MUPOwn_panel_variables.xlsx")
df_descriptions_mip = pd.read_excel(r"C:\Users\lucas\OneDrive\BA\Data\MUPMIP_panel_variables.xlsx")
df_columns = pd.merge(df_columns, df_descriptions_own, how="left", left_on="labels", right_on="name")
df_columns = pd.merge(df_columns, df_descriptions_mip, how="left", left_on="labels", right_on="name")
df_columns.to_csv(r"C:\Users\lucas\OneDrive\BA\Data\merged_variables.csv")

In [54]:
df_merged_companies.groupby(["companyid", "smpljahr"]).nunique()

Unnamed: 0_level_0,Unnamed: 1_level_0,crefo,b_crefo,b_eigenschaft,b_betrag,b_anteil,b_beginn,b_ende,b_firma,b_person,welle_x,...,ghe3,ghp1,ghp2,ghvarp,maein1,maein2,maein3,maein4,maein5,_merge
companyid,smpljahr,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1.0,2001,1,1,0,0,0,1,1,1,2,2,...,0,0,0,0,0,0,0,0,0,1
2.0,1995,1,1,1,1,0,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,1
3.0,1995,1,2,1,0,0,1,1,2,2,2,...,0,0,0,0,0,0,0,0,0,1
4.0,2012,1,5,2,1,0,3,1,1,2,3,...,0,0,0,0,0,0,0,0,0,1
5.0,1995,1,1,2,4,2,1,1,1,1,4,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26202.0,2019,1,2,2,2,1,2,1,2,2,4,...,0,0,0,0,0,0,0,0,0,1
26203.0,2019,1,1,1,0,0,1,1,1,1,2,...,0,0,0,0,1,1,1,1,1,1
26204.0,2019,1,1,2,1,1,1,1,1,1,2,...,1,1,1,1,1,1,1,1,1,1
26205.0,2021,1,3,2,1,1,2,1,2,1,3,...,1,1,1,1,1,1,1,1,1,1
