# Creating Final Datasets
In this notebook, we complete two datasets: 

1. The "final_df", which we will use for the old target model (usable with any new client/ company):
To create this dataset, we merge together all three datasets we've been preprocessing until now, namely: 
- The Omdia Spend Tracker, with spending by company, market, submarket, device and year
- The Omdia Market Spending dataset, with spending in semiconductors by Market, Submarket, Device and Year
- The Financial data we scraped from the web

2. The "newtarget_df" which we create directly from the final_df and we'll use for the new target model, i.e. the better performing one, which can only be used for old clients/ companies (companies with previous spending in semiconductors)

Both datasets have years from 2018 to 2024 as these is the data we'll use for training and testing the models. For years 2025, 2026, 2027 we will create datasets in the "prediction_dataset.ipynb" notebook. 

In [1]:
import pandas as pd
import numpy as np

# Creating Final_df
## Merging Omdia Spend Tracker  with Omdia Market Spending 
The issue with performing this first merge is the following; the names assigned to markets and submarkets in the two datasets do not match, so before merging we need to create a dictionary that maps the market and submarket names of one datasets with those of the other. 

In [2]:
#clean data is the Omdia Spend Tracker data
cleaned_data = pd.read_csv("../clean_data/cleaned_spend_data.csv", index_col = 0)
cleaned_data = cleaned_data.groupby(["CompanyName", "Device", "Market", " SubMarket", "PeriodCode"]).sum().reset_index()
cleaned_data.rename(columns = {" SubMarket" : "SubMarket", "PeriodCode" : "Year"}, inplace = True)
cleaned_data.Year = cleaned_data.Year.apply( lambda x: int(x[1:]))
#cleaned_data.drop("CompanyName", axis = 1, inplace = True)
cleaned_data

Unnamed: 0,CompanyName,Device,Market,SubMarket,Year,Value
0,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2018,7.951500
1,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2019,8.616500
2,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2020,8.565350
3,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2021,12.353845
4,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2022,13.879158
...,...,...,...,...,...,...
36010,iRobot,Voltage Regulator/Reference,Consumer,Appliance,2020,10.295800
36011,iRobot,Voltage Regulator/Reference,Consumer,Appliance,2021,17.639800
36012,iRobot,Voltage Regulator/Reference,Consumer,Appliance,2022,23.717500
36013,iRobot,Voltage Regulator/Reference,Consumer,Appliance,2023,23.067700


In [4]:
#market_spend is the Omdia Market Spend data
market_spend = pd.read_csv("../input_data/market_data.csv")
market_spend

Unnamed: 0,Market,Submarket,Device,Year,Region,Market size
0,Automotive Electronics Categories,ADAS,Amplifier/Comparator,2008,Americas,0.424
1,Automotive Electronics Categories,ADAS,Amplifier/Comparator,2008,Asia & Oceania (excl. Japan),0.246
2,Automotive Electronics Categories,ADAS,Amplifier/Comparator,2008,EMEA,0.678
3,Automotive Electronics Categories,ADAS,Amplifier/Comparator,2008,Japan,0.421
4,Automotive Electronics Categories,ADAS,Amplifier/Comparator,2008,Worldwide,1.769
...,...,...,...,...,...,...
32495,Wireless Communications Categories,Wireless LAN Equipment,Voltage Regulator/Reference,2027,Americas,0.000
32496,Wireless Communications Categories,Wireless LAN Equipment,Voltage Regulator/Reference,2027,Asia & Oceania (excl. Japan),0.000
32497,Wireless Communications Categories,Wireless LAN Equipment,Voltage Regulator/Reference,2027,EMEA,0.000
32498,Wireless Communications Categories,Wireless LAN Equipment,Voltage Regulator/Reference,2027,Japan,0.000


### Makret - Submarket mapping between the two datasets (done with GPT)
Now that we've loaded both datasets, we can create the dictionary to map the market and submarket names. To do that, we gave as input to Chat GPT the two lists of market-submarket for the two datasets, and asked to output this dictionary of mappings.

In [6]:
mapping_dict = {('Automotive Electronics Categories', 'ADAS'): ('Automotive', 'Auto ADAS'),
    ('Automotive Electronics Categories', 'Body & Convenience'): ('Automotive', 'Other Auto & Aftermarket'),
    ('Automotive Electronics Categories', 'Chassis & Safety'): ('Automotive', 'Other Auto & Aftermarket'),
    ('Automotive Electronics Categories', 'Connectivity & Telematics'): ('Automotive', 'Other Auto & Aftermarket'),
    ('Automotive Electronics Categories', 'Hybrid & Electric Drive Train'): ('Automotive', 'Other Auto & Aftermarket'),
    ('Automotive Electronics Categories', 'Infotainment & Cluster'): ('Automotive', 'Auto Infotainment'),
    ('Automotive Electronics Categories', 'Other Automotive'): ('Automotive', 'Other Auto & Aftermarket'),
    ('Automotive Electronics Categories', 'Powertrain & Vehicle Dynamics'): ('Automotive', 'Auto Powertrain'),
    ('Computing & Data Storage Categories', 'Data Center Servers'): ('Computer Platforms', 'Data Center Servers'),
    ('Computing & Data Storage Categories', 'Desktop PCs'): ('Computer Platforms', 'Desktop PCs'),
    ('Computing & Data Storage Categories', 'Flash Storage Cards'): ('Computer Peripherals & Storage', 'Flash Cards/Drives'),
    ('Computing & Data Storage Categories', 'Hard Disk Drives'): ('Computer Peripherals & Storage', 'HDD'),
    ('Computing & Data Storage Categories', 'Notebook PCs'): ('Computer Platforms', 'Notebook PCs'),
    ('Computing & Data Storage Categories', 'Other Computing'): ('Computer Platforms', 'Other Computer Products'),
    ('Computing & Data Storage Categories', 'Other Data Storage'): ('Computer Peripherals & Storage', 'Other Storage'),
    ('Computing & Data Storage Categories', 'Other Peripherals'): ('Computer Peripherals & Storage', 'Other Peripherals'),
    ('Computing & Data Storage Categories', 'Smart Cards'): ('Computer Peripherals & Storage', 'Smart Cards'),
    ('Computing & Data Storage Categories', 'Solid-State Drives'): ('Computer Peripherals & Storage', 'Flash Cards/Drives'),
    ('Computing & Data Storage Categories', 'Tablet PCs'): ('Computer Platforms', 'Tablet PCs'),
    ('Computing & Data Storage Categories', 'USB Flash Drive'): ('Computer Peripherals & Storage', 'Flash Cards/Drives'),
    ('Consumer Electronics Categories', 'Fitness & Wellness Wearable Electronics'): ('Consumer', 'Other Consumer'),
    ('Consumer Electronics Categories', 'LCD TV'): ('Consumer', 'TV'),
    ('Consumer Electronics Categories', 'Major Home Appliances'): ('Consumer', 'Appliance'),
    ('Consumer Electronics Categories', 'OLED TV'): ('Consumer', 'TV'),
    ('Consumer Electronics Categories', 'Other Audio/Video'): ('Consumer', 'Audio'),
    ('Consumer Electronics Categories', 'Other Consumer Electronics'): ('Consumer', 'Other Consumer'),
    ('Consumer Electronics Categories', 'Set-Top Boxes'): ('Consumer', 'STB'),
    ('Consumer Electronics Categories', 'Smart Speakers & Digital Assistants'): ('Consumer', 'Connected Consumer'),
    ('Consumer Electronics Categories', 'Smart Watches'): ('Consumer', 'Other Consumer'),
    ('Consumer Electronics Categories', 'VR Headsets'): ('Consumer', 'Other Consumer'),
    ('Consumer Electronics Categories', 'Video Game Consoles'): ('Consumer', 'Video Games'),
    ('Industrial Electronics Categories', 'Automation'): ('Industrial', 'Other Industrial'),
    ('Industrial Electronics Categories', 'Building & Home Control'): ('Industrial', 'Other Industrial'),
    ('Industrial Electronics Categories', 'Lighting'): ('Industrial', 'Other Industrial'),
    ('Industrial Electronics Categories', 'Medical Electronics'): ('Industrial', 'Medical'),
    ('Industrial Electronics Categories', 'Military & Civil Aerospace'): ('Industrial', 'Military/Aerospace'),
    ('Industrial Electronics Categories', 'Other Industrial'): ('Industrial', 'Other Industrial'),
    ('Industrial Electronics Categories', 'Power & Energy'): ('Industrial', 'Power & Energy'),
    ('Industrial Electronics Categories', 'Security & Video Surveillance'): ('Industrial', 'Other Industrial'),
    ('Industrial Electronics Categories', 'Test & Measurement'): ('Industrial', 'Test & Measurement'),
    ('Industrial Electronics Categories', 'Other Industrial') : ('Industrial', 'Manufacturing Equipment'),
    ('Wired Communications Categories', 'Broadcast & Streaming Video'): ('Wired Communications', 'Other Wired'),
    ('Wired Communications Categories', 'Cable Aggregation Equipment'): ('Wired Communications', 'Carrier'),
    ('Wired Communications Categories', 'Cable CPE'): ('Wired Communications', 'Carrier'),
    ('Wired Communications Categories', 'Carrier Ethernet Switches & Routers'): ('Wired Communications', 'Carrier'),
    ('Wired Communications Categories', 'DSL Aggregation Equipment'): ('Wired Communications', 'Carrier'),
    ('Wired Communications Categories', 'DSL CPE'): ('Wired Communications', 'Carrier'),
    ('Wired Communications Categories', 'Data Center Network Switches'): ('Wired Communications', 'DC Network & Threat Mitigation'),
    ('Wired Communications Categories', 'Enterprise Ethernet Switches & Routers'): ('Wired Communications', 'Enterprise/SOHO'),
    ('Wired Communications Categories', 'Enterprise UC & Voice'): ('Wired Communications', 'Enterprise/SOHO'),
    ('Wired Communications Categories', 'FTTH Aggregation Equipment'): ('Wired Communications', 'Carrier'),
    ('Wired Communications Categories', 'FTTH CPE'): ('Wired Communications', 'Carrier'),
    ('Wired Communications Categories', 'Low-Tier Consumer/SOHO Routers'): ('Wired Communications', 'Enterprise/SOHO'),
    ('Wired Communications Categories', 'Optical Equipment'): ('Wired Communications', 'Other Wired'),
    ('Wired Communications Categories', 'Other Wired Communications'): ('Wired Communications', 'Other Wired'),
    ('Wired Communications Categories', 'Threat Mitigation Products'): ('Wired Communications', 'DC Network & Threat Mitigation'),
    ('Wireless Communications Categories', 'Gray Market Handsets'): ('Wireless Communications', 'Handset'),
    ('Wireless Communications Categories', 'High-Tier Smartphone'): ('Wireless Communications', 'Handset'),
    ('Wireless Communications Categories', 'Low-Tier Smartphone'): ('Wireless Communications', 'Handset'),
    ('Wireless Communications Categories', 'M2M Modules'): ('Wireless Communications', 'Other Wireless'),
    ('Wireless Communications Categories', 'Media Tablets'): ('Wireless Communications', 'Media Tablets'),
    ('Wireless Communications Categories', 'Mid-Tier Smartphone'): ('Wireless Communications', 'Handset'),
    ('Wireless Communications Categories', 'Mobile Comm Infrastructure'): ('Wireless Communications', 'Infrastructure'),
    ('Wireless Communications Categories', 'Mobile Phone (ULCH, Entry, Feature)'): ('Wireless Communications', 'Handset'),
    ('Wireless Communications Categories', 'Other Wireless Communications'): ('Wireless Communications', 'Other Wireless'),
    ('Wireless Communications Categories', 'Wireless LAN Equipment'): ('Wireless Communications', 'Infrastructure')}

Now we can rename the market_spend Market and Submarket entries according to the dictionary. 

In [7]:
 def remapping(market, submarket):
    key = (market, submarket)
    return mapping_dict[key][0], mapping_dict[key][1]

market_spend.Market, market_spend.Submarket = zip(*market_spend.apply(lambda x : remapping(x.Market, x.Submarket), axis = 1))
market_spend.rename(columns = {"Submarket" : "SubMarket"}, inplace = True)
market_spend

Unnamed: 0,Market,SubMarket,Device,Year,Region,Market size
0,Automotive,Auto ADAS,Amplifier/Comparator,2008,Americas,0.424
1,Automotive,Auto ADAS,Amplifier/Comparator,2008,Asia & Oceania (excl. Japan),0.246
2,Automotive,Auto ADAS,Amplifier/Comparator,2008,EMEA,0.678
3,Automotive,Auto ADAS,Amplifier/Comparator,2008,Japan,0.421
4,Automotive,Auto ADAS,Amplifier/Comparator,2008,Worldwide,1.769
...,...,...,...,...,...,...
32495,Wireless Communications,Infrastructure,Voltage Regulator/Reference,2027,Americas,0.000
32496,Wireless Communications,Infrastructure,Voltage Regulator/Reference,2027,Asia & Oceania (excl. Japan),0.000
32497,Wireless Communications,Infrastructure,Voltage Regulator/Reference,2027,EMEA,0.000
32498,Wireless Communications,Infrastructure,Voltage Regulator/Reference,2027,Japan,0.000


### Further adjustments in the market_spend dataset
- We keep only "worldwide" as a region as we did for the spend tracker dataset
- We split the market spend into spending by market, spending by submarket and spending by device to have more granular information. This is done by creating 3 additional datasets (**marketspendings**, **submarketspendings**, **devicespendings**) that we will then merge with the rest 

In [8]:
#keep only region = worldwide in the market dataset as agreed
market_spend = market_spend[(market_spend.Region == "Worldwide")]
market_spend.drop("Region", axis = 1, inplace = True)
market_spend

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  market_spend.drop("Region", axis = 1, inplace = True)


Unnamed: 0,Market,SubMarket,Device,Year,Market size
4,Automotive,Auto ADAS,Amplifier/Comparator,2008,1.769
9,Automotive,Auto ADAS,Amplifier/Comparator,2009,1.390
14,Automotive,Auto ADAS,Amplifier/Comparator,2010,3.401
19,Automotive,Auto ADAS,Amplifier/Comparator,2011,5.911
24,Automotive,Auto ADAS,Amplifier/Comparator,2012,7.250
...,...,...,...,...,...
32479,Wireless Communications,Infrastructure,Voltage Regulator/Reference,2023,0.000
32484,Wireless Communications,Infrastructure,Voltage Regulator/Reference,2024,0.000
32489,Wireless Communications,Infrastructure,Voltage Regulator/Reference,2025,0.000
32494,Wireless Communications,Infrastructure,Voltage Regulator/Reference,2026,0.000


In [9]:
#group by market, submarket and device to have more granular information on the spending
#create the three new df that will then be re-merged into the main dataset
marketspendings = market_spend[["Market", "Year", "Market size"]].groupby(["Market", "Year"]).sum().reset_index()
submarketspendings = market_spend[["SubMarket", "Year", "Market size"]].groupby(["SubMarket", "Year"]).sum().reset_index()
devicespendings = market_spend[["Device", "Year", "Market size"]].groupby(["Device", "Year"]).sum().reset_index()

### Merging the two datasets
We finally merge the two main datasets and the newly created market, submarket and device spendings, rename all the cols coherently and drop the rows with spending = 0

In [12]:
#merge market_spend and cleaned_data
merged_dataset = cleaned_data.merge(market_spend, how = 'left', on = ['Market', "SubMarket", "Device", "Year"])
merged_dataset = merged_dataset.groupby(["CompanyName", "Device", "Market", "SubMarket", "Year", "Value"]).sum().reset_index()
merged_dataset.rename(columns = {"Market size" : "Spendings"}, inplace = True)
merged_dataset = merged_dataset[merged_dataset['Spendings'] != 0]
merged_dataset

Unnamed: 0,CompanyName,Device,Market,SubMarket,Year,Value,Spendings
0,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2018,7.951500,82.904
1,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2019,8.616500,84.255
2,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2020,8.565350,85.306
3,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2021,12.353845,122.722
4,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2022,13.879158,138.207
...,...,...,...,...,...,...,...
36010,iRobot,Voltage Regulator/Reference,Consumer,Appliance,2020,10.295800,282.999
36011,iRobot,Voltage Regulator/Reference,Consumer,Appliance,2021,17.639800,399.001
36012,iRobot,Voltage Regulator/Reference,Consumer,Appliance,2022,23.717500,469.001
36013,iRobot,Voltage Regulator/Reference,Consumer,Appliance,2023,23.067700,412.999


In [13]:
#merge the market, submarket and device spendings with the merged_dataset
merged_dataset = merged_dataset.merge(marketspendings, how = 'left', on = ["Market", "Year"])
merged_dataset.rename(columns = {"Market size" : "MarketSpends"}, inplace = True)
merged_dataset = merged_dataset.merge(submarketspendings, how = 'left', on = ["SubMarket", "Year"])
merged_dataset.rename(columns = {"Market size" : "SubMarketSpends"}, inplace = True)
merged_dataset = merged_dataset.merge(devicespendings, how = 'left', on = ["Device", "Year"])
merged_dataset.rename(columns = {"Market size" : "DevSpends"}, inplace = True)
merged_dataset

Unnamed: 0,CompanyName,Device,Market,SubMarket,Year,Value,Spendings,MarketSpends,SubMarketSpends,DevSpends
0,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2018,7.951500,82.904,11486.855,926.369,3990.996
1,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2019,8.616500,84.255,10750.418,905.298,3800.001
2,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2020,8.565350,85.306,10966.167,942.751,3786.004
3,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2021,12.353845,122.722,13599.835,1128.090,4785.001
4,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2022,13.879158,138.207,15244.412,1221.082,5469.003
...,...,...,...,...,...,...,...,...,...,...
33730,iRobot,Voltage Regulator/Reference,Consumer,Appliance,2020,10.295800,282.999,5679.186,558.000,11956.010
33731,iRobot,Voltage Regulator/Reference,Consumer,Appliance,2021,17.639800,399.001,7289.565,722.000,15765.008
33732,iRobot,Voltage Regulator/Reference,Consumer,Appliance,2022,23.717500,469.001,7755.155,887.001,17307.004
33733,iRobot,Voltage Regulator/Reference,Consumer,Appliance,2023,23.067700,412.999,6292.004,674.000,16419.001


## Adding Financial Information scraped from the web
The first step is once again mapping the company names from the Omdia spend tracker with those from the Capital IQ financial data, as they do not always match. 
The dictionary is stored in the "financials_map.csv" dataset, and here is just read with pandas.

Then, the three financial datasets, namely revenue, ebitda and cogs per company and year are read, and the mapping function as applied to their company names, such that the company names now match the spend tracker's. 

In [18]:
merged_dataset.rename(columns = {"CompanyName" : "Company"}, inplace = True)

In [19]:
#reading the dictionary "financials_map.csv"
comp_map = pd.read_csv("../input_data/financials_map.csv", delimiter= ",", encoding='unicode_escape', index_col = 0)
comp_map = comp_map.set_index("Company Tag from S&P Global").to_dict()["Company Name from Spend Tracker"]
companies_available = comp_map.keys()

In [20]:
#function that takes a company name in the S&P Global and returns the respective name in the spend tracker
def mapping_companies(x):
    if x in companies_available:
        return comp_map[x]
    else :
        return None

In [21]:
#reading each financial file and applying the mappig function to match company names
ebitda = pd.read_csv("../clean_data/ebitda_complete.csv")
ebitda.Company = ebitda.Company.apply(mapping_companies)
#remove when there's no mapping in the company name 
ebitda = ebitda[~ebitda['Company'].isnull()]
ebitda["Ebitda_1"] = ebitda.groupby('Company')['Ebitda'].shift(1)
ebitda["Ebitda_2"] = ebitda.groupby('Company')['Ebitda'].shift(2)
#create a set of companies which have negative ebitda (or negative lagged) to then remove from the dataset
remove_ebitda = ebitda[(ebitda.Ebitda_2 < 0) | (ebitda.Ebitda < 0) | (ebitda.Ebitda_1 < 0)].Company.unique()

In [22]:
#same with cogs
cogs = pd.read_csv("../clean_data/cogs_complete.csv")
cogs.Company = cogs.Company.apply(mapping_companies)
cogs = cogs[~cogs['Company'].isnull()]
cogs.rename(columns = {"Value" : "Cogs"}, inplace = True)
cogs["Cogs_1"] = cogs.groupby('Company')['Cogs'].shift(1)
cogs["Cogs_2"] = cogs.groupby('Company')['Cogs'].shift(2)
remove_cogs = cogs[(cogs.Cogs_2 < 0) | (cogs.Cogs < 0) | (cogs.Cogs_1 < 0)].Company.unique()
remove_cogs

array(['Wabtec', 'Hewlett-Packard', 'Johnson Controls', 'Schlumberger',
       'LG Display', 'Nabtesco Corporation'], dtype=object)

In [24]:
#same with revenue
revenue = pd.read_csv("../clean_data/revenue_complete.csv")
revenue.Company = revenue.Company.apply(mapping_companies)
revenue = revenue[~revenue['Company'].isnull()]
revenue["Revenue_1"] = revenue.groupby('Company')['Revenue'].shift(1)
revenue["Revenue_2"] = revenue.groupby('Company')['Revenue'].shift(2)
remove_revenue = revenue[(revenue.Revenue_2 < 0) | (revenue.Revenue < 0) | (revenue.Revenue_1 < 0)].Company.unique()
remove_revenue

array(['Deere & Company', 'Rolls-Royce Holding', 'LG Display', 'TomTom',
       'HTC'], dtype=object)

### Merging everything together
Finally, we merge everything together; the new financial dataset with the previously created "merged_dataset" to obtain the final df

In [27]:
#merge everything and remove the companies in the previously creates sets of negative ebitda, cogs or revenue
final_df = merged_dataset.merge(cogs, how = "left", on = ["Company", "Year"]).merge(revenue, how = "left", on = ["Company", "Year"]).merge(ebitda, how = "left", on = ["Company", "Year"])
final_df = final_df[~final_df['Cogs'].isnull()].reset_index(drop = True)
final_df = final_df[~final_df['Ebitda'].isnull()].reset_index(drop = True)
final_df = final_df[~final_df['Ebitda_1'].isnull()].reset_index(drop = True)
final_df = final_df[~final_df['Ebitda_2'].isnull()].reset_index(drop = True)
final_df = final_df[final_df['Value'] != 0]
final_df = final_df[~final_df.Company.isin(remove_ebitda)]
final_df = final_df[~final_df.Company.isin(remove_cogs)]
final_df = final_df[~final_df.Company.isin(remove_revenue)]
final_df

Unnamed: 0,Company,Device,Market,SubMarket,Year,Value,Spendings,MarketSpends,SubMarketSpends,DevSpends,Cogs,Cogs_1,Cogs_2,Revenue,Revenue_1,Revenue_2,Ebitda,Ebitda_1,Ebitda_2
0,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2018,7.951500,82.904,11486.855,926.369,3990.996,19059.000000,17278.000000,17270.0,27662.00,25196.00,24929.0,3227.000000,2929.000000,2987.0
1,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2019,8.616500,84.255,10750.418,905.298,3800.001,19018.000000,19059.000000,17278.0,27978.00,27662.00,25196.0,3347.000000,3227.000000,2929.0
2,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2020,8.565350,85.306,10966.167,942.751,3786.004,18123.000000,19018.000000,19059.0,26134.00,27978.00,27662.0,2668.000000,3347.000000,3227.0
3,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2021,12.353845,122.722,13599.835,1128.090,4785.001,19407.000000,18123.000000,19018.0,28945.00,26134.00,27978.0,4641.000000,2668.000000,3347.0
4,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2022,13.879158,138.207,15244.412,1221.082,5469.003,19712.000000,19407.000000,18123.0,29446.00,28945.00,26134.0,4477.000000,4641.000000,2668.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27696,Zebra Technologies,Voltage Regulator/Reference,Wireless Communications,Other Wireless,2020,0.809700,454.000,25838.124,4852.175,11956.010,2445.000000,2385.000000,2237.0,4448.00,4485.00,4218.0,831.000000,899.000000,817.0
27697,Zebra Technologies,Voltage Regulator/Reference,Wireless Communications,Other Wireless,2021,1.271038,584.000,33129.151,5885.459,15765.008,2999.000000,2445.000000,2385.0,5627.00,4448.00,4485.0,1198.000000,831.000000,899.0
27698,Zebra Technologies,Voltage Regulator/Reference,Wireless Communications,Other Wireless,2022,1.229600,555.000,37356.594,5729.355,17307.004,3157.000000,2999.000000,2445.0,5781.00,5627.00,4448.0,1140.000000,1198.000000,831.0
27699,Zebra Technologies,Voltage Regulator/Reference,Wireless Communications,Other Wireless,2023,1.182734,474.999,28959.587,4701.997,16419.001,7385.362857,3157.000000,2999.0,4536.76,5781.00,5627.0,3668.885714,1140.000000,1198.0


In [39]:
final_df.to_csv("../clean_data/final_df.csv")

# Creating Newtarget_df
The idea is to transform all the numerical values into log differentials with respect to the previous year's values, and then use these new values to train a model.
Thus, we add to the dataset new cols for previouis years' Value, spending, marketspendings, submarketspendings and devicespendings. 
Then, we apply the log_diff function and add the logdiff version of all those variables in the dataset. 
We will use this dataset to run the logdiff version of the model (newtarget)

In [28]:
#defining the function
def log_diff(x, y):
    return np.log(1+x) - np.log(1+y)

In [29]:
#creating cols with lagged data
new_df = final_df.copy()
new_df["Value_1"] = new_df.groupby(["Company", "Device", "Market", "SubMarket"]).Value.shift(1)
new_df["Spendings_1"] = new_df.groupby(["Company", "Device", "Market", "SubMarket"]).Spendings.shift(1)
new_df["MarketSpends_1"] = new_df.groupby(["Company", "Device", "Market", "SubMarket"]).MarketSpends.shift(1)
new_df["SubMarketSpends_1"] = new_df.groupby(["Company", "Device", "Market", "SubMarket"]).SubMarketSpends.shift(1)
new_df["DevSpends_1"] = new_df.groupby(["Company", "Device", "Market", "SubMarket"]).DevSpends.shift(1)
new_df = new_df[~new_df['Value_1'].isnull()].reset_index(drop = True)

#creating cols with logdiff data using the lagged data
new_df["Target"] = new_df.apply(lambda x: log_diff(x.Value, x.Value_1), axis = 1)
new_df["LogDiffSpendings"] = new_df.apply(lambda x: log_diff(x.Spendings, x.Spendings_1), axis = 1)
new_df["LogDiffMarketSpends"] = new_df.apply(lambda x: log_diff(x.MarketSpends, x.MarketSpends_1), axis = 1)
new_df["LogDiffSubMarketSpends"] = new_df.apply(lambda x: log_diff(x.SubMarketSpends, x.SubMarketSpends_1), axis = 1)
new_df["LogDiffDevSpends"] = new_df.apply(lambda x: log_diff(x.DevSpends, x.DevSpends_1), axis = 1)
new_df["LogDiffCogs"] = new_df.apply(lambda x: log_diff(x.Cogs, x.Cogs_1), axis = 1)
new_df["LogDiffRevenue"] = new_df.apply(lambda x: log_diff(x.Revenue, x.Revenue_1), axis = 1)
new_df["LogDiffEbitda"] = new_df.apply(lambda x: log_diff(x.Ebitda, x.Ebitda_1), axis = 1)
new_df["LogDiffCogs_1"] = new_df.apply(lambda x: log_diff(x.Cogs_1, x.Cogs_2), axis = 1)
new_df["LogDiffRevenue_1"] = new_df.apply(lambda x: log_diff(x.Revenue_1, x.Revenue_2), axis = 1)
new_df["LogDiffEbitda_1"] = new_df.apply(lambda x: log_diff(x.Ebitda_1, x.Ebitda_2), axis = 1)

#drop old values and lagged ones
new_df.drop(["Value", "Value_1", "Spendings", "Spendings_1", "MarketSpends", "MarketSpends_1", "SubMarketSpends", "SubMarketSpends_1", "DevSpends", "DevSpends_1"], axis = 1, inplace = True)
new_df.drop(["Cogs_1", "Cogs_2", "Cogs", "Revenue", "Revenue_1", "Revenue_2", "Ebitda", "Ebitda_1", "Ebitda_2"], axis = 1, inplace = True)
new_df

Unnamed: 0,Company,Device,Market,SubMarket,Year,Target,LogDiffSpendings,LogDiffMarketSpends,LogDiffSubMarketSpends,LogDiffDevSpends,LogDiffCogs,LogDiffRevenue,LogDiffEbitda,LogDiffCogs_1,LogDiffRevenue_1,LogDiffEbitda_1
0,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2019,0.071659,0.015973,-0.066253,-0.022983,-0.049027,-0.002153,0.011358,0.036500,0.098100,0.093371,0.096860
1,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2020,-0.005333,0.012252,0.019868,0.040494,-0.003689,-0.048201,-0.068179,-0.226659,-0.002153,0.011358,0.036500
2,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2021,0.333657,0.360138,0.215225,0.179305,0.234120,0.068448,0.102157,0.553441,-0.048201,-0.068179,-0.226659
3,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2022,0.108157,0.117925,0.114147,0.079144,0.133584,0.015593,0.017160,-0.035969,0.068448,0.102157,0.553441
4,ABB,Amplifier/Comparator,Industrial,Manufacturing Equipment,2023,0.146973,0.045711,0.068025,-0.009323,-0.063393,-0.690697,0.088614,0.260177,0.015593,0.017160,-0.035969
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18671,Zebra Technologies,Voltage Regulator/Reference,Wireless Communications,Other Wireless,2020,0.058855,-0.004677,0.153879,0.049009,0.085060,0.024836,-0.008282,-0.078562,0.064036,0.061363,0.095532
18672,Zebra Technologies,Voltage Regulator/Reference,Wireless Communications,Other Wireless,2021,0.227076,0.251314,0.248554,0.193021,0.276539,0.204158,0.235075,0.365411,0.024836,-0.008282,-0.078562
18673,Zebra Technologies,Voltage Regulator/Reference,Wireless Communications,Other Wireless,2022,-0.018415,-0.050844,0.120092,-0.026877,0.093313,0.051327,0.026996,-0.049583,0.204158,0.235075,0.365411
18674,Zebra Technologies,Voltage Regulator/Reference,Wireless Communications,Other Wireless,2023,-0.021244,-0.155353,-0.254600,-0.197578,-0.052669,0.849697,-0.242316,1.168255,0.051327,0.026996,-0.049583


In [30]:
new_df.to_csv("../clean_data/newtarget_df.csv")