### The Libraries

This part of the project is where I will be importing all of the libraries I will use throughout the project.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import math

### The Data

This part is intended to just upload all of the data that will be used for the project as dataframes.

In [2]:
fi_2014 = pd.read_csv("Financial_Indicators/2014_Financial_Data.csv")
fi_2015 = pd.read_csv("Financial_Indicators/2015_Financial_Data.csv")
fi_2016 = pd.read_csv("Financial_Indicators/2016_Financial_Data.csv")
fi_2017 = pd.read_csv("Financial_Indicators/2017_Financial_Data.csv")
fi_2018 = pd.read_csv("Financial_Indicators/2018_Financial_Data.csv")

In [3]:
# This is a quick check of one of the files.  The files should have very similar information
# with the only difference between them being the year the data was collected.

fi_2014.head()

Unnamed: 0.1,Unnamed: 0,Revenue,Revenue Growth,Cost of Revenue,Gross Profit,R&D Expenses,SG&A Expense,Operating Expenses,Operating Income,Interest Expense,...,Receivables growth,Inventory Growth,Asset Growth,Book Value per Share Growth,Debt Growth,R&D Expense Growth,SG&A Expenses Growth,Sector,2015 PRICE VAR [%],Class
0,PG,74401000000.0,-0.0713,39030000000.0,35371000000.0,0.0,21461000000.0,21461000000.0,13910000000.0,709000000.0,...,-0.0187,-0.0217,0.0359,0.0316,0.1228,0.0,-0.1746,Consumer Defensive,-9.323276,0
1,VIPS,3734148000.0,1.1737,2805625000.0,928522600.0,108330300.0,344141400.0,793926700.0,134595900.0,12148690.0,...,,,,,,1.6484,1.7313,Consumer Defensive,-25.512193,0
2,KR,98375000000.0,0.0182,78138000000.0,20237000000.0,0.0,15196000000.0,17512000000.0,2725000000.0,443000000.0,...,0.0618,0.0981,0.1886,0.3268,0.2738,0.0,0.0234,Consumer Defensive,33.118297,1
3,RAD,25526410000.0,0.0053,18202680000.0,7323734000.0,0.0,6561162000.0,6586482000.0,737252000.0,424591000.0,...,0.0211,-0.051,-0.0189,0.1963,-0.0458,0.0,-0.006,Consumer Defensive,2.752291,1
4,GIS,17909600000.0,0.0076,11539800000.0,6369800000.0,0.0,3474300000.0,3412400000.0,2957400000.0,302400000.0,...,0.0257,0.009,0.0215,0.0274,0.1025,0.0,-0.022,Consumer Defensive,12.897715,1


In [4]:
# This is a quick check on the shape of the datasets.

print('The shape of the 2014 file is:', fi_2014.shape)
print('\n')
print('The shape of the 2015 file is:', fi_2015.shape)
print('\n')
print('The shape of the 2016 file is:', fi_2016.shape)
print('\n')
print('The shape of the 2017 file is:', fi_2017.shape)
print('\n')
print('The shape of the 2018 file is:', fi_2018.shape)

The shape of the 2014 file is: (3808, 225)


The shape of the 2015 file is: (4120, 225)


The shape of the 2016 file is: (4797, 225)


The shape of the 2017 file is: (4960, 225)


The shape of the 2018 file is: (4392, 225)


The shape of the files shows that each one of them have different amounts of rows, which means that some of the files will have more stocks than others.  

The good news is that they all have the same amount of columns/features.  This is good because it makes it possible to merge those files into a single dataframe which makes the project more effective.

The financial indicators in the year 2014 is a good way we can see how the dataset is built.  Since we are dealing with multiple datasets that should have the same kind of information, the column names in all of them should be the same.  That means that the next step is to check just that.

In [5]:
# This step will isolate the column names to check if all files have equal columns.

col_2014 = fi_2014.columns # Isolating the 2014 column names

col_2015 = fi_2015.columns # Isolating the 2015 column names

col_2016 = fi_2016.columns # Isolating the 2016 column names

col_2017 = fi_2017.columns # Isolating the 2017 column names

col_2018 = fi_2018.columns # Isolating the 2018 column names

# Now let's print the column names for the 2014 file just to familiarize with them.
for f in col_2014:
    print(f)

Unnamed: 0
Revenue
Revenue Growth
Cost of Revenue
Gross Profit
R&D Expenses
SG&A Expense
Operating Expenses
Operating Income
Interest Expense
Earnings before Tax
Income Tax Expense
Net Income - Non-Controlling int
Net Income - Discontinued ops
Net Income
Preferred Dividends
Net Income Com
EPS
EPS Diluted
Weighted Average Shs Out
Weighted Average Shs Out (Dil)
Dividend per Share
Gross Margin
EBITDA Margin
EBIT Margin
Profit Margin
Free Cash Flow margin
EBITDA
EBIT
Consolidated Income
Earnings Before Tax Margin
Net Profit Margin
Cash and cash equivalents
Short-term investments
Cash and short-term investments
Receivables
Inventories
Total current assets
Property, Plant & Equipment Net
Goodwill and Intangible Assets
Long-term investments
Tax assets
Total non-current assets
Total assets
Payables
Short-term debt
Total current liabilities
Long-term debt
Total debt
Deferred revenue
Tax Liabilities
Deposit Liabilities
Total non-current liabilities
Total liabilities
Other comprehensive income
Re

In [6]:
# Let's check how the 2014 column names compares to the other files.

print('The number of unequal column names comparing the 2014 dataset with the others:')

print('2014 - 2015:', (col_2014 != col_2015).sum()) # Checking amount of unmatching columns

print('2014 - 2016:', (col_2014 != col_2016).sum()) # Checking amount of unmatching columns

print('2014 - 2017:', (col_2014 != col_2017).sum()) # Checking amount of unmatching columns

print('2014 - 2018:', (col_2014 != col_2018).sum()) # Checking amount of unmatching columns

The number of unequal column names comparing the 2014 dataset with the others:
2014 - 2015: 1
2014 - 2016: 1
2014 - 2017: 1
2014 - 2018: 1


In [7]:
# Let's check which column from the 2014 file does not match the other files column names.

print('The different column name is:\n')

print('2014 - 2015:\n', col_2014[col_2014 != col_2015][0]) # Checking which column name is not matching
print('\n')
print('2014 - 2016:\n', col_2014[col_2014 != col_2016][0]) # Checking which column name is not matching
print('\n')
print('2014 - 2017:\n', col_2014[col_2014 != col_2017][0]) # Checking which column name is not matching
print('\n')
print('2014 - 2018:\n', col_2014[col_2014 != col_2018][0]) # Checking which column name is not matching
print('\n')

The different column name is:

2014 - 2015:
 2015 PRICE VAR [%]


2014 - 2016:
 2015 PRICE VAR [%]


2014 - 2017:
 2015 PRICE VAR [%]


2014 - 2018:
 2015 PRICE VAR [%]




It seems that since the column title (2015 PRICE VAR [%]) contains a 'year' in it.  Since in 2014 the year said 2015, that means that the values in that column is taken at the end of the data year, which means at the beginning of the following year. It can be assumed that for the other years the name will be different, for example the 2015 data will have the column '2016 PRICE VAR [%]'.  What needs to be done is to change the column name by removing the year from it.

The other column name that I feel like needs to be changed is the very first column, which currently is 'Unnamed'.  The name of these columns in each of the dataset will become 'Stock'.

To separate the data by the year, I will create a new column named "Year" which will contain the integer value for the year of the dataset.  This column is created to help organize the dataset when merging them since I believe that many of the stock names will be the same throughout all of the datasets.

In [8]:
fi_2014.rename(columns = {'Unnamed: 0' : 'Stock', # Changing 'Stock' column name
                          '2015 PRICE VAR [%]' : 'PRICE VAR [%]'}, inplace = True) # Changing 'PRICE VAR' column name
fi_2014['Year'] = 2014 # Creating 'Year' Column

fi_2015.rename(columns = {'Unnamed: 0' : 'Stock', # Changing 'Stock' column name
                          '2016 PRICE VAR [%]' : 'PRICE VAR [%]'}, inplace = True) # Changing 'PRICE VAR' column name
fi_2015['Year'] = 2015 # Creating 'Year' Column

fi_2016.rename(columns = {'Unnamed: 0' : 'Stock', # Changing 'Stock' column name
                          '2017 PRICE VAR [%]' : 'PRICE VAR [%]'}, inplace = True) # Changing 'PRICE VAR' column name
fi_2016['Year'] = 2016 # Creating 'Year' Column

fi_2017.rename(columns = {'Unnamed: 0' : 'Stock', # Changing 'Stock' column name
                          '2018 PRICE VAR [%]' : 'PRICE VAR [%]'}, inplace = True) # Changing 'PRICE VAR' column name
fi_2017['Year'] = 2017 # Creating 'Year' Column

fi_2018.rename(columns = {'Unnamed: 0' : 'Stock', # Changing 'Stock' column name
                          '2019 PRICE VAR [%]' : 'PRICE VAR [%]'}, inplace = True) # Changing 'PRICE VAR' column name
fi_2018['Year'] = 2018 # Creating 'Year' Column

Following the columns changes, we will now check its success.  Success in this case means the column names in all of the datasets have to match.

In [9]:
# The first step to check whether the column names match is to isolate the column names from the datasets.

col_2014 = fi_2014.columns # Isolating the 2014 column names

col_2015 = fi_2015.columns # Isolating the 2015 column names

col_2016 = fi_2016.columns # Isolating the 2016 column names

col_2017 = fi_2017.columns # Isolating the 2017 column names

col_2018 = fi_2018.columns # Isolating the 2018 column names

# The second step is to compare and see if they match.

# For this step using just one of the datasets columns, in this case 2014, to compare with the remaining datasets
# will be enough because if there are no unmatching between 2014 and 2015 and 2014 and 2016, logically speaking
# then the 2015 and the 2016 column names should also match.

print('The number of unequal column names comparing the 2014 dataset with the others:')

print('2014 - 2015:', (col_2014 != col_2015).sum()) # Checking amount of unmatching columns

print('2014 - 2016:', (col_2014 != col_2016).sum()) # Checking amount of unmatching columns

print('2014 - 2017:', (col_2014 != col_2017).sum()) # Checking amount of unmatching columns

print('2014 - 2018:', (col_2014 != col_2018).sum()) # Checking amount of unmatching columns

The number of unequal column names comparing the 2014 dataset with the others:
2014 - 2015: 0
2014 - 2016: 0
2014 - 2017: 0
2014 - 2018: 0


With all of the datasets having the same column names, it is time to merge all of the files.

In [10]:
# To merge all files since they all are dataframes already, let's first first make a list of the dataframes.

data_frames = [fi_2014, fi_2015, fi_2016, fi_2017, fi_2018]

# Now just make them into one by creating a blank dataframe and merging the list created above.

df = pd.DataFrame() # Creating empty dataframe
df = df.append(data_frames) # Merging the files dataframes

In [11]:
# Let's check how the new dataframe looks

df

Unnamed: 0,Stock,Revenue,Revenue Growth,Cost of Revenue,Gross Profit,R&D Expenses,SG&A Expense,Operating Expenses,Operating Income,Interest Expense,...,Inventory Growth,Asset Growth,Book Value per Share Growth,Debt Growth,R&D Expense Growth,SG&A Expenses Growth,Sector,PRICE VAR [%],Class,Year
0,PG,7.440100e+10,-0.0713,3.903000e+10,3.537100e+10,0.000000e+00,2.146100e+10,2.146100e+10,1.391000e+10,7.090000e+08,...,-0.0217,0.0359,0.0316,0.1228,0.0000,-0.1746,Consumer Defensive,-9.323276,0,2014
1,VIPS,3.734148e+09,1.1737,2.805625e+09,9.285226e+08,1.083303e+08,3.441414e+08,7.939267e+08,1.345959e+08,1.214869e+07,...,,,,,1.6484,1.7313,Consumer Defensive,-25.512193,0,2014
2,KR,9.837500e+10,0.0182,7.813800e+10,2.023700e+10,0.000000e+00,1.519600e+10,1.751200e+10,2.725000e+09,4.430000e+08,...,0.0981,0.1886,0.3268,0.2738,0.0000,0.0234,Consumer Defensive,33.118297,1,2014
3,RAD,2.552641e+10,0.0053,1.820268e+10,7.323734e+09,0.000000e+00,6.561162e+09,6.586482e+09,7.372520e+08,4.245910e+08,...,-0.0510,-0.0189,0.1963,-0.0458,0.0000,-0.0060,Consumer Defensive,2.752291,1,2014
4,GIS,1.790960e+10,0.0076,1.153980e+10,6.369800e+09,0.000000e+00,3.474300e+09,3.412400e+09,2.957400e+09,3.024000e+08,...,0.0090,0.0215,0.0274,0.1025,0.0000,-0.0220,Consumer Defensive,12.897715,1,2014
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4387,YRIV,0.000000e+00,0.0000,0.000000e+00,0.000000e+00,0.000000e+00,3.755251e+06,3.755251e+06,-3.755251e+06,1.105849e+07,...,0.0000,-0.0508,-0.1409,-0.0152,0.0000,-0.2602,Real Estate,-90.962099,0,2018
4388,YTEN,5.560000e+05,-0.4110,0.000000e+00,5.560000e+05,4.759000e+06,5.071000e+06,9.830000e+06,-9.274000e+06,0.000000e+00,...,0.0000,-0.2323,-0.8602,0.0000,0.0352,-0.0993,Basic Materials,-77.922077,0,2018
4389,ZKIN,5.488438e+07,0.2210,3.659379e+07,1.829059e+07,1.652633e+06,7.020320e+06,8.672953e+06,9.617636e+06,1.239170e+06,...,0.7706,0.2489,0.4074,-0.0968,0.2415,0.8987,Basic Materials,-17.834400,0,2018
4390,ZOM,0.000000e+00,0.0000,0.000000e+00,0.000000e+00,1.031715e+07,4.521349e+06,1.664863e+07,-1.664863e+07,0.000000e+00,...,0.0000,0.1568,-0.2200,0.0000,2.7499,0.1457,Industrials,-73.520000,0,2018


Now that we can see that the new dataframe is built as we expected, it is time to set the index.  For this dataframe the index will be both the 'Year' and the 'Stock' columns.  The year column will keep the separation of the files within the same dataframe, and the stock column is the original index since each stock has difference feature values. 

In [12]:
# Setting the index of the dataframe

df = df.set_index(['Year', 'Stock'])

In [13]:
# Checking the dataframe with the new index

df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Revenue Growth,Cost of Revenue,Gross Profit,R&D Expenses,SG&A Expense,Operating Expenses,Operating Income,Interest Expense,Earnings before Tax,...,Receivables growth,Inventory Growth,Asset Growth,Book Value per Share Growth,Debt Growth,R&D Expense Growth,SG&A Expenses Growth,Sector,PRICE VAR [%],Class
Year,Stock,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2014,PG,74401000000.0,-0.0713,39030000000.0,35371000000.0,0.0,21461000000.0,21461000000.0,13910000000.0,709000000.0,14494000000.0,...,-0.0187,-0.0217,0.0359,0.0316,0.1228,0.0,-0.1746,Consumer Defensive,-9.323276,0
2014,VIPS,3734148000.0,1.1737,2805625000.0,928522600.0,108330300.0,344141400.0,793926700.0,134595900.0,12148690.0,175382300.0,...,,,,,,1.6484,1.7313,Consumer Defensive,-25.512193,0
2014,KR,98375000000.0,0.0182,78138000000.0,20237000000.0,0.0,15196000000.0,17512000000.0,2725000000.0,443000000.0,2270000000.0,...,0.0618,0.0981,0.1886,0.3268,0.2738,0.0,0.0234,Consumer Defensive,33.118297,1
2014,RAD,25526410000.0,0.0053,18202680000.0,7323734000.0,0.0,6561162000.0,6586482000.0,737252000.0,424591000.0,250218000.0,...,0.0211,-0.051,-0.0189,0.1963,-0.0458,0.0,-0.006,Consumer Defensive,2.752291,1
2014,GIS,17909600000.0,0.0076,11539800000.0,6369800000.0,0.0,3474300000.0,3412400000.0,2957400000.0,302400000.0,2707700000.0,...,0.0257,0.009,0.0215,0.0274,0.1025,0.0,-0.022,Consumer Defensive,12.897715,1


The dataframe now has 224 columns.  These amount of columns is so large that does not always show when we simply check the dataframe.  Because of that, it is wise to check and see if there is any duplicate columns in the dataframe.

In [14]:
# To check for duplicates, let's first create a list of all column names.
# The names on the list should be lower-cased and without any space.  
# This will make it easier to check the strings in the list.

lst1 = [] # Creating an empty list.

for c in df.columns:
    lst1.append(c.replace(' ', '').lower()) # Adding the column names to the list lower-cased and without space.
    
# Let's check the newly created list

lst1

['revenue',
 'revenuegrowth',
 'costofrevenue',
 'grossprofit',
 'r&dexpenses',
 'sg&aexpense',
 'operatingexpenses',
 'operatingincome',
 'interestexpense',
 'earningsbeforetax',
 'incometaxexpense',
 'netincome-non-controllingint',
 'netincome-discontinuedops',
 'netincome',
 'preferreddividends',
 'netincomecom',
 'eps',
 'epsdiluted',
 'weightedaverageshsout',
 'weightedaverageshsout(dil)',
 'dividendpershare',
 'grossmargin',
 'ebitdamargin',
 'ebitmargin',
 'profitmargin',
 'freecashflowmargin',
 'ebitda',
 'ebit',
 'consolidatedincome',
 'earningsbeforetaxmargin',
 'netprofitmargin',
 'cashandcashequivalents',
 'short-terminvestments',
 'cashandshort-terminvestments',
 'receivables',
 'inventories',
 'totalcurrentassets',
 'property,plant&equipmentnet',
 'goodwillandintangibleassets',
 'long-terminvestments',
 'taxassets',
 'totalnon-currentassets',
 'totalassets',
 'payables',
 'short-termdebt',
 'totalcurrentliabilities',
 'long-termdebt',
 'totaldebt',
 'deferredrevenue',
 't

The newly created list kept the order of the names in the same order it appears in the dataframe.  To make sure that order will be followed, I will make that list into a dataframe in that order so that the index of that new dataframe will be the same as the column index of the 'original' dataframe (df).  

Creating a new dataframe will also help in checking for duplicates by using pandas *duplicated* method.

In [15]:
# Creating the columns names dataframe

col_df = pd.DataFrame({"Column Names" : lst1})

# Checking the column names dataframe

col_df

Unnamed: 0,Column Names
0,revenue
1,revenuegrowth
2,costofrevenue
3,grossprofit
4,r&dexpenses
...,...
219,r&dexpensegrowth
220,sg&aexpensesgrowth
221,sector
222,pricevar[%]


With the new dataframe, it is time to check for duplicates.  Because of how the *duplicated* method works, 2 dataframes will be created.  In one of them there will be the column names that comes 'second', while the other dataframe will keep the titles that comes in 'first' in the order that it shows in the original dataframe.

In [16]:
# Creating the duplicate dataframe with the 'second' names.

duplicates_last = col_df[col_df.duplicated()]

# Creating the duplicate dataframe with the 'first' names.

duplicates_first = col_df[col_df.duplicated(keep = 'last')]

In [17]:
# Let's see the duplicates

duplicates_last

Unnamed: 0,Column Names
93,netprofitmargin
98,niperebt
99,ebtperebit
100,ebitperrevenue
133,operatingcashflowpershare
134,freecashflowpershare
135,cashpershare
143,pricetosalesratio
157,currentratio
158,interestcoverage


In [18]:
duplicates_first

Unnamed: 0,Column Names
30,netprofitmargin
77,pricetosalesratio
84,dividendyield
87,ebitperrevenue
88,ebtperebit
89,niperebt
101,payablesturnover
102,inventoryturnover
105,currentratio
117,interestcoverage


The column names match in both dataframes, but the index is different.  That means that the column index where those names are located are the ones where the duplicated are located in the original dataframe.

Though the column names are in both dataframe, they are not in the same order.  That is easy to fix by setting the order of both dataframes according to the 'Column Names' value.

In [19]:
# Ordering the dataframes while maintaining the same dataframe names.

duplicates_last = duplicates_last.sort_values(by = ['Column Names'])

duplicates_first = duplicates_first.sort_values(by = ['Column Names'])

In [20]:
duplicates_last

Unnamed: 0,Column Names
135,cashpershare
157,currentratio
160,dividendyield
100,ebitperrevenue
99,ebtperebit
134,freecashflowpershare
158,interestcoverage
185,inventoryturnover
93,netprofitmargin
98,niperebt


In [21]:
duplicates_first

Unnamed: 0,Column Names
122,cashpershare
105,currentratio
84,dividendyield
87,ebitperrevenue
88,ebtperebit
121,freecashflowpershare
117,interestcoverage
102,inventoryturnover
30,netprofitmargin
89,niperebt


Now both dataframe are ordered according to the 'Column Names' and not the index value.  The good news is that the index value followed the column names.

For the next step a duplicates dataframe will be created with those column names in a more readable format.  This dataframe will compare the values, or lack thereof, of the columns in the original dataframe.

In [22]:
# Creating a list of the column names to be used as an index.

duplicates_columns = ['Cash per Share', 'Current Ratio', 'Dividend Yield', 'EBIT per Revenue',
 'EBT per EBIT', 'Free Cash Flow per Share', 'Interest Coverage',
 'Inventory Turnover', 'Net Profit Margin', 'NI per EBT',
 'Operating Cash Flow per Share', 'Payables Turnover', 'Payout Ratio',
 'Price to Sales Ratio']

In [23]:
# Creating the new dataframe with the new index

duplicates = pd.DataFrame(index = duplicates_columns)

In [24]:
duplicates

Cash per Share
Current Ratio
Dividend Yield
EBIT per Revenue
EBT per EBIT
Free Cash Flow per Share
Interest Coverage
Inventory Turnover
Net Profit Margin
NI per EBT
Operating Cash Flow per Share


In [25]:
# Creating a list of the columns from the original dataframe using the index gathered earlier in the duplicates process.

# The index is used in the order found when ordering the 'Column Names' alphabetically.

list1 = list(df.columns[[122, 105, 84, 87, 88, 121, 117, 102, 30, 89, 120, 101, 123, 77]]) # List of the 1st duplicates.

list2 = list(df.columns[[135, 157, 160, 100, 99, 134, 158, 185, 93, 98, 133, 184, 161, 143]]) # List of the 2nd duplicates.

The first values to be added to the new dataframe will be the quantity of missing values each duplicate column has.

In [26]:
# Creating an array of the quantity of duplicate values for each set of column names.

# The first set are the duplicate columns that appears first in the order of the original dataframe.
first = np.array(df[list1].isna().sum())

# The second set are the duplicate columns that appears second in the order of the original dataframe.
second = np.array(df[list2].isna().sum())

Other values will be the total number of unmatching values from the duplicate columns.

In [27]:
# Taking the amount of unmatching values from the duplicate columns.

unmatched = []

for x, y in zip(list1, list2):
    unmatched.append(len((df[df[x] != df[y]])[x]))

Now let's add those values to the duplicates dataframe.  I will also be creating new sets of values using those values gathered above.

In [28]:
duplicates['MV1'] = first # Missing values from the first set of columns.

duplicates['MV2'] = second # Missing values from the second set of columns.

duplicates['(MV1 - MV2)'] = duplicates.iloc[:, 0] - duplicates.iloc[:, 1] # Difference between first and second set.

duplicates['UV'] = unmatched # Amount of unmatching values between the two sets of columns.

duplicates['(UV - MV1)'] = duplicates.iloc[:, 3] - duplicates.iloc[:, 0] # Difference between unmatched and first set of missing values

duplicates['(UV - MV2)'] = duplicates.iloc[:, 3] - duplicates.iloc[:, 1] # Difference between unmatched and second set of missing values


In [29]:
# Let's check the duplicates data
duplicates

Unnamed: 0,MV1,MV2,(MV1 - MV2),UV,(UV - MV1),(UV - MV2)
Cash per Share,2429,2430,-1,2430,1,0
Current Ratio,3065,6516,-3451,6516,3451,0
Dividend Yield,3298,2246,1052,12742,9444,10496
EBIT per Revenue,2862,2862,0,2862,0,0
EBT per EBIT,8007,8007,0,8007,0,0
Free Cash Flow per Share,2424,2412,12,2425,1,13
Interest Coverage,2187,2157,30,2188,1,31
Inventory Turnover,2512,2512,0,2512,0,0
Net Profit Margin,1722,2862,-1140,22017,20295,19155
NI per EBT,8595,8595,0,8595,0,0


We can see that some of the columns have the same number of missing values as well as the same number of unmatching values.  That is because the missing values in both duplicate columns are the only one that does not match between the two of them.  This means that we can extract one of the duplicates without worrying about affecting the final result of the project. 

In [30]:
# First let's extract the index of the columns for which the difference in missing values is '0'.

exact_duplicates = duplicates[duplicates.iloc[:, 2] == 0].index

# Now let's put that index into a list.

list1 = []

for i in exact_duplicates:
    list1.append(duplicates_columns.index(i))
    
print(list1)

[3, 4, 7, 9, 11]


In [31]:
# The list of index will now be used to get the index of the columns in the original dataframe
# that will find the columns that will be extracted.


# list1 has the list of index of the columns to be extracted from a list of the ones that was   
# set up in alphabetical order. That is why I need to build a list of all the duplicate column names in 
# alphabetical order to be able to match the index in list1 to the column name lower-cased and without spaces.
# To do that I will use the column names in the duplicated_last dataframe.
dl_col = []

for name in (duplicates_last['Column Names'].values):
    dl_col.append(name)

# Now to match the index with the names
equal = [dl_col[i] for i in list1]

equal_df = col_df[col_df.values == equal]

equal_duplicated = equal_df[equal_df.duplicated()]

# Here is where I will create a list of the columns that I plan to extract from the original dataframe.
# The first column names to be appended to the list will be the ones that are the 
# exact match of another column in the dataframe
extract = []

for col in equal_duplicated.values:
    extract += list(col)
    
print(extract)

exact_matches = list(equal_duplicated.index)

print(exact_matches)

['niperebt', 'ebtperebit', 'ebitperrevenue', 'payablesturnover', 'inventoryturnover']
[98, 99, 100, 184, 185]


In [32]:
df_exact_matches = df.iloc[:, exact_matches]

In [33]:
df_exact_matches

Unnamed: 0_level_0,Unnamed: 1_level_0,nIperEBT,eBTperEBIT,eBITperRevenue,Payables Turnover,Inventory Turnover
Year,Stock,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014,PG,0.803298,0.953364,0.204339,1.5648,10.8869
2014,VIPS,0.774438,0.935218,0.050221,,
2014,KR,0.669163,0.836712,0.027578,4.9593,18.2227
2014,RAD,0.996787,0.370798,0.026436,4.9289,8.3030
2014,GIS,0.673782,0.899538,0.168072,2.8234,11.5363
...,...,...,...,...,...,...
2018,YRIV,,,,0.0000,0.0000
2018,YTEN,,,-16.492806,1.3990,0.0000
2018,ZKIN,0.833869,0.871662,0.175924,,3.9427
2018,ZOM,,,,0.0000,0.0000
