# Importing Libraries

In [1]:
import pandas as pd
import numpy as np

# Files Loading

In [2]:
headers = ['Company','Company_Score','Reviews','Director','Director_Score','Company_Employee','Company_Revenue','Company_Sector_1','Company_URL_1','Company_Data_1','Company_Data_2','Company_Sector_2','Company_Sector_3','Null_column','Company_URL_2','Company_Data_3']

Companies = pd.read_excel('companies.xlsx', names = headers)
Companies

Unnamed: 0,Company,Company_Score,Reviews,Director,Director_Score,Company_Employee,Company_Revenue,Company_Sector_1,Company_URL_1,Company_Data_1,Company_Data_2,Company_Sector_2,Company_Sector_3,Null_column,Company_URL_2,Company_Data_3
0,BOK Financial,3.6,468 evaluaciones,Stacy Kymes,0.80,,más de $10 mil millones USD,Finanzas,https://www.bokfinancial.com/,,,,,,,De 1001 a 5000
1,Live Nation,4.1,"1,173 evaluaciones",Michael Rapino,0.89,más de 10 000,,Audiovisual y medios de comunicación,http://www.livenationentertainment.com/careers,,,,,,,más de 10 000
2,Amex,4.1,"8,889 evaluaciones",Stephen J Squeri,0.85,más de 10 000,más de $10 mil millones USD,Finanzas,https://www.americanexpress.com/,,,,,,,más de 10 000
3,Staples,3.4,"13,154 evaluaciones",Michael Motz,0.61,más de 10 000,más de $10 mil millones USD,Ventas al mayoreo y al menudeo,https://www.staples.com/,,,,,,,más de 10 000
4,M&T Bank,3.5,"2,474 evaluaciones",René Jones,0.78,más de 10 000,,Finanzas,https://www.mtb.com/,,,,,,,más de 10 000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8788,Coppin State University,4.0,145 evaluaciones,,,,,,,201 a 500,,,,,,
8789,Acnusa,,,,,,,,,,,,,,,
8790,Southwest Allen County Schools,4.4,17 evaluaciones,,,,,Sitio web de Southwest Allen County Schools,,De 1001 a 5000,500 mdp a 2 mil mdp MXN,Educación,,,,
8791,City of Santa Barbara,4.1,35 evaluaciones,,,,,,,De 1001 a 5000,,,,,,


# Data Cleaning

Let's navigate in the Dataset:

In [3]:
Companies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8793 entries, 0 to 8792
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Company           8792 non-null   object 
 1   Company_Score     6643 non-null   float64
 2   Reviews           6643 non-null   object 
 3   Director          2785 non-null   object 
 4   Director_Score    2421 non-null   float64
 5   Company_Employee  1080 non-null   object 
 6   Company_Revenue   455 non-null    object 
 7   Company_Sector_1  3049 non-null   object 
 8   Company_URL_1     2020 non-null   object 
 9   Company_Data_1    1134 non-null   object 
 10  Company_Data_2    2564 non-null   object 
 11  Company_Sector_2  883 non-null    object 
 12  Company_Sector_3  1630 non-null   object 
 13  Null_column       0 non-null      float64
 14  Company_URL_2     1558 non-null   object 
 15  Company_Data_3    3868 non-null   object 
dtypes: float64(3), object(13)
memory usage: 1.

In [4]:
Companies.describe()

Unnamed: 0,Company_Score,Director_Score,Null_column
count,6643.0,2421.0,0.0
mean,3.617628,0.736415,
std,0.633907,0.112766,
min,1.0,0.25,
25%,3.3,0.67,
50%,3.6,0.75,
75%,4.0,0.82,
max,5.0,1.0,


We see some facts:

- Only three numerical columns. We can drop the Null_column, since we don't have values. Moreover, the column Reviews should be numerical also.
- Some columns repeated. That's because of the extraction from Power Automate, it wasn't extracted as clean as it could. We need to merge the repeated columns into a single ones.
- Columns "Company_Data_x", includes data of number of employees, or company revenue.

Let's work on it!

In [5]:
Companies = Companies.drop(columns=['Null_column'])

## Reviews

We need to transform the column "Reviews" into a numerical data:

In [6]:
Companies['Reviews'] = Companies['Reviews'].fillna('0')
Companies['Reviews'] = Companies['Reviews'].replace(regex=r'\D', value='').astype(int)

Companies

Unnamed: 0,Company,Company_Score,Reviews,Director,Director_Score,Company_Employee,Company_Revenue,Company_Sector_1,Company_URL_1,Company_Data_1,Company_Data_2,Company_Sector_2,Company_Sector_3,Company_URL_2,Company_Data_3
0,BOK Financial,3.6,468,Stacy Kymes,0.80,,más de $10 mil millones USD,Finanzas,https://www.bokfinancial.com/,,,,,,De 1001 a 5000
1,Live Nation,4.1,1173,Michael Rapino,0.89,más de 10 000,,Audiovisual y medios de comunicación,http://www.livenationentertainment.com/careers,,,,,,más de 10 000
2,Amex,4.1,8889,Stephen J Squeri,0.85,más de 10 000,más de $10 mil millones USD,Finanzas,https://www.americanexpress.com/,,,,,,más de 10 000
3,Staples,3.4,13154,Michael Motz,0.61,más de 10 000,más de $10 mil millones USD,Ventas al mayoreo y al menudeo,https://www.staples.com/,,,,,,más de 10 000
4,M&T Bank,3.5,2474,René Jones,0.78,más de 10 000,,Finanzas,https://www.mtb.com/,,,,,,más de 10 000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8788,Coppin State University,4.0,145,,,,,,,201 a 500,,,,,
8789,Acnusa,,0,,,,,,,,,,,,
8790,Southwest Allen County Schools,4.4,17,,,,,Sitio web de Southwest Allen County Schools,,De 1001 a 5000,500 mdp a 2 mil mdp MXN,Educación,,,
8791,City of Santa Barbara,4.1,35,,,,,,,De 1001 a 5000,,,,,


Nice!

## Sector

Let's try to merge the columns Sector:

In [7]:
Companies['Company_Sector_1'].head(50).values

array(['Finanzas', 'Audiovisual y medios de comunicación', 'Finanzas',
       'Ventas al mayoreo y al menudeo', 'Finanzas',
       'Sitio web de Equitable', 'Salud', 'Venta al por mayor', nan, nan,
       'Electricidad y Servicios públicos', nan, nan, nan, nan,
       'Ventas al mayoreo y al menudeo',
       'Fabricación de productos electrónicos',
       'Energía, minería e infrastructura pública', nan, 'Salud', nan,
       'Manufactura', 'Farmacéutica y biotecnología',
       'Aeroespacial y defensa', nan, nan, nan, nan, 'Salud',
       'Compañías de seguros', nan, 'Sitio web de Accella',
       'Sitio web de Carnival Cruise Line',
       'Ventas al mayoreo y al menudeo', nan, nan, nan, nan, nan,
       'Sitio web de Broward County Public Schools', 'Finanzas',
       'Sitio web de MicroAire Surgical Instruments',
       'Telecomunicaciones', 'Banca y Servicios de crédito', nan,
       'Seguros', 'Manufactura', nan,
       'Sitio web de Eugene Water and Electric Board', nan], dtype=ob

In [8]:
Companies['Company_Sector_2'].head(50).values

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan,
       'Servicios de construcción, reparación y mantenimiento', nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, 'Ocio y Cultura', nan,
       nan, nan, nan, nan, nan, 'Educación', nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan], dtype=object)

In [9]:
Companies['Company_Sector_3'].head(50).values

array([nan, nan, nan, nan, nan, 'Finanzas', nan, nan, nan, nan, nan,
       'Educación', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, 'Fabricación de maquinaria', nan, nan, nan, nan, nan, nan,
       nan, nan, nan, 'Tecnologías de la información', nan, nan, nan,
       'Tecnologías de la información', nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, 'Energía, minería e infrastructura pública'],
      dtype=object)

The values in the 2 and 3 columns looks good. However the values in column 1 looks worse. We need to set as null/empty all the values containing "Sitio web", because that isn't any sector:

In [10]:
Companies['Company_Sector_1'] = Companies['Company_Sector_1'].fillna('')
Companies.loc[Companies['Company_Sector_1'].str.contains('Sitio web'), 'Company_Sector_1'] = ''

In [11]:
Companies['Company_Sector_1'].head(50).values

array(['Finanzas', 'Audiovisual y medios de comunicación', 'Finanzas',
       'Ventas al mayoreo y al menudeo', 'Finanzas', '', 'Salud',
       'Venta al por mayor', '', '', 'Electricidad y Servicios públicos',
       '', '', '', '', 'Ventas al mayoreo y al menudeo',
       'Fabricación de productos electrónicos',
       'Energía, minería e infrastructura pública', '', 'Salud', '',
       'Manufactura', 'Farmacéutica y biotecnología',
       'Aeroespacial y defensa', '', '', '', '', 'Salud',
       'Compañías de seguros', '', '', '',
       'Ventas al mayoreo y al menudeo', '', '', '', '', '', '',
       'Finanzas', '', 'Telecomunicaciones',
       'Banca y Servicios de crédito', '', 'Seguros', 'Manufactura', '',
       '', ''], dtype=object)

Nice! Now let's merge the three columns:

In [12]:
Companies['Sector'] = Companies['Company_Sector_3'].fillna(Companies['Company_Sector_2']).fillna(Companies['Company_Sector_1'])
Companies['Sector'] = Companies['Sector'].replace('', np.nan)

In [13]:
Companies['Sector'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 8793 entries, 0 to 8792
Series name: Sector
Non-Null Count  Dtype 
--------------  ----- 
4548 non-null   object
dtypes: object(1)
memory usage: 68.8+ KB


In [14]:
Companies['Sector'].value_counts().head(50)

Salud                                                    545
Educación                                                357
Manufactura                                              355
Tecnologías de la información                            349
Finanzas                                                 265
Gobierno y administración pública                        184
Ventas al mayoreo y al menudeo                           166
Aeroespacial y defensa                                   122
ONG y Organizaciones sin fines de lucro                  117
Energía, minería e infrastructura pública                 94
Seguros                                                   93
Transporte y logística                                    92
Farmacéutica y biotecnología                              81
Banca y Servicios de crédito                              80
Electricidad y Servicios públicos                         73
Servicios de construcción, reparación y mantenimiento     72
Audiovisual y medios de 

Perfect!

## URL

We must do the same with the URL columns, merge them into a single one:

In [15]:
Companies['Company_URL_1'].head(50).values

array(['https://www.bokfinancial.com/',
       'http://www.livenationentertainment.com/careers',
       'https://www.americanexpress.com/', 'https://www.staples.com/',
       'https://www.mtb.com/', nan, 'https://www.geisinger.org/',
       'http://www.yamaha-motor.com/', nan, nan, 'http://careers.abb/',
       nan, nan, nan, nan, 'http://www.publix.jobs/',
       'https://www.intel.com/', 'http://www.halliburton.com/', nan,
       'http://www.exactsciences.com/', nan, 'https://www.tesla.com/',
       'https://career.bayer.com/en/career',
       'http://www.leidos.com/careers/', nan, nan, nan, nan,
       'https://www.sentaracareers.com/', 'https://www.bluecrossnc.com/',
       nan, nan, nan, 'https://corporate.dollartree.com/careers', nan,
       nan, nan, nan, nan, nan, 'https://www.bnymellon.com/careers', nan,
       'http://corporate.comcast.com/', 'https://www.axosbank.com/', nan,
       'http://www.usablelife.com/', 'http://www.sherwin-williams.com/',
       nan, nan, nan], dtype

In [16]:
Companies['Company_URL_2'].head(50).values

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       'https://www.nova.edu/', nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, 'https://www.rti-inc.com/', nan, nan, nan, nan,
       nan, nan, nan, nan, nan, 'http://www.solera.com/', nan, nan, nan,
       'http://mindbodyonline.com/', nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, 'http://www.aes.com/'], dtype=object)

In [17]:
Companies['URL'] = Companies['Company_URL_2'].fillna(Companies['Company_URL_1'])
Companies['URL'] = Companies['URL'].replace('', np.nan)

In [18]:
Companies['URL'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 8793 entries, 0 to 8792
Series name: URL
Non-Null Count  Dtype 
--------------  ----- 
3578 non-null   object
dtypes: object(1)
memory usage: 68.8+ KB


Looks great!

## Revenue

Now let's work on the revenue columns. We need to merge the Company_Revenue and Company_Data_x columns. The challenge here is that the columns "Company_Data_1", "Company_Data_2" and "Company_Data_3" has revenue values in some cases, but in other cases has employee values. Let's create the new column "Revenue":