### Fortune 500 Scrapper

__author__: Mohammad Alfi Hasan  <br/>
__update__: Dec 17, 2021

This is a project to scrapper the company information from the fortune.com ( Fortune 500 companies ) website, after getting the subscription.

In [1]:
## --- Initialization --- ##

# ... libraries ... #
import pandas as pd 
import bs4 
import re
import platform
import sys

# ... paths ... #
import sys
sys.path.append('../data/')
sys.path.append('../util/')


## The approch and limitaion 

To scrap the Fortune 500 companies, the previous repo tried to download the information using old stay HTML format. However, with the new fortune 500 website, it is really hard even to copy visual data into an excel sheet. So to eliminate the challenges, we, unfortunately, need to download the website in the local manually by selecting 100 rows per page. 

The methods of the approach as follows:
1. Download the website by selecting 100 rows per page. Try to "save_as" completely and place only the HTML file in the data folder.
2. Once the file is in the data folder you need to open the file in a text editor and search the name of a company in the text editor. 
3. Once you find the company name, try to find the class object associated with the cell div. This step is crucial to understand. For example: in this case, we first search "amgen" in a text editor like sublime, and found "amgen" as a company from the HTML list.  The searched text was as follows <br/> ```<div class="searchResults__cellContent--3WEWj"><span>112</span></div></a></div></div></div><div class="rt-tr-group" role="rowgroup"><div class="rt-tr -even" role="row"><div class="rt-td searchResults__cell--2Y7Ce searchResults__rank--1sTfo" role="gridcell" style="flex: 100 0 auto; width: 100px;"><a class="searchResults__cellWrapper--39MAj" trackerdata="[object Object]" href="https://fortune.com/company/amgen/fortune500/"><div class="searchResults__cellContent--3WEWj"><span>112</span></div></a></div><div class="rt-td searchResults__cell--2Y7Ce searchResults__title--3LyRA" role="gridcell" style="flex: 100 0 auto; width: 100px;"><a class="searchResults__cellWrapper--39MAj" trackerdata="[object Object]" href="https://fortune.com/company/amgen/fortune500/"><div class="searchResults__cellContent--3WEWj"><span><div>Amgen</div></span></div></a></div><div class="rt-td searchResults__cell--2Y7Ce" role="gridcell" style="flex: 100 0 auto; width: 100px;"><a class="searchResults__cellWrapper--39MAj" trackerdata="[object Object]" href="https://fortune.com/company/amgen/fortune500/"><div class="searchResults__cellContent--3WEWj"><span>$25,424</span></div></a></div><div class="rt-td searchResults__cell--2Y7Ce" role="gridcell" style="flex: 100 0 auto; width: 100px;"><a class="searchResults__cellWrapper--39MAj" trackerdata="[object Object]" href="https://fortune.com/company/amgen/fortune500/"><div class="searchResults__cellContent--3WEWj"><span>8.8%</span></div></a></div>```. In the text, there are two main classes that comprise the table, `rt-td searchResults__cell--2Y7Ce searchResults__title--3LyRA` and `rt-td searchResults__cell--2Y7Ce`. Finding these two keys for your case is essential for the next steps. Otherwise following code WILL NOT WORK.
4. Once found the keys, they can be passed in the following functions as "titleKey" and "tableKey".
5. for the current case, we downloaded 10 columns ( default ) as our HTML table. if you have more or less than 10, config that number in "col_names" variable. Make sure, you also provide the name of columns according to the order that you have downloaded. 

* Note: a complete download of HTML, we don't need the folders in the data.
* Note: Each download file is renamed as follow: `../data/Fortune_500_list_of_companies_2021_p<part_np>.html`. 

In [2]:
### **** config for entire project **** ###
no_of_files = 10      # thought this is a fortune 500 company script, they have 1000 companies in their website. 
data_dir = '../data/' 
titleKey = 'rt-td searchResults__cell--2Y7Ce searchResults__title--3LyRA'
tableKey = 'rt-td searchResults__cell--2Y7Ce'
col_names = [
        'revenues',
        'revenues(change%)',
        'profit',
        'profit(change%)',
        'asset($m)',
        'marketcap(03032021)',
        'change_rank(1000)',
        'employees',
        'change_rank(500)',
        'measure_rank'
    ] # 10 columns
row_no = 100 


### ++++ User functions ++++ ###

def get_text_from_div_list(div_result : list):
    '''
    get a values list from the div class
    '''
    return [ i.findAll('div', {'class': 'searchResults__cellContent--3WEWj'})[0].get_text() 
            for i in div_result ]

def get_title( html_file : str, div_key : str):
    '''
    getting title values as a list from HTML string 
    '''
    
    # -- getting title values 
    div_raw =  html_file.findAll('div', {'class': div_key}) ## It should split 100 entry list

    return get_text_from_div_list(div_raw)
    
def get_table( html_file : str, div_key : str):
    '''
    getting title values as a list from HTML string 
    
    non-local var : col_no 
    
    '''
    
    # -- getting title values 
    div_raw =  html_file.findAll('div', {'class': div_key}) ## It should split 100 entry list
    
    val_lst = [ get_text_from_div_list(i)[0] for i in div_raw ] 
    
    company_no = int(len(val_lst)/len(col_names))
    
    company_lst = []
    
    for m in range(company_no) :
        company_lst.append(dict(zip(col_names, val_lst[(m*len(col_names)+0):(m*len(col_names)+len(col_names))])))
    
    return company_lst


def single_html_read( filename : str ):
    '''
    reading a single HTML file and returning dataframe 
    
    non-local var : titleKey
    non-local var : tableKey
    
    '''
    
    # -- reading the file 
    with open(filename, encoding='utf8') as infile:
        html_file_ = bs4.BeautifulSoup(infile, 'html.parser')
    
    title_lst = get_title( html_file_, titleKey)
    #print(title_list) # this order and next query order must needs to be same.
    
    table_lst = get_table( html_file_, tableKey)
    
    dfCompany = pd.DataFrame(table_lst)
    dfCompany['company'] = title_lst
    
    return dfCompany
    
def scrapping_values_from_download():
    '''
    reading a single HTML file and returning dataframe 
    
    non-local var : no_of_files
    non-local var : data_dir
    
    '''

    ### ---> Testing 
    
    dfAll = pd.DataFrame()
    
    for partNo in range(no_of_files):
        filename_ = f'{data_dir}Fortune_500_list_of_companies_2021_p{partNo + 1}.html'
        dfCo = single_html_read( filename_ ) 
        dfAll = pd.concat([dfAll, dfCo])
        
    return dfAll

### ++++ +++++++++++++ ++++ ###
 

scrapping_values_from_download()  


Unnamed: 0,revenues,revenues(change%),profit,profit(change%),asset($m),marketcap(03032021),change_rank(1000),employees,change_rank(500),measure_rank,company
0,"$559,151",6.7%,"$13,510",-9.2%,"$252,496","$382,642.8",-,2300000,-,20,Walmart
1,"$386,064",37.6%,"$21,331",84.1%,"$321,195","$1,558,069.6",-,1298000,-,11,Amazon
2,"$274,515",5.5%,"$57,411",3.9%,"$323,888","$2,050,665.9",1,147000,1,188,Apple
3,"$268,706",4.6%,"$7,179",8.2%,"$230,715","$98,653.2",1,256500,1,57,CVS Health
4,"$257,141",6.2%,"$15,403",11.3%,"$197,289","$351,725",2,330000,2,25,UnitedHealth Group
...,...,...,...,...,...,...,...,...,...,...,...
95,"$1,860.1",1.6%,$-116.1,-,"$5,413.2","$2,649.4",-,9950,-,-,Surgery Partners
96,"$1,859.3",16.9%,$295,15.7%,"$2,917.7","$15,120.7",-,5800,-,-,Entegris
97,"$1,856.6",9.3%,$139.2,70.1%,$800.1,"$3,560.3",-,4625,-,-,Sleep Number
98,"$1,855.4",-5%,$88.6,-52%,"$8,241.2","$3,817.5",-,3583,-,-,Spire
