## Welcome
This is a simple webscraping project that summarizes U.S. banks data in 2023 into a single dataset. I used a website called ***US Bank Locations*** (https://www.usbanklocations.com/). These are the processes involved in this project:
1. Importing libraries necessary for the execution of this program.
2. Inspecting the webpages to correctly identify target elements.
3. Scraping the webpages.
4. Transforming the scraped data into dataframes.
5. Merging the dataframes into a single dataframe and exporting it as a csv file.

**Importing libraries**

* ***requests*** - sends HTTP request.  
* ***BeatifulSoup*** - parses HTML.  
* ***pandas*** - transforms scraped data into a dataframe, combines multiple dataframes into a single dataframe, exports the dataframe into csv.

In [118]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

* ***base_url*** is the url of the website's homepage.
* ***response*** is the HTTP response.
* ***soup*** is the assigned variable of response's parsed html.
* ***divs*** is a list of all elements with HTML div tag.
* ***divs[13]*** is the division element with the desired anchor elements containtaining their respective href attributes.
* ***hrefs*** is a list of all href attributes of anchor elements in *divs[13]*.
* The for loop in **line 9** iterates through the *hrefs* list. The ***final_link*** variable is defined as a concatenation of *base_url* and the href attribute of a particular anchor element at a time.
* Also, the ***final_link*** is appended to the ***links*** list in the loop.


In [119]:
base_url = 'https://www.usbanklocations.com/' #used to scrape elements with href attributes
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

divs = soup.find_all('div')
#print(divs[13])
links = []
hrefs = divs[13].find_all('a', href=True)
for a in hrefs:
    final_link = base_url+a['href']
    links.append(final_link)

* The ***scrape(link)*** function is a function that takes *link* as an argument and has a return value of a dataframe. 
* Inside it, ***column1*** and ***column2*** lists are initially set to be empty. The argument ***link*** is passed on to the get() method to make a response, which then is parsed and stored as the ***soup*** variable.   
* Also, we have ***tables*** and ***rows*** as the lists of elements with tags *table* and *tr* in *tables[1]*, respectively.  
* The rows of ***tables*** are iterated over by a for loop, from its second row to the last. 
* The ***attribute_value*** and ***bank_name*** variables are defined and appended to the lists defined above. 
* The dataframe, in which the ***columns1*** and ***columns2*** lists stand as values of keys, is defined as the ***dataframe*** variable.
* **Duplicate rows** with the same bank names are dropped from ***dataframe***.


In [120]:
#def a scrape funtion
def scrape(link):
    column1 = []
    column2 = []
    response = requests.get(link)
    soup = BeautifulSoup(response.text, 'html.parser')
    tables = soup.find_all('table')
    rows = tables[1].find_all('tr')
    for i in range(1, len(rows)):
        cells = rows[i].find_all('td')
        attribute_value = cells[1].text
        bank_name = cells[2].text
        column1.append(bank_name)
        column2.append(attribute_value)
    
    att_row = rows[0].find_all('td')
    attribute_name = att_row[1].text
    table = {'Bank':column1, attribute_name:column2}
    dataframe = pd.DataFrame(data=table).drop_duplicates('Bank', keep='first')
    return dataframe

* ***dataframes*** is an initially empty list which will eventually be the list of all dataframes.
* The for loop calls for the ***scrape*** function and appends the resulting dataframe at a time to the *dataframes* list.

In [121]:
dataframes = []
for link in links:
    dataframes.append(scrape(link))

* ***merged_df*** is defined to be the concatenation of all dataframes. The **'Bank'** column is the set to be the index, which was the reference of merging the dataframes.  
* A column named ***'Id'*** is added, which stands as a unique identifier of a bank in the data set.

In [122]:
merged_df = pd.concat([df.set_index('Bank') for df in dataframes], ignore_index=False, axis=1)
merged_df.insert(0, 'Id', range(1, 1+(len(merged_df))))

The final data set is exported as a csv file with a file name of ***US_Bank_2023_Dataset.csv***.

In [123]:
merged_df.to_csv('US_Bank_2023_Dataset.csv')