# Exercise: Counties in Maryland

From Wikipedia, use web scraping to gather the information from the table of the list of counties in Maryland (Links to an external site.). (Links to an external site.)

The information to include in your final dataframe is:
- County Name
- FIPS Code
- County Seat
- Established (year)
- Origin
- Etymology
- Population
- Area

Upload your completed Jupyter notebook to Github and submit the URL for this assignment.

The `requests` library is the de facto standard for making HTTP requests in Python.

In [1]:
import requests #library used to connect to a website

In [2]:
#specify the url
URL = "https://en.wikipedia.org/wiki/List_of_counties_in_Maryland#List_of_counties"

In [3]:
# Connect to the website as the variable 'page'
# The GET method indicates that you’re trying to get or retrieve data from a specified resource. 
# To make a GET request, invoke requests.get().
page = requests.get(URL)

In [4]:
# A Response is a powerful object for inspecting the results of the request.
type(page)

requests.models.Response

In [5]:
# verify successful connection to website

# To know about the all codes 
# https://www.restapitutorial.com/httpstatuscodes.html
  
#  a 200 OK status means that your request was successful,and the server responded with the data you were requesting,
# whereas a 404 NOT FOUND status means that the resource you were looking for was not found.     
page.status_code

200

## HTML - The Basics
This is the basic syntax of an HTML webpage. Every `<tag>` serves a block inside the webpage:
1. `<!DOCTYPE html>`  HTML documents must start with a type declaration.
2. The HTML document is contained between `<html>` and `</html>`.
3. The meta and script declaration of the HTML document is between <head>and </head>.
4. The visible part of the HTML document is between `<body>` and `</body>` tags.
5. Title headings are defined with the `<h1>`  through  `< h6>` tags.
6. Paragraphs are defined with the `<p>` tag.

Other useful tags include `<a>` for hyperlinks, `<table>` for tables, `<tr>` for table rows, and `<td>` tag defines a standard cell in an HTML table.

In [6]:
#save string format of website HTML into a variable
HTMLstr = page.text
#print(HTMLstr[:300])

In [7]:
#import the Beautiful soup functions to parse the data returned from the website

# Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML
# or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
from bs4 import BeautifulSoup

In [8]:

# parse the html using beautiful soup and store in variable `soup`
# First argument: It is the raw HTML content.
# Second Argument:  Specifying the HTML parser we want to use.
soup = BeautifulSoup(HTMLstr, "html.parser")

In [9]:
# Format page contents to include indentation
# Now soup.prettify() is printed, it gives the visual representation
# of the parse tree created from the raw HTML content.
#print (soup.prettify())

In [10]:
#get the <table> tag that contains the data we want to scrape
right_table=soup.find('table', class_='wikitable sortable')

In [11]:
#set empty lists to hold data of each column
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
I=[]
J=[]

#find all <tr> tags in the table and go through each one (row)
# tr table row tag
for row in right_table.findAll("tr")[1:]:
    
    A.append(row.th.find(text=True)) #gets info in County Name column and adds it to list A
    
    #get all the <td> tags for each <tr> tag
    cells = row.findAll('td')
    
    #if there are 11 <td> tags, 11 cells in a row
    if len(cells)==10: 
        B.append(cells[0].find(text=True)) # gets info from FIPS Code column and adds it to list B
        C.append(cells[1].find(text=True)) # gets info from County Seat column; add it to list C
        D.append(cells[2].find(text=True)) # gets info from Established (year) column and adds it to list D
        E.append(cells[3].find(text=True)) # gets info from Origin column and adds it to list E
        F.append(cells[4].find(text=True)) # gets info from Etymology column and adds it to list F
        I.append(cells[7].find(text=True)) # gets info from Population column and adds it to list I
        J.append(cells[8].find(text=True)) # gets info from Area column and adds it to list J

In [12]:
#import pandas to convert list to data frame
import pandas as pd

df=pd.DataFrame(A, columns=['County Name']) #turn list A into dataframe first

#add other lists as new columns in my new dataframe
df['FIPS Code'] = B
df['County Seat'] = C
df['Established (year)'] = D
df['Origin'] = E
df['Etymology'] = F
df['Population'] = I
df['Area'] = J

#show first 5 rows of created dataframe
df.head()

Unnamed: 0,County Name,FIPS Code,County Seat,Established (year),Origin,Etymology,Population,Area
0,Allegany County,1,Cumberland,1789,Formed from part of Washington County.,From the Lenape Indian word,74012,430
1,Anne Arundel County,3,Annapolis,1650,Formed from part of St. Mary's County.,Anne Arundell,550488,588
2,Baltimore County,5,Towson,1659,Formed from unorganized territory,"Cecil Calvert, 2nd Baron Baltimore",817455,682
3,Baltimore City,510,Baltimore City,1851,Founded in 1729. Detached in 1851 from Baltimo...,"Cecil Calvert, 2nd Baron Baltimore",621342,92
4,Calvert County,9,Prince Frederick,1654,Formed as Patuxent County from unorganized ter...,The,89628,345


In [13]:
#export scraped data to a csv file
df.to_csv("MD_Counties.csv")