## 1. Multipage Tabular Scrape

- <a href="https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AllRecordsAction.action">On this site</a>, scrape all doctors whose last names begin with "Z".
- Export the content into a CSV file called ```md_Z.csv```.


In [None]:
### add more cells as needed

In [1]:
from bs4 import BeautifulSoup  ## web scraping
import requests ## request html for a page(s)
import pandas as pd ## pandas to work with data

In [2]:
### add more cells as needed

## Requesting URL website. Luckly we can just target the letter z since the page very easliy allows you to do that.
# The URL is practically the same for all pages except the page number changes in the end. So I am going to add a "{}"
#to signal the change.

base_url = "https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p={}"




In [3]:
#Now we are going to iterate through each url and add a number. There are five pages:

for url_number in range(1,6):
    print(base_url.format(url_number))

https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=1
https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=2
https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=3
https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=4
https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=5


In [4]:
all_urls = [base_url.format(url_number) for url_number in range(1,6)]

all_urls

['https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=1',
 'https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=2',
 'https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=3',
 'https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=4',
 'https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=5']

In [5]:
#Before we go for the fulls scrape lets just test it on the first page to see what we get.

#First step is getting the URL and store it:

url = "https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=1"

#then you store it to a "response". and check the status: 

response = requests.get(url)

response.status_code
#200 code is "here is the data", 

200

In [6]:
#This is for one page to see that how the table will look and then we'll replicate:
df_list = pd.read_html(response.text)

df_list

[                                  Physician Last Name  \
 0                                             Zaccheo   
 1                                           Zachariah   
 2                                              Zachel   
 3                                              Zackin   
 4                                              Zackin   
 5                                              Zackin   
 6                                               Zadeh   
 7                                               Zafar   
 8                                               Zafar   
 9                                                Zahl   
 10                                             Zahler   
 11                                              Zaino   
 12                                                Zak   
 13                                               Zaki   
 14                                              Zales   
 15                                           Zalmanov   
 16           

In [7]:
#Verifying the type of data:
# Adding the first index position is important because?
type(df_list[0])

pandas.core.frame.DataFrame

In [8]:
phys_df = df_list[0]

phys_df

Unnamed: 0,Physician Last Name,Physician First Name,Physician Middle Name,License Number,License Type,Effective Date,Date Updated,Year of Birth
0,Zaccheo,Jerald,D,134842.0,MD,12/20/2001,12/23/2001,1946.0
1,Zachariah,Abraham,,137458.0,MD,09/15/2004,09/08/2004,1950.0
2,Zachel,Gretchen,,20699.0,PA,10/13/2017,10/06/2017,1952.0
3,Zackin,Henry,J,101457.0,MD,03/28/2002,03/09/2005,1941.0
4,Zackin,Henry,J,101457.0,MD,03/16/2005,03/09/2005,1941.0
5,Zackin,Henry,J,101457.0,MD,02/21/1990,03/09/2005,1941.0
6,Zadeh,Mehran,,3399.0,PA,07/21/2010,09/06/2013,1961.0
7,Zafar,Kamal,,113.0,SA,08/04/2016,08/08/2016,1968.0
8,Zafar,Syeda,,158264.0,MD,10/16/2007,11/06/2007,1936.0
9,Zahl,Kenneth,,151413.0,MD,04/18/2008,04/11/2008,1956.0


In [9]:
## Let's import the required libaries to create a delay:
from random import randrange ##  allows us to randomize numbers library
import time ## time tracker

In [10]:
#Now we can begin the scrape by each of the URLs:

#Successful links scraped:
all_df = []
#Unsucessful links scraped:
busted_links = []

counter = 1

for link in all_urls:
    print(f"Scraping {counter} of {len(all_urls)}")
    counter += 1
    print(f"Scraping {link}")
    response = requests.get(link)
    if response.status_code == 200:
        df = pd.read_html(response.text)
        all_df.append(df[0])
    else:
        print(f"{link} returned a busted link with response {response.status_code}")
        busted_links.append(link)
    # ^ this help us keep track of broken link or missing data that was unsuccessfully scraped.
    snooze = randrange(10,20)
    print(f"Snoozing for {snooze} sec. before next scrape")
    time.sleep(snooze)

print("ALL DONE!")

Scraping 1 of 5
Scraping https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=1
Snoozing for 14 sec. before next scrape
Scraping 2 of 5
Scraping https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=2
Snoozing for 11 sec. before next scrape
Scraping 3 of 5
Scraping https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=3
Snoozing for 15 sec. before next scrape
Scraping 4 of 5
Scraping https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=4
Snoozing for 16 sec. before next scrape
Scraping 5 of 5
Scraping https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=5
Snoozing for 14 sec. before next scrape
ALL DONE!


In [11]:
len(all_df)

5

In [17]:
# I'm checking each page to see where the rows with the unusable data is. So ill do an index check from 1 through 5.

all_df[4]

Unnamed: 0,Physician Last Name,Physician First Name,Physician Middle Name,License Number,License Type,Effective Date,Date Updated,Year of Birth
0,Zulfacar,Mary,,130166.0,MD,10/21/2005,10/14/2005,1940.0
1,Zuniga,Dario,,123324.0,MD,05/07/2002,05/07/2002,1941.0
2,Zuttah,Silas,H,153216.0,MD,01/22/2003,06/17/2003,1953.0
3,Zweig,Steven,Jeffrey,140242.0,MD,05/17/2006,05/10/2006,1949.0
4,New Physician Search,,,,,,,
5,"Physician Records \tvar css = "".visualping-con...",Physician Last Name Physician First Name Physi...,,,,,,


In [59]:
#lets combine all 5 dataframes into one:
df = pd.concat(all_df, ignore_index = True)

df

Unnamed: 0,Physician Last Name,Physician First Name,Physician Middle Name,License Number,License Type,Effective Date,Date Updated,Year of Birth
0,Zaccheo,Jerald,D,134842.0,MD,12/20/2001,12/23/2001,1946.0
1,Zachariah,Abraham,,137458.0,MD,09/15/2004,09/08/2004,1950.0
2,Zachel,Gretchen,,20699.0,PA,10/13/2017,10/06/2017,1952.0
3,Zackin,Henry,J,101457.0,MD,03/28/2002,03/09/2005,1941.0
4,Zackin,Henry,J,101457.0,MD,03/16/2005,03/09/2005,1941.0
...,...,...,...,...,...,...,...,...
89,Zuniga,Dario,,123324.0,MD,05/07/2002,05/07/2002,1941.0
90,Zuttah,Silas,H,153216.0,MD,01/22/2003,06/17/2003,1953.0
91,Zweig,Steven,Jeffrey,140242.0,MD,05/17/2006,05/10/2006,1949.0
92,New Physician Search,,,,,,,


In [76]:
df.tail(35)

Unnamed: 0,Physician Last Name,Physician First Name,Physician Middle Name,License Number,License Type,Effective Date,Date Updated,Year of Birth
59,Zhitlovsky,German,,163179.0,MD,06/23/1995,,
60,Zhong,Chizheng,,201312.0,MD,08/20/2003,08/22/2003,1953.0
61,Zhu,Mary,M,186722.0,MD,04/05/2013,04/03/2013,1939.0
62,Zhu,Ming,Zhong,200493.0,MD,03/28/2011,10/14/2014,1949.0
63,Ziegler,Ross,,151632.0,MD,11/10/2015,11/04/2015,1952.0
64,New Physician Search,,,,,,,
65,"Physician Records \tvar css = "".visualping-con...",Physician Last Name Physician First Name Physi...,,,,,,
66,Ziering,William,H,80678.0,MD,11/05/2001,11/14/2001,1930.0
67,Ziets,Robert,,177701.0,MD,11/27/1996,,
68,Zigelbaum,Sheldon,D,142022.0,MD,09/14/1993,,


In [77]:
# Im clearing the bad rows of data after finding them.
df = df.drop(labels=[20, 21, 42, 43, 64, 65, 86, 87, 92, 93], axis=0)

df

Unnamed: 0,Physician Last Name,Physician First Name,Physician Middle Name,License Number,License Type,Effective Date,Date Updated,Year of Birth
0,Zaccheo,Jerald,D,134842.0,MD,12/20/2001,12/23/2001,1946.0
1,Zachariah,Abraham,,137458.0,MD,09/15/2004,09/08/2004,1950.0
2,Zachel,Gretchen,,20699.0,PA,10/13/2017,10/06/2017,1952.0
3,Zackin,Henry,J,101457.0,MD,03/28/2002,03/09/2005,1941.0
4,Zackin,Henry,J,101457.0,MD,03/16/2005,03/09/2005,1941.0
...,...,...,...,...,...,...,...,...
85,Zugec,Mirko,,213710.0,MD,12/08/2020,12/01/2020,1960.0
88,Zulfacar,Mary,,130166.0,MD,10/21/2005,10/14/2005,1940.0
89,Zuniga,Dario,,123324.0,MD,05/07/2002,05/07/2002,1941.0
90,Zuttah,Silas,H,153216.0,MD,01/22/2003,06/17/2003,1953.0


In [78]:
 df.to_csv("md_Z.csv", encoding = "UTF-8", index = False)

## 2. Conversion function


Write a function that takes string values like ```$12.24267```, ```10,201``` and ```$12,501``` and converts them into floating point numbers like ```12.24```, ```10201.0``` and ```12501.0```

Test it out on those 3 string values.




In [22]:
numeros = [12.24267, 10_201, 12_501]

In [34]:
#Lets begin con:

def numConvert(input_num):
    converted = float("{:.2f}".format(input_num))
    print(converted)

In [52]:
numConvert(12.24267)

12.24


In [58]:
fpn = []

for number in numeros:
    fpn.append(numConvert(number))
    
fpn


12.24
10201.0
12501.0


[None, None, None]