### Sunrise and Sunset Times Webscraping Project by Jeff Kirkpatrick

#### Problem Statement:
I need the 2023 sunrise and set times for Las Cruces, NM to create a sunrise set table in a weather database. We can use the python requests module along with regular expressions to scrap to a weather website and retrieve the needed information in the format required to load in a SQL database table.

In [1]:
# import Python Requests Module
import requests
import re
import csv

#### Pageanation Operations

The website only displays the sunrise and sunset times a month at a time for the year needed. To get subsequent months, they append the month at the end of the url after the city and before the extension. Therefore, we will substitute each month in the url from a list and save the entire 12 months in the variable 'html'.

In [17]:
# To deal with the pagination for 2023. Cycle through all the months and get url data.
# We need the whole year of 2023. Create a months list to use for pageanation operations.
html = ''
months = ['january', 'february', 'march', 'april', 'may', 'june', 'july', 'august', 'september', 
          'october', 'november', 'december']

# cycle through all the months on the website and gather the text in the 'html' variable.
for month in months:
    print(month)
    url = f'https://www.sunrisesunsettime.org/north-america/united-states/las-cruces-{month}.htm'
    #print(url)
    r = requests.get(url)
    #print(r.status_code) # should get 12 '200' status codes indicating each page was accessed successfully.
    #print(r.encoding)
    html = r.text
    print(html)

january
<!doctype html>
<html class="no-js" lang="en" dir="ltr">
  <head>
    <meta charset="utf-8">
    <meta name="google" content="notranslate">
    <meta http-equiv="x-ua-compatible" content="ie=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
		<title>January Sunrise and Sunset times Las Cruces   (NM) | United States</title>
		<meta name="keywords" content="sunset times,sunrise times,first light,day length">
<meta name="description" content="January Las Cruces (NM), United States sunrise and sunset times. Calculation include position of the sun and are in the local timezone">
<meta name="author" content="Copyright Sunrise and Sunset Times - A-Connect Ltd">

<!--	<link rel="preconnect" href="https://adservice.google.com/" crossorigin>
	<link rel="preconnect" href="https://googleads.g.doubleclick.net/" crossorigin>
	<link rel="preconnect" href="https://www.googletagservices.com/" crossorigin>
	<link rel="preconnect" href="https://tpc.googlesyndicatio

#### Regular Expression Operations

In [18]:
# Next find the data in the html and remove unwanted data using regular expression replacements.

# create a list of regex expressions we will use to filter and remove unwanted formating data.

REGEX_REPLACEMENTS = [
    (r"Sunrise", ''),
    (r"Sunset", ''),
    (r"Dawn", ''),
    (r"Day Length", ''),
    (r"<th>", ''),
    (r"</th>", ''),
    (r"<td>", ''),
    (r"</td>", ''),
    (r"<span class.*</span>", '')
]

outList = []

# Now we look for specific tags in each html line that contain the data we need.
for line in html.split('\n'):
    print(line)
    if '<th>' in line or ('<td>' and '<span' and '</span></td>') in line:
        for old, new in REGEX_REPLACEMENTS:
            line = re.sub(old, new, line, flags=re.IGNORECASE)
            newline = re.sub(r"[\n\t]*", "", line) # replace all '\n' and '\t' with blank.
            outList.append(newline)
            
# Take a look at the outList.
print(outList)

<!doctype html>
<html class="no-js" lang="en" dir="ltr">
  <head>
    <meta charset="utf-8">
    <meta name="google" content="notranslate">
    <meta http-equiv="x-ua-compatible" content="ie=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
		<title>January Sunrise and Sunset times Las Cruces   (NM) | United States</title>
		<meta name="keywords" content="sunset times,sunrise times,first light,day length">
<meta name="description" content="January Las Cruces (NM), United States sunrise and sunset times. Calculation include position of the sun and are in the local timezone">
<meta name="author" content="Copyright Sunrise and Sunset Times - A-Connect Ltd">

<!--	<link rel="preconnect" href="https://adservice.google.com/" crossorigin>
	<link rel="preconnect" href="https://googleads.g.doubleclick.net/" crossorigin>
	<link rel="preconnect" href="https://www.googletagservices.com/" crossorigin>
	<link rel="preconnect" href="https://tpc.googlesyndication.com/" 

#### Cleanup Operations

In [13]:
# Process the scraped data.
goutlist = []

while("" in outList): #remove blank lines
    outList.remove("")

for element in outList: # create a clean list of values
    goutlist.append(element)

print(goutlist)

['<th></th>', '<th></th>', '<th></th>', '<th></th>', '</th>', '<th>Sunset</th>', '<th></th>', '<th></th>', '<th></th>', '</th>', '<th>Dawn</th>', '<th>Dawn</th>', '<th></th>', '<th></th>', '</th>', '<th>Day length</th>', '<th>Day length</th>', '<th>Day length</th>', '<th></th>', '</th>', '<th>Thu, 30 Nov</th>', '<th>Thu, 30 Nov</th>', '<th>Thu, 30 Nov</th>', '<th>Thu, 30 Nov</th>', 'Thu, 30 Nov</th>', 'Thu, 30 Nov', 'Thu, 30 Nov', 'Thu, 30 Nov', 'Thu, 30 Nov', '<td>06:50 <span class="azimuth ESE" title="ESE">(115&deg;)</span></td>', '<td>06:50 <span class="azimuth ESE" title="ESE">(115&deg;)</span></td>', '<td>06:50 <span class="azimuth ESE" title="ESE">(115&deg;)</span></td>', '<td>06:50 <span class="azimuth ESE" title="ESE">(115&deg;)</span></td>', '<td>06:50 <span class="azimuth ESE" title="ESE">(115&deg;)</span></td>', '<td>06:50 <span class="azimuth ESE" title="ESE">(115&deg;)</span></td>', '06:50 <span class="azimuth ESE" title="ESE">(115&deg;)</span></td>', '06:50 <span class="a

In [14]:
# Create a list of tuples with desired values.
tupList = []

for i in range(3, len(goutlist), 3): #don't need Dec 31 last year start at record 3
    tupList.append((goutlist[i].split(','), goutlist[i + 1], goutlist[i + 2]))

print(tupList)

IndexError: list index out of range

In [6]:
# Get the date, sunrise and sunset values and assign them to variables by indexing through the list of tuples.
outList2 = []

for i in range(len(tupList)):
    date = tupList[i][0][1].strip().split(' ')
    sunrise = tupList[i][1].strip()
    sunset = tupList[i][2].strip()
    outList2.append([date, sunrise, sunset])
    
print(outList2)

[[['1', 'Dec'], '06:51', '17:03'], [['2', 'Dec'], '06:52', '17:03'], [['3', 'Dec'], '06:53', '17:03'], [['4', 'Dec'], '06:54', '17:03'], [['5', 'Dec'], '06:55', '17:03'], [['6', 'Dec'], '06:55', '17:03'], [['7', 'Dec'], '06:56', '17:03'], [['8', 'Dec'], '06:57', '17:03'], [['9', 'Dec'], '06:58', '17:03'], [['10', 'Dec'], '06:58', '17:03'], [['11', 'Dec'], '06:59', '17:03'], [['12', 'Dec'], '07:00', '17:04'], [['13', 'Dec'], '07:00', '17:04'], [['14', 'Dec'], '07:01', '17:04'], [['15', 'Dec'], '07:02', '17:04'], [['16', 'Dec'], '07:02', '17:05'], [['17', 'Dec'], '07:03', '17:05'], [['18', 'Dec'], '07:04', '17:05'], [['19', 'Dec'], '07:04', '17:06'], [['20', 'Dec'], '07:05', '17:06'], [['21', 'Dec'], '07:05', '17:07'], [['22', 'Dec'], '07:06', '17:07'], [['23', 'Dec'], '07:06', '17:08'], [['24', 'Dec'], '07:06', '17:08'], [['25', 'Dec'], '07:07', '17:09'], [['26', 'Dec'], '07:07', '17:09'], [['27', 'Dec'], '07:08', '17:10'], [['28', 'Dec'], '07:08', '17:11'], [['29', 'Dec'], '07:08', '17

In [7]:
# Remove any duplicate values.
outList3 = []

# We can use a list comprehension to remove any duplicate values.
[outList3.append(x) for x in outList2 if x not in outList3]

print(outList3)

[[['1', 'Dec'], '06:51', '17:03'], [['2', 'Dec'], '06:52', '17:03'], [['3', 'Dec'], '06:53', '17:03'], [['4', 'Dec'], '06:54', '17:03'], [['5', 'Dec'], '06:55', '17:03'], [['6', 'Dec'], '06:55', '17:03'], [['7', 'Dec'], '06:56', '17:03'], [['8', 'Dec'], '06:57', '17:03'], [['9', 'Dec'], '06:58', '17:03'], [['10', 'Dec'], '06:58', '17:03'], [['11', 'Dec'], '06:59', '17:03'], [['12', 'Dec'], '07:00', '17:04'], [['13', 'Dec'], '07:00', '17:04'], [['14', 'Dec'], '07:01', '17:04'], [['15', 'Dec'], '07:02', '17:04'], [['16', 'Dec'], '07:02', '17:05'], [['17', 'Dec'], '07:03', '17:05'], [['18', 'Dec'], '07:04', '17:05'], [['19', 'Dec'], '07:04', '17:06'], [['20', 'Dec'], '07:05', '17:06'], [['21', 'Dec'], '07:05', '17:07'], [['22', 'Dec'], '07:06', '17:07'], [['23', 'Dec'], '07:06', '17:08'], [['24', 'Dec'], '07:06', '17:08'], [['25', 'Dec'], '07:07', '17:09'], [['26', 'Dec'], '07:07', '17:09'], [['27', 'Dec'], '07:08', '17:10'], [['28', 'Dec'], '07:08', '17:11'], [['29', 'Dec'], '07:08', '17

#### Write to file Operations

In [8]:
# %% Write data to a csv file in the current directory.

with open('sunrise_set_2023.csv', mode='w', newline='') as suntimes:
    writer = csv.writer(suntimes)
    for i in range(len(outList3)):
        writer.writerow([outList3[i][0][1], outList3[i][0][0], outList3[i][1], outList3[i][2]])