# <center>Class 2</center>

## Navigating The File System

In [None]:
import os

In [None]:
os.getcwd()

Create a path in an OS-compatibe format. (Mac & Linux use slashes, Windows uses backslashes as a separator in path notations.)

In [None]:
directory = 'my_folder'
subdirectory = 'sub_folder'
filename = 'file.txt'

full_path = os.path.join(directory, subdirectory, filename)

full_path

**For Windows users**: note that the `print()` method escapes the double backslashes.

In [None]:
print(full_path)

Create and change directories, list directory contents. 

In [None]:
os.mkdir('data')

In [None]:
os.chdir('data')

In [None]:
os.getcwd()

In [None]:
os.pardir

In [None]:
os.chdir(os.pardir)

In [None]:
os.getcwd()

In [None]:
os.listdir()

Remove directories.

In [None]:
os.rmdir('data')

In [None]:
os.listdir()

## To Do

List all files with a *'txt'* extension in the `data` directory in this repo. Hint: use list comprehension. 

In [None]:
# To navigate to the `data` directory from the current directory
os.path.join(os.pardir, 'data')

In [None]:
[x for x in os.listdir(os.path.join(os.pardir, 'data')) if '.txt' in x]

<br> 

## I/O: Reading from and writing to files

### Reading

First you need to **open** the file. 

In [None]:
datadir = os.path.join(os.pardir, 'data')

In [None]:
file = os.path.join(datadir, 'example.txt')

In [None]:
f = open(file)
print(f)

In [None]:
f.read()

In [None]:
f.close()

Let's fix the encoding issues...

You can also add **encoding information** to the `open()` method. You need to know these endocings:
- **'utf-8'**: The most common encoding designed for backward compatibility with ASCII. UTF-8 is by far the most common encoding for the World Wide Web, accounting for over 97% of all web pages, and up to 100% for some languages, as of 2021.
- **'cp1250'** or **'Windows-1250'**: It is used under Microsoft Windows to represent texts in Central European and Eastern European languages that use Latin script, such as Polish, Czech, Slovak, Hungarian, Slovene, Serbo-Croatian (Latin script), Romanian (before 1993 spelling reform) and Albanian. It may also be used with the German language; German-language texts encoded with Windows-1250 and Windows-1252 are identical.
- **'iso8859_2'**:  part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as "Latin-2". It is generally intended for Central[1] or "Eastern European" languages that are written in the Latin script.

You can read about the encoding 'codecs' [here](https://docs.python.org/3/library/codecs.html).

In [None]:
f = open(file, encoding="utf-8")
f.read()

You also need to **close** the file, otherwise your program will not allow other programs to access it. 

In [None]:
f.close()

In [None]:
file = os.path.join(datadir, 'ilikeapples.txt')

In [None]:
f = open(file, encoding="utf-8")

f.readline() # This command only reads one line

In [None]:
f.readline()

In [None]:
f.close()

In [None]:
# Now try this:
f.read()

The best way to close a file is by using the `with` statement. This ensures that the file is closed when the block inside the with statement is exited. We don't need to explicitly call the `close()` method. It is done internally.

In [None]:
with open(file, encoding="utf-8") as f:
    for line in f:                # remember to indent! 
        print(line)

There are four ways to open a file:
- "r" - Read - Default value. Opens a file for reading, error if the file does not exist
- "a" - Append - Opens a file for appending, creates the file if it does not exist
- "w" - Write - Opens a file for writing, creates the file if it does not exist
- "x" - Create - Creates the specified file, returns an error if the file exists

In [None]:
file = os.path.join(datadir, 'fruits.txt')

In [None]:
dc_fruits=dict()
f = open(file)
f.readline() # This reads only one line, the first one with X and Y. We are often not interested in the headline of a file

for line in f:
    mykey, myvalue=line.strip().split('\t') # strip() removes whitespace
    dc_fruits[mykey]=myvalue

f.close() # Remember to close your file!!

print(dc_fruits)

### Writing

In [None]:
write_text = open(os.path.join(datadir, 'message.txt'), 'w')

In [None]:
write_text.write('Hello Monthy! \nThis is my second Python class!')

In [None]:
write_text.close()

In [None]:
with open('message.txt', mode = 'r', encoding='utf-8') as f:
    for line in f:
        print(line)

In [None]:
with open(os.path.join(datadir, 'message.txt'), mode = 'a', encoding='utf-8') as f:
    f.write('I hope you are enjoying Python!')

In [None]:
with open(os.path.join(datadir, 'message.txt'), mode = 'r', encoding='utf-8') as f:
    for line in f:
        print(line)

Question: How do you make sure that your entry is appended at a new line?

## Datetime

Date is not a datatype in Python, but the `datetime` module provides access to date and time functionalities. 

In [None]:
import datetime

In [None]:
D1 = datetime.date(2024, 9, 1)
T1 = datetime.time(12,0,0) # noon
DT = datetime.datetime(2024, 9, 1, 12, 15, 0)

# Typically you want to work with datetime because you can
# omit the time values and then it defaults to midnight.
D = datetime.datetime(2024,9, 1)

In [None]:
print('D1:', D1)
print('T1:', T1)
print('DT:', DT)
print('D:', D)

Once you have a `datetime` object you can do fancy things with it:

In [None]:
print("The year was %d and the day is %d." % (D.year, D.day))

print (f"The day of the week was {D.weekday()}.")
print ("(Monday = 0, ..., Sunday = 6.)")

In [None]:
D.utctimetuple()

In [None]:
utct = DT.utctimetuple()

In [None]:
utct.tm_yday

In [None]:
utct.tm_min

In [None]:
Dnow = datetime.datetime.now()
print(Dnow)

We can format the time using for example `strftime()` (all information about the format [ here](https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior)):

In [None]:
# 'strftime' stands for 'string from time'
Dnow.strftime("Date: %Y-%m-%d time: %H:%M") 

In [None]:
Dnow.strftime("%Y-%m-%d %H:%M:%S") 

Math operations make sense with datatime objects.

In [None]:
dt = Dnow - D
print(type(dt))
print("There are {:,.0f} days between then and now.".format(dt.days))

`timedelta` encodes time intervals. This allow us to do more operations:

In [None]:
interval = datetime.timedelta(days=100,hours=12) # 100.5 days

soon = datetime.datetime.now() + interval # addition!

interval_days =interval.days #if you want also the .5, need to hack a bit and do interval.total_seconds()/3600.0/24

print ("In %0.1f days it will be %s." % (interval_days, soon))


In [None]:
ts1 = "2024-09-18"
ts2 = "March 31, 1971"

We can now use a function to parse a string for a time given a string representing a time format. This uses a function called `strptime` (read it as **str**ing **p**arse **time**). 

Here we go.

In [None]:
d1 = datetime.datetime.strptime( ts1, "%Y-%m-%d" )
print(d1)
print(d1 + datetime.timedelta(days=-7))

The string `"%Y-%m-%d"` encodes the timestamp format we were looking for. A four-digit year (`%Y`), a dash (`-`), a two-digit month number (`%m`), another dash, and then a day number (`%d`).

Now ts2 incorporates the name of a month, so that format string is a little different (`%B` means the full month name).

In [None]:
d2 = datetime.datetime.strptime( ts2, "%B %d, %Y" )
print(d2)
print(d2 + datetime.timedelta(days=-7))

There's a huge number of ways to build a format string. Best is to look up the documentation: http://docs.python.org/2/library/datetime.html#strftime-strptime-behavior

Parallel to strptime is another function, `strftime` (string format time) that does the opposite: it takes a `date` or `datetime` and returns a timestamp format string.

In [None]:
s_original = "Jun 1, '13"
d = datetime.datetime.strptime(s_original, "%b %d, '%y") # taking the time in a specific format as input
s_formatted  = d.strftime("%Y-%m-%d") # writing the time in another format
print (s_original, "--->", s_formatted)

Datetime is extremely useful, because different data sources encode times in different ways. Some formats are easy for humans to read, but I like the standard `%Y-%m-%d %H:%M:%S` UNIX-style timestamp because it _sorts nicely_.

### Wrangling with timezones

In [None]:
D = datetime.datetime(2021,9,28,1,0,15)
D

Date and time objects may be categorized as “**aware**” or “**naive**” depending on whether or not they include timezone information. Our **D** variable is *naive*. 

In [None]:
print(D.tzinfo)

In [None]:
# Python timezone module
import pytz

**D_W** will be *aware*. 

In [None]:
D_W = D.astimezone(pytz.timezone("Europe/Vienna"))

In [None]:
print(D_W.tzinfo)

In [None]:
D_W.hour

Convert to other timezones.

In [None]:
D_W.astimezone(pytz.timezone("Europe/London"))

In [None]:
D_W.astimezone(pytz.timezone("Europe/London")).hour

In [None]:
D_W.astimezone(pytz.timezone("America/Montevideo"))

Everything about datetime is [here](https://docs.python.org/3/library/datetime.html). List of databases time zones [here](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones).  

Note: Time zones are going to be extremely important when working with databases. Interfaces interact with various databases in sometimes unexpected ways regarding time zone conversion and behavior. You ***always*** need to check the consistency of your time fields. 

## Lambda functions

A lambda function is a small anonymous function. A lambda function can take any number of arguments, but can only have one expression. It is created using the `lambda` keyword.

In [None]:
square = lambda x: x ** 2

In [None]:
square(2)

We use lambda to simplify our code, to create temporary definitions, which are used only once. The same can be achieved with a normal definiton:

In [None]:
def square_def(x): 
    return x ** 2

In [None]:
square_def(2)

You can combine lambda functions with list comprehension. 

In [None]:
ls_numbers = list(range(10))

In [None]:
ls_numbers

Let's square all the values from the list and add 1 to each element

In [None]:
f = lambda x: x**2 + 1
[f(x) for x in ls_numbers]

Let's square and add one to each even number in the list

In [None]:
[f(x) for x in ls_numbers if x%2 == 0 ]

Square and add one to each even number in the list but return the odd numbers without transformation

In [None]:
[f(x) if x%2 == 0 else x for x in ls_numbers]

## Exception Handling (Try Except)

`Exceptions` handle errors in the code. They let you write contructs so that your program falls back to somewhere else if an error blocks the normal run of your code. 

The `try` block lets you test a block of code for errors. <br>
The `except` block lets you handle the error.<br>
The `else` block is to be executed if no errors were raised.<br>
The `finally` block lets you execute code, regardless of the result of the try- and except blocks.<br>

In [None]:
try:
    print("test")
    # generate an error: the variable 'test' is not defined
    print(test)
except:
    print("Caught an exception")

To get information about the error, we can access the `Exception` class instance that describes the exception by using for example:

    except Exception as e:

In [None]:
try:
    print("test")
    # generate an error: the variable test is not defined
    print(test)
except Exception as e:
    print("The problem with our code is the following: " + str(e))

In [None]:
from utils import add_two_numbers

In [None]:
add_two_numbers(3, 4)

In [None]:
add_two_numbers(3, 'b')

In [None]:
try:
   add_two_numbers(3, 'b')
except Exception as e:
    print('We ran into this error: ' + str(e))

And what happens here? 

In [None]:
def divide_two_numbers(a, b):
    try:
        c = a / b
    except Exception as e:
        print('The division could not be executed:', str(e))
    else:
        return c

In [None]:
try:
    divide_two_numbers(3, 'b') # This function already handles the error inside!
except Exception as e:
    print('We ran into this error: ' + str(e))
else:
    print('Everything went fine.')

Our code above threw an exception *then* printed 'Everything went fine.' Why is that? 

Our function imported from the utils.py file already handled the error, threw the exception, then closed the function. For this reason our function call did not encounter any errors and exited on the `else` branch. 

You can even raise an exception by code.

In [None]:
def divide_two_numbers(a, b):
    if b == 1:
        raise Exception('It makes no sense to divide by 1.')
    else:
        try:
            c = a / b
        except Exception as e:
            print('The division could not be executed:', str(e))
        else:
            return c

In [None]:
divide_two_numbers(10, 1)

## Extra: Web-scraping using BeautifulSoup

Scraping data from the internet is not integral part of the data analysis course but it is a useful tool nonetheless. This exercise is just a taste of what's possible. 

We are scraping currency x-rates from the internet.

In [None]:
from bs4 import BeautifulSoup
import requests

In [None]:
response = requests.get('https://www.investing.com/currencies/streaming-forex-rates-majors')

In [None]:
content = response.content

In [None]:
content

In [None]:
soup = BeautifulSoup(content, 'html.parser')

In [None]:
table = soup.find('table')

In [None]:
table

In [None]:
type(table)

In [None]:
# find the headers

ls_headers = []
for header in table.find_all('th'):
    ls_headers.append(header.text.strip())

In [None]:
ls_headers

In [None]:
# find the rates, that is the elements in the table

ls_rates = []

for row in table.find_all('tr'): # finds all the rows
    cols = row.find_all('td')  # fidns the cells within the table; 'th' finds the header cells
    for element in cols:
        print(element)

In [None]:
# find the rates, that is the elements in the table

ls_rates = []

for row in table.find_all('tr'): # finds all the rows
    cols = row.find_all('td')  # fidns the cells within the table; 'th' finds the header cells
    for element in cols:
        print(element.text.strip())

In [None]:
# find the rates, that is the elements in the table

ls_rates = []

for row in table.find_all('tr'): # finds all the rows
    cols = row.find_all('td')  # fidns the cells within the table; 'th' finds the header cells
    cols = [element.text.strip() for element in cols]  # get text and strip whitespace
    if cols:  # see the condition, adds non-empty rows only
        ls_rates.append(cols)

In [None]:
ls_rates # list of lists

In [None]:
ls_rates[0]

In [None]:
# now we put all this into a table; we will explore Pandas next time
import pandas as pd

In [None]:
df_crossrates = pd.DataFrame(ls_rates, columns= ls_headers)
df_crossrates

In [None]:
# we have an unwanted empty column here
df_crossrates.info()

In [None]:
df_crossrates.columns

In [None]:
df_crossrates.drop(columns= '', inplace = True)

In [None]:
df_crossrates.info()