# Modules
A module is a file with `.py` extension containing Python codes, variables, executable statements, functions, and classes and can be imported inside another Python program or application. One of the major advantage of using modules is to break down large programs into small, more readable and manageable components.

## Storing Functions in Modules

To store a function in a module, you open a text file on your current directory, write your functions then save it as a python file (with a choosen name with .py extension). You can open a text file on Jupyter notebook as shown in the images below:
![image.png](attachment:image.png)

![image-4.png](attachment:image-4.png)

### Importing an Entire Module

When you import an entire module, to use any function in that module you have to call the name of the module followed by the function name:

    module_name.function_name(args...)

In [2]:
import my_functions

user1 = my_functions.user('peter','obarotu', 22, dob='12/12')
user2 = my_functions.user('kobe', 'bryant', 35, country='america', middle_name='bean', dob='23/08', city='philadephia')

In [3]:
user1

{'dob': '12/12',
 'f_name': 'peter',
 'l_name': 'obarotu',
 'age': 22,
 'country': 'nigeria'}

In [4]:
user2

{'middle_name': 'bean',
 'dob': '23/08',
 'city': 'philadephia',
 'f_name': 'kobe',
 'l_name': 'bryant',
 'age': 35,
 'country': 'america'}

### Importing a Specific Function
To import a specific function use the following syntax:

    from module_name import function_name

To import more than one function, use the following syntax:
    
    from module_name import function1, function2, ...


In [5]:
from my_functions import user

user3 = user('phil', 'mcguiness', 58, 'ireland', dob='23/08', city='dublin', phone_no='+35376329054')
user4 = user('lionel','messi', 34, 'argentina', middle_name='andres', spouse='antonella')
user5 = user('jack', 'bauer', 52, 'america', dob='23/08', city='los angeles', occupation='security agent')

users = [user1, user2, user3, user4, user5]

In [6]:
user3

{'dob': '23/08',
 'city': 'dublin',
 'phone_no': '+35376329054',
 'f_name': 'phil',
 'l_name': 'mcguiness',
 'age': 58,
 'country': 'ireland'}

In [7]:
user5

{'dob': '23/08',
 'city': 'los angeles',
 'occupation': 'security agent',
 'f_name': 'jack',
 'l_name': 'bauer',
 'age': 52,
 'country': 'america'}

In [8]:
from my_functions import user, greet_bday_user

greet_bday_user('23/08', *users)

Hurray! it's 23/08
Happy birthday Kobe Bryant. We celebrate you on your day!

Hurray! it's 23/08
Happy birthday Phil Mcguiness. We celebrate you on your day!

Hurray! it's 23/08
Happy birthday Jack Bauer. We celebrate you on your day!



### Aliases For Module and Function
Using alises is useful to avoid too much typing(i.e long function name) or avoid name conflict with an existing name in your program. You can alias using `as`keyword. Syntax for both module and function is shown below:

    import module_name as mn
    
    from module_name import function_name as fn

In [9]:
import my_functions as mf

user1 = mf.user('peter','obarotu', 22, dob='12/12')
user2 = mf.user('kobe', 'bryant', 35, country='america', middle_name='bean', dob='23/08', city='philadephia')

In [10]:
user1

{'dob': '12/12',
 'f_name': 'peter',
 'l_name': 'obarotu',
 'age': 22,
 'country': 'nigeria'}

In [11]:
from my_functions import greet_bday_user as gb

gb('23/08', *users)

Hurray! it's 23/08
Happy birthday Kobe Bryant. We celebrate you on your day!

Hurray! it's 23/08
Happy birthday Phil Mcguiness. We celebrate you on your day!

Hurray! it's 23/08
Happy birthday Jack Bauer. We celebrate you on your day!



### Importing All Functions in a Module
You can import all functions in a module using the asterisk(`*`) but this is not always advisable because of name conflicts. if you unintentionally import a function with a name that is already in your program, you might get an unexpected result. Using this method is even more dangerous when using a module you didn't write.

    from module_name import *

In [12]:
from my_functions import *

user7 = user('marshall', 'mathers', 49, dob='17/10', middle_name='bruce', city='missouri')
user8 = user('tom', 'mapother', 60, country='america', middle_name='cruise', dob='03/07', city='new york')

In [13]:
greet_bday_user('17/10', user7, user8)

Hurray! it's 17/10
Happy birthday Marshall Mathers. We celebrate you on your day!



## Storing Classes in Modules

Storing and importing classes in module follows the same method as for functions. First write a `.py` file that stores your classes. Here I have save a module call `my_classes` that have the `Employee`, `Manager` and `Benefits` classes from Lesson 8.

### Importing an Entire Module

In [14]:
import my_classes

John = my_classes.Employee('John', 'Olu', job='Marketer', salary=67000)
Davina = my_classes.Employee('Davina', 'Klu', 'Secretary', 75000)

In [15]:
John

Employee
Name:John Olu 
Role:Marketer 
Basic salary:67000

In [16]:
Davina

Employee
Name:Davina Klu 
Role:Secretary 
Basic salary:75000

### Importing a Specific Class

In [17]:
from my_classes import Employee, Manager, Benefits

Drake = Manager('Drake', 'Graham', 200000)
print(Drake)

Manager
Name:Drake Graham
Department:Sales 
Basic salary:200000


In [18]:
Drake.benefits.pto

0

In [19]:
Drake.benefits

Benefits: 
PTO:0 days 
Health Insurance:0
Life Insurance:0 
Pension:0

In [20]:
Drake.compute_salary(percent=.08)

216000

In [21]:
Drake.compute_salary(percent=.08, add_on=0.10)

236000

### Module in a Module
You can use a module in another module. Let's create a module for the Employee class call `employee.py` and another module for the Benefits and Manager classes call `manager.py`. We have to import the employee module into the manager module since it depends on it to work, without it Python returns an error.

![image-4.png](attachment:image-4.png)
<br>


![image-5.png](attachment:image-5.png)

In [22]:
from manager import Manager, Benefits

In [23]:
trump = Manager('Donald', 'Trump', 350000)
print(trump)

Manager
Name:Donald Trump
Department:Sales 
Basic salary:350000


In [24]:
trump.give_raise()
print(trump)

Manager
Name:Donald Trump
Department:Sales 
Basic salary:427000


# Python Modules and Libraries

In this section, we will look at some common modules in the Python standard library and other third party packages that provide libraries not included in the standard library. 

## datetime Module

The `datetime` module in Python provides classes for working with dates and times. It includes classes for working with dates, times, date-times, time deltas, and time zones. In the `datetime` module, the following classes are available:

* `date`: Represents a date (year, month, day) and provides methods for working with dates.
* `time`: Represents a time (hour, minute, second, microsecond) and provides methods for working with times.
* `datetime`: Represents a date and time together and provides methods for working with both.
* `timedelta`: Represents a duration or difference between two dates or times.
* `tzinfo`: This is an abstract base class that can be used to define time zones. It has several methods that must be overridden in any concrete subclass.

We will take a look at each of this in detail: 

### `date`
The `date` constructor takes three arguments:
    
    date(year, month, day)
    
* `year`: A four-digit integer representing the year.
* `month`: An integer between 1 and 12 representing the month.
* `day`: An integer between 1 and the number of days in the specified month and year.

An example is given below:

In [25]:
from datetime import date

my_date = date(2021, 3, 20)
print(my_date)
my_date

2021-03-20


datetime.date(2021, 3, 20)

Some of the attributes and methods available to a date object includes the following:

* `year`: An integer representing the year.
* `month`: An integer representing the month, where January is 1 and December is 12.
* `day`: An integer representing the day of the month.
* `today()`: A class method that returns a date object representing the current local date.
* `replace(year=None, month=None, day=None)`: Returns a new date object with the specified year, month, and/or day. Any arguments that are not provided default to the corresponding value in the original date object.
* `ctime()`: Returns a string representing the date object in a human-readable format.
* `weekday()`: Returns the day of the week as an integer, where Monday is 0 and Sunday is 6.
* `isoweekday()`: Returns the day of the week as an integer, where Monday is 1 and Sunday is 7.
* `isoformat()`: Returns the date as an ISO 8601 formatted string.
* `strftime(format)`:  Pronounced "ess-tee-are-f-time" or "string-f-time". Returns the date as a formatted string according to the specified format string.


In [26]:
print(my_date.year)    # Return the year components of the date object
print(my_date.month)   # Return the month components of the date object
print(my_date.day)     # Return the day components of the date object

2021
3
20


In [27]:
# Return the current system date
print(my_date.today())     # Using an instance
print(date.today())        # Using the class directly

2025-02-21
2025-02-21


In [28]:
# Change the year and month, leaving the original day
my_date.replace(year=1998, month=7)

datetime.date(1998, 7, 20)

In [29]:
my_date = my_date.replace(year=1998, month=7)
print(my_date)

1998-07-20


In [30]:
ctime = my_date.ctime()
ctime

'Mon Jul 20 00:00:00 1998'

If you are curious about the day of the week you or someone else was born, you can apply the split method to the ctime string and get the first item:

In [31]:
dob = date(1999, 10, 17)
dob_ctime = dob.ctime()
dob_dow = dob_ctime.split()[0]   # gets day of the week(dow)
print(dob_dow)

Sun


In [32]:
print(my_date)
my_date.strftime(format='%d-%m-%Y') # Reorder the date as a string base on the format specified (day-month-year)

1998-07-20


'20-07-1998'

In [33]:
my_date.strftime(format='%B-%d-%Y')   #  American format(month-day-year)

'July-20-1998'

Some format codes and examples are giving in the table below:

|Directive|Description|Example
|:--:|:--:|:--:|
%d 	|Day of month 01-31 |31 	
%b |Month name, short version| 	Dec 	
%B |Month name, full version |	December 	
%m |Month as a number 01-12 |	12 	
%y |Year, short version, without century |	18 	
%Y 	|Year, full version |	2018 	
%H |	Hour 00-23 |	17 	
%I |	Hour 00-12 |	05 	
%p |	AM/PM |	PM 	
%M |	Minute 00-59 |	41 	
%S 	|Second 00-59 |	08 	
%f 	|Microsecond 000000-999999 |	548513 	
%a |Weekday, short version|	Wed 	
%A |	Weekday, full version |	Wednesday 	
%w 	|Weekday as a number 0-6, 0 is Sunday| 	3 	
%z 	|UTC offset |	+0100 	
%Z |Timezone |	CST
%c |Local version of date and time |	Mon Dec 31 17:41:00 2018

Note that some format code meant for time object (i.e `%H`) will not work for a date object because date object don't have time component(i.e hour) and vice-versa.

In [34]:
my_date.strftime(format='%A, %B-%d-%Y')

'Monday, July-20-1998'

In [35]:
my_date.strftime(format='%A')

'Monday'

Finally, let's write a simple program that returns a person's day of the week base on their date of birth:

In [36]:
print('Tell me your date of birth and I will tell you the day of the week you were born')

# Accept the date component and convert to an int
year = int(input('Enter Year(YYYY) '))
month = int(input('Enter Month(MM) '))
day = int(input('Enter Month(DD) '))

# Create a date object
dob = date(year, month, day)

# Extract the weekday and print
weekday = dob.strftime(format='%A')
print(f"You were born on {weekday}")

Tell me your date of birth and I will tell you the day of the week you were born


Enter Year(YYYY)  2009
Enter Month(MM)  05
Enter Month(DD)  04


You were born on Monday


### `time`
To create a time object, you can use the `time()` constructor class, which takes up to four arguments:
    
    time(hour, minute, second, microsecond)
    
* `hour`: An integer representing the hour (0-23).
*`minute`: An integer representing the minute (0-59).
*`second`: An integer representing the second (0-59).
*`microsecond`: An integer representing the microsecond (0-999999).

An example of how to create a time object is give below:

In [37]:
from datetime import time

my_time = time(17, 30, 20)
print(my_time)
my_time

17:30:20


datetime.time(17, 30, 20)

Some of the common attributes and methods available to time objects includes:
* `hour`: An integer representing the hour (0-23).
* `minute`: An integer representing the minute (0-59).
* `second`: An integer representing the second (0-59).
* `microsecond`: An integer representing the microsecond (0-999999).
* `tzinfo`: A time zone information object.
* `replace(hour=None, minute=None, second=None, microsecond=None, tzinfo=None)`: Returns a new time object with one or more of its attributes replaced. Any arguments that are not provided will be taken from the original object.
* `isoformat(timespec='auto')`: Returns a string representation of the time in ISO 8601 format (HH:MM:SS.mmmmmm).
* `strftime(format)`: Returns a string representation of the time formatted according to the specified format string.

In [38]:
print(my_time.hour)
print(my_time.minute)
print(my_time.second)
print(my_time.tzinfo)

17
30
20
None


We will discuss timezone shortly. Let's replace some of the time components in `my_time` to get a new time called "my_time2 using the `replace()` method"

In [39]:
my_time2 = my_time.replace(hour=19, minute=15, microsecond=100)
print(my_time2)

19:15:20.000100


In [40]:
# Convert to 12-hour clock using `%I`(hour; 0-12) and `%p`(AM/PM) format codes
my_time3 = my_time2.strftime(format='%I:%M:%S:%f %p')
print(my_time3)

07:15:20:000100 PM


Note that you can use any delimiter of your choice (including space) to seperate each components. The arrangement is totally your choice and the same goes with formating date object:

In [41]:
# Arrange from lower to higher time component using different delimiter
my_time3 = my_time2.strftime(format='%f.%S:%M*%I %p')
print(my_time3)

000100.20:15*07 PM


In [42]:
# convert to ISO string
my_time2.isoformat()

'19:15:20.000100'

### `datetime`
The datetime class represents a **date and time together**. It has the following syntax:

    datetime(year, month, day, hour, minute, second, microsecond, tzinfo)

where `year`, `month`, and `day` are integers that specify the date, and `hour`, `minute`, `second`, and `microsecond` are **optional** integers that specify the time. The `tzinfo` parameter is an optional object that provides time zone information.

An example is given below:

In [43]:
from datetime import datetime
date = datetime(2021, 3, 20, hour=17, minute=30, second=20)
print(date)
date

2021-03-20 17:30:20


datetime.datetime(2021, 3, 20, 17, 30, 20)

Some of the common attributes and methods includes:

* `year`: Returns the year of the date as an integer.
* `month`: Returns the month of the date as an integer (1-12).
* `day`: Returns the day of the date as an integer.
* `hour`: Returns the hour of the time as an integer (0-23).
* `minute`: Returns the minute of the time as an integer (0-59).
* `second`: Returns the second of the time as an integer (0-59).
* `microsecond`: Returns the microsecond of the time as an integer.
* `tzinfo`: Returns the time zone information associated with the datetime object, or `None` if no time zone information is present.

* `now()`: Returns a datetime object representing the system current date and time.
* `date()`: Returns a date object representing the date part of the datetime object.
* `time()`: Returns a time object representing the time part of the datetime object.
* `replace()`: Returns a new datetime object with one or more attributes replaced. Any attributes not specified default to their current values.
* `strptime()`: Use to Parse a string representation of a date and time into a datetime object.
* `strftime(format)`: Returns a string representing the date and time, formatted according to the specified format string. 
* `weekday()`: Returns the day of the week as an integer, where Monday is 0 and Sunday is 6.
* `ctime()`:  Returns a string representing the date object in a human-readable format according to the standard ctime() format.
* `timestamp()`: Returns the number of seconds since the Unix epoch (January 1, 1970, at 00:00:00 UTC) as a floating-point number.

* `fromtimestamp()`: Takes a Unix timestamp as its argument and returns a datetime object representing the corresponding time in your local timezone.

In [44]:
print(date.year)  # gets the year
print(date.hour)  # gets the hour

2021
17


In [45]:
print(date.date())    # gets the date part
print(date.time())    # gets the time part 

2021-03-20
17:30:20


In [46]:
# get current system date and time as a dateime object
datetime.now()

datetime.datetime(2025, 2, 21, 10, 42, 56, 699641)

In [47]:
str_time = '2023-02-03 16:45:59'

# Parse date string as datetime object
ptime = datetime.strptime(str_time, '%Y-%m-%d %H:%M:%S')
print(ptime)
ptime

2023-02-03 16:45:59


datetime.datetime(2023, 2, 3, 16, 45, 59)

In [48]:
str_time

'2023-02-03 16:45:59'

When using `strptime` (pronounced es-tee-ar-pee-time) be aware that the format string must match the string date exactly for proper parsing else an error will be returned. You can always refer back to the table showing the format codes:

In [49]:
str_time = '04:45:59 PM 2023-02-03'

# Parse date string as datetime object
ptime = datetime.strptime(str_time, '%I:%M:%S %p %Y-%m-%d')  
print(ptime)
ptime

2023-02-03 16:45:59


datetime.datetime(2023, 2, 3, 16, 45, 59)

Ealier we wrote a program that tells a person the weekday they were born. We can re-write the program so that there is only one input required from the user instead of three:

In [51]:
## print("Tell me your date of birth and I will tell you the day of the week you were born")

# Accept a date string
str_dob = input('Enter DOB (YYYY-MM-DD): ')

# Create a date object
dob = datetime.strptime(str_dob, '%Y-%m-%d')

# Extract the weekday and print
weekday = dob.strftime(format='%A')
print(f"You were born on {weekday}")

Enter DOB (YYYY-MM-DD):  2009-05-04


You were born on Monday


Unix time, also known as POSIX time or epoch time, is a system for tracking time in computing systems. It is represented as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC) on January 1, 1970. This date and time is referred to as the "Unix epoch." The`timestamp()` method is used to get the unix time from a datetime object and `fromtimestamp()` is used covert a unix time to a datetime object. 

In [52]:
import datetime          # Imports the entire module

date = datetime.datetime.now()
print(date)

# Convert to unix timestamp
unix = date.timestamp()
print(unix)

# Convert from Unix to datetime object
converted = datetime.datetime.fromtimestamp(unix)
print(converted)

2025-02-21 10:45:42.701155
1740131142.701155
2025-02-21 10:45:42.701155


### pytz Module

All the dates and time we have created so far has no timezone. By default, a datetime object created without a time zone is considered to be in the local time zone of the computer where the code is running. If you want to explicitly set the time zone of a datetime object to UTC, you can pass `timezone.utc` to the `tzinfo` argument. If you want other time zones then you'll have to define it base on the offset from UTC using the `timezone` class (we'll come to this shortly) or use the already defined timezones in `pytz` module. `pytz` is a third-party package that provides more accurate functionality for working with wider range of time zones and better handling of daylight saving time transitions compare to the built-in `datetime` module's `timezone` class. In order to use pytz, you need to install it:

In [53]:
# Un-comment the line below to install pytz package if you haven't already
#%pip install pytz

In [54]:
# Import the time and timezone class from datetime
from datetime import time, timezone
my_time = time(20, 30, tzinfo=timezone.utc)
print(my_time)
print(my_time.tzinfo)
my_time.tzinfo

20:30:00+00:00
UTC


datetime.timezone.utc

In [55]:
import datetime

# Import pytz
import pytz

# Creates lagos time zone
timezone = pytz.timezone('Africa/Lagos')

my_time = datetime.time(20, 30, tzinfo=timezone)
print(my_time)
print(my_time.tzinfo)
my_time.tzinfo

20:30:00
Africa/Lagos


<DstTzInfo 'Africa/Lagos' LMT+0:14:00 STD>

You can set a previously defined datetime object without a timezone (naive tzinfo) to have a time zone without recreating it from scratch. The `localize()` method is used to attach a time zone to a `datetime` object that does not have a time zone. If the timezone is already set, a `ValueError` will be raised. The method takes a datetime object as input and returns a new datetime object with the specified time zone:

In [56]:
import pytz 

# Create date+time (naive tzinfo[=None])
time = datetime.datetime(2004, 10, 1, 23, 30)
print(time)

# Create New York time zone
nyc = pytz.timezone('America/New_York')

# Set time zone to NYC 
nyc_time = nyc.localize(time)      # Use the localize method of a time zone
print(nyc_time)

2004-10-01 23:30:00
2004-10-01 23:30:00-04:00


The string "2004-10-01 23:30:00-04:00" represents a specific point in time, in the format of year-month-day hour:minute:second-timezone offset. In this case, the time zone offset is "-04:00", which indicates that New York is four hours behind Coordinated Universal Time (UTC-4). To get a list of all timezones available in `pytz`, use `pytz.all_timezones`:

In [57]:
# Get a list of all available time zones
time_zones = pytz.all_timezones

# Print the list of time zones
for tz in time_zones:
    print(tz)

Africa/Abidjan
Africa/Accra
Africa/Addis_Ababa
Africa/Algiers
Africa/Asmara
Africa/Asmera
Africa/Bamako
Africa/Bangui
Africa/Banjul
Africa/Bissau
Africa/Blantyre
Africa/Brazzaville
Africa/Bujumbura
Africa/Cairo
Africa/Casablanca
Africa/Ceuta
Africa/Conakry
Africa/Dakar
Africa/Dar_es_Salaam
Africa/Djibouti
Africa/Douala
Africa/El_Aaiun
Africa/Freetown
Africa/Gaborone
Africa/Harare
Africa/Johannesburg
Africa/Juba
Africa/Kampala
Africa/Khartoum
Africa/Kigali
Africa/Kinshasa
Africa/Lagos
Africa/Libreville
Africa/Lome
Africa/Luanda
Africa/Lubumbashi
Africa/Lusaka
Africa/Malabo
Africa/Maputo
Africa/Maseru
Africa/Mbabane
Africa/Mogadishu
Africa/Monrovia
Africa/Nairobi
Africa/Ndjamena
Africa/Niamey
Africa/Nouakchott
Africa/Ouagadougou
Africa/Porto-Novo
Africa/Sao_Tome
Africa/Timbuktu
Africa/Tripoli
Africa/Tunis
Africa/Windhoek
America/Adak
America/Anchorage
America/Anguilla
America/Antigua
America/Araguaina
America/Argentina/Buenos_Aires
America/Argentina/Catamarca
America/Argentina/ComodRivad

To convert a datetime object from one time zone to another using pytz, you can use the `astimezone()` method of a datetime object. This method creates a new datetime object in the specified time zone by adjusting the original datetime object to the new time zone:

In [58]:
print(nyc_time)

# Create a lagos time zone
lag_timezone = pytz.timezone('Africa/Lagos')

# Converts nyc time to lagos time and print
lag_time = nyc_time.astimezone(lag_timezone)
print(lag_time)

2004-10-01 23:30:00-04:00
2004-10-02 04:30:00+01:00


As you can see while  New York is at 11:30 PM of October **1**, 2004, Lagos is already in the next day, October **2**, 2004 at 04:30 AM. Also the UTC offset is also attached to each time. New York is 4 hours behind UTC and Lagos is an hour ahead, therefore Lagos is 5 hours in total ahead of New York.

For more infomation and methods that can be use with pytz datetime objects check the [online documentation](https://pythonhosted.org/pytz/)

Now that we have taken care of time zones, let's get back to the classes in our datetime module. The two left we have not discussed is `timedelta` and `tzinfo`

### `timedelta`
`timedelta` is another class in the datetime module that represents a duration or difference between two dates or times. It is used for various time-related calculations.

The general syntax for creating a timedelta object is as follows:

    timedelta(days=0, seconds=0, microseconds=0, milliseconds=0, minutes=0, hours=0, weeks=0)

The arguments are as follows:

* `days`: The number of days in the duration (can be negative)
* `seconds`: The number of seconds in the duration (can be negative)
* `microseconds`: The number of microseconds in the duration (can be negative)
* `milliseconds`: The number of milliseconds in the duration (can be negative)
* `minutes`: The number of minutes in the duration (can be negative)
* `hours`: The number of hours in the duration (can be negative)
* `weeks`: The number of weeks in the duration (can be negative)

In [59]:
timestamp1 = datetime.datetime(2023, 1, 1)   # New year 2023
timestamp2 = datetime.datetime.now()         # Current datetime

# Number of days passed since new year of 2023
delta = timestamp2 - timestamp1
print(delta)

782 days, 10:46:10.509024


The difference between the two time above is an example of a time delta and can be created using a `timedelta` constructor:

In [60]:
from datetime import timedelta
delta = timedelta(days=100)
print(delta)
delta

100 days, 0:00:00


datetime.timedelta(days=100)

If you're inaugurated today into a public office, you can determine the date of your first 100 days in office:

In [61]:
from datetime import date

In [62]:
inaugural_date = date.today()
delta = timedelta(days=100)
# Add 100 days to the inaugural_date
first_100 = inaugural_date + delta
print(first_100)

2025-06-01


Goodluck! hope you achieve your 100 days goals before that date. The total number of seconds in the duration/delta can be calculated using the `total_seconds()` method:

In [63]:
delta.total_seconds()

8640000.0

Some of the most commonly used attributes and methods of a `timedelta`:

* `days`: Returns the number of days in the duration.
* `seconds`: Returns the number of seconds in the duration (**excluding** days).
* `microseconds`: Returns the number of microseconds in the duration (**excluding** days and seconds).
* `total_seconds()`: Returns the total number of seconds in the duration (**including** days, seconds, and microseconds).

In [64]:
timedelta(milliseconds=25, microseconds=600
         )

datetime.timedelta(microseconds=25600)

In [65]:
delta = timedelta(days=2, hours=4, minutes=30, seconds=47)
print(delta)
print(delta.days)
print(delta.seconds)

2 days, 4:30:47
2
16247


To perform calculations on datetime objects with different timezones, You need to first convert them to a common time zone:

In [66]:
datetime_1 = datetime.datetime(2022, 10, 20, 13, 57, tzinfo=pytz.timezone('America/New_York'))
datetime_2 = datetime.datetime(2023, 2, 28, 21, 15, tzinfo=pytz.timezone('Asia/Tokyo'))

# Convert time to UTC timezone
datetime_1_utc = datetime_1.astimezone(pytz.utc)
datetime_2_utc = datetime_2.astimezone(pytz.utc)

# Calculate differece in time
delta = datetime_1_utc - datetime_2_utc
print(delta)
delta

-131 days, 6:57:00


datetime.timedelta(days=-131, seconds=25020)

### `tzinfo`
This is an abstract base class used to define timezone information in the datetime module to represent timezone-aware datetimes. `tzinfo` **cannot be instantiated directly, instead you must create a subclass that implements the necessary methods**. The `timezone` class in the datetime module is actually a subclass of `tzinfo` and can be use to create a custom time zone. If for example, you discover that Lagos time zone is actually 1 hour 15 minutes offset from UTC and not the usuall 1 hour, you can create a datetime object that is aware of this timezone:

In [67]:
from datetime import datetime, timezone, timedelta

# Create the offset, which must be a time delta for proper time manipulation
offset = timedelta(minutes=75)

# Create the timezone with the proper offset
new_tz = timezone(offset)

dt = datetime(2022, 9, 15, 10, 45, tzinfo=new_tz)
print(dt)
print(dt.tzinfo)
dt

2022-09-15 10:45:00+01:15
UTC+01:15


datetime.datetime(2022, 9, 15, 10, 45, tzinfo=datetime.timezone(datetime.timedelta(seconds=4500)))

To define your own custom `tzinfo` subclass, you need to define the necessary methods to convert between UTC and the local timezone, and to determine the timezone offset at a given point in time. The required methods are:

* `tzname(self, dt)`: Returns the name of the local timezone at the specified datetime dt. This method should return a string representing the timezone name.
* `utcoffset(self, dt)`:Returns the offset of the local timezone from UTC at the specified datetime dt. This method should return a timedelta object representing the offset.
* `dst(self, dt)`:Returns the daylight saving time (DST) offset for the local timezone at the specified datetime dt. This method should return a timedelta object representing the DST offset, or None if DST is not in effect.

In [68]:
import datetime

class NewLagTZ(datetime.tzinfo):
    
    def tzname(self, dt):
        return f"New_Lagos_Timezone"
        
    def utcoffset(self, dt):  # Set the offset from UTC
        return datetime.timedelta(hours=1, minutes=15)
    
    def dst(self, dt):
        return None     # Daylight Saving Time not applicable
    
    def __repr__(self):
        offset = self.utcoffset(dt=None)
        return f"New_Lagos_Timezone UTC+{offset}"

In [69]:
import datetime

dt = datetime.datetime(2022, 9, 15, 10, 45, tzinfo=NewLagTZ())
print(dt)
print(dt.tzinfo)
dt

2022-09-15 10:45:00+01:15
New_Lagos_Timezone UTC+1:15:00


datetime.datetime(2022, 9, 15, 10, 45, tzinfo=New_Lagos_Timezone UTC+1:15:00)

In [70]:
pytz.timezone('Asia/Tokyo')

<DstTzInfo 'Asia/Tokyo' LMT+9:19:00 STD>

## requests Library

The requests library in Python is a third-party library that allows you to easily make HTTP requests and interact with web APIs (Application Programming Interface). It provides a simple and intuitive interface for sending HTTP requests and handling responses, making it a popular choice for web scraping, automation, and other HTTP-related tasks.

The requests library provides several classes and modules that you can use to make HTTP requests and interact with APIs. Since we're only interested in retriving information from the web, we will only look at the read-only methods like `get()` and `head()`. Write methods like `put()` and `post()` are used to modify information on a server(if you have the necessary credentials).  Here are the methods, classes and modules in the requests library we will look at in detail:

* `get()`: This method sends a GET request to the specified URL and returns the server's response.
* `head()`: This method sends a HEAD request to the specified URL and returns the server's response headers.
* `Session`: This class represents a persistent session with a server. It allows you to reuse **a single connection to the server for multiple requests**, which can improve performance and reduce the risk of errors.
* `exceptions`: This module provides several exception classes that are raised when errors occur during an HTTP request, such as `ConnectionError`, `HTTPError`, and `Timeout`.

In this session we will work with the JSONPlaceholder API. It is a free fake API that allows you to test and experiment with HTTP requests. It provides several *endpoints* for different types of data, including posts, comments, albums, and more. You can use the `requests` library to make requests to the API and receive JSON responses. The API documentation can be found at https://jsonplaceholder.typicode.com/.

### `get()`
`get()` is a method in the requests module that is used to send an [HTTP GET](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/GET) request to a specified URL and receive a response from the server. Here's the syntax for using the `get()` method:

    response = requests.get(url, params=None, **kwargs)

The `get()` method takes the following arguments:

* `url`: The URL to which the GET request is sent.
* `params (optional)`: A dictionary or list of tuples that contains the query string parameters to include in the request URL.
* `**kwargs (optional)`: Additional arguments that are passed. These include `headers`, `auth`, `cookies`, `proxies`, `timeout`, `allow_redirects`, `stream`, and `verify`.


In [71]:
# Un-comment and run the code below if you're running this cell for the first time
# %pip install requests

In [96]:
# import the module
import requests

# url to the JSONplaceholder site with the `posts` endpoint
# This is where the post are stored as a json file
url = 'https://jsonplaceholder.typicode.com/posts'

# GET the JSON file
response = requests.get(url=url)

# Check the status of the request
response.status_code

200

The `get()` function returns what is called a `Response` object, which contains the server's response to the request. Some common attributes and methods of the Response object returned by the `get()` method of the requests module are:

* `status_code`: Returns the HTTP status code of the response (e.g. 200 for success, 404 for not found, etc.).
* `headers`: Returns a dictionary of the HTTP headers sent by the server in the response.
* `text`: Returns the response content as a Unicode string.
* `content`: Returns the response content as bytes.
* `json()`: Returns the response content as a Python object, parsed as JSON.

In [97]:
response.headers

{'Date': 'Fri, 21 Feb 2025 09:49:24 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Report-To': '{"group":"heroku-nel","max_age":3600,"endpoints":[{"url":"https://nel.heroku.com/reports?ts=1740027441&sid=e11707d5-02a7-43ef-b45e-2cf4d2036f7d&s=cs1tPGvvHjCJUAI2GiTLeRAklxpB%2BYnJvZ1uMN2pncQ%3D"}]}', 'Reporting-Endpoints': 'heroku-nel=https://nel.heroku.com/reports?ts=1740027441&sid=e11707d5-02a7-43ef-b45e-2cf4d2036f7d&s=cs1tPGvvHjCJUAI2GiTLeRAklxpB%2BYnJvZ1uMN2pncQ%3D', 'Nel': '{"report_to":"heroku-nel","max_age":3600,"success_fraction":0.005,"failure_fraction":0.05,"response_headers":["Via"]}', 'X-Powered-By': 'Express', 'X-Ratelimit-Limit': '1000', 'X-Ratelimit-Remaining': '959', 'X-Ratelimit-Reset': '1740027495', 'Vary': 'Origin, Accept-Encoding', 'Access-Control-Allow-Credentials': 'true', 'Cache-Control': 'max-age=43200', 'Pragma': 'no-cache', 'Expires': '-1', 'X-Content-Type-Options': 'nosniff', 'Etag': 'W/"6b80-Y

In [98]:
print(response.headers['Date'])
print(response.headers['Content-Type'])
print(response.headers['Server'])

Fri, 21 Feb 2025 09:49:24 GMT
application/json; charset=utf-8
cloudflare


In [99]:
print(response.text)

[
  {
    "userId": 1,
    "id": 1,
    "title": "sunt aut facere repellat provident occaecati excepturi optio reprehenderit",
    "body": "quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"
  },
  {
    "userId": 1,
    "id": 2,
    "title": "qui est esse",
    "body": "est rerum tempore vitae\nsequi sint nihil reprehenderit dolor beatae ea dolores neque\nfugiat blanditiis voluptate porro vel nihil molestiae ut reiciendis\nqui aperiam non debitis possimus qui neque nisi nulla"
  },
  {
    "userId": 1,
    "id": 3,
    "title": "ea molestias quasi exercitationem repellat qui ipsa sit aut",
    "body": "et iusto sed quo iure\nvoluptatem occaecati omnis eligendi aut ad\nvoluptatem doloribus vel accusantium quis pariatur\nmolestiae porro eius odio et labore et velit aut"
  },
  {
    "userId": 1,
    "id": 4,
    "title": "eum et est occaecati",
    "body": "ullam et saepe reic

In [100]:
type(response.text)

str

In [101]:
# Converts to a Python object
response = response.json()

# print object type
print(type(response))

# print length 
len(response)

<class 'list'>


100

In [102]:
response[:3]    # List the first three items/posts

[{'userId': 1,
  'id': 1,
  'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit',
  'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'},
 {'userId': 1,
  'id': 2,
  'title': 'qui est esse',
  'body': 'est rerum tempore vitae\nsequi sint nihil reprehenderit dolor beatae ea dolores neque\nfugiat blanditiis voluptate porro vel nihil molestiae ut reiciendis\nqui aperiam non debitis possimus qui neque nisi nulla'},
 {'userId': 1,
  'id': 3,
  'title': 'ea molestias quasi exercitationem repellat qui ipsa sit aut',
  'body': 'et iusto sed quo iure\nvoluptatem occaecati omnis eligendi aut ad\nvoluptatem doloribus vel accusantium quis pariatur\nmolestiae porro eius odio et labore et velit aut'}]

Every user (`UserId`) has their post labelled (`id`) accordingly along with the content of each post which are the title and  body. 

Now that we have our data, we can perform analysis on them.

In [103]:
response[0]  # Pull the first post

{'userId': 1,
 'id': 1,
 'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit',
 'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'}

In [104]:
print(f"Title:\n{response[0]['title']}\nBody:\n{response[0]['body']}")


Title:
sunt aut facere repellat provident occaecati excepturi optio reprehenderit
Body:
quia et suscipit
suscipit recusandae consequuntur expedita et cum
reprehenderit molestiae ut ut quas totam
nostrum rerum est autem sunt rem eveniet architecto


We can pull all the post belonging to a certain user:

In [105]:
user_id = 1
user_post = []
for post in response:
    if post['userId'] == user_id:
        user_post.append(post)

user_post[:3]       # All post belonging to user 1

[{'userId': 1,
  'id': 1,
  'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit',
  'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'},
 {'userId': 1,
  'id': 2,
  'title': 'qui est esse',
  'body': 'est rerum tempore vitae\nsequi sint nihil reprehenderit dolor beatae ea dolores neque\nfugiat blanditiis voluptate porro vel nihil molestiae ut reiciendis\nqui aperiam non debitis possimus qui neque nisi nulla'},
 {'userId': 1,
  'id': 3,
  'title': 'ea molestias quasi exercitationem repellat qui ipsa sit aut',
  'body': 'et iusto sed quo iure\nvoluptatem occaecati omnis eligendi aut ad\nvoluptatem doloribus vel accusantium quis pariatur\nmolestiae porro eius odio et labore et velit aut'}]

In [106]:
for post in user_post[1:5]:
    print(f"Title:\n{post['title']}\nBody:\n{post['body']} \n")

Title:
qui est esse
Body:
est rerum tempore vitae
sequi sint nihil reprehenderit dolor beatae ea dolores neque
fugiat blanditiis voluptate porro vel nihil molestiae ut reiciendis
qui aperiam non debitis possimus qui neque nisi nulla 

Title:
ea molestias quasi exercitationem repellat qui ipsa sit aut
Body:
et iusto sed quo iure
voluptatem occaecati omnis eligendi aut ad
voluptatem doloribus vel accusantium quis pariatur
molestiae porro eius odio et labore et velit aut 

Title:
eum et est occaecati
Body:
ullam et saepe reiciendis voluptatem adipisci
sit amet autem assumenda provident rerum culpa
quis hic commodi nesciunt rem tenetur doloremque ipsam iure
quis sunt voluptatem rerum illo velit 

Title:
nesciunt quas odio
Body:
repudiandae veniam quaerat sunt sed
alias aut fugiat sit autem sed est
voluptatem omnis possimus esse voluptatibus quis
est aut tenetur dolor neque 



Let's wrap this in a function:

In [107]:
def fetch_user_post(response, uid, printed=False, start=None, end=None):
    """
    Fetch all post belonging to a particular user into a list by default.
    If `printed` is set to True, prints all the post contents.
    `start` and `end` are optional parameter to print only a slice of
    the posts.
    """
    user_post = [post for post in response if post['userId'] == uid]
    if printed:
        for post in user_post[start:end]:
            print(f"Id:{post['id']}\nTitle:\n{post['title']}\nBody:\n{post['body']} \n")
    else:
        return user_post[start:end]

In [108]:
user_4 = fetch_user_post(response, uid=4, printed=True, start=3, end=5)
user_4

Id:34
Title:
magnam ut rerum iure
Body:
ea velit perferendis earum ut voluptatem voluptate itaque iusto
totam pariatur in
nemo voluptatem voluptatem autem magni tempora minima in
est distinctio qui assumenda accusamus dignissimos officia nesciunt nobis 

Id:35
Title:
id nihil consequatur molestias animi provident
Body:
nisi error delectus possimus ut eligendi vitae
placeat eos harum cupiditate facilis reprehenderit voluptatem beatae
modi ducimus quo illum voluptas eligendi
et nobis quia fugit 



As a bit of diversion, the post are in latin and if you want them translated to english or any other language you can use a module named `easygoogletranslate` -an unofficial Google Translate API created by a [GitHub user](https://github.com/ahmeterenodaci/easygoogletranslate)

In [109]:
# Un-comment and run the code below if you're running this cell for the first time
# %pip install easygoogletranslate

In [110]:
# Import the EasyGoogleTranslate class from the module
from easygoogletranslate import EasyGoogleTranslate

# Create an instance of the class with certain parameter set
translator = EasyGoogleTranslate(source_language='la',     # latin
                                 target_language='en',     # English
                                 timeout=10)

# Translate using the translate() method of the class
translator.translate("Python amo")

'Python Like'

In [111]:
response2024

[{'userId': 1,
  'id': 1,
  'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit',
  'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'},
 {'userId': 1,
  'id': 2,
  'title': 'qui est esse',
  'body': 'est rerum tempore vitae\nsequi sint nihil reprehenderit dolor beatae ea dolores neque\nfugiat blanditiis voluptate porro vel nihil molestiae ut reiciendis\nqui aperiam non debitis possimus qui neque nisi nulla'},
 {'userId': 1,
  'id': 3,
  'title': 'ea molestias quasi exercitationem repellat qui ipsa sit aut',
  'body': 'et iusto sed quo iure\nvoluptatem occaecati omnis eligendi aut ad\nvoluptatem doloribus vel accusantium quis pariatur\nmolestiae porro eius odio et labore et velit aut'},
 {'userId': 1,
  'id': 4,
  'title': 'eum et est occaecati',
  'body': 'ullam et saepe reiciendis voluptatem adipisci\nsit amet autem assumenda provid

In [112]:
response[0]['title']

'sunt aut facere repellat provident occaecati excepturi optio reprehenderit'

In [113]:
translator.translate(response[0]['title'])

'are or to do the drives provision blinded by an exception option to criticize'

The `params` argument of the `get()` method is used to specify query parameters to include in the URL for a GET request. You can think of the parameters in an API request as similar to ordering food or drinks in a restaurant. Just as you would give instructions to the waiter on how you want your coffee or meal prepared, you have the flexibility to utilize parameters that define the desired format of the response data, the specific type of data you wish to retrieve, the preferred language of the content, and more. Put simply, these parameters empower you to tailor the API request according to your requirements, within the permissible constraints. For instance, utilizing the `userId` key in a query parameter allows us to retrieve posts specific to a certain user. This approach eliminates the necessity of downloading all posts and subsequently filtering out the desired ones. 

In [114]:
# import the module
import requests

url = 'https://jsonplaceholder.typicode.com/posts'

# Get all post belonging to only user 1
params = {'userId':1}

# GET the JSON file
response = requests.get(url=url, params=params)

# Check the status of the request
response.status_code

200

In [115]:
# Converts to a Python object
response = response.json()

len(response)

10

Even better, we can get post for multiple users:

In [116]:
url = 'https://jsonplaceholder.typicode.com/posts'

# Get all post belonging to user 1 and 2
params = {'userId':[1, 2, 3]}

# GET the JSON file
response = requests.get(url=url, params=params)

# Check the status of the request
response.status_code

200

In [117]:
response = response.json()
len(response)

30

In [118]:
response[8:25]

[{'userId': 1,
  'id': 9,
  'title': 'nesciunt iure omnis dolorem tempora et accusantium',
  'body': 'consectetur animi nesciunt iure dolore\nenim quia ad\nveniam autem ut quam aut nobis\net est aut quod aut provident voluptas autem voluptas'},
 {'userId': 1,
  'id': 10,
  'title': 'optio molestias id quia eum',
  'body': 'quo et expedita modi cum officia vel magni\ndoloribus qui repudiandae\nvero nisi sit\nquos veniam quod sed accusamus veritatis error'},
 {'userId': 2,
  'id': 11,
  'title': 'et ea vero quia laudantium autem',
  'body': 'delectus reiciendis molestiae occaecati non minima eveniet qui voluptatibus\naccusamus in eum beatae sit\nvel qui neque voluptates ut commodi qui incidunt\nut animi commodi'},
 {'userId': 2,
  'id': 12,
  'title': 'in quibusdam tempore odit est dolorem',
  'body': 'itaque id aut magnam\npraesentium quia et ea odit et ea voluptas et\nsapiente quia nihil amet occaecati quia id voluptatem\nincidunt ea est distinctio odio'},
 {'userId': 2,
  'id': 13,
  

To further enhance our comprehension of this concept, we'll engage with the GitHub API. The base url is https://api.github.com/ which list the urls to all the available endpoints (as shown in the image below):
![image-2.png](attachment:image-2.png)


* Repository url: `https://api.github.com/repos/{owner}/{repo}` - Provides a URL template that provides information about a specific repository where `{owner}` parameter is the username or organization that owns the repository, and `{repo}` is the name of the repository.
*  Repository search url: `https://api.github.com/search/repositories?q={query}{&page,per_page,sort,order}` - It provides a URL template that you can use to search through all repositories on GitHub.
* Topic search url: `https://api.github.com/search/topics?q={query}{&page,per_page}` - It provides a URL template that you can use to search for repositories that are associated with a specific topic. The `{query}` parameter is the search query you want to perform. The `page` parameter set the page number you want to retrieve, and the `per_page` parameter set the number of results you want to retrieve per page.

Let's play with some of the links parameters:
* https://api.github.com/search/topics?q=python  -- Returns topics containing python
* https://api.github.com/search/topics?q=python,java -- Returns topics containing python and java
* https://api.github.com/search/topics?q=python&page=2 -- Retrieve only page 2 of topics containing python 
* https://api.github.com/search/topics?q=python&page=2&per_page=3 -- Retrieve only page 2 of topics containing python with the first 3 items.
* https://api.github.com/search/topics?q=python&per_page=3 -- Returns all pages of the topics containing python with 3 items shown for each page.
* https://api.github.com/search/repositories?q=python&sort=stars&order=desc -- Returns all repositories that has Python as the language, sort by the `stargazers_count` in a descending order.

The `sort` parameter in the GitHub API can take the following values:

* `stars`: Sort by stars count, descending.
* `forks`: Sort by forks count, descending.
* `updated`: Sort by the time when the repository was last pushed to, descending.

The `order` parameter can take either `asc` or `desc` values, indicating the ascending or descending order of the sorted results.

Let see how we can perform this operation with request and have our result in Python:

In [119]:
import requests
base_url = "https://api.github.com"
endpoint = "/search/repositories"
url = base_url + endpoint

params = {'q':'python', 
          'sort':'stars',
          'order':'desc'}    # Dict style

# OR
params = [('q', 'python'), ('sort', 'stars'), ('order', 'desc')]    # List of tuples style

response = requests.get(url=url, params=params)
response.status_code

200

In [120]:
response = response.json()     # Converts to a Python object
type(response)

dict

In [121]:
response.keys()

dict_keys(['total_count', 'incomplete_results', 'items'])

In [122]:
print(response['total_count'])          # Total number of items(repositories)
print(response['incomplete_results'])   # Indicates if incomplete results are present.

4677751
False


In [123]:
repos = response['items']     # Save the list of items as repos
print(type(repos))      # Its is a list of dicts, each dict is a repository
len(repos)       # Repositiries returned

<class 'list'>


30

In [124]:
repos[0]   

{'id': 83222441,
 'node_id': 'MDEwOlJlcG9zaXRvcnk4MzIyMjQ0MQ==',
 'name': 'system-design-primer',
 'full_name': 'donnemartin/system-design-primer',
 'private': False,
 'owner': {'login': 'donnemartin',
  'id': 5458997,
  'node_id': 'MDQ6VXNlcjU0NTg5OTc=',
  'avatar_url': 'https://avatars.githubusercontent.com/u/5458997?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/donnemartin',
  'html_url': 'https://github.com/donnemartin',
  'followers_url': 'https://api.github.com/users/donnemartin/followers',
  'following_url': 'https://api.github.com/users/donnemartin/following{/other_user}',
  'gists_url': 'https://api.github.com/users/donnemartin/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/donnemartin/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/donnemartin/subscriptions',
  'organizations_url': 'https://api.github.com/users/donnemartin/orgs',
  'repos_url': 'https://api.github.com/users/donnemartin/repos',
  'events_url':

In [125]:
print(f"Repository Id:{repos[0]['id']}" )        # repo id
print(f"Owner:{repos[0]['owner']['login']}")     # Owner
print(f"Owner Id:{repos[0]['owner']['id']}")     # Owner id
print(f"Stars:{repos[0]['stargazers_count']}" )  # number of stars given.
print(f"No of forks:{repos[0]['forks_count']}")  # Number of copies(fork) made by other user(signifies popularity or usefulness)
print(f"Repo Size(Kb):{repos[0]['size']}")       # size of the default branch of the repo

Repository Id:83222441
Owner:donnemartin
Owner Id:5458997
Stars:289770
No of forks:48209
Repo Size(Kb):11220


In [126]:
stars = [repo['stargazers_count'] for repo in repos]
forks = [repo['forks_count'] for repo in repos]
sizes = [repo['size'] for repo in repos]

In [127]:
print(stars[:5])
print(forks[:5])
print(sizes[:5])

[289770, 234697, 218342, 197491, 188140]
[48209, 25319, 28465, 46286, 74537]
[11220, 6769, 438, 15205, 1137883]


The `kwargs` parameter in the `get()` method is used to pass optional arguments that modify the behavior of the request. These arguments are passed as keyword arguments, and can include:

* `headers`: a dictionary of HTTP headers to be sent with the request.
* `cookies`: a dictionary or `CookieJar` of cookies to be sent with the request.
* `auth`: a tuple of (username, password) to enable Basic HTTP authentication.
* `timeout`: the timeout in seconds for the request.
* `allow_redirects`: a boolean value indicating whether or not to follow redirects.
* `verify`: a boolean value indicating whether or not to verify SSL certificates for HTTPS requests.

In [128]:
import requests
base_url = "https://api.github.com"
endpoint = "/search/repositories"
url = base_url + endpoint

params = {'q':'python', 
          'sort':'stars',
          'order':'desc'}   

headers = {'User-Agent':'Mozilla/5.0',
           'Accept':'application/vnd.github.v3+json'}

response = requests.get(url=url, params=params, timeout=15,
                        allow_redirects=True, headers=headers )
response.status_code

200

In the above cell, we modify our previous code to have a [timeout](https://realpython.com/python-requests/#timeouts) of 15 seconds, and to permits redirection  when needed (`allow_redirects=True`). The last keyword arguments is the `header` parameter. HTTP headers are a set of key-value pairs that provide additional information about an HTTP request or response. They are used to communicate [metadata](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers#metadata_headers) about the message being sent, such as the content type, authentication credentials, caching directives, and more. ['User-Agent'](https://www.seobility.net/en/wiki/User_Agent) header specifies the browser and operating system being used by the client making the request. It can be used by servers to tailor their responses based on the software that is making the request. Here, `'Mozilla/5.0'` indicates the browser being used is Mozilla-compatible.  We've set the `'Accept'` header to `'application/vnd.github.v3+json'` --the default, which tells the GitHub API to return the response in JSON format using the version 3 of the API. Other value to `Accept` for GitHub API includes:

* `application/vnd.github+json`: This returns the response data in a more condensed JSON format.
* `text/plain`: This returns the response data as plain text.
* `application/vnd.github.VERSION.raw`: This returns the raw content of a file, without any metadata or formatting.

All this information can be overwhelming for beginners, but the goodnews is, you hardly need to specify all this keyword arguments as their defaults is pretty sufficient but you might want to always include a `timeout` so that your request won't be delayed indefinately in cases where the server is busy. In summary the `url` and the `params` arguments is what you most likely need to make a request.

### `head()`
Just as a request has headers, a response also has headers. `requests.head()` is a method in the Requests library in Python that sends an HTTP HEAD request to a URL and returns the response headers. It is similar to the `requests.get()` method, but it only retrieves the headers of the response, not the body.

In [129]:
import requests

import requests
base_url = "https://api.github.com/"
endpoint = "search/repositories"
url = base_url + endpoint

params = {'q':'python', 
          'sort':'stars',
          'order':'desc'}   

headers = {'User-Agent':'Mozilla/5.0',
           'Accept':'application/vnd.github.v3+json'}

response = requests.head(url=url, params=params, timeout=15,
                        allow_redirects=True, headers=headers )


print(response.status_code)  # prints the status code of the response
print(response.headers)      # prints the headers of the response


200
{'Date': 'Fri, 21 Feb 2025 09:50:29 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Cache-Control': 'no-cache', 'Vary': 'Accept,Accept-Encoding, Accept, X-Requested-With', 'X-GitHub-Media-Type': 'github.v3; format=json', 'Link': '<https://api.github.com/search/repositories?q=python&sort=stars&order=desc&page=2>; rel="next", <https://api.github.com/search/repositories?q=python&sort=stars&order=desc&page=34>; rel="last"', 'x-github-api-version-selected': '2022-11-28', 'Access-Control-Expose-Headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection'

In this example, `requests.head()` sends an HTTP HEAD request to the URL https://api.github.com/search/repositories?q=python&sort=stars&order=desc, and the response is stored in the response variable. The status code of the response is printed using `response.status_code`, and the headers of the response are printed using `response.headers`.

### `Session`
A _session_ is a way to store information between different HTTP requests. It allows a server to track the actions of a single user across multiple requests. When a client makes a request to a server, the server can create a session for that client and return a session ID to the client. The client can then include the session ID in subsequent requests to the server, allowing the server to identify the client and retrieve any stored session data.

Sessions are commonly used to store user authentication information, such as login credentials or session tokens, as well as user-specific preferences and settings. They can also be used to track user activity and perform server-side processing of data.

The requests module has a `Session` object that can be used to send multiple requests to a server and retrieve different sets of data:

In [130]:
import requests

base_url = 'https://jsonplaceholder.typicode.com'

# Create a Session object
session = requests.Session()

# Send a GET request to the posts endpoint
response_posts = session.get(f"{base_url}/posts")
print(response_posts.status_code)

# Send a GET request to the albums endpoint
response_albums = session.get(f"{base_url}/albums")
print(response_albums.status_code)

200
200


In the example above, we first create a `Session` object using `requests.Session()`. We then use the same Session object to send two GET requests to the JSONPlaceholder API, one to the `'/posts'` endpoint and one to the `'/albums'` endpoint.

Using a Session object in this way allows us to persist certain parameters, such as cookies or headers, across multiple requests. This can be useful when we need to send multiple requests to the same server and want to reuse the same session data across all requests. It can also help improve performance by reusing the same [TCP](https://www.techtarget.com/searchnetworking/definition/TCP) connection for multiple requests.

In [131]:
response_posts = response_posts.json()
response_albums = response_albums.json() 

# print object type
print(type(response_posts))
print(type(response_albums))

<class 'list'>
<class 'list'>


In [132]:
print(response_posts[0])
print(response_albums[0])

{'userId': 1, 'id': 1, 'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit', 'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'}
{'userId': 1, 'id': 1, 'title': 'quidem molestiae enim'}


### `exceptions`
It is not uncommon to encounter errors like HTTP and URL Error when making request and it is advisable to write codes that can handle this error. The `exceptions` module of `requests` provides a number of built-in exceptions that can be used to handle common error conditions that may arise during HTTP requests, such as network errors, timeouts, and invalid URLs.

Common HTTPError includes:
* `400 Bad Request`: Indicates that the server could not understand the request due to invalid syntax.
* `401 Unauthorized`: Indicates that the client must authenticate itself to get the requested response.
* `403 Forbidden`: Indicates that the server understands the request, but refuses to authorize it.
* `404 Not Found`: Indicates that the server could not retrieve the particular page requested-It may have been moved, deleted or never existed.
* `500 Internal Server Error`: Indicates that the server encountered an unexpected condition that prevented it from fulfilling the request.

A `URLError` occurs when the URL specified in a request is invalid or cannot be reached. This can happen if the URL is misspelled or if the server hosting the resource is down or has changed its address.

In [133]:
import requests

try:
    response = requests.get('https://jsonplaceholder.typicode.com/posts/101') 
    response.raise_for_status() # raise an exception if the response status code is not 200
except requests.exceptions.HTTPError as err:
    print(f"HTTP error occurred: {err}")
else:
    response=response.json()

HTTP error occurred: 404 Client Error: Not Found for url: https://jsonplaceholder.typicode.com/posts/101


In the above example, we tried to retrieve the 101th post which is not available because there are only 100 posts in the JSONPlaceholder server. The `raise_for_status()` method of the `Response` object returned by `get()` will raise an exception if the response status code is not 200. This ensures that we handle any HTTP errors that may occur during the request. The `except` block handles what should be done in case of an HTTPError -- in this case, we are only printing out the error message. If the request was succesful (no error encountered) the `Response` object is converted to a Python object using `response.json()`. Multiple except block could be used to handle different errors that may occur during a request: 

In [134]:
import requests

try:
    response = requests.get('https://jsonplaceholder.typicode.com/posts/1')
    response.raise_for_status()  # raise an exception if the response status code is not 200
except requests.exceptions.HTTPError as err:
    print(f"HTTP error occurred: {err}")
except requests.exceptions.Timeout as err:
    print(f"Timeout occurred: {err}")
except requests.exceptions.RequestException as err:   # handles any kind of error
    print(f"An error occurred: {err}")
else:
    response=response.json()

The `requests.exceptions.RequestException` is a base class for all the exceptions that Requests library might raise. It is a subclass of Python's built-in Exception class, and it is used to catch any exception that occurs when making a request, regardless of whether it is a network-related error, a server-related error, or a client-related error. **This makes it easier to handle all errors that can occur when using the library in a single try-except block**.

In [135]:
response

{'userId': 1,
 'id': 1,
 'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit',
 'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'}

### Collecting Weather Data

In this section we will use our knowledge so far to collect data from OpenWeather. OpenWeather is a web service that provides current weather data, weather forecasts, and historical weather data for various locations around the world. It has an API that allows developers to access its weather data and use it in their own applications. The API provides various endpoints for different weather-related data, including current weather data, 5-day weather forecasts, historical weather data, and more. Some endpoints are free and some are paid for to get access. To use the OpenWeather API, you need to sign up for an API key, which is used to authenticate your requests to the API.

To sign up and get an API key on OpenWeather, you can follow these steps:

1. Go to the OpenWeather website at https://openweathermap.org/.
2. Click on the "Sign Up" button at the top right corner of the page.
3. Fill in the registration form with your details, including your email address and password or sign up directly with your gmail account. Then, click on the "Create Account" button.
4. After creating an account, you will receive a verification email. Follow the instructions in the email to verify your account.
5. Once your account is verified, log in to your OpenWeather account.
6. Click on the "API keys" on the drop-down menu of your profile tab on the top menu bar of your dashboard.
7. Click on the "Generate API Key" button.
8. Your API key will be generated and displayed on the screen. You can copy it and start using it to make requests to the OpenWeather API.

Image of step 6-8 is shown below:

![image.png](attachment:image.png)


To view all the endpoints and associated docs, click on the ["API"](https://openweathermap.org/api) tab on the top menu bar of your dashboard. We're interested in the ["Current weather data"](https://openweathermap.org/current) so we will be using that endpoint. If you follow the link, you'll have something like this:

![image-2.png](attachment:image-2.png)


It is always good to read the documentation of an API to know how to effectively communicate with it. If you read through, you will see the the API call template, `https://api.openweathermap.org/data/2.5/weather?lat={lat}&lon={lon}&appid={API key}`, and  the detailed explanation of the required parameters and the fields represented in the json response file.

We need to get the latitude(lat) and longitude(lon) of the location we desired its weather data. The code below is used to fetch a json file on GitHub that contains all countries coordinates:

In [136]:
import requests

url = "https://raw.githubusercontent.com/eesur/country-codes-lat-long/master/country-codes-lat-long-alpha3.json"
try:
    response_coord = requests.get(url=url, timeout=10)
    response_coord.raise_for_status() 
except requests.exceptions.RequestException as err:      # Handles any type of error
    print(f"An error occurred: {err}")
else:
    response_coord = response_coord.json()

An error occurred: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=10)


If direct access to the raw file on GitHub becomes unavailable, You can download it using this [link](https://drive.google.com/file/d/11j7N50FNib6Dy5Ep3T1SjozBo8ffxs24/view?usp=sharing) and save it in your current working directory (i.e. the folder where this notebook is located) . In such a scenario, please uncomment the immediate code below.

In [137]:
# import json
# with open('countries_lat_long.json') as f:
#     response_coord = json.load(f)          

In [138]:
type(response_coord)

dict

In [139]:
response_coord.keys()

dict_keys(['ref_country_codes'])

In [140]:
type(response_coord['ref_country_codes'])

list

In [141]:
response_coord['ref_country_codes'][:2]     # Select the first two item of the list

[{'country': 'Albania',
  'alpha2': 'AL',
  'alpha3': 'ALB',
  'numeric': 8,
  'latitude': 41,
  'longitude': 20},
 {'country': 'Algeria',
  'alpha2': 'DZ',
  'alpha3': 'DZA',
  'numeric': 12,
  'latitude': 28,
  'longitude': 3}]

In [142]:
country_coord = response_coord['ref_country_codes']
country_coord[:2]

[{'country': 'Albania',
  'alpha2': 'AL',
  'alpha3': 'ALB',
  'numeric': 8,
  'latitude': 41,
  'longitude': 20},
 {'country': 'Algeria',
  'alpha2': 'DZ',
  'alpha3': 'DZA',
  'numeric': 12,
  'latitude': 28,
  'longitude': 3}]

In [143]:
nigeria = country_coord[158] 
nigeria

{'country': 'Nigeria',
 'alpha2': 'NG',
 'alpha3': 'NGA',
 'numeric': 566,
 'latitude': 10,
 'longitude': 8}

In [144]:
nigeria_lat = nigeria['latitude']
nigeria_lon = nigeria['longitude']
print(nigeria_lat)
print(nigeria_lon)

10
8


Now that we have the coordinates, let's get the weather data:

In [145]:
import requests

base_url = "https://api.openweathermap.org/data/2.5"
endpoint = "/weather"
api_key = 'e9c6e72f2ae584b2a98bd0471b506f89'
params = {'q':'Lagos',
          'appid':api_key}

try:
    response_weather = requests.get(url=base_url+endpoint, params=params, timeout=10)
    response_weather.raise_for_status() 
except requests.exceptions.RequestException as err:      # Handles any type of error
    print(f"An error occurred: {err}")
else:
    response_weather = response_weather.json()

In [146]:
response_weather

{'coord': {'lon': 3.75, 'lat': 6.5833},
 'weather': [{'id': 804,
   'main': 'Clouds',
   'description': 'overcast clouds',
   'icon': '04d'}],
 'base': 'stations',
 'main': {'temp': 305.66,
  'feels_like': 310.8,
  'temp_min': 305.66,
  'temp_max': 305.66,
  'pressure': 1011,
  'humidity': 58,
  'sea_level': 1011,
  'grnd_level': 1011},
 'visibility': 10000,
 'wind': {'speed': 3.11, 'deg': 228, 'gust': 2.47},
 'clouds': {'all': 91},
 'dt': 1740131446,
 'sys': {'country': 'NG', 'sunrise': 1740117617, 'sunset': 1740160637},
 'timezone': 3600,
 'id': 2332453,
 'name': 'Lagos',
 'cod': 200}

In [147]:
print(f"Country_code:{response_weather['sys']['country']}")
print(f"Country:{response_weather['name']}")
print(f"Time_unix:{response_weather['dt']}")
print(f"Timezone:{response_weather['timezone']}")  # Offset from UTC
print(f"Coordinates:{response_weather['coord']}")
print(f"Weather:{response_weather['weather'][0]['main']}")    # It is list of dict, so it needs to be index first
print(f"Weather description:{response_weather['weather'][0]['description']}")   
print(f"Temperature(K):{response_weather['main']['temp']}")
print(f"Humidity(%):{response_weather['main']['humidity']}")
print(f"Wind_speed(m/s):{response_weather['wind']['speed']}") 
print(f"Atm_pressure(hPa):{response_weather['main']['pressure']}") 

Country_code:NG
Country:Lagos
Time_unix:1740131446
Timezone:3600
Coordinates:{'lon': 3.75, 'lat': 6.5833}
Weather:Clouds
Weather description:overcast clouds
Temperature(K):305.66
Humidity(%):58
Wind_speed(m/s):3.11
Atm_pressure(hPa):1011


You can convert the time from unix to a datetime object in UTC or current system time(local time) using the `datetime.utcfromtimestamp()` or `datetime.fromtimestamp()` respectively:

In [148]:
from datetime import datetime
unix_time = response_weather['dt']
print(f"unix time:{unix_time}")

# Convert to UTC time
utc_time = datetime.utcfromtimestamp(unix_time)  
print(f"UTC time:{utc_time}")

# Convert directly to local(sys) time
local_time = datetime.fromtimestamp(unix_time)
print(f"Local time {local_time}")

unix time:1740131446
UTC time:2025-02-21 09:50:46
Local time 2025-02-21 10:50:46


#### NB:
This API call template will also work:

    "https://api.openweathermap.org/data/2.5/weather?q={Country/City}&appid={API key}"

An example:

    "https://api.openweathermap.org/data/2.5/weather?q=Lagos&appid=e9c6e72f2ae584b2a98bd0471b506f89" 

The `params` dict will look like this:

    params = {'q':'Lagos', 'appid':api_key}


Now let's combine everyhing we have so far and write a program that collects weather data for all countries  using their coordinates.

In [149]:
import requests
all_countries = {}

# Create a Session to submit multiple requests to get weather data for aach country
session = requests.Session()

# Iterate through each country's coordinate dict
for country in country_coord[150:]:    # Picked only countries from  position 150 to end
    
    # Extract the name and coordinates for the country
    country_name = country['country']
    lat = country['latitude']
    lon = country['longitude']
    
    # Define the API url and the parameters
    url = "https://api.openweathermap.org/data/2.5/weather"
    api_key = 'e9c6e72f2ae584b2a98bd0471b506f89'
    params = {'lat':lat,
              'lon':lon,
              'appid':api_key}
    
    # create an empty dict for the country
    country_dict = {}
    
    # Make a request for the country weather data
    try:
        response_weather = session.get(url=url, params=params, timeout=10)
        response_weather.raise_for_status() 
    except requests.exceptions.RequestException as err:      # If country not avalable pass
        pass
    else:    
        response_weather = response_weather.json()
        
        # Extract only the needed data into the country's dict
        country_dict['country'] = country_name
#         country_dict['country_code'] = response_weather['sys']['country']
        country_dict['exact_loc'] = response_weather['name']
        country_dict['time_unix'] = response_weather['dt']
        country_dict['timezone'] = response_weather['timezone']  
        country_dict['coord'] = response_weather['coord']
        country_dict['weather'] = response_weather['weather'][0]['main'] 
        country_dict['temp(K)'] = response_weather['main']['temp']  
        country_dict['humidity(%)'] = response_weather['main']['humidity']
        country_dict['wind_speed(m/s)'] = response_weather['wind']['speed']
        country_dict['atm_pressure(hPa)'] = response_weather['main']['pressure']
        
        # Extract rain information if available
        try:
            country_dict['rain(mm)'] = response_weather['rain']['1h']
        except KeyError:
            country_dict['rain(mm)'] = 'Not Available'   
        
        # Save the country dict in all_countries dict
        all_countries[country_name] = country_dict

In [150]:
all_countries['United Kingdom']

{'country': 'United Kingdom',
 'exact_loc': 'Embsay',
 'time_unix': 1740131493,
 'timezone': 0,
 'coord': {'lon': -2, 'lat': 54},
 'weather': 'Clouds',
 'temp(K)': 285.97,
 'humidity(%)': 79,
 'wind_speed(m/s)': 12.18,
 'atm_pressure(hPa)': 1000,
 'rain(mm)': 'Not Available'}

In [151]:
all_countries['Russian Federation']

{'country': 'Russian Federation',
 'exact_loc': 'Russia',
 'time_unix': 1740131482,
 'timezone': 25200,
 'coord': {'lon': 100, 'lat': 60},
 'weather': 'Clouds',
 'temp(K)': 261.68,
 'humidity(%)': 90,
 'wind_speed(m/s)': 3.26,
 'atm_pressure(hPa)': 1036,
 'rain(mm)': 'Not Available'}

In [152]:
all_countries['United States']

{'country': 'United States',
 'exact_loc': 'Peabody',
 'time_unix': 1740131493,
 'timezone': -21600,
 'coord': {'lon': -97, 'lat': 38},
 'weather': 'Clouds',
 'temp(K)': 254.46,
 'humidity(%)': 59,
 'wind_speed(m/s)': 1.43,
 'atm_pressure(hPa)': 1042,
 'rain(mm)': 'Not Available'}

We will look at how to plot this data in the next lesson.

## Regular Expression with `re` Module

A regular expression, also known as regex or regexp, is a sequence of characters that define a search pattern. Regular expressions are used in many programming languages, including Python, to search, replace, and manipulate text.
Regular expressions consist of a combination of literal characters and metacharacters that define a pattern. Literal characters match themselves exactly, while metacharacters have special meanings and are used to match more complex patterns.

Regular expressions can be used for a wide variety of tasks, such as:

* Validating user input, such as email addresses or phone numbers
* Searching and replacing text in a document or file
* Extracting specific information from a text string, such as dates or names

The `re` module is a built-in module in Python that provides support for regex. The `re` module allows you to use regular expressions to search, replace, and manipulate text data in Python. Some of the commonly used functions/methods in the `re` module include:

* `re.match()`: Searches for a pattern at the **beginning of a string** and returns a match object if there is a match.
* `re.search()`: Searches a string for a match to a pattern **at any location in the string** and returns a match object if there is a match.
* `re.findall()`: Returns a list of all non-overlapping matches in a string for a given regular expression pattern.
* `re.sub()`: Replaces one or more occurrences of a pattern in a string with a specified string.
* `re.compile()`: Compiles a regular expression pattern into a regular expression object, which can be used for more efficient pattern matching.

### Regular Expression Pattern
A regex pattern is a sequence of characters that defines a search pattern used to search for matches within a string. Regular expression pattern characters can be broadly categorized into three groups:

* Literal characters: These are ordinary characters that match themselves. For example, the regular expression "dog" matches the characters "dog" in a string.


* *Metacharacters*: These are special characters that have a special meaning in regular expressions. Metacharacters are used to match specific patterns of characters in a string. Examples include `*`, which matches zero or more occurrences. 


* Special sequences: These are combinations of characters that have a predefined meaning and are used to match certain types of patterns in a string. They are constructed using a backslash (`\`) followed by a special character or a combination of characters. Example includes `\d` which matches any digit character (0-9).

Let's discuss this in details with examples because they form the basis of regular expressions:

### Literal Characters

In [153]:
import re

# Define the pattern to search for
pattern = r'Python'

# Search for the pattern in the string
string = 'Python Programming'

match = re.match(pattern, string)
print(match)

<re.Match object; span=(0, 6), match='Python'>


The `match` function returns a match object if there is a match but returns `None` if there isn't. The `group()` method of a match object can be use to returned the string matched:

In [154]:
match.group()

'Python'

In regex, the `r` in front of a string literal denotes a "raw string". A raw string is a string literal that does not interpret backslashes (`\`) as escape characters. This means that any backslashes in the string are treated as literal backslashes, rather than as escape characters for special characters. An example is shown below:

In [155]:
print('look\nnow')

look
now


In [156]:
print(r'look\nnow')

look\nnow


In the code above, the `r` in front of the string strips `\n` of its special meaning, and make it an ordinary character. You do not need to use a raw string if your regular expression does not contain backslashes. However, it is good practice to use raw strings consistently in your code to avoid any unexpected behavior due to backslash escaping.

In [157]:
import re

# Define the pattern to search for
pattern = r'Python'

# Search for the pattern in the string
string = 'I love Python Programming'

match = re.match(pattern, string)
print(match)

None


In the above cell, no match was found because the beginning of the string is not "Python" but "I"

In [158]:
match = re.match(r'the', "The quick brown fox")
print(match)     # Returns None because 'the' is not 'The'

None


### Metacharacters
The metacharacters are `. ^ $ * + ? { } [ ] \ | ( )`. They can be further divided into several subcategories:

* Anchors e.g `^`, `$`
* Quantifiers e.g `*`, `+`
* Character classes e.g `[a-z]`, `[0-9]`
* Alternation e.g `|` 
* Grouping: e.g `( )`
* Escaped characters e.g `\.` -- matches a period, `\?` -- matches a question mark

#### Anchor metacharacters 
They are used in regular expressions to specify the position of a match within a string. Here are the most common anchor metacharacters:

* `^`: Matches the beginning of a string.
* `$`: Matches the end of a string.

In [159]:
pattern = r'^t'   # matches string or word beginning with 't'
match = re.match(pattern, "total touch")
print(match)   

<re.Match object; span=(0, 1), match='t'>


In [160]:
pattern = r'^t'   # matches string or word beginning with 't'
match = re.match(pattern, "royal touch")
print(match)   

None


In [161]:
fruits = ["apple", "banana", "apricot", "orange", "avocado"]
 
for fruit in fruits:
    match = re.match(r'^a', fruit)    # Select only fruits starting with a
    if match:
        print(fruit)

apple
apricot
avocado


The regular expression pattern `r'^a'` matches any string that starts with the letter 'a'. In this code, the `re.match()` function is used to match the pattern against each fruit in the fruits list. If the fruit name starts with the letter 'a', the `print()` function is used to print the name of the fruit.

In [162]:
fruits = ["apple", "banana", "apricot", "orange", "avocado"]
 
for fruit in fruits:
    match = re.match(r'^ap', fruit)    # Select only fruits starting with ap
    if match:
        print(fruit)

apple
apricot


Although since `re.match()` was used, it already looks for matches at the beginning of the string by default and no need for `^`.

In [163]:
pattern = r'grant$'   # matches word starting with gran but ends with 't'
match = re.match(pattern, "grant")
print(match)   

<re.Match object; span=(0, 5), match='grant'>


In [164]:
pattern = r'grant$'   #  matches word starting with gran but ends with 't'
match = re.match(pattern, "granted")
print(match)   

None


If we had omitted the `$` anchor, the pattern would have matched!

In [165]:
pattern = r'grant'   #  matches word starting with gran but ends with 't'
match = re.match(pattern, "granted")
print(match)   

<re.Match object; span=(0, 5), match='grant'>


#### Quantifiers
Quantifiers are a type of metacharacter in regular expressions that specify how many times a character or group of characters can occur in a string. Most common quantifiers in regular expressions includes:

* `*`: Matches **zero or more** occurrences of the preceding character or group.
* `+`: Matches **one or more** occurrences of the preceding character or group.
* `?`: Matches **zero or one** occurrence of the preceding character or group.
* `{n}`: Matches **exactly n** occurrences of the preceding character or group.
* `{n,}`: Matches **n or more** occurrences of the preceding character or group.
* `{n,m}`: Matches between **n and m** (inclusive) occurrences of the preceding character or group.

In [166]:
words = ['ale', 'ape', 'aple', 'apple', 'appple', 'apply']

# Select only words starting with 'a' and zero or any occurrence of 'p' but ends in 'le'
pattern = 'ap*le'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

ale
aple
apple
appple


In [167]:
words = ['ale', 'ape', 'aple', 'apple', 'appple', 'apply']

# Select only words starting with 'a' and at least one or more occurrence of 'p' but ends in 'le'
pattern = 'ap+le'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

aple
apple
appple


In [168]:
words = ['ale', 'ape', 'aple', 'apple', 'appple', 'apply']

# Select only words starting with 'a' and matches zero or one occurrence of 'p' but ends in 'le'
pattern = 'ap?le'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

ale
aple


In [169]:
words = ['ale', 'ape', 'aple', 'apple', 'appple', 'apply']

# Select only words starting with 'a', matches exactly 2 occurrences of 'p' but ends in 'le'
pattern = 'ap{2}le'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

apple


In [170]:
words = ['ale', 'ape', 'aple', 'apple', 'appple', 'apply']

# Select only words starting with 'a' and matches 2 or more occurrences of 'p' but ends in 'le'
pattern = 'ap{2,}le'

for word in words:
    match = re.match(pattern, word)   
    if match:
        print(word)

apple
appple


In [171]:
words = ['ale', 'ape', 'aple', 'apple', 'appple', 'apply']

# Select only words starting with 'a', matches between 1 and 3 (both inclusive) occurrences of 'p' but ends in 'le'
pattern = 'ap{1,3}le'

for word in words:
    match = re.match(pattern, word)  
    if match:
        print(word)

aple
apple
appple


#### Character classes

In regular expressions, character classes are used to **match a single character from a set of characters**. Here are some common character classes:

* `.`: Matches any single character except for a newline character (`\n`).
* `[abc]`: Matches any single character that is either "a", "b", or "c".
* `[a-z]`: Matches any single lowercase letter from "a" to "z".
* `[A-Z]`: Matches any single uppercase letter from "A" to "Z".
* `[0-9]`: Matches any single digit from 0 to 9.


You can also use negated character classes to match any character that is not in a specified set. For example:

* `[^abc]`: Matches any single character that is **not** "a", "b", or "c".
* `[^a-z]`: Matches any single character that is **not** a lowercase letter from "a" to "z".
* `[^A-Z]`: Matches any single character that is **not** a uppercase letter from "A" to "Z".
* `[^0-9]`: Matches any single character that is **not** a digit

In [172]:
import re

words = ['toll', 'ball', 'tall', 'till', 'tail', 't ll',]

# Select only words starting with 't', followed by any character and ends with 'll'
pattern = 't.ll'

for word in words:
    match = re.match(pattern, word)  
    if match:
        print(word)

toll
tall
till
t ll


In [173]:
words = ['all', 'ball', 'call', 'tall', 'fall']

# Select only words starting with either "a", "b", or "c". followed by any character and ends with 'll'
pattern = '[abc].ll'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

ball
call


To include "all" in the previous result you can use either `?` or `*` after the `.` to indicate zero or one or more occurence of any character:

In [174]:
words = ['all', 'ball', 'call', 'tall', 'fall']

# Select only words starting with either "a", "b", or "c". followed zero or one character and ends with 'll'
pattern = '[abc].?ll'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

all
ball
call


In [175]:
words = ['cities', 'Manager', '1_actor', 'actor_1', 'Git-Hub']

# Select only words starting with lowercase followed by zero or any number of occurence of any charater 
pattern = '[a-z].*'

for word in words:
    match = re.match(pattern, word)   
    if match:
        print(word)

cities
actor_1


In [176]:
words = ['cities', 'Manager', '1_actor', 'actor_1', 'Git-Hub']

# Select only words starting with either lowercase or uppercase followed by zero or any number of occurence of any charater
pattern = '[a-zA-Z].*'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

cities
Manager
actor_1
Git-Hub


In [177]:
words = ['image_1.jpeg', 'image_2.png', 'image_cat.jpeg', 'image_3.jpeg', 'image_4.gif']

# Select only jpeg images ending with a digit
pattern = 'image_[0-9]\.jpeg'

for word in words:
    match = re.match(pattern, word)    # Select only words starting wi
    if match:
        print(word)

image_1.jpeg
image_3.jpeg


All the previous examples (under character classes) can be negated:

In [178]:
import re

words = ['all', 'ball', 'call', 'tall', 'tail', 'fall']

# Select words that does not start with "a", "b", or "c" followed zero or one character and ends with 'll'
pattern = '[^abc].?ll'

for word in words:
    match = re.match(pattern, word)  
    if match:
        print(word)

tall
fall


In [179]:
words = ['cities', 'Manager', '1_actor', 'actor_1', 'Git-Hub']

# Select words that doesn't start with lowercase or uppercase followed by zero or any number of occurence of any charater
pattern = '[^a-zA-Z].*'

for word in words:
    match = re.match(pattern, word)   
    if match:
        print(word)

1_actor


#### Alternation
Alternation, denoted by the pipe character `|`, is a feature in regular expressions that allows you to match one pattern **or** another. For example, the regular expression `a|b` would match either "a" or "b" and `a|b|c` would match either "a" or "b" or "c" and so on. 

In [180]:
import re

addresses = ['example@gmail.com', 'my_mail@yahoo.ng', 'dataclax@inbox.ng', 'new_email@yahoo.com', 'example@outlook.com']

# Select words that starts with one or more characters followed by "@", 
# then followed by either 'gmail' or 'yahoo' followed by one or more characters
pattern = '.+@(gmail|yahoo).+'

for address in addresses:
    match = re.match(pattern, address)    
    if match:
        print(address)

example@gmail.com
my_mail@yahoo.ng
new_email@yahoo.com


#### Grouping
Grouping is a feature in regular expressions that allows you to **group together sub-patterns and treat them as a single unit**. This can be useful for a variety of purposes, including capturing substrings, applying quantifiers to a group of characters, and creating more complex regular expression:

In [181]:
import re

words = ['poach', 'teal', 'oasis', 'foil', 'goal']

# Select words that starts with or contains'oa'
pattern = '.*(oa)'

for word in words:
    match = re.match(pattern, word)   
    if match:
        print(word)

poach
oasis
goal


In [182]:
words = ['gol', 'goal', 'goooal', 'gooaaaal', 'gooaaaoooaaal', 'gaoool']

# Select words that pronounced "goal" irrespective of the spellings
pattern = 'g(o+a+)*l'

for word in words:
    match = re.match(pattern, word)   
    if match:
        print(word)

goal
goooal
gooaaaal
gooaaaoooaaal


The regular expression pattern `g(o+a+)*l` matches any word that starts with the letter 'g', followed by any number of occurence of the sequence 'oa' and ends with the letter 'l'. `o+` and `a+` ensures 'o' and 'a' can occur one or more times. This pattern matches all the words in the words list that spell "goal" irrespective of the pronunciation.

#### Escaped Characters
To match the metacharacters we have seen so far literally, you need to *escape* them using the backslash character `\`.
Here are some examples of how to escape metacharacters in regular expressions:

* To match a period (`.`), use `\.`.
* To match a literal question mark (`?`), use `\?`.
* To match a literal backslash (`\`), use `\\`.
* To match a literal opening square bracket (`[`), use `\[`.
* To match a literal vertical bar (`|`), use `\|`.

By escaping a metacharacter, you are telling the regular expression engine to treat it as a literal character instead of a special character with a special meaning. In one of the previous examples above, we selected only emails with gmail or yahoo domain name. But this pattern will fail if someone use an invalid email address with an incorrect *top-level domain name* (e.g `.com`, `.co`, `.uk`) like `not_valid@gmail.adress` or even worse the person totally omits the top-level domain name (leaving whitespaces at the end) i.e `not_valid@yahoo `. This scenerio is given below:

In [183]:
addresses = ['example@gmail.com', 'my_mail@yahoo.ng', 'dataclax@inbox.ng', 'not_valid@gmail.adress', 'not_valid@yahoo ']

# Select words that starts with one or more characters followed by "@", 
# then followed by either 'gmail' or 'yahoo' followed by one or more characters
pattern = '.+@(gmail|yahoo).+'

for address in addresses:
    match = re.match(pattern, address)    
    if match:
        print(address)

example@gmail.com
my_mail@yahoo.ng
not_valid@gmail.adress
not_valid@yahoo 


We can improve the code by using the pattern below:

In [184]:
addresses = ['example@gmail.com', 'my_mail@yahoo.ng', 'dataclax@inbox.ng', 'not_valid@gmail.adress', 'not_valid@yahoo ']

pattern = '.+@(gmail|yahoo)\..{2,3}$'

for address in addresses:
    match = re.match(pattern, address)    
    if match:
        print(address)

example@gmail.com
my_mail@yahoo.ng


The code uses regular expressions to select email addresses from a list that have either gmail or yahoo domain name and have a valid top-level domain of two or three characters.

Here is how the regular expression works:

* `.+@`: matches any sequence of one or more characters that ends with an at sign (`@`), indicating the start of an email address.
* `(gmail|yahoo)`: matches either the string "gmail" or "yahoo".
* `\.`: matches a literal period (`.`) character.
* `.{2,3}`: matches any two or three characters, which in this case represent the top-level domain of the email address.
* `$`: matches the end of the string.

### Special sequences

Special sequences are **predefined patterns** in regular expressions that match certain types of characters or patterns. Some common special sequences includes:

* `\d`: Matches any digit character. Equivalent to `[0-9]`.
* `\D`: Matches any non-digit character. Equivalent to `[^0-9]`.
* `\w`: Matches any alphanumeric character and underscore. Equivalent to `[a-zA-Z0-9_]`.
* `\W`: Matches any non-alphanumeric character or underscore. Equivalent to `[^a-zA-Z0-9_]`.
* `\s`: Matches any whitespace character, including spaces, tabs, and newlines.
* `\S`: Matches any non-whitespace character.

In [185]:
words = ['image_1.jpeg', 'image_2.png', 'image_3.jpeg', 'image_4.gif', 'image_?.jpeg', 'image_cat.jpeg']

# Select only jpeg images ending with a digit
pattern = r'image_\d\.jpeg$'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

image_1.jpeg
image_3.jpeg


In [186]:
words = ['image_1.jpeg', 'image_2.png', 'image_3.jpeg', 'image_4.gif', 'image_?.jpeg', 'image_cat.jpeg']

# Select jpeg images ending with a non-digit character
pattern = r'image_\D\.jpeg$'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

image_?.jpeg


The regular expression `r'image_\D.jpeg'` matches a string that starts with the characters "image_", followed by any non-digit character, and ends with the characters ".jpeg".

Here's a breakdown of the pattern:

* `r` at the beginning of the string indicates that it's a raw string and backslashes should be treated literally.
* `image_` matches the characters "image_" in the input string.
* `\D` matches any non-digit character.
* `.jpeg` matches the characters ".jpeg" in the input string.

To match **any number of occurence** of non-digit character, we can add the `*` metacharacter. This way we also include `image_cat.jpeg`

In [187]:
words = ['image_1.jpeg', 'image_2.png', 'image_3.jpeg', 'image_4.gif', 'image_?.jpeg', 'image_cat.jpeg']

# Select jpeg images ending with a non-digit characters
pattern = r'image_\D*.jpeg$'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

image_?.jpeg
image_cat.jpeg


In [188]:
words = ['image_a.jpeg', 'image_b.jpeg', 'image_3.jpeg', 'image_4.gif', 'image_?.jpeg', 'image_cat.jpeg']

# Select jpeg images ending with a digit, alphabet or underscore character
pattern = r'image_\w\.jpeg'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

image_a.jpeg
image_b.jpeg
image_3.jpeg


In [189]:
words = ['image_a.jpeg', 'image_b.jpeg', 'image_3.jpeg', 'image_4.gif', 'image_?.jpeg', 'image_cat.jpeg']

# Matches any word that contains digits, alphabets or underscore
pattern = r'\w+.jpeg$'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

image_a.jpeg
image_b.jpeg
image_3.jpeg
image_cat.jpeg


In [190]:
fruits = ["apple", "banana", "apricot", "orange", "avocado"]

# Select word that matches any alphanumeric and underscore character in one or more occurence but ends with "e"
pattern = r"\w+e$"
    
for fruit in fruits:
    match = re.match(pattern, fruit)   
    if match:
        print(fruit)

apple
orange


In [191]:
words = ['my_love', 'mi amor', 'mon_amour', 'meu  amor', 'il mio amore']

# Select words seperated by a single space
pattern = r'\w+\s\w+'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

mi amor
il mio amore


The above code matches any sequence of two or more alphanumeric characters and/or underscores separated by a whitespace character. It didn't select  `'meu  amor'` because if you look at it carefully it contains two whitespaces. If we need to includes words that contains multiple whitespaces, we need to add `+` to `s`: 

In [192]:
words = ['my_love', 'mi amor', 'mon_amour', 'meu  amor', 'il mio amore']

pattern = r'\w+\s+\w+'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

mi amor
meu  amor
il mio amore


The code above matches any words seperated by spaces, irrespective if its two or three or more words. This pattern will not work if we only needed two words seperated with space(s). This problem stems from the fact that we are using `re.match()` which only checks if the begining of a word matches the pattern specified, so far the begining of the word matches, the word is returned irrespective of what follows: 

In [193]:
words = ["Python", "Python Programming", "I love Python programming"]

pattern = r'Python'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

Python
Python Programming


To correct this, we can apply the `$` metacharacter to specify the end of a match.

In [194]:
words = ["Python", "Python Programming", "I love Python programming"]


pattern = r'Python$'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

Python


In [195]:
words = ['my_love', 'mi amor', 'mon-amour', 'meu  amor', 'il mio amore']

# Select two words seperated by spaces
pattern = r'\w+\s+\w+$'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

mi amor
meu  amor


This highlight another usefulness of `$`. It is very useful when we need to be strict with our pattern-- to ensure the words selected **totally** conforms to the pattern specified and not partially. 

If we wants words seperated by non-whitespace character, we can use the `\S` in place of `\s` metacharacter. If you use `r'\w+$` pattern, you'll end up with only words seperated with underscore (because `\w+` matches underscore too) but not  those seperated with hypen or other characters:

In [196]:
words = ['my_love', 'mi amor', 'mon-amour', 'meu amor', 'il mio amore']

pattern = r'\w+\S\w+$'

for word in words:
    match = re.match(pattern, word)    
    if match:
        print(word)

my_love
mon-amour


Now that we have covered regex pattern and various characters used to define a pattern, let's look at the functions provided by `re` module for working with regular expressions in detail.

### `match()`
As pointed out before, the `match()` method searches for a pattern at the beginning of a string. 

The syntax for `re.match` is:

    re.match(pattern, string, flags=0)

where `pattern` is the regular expression pattern to be searched, `string` is the input string to be searched, and `flags` is an optional argument to specify various flags that control how the pattern is interpreted.

In [197]:
import re

# Define the pattern to search for
pattern = r'Python'

# Search for the pattern in the string
string = 'Python Programming'

match = re.match(pattern, string)
print(match)

<re.Match object; span=(0, 6), match='Python'>


Some common attributes and methods of `re.match()` includes:

* `string`: Returns the string passed to the match function.
* `pos`: Returns the starting index of the match.
* `endpos`: Returns the ending index of the match.
* `lastindex`: Returns the index of the last captured group.
* `lastgroup`: Returns the name of the last captured group
* `group([group1, ...])`: Returns the specified capturing group or groups as a string or tuple of strings.
* `groups()`: Returns all capturing groups as a tuple of strings.
* `groupdict()`: Returns a dictionary containing all named capturing groups.
* `start([group])`: Returns the starting index of the match or of the specified capturing group.
* `end([group])`: Returns the ending index of the match or of the specified capturing group.
* `span([group])`: Returns a tuple containing the starting and ending indices of the match or of the specified capturing group.

In [198]:
match.string

'Python Programming'

In [199]:
match.span()

(0, 6)

In [200]:
match.pos

0

In [201]:
match.endpos

18

You might be wondering, what the term **"group"** mean. Remember when we treated "Grouping" under regex pattern, we use the `()` metacharacters to group characters that needs to matched together. In regular expressions, groups are used to capture and extract specific parts of a pattern. We can create and extract groups as shown in the examples below:

In [202]:
pattern = r"(\w+)\s(\w+)$"
match = re.match(pattern, 'John Snow')
print(match)

<re.Match object; span=(0, 9), match='John Snow'>


In [203]:
match.groups()       # Returns all groups in a tuple

('John', 'Snow')

In [204]:
match.group(1)

'John'

In [205]:
match.group(2)

'Snow'

In [206]:
match.group(0)   # returns the entire matched string

'John Snow'

In [207]:
match.end(1)  # Returns the end index of group 1

4

In [208]:
match.start(2)    # Returns the start index of group 2

5

In [209]:
names = ['John Snow', 'Arya Stark', 'Daenerys Targaryen', 'Tyrion Lannister']
pattern = r"(\w+)\s(\w+)$"
for name in names:
    match = re.match(pattern, name)
    if match:
        print(f"First Name:{match.group(1)}, Last Name:{match.group(2)}")
        

First Name:John, Last Name:Snow
First Name:Arya, Last Name:Stark
First Name:Daenerys, Last Name:Targaryen
First Name:Tyrion, Last Name:Lannister


In [210]:
pattern = r"(\d{2})-(\d{2})-(\d{4})$"
m = re.match(pattern, '12-06-2022')
print(f"Day:{m.group(1)}")
print(f"Month:{m.group(2)}")
print(f"Year:{m.group(3)}")

Day:12
Month:06
Year:2022


The above code uses a regular expression pattern to match a date string in the format of "dd-mm-yyyy". The pattern is defined as `"(\d{2})-(\d{2})-(\d{4})$"` which contains three groups separated by hyphens. Each group corresponds to the day, month, and year of the date string.

When the string "12-06-2022" is matched against the pattern using `re.match()`, the `group()` method is used to retrieve the values of each group. The group number is passed as an argument to the group() method, with the first group being numbered 1.

In the code, `m.group(1)` retrieves the value of the first group, which corresponds to the day of the date string, `m.group(2)` retrieves the value of the second group, which corresponds to the month of the date string, and `m.group(3)` retrieves the value of the third group, which corresponds to the year of the date string.

In order to create a `groupdict`, you need to use *named capturing groups* in your regular expression. A named capturing group is created using the syntax `(?P<name>pattern)`, where `name` is the name of the group and `pattern` is the regular expression pattern you want to match. Once you have defined named capturing groups in your regular expression, you can use the `groupdict()` method on the resulting match object to create a dictionary of group names and their matched values:

In [211]:
pattern = r"(?P<day>\d{2})-(?P<month>\d{2})-(?P<year>\d{4})$"
m = re.match(pattern, '12-06-2022')
m.groupdict()

{'day': '12', 'month': '06', 'year': '2022'}

In [212]:
m.group('day')

'12'

In [213]:
m.group(3)

'2022'

In [214]:
m.group(1)

'12'

The above operations could have been easily carried out with the `split()` method of a string using the character that seperates the groups as the splitting delimiter, but there are some complicated strings that can't be easily splitted using `split()` because there are no defined or obvious delimiter. Consider this example. The LaLiga twitter handle hashtags each match with something like #BarcaAtleti, #OsasunaGirona, #RCDEspanyolElche. There are two things that can be easily inferred --each team name start with capital letter(s) and the home team comes first immediately after the `#`.

In [215]:
hashtag = '#BarcaAtleti'

# Extract the first group/team
pattern = r"^#([A-Z]*[a-z]*)"

match = re.match(pattern, hashtag)
print(match.groups())
               

('Barca',)


The pattern variable is defined as `^#([A-Z]*[a-z]*)`, which matches a string starting with `#` and followed by zero or more uppercase letters, followed by zero or more lowercase letters.

In [216]:
hashtag = '#RCDEspanyolElche'

# Extract the first
pattern = r"^#([A-Z]*[a-z]*)"

match = re.match(pattern, hashtag)
print(match.groups())
        

('RCDEspanyol',)


In [217]:
hashtag = '#BarcaAtleti'

# Extract the first and second team at once
pattern = r"^#([A-Z]*[a-z]*)([A-Z]*[a-z]*)"

match = re.match(pattern, hashtag)
print(match.groups())
        

('Barca', 'Atleti')


Since the pattern is a multiple of itself, we can rewrite as:

In [218]:
'([A-Z]*[a-z]*)' * 2

'([A-Z]*[a-z]*)([A-Z]*[a-z]*)'

In [219]:
hashtag = '#BarcaAtleti'

sub_pattern = '([A-Z]*[a-z]*)'
# Extract the first and second team
pattern = fr"^#{sub_pattern * 2}"    # uses f_string substitution with raw string

match = re.match(pattern, hashtag)
print(match.groups())
        

('Barca', 'Atleti')


The regular expression pattern used in this example is `^#{sub_pattern * 2}`, which matches a string that starts with the "#" character, followed by two groups of the sub-pattern variable defined as `([A-Z]*[a-z]*)`, which matches zero or more uppercase letters followed by zero or more lowercase letters.

The `sub_pattern` was multiplied by 2 in `#{sub_pattern * 2}`, which means that we want to match two groups of the sub_pattern.

After running the `re.match` function with the pattern and hashtag string, the `match.groups()` method is used to extract the groups from the match object, which in this case are the names of the two teams. We can easily extend this code to capture four groups by using `^#{sub_pattern * 4}`. For example '#RealMadridRealSociedad':

In [220]:
hashtags = ['#BarcaAtleti', '#OsasunaGirona', '#RCDEspanyolElche', '#RealMadridRealSociedad']
sub_pattern = '([A-Z]*[a-z]*)'
pattern = fr"^#{sub_pattern * 4}"
for hashtag in hashtags:
    match = re.match(pattern, hashtag)
    if match:
        print(match.groups())
        
        

('Barca', 'Atleti', '', '')
('Osasuna', 'Girona', '', '')
('RCDEspanyol', 'Elche', '', '')
('Real', 'Madrid', 'Real', 'Sociedad')


In [221]:
hashtags = ['#BarcaAtleti', '#OsasunaGirona', '#RCDEspanyolElche', '#RealMadridRealSociedad']
sub_pattern = '([A-Z]*[a-z]*)'
pattern = fr"^#{sub_pattern * 4}" # Concatenates the sub-pattern four times to create a pattern for a hashtag match

for hashtag in hashtags:
    match = re.match(pattern, hashtag)
    if match and match.group(4): # If there's a match, and there are at least four groups
        print(f"{match.group(1)} {match.group(2)} Vs. {match.group(3)} {match.group(4)}")
    else:       # If there's a match, but not up to four groups
        print(f"{match.group(1)} Vs. {match.group(2)}")   
        

Barca Vs. Atleti
Osasuna Vs. Girona
RCDEspanyol Vs. Elche
Real Madrid Vs. Real Sociedad


The above code first defines a list of hashtags, then creates a sub-pattern that matches capital and lowercase letters using regular expressions. This sub-pattern is then concatenated four times to create a pattern that matches a specific format for a sports match hashtag. The `re.match` method is then used to search for this pattern in each of the hashtags in the list. If there is a match, the groups of the match object are extracted and printed in a specific format that indicates the two teams playing against each other.

When we have three groups in the match, it can be challenging to determine the correct pairing of the groups. For example, if the match is "#BarcaRealMadrid" or "#RealMadridBarca" we don't know if the first tag should be "Barca Vs. RealMadrid" or "BarcaReal Vs. Madrid". So, we need additional information or constraints to be able to extract the three group names properly. This is where Machine Learning might be useful.

### Flags

Compilation flags are optional arguments that modify the behavior of regular expression pattern. Flags are specified as additional arguments to `re.match`, `re.search`, and other functions in the `re` module and are represented using constants defined in the module.

The following are the most commonly used compilation flags:

`re.IGNORECASE` or `re.I`: This flag makes the pattern matching case-insensitive.

`re.MULTILINE` or `re.M`: This flag makes the pattern matching work across multiple lines.

`re.DOTALL` or `re.S`: This flag makes the dot character (`.`) match all characters, including newlines. Without this flag, `.` will match anything except a newline.

`re.VERBOSE` or `re.X`: This flag allows you to write more readable and understandable regular expressions by ignoring whitespace and comments within the pattern.

Here's an example of how to use flags in a regular expression pattern:

In [222]:
import re
match = re.match(r'the', "The quick brown fox" , flags=re.I)   # Makes the search case-insensitive
print(match)     # 'the' now matches 'The'

<re.Match object; span=(0, 3), match='The'>


The `re.X` flag is used to enable verbose mode, which allows us to write the regular expression on multiple lines and add comments using the `#` character:

In [223]:
addresses = ['example@gmail.com', 'my_mail@yahoo.ng', 'dataclax@inbox.ng', 'not_valid@gmail.adress', 'not_valid@yahoo ']

pattern = r"""
.+@            # matches any sequence of one or more characters that ends with the at sign (`@`)
(gmail|yahoo)  # matches either the string "gmail" or "yahoo"
\.             # matches a literal period (`.`) character
.{2,3}         # matches any two or three characters, representing the top-level domain of the email address
$              # matches the end of the string
"""
         

for address in addresses:
    match = re.match(pattern, address, re.X)    
    if match:
        print(address)

example@gmail.com
my_mail@yahoo.ng


Two or more flags can be combined by seperating each with a pipe `|`

In [224]:
addresses = ['example@gmail.com', 'my_mail@YAHOO.ng', 'dataclax@inbox.ng', 'not_valid@gmail.adress', 'not_valid@yahoo ']

pattern = r"""
.+@            # matches any sequence of one or more characters that ends with an at sign (`@`)
(gmail|yahoo)  # matches either the string "gmail" or "yahoo"
\.             # matches a literal period (`.`) character
.{2,3}         # matches any two or three characters, representing the top-level domain of the email address
$              # matches the end of the string
"""
         

for address in addresses:
    match = re.match(pattern, address, re.I|re.X)     # case-insensitive and verbose combined
    if match:
        print(address)

example@gmail.com
my_mail@YAHOO.ng


This is similar to the previous code, with the only difference being the addition of the `re.I` flag in the `re.match()` function. The `re.I` flag makes the pattern **case-insensitive**, so that both "gmail" and "yahoo" can be matched regardless of their capitalization. Therefore, this code will match email addresses containing either "gmail" or "yahoo", regardless of whether they are in upper or lower case letters.

### `search()`
`re.search()` searches a given string for the first occurrence of a regular expression pattern and returns a match object if found. It returns `None` if no match is found.

The match object has the following common attributes and methods:
* `group(group1...)`: returns the specified capturing group(s) or the entire match if no group is specified.
* `start(group)`: returns the starting position of the match (or the specified group) in the string.
* `end(group)`: returns the ending position of the match (or the specified group) in the string.
* `span(group)`: returns a tuple containing the starting and ending positions of the match (or the specified group) in the string.
* `groups()`: returns a tuple containing all the capturing groups in the match, or the default value if no match was found.
* `groupdict()`: returns a dictionary containing all the named capturing groups in the match, with the group names as keys and the corresponding matched strings as values.

In [225]:
import re

# Define the pattern to search for
pattern = r'Python'

# Search for the pattern in the string
string = 'I love Python Programming'

match = re.search(pattern, string)
print(match)

<re.Match object; span=(7, 13), match='Python'>


`search` can sometimes make pattern simpler if you are only looking for a specific pattern or words in a string. Consider this two cases where we use `match` and `search` to get any word spelt "goal" irrespective of pronounciation:

In [226]:
# Using match
words = ['gol', 'goal', 'goooal', 'gooaaaal', 'goooaaoooaaal', 'gaoooool']

# Select words that spells "goal" irrespective of the pronounciation
pattern = 'g(o+a+)*l'

for word in words:
    match = re.match(pattern, word)   
    if match:
        print(word)

goal
goooal
gooaaaal
goooaaoooaaal


In [227]:
# Using search
words = ['gol', 'goal', 'goooal', 'gooaaaal', 'goooaaoooaaal', 'gaoooool']

pattern = 'oa'

for word in words:
    match = re.search(pattern, word)   
    if match:
        print(word)

goal
goooal
gooaaaal
goooaaoooaaal


Yes, `search` can make pattern matching easier because it seeks the occurence of a character in a string, irrespective of its position, but hey its greatest strength can also be its weakness. Imagine we have "oagol" instead of "gol", this is going to be matched also even though it comes no closer to pronouncing "goal".

In [228]:
words = ['oagol', 'goal', 'goooal', 'gooaaaal', 'gaoooool']

pattern = 'oa'

for word in words:
    match = re.search(pattern, word)   
    if match:
        print(word)

oagol
goal
goooal
gooaaaal


### `findall`
`re.findall()` returns all non-overlapping matches of a regular expression in a string as a list of strings.

In [229]:
import re

string = "I love the Python language. It was actually named after Monty Python, and not python the snake"
pattern = r'python'

matches = re.findall(pattern, string, re.I)
matches

['Python', 'Python', 'python']

In the example code above, we have defined a string variable `string`, which is a sentence with 3 "python" words (lower and uppercase). A regular expression pattern `r'python'` is then defined that matches the string "python". We passed the pattern and the string to `re.findall()`, along with the `re.I` flag which makes the pattern case-insensitive and thus matches both occurrences of the word "Python" and "python". `re.findall()` searched the string for all occurrences of the pattern and returned them as a list of strings.

Another example is given below:

In [230]:
string = "The quick brown fox jumps over the lazy dog"
pattern = r'\w*o\w*'

matches = re.findall(pattern, string)
matches

['brown', 'fox', 'over', 'dog']

Here is a line by line explanation of the code:

1. `string = "The quick brown fox jumps over the dog"` assigns a string to a variable named `string`.


2. `pattern = r'\w*o\w*'` assigns a regular expression pattern to a variable named pattern. This pattern matches any sequence of zero or more word characters (`\w*`) that contains the letter 'o', followed by any sequence of zero or more word characters (`\w*`).


3. `matches = re.findall(pattern, string)` uses the `re.findall()` method to find all non-overlapping occurrences of the pattern in the string. The results are stored in a list named matches.


4. `matches` contains the list of all the matches found in string that satisfy the given pattern.

### `finditer`
`re.finditer()` returns all non-overlapping matches of a regular expression in a string as an **iterable object**.

In [231]:
string = "The quick brown fox jumps over the lazy dog"
pattern = r'\w*o\w*'

matches = re.finditer(pattern, string)
print(matches)      # Iterable object

<callable_iterator object at 0x7aa53f5cb910>


The `next()` function in Python is used to retrieve the next item from an iterator. It takes two arguments: the iterator object and an optional default value to be returned if the iterator is exhausted.



In [232]:
next(matches)

<re.Match object; span=(10, 15), match='brown'>

In [233]:
next(matches, 'finish')

<re.Match object; span=(16, 19), match='fox'>

In [234]:
string = "The quick brown fox jumps over the lazy dog"
pattern = r'\w*o\w*'

matches = re.finditer(pattern, string)

for match in matches:
    print(match.group(), match.span())

brown (10, 15)
fox (16, 19)
over (26, 30)
dog (40, 43)


### `compile`

The `re.compile()` function compiles a regular expression pattern into a **regular expression object**, which can be used to match strings using various methods such as `match()`, `search()`, and others. In other words a compiled regex object has methods of `match()`, `search()`, etc. 

The benefit of compiling a regular expression is that it can save processing time when the expression needs to be used multiple times, as the pattern only needs to be compiled once. Once the regular expression object is created, it can be used multiple times with different input strings.

In [235]:
string = "The quick brown fox jumps over the lazy dog"

# Creates a regex object
pattern = re.compile(r'\w*o\w*', re.I)

print(pattern.match(string))          # No match, pattern never occurred at the begining of string
print(pattern.search(string))         # Pattern matched somewhere in the string. Returned first occurrence
print(pattern.findall(string))        # Pattern matched. Returned all occurrences

None
<re.Match object; span=(10, 15), match='brown'>
['brown', 'fox', 'over', 'dog']


## More Regex Pattern

### Boundaries
In regular expressions, boundaries are used to match positions between characters or character groups, instead of matching actual characters.
The following are the boundary metacharacters in regular expressions:

* `^`: Matches at the beginning of a string or at the beginning of a line.
* `$`: Matches at the end of a string or at the end of a line.
* `\b`: Matches at a word boundary.
* `\B`: Matches at a position that is not a word boundary.

The line boundaries, `^` and `$`, had been previously explained. Word boundaries are used to match a specific position between a word character `\w` ([A_Za-z0-9_]) and a non-word character `\W` ([^A_Za-z0-9_]). `\B` is a word boundary that matches when the current position is not a word boundary. **It matches at any position that is within a word**. The image and examples below will make this clearer:

![image.png](attachment:image.png)

In [236]:
import re

string = "hello this is the final one"
pattern = re.compile(r'\bis\b')

matches = pattern.findall(string)
print(matches)

['is']


In [237]:
import re

string = "The quick brown fox jumps over the lazy dog"
pattern = re.compile(r'\b\w{3}\b')

matches = pattern.findall(string)
print(matches)

['The', 'fox', 'the', 'dog']


In the above example, we take each word, determined by the word boundaries and check if the the word contains three word characters.
For example, one of the match is "fox". Before the 'f' in "fox" is a space which is not a character set of `\w`, So 'f' puts a demacation between itself (a word charater) and a non-word character--the space before 'f'. Likewise 'x' also puts a demacation between itself and the space after, therefore this two demacation formed some sort of enclosed **region**. When we count the word character(s), `\w`, within this region it must be 3, else there won't be a match. 

Non-word boundary, `\B` are opposite. This are characters in a string that most likely exist between word `\w` characters: 

* 'T' and 'e' are word boundaries in "The" but 'h' is not
* 'b' and 'n' are word boundaries in "brown" but 'r', 'o', and 'w' are not
* 'l' and 'y' are word boundaries in "lazy" but 'a' and 'z' are not

In [238]:
import re

words = ['can', 'uncanny', 'cancer', 'decanter', '_can_', '-can-']
pattern = re.compile(r'\Bcan\B')

for word in words:
    match = pattern.findall(word)
    if match:
        print(word)

uncanny
decanter
_can_


The code above defines a list of words containing the substring 'can', and compiles a regular expression pattern to match the exact substring 'can' with a non-word boundary on either side, using the `\B` metacharacter.
The for loop then iterates over each word in the `words` list, and applies the regular expression pattern to the word using the `findall()` method of the compiled pattern object.
If the pattern matches the word, **meaning that the exact substring 'can' is surrounded by non-word boundaries** i.e the pattern exist inside a word characters, the word is printed.

In case you are wondering why `'_can_'` is returned and `'-can-'` wasn't. Remember the definition of word boundary -- it's a boundary that exist between a word character `\w` **and** a non-word character, vice-versa. The leading underscore (`_`) and 'c' are **both** word characters so this a **not** a word boundary in other words it's a non-word boundary `\B`, same goes with 'n' and the trailing underscore. For `'-can-'`, the leading hyphen is not a word character  but 'c' is, so this a valid word boundary `\b`. Same goes with 'n' and the trailing hyphen. Therefore `'-can-'` won't be returned because we are not looking for pattern between word boundaries but within a non-word boundaries.

In summary, `\b` is usaully used for standalone characters or words, while `\B` is used to find characters within words or text:

* `\bword\b`: This pattern would match the whole word "word" when it appears as a standalone word in the text.

* `\Bchar\B`: This pattern would match the character "char" when it appears within a word or is surrounded by other word characters

### Lookaround Assertions

Lookaround assertions are zero-width assertions, meaning they do not exactly use the characters in the string but only assert whether a pattern is or is not present at a specific position. Lookaround assertions are divided into two types: **lookahead and lookbehind**. Both lookahead and lookbehind assertions can be either positive or negative, depending on whether they assert the presence or absence of a pattern.

#### Positive Lookahead 
This is a type of zero-width assertion in regular expressions that allows you to match a pattern **only if** it is followed by a given pattern. Positive lookahead is denoted by `(?=pattern)`. This is very useful when you want to match a pattern only if it is followed by specific pattern. Consider the example below:

In [239]:
import re

sentences = ['Think positive', 'React positively', 'Genarally positive personality', 'Negative-positive terminals']
pattern = re.compile(r"\w+\s+(?=positive)", re.I)

for sentence in sentences:
    match = pattern.match(sentence)
    if match:
        print(match.group())

Think 
React 
Genarally 


 Here's a breakdown of how the above code works:
* The pattern `r"\w+\s+(?=positive)"` is used. Let's futher break it down:
    * `\w+\s+` matches one or more word characters followed by one or more whitespace character.
    * `(?=positive)` is the lookahead part. It ensures that the matched word is followed by a word that contains "positive". **However, the lookahead part i.e "positive", is not included in the match**.

* The `re.I` flag is passed to the `re.compile` function to make the matching case-insensitive.

* The code then iterates over the sentences list and for each sentence:
    * `pattern.match(sentence)` look for a match of the pattern in the sentence.
    * If  there is a match, `match.group()` retrieves the matched portion of the sentence.

If we only want to match "positive" exactly and not "positively", "positiveness", etc, we can modify the code as shown below:

In [240]:
sentences = ['Think positive', 'React positively', 'Genarally positive personality', 'Negative-positive terminals']
pattern = re.compile(r"\w+\s+(?=\bpositive\b)", re.I)

for sentence in sentences:
    match = pattern.match(sentence)
    if match:
        print(match.group())

Think 
Genarally 


Here, by using a word boundary on both side, we are saying "positive" is a word on its own and not part of a word

#### Negative Lookahead 
This allows you to specify a pattern that **should not be present** after the current matching position. It is denoted by `(?!pattern)`. This is very useful when you want to match a pattern only if it is **not** followed by a specific pattern. Consider the example below where we select only words **not** followed by "pie":

In [241]:
words = ['apple pie', 'apple juice', 'orange juice', 'meat pie']
pattern = re.compile(r"\w+\s(?!pie)")

for word in words:
    match = pattern.match(word)
    if match:
        print(f"matched pattern: {match.group()}")
        print(f"word: {word}\n")

matched pattern: apple 
word: apple juice

matched pattern: orange 
word: orange juice



#### Positive Lookbehind
Positive lookbehind allows you to check if a certain pattern is preceded by another pattern, without including the preceding pattern in the match. Positive lookbehind is denoted by `(?<=pattern)`. This is very useful when you want to match a pattern only if it is preceded by specific pattern. Consider the example below where we select words preceded by "apple" followed by a space:

In [242]:
words = ['apple pie', 'apple juice', 'orange juice', 'meat pie']
pattern = re.compile(r"(?<=apple)\s\w+")

for word in words:
    match = pattern.search(word)
    if match:
        print(f"matched pattern: {match.group()}")
        print(f"word: {word}\n")

matched pattern:  pie
word: apple pie

matched pattern:  juice
word: apple juice



**NB**: We can't use `re.match` here because the pattern we wanted to match doesn't start each word, they come after a specific pattern, in this case "apple".

In [243]:
text = "The price is $50 and the discount is $10."

# Using a lookbehind to find the dollar amounts
pattern = r'(?<=\$)\d+'          # $ is a special character so it has to be escaped with \
matches = re.findall(pattern, text)

print(matches)

['50', '10']


#### Negative Lookbehind
Negative lookbehind allows you to check if a pattern is **not** preceded by a specified pattern, without including the preceding pattern in the match and it's denoted by `(?<!pattern)`. Consider the example below where we select words preceded by any characters or word but not "apple" followed by a space:`

In [244]:
words = ['apple pie', 'apple juice', 'orange juice', 'meat pie']
pattern = re.compile(r"(?<!apple)\s\w+")

for word in words:
    match = pattern.search(word)
    if match:
        print(f"matched pattern: {match.group()}")
        print(f"word: {word}\n")

matched pattern:  juice
word: orange juice

matched pattern:  pie
word: meat pie



## `sys` Module
The `sys` module provides functions and variables to interact with Python's runtime environment, allowing the programmer to control how the program interacts with the underlying system. Some of the common methods includes:

* `sys.argv`: Access command-line arguments passed to the script.
* `sys.exit()`: Exit the program with an optional status code.
* `sys.version`: Get Python version and related metadata.
* `sys.path`: List the directories where Python looks for modules.

In [245]:
import sys

# Print the command-line arguments
print("Arguments passed to the script:", sys.argv)


Arguments passed to the script: ['/home/peter/anaconda3/lib/python3.11/site-packages/ipykernel_launcher.py', '-f', '/home/peter/.local/share/jupyter/runtime/kernel-2282d7c5-f088-4925-b9e4-77be1248beb8.json']


`sys.argv` is a list that contains the arguments passed to the Python script. The first element (`sys.argv[0]`) is the name of the script itself, and the following elements are the additional arguments

In [252]:
%%writefile greet.py

def greet(user_name):
    print(f"Nice to me you {user_name.title()}")

if __name__=='__main__':
    import sys
    user_name = sys.argv[1]
    greet(user_name)
    sys.exit()

Overwriting greet.py


In [253]:
!python greet.py peter

Nice to me you Peter


In [254]:
import sys
sys.version

'3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0]'

In [255]:
sys.path

['/home/peter/Desktop/peter/DataClax_files/Lecture Notes/Python Language',
 '/home/peter/anaconda3/lib/python311.zip',
 '/home/peter/anaconda3/lib/python3.11',
 '/home/peter/anaconda3/lib/python3.11/lib-dynload',
 '',
 '/home/peter/anaconda3/lib/python3.11/site-packages']

### To be continued...

*Copyright &copy; 2024 DataClax. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*