# Urls

`Urls` contains functions to parse url strings. 


# Initialization

The following code imports `urls`. The code assumes that the current directory contains the scrape package.

Urls include protocols for http, https and ftp amongst others. For a syntax see:

https://en.wikipedia.org/wiki/URL

In [1]:
import os
import sys
import time
PROJECT_DIR = os.path.dirname(os.path.abspath('..'))
print('Project folder: ' + PROJECT_DIR)
sys.path.append(PROJECT_DIR)

from scrape.utils import urls

Project folder: D:\Projects\Python\projects\scrape
Initializing scrape ...


# Working with urls
In this example we given an overview of functions in `urls`.

### Get and set url components
The functions `get_components` and `set_components` can be used to get and set the components of an url string.

In [2]:
aurl = 'http://user:pwd@NetLoc.com:80/p1;para/p2;para?query=arg#frag'
components = urls.get_components(aurl)
components

{'scheme': 'http',
 'netloc': 'user:pwd@NetLoc.com:80',
 'path': '/p1;para/p2;para',
 'query': 'query=arg',
 'fragment': 'frag'}

`set_components` can be used with a dictionary:

In [3]:
components = {'scheme': 'http', 'netloc': 'user:pwd@NetLoc.com:80', 'path': '/p1;para/p2;para', 'query': 'query=arg', 'fragment': 'frag'}
aurl = urls.set_components(aurl,**components)
print(aurl)

http://user:pwd@NetLoc.com:80/p1;para/p2;para?query=arg#frag


`set_components` can be used with keywords:

In [4]:
url_new = urls.set_components(aurl,scheme='ftp',netloc='newloc.com')
print(url_new)

ftp://newloc.com/p1;para/p2;para?query=arg#frag


## Working with queries
One can set an entire query using `set_components`. Alternative one can use the function replace_subquery to change the values for one or more existing keywords.

In [5]:
aurl = 'http://user:pwd@NetLoc.com:80/p1;para/p2;para?query1=arg1&query2=arg2&query3=arg3#frag'
query_new = 'query1=arg11&query4=arg44'
url_new = urls.set_components(aurl,query=query_new)
print(url_new)

http://user:pwd@NetLoc.com:80/p1;para/p2;para?query1=arg11&query4=arg44#frag


In [6]:
aurl = 'http://user:pwd@NetLoc.com:80/p1;para/p2;para?query1=arg1&query2=arg2&query3=arg3#frag'
subquery = 'query1=arg11&query4=arg44'
url_new = urls.replace_subquery(aurl,subquery=subquery)
print(url_new)

http://user:pwd@NetLoc.com:80/p1;para/p2;para?query1=arg11&query2=arg2&query3=arg3&query4=arg44#frag


## Working with paths
One can set an entire path using `set_components`. Alternative one can use the function replace_subpath to change the valueof a single subpath. In this case one needs to specify the level of the path.

In [7]:
aurl = 'http://user:pwd@NetLoc.com:80/p1;para/p2;para?query1=arg1&query2=arg2&query3=arg3#frag'
path_new = '/p3/p4/p5'
url_new = urls.set_components(aurl,path=path_new)
print(url_new)

http://user:pwd@NetLoc.com:80/p3/p4/p5?query1=arg1&query2=arg2&query3=arg3#frag


In [8]:
aurl = 'http://user:pwd@NetLoc.com:80/p1;para/p2;para?query1=arg1&query2=arg2&query3=arg3#frag'
subpath = '/p3'
level = 1
url_new = urls.replace_subpath(aurl,subpath=subpath,level=level)
print(url_new)

http://user:pwd@NetLoc.com:80/p3/p2;para?query1=arg1&query2=arg2&query3=arg3#frag


In [9]:
aurl = 'http://user:pwd@NetLoc.com:80/p1;para/p2;para?query1=arg1&query2=arg2&query3=arg3#frag'
subpath = 'p3/p4'
level = 1
url_new = urls.replace_subpath(aurl,subpath=subpath,level=level)
print(url_new)

http://user:pwd@NetLoc.com:80/p3/p4/p2;para?query1=arg1&query2=arg2&query3=arg3#frag


### Validating an url


In [5]:
aurl = 'http://user:pwd@NetLoc.com:80/p1;para/p2;para?query1=arg1&query2=arg2&query3=arg3#frag'
is_http = urls.valid_http(aurl)
is_http

# Versions

In [10]:
%reload_ext watermark
%watermark

Last updated: 2021-06-12T19:30:37.058274+02:00

Python implementation: CPython
Python version       : 3.7.9
IPython version      : 7.19.0

Compiler    : MSC v.1916 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
CPU cores   : 8
Architecture: 64bit

