<img width="10%" alt="Naas" src="https://landen.imgix.net/jtci2pxwjczr/assets/5ice39g4.png?w=160"/>

# Advertools - Analyze website content using XML sitemap

**Tags:** #advertools #xml #sitemap #website #analyze #seo

**Author:** [Elias Dabbas](https://www.linkedin.com/in/eliasdabbas/)

**Description:** This notebook helps you get an overview of a website's content by analyzing and visualizing its XML sitemap. It's also an important SEO audit process that can uncover some potential issues that might affect the website.

**References:**
- [advertools Sitemaps](https://advertools.readthedocs.io/en/master/advertools.sitemaps.html)
- [XML Sitemap](https://en.wikipedia.org/wiki/Sitemaps)
- [Sitemaps Protocol](https://www.sitemaps.org/)

## Input

### Import libraries

In [None]:
try:
    import advertools as adv
except:
    !pip install advertools --user
    import advertools as adv
import adviz

Collecting advertools
  Downloading advertools-0.13.2-py2.py3-none-any.whl (310 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.1/310.1 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting scrapy>=2.5.0
  Downloading Scrapy-2.9.0-py2.py3-none-any.whl (277 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m277.2/277.2 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting twython>=3.8.0
  Downloading twython-3.9.1-py3-none-any.whl (33 kB)
Collecting service-identity>=18.1.0
  Using cached service_identity-21.1.0-py2.py3-none-any.whl (12 kB)
Collecting protego>=0.1.15
  Using cached Protego-0.2.1-py2.py3-none-any.whl (8.2 kB)
Collecting Twisted>=18.9.0
  Downloading Twisted-22.10.0-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m30.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hCollecting parsel>=1.5.0
  Downloading parsel-1.8.1-py2.py3-none-any.whl (17 kB)


### Setup Variables
- `sitemap_url`: URL of the sitemap to analyze, which can be
    * The URL of an XML sitemap
    * The URL of an XML sitemapindex
    * The URL of a robots.txt file
    * Normal and zipped formats are supported
- `recursive`: If this is a sitemapindex, should all the sub-sitemaps also be  downloaded, parsed and combined into one DataFrame?
- `max_workers`: Number of concurrent workers to fetch the sitemaps.

In [10]:
sitemap_url = "https://www.example.com/robots.txt"
recursive = True
max_workers = 8

## Model

### Analyze website content using XML sitemap

Getting the sitemap(s)

In [36]:
sitemap = adv.sitemap_to_df(
    sitemap_url=sitemap_url,
    max_workers=max_workers,
    recursive=recursive)
sitemap

Split URLs into their components for further analysis/understanding

In [35]:
urldf = adv.url_to_df(sitemap['loc'])
urldf

## Output

### Display results

#### Errors

In [34]:
if 'errors' in sitemap:
    from IPython.display import display
    display(sitemap[sitemap['errors'].notnull()])
else:
    print('No errors found')

#### Duplicated URLs

In [33]:
duplicated = sitemap[sitemap['loc'].duplicated()]
if not duplicated.empty:
    display(duplicated)
else:
    print('No duplicated URLs found')

#### URL counts per sitemap and sitemap sizes

Each sitemap should have a maximumof 50,000 URLs, and its size should not exceek 50MB

URL counts:

In [32]:
adviz.value_counts_plus(sitemap['sitemap'], name='Sitemap URLs')

URL Sizes:

In [31]:
sitemap['sitemap_size_mb'].describe().to_frame().T.style.format('{:,.2f}')

#### Count unique values of URL components

In [28]:
for col in ['scheme', 'netloc', 'dir_1', 'dir_2']:
    display(adviz.value_counts_plus(urldf[col], name=col))


#### Visualize the structure of the URLs

In [27]:
from urllib.parse import urlsplit
domain=urlsplit(sitemap_url).netloc

adviz.url_structure(
    urldf['url'].fillna(''),
    items_per_level=30,
    domain=domain,
    height=750,
    title=f'URL Structure: {domain} XML sitemap')