<img width="10%" alt="Naas" src="https://landen.imgix.net/jtci2pxwjczr/assets/5ice39g4.png?w=160"/>

# Advertools - Analyze website content using XML sitemap
<a href="https://app.naas.ai/user-redirect/naas/downloader?url=https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/Advertools/Advertools_Analyze_website_content_using_XML_sitemap.ipynb" target="_parent"><img src="https://naasai-public.s3.eu-west-3.amazonaws.com/Open_in_Naas_Lab.svg"/></a><br><br><a href="https://github.com/jupyter-naas/awesome-notebooks/issues/new?assignees=&labels=&template=template-request.md&title=Tool+-+Action+of+the+notebook+">Template request</a> | <a href="https://github.com/jupyter-naas/awesome-notebooks/issues/new?assignees=&labels=bug&template=bug_report.md&title=Advertools+-+Analyze+website+content+using+XML+sitemap:+Error+short+description">Bug report</a> | <a href="https://app.naas.ai/user-redirect/naas/downloader?url=https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/Naas/Naas_Start_data_product.ipynb" target="_parent">Generate Data Product</a>

**Tags:** #advertools #xml #sitemap #website #analyze #seo

**Author:** [Elias Dabbas](https://www.linkedin.com/in/eliasdabbas/)

**Description:** This notebook helps you get an overview of a website's content by analyzing and visualizing its XML sitemap. It's also an important SEO audit process that can uncover some potential issues that might affect the website.

**References:**
- [advertools Sitemaps](https://advertools.readthedocs.io/en/master/advertools.sitemaps.html)
- [XML Sitemap](https://www.xml-sitemaps.com/)
- [Sitemaps Protocol](https://www.sitemaps.org/)

## Input

### Install Pandas last version
In this case, we need to have the last pandas version.

In [1]:
!pip install pandas==1.5.3 --user
!pip install --upgrade advertools --user
!pip install --upgrade adviz --user



### Import libraries

In [2]:
import advertools as adv
import adviz
from urllib.parse import urlsplit

### Setup Variables
- `sitemap_url`: URL of the sitemap to analyze, which can be
    * The URL of an XML sitemap
    * The URL of an XML sitemapindex
    * The URL of a robots.txt file
    * Normal and zipped formats are supported
- `recursive`: If this is a sitemapindex, should all the sub-sitemaps also be  downloaded, parsed and combined into one DataFrame?
- `max_workers`: Number of concurrent workers to fetch the sitemaps.

In [3]:
sitemap_url = "https://blog.sriniketh.design/sitemap.xml"
recursive = True
max_workers = 8

## Model

### Analyze website content using XML sitemap
Getting the sitemap(s)

In [4]:
sitemap = adv.sitemap_to_df(
    sitemap_url=sitemap_url,
    max_workers=max_workers,
    recursive=recursive
)
sitemap

2023-07-13 13:51:55,206 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://blog.sriniketh.design/sitemap.xml


Unnamed: 0,loc,changefreq,priority,lastmod,sitemap,sitemap_size_mb,download_date
0,https://blog.sriniketh.design,always,1.0,2023-04-05 19:18:36.914000+00:00,https://blog.sriniketh.design/sitemap.xml,0.004296,2023-07-13 11:51:55.211205+00:00
1,https://blog.sriniketh.design/getting-credentials-from-gcp-google-cloud-platform,daily,0.8,2023-04-05 19:18:36.914000+00:00,https://blog.sriniketh.design/sitemap.xml,0.004296,2023-07-13 11:51:55.211205+00:00
2,https://blog.sriniketh.design/send-sms-using-twilio-for-g-calendar-events-using-naas-template,daily,0.8,2023-04-12 08:38:25.759000+00:00,https://blog.sriniketh.design/sitemap.xml,0.004296,2023-07-13 11:51:55.211205+00:00
3,https://blog.sriniketh.design/when-you-are-stuck-as-a-developer-what-do-you-do,daily,0.8,2022-09-15 22:01:40.656000+00:00,https://blog.sriniketh.design/sitemap.xml,0.004296,2023-07-13 11:51:55.211205+00:00
4,https://blog.sriniketh.design/i-want-to-build-a-developer-portfolio-because,daily,0.8,2022-10-07 16:31:18.886000+00:00,https://blog.sriniketh.design/sitemap.xml,0.004296,2023-07-13 11:51:55.211205+00:00
5,https://blog.sriniketh.design/what-have-been-the-most-helpful-online-tools-to-self-improve-as-a-developer,daily,0.8,2022-09-06 12:53:35.540000+00:00,https://blog.sriniketh.design/sitemap.xml,0.004296,2023-07-13 11:51:55.211205+00:00
6,https://blog.sriniketh.design/what-made-you-want-to-be-a-developer,daily,0.8,2022-08-20 15:07:01.041000+00:00,https://blog.sriniketh.design/sitemap.xml,0.004296,2023-07-13 11:51:55.211205+00:00
7,https://blog.sriniketh.design/sawo-latest-features,daily,0.8,2022-07-14 18:14:45.292000+00:00,https://blog.sriniketh.design/sitemap.xml,0.004296,2023-07-13 11:51:55.211205+00:00
8,https://blog.sriniketh.design/sentiment-analysis-using-python,daily,0.8,2021-10-31 08:48:52.171000+00:00,https://blog.sriniketh.design/sitemap.xml,0.004296,2023-07-13 11:51:55.211205+00:00
9,https://blog.sriniketh.design/tag/cloud,always,1.0,NaT,https://blog.sriniketh.design/sitemap.xml,0.004296,2023-07-13 11:51:55.211205+00:00


Split URLs into their components for further analysis/understanding

In [5]:
urldf = adv.url_to_df(sitemap['loc'])
urldf

Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,dir_2,last_dir
0,https://blog.sriniketh.design,https,blog.sriniketh.design,,,,,,
1,https://blog.sriniketh.design/getting-credentials-from-gcp-google-cloud-platform,https,blog.sriniketh.design,/getting-credentials-from-gcp-google-cloud-platform,,,getting-credentials-from-gcp-google-cloud-platform,,getting-credentials-from-gcp-google-cloud-platform
2,https://blog.sriniketh.design/send-sms-using-twilio-for-g-calendar-events-using-naas-template,https,blog.sriniketh.design,/send-sms-using-twilio-for-g-calendar-events-using-naas-template,,,send-sms-using-twilio-for-g-calendar-events-using-naas-template,,send-sms-using-twilio-for-g-calendar-events-using-naas-template
3,https://blog.sriniketh.design/when-you-are-stuck-as-a-developer-what-do-you-do,https,blog.sriniketh.design,/when-you-are-stuck-as-a-developer-what-do-you-do,,,when-you-are-stuck-as-a-developer-what-do-you-do,,when-you-are-stuck-as-a-developer-what-do-you-do
4,https://blog.sriniketh.design/i-want-to-build-a-developer-portfolio-because,https,blog.sriniketh.design,/i-want-to-build-a-developer-portfolio-because,,,i-want-to-build-a-developer-portfolio-because,,i-want-to-build-a-developer-portfolio-because
5,https://blog.sriniketh.design/what-have-been-the-most-helpful-online-tools-to-self-improve-as-a-developer,https,blog.sriniketh.design,/what-have-been-the-most-helpful-online-tools-to-self-improve-as-a-developer,,,what-have-been-the-most-helpful-online-tools-to-self-improve-as-a-developer,,what-have-been-the-most-helpful-online-tools-to-self-improve-as-a-developer
6,https://blog.sriniketh.design/what-made-you-want-to-be-a-developer,https,blog.sriniketh.design,/what-made-you-want-to-be-a-developer,,,what-made-you-want-to-be-a-developer,,what-made-you-want-to-be-a-developer
7,https://blog.sriniketh.design/sawo-latest-features,https,blog.sriniketh.design,/sawo-latest-features,,,sawo-latest-features,,sawo-latest-features
8,https://blog.sriniketh.design/sentiment-analysis-using-python,https,blog.sriniketh.design,/sentiment-analysis-using-python,,,sentiment-analysis-using-python,,sentiment-analysis-using-python
9,https://blog.sriniketh.design/tag/cloud,https,blog.sriniketh.design,/tag/cloud,,,tag,cloud,cloud


## Output

### Display results

#### Errors

In [6]:
if 'errors' in sitemap:
    from IPython.display import display
    display(sitemap[sitemap['errors'].notnull()])
else:
    print('No errors found')

No errors found


#### Duplicated URLs

In [7]:
duplicated = sitemap[sitemap['loc'].duplicated()]
if not duplicated.empty:
    display(duplicated)
else:
    print('No duplicated URLs found')

No duplicated URLs found


#### URL counts per sitemap and sitemap sizes

Each sitemap should have a maximumof 50,000 URLs, and its size should not exceek 50MB

URL counts:

In [8]:
adviz.value_counts_plus(sitemap['sitemap'], name='Sitemap URLs')

Unnamed: 0,Sitemap URLs,count,cum. count,%,cum. %
1,https://blog.sriniketh.design/sitemap.xml,31,31,100.0%,100.0%


URL Sizes:

In [9]:
sitemap['sitemap_size_mb'].describe().to_frame().T.style.format('{:,.2f}')

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sitemap_size_mb,31.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Count unique values of URL components

In [10]:
for col in ['scheme', 'netloc', 'dir_1', 'dir_2', 'dir_3']:
    try:
        display(adviz.value_counts_plus(urldf[col], name=col))
    except Exception as e:
        continue

Unnamed: 0,scheme,count,cum. count,%,cum. %
1,https,31,31,100.0%,100.0%


Unnamed: 0,netloc,count,cum. count,%,cum. %
1,blog.sriniketh.design,31,31,100.0%,100.0%


Unnamed: 0,dir_1,count,cum. count,%,cum. %
1,tag,22,22,71.0%,71.0%
2,,1,23,3.2%,74.2%
3,getting-credentials-from-gcp-google-cloud-platform,1,24,3.2%,77.4%
4,send-sms-using-twilio-for-g-calendar-events-using-naas-template,1,25,3.2%,80.6%
5,i-want-to-build-a-developer-portfolio-because,1,26,3.2%,83.9%
6,when-you-are-stuck-as-a-developer-what-do-you-do,1,27,3.2%,87.1%
7,what-have-been-the-most-helpful-online-tools-to-self-improve-as-a-developer,1,28,3.2%,90.3%
8,what-made-you-want-to-be-a-developer,1,29,3.2%,93.5%
9,sawo-latest-features,1,30,3.2%,96.8%
10,sentiment-analysis-using-python,1,31,3.2%,100.0%


Unnamed: 0,dir_2,count,cum. count,%,cum. %
1,,9,9,29.0%,29.0%
2,cloud,1,10,3.2%,32.3%
3,tools,1,11,3.2%,35.5%
4,data-science,1,12,3.2%,38.7%
5,machine-learning,1,13,3.2%,41.9%
6,google-cloud,1,14,3.2%,45.2%
7,apis,1,15,3.2%,48.4%
8,developer,1,16,3.2%,51.6%
9,python3,1,17,3.2%,54.8%
10,portfolio,1,18,3.2%,58.1%


#### Visualize the structure of the URLs

In [11]:
domain = urlsplit(sitemap_url).netloc
try:
    adviz.url_structure(
        urldf['url'].fillna(''),
        items_per_level=30,
        domain=domain,
        height=750,
        title=f'URL Structure: {domain} XML sitemap'
    )
except Exception as e:
    print(str(e))
    pass