<img width="10%" alt="Naas" src="https://landen.imgix.net/jtci2pxwjczr/assets/5ice39g4.png?w=160"/>

# Advertools - Audit robots.txt and XML sitemap potential issues

**Tags:** #advertools #xml #sitemap #website #analyze #seo #robots.txt

**Author:** [Elias Dabbas](https://www.linkedin.com/in/eliasdabbas/)

**Description:** This notebook helps you check if there are any conflicts between robots.txt rules and your XML sitemap.

* Are you disallowing URLs that you shouldn't?
* Test and make sure you don't publish new pages with such conflicts.
* Do this in bulk: for all URL/rule/user-agent combinations run all tests with one command.

**References:**
- [advertools robots.txt functions](https://advertools.readthedocs.io/en/master/advertools.robotstxt.html)
- [Google's robots reference](https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt)


## Input

### Import libraries

In [None]:
try:
    import advertools as adv
except ModuleNotFoundError:
    !pip install advertools

### Setup Variables
- `robotstxt_url`: URL of the robots.txt file to convert to a `DataFrame`

In [None]:
robotstxt_url = "https://www.example.com/robots.txt"

# just for testing:
robotstxt_url = "https://www.google.com/robots.txt"

## Model

### Analyze potential robots.txt and XML conflicts

Getting the robots.txt file and converting it to a `DataFrame`.

In [None]:
robots_df = adv.robotstxt_to_df(robotstxt_url=robotstxt_url)
robots_df

2023-05-24 15:22:39,594 | INFO | robotstxt.py:381 | robotstxt_to_df | Getting: https://www.google.com/robots.txt


Unnamed: 0,directive,content,robotstxt_last_modified,robotstxt_url,download_date
0,User-agent,*,2023-05-22 21:30:00,https://www.google.com/robots.txt,2023-05-24 13:22:39.746456+00:00
1,Disallow,/search,2023-05-22 21:30:00,https://www.google.com/robots.txt,2023-05-24 13:22:39.746456+00:00
2,Allow,/search/about,2023-05-22 21:30:00,https://www.google.com/robots.txt,2023-05-24 13:22:39.746456+00:00
3,Allow,/search/static,2023-05-22 21:30:00,https://www.google.com/robots.txt,2023-05-24 13:22:39.746456+00:00
4,Allow,/search/howsearchworks,2023-05-22 21:30:00,https://www.google.com/robots.txt,2023-05-24 13:22:39.746456+00:00
...,...,...,...,...,...
296,Disallow,/search,2023-05-22 21:30:00,https://www.google.com/robots.txt,2023-05-24 13:22:39.746456+00:00
297,Disallow,/groups,2023-05-22 21:30:00,https://www.google.com/robots.txt,2023-05-24 13:22:39.746456+00:00
298,Disallow,/hosted/images/,2023-05-22 21:30:00,https://www.google.com/robots.txt,2023-05-24 13:22:39.746456+00:00
299,Disallow,/m/,2023-05-22 21:30:00,https://www.google.com/robots.txt,2023-05-24 13:22:39.746456+00:00


Get XML sitemap(s) and convert to a `DataFrame`.

In [None]:
sitemap = adv.sitemap_to_df(
    # the function will extract and combine all available sitemaps
    # in the robots.txt file
    robotstxt_url,
    max_workers=8,
    recursive=True)
sitemap

2023-05-24 15:22:46,730 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.google.com/docs/sitemaps.xml
2023-05-24 15:22:46,753 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.google.com/sheets/sitemaps.xml
2023-05-24 15:22:46,775 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.google.com/slides/sitemaps.xml
2023-05-24 15:22:46,813 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.google.com/forms/sitemaps.xml
2023-05-24 15:22:46,845 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.google.com/gmail/sitemap.xml
2023-05-24 15:22:46,879 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.google.com/get/sitemap.xml
2023-05-24 15:22:47,033 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.google.com/drive/sitemap.xml
2023-05-24 15:22:47,204 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.google.com/admob/sitemap.xml
2023-05-24 15:22:47,222 | INFO | sitemaps.py:536 | sitemap_to_df | Ge

Unnamed: 0,loc,changefreq,priority,sitemap,sitemap_last_modified,sitemap_size_mb,download_date,errors
0,https://www.google.com/docs/about/,Weekly,0.8,https://www.google.com/docs/sitemaps.xml,2020-05-25 08:30:00,0.392524,2023-05-24 13:22:46.736858+00:00,
1,https://www.google.com/intl/af/docs/about/,Weekly,0.8,https://www.google.com/docs/sitemaps.xml,2020-05-25 08:30:00,0.392524,2023-05-24 13:22:46.736858+00:00,
2,https://www.google.com/intl/id/docs/about/,Weekly,0.8,https://www.google.com/docs/sitemaps.xml,2020-05-25 08:30:00,0.392524,2023-05-24 13:22:46.736858+00:00,
3,https://www.google.com/intl/ms/docs/about/,Weekly,0.8,https://www.google.com/docs/sitemaps.xml,2020-05-25 08:30:00,0.392524,2023-05-24 13:22:46.736858+00:00,
4,https://www.google.com/intl/ca/docs/about/,Weekly,0.8,https://www.google.com/docs/sitemaps.xml,2020-05-25 08:30:00,0.392524,2023-05-24 13:22:46.736858+00:00,
...,...,...,...,...,...,...,...,...
46666,https://www.google.com/intl/it_ch/retail/solutions/manufacturer-center/,,,https://www.google.com/retail/sitemap.xml,2023-04-17 05:30:00,4.386189,2023-05-24 13:22:51.261837+00:00,
46667,https://www.google.com/intl/it_ch/retail/solutions/merchant-center/,,,https://www.google.com/retail/sitemap.xml,2023-04-17 05:30:00,4.386189,2023-05-24 13:22:51.261837+00:00,
46668,https://www.google.com/intl/it_ch/retail/solutions/not-available/,,,https://www.google.com/retail/sitemap.xml,2023-04-17 05:30:00,4.386189,2023-05-24 13:22:51.261837+00:00,
46669,https://www.google.com/intl/it_ch/retail/solutions/performance-max/,,,https://www.google.com/retail/sitemap.xml,2023-04-17 05:30:00,4.386189,2023-05-24 13:22:51.261837+00:00,


#### Testing robots.txt
For all URL/user-agent combinations check if the URL is blocked.

In [None]:
user_agents = robots_df[robots_df['directive'].str.contains('user-agent', case=False)]['content']
user_agents

0                        *
281          AdsBot-Google
288             Twitterbot
294    facebookexternalhit
Name: content, dtype: object

In [None]:
robots_report = adv.robotstxt_test(
    robotstxt_url=robotstxt_url,
    user_agents=user_agents,
    urls=sitemap['loc'].dropna())
robots_report

Unnamed: 0,robotstxt_url,user_agent,url_path,can_fetch
0,https://www.google.com/robots.txt,*,https://www.google.ad/,True
1,https://www.google.com/robots.txt,*,https://www.google.ae/,True
2,https://www.google.com/robots.txt,*,https://www.google.am/,True
3,https://www.google.com/robots.txt,*,https://www.google.as/,True
4,https://www.google.com/robots.txt,*,https://www.google.at/,True
...,...,...,...,...
186671,https://www.google.com/robots.txt,facebookexternalhit,https://www.google.tn/,True
186672,https://www.google.com/robots.txt,facebookexternalhit,https://www.google.to/,True
186673,https://www.google.com/robots.txt,facebookexternalhit,https://www.google.tt/,True
186674,https://www.google.com/robots.txt,facebookexternalhit,https://www.google.vu/,True


Does Google have URLs listed in the XML sitemap that are also disallowed by its robots.txt

(this is not necessarily a problem, because they might disallow it for some user-agents), but it's good to check.

## Output

In [None]:
robots_report[~robots_report['can_fetch']]

Unnamed: 0,robotstxt_url,user_agent,url_path,can_fetch
139889,https://www.google.com/robots.txt,Twitterbot,https://www.google.com/search/about/,False
139890,https://www.google.com/robots.txt,Twitterbot,https://www.google.com/search/howsearchworks/,False
139891,https://www.google.com/robots.txt,Twitterbot,https://www.google.com/search/howsearchworks/algorithms/,False
139892,https://www.google.com/robots.txt,Twitterbot,https://www.google.com/search/howsearchworks/crawling-indexing/,False
139893,https://www.google.com/robots.txt,Twitterbot,https://www.google.com/search/howsearchworks/mission/,False
139894,https://www.google.com/robots.txt,Twitterbot,https://www.google.com/search/howsearchworks/mission/open-web/,False
139895,https://www.google.com/robots.txt,Twitterbot,https://www.google.com/search/howsearchworks/mission/site-owners/,False
139896,https://www.google.com/robots.txt,Twitterbot,https://www.google.com/search/howsearchworks/mission/web-users/,False
139897,https://www.google.com/robots.txt,Twitterbot,https://www.google.com/search/howsearchworks/responses/,False
186558,https://www.google.com/robots.txt,facebookexternalhit,https://www.google.com/search/about/,False
