<a href="https://colab.research.google.com/github/ranksense/Twittorials/blob/master/robots_txt_Sitemap_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🐍🔥
#  Is your `robots.txt` file blocking any of the URLs in your sitemap?  

We know this generally shouldn't happen, but we also know that it might!  

Is it correctly blocking certain URLs for certain User-agents, or is there an error somewhere?  
This is quick testing tool that allows you to do just that. 

The plan: 

1. Specify a `robots.txt` file URL
2. Extract and select one of the available sitemaps
3. Download the sitemap
4. Extract the `User-agent`s
5. For each (User-agent, URL) combination, check whether the user-agent can fetch the URL


In [None]:
%%capture
!pip install advertools

Import the required packages, and display their version numbers:

In [None]:
import advertools as adv
import pandas as pd
pd.options.display.max_columns = None
for p in [adv, pd]:
    print(f'{p.__name__:<13}', 'v' + p.__version__)

In [None]:
#@title Please enter a robots.txt URL e.g. https://www.nytimes.com/robots.txt
robotstxt_url = "" #@param {type:"string"}


In [None]:
robotstxt_df = adv.robotstxt_to_df(robotstxt_url)
robotstxt_df

#### Extract sitemaps from `robots.txt`

In [None]:
sitemaps = (robotstxt_df                             # take the robotstxt_df DataFrame
            [robotstxt_df['directive']               # select its "directive" column
             .str.contains('^sitemap$', case=False)] # filter values that contain "sitemap" case insensitive
            ['content']                              # now select the "content" column
            .tolist())                               # convert it to a list

sitemaps

In [None]:
#@title Enter one of the sitemaps extracted (if none exist, try to get one from the website). If you provide a sitemap index, you will get all the sub-sitemaps (might take long with large websites). Try with a regular sitemap for faster results.
sitemap = "" #@param {type:"string"}


#### Download selected sitemap

In [None]:
sitemap_df = adv.sitemap_to_df(sitemap)
print('Sitemap rows:', sitemap_df.shape[0])
print('Sitemap columns:', sitemap_df.shape[1])
sitemap_df.head()

#### Extract `User-agent`s

In [None]:
user_agents = (robotstxt_df                              # take the robotstxt_df DataFrame
               [robotstxt_df['directive']                # select its "directive" column
                .str.contains('user-agent',case=False)]  # filter values that contain "user-agent" case insensitive
               ['content']                               # now select the "content" column
               .tolist())                                # convert it to a list
user_agents

#### Run the report

In [None]:
robots_test_report = adv.robotstxt_test(robotstxt_url, user_agents, sitemap_df['loc'])
robots_test_report

### Homework

Now that you have parsed the XML sitemap, and got all its URLs, how can you further analyze those URLs?

Hint: `adv.url_to_df`