# 11. Data Wrangling With Screaming Frog

------------------------------------------------

## Learning Outcomes

- To learn how to automate the command line Screaming Frog commands with Python
- To learn how to wrangle 5 .csv files from Screaming Frog with Pandas
- To learn how to push the data into a BigQuery table
- To learn how to connect your BigQuery table to Google Data Studio

------------------------------------------------------

In the last tutorial, you learned how to [easily automate Screaming Frog on the command line for either Mac or Windows.](https://sempioneer.com/python-for-seo/screaming-frog-automation/).

In this section we'll be focusing on automating the previous terminal commands with Python.

Then we'll wrangle the .csv data into Pandas, push it into [BigQuery](https://cloud.google.com/bigquery) and finally view it in [Google Data Studio.](https://datastudio.google.com/u/0/navigation/reporting)

---------------------------------------------------------------

## Module Imports

References:
    
- https://docs.python.org/3/library/subprocess.html
- https://www.jstorimer.com/blogs/workingwithcode/7766119-when-to-use-stderr-instead-of-stdout
- https://pandas.pydata.org/
- https://www.vervesearch.com/blog/screaming-frog-google-compute-cloud-automatically-crawl-an-entire-industry-fast/

------------------------------------------------------------

<strong> Scripts To Refactor:</strong>
- https://www.vervesearch.com/screaming-frog-files/scream.py
- https://www.vervesearch.com/screaming-frog-files/auto-ssh.py
- https://raw.githubusercontent.com/skywind3000/terminal/master/terminal.py
- https://www.vervesearch.com/blog/compare-screaming-frog-crawl-files/
- https://github.com/skywind3000/terminal/

-----------------------------------------------------------------------------------------------

<strong> To Research: </strong>
    
- What does os.system do? (This has been replaced with the subprocess module)

In [1]:
!pip install pandas



In [572]:
import os
import subprocess
import pandas as pd
import re
from datetime import datetime
from sys import platform

----------------------------------------------------------------------------------------

## How To Run The Command Line In Python

In this section we'll be using linux commands:

In [3]:
process = subprocess.run("ls", shell=True, check=True, capture_output=True)
print(process)

CompletedProcess(args='ls', returncode=0, stdout=b'data-wrangling-screaming-frog.ipynb\n', stderr=b'')


In [4]:
print(f"This is the return code of the subprocess: {process.returncode}")

This is the return code of the subprocess: 0


- Typically a <strong> returncode 0 means that the command run successfully. </strong>
- Also notice how the output of the command is pushed into stdout (standard output), stderr (standard errror).

----------------------------------------------------------------------

## How To Run The Screaming Frog Command Line Scripts With Python

Let's extract our username, website, output location and Screaming Frog application and put them into variables:

In [5]:
!pwd

/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/11_data_wrangling_screaming_frog


In [25]:
username = 'jamesaphoneix'
website = 'https://phoenixandpartners.co.uk/'
output_location = '/users/jamesaphoenix/desktop'
screaming_frog_app = '/Applications/Screaming\ Frog\ SEO\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher'

In [7]:
print("Username:",username, "'\nWebsite:",website, '\nOutputLocation:', output_location,
     "\nScreamingFrogLocation:",screaming_frog_app)

Username: jamesaphoneix '
Website: https://phoenixandpartners.co.uk/ 
OutputLocation: --output-folder /users/jamesaphoneix/desktop 
ScreamingFrogLocation: /Applications/Screaming\ Frog\ SEO\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher


------------------------------------------

Now let's create a couple of Screaming Frog string commands that we'll push into subprocess commands:

In [292]:
screaming_frog_open = "/Applications/Screaming Frog SEO Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher"
screaming_frog_crawl=f'{screaming_frog_app} --headless --save-crawl --output-folder {output_location} --timestamped-output --crawl phoenixandpartners.co.uk'

In [294]:
screaming_frog_crawl

'/Applications/Screaming\\ Frog\\ SEO\\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher --headless --save-crawl --output-folder /users/jamesaphoenix/desktop --timestamped-output --crawl phoenixandpartners.co.uk'

---------------------------------------------------------------

Also notice, how we've used an f string for the screaming_frog_crawl variable, which means:

- jamesaphoenix will be passed into this text string instead of {username}.
- https://phoenixandpartners.co.uk/ will be passed into this text string instead of {website}.

In [323]:
print(screaming_frog_crawl)

/Applications/Screaming\ Frog\ SEO\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher --headless --save-crawl --output-folder /users/jamesaphoenix/desktop --timestamped-output --crawl phoenixandpartners.co.uk


------------------------------------------------------------------------------------------------

Let's now run them one by one:

In [295]:
open_sf = subprocess.run(screaming_frog_open)

This command will hopefully open scremaing frog, also the subprocess will keep running until we close the window.

So close screaming frog.

In [296]:
print(open_sf)

CompletedProcess(args='/Applications/Screaming Frog SEO Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher', returncode=0)


We can see that the return code was 0!

------------------------------------------------------------------------------

In [297]:
screaming_frog=subprocess.run(screaming_frog_crawl, 
               shell=True, 
               capture_output=True)

In [298]:
screaming_frog_crawl

'/Applications/Screaming\\ Frog\\ SEO\\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher --headless --save-crawl --output-folder /users/jamesaphoenix/desktop --timestamped-output --crawl phoenixandpartners.co.uk'

![](https://sempioneer.com/wp-content/uploads/2020/06/screaming-frog-1.png)

------------------------------------

### How To Find The Outputted Folder Name

As well as saving the crawl, we can parse the standard output pipe (stdout) and obtain the name of the timestamped folder:

In [43]:
dir(screaming_frog)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'args',
 'check_returncode',
 'returncode',
 'stderr',
 'stdout']

We can decode the stdout which will convert all of the console messages into a string:

In [126]:
text = screaming_frog.stdout.decode('utf-8')
print(text[0:10])

2020-06-25


Now let's search the date timestamp:

- after this: <strong> Output directory: </strong>
- before this: <strong> \n </strong>

![](https://sempioneer.com/wp-content/uploads/2020/06/output-directory.png)

In [391]:
timestamp = re.findall('(?<=Output directory:)(.*?)(?=\n)', 
                       str(text))

correct_folder = timestamp[0].strip()
print(f"This is the timestamped output folder: {correct_folder}")

This is the timestamped output folder: /users/jamesaphoenix/desktop/2020.06.25.15.01.49


------------------------------------------------------------------------

We can also check what folders are in our current working directory with:

~~~

os.listdir()

~~~

In [139]:
os.chdir('/Users/jamesaphoenix/Desktop') # This changes the directory into the desktop
os.listdir()

['Music',
 'Screaming Frog - Data Manipulation.ipynb',
 'ngrok',
 '2020.06.25.15.01.49',
 '.DS_Store',
 '.localized',
 'config',
 'Coding_Marketing_Projects',
 'Google_cloud-sdk',
 'screaming-frog-remotedesktop-image.vmdk',
 'Screenshot 2020-06-25 at 15.07.17.png',
 'Extracting Schema At Bulk.ipynb',
 'Scripts_and_keys',
 'YouTube SEO.jpg',
 'Data_Science_Resources',
 'Screenshot 2020-06-25 at 15.07.17 (2).png',
 'Marketing',
 'Sort Through These',
 'Atom.app',
 'Math Textbooks',
 '.ipynb_checkpoints',
 'Client_Projects',
 'Imran_And_James',
 'General_Assembly',
 'layered_architecture.png',
 'Postman.app',
 'message.png']

![](https://sempioneer.com/wp-content/uploads/2020/06/output-of-directory.png)

-----------------------------------------------------------------------------

Another way to capture the relevant folder would be to:
    
1. Get todays date.
2. Only return folders that include todays date.

In [153]:
now = datetime.now()
todays_date = now.strftime("%Y.%m.%d")
print(todays_date)

2020.06.25


In [157]:
screaming_frog_folders = [file for file in os.listdir() if todays_date in file]
print(screaming_frog_folders)

['2020.06.25.15.01.49']


------------------------------------------

## Enhancing Our Screaming Frog CLI Automation With Classes

Running the subprocess and string command is a better improvement then having to load up terminal and manually enter in the commands.

But let's take it a step further and create a Python Class called üê∏ <strong> ScreamingFrogAnalyser </strong> üê∏

---------------------------------------

<strong>To Do:</strong>
    
- Runs a standard screamig frog crawl
- Allows a specific folder to be an output folder
    
- Add the ability to:
    - Add reports
    - Bulk export
    
- Parsing for the newly created output directory
- Option to run ScreamingFrog Headless or Not.

---------------------------------------------------------------

In [249]:
output_location

'/users/jamesaphoenix/desktop'

In [None]:
screaming_frog_crawl=f'{screaming_frog_app} --headless --save-crawl --output-folder {output_location} --timestamped-output --crawl phoenixandpartners.co.uk'

------------------------------------------------------------------------------------------------------------------------

In [None]:
TESTS TO CREATE

- DIFFERENT RESULTS FOR CHECK FOR REPORTS
- DIFFERENT OPERATING SYSTEMS
- ADD IN CHECKING FOR ALL OF THE REPORT NAMES

### Having Fun With Python Classes

In [427]:
class UnsupportedPlatformError(Exception):
    def __init__(self, message, errors):

        # Call the base class constructor with the parameters it needs
        super().__init__(message)

        # Now for your custom code...
        self.errors = errors

In [428]:
class ValidationError(Exception):
    def __init__(self, message, errors):

        # Call the base class constructor with the parameters it needs
        super().__init__(message)

        # Now for your custom code...
        self.errors = errors

In [527]:
class ScreamingFrogAnalyser(object):
    def __init__(self, website_urls,
                 user_name,
                 outputfolder='',
                 export_tabs=False,
                 export_reports=False,
                 export_bulk_exports=False):
        
        self._website_urls = website_urls
        self._user_name = user_name
        self._output_folder = '--output-folder ' + outputfolder
        self._export_tabs = export_tabs
        self._export_reports = export_reports
        self.export_bulk_exports = export_bulk_exports
        
        # This will populate with a list of folders that Screaming Frog Creates via --timestamped folder:
        self._sf_folders = []
        
        if self._output_folder == '':
            raise ValidationError('You must choose a valid output folder for your Screaming Frog Crawls',
                                 'outputfolder=""')
            
        # Creating the command based upon the Operating System:
        self._create_command()
        self._command_updater()
    
    def _create_command(self):
        if platform == "linux" or platform == "linux2":
            # Linux
            self._sf_command = 'screamingfrogspider --headless --save-crawl'
        elif platform == "darwin":
            # OS X
            self._sf_command = '/Applications/Screaming\ Frog\ SEO\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher --headless --save-crawl'
        elif platform == "win32":
            # Windows...
            raise UnsupportedPlatformError("Windows Is Currently Not Supported", 'Please stop using windows!')
            
    def _add_reports(self):
        for _, argument in zip([self._export_tabs, self._export_reports, self.export_bulk_exports],
                    ['--export-tabs', '--save-report', '--bulk-export']):
            if _ is not False:
                self.sf_command = self.sf_command + f' {argument} "{_}"'
            else:
                # This will just save the generic .seospider crawl
                pass
    
    def _parse_subprocess_text(self, subprocess_text):
        directory = re.findall('(?<=Output directory:)(.*?)(?=\n)', 
                       str(subprocess_text.decode('utf-8')))
        return directory[0].strip()

    def _command_updater(self):
        self.sf_command = self._sf_command + ' ' + self._output_folder + ' --timestamped-output'
        print(f"Please make sure that the {self._output_folder} is a valid destination! \n")
        self._add_reports()
        print(self.sf_command)
    
    # Execution Functions
    def run_screaming_headless_frog(self, website):
        final_command = self.sf_command + ' --crawl ' + website
        screaming_frog=subprocess.run(final_command, 
        shell=True, 
        capture_output=True)
        return screaming_frog
        
    # Run Multiple Websites:
    def run_crawls(self):
        for website in self._website_urls:
            # 1. Crawl the website: 
            output = self.run_screaming_headless_frog(website)
            # 2. Store the crawled files:
            resp = self._parse_subprocess_text(output.stdout)
            if isinstance(resp, str):
                self._sf_folders.append(resp)
            else:
                raise ValidationError('No folder was created, check your output folder and export settings', 'Incorrect Response')
            print('\n' + '----' + '\n')
            

In [528]:
class CSV_Parser():
    def __init__(self, file_paths):
        pass
    
    def group_multiple_csv_files(self):
        pass    
    
    def merge_multiple_csv_files(self):
        pass

In [529]:
class BigQuery():
    def __init__(self):
        pass

In [532]:
sf = ScreamingFrogAnalyser(user_name='jamesaphoenix',website_urls=['https://phoenixandpartners.co.uk/'],
                           outputfolder='/users/jamesaphoenix/desktop', 
                           export_tabs='Response Codes:Client Error (4xx)')

Please make sure that the --output-folder /users/jamesaphoenix/desktop is a valid destination! 

/Applications/Screaming\ Frog\ SEO\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher --headless --save-crawl --output-folder /users/jamesaphoenix/desktop --timestamped-output --export-tabs "Response Codes:Client Error (4xx)"


In [533]:
print(sf._sf_command, sf._sf_folders)

/Applications/Screaming\ Frog\ SEO\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher --headless --save-crawl []


In [537]:
sf.run_crawls()


----



In [535]:
sf._sf_folders

['/users/jamesaphoenix/desktop/2020.06.25.23.09.50']

------------------------------------------------------------------------------------------

In [541]:
### Working Here On Exporting All Of The Files:
outputfolder = '/users/jamesaphoenix/desktop'

In [539]:
folders = ['2020.06.25.23.10.31', '2020.06.25.23.11.30', '2020.06.25.23.09.50']

In [559]:
correct_directories = [outputfolder + '/' + folder for folder in folders]
correct_files = [directory + '/' + files for directory in correct_directories
                 for files in os.listdir(path=directory)]

In [578]:
### Working On Grouping Files
file_dict = {}

for file in correct_files:
    split_file = file.split('/')
    name = split_file[-1]
    if split_file[-1] not in file_dict.keys():
        file_dict[str(name)] = [file]
    else:
        file_dict[str(name)].append(file)

------------------------------------------------------------------------------------

## Loading The Screaming Frog CSV Data In Pandas

### TO DO:
- WE WILL NEED TO ADD THE DOMAIN BELOW!
- THE CODE BELOW NEEDS REFACTORING

In [605]:
### Working On Concatenating Files
data_dict = {}

for key, values in file_dict.items():
    # 1. Temporary Dataframe
    df = pd.DataFrame()
    # 2. Munge all of the data:
    for value in values:
        if value.endswith(('.csv')):
            name = value.split('/')[-1]
            if name not in data_dict.keys():
                data_dict[name] = ''
            temp_df = pd.read_csv(value)
            df = df.append(temp_df)
            data_dict[name] = df

In [607]:
data_dict['response_codes_client_error_(4xx).csv']

Unnamed: 0,Address,Content,Status Code,Status,Indexability,Indexability Status,Inlinks,Response Time,Redirect URL,Redirect Type
0,https://phoenixandpartners.co.uk/wp-content/up...,text/html; charset=UTF-8,404,Not Found,Non-Indexable,Client Error,0,0.3,,
1,https://phoenixandpartners.co.uk/wp-content/up...,text/html; charset=UTF-8,404,Not Found,Non-Indexable,Client Error,0,0.337,,
2,https://www.bankrate.com/finance/real-estate/r...,text/html; charset=UTF-8,410,Gone,Non-Indexable,Client Error,1,0.01,,
3,https://metro.co.uk/2018/03/15/5-people-share-...,text/html,429,Too Many Requests,Non-Indexable,Client Error,1,0.012,,
4,https://www.simplelandlordsinsurance.com/emerg...,text/html; charset=utf-8,404,Not Found,Non-Indexable,Client Error,1,0.409,,
0,https://phoenixandpartners.co.uk/wp-content/up...,text/html; charset=UTF-8,404,Not Found,Non-Indexable,Client Error,0,0.293,,
1,https://phoenixandpartners.co.uk/wp-content/up...,text/html; charset=UTF-8,404,Not Found,Non-Indexable,Client Error,0,0.306,,
2,https://www.bankrate.com/finance/real-estate/r...,text/html; charset=UTF-8,410,Gone,Non-Indexable,Client Error,1,0.013,,
3,https://metro.co.uk/2018/03/15/5-people-share-...,text/html,429,Too Many Requests,Non-Indexable,Client Error,1,0.01,,
4,https://www.simplelandlordsinsurance.com/emerg...,text/html; charset=utf-8,404,Not Found,Non-Indexable,Client Error,1,0.069,,


## Data Wrangling All Of The CSV Data

------------------------------------------

## BigQuery Setup

- API Creation
- Table Creation
- Service Account Key Creation

TBC

------------------------------------------

## Pushing The Data To BigQuery



TBC

------------------------------------------

## Connecting The BigQuery Table to Google Data Studio



------------------------------------------------------------------------------------------