<a href="https://colab.research.google.com/github/jhayesn13/Test/blob/main/Working_Crawler_with_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#Working Code

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import threading
import re

# Create webcrawler class
class WebCrawler:
    def __init__(self, start_url, visiting_strategy='preorder'):
        self.start_url = start_url
        self.visiting_strategy = visiting_strategy.lower()
        self.visited_urls = set()
        self.corpus = {}
        self.main_domain = urlparse(start_url).netloc
        self.lock = threading.Lock()  # Lock for thread-safe access to shared data

    def crawl(self, url):
        if url not in self.visited_urls and self.is_same_domain(url):
            print(f"Visiting: {url}")
            self.visited_urls.add(url)
            try:
                response = requests.get(url)
                soup = BeautifulSoup(response.content, 'html.parser')
                title = soup.title.string.strip() if soup.title else 'Untitled'
                text_content = self.extract_text_content(soup)

                with self.lock:  # Thread-safe update of shared data
                    self.corpus[url] = text_content

                print(f"Text Content: {text_content[:100]}...")  # Output a snippet of text

                if self.visiting_strategy == 'preorder':
                    links = self.extract_links(soup)
                    threads = []
                    for link in links:
                        thread = threading.Thread(target=self.crawl, args=(link,))
                        threads.append(thread)
                        thread.start()

                    # Wait for all threads to complete
                    for thread in threads:
                        thread.join()

                # Additional visiting strategies (inorder, postorder) can be implemented here

            except Exception as e:
                print(f"Error crawling {url}: {e}")

    def extract_text_content(self, soup):
        # Extract text content only from the body of the HTML
        text_content = ' '.join([p.get_text(separator=' ', strip=True) for p in soup.body.find_all('p')])
        return text_content

    def extract_links(self, soup):
        # Extract all links from the page
        links = [link.get('href') for link in soup.find_all('a', href=True)]
        # Filter internal links only
        links = [urljoin(self.start_url, link) for link in links if link.startswith(('http', 'https'))]
        # Exclude PDF links
        links = [link for link in links if not link.endswith('.pdf')]
        # Filter out external links
        links = [link for link in links if self.is_same_domain(link)]
        # Exclude links with 'resources' in the URL
        links = [link for link in links if 'resources' not in link.lower()]
        return links

    def is_same_domain(self, url):
        return urlparse(url).netloc == self.main_domain

    def start_crawling(self):
        self.crawl(self.start_url)

    def get_crawled_data(self):
        return self.corpus

if __name__ == "__main__":
    # Get the starting URL from the user
    start_url = input("Enter the website's URL: ")

    # Instantiate the WebCrawler with the provided URL and visiting strategy
    crawler = WebCrawler(start_url=start_url, visiting_strategy='preorder')

    # Start crawling
    crawler.start_crawling()

    # Get the crawled data
    crawled_data = crawler.get_crawled_data()

    # Print the crawled data
    for url, content in crawled_data.items():
        print(f"URL: {url}")
        print(f"Content: {content[:100]}...")  # Print a snippet of content


Enter the website's URL: https://www.stjohns.edu/
Visiting: https://www.stjohns.edu/
Text Content: See how your journey aligns with what drives you. Hannah M. Queens, NY Maria Orlando, FL Jenna Charl...
Visiting: https://www.stjohns.edu/life-st-johns/career-services
Visiting: https://www.stjohns.edu/about/leadership-and-administration/office-president/presidents-society
Visiting: https://www.stjohns.edu/who-we-are/faith-and-mission/campus-ministry/opportunities/plunge-program
Visiting: https://www.stjohns.edu/who-we-are/campus-sustainability
Visiting: https://www.stjohns.edu/academics/programs?level%5B151%5D=151
Visiting: https://www.stjohns.edu/admission/graduate-admission
Text Content: Founded in 1968, the President’s Society honors those students who combine scholarship, integrity, m...
Text Content: Plunges, or service immersion, are weeklong experiences where students are given the opportunity to ...
Visiting: https://www.stjohns.edu/who-we-are/history-and-facts/vincentian-heritag



Text Content: Build critical skills to develop, plan, launch, and sustain new, innovative ventures with an M.B.A. ...Visiting: https://www.stjohns.edu/about/news/all-news?school=36

Visiting: https://www.stjohns.edu/academics/programs/finance-master-science#stem-visa-extension-for-international-students
Text Content: The Economic Justice Legal Clinic is a full-year partner clinic offered in collaboration with the Ne...Text Content: The University’s campus in Manhattan is situated in the East Village, one of New York City’s most vi...
Text Content: Combine data-driven decision-making and analytics with the overall M.B.A. program learning objective...Visiting: https://www.stjohns.edu/academics/programs/finance-master-science#tracks


Visiting: https://www.stjohns.edu/law/alumni/st-johns-law-magazine
Text Content: The Securities Arbitration Clinic is part of the St. Vincent de Paul Legal Program, Inc. It is a one...
Visiting: https://www.stjohns.edu/law/law-career-development/current-stud



Visiting: https://www.stjohns.edu/about/news/2021-03-05/researcher-speak-quantum-computing-and-communication
Text Content: My name is Kate Smith and I grew up on a small dairy farm in the midland of Ireland. There was plent...Visiting: https://www.stjohns.edu/about/news/2021-02-12/zoom-presentation-focus-blockchain
Text Content: “Education is one thing no one can take away from you.” This quote by Elin Nordegren fuels my belief...

Visiting: https://www.stjohns.edu/about/news/2021-01-22/cyber-security-program-host-talk-robots
Visiting: https://www.stjohns.edu/academics/schools/college-professional-studies/faculty
Error crawling https://www.stjohns.edu/sites/default/files/2023-10/School%20of%20Education%20Visiting%20Scholar%20Review%20Checklist.docx: 'NoneType' object has no attribute 'find_all'Text Content: I am currently a junior at St. John’s University seeking a bachelor’s degree in Hospitality Manageme...
Visiting: https://www.stjohns.edu/academics/programs/business-administration-



Text Content: Dual degree programs are designed to provide highly motivated, qualified students with the opportuni...
Text Content: Introne, J., Yildirim, I. G., Iandoli, L., DeCook, J., and Elzeini, S. (2018). How People Weave Onli...
Text Content: St. John’s College of Liberal Arts and Sciences provides students with a firm foundation in analytic...
Error crawling https://www.stjohns.edu/sites/default/files/2022-04/SJU%202022%20Camp%20Emergency%20Contact%20%26%20Consent%20to%20Treat%20Form.docx: 'NoneType' object has no attribute 'find_all'
Text Content: Queens Campus 8000 Utopia Parkway St. Augustine Hall, Second Floor Queens, NY 11439 718-990-6414 Sta...
Visiting: http://www.stjohns.edu/ccpsappt
Visiting: https://www.stjohns.edu/academics/faculty/edrex-fontanilla




Text Content: Not all roads lead to riches... ...some lead to Johnny Thunderbird with an unfortunate message. Thin...Visiting: https://www.stjohns.edu/academics/faculty/lequez-spearman

Error crawling https://www.stjohns.edu/sites/default/files/2022-04/SJU%202022%20Instructions%20for%20Completing%20Summer%20Program%20Form.docx: 'NoneType' object has no attribute 'find_all'Text Content: The Lesley H. and William L. Collins College of Professional Studies is a launchpad for innovators, ...





Text Content: Please click below to access the page. 8000 Utopia Parkway Queens NY 11439 718-990-2000 St. John’s U...
Error crawling https://www.stjohns.edu/sites/default/files/2022-04/SJU%202022%20Meningitis%20Vaccination%20Parent%20Information%20%26%20Response%20Form.docx: 'NoneType' object has no attribute 'find_all'
Text Content: 8000 Utopia Parkway Queens NY 11439 718-990-2000 St. John’s University does not discriminate on the ...
Text Content: 8000 Utopia Parkway Queens NY 11439 718-990-2000 St. John’s University does not discriminate on the ...
Text Content: 8000 Utopia Parkway Queens NY 11439 718-990-2000 St. John’s University does not discriminate on the ...
Text Content: ...




Error crawling https://www.stjohns.edu/sites/default/files/uploads/2022_Clare_Boothe_Luce_Summer_Research_Scholarship_application%20%285%29.docx: 'NoneType' object has no attribute 'find_all'
Error crawling https://www.stjohns.edu/sites/default/files/2020-03/M1-12853%20MA%20Chinese%20%26%20EAS.PDF: 'NoneType' object has no attribute 'find_all'




Error crawling https://www.stjohns.edu/files/january-2020-review-business: 'NoneType' object has no attribute 'find_all'




Error crawling https://www.stjohns.edu/files/2023-annual-security-and-fire-safety-report: 'NoneType' object has no attribute 'find_all'




Error crawling https://www.stjohns.edu/sites/default/files/2020-01/53R%20Brief%20%28Revised%29.PDF: 'NoneType' object has no attribute 'find_all'
URL: https://www.stjohns.edu/
Content: See how your journey aligns with what drives you. Hannah M. Queens, NY Maria Orlando, FL Jenna Charl...
URL: https://www.stjohns.edu/about/leadership-and-administration/office-president/presidents-society
Content: Founded in 1968, the President’s Society honors those students who combine scholarship, integrity, m...
URL: https://www.stjohns.edu/who-we-are/faith-and-mission/campus-ministry/opportunities/plunge-program
Content: Plunges, or service immersion, are weeklong experiences where students are given the opportunity to ...
URL: https://www.stjohns.edu/academics/programs?level%5B151%5D=151
Content: Professional licensure and certification requirements often vary from state to state. St. John’s Uni...
URL: https://www.stjohns.edu/who-we-are/campus-sustainability
Content: Sustainability is a long-term 

In [None]:
#Code with best comments

import requests  # Importing a library to make HTTP requests
from bs4 import BeautifulSoup  # Importing a library to parse HTML content
from urllib.parse import urljoin, urlparse  # Importing utilities for URL handling
import threading  # Importing a tool to run multiple tasks concurrently
import re  # Importing a library for regular expressions

# Create webcrawler class
class WebCrawler:
    def __init__(self, start_url, visiting_strategy='preorder'):
        # Initialize the web crawler with a starting URL and a visiting strategy
        self.start_url = start_url  # The URL from where the crawling begins
        self.visiting_strategy = visiting_strategy.lower()  # How the crawler should visit links
        self.visited_urls = set()  # Keep track of visited URLs to avoid revisiting
        self.corpus = {}  # Store the text content of crawled pages
        self.main_domain = urlparse(start_url).netloc  # Get the domain of the starting URL
        self.lock = threading.Lock()  # A tool to ensure safe access to shared data

    def crawl(self, url):
        # Method to crawl a given URL and extract its content
        if url not in self.visited_urls and self.is_same_domain(url):
            # If the URL hasn't been visited yet and belongs to the same domain
            print(f"Visiting: {url}")  # Print the URL being visited
            self.visited_urls.add(url)  # Mark the URL as visited
            try:
                # Try to request the webpage content
                response = requests.get(url)  # Get the webpage content
                soup = BeautifulSoup(response.content, 'html.parser')  # Parse HTML content
                title = soup.title.string.strip() if soup.title else 'Untitled'  # Get the page title

                # Extract and store the text content of the page
                text_content = self.extract_text_content(soup)
                with self.lock:  # Ensure safe data access in a multi-threaded environment
                    self.corpus[url] = text_content

                print(f"Text Content: {text_content[:100]}...")  # Print a snippet of the text content

                if self.visiting_strategy == 'preorder':
                    # If the visiting strategy is 'preorder', crawl links immediately
                    links = self.extract_links(soup)  # Extract links from the current page
                    threads = []  # Store threads for concurrent crawling
                    for link in links:
                        # Start a new thread to crawl each link concurrently
                        thread = threading.Thread(target=self.crawl, args=(link,))
                        threads.append(thread)  # Store the thread for later use
                        thread.start()  # Start the thread

                    for thread in threads:
                        thread.join()  # Wait for all threads to complete

            except Exception as e:
                # Handle any errors that occur during crawling
                print(f"Error crawling {url}: {e}")

    def extract_text_content(self, soup):
        # Extract text content from the body of the HTML
        text_content = ' '.join([p.get_text(separator=' ', strip=True) for p in soup.body.find_all('p')])
        return text_content

    def extract_links(self, soup):
        # Extract all links from the page
        links = [link.get('href') for link in soup.find_all('a', href=True)]
        # Filter internal links only
        links = [urljoin(self.start_url, link) for link in links if link.startswith(('http', 'https'))]
        # Exclude PDF links
        links = [link for link in links if not link.endswith('.pdf')]
        # Filter out external links
        links = [link for link in links if self.is_same_domain(link)]
        # Exclude links with 'resources' in the URL
        links = [link for link in links if 'resources' not in link.lower()]
        return links

    def is_same_domain(self, url):
        # Check if a given URL belongs to the same domain as the starting URL
        return urlparse(url).netloc == self.main_domain

    def start_crawling(self):
        # Start crawling from the starting URL
        self.crawl(self.start_url)

    def get_crawled_data(self):
        # Get the crawled data (text content of each page)
        return self.corpus

if __name__ == "__main__":
    # Get the starting URL from the user
    start_url = input("Enter the website's URL: ")

    # Instantiate the WebCrawler with the provided URL and visiting strategy
    crawler = WebCrawler(start_url=start_url, visiting_strategy='preorder')

    # Start crawling
    crawler.start_crawling()

    # Get the crawled data
    crawled_data = crawler.get_crawled_data()

    # Print the crawled data
    for url, content in crawled_data.items():
        print(f"URL: {url}")
        print(f"Content: {content[:100]}...")  # Print a snippet of content


KeyboardInterrupt: Interrupted by user

In [None]:
#Displaying content from a specific webpage

# Assuming 'crawled_data' is the dictionary containing the crawled data
desired_url = 'https://www.stjohns.edu/academics/programs/business-bachelor-science'

# Check if the URL is present in the crawled data
if desired_url in crawled_data:
    content = crawled_data[desired_url]
    print("Content for URL:")
    print(content)
else:
    print("URL not found in crawled data.")

Content for URL:
{'title': "Interdisciplinary Business, Bachelor of Science | St. John's University", 'text_content': "An independent and interdisciplinary career focused-major. The Bachelor of Science degree is designed to provide a high-quality business education. The program is interdisciplinary and allows the student to select six advanced courses from the major disciplines in the Peter J. Tobin College of Business to tailor a program to their career of interest. Experiential courses are available that broaden the business education experience.  Director, B.S. in Business Degree Program [email\xa0protected] 718-990-1638 Bent Hall, Room 424 Poets&Quants for Undergrads , the leading online publication for undergraduate business education news, recognized the improved undergraduate business program at The Peter J. Tobin College of Business at St. John’s University by moving it up from #54 to 38 in its sixth annual list of “Best Undergraduate Business Schools 2022.” The Tobin College o

In [None]:
#Crawler with processed text. Includes tokens, removed stopwords, lemmatized tokens, stemming, and sentiment score

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import threading
import re
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.sentiment import SentimentIntensityAnalyzer

class TextProcessor:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm")
        nltk.download('punkt')
        nltk.download('stopwords')
        nltk.download('wordnet')
        nltk.download('vader_lexicon')

        self.stop_words = set(stopwords.words('english'))
        self.lemmatizer = WordNetLemmatizer()
        self.stemmer = PorterStemmer()
        self.sentiment_analyzer = SentimentIntensityAnalyzer()

    def clean_text(self, text):
        cleaned_text = re.sub(r'[^\w\s]', '', text)
        cleaned_text = cleaned_text.lower()
        return cleaned_text

    def tokenize_text(self, text):
        doc = self.nlp(text)
        tokens = [token.text for token in doc]
        return tokens

    def remove_stopwords(self, tokens):
        filtered_tokens = [token for token in tokens if token.lower() not in self.stop_words]
        return filtered_tokens

    def lemmatize_text(self, tokens):
        doc = self.nlp(" ".join(tokens))
        lemmatized_tokens = [token.lemma_ for token in doc]
        return lemmatized_tokens

    def stem_text(self, tokens):
        stemmed_tokens = [self.stemmer.stem(token) for token in tokens]
        return stemmed_tokens

    def analyze_sentiment(self, text):
        sentiment_scores = self.sentiment_analyzer.polarity_scores(text)
        return sentiment_scores

class WebCrawler:
    def __init__(self, start_url, visiting_strategy='preorder'):
        self.start_url = start_url
        self.visiting_strategy = visiting_strategy.lower()
        self.visited_urls = set()
        self.corpus = {}
        self.main_domain = urlparse(start_url).netloc
        self.lock = threading.Lock()  # Lock for thread-safe access to shared data
        self.text_processor = TextProcessor()  # Instantiate TextProcessor

    def crawl(self, url):
        if url not in self.visited_urls and self.is_same_domain(url):
            print(f"Visiting: {url}")
            self.visited_urls.add(url)
            try:
                response = requests.get(url)
                soup = BeautifulSoup(response.content, 'html.parser')
                title = soup.title.string.strip() if soup.title else 'Untitled'
                text_content = self.extract_text_content(soup)

                # Process text content
                cleaned_text = self.text_processor.clean_text(text_content)
                tokens = self.text_processor.tokenize_text(cleaned_text)
                filtered_tokens = self.text_processor.remove_stopwords(tokens)
                lemmatized_tokens = self.text_processor.lemmatize_text(filtered_tokens)
                stemmed_tokens = self.text_processor.stem_text(filtered_tokens)
                sentiment_scores = self.text_processor.analyze_sentiment(text_content)

                with self.lock:  # Thread-safe update of shared data
                    self.corpus[url] = {
                        'title': title,
                        'text_content': text_content,
                        'cleaned_text': cleaned_text,
                        'tokens': tokens,
                        'filtered_tokens': filtered_tokens,
                        'lemmatized_tokens': lemmatized_tokens,
                        'stemmed_tokens': stemmed_tokens,
                        'sentiment_scores': sentiment_scores
                    }

                print(f"Title: {title}")

                if self.visiting_strategy == 'preorder':
                    links = self.extract_links(soup)
                    threads = []
                    for link in links:
                        thread = threading.Thread(target=self.crawl, args=(link,))
                        threads.append(thread)
                        thread.start()

                    # Wait for all threads to complete
                    for thread in threads:
                        thread.join()

                # Additional visiting strategies (inorder, postorder) can be implemented here

            except Exception as e:
                print(f"Error crawling {url}: {e}")

    def extract_text_content(self, soup):
        text_content = ' '.join([p.get_text(separator=' ', strip=True) for p in soup.body.find_all('p')])
        return text_content

    def extract_links(self, soup):
        links = [link.get('href') for link in soup.find_all('a', href=True)]
        links = [urljoin(self.start_url, link) for link in links if link.startswith(('http', 'https'))]
        links = [link for link in links if not link.endswith('.pdf')]
        links = [link for link in links if self.is_same_domain(link)]
        links = [link for link in links if 'resources' not in link.lower()]
        return links

    def is_same_domain(self, url):
        return urlparse(url).netloc == self.main_domain

    def start_crawling(self):
        self.crawl(self.start_url)

    def get_crawled_data(self):
        return self.corpus

if __name__ == "__main__":
    # Get the starting URL from the user
    start_url = input("Enter the website's URL: ")

    # Instantiate the WebCrawler with the provided URL and visiting strategy
    crawler = WebCrawler(start_url=start_url, visiting_strategy='preorder')

    # Start crawling
    crawler.start_crawling()

    # Get the crawled data
    crawled_data = crawler.get_crawled_data()

    # Print the crawled data
    for url, data in crawled_data.items():
        print(f"URL: {url}")
        print(f"Title: {data['title']}")
        print(f"Cleaned Text: {data['cleaned_text']}")
        print(f"Tokens: {data['tokens']}")
        print(f"Filtered Tokens: {data['filtered_tokens']}")
        print(f"Lemmatized Tokens: {data['lemmatized_tokens']}")
        print(f"Stemmed Tokens: {data['stemmed_tokens']}")
        print(f"Sentiment Scores: {data['sentiment_scores']}")


Enter the website's URL: https://www.stjohns.edu/


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Visiting: https://www.stjohns.edu/
Title: Turn Passion into Purpose | St. John's University
Visiting: https://www.stjohns.edu/life-st-johns/career-services
Visiting: https://www.stjohns.edu/about/leadership-and-administration/office-president/presidents-society
Visiting: https://www.stjohns.edu/who-we-are/faith-and-mission/campus-ministry/opportunities/plunge-program
Visiting: https://www.stjohns.edu/who-we-are/campus-sustainability
Visiting: https://www.stjohns.edu/academics/programs?level%5B151%5D=151Visiting: https://www.stjohns.edu/admission/graduate-admission

Title: Graduate Program Admissions and Application Requirements | St. John's University | New York
Visiting: https://www.stjohns.edu/admission/apply
Visiting: https://www.stjohns.edu/academics/programs?level=151
Visiting: https://www.stjohns.edu/about/campuses-and-locations/queens-campus/explore-queens-campus
Visiting: https://www.stjohns.edu/academics/programs
Visiting: https://www.stjohns.edu/academics/schools/school-educa



Error crawling https://www.stjohns.edu/files/phd-curriculum-and-instruction-essay-prompt-instructions: 'NoneType' object has no attribute 'find_all'
Title: Academics | St. John's University
Visiting: https://www.stjohns.edu/law/applyVisiting: https://www.stjohns.edu/law/give
Visiting: https://www.stjohns.edu/law/law-career-development

Visiting: https://www.stjohns.edu/law/academics/jd-programs
Visiting: https://www.stjohns.edu/law/academics/llm-programs
Visiting: https://www.stjohns.edu/law/academics/clinics
Visiting: https://www.stjohns.edu/law/academics/centersVisiting: https://www.stjohns.edu/law/academics/co-curricular-programs

Visiting: https://www.stjohns.edu/law/academics/study-abroad
Visiting: https://www.stjohns.edu/law/academics/course-catalog
Visiting: https://www.stjohns.edu/law/academics/academic-calendar
Visiting: https://www.stjohns.edu/law/academics/assessment
Visiting: https://www.stjohns.edu/law/admissions/jd-admissions/apply-st-johns-law
Visiting: https://www.stjoh



Title: International Admission | St. John's University

Error crawling https://www.stjohns.edu/sites/default/files/2023-10/School%20of%20Education%20Visiting%20Scholar%20Review%20Checklist.docx: 'NoneType' object has no attribute 'find_all'
Title: Explore St. John’s Today! | St. John's University
Title: Annual Security and Fire Safety Report | St. John's University
Visiting: https://www.stjohns.edu/files/2023-annual-security-and-fire-safety-report
Visiting: https://www.stjohns.edu/life-st-johns/public-safety
Title: Commencement Information for Faculty | St. John's University
Title: Diversity, Equity, & Inclusion Leadership | St. John's University
Title: International Engagement Opportunities | St. John's University
Visiting: https://www.stjohns.edu/academics/commencement/faculty/rented-attire
Title: Careers | St. John's University
Title: Pre-Law Advisement Program | St. John's UniversityVisiting: https://www.stjohns.edu/law/faculty/noa-ben-asher
Title: SignOn | St. John's University
Vi



Error crawling https://www.stjohns.edu/sites/default/files/2022-04/SJU%202022%20Meningitis%20Vaccination%20Parent%20Information%20%26%20Response%20Form.docx: 'NoneType' object has no attribute 'find_all'
Title: Jeremy Sheff | St. John's UniversityError crawling https://www.stjohns.edu/sites/default/files/2022-04/SJU%202022%20Instructions%20for%20Completing%20Summer%20Program%20Form.docx: 'NoneType' object has no attribute 'find_all'





Error crawling https://www.stjohns.edu/sites/default/files/2022-04/SJU%202022%20Camp%20Emergency%20Contact%20%26%20Consent%20to%20Treat%20Form.docx: 'NoneType' object has no attribute 'find_all'
Title: Global Passport | St. John's University
Title: St. John’s Law Commencement | St. John's University
Title: Application for the Insurance Leaders of the Year Fellows Program | St. John's University
Title: LGBTQ+ at St. John’s Law | St. John's UniversityTitle: Page Not Found | St. John's University

Title: Mark L. Movsesian | St. John's UniversityVisiting: https://www.stjohns.edu/academics/office-registrar/chosen-name-policy
Visiting: https://www.stjohns.edu/equity-and-inclusion/office-multicultural-affairs

Title: The Review of Business | St. John's University
Visiting: https://www.stjohns.edu/academics/faculty/yun-zhu-phdVisiting: https://www.stjohns.edu/files/january-2020-review-business

Visiting: https://www.stjohns.edu/queens-residential-campus/queens-campus-life/student-organizations



Error crawling https://www.stjohns.edu/sites/default/files/2020-03/M1-12853%20MA%20Chinese%20%26%20EAS.PDF: 'NoneType' object has no attribute 'find_all'
Title: Office of the Registrar | St. John's University
Visiting: https://www.stjohns.edu/node/27706#transcriptVisiting: https://www.stjohns.edu/node/27706#transcript

Title: Global Management and Entrepreneurship, Master of Science | St. John's University
Title: GDC Budget Planning Worksheet SAMPLE | St. John's University
Title: International Student and Scholar Services | St. John's UniversityTitle: Our Faculty | St. John's University | New York
Title: M.B.A. in Interdisciplinary Business | St. John's University | New YorkTitle: M.B.A. in Educational Leadership | St. John's University | New YorkTitle: Public Interest Center | St. John's University



Visiting: https://www.stjohns.edu/academics/faculty?school=21&department=7591
Title: Rome Campus | St. John's UniversityTitle: Graduate Programs | Peter J. Tobin College of Business | St



Title: Global Programs | St. John's University
Title: Page Not Found | St. John's University
Title: The LGBTQ+ Center | St. John's University
Visiting: https://www.stjohns.edu/life-st-johns/spectrum
Visiting: https://www.stjohns.edu/academics/faculty/candice-d-roberts
Error crawling https://www.stjohns.edu/files/january-2020-review-business: 'NoneType' object has no attribute 'find_all'
Title: Staten Island Campus | St. John's UniversityTitle: Graduate Programs | Peter J. Tobin College of Business | St. John's University | New York

Visiting: https://www.stjohns.edu/node/1826?location=241Visiting: https://www.stjohns.edu/life-st-johns/new-york-city-your-campus/queens-campus-life/spectrum

Visiting: https://www.stjohns.edu/life-st-johns/new-york-city-your-campus/staten-island-campus-life/spectrum-staten-island-campus
Visiting: https://www.stjohns.edu/life-st-johns/new-york-city-your-campus/staten-island-campus-life
Visiting: https://www.stjohns.edu/about/leadership-and-administration/ad



Error crawling https://www.stjohns.edu/sites/default/files/uploads/2022_Clare_Boothe_Luce_Summer_Research_Scholarship_application%20%285%29.docx: 'NoneType' object has no attribute 'find_all'Title: Page Not Found | St. John's University

Title: Elda Tsou | St. John's University
Title: SignOn | St. John's University
Title: Philosophy, Bachelor of Arts | St. John's University
Title: Consumer Justice for the Elderly: Litigation Clinic | St. John's University
Title: Raj Chetty | St. John's University
Title: Career Development at St. John's University | Empower Your Future
Title: Master of Science in Cyber and Information Security | St. John's University | New York, NY
Title: Where to Give | St. John's University
Title: St. John's College of Liberal Arts and Sciences | St. John's University | New York
Title: 32nd Annual Duberstein Bankruptcy Moot Court Competition | St. John's University
Title: St. John's University | The Lesley H. and William L. Collins College of Professional Studies | Ne



Error crawling https://www.stjohns.edu/files/2023-annual-security-and-fire-safety-report: 'NoneType' object has no attribute 'find_all'
Title: Spring 2020 | St. John's University
Title: Course Catalog | St. John's University
Title: GLOBE Entrepreneurs | St. John's University




Error crawling https://www.stjohns.edu/sites/default/files/2020-01/53R%20Brief%20%28Revised%29.PDF: 'NoneType' object has no attribute 'find_all'
URL: https://www.stjohns.edu/
Title: Turn Passion into Purpose | St. John's University
Cleaned Text: see how your journey aligns with what drives you hannah m queens ny maria orlando fl jenna charles queens ny joel stephen trinidad and tobago lucas shears warren ri with programs for every path a st johns university education gives you limitless opportunities to broaden your mind discover your passion and reach new heights at st johns you help make the world a better place by volunteering fighting injustice and serving those in need at st johns you help make the world a better place by volunteering fighting injustice and serving those in need our more than 100 undergraduate majors and programs of study are designed to prepare you for a successful future with online hybrid and inperson options you can earn a degree that works with your schedule

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Stemmed Tokens: ['play', 'key', 'role', 'shape', 'new', 'york', 'educ', 'four', 'doctor', 'degre', 'ie', 'edd', 'phd', 'program', '43', 'master', 'degre', 'program', 'accommod', 'career', 'changer', 'field', 'changer', 'seek', 'expertis', 'andor', 'certif', 'area', '\xa0', 'adolesc', 'educ', 'childhood', 'educ', 'clinic', 'mental', 'health', 'counsel', 'earli', 'childhood', 'educ', '\xa0', 'literaci', '\xa0', 'school', 'build', 'leadership', 'school', 'counsel', 'special', 'educ', '\xa0', 'tesol', 'program', 'design', 'flexibl', 'onlin', 'tradit', 'schedul', 'becom', 'student', 'st', 'john', 'also', 'becom', 'part', 'rich', 'histori', 'excel', 'educ', 'prepar', 'educ', '100', 'year', ' ', 'school', 'educ', 'offer', 'undergradu', 'dual', 'degre', 'program', 'student', 'seek', 'becom', 'teacher', 'want', 'transform', 'educ', 'outcom', 'children', 'school', 'educ', 'offer', 'master', 'degre', 'program', 'educ', 'enhanc', 'intellectu', 'pedagog', 'knowledg', ' ', 'administr', 'teacher', '

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

