Task Description: Find information on solarpark investors, namely the size of the equity checks they are writing, the megawatts of the solarparks they are investing in, and in general whether they are investing in solarparks.

Companies: ENVIRIA Energy Holding GmbH, ENREGO Energy GmbH, HIH Invest Real Estate Austria GmbH, Merkle Germany GmbH.

**Problem Breakdown:** 
1. Fetch HTML Content: Retrieve HTML content from a given URL.
2. Parse HTML: Extract text from the HTML elements and links from the HTML content.
3. URL Validation: Ensure the URLs are valid and within the same domain.
4. Deep Search: Recursively follow links up to a specified # of websites.
5. NLP Analysis: Analyze text using NLP to identify mentions of solar park investments.
6. Display Results: Present the extracted information in a structured format.

**Detailed Explanation of the Task Solution:**
1. Initial URL Fetching:
   Start with a set of initial URLs for each company. 
   Fetch the HTML content of each initial URL.

2. Parsing HTML Content:
    Parse the fetched HTML content using a tool like BeautifulSoup to extract text from elements where data is availeble(be sure that not styles or    scripts) and links.
    Store the text and links from the parsed content.
   
3. Recursively Follow Links:
    For each link extracted from the initial page, fetch the HTML content of the linked page.
    Parse the linked page's content to extract further text and links.
    Repeat this process for each new link found, until you reach a specified depth or a maximum number of links.
   
5. Maximum Fetching Control:
    Implement a counter or limit to control how deep the search goes. This prevents infinite recursion and ensures the search stops after a reasonable number of steps.

6. Data Collection:
    Collect and store relevant data (e.g., text content, links) from each page visited.
    Ensure that you are not revisiting the same pages by maintaining a cache of visited URLs.

7. Data Analysis:
   In order to understand related information NLP with spaCy is used.
   There are basically 2 methods defined. Both of them, almost gives same output(information).
   While one method is using predefined model for natural language processing, the other one is using spacy component which is defined with patterns(matcher).

8. Display Results:
   Show the results achieved by using any method with websites, to check by hand.

In [8]:
%run main.py

Do you want to run method Pattern[0] or method Predefined[1] for NLP?  0
Do you want to run Task[0] or Test[1]?  1


Text Example: WindGen Corporation, a leader in renewable energy, has announced a substantial investment in wind energy projects. They are allocating above €10 million to develop wind farms in coastal regions known for high wind speeds. These farms will have turbines with capacities ranging from 2 MW to 10 MW each. WindGen aims to harness the power of wind to generate clean energy for thousands of homes and businesses. The first phase of the project includes installing 20 turbines by the end of next year, with plans for further expansion in the coming years.
Investing in Solarparks: No
Equity Check Sizes: N/A
Power Capacities: N/A

--------------------------------------------------

Text Example: SolarPower Inc. has invested 50 million dollars in solarparks projects. The new solarparks are expected to generate over 100 MW of power, providing electricity to thousands of households. This investment marks a significant step towards sustainable energy solutions.
Investing in Solarparks: Yes

In [9]:
%run main.py

Do you want to run method Pattern[0] or method Predefined[1] for NLP?  1
Do you want to run Task[0] or Test[1]?  1


Text Example: WindGen Corporation, a leader in renewable energy, has announced a substantial investment in wind energy projects. They are allocating above €10 million to develop wind farms in coastal regions known for high wind speeds. These farms will have turbines with capacities ranging from 2 MW to 10 MW each. WindGen aims to harness the power of wind to generate clean energy for thousands of homes and businesses. The first phase of the project includes installing 20 turbines by the end of next year, with plans for further expansion in the coming years.
Investing in Solarparks: No
Equity Check Sizes: N/A
Power Capacities: N/A

--------------------------------------------------

Text Example: SolarPower Inc. has invested 50 million dollars in solarparks projects. The new solarparks are expected to generate over 100 MW of power, providing electricity to thousands of households. This investment marks a significant step towards sustainable energy solutions.
Investing in Solarparks: Yes

In [11]:
py_website = 'https://www.python.org/'
py_html = fetch_url_html_content(py_website)
py_data = parse_html(py_html,py_website)
print(py_data)

[('Notice:While JavaScript is not essential for this website, your interaction with the content will be limited. Please turn JavaScript on for the full experience.Skip to content▼ClosePythonPSFDocsPyPIJobsCommunity▲The Python NetworkDonate≡MenuSearch This SiteGOAASmallerLargerResetSocializeLinkedInMastodonChat on IRCTwitterAboutApplicationsQuotesGetting StartedHelpPython BrochureDownloadsAll releasesSource codeWindowsmacOSOther PlatformsLicenseAlternative ImplementationsDocumentationDocsAudio/Visual TalksBeginner\'s GuideDeveloper\'s GuideFAQNon-English DocsPEP IndexPython BooksPython EssaysCommunityDiversityMailing ListsIRCForumsPSF Annual Impact ReportPython ConferencesSpecial Interest GroupsPython LogoPython WikiCode of ConductCommunity AwardsGet InvolvedShared StoriesSuccess StoriesArtsBusinessEducationEngineeringGovernmentScientificSoftware DevelopmentNewsPython NewsPSF NewsletterPSF NewsPyCon US NewsNews from the CommunityEventsPython EventsUser Group EventsPython Events ArchiveU

**Ideas of How to Improve Solution:**
1. Instead of starting the process from the main website, it might be more beneficial to start from where the information is located. Alternatively, knowing the sites where the information is available in advance and providing these will send more relevant texts to the analysis phase.
2. Instead of the website, if available, their databases or financial reports can be checked. Here, processing more useful information, i.e., texts, will yield more detailed results.
3. More currencies can be added, to better understand the equity check size.
4. If the number of search words for each URL entered is kept in a token list, and as the process goes deeper, if the URLs fall below a certain number of words in a recursive way, that is, if the URLs no longer contain money and the information being sought, but only the name of the company or something similar, irrelevant searches can be avoided by stopping the processing of the page according to the number of these words. By thinking of it like a tree, going lower in this branch can be stopped. This can provide a better search.
5. If the data analysis part is better, it will give more accurate results. It might understand negative sentences positively and answer 'Yes' to the investment question. For example: 'While Company A has no investment in solar parks, it has a $100 million investment in wind energy.'

**Thank you for getting to this point, now we can run the task.**

**Predefined Method:**

In [13]:
%run main.py

Do you want to run method Pattern[0] or method Predefined[1] for NLP?  1
Do you want to run Task[0] or Test[1]?  0
Please indicate number of websites search per company(Max. allowed limit is 50):   7


# of Remained Link: 6 , and Fetched Link:  https://enrego.de/en
# of Remained Link: 5 , and Fetched Link:  https://enrego.de/en/our-services
# of Remained Link: 4 , and Fetched Link:  https://enrego.de/en/our-project
# of Remained Link: 3 , and Fetched Link:  https://enrego.de/en/crowdfunding
# of Remained Link: 2 , and Fetched Link:  https://enrego.de/en/funds
# of Remained Link: 1 , and Fetched Link:  https://enrego.de/en/private-equity
# of Remained Link: 0 , and Fetched Link:  https://enrego.de/en/about
Processing data for ENREGO Energy GmbH
# of Remained Link: 6 , and Fetched Link:  https://enviria.energy/en/cases/maja-furniture-factory-energy-as-a-service
# of Remained Link: 5 , and Fetched Link:  https://enviria.energy/en/how-it-works
# of Remained Link: 4 , and Fetched Link:  https://enviria.energy/en/solarconfigurator
# of Remained Link: 3 , and Fetched Link:  https://enviria.energy/en
# of Remained Link: 2 , and Fetched Link:  https://enviria.energy/en/solar-options/your-sola

**Pattern Method:**

In [16]:
%run main.py

Do you want to run method Pattern[0] or method Predefined[1] for NLP?  0
Do you want to run Task[0] or Test[1]?  0
Please indicate number of websites search per company(Max. allowed limit is 50):   7


# of Remained Link: 6 , and Fetched Link:  https://enrego.de/en
# of Remained Link: 5 , and Fetched Link:  https://enrego.de/en/our-services
# of Remained Link: 4 , and Fetched Link:  https://enrego.de/en/our-project
# of Remained Link: 3 , and Fetched Link:  https://enrego.de/en/crowdfunding
# of Remained Link: 2 , and Fetched Link:  https://enrego.de/en/funds
# of Remained Link: 1 , and Fetched Link:  https://enrego.de/en/private-equity
# of Remained Link: 0 , and Fetched Link:  https://enrego.de/en/about
Processing data for ENREGO Energy GmbH
# of Remained Link: 6 , and Fetched Link:  https://enviria.energy/en/cases/maja-furniture-factory-energy-as-a-service
# of Remained Link: 5 , and Fetched Link:  https://enviria.energy/en/how-it-works
# of Remained Link: 4 , and Fetched Link:  https://enviria.energy/en/solarconfigurator
# of Remained Link: 3 , and Fetched Link:  https://enviria.energy/en
# of Remained Link: 2 , and Fetched Link:  https://enviria.energy/en/solar-options/your-sola