## The second In-class-exercise (09/13/2023, 40 points in total)

Kindly use the provided .ipynb document to write your code or respond to the questions. Avoid generating a new file.
Execute all the cells before your final submission.

This in-class exercise is due tomorrow September 14, 2023 at 11:59 PM. No late submissions will be considered.

The purpose of this exercise is to understand users' information needs, then collect data from different sources for analysis.

Question 1 (10 points): Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? How many data needed for the analysis? The detail steps for collecting and save the data.


**Research Question:** Is there a notable contrast in the stock performance between companies listed in the S&P 400 and those listed in the S&P 600 during a specific timeframe?

**Data Required for Analysis:**

1. **Historical Stock Price Data:** To address this question, historical stock price data is necessary for companies in both the S&P 400 and S&P 600 indices. This dataset should encompass daily or monthly closing prices spanning a specified period, such as the last five years.

2. **Financial Metrics:** In addition to stock price data, having access to financial metrics for each company is advantageous. These metrics, including earnings per share (EPS), price-to-earnings (P/E) ratios, and market capitalization, offer insights into financial health and company valuation.

3. **Timeframe:** The chosen analysis period needs to be determined. This could involve assessing current performance or conducting a longer-term analysis using historical data.

**Steps for Data Collection and Storage:**

1. **Gather Historical Stock Price Data:**
   - Retrieve historical stock price data for companies in both S&P 400 and S&P 600 using financial data APIs such as Alpha Vantage, Yahoo Finance, or Quandl.
   - You may need to iterate through the list of company tickers obtained from web scraping to fetch historical stock price data for each one. Organize and store this data systematically, such as in a CSV file or database.

2. **Collect Financial Metrics Data:**
   - Similar to stock price data, access financial metrics for the companies through financial data APIs or financial data providers, typically available in financial databases.
   - Store this data in an organized format for further analysis.

3. **Data Cleaning and Preparation:**
   - Ensure the collected data is devoid of errors and inconsistencies. Address missing values and data irregularities.

4. **Time Alignment:**
   - Align data for both S&P 400 and S&P 600 companies according to the chosen analysis timeframe. Ensure data covers the same time intervals to enable meaningful comparisons.

5. **Conduct Statistical Analysis:**
   - Compute relevant performance metrics, such as average returns, volatility, and risk-adjusted returns, for companies in both indices.
   - Utilize statistical tests, such as t-tests or ANOVA, to assess significant performance disparities between the two groups of companies.

6. **Visualization and Reporting:**
   - Create visual representations, such as line charts or bar graphs, to communicate findings effectively.
   - Prepare a report or presentation summarizing the analysis outcomes and conclusions.

7. **Hypothesis Testing:**
   - Formulate hypotheses, including a null hypothesis (e.g., No performance difference between S&P 400 and S&P 600 companies) and an alternative hypothesis (e.g., A significant difference exists).
   - Employ statistical tests to evaluate these hypotheses.
   
8. **Data Storage:**

   - Establish a structured and secure data storage system to archive all the collected data, including historical stock price data, financial metrics, and any other relevant information.
   - Implement a database or file storage solution that is robust and scalable to handle the volume of data.
   - Consider implementing data versioning and backup procedures to maintain data integrity and recover data in case of unexpected issues or data loss.


Question 2 (10 points): Write python code to collect 1000 data samples you discussed above.

In [12]:
import requests
from bs4 import BeautifulSoup

# Define URLs for S&P 400 and S&P 600 companies
sp400_url = "https://en.wikipedia.org/wiki/List_of_S%26P_400_companies"
sp600_url = "https://en.wikipedia.org/wiki/List_of_S%26P_600_companies"

# Function to scrape company tickers and names from a given URL
def scrape_tickers_and_names(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    table = soup.find("table")

    if table:
        data = []
        for row in table.find_all("tr")[1:]:
            cells = row.find_all("td")
            ticker = cells[0].text.strip()
            name = cells[1].text.strip()
            data.append({"Ticker": ticker, "Name": name})
        return data
    else:
        print("Table not found on the webpage.")
        return []

# Scrape S&P 400 companies
sp400_data = scrape_tickers_and_names(sp400_url)

# Scrape S&P 600 companies
sp600_data = scrape_tickers_and_names(sp600_url)

# Print the ticker and name data for S&P 400 companies
print("S&P 400 Companies:")
for company in sp400_data:
    print(f"Ticker: {company['Ticker']}, Name: {company['Name']}")

# Print the ticker and name data for S&P 600 companies
print("\nS&P 600 Companies:")
for company in sp600_data:
    print(f" Ticker: {company['Name']}, Name: {company['Ticker']}")


S&P 400 Companies:
Ticker: AA, Name: Alcoa
Ticker: ACHC, Name: Acadia Healthcare
Ticker: ACIW, Name: ACI Worldwide
Ticker: ACM, Name: AECOM
Ticker: ADC, Name: Agree Realty
Ticker: ADNT, Name: Adient
Ticker: AFG, Name: American Financial Group
Ticker: AGCO, Name: AGCO
Ticker: AIRC, Name: Apartment Income REIT
Ticker: ALE, Name: ALLETE
Ticker: ALGM, Name: Allegro MicroSystems
Ticker: ALV, Name: Autoliv
Ticker: AM, Name: Antero Midstream
Ticker: AMED, Name: Amedisys
Ticker: AMG, Name: Affiliated Managers Group
Ticker: AMKR, Name: Amkor Technology
Ticker: AN, Name: AutoNation
Ticker: AR, Name: Antero Resources
Ticker: ARMK, Name: Aramark
Ticker: ARW, Name: Arrow Electronics
Ticker: ARWR, Name: Arrowhead Pharmaceuticals
Ticker: ASB, Name: Associated Bank
Ticker: ASGN, Name: ASGN
Ticker: ASH, Name: Ashland Global
Ticker: ATR, Name: AptarGroup
Ticker: AVNT, Name: Avient
Ticker: AVT, Name: Avnet
Ticker: AXTA, Name: Axalta
Ticker: AYI, Name: Acuity Brands
Ticker: AZPN, Name: Aspen Technology
Ti

Question 3 (10 points): Write python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "information retrieval". The articles should be published in the last 10 years (2013-2023).

The following information of the article needs to be collected:

(1) Title

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [13]:
# You code here (Please add comments in the code):

import requests

from bs4 import BeautifulSoup

 

def scrape_google_scholar(query, num_articles=1000):

    base_url = "https://scholar.google.com/scholar"

    params = {

        "q": query,

        "as_ylo": 2013,

        "as_yhi": 2023,

        "hl": "en",

        "as_sdt": "0,5",

    }

 

    articles = []

 

    while len(articles) < num_articles:

        params["start"] = len(articles)

        response = requests.get(base_url, params=params)

 

        if response.status_code == 200:

            soup = BeautifulSoup(response.text, "html.parser")

            results = soup.find_all("div", class_="gs_ri")

 

            if not results:

                break

 

            for result in results:

                title = result.find("h3", class_="gs_rt").text

                venue = result.find("div", class_="gs_a").text

                year = result.find("div", class_="gs_a").text.split(" - ")[-1]

                authors = result.find("div", class_="gs_a").text.split(" - ")[0]

                abstract = result.find("div", class_="gs_rs").text

 

                articles.append({

                    "Title": title,

                    "Venue": venue,

                    "Year": year,

                    "Authors": authors,

                    "Abstract": abstract,

                })

 

    return articles

 

# Example usage

keyword = "information retrieval"

num_articles_to_collect = 1000

 

articles = scrape_google_scholar(keyword, num_articles=num_articles_to_collect)

 

for index, article in enumerate(articles, start=1):

    print(f"Article {index}:")

    print("Title:", article["Title"])

    print("Venue:", article["Venue"])

    print("Year:", article["Year"])

    print("Authors:", article["Authors"])

    print("Abstract:", article["Abstract"])

    print("\n")

Article 1:
Title: Information retrieval as statistical translation
Venue: A Berger, J Lafferty - ACM SIGIR Forum, 2017 - dl.acm.org
Year: dl.acm.org
Authors: A Berger, J Lafferty - ACM SIGIR Forum, 2017
Abstract: … There is a large literature on probabilistic approaches to information retrieval, and we will 
not attempt to survey it here. Instead, we focus on the language modeling approach introduced …


Article 2:
Title: [BOOK][B] Information retrieval: Implementing and evaluating search engines
Venue: S Buttcher, CLA Clarke, GV Cormack - 2016 - books.google.com
Year: books.google.com
Authors: S Buttcher, CLA Clarke, GV Cormack
Abstract: … Information retrieval forms the foundation for modern search engines. In this textbook we 
provide an introduction to information retrieval targeted at graduate students and working …


Article 3:
Title: A language modeling approach to information retrieval
Venue: JM Ponte, WB Croft - ACM SIGIR Forum, 2017 - dl.acm.org
Year: dl.acm.org
Authors: JM P

Do either of the question-4 tasks given below.

Question 4 (10 points): Write python code to collect 1000 posts from Twitter, or Facebook, or Instagram. You can either use hashtags, keywords, user_name, user_id, or other information to collect the data.

The following information needs to be collected:

(1) User_name

(2) Posted time

(3) Text

In [None]:
# You code here (Please add comments in the code):




Question 4 (10 points):

In this task, you are required to identify and utilize online tools for web scraping data from websites without the need for coding, with a specific focus on Parsehub. The objective is to gather data and save it in formats like CSV, Excel, or any other suitable file format.

You have to mention an introduction to the tool which ever you prefer to use, steps to follow for web scrapping and the final output of the data collected.

Upload a document (Word or PDF File) in the same repository and you can add the link in the ipynb file.

https://github.com/mounicahandana/MounicaSiriChandana_INFO5731_Fall2023/blob/main/In_class_exercise/Tamalampudi_Exercise_02.docx

https://console.apify.com/actors/nFJndFXA5zjCTuudP/runs/YrLCFo33eljstVFJU#output


Apify is a helpful tool for making web scraping and data gathering from websites easier. It's user-friendly and versatile, designed to help users quickly get information from the internet. It's handy for various purposes like research, content gathering, and analyzing competitors.

 

Here are the steps to use Apify for web scraping:

 

1. **Create a Task**: Start by making a new task in Apify. A task is like a set of instructions that tells Apify what data to collect and how to do it. You can make your own task or choose from ready-made ones.

 

2. **Define Input**: Tell Apify what to scrape, like the website's link, what data to grab (like text, pictures, or links), and any special actions (like clicking buttons or filling forms).

 

3. **Configure Scrapers**: Apify has scraping tools like actors and crawlers. You can set them up to pull data from websites. You can point out what you want to scrape using selectors or expressions.

 

4. **Pagination Setup**: If the site has many pages of data, Apify can deal with it for you. It'll make sure all the needed pages are scraped.

 

5. **Run the Task**: Start the scraping task on Apify. It'll follow your instructions and collect data from the site.

 

6. **Data Collection**: As data is scraped, Apify organizes it neatly, usually in JSON or CSV files. You can keep it on Apify's cloud or download it for your own use.

 

In simple terms, Apify is a handy tool to gather data from websites. You tell it what you need, and it does the hard work for you, saving the data in an easy-to-use format.

The provided data is a tabular representation of search results obtained through a web scraping operation. Here's a brief explanation of each column:

 

1. `searchQuery/countryCode`: The country code where the search was conducted (e.g., US).

 

2. `searchQuery/device`: The type of device used for the search (e.g., DESKTOP).

 

3. `searchQuery/domain`: The domain or website from which the search results were collected (e.g., google.com).

 

4. `searchQuery/languageCode`: The language code used for the search.

 

5. `searchQuery/locationUule`: Location information related to the search.

 

6. `searchQuery/page`: The page number of the search results.

 

7. `searchQuery/resultsPerPage`: The number of results displayed per page.

 

8. `searchQuery/term`: The search query or keyword used (e.g., "web scraping").

 

9. `searchQuery/type`: The type of search query (e.g., SEARCH).

 

10. `searchQuery/url`: The URL of the search query.

 

11. `resultsTotal`: The total number of search results found for the query.

 

12. `description`: A description or snippet related to the search result.

 

13. `displayedUrl`: The displayed URL of the search result.

 

14. `position`: The position or rank of the search result on the page.

 

15. `title`: The title of the search result.

 

16. `type`: The type of result (e.g., organic).

 

17. `url`: The URL of the search result.

 

18. `date`: The date associated with the search result, often indicating the publication date.

 

19. `emphasizedKeywords/0`, `emphasizedKeywords/1`, `emphasizedKeywords/2`, `emphasizedKeywords/3`: Keywords or terms emphasized in the search result.

 

20. `productInfo/rating`: Rating associated with the product or result.

 

21. `productInfo/numberOfReviews`: The number of reviews for the product or result.

 

22. `productInfo/price`: Price information associated with the product or result.

 

This data represent search results for the query "web scraping" on Google. It includes various details about each search result, such as the title, URL, publication date, and more. It can be used for various purposes, including analyzing search result rankings, extracting relevant information, and more.
