# Web Scraping (Data Collection)
The `requests` library in conjunction with the `furl` library were used to fetch html content from Google Scholar for machine learning articles published after 2022. 

The following cell utilized two key functions: `get_all_articles` and `get_article_data`. The first function used a starting position on Google Scholar's search results page and retrieved a list of elements representing individual research articles. The second function focused on a single article element, extracting its title, publication details, and the number of times it has been cited.

To scrape multiple pages of results, the cell iterated by URL starting position until no article elements were found on the current page. To avoid overloading Google Scholar's servers, a 5-second delay was implemented between requests. Intermediate data were stored as tuples in a list, which upon completion of the loop allowed for easy conversion to a pandas DataFrame. 

In [None]:
import requests
import furl
import pandas as pd
import time
from bs4 import BeautifulSoup

url = furl.furl("https://scholar.google.com/scholar?start=0&q=machine+learning&hl=en&as_sdt=0,20&as_ylo=2022")

def get_all_articles(start: int):
    url.args["start"] = start
    response = requests.get(
        url=url.url
    )
    soup = BeautifulSoup(response.content, "lxml")
    return soup.find_all("div", class_="gs_ri")

def get_article_data(article):
    title = article.find("h3", class_="gs_rt").get_text(strip=False)
    publication_info = article.find("div", class_="gs_a").get_text(strip=False)
    citation_bar = article.find("div", class_="gs_fl gs_flb").find_all("a")
    cited_by = next((cite.text for cite in citation_bar if "Cited by" in cite.text), None)
    print(title, publication_info, cited_by)
    return title, publication_info, cited_by

raw_data = []
start = 770
all_articles = get_all_articles(start)
while len(all_articles) > 0:
    print(start)
    raw_data = raw_data + [get_article_data(article) for article in all_articles]
    start += 10
    time.sleep(5.)
    all_articles = get_all_articles(start)
    
data = pd.DataFrame(data=raw_data, columns=["title", "publication_info", "cited_by"])
data.to_csv("ml_articles_info.csv", mode="a+")

# Data Wrangling and Exploration

In [74]:
import pandas as pd
import re
import datetime

YEAR = datetime.datetime.now().year

data = pd.read_csv("ml_articles_info.csv", index_col = 0)

clean_data = []
for _, row in data.iterrows():
    
    clean_title = re.sub(r"\[.*?\]", "", row["title"]).strip().strip('"')
    
    all_publication_info = str(row["publication_info"]).split(" - ")
    all_publication_info = [element.split("\xa0-") for element in all_publication_info]
    all_publication_info = [item for sublist in all_publication_info for item in sublist]
    try:
        assert len(all_publication_info) == 3
        authors, venue_year, publisher = all_publication_info
        
        match = re.match(r"(.*?)\s+(\d{4})", venue_year)
        if match:
            venue, year = match.groups()
        else:
            venue = re.findall(r"(.*?)\s", venue_year)
            if len(venue) == 0: venue = None
            else: venue = venue[0]
            
            year = re.findall(r"(\d{4})", venue_year)
            if len(year) == 0: year = None
            else: year = int(year[0])
    except AssertionError:
        continue
    
    match = re.findall(r"(\d+)", row["cited_by"])
    if len(match) == 1:
        citation_count = int(match[0])
        citation_per_year = float(citation_count / (int(YEAR) - int(year) + 1))
    else:
        continue
    
    clean_data.append([clean_title, authors, int(year), venue, publisher, int(citation_count), float(citation_per_year)])
    
clean_data = pd.DataFrame(clean_data, columns=["title", "authors", "year", "venue", "publisher", "citation_count", "avg_citations_per_year"])
clean_data.to_csv("ml_articles_info-cleaned.csv")

In [87]:
from IPython.display import display, HTML

print("Articles published in 2023 or later with a total citation count exceeding 300:")
display(clean_data.loc[(clean_data["citation_count"] > 300) & (clean_data["year"] >= 2023)])

Articles published in 2023 or later with a total citation count exceeding 300:


Unnamed: 0,title,authors,year,venue,publisher,citation_count,avg_citations_per_year
3,International conference on machine learning,"W Li, C Wang, G Cheng, Q Song",2023,"Transactions on machine learning …,",par.nsf.gov,1568,784.0
8,Probabilistic machine learning: Advanced topics,KP Murphy,2023,,books.google.com,345,172.5
10,Human-in-the-loop machine learning: a state of...,"E Mosqueira-Rey, E Hernández-Pereira…",2023,"Artificial Intelligence …,",Springer,336,168.0
11,"Machine learning operations (mlops): Overview,...","D Kreuzberger, N Kühl, S Hirschl",2023,"IEEE access,",ieeexplore.ieee.org,399,199.5
27,Understanding of machine learning with deep le...,MM Taye,2023,"Computers,",mdpi.com,315,157.5
31,Fairness in machine learning: A survey,"S Caton, C Haas",2024,"ACM Computing Surveys,",dl.acm.org,641,641.0
46,Coronavirus disease (COVID-19) cases analysis ...,"AS Kwekha-Rashid, HN Abduljabbar, B Alhayani",2023,"Applied Nanoscience,",Springer,328,164.0
75,Artificial intelligence and machine learning i...,"CJ Haug, JM Drazen",2023,"New England Journal of Medicine,",Mass Medical Soc,551,275.5
81,"Artificial intelligence, machine learning and ...","M Soori, B Arezoo, R Dastres",2023,"Cognitive Robotics,",Elsevier,409,204.5
247,Machine learning advances for time series fore...,"RP Masini, MC Medeiros…",2023,"Journal of economic …,",Wiley Online Library,305,152.5
