# Web Scraping (Data Collection)
The `requests` library in conjunction with the `furl` library were used to fetch html content from Google Scholar for machine learning articles published after 2022. 

The following cell utilized two key functions: `get_all_articles` and `get_article_data`. The first function used a starting position on Google Scholar's search results page and retrieved a list of elements representing individual research articles. The second function focused on a single article element, extracting its title, publication details, and the number of times it has been cited.

To scrape multiple pages of results, the cell iterated by URL starting position until no article elements were found on the current page. To avoid overloading Google Scholar's servers, a 5-second delay was implemented between requests. Intermediate data were stored as tuples in a list, which upon completion of the loop allowed for easy conversion to a pandas DataFrame. 

In [2]:
import requests
import furl
import pandas as pd
import time
from bs4 import BeautifulSoup

url = furl.furl("https://scholar.google.com/scholar?start=0&q=machine+learning&hl=en&as_sdt=0,20&as_ylo=2022")

def get_all_articles(start: int):
    url.args["start"] = start
    response = requests.get(
        url=url.url
    )
    soup = BeautifulSoup(response.content, "lxml")
    return soup.find_all("div", class_="gs_ri")

def get_article_data(article):
    title = article.find("h3", class_="gs_rt").get_text(strip=False)
    publication_info = article.find("div", class_="gs_a").get_text(strip=False)
    citation_bar = article.find("div", class_="gs_fl gs_flb").find_all("a")
    cited_by = next((cite.text for cite in citation_bar if "Cited by" in cite.text), None)
    print(title, publication_info, cited_by)
    return title, publication_info, cited_by

raw_data = []
start = 770
all_articles = get_all_articles(start)
while len(all_articles) > 0:
    print(start)
    raw_data = raw_data + [get_article_data(article) for article in all_articles]
    start += 10
    time.sleep(5.)
    all_articles = get_all_articles(start)
    
data = pd.DataFrame(data=raw_data, columns=["title", "publication_info", "cited_by"])
data.to_csv("ml_articles_info.csv", mode="a+")

770
Enhanced membership inference attacks against machine learning models J Ye, A Maddi, SK Murakonda… - Proceedings of the …, 2022 - dl.acm.org Cited by 216
Web application based Diabetes prediction using Machine Learning GR Kumar, RV Reddy, M Jayarathna… - … on Advances in …, 2023 - ieeexplore.ieee.org Cited by 33
Fraudulent financial transactions detection using machine learning MMM Megdad, SS Abu-Naser, BS Abu-Nasser - 2022 - philpapers.org Cited by 50
Adversarial machine learning for network intrusion detection: A comparative study H Jmila, MI Khedher - Computer Networks, 2022 - Elsevier Cited by 56
Machine learning algorithms in the environmental corrosion evaluation of reinforced concrete structures-A review H Jia, G Qiao, P Han - Cement and Concrete Composites, 2022 - Elsevier Cited by 41
[HTML][HTML] Interpretable Ensemble-Machine-Learning models for predicting creep behavior of concrete M Liang, Z Chang, Z Wan, Y Gan, E Schlangen… - Cement and Concrete …, 2022 - Elsevier Cite

# Data Wrangling and Exploration

In [67]:
import pandas as pd
import re
import datetime

YEAR = datetime.datetime.now().year

data = pd.read_csv("ml_articles_info.csv", index_col = 0)

clean_data = []
for _, row in data.iterrows():
    
    clean_title = re.sub(r"\[.*?\]", "", row["title"]).strip().strip('"')
    
    all_publication_info = str(row["publication_info"]).split(" - ")
    all_publication_info = [element.split("\xa0-") for element in all_publication_info]
    all_publication_info = [item for sublist in all_publication_info for item in sublist]
    try:
        assert len(all_publication_info) == 3
        authors, venue_year, publisher = all_publication_info
        
        match = re.match(r"(.*?)\s+(\d{4})", venue_year)
        if match:
            venue, year = match.groups()
        else:
            venue = re.findall(r"(.*?)\s", venue_year)
            if len(venue) == 0: venue = None
            else: venue = venue[0]
            
            year = re.findall(r"(\d{4})", venue_year)
            if len(year) == 0: year = None
            else: year = int(year[0])
    except AssertionError:
        continue
    
    match = re.findall(r"(\d+)", row["cited_by"])
    if len(match) == 1:
        citation_count = int(match[0])
        citation_per_year = float(citation_count / (int(YEAR) - int(year) + 1))
    else:
        continue
    
    clean_data.append([clean_title, authors, year, venue, publisher, citation_count, citation_per_year])
    
clean_data = pd.DataFrame(clean_data, columns=["title", "authors", "year", "venue", "publisher", "citation_count", "avg_citations_per_year"])
clean_data.to_csv("ml_articles_info-cleaned.csv")
print(clean_data)

                                                 title  \
0           A guide to machine learning for biologists   
1                    Open-environment machine learning   
2    Machine learning and deep learning: A review o...   
3         International conference on machine learning   
4      Probabilistic machine learning: an introduction   
..                                                 ...   
990  Biomimetic Mechanical Robust Cement‐Resin Comp...   
991  A machine learning‐assisted multifunctional ta...   
992            Tutorial on multimodal machine learning   
993  Técnicas y aplicaciones del Machine Learning e...   
994  Perbandingan Algoritma Machine Learning untuk ...   

                                             authors  year  \
0                JG Greener, SM Kandathil, L Moffat…  2022   
1                                            ZH Zhou  2022   
2                               K Sharifani, M Amini  2023   
3                      W Li, C Wang, G Cheng, Q Song  2