# Arxiv Explorer Tools - minimal weighted match
- Fast:
    - 1-2 sec to run in a local jupyter notebook
    - ~5-10 sec to run in google colab
    - vs. 5-10 min for embedding or TFIDF search
- extracts articles on topics of interest from the too-many-to-look-through daily pages of articles that come out each day.
- minimal weighted match uses a list of phrases and an weight for each
- saves search results to json (for automation later) and html (for easy reading and linking)
- saves all articles for archiving
- multi-topic: use as many pre-set seaches as you want
- set score_floor and top_n to filter which results you see
- arxiv site reading uses 'beautiful soup'

### Setup & Install:
- have python installed and use an python env
- use a jupyter notebook or script, etc.
- for specialty topics you can create extensive weighted search profiles.

Note: should be able to run as a script or in a server, but notebooks are useful

### For more on Arxiv search tools, See:
- https://medium.com/@GeoffreyGordonAshbrook/search-with-non-generative-ai-d0a3cc77164b
- https://github.com/lineality/arxiv_explorer_tools

# Build Instructions

Put this notebook (the .ipynb file) or a script (.py file) into your project directory.

requirements.txt ->
```
requests
scikit-learn
scipy
numpy
beautifulsoup4
jupyter
```
- https://pypi.org/project/beautifulsoup4/

#### make python env
```bash
python -m venv env; source env/bin/activate
python -m pip install --upgrade pip; python -m pip install -r requirements.txt
python -m pip install git+https://github.com/psf/black pylint pydocstyle flake8
```

#### run notebook
```bash
jupyter notebook
```

#### use notebook
- Select Notebook
- Run all Cells
- view and use printed and file-saved results

# setup

In [28]:
import re  # standard library
import time  # standard library
from datetime import datetime  # standard library
from bs4 import BeautifulSoup  # pip install beautifulsoup4
import requests  # standard library
import json  # standard library

In [29]:
"""
Code-Bundle For Time:
 Commented-out code is for use in different places in the code.
"""


def duration_min_sec(start_time, end_time):

    duration = end_time - start_time

    duration_seconds = duration.total_seconds()

    minutes = int(duration_seconds // 60)
    seconds = duration_seconds % 60
    time_message = f"{minutes}_min__{seconds:.1f}_sec"

    return time_message


"""
Start (and Stop) your Time Tracking:
"""
start_time_whole_single_task = datetime.now()
# end_time_whole_single_task = datetime.now()

"""
Tally time at end.
"""
# # start_time_whole_single_task = datetime.now()
# end_time_whole_single_task = datetime.now()
# duration_time = duration_min_sec(start_time_whole_single_task, end_time_whole_single_task)
# print(f"Duration to run -> {duration_time}")

'\nTally time at end.\n'

# minimal weighted matching code

In [30]:
# An simplistic basic key word search (with optional weights)

def minimal_wieghted_match_score(one_document, keyword_weights):
    """
    Simple weight score, lowercase.

    """

    score = 0

    try:
        # Make the document lowercase and strip all symbols, spaces, and newline characters
        match_this_cleaned_document = re.sub(r'[^\w\s]', '', one_document.lower()).replace('\n', '').replace(' ','')

        # print(match_this_cleaned_document)
        for keyword, weight in keyword_weights:

            # Make the keyword lowercase and strip all symbols, spaces, and newline characters
            match_this_cleaned_keyword = re.sub(r'[^\w\s]', '', keyword.lower()).replace('\n', '').replace(' ','')

            # print(match_this_cleaned_keyword)
            # Check if the keyword-phrase is in the document
            if match_this_cleaned_keyword in match_this_cleaned_document:
                # If the keyword-phrase is in the document, add its weight to the score
                score += weight

        return score
    except Exception as e:
        print(f"{str(e)}: {one_document} {type(one_document)}")


def rank_documents_on_weighted_matches(documents, keyword_weights):
    """
    Ranks documents based on the presence of weighted keywords-phrases.
    comparison looks at text without:
    - captialization
    - spaces
    - newlines
    - special symbols

    Parameters:
    - documents (list of str): The list of documents to be ranked.
    - keyword_weights (list of tuple): A list of tuples,
       where the first element is the keyword and the
       second element is the corresponding weight.

    Returns:
    list of (str, float): A list of tuples, where the first element is the document and the
    second element is the ranking score.
    """
    """
    string cleaning steps:
    - lower
    - strip extra spaces
    - remove symbols
    - remove newlines
    """

    ranked_documents = []

    for document in documents:
        score = 0

        # Make the document lowercase and strip all symbols, spaces, and newline characters
        match_this_cleaned_document = re.sub(r'[^\w\s]', '', document.lower()).replace('\n', '').replace(' ','')

        # print(match_this_cleaned_document)
        for keyword, weight in keyword_weights:

            # Make the keyword lowercase and strip all symbols, spaces, and newline characters
            match_this_cleaned_keyword = re.sub(r'[^\w\s]', '', keyword.lower()).replace('\n', '').replace(' ','')

            # print(match_this_cleaned_keyword)
            # Check if the keyword-phrase is in the document
            if match_this_cleaned_keyword in match_this_cleaned_document:
                # If the keyword-phrase is in the document, add its weight to the score
                score += weight

        ranked_documents.append((document, score))

    # Sort the documents by their ranking scores in descending order
    ranked_documents.sort(key=lambda x: x[1], reverse=True)

    return ranked_documents


# ################
# # Example usage
# ################
# corpus = [
#     "This is the first document about machine learning.",
#     "The second document discusses data analysis and visualization.",
#     "The third document focuses on natural language processing.",
#     "The fourth document talks about deep learning and neural networks.",
#     """to test line breaks
#     Emotion mining
#      data
#     analysis
#     Keywords: emotion mining, sentiment analysis, natural disasters, psychology, technological disasters""",
# ]

# one_doc = "This is the first document about machine learning."

# keyword_weights = [("machine learning", 3), ("data analysis", 2), ("natural language processing", 4), ("deep learning", 5), ("neural networks", 6)]

# ranked_documents = rank_documents_on_weighted_matches(corpus, keyword_weights)

# for document, score in ranked_documents:
#     print(f"Document: {document}\nScore: {score}\n")

# one_score = minimal_wieghted_match_score(one_doc, keyword_weights)

# print(one_score)

Document: The fourth document talks about deep learning and neural networks.
Score: 11

Document: The third document focuses on natural language processing.
Score: 4

Document: This is the first document about machine learning.
Score: 3

Document: The second document discusses data analysis and visualization.
Score: 2

Document: to test line breaks
    Emotion mining
     data
    analysis
    Keywords: emotion mining, sentiment analysis, natural disasters, psychology, technological disasters
Score: 2

3


# Arxiv Explorerer


In [31]:
###################
# Arxiv Explorerer
###################
# step 1: embed the search-phrase
# step 2: embed each text
# step 3: get scores
# step 4: evaluates if score is succss or fail
# step 5: if success: do stuff with text


# # Imports
# from bs4 import BeautifulSoup  # pip install beautifulsoup4
# import requests  # standard library
# import json  # standard library
# from datetime import datetime  # standard library

## Get Article Corpus

In [32]:
start_segment_time = datetime.now()

#####################
# Get Article Corpus
#####################

# List to hold all article data
article_data = []

# # Make a request to the website
r = requests.get('https://arxiv.org/list/cs/new')

url = "https://arxiv.org/list/cs/new"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# # Find all the articles
articles = soup.find_all('dt')

# # Find all the titles
articles_title = soup.find_all('div', {'class': 'list-title mathjax'})

# Find all the subject on the page
articles_subject = soup.find_all('dd')


###############
# make corpus
###############

corpus = []
report_list = []
article_dicts = []

for this_index, article in enumerate(articles):

    ################################################
    # Extract each field of data about each article
    ################################################

    # Extract the title
    title = articles_title[this_index].text.split('Title:')[1].strip()

    # Extract the subjects
    subjects = articles_subject[this_index].find('span', {'class': 'primary-subject'}).text

    arxiv_id = article.find('a', {'title': 'Abstract'}).text.strip()

    abstract_p = article.find_next_sibling('dd').find('p', {'class': 'mathjax'})

    # Extract the abstract
    if abstract_p:
        abstract = abstract_p.text.strip()
    else:
        abstract = ""

    pdf_link_segment = article.find('a', {'title': 'Download PDF'})['href']

    arxiv_id = article.find('a', {'title': 'Abstract'}).text.strip()
    pdf_link = f"https://arxiv.org{pdf_link_segment}"
    paper_link = f"https://arxiv.org/abs/{arxiv_id[6:]}"

    # extracted_article_string = title + " " + abstract + " " + str(subjects)

    # assemble corpus
    article_characters = f"{this_index}|||| "

    article_characters += f"\n'arxiv_id': {arxiv_id}, "
    article_characters += f"\n'paper_link': {paper_link}, "
    article_characters += f"\n'pdf_link': {pdf_link}, "

    article_characters += "\nTitle: " + title + " "
    article_characters += "\nSubjects: " + subjects + " "
    article_characters += "\nAbstract: " + abstract

    ##################################
    # Make Bundles (sharing an index)
    ##################################

    # # add to corpus: just the meaningful text
    # corpus.append(extracted_article_string)

    # add to simple report_list: includes link and article ID info
    report_list.append(article_characters)

    # Append the data to the list
    article_dicts.append({
        'title': title,
        'abstract': abstract,
        'paper_link': paper_link,
        'pdf_link': pdf_link,
        'subjects': subjects,
        'arxiv_id': arxiv_id,
        'article_sequence_index': this_index,
    })

    # using this because only basic search works
    corpus = report_list


# # Segment Timer
# start_segment_time = datetime.now()
end_segment_time = datetime.now()
duration_time = duration_min_sec(start_segment_time, end_segment_time)
print(f"Duration to run segment -> {duration_time}")

# ALL Save the data to a JSON file
date_time = datetime.now()
all_article_dicts_clean_timestamp = date_time.strftime('%Y-%m-%d__%H%M%S%f')
with open(f'all_arxiv_article_dicts_{all_article_dicts_clean_timestamp}.json', 'a') as f:
    json.dump(article_dicts, f)

Duration to run segment -> 0_min__2.4_sec


In [33]:
# inspection (size of corpus)
len(corpus)

779

In [34]:
corpus[0]

"0|||| \n'arxiv_id': arXiv:2412.14174, \n'paper_link': https://arxiv.org/abs/2412.14174, \n'pdf_link': https://arxiv.org/pdf/2412.14174, \nTitle: Steering Large Text-to-Image Model for Abstract Art Synthesis: Preference-based Prompt Optimization and Visualization \nSubjects: Human-Computer Interaction (cs.HC) \nAbstract: With the advancement of neural generative capabilities, the art community has increasingly embraced GenAI (Generative Artificial Intelligence), particularly large text-to-image models, for producing aesthetically compelling results. However, the process often lacks determinism and requires a tedious trial-and-error process as users often struggle to devise effective prompts to achieve their desired outcomes. This paper introduces a prompting-free generative approach that applies a genetic algorithm and real-time iterative human feedback to optimize prompt generation, enabling the creation of user-preferred abstract art through a customized Artist Model. The proposed tw

# print and save: code

In [35]:
# from datetime import datetime  # standard library

########################################
# Filter, Save, & Print the Raw Results
########################################
# ALL Save the data to a JSON file
date_time = datetime.now()
all_arxiv_results_clean_timestamp = date_time.strftime('%Y-%m-%d__%H%M%S%f')
all_articles_list = []
all_results_json_list = []


def result_counter(ranked_documents):
    """
    count non-zero scored results
    """

    result_count = 0

    for this_doc in ranked_documents:
        score = this_doc[1]

        if score != 0:
            result_count += 1

    return result_count


def score_filtered_result_counter(ranked_documents, score_floor=0):
    """
    count non-zero scored results that are greater than or equal to score_floor
    """

    result_count = 0

    for this_doc in ranked_documents:
        score = this_doc[1]

        if score != 0 and score >= score_floor:
            result_count += 1

    return result_count


def print_and_save(ranked_documents, top_n, name_of_set, score_floor=5):
    # Posix UTC Seconds
    # make readable time
    # from datetime import datetime
    date_time = datetime.now()
    clean_timestamp = date_time.strftime('%Y-%m-%d__%H%M%S%f')

    counter = 0

    results_json_list = []

    for document, score in ranked_documents:

        if score >= score_floor:

            blurb = f"Document: {document}\nScore: {score}\n"

            print(blurb)

        this_index = int(document.split('||||')[0])

        data_dict = article_dicts[this_index]

        results_json_list.append(data_dict)
        all_results_json_list.append(data_dict)

        counter += 1
        if counter >= top_n:
            break

    #############
    # Write Data
    #############

    # Save the data to a JSON file
    with open(f'{name_of_set}_articles_{clean_timestamp}.json', 'w') as f:
        json.dump(results_json_list, f)

    # Create an HTML file
    html = '<html><body>'
    for article in results_json_list:
        html += f'<h2><a href="{article["paper_link"]}">{article["title"]}</a></h2>'
        html += f'<p>{article["abstract"]}</p>'
        html += f'<p>Subjects: {str(article["subjects"])}</p>'

        html += f'<a href="{article["paper_link"]}">{article["paper_link"]}</a>'
        html += f'<p>paper link: {str(article["paper_link"])}</p>'

        html += f'<a href="{article["pdf_link"]}">{article["pdf_link"]}</a>'
        html += f'<p>pdf link: {str(article["pdf_link"])}</p>'

        html += f'<p>arxiv id: {str(article["arxiv_id"])}</p>'
        html += f'<p>article_sequence_index id: {str(article["article_sequence_index"])}</p>'

    html += '</body></html>'


    # Save the HTML to a file
    with open(f'{name_of_set}_articles{clean_timestamp}.html', 'w') as f:
        f.write(html)


def match_print_save(list_of_lists_of_weights, top_n, score_floor):
    date_time = datetime.now()
    clean_timestamp = date_time.strftime('%Y-%m-%d__%H%M%S%f')

    counter = 0
    for keyword_weights in list_of_lists_of_weights:

        ranked_documents = rank_documents_on_weighted_matches(corpus, keyword_weights)

        # user first list item as name of set
        name_of_set = list_of_lists_of_weights[counter][0][0]

        result_quantity = result_counter(ranked_documents)

        score_floor_filtered_quantity = score_filtered_result_counter(ranked_documents, score_floor)

        this_max_number = top_n

        if top_n > result_quantity:
            this_max_number = result_quantity

        print(f"\n\nSet Name: {name_of_set}")
        print(f"Total Matches in Set: {result_quantity}")
        print(f"Matches Above Score-Floor in Set: {score_floor_filtered_quantity}")
        print(clean_timestamp)

        print(f"\nShowing {score_floor_filtered_quantity} in top-{this_max_number} out of {result_quantity} total results.     -> {score_floor_filtered_quantity} of {this_max_number}/{result_quantity}")
        print(f"(Ceiling set at {top_n} (top_n) filtered results.)    -> {top_n}")
        print(f"(Minimum-included-score, 'Score-Floor' set at {score_floor}) -> {score_floor}\n\n")

        print_and_save(ranked_documents, top_n, name_of_set, score_floor)
        counter += 1


        # ALL Save the data to a JSON file
        with open(f'all_arxiv_results_{all_arxiv_results_clean_timestamp}.json', 'a') as f:
            json.dump(all_results_json_list, f)

# multi-set search(es)
(optional)

In [36]:
# ########
# # Batch
# ########

# # example multi-list

# list_of_lists_of_weights = [
#     # keyword_weights =
#     [
#         ("computer vision", 3),
#         ("resolution", 2),
#         # ("natural language processing", 4),
#         # ("deep learning", 5),
#         ("neural networks", 6),
#     ],


#     # keyword_weights =
#     [
#         ("distance measure", 10),
#         ("similarity measure", 10),
#         ("vector distance", 10),
#         ("distance metric", 10),
#         ("similarity metric", 10),
#         ("dimension reduction", 10),


#         ("similarity", 1),
#         ("distance", 1),
#         ("metric", 1),

#     ],


#     # # keyword_weights =
#     # ("cognitive science", 2),  # much too broad...
#     [
#         ("mental health", 5),
#         ("psychological health", 5),
#         ("psycholog", 2),  # stem vs. lemma


#         ("mental health care", 3),
#         ("neuroscience", 2),
#         ("psychological assessment", 2),
#         ("personality assessment", 2),
#         ("personality inference", 2),
#         ("personality traits", 2),
#         ("personality dimensions", 2),
#         ("emotion", 15),
#         ("sports psychology", 15),
#         # ("", 2),
#         # ("", 2),



#         # disease terms
#         ("depression", 5),
#         ("anxiety", 5),
#         ("mental disorders", 2),
#         ("social anxiety disorder", 4),
#         ("mental illness", 2),
#         ("Major Depressive Disorder", 2),
#         ("MDD", 2),
#         ("psychological stressors", 2),
#         ("cognitive impairment", 2),
#         ("mci", 2),
#         # ("", 2),
#         # ("", 2),
#         # ("", 2),

#         ],


#     # # keyword_weights =
#     [
#         ("benchmark", 5),
#         ("model evaluation", 5),
#         ("test", 2),
#         ("measure", 2),
#     ],


#     # # keyword_weights =
#     [
#         ("training set", 5),
#         ("synthetic", 2),
#         ("generate", 2),
#         ("measure", 2),
#     ],

#     # keyword_weights =
#     [
#         ("graph", 5),
#         ("graph generation", 8),
#         ("subgraph", 2),
#         ("hierarchical graph", 2),
#         ("embedding", 2),
#         ("knowledge graph", 2),

#         ("graph neural networks", 2),
#         ("graph representation", 2),
#         ("node", 2),
#          ## collisions: cryptograph, geograph,
#     ],

# ]

# top_n = 45
# score_floor = 3
# match_print_save(list_of_lists_of_weights, top_n, score_floor)

# Find articles:
use: keywords+weights, top_n, score_floor

In [37]:
top_n = 45
score_floor = 2
list_of_lists_of_weights = [[
        ("Manifold Approximation", 10),
        ("UMAP", 10),
        ("Uniform Manifold Approximation and Projection", 10),
        ("Manifold hypothesis", 10),
        ("dimensionality reduction", 10),
        ("dimension reduction", 10),
        ("dimension reduction technique", 10),

        ("stress", 1),
        ("Manifold", 1),
        ("lower-dimensional", 1),
        ("visualiz", 1),
        ("projection", 1),
        ("project", 1),
        ("dimensionality", 1),
        ("reduction", 1),
    ],]
match_print_save(list_of_lists_of_weights, top_n, score_floor)



Set Name: Manifold Approximation
Total Matches in Set: 94
Matches Above Score-Floor in Set: 20
2024-12-20__041109376782

Showing 20 in top-45 out of 94 total results.     -> 20 of 45/94
(Ceiling set at 45 (top_n) filtered results.)    -> 45
(Minimum-included-score, 'Score-Floor' set at 2) -> 2


Document: 264|||| 
'arxiv_id': arXiv:2412.14717, 
'paper_link': https://arxiv.org/abs/2412.14717, 
'pdf_link': https://arxiv.org/pdf/2412.14717, 
Title: Computing Gram Matrix for SMILES Strings using RDKFingerprint and Sinkhorn-Knopp Algorithm 
Subjects: Machine Learning (cs.LG) 
Abstract: In molecular structure data, SMILES (Simplified Molecular Input Line Entry System) strings are used to analyze molecular structure design. Numerical feature representation of SMILES strings is a challenging task. This work proposes a kernel-based approach for encoding and analyzing molecular structures from SMILES strings. The proposed approach involves computing a kernel matrix using the Sinkhorn-Knopp alg

In [38]:
top_n = 45
score_floor = 3
list_of_lists_of_weights = [[
        ("distance measure", 10),
        ("similarity measure", 10),
        ("vector distance", 10),
        ("distance metric", 10),
        ("similarity metric", 10),
        ("dimension reduction", 10),

        ("similarity", 1),
        ("distance", 1),
        ("metric", 1),
    ],]
match_print_save(list_of_lists_of_weights, top_n, score_floor)



Set Name: distance measure
Total Matches in Set: 140
Matches Above Score-Floor in Set: 4
2024-12-20__041109494279

Showing 4 in top-45 out of 140 total results.     -> 4 of 45/140
(Ceiling set at 45 (top_n) filtered results.)    -> 45
(Minimum-included-score, 'Score-Floor' set at 3) -> 3


Document: 194|||| 
'arxiv_id': arXiv:2412.14580, 
'paper_link': https://arxiv.org/abs/2412.14580, 
'pdf_link': https://arxiv.org/pdf/2412.14580, 
Title: DiffSim: Taming Diffusion Models for Evaluating Visual Similarity 
Subjects: Computer Vision and Pattern Recognition (cs.CV) 
Abstract: Diffusion models have fundamentally transformed the field of generative models, making the assessment of similarity between customized model outputs and reference inputs critically important. However, traditional perceptual similarity metrics operate primarily at the pixel and patch levels, comparing low-level colors and textures but failing to capture mid-level similarities and differences in image layout, object 

In [39]:
top_n = 45
score_floor = 3
list_of_lists_of_weights = [[
        ("parametric", 10),
    ],]
match_print_save(list_of_lists_of_weights, top_n, score_floor)



Set Name: parametric
Total Matches in Set: 8
Matches Above Score-Floor in Set: 8
2024-12-20__041109580067

Showing 8 in top-8 out of 8 total results.     -> 8 of 8/8
(Ceiling set at 45 (top_n) filtered results.)    -> 45
(Minimum-included-score, 'Score-Floor' set at 3) -> 3


Document: 276|||| 
'arxiv_id': arXiv:2412.14744, 
'paper_link': https://arxiv.org/abs/2412.14744, 
'pdf_link': https://arxiv.org/pdf/2412.14744, 
Title: A parametric algorithm is optimal for non-parametric regression of smooth functions 
Subjects: Machine Learning (cs.LG) 
Abstract: We address the regression problem for a general function $f:[-1,1]^d\to \mathbb R$ when the learner selects the training points $\{x_i\}_{i=1}^n$ to achieve a uniform error bound across the entire domain. In this setting, known historically as nonparametric regression, we aim to establish a sample complexity bound that depends solely on the function's degree of smoothness. Assuming periodicity at the domain boundaries, we introduce P

In [40]:
top_n = 45
score_floor = 2
list_of_lists_of_weights = [[
        ("survey", 1),
        ("election", 1),
        ("voting", 1),
        ("poll", 1),
        ("vote", 1),
        ("candidate", 1),

        ("selection", .5),
        ("coordination", .5),
        ("consensus", .5),
        ("campaign", .5),

        ],]
match_print_save(list_of_lists_of_weights, top_n, score_floor)



Set Name: survey
Total Matches in Set: 72
Matches Above Score-Floor in Set: 4
2024-12-20__041109650292

Showing 4 in top-45 out of 72 total results.     -> 4 of 45/72
(Ceiling set at 45 (top_n) filtered results.)    -> 45
(Minimum-included-score, 'Score-Floor' set at 2) -> 2


Document: 405|||| 
'arxiv_id': arXiv:2412.15146, 
'paper_link': https://arxiv.org/abs/2412.15146, 
'pdf_link': https://arxiv.org/pdf/2412.15146, 
Title: Cruise Control: Dynamic Model Selection for ML-Based Network Traffic Analysis 
Subjects: Networking and Internet Architecture (cs.NI) 
Abstract: Modern networks increasingly rely on machine learning models for real-time insights, including traffic classification, application quality of experience inference, and intrusion detection. However, existing approaches prioritize prediction accuracy without considering deployment constraints or the dynamism of network traffic, leading to potentially suboptimal performance. Because of this, deploying ML models in real-wo

In [41]:
top_n = 45
score_floor = 1
list_of_lists_of_weights = [[
        ("disinformation", 1),
        ("manipulate public opinion", 1),
        ("conspiracy", 1),
        ("radicalization", 1),
        ("conspiracy theories", 1),
        ("violent extremism", 2),

        ("extremism", 1),
        ("extremist", 1),
        ("extreme views", 1),
        ("extreme beliefs", 1),
        ("extreme action", 1),
        ("ideology", .5),        ],]
match_print_save(list_of_lists_of_weights, top_n, score_floor)



Set Name: disinformation
Total Matches in Set: 5
Matches Above Score-Floor in Set: 4
2024-12-20__041109749584

Showing 4 in top-5 out of 5 total results.     -> 4 of 5/5
(Ceiling set at 45 (top_n) filtered results.)    -> 45
(Minimum-included-score, 'Score-Floor' set at 1) -> 1


Document: 386|||| 
'arxiv_id': arXiv:2412.15098, 
'paper_link': https://arxiv.org/abs/2412.15098, 
'pdf_link': https://arxiv.org/pdf/2412.15098, 
Title: A Cross-Domain Study of the Use of Persuasion Techniques in Online Disinformation 
Subjects: Computers and Society (cs.CY) 
Abstract: Disinformation, irrespective of domain or language, aims to deceive or manipulate public opinion, typically through employing advanced persuasion techniques. Qualitative and quantitative research on the weaponisation of persuasion techniques in disinformation has been mostly topic-specific (e.g., COVID-19) with limited cross-domain studies, resulting in a lack of comprehensive understanding of these strategies. This study empl

In [42]:
top_n = 45
score_floor = 1
list_of_lists_of_weights = [[
        ("Speech-LLM", 1),

        ("spoken language understanding", 1),

        ("speech to text", 1),
        ("text to speech", 1),

        ("audio modality", .5),
        ("speech encoder", .5),
        ("SLU", .5),
        ("stt", .5),
        ("tts", .5),

        ],]
match_print_save(list_of_lists_of_weights, top_n, score_floor)



Set Name: Speech-LLM
Total Matches in Set: 112
Matches Above Score-Floor in Set: 3
2024-12-20__041109849517

Showing 3 in top-45 out of 112 total results.     -> 3 of 45/112
(Ceiling set at 45 (top_n) filtered results.)    -> 45
(Minimum-included-score, 'Score-Floor' set at 1) -> 1


Document: 711|||| 
'arxiv_id': arXiv:2412.11795, 
'paper_link': https://arxiv.org/abs/2412.11795, 
'pdf_link': https://arxiv.org/pdf/2412.11795, 
Title: ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis 
Subjects: Computation and Language (cs.CL) 
Abstract: Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (F

In [43]:
top_n = 45
score_floor = .5
list_of_lists_of_weights = [[
        ("multiple agents", 1),
        ("Multiagent Systems", 1),
        ("Multiagent", 1),
        ("(cs.MA)", 1),
        ("multi-agent and multi-rack path finding", 1),  #  (MARPF)

        ("agent interactions", 1),
        ],]
match_print_save(list_of_lists_of_weights, top_n, score_floor)



Set Name: multiple agents
Total Matches in Set: 24
Matches Above Score-Floor in Set: 24
2024-12-20__041109950529

Showing 24 in top-24 out of 24 total results.     -> 24 of 24/24
(Ceiling set at 45 (top_n) filtered results.)    -> 45
(Minimum-included-score, 'Score-Floor' set at 0.5) -> 0.5


Document: 286|||| 
'arxiv_id': arXiv:2412.14779, 
'paper_link': https://arxiv.org/abs/2412.14779, 
'pdf_link': https://arxiv.org/pdf/2412.14779, 
Title: Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning 
Subjects: Multiagent Systems (cs.MA) 
Abstract: In multi-agent environments, agents often struggle to learn optimal policies due to sparse or delayed global rewards, particularly in long-horizon tasks where it is challenging to evaluate actions at intermediate time steps. We introduce Temporal-Agent Reward Redistribution (TAR$^2$), a novel approach designed to address the agent-temporal credit assignment problem by redistributing sparse

In [44]:
top_n = 45
score_floor = .5
list_of_lists_of_weights = [[
        ("Agents for Software Engineering", .5),
        ("ai writing code", .5),
        ("coding done by ai", .5),
        ("AI-Generated Code", .5),
        ("Generated Code", .5),
        ("code generation", .5),
        ("ai code writing", .5),
        ("solutions to produce computer code", .5),
        ("Generated Code", .5),

        ],]
match_print_save(list_of_lists_of_weights, top_n, score_floor)



Set Name: Agents for Software Engineering
Total Matches in Set: 13
Matches Above Score-Floor in Set: 13
2024-12-20__041110059505

Showing 13 in top-13 out of 13 total results.     -> 13 of 13/13
(Ceiling set at 45 (top_n) filtered results.)    -> 45
(Minimum-included-score, 'Score-Floor' set at 0.5) -> 0.5


Document: 210|||| 
'arxiv_id': arXiv:2412.14611, 
'paper_link': https://arxiv.org/abs/2412.14611, 
'pdf_link': https://arxiv.org/pdf/2412.14611, 
Title: Is This You, LLM? Recognizing AI-written Programs with Multilingual Code Stylometry 
Subjects: Software Engineering (cs.SE) 
Abstract: With the increasing popularity of LLM-based code completers, like GitHub Copilot, the interest in automatically detecting AI-generated code is also increasing-in particular in contexts where the use of LLMs to program is forbidden by policy due to security, intellectual property, or ethical this http URL introduce a novel technique for AI code stylometry, i.e., the ability to distinguish code gene

In [45]:
top_n = 45
score_floor = .5
list_of_lists_of_weights = [[
        ("e-Learners", 1),
        ("educational content", 1),
        ("learning styles", 1),
        ("educational process", 1),
        ("human learning", 1),

        ("education", .5),
        ("learner", .5),
        ("individual needs", .5),

        ("learning sciences", .5),
        ("educational technology", .5),
        ("human-computer interaction", .5),
        ],]
match_print_save(list_of_lists_of_weights, top_n, score_floor)



Set Name: e-Learners
Total Matches in Set: 52
Matches Above Score-Floor in Set: 52
2024-12-20__041110164466

Showing 52 in top-45 out of 52 total results.     -> 52 of 45/52
(Ceiling set at 45 (top_n) filtered results.)    -> 45
(Minimum-included-score, 'Score-Floor' set at 0.5) -> 0.5


Document: 17|||| 
'arxiv_id': arXiv:2412.14195, 
'paper_link': https://arxiv.org/abs/2412.14195, 
'pdf_link': https://arxiv.org/pdf/2412.14195, 
Title: IMPROVE: Impact of Mobile Phones on Remote Online Virtual Education 
Subjects: Human-Computer Interaction (cs.HC) 
Abstract: This work presents the IMPROVE dataset, designed to evaluate the effects of mobile phone usage on learners during online education. The dataset not only assesses academic performance and subjective learner feedback but also captures biometric, behavioral, and physiological signals, providing a comprehensive analysis of the impact of mobile phone use on learning. Multimodal data were collected from 120 learners in three groups wi

In [46]:
top_n = 45
score_floor = 2
list_of_lists_of_weights = [[
        ("collective behavior", 1),
        ("collective", 1),
        ("coordination", 1),
        ("oganization", 1),
        ("behavior", 1),
        ("ants", 1),
        ("insects", 1),
        ("worms", 1),
        ("swarm", 1),
        ],]
match_print_save(list_of_lists_of_weights, top_n, score_floor)



Set Name: collective behavior
Total Matches in Set: 119
Matches Above Score-Floor in Set: 13
2024-12-20__041110297445

Showing 13 in top-45 out of 119 total results.     -> 13 of 45/119
(Ceiling set at 45 (top_n) filtered results.)    -> 45
(Minimum-included-score, 'Score-Floor' set at 2) -> 2


Document: 585|||| 
'arxiv_id': arXiv:2407.20041, 
'paper_link': https://arxiv.org/abs/2407.20041, 
'pdf_link': https://arxiv.org/pdf/2407.20041, 
Title: Counterfactual rewards promote collective transport using individually controlled swarm microrobots 
Subjects: Robotics (cs.RO) 
Abstract: Swarm robots offer fascinating opportunities to perform complex tasks beyond the capabilities of individual machines. Just as a swarm of ants collectively moves a large object, similar functions can emerge within a group of robots through individual strategies based on local sensing. However, realizing collective functions with individually controlled microrobots is particularly challenging due to their mi

In [47]:
top_n = 45
score_floor = 2
list_of_lists_of_weights = [[
        ("Retrieval-Augmented Systems", 1),
        ("RAG systems", 1),
        ("Retrieval-Augmented Generation", 1),
        ("RAG evaluation metric ", 3),
        # ("", 1),
        # ("", 1),
        # ("", 1),
        # ("", 1),
        # ("", 1),
        ],]
match_print_save(list_of_lists_of_weights, top_n, score_floor)



Set Name: Retrieval-Augmented Systems
Total Matches in Set: 13
Matches Above Score-Floor in Set: 3
2024-12-20__041110423195

Showing 3 in top-13 out of 13 total results.     -> 3 of 13/13
(Ceiling set at 45 (top_n) filtered results.)    -> 45
(Minimum-included-score, 'Score-Floor' set at 2) -> 2


Document: 122|||| 
'arxiv_id': arXiv:2412.14457, 
'paper_link': https://arxiv.org/abs/2412.14457, 
'pdf_link': https://arxiv.org/pdf/2412.14457, 
Title: VISA: Retrieval Augmented Generation with Visual Source Attribution 
Subjects: Information Retrieval (cs.IR) 
Abstract: Generation with source attribution is important for enhancing the verifiability of retrieval-augmented generation (RAG) systems. However, existing approaches in RAG primarily link generated content to document-level references, making it challenging for users to locate evidence among multiple content-rich retrieved documents. To address this challenge, we propose Retrieval-Augmented Generation with Visual Source Attributio

In [48]:
top_n = 45
score_floor = 1
list_of_lists_of_weights = [[
        ("manifold hypothesis", 1),
        ("manifolds", 1),
        ],]
match_print_save(list_of_lists_of_weights, top_n, score_floor)



Set Name: manifold hypothesis
Total Matches in Set: 2
Matches Above Score-Floor in Set: 2
2024-12-20__041110540396

Showing 2 in top-2 out of 2 total results.     -> 2 of 2/2
(Ceiling set at 45 (top_n) filtered results.)    -> 45
(Minimum-included-score, 'Score-Floor' set at 1) -> 1


Document: 92|||| 
'arxiv_id': arXiv:2412.14384, 
'paper_link': https://arxiv.org/abs/2412.14384, 
'pdf_link': https://arxiv.org/pdf/2412.14384, 
Title: I0T: Embedding Standardization Method Towards Zero Modality Gap 
Subjects: Machine Learning (cs.LG) 
Abstract: Contrastive Language-Image Pretraining (CLIP) enables zero-shot inference in downstream tasks such as image-text retrieval and classification. However, recent works extending CLIP suffer from the issue of modality gap, which arises when the image and text embeddings are projected to disparate manifolds, deviating from the intended objective of image-text contrastive learning. We discover that this phenomenon is linked to the modality-specific ch

In [49]:
top_n = 45
score_floor = 1
list_of_lists_of_weights = [[
        ("sentiment analysis", 3),
        ("semantic analysis", 3),
        ("semantic modeling", 3),
        ("emotion modeling", 3),
        ("emotion analysis", 3),
        ("sentiment recognition", 2),
        ("semantic recognition", 2),
        ("sentiment", 1),
        ("semantically blinding", 1),
        ("disambiguation", 1),
        # ("", 1),
        # ("", 1),
        # ("", 1),
        ],]
match_print_save(list_of_lists_of_weights, top_n, score_floor)



Set Name: sentiment analysis
Total Matches in Set: 8
Matches Above Score-Floor in Set: 8
2024-12-20__041110643110

Showing 8 in top-8 out of 8 total results.     -> 8 of 8/8
(Ceiling set at 45 (top_n) filtered results.)    -> 45
(Minimum-included-score, 'Score-Floor' set at 1) -> 1


Document: 313|||| 
'arxiv_id': arXiv:2412.14849, 
'paper_link': https://arxiv.org/abs/2412.14849, 
'pdf_link': https://arxiv.org/pdf/2412.14849, 
Title: DS$^2$-ABSA: Dual-Stream Data Synthesis with Label Refinement for Few-Shot Aspect-Based Sentiment Analysis 
Subjects: Computation and Language (cs.CL) 
Abstract: Recently developed large language models (LLMs) have presented promising new avenues to address data scarcity in low-resource scenarios. In few-shot aspect-based sentiment analysis (ABSA), previous efforts have explored data augmentation techniques, which prompt LLMs to generate new samples by modifying existing ones. However, these methods fail to produce adequately diverse data, impairing thei

In [50]:
top_n = 45
score_floor = 2
list_of_lists_of_weights = [[
        ("mental health", 5),
        ("psychological health", 5),
        ("psycholog", 2),  # stem vs. lemma
        ("mental health care", 3),
        ("neuroscience", 2),
        ("psychological assessment", 2),
        ("personality assessment", 2),
        ("personality inference", 2),
        ("personality traits", 2),
        ("personality dimensions", 2),
        ("emotion", 15),
        ("sports psychology", 15),
        ("sentiment recognition", 10),
        ("Emotion Recognition", 5),
        # ("", 5),
        # ("", 5),

        # disease terms
        ("depression", 5),
        ("anxiety", 5),
        ("mental disorders", 2),
        ("social anxiety disorder", 4),
        ("mental illness", 2),
        ("Major Depressive Disorder", 2),
        ("MDD", 2),
        ("psychological stressors", 2),
        ("cognitive impairment", 2),
        ("mci", 2),
        ("personality", 1)
        # ("", 2),
        ],]
match_print_save(list_of_lists_of_weights, top_n, score_floor)



Set Name: mental health
Total Matches in Set: 37
Matches Above Score-Floor in Set: 36
2024-12-20__041110765755

Showing 36 in top-37 out of 37 total results.     -> 36 of 37/37
(Ceiling set at 45 (top_n) filtered results.)    -> 45
(Minimum-included-score, 'Score-Floor' set at 2) -> 2


Document: 13|||| 
'arxiv_id': arXiv:2412.14190, 
'paper_link': https://arxiv.org/abs/2412.14190, 
'pdf_link': https://arxiv.org/pdf/2412.14190, 
Title: Lessons From an App Update at Replika AI: Identity Discontinuity in Human-AI Relationships 
Subjects: Human-Computer Interaction (cs.HC) 
Abstract: Can consumers form especially deep emotional bonds with AI and be vested in AI identities over time? We leverage a natural app-update event at Replika AI, a popular US-based AI companion, to shed light on these questions. We find that, after the app removed its erotic role play (ERP) feature, preventing intimate interactions between consumers and chatbots that were previously possible, this event triggered 

# Final Timer

In [51]:
end_time_whole_single_task = datetime.now()
duration_time = duration_min_sec(start_time_whole_single_task, end_time_whole_single_task)
print(f"Duration to run -> {duration_time}")

Duration to run -> 0_min__4.2_sec


In [52]:
# See files
print("List of results saved:")
!ls
print(f"All Articles-Found Results Count = {len(all_results_json_list)}")

List of results saved:
'Agents for Software Engineering_articles2024-12-20__022542279803.html'
'Agents for Software Engineering_articles_2024-12-20__022542279803.json'
'Agents for Software Engineering_articles2024-12-20__041110132006.html'
'Agents for Software Engineering_articles_2024-12-20__041110132006.json'
 all_arxiv_article_dicts_2024-12-20__022539101651.json
 all_arxiv_article_dicts_2024-12-20__041109292906.json
 all_arxiv_results_2024-12-20__022539321644.json
 all_arxiv_results_2024-12-20__041109354862.json
'collective behavior_articles2024-12-20__022542564102.html'
'collective behavior_articles_2024-12-20__022542564102.json'
'collective behavior_articles2024-12-20__041110389839.html'
'collective behavior_articles_2024-12-20__041110389839.json'
 disinformation_articles2024-12-20__022541239886.html
 disinformation_articles_2024-12-20__022541239886.json
 disinformation_articles2024-12-20__041109825584.html
 disinformation_articles_2024-12-20__041109825584.json
'distance measure_a