<a href="https://colab.research.google.com/github/michaelachmann/social-media-lab/blob/main/notebooks/2023_10_05_GPT_Literature_Review_Assistant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Literature Review Assistant [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8199901.svg)](https://doi.org/10.5281/zenodo.8199901)

![Notes on (Computational) Social Media Research Banner](https://raw.githubusercontent.com/michaelachmann/social-media-lab/main/images/banner.png)

## Overview

This Jupyter notebook is a part of the social-media-lab.net project, which is a work-in-progress textbook on computational social media analysis. The notebook is intended for use in my classes.

The **Literature Review Assistant** uses search result lists of Publish or Perish to get started with prompting. It uses GPT to extract information from abstracts to assist the literature review process.

### Project Information

- Project Website: [social-media-lab.net](https://social-media-lab.net/)
- GitHub Repository: [https://github.com/michaelachmann/social-media-lab](https://github.com/michaelachmann/social-media-lab)

## License Information

This notebook, along with all other notebooks in the project, is licensed under the following terms:

- License: [GNU General Public License version 3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl-3.0.de.html)
- License File: [LICENSE.md](https://github.com/michaelachmann/social-media-lab/blob/main/LICENSE.md)


## Citation

If you use or reference this notebook in your work, please cite it appropriately. Here is an example of the citation:

```
Michael Achmann. (2023). michaelachmann/social-media-lab: 27.10.2023 (v0.0.1). Zenodo. https://doi.org/10.5281/zenodo.8199901
```

In [8]:
#@title Setup
#@markdown At first we need to install necessary packages. Hit run and wait.

print("Install Packages")
!pip install -q openai crossref-commons

Install Packages
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for ratelimit (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.[0m[31m
[0m

In [3]:
#@title Import from Publish or Perish Data.
#@markdown If this is the start of your review process, upload the `csv` file exported from [Publish or Perish](https://harzing.com/resources/publish-or-perish) in the left-hand *Files* pane. Enter the filename in `publish_or_perish_file_name`. Define the output name in `file_name`. If you want to save the imported file in the google drive add `/content/drive/MyDrive/` to the path. <br/> **Skip this cell if you want to work with a file that has been imported in the past.**

import pandas as pd
import numpy as np
import io

publish_or_perish_file_name = "scholar.csv" # @param {type: "string"}
file_name = "2023-10-31-Literature-Review.csv" # @param {type: "string"}

# Initialize empty DataFrame
all_data = pd.DataFrame()


try:
    all_data = pd.read_csv(publish_or_perish_file_name)

    # Remove Duplicates
    initial_len = len(all_data)
    all_data = all_data.drop_duplicates(subset='DOI', keep='first')
    removed_len = initial_len - len(all_data)
    print(f'Removed {removed_len} duplicates based on DOI.')

    # Remove missing DOIs
    initial_len = len(all_data)
    all_data = all_data[~pd.isna(all_data['DOI'])]
    removed_len = initial_len - len(all_data)
    print(f'Removed {removed_len} rows without DOI.')

    all_data = all_data.sort_values(by='Cites', ascending=False).reset_index(drop=True)

    print('Sorted Table by Cites.')

    # Create empty columns for Literature Review
    all_data["Relevant"] = ""
    all_data["Notes"] = ""
    all_data["Checked"] = False

    print('Initialized Columns')

    all_data.to_csv(file_name)
    print(f"Success: Saved data to {file_name}")

    print(f'Success: Data loaded from File "{file_name}".')
except Exception as e:
    print(f"Error: Failed to load data from File. {str(e)}")

Removed 172 duplicates based on DOI.
Removed 1 rows without DOI.
Sorted Table by Cites.
Initializes Columns
Success: Saved data to 2023-10-31-Literature-Review.csv
Success: Data loaded from File "2023-10-31-Literature-Review.csv".


In [5]:
#@title Read previously imported File
#@markdown If you want to keep going with a former review process, we can read an uploaded file / a file from google drive. **Only run one cell, this one or the above.**
import pandas as pd
import numpy as np
import io

file_name = "2023-10-31-Literature-Review.csv" # @param {type: "string"}

try:
    all_data = pd.read_csv(file_name)

    print(f'Success: Data loaded from File "{file_name}".')
except Exception as e:
    print(f"Error: Failed to load data from File. {str(e)}")

Success: Data loaded from File "2023-10-31-Literature-Review.csv".


In this example we've saved the file locally. When working with Colab, the file will be deleted when we disconnect. For colab you should link your google drive (open the files pane on the left, click the Google Drive button). Once connected, save the file in the folder `/content/drive/MyDrive/YOUR-FILENAME.csv`. It will be accessible through Drive, and Colab is from now on going to connect automatically to drive.

Check the imported data. We're using pandas, the imported data is saved in the `all_data`variable. `head(2)`displays the two top rows of the table. Additionally, we have added three columns: `Relevant`, `Notes`, and `Checked`. We are going to make use of them to keep track of our progress.

In [6]:
# Check the structure (and content) of the file
all_data.head(2)

Unnamed: 0.1,Unnamed: 0,Cites,Authors,Title,Year,Source,Publisher,ArticleURL,CitesURL,GSRank,...,CitesPerYear,CitesPerAuthor,AuthorCount,Age,Abstract,FullTextURL,RelatedURL,Relevant,Notes,Checked
0,0,479,"M Tiggemann, M Zaccardo",'Strong is the new skinny': A content analysis...,2018.0,Journal of health psychology,journals.sagepub.com,https://journals.sagepub.com/doi/abs/10.1177/1...,https://scholar.google.com/scholar?cites=12266...,3,...,119.75,240,2,4.0,… This study provides a content analysis of fi...,,https://scholar.google.com/scholar?q=related:c...,,,False
1,1,116,"D Ging, S Garvey",'Written in these scars are the stories I can'...,2018.0,New Media & Society,journals.sagepub.com,https://journals.sagepub.com/doi/abs/10.1177/1...,https://scholar.google.com/scholar?cites=17319...,19,...,29.0,58,2,4.0,… such as Instagram. Using a dataset of 7560 i...,https://www.researchgate.net/profile/Debbie-Gi...,https://scholar.google.com/scholar?q=related:-...,,,False


In the next step we are going to start our literature review:
1. We filter for the first unchecked row, ordered by the cite count.
2. We retrieve the abstract from [CrossRef API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) using the [DOI](https://en.wikipedia.org/wiki/Digital_object_identifier).
3. We display all information
4. We answer whether the paper appear to be relevant by entering y or n for **y**es or **n**o.

For our session, the cell only runs through one row and finishes afterwards. For a real world application you'd probably like to add some kind of loop.

In [9]:
from crossref_commons.retrieval import get_publication_as_json
import json
import openai
import textwrap
import IPython
import re

# Get one row: Not checked, highest Citation count.
highest_cites_unchecked = all_data[all_data['Checked'] == False].sort_values(by="Cites", ascending=False).iloc[0]
index = highest_cites_unchecked.name

# Retrieve Abstract from Crossref
response = get_publication_as_json(highest_cites_unchecked['DOI'])
abstract =  response.get("abstract", "")

# Remove XML
abstract = re.sub(r'<[^>]+>', '', abstract)
all_data.loc[index, 'Abstract'] = abstract


# Display all information
IPython.display.clear_output(wait=True)
title_disp = IPython.display.HTML("<h2>{}</h2>".format(highest_cites_unchecked['Title']))
authors_disp = IPython.display.HTML("<p>{}</p>".format(highest_cites_unchecked['Authors']))
doi_disp = IPython.display.HTML("<p><a target='_blank' href='https://doi.org/{}'>{}</a></p>".format(highest_cites_unchecked['DOI'],highest_cites_unchecked['DOI']))
display(title_disp, authors_disp, doi_disp)
print(textwrap.fill(abstract, 80))
relevant_input = input('Relevant? (y/n): ').lower().strip() == 'y'

# Save user input
all_data.loc[index, 'Checked'] = True
all_data.loc[index, 'Relevant'] = relevant_input

 ‘Fitspiration’ is an online trend designed to inspire viewers towards a
healthier lifestyle by promoting exercise and healthy food. This study provides
a content analysis of fitspiration imagery on the social networking site
Instagram. A set of 600 images were coded for body type, activity,
objectification and textual elements. Results showed that the majority of images
of women contained only one body type: thin and toned. In addition, most images
contained objectifying elements. Accordingly, while fitspiration images may be
inspirational for viewers, they also contain a number of elements likely to have
negative effects on the viewer’s body image.
Relevant? (y/n): y


Next, we check whether our input has been saved:

In [10]:
# Check the result
all_data.iloc[index]

Unnamed: 0                                                        0
Cites                                                           479
Authors                                     M Tiggemann, M Zaccardo
Title             'Strong is the new skinny': A content analysis...
Year                                                         2018.0
Source                                 Journal of health psychology
Publisher                                      journals.sagepub.com
ArticleURL        https://journals.sagepub.com/doi/abs/10.1177/1...
CitesURL          https://scholar.google.com/scholar?cites=12266...
GSRank                                                            3
QueryDate                                       2022-09-08 10:36:01
Type                                                            NaN
DOI                                        10.1177/1359105316639436
ISSN                                                            NaN
CitationURL                                     

## Using GPT to extract information from abstracts

Now for the fun part: Is it possible to use GPT to help us during the review process? We are going to try and extract text features automatically. For the moment we are going to use `gpt3.5-turbo`.

**Note:**
Please feel free to test different prompts and questions. The [Promptingguide](https://www.promptingguide.ai/) is a good resource to learn more about different prompting techniques. Use the [ChatGPT](https://chat.openai.com/) interface to cheaply test prompt prior to using them with the API. Use the [OpenAI Playground](https://platform.openai.com/playground) to optimize your prompts with a visual user interface for different settings and a prompting history (trust me, this can save your life!).

**Prompts:** We're going to use the **system prompt** for our instructions, and the **user prompt** to send our content.

In [11]:
system_prompt = """
You're an advanced AI research assistant. Your task is to extract **research questions**, **operationalization**, **data sources**, **population**, and **scientific disciplines** from user input. Return "None" if you can't find the information in user input.

**Formatting**
Return a markdown table, one row for each extracted feature: **research questions**, **operationalization**, **data sources**, **population**, and **scientific disciplines**.
"""

Please enter your API-Code in the next code cell for the `openai.api_key` variable. We have changed the cell to include the `gpt_prompt` variable, which sends the title and abstract as a **user prompt**. We're using the `openai.ChatCompletion.create()` method to send our request to the API. We expect the response in `api_response['choices'][0]['message']['content']` to be markdown (see prompt above), as such we display the markdown in our notebook.

In [14]:
from crossref_commons.retrieval import get_publication_as_json
import json
import openai
import textwrap
import IPython
import re

# Enter OpenAI API-Code
openai.api_key = "sk-XXXX"

# Get one row: Not checked, highest Citation count.
highest_cites_unchecked = all_data[all_data['Checked'] == False].sort_values(by="Cites", ascending=False).iloc[0]
index = highest_cites_unchecked.name

# Retrieve Abstract from Crossref
response = get_publication_as_json(highest_cites_unchecked['DOI'])
abstract = response.get("abstract", "")

# Remove XML
abstract = re.sub(r'<[^>]+>', '', abstract)
all_data.loc[index, 'Abstract'] = abstract

# Display all information (before we send the request to OpenAI)
IPython.display.clear_output(wait=True)
title_disp = IPython.display.HTML("<h2>{}</h2>".format(highest_cites_unchecked['Title']))
authors_disp = IPython.display.HTML("<p>{}</p>".format(highest_cites_unchecked['Authors']))
doi_disp = IPython.display.HTML("<p><a target='_blank' href='https://doi.org/{}'>{}</a></p>".format(highest_cites_unchecked['DOI'],highest_cites_unchecked['DOI']))
display(title_disp, authors_disp, doi_disp)
print(textwrap.fill(abstract, 80))

gpt_prompt = f"""
**Title**: {highest_cites_unchecked['Title']}
**Abstract**: {abstract}
"""

# Sending request, takes a moment. In the meantime you may read the abstract.
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": abstract}
]

try:
  api_response = openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      messages=messages,
      temperature=0,
      timeout=30
    )

  gpt_result = api_response['choices'][0]['message']['content']

  # Display the GPT result
  display(IPython.display.HTML(f"<h3>GPT Extracted Data</h3>"))
  display(IPython.display.Markdown(gpt_result))
except:
  print("GPT API Error")

relevant_input = input('Relevant? (y/n): ').lower().strip() == 'y'

# Save user input
all_data.loc[index, 'Checked'] = True
all_data.loc[index, 'Relevant'] = relevant_input

Since pro-anorexia websites began to appear in the 1990s, there has been a
growing body of academic work on pro-ana and thinspiration communities online.
Underpinned by a range of (inter)disciplinary perspectives, most of this work
focuses on websites and blogs. There is a dearth of research and, in particular,
gender-aware research on pro-ana practices and discourses in the context of
newer mobile social platforms such as Instagram. Using a dataset of 7560 images,
this study employs content analysis to ask whether, to what extent and how pro-
ana identities and discourses manifest themselves on a more open, image-based
platform such as Instagram. We demonstrate that, by mainstreaming pro-ana,
Instagram has rendered visible pro-ana sensibilities such as abstinence and
self-discipline in the broader context of distressed girls’ lives and Western
culture more generally. We conclude that this increased visibility may in fact
be a positive development.


| Feature                | Value                                                                                                                                                                                                                   |
|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Research Questions     | - To what extent and how do pro-ana identities and discourses manifest themselves on Instagram? <br>- Has Instagram mainstreamed pro-ana sensibilities such as abstinence and self-discipline? <br>- Is the increased visibility of pro-ana on Instagram a positive development? |
| Operationalization     | Content analysis of a dataset of 7560 images                                                                                                                                                                            |
| Data Sources           | Instagram                                                                                                                                                                                                               |
| Population             | Distressed girls and Western culture                                                                                                                                                                                    |
| Scientific Disciplines | - Social sciences <br>- Gender studies <br>- Cultural studies                                                                                                                                                           |

Relevant? (y/n): y
