<a href="https://colab.research.google.com/github/julianafalves/AI_PaperRelevance/blob/main/CoHereAPI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Alzheimer's Disease Research Paper Summarization using Cohere API

Project Description:
The goal of this project is to develop a Python-based application that leverages the CoHere API to summarize the main concepts of research papers, extract relevant papers based on an abstract, and highlight similarities between papers in terms of results and main ideas. The application will handle papers in PDF format, accommodating different layout structures, including those with two columns per page. The project will primarily focus on the fields of Machine Learning (ML) and Natural Language Processing (NLP).

##PDF Parsing and Text Extraction:
- Implement a PDF parsing mechanism to extract the text content from research papers.
- Account for different layout structures, including papers with two columns per page.
- Store the extracted text for further processing.


In [9]:
!pip install PyPDF2



In [10]:
import PyPDF2
pdf_file = open('paper1_yoa40625.pdf',"rb")
pdf_reader = PyPDF2.PdfReader(pdf_file)

In [11]:
extracted_text = ""
for page in pdf_reader.pages:
  extracted_text += page.extract_text()

pdf_file.close()

In [12]:
extracted_text

'ORIGINAL ARTICLE\nRole of Genes and Environments\nfor Explaining Alzheimer Disease\nMargaret Gatz, PhD; Chandra A. Reynolds, PhD; Laura Fratiglioni, MD, PhD; Boo Johansson, PhD;\nJames A. Mortimer, PhD; Stig Berg, PhD; Amy Fiske, PhD; Nancy L. Pedersen, PhD\nContext :Twin studies using selected samples have shown\nhigh heritability for Alzheimer disease (AD).\nObjective :To evaluate genetic and environmental in-\nfluences on AD in a fully ascertained population of oldertwins, including like- and unlike-sex pairs.\nDesign :Five-group quantitative genetic model: male\nmonozygotic twins, female monozygotic twins, male di-zygotic twins, female dizygotic twins, and unlike-sextwins.\nSetting and Participants :All twins in the Swedish\nTwin Registry aged 65 years and older. The study in-cluded 11 884 twin pairs, among whom were 392 pairsin which 1 or both members had AD.\nMain Outcome Measures :All individuals were\nscreened for cognitive dysfunction. Suspected cases of de-mentia and their c

##Paper Summarization using CoHere API:
- Utilize the CoHere API to generate summaries of research papers.
- Make API calls to retrieve key insights and main ideas from each paper.
- Process the obtained summaries to ensure readability and coherence.

In [3]:
!pip install cohere
#!pip install requests

Collecting cohere
  Downloading cohere-4.11.2-py3-none-any.whl (39 kB)
Collecting backoff<3.0,>=2.0 (from cohere)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting importlib_metadata<7.0,>=6.0 (from cohere)
  Downloading importlib_metadata-6.7.0-py3-none-any.whl (22 kB)
Installing collected packages: importlib_metadata, backoff, cohere
Successfully installed backoff-2.2.1 cohere-4.11.2 importlib_metadata-6.7.0


In [4]:
import requests
api_endpoint = "https://api.cohere.ai/paper/summarize"
x = requests.post(api_endpoint)

## CoHere - Data generation

In [23]:
prompt = "What is the main achievmentes and challenges in this study: f{extracted_text} "

In [25]:
import cohere
co = cohere.Client('dCbAGB2ASZccs43UaN1DcMdZIIVfp1vwQO6ojxpn') # This is your trial API key
response = co.generate(
  model='command',
  prompt=prompt,
  max_tokens=300,
  temperature=0.9,
  k=0,
  stop_sequences=[],
  return_likelihoods='NONE')
print('Prediction: {}'.format(response.generations[0].text))

Your text contains a trailing whitespace, which has been trimmed to ensure high quality generations.


Prediction: 
The main achievements and challenges in this study are:

Achievements:
- The study successfully extracted and analyzed data from a large number of tweets.
- The study identified a number of trends and patterns in the data that were not previously known.
- The study developed a new method for extracting and analyzing data from tweets.

Challenges:
- The study had to deal with a large amount of data.
- The study had to develop a new method for extracting and analyzing data from tweets.


##CoHere - Data generation

In [20]:
text = extracted_text
response = co.summarize(
    model='summarize-xlarge',
    length='long',
    extractiveness='high',
    temperature = 1,
    format = 'bullets',
    text = text
)

summary = response.summary

In [21]:
summary

'- The aim of this study was to estimate the contribution of genetics and environment to Alzheimer’s disease in a large sample of elderly twins.\n- The Swedish Twin Registry was used to identify twins who had been diagnosed with Alzheimer’s disease and their co-twins.\n- For the analysis of Alzheimer’s disease, 3 fi ve groups were used: male and female monozygotic (MZ) and dizygotic (DZ) twins, and unlike-sex pairs.\n- Using advanced methods of twin methodology, we verified that heritability for Alzheimer’s disease is high and does not differ by sex.\n- The same genetic effects are operating in men and women.\n- Among all MZ individuals, average age at onset for Alzheimer’s disease was 8.12 years greater among those who were concordant for Alzheimer’s disease than among those who were discordant.\n- This study represents the largest population-based twin study of dementia to date.'

- The aim of this study was to estimate the contribution of genetics and environment to Alzheimer’s disease in a large sample of elderly twins.
- The Swedish Twin Registry was used to identify twins who had been diagnosed with Alzheimer’s disease and their co-twins.
- For the analysis of Alzheimer’s disease, 3 fi ve groups were used: male and female monozygotic (MZ) and dizygotic (DZ) twins, and unlike-sex pairs.
- Using advanced methods of twin methodology, we verified that heritability for Alzheimer’s disease is high and does not differ by sex.
- The same genetic effects are operating in men and women.
- Among all MZ individuals, average age at onset for Alzheimer’s disease was 8.12 years greater among those who were concordant for Alzheimer’s disease than among those who were discordant.
- This study represents the largest population-based twin study of dementia to date.

## Research Abstract.

The following text contain the abstract of my research. I will use it as the guideline to find similar papers that will help the most in my reserach

In [30]:
abstract = "Alzheimer\'s Disease (AD) is a complex neurodegenerative disorder that has gained significant attention in scientific research, particularly since the Human GenomeProject. In 2022, it is estimated that AD affects over 50 million people worldwide, and its economic burden exceeds a trillion US dollars per year. a deeper understanding of the complex genetic factors underlying AD. One promising approach is Genome-Wide Association Studies (GWAS), which allow the identification of geneticvariants associated with AD susceptibility. Of particular interest are Single Nucleotide Polymorphisms (SNPs), which represent variations in a single nucleotidebase in the DNA sequence.In this study, we investigated the association between SNPs and AD susceptibility by applying various quality control (QC) parameters during data pre-processing and rank the SNP associations through mixed linear models-based GWAS implemented in BLUPF90. Our findings indicate that the identified SNPs are located in regions already associated with Alzheimer\'s Disease, including non-coding regions, We also investigated the impact of incorporating demographic data into our models. However, the results indicated that the inclusion of such data did not yield any benefits for the model.This study highlights the importance of GWAS in identifying potential genetic risk factors for AD and underscores the need for further research to gain a better understanding of the complex genetic mechanisms underlying this debilitating disease."

In [38]:

len(abstract.split())

210

##Relevance Ranking based on Abstract:

- Collect the abstract of your research as input.
- Implement a method to compute the relevance score of each paper in relation to your research abstract.
- Rank the papers based on their relevance score, sorting them in descending order.


In [41]:
docs = [
    extracted_text,
    "Pamonha",
    "Carson City is the capital city of the American state of Nevada. At the 2010 United States Census, Carson City had a population of 55,274.",
    "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan.",
   ]
response = co.rerank(
  model = 'rerank-english-v2.0',
  query = f"Witch article are more helpful to the following abstract: "+abstract,
  documents = docs,
  top_n = 3,
)


Query has been truncated from the right to 256 tokens from 271 tokens.


In [42]:
for idx, r in enumerate(response):
  print(f"Document Rank: {idx + 1}, Document Index: {r.index}")
  print(f"Doc: {r.document['text'][0:50]}")
  print(f"Score: {r.relevance_score:.2f} \n")

Document Rank: 1, Document Index: 0
Doc: ORIGINAL ARTICLE
Role of Genes and Environments
fo
Score: 0.25 

Document Rank: 2, Document Index: 3
Doc: The Commonwealth of the Northern Mariana Islands i
Score: 0.20 

Document Rank: 3, Document Index: 1
Doc: Pamonha
Score: 0.17 



##Similarity Highlighting:

- Develop an algorithm to identify similarities between papers based on their results and main ideas.
- Implement a method to highlight the areas of overlap and similarity between papers.
- Provide visual or textual indicators to highlight these similarities within the application.


##Keyword Extraction and Highlighting:

- Leverage ML and NLP techniques to extract relevant keywords from the research papers.
- Identify keywords associated with ML and NLP to support your project development.
- Highlight these keywords within the summaries and relevant paper sections.
