<a href="https://colab.research.google.com/github/merlynjocol/DigitalActions_NLP_NLU/blob/main/HEIDI_Semantic_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Preprocessing

HEIDI Project

Steps before: 

Creating the database
https://colab.research.google.com/drive/1Ddxa24UC5Oja4nbpZutVY3b2-mUvmEen#scrollTo=wY7z-qzeZbUx&uniqifier=3

# Import libraries




In [None]:
# libraries
import pandas as pd
from pandas import DataFrame
import numpy as np
import re
from tqdm import tqdm
import warnings


In [None]:
# Data Loading
from google.colab import drive
drive.mount('/gdrive', force_remount=True)
""

Mounted at /gdrive


''

In [None]:
#Transformer libraries
import torch
from keras.preprocessing.sequence import pad_sequences
from transformers import BertTokenizer,  AutoModelForSequenceClassification

In [None]:
#Similarity search section: cosine similarity search and facebook AI research library
from sklearn.metrics.pairwise import cosine_similarity
!pip install faiss-gpu # please uncomment this line when you're running the notebook for the first time
import faiss

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[K     |████████████████████████████████| 85.5 MB 92 kB/s 
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


# Reading Dataset

In [None]:
data  = pd.read_csv(f"/gdrive/MyDrive/Colab Notebooks/HEIDI_SCIENTIFIC_ARTICLES/Datasets/fulltext.csv")
#/gdrive/MyDrive/Colab Notebooks/HEIDI_SCIENTIFIC_ARTICLES/Datasets/fulltext.csv

NameError: ignored

# drop duplicates


In [None]:
data.drop_duplicates(['Abstract', 'fulltext'], inplace=True)


In [None]:
data['Abstract'].describe(include='all')

count                                                    20
unique                                                   20
top       Households are the most numerous and atomized ...
freq                                                      1
Name: Abstract, dtype: object

# Rules

Rule 1. Don't use standard preprocessing steps like stemming or stopword removal when you have pre-trained embeddings
Some of you might used standard preprocessing steps when doing word count based feature extraction (e.g. TFIDF) such as removing stopwords, stemming etc. The reason is simple: You loose valuable information, which would help your NN to figure things out.

Rule 2. Get your vocabulary as close to the embeddings as possible
focus in how to achieve the task

## 🔷 Cleaning process

In this case, I  have divide the cleaning process in 4 steps. 

1. Recovery words that are divide to pass the next line, after the text extraction the word loose their meaning.  eg. "scien-" "ce"

2. Remove URL, emails. This bring noise to the model

3. Create a variable with the cleaning text 

5. Enjoy the process! it's long! 😎

###  Resources 
- https://colab.research.google.com/github/hackveda-canada/Data-Science-Essentials/blob/master/Data_Science_Essentials_Day_5_NLP_%26_Text_Mining.ipynb#scrollTo=yjP65tpyLxPW

In [None]:
# creating the function for text cleaning
def clean_meaning(text):
    text = re.sub(r"- ", "", text) # remove "- " this appear when the word is cut to pass the next line eg. "scien- ce"
    text= re.sub(r'\S+@\S+', '', text) # removing emails
    text= re.sub(r'https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}[-a-zA-Z0-9()@:%_+.~#?&\/=]*','',text) # remove URL
    #text = re.sub(r"\sd\s", " ", text) # removing single letters
    
    return text

In [None]:
data.columns

Index(['Unnamed: 0', 'fulltext', 'Authors', 'Title', 'DOI', 'Abstract'], dtype='object')

In [None]:
# applying function 
# "clean_meaning" is text for advance analysis
data['fulltext_clean'] = data['fulltext'].apply(lambda x: clean_meaning(x))
data['Abstract_clean'] = data['Abstract'].apply(lambda x: clean_meaning(x))

### Removing all references

In [None]:
# removing all the caracthers after the word "References"
def removeReferences(doc):
  doc = doc.split("References", 1)
  return doc[0]

# removing all the caracthers after the word "Acknowledgements"
def removeAcknowledgements(doc):
  doc = doc.split("Acknowledgements", 1)
  return doc[0]

# removing all the caracthers after the word "Acknowledgements"
def removeAppendix(doc):
  doc = doc.split("Appendix", 1)
  return doc[0]


## Removing references & checking (counting)

In [None]:
# counting the ocurrences of the worrd
data['count_ref'] = list(map(lambda x: x.count("References"), data['fulltext_clean']))
# New text without references
data['fulltext_noref']= data['fulltext_clean'].apply(removeReferences)
# checking no ocurrences of "References" in the text
data['count_ref1'] = list(map(lambda x: x.count("References"), data['fulltext_noref']))

### Removing Acknowledgement

In [None]:
data['fulltext_no_Aknow'] = data['fulltext_noref'].apply(removeAcknowledgements) 
#data['text_no_Aknow'] = data['text_no_Aknow'].apply(removeAppendix) 

## Extracting conclusions

In [None]:
data['discussion']= data['fulltext_no_Aknow'].str.split('Discussion|DISCUSSION|Results and discussion').str[1]

In [None]:
data['conclusion']= data['fulltext_no_Aknow'].str.split('Conclusion|Conclusions').str[1]
data['conclusion']= data['conclusion'].str.split('Appendix').str[0]



In [None]:
data.head(2)

Unnamed: 0.1,Unnamed: 0,fulltext,Authors,Title,DOI,Abstract,fulltext_clean,Abstract_clean,count_ref,fulltext_noref,count_ref1,fulltext_no_Aknow,discussion,conclusion
0,0,Citizen science and social licence_ Improving ...,"Kelly R,Fleming A,Pecl GT",Citizen science and social licence: Improving ...,10.1016/j.ocecoaman.2019.104855,Marine stakeholder groups have diverse relatio...,Citizen science and social licence_ Improving ...,Marine stakeholder groups have diverse relatio...,2,Citizen science and social licence_ Improving ...,0,Citizen science and social licence_ Improving ...,,The concept of social licence has many compon...
1,1,Urban regeneration_ Community engagement proce...,"Kim G,Newman G,Jiang B",Urban regeneration: Community engagement proce...,10.1016/j.cities.2020.102730,Vacant land presents many challenges for older...,Urban regeneration_ Community engagement proce...,Vacant land presents many challenges for older...,2,Urban regeneration_ Community engagement proce...,0,Urban regeneration_ Community engagement proce...,4.1. Understanding the problems and potential...,"Community engagement is an ongoing process, n..."


In [None]:
data["discussion"].fillna(data["conclusion"], inplace=True)
data["discussion"].str.len()

0      3688
1     15703
2     15203
3     11448
4     29583
5     26447
6     31507
7     11683
8     21043
9      7976
10    15664
11     2165
12    14738
13    21511
14    36419
15     6121
16     1673
17    10781
18    22646
19     1479
Name: discussion, dtype: int64

In [None]:
#checking the conclusion
print(data["Title"].iloc[19])
print(data["conclusion"].iloc[8])
print(data["discussion"].iloc[14])
print(data["fulltext_no_Aknow"].iloc[4])



Using community engagement to implement evidence-based practices for opioid use disorder: A data-driven paradigm & systems science approach
 The contributory citizen science project, the GKC, has been useful in the early stages of policy development. Utilising the phases in the policy process described by Walters et al. (2000), we have described the utility of citizen science in ‘Discovery’, ‘Measurement’ and ‘Education’. Citizen science projects might also be useful in ‘Persuasion’ and ‘Legitimization’. Our work supports the previous assertion (Shirk et al., 2012) that contributory citizen science projects can make valuable contributions outside scientific research (i.e., policy outcomes) if they are explicitly designed to achieve these outcomes. Our evaluation found some differences in opinions between citizen scientists involved in the GKC project, onlookers and a sample of the wider community. However, we contend that data from citizen science projects are useful for policy makers 

In [None]:
data.columns

Index(['Unnamed: 0', 'fulltext', 'Authors', 'Title', 'DOI', 'Abstract',
       'fulltext_clean', 'Abstract_clean', 'count_ref', 'fulltext_noref',
       'count_ref1', 'fulltext_no_Aknow', 'discussion', 'conclusion'],
      dtype='object')

## How many tokens has discussion? 

In [None]:
!pip install spacy -qq
!python -m spacy download en_core_web_md 

Collecting en_core_web_md==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4 MB)
[K     |████████████████████████████████| 96.4 MB 75.1 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [None]:
import spacy
# the spaCy model:
nlp = spacy.load("en_core_web_sm")


In [None]:
data["token"]=  data['discussion'].apply(lambda x: nlp(x))

In [None]:
data["total_token"] = data["token"].str.len()

In [None]:
print(data["total_token"].max())

6175


### Drop unnecesary cols

In [None]:
# droping unnecesary columns
drop_cols = ['fulltext','Abstract','fulltext_clean','count_ref', 'count_ref1','fulltext_no_Aknow', 'conclusion', 'token',
       'total_token']

In [None]:
df = data.drop(['fulltext','Abstract','fulltext_clean','count_ref', 'count_ref1','fulltext_no_Aknow', 'conclusion', 'token','total_token'], axis=1)

In [None]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,Authors,Title,DOI,Abstract_clean,fulltext_noref,discussion
0,0,"Kelly R,Fleming A,Pecl GT",Citizen science and social licence: Improving ...,10.1016/j.ocecoaman.2019.104855,Marine stakeholder groups have diverse relatio...,Citizen science and social licence_ Improving ...,The concept of social licence has many compon...
1,1,"Kim G,Newman G,Jiang B",Urban regeneration: Community engagement proce...,10.1016/j.cities.2020.102730,Vacant land presents many challenges for older...,Urban regeneration_ Community engagement proce...,4.1. Understanding the problems and potential...


In [None]:
# exporting the csv file 
df.to_csv('fulltext_discussion.csv')