<a href="https://colab.research.google.com/github/nice-digital/text-classifier/blob/main/text-classifier-lr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text classifier Colab**

This Colab notebook allows you to categorise a set of scientific papers into two categories. This is experimental code

**Note**: Name your training file *training.csv*  and test file *testing.csv* (*title* column should be named 'Title' or 'title' and *abstract* column if present should be named 'Abstract' or 'abstract'), and upload it by pressing the upload button on the top left of the left sidebar. The results will appear in a folder named *RESULTS*. RESULTS folder will be automatically created by the code.


In [None]:
#@title Install Python packages { form-width: "20%" }

#@markdown Please execute this cell by pressing the _Play_ button
#@markdown on the left to download and import third-party software
#@markdown in this Colab notebook.

#@markdown This installs the software on the Colab
#@markdown notebook in the cloud and not on your computer.
from IPython.utils import io
try:
  with io.capture_output() as captured:
    %shell pip install scispacy
    # %shell pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_md-0.5.0.tar.gz
    # %shell pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bc5cdr_md-0.5.1.tar.gz
    # %shell pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz
    %shell pip install pyLDAvis==2.1.2
    %shell pip install import-ipynb
    %shell pip install pandas
    %shell pip install shutup

except subprocess.CalledProcessError:
  print(captured)
  raise
import shutup
shutup.please()

import os
import numpy as np
import spacy
import scispacy
import pandas as pd
from scispacy.abbreviation import AbbreviationDetector

from pathlib import Path
import collections
import csv
import multiprocessing as mp
from multiprocessing import Pool

from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from fastai.imports import *


cpu_count = mp.cpu_count()

pd. set_option('display.max_colwidth', None)

In [None]:
#@title Create train/test datasets from human/animal datasets { form-width: "20%" }
animal = pd.read_csv('excludes_Animal_2200.csv')
human = pd.read_csv('includes_human_2400.csv')

#add target variable
animal['target'] = 0
human['target'] = 1

print(animal.columns)
print(human.columns)

#combine & shuffle the datasets
combined_data = pd.concat([animal, human], axis=0)
shuffled_combined_df = combined_data.sample(frac=1).reset_index(drop=True)

#create a 80-20 split from it
training, testing = train_test_split(shuffled_combined_df, test_size=0.2, random_state=42)


Index(['Title', 'Abstract', 'Primary Author', 'Journal', 'Year', 'Volume',
       'Issue', 'Pages', 'Comments', 'Eppi ID', 'target'],
      dtype='object')
Index(['Title', 'Abstract', 'Primary Author', 'Journal', 'Year', 'Volume',
       'Issue', 'Pages', 'Comments', 'Eppi ID', 'target'],
      dtype='object')


In [None]:
#@title File settings to get started  { form-width: "20%" }

#@markdown Please ensure the training.csv and testing.csv are uploaded and execute this cell by pressing the _Play_ button
#@markdown on the left

#@markdown The training.csv and testing.csv files should have 'title', optional 'abstract' fields. Additionally the file should have a 'target' field
#@markdown which indicates whether the title/abstract is an include (coded as 1) or exclude (coded as 0)
TRAIN_PATH = 'training.csv'
TEST_PATH = 'testing.csv'

results_folder = 'RESULTS'
RESULTS_FOLDER = results_folder     #***user input
if not os.path.isdir(RESULTS_FOLDER):
    os.makedirs(RESULTS_FOLDER)
RESULTS_PATH = Path(RESULTS_FOLDER)

In [None]:
#@title Read in input data as separate training.csv and testing.csv. **Ignore** this block if human/animal data was uploaded above { form-width: "20%" }
try:
    training = pd.read_csv(TRAIN_PATH)
    orig_colnames = training.columns
    print(orig_colnames)

    testing = pd.read_csv(TEST_PATH)

except Exception as e:
    print(e)
    raise

In [None]:
#@title Read in input data { form-width: "20%" }
rename_map = {'Title': 'title', 'Abstract': 'abstract'}
training.rename(columns = rename_map, inplace = True)
testing.rename(columns = rename_map, inplace = True)
print("Number of studies in the training dataset: " + str(training.shape[0]))
print("Number of studies in the training dataset: " + str(testing.shape[0]))

#rename the columns so that the relevant column names are 'title' and 'abstract'

try:
  training['title_orig'] = training['title']
  testing['title_orig'] = testing['title']
except Exception as e:
  print(e)
  print("Error- No title detected! Title is needed!")
  raise

# drop any duplicates based on 'title'
training.drop_duplicates(subset=['title'], inplace=True)
testing.drop_duplicates(subset=['title'], inplace=True)
print("Number of studies in the training dataset after de-dupe: " + str(training.shape[0]))
print("Number of studies in the testing dataset after de-dupe: " + str(testing.shape[0]))

training['titleabstract'] = training['title'] + " " + training['abstract']
training['titleabstract'] = training['titleabstract'].str.lower()

testing['titleabstract'] = testing['title'] + " " + testing['abstract']
testing['titleabstract'] = testing['titleabstract'].str.lower()

Number of studies in the training dataset: 3693
Number of studies in the training dataset: 925
Number of studies in the training dataset after de-dupe: 3693
Number of studies in the testing dataset after de-dupe: 925


In [None]:
#@title Fit logistic regression model (in progress) { form-width: "20%" }

#A sklearn pipeline comprising of tf-idf vectorizer (using tri-gram) and logistic regression model. The parameters for logistic regression
#are taken from prior hyper-parameter tuning.
text_clf = Pipeline([
                ('tfidfvect', TfidfVectorizer(ngram_range = (3,3), stop_words = 'english')),
                ('clf', LogisticRegression(C=100, max_iter = 5000, solver = 'liblinear', penalty = 'l2', class_weight = 'balanced')),
               ])
y_train = training['target']
model = text_clf.fit(training['titleabstract'].astype(str),y_train)



In [None]:
#@title Predict category and evaluate performance (in progress) { form-width: "20%" }

#Using the model that was fit to the training data above, evaluate the model's performance on test data.
data = testing['titleabstract'].astype(str)
y_test = testing['target']
yhat = model.predict(data)
yhat_probs = model.predict_proba(data)[:,1]
yhat_adjusted = np.zeros(data.shape[0], dtype=int)
THRESHOLD = 0.4
yhat_adjusted[yhat_probs >= THRESHOLD] = 1

report_dict = {}
decimal_places = 3
report_dict['Accuracy'] = accuracy_score(y_test, yhat_adjusted).round(decimal_places)
report_dict['Precision'] = precision_score(y_test,yhat_adjusted).round(decimal_places)
report_dict['Recall'] = recall_score(y_test, yhat_adjusted, average = 'binary').round(decimal_places)
report_dict['F1-Score'] = f1_score(y_test, yhat_adjusted).round(decimal_places)
report_dict['ROC_AUC'] = roc_auc_score(y_test, yhat_adjusted).round(decimal_places)
cm = confusion_matrix(y_test, yhat_adjusted)
FP = cm[0][1]
TN = cm[0][0]
FN = cm[1][0]
TP = cm[1][1]
specificity = (TN / (TN+FP)).round(decimal_places)
FPR = (FP/(FP+TN)).round(decimal_places)
FNR = (FN/(FN+TP)).round(decimal_places)
report_dict['FPR'] = FPR
report_dict['FNR'] = FNR
report_dict['Specificity'] = specificity

print('Classification report:\n{}'.format(report_dict))


Classification report:
{'Accuracy': 0.826, 'Precision': 0.764, 'Recall': 0.974, 'F1-Score': 0.856, 'ROC_AUC': 0.816, 'FPR': 0.343, 'FNR': 0.026, 'Specificity': 0.657}


#Data Preprocessing
Convert text data to numerical features using TF-IDF

In [None]:
animal = pd.read_csv('excludes_Animal_2200.csv')
human = pd.read_csv('includes_human_2400.csv')

#animal.head(), human.head()

In [None]:
# prompt: shape of the df
(animal.shape), (human.shape)

((2212, 10), (2411, 10))

In [None]:
#add target variable
animal['target'] = 0
human['target'] = 1

In [None]:
animal.head()

Unnamed: 0,Title,Abstract,Primary Author,Journal,Year,Volume,Issue,Pages,Comments,Eppi ID,target
0,An In Vivo Comparison: Novel Mesh Suture Versus Traditional Suture-Based Repair in a Rabbit Tendon Model,"Purpose: Despite advancements in surgical techniques, suture pull-though and rupture continue to limit the early range of motion and functional rehabilitation after flexor tendon repairs. The aim of this study was to evaluate a suturable mesh compared with a commonly used braided suture in an in vivo rabbit intrasynovial tendon model. Method(s): Twenty-four New Zealand female rabbits (3-4 kg) were injected with 2 units/kg botulinum toxin evenly distributed into 4 sites in the left calf. After 1 week, the animals underwent surgical tenotomy of the flexor digitorum tendon and were randomized to repair with either 2-0 Duramesh suturable mesh or to 2-0 Fiberwire using a 2-strand modified Kessler and 6-0 polypropylene running epitendinous suture. Rabbits were killed at 2, 4, and 9 weeks after surgery. Result(s): Grouping across time points, 58.3% (7 of 12) of Duramesh repairs were found to be intact for the explant compared with 16.7% (2 of 12) of Fiberwire repairs (P = .09). At 2 weeks, the mean Duramesh repairs were significantly stronger than the Fiberwire repairs with a mean failure load of 50.7 +/- 12.7 N compared to 14.8 +/- 18.3 N (P = .02). The load supported by the Duramesh repairs at 2 weeks (mean 50.7 +/- 12.7 N) was similar to the load supported by both Fiberwire (52.2 +/- 13.6 N) and Duramesh (57.6 +/- 22.3 N) at 4 weeks. The strength of repair between Fiberwire and Duramesh at 4 weeks and 9 weeks was not significantly different. Conclusion(s): The 2-strand tendon repair with suturable mesh achieved significantly greater strength at 2 weeks than the conventional suture material. Future studies should evaluate the strength of repair prior to 2 weeks to determine the strength curve for this novel suture material. Clinical Relevance: This study evaluates the utility of a novel suturable mesh for flexor tendon repair in an in vivo rabbit model compared with conventional suture material.Copyright ? 2021 The Authors","Janes, L.E.",Journal of Hand Surgery Global Online,2022,4.0,1,32-39,,14844675,0
1,"Effects of urea supplementation on ruminal fermentation characteristics, nutrient intake, digestibility, and performance in sheep: A meta-analysis","Background and Aim: As a non-protein nitrogen source, urea is a popular, low cost, and easily obtained protein supplement. The objective of the present study was to perform a meta-analysis of the effects of urea supplementation on rumen fermentation and sheep performance. Material(s) and Method(s): A total of 32 experiments from 21 articles were compiled into a dataset. The levels of dietary urea varied from 0 to 31 g/kg of dry matter (DM). Parameters observed were rumen fermentation product, nutrient intake, nutrient digestibility, and sheep performance. This dataset was analyzed using a mixed model methodology, with urea supplementation levels as fixed effects and the different experiments as random effects. Result(s): Increasing levels of urea were associated with increases (p=0.008) in rumen pH, butyrate (C4) production, and ammonia (NH3-N) concentration. Urea supplementation had minor effects on total volatile fatty acids (p=0.242), total protozoa (p=0.429), and the microbial N supply (p=0.619), but tended to increase methane production (CH4; p<0.001). Supplementation of urea increased the intake of dry matter (DM; p=0.004) and crude protein (CP; p=0.001). Digestibility parameters, such as DM digestibility (DMD) and CP digestibility (CPD), also increased (p<0.01) as a result of urea supplementation. Retained N (p=0.042) and N intake (p<0.001) were higher with increasing levels of urea supplementation. In terms of animal performance, supplementation of urea increased average daily gain (ADG; p=0.024), but decreased the hot carcass weight percentage (p=0.017). Conclusion(s): This meta-analysis reports the positive effects of urea supplementation on rumen fermentation products (i.e., pH, C4, and NH3-N), intake (DM, CP, and N), digestibility (DMD and CPD), and ADG in sheep.Copyright: Wahyono, et al. Open Access. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.","Wahyono, T.",Veterinary World,2022,15.0,2,331-340,,14844676,0
2,The in vitro and in vivo anti-inflammatory activities of triterpene saponins from Clematis florida.,"Clematis florida is widely used in She Ethnopharmacy in China owing to its significant anti-inflammatory activities. This study aimed to investigate the anti-inflammatory effect of the active fraction of C. florida (CFAF) in an arthritis animal model and its possible mechanism. Pre-inflammatory cytokine levels were examined by ELISA. CFAF can significantly improve the symptoms of arthritis such as paw swelling, arthritic index, and histological condition in AA rat. CFAF can also reduce levels of IL-1?, TNF-? and IL-6. Further studies showed that triterpene saponins from CFAF induced anti-inflammatory activity inhibited inflammatory mediators by blocking JAK/STAT signalling pathways in the LPS-treated macrophages.","Yang, Na-Na",Natural Product Research,2021,35.0,24,6180-6183,,14844677,0
3,Evaluating Mouse Fibroblast Interaction with Implant Surfaces in a 3D Microenvironment.,"Purpose: Previous studies assessing fibroblast interactions with implants have mainly relied on measurements such as cell migration, gene expression, and cell adhesion. For these studies, testing cellular behavior at the implant surface was done by imaging the cell-implant interface using standard microscopy techniques in 2D tissue culture dishes. The true behavior of cells relative to the implant can best be assessed in a more physiologic 3D microenvironment. Materials and Methods: The embedding of the implant disks in 3D collagen gels was standardized with labeled fibroblasts to allow the imaging of fibroblast morphology and behavior when proximal to or binding to the implant disks. This allowed comparison of the behavior of laser-microgrooved and machined implant disk surfaces quantitatively in an in vitro 3D microenvironment. Results: This in vitro imaging assay revealed for the first time in a 3D microenvironment setting the statistically significant impact laser-microgrooved disk surfaces have on both cell adherence and recruitment of cells in proximity to the disk. It also allowed visualization of membrane protrusivity and cytoskeletal organization in cells adherent to the implant disk. Conclusion: This assay provides a simple and effective way of observing cell behavior on and around the implant disk surface in a more physiologic 3D setting. Within the limits of this study, it revealed that the laser-microgrooved implant surface demonstrates significant superiority in fibroblast recruitment and binding in a 3D microenvironment.","Kasherwal, Vishakha",International Journal of Oral & Maxillofacial Implants,2021,36.0,6,1121-1128,,14844678,0
4,Deep learning-based image-analysis algorithm for classification and quantification of multiple histopathological lesions in rat liver.,"Artificial intelligence (AI)-based image analysis is increasingly being used for preclinical safety-assessment studies in the pharmaceutical industry. In this paper, we present an AI-based solution for preclinical toxicology studies. We trained a set of algorithms to learn and quantify multiple typical histopathological findings in whole slide images (WSIs) of the livers of young Sprague Dawley rats by using a U-Net-based deep learning network. The trained algorithms were validated using 255 liver WSIs to detect, classify, and quantify seven types of histopathological findings (including vacuolation, bile duct hyperplasia, and single-cell necrosis) in the liver. The algorithms showed consistently good performance in detecting abnormal areas. Approximately 75% of all specimens could be classified as true positive or true negative. In general, findings with clear boundaries with the surrounding normal structures, such as vacuolation and single-cell necrosis, were accurately detected with high statistical scores. The results of quantitative analyses and classification of the diagnosis based on the threshold values between ""no findings"" and ""abnormal findings"" correlated well with diagnoses made by professional pathologists. However, the scores for findings ambiguous boundaries, such as hepatocellular hypertrophy, were poor. These results suggest that deep learning-based algorithms can detect, classify, and quantify multiple findings simultaneously on rat liver WSIs. Thus, it can be a useful supportive tool for a histopathological evaluation, especially for primary screening in rat toxicity studies. Copyright ?2022 The Japanese Society of Toxicologic Pathology.","Shimazaki, Taishi",Journal of toxicologic pathology,2022,35.0,2,135-147,,14844679,0


In [None]:
human.head()

Unnamed: 0,Title,Abstract,Primary Author,Journal,Year,Volume,Issue,Pages,Comments,Eppi ID,target
0,How does pre-dialysis education need to change? Findings from a qualitative study with staff and patients.,"BACKGROUND: Pre-dialysis education (PDE) is provided to thousands of patients every year, helping them decide which renal replacement therapy (RRT) to choose. However, its effectiveness is largely unknown, with relatively little previous research into patients' views about PDE, and no research into staff views. This study reports findings relevant to PDE from a larger mixed methods study, providing insights into what staff and patients think needs to improve., METHODS: Semi-structured interviews in four hospitals with 96 clinical and managerial staff and 93 dialysis patients, exploring experiences of and views about PDE, and analysed using thematic framework analysis., RESULTS: Most patients found PDE helpful and staff valued its role in supporting patient decision-making. However, patients wanted to see teaching methods and materials improve and biases eliminated. Staff were less aware than patients of how informal staff-patient conversations can influence patients' treatment decision-making. Many staff felt ill equipped to talk about all treatment options in a balanced and unbiased way. Patient decision-making was found to be complex and patients' abilities to make treatment decisions were adversely affected in the pre-dialysis period by emotional distress., CONCLUSIONS: Suggested improvements to teaching methods and educational materials are in line with previous studies and current clinical guidelines. All staff, irrespective of their role, need to be trained about all treatment options so that informal conversations with patients are not biased. The study argues for a more individualised approach to PDE which is more like counselling than education and would demand a higher level of skill and training for specialist PDE staff. The study concludes that even if these improvements are made to PDE, not all patients will benefit, because some find decision-making in the pre-dialysis period too complex or are unable to engage with education due to illness or emotional distress. It is therefore recommended that pre-dialysis treatment decisions are temporary, and that PDE is replaced with on-going RRT education which provides opportunities for personalised education and on-going review of patients' treatment choices. Emotional support to help overcome the distress of the transition to end-stage renal disease will also be essential to ensure all patients can benefit from RRT education.","Combes, Gill",BMC nephrology,2017.0,18,1,334,,14842061,1
1,Factors associated with regular eye examinations in people with diabetes: results from the Victorian Population Health Survey.,"PURPOSE: Although diabetes increases the risk of becoming visually impaired or blind, a large proportion of people with diabetes are not receiving the recommended eye care to detect and prevent retinopathy. Assessing a broad range of demographic, health behavior, and societal characteristics in relation to eye care utilization, the present study aims to increase knowledge about the potential impact of such factors on eye care utilization., METHODS: In 2003, for the first time, the annual Victorian Population Health Survey (VPHS) incorporated various eye health-related questions. Approximately 12,600 primary approach letters were mailed to all eligible and randomly selected households. Using computer-assisted telephone interviewing (CATI), the interviewer selected the person aged 18 years or over with the most recent birthday within the contacted household., RESULTS: The mean age of all 7500 participants was 47.7 years (range, 18-99 years). Six percent (n = 424) of all participants had diabetes, of whom 80% (n = 345) reported a visit to an eye care specialist within the last 2 years. People with diabetes were more likely to have had an eye test within the last 2 years if they had seen a healthcare provider or had one of various health checks, including checks not related to diabetes, within the same time., CONCLUSION: Results suggest that people who take an interest in their general health may also be more aware of the importance of eye examinations to avoid vision loss. Eye health promotion activities therefore need to broaden their reach to approach from outside the health sector, targeting people with diabetes who normally do not receive health checks. The importance of dilated eye examinations for people with diabetes needs to be further promoted for eye care providers.","Muller, Andreas",Optometry and vision science : official publication of the American Academy of Optometry,2006.0,83,2,96-101,,14842062,1
2,Alirocumab for the treatment of hyperlipidemia in high-risk patients: an updated review.,"INTRODUCTION: Alirocumab is a fully human immunoglobulin G1 monoclonal antibody directed against proprotein convertase subtilisin/kexin type 9 (PCSK9) approved for the treatment of hypercholesterolemia in high-risk patients. The objective is to provide an updated review of the recent data published for alirocumab. Areas covered: The efficacy and safety of alirocumab has been initially evaluated in a comprehensive phase 3 program conducted in more than 6 000 patients with primary non-familial and heterozygous familial hypercholesterolemia: alirocumab reduced LDL-cholesterol up to 62% in phase 3 with every 2-week dosing compared with placebo, and up to 36% compared with ezetimibe, with an excellent safety and tolerability profile. Herein, the author describes new efficacy and safety data obtained from complementary analyses of the phase 3 program submitted for approval and reports data from new specific trials. Expert commentary: Based on current high pricing, the patient groups prioritized for alirocumab treatment are patients with heterozygous familial hypercholesterolemia and patients with atherosclerotic cardiovascular disease who have substantially elevated LDL cholesterol on maximally tolerated statin plus ezetimibe therapy. The ongoing ODYSSEY OUTCOMES trial will provide important information on the cost-effectiveness of alirocumab treatment.","Farnier, Michel",Expert review of cardiovascular therapy,2017.0,15,12,923-932,,14842063,1
3,Comparison of risk factors for hepatitis B and C in patients visiting a gastroenterology clinic.,"OBJECTIVE: To find out and compare the risk factors for hepatitis B and C infections in patients visiting a gastroenterology clinic., DESIGN: A case-control study., PLACE AND DURATION OF STUDY: The Liver Stomach Clinic, Karachi, from July 2004 to September 2004., PATIENTS AND METHODS: Patients of hepatitis B and C visiting the clinic were interviewed and data were noted on a prescribed form. Patients with dyspeptic symptoms who were negative for both hepatitis B and C were taken as controls. Statistical analysis was done using SPSS package., RESULTS: Total numbers of patients interviewed were 148; 63 with hepatitis C, 41 with hepatitis B and 44 in the control group. These patients hailed from various parts of Pakistan with diverse ethnicity. Comparing hepatitis C with the control group, important risk factors identified were lower level of education, the occupational exposure to the blood and syringes, history of blood transfusions, taking therapeutic injections and intravenous drips, and habit of getting shaved by barbers. Patients of hepatitis B were younger as compared to the control group. Their knowledge about spread of infection was poor. These patients had not received hepatitis B vaccine during childhood. Less number of risk factors could be identified in this group, Shaving from the barber s shop was also found to be a risk factor just like in hepatitis C., CONCLUSION: There is a need to educate general population about the possible risk factors associated with the spread of hepatitis C and B. Proper screening of blood products and universal precautions against the spread of infections are recommended. Treatment by 1/V drips and getting shaved by barbers should be discouraged. Vaccination against hepatitis B is recommended.","Shazi, Lubna",Journal of the College of Physicians and Surgeons--Pakistan : JCPSP,2006.0,16,2,104-7,,14842064,1
4,"Comparative safety of atorvastatin 80 mg versus 10 mg derived from analysis of 49 completed trials in 14,236 patients.","Atorvastatin has been shown to reduce coronary events and revascularization procedures in patients with multiple risk factors for coronary heart disease. Recent studies with atorvastatin 80 mg support the overall safety of this dose during long-term treatment. However, physicians appear reluctant to use high doses of statins. A retrospective analysis of pooled data from 49 clinical trials of atorvastatin in 14,236 patients treated for an average period of 2 weeks to 52 months was conducted. The study compared the safety of atorvastatin 10 mg (n = 7,258), atorvastatin 80 mg (n = 4,798), and placebo (n = 2,180) and included analyses on treatment-associated adverse events; nonserious and serious adverse events related to the musculoskeletal, hepatic, and renal systems; the incidence of elevations of creatine kinase >10 times the upper limit of normal (ULN); and hepatic transaminases >3 times ULN. Percentages of patients experiencing > or =1 adverse event were similar across all 3 groups. Withdrawals due to treatment-related adverse events were observed in 2.4%, 1.8%, and 1.2% of patients in the atorvastatin 10 mg, atorvastatin 80 mg, and placebo groups, respectively. Serious adverse events were rare and seldom led to treatment withdrawal with any dose. Treatment-associated myalgia was observed in 1.4%, 1.5%, and 0.7% of patients in the atorvastatin 10 mg, atorvastatin 80 mg, and placebo groups, respectively. No cases of rhabdomyolysis were reported in any group. Persistent elevations in hepatic transaminases >3 times ULN were observed in 0.1%, 0.6%, and 0.2% of patients in the atorvastatin 10 mg, atorvastatin 80 mg, and placebo groups, respectively. The incidence of treatment-associated adverse events for atorvastatin 80 mg was similar to that of atorvastatin 10 mg and placebo. In conclusion, the results of this analysis support the positive safety profile of atorvastatin at the highest dose.","Newman, Connie",The American journal of cardiology,2006.0,97,1,61-7,,14842065,1


In [None]:
animal.isnull().sum(), human.isnull().sum()


(Title                0
 Abstract            61
 Primary Author       4
 Journal              0
 Year                 0
 Volume              30
 Issue              687
 Pages              190
 Comments          2212
 Eppi ID              0
 target               0
 dtype: int64,
 Title                0
 Abstract           410
 Primary Author       0
 Journal              0
 Year                 2
 Volume              25
 Issue              113
 Pages                6
 Comments          2411
 Eppi ID              0
 target               0
 dtype: int64)

In [None]:
animal.columns, human.columns

(Index(['Title', 'Abstract', 'Primary Author', 'Journal', 'Year', 'Volume',
        'Issue', 'Pages', 'Comments', 'Eppi ID', 'target'],
       dtype='object'),
 Index(['Title', 'Abstract', 'Primary Author', 'Journal', 'Year', 'Volume',
        'Issue', 'Pages', 'Comments', 'Eppi ID', 'target'],
       dtype='object'))

Deleting unwanted columns

In [None]:
human = human.drop(columns=['Primary Author', 'Journal', 'Year', 'Volume',
       'Issue', 'Pages', 'Comments', 'Eppi ID', 'target'], inplace=True)
animal = animal.drop(columns=['Primary Author', 'Journal', 'Year', 'Volume',
       'Issue', 'Pages', 'Comments', 'Eppi ID', 'target'], inplace=True)

In [None]:
animal.head()

AttributeError: 'NoneType' object has no attribute 'head'

In [None]:
animal_df = animal.iloc[:2212]

In [None]:
human_df = human.iloc[:2411]

In [None]:
vectorizer = TfidfVectorizer(max_features=1000)
X_train_tfidf = vectorizer.fit_transform(TRAIN_PATH)
X_test_tfidf = vectorizer.transform(TEST_PATH)

ValueError: Iterable over raw text documents expected, string object received.