# Text Extraction

This notebook extracts article's full text from pdf file.

In [1]:
import pandas as pd

# load csv file that contain article's info and file location
df = pd.read_csv('data/articles.csv')
df.head()

Unnamed: 0,Date,Title,Abstract,Keywords,File Name,URL
0,Published: 8 January 2021,A Systematic Literature Review on English and ...,Due to the enormous growth of information and ...,"English Bangla Comparison, Latent Dirichlet Al...",2021_17_1_jcssp.2021.1.18.pdf,https://thescipub.com/pdf/jcssp.2021.1.18.pdf
1,Published: 21 January 2021,DAD: A Detailed Arabic Dataset for Online Text...,This paper presents a novel Arabic dataset tha...,"Arabic Dataset, Arabic Benchmark, Arabic Recog...",2021_17_1_jcssp.2021.19.32.pdf,https://thescipub.com/pdf/jcssp.2021.19.32.pdf
2,Published: 20 January 2021,Collision Avoidance Modelling in Airline Traff...,An Air Traffic Controller (ATC) system aims to...,"Air Traffic Control, Collision Avoidance, Conf...",2021_17_1_jcssp.2021.33.43.pdf,https://thescipub.com/pdf/jcssp.2021.33.43.pdf
3,Published: 20 January 2021,Fine-Tuned MobileNet Classifier for Classifica...,"This paper proposed an accurate, fast and reli...","Strawberry, Cherry Fruit, Accuracy, MobileNet,...",2021_17_1_jcssp.2021.44.54.pdf,https://thescipub.com/pdf/jcssp.2021.44.54.pdf
4,Published: 21 January 2021,A Content Filtering from Spam Posts on Social ...,The system for filtering spam posts on social ...,"Content Filtering, Spam Detection, Multimodal ...",2021_17_1_jcssp.2021.55.66.pdf,https://thescipub.com/pdf/jcssp.2021.55.66.pdf


In [2]:
# display number of articles and their features
df.shape

(2694, 6)

In [3]:
# display data frame's info -- check for columns that have missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2694 entries, 0 to 2693
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Date       2694 non-null   object
 1   Title      2694 non-null   object
 2   Abstract   2693 non-null   object
 3   Keywords   2688 non-null   object
 4   File Name  2694 non-null   object
 5   URL        2694 non-null   object
dtypes: object(6)
memory usage: 126.4+ KB


## Missing Abstract

In [4]:
# look at the article with missing abstract
df[df['Abstract'].isnull()]

Unnamed: 0,Date,Title,Abstract,Keywords,File Name,URL
939,Published: 14 April 2014,A MOBILE AGENT BASED APPROACH FOR AUTOMATING &...,,,2014_10_9_jcssp.2014.1628.1641.pdf,https://thescipub.com/pdf/jcssp.2014.1628.1641...


In [5]:
# get the url of the article's pdf file
# so we can grab the missing contents from the pdf file
df[df['Abstract'].isnull()].URL.values

array(['https://thescipub.com/pdf/jcssp.2014.1628.1641.pdf'], dtype=object)

In [6]:
# grab the abstract and keywords from the article's pdf file
abstract = '''Nowadays, the focus is not only on how to exchange data between companies but also on how to
exchange services between them or between companies-customers in order to minimize IT charges and
increase their profit. Thus, web services are an adequate solution for the e-business thanks to their
interoperability and reusability. The combination of web services and semantic technology, which are
called Semantic Web Services (SWS), allows their discovery-composition-invocation process to be
automatically performed by programs or intelligent agents. However, this process is still a challenging
task that includes several issues such as the complexity of finding and composing distributed SWS. In
this study we present a mobile agent based approach to discover and compose SWS in a distributed
environment and extend some algorithms related to this field. The article reports examples and
experimental results in order to illustrate as well as to assess the benefits of the proposed approach.'''

keywords = '''Semantic Web Services, Semantic Web Services Discovery, Semantic Web Services Composition, Ontology, Mobile Agent, JADE'''

In [7]:
# get the row's index for the article with missing abstract
row_index = df[df['Abstract'].isnull()].index[0]

# fill in the abstract for article
df.iloc[row_index].Abstract = abstract
df.iloc[row_index].Keywords = keywords

# make sure the abstract has been updated
df.iloc[row_index]

Date                                  Published: 14 April 2014
Title        A MOBILE AGENT BASED APPROACH FOR AUTOMATING &...
Abstract     Nowadays, the focus is not only on how to exch...
Keywords     Semantic Web Services, Semantic Web Services D...
File Name                   2014_10_9_jcssp.2014.1628.1641.pdf
URL          https://thescipub.com/pdf/jcssp.2014.1628.1641...
Name: 939, dtype: object

## Extract Article's Full Text from PDF File

In [8]:
from tqdm import tqdm
import PyPDF4 

path = 'articles/'
article_contents = []
file_list = list(df['File Name'].values)

# a list of files for manual text extraction
exclude_files = ['2008_4_8_jcssp.2008.652.662.pdf',
                 '2007_3_4_jcssp.2007.242.248.pdf', 
                 '2005_1_1_jcssp.2005.98.102.pdf']

for filename in file_list:
    # ignore the specified files because they cannot be read
    # and the texts need to be extracted manually
    if filename in exclude_files:
        continue
    print(filename)
    
    # creating a pdf file object 
    with open(path + filename, 'rb') as pdfFileObj:
        # creating a pdf reader object 
        pdfReader = PyPDF4.PdfFileReader(pdfFileObj)

        # iterate through every page in the pdf document
        article_text = ''
        for page in range(pdfReader.numPages):
            # create a page object and extract text from the current page
            pageObj = pdfReader.getPage(page)
            # strip off all new line characters
            extracted_text = pageObj.extractText().replace('\n', '')

            # if the page contain the intro, assume this is the first page
            # exclude all texts that come before the intro
            if 'INTRODUCTION' in extracted_text or 'Introduction' in extracted_text:
                # split the text into 2 portions
                first_page = extracted_text.split("INTRODUCTION")
                if len(first_page) == 1:
                    first_page = extracted_text.split("Introduction")

                # store main content, exclude the portion before "INTRODUCTION"
                article_text = first_page[-1]
            elif 'REFERENCES' in extracted_text or 'References' in extracted_text:
                # assume this is the last page of the article
                last_page = extracted_text.split('REFERENCES')
                if len(last_page) == 1:
                    last_page = extracted_text.split("References")

                # get article's text, ignore the texts that are after REFERENCES
                article_text = article_text + ' ' + last_page[0]
                
                # break the loop because texts go below References section are not needed
                break
            else:
                # store article's text
                article_text = article_text + ' ' + extracted_text

        # store article's text to a list
        article_contents.append(article_text.strip())

2021_17_1_jcssp.2021.1.18.pdf
2021_17_1_jcssp.2021.19.32.pdf
2021_17_1_jcssp.2021.33.43.pdf
2021_17_1_jcssp.2021.44.54.pdf
2021_17_1_jcssp.2021.55.66.pdf
2021_17_2_jcssp.2021.67.89.pdf
2021_17_2_jcssp.2021.90.111.pdf
2021_17_2_jcssp.2021.112.122.pdf
2021_17_2_jcssp.2021.123.134.pdf
2021_17_2_jcssp.2021.135.155.pdf
2021_17_2_jcssp.2021.156.166.pdf
2021_17_2_jcssp.2021.167.177.pdf
2021_17_3_jcssp.2021.178.187.pdf
2021_17_3_jcssp.2021.188.196.pdf
2021_17_3_jcssp.2021.197.204.pdf
2021_17_3_jcssp.2021.205.220.pdf
2021_17_3_jcssp.2021.221.230.pdf
2021_17_3_jcssp.2021.231.241.pdf
2021_17_3_jcssp.2021.242.250.pdf
2021_17_3_jcssp.2021.251.264.pdf
2021_17_3_jcssp.2021.265.274.pdf
2021_17_3_jcssp.2021.275.283.pdf
2021_17_3_jcssp.2021.284.295.pdf
2021_17_3_jcssp.2021.296.303.pdf
2021_17_3_jcssp.2021.304.318.pdf
2021_17_3_jcssp.2021.319.329.pdf
2021_17_3_jcssp.2021.330.348.pdf
2021_17_3_jcssp.2021.349.363.pdf
2021_17_4_jcssp.2021.364.370.pdf
2021_17_4_jcssp.2021.371.402.pdf
2021_17_4_jcssp.2021.403



2021_17_6_jcssp.2021.580.597.pdf
2021_17_6_jcssp.2021.598.609.pdf
2021_17_7_jcssp.2021.610.623.pdf
2021_17_7_jcssp.2021.624.638.pdf
2021_17_7_jcssp.2021.639.656.pdf
2021_17_7_jcssp.2021.657.669.pdf
2021_17_7_jcssp.2021.670.682.pdf
2021_17_8_jcssp.2021.683.691.pdf
2021_17_8_jcssp.2021.692.708.pdf
2021_17_8_jcssp.2021.709.723.pdf
2021_17_8_jcssp.2021.724.737.pdf
2021_17_8_jcssp.2021.738.747.pdf
2021_17_9_jcssp.2021.748.761.pdf
2021_17_9_jcssp.2021.762.775.pdf
2021_17_9_jcssp.2021.776.788.pdf
2021_17_9_jcssp.2021.789.802.pdf
2021_17_9_jcssp.2021.803.813.pdf
2021_17_9_jcssp.2021.814.824.pdf
2021_17_9_jcssp.2021.825.843.pdf
2021_17_9_jcssp.2021.833.838.pdf
2021_17_9_jcssp.2021.857.869.pdf
2021_17_10_jcssp.2021.870.888.pdf
2021_17_10_jcssp.2021.889.904.pdf
2021_17_10_jcssp.2021.905.914.pdf
2020_16_1_jcssp.2020.1.13.pdf
2020_16_1_jcssp.2020.14.24.pdf
2020_16_1_jcssp.2020.25.34.pdf
2020_16_1_jcssp.2020.35.49.pdf
2020_16_1_jcssp.2020.50.55.pdf
2020_16_1_jcssp.2020.56.71.pdf
2020_16_1_jcssp.2020

2019_15_6_jcssp.2019.824.831.pdf
2019_15_6_jcssp.2019.832.843.pdf
2019_15_6_jcssp.2019.844.854.pdf
2019_15_6_jcssp.2019.855.860.pdf
2019_15_6_jcssp.2019.861.872.pdf
2019_15_6_jcssp.2019.873.879.pdf
2019_15_7_jcssp.2019.880.885.pdf
2019_15_7_jcssp.2019.886.929.pdf
2019_15_7_jcssp.2019.930.943.pdf
2019_15_7_jcssp.2019.944.953.pdf
2019_15_7_jcssp.2019.954.960.pdf
2019_15_7_jcssp.2019.961.971.pdf
2019_15_7_jcssp.2019.972.982.pdf
2019_15_7_jcssp.2019.983.994.pdf
2019_15_7_jcssp.2019.995.1003.pdf
2019_15_7_jcssp.2019.1004.1011.pdf
2019_15_7_jcssp.2019.1012.1021.pdf
2019_15_7_jcssp.2019.1022.1039.pdf
2019_15_7_jcssp.2019.1040.1049.pdf
2019_15_8_jcssp.2019.1050.1064.pdf
2019_15_8_jcssp.2019.1065.1073.pdf
2019_15_8_jcssp.2019.1074.1084.pdf
2019_15_8_jcssp.2019.1085.1096.pdf
2019_15_8_jcssp.2019.1097.1107.pdf
2019_15_8_jcssp.2019.1108.1122.pdf
2019_15_8_jcssp.2019.1123.1132.pdf
2019_15_8_jcssp.2019.1133.1149.pdf
2019_15_8_jcssp.2019.1150.1160.pdf
2019_15_8_jcssp.2019.1161.1183.pdf
2019_15_8_jcss

2017_13_8_jcssp.2017.275.289.pdf
2017_13_8_jcssp.2017.290.300.pdf
2017_13_8_jcssp.2017.301.306.pdf
2017_13_8_jcssp.2017.307.319.pdf
2017_13_8_jcssp.2017.320.328.pdf
2017_13_8_jcssp.2017.329.336.pdf
2017_13_8_jcssp.2017.337.354.pdf
2017_13_8_jcssp.2017.355.362.pdf
2017_13_8_jcssp.2017.363.370.pdf
2017_13_9_jcssp.2017.371.379.pdf
2017_13_9_jcssp.2017.380.392.pdf
2017_13_9_jcssp.2017.393.399.pdf
2017_13_9_jcssp.2017.400.407.pdf
2017_13_9_jcssp.2017.408.415.pdf
2017_13_9_jcssp.2017.416.421.pdf
2017_13_9_jcssp.2017.422.429.pdf
2017_13_9_jcssp.2017.430.439.pdf
2017_13_9_jcssp.2017.440.451.pdf
2017_13_9_jcssp.2017.452.459.pdf
2017_13_10_jcssp.2017.460.469.pdf
2017_13_10_jcssp.2017.470.495.pdf
2017_13_10_jcssp.2017.496.504.pdf
2017_13_10_jcssp.2017.505.513.pdf
2017_13_10_jcssp.2017.514.523.pdf
2017_13_10_jcssp.2017.524.536.pdf
2017_13_10_jcssp.2017.537.547.pdf
2017_13_10_jcssp.2017.548.557.pdf
2017_13_10_jcssp.2017.558.571.pdf
2017_13_10_jcssp.2017.572.580.pdf
2017_13_11_jcssp.2017.581.589.pdf

2014_10_2_jcssp.2014.190.197.pdf
2014_10_2_jcssp.2014.198.209.pdf
2014_10_2_jcssp.2014.210.223.pdf
2014_10_2_jcssp.2014.224.232.pdf
2014_10_2_jcssp.2014.233.239.pdf
2014_10_2_jcssp.2014.240.250.pdf
2014_10_2_jcssp.2014.251.254.pdf
2014_10_2_jcssp.2014.255.263.pdf
2014_10_2_jcssp.2014.264.271.pdf
2014_10_2_jcssp.2014.272.284.pdf
2014_10_2_jcssp.2014.285.295.pdf
2014_10_2_jcssp.2014.296.304.pdf
2014_10_2_jcssp.2014.305.315.pdf
2014_10_2_jcssp.2014.316.324.pdf
2014_10_2_jcssp.2014.325.329.pdf
2014_10_2_jcssp.2014.330.340.pdf
2014_10_2_jcssp.2014.341.346.pdf
2014_10_2_jcssp.2014.347.352.pdf
2014_10_2_jcssp.2014.353.360.pdf
2014_10_2_jcssp.2014.361.365.pdf
2014_10_3_jcssp.2014.366.375.pdf
2014_10_3_jcssp.2014.376.381.pdf
2014_10_3_jcssp.2014.382.392.pdf
2014_10_3_jcssp.2014.393.402.pdf
2014_10_3_jcssp.2014.403.410.pdf
2014_10_3_jcssp.2014.411.422.pdf
2014_10_3_jcssp.2014.423.433.pdf
2014_10_3_jcssp.2014.434.442.pdf
2014_10_3_jcssp.2014.443.452.pdf
2014_10_3_jcssp.2014.453.457.pdf
2014_10_3_

2014_10_12_jcssp.2014.2358.2359.pdf
2014_10_12_jcssp.2014.2360.2365.pdf
2014_10_12_jcssp.2014.2366.2373.pdf
2014_10_12_jcssp.2014.2374.2382.pdf
2014_10_12_jcssp.2014.2383.2394.pdf
2014_10_12_jcssp.2014.2395.2407.pdf
2014_10_12_jcssp.2014.2408.2414.pdf
2014_10_12_jcssp.2014.2415.2421.pdf
2014_10_12_jcssp.2014.2422.2428.pdf
2014_10_12_jcssp.2014.2429.2441.pdf
2014_10_12_jcssp.2014.2442.2449.pdf
2014_10_12_jcssp.2014.2450.2463.pdf
2014_10_12_jcssp.2014.2464.2468.pdf
2014_10_12_jcssp.2014.2469.2480.pdf
2014_10_12_jcssp.2014.2481.2487.pdf
2014_10_12_jcssp.2014.2488.2493.pdf
2014_10_12_jcssp.2014.2494.2506.pdf
2014_10_12_jcssp.2014.2507.2517.pdf
2014_10_12_jcssp.2014.2518.2524.pdf
2014_10_12_jcssp.2014.2525.2537.pdf
2014_10_12_jcssp.2014.2538.2547.pdf
2014_10_12_jcssp.2014.2548.2552.pdf
2014_10_12_jcssp.2014.2553.2563.pdf
2014_10_12_jcssp.2014.2564.2575.pdf
2014_10_12_jcssp.2014.2576.2583.pdf
2014_10_12_jcssp.2014.2584.2592.pdf
2014_10_12_jcssp.2014.2593.2607.pdf
2014_10_12_jcssp.2014.2608.2

2012_8_1_jcssp.2012.61.67.pdf
2012_8_1_jcssp.2012.68.75.pdf
2012_8_1_jcssp.2012.76.83.pdf
2012_8_1_jcssp.2012.84.88.pdf
2012_8_1_jcssp.2012.89.98.pdf
2012_8_1_jcssp.2012.99.106.pdf
2012_8_1_jcssp.2012.107.120.pdf
2012_8_1_jcssp.2012.121.132.pdf
2012_8_1_jcssp.2012.133.140.pdf
2012_8_1_jcssp.2012.141.144.pdf
2012_8_1_jcssp.2012.145.148.pdf
2012_8_1_jcssp.2012.149.158.pdf
2012_8_1_jcssp.2012.159.162.pdf
2012_8_1_jcssp.2012.163.169.pdf
2012_8_1_jcssp.2012.170.174.pdf
2012_8_1_jcssp.2012.175.180.pdf
2012_8_2_jcssp.2012.181.187.pdf
2012_8_2_jcssp.2012.188.194.pdf
2012_8_2_jcssp.2012.195.199.pdf
2012_8_2_jcssp.2012.200.204.pdf
2012_8_2_jcssp.2012.205.215.pdf
2012_8_2_jcssp.2012.216.221.pdf
2012_8_2_jcssp.2012.222.226.pdf
2012_8_2_jcssp.2012.227.231.pdf
2012_8_2_jcssp.2012.232.238.pdf
2012_8_2_jcssp.2012.239.242.pdf
2012_8_2_jcssp.2012.243.250.pdf
2012_8_2_jcssp.2012.251.258.pdf
2012_8_2_jcssp.2012.259.264.pdf
2012_8_2_jcssp.2012.265.271.pdf
2012_8_2_jcssp.2012.272.276.pdf
2012_8_3_jcssp.2012

2012_8_11_jcssp.2012.1924.1931.pdf
2012_8_11_jcssp.2012.1932.1939.pdf
2012_8_12_jcssp.2012.1940.1945.pdf
2012_8_12_jcssp.2012.1946.1956.pdf
2012_8_12_jcssp.2012.1957.1960.pdf
2012_8_12_jcssp.2012.1961.1969.pdf
2012_8_12_jcssp.2012.1970.1978.pdf
2012_8_12_jcssp.2012.1979.1986.pdf
2012_8_12_jcssp.2012.1987.1995.pdf
2012_8_12_jcssp.2012.1996.2007.pdf
2012_8_12_jcssp.2012.2008.2016.pdf
2012_8_12_jcssp.2012.2017.2024.pdf
2012_8_12_jcssp.2012.2025.2031.pdf
2012_8_12_jcssp.2012.2032.2041.pdf
2012_8_12_jcssp.2012.2042.2052.pdf
2012_8_12_jcssp.2012.2053.2061.pdf
2012_8_12_jcssp.2012.2062.2067.pdf
2012_8_12_jcssp.2012.2068.2074.pdf
2012_8_12_jcssp.2012.2075.2082.pdf
2012_8_12_jcssp.2012.2083.2097.pdf
2012_8_12_jcssp.2012.2098.2105.pdf
2012_8_12_jcssp.2012.2106.2111.pdf
2011_7_1_jcssp.2011.1.5.pdf
2011_7_1_jcssp.2011.6.11.pdf
2011_7_1_jcssp.2011.12.16.pdf
2011_7_1_jcssp.2011.17.26.pdf
2011_7_1_jcssp.2011.27.31.pdf
2011_7_1_jcssp.2011.32.38.pdf
2011_7_1_jcssp.2011.39.45.pdf
2011_7_1_jcssp.2011.46.



2011_7_1_jcssp.2011.101.107.pdf
2011_7_1_jcssp.2011.108.113.pdf
2011_7_1_jcssp.2011.114.119.pdf
2011_7_1_jcssp.2011.120.128.pdf
2011_7_2_jcssp.2011.129.142.pdf
2011_7_2_jcssp.2011.143.147.pdf
2011_7_2_jcssp.2011.148.153.pdf
2011_7_2_jcssp.2011.154.158.pdf
2011_7_2_jcssp.2011.159.166.pdf
2011_7_2_jcssp.2011.167.172.pdf
2011_7_2_jcssp.2011.173.178.pdf
2011_7_2_jcssp.2011.179.187.pdf
2011_7_2_jcssp.2011.188.196.pdf
2011_7_2_jcssp.2011.197.205.pdf
2011_7_2_jcssp.2011.206.215.pdf
2011_7_2_jcssp.2011.216.224.pdf
2011_7_2_jcssp.2011.225.233.pdf
2011_7_2_jcssp.2011.234.241.pdf
2011_7_2_jcssp.2011.242.249.pdf
2011_7_2_jcssp.2011.250.254.pdf
2011_7_2_jcssp.2011.255.261.pdf
2011_7_2_jcssp.2011.262.269.pdf
2011_7_2_jcssp.2011.270.278.pdf
2011_7_2_jcssp.2011.279.283.pdf
2011_7_2_jcssp.2011.284.290.pdf
2011_7_2_jcssp.2011.291.297.pdf
2011_7_2_jcssp.2011.298.303.pdf
2011_7_2_jcssp.2011.304.313.pdf
2011_7_2_jcssp.2011.314.319.pdf
2011_7_3_jcssp.2011.320.327.pdf
2011_7_3_jcssp.2011.328.340.pdf
2011_7_3

2011_7_12_jcssp.2011.1867.1874.pdf
2011_7_12_jcssp.2011.1875.1880.pdf
2011_7_12_jcssp.2011.1881.1887.pdf
2011_7_12_jcssp.2011.1888.1893.pdf
2011_7_12_jcssp.2011.1894.1899.pdf
2011_7_12_jcssp.2011.1900.1907.pdf
2011_7_12_jcssp.2011.1908.1913.pdf
2011_7_12_jcssp.2011.1914.1920.pdf
2011_7_12_jcssp.2011.1921.1927.pdf
2010_6_1_jcssp.2010.1.11.pdf
2010_6_1_jcssp.2010.12.17.pdf
2010_6_1_jcssp.2010.18.23.pdf
2010_6_1_jcssp.2010.24.28.pdf
2010_6_1_jcssp.2010.29.35.pdf
2010_6_1_jcssp.2010.36.42.pdf
2010_6_1_jcssp.2010.43.46.pdf
2010_6_1_jcssp.2010.47.51.pdf
2010_6_1_jcssp.2010.52.59.pdf
2010_6_1_jcssp.2010.60.66.pdf
2010_6_1_jcssp.2010.67.74.pdf
2010_6_1_jcssp.2010.75.79.pdf
2010_6_1_jcssp.2010.80.86.pdf
2010_6_1_jcssp.2010.87.91.pdf
2010_6_1_jcssp.2010.92.100.pdf
2010_6_2_jcssp.2010.101.106.pdf
2010_6_2_jcssp.2010.107.111.pdf
2010_6_2_jcssp.2010.112.116.pdf
2010_6_2_jcssp.2010.117.125.pdf
2010_6_2_jcssp.2010.126.132.pdf
2010_6_2_jcssp.2010.133.140.pdf
2010_6_2_jcssp.2010.141.162.pdf
2010_6_2_jc

2009_5_2_jcssp.2009.123.130.pdf
2009_5_2_jcssp.2009.131.135.pdf
2009_5_2_jcssp.2009.136.139.pdf
2009_5_2_jcssp.2009.140.145.pdf
2009_5_2_jcssp.2009.146.153.pdf
2009_5_2_jcssp.2009.154.162.pdf
2009_5_2_jcssp.2009.163.171.pdf
2009_5_3_jcssp.2009.172.176.pdf
2009_5_3_jcssp.2009.177.183.pdf
2009_5_3_jcssp.2009.184.190.pdf
2009_5_3_jcssp.2009.191.198.pdf
2009_5_3_jcssp.2009.199.206.pdf
2009_5_3_jcssp.2009.207.213.pdf
2009_5_3_jcssp.2009.214.220.pdf
2009_5_3_jcssp.2009.221.225.pdf
2009_5_3_jcssp.2009.226.232.pdf
2009_5_3_jcssp.2009.233.241.pdf
2009_5_4_jcssp.2009.242.249.pdf
2009_5_4_jcssp.2009.250.254.pdf
2009_5_4_jcssp.2009.255.262.pdf
2009_5_4_jcssp.2009.263.269.pdf
2009_5_4_jcssp.2009.270.274.pdf
2009_5_4_jcssp.2009.275.282.pdf
2009_5_4_jcssp.2009.283.289.pdf
2009_5_4_jcssp.2009.290.296.pdf
2009_5_4_jcssp.2009.297.301.pdf
2009_5_4_jcssp.2009.302.310.pdf
2009_5_4_jcssp.2009.311.322.pdf
2009_5_4_jcssp.2009.323.329.pdf
2009_5_4_jcssp.2009.330.337.pdf
2009_5_5_jcssp.2009.338.346.pdf
2009_5_5

2008_4_10_jcssp.2008.864.870.pdf
2008_4_10_jcssp.2008.871.876.pdf
2008_4_10_jcssp.2008.877.887.pdf
2008_4_11_jcssp.2008.888.896.pdf
2008_4_11_jcssp.2008.897.902.pdf
2008_4_11_jcssp.2008.903.909.pdf
2008_4_11_jcssp.2008.910.915.pdf
2008_4_11_jcssp.2008.916.921.pdf
2008_4_11_jcssp.2008.922.927.pdf
2008_4_11_jcssp.2008.928.933.pdf
2008_4_11_jcssp.2008.934.941.pdf
2008_4_11_jcssp.2008.942.950.pdf
2008_4_11_jcssp.2008.951.958.pdf
2008_4_11_jcssp.2008.959.962.pdf
2008_4_11_jcssp.2008.963.966.pdf
2008_4_12_jcssp.2008.967.975.pdf
2008_4_12_jcssp.2008.976.981.pdf
2008_4_12_jcssp.2008.982.990.pdf
2008_4_12_jcssp.2008.991.998.pdf
2008_4_12_jcssp.2008.999.1002.pdf
2008_4_12_jcssp.2008.1003.1011.pdf
2008_4_12_jcssp.2008.1012.1019.pdf
2008_4_12_jcssp.2008.1020.1023.pdf
2008_4_12_jcssp.2008.1024.1029.pdf
2008_4_12_jcssp.2008.1030.1035.pdf
2008_4_12_jcssp.2008.1036.1041.pdf
2008_4_12_jcssp.2008.1042.1050.pdf
2008_4_12_jcssp.2008.1051.1055.pdf
2008_4_12_jcssp.2008.1056.1060.pdf
2008_4_12_jcssp.2008.106

2006_2_6_jcssp.2006.521.527.pdf
2006_2_6_jcssp.2006.528.534.pdf
2006_2_6_jcssp.2006.535.541.pdf
2006_2_6_jcssp.2006.542.549.pdf
2006_2_6_jcssp.2006.550.557.pdf
2006_2_7_jcssp.2006.558.564.pdf
2006_2_7_jcssp.2006.565.571.pdf
2006_2_7_jcssp.2006.572.576.pdf
2006_2_7_jcssp.2006.577.582.pdf
2006_2_7_jcssp.2006.583.588.pdf
2006_2_7_jcssp.2006.589.594.pdf
2006_2_7_jcssp.2006.595.599.pdf
2006_2_7_jcssp.2006.600.606.pdf
2006_2_8_jcssp.2006.607.611.pdf
2006_2_8_jcssp.2006.612.614.pdf
2006_2_8_jcssp.2006.615.618.pdf
2006_2_8_jcssp.2006.619.626.pdf
2006_2_8_jcssp.2006.627.633.pdf
2006_2_8_jcssp.2006.634.637.pdf
2006_2_8_jcssp.2006.638.645.pdf
2006_2_8_jcssp.2006.646.659.pdf
2006_2_8_jcssp.2006.660.664.pdf
2006_2_8_jcssp.2006.665.671.pdf
2006_2_9_jcssp.2006.672.675.pdf
2006_2_9_jcssp.2006.676.682.pdf
2006_2_9_jcssp.2006.683.689.pdf
2006_2_9_jcssp.2006.690.697.pdf
2006_2_9_jcssp.2006.698.703.pdf
2006_2_9_jcssp.2006.704.709.pdf
2006_2_9_jcssp.2006.710.715.pdf
2006_2_9_jcssp.2006.716.734.pdf
2006_2_9

In [9]:
# create a pandas data frame of article's full text
full_texts = pd.DataFrame(article_contents, columns=['Text'])

# get the length of the list of article's full text
full_texts.shape

(2691, 1)

In [10]:
# look at the full texts
full_texts.head()

Unnamed: 0,Text
0,Because of the rapid development of Informatio...
1,"In the literature, many papers that focus on A..."
2,Collision avoidance on air traffic becomes ver...
3,"In recent years, farmers in India eventually l..."
4,Spam is the use of electronic devices to trans...


In [11]:
# get the subset of data without the excludes file
data = df[df['File Name'].isin(exclude_files) == False]
data.reset_index(drop=True, inplace=True)

# display the shape of data
data.shape

(2691, 6)

In [12]:
# add article's full text to the data frame
data_new = pd.merge(data, full_texts, right_index=True, left_index=True)

print("New data shape:", data.shape)

# look at the first five rows
data_new.head()

New data shape: (2691, 6)


Unnamed: 0,Date,Title,Abstract,Keywords,File Name,URL,Text
0,Published: 8 January 2021,A Systematic Literature Review on English and ...,Due to the enormous growth of information and ...,"English Bangla Comparison, Latent Dirichlet Al...",2021_17_1_jcssp.2021.1.18.pdf,https://thescipub.com/pdf/jcssp.2021.1.18.pdf,Because of the rapid development of Informatio...
1,Published: 21 January 2021,DAD: A Detailed Arabic Dataset for Online Text...,This paper presents a novel Arabic dataset tha...,"Arabic Dataset, Arabic Benchmark, Arabic Recog...",2021_17_1_jcssp.2021.19.32.pdf,https://thescipub.com/pdf/jcssp.2021.19.32.pdf,"In the literature, many papers that focus on A..."
2,Published: 20 January 2021,Collision Avoidance Modelling in Airline Traff...,An Air Traffic Controller (ATC) system aims to...,"Air Traffic Control, Collision Avoidance, Conf...",2021_17_1_jcssp.2021.33.43.pdf,https://thescipub.com/pdf/jcssp.2021.33.43.pdf,Collision avoidance on air traffic becomes ver...
3,Published: 20 January 2021,Fine-Tuned MobileNet Classifier for Classifica...,"This paper proposed an accurate, fast and reli...","Strawberry, Cherry Fruit, Accuracy, MobileNet,...",2021_17_1_jcssp.2021.44.54.pdf,https://thescipub.com/pdf/jcssp.2021.44.54.pdf,"In recent years, farmers in India eventually l..."
4,Published: 21 January 2021,A Content Filtering from Spam Posts on Social ...,The system for filtering spam posts on social ...,"Content Filtering, Spam Detection, Multimodal ...",2021_17_1_jcssp.2021.55.66.pdf,https://thescipub.com/pdf/jcssp.2021.55.66.pdf,Spam is the use of electronic devices to trans...


In [13]:
data_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2691 entries, 0 to 2690
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Date       2691 non-null   object
 1   Title      2691 non-null   object
 2   Abstract   2691 non-null   object
 3   Keywords   2686 non-null   object
 4   File Name  2691 non-null   object
 5   URL        2691 non-null   object
 6   Text       2691 non-null   object
dtypes: object(7)
memory usage: 147.3+ KB


In [14]:
# save to csv file
data_new.to_csv('data/article_full_text.csv', index=False)

In [19]:
# save article_contents to csv file
full_texts.to_csv('data/article_contents.csv', index=False)