# Description on Tabula

tabula-py is a simple Python wrapper of tabula-java, which can read table of PDF. You can read tables from PDF and convert into pandas's DataFrame. tabula-py also enables you to convert a PDF file into CSV/TSV/JSON file. \\

The following commands are used to install tabula in anaconda
* conda install -c conda-forge tabula-py
* conda install -c conda-forge/label/cf201901 tabula-py
* conda install -c conda-forge/label/cf202003 tabula-py

In [2]:
import tabula

In [89]:
df = tabula.read_pdf('Annexure-I.pdf',pages="all")

In [90]:
columns = df.ix[1]
columns
df.columns = columns
df = df[2:]
df

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


1,Rank,Full Journal Title,ISSN,Publsiher,Country
2,1,CA-A CANCER JOURNAL FOR CLINICIANS,0007-9235,WILEY,UNITED STATES
3,2,NEW ENGLAND JOURNAL OF MEDICINE,0028-4793,MASSACHUSETTS MEDICAL SOC,UNITED STATES
4,3,LANCET,0140-6736,ELSEVIER SCIENCE INC,ENGLAND
5,4,CHEMICAL REVIEWS,0009-2665,AMER CHEMICAL SOC,UNITED STATES
6,5,Nature Reviews Materials,2058-8437,NATURE PUBLISHING GROUP,ENGLAND
...,...,...,...,...,...
11102,11101,Deviance et Societe,0378-7931,MEDECINE ET HYGIENE,SWITZERLAND
11103,11102,LIBRARY AND INFORMATION SCIENCE,0373-4447,MITA SOC LIBRARY INFORMATION SCIENCE,JAPAN
11104,11103,INTERNATIONAL JOURNAL,0020-7020,SAGE PUBLICATIONS LTD,ENGLAND
11105,11104,Srpski Arhiv za Celokupno Lekarstvo,0370-8179,SRPSKO LEKARSKO DRUSTVO,SERBIA


# Finding Missing Values from the Datasets

In [92]:
def missing(Frame):
    return Frame[Frame.isnull().any(axis = 1)]

In [93]:
missing(df)

1,Rank,Full Journal Title,ISSN,Publsiher,Country
5626,5625,POLYMER-PLASTICS TECHNOLOGY AND MATERIALS,2574-0881,,


1
Rank                  False
Full Journal Title    False
ISSN                  False
Publsiher              True
Country                True
dtype: bool

In [64]:
df = df.drop(df.index[[0]])


In [65]:
columns = df.ix[1]

#columns = list(df.ix[1])


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


KeyError: 1

In [66]:
list(df.ix[1])

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


KeyError: 1

In [67]:
df.columns = columns

In [68]:
df = df.ix[2:]
df

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


1,Rank,Full Journal Title,ISSN,Publsiher,Country
3,2,NEW ENGLAND JOURNAL OF MEDICINE,0028-4793,MASSACHUSETTS MEDICAL SOC,UNITED STATES
4,3,LANCET,0140-6736,ELSEVIER SCIENCE INC,ENGLAND
5,4,CHEMICAL REVIEWS,0009-2665,AMER CHEMICAL SOC,UNITED STATES
6,5,Nature Reviews Materials,2058-8437,NATURE PUBLISHING GROUP,ENGLAND
7,6,NATURE REVIEWS DRUG DISCOVERY,1474-1776,NATURE PUBLISHING GROUP,ENGLAND
...,...,...,...,...,...
11102,11101,Deviance et Societe,0378-7931,MEDECINE ET HYGIENE,SWITZERLAND
11103,11102,LIBRARY AND INFORMATION SCIENCE,0373-4447,MITA SOC LIBRARY INFORMATION SCIENCE,JAPAN
11104,11103,INTERNATIONAL JOURNAL,0020-7020,SAGE PUBLICATIONS LTD,ENGLAND
11105,11104,Srpski Arhiv za Celokupno Lekarstvo,0370-8179,SRPSKO LEKARSKO DRUSTVO,SERBIA


In [80]:
Country = df["Country"]
Country.count()

11103

In [1]:
URL1 = "https://cfr.annauniv.edu/research/annexure-1-journals.php"
URL2 = "https://cfr.annauniv.edu/research/annexure-2-journals.php"

In [2]:
import requests
import lxml.html as lh
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup

In [3]:
def Dataset(Link):
    page = requests.get(Link)
    #Store the contents of the website under doc
    doc = lh.fromstring(page.content)
    #Parse data that are stored between <tr>..</tr> of HTML
    tr_elements = doc.xpath('//tr')
    #Create empty list
    col=[]
    i=0
    #For each row, store each first element (header) and an empty list
    for t in tr_elements[18:]:
        i+=1
        name=t.text_content()
        #print('%d:"%s"'%(i,name))
        col.append((name))
    ColumnName = [i.split('  ')[-1] for i in col[0].split('\n')[1:-1]]
    Data = [[i for i in ' '.join(col[j].split('\n')[1:-1]).split('    ') if not i==''] for j in range(len(col))]
    Data = Data[:-6]
    DF = pd.DataFrame(Data[1:],columns = ColumnName)
    return DF

In [5]:
A_I = Dataset(URL1)
A_II = Dataset(URL2)

In [6]:
A_I

Unnamed: 0,Sl.No,Full Journal Title,ISSN,Publisher,Country
0,1,2D MATERIALS,2053-1583,IOP PUBLISHING LTD,ENGLAND
1,2,3 BIOTECH,2190-572X,SPRINGER HEIDELBERG,GERMANY
2,3,3D PRINTING AND ADDITIVE MANUFACTURING,2329-7662,"MARY ANN LIEBERT, INC",UNITED STATES
3,4,4OR-A QUARTERLY JOURNAL OF OPERATIONS RESEARCH,1619-4500,SPRINGER HEIDELBERG,GERMANY
4,5,AAPG BULLETIN,0149-1423,AMER ASSOC PETROLEUM GEOLOGIST,UNITED STATES
...,...,...,...,...,...
11084,11085,ZOOSYSTEMA,1280-9551,"PUBLICATIONS SCIENTIFIQUES DU MUSEUM, PARIS",FRANCE
11085,11086,ZOOSYSTEMATICS AND EVOLUTION,1860-0743,PENSOFT PUBL,BULGARIA
11086,11087,ZOOTAXA,1175-5326,MAGNOLIA PRESS,NEW ZEALAND
11087,11088,ZYGON,0591-2385,WILEY,UNITED STATES


In [7]:
Column_Name = A_I.columns
Column_Name

Index(['Sl.No', 'Full Journal Title ', 'ISSN ', 'Publisher', 'Country '], dtype='object')

In [8]:
ISSN = A_I[Column_Name[2]]

In [9]:
def Parsing(Link):
    response = urllib.request.urlopen(Link).read()
    soup = BeautifulSoup(response,'html.parser')
    try:
        link = 'https://www.scimagojr.com/'+soup.find('div', class_='search_results').a['href']
        response = urllib.request.urlopen(link).read()
        soup = BeautifulSoup(response,'html.parser')
        Scope = soup.find('div', class_='fullwidth').contents[2]
        return Scope
    except:
        return None
    
    
    

In [10]:
Scope = []
URL = 'https://www.scimagojr.com/journalsearch.php?q='

In [11]:
Scope = [Parsing(URL+i.split('  ')[1]) for i in ISSN]
Scope

KeyboardInterrupt: 

In [12]:
A_I['Scope'] = Scope

ValueError: Length of values does not match length of index

In [None]:
A_I.to_csv('A_IDataBASe.csv')

In [14]:
Scope

[]

In [47]:
for i in ISSN:
    print("=>",i)

=>   2053-1583
=>   2190-572X
=>   2329-7662
=>   1619-4500
=>   0149-1423
=>   1550-7416
=>   1530-9932
=>   2330-5517
=>   1532-8813
=>   0001-3072
=>   2366-004X
=>   0025-5858
=>   1012-8255
=>   1069-6563
=>   1040-2446
=>   1876-2859
=>   1042-9670
=>   1076-6332
=>   1941-6520
=>   0001-4273
=>   1537-260X
=>   1558-9080
=>   0363-7425
=>   0044-586X
=>   0001-4575
=>   0898-9621
=>   0001-4788
=>   0810-5391
=>   0951-3574
=>   0888-7993
=>   0361-3682
=>   0001-4826
=>   0001-4842
=>   0949-1775
=>   0889-325X
=>   0889-3241
=>   0360-0300
=>   1556-4673
=>   1550-4832
=>   0146-4833
=>   1549-6325
=>   1544-3558
=>   1544-3566
=>   2375-4699
=>   1556-4665
=>   1529-3785
=>   0734-2071
=>   1073-0516
=>   1946-6226
=>   0362-5915
=>   1084-4309
=>   1539-9087
=>   0730-0301
=>   1046-8188
=>   2157-6904
=>   1533-5399
=>   1556-4681
=>   0098-3500
=>   1049-3301
=>   1551-6857
=>   2471-2566
=>   0164-0925
=>   1936-7406
=>   1550-4859
=>   1049-331X
=>   1553-3077
=>   1559-

=>   0225-5189
=>   1499-2671
=>   1486-3847
=>   0008-4077
=>   0008-4085
=>   0840-8688
=>   1481-8035
=>   1196-1961
=>   0706-652X
=>   0045-5067
=>   2291-2797
=>   1712-9532
=>   0008-414X
=>   0008-4166
=>   0317-1671
=>   0008-4174
=>   0008-4182
=>   0008-4204
=>   0008-4212
=>   0706-0661
=>   0008-4220
=>   0008-4239
=>   0706-7437
=>   0008-4263
=>   0703-8992
=>   0829-5735
=>   0318-6431
=>   0008-4271
=>   0319-5724
=>   0008-428X
=>   1195-9479
=>   0830-9000
=>   0008-4301
=>   0714-9808
=>   0008-4395
=>   0820-3946
=>   0008-4433
=>   0008-4476
=>   0008-4506
=>   0708-5591
=>   0008-4840
=>   0317-0861
=>   1198-2241
=>   1755-6171
=>   0380-1489
=>   0008-5286
=>   0701-1784
=>   0008-543X
=>   0167-7659
=>   2095-3941
=>   1538-4047
=>   1574-0153
=>   1084-9785
=>   0957-5243
=>   1535-6108
=>   1475-2867
=>   0344-5704
=>   1073-2748
=>   1934-662X
=>   2159-8274
=>   1877-7821
=>   1055-9965
=>   0929-1903
=>   2210-7762
=>   1109-6535
=>   1470-7330
=>   0340-

=>   2048-3694
=>   1867-0334
=>   0379-5721
=>   1557-1858
=>   2212-4292
=>   0890-5436
=>   0308-8146
=>   0956-7135
=>   1552-8014
=>   1866-7910
=>   0268-005X
=>   0015-6426
=>   0740-0020
=>   2214-2894
=>   0306-9192
=>   0950-3293
=>   0963-9969
=>   8755-9129
=>   2048-7177
=>   1226-7708
=>   0101-2061
=>   1082-0132
=>   1344-6606
=>   1876-4517
=>   1330-9862
=>   1535-3141
=>   1071-1007
=>   1083-7515
=>   1268-7731
=>   0015-6914
=>   0015-704X
=>   0015-7120
=>   0015-718X
=>   1743-8586
=>   0379-0738
=>   1872-4973
=>   1547-769X
=>   1860-8965
=>   0378-1127
=>   2095-6355
=>   1437-4781
=>   1389-9341
=>   0015-7473
=>   0015-749X
=>   2171-5068
=>   0015-752X
=>   0015-7546
=>   1999-4907
=>   0934-5043
=>   0925-9856
=>   0015-7899
=>   0720-4299
=>   0015-8208
=>   0933-7741
=>   2194-6183
=>   2193-0066
=>   1802-5439
=>   1554-0669
=>   1386-4238
=>   1615-3375
=>   0015-9018
=>   1233-1821
=>   0429-2766
=>   0218-348X
=>   1311-0454
=>   0891-5849
=>   1071-

=>   2095-2899
=>   1229-9162
=>   2190-9385
=>   0733-5210
=>   0271-678X
=>   0021-9568
=>   1074-1542
=>   0098-0331
=>   0021-9584
=>   0021-9592
=>   1549-9596
=>   0891-0618
=>   0021-9606
=>   1747-5198
=>   0974-3626
=>   0268-2575
=>   1549-9618
=>   0021-9614
=>   1758-2946
=>   2090-9063
=>   0886-9383
=>   1120-009X
=>   1067-828X
=>   1044-5463
=>   1062-1024
=>   1367-4935
=>   0305-0009
=>   0883-0738
=>   0021-9630
=>   1053-8712
=>   1863-2521
=>   1080-6954
=>   1755-5345
=>   0021-9665
=>   0021-9673
=>   1570-0232
=>   0218-1266
=>   1392-3730
=>   0176-4268
=>   0959-6526
=>   0894-8755
=>   1380-3395
=>   0952-8180
=>   0733-2459
=>   0912-0009
=>   1537-4416
=>   1094-6950
=>   0021-972X
=>   0895-4356
=>   0192-0790
=>   1524-6175
=>   0271-9142
=>   0021-9738
=>   0887-8013
=>   1933-2874
=>   2077-0383
=>   0095-1137
=>   1387-1307
=>   1738-6586
=>   0736-0258
=>   0967-5868
=>   0962-1067
=>   0732-183X
=>   0021-9746
=>   1053-4628
=>   0303-6979
=>   0091-

=>   0026-2617
=>   0385-5600
=>   1092-2172
=>   2045-8827
=>   1350-0872
=>   2049-2618
=>   0026-265X
=>   0026-3672
=>   1073-9688
=>   0167-9317
=>   1356-5362
=>   0026-2692
=>   0026-2714
=>   1613-4982
=>   0938-0108
=>   2072-666X
=>   0968-4328
=>   0026-2803
=>   1387-1811
=>   0141-9331
=>   2050-5698
=>   1431-9276
=>   1059-910X
=>   0738-1085
=>   0946-7076
=>   2055-7434
=>   0026-2862
=>   0895-2477
=>   0026-3141
=>   1061-1924
=>   0266-6138
=>   2049-5838
=>   0374-9096
=>   1424-9286
=>   0887-378X
=>   0026-4075
=>   0899-5605
=>   0305-8298
=>   0268-1064
=>   1751-2271
=>   1074-9039
=>   1868-8527
=>   0924-6495
=>   1025-9112
=>   0882-7508
=>   0026-4598
=>   0026-461X
=>   0930-0708
=>   2075-163X
=>   0747-9182
=>   0892-6875
=>   0026-4695
=>   0375-9393
=>   1120-4826
=>   0026-4725
=>   0026-4733
=>   0391-1977
=>   0026-4806
=>   0026-4946
=>   0393-2249
=>   1364-5706
=>   1389-5575
=>   1570-193X
=>   0026-5535
=>   1819-754X
=>   0276-7783
=>   1540-

=>   0253-939X
=>   1012-0750
=>   2224-7890
=>   1608-9685
=>   0081-2463
=>   0038-2353
=>   1808-9798
=>   0038-2876
=>   1360-8746
=>   0125-1562
=>   1468-3857
=>   1528-7092
=>   1608-9693
=>   1607-3614
=>   0038-3910
=>   0038-4038
=>   2070-2620
=>   0038-4348
=>   0147-1724
=>   0038-6073
=>   1206-3312
=>   0265-9646
=>   0038-6308
=>   1542-7390
=>   1695-971X
=>   0210-2412
=>   1138-7416
=>   1387-5868
=>   1742-1772
=>   2211-6753
=>   1064-6671
=>   1086-055X
=>   1930-1855
=>   1094-6470
=>   1386-1425
=>   0584-8547
=>   0887-6703
=>   1000-0593
=>   0038-7010
=>   0167-6393
=>   2010-3247
=>   1362-4393
=>   0362-2436
=>   1529-9430
=>   0341-8391
=>   1357-3322
=>   2157-3905
=>   1743-0437
=>   1441-3523
=>   1061-6934
=>   0888-4781
=>   1476-3141
=>   1941-7381
=>   0112-1642
=>   1062-8592
=>   0932-0555
=>   0370-8179
=>   1061-0022
=>   0038-9145
=>   1944-3277
=>   0731-5082
=>   0038-9765
=>   0038-9056
=>   1536-867X
=>   1532-4400
=>   0039-0402
=>   1017-

In [44]:
len(Scope)

5

In [54]:
Parsing(URL+ISSN[7].split('  ')[1])

'\r\n      AATCC Journal of Research. This textile research journal has a broad scope: from advanced materials, fibers, and textile and polymer chemistry, to color science, apparel design, and sustainability.\r\n\r\nNow indexed by Science Citation Index Extended (SCIE) and discoverable in the Clarivate Analytics Web of Science Core Collection! The Journal’s impact factor is available in Journal Citation Reports.        '

In [51]:
Scope

["\r\n      2D Materials aims to curate the most significant and cutting-edge research being undertaken in the field of 2D materials science and engineering. Serving an expanding multidisciplinary community of researchers and technologists, our goal is to develop a selective journal dedicated to bringing together the most important new results and perspectives from across the discipline. Submissions should be essential reading for a particular sub-field, but should also be of multidisciplinary interest to the wider community, with the expectation that published work will have significant impact.\r\n\r\nSubmissions that do not meet 2DM's strict acceptance criteria may be transferred at the discretion of the journal's editors (with author approval) to other relevant journals in the IOP portfolio, for further consideration.\r\n2D Materials is a multidisciplinary journal devoted to publishing original research of the highest quality and impact covering all aspects of two-dimensional materi