<center><h1>Artificial Intelligence Journals Ranking (2000 - 2021)</h1></center>

<center><h3>Vásquez, V., Cruz, J. & Henao, M.</h3></center>

<center><h2 style="margin-top:50px;">Abstract</h2></center>
<p>SCImago Journal & Country Rank is a publicly available portal that includes the journals and country scientific indicators developed from the Scopus database. With the dataset used in this document, it is possible to determine the trends of the subtopics covered in the documents published throughout the years and which journals are the most significant in the area.</p>

<strong>Keywords:</strong> Artificial Intelligence, Data analisis, Journal ranking

<h2>0. Setting Up Environment</h2>

In [25]:
# System variables
import os 
from glob import glob

# Data processing libraries
import numpy as np
import pandas as pd

# Dataset connection
import opendatasets as od

# Graphic tools 
import seaborn as sns
import matplotlib as mpl
from matplotlib import pyplot as plt

<h3>0.1. Download Dataset</h3>

<p>The dataset is downloaded from Kaggle using the opendatasets library.</p>

In [None]:
# ================
# Download dataset
# ================
od.download("https://www.kaggle.com/datasets/yasirabdaali/artificial-intelligence-journals-ranking-20002021", "../dataset")

<h3>0.2. Reading Dataset</h3>

In [26]:
def path_csvFiles (PATH = os.getcwd(),EXT = "*.csv"):  
    """
    Retrieve all files with a given extension (EXT = *.csv by default) from current
    working directory where the process is being implemented, using os.getcwd () 
    and the glob module, which finds all path names that resemble a specified pattern 
    according to the rules that are followed in a Unix terminal.
    
    Returns:6
        Returns a list with all the files path of the given extension 
    """
    print(os.getcwd())
    list_paths = []
    for path, subdir, files in os.walk(PATH):
        for file in glob(os.path.join(path, EXT)):
            if file.find("scimagojr") != -1:
                list_paths.append(file)
    return list_paths


def concat_paths(all_paths):
    """receives a list of file directories with a CSV extension 
    and creates a dataset by concatenating each file
    and assigning each file a year label that is stored in the "Year" column,
    returning a dataframe

    Returns: 
        Returns a dataframe
    """

    all_df = []
    for path in all_paths:
        df = pd.read_csv(path, sep = ';')        
        df['Year'] = int(path.split()[1])
        all_df.append(df)
    
    df = pd.concat(all_df, ignore_index=True)
    return df 

In [27]:
# ===========================================================================
# The functions in charge of reading and joining the dataset tables are called
# ===========================================================================
df = concat_paths(path_csvFiles())

C:\Users\juandiego\Documents\Programacion\final_project_TEAM\final_project_TGL2022


<h2>1. Understanding the dataset</h2>

In [None]:
# ===========================================
# Display the columns that make up the dataset
# ===========================================
df.columns.values.tolist()[:21]

<ul>
    <li><b>Rank:</b>  Consecutive number assigned to records by table </li>
    <li><b>Source ID:</b> Scopus Journal ID </li>
    <li><b>Title:</b> Journal’s title</li>
    <li><b>Type:</b> Type of publication (Journal, Book Series and Conference & Proceedings) </li>
    <li><b>ISSN:</b> International Standard Serial Number  </li>
    <li><b>SJR:</b> Weighted citations received in year X to documents published in the journal in years X-1, X-2 and X-3.</li>
    <li>
        <b>SJR Quartile:</b> Each thematic category is divided into quartiles.
        <ul>
            <li><b>Q1:</b> group made up of the first 25% of the journals on the list. </li>
            <li><b>Q2:</b> group that occupies from 25% to 50% </li>
            <li><b>Q3:</b> group that is positioned between 50% and 75% </li>
            <li><b>Q4:</b> group that is positioned between 75% and 100% </li>
        </ul>
    </li>
</ul>

<ul>
    <li>
        <b>H Index:</b>
        The h index expresses the journal's number of articles (h) that have received at least h citations. It quantifies both journal scientific productivity and scientific impact and it is also applicable to scientists, countries, etc.
    </li>
    <li>
        <b>Total Docs. (3years):</b>
        Published documents in the three previous years
    </li>
    <li>
        <b>Total Refs:</b>
        All the bibliographical references in a journal in the selected period.
    </li>
    <li>
        <b>Total Cites (3years):</b>
        Number of citations received in the selected year by a journal to the documents published in the three previous years
    </li>
    <li>
        <b>Citable Docs. (3years):</b>
        Number of citable documents published by a journal in the three previous years
    </li>
    <li>
        <b>Cites / Doc. (2years):</b>
        Margin between citable documents and the total documents by a journal in the two previous years.
    </li>
    <li>
        <b>Ref. / Doc:</b>
        Margin between all the bibliographical references in a journal in the selected period and the total documents published
    </li>
    <li>
        <b>Publisher:</b> Journal Publisher.
    </li>
    <li>
        <b>Coverage:</b>
        The length of time, e.g. years, for which journals are published.
    </li>
    <li>
        <b>Categories:</b>
        Journal key words
    </li>
</ul>
<br>

In [None]:
df.head(5)

In [None]:
# ===================================================================================================
# Plot rectangular df as a color-encoded matrix
# Visualization in heatmap of the columns that represent gaps to develop strategies to correct them
# ===================================================================================================
sns.heatmap(df.notnull())

In [None]:
# ===================================================================================
# Plot rectangular df as a color-encoded matrix. 
# In the Type column, the records that are 'Conference and Proceedings' are searched 
# and it is identified that in relation to the 'Coverage' and 'Publisher' columns, 
# most of the empty records are found
# ===================================================================================
sns.heatmap(df[(df['Type'] == "conference and proceedings")][["Type","Coverage","Publisher"]].notnull())

<h2>2. Preprocessing data</h2>

In [None]:
# ==================================================
# Keep only 'Journal' and 'Book Series' type records
# ==================================================
df = df.loc[(df['Type'] == 'journal') | (df['Type'] == 'book series')]

In [None]:
# ========================================================================================
# The data of the 'Total Docs.20##' columns is stored in the 'Total Docs column. per year'
# ========================================================================================
df['Total Docs. per Year'] = df[list(df.filter(regex  = '20'))].fillna('').astype(str).apply(lambda x: "".join(x), axis =1)
df['Total Docs. per Year'] = df['Total Docs. per Year'].astype(float)

In [None]:
# ============================================
# The columns 'Total Docs.20##' are eliminated
# ============================================
df.drop(list(df.filter(regex  = '20')), inplace = True, axis=1)

In [None]:
# ==============================================================
# Explore the data type columns and identify an assignment error
# ==============================================================
df.dtypes

In [None]:
# ====================================================================================
# Explore the values with the wrong mapping in detail and create a modification scheme
# ====================================================================================

for i in (5, 12, 13):
    print(f"\033[1m {df.columns[i]}:\n\033[0m {list((df[df.columns[i]]))[:30]}\n")

<h2>2.1. Modification scheme</h2>

In [None]:
# ===========================================================================
# The columns that were as an object with the float data type are established
# ===========================================================================
df['SJR'] = (df['SJR'].replace(',','.', regex=True).astype(float)).fillna(0)
df['Cites / Doc. (2years)'] = (df['Cites / Doc. (2years)'].replace(',','.', regex=True).astype(float))
df['Ref. / Doc.'] = (df['Ref. / Doc.'].replace(',','.', regex=True).astype(float))

In [None]:
# ======================================================================================================
# The heatmap is made to verify the result of the treatment that was given to the columns that had empty
# ======================================================================================================
sns.heatmap(df.notnull())

In [None]:
# ===============
# Numeric columns
# ===============
int_df = df.select_dtypes(include=['int64', 'float']).copy()
print(f"[{len(int_df)} rows x {len(int_df.columns)} columns]")

In [None]:
# ===========================================================================
# Summarize the mean, standard deviation, min and max values in the dataframe
# ===========================================================================
int_df = int_df.reset_index(drop=True)
int_df[['SJR', 'H index', 'Total Docs. per Year']].describe().loc[['mean', 'std', 'min', 'max']].applymap(lambda x: f"{x:0.3f}")

In [None]:
# ===================
# Categorical columns
# ===================
obj_df = df.select_dtypes(include=['object']).copy()
print(f"[{len(obj_df)} rows x {len(obj_df.columns)} columns]\n")

#Categorical description
obj_df[['Title', 'Country', 'Region', 'Publisher', 'Categories']].describe().loc[['count', 'unique']]

In [None]:
# =================================
# Categorical columns sets overview
# =================================
i=0
while i<len(obj_df.columns):    
    print(("\033[1m {}: \n \033[0m {}\n").format(obj_df.columns[i],list(set(obj_df[obj_df.columns[i]]))[:10]))
    i+=1

In [None]:
# ==============================
# Download consolidated dataframe
# ==============================
df.to_csv('../dataset/journalAI.csv',sep = ";", index=False)

<h2>Viz Context</h2>

<h2>References</h2>