<a href="https://colab.research.google.com/github/mahesh-tippanu/liberay_bPML/blob/feature%2Fapi-integration/liberay_bPML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data Preprocessing

Load the dataset using pandas from an Excel file.
Handle missing values by filling them with median (for numerical columns like SNIP and SJR) and mode (for categorical columns like Publisher).
Drop columns with excessive missing values (Print ISSN and E-ISSN).
Handle duplicate entries using drop_duplicates().

In [6]:
import pandas as pd

file_path = '/content/drive/MyDrive/Machine Learning Project/Ist Quratile journals Scopus (1).xlsx'
df = pd.read_excel(file_path, sheet_name='Sheet1')

# Handle missing values
df['SNIP'].fillna(df['SNIP'].median(), inplace=True)
df['SJR'].fillna(df['SJR'].median(), inplace=True)
df['Publisher'].fillna(df['Publisher'].mode()[0], inplace=True)
df['Main Publisher'].fillna(df['Main Publisher'].mode()[0], inplace=True)

# Drop columns with excessive missing values
df_cleaned = df.drop(columns=['Print ISSN', 'E-ISSN'])

# Handle duplicate entries
df_cleaned.drop_duplicates(inplace=True)

# Save the cleaned data for the next step
df_cleaned.to_excel('cleaned_journal_data.xlsx', index=False)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['SNIP'].fillna(df['SNIP'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['SJR'].fillna(df['SJR'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are sett

Feature-engineering Aggregate the dataset by grouping duplicates based on Scopus Source ID and RANK, taking the mean of the Top 10% (CiteScore Percentile) feature. Pivot the data to create a user-item matrix, where rows represent journals (based on Scopus Source ID), columns represent RANK, and values represent Top 10% (CiteScore Percentile)

In [7]:
#pip install numpy pandas scikit-learn seaborn


Accessing Files: Files in your Drive are accessible under /content/drive/MyDrive/.
Adjust the path to match the folder and file structure in your Google Drive.


In [8]:
import pandas as pd

file_path = '/content/drive/MyDrive/Machine Learning Project/Ist Quratile journals Scopus (1).xlsx'
df = pd.read_excel(file_path, sheet_name='Sheet1')
print(df.head())


   Scopus Source ID                                     Title  Citation Count  \
0             12091         Atmospheric Chemistry and Physics           34849   
1             12459  Journal of the National Cancer Institute           11530   
2             12459  Journal of the National Cancer Institute           11530   
3             13877   European Journal of Mechanics, A/Solids            6294   
4             13877   European Journal of Mechanics, A/Solids            6294   

   Scholarly Output  Percent Cited  CiteScore   SNIP    SJR  \
0              3251             85       10.7  1.291  2.138   
1               679             88       17.0  2.556  4.986   
2               679             88       17.0  2.556  4.986   
3               899             83        7.0  1.447  0.993   
4               899             83        7.0  1.447  0.993   

   Scopus ASJC Code (Sub-subject Area)      Scopus Sub-Subject Area  ...  \
0                                 1902          Atmospheri

feature/model-training

In [9]:
import pandas as pd
from sklearn.neighbors import NearestNeighbors

# Load the cleaned data
df = pd.read_excel('cleaned_journal_data.xlsx')

# Before pivoting, aggregate duplicate entries based on 'Scopus Source ID' and 'RANK'
# Here, we take the mean of 'Top 10% (CiteScore Percentile)' for duplicates
df_aggregated = df.groupby(['Scopus Source ID', 'RANK'])['Top 10% (CiteScore Percentile)'].mean().reset_index()

# Now pivot the aggregated DataFrame
user_item_matrix = df_aggregated.pivot(index='Scopus Source ID', columns='RANK', values='Top 10% (CiteScore Percentile)').fillna(0)

# Create a K-NN model
model_knn = NearestNeighbors(metric='cosine', algorithm='brute')

# Fit the model to the user-item matrix
model_knn.fit(user_item_matrix.T)  # Transpose to use item-based K-NN

# Save the model for future use
import joblib
joblib.dump(model_knn, 'knn_model.joblib')

['knn_model.joblib']

feature/model-evaluation


In [10]:

# Function to get recommendations
def get_recommendations(journal_id, n_recommendations=5):
    try:
        # Find the index of the journal
        journal_index = user_item_matrix.columns.get_loc(journal_id)

        # Find the K nearest neighbors
        distances, indices = model_knn.kneighbors(user_item_matrix.T.iloc[journal_index].values.reshape(1, -1), n_neighbors=n_recommendations + 1)

        # Get the journal IDs of the recommendations
        recommended_journals = [user_item_matrix.columns[i] for i in indices.flatten()[1:]]
        return recommended_journals
    except KeyError:
        return f"Journal ID {journal_id} not found."


journal_id_to_recommend = 11
recommendations = get_recommendations(journal_id_to_recommend)
print(f"Recommendations for Journal ID {journal_id_to_recommend}:", recommendations)


Recommendations for Journal ID 11: [6, 8, 5, 3, 22]


In [11]:
# prompt: print the journal name of the corresponing jounral id

# Assuming you have a dictionary that maps journal IDs to journal names
journal_id_to_name = {
    # ... your mapping of journal IDs to names ...
    11: "Journal of Biomedical Informatics",
    12: "Journal of the American Medical Informatics Association",
    # Add more entries here...
}

journal_id = 11  # Replace with the journal ID you want to look up

if journal_id in journal_id_to_name:
    print(f"Journal ID: {journal_id}, Journal Name: {journal_id_to_name[journal_id]}")
else:
    print(f"Journal ID {journal_id} not found in the mapping.")

Journal ID: 11, Journal Name: Journal of Biomedical Informatics


In [12]:
import pandas as pd
import joblib
from sklearn.neighbors import NearestNeighbors


df = pd.read_excel('cleaned_journal_data.xlsx')
df_aggregated = df.groupby(['Scopus Source ID', 'RANK'])['Top 10% (CiteScore Percentile)'].mean().reset_index()

# Now pivot the aggregated DataFrame
user_item_matrix = df_aggregated.pivot(index='Scopus Source ID', columns='RANK', values='Top 10% (CiteScore Percentile)').fillna(0)

# Create and train the KNN model
model_knn = NearestNeighbors(n_neighbors=6, metric='cosine')
model_knn.fit(user_item_matrix)

# Save the model
joblib.dump(model_knn, 'knn_model.joblib')

['knn_model.joblib']

In [14]:
import joblib

# ... your model training code ...

# Save the model
joblib.dump(model_knn, 'knn_model.joblib')

['knn_model.joblib']

In [34]:
import pandas as pd
import joblib
from sklearn.neighbors import NearestNeighbors


df = pd.read_excel('cleaned_journal_data.xlsx')

# Aggregate the data, averaging 'Top 10% (CiteScore Percentile)' for duplicates
df_aggregated = df.groupby(['Scopus Source ID', 'RANK'])['Top 10% (CiteScore Percentile)'].mean().reset_index()

# Now pivot the aggregated DataFrame
# Using `pivot_table` allows for aggregation to handle potential duplicate values
user_item_matrix = df_aggregated.pivot_table(
    index='Scopus Source ID',
    columns='RANK',
    values='Top 10% (CiteScore Percentile)',
    aggfunc='mean'  # You can change the aggregation function if needed
).fillna(0)

# Create and train the KNN model
model_knn = NearestNeighbors(n_neighbors=6, metric='cosine')
model_knn.fit(user_item_matrix)

# Save the model
joblib.dump(model_knn, 'knn_model.joblib')

['knn_model.joblib']

In [42]:
# prompt: write the api

from flask import Flask, request, jsonify
import joblib
import pandas as pd
from sklearn.neighbors import NearestNeighbors

app = Flask(__name__)

# Load the pre-trained KNN model
model_knn = joblib.load('knn_model.joblib')

# Load the user-item matrix (ensure this matches the matrix used for training)
df = pd.read_excel('cleaned_journal_data.xlsx')
df_aggregated = df.groupby(['Scopus Source ID', 'RANK'])['Top 10% (CiteScore Percentile)'].mean().reset_index()
user_item_matrix = df_aggregated.pivot_table(
    index='Scopus Source ID',
    columns='RANK',
    values='Top 10% (CiteScore Percentile)',
    aggfunc='mean'
).fillna(0)

# Journal ID to name mapping (replace with your actual mapping)
journal_id_to_name = {
    11: "Journal of Biomedical Informatics",
    12: "Journal of the American Medical Informatics Association",
    # Add more entries as needed
}

@app.route('/recommend', methods=['POST'])
def recommend_journals():
    data = request.get_json()
    journal_id = data.get('journal_id')

    if not journal_id:
        return jsonify({'error': 'journal_id is required'}), 400

    try:
        journal_index = user_item_matrix.index.get_loc(journal_id)
        distances, indices = model_knn.kneighbors(user_item_matrix.iloc[journal_index].values.reshape(1, -1), n_neighbors=6)
        recommended_indices = indices.flatten()[1:]
        recommended_journal_ids = [user_item_matrix.index[i] for i in recommended_indices]

        recommendations = []
        for journal_id in recommended_journal_ids:
            name = journal_id_to_name.get(journal_id, "Unknown Journal")
            recommendations.append({'journal_id': journal_id, 'journal_name': name})

        return jsonify({'recommendations': recommendations})

    except KeyError:
        return jsonify({'error': f'Journal ID {journal_id} not found.'}), 404
    except IndexError:
      return jsonify({'error': f'Journal ID {journal_id} not found in the dataset.'}), 404

if __name__ == '__main__':
    app.run(debug=True,port=5020)

 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on http://127.0.0.1:5020
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug: * Restarting with stat


SystemExit: 1

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
