# Modeling (Title, Author & Program)

In [1]:
# Import Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Import natural language toolkit
import nltk

# Import tokenizer
from nltk.tokenize import RegexpTokenizer

# Import lemmatizer
from nltk.stem import WordNetLemmatizer

# Import regular expression
import re

# Import wordcloud 
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# Import Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer

#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Import cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

# Import sparse to make matrix sparse (where most of the elements are zero)
from scipy import sparse

In [2]:
#setting the display options

pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

In [3]:
#reading the datafile for Text Preprocessing

df = pd.read_csv('../data/df.csv', converters={'author': eval, 'program': eval})
df.head()

Unnamed: 0.1,Unnamed: 0,paper,year,month,title,author,code,program
0,73,74,1975,March,Variation Across Household in the Rate of Inflation,[Robert T Michael],w00074,[Economic Fluctuations and Growth]
1,86,87,1975,May,Exports and Foreign Investment in the Pharmaceutical Industry,"[Merle Yahr Weiss, Robert E Lipsey]",w00087,"[International Trade and Investment, International Finance and Macroeconomics]"
2,106,107,1975,October,Social Security and Retirement Decisions,[Michael J Boskin],w00107,[Public Economics]
3,115,116,1975,November,Notes on the Tax Treatment of Human Capital,[Michael J Boskin],w00116,[Public Economics]
4,116,117,1980,April,Job Mobility and Earnings Growth,[Ann P Bartel],w00117,[Labor Studies]


In [4]:
#dropping the column = Unnamed:0
df.drop(columns='Unnamed: 0', axis=1,inplace=True)

In [5]:
df.head()

Unnamed: 0,paper,year,month,title,author,code,program
0,74,1975,March,Variation Across Household in the Rate of Inflation,[Robert T Michael],w00074,[Economic Fluctuations and Growth]
1,87,1975,May,Exports and Foreign Investment in the Pharmaceutical Industry,"[Merle Yahr Weiss, Robert E Lipsey]",w00087,"[International Trade and Investment, International Finance and Macroeconomics]"
2,107,1975,October,Social Security and Retirement Decisions,[Michael J Boskin],w00107,[Public Economics]
3,116,1975,November,Notes on the Tax Treatment of Human Capital,[Michael J Boskin],w00116,[Public Economics]
4,117,1980,April,Job Mobility and Earnings Growth,[Ann P Bartel],w00117,[Labor Studies]


In [6]:
#instantiate tokenizer
tokenizer = RegexpTokenizer(r'\w+')

In [7]:
#tokenizing the title of working papers
df['title'] = df['title'].apply(lambda x: tokenizer.tokenize(x))

In [8]:
#Create stopword list
#add new words to the stopwords
stopwords = set(STOPWORDS)
new_words = ["may","aren", "couldn", "didn", "doesn", "don", "hadn", "hasn", "haven", "isn", "let", 
                  "ll", "mustn", "re", "shan", "shouldn", "ve", "wasn", "weren", "won", "wouldn", "t",
            "within","upon", "greater","effect","new", "the"]
stopwords = stopwords.union(new_words)

In [9]:
#instantiate lemmatizer
lemmatizer = WordNetLemmatizer()

In [10]:
#function to lemmatize the title text
def word_lemmatizer(title):
    lem_text = " ".join([lemmatizer.lemmatize(i) for i in title if not i in stopwords])
    return lem_text

In [11]:
#applying the lemmatizer and checking the title column
df['title'] = df['title'].apply(lambda x: word_lemmatizer(x))
df['title'].head()

0    Variation Across Household Rate Inflation         
1    Exports Foreign Investment Pharmaceutical Industry
2    Social Security Retirement Decisions              
3    Notes Tax Treatment Human Capital                 
4    Job Mobility Earnings Growth                      
Name: title, dtype: object

In [12]:
#joining all titles
title_text = " ".join(text for text in df['title'])

In [13]:
#checking the dataframe post tokenization and lemmatization
df.head()

Unnamed: 0,paper,year,month,title,author,code,program
0,74,1975,March,Variation Across Household Rate Inflation,[Robert T Michael],w00074,[Economic Fluctuations and Growth]
1,87,1975,May,Exports Foreign Investment Pharmaceutical Industry,"[Merle Yahr Weiss, Robert E Lipsey]",w00087,"[International Trade and Investment, International Finance and Macroeconomics]"
2,107,1975,October,Social Security Retirement Decisions,[Michael J Boskin],w00107,[Public Economics]
3,116,1975,November,Notes Tax Treatment Human Capital,[Michael J Boskin],w00116,[Public Economics]
4,117,1980,April,Job Mobility Earnings Growth,[Ann P Bartel],w00117,[Labor Studies]


In [14]:
#joining the authors together
df['author'] = df['author'].apply(', '.join)

In [15]:
#checking the tail
df.tail()

Unnamed: 0,paper,year,month,title,author,code,program
20691,21009,2015,March,Reference Points Redistributive Preferences Experimental Evidence,"Ilyana Kuziemko, Jimmy Charite, Raymond Fisman",w21009,"[Public Economics, Political Economy]"
20692,21010,2015,March,TFP News Sentiments The International Transmission Business Cycles,"Andrei A Levchenko, Nitya Pandalai-Nayar",w21010,"[Economic Fluctuations and Growth, International Finance and Macroeconomics]"
20693,21011,2015,March,Poisedness Propagation Organizational Emergence Transformation Civic Order 19th Century New York City,"Victoria Johnson, Walter W Powell",w21011,"[Development of the American Economy, , Productivity, Innovation, and Entrepreneurship]"
20694,21012,2015,March,Preventives Versus Treatments,"Christopher M Snyder, Michael R Kremer",w21012,"[Development Economics, Health Economics, Industrial Organization, Law and Economics, Productivity, Innovation, and Entrepreneurship]"
20695,21013,2015,March,Effects Peer Counseling Support Breastfeeding Assessing External Validity Randomized Field Experiment,"Julie A Reeder, Onur Altindag, Theodore J Joyce",w21013,[Health Economics]


In [16]:
#replacing commas and spaces in author and titles with nothing
df['author'] = df['author'].str.replace(' ','')

In [17]:
#checking the dataframe again
df.tail(10)

Unnamed: 0,paper,year,month,title,author,code,program
20686,21004,2015,April,The Marriage Market Labor Supply Education Choice,"CostasMeghir,MonicaCostaDias,Pierre-AndreChiappori",w21004,"[Economics of Education, Labor Studies]"
20687,21005,2015,March,Short term Long term Continuing Contracts,"MaijaHalonen-Akatwijuka,OliverDHart",w21005,"[Corporate Finance, Law and Economics]"
20688,21006,2015,March,Grasp Large Let Go Small The Transformation State Sector China,"Chang-TaiHsieh,Zheng(Michael)Song",w21006,"[Development Economics, Economic Fluctuations and Growth, International Trade and Investment, Productivity, Innovation, and Entrepreneurship]"
20689,21007,2015,March,Regional Redistribution Through U S Mortgage Market,"AmitSeru,BenjaminJKeys,ErikGHurst,JosephSVavra",w21007,"[Corporate Finance, Economic Fluctuations and Growth, Monetary Economics, Political Economy]"
20690,21008,2015,March,Intra Industry Trade Bertrand Cournot Oligopoly The Role Endogenous Horizontal Product Differentiation,"BarbaraJSpencer,JamesABrander",w21008,[International Trade and Investment]
20691,21009,2015,March,Reference Points Redistributive Preferences Experimental Evidence,"IlyanaKuziemko,JimmyCharite,RaymondFisman",w21009,"[Public Economics, Political Economy]"
20692,21010,2015,March,TFP News Sentiments The International Transmission Business Cycles,"AndreiALevchenko,NityaPandalai-Nayar",w21010,"[Economic Fluctuations and Growth, International Finance and Macroeconomics]"
20693,21011,2015,March,Poisedness Propagation Organizational Emergence Transformation Civic Order 19th Century New York City,"VictoriaJohnson,WalterWPowell",w21011,"[Development of the American Economy, , Productivity, Innovation, and Entrepreneurship]"
20694,21012,2015,March,Preventives Versus Treatments,"ChristopherMSnyder,MichaelRKremer",w21012,"[Development Economics, Health Economics, Industrial Organization, Law and Economics, Productivity, Innovation, and Entrepreneurship]"
20695,21013,2015,March,Effects Peer Counseling Support Breastfeeding Assessing External Validity Randomized Field Experiment,"JulieAReeder,OnurAltindag,TheodoreJJoyce",w21013,[Health Economics]


In [18]:
#joining the title and author column together
df['title_author'] = df['title'] + ' ' + df['author']

In [19]:
#checking the dataframe again
df.tail()

Unnamed: 0,paper,year,month,title,author,code,program,title_author
20691,21009,2015,March,Reference Points Redistributive Preferences Experimental Evidence,"IlyanaKuziemko,JimmyCharite,RaymondFisman",w21009,"[Public Economics, Political Economy]","Reference Points Redistributive Preferences Experimental Evidence IlyanaKuziemko,JimmyCharite,RaymondFisman"
20692,21010,2015,March,TFP News Sentiments The International Transmission Business Cycles,"AndreiALevchenko,NityaPandalai-Nayar",w21010,"[Economic Fluctuations and Growth, International Finance and Macroeconomics]","TFP News Sentiments The International Transmission Business Cycles AndreiALevchenko,NityaPandalai-Nayar"
20693,21011,2015,March,Poisedness Propagation Organizational Emergence Transformation Civic Order 19th Century New York City,"VictoriaJohnson,WalterWPowell",w21011,"[Development of the American Economy, , Productivity, Innovation, and Entrepreneurship]","Poisedness Propagation Organizational Emergence Transformation Civic Order 19th Century New York City VictoriaJohnson,WalterWPowell"
20694,21012,2015,March,Preventives Versus Treatments,"ChristopherMSnyder,MichaelRKremer",w21012,"[Development Economics, Health Economics, Industrial Organization, Law and Economics, Productivity, Innovation, and Entrepreneurship]","Preventives Versus Treatments ChristopherMSnyder,MichaelRKremer"
20695,21013,2015,March,Effects Peer Counseling Support Breastfeeding Assessing External Validity Randomized Field Experiment,"JulieAReeder,OnurAltindag,TheodoreJJoyce",w21013,[Health Economics],"Effects Peer Counseling Support Breastfeeding Assessing External Validity Randomized Field Experiment JulieAReeder,OnurAltindag,TheodoreJJoyce"


In [20]:
#replacing commas and spaces in title_author column with nothing
df['title_author'] = df['title_author'].str.replace(',',' ')

In [21]:
#checking the dataframe
df.tail()

Unnamed: 0,paper,year,month,title,author,code,program,title_author
20691,21009,2015,March,Reference Points Redistributive Preferences Experimental Evidence,"IlyanaKuziemko,JimmyCharite,RaymondFisman",w21009,"[Public Economics, Political Economy]",Reference Points Redistributive Preferences Experimental Evidence IlyanaKuziemko JimmyCharite RaymondFisman
20692,21010,2015,March,TFP News Sentiments The International Transmission Business Cycles,"AndreiALevchenko,NityaPandalai-Nayar",w21010,"[Economic Fluctuations and Growth, International Finance and Macroeconomics]",TFP News Sentiments The International Transmission Business Cycles AndreiALevchenko NityaPandalai-Nayar
20693,21011,2015,March,Poisedness Propagation Organizational Emergence Transformation Civic Order 19th Century New York City,"VictoriaJohnson,WalterWPowell",w21011,"[Development of the American Economy, , Productivity, Innovation, and Entrepreneurship]",Poisedness Propagation Organizational Emergence Transformation Civic Order 19th Century New York City VictoriaJohnson WalterWPowell
20694,21012,2015,March,Preventives Versus Treatments,"ChristopherMSnyder,MichaelRKremer",w21012,"[Development Economics, Health Economics, Industrial Organization, Law and Economics, Productivity, Innovation, and Entrepreneurship]",Preventives Versus Treatments ChristopherMSnyder MichaelRKremer
20695,21013,2015,March,Effects Peer Counseling Support Breastfeeding Assessing External Validity Randomized Field Experiment,"JulieAReeder,OnurAltindag,TheodoreJJoyce",w21013,[Health Economics],Effects Peer Counseling Support Breastfeeding Assessing External Validity Randomized Field Experiment JulieAReeder OnurAltindag TheodoreJJoyce


In [22]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(max_df=0.8,stop_words=stopwords, max_features=15000,ngram_range=(2, 3),)

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df['title_author'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(20696, 15000)

In [23]:
#instantiate the Count Vectorizer
cvec = CountVectorizer(max_df=0.8,stop_words=stopwords, max_features=15000, ngram_range=(2,3))

#construct the required CVEC by fitting anf tranforming the data
cvec_matrix = cvec.fit_transform(df['title_author'])

#output the shape of CVEC_matrix
cvec_matrix.shape

(20696, 15000)

### Text Processing for Program

In [24]:
# First let's make a copy of the movies_df
df_with_program = df.copy(deep=True)

# Let's iterate through movies_df, then append the movie genres as columns of 1s or 0s.
# 1 if that column contains movies in the genre at the present index and 0 if not.

x = []
for index, row in df.iterrows():
    x.append(index)
    for program in row['program']:
        df_with_program.at[index, program] = 1

# Confirm that every row has been iterated and acted upon
print(len(x) == len(df))

df_with_program.head(3)

True


Unnamed: 0,paper,year,month,title,author,code,program,title_author,Economic Fluctuations and Growth,International Trade and Investment,International Finance and Macroeconomics,Public Economics,Labor Studies,Health Economics,Monetary Economics,"Productivity, Innovation, and Entrepreneurship",Law and Economics,Children,Corporate Finance,Economics of Aging,Development of the American Economy,Environment and Energy Economics,Industrial Organization,Asset Pricing,Unnamed: 25,Health Care,Economics of Education,Political Economy,Technical Working Papers,Development Economics
0,74,1975,March,Variation Across Household Rate Inflation,RobertTMichael,w00074,[Economic Fluctuations and Growth],Variation Across Household Rate Inflation RobertTMichael,1.0,,,,,,,,,,,,,,,,,,,,,
1,87,1975,May,Exports Foreign Investment Pharmaceutical Industry,"MerleYahrWeiss,RobertELipsey",w00087,"[International Trade and Investment, International Finance and Macroeconomics]",Exports Foreign Investment Pharmaceutical Industry MerleYahrWeiss RobertELipsey,,1.0,1.0,,,,,,,,,,,,,,,,,,,
2,107,1975,October,Social Security Retirement Decisions,MichaelJBoskin,w00107,[Public Economics],Social Security Retirement Decisions MichaelJBoskin,,,,1.0,,,,,,,,,,,,,,,,,,


In [25]:
#Filling in the NaN values with 0 to show that a movie doesn't have that column's genre
df_with_program = df_with_program.fillna(0)
df_with_program.head(3)

Unnamed: 0,paper,year,month,title,author,code,program,title_author,Economic Fluctuations and Growth,International Trade and Investment,International Finance and Macroeconomics,Public Economics,Labor Studies,Health Economics,Monetary Economics,"Productivity, Innovation, and Entrepreneurship",Law and Economics,Children,Corporate Finance,Economics of Aging,Development of the American Economy,Environment and Energy Economics,Industrial Organization,Asset Pricing,Unnamed: 25,Health Care,Economics of Education,Political Economy,Technical Working Papers,Development Economics
0,74,1975,March,Variation Across Household Rate Inflation,RobertTMichael,w00074,[Economic Fluctuations and Growth],Variation Across Household Rate Inflation RobertTMichael,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,87,1975,May,Exports Foreign Investment Pharmaceutical Industry,"MerleYahrWeiss,RobertELipsey",w00087,"[International Trade and Investment, International Finance and Macroeconomics]",Exports Foreign Investment Pharmaceutical Industry MerleYahrWeiss RobertELipsey,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,107,1975,October,Social Security Retirement Decisions,MichaelJBoskin,w00107,[Public Economics],Social Security Retirement Decisions MichaelJBoskin,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
df_with_program.reset_index(inplace=True)

In [27]:
df_with_program.columns

Index(['index', 'paper', 'year', 'month', 'title', 'author', 'code', 'program',
       'title_author', 'Economic Fluctuations and Growth',
       'International Trade and Investment',
       'International Finance and Macroeconomics', 'Public Economics',
       'Labor Studies', 'Health Economics', 'Monetary Economics',
       'Productivity, Innovation, and Entrepreneurship', 'Law and Economics',
       'Children', 'Corporate Finance', 'Economics of Aging',
       'Development of the American Economy',
       'Environment and Energy Economics', 'Industrial Organization',
       'Asset Pricing', '', 'Health Care', 'Economics of Education',
       'Political Economy', 'Technical Working Papers',
       'Development Economics'],
      dtype='object')

In [28]:
# Deleting four unnecessary columns.
df_with_program.drop(['paper','year','month','author','code','program','title_author','title','','index'], axis=1, inplace=True)

# Viewing changes.
df_with_program.head()

Unnamed: 0,Economic Fluctuations and Growth,International Trade and Investment,International Finance and Macroeconomics,Public Economics,Labor Studies,Health Economics,Monetary Economics,"Productivity, Innovation, and Entrepreneurship",Law and Economics,Children,Corporate Finance,Economics of Aging,Development of the American Economy,Environment and Energy Economics,Industrial Organization,Asset Pricing,Health Care,Economics of Education,Political Economy,Technical Working Papers,Development Economics
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
#prog_matrix is defined to the values from the dataframe
prog_matrix = df_with_program.values

In [30]:
#checking the shape of the program matrix
prog_matrix.shape

(20696, 21)

In [31]:
#checking the shape of tf-idf matrix
tfidf_matrix.shape

(20696, 15000)

In [32]:
#combining the tfidf matrix and program matrix (derived from the df_with_program dataframe)
combined_matrix = np.concatenate((tfidf_matrix.toarray(),prog_matrix),axis=1)

In [33]:
#checking the shape of combined matrix
combined_matrix.shape

(20696, 15021)

In [34]:
#creating a sparse matrix for combined matrix because linear kernel only takes sparse matrix as input 
combined_sparse = sparse.csr_matrix(combined_matrix)

In [35]:
# Compute the cosine similarity matrix
cosine_sim_final = linear_kernel(combined_sparse, combined_sparse)

In [36]:
#checking the shape of the cosine_sim_final (computed from tfidf)
cosine_sim_final.shape

(20696, 20696)

### Recommender System using Tf-idf vectorizer

In [37]:
#Construct a reverse map of indices and working paper titles
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

# Function that takes in paper title as input and outputs most similar papers
def get_recommendations(title, cosine_sim_final=cosine_sim_final):
    # Get the index of the wp that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all papers with that paper
    sim_scores_1 = list(enumerate(cosine_sim_final[idx]))

    # Sort the wp based on the similarity scores
    sim_scores_1 = sorted(sim_scores_1, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar wp
    sim_scores_1 = sim_scores_1[1:11]

    # Get the wp indices
    wp_indices = [i[0] for i in sim_scores_1]

    # Return the top 10 most similar working papers
    return pd.DataFrame({'year_published': df['year'].iloc[wp_indices],
                         'author': df['author'].iloc[wp_indices],
                         'title': df['title'].iloc[wp_indices],
                        'program': df['program'].iloc[wp_indices]})

In [38]:
get_recommendations('Exports Foreign Investment Pharmaceutical Industry', cosine_sim_final=cosine_sim_final)

Unnamed: 0,year_published,author,title,program
6,1976,"MerleYahrWeiss,RobertELipsey",Exports Foreign Investment Manufacturing Industries,"[International Trade and Investment, International Finance and Macroeconomics]"
3369,1991,MagnusBlomstrom,Host Country Benefits Foreign Investment,"[International Trade and Investment, International Finance and Macroeconomics]"
2472,1988,"GuyVGStevens,RobertELipsey",Interactions Domestic Foreign Investment,"[International Trade and Investment, International Finance and Macroeconomics]"
2716,1989,"Jian-YeWang,MagnusBlomstrom",Foreign Investment Technology Transfer A Simple Model,"[International Trade and Investment, International Finance and Macroeconomics]"
564,1981,DavidGHartman,Domestic Tax Policy Foreign Investment Some Evidence,"[International Trade and Investment, Public Economics, International Finance and Macroeconomics]"
951,1983,GeneMGrossman,International Trade Foreign Investment Formation Entrepreneurial Class,"[International Trade and Investment, International Finance and Macroeconomics]"
81,1978,JacobAFrenkel,International Reserves Under Alternative Exchange Rate Regimes Aspects The Economics Managed Float,"[International Trade and Investment, International Finance and Macroeconomics]"
82,1978,"BoyanJovanovic,JacobAFrenkel",On Transactions Precautionary Demand For Money,"[International Trade and Investment, Economic Fluctuations and Growth, International Finance and Macroeconomics]"
83,1978,JacobAFrenkel,Further Evidence On Expectations And The Demand Money During German Hyperinflation,"[International Trade and Investment, International Finance and Macroeconomics]"
84,1978,"JacobAFrenkel,KennethWClements",Exchange Rates The 1920 s A Monetary Approach,"[International Trade and Investment, Economic Fluctuations and Growth, International Finance and Macroeconomics]"


In [39]:
#combining the cosine matrix and program matrix (derived from the df_with_program dataframe)
combined_matrix_1 = np.concatenate((cvec_matrix.toarray(),prog_matrix),axis=1)

In [40]:
#checking the shape
combined_matrix_1.shape

(20696, 15021)

In [41]:
# Compute the cosine similarity matrix
cosine_sim_1 = cosine_similarity(combined_matrix_1, combined_matrix_1)

In [42]:
cosine_sim_1

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.18257419,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.18257419, 1.        ,
        0.18257419],
       [0.        , 0.        , 0.        , ..., 0.        , 0.18257419,
        1.        ]])

In [43]:
#checking the shape of cosine_sim_1 (has to be equal to the total no. of rows in the dataset - square matrix)
#computed from cvec
cosine_sim_1.shape

(20696, 20696)

### Recommender System using Count Vectorizer

In [44]:
#Construct a reverse map of indices and working paper titles
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

# Function that takes in paper title as input and outputs most similar papers
def get_recommendations_1(title, cosine_sim_1=cosine_sim_1):
    # Get the index of the wp that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all papers with that paper
    sim_scores_2 = list(enumerate(cosine_sim_1[idx]))

    # Sort the wp based on the similarity scores
    sim_scores_2 = sorted(sim_scores_2, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar wp
    sim_scores_2 = sim_scores_2[1:11]

    # Get the wp indices
    wp_indices_1 = [i[0] for i in sim_scores_2]

    # Return the top 10 most similar working papers
    return pd.DataFrame({'year_published': df['year'].iloc[wp_indices_1],
                         'author': df['author'].iloc[wp_indices_1],
                         'title': df['title'].iloc[wp_indices_1],
                        'program': df['program'].iloc[wp_indices_1]})

In [45]:
get_recommendations_1('Exports Foreign Investment Pharmaceutical Industry', cosine_sim_1=cosine_sim_1)

Unnamed: 0,year_published,author,title,program
6,1976,"MerleYahrWeiss,RobertELipsey",Exports Foreign Investment Manufacturing Industries,"[International Trade and Investment, International Finance and Macroeconomics]"
2472,1988,"GuyVGStevens,RobertELipsey",Interactions Domestic Foreign Investment,"[International Trade and Investment, International Finance and Macroeconomics]"
3369,1991,MagnusBlomstrom,Host Country Benefits Foreign Investment,"[International Trade and Investment, International Finance and Macroeconomics]"
256,1980,JohnFOBilson,The Speculative Efficiency Hypothesis,"[International Trade and Investment, International Finance and Macroeconomics]"
323,1980,"ElianaACardoso,RudigerDornbusch",Three Papers Brazilian Trade Payments,"[International Trade and Investment, International Finance and Macroeconomics]"
336,1980,PaulRKrugman,Oil Dollar,"[International Trade and Investment, International Finance and Macroeconomics]"
372,1980,"ClaricePechman,DanielValenteDantas,DemetrioSimoes,RobertodeRezendeRocha,RudigerDornbusch",A Model Black Market Dollars,"[International Trade and Investment, International Finance and Macroeconomics]"
410,1981,MichaelRDarby,The Real Price Oil 1970s World Inflation,"[International Trade and Investment, International Finance and Macroeconomics]"
562,1981,RichardPortes,Central Planning Monetarism Fellow Travelers,"[International Trade and Investment, International Finance and Macroeconomics]"
571,1981,WilliamHBranson,The OPEC Surplus U S LDC Trade,"[International Trade and Investment, International Finance and Macroeconomics]"


## Conclusion and Limitations

The three recommender systems are making recommendations for working papers using title as an input. The users can search by titles and find other most similar working papers to the searched query. Whether the text is vectorized using count vectorizer or tf-idf vectorizer, the recommendations about the working papers are not significantly different (especially in the case of recommender system for a) title and author, and b) title, author and program). The recommender system is useful to academicians, students and others who are researching on any topic in the field of economics. Also, it could be employed as an e-library for searching working papers similar to the working paper of interest. The system is built to streamline search of relevant literature in plethora of information available on the website of NBER. 

One of the limitations of the model is that it is based only on the text of the titles of the working papers and no information about user preferences for programs, authors is included. The results of the model could be improved if more information about user profile is collected and modelled into the recommender system. Also, the recommender system can be extended to other fields such as science and social sciences. 