## Modeling: TF-IDF (Term Frequency-Inverse Document Frequency)

**Team: Title Generation**

Karina Huang, Abhimanyu Vasishth, Phoebe Wong

---

In [1]:
#import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD

#NLP packages
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

# Import project utils
import utils

In [2]:
#check dataset
papers = pd.read_csv('../data/papers.csv')
print('Size of data: ', papers.shape)
papers.head()

Size of data:  (7241, 7)


Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3,1000,1994,Bayesian Query Construction for Neural Network...,,1000-bayesian-query-construction-for-neural-ne...,Abstract Missing,Bayesian Query Construction for Neural\nNetwor...
4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",,1001-neural-network-ensembles-cross-validation...,Abstract Missing,"Neural Network Ensembles, Cross\nValidation, a..."


#### TF-IDF

In [3]:
def fit_tfidf_vectorizer(corpus, min_ngram=1, max_ngram=2, stop_words='english'):
    vectorizer = TfidfVectorizer(ngram_range=(min_ngram,max_ngram), stop_words=stop_words)
    return vectorizer, vectorizer.fit_transform(corpus)

In [4]:
# useful function to get words
def top_k_words(corpus, row, k, feature_array):
    tfidf_sorting = np.argsort(response[row].toarray()).flatten()[::-1]
    return feature_array[tfidf_sorting][:k]

In [5]:
def matrix_decomposition(matrix, components=50, n_iter=7, random_state=42):
    svd = TruncatedSVD(n_components=components, n_iter=n_iter, random_state=random_state)
    transformed_matrix = svd.fit_transform(matrix)
    return svd, transformed_matrix

In [6]:
def find_nearest_neighbor(transformed_response, chosen):
    X = transformed_response[chosen].reshape(-1,1)
    Y = transformed_response.copy()
    
    # setting the same row to 0s temporarily so it doesn't match with itself
    Y[chosen] = np.zeros(X.shape[0])
    
    # computing similarity
    similarity = cosine_similarity(X.T,Y)
    match = np.argmax(similarity)
    return match

In [7]:
def compute_title_score(generated_title, original_title):
    generated_wordlist = generated_title.split()
    original_wordlist = original_title.split()
    
    # identical words
    num_identical = 0
    for word in generated_wordlist:
        if word in original_wordlist:
            num_identical += 1
            
    if num_identical == 0:
        return 0
    
    precision = num_identical/len(generated_title)
    recall = num_identical/len(original_title)
    F1 = 2*precision*recall/(precision + recall)

    return F1

# Open questions
# 1. Include stop words in the set of identical words?
# 2. Does the length of the generated title vs the original title matter?
# 3. Weight important words more?
# 4. Other measures of similarity?

In [8]:
# preprocessing
papers = utils.preprocessing(papers)
papers = papers.dropna(subset=['abstract'])

# creating corpus
corpus = papers['abstract'].values
print('Length of corpus: {}'.format(len(corpus)))

# vectorizing using TF-IDF
vectorizer, response = fit_tfidf_vectorizer(corpus)
print('Shape of TF-IDF matrix: {}'.format(response.shape))

# matrix decomposition
svd, transformed_response = matrix_decomposition(response, components=100)
print('Variance captured by SVD: {}'.format(svd.explained_variance_ratio_.sum()))
print('Shape of transformed matrix: {}'.format(transformed_response.shape))

# creating vector of words
feature_array = np.array(vectorizer.get_feature_names())

Length of corpus: 7167
Shape of TF-IDF matrix: (7167, 5728468)
Variance captured by SVD: 0.10735829082813717
Shape of transformed matrix: (7167, 100)


In [9]:
generated_titles = []
match_indices = []
title_scores = []

# parameters
k = 10

for chosen in range(len(corpus)):
    if chosen % 1000 == 0:
        print('{} of {} done'.format(chosen, len(corpus)))
    
    # finding match using nearest neighbors on TF-IDF matrix 
    match = find_nearest_neighbor(transformed_response, chosen)
    match_indices.append(match)
    
    # generating titles
    original_title = papers['title'].values[chosen]
    generated_title = papers['title'].values[match]
    generated_titles.append(generated_title)
    
    # evaluating generated title
    score = compute_title_score(generated_title, original_title)
    title_scores.append(score)

print('{} of {} done'.format(len(corpus), len(corpus)))
papers['generated_title'] = generated_titles
papers['match_index'] = match_indices
papers['title_score'] = title_scores
papers.head()

0 of 7167 done
1000 of 7167 done
2000 of 7167 done
3000 of 7167 done
4000 of 7167 done
5000 of 7167 done
6000 of 7167 done
7000 of 7167 done
7167 of 7167 done


Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text,generated_title,match_index,title_score
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,An efficient method of self-organizing associa...,767 SELF-ORGANIZATION OF ASSOCIATIVE DATABASE...,Real-time autonomous robot navigation using VL...,3428,0.0
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,A single cell theory for the development of se...,683 A MEAN FIELD THEORY OF LAYER IV OF VISUAL...,Network activity determines spatio-temporal in...,4828,0.0
2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,In modeling studies or memory based on neural...,394 STORING COVARIANCE BY THE ASSOCIATIVE LON...,Effective Learning Requires Neuronal Remodelin...,828,0.010929
3,1000,1994,Bayesian Query Construction for Neural Network...,,1000-bayesian-query-construction-for-neural-ne...,"If data collection is costly, there is much to...",Bayesian Query Construction for Neural Network...,Adaptive Design Optimization in Experiments wi...,2982,0.0
4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",,1001-neural-network-ensembles-cross-validation...,Learning of continuous valued functions using ...,"Neural Network Ensembles, Cross Validation, an...",Co-Validation: Using Model Disagreement on Unl...,1734,0.0


In [10]:
# printing best title
print(papers.loc[papers['title_score'].idxmax()].title)
print(papers.loc[papers['title_score'].idxmax()].generated_title)

Kernels and learning curves for Gaussian process regression on random graphs
Exact learning curves for Gaussian process regression on large random graphs


We should perhaps be checking for duplicates

---