# Text Retrieval 
In this demo, we will try to address the problem of text retrieval using two approaches <br> 
1. Boolean Retrieval mechanism 
2. TF-IDF scores 

## 1. Import Packages

In [0]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## 2. Read Dataset 
The dataset is a song database which has 2 columns <br>  
1. Song name 
2. Lyrics of the song <br> 
Our typical query would be to retrieve the song name from the lyrics which are given as a query. 

In [0]:
df = pd.read_csv('taylor_swift_song_lyrics.csv', encoding='mac_roman')
print(df.head())

              track_title                                              lyric
0        ...Ready for It?  Knew he was a killer first time that I saw him...
1                      22  It feels like a perfect night to dress up like...
2  A Perfectly Good Heart  Why would you wanna break a perfectly good hea...
3   A Place In This World  I don't know what I want, so don't ask me Caus...
4            All Too Well  I walked through the door with you, the air wa...


# TF-IDF Search Engine

In [0]:
# Get tf-idf matrix using fit_transform function
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['lyric']) # Store tf-idf representations of all docs
print(X.shape) # (Number of songs, Number of unique words)

(94, 2301)


# Query Processing

In [0]:
query = "I don't know about you"
query_vec = vectorizer.transform([query]) 
# calculating cosine similarity of query with vectors in the dataset 
results = cosine_similarity(X,query_vec).reshape((-1,))

# Print Results

In [0]:
# Print Top 10 results
for i in results.argsort()[-10:][::-1]:
    print(df.iloc[i,0],"--",df.iloc[i,1]+"\n")

Everything Has Changed (Ft.‚Ä†Ed‚Ä†Sheeran) -- All I knew this morning when I woke Is I know something now, know something now I didn't before And all I've seen since 18 hours ago is green eyes and freckles and your smile In the back of my mind making me feel like I just want to know you better Know you better, know you better now I just want to know you better Know you better, know you better now I just want to know you better Know you better, know you better now I just want to know you, know you, know you 'Cause all I know is we said hello And your eyes look like coming home All I know is a simple name, everything has changed All I know is you held the door You'll be mine and I'll be yours All I know since yesterday is everything has changed Taylor Swift And all my walls stood tall painted blue And I'll take them down Take them down and open up the door for you And all I feel in my stomach is butterflies The beautiful kind, making up for lost time Taking flight, making me feel like I

# Boolean Retrieval Model

In [0]:
brm_vectorizer = CountVectorizer(binary=True)
brm_X = brm_vectorizer.fit_transform(df['lyric'])
print(brm_X.shape) # (Number of songs, Number of unique words)

(94, 2301)


In [0]:
word1 = 'dancing'
word2 = 'confused'
# getting the index of matrix location where the boolean value is 1
w1vec = brm_X[:,brm_vectorizer.vocabulary_[word1]]
print(w1vec) 
w2vec = brm_X[:,brm_vectorizer.vocabulary_[word2]]
print(w2vec)

  (1, 0)	1
  (4, 0)	1
  (17, 0)	1
  (22, 0)	1
  (26, 0)	1
  (33, 0)	1
  (45, 0)	1
  (53, 0)	1
  (61, 0)	1
  (66, 0)	1
  (91, 0)	1
  (1, 0)	1
  (73, 0)	1
  (76, 0)	1


In [0]:
# AND Operation
for i in range(w1vec.shape[0]):
    if w1vec[i]==1 and w2vec[i]==1:
        print(df.iloc[i,0] + " ---- " + df.iloc[i,1]+"\n")

22 ---- It feels like a perfect night to dress up like hipsters And make fun of our exes, uh uh, uh uh It feels like a perfect night for breakfast at midnight To fall in love with strangers, uh uh, uh uh Yeah We're happy, free, confused, and lonely at the same time It's miserable and magical, oh yeah Tonight's the night when we forget about the deadlines It's time, uh uh I don't know about you, but I'm feeling 22 Everything will be alright if you keep me next to you You don't know about me, but I bet you wanted to Everything will be alright if we just keep dancing like we're 22 22 It seems like one of those nights This place is too crowded, too many cool kids, uh uh, uh uh (Who's Taylor Swift, anyway? Ew) It seems like one of those nights We ditch the whole scene and end up dreaming Instead of sleeping Yeah We're happy, free, confused, and lonely in the best way It's miserable and magical, oh yeah Tonight's the night when we forget about The heartbreaks, it's time Oh oh I don't know ab

In [0]:
# OR Operation
for i in range(w1vec.shape[0]):
    if w1vec[i]==1 or w2vec[i]==1:
        print(df.iloc[i,0] + " ---- " + df.iloc[i,1]+"\n")

22 ---- It feels like a perfect night to dress up like hipsters And make fun of our exes, uh uh, uh uh It feels like a perfect night for breakfast at midnight To fall in love with strangers, uh uh, uh uh Yeah We're happy, free, confused, and lonely at the same time It's miserable and magical, oh yeah Tonight's the night when we forget about the deadlines It's time, uh uh I don't know about you, but I'm feeling 22 Everything will be alright if you keep me next to you You don't know about me, but I bet you wanted to Everything will be alright if we just keep dancing like we're 22 22 It seems like one of those nights This place is too crowded, too many cool kids, uh uh, uh uh (Who's Taylor Swift, anyway? Ew) It seems like one of those nights We ditch the whole scene and end up dreaming Instead of sleeping Yeah We're happy, free, confused, and lonely in the best way It's miserable and magical, oh yeah Tonight's the night when we forget about The heartbreaks, it's time Oh oh I don't know ab