#AIWIR Assignment: Building Search Engine

##[Music Review Classification](https://www.kaggle.com/datasets/eswarchandt/amazon-music-reviews) is the dataset we have used.

###Importing Dataset from Local Machine to Colab

In [9]:
from google.colab import files
uploaded  = files.upload()

Saving Musical_instruments_reviews.csv to Musical_instruments_reviews (1).csv


###Displaying the Datset

In [38]:
import pandas as pd
import io
 
df = pd.read_csv(io.BytesIO(uploaded['Musical_instruments_reviews.csv']))
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A2IBPI20UZIR0U,1384719342,"cassandra tu ""Yeah, well, that's just like, u...","[0, 0]","Not much to write about here, but it does exac...",5.0,good,1393545600,"02 28, 2014"
1,A14VAT5EAX3D9S,1384719342,Jake,"[13, 14]",The product does exactly as it should and is q...,5.0,Jake,1363392000,"03 16, 2013"
2,A195EZSQDW3E21,1384719342,"Rick Bennette ""Rick Bennette""","[1, 1]",The primary job of this device is to block the...,5.0,It Does The Job Well,1377648000,"08 28, 2013"
3,A2C00NNG1ZQQG2,1384719342,"RustyBill ""Sunday Rocker""","[0, 0]",Nice windscreen protects my MXL mic and preven...,5.0,GOOD WINDSCREEN FOR THE MONEY,1392336000,"02 14, 2014"
4,A94QU4C90B1AX,1384719342,SEAN MASLANKA,"[0, 0]",This pop filter is great. It looks and perform...,5.0,No more pops when I record my vocals.,1392940800,"02 21, 2014"


###We will be working on the **reviewText column** of the dataset because it has punctuations, numbers and NAN characters. So, we drop the rest of the columns

In [39]:
df = df['reviewText'].to_frame()
df.head()

Unnamed: 0,reviewText
0,"Not much to write about here, but it does exac..."
1,The product does exactly as it should and is q...
2,The primary job of this device is to block the...
3,Nice windscreen protects my MXL mic and preven...
4,This pop filter is great. It looks and perform...


### Corpus Length

In [34]:
print("Number of documents in the corpus: ", len(df))

Number of documents in the corpus:  10261


## Preprocessing stage

  ### 1. Removing all NAN characters

In [47]:
df['reviewText'] = df.fillna({'reviewText':''})

  ### 2. Removing punctuations

In [48]:
import string
i = 0;
for line in range(len(df['reviewText'])):
    # print(line)
     df['reviewText'][line] = df['reviewText'][line].translate(str.maketrans('', '', string.punctuation))
df.head()

Unnamed: 0,reviewText
0,Not much to write about here but it does exact...
1,The product does exactly as it should and is q...
2,The primary job of this device is to block the...
3,Nice windscreen protects my MXL mic and preven...
4,This pop filter is great It looks and performs...


  ### 3. Removing tags

In [50]:
import re
i = 0;
for line in df['reviewText']:
    df.at[i, 'reviewText'] = re.sub('<[^<]+?>','', line)
    i = i + 1
df.head()

Unnamed: 0,reviewText
0,Not much to write about here but it does exact...
1,The product does exactly as it should and is q...
2,The primary job of this device is to block the...
3,Nice windscreen protects my MXL mic and preven...
4,This pop filter is great It looks and performs...


  ### 4. Removing numbers

In [51]:
i = 0;
for line in df['reviewText']:
    df.at[i, 'reviewText'] = ''.join(c for c in line if not c.isdigit())
    i = i + 1
df.head()

Unnamed: 0,reviewText
0,Not much to write about here but it does exact...
1,The product does exactly as it should and is q...
2,The primary job of this device is to block the...
3,Nice windscreen protects my MXL mic and preven...
4,This pop filter is great It looks and performs...


####5. Stop Words removal and conversion to Lowercase

In [52]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
stopword = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [53]:
i = 0
for line in df['reviewText']:
    df.at[i, 'reviewText'] = ' '.join([i for i in line.lower().split() if i not in stopword])
    i += 1
df.head()

Unnamed: 0,reviewText
0,much write exactly supposed filters pop sounds...
1,product exactly quite affordablei realized dou...
2,primary job device block breath would otherwis...
3,nice windscreen protects mxl mic prevents pops...
4,pop filter great looks performs like studio fi...


  ### 6. Tokenization

In [54]:
i = 0
for line in df['reviewText']:
    df.at[i, 'reviewText'] = nltk.word_tokenize(line)
    i += 1
df.head()

Unnamed: 0,reviewText
0,"[much, write, exactly, supposed, filters, pop,..."
1,"[product, exactly, quite, affordablei, realize..."
2,"[primary, job, device, block, breath, would, o..."
3,"[nice, windscreen, protects, mxl, mic, prevent..."
4,"[pop, filter, great, looks, performs, like, st..."


  ### 7. Lemmatization

In [55]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [56]:
i = 0
for words in df['reviewText']:
    new_words = []
    for word in words:
        new_word = lemmatizer.lemmatize(word)
        new_words.append(new_word)
    df.at[i, 'reviewText'] = new_words
    i += 1
df.head()

Unnamed: 0,reviewText
0,"[much, write, exactly, supposed, filter, pop, ..."
1,"[product, exactly, quite, affordablei, realize..."
2,"[primary, job, device, block, breath, would, o..."
3,"[nice, windscreen, protects, mxl, mic, prevent..."
4,"[pop, filter, great, look, performs, like, stu..."


### 8. Stemming

In [57]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [58]:
i = 0
for words in df['reviewText']:
    new_words = []
    for word in words:
        new_word = stemmer.stem(word)
        new_words.append(new_word)
    df.at[i, 'reviewText'] = new_words
    i += 1
df.head()

Unnamed: 0,reviewText
0,"[much, write, exactli, suppos, filter, pop, so..."
1,"[product, exactli, quit, affordablei, realiz, ..."
2,"[primari, job, devic, block, breath, would, ot..."
3,"[nice, windscreen, protect, mxl, mic, prevent,..."
4,"[pop, filter, great, look, perform, like, stud..."


## Creation of Postings list and Inverted Index

In [59]:
index1 = dict()
i = 0
for words in df['reviewText']:
    for word in words:
        if word not in index1.keys():
            #add word to index along with docID
            #docIDs will be pre-sorted
            index1[word] = []
        elif i not in index1[word]:
            index1[word].append(i)
    i += 1

### 1. Sorting the terms

In [60]:
index = dict()
for i in sorted(index1):
    index[i] = index1[i]

In [61]:
count = 0
for j in index.items():
    print(j)
    count += 1
    ### print only the first 25 terms
    if count == 25:
        break

('aa', [4158, 4369, 6366, 6417, 6917, 6923, 7556, 7839, 8839, 9118, 9344, 10040])
('aaa', [4484, 5033, 5036, 5407, 5984, 6374, 6385, 6540, 7315, 7556, 8497, 8500, 9643])
('ab', [1398, 1994, 3092, 4445, 4872, 4873, 4874, 4875, 5129, 5246, 6192, 7121, 7464, 7493, 7695, 8056, 8059, 8063, 8398, 8532, 8788, 8954, 9049, 9092, 9283, 9886, 10202])
('aback', [])
('abalon', [1695, 1704, 1723, 1731, 3602, 4523, 6085])
('abalonefend', [])
('abandon', [])
('abb', [])
('abcd', [7297])
('abe', [])
('abehring', [8065])
('abelton', [])
('abercrombi', [])
('abhorr', [])
('abi', [])
('abid', [])
('abil', [346, 390, 467, 1177, 1430, 1558, 1563, 1894, 1945, 1958, 2218, 2507, 2524, 2667, 3044, 3233, 3300, 3444, 3613, 3789, 4169, 4420, 4821, 4958, 4982, 5101, 5356, 5420, 5731, 5777, 5778, 6115, 6162, 6169, 6189, 6207, 6576, 6915, 6927, 6998, 7061, 7275, 7298, 7427, 7465, 7472, 7480, 7487, 7526, 7714, 7782, 8000, 8068, 8274, 8311, 8357, 8367, 8407, 8424, 8480, 8487, 8496, 8500, 8644, 8647, 8727, 8777, 8782, 9

## Testing by searching words

### Test 1:Intersection of two search results: 

In [63]:
import timeit
start1 = timeit.default_timer()

d_book = set(index['product'])
d_school = set(index['job'])
d_query = d_book.intersection(d_school)

end1 = timeit.default_timer()

print("The documents containing both product AND job are:\n", sorted(list(d_query)))
print("\nNumber of documents retrieved: ", len(d_query))

The documents containing both product AND job are:
 [56, 152, 222, 228, 235, 337, 341, 401, 485, 544, 729, 1092, 1148, 1605, 1607, 1884, 1899, 2396, 2663, 2873, 2921, 3425, 3442, 3727, 4009, 4170, 4235, 4469, 4523, 4608, 4653, 4957, 5074, 5111, 5210, 5525, 5726, 5948, 6347, 6515, 6534, 6751, 6758, 7014, 7249, 7305, 8238, 8312, 8624, 8643, 8660, 8667, 8668, 9119, 9177, 9192, 9303, 9324, 9335, 9394, 9514, 9540, 9573, 9861, 9887, 9902, 9957]

Number of documents retrieved:  67


###Test 2: Either of the two search results:

In [65]:
start2 = timeit.default_timer()

d_photo = index['product']
d_teacher = index['job']
d_query = sorted(list(set(d_photo) | set(d_teacher)))

end2 = timeit.default_timer()

print("The documents containing product OR job are:\n", d_query)
print("\nNumber of documents retrieved: ", len(d_query))

The documents containing product OR job are:
 [1, 18, 27, 32, 37, 52, 56, 57, 61, 63, 67, 73, 77, 80, 90, 93, 99, 103, 107, 113, 123, 125, 128, 134, 139, 149, 150, 152, 153, 154, 156, 165, 168, 184, 188, 206, 222, 226, 228, 234, 235, 242, 256, 258, 259, 263, 268, 271, 286, 289, 291, 303, 307, 317, 337, 338, 341, 355, 359, 362, 371, 390, 393, 394, 397, 400, 401, 404, 412, 415, 416, 430, 442, 453, 455, 456, 457, 485, 489, 490, 503, 510, 515, 524, 527, 529, 541, 544, 551, 553, 554, 571, 585, 588, 595, 596, 597, 600, 604, 609, 611, 617, 632, 650, 655, 674, 685, 688, 694, 695, 698, 719, 721, 724, 726, 727, 729, 739, 779, 818, 840, 841, 843, 870, 873, 883, 894, 920, 936, 937, 944, 951, 952, 956, 960, 963, 965, 966, 967, 968, 969, 972, 975, 976, 978, 979, 980, 985, 988, 990, 993, 996, 998, 1003, 1007, 1027, 1028, 1040, 1042, 1047, 1049, 1069, 1070, 1086, 1087, 1092, 1094, 1098, 1099, 1100, 1101, 1120, 1122, 1145, 1148, 1180, 1194, 1209, 1212, 1227, 1229, 1231, 1233, 1236, 1237, 1242, 1246, 12

### Execution Time for Test1:

In [66]:
executionTime1 = end1 - start1
print("Retrieval Time for Query 1: ", executionTime1)

Retrieval Time for Query 1:  0.00026699900035964674


### Execution time for Test2:

In [67]:
executionTime2 = end2 - start2
print("Retrieval Time for Query 2: ", executionTime2)

Retrieval Time for Query 2:  0.0006747649999852001
