# [EX1] Data collection

## 1.1 Get the list of animes


### Libraries

In [2]:
# core libraries
import requests
from bs4 import BeautifulSoup

# for better time view 
from tqdm import tqdm

### Creating the anime_url.txt file

We are creating a file called "anime.txt" conteining the url of each anime in the 400 pages (actually 383) of the top anime rank list of the MyAnimeList site. 

* We initialize a filename and then we are doing a for loop over the 400 pages. If the url "exist" so the response == 200 and we can collect the 50 url in that page, otherwise if the response == 400, the page doesn't exist and we exit from the loop, no more interest in continue to loop (no anime url to collect).

* The url scrabbing consist on using BeautifulSoup to inspect the whole html page to find where the url of the anime are. In particolar under the < tr>..< /tr> and < a>..< /a> we can view all the link in the pages. Moreover the anime url are associated with an "id=#areaXXX" class. We get there and gather the right link.
```
<a href="https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood" id="#area5114" rel="#info5114">Fullmetal Alchemist: Brotherhood</a>
```


In [3]:
filename = r"./anime_url.txt"
with open(filename,'w', encoding='utf-8') as f:
    for page in tqdm(range(0, 400)):
        url = "https://myanimelist.net/topanime.php?limit="+str(page*50)
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, "html.parser")
            for tag in soup.find_all("tr"):
                links = tag.find_all("a")
                for link in links:
                    if type(link.get("id")) == str and len(link.contents[0]) > 1:
                        f.write(link.get("href"))
                        f.write("\n")
        else:
            print("End with page count: ", page)
            break

 96%|████████████████████████████████████████████████████████████████████████████▌   | 383/400 [05:02<00:13,  1.27it/s]

End with page count:  383





## 1.2 Crawl animes

We now need to collect for each anime the html and store it in a .html file. The anime needs to follow the ranking page of the MyAnimeList, i.e 50 anime for page.

For this we begin to create a list called "lines" that store all the link of the anime. We use this method instead of open directly the .txt file in for loop to split the crawl part among the group.

In [4]:
filename = r"./anime_url.txt"

lines = []
with open(filename, "r", encoding='utf-8') as f:
        lines = f.readlines()

In [5]:
# view the first 5 and last 5 anime url collected
lines[:5], lines[-5:]

(['https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood\n',
  'https://myanimelist.net/anime/28977/Gintama°\n',
  'https://myanimelist.net/anime/38524/Shingeki_no_Kyojin_Season_3_Part_2\n',
  'https://myanimelist.net/anime/9253/Steins_Gate\n',
  'https://myanimelist.net/anime/42938/Fruits_Basket__The_Final\n'],
 ['https://myanimelist.net/anime/42383/Konbini_Shoujo_Z\n',
  'https://myanimelist.net/anime/10564/Korogashi_Ryouta\n',
  'https://myanimelist.net/anime/50237/Kyonyuu_Elf_Oyako_Saimin\n',
  'https://myanimelist.net/anime/49876/Mahou_Shoujo_Elena_DVD-BOX_Special\n',
  'https://myanimelist.net/anime/49762/Mama_x_Holic__Miwaku_no_Mama_to_Amaama_Kankei_-_The_Animation\n'])

### Libraries

In [43]:
# core libraries
import requests
from bs4 import BeautifulSoup
from time import sleep
# for time view
from tqdm import tqdm

### The Crawl

We begin creating the pages needed to store the anime. 

The pages from 1 to 383 are stored in the Anime_pages folder.

In [None]:
import os

for page in tqdm(range(1, 384)):
    folder = "page"+str(page)
    path = "./Anime_pages/"+folder
    os.mkdir(path)

Now we create a double loop, over the pages and over the 50 anime, to collect the whole html and save it in the correct pages.

For group division, here the loop goes from page 1 to 130. For have the complete loop we can iter over 0 to 383.

In [None]:
for page in tqdm(range(0, 130)):  
    # positioning in the right folder page
    folder = "/page"+str(page+1)
    for i in range(0,50):   
        page_now = 50*page
        # gather the url
        url = lines[page_now+i]
        response = requests.get(url)
        # name of the file   
        filename = r"./Anime_pages"+folder+"/anime_"+str(page_now+i+1)+".txt"
        with open(filename,'w', encoding='utf-8') as f:
            f.write(response.text)

## 1.3 Parse downloaded pages


### Importing libraries

In [9]:
# core libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import re
# usefull libraries to manage file in folder 
import os
from natsort import natsorted
# for time view
from tqdm import tqdm

Because there are vary information to collect about an anime, to have a clear notebook we split in functions our parsing in the "functions.py" file.

We divide the crawl in three function (or three part):
* scrabbing_anime1:  Title, Type, number of Episode, Realease / End date
* scrabbing_anime2:  number of Members, Score, Users, Rank, Popularity
* scrabbing_anime3:  Synopsis, anime Related, anime Characters and Voices, anime Staff

Also we have a function called: 
* parse_time:  a function that parse the date collected from the html in a datetime object

The parsing process is this: we create a list called "anime_info" in which we collect all the info about an anime. Than using this list we create a dataframe which columns are coming from the "attrs" list (a list containing the title of each attributes) and save it as a .tsv file.

In [6]:
# importing the functions on functions.py file
import functions

Let's view the process with one anime

In [48]:
# name of the columns 
attrs = ["animeTitle", "animeType", "animeNumEpisode","releaseDate","endDate","animeNumMembers","animeScore","animeUsers","animeRank",
         "animePopularity","animeDescription","animeRelated","animeCharacters","animeVoices","animeStaff"]

# positioning 
html_name = r"./Anime_pages/page1/anime_1.html"

# take the html of the file 
with open(html_name, "r",  encoding='utf-8') as fp:
    soup = BeautifulSoup(fp, "html.parser")

# collecting the information
anime_info = []
anime_info + functions.scrabbing_anime1(soup, anime_info)
anime_info + functions.scrabbing_anime2(soup, anime_info)
anime_info + functions.scrabbing_anime3(soup, anime_info)


# Creating the DataFrame
df = pd.DataFrame([anime_info], columns = attrs)
# change attributes to str
str_cols = ["animeTitle", "animeType", "animeDescription"]
df[str_cols] = df[str_cols].astype("string")
# Creating the tsv file, take the anime number (the id)
name = re.sub(".html","",anime)
df.to_csv("./tsv_anime/"+name+".tsv", index = False, sep = "\t")

In [49]:
df

Unnamed: 0,animeTitle,animeType,animeNumEpisode,releaseDate,endDate,animeNumMembers,animeScore,animeUsers,animeRank,animePopularity,animeDescription,animeRelated,animeCharacters,animeVoices,animeStaff
0,Fullmetal Alchemist: Brotherhood,TV,64,2009-04-05,2010-07-04,2676066,9.16,1622384,1,3,After a horrific alchemy experiment goes wrong...,[Fullmetal Alchemist: Brotherhood - 4-Koma The...,"[Elric, Edward, Elric, Alphonse, Mustang, Roy,...","[Park, RomiJapanese, Kugimiya, RieJapanese, Mi...","[[[Cook, Justin, Producer], [Yonai, Noritomo, ..."


In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   animeTitle        1 non-null      string        
 1   animeType         1 non-null      string        
 2   animeNumEpisode   1 non-null      int64         
 3   releaseDate       1 non-null      datetime64[ns]
 4   endDate           1 non-null      datetime64[ns]
 5   animeNumMembers   1 non-null      int64         
 6   animeScore        1 non-null      float64       
 7   animeUsers        1 non-null      int64         
 8   animeRank         1 non-null      int64         
 9   animePopularity   1 non-null      int64         
 10  animeDescription  1 non-null      string        
 11  animeRelated      1 non-null      object        
 12  animeCharacters   1 non-null      object        
 13  animeVoices       1 non-null      object        
 14  animeStaff        1 non-null  

For all the anime we proceed in this way:

In [None]:
attrs = ["animeTitle", "animeType", "animeNumEpisode","releaseDate","endDate","animeNumMembers","animeScore","animeUsers","animeRank",
         "animePopularity","animeDescription","animeRelated","animeCharacters","animeVoices","animeStaff"]


# iterate from page 1 to 384 
# ( remember the page are from 1 to 383 and range goes from 1 to 384 )
for page in tqdm(range(1,384)):
    # positioning
    folder = "./Anime_pages/page"+str(page)
    # iterate over the "ordered" list of anime
    for anime in natsorted(os.listdir(folder)):
            # open the anime
            with open(folder + "/" + anime, "r",  encoding='utf-8') as fp:
                soup = BeautifulSoup(fp, "html.parser")
            anime_info = []
            anime_info + scrabbing_anime1(soup, anime_info)
            anime_info + scrabbing_anime2(soup, anime_info)
            anime_info + scrabbing_anime3(soup, anime_info)
        
            # Creating the DataFrame
            df = pd.DataFrame([anime_info], columns = attrs)
            # change attributes to str
            str_cols = ["animeTitle", "animeType", "animeDescription"]
            df[str_cols] = df[str_cols].astype("string")
            # Creating the tsv file, take the anime number (the id)
            name = re.sub(".html","",anime)
            df.to_csv("./tsv_anime/"+name+".tsv", index = False, sep = "\t")

# [EX2] Search Engine


## 2.0 Preprocessing the documents

### Let's collect our corpus of documents

First of all lets gather all the documents in one list.
* documents: lists of document. Each document correspond to an anime description.

In [1]:
from tqdm import tqdm
from natsort import natsorted
import os
import pandas as pd

In [2]:
documents = []

# positioning
folder = r"./tsv_anime/"
# iter over the file
for anime in tqdm(natsorted(os.listdir(folder))):
    df = pd.read_csv(folder+anime, sep = "\t")
    # take only the description
    documents.append(df["animeDescription"][0])

100%|███████████████████████████████████████████████████████████████████████████| 19116/19116 [00:50<00:00, 380.37it/s]


In [3]:
# view a document 
documents[1]



### Clean the documents

Now let's cleaning all the documents. This step is colled preprocessing. We follow this order:
- 1) expand contraction type 1 + Normalization (capital lower words)
- 2) splitting number from text (ex 25min) and removing contraction type 2 
- 3) removing punctuation
- 4) removing special characters
- 3) Tokanize. We divide the string in words.
- 4) removing stopwords
- 5) removing some other words or non-text string
- 6) lemmatizing and stemming 

For contraction type 1 and 2 see below

Let's inspect the document to see what we can delete and what no:
- for example: "Philosopher's Stone—a powerful" << This is dash
- but "bio-mechanical engineering" << This is hyphen
- but we also have "15-year" << hyphen <br>

We decide to remove the hypen and the dash

We encounter also some special characters like: … • ♥ →☆‘ . We remove them.

At the end of a lot of anime description there is a "Written MAL Rewrite". We remove that.

When tokanizing there is also the saxon genitive " 's ": we remove that.

Also we encounter a lot of contraction form: using wikipedia https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions
we store them in a dictionary and restore the long form.
Using the same idea we have term like: min or sec, so we use a second dictionary to restore the original form.

In [3]:
# core libraries 
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import string
import re

# functions.py
import functions

In [5]:
contractions = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had",
"he'd've": "he would have",
"he'll": "he shall",
"he'll've": "he shall have",
"he's": "he has",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has",
"i'd": "I had",
"i'd've": "I would have",
"i'll": "I shall",
"i'll've": "I shall have",
"i'm": "I am",
"i've": "I have",
"isn't": "is not",
"it'd": "it had",
"it'd've": "it would have",
"it'll": "it shall",
"it'll've": "it shall have",
"it's": "it has",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had",
"she'd've": "she would have",
"she'll": "she shall",
"she'll've": "she shall have",
"she's": "she has",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that has",
"there'd": "there had",
"there'd've": "there would have",
"there's": "there has",
"they'd": "they had",
"they'd've": "they would have",
"they'll": "they shall",
"they'll've": "they shall have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall",
"what'll've": "what shall have",
"what're": "what are",
"what's": "what has",
"what've": "what have",
"when's": "when has",
"when've": "when have",
"where'd": "where did",
"where's": "where has",
"where've": "where have",
"who'll": "who shall",
"who'll've": "who shall have",
"who's": "who has",
"who've": "who have",
"why's": "why has",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had",
"you'd've": "you would have",
"you'll": "you shall",
"you'll've": "you shall have",
"you're": "you are",
"you've": "you have"
}

In [6]:
contractions2 = {
"min": "minute",
"sec": "second"
}

In [7]:
def pre_processing(documents):
    stop = stopwords.words("english")
    snowball_stemmer = SnowballStemmer("english")
    lmtzr = WordNetLemmatizer()
    remove = ["Written", "MAL", "Rewrite"]+["'s"]
    
    # removing contraction + Normalization
    document_tmp = functions.replace_words(documents.lower(), contractions)
    # splitting number and text 
    document_tmp = " ".join(re.split(r"([0-9]+)([a-z]+)",document_tmp))
    document_tmp = functions.replace_words(document_tmp, contractions2)
    # removing punctuation
    document_tmp = re.sub(r"[{}\—ー``\"'“''――]".format(string.punctuation)," ",document_tmp)
    # removing special characters
    document_tmp = re.sub(r"[…•♥→☆‘★]"," ",document_tmp)
    # Tokenizing 
    document_tmp = word_tokenize(document_tmp) 
    # removing stopwords
    document_tmp = [ word for word in document_tmp if word not in stop]
    # removing "Written MAL Rewrite" and other stuff
    document_tmp = [ word for word in document_tmp if word not in remove]
    # lemmatize
    document_tmp = [ lmtzr.lemmatize(word) for word in document_tmp]
    # stemming 
    document_tmp = [ snowball_stemmer.stem(word) for word in document_tmp]

    return document_tmp

- documents_clean: is a list of list of the documents cleaned. Each list contain the tokenize cleaning document text.

In [8]:
# cleaning the documents
documents_clean = []
for d in documents:
    documents_clean.append(pre_processing(d))

Let's view how a document is processed:

In [9]:
documents_clean[0][:10]

['horrif',
 'alchemi',
 'experi',
 'go',
 'wrong',
 'elric',
 'household',
 'brother',
 'edward',
 'alphons']

### Creating vocabulary

In [10]:
# core libraries
import itertools
import numpy as np

I will create a list of each unique word among the all documents

In [11]:
# the list of all unique words
word_list = list(set(list(itertools.chain.from_iterable(documents_clean))))

Creating a dictionary that maps each word to an integer: we use the function zip to assign to each word an order integer.

In [12]:
vocabolary = dict(zip(word_list, range(len(word_list))))

Saving the dictionary in a .json file

In [13]:
import json

file = open("vocabolary.json", "w", encoding = "utf-8")
json.dump(vocabolary, file)
file.close()

Import the saved vocabolary.

In [14]:
with open( "vocabolary.json" ) as f:
    vocabolary = json.load( f )

In [15]:
# view the vocabolary
count = 0
for key, mapped_int in vocabolary.items():
    count +=1
    print(key,"-->",mapped_int)
    if count == 10: break

underworld --> 0
niizato --> 1
winri --> 2
itali --> 3
reinforc --> 4
nth --> 5
jeffri --> 6
encor --> 7
nyan --> 8
wecom --> 9


## 2.1. Conjunctive query

For this type of search engine we need only to have a search engine based on the query appear or not in each documents. 


* ### Prepare the mapped document

To do this we will create an array of documents of each len(document_j) in which thare are converted the word into integer based on the vocabolary

We will use numpy array for time optimitation.

In [16]:
def word_to_int(document, vocabolary):
    int_doc = np.zeros(len(document), dtype = np.int64)
    # iterating over the document that has len(d)<<len(vocabolary)
    # change the value of the document, otherwise remain zero
    for i, word in enumerate(document):
        # vocabolary[word] is the mapping function that return an integer i.e the index
        int_doc[i] = vocabolary[word]
        
    return np.sort(int_doc)

* documents_mapped: is a list of list that have the words mapped

In [17]:
documents_mapped = []
for d in documents_clean:
    documents_mapped.append(word_to_int(d,vocabolary))

In [18]:
# view a docoument mapped 
documents_mapped[0]

array([    2,     2,   223,   259,   259,   275,   350,   350,   350,
         350,   807,   879,  1026,  1319,  1504,  1780,  1785,  1891,
        3189,  4042,  4071,  4271,  4430,  4642,  5163,  5296,  5874,
        6255,  6667,  7065,  7065,  7797,  7974,  8053,  8272,  8589,
        8628,  8813,  9466,  9658,  9658,  9658,  9834, 10621, 11227,
       11227, 11553, 12827, 12881, 13535, 13926, 13950, 14612, 14694,
       15308, 15330, 15972, 16074, 16713, 16791, 16895, 18092, 18630,
       18952, 19058, 19241, 19353, 19378, 19637, 19689, 19810, 20020,
       20063, 20465, 21067, 21827, 21923, 22139, 22309, 23280, 23307,
       23844, 23878, 25024, 25158, 25619, 25619, 26391, 28726, 28764,
       28928, 29272, 29323, 29604, 29834, 29933, 30841, 31158, 31158,
       31184, 31217, 31567, 31599, 32901, 32909, 33917, 34129, 34129,
       34129, 34234, 34520, 34569, 34787, 35029], dtype=int64)

* ### Inverted Index v1

In [23]:
from collections import defaultdict  

In [24]:
# initialize the Inverted_index
Inverted_index = defaultdict(list)

To compute the Inverted Index we iterating over each document. Every time we encounter a word (that is now a integer) we insert in the dictionary the id of the documents, which is the row index in documents_mapped or in our dataset of tsv / url.txt.

In [25]:
for i,d in enumerate(documents_mapped):
    for word in set(d):
        Inverted_index[str(word)].append(i)

In [27]:
# let's view the Inverted_index
count = 0
for key, lis in Inverted_index.items():
    count +=1
    print("word: ",key,"-->","documents: ", lis)
    if count == 2: break

word:  2 --> documents:  [0, 525, 4155]
word:  1026 --> documents:  [0, 172, 356, 363, 414, 449, 530, 685, 693, 709, 727, 985, 1060, 1066, 1777, 1811, 1852, 2078, 2393, 2550, 3361, 3802, 3978, 3987, 4009, 4227, 4247, 4486, 4595, 4657, 5291, 5434, 5775, 6291, 6461, 6689, 7701, 7880, 8556, 8583, 8810, 8925, 9306, 9673, 9714, 10054, 10232, 10485, 10709, 11005, 11152, 11220, 11291, 12254, 12989, 14429, 14606, 16502, 17348]


Saving the Inverted Index in memory

In [28]:
import json

file = open("Inverted_index_v1.json", "w", encoding = "utf-8")
json.dump(Inverted_index, file)
file.close()

Import the saved inverted index

In [29]:
with open( "Inverted_index_v1.json" ) as f:
    Inverted_index = json.load( f )

* ### Searching

How it work:

* Given in input a query we first of all preprocessing like a document;
* then using the Inverted index we exctract the list corrisponding to each element (word) of the query, store them in a list of set (for intersection pourpose) called "index". 
* Than we intersect all the set for obtain the documents that match ALL the query elements.
* In the end we create the dataframe with the desire output.

In [30]:
def search_engine_v1(query_text):
    # pre-processing the query
    query_clean = pre_processing(query_text)
    query_int = word_to_int(query_clean, vocabolary)
    
    # for each element of the query we can obtain the list of doc 
    # in which the query element appears
    index = []
    
    for query in query_int:
        # creating a set of index (set is for intersection pourpose)
        index.append(set(Inverted_index[str(query)]))
    
    # intersect and obtain the documents that contain ALL the query
    index = list(index[0].intersection(*index))
    
    # taken the url of the anime list
    with open("./anime_url.txt", "r", encoding = "utf-8") as f:
        lines = f.readlines()
        
    # we are searching for the anime and the url corrisponding to the index
    # we found
    anime_path = []
    url = []

    for idx in index:
        # we need the +1 because we start indexing from 1
        name = "/anime_"+str(idx+1)+".tsv"
        anime_path.append(name)
        url.append(lines[idx])

    # creating the dataframe for view the result
    animes_df = []
    # folder of the anime_tsv
    folder = r"./tsv_anime/"
    # column I want
    cols = ["animeTitle","animeDescription"]
    for i,anime_tsv in enumerate(anime_path):
        df = pd.read_csv(folder+anime_tsv, sep = "\t", usecols = cols)
        # creating new column with the url
        df["animeURL"] = url[i]
        animes_df.append(df)
    
    frame = pd.concat(animes_df, ignore_index = True)
    display(frame)

#### Input and search

In [141]:
# input query 
query_text = input("Insert the query: ")

search_engine_v1(query_text)

Insert the query:  alchemist


Unnamed: 0,animeTitle,animeDescription,animeURL
0,Fullmetal Alchemist: Brotherhood,After a horrific alchemy experiment goes wrong...,https://myanimelist.net/anime/5114/Fullmetal_A...
1,Senki Zesshou Symphogear GX,Following the events of Senki Zesshou Symphoge...,https://myanimelist.net/anime/21573/Senki_Zess...
2,Gosick,Kazuya Kujou is a foreign student at Saint Mar...,https://myanimelist.net/anime/8425/Gosick\n
3,Garo: Vanishing Line,Corruption looms over the prosperous Russell C...,https://myanimelist.net/anime/36144/Garo__Vani...
4,Fullmetal Alchemist,"Edward Elric, a young, brilliant alchemist, ha...",https://myanimelist.net/anime/121/Fullmetal_Al...
5,Fullmetal Alchemist: Brotherhood Specials,Amazing secrets and startling facts are expose...,https://myanimelist.net/anime/6421/Fullmetal_A...
6,Arcana Famiglia: Capriccio - stile Arcana Fami...,"After toiling away in his lab, the alchemist J...",https://myanimelist.net/anime/15411/Arcana_Fam...
7,Ulysses: Jehanne Darc to Renkin no Kishi,"The story is set in the 15th century, during t...",https://myanimelist.net/anime/36510/Ulysses__J...
8,Shinmai Renkinjutsushi no Tenpo Keiei,Shoot for the stars! I'm going to be the count...,https://myanimelist.net/anime/49849/Shinmai_Re...
9,Bungou to Alchemist: Shinpan no Haguruma,Famous writers throughout history find themsel...,https://myanimelist.net/anime/40934/Bungou_to_...


# 2.2 Conjunctive query & Ranking score

Now we want for our Inverted_index two element:

* $\text{tf}_{i,j}$: occurancy of term $j$ in document $i$
* $\text{idf}_{j}$: Inverse Document Frequency of term $j$

Define:

* n_words = total number of words in vocabolary
* n = number of documents

In [34]:
n = len(documents)
n_words = len(vocabolary)

Creating the $\text{tf}_{i,j}$ matrix:

* ### Prepare the mapped document

To do this we will create an array of documents of each len(document_j) in which thare are converted the word into integer based on the vocabolary

We will use numpy array for time optimitation.

In [35]:
# function that map document text to the relative tf

def word_to_int2(document, vocabolary):
    int_doc = np.zeros(len(vocabolary), dtype = np.int64)
    # iterating over the document that has len(d)<<len(vocabolary)
    # change the value of the document, otherwise remain zero
    for word in document:
        # vocabolary[word] is the mapping function that return an integer i.e the index
        int_doc[vocabolary[word]] += 1
        
    return int_doc

* $tf$: is a list of list that have the words mapped with is count

In [36]:
tf = []
for d in documents_clean:
    tf.append(word_to_int2(d,vocabolary))

In [37]:
# view a doc
tf[0][tf[0]>0]

array([2, 1, 2, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
       1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1], dtype=int64)

Creating the $\text{idf}_{j}$ array:
* I need the $n_j$: number of documents containing term j

In [38]:
nj = np.zeros(n_words, dtype = np.float64)

# we create a numpy array of length n_words
# for each docoument we take the unique set of word
# and count
for d in documents_mapped:
    for word in set(d):
        nj[word] += 1

In [39]:
# view
nj

array([43.,  1.,  3., ...,  1., 77.,  5.])

In [40]:
# creating idf_j

idf = np.zeros(n_words, dtype = np.float64)
idf = np.log(n/nj) / np.log(n)

$\text{tfIdf}_{ij} = \text{tf}_{ij} * \text{idf}_{j} $

In [41]:
tfIdf = np.multiply(tf,idf)

In [42]:
np.shape(tfIdf)

(19116, 36056)

* ### Inverted index v2

In [43]:
from collections import defaultdict  

In [44]:
Inverted_indexv2 = defaultdict(list)

To compute the Inverted Index we iterating over each document. Every time we encounter a word (that is now a integer) we insert in the dictionary the __tupla__ of _id_ of the documents and the _tfIdf_.

In [45]:
for i,d in enumerate(documents_mapped):
    for word in set(d):
        # append (idd_i, tfidf_ij)
        Inverted_indexv2[str(word)].append((i,tfIdf[i,int(word)]))

In [163]:
# view
count = 0
for key, lis in Inverted_indexv2.items():
    count +=1
    print(key,"-->",lis)
    if count == 1: break

2 --> [[0, 1.7771188926021926], [525, 0.8885594463010963], [4155, 0.8885594463010963]]


Saving the Inverted Index in memory

In [47]:
import json

file = open("Inverted_index_v2.json", "w", encoding = "utf-8")
json.dump(Inverted_indexv2, file)
file.close()

Import the saved inverted index

In [48]:
with open( "Inverted_index_v2.json" ) as f:
    Inverted_indexv2 = json.load( f )

* ### Searching

To better have a use of the Inverted Index + Ranking we suppose that a __query is input as a text of unique word__.

We define than the cosine similarity as:
$$
\begin{equation}
\cos(q,d^i) =  \frac{\sum_{j: \hspace{0.1cm}  q_j=1} \text{tfIdf}_{ij}}{||d^i||*||q||}
\end{equation}
$$

Under this assumption we observe that: $||q|| = \sqrt(len(q)$

Recall: $d^i = [\text{tfIdf}_{i1}, \text{tfIdf}_{i2}, \ldots]$ 

In [49]:
# libraries to compute the manage the heap
import heapq

How is it work:

* we take a query text in input and preprocessing like a normal document. Than we obtain the "query_int" array of 0 and 1: q[j] = 1 if word_j in q, 0 otherwise;
* then using the Inverted index we exctract the list of tuple corrisponding to each element (word) of the query. We store also the len of each match list and the maximum list (we need it to manage the pointer of different length list);
* We now have a function "intersection_all" that return two things: a boolean variable, enough that is True if I have at least k document that match the all query and match list, that return the list of the index that match all the query if true or the original match list if False;
* Than the algorithm goes in two part:<br>
$\hspace{1cm}$  1. if True (I have at least k document) we compute the score of element in the list and return the top k<br>
$\hspace{1cm}$  2. if False, we compute the score af all the document and return the score ordered of the matches
* In the end: we create the dataframe of the documents ranked.

In [153]:
def search_engine2(query, k, tfIdf):
    # preprocessing the query
    query_clean = pre_processing(query)
    query_int = word_to_int2(query_clean, vocabolary)

    # initialize a match_list which store the list of match
    match_list = []
    # list of the length of the match 
    lenMatch = []
    # tupla = (#list, len(list))
    max_lenMatch = (0,-1)
    for i,query in enumerate(np.where(query_int>0)[0]):
        lis = Inverted_indexv2[str(query)]
        match_list.append(lis)
        tmplis, tmplen = i, len(lis)
        if tmplen>max_lenMatch[1]:
            max_lenMatch = (tmplis,tmplen)
        lenMatch.append(len(lis))
    
    enough, match_list = functions.intersection_all(k, match_list)
    m = len(query_clean)
    if enough:
        scores = functions.scoresK(match_list, tfIdf, m, query_int)
        topscore, topk = functions.find_topK(k, scores)
    else:
        scores = functions.scoresALL(match_list, tfIdf, m, lenMatch)
        k = len(scores)
        topscore, topk = functions.find_topK(k, scores)
            
    # search the url
    with open("./anime_url.txt", "r", encoding = "utf-8") as f:
        lines = f.readlines()
        
    # we are searching for the anime and the url 
    anime_path = []
    url = []

    for idx in topk:
        # we need the +1 because we start indexing from 1
        name = "/anime_"+str(idx+1)+".tsv"
        anime_path.append(name)
        url.append(lines[idx])

    # creating the dataframe for view the result
    animes_df = []
    # folder of the anime_tsv
    folder = r"./tsv_anime/"
    # column I want
    cols = ["animeTitle","animeDescription"]
    for i,anime_tsv in enumerate(anime_path):
        df = pd.read_csv(folder+anime_tsv, sep = "\t", usecols = cols)
        # creating new column
        df["animeURL"] = url[i]
        df["animeScores"] = topscore[i]
        animes_df.append(df)
    
    frame = pd.concat(animes_df, ignore_index = True)
    display(frame)
    
    return

### Input and searching

In [161]:
query = input("Insert the query: ")
k = int(input("Insert k: "))

search_engine2(query, k, tfIdf)

Insert the query:  alchemist alchemy
Insert k:  5


Unnamed: 0,animeTitle,animeDescription,animeURL,animeScores
0,Fullmetal Alchemist,"Edward Elric, a young, brilliant alchemist, ha...",https://myanimelist.net/anime/121/Fullmetal_Al...,0.399861
1,Fullmetal Alchemist: Brotherhood Specials,Amazing secrets and startling facts are expose...,https://myanimelist.net/anime/6421/Fullmetal_A...,0.268511
2,Ulysses: Jehanne Darc to Renkin no Kishi,"The story is set in the 15th century, during t...",https://myanimelist.net/anime/36510/Ulysses__J...,0.258751
3,Fullmetal Alchemist: Brotherhood,After a horrific alchemy experiment goes wrong...,https://myanimelist.net/anime/5114/Fullmetal_A...,0.208161
4,Baccano!,"During the early 1930s in Chicago, the transco...",https://myanimelist.net/anime/2251/Baccano\n,0.189374
