# **5. BONUS: More complex search engine**

We decided to make a separate notebook for this part because we should re define our approach for this complex search engine.
We should consider the possibility for the user to choose multiple search criteria. 

### Objective

- For this part of our homework, the objective was to implement a search engine based on multiple criteria. The user can choose between 5 option. In particular, for the option 1, the user can choose between 3 suboptions. The options are:
1. Specify queries for three features: 'courseName', 'universityName', 'universityCity'.
2. Specify a range for the fees
3. Specify a list of countries
4. Filter based on the courses that have already started.
5. Filter based on the presence of online modality.

### Our implemention:
- This funtion asks the user to insert an input number which rapresent one of the possible options that the user could choose. The user can rely multiple search criteria and the number 0 can be used to stop the research. In case of choice of the option 1, the user can filter for multiple queries as well, also in this case the number 0 can be used to exit from the option 1 to switch to other options.


In [182]:
# Import Libraries
import nltk
import pandas as pd
import numpy as np
import warnings
import string
import collections
import pickle
import string
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Genny\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Code Overview

### Processing Text 
- The `process_text` function tokenizes, removes stopwords, punctuation, and performs stemming on input text. We used it to stem our columns.

In [183]:
# Function to process and stem a text
def process_text(text):
    # Check for NaN values
    if pd.isna(text):
        return ""
    
    # Tokenize the text
    tokens = tokenizer.tokenize(text.lower())

    # Remove stopwords
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # Remove punctuation
    filtered_tokens = [token for token in filtered_tokens if token not in string.punctuation]

    # Stemming
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

    # Join the tokens back into a string
    processed_text = ' '.join(stemmed_tokens)

    return processed_text

### Search Functions
- **`search_city`**: Filters the DataFrame based on the specified city.
- **`option1`**: Allows the user to search for queries in 'courseName', 'universityName', and 'universityCity'. Implements inverted indexing and conjunction of query terms.
- **`option2`**: Filters courses based on a specified fee range.
- **`option3`**: Filters courses based on a list of specified countries.
- **`option4`**: Identifies and filters courses that have already started.
- **`option5`**: Filters courses based on the availability of online learning.


In [184]:
# Filter based on city (required for the suboption 3 of option 1)
def search_city(city_query,curr):
    
    result_df = curr[curr['city'].str.lower() == city_query.lower()]
    
    return result_df

### Option1 overview:
- With this implementation, the user has the possibility to issue none or all of the suboptions ('courseName','universityName','city'). This function uses a while True loop until 0 is filled as an input. The boolean variable 'start' serves to purpose of distinguishing the first while iteration from all the remaining iterations. This is needed because in option1, inverted_index_course_name, vocabulary_course_name, inverted_index_university_name, vocabulary_university_name have to be created based on the full dataframe. It is needed also because we have to make intersection between search criteria. Our dictionaries have to be updated depending on the previous interation.

- For example: 
1. The first input query 'data science' for 'courseName' in option1 will return an output dataframe (df_first_query). 
2. The second input query 'Leeds' for 'universityName' in option1 will return an output dataframe based on df_first_query including all those index that are in df_first_query and that satisfy the second search criteria as well.

In [185]:
def option1(working_curr, start):
    
    while True:
        
        print("\nSelect the feature you prefer:")
        print("1. Search for 'courseName' ")
        print("2. Search for 'universityName' ")
        print("3. Search for 'universityCity' ")
        print("0. Digit 0 to end your search")
        
        choice = input("Choose a suboption: ")
        
        # To stop 
        if choice == '0':
            break
        
        elif choice == '1':
            
            # True if it is the first while interation
            if start == True:
                inverted_index_course_name_, vocabulary_course_name_ = inverted_index_course_name, vocabulary_course_name
                
            # inverted_index_course_name_, vocabulary_course_name_ are new dictionaries based on the previous interation
            else:
                inverted_index_course_name_, vocabulary_course_name_, _, _ = retrieve_df(working_curr)
            
            # To stem the query input
            query1 = input("Put a query for 'courseName': ")
            query1 = tokenizer.tokenize(query1)
            query1 = [stemmer.stem(word) for word in query1 if not word in lst_stopwords]
            
            # The same approach we used for the first search engine (Q2)
            conjunctive_list_course_name = inverted_index_course_name_[vocabulary_course_name_[query1[0]]]  # initialize the conjunctive query list
            for term in query1:
                if term in vocabulary_course_name_:
                    term_id = vocabulary_course_name_[term]
                    term_list = inverted_index_course_name_[term_id]
                    conjunctive_list_course_name = set(conjunctive_list_course_name).intersection(set(term_list))
                else:
                    print("Not all terms are in the course's descriptions")
                    return False
                
            if start == False:
                working_curr = working_curr.iloc[list(conjunctive_list_course_name)]
            else:
                working_curr = original_df.iloc[list(conjunctive_list_course_name)]
                start = False
        
        
        elif choice == '2':
            
            if start == True:
                inverted_index_university_name_, vocabulary_university_name_ = inverted_index_university_name, vocabulary_university_name
                
            # inverted_index_university_name_, vocabulary_university_name_ are new dictionaries based on the previous interation    
            else:
                _, _, inverted_index_university_name_, vocabulary_university_name_ = retrieve_df(working_curr)
            
            query2 = input("Put a query for 'universityName': ")
            query2 = tokenizer.tokenize(query2)
            query2 = [stemmer.stem(word) for word in query2 if not word in lst_stopwords]
            
            # The same approach we used for the first search engine (Q2)
            conjunctive_list_university_name = inverted_index_university_name_[vocabulary_university_name_[query2[0]]]  # initialize the conjunctive query list
            for term in query2:
                if term in vocabulary_university_name_:
                    term_id = vocabulary_university_name_[term]
                    term_list = inverted_index_university_name_[term_id]
                    conjunctive_list_university_name = set(conjunctive_list_university_name).intersection(set(term_list))
                else:
                    print("Not all terms are in the course's descriptions")
                    return False
                
            if start == False:
                working_curr = working_curr.iloc[list(conjunctive_list_university_name)]
            else:
                working_curr = original_df.iloc[list(conjunctive_list_university_name)]
                start = False
            
            
        elif choice == '3':
            query3 = input("Put a query for 'city': ")
            working_curr = search_city(query3,working_curr)
            
    return working_curr

# Filter based on fees
def option2(min_fee, max_fee,curr):
    
    # Handle NaN
    curr['fees'] = curr['fees'].replace(np.nan,'')
    curr = curr[curr['fees'] != '']
    
    # Return fees in that range
    final_df = curr[(curr['fees'] >= min_fee) & (curr['fees'] <= max_fee)] 
    
    return final_df

#Filter based on country
def option3(countries,curr):
    
    # Return contries 
    final_df = curr[curr['country'].str.lower().isin([country.lower() for country in countries])]
    
    return final_df

#Filter based on courses already started
def option4(curr):
    # Tokenize the 'startDate' column and create a new column 'startMonths'
    curr['startDate'] = curr['startDate'].replace(np.nan,'NO INFO')
    curr['startDate'] = curr['startDate'].replace('See Course','NO INFO')
    
    # Removing NO INFO rows
    curr = curr[curr['startDate'] != 'NO INFO']
    
    curr['startMonths'] = curr['startDate'].apply(lambda x: x.split(', '))
    
    
    # Detect months in 'August', 'September', 'October', 'November'
    curr['Already Started'] =  curr['startMonths'].apply(
    lambda x: 1 if any(month in ['August', 'September', 'October', 'November'] for month in x) else 0
)
    
    # Return courses already started
    final_df = curr[curr['Already Started'] == 1]
    
    # To print the percentage
    percentage = len(final_df)/len(curr)
    
    print(f"The percentage of courses already started based on your research is: {round(percentage * 100,2)}%")
    
    return final_df


# Filter based on online courses
def option5(curr):
    
    curr['administration'] = curr['administration'].replace(np.nan,'')
    
    # When there is '' , it doesn't matter for our research
    curr = curr[curr['administration'] != '']
    
    curr['administration'] = curr['administration'].apply(lambda x: x.split(', '))
    
    # Decect courses having online opportunity
    curr['IsItOnline'] =  curr['administration'].apply(
    lambda x: 1 if any(k in ['Online'] for k in x) else 0
)
    
    # Return online courses
    final_df = curr[curr['IsItOnline'] == 1]
    
    percentage = len(final_df)/len(curr)
    
    print(f"The percentage of online courses is: {round(percentage * 100,2)}%")
    
    return final_df

In [186]:
# Our complex search engine corpus
def complex_search_engine(df):
    
    # Create a copy of the DataFrame to work with
    working_df = df.copy()
    
    # To take into account the first request 
    start = True
    
    # Iterate until the user doesn't stop it digiting 0.
    while True:
        print("\nSearch engine options:")
        print("1. Search for a query in columns: 'courseName', 'universityName', 'universityCity'")
        print("2. Search for fees range")
        print("3. Search for country")
        print("4. Search for courses already started")
        print("5. Search for opportunity of online learning") 
        print("0. Put 0 to end your search")
        
        choice = input("Choose an option: ")
        
        # To stop 
        if choice == '0':
            break
        
        elif choice == '1':
            working_df = option1(working_df, start)
            if start == True:
                start = False
            
        elif choice == '2':
            min_fee = float(input("Min fee: "))
            max_fee = float(input("Max fee: "))
            working_df = option2(min_fee, max_fee, working_df)
            if start == True:
                start = False
                
        # We used -, so space separated countries can be searched (example. United Kingdom)    
        elif choice == '3':
            countries_query = input("Enter countries separated by -: ")
            countries = [country.lower() for country in countries_query.split('-')]
            working_df = option3(countries, working_df)
            if start == True:
                start = False
        
        elif choice == '4':
            working_df = option4(working_df)
            if start == True:
                start = False
    
        elif choice == '5':
            working_df = option5(working_df)
            if start == True:
                start = False
    
    return working_df


# To re-obtain inverted_index_course_name, vocabulary_course_name, inverted_index_university_name, vocabulary_university_name based on the current dataframe
def retrieve_df(df):

    df_administration_restored = pd.read_csv('df_DEFINITIVO.tsv', sep='\t')
    df_process_fees = pd.read_csv('merged_courses_newfees.tsv', sep='\t')

    df['fees'] = df_process_fees['fees'].copy()
    df['administration'] = df_administration_restored['administration'].copy()

    # Apply the processing function to the 'courseName' column
    df['courseName'] = df['courseName'].apply(process_text)

    # Apply the processing function to the 'universityName' column
    df['universityName'] = df['universityName'].apply(process_text)

    # We don't need to stem city because it contains a single word.

    # Our documents of interest for 'courseName'
    course_name = df['courseName']

    # Our documents of interest for 'universityName'
    university_name = df['universityName']



    # Create vocabulary and inverted_index for the column 'courseName'
    # Tokenized 'courseName'
    tokenized_course_name = [text.split() if isinstance(text, str) else [] for text in course_name]

    # Vocabulary for the column 'courseName'
    vocabulary_course_name = {word: i for i, word in
                              enumerate(set(word for text in tokenized_course_name for word in text))}

    # Initialize an inverted index for 'courseName'
    inverted_index_course_name = collections.defaultdict(list)

    # Nested loop to find documents which contain every term_id
    for doc_id, text in enumerate(tokenized_course_name):
        for word in set(text):
            term_id = vocabulary_course_name[word]
            inverted_index_course_name[term_id].append(doc_id)

    # Create vocabulary and inverted_index for the column 'universityName'
    # Tokenized 'universityName'
    tokenized_university_name = [text.split() if isinstance(text, str) else [] for text in university_name]

    # Vocabulary for the column 'universityName'
    vocabulary_university_name = {word: i for i, word in
                                  enumerate(set(word for text in tokenized_university_name for word in text))}

    # Initialize an inverted index for the column 'universityName'
    inverted_index_university_name = collections.defaultdict(list)

    # Nested loop to find documents which contained that term_id
    for doc_id, text in enumerate(tokenized_university_name):
        for word in set(text):
            term_id = vocabulary_university_name[word]
            inverted_index_university_name[term_id].append(doc_id)

    return inverted_index_course_name, vocabulary_course_name, inverted_index_university_name, vocabulary_university_name

# MAIN corpus, here we defined global variables
if __name__ == "__main__":
    stemmer = PorterStemmer()
    tokenizer = RegexpTokenizer(r'\w+')
    stop_words = set(stopwords.words('english'))
    lst_stopwords = stopwords.words('english')
    
    #We noticed that administration included NaN so we restored this column
    df_administration_restored = pd.read_csv('df_DEFINITIVO.tsv', sep='\t')
    df = pd.read_csv('df.tsv', sep='\t')
    df_process_fees = pd.read_csv('merged_courses_newfees.tsv', sep='\t')

    df['fees'] = df_process_fees['fees'].copy()
    df['administration'] = df_administration_restored['administration'].copy()
    
    # Now df is ready for our complex search engine, 'fees' column has been processed and 'administration' has been restored
    
    original_df = df.copy()

    # Apply the processing function to the 'courseName' column
    df['courseName'] = df['courseName'].apply(process_text)

    # Apply the processing function to the 'universityName' column
    df['universityName'] = df['universityName'].apply(process_text)

    # We don't need to stem city because it contains a single word.

    # Our documents of interest for 'courseName'
    course_name = df['courseName']

    # Our documents of interest for 'universityName'
    university_name = df['universityName']

    # Our documents of interest for 'city'
    cities = df['city']

    # Create vocabulary and inverted_index for the column 'courseName'
    # Tokenized 'courseName'
    tokenized_course_name = [text.split() if isinstance(text, str) else [] for text in course_name]

    # Vocabulary for the column 'courseName'
    vocabulary_course_name = {word: i for i, word in
                              enumerate(set(word for text in tokenized_course_name for word in text))}

    # Initialize an inverted index for 'courseName'
    inverted_index_course_name = collections.defaultdict(list)

    # Nested loop to find documents which contain every term_id
    for doc_id, text in enumerate(tokenized_course_name):
        for word in set(text):
            term_id = vocabulary_course_name[word]
            inverted_index_course_name[term_id].append(doc_id)

    # Create vocabulary and inverted_index for the column 'universityName'
    # Tokenized 'universityName'
    tokenized_university_name = [text.split() if isinstance(text, str) else [] for text in university_name]

    # Vocabulary for the column 'universityName'
    vocabulary_university_name = {word: i for i, word in
                                  enumerate(set(word for text in tokenized_university_name for word in text))}

    # Initialize an inverted index for the column 'universityName'
    inverted_index_university_name = collections.defaultdict(list)

    # Nested loop to find documents which contained that term_id
    for doc_id, text in enumerate(tokenized_university_name):
        for word in set(text):
            term_id = vocabulary_university_name[word]
            inverted_index_university_name[term_id].append(doc_id)

**Example:**

In [188]:
your_result = complex_search_engine(df)


Search engine options:
1. Search for a query in columns: 'courseName', 'universityName', 'universityCity'
2. Search for fees range
3. Search for country
4. Search for courses already started
5. Search for opportunity of online learning
0. Put 0 to end your search


Choose an option:  1



Select the feature you prefer:
1. Search for 'courseName' 
2. Search for 'universityName' 
3. Search for 'universityCity' 
0. Digit 0 to end your search


Choose a suboption:  1
Put a query for 'courseName':  data science



Select the feature you prefer:
1. Search for 'courseName' 
2. Search for 'universityName' 
3. Search for 'universityCity' 
0. Digit 0 to end your search


Choose a suboption:  2
Put a query for 'universityName':  leeds



Select the feature you prefer:
1. Search for 'courseName' 
2. Search for 'universityName' 
3. Search for 'universityCity' 
0. Digit 0 to end your search


Choose a suboption:  0



Search engine options:
1. Search for a query in columns: 'courseName', 'universityName', 'universityCity'
2. Search for fees range
3. Search for country
4. Search for courses already started
5. Search for opportunity of online learning
0. Put 0 to end your search


Choose an option:  4


The percentage of courses already started based on your research is: 87.5%

Search engine options:
1. Search for a query in columns: 'courseName', 'universityName', 'universityCity'
2. Search for fees range
3. Search for country
4. Search for courses already started
5. Search for opportunity of online learning
0. Put 0 to end your search


Choose an option:  0


In [189]:
your_result

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,courseDescription,startDate,fees,modality,duration,city,country,administration,url,startMonths,Already Started
3585,health informat data scienc msc,univers leed,School of Medicine,Full time,Demand for professionals qualified in health i...,September,,MSc,"1 year full time, 3 years part time",Leeds,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...,[September],1
35,environment data scienc analyt msc,univers leed,School of Geography,Full time,As global discussions are increasingly focused...,September,37645.86,MSc,1 year full time,Leeds,United Kingdom,,https://www.findamasters.com/masters-degrees/c...,[September],1
5689,advanc comput scienc data analyt msc,univers leed,School of Computing,Full time,"From science to marketing, engineering to medi...",September,17262.0,MSc,1 year full time,Leeds,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...,[September],1
3779,precis medicin genom data scienc msc,univers leed,School of Biomedical Sciences,Full time,The rapid transformation of healthcare through...,September,,MSc,1 year full time,Leeds,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...,[September],1
1827,data scienc analyt msc,univers leed,School of Mathematics,Full time,We’re surrounded by data. The variety and amou...,September,,MSc,1 year full time,Leeds,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...,[September],1
1832,data scienc artifici intellig,leed triniti univers,School of Computer Science,Full time,Are you interested in understanding your right...,September,,MSc,Full-time (1 year) Part-time (2 years),Leeds,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...,[September],1
4322,urban data scienc analyt msc,univers leed,School of Geography,Full time,Urban data science and analytics is critical t...,September,,MSc,1 year full time,Leeds,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...,[September],1


This was the output for:
- Option1: 'courseName = 'Data science' , 'universityName' = 'leeds'
- Option4: Courses already started

As we can see, 87.5% of courses included in the result DataFrame from option1 have already started.

**Another example:**

In [192]:
your_result1 = complex_search_engine(df)


Search engine options:
1. Search for a query in columns: 'courseName', 'universityName', 'universityCity'
2. Search for fees range
3. Search for country
4. Search for courses already started
5. Search for opportunity of online learning
0. Put 0 to end your search


Choose an option:  1



Select the feature you prefer:
1. Search for 'courseName' 
2. Search for 'universityName' 
3. Search for 'universityCity' 
0. Digit 0 to end your search


Choose a suboption:  1
Put a query for 'courseName':  applied



Select the feature you prefer:
1. Search for 'courseName' 
2. Search for 'universityName' 
3. Search for 'universityCity' 
0. Digit 0 to end your search


Choose a suboption:  2
Put a query for 'universityName':  university



Select the feature you prefer:
1. Search for 'courseName' 
2. Search for 'universityName' 
3. Search for 'universityCity' 
0. Digit 0 to end your search


Choose a suboption:  0



Search engine options:
1. Search for a query in columns: 'courseName', 'universityName', 'universityCity'
2. Search for fees range
3. Search for country
4. Search for courses already started
5. Search for opportunity of online learning
0. Put 0 to end your search


Choose an option:  3
Enter countries separated by -:  usa-france-united kingdom-italy



Search engine options:
1. Search for a query in columns: 'courseName', 'universityName', 'universityCity'
2. Search for fees range
3. Search for country
4. Search for courses already started
5. Search for opportunity of online learning
0. Put 0 to end your search


Choose an option:  2
Min fee:  5000
Max fee:  20000



Search engine options:
1. Search for a query in columns: 'courseName', 'universityName', 'universityCity'
2. Search for fees range
3. Search for country
4. Search for courses already started
5. Search for opportunity of online learning
0. Put 0 to end your search


Choose an option:  0


In [193]:
your_result1

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,courseDescription,startDate,fees,modality,duration,city,country,administration,url
49,appli analyt chemistri msc,univers colleg london,Department of Chemistry,Full time,Analytical chemistry underpins many important ...,September,17262.0,MSc,1 year full time,London,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
75,appli clinic psycholog msc,univers central lancashir,School of Psychology and Computer Science,Full time,The MSc in Applied Clinical Psychology has bee...,September,9610.41,MSc,"1 year full time, 2 years part time",Preston,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
76,appli clinic research,st cloud state univers,Postgraduate Programs,Full time,As an Applied Clinical Research student you wi...,See Course,17498.8,MSc,Full-time: 2 years (with internship); part-tim...,St Cloud,USA,On Campus,https://www.findamasters.com/masters-degrees/c...
85,appli comput msc,univers sunderland,Faculty of Technology,Full time,"Do you want to work in the IT sector, but don’...",September,17882.7,MSc,2 years full time,Sunderland,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
98,appli data scienc model msc,univers exet,Mathematics and Statistics,Full time,"The concurrent crises of health, climate and e...",September,9603.0,MSc,1 year full time,Exeter,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
117,appli environment geolog msc,cardiff univers,Cardiff School of Earth and Environmental Scie...,Full time,Our vocationally focused MSc degree in Applied...,September,14490.29,MSc,1 year full time,Cardiff,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
129,appli gender studi msc,univers strathclyd,School of Humanities,Full time,If you wish to pursue a career in the charitab...,September,10283.75,MSc,12 months full-time; 24 months part-time,Glasgow,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
142,appli instrument control msc,glasgow caledonian univers,School of Engineering and Built Environment,Full time,We worked with industry professionals to devel...,"September, January",12854.68,MSc,1 year full-time; 2-5 years distance learning,Glasgow,United Kingdom,Online,https://www.findamasters.com/masters-degrees/c...
162,appli microbiolog biotechnolog profession prac...,univers wolverhampton,School of Sciences,Full time,This postgraduate course provides an understan...,September,17262.0,MSc,2 years full time,Wolverhampton,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
184,appli psycholog healthcar children young peopl...,univers edinburgh,School of Health in Social Science,Full time,"This programme, developed in partnership with ...",February,11752.85,MSc,1 year full-time,Edinburgh,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...


This was the output for:
- Option1: 'courseName = 'applied' , 'universityName' = 'university'
- Option3: Contries = 'usa-france-united kingdom-italy'
- Option2: Minfees = 5000, Maxfees = 20000