# HW1 Solutions


## Import libraries


##### `nltk`

- **Description**: `Natural Language Toolkit`, I use it for tokenization, stemming , remove stop words and lemmatizer.

##### `glob`

- **Description**: Just for reading documents name in ./docs directory

##### `string`

- **Description**: I used from punctuations attribute on this package


In [1]:
import nltk
import glob
import string
# nltk.download('stopwords')
# nltk.download('wordnet')

## Load Documents


The `LoadDocuments` class is responsible for loading and managing a collection of documents. If you want to change the document path please cgange path variable.

##### Properties

- **_`DOC_ID_MAPPER`_** _(class variable)_: A dictionary mapping document names to their respective IDs, Be care full, becuse it depends on your documents order and docIds started from zero.

##### Methods

##### `__init__(self)`

- **Description**: Initializes a `LoadDocuments` object with a default path to load text that you can change it base on your docs directory path and an empty `documentCollection`. that is a dictionary.

##### `Load(self)`

- **Description**: Loads text documents from the specified path and populates the `documentCollection` property.

##### `buildDocumentIDMapper(self)`

- **Description**: Builds a document ID mapper by associating each document's name with a unique ID and populates the `DOC_ID_MAPPER` dictionary, I will print this mapper data for checking my results base on this mapper.


In [2]:
class LoadDocuments:

    DOC_ID_MAPPER = {}

    def __init__(self) -> None:
        self.path = "./docs/*.txt"
        self.documentCollection = {}

    def Load(self)->None:
        documents = glob.glob(self.path)
        for documentName in documents :
            document = open(documentName, "r", encoding = 'cp1252')
            self.documentCollection[documentName] = document.read()

    def buildDocumentIDMapper(self)->None:
        files = glob.glob(self.path)
        for docId, documentName in enumerate(files):
            self.DOC_ID_MAPPER[documentName] = docId
            print(docId , ' -> ', documentName[7:])  

## Document Preprocessing


The `DocumentPreProcessor` class extends the functionality of the `LoadDocuments` class by providing methods to preprocess text documents. These methods perform various text processing tasks, such as converting text to lowercase, tokenization, removing punctuations, removing stop words, stemming, and lemmatization.

#### Methods

##### `__init__(self)`

- **Description**: Initializes a `DocumentPreProcessor` object by calling the constructor of the base class `LoadDocuments`.

##### `convertToLower(self)`

- **Description**: Converts all text in the loaded documents to lowercase.

##### `tokenizer(self)`

- **Description**: Tokenizes the text in the loaded documents using a regular expression pattern.

##### `removePunctuations(self)`

- **Description**: Removes punctuations from the tokenized text in the loaded documents.

##### `removeStopWords(self)`

- **Description**: Removes stop words from the tokenized text in the loaded documents using NLTK's English stop words list.

##### `stemming(self)`

- **Description**: Applies stemming to the tokenized text in the loaded documents using the Porter Stemmer algorithm from NLTK.

##### `lemmatizer(self)`

- **Description**: Applies lemmatization to the tokenized text in the loaded documents using NLTK's WordNet Lemmatizer.


In [3]:
class DocumentPreProcessor(LoadDocuments):
    def __init__(self)->None:
        super().__init__()

    def convertToLower(self)->None:
        for documentName, document in self.documentCollection.items():
            self.documentCollection[documentName] = document.lower()

    def tokenizer(self)->None:
        pattern = r'\d{1,3}(?:,\d{3})*(?:\.\d+)?|\w+'
        for documentName, document in self.documentCollection.items():
            self.documentCollection[documentName] = nltk.tokenize.regexp_tokenize(document, pattern)

    def removePunctuations(self)->None:
        for documentName, document in self.documentCollection.items():
            for ind, term in enumerate(document):
                self.documentCollection[documentName][ind] = "".join([i for i in term if i not in string.punctuation])

    def removeStopWords(self)->None:
        stopwords = nltk.corpus.stopwords.words('english')
        for documentName, document in self.documentCollection.items():
            self.documentCollection[documentName] = [i for i in document if i not in stopwords]

    def stemming(self)->None:
        porter_stemmer = nltk.stem.porter.PorterStemmer()
        for documentName, document in self.documentCollection.items():
            self.documentCollection[documentName] = [porter_stemmer.stem(term) for term in document]

    def lemmatizer(self)->None:
        wordnet_lemmatizer = nltk.stem.WordNetLemmatizer()
        for documentName, document in self.documentCollection.items():
            self.documentCollection[documentName] = [wordnet_lemmatizer.lemmatize(term) for term in document]

## Inverted Index


The `InvertedIndex` class will generate a invertedIndexex that maps terms to the documents that contain them, I will save all positions for proximate search .

#### Properties

- **`invertedIndex`**: A dictionary representing the inverted index where terms are mapped to document IDs and their positions.

#### Methods

##### `__init__(self)`

- **Description**: Initializes an `InvertedIndex` object and call base class constructor.

##### `buildInvertedIndex(self)`

- **Description**: Builds the inverted index by iterating through preprocessed documents and mapping terms to document IDs and their positions.

##### `build(self) -> None`

- **Description**: Initiates the preprocessing steps by loading documents, building document ID mapper, converting to lowercase, tokenizing, removing punctuations, removing stop words, stemming, lemmatizing, and finally, building the inverted index.


In [4]:
class InvertedIndex(DocumentPreProcessor):
    def __init__(self) -> None:
        super().__init__()
        self.invertedIndex = {}    

    def buildInvertedIndex(self)->None:
        for documentName, document in self.documentCollection.items():
            for ind, term in enumerate(document):
                if term not in self.invertedIndex:
                    self.invertedIndex[term] = {}

                if(self.DOC_ID_MAPPER[documentName] not in self.invertedIndex[term]):
                    self.invertedIndex[term][self.DOC_ID_MAPPER[documentName]] = []

                self.invertedIndex[term][self.DOC_ID_MAPPER[documentName]].append(ind)

    def build(self)->None:
        self.Load()
        self.buildDocumentIDMapper()
        self.convertToLower()
        self.tokenizer()
        self.removePunctuations()
        self.removeStopWords()
        self.stemming()
        self.lemmatizer()
        self.buildInvertedIndex()      

## Query Processing


The `QueryPreProcessing` class provides methods to preprocess search queries. It is like document preprocessing because I want to make them similar but I don't remove punctuations.

#### Properties

- **`query`**: the Query string that I gave from user
- **`queryType`**: A string indicating the type of query (`AND`, `OR`, `NOT`, or `NEAR`), I keep my query type and then I will remove it from my query.
- **`maxDistance`**: An integer representing the maximum distance for `NEAR` queries.

#### Methods

##### `__init__(self, query:str)`

- **Description**: Initializes a `QueryPreProcessing` object with the provided query string.

##### `determineQueryType(self)`

- **Description**: Determines the query type based on the query string and sets the `queryType` and `maxDistance` properties.

##### `convertToLower(self)`

- **Description**: Converts all query terms to lowercase.

##### `removePunctuations(self)`

- **Description**: Removes punctuations from query terms.

##### `removeQueryEXP(self)`

- **Description**: Removes specific query expressions (`AND`, `OR`, `NOT`, and `NEAR`) from the query.

##### `stemming(self)`

- **Description**: Stems query terms using the Porter Stemmer algorithm.

##### `lemmatizer(self)`

- **Description**: Lemmatizes query terms using the WordNet Lemmatizer.

##### `execute(self)`

- **Description**: Executes the entire preprocessing for generating a clean query.


In [5]:
class QueryPreProcessing:

    ALL_QUERY_TYPES = [
        "and",
        "or",
        "not",
        "near"
    ]

    def __init__(self, query:str) -> None:
        self.query = query.split(" ")
        self.queryType = ""
        self.maxDistance = 0

    def determineQueryType(self)->None:
        if self.query[1] == "AND":
            self.queryType = "AND"
        elif self.query[1] == "OR":
            self.queryType = "OR"
        elif self.query[0] == "NOT":
            self.queryType = "NOT"
        else:
            self.queryType = "NEAR"
            self.maxDistance = int(self.query[1].split("/")[1])   

    def convertToLower(self)->None:
            for ind, term in enumerate(self.query):
                self.query[ind] = term.lower()

    def removePunctuations(self) ->None:
        for ind, term in enumerate(self.query):
            self.query[ind] ="".join([i for i in term if i not in string.punctuation])           

    def removeQueryEXP(self)->None:
        self.query = [term for term in self.query if term not in self.ALL_QUERY_TYPES and not term.startswith("NEAR")]

    def stemming(self)->None:
        porter_stemmer = nltk.stem.porter.PorterStemmer()
        self.query = [porter_stemmer.stem(term) for term in self.query]

    def lemmatizer(self)->None:
        wordnet_lemmatizer = nltk.stem.WordNetLemmatizer()
        self.query = [wordnet_lemmatizer.lemmatize(term) for term in self.query]  

    def execute(self)->None:
        self.determineQueryType()
        self.convertToLower()
        self.removePunctuations()  
        self.removeQueryEXP()
        self.stemming()
        self.lemmatizer()       

## Utils


The `Utils` class contains utility methods for set operations like intersection and union and proximity distance calculations that is a little custom function(maybe I should change it's name).

#### Methods

##### `interSect(ls: list)`

- **Description**: Calculates the intersection of two lists.

##### `union(ls: list)`

- **Description**: Calculates the union of two lists.

##### `complement(ls: list)`

- **Description**: Calculates the complement of two sets.

##### `proximateDistanceCal(ls: list, maxDistance: int)`

- **Description**: Calculates proximate distances within the given maximum distance between indexes.


In [6]:
class Utils:
    def __init__(self) -> None:
        pass

    def interSect(ls: list)->list:
        return list(set([i for i in ls[0] if i in ls[1]]))
    
    def union(ls :list)->list:
        return list(set(ls[0] + ls[1]))
    
    def complement(ls: list)->list:
        return list(set([i for i in ls[1].values() if i not in ls[0]]))
    
    def proximateDistanceCal(ls: list, maxDistance: int)->list:
        result = []
        for docId, indexes in ls[0].items():
            if(docId in ls[1]):
                for i in range (len(indexes)-1):
                    if(abs(indexes[i] - indexes[i+1]) <= maxDistance + 1):
                        result.append(docId)
                for i in range(len(ls[1][docId])-1):
                    if(abs(ls[1][docId][i] - ls[1][docId][i+1]) <= maxDistance + 1):
                        result.append(docId)     

        return list(set(result))


## IR System


The `IR` class is my information retrieval system that will generate InvertedInxe and will get your query and generate output, If you want a IR system that get many queries and be up for a while you can put start function in while.

#### Methods

##### `booleanSearch(query: QueryPreProcessing)`

- **Description**: Performs boolean search operations (AND, OR, NOT) based on the provided query.

##### `proximateSearch(query: QueryPreProcessing)`

- **Description**: Performs proximate search operations based on the provided query and maximum distance.

##### `search(query: QueryPreProcessing)`

- **Description**: Determines the type of search operation (boolean or proximate) based on the query type and performs the search.

##### `start()`

- **Description**: Initiates the information retrieval process by taking user input for the query, processing it, and displaying the resulting document IDs.


In [7]:
class IR:
    def __init__(self) -> None:
        self.invertedIndex = InvertedIndex()
        self.invertedIndex.build()
        self.utils = Utils()

    def booleanSearch(self, query: QueryPreProcessing)->list:
        result_docs = []
        for term in query.query:
            if term in self.invertedIndex.invertedIndex:
                result_docs.append(list(self.invertedIndex.invertedIndex[term].keys()))
            else: result_docs.append([])
            
        if(query.queryType == "AND"): return Utils.interSect(result_docs)
        elif (query.queryType == "OR") : return Utils.union(result_docs)
        elif(query.queryType == "NOT") : 
            result_docs.append(InvertedIndex.DOC_ID_MAPPER)
            return Utils.complement(result_docs)
            
    def proximateSearch(self, query: QueryPreProcessing)->list:
        result_docs = []
        for term in query.query:
            if term in self.invertedIndex.invertedIndex:
                result_docs.append(self.invertedIndex.invertedIndex[term])

        return Utils.proximateDistanceCal(result_docs, query.maxDistance)
            
    def search(self, query: QueryPreProcessing)-> list:
        if(query.queryType != "NEAR"):
            return self.booleanSearch(query)
        else:
            return self.proximateSearch(query)   
        
    def start(self)->None:
        query = input("Enter Your Query: '\n'")   
        print('\n') 
        print("Input query : " , query)
        query = QueryPreProcessing(query)
        query.execute()
        print("Result document Ids : " , self.search(query))

In [8]:
IR = IR()
IR.start()

0  ->  Jerry Decided To Buy a Gun.txt
1  ->  Gasoline Prices Hit Record High.txt
2  ->  Man Injured at Fast Food Place.txt
3  ->  Freeway Chase Ends at Newsstand.txt
4  ->  Happy and Unhappy Renters.txt
5  ->  A Festival of Books.txt
6  ->  Cloning Pets.txt
7  ->  Rentals at the Oceanside Community.txt
8  ->  Crazy Housing Prices.txt
9  ->  A Murder-Suicide.txt
10  ->  Pulling Out Nine Tons of Trash.txt
11  ->  Food Fight Erupted in Prison.txt
12  ->  Sara Went Shopping.txt
13  ->  Trees Are a Threat.txt
14  ->  Better To Be Unlucky.txt


Input query :  decision AND run
Result document Ids :  [8, 14]


## Report:


### Steps:

1. For this Mini project at first I tried to read all files from google drive but I faced some issue in urls, because it didn't read all directory and then I faced some issues, then I decided to download all files and put the in a local directory.

2. After I write my document loader class I read a little about nltk for preprocessing because all of us know about removing punctuations and some thing like this but for stemmin and lemmatizer I didn't have any idea, so I read about it and nltk as a tool for doing these.

3. In this step I decided to make my invertedIndex and as you know we have dictionary in python that make this step ver easy, becuse I decided to implement my trie and ... but dictionary is like a hash table and is very efficient, so I used it and build my invertedIndex

4. I write my QueryPreprocessor, but it's like my document preprocessor.

5. I assembled all these component and in this step I forced to write a Util for giving some functionality to me, then I faced a problem because at first I insterted touples like (docId, position), but ut was a little bad and I faced problem for searching in them, so I convetet it to a set like docId -> [positions ... ]. this step was the most critical and challenging section for me.

6. In this step I used some queries to check my code.

7. In last step I write a documentation for my code, but for better style and a formal language I gave them and my classes to chatGpt I helped me to generate documents, at last I revise all things and merge all of them and write my document.


## References:


1. Our course slides
2. https://www.nltk.org/
3. https://chat.openai.com/
