


# Document Similarity with Latent Semantic Analysis (LSA)

The following notebook walks you through doing LSA document similarity in Python. We then output the document similarity matrix as a .csv file which can be manipulated to highlight similarity between documents. You then have the option of using our "doc_sim_lsa_heatmap" notebook to create a heatmap of cosine similarity scores between documents.

###  Before we begin
This notebook is setup to be used specifically on a HTRC Data Capsule and has default file paths and variable settings that you may need to change. Be sure to read all the anotations and directions thoroughly so you know what changes you may wish to make and where.

### Include necessary packages for notebook 

Python's extensibility comes in large part from packages. Packages are groups of functions, data, and algorithms that allow users to easily carry out processes without recreating the wheel. Some packages are included in the basic installation of Python, others, created by Python users, are available for download.

If you decide to add to the code in this notebook you may need to install packages that are not pre-installed on the Data Capsule. In your terminal, packages can be installed by simply typing `pip install nameofpackage`. However, since you are using this notebook on the HTRC Data Capsule you will not need to install any of the packages below to use this notebook as it is. We do, however, need to import the packages we want to use. Installing a package just means we have it available to use, importing the package tells Python that our code below actually utilizes the package. Below is a brief description of the packages we are using in this notebook:  

- **os:** Provides a portable way of using operating system dependent functionality.
- **sklearn:** Simple and efficient tools for data mining and data analysis built on NumPy, SciPy, and matplotlib.
- **scipy:** Open-source software for mathematics, science, and engineering.
- **pandas:** An open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
- **warnings:** Allows for the manipulation of warning messages in Python.
- **numpy:** a general-purpose array-processing package designed to efficiently manipulate large multi-dimensional arrays of arbitrary records without sacrificing too much speed for small multi-dimensional arrays. 
- **string:** Contains a number of functions to process standard Python strings.
- **nltk:** A leading platform for building Python programs to work with human language data.
- **spacy:** A library for advanced Natural Language Processing in Python and Cython.

Notice we import some of the packages differently. In some cases we just import the entire package when we say `import XYZ`. For some packages which are small, or, from which we are going to use a lot of the functionality it provides, this is fine. 

Sometimes when we import the package directly we say `import XYZ as X`. All this does is allow us to type `X` instead of `XYZ` when we use certain functions from the package. So we can now say `X.function()` instead of `XYZ.function()`. This saves time typing and eliminates errors from having to type out longer package names. I could just as easily type `import XYZ as potato` and whenever I use a function from the `XYZ` package I would need to type `potato.function()`. What we import the package as is up to you, but some commonly used packages have abbreviations that are standard amongst Python users such as `import pandas as pd` or `import matplotlib.pyplot as plt`. You do not need to us `pd` or `plt`, however, these are widely used and using something else could confuse other users and is generally considered bad practice. 

Other times we import only specific elements or functions from a package. This is common with packages that are very large and provide a lot of functionality, but from which we are only using a couple functions or a specific subset of the package that contains the functionality we need. This is seen when we say `from XYZ import ABC`. This is saying I only want the `ABC` function from the `XYZ` package. Sometimes we need to point to the specific location where a function is located within the package. We do this by adding periods in between the directory names, so it would look like `from XYZ.123.A1B2 import LMN`. This says we want the `LMN` function which is located in the `XYZ` package and then the `123` and `A1B2` directory in that package. 

You can also import more than one function from a package by separating the functions with commas like this `from XYZ import ABC, LMN, QRS`. This imports the `ABC`, `LMN` and `QRS` functions from the `XYZ` package.

In [1]:
import os
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
import pandas as pd
import warnings
import numpy as np
import string
import nltk
from nltk.corpus import stopwords
import spacy

  from ._conv import register_converters as _register_converters


This will ignore deprecation and future warnings. All the warnings in this code are not concerning and will not break the code or cause errors in the results.

In [2]:
# Suppress warnings from pandas library
warnings.filterwarnings("ignore", category=DeprecationWarning,
                        module="pandas", lineno=570)
warnings.filterwarnings("ignore", category=FutureWarning,
                        module = "sklearn", lineno = 1059)
warnings.filterwarnings("ignore", category=UserWarning,
                        module = "sklearn", lineno = 300)

### Getting your data

#### File paths
Here we are saving as variables different file paths that we need in our code. We do this so that they are easier to call later and so that you can make most of your changes now and not need to make as many changes later. 

First we point to the `secure_volume` directory and assign that file path to the variable `home_path` since all of the volumes used will be in the `secure_volume` directory.

Next, we combine the `home_path` variable with the folder names that lead to where our data is stored. Note that we do not use any file names yet, just the path to the folder. This is because we are comparing documents to one another, so we need to read in an entire directory or the contents of several directories. You will want to change the folder name(s) to match the folder names in your file path. Currently it is set to a directory of sample HathiTrust volumes.

Now we add the `home_path` variable to other folder names that lead to a folder where we will want to save our document similarity matrix. You again will want to change the folder names in the path to match your own folder names. We assign this file path to the variable `cleaned_data`.

In [3]:
home_path = os.path.join("/media", "secure_volume")
data_home = os.path.join(home_path, "workset")
cleaned_data = os.path.join(home_path, "jupyter_notebooks", "document_similarity", "cleaned_data" )

#### Set needed variables

Next ww assign values to variables that will inform various parts of our code. Just like the file path variables, this is done so you have to make fewer changes later and also to make the changes easier to find by putting them in one place.

- **nltk_stop:** If you want to use the stopword list that comes with the nltk package then set `nltk_stop` equal to `True`. If you do not wish to use the nltk stopword list then set `nltk_stop` equal to `False`.

- **custom_stop:** If you have created your own custom stopword list and wish to use that, then set `custom_stop` equal to `True`. If you do not have your own custom stopword list then set `custom_stop` equal to `False`.

**NOTE: You can use both the nltk and custom stopword lists or you can use neither or just one or the other. You do NOT need to set them both to True or both to False. Use whatever works best for you.**

- **lem:** Next we decide if we want to lemmatize our words. Lemmatizing words will turn certain words to the root of the word. So "are" and "is" become "be" and "runs" and "running" become "run". This will probably increase the similarity of documents as they will then share more words in common. If you want to lemmatize the words in your dataset then assign `True` to the variable `lem`. If you do not wish to lemmatize your words then assign `False` to the variable `lem`.

- **lower_case:** Then we decide if we want all the words in our dataset lowercased. This will change "Love" to "love" so that it is recognized as the same word for similarity purposes. However, there are some cases where the use of capitalization may be important to determining similarity, so we have the option to lowercase or not. If you want to lowercase all the words in your dataset assign `True` to the variable `lower_case`. If you do not wish to lowercase all the words in your dataset then assign `False` to the variable `lower_case`.

- **remove_digits:** Now we decide if we want to remove numbers from our text. Again, removing numbers will increase the similarity of texts as page numbers and other integers that may not be exactly alike will be removed. However, there are instances where numbers are thematically important, and they need to be kept. Here is where you make that decison. If you wish to remove all numbers then assign `True` to the `remove_digits` variable. If you wish to retain all numbers then assign `False` to the `remove_digits` variable.

- **language:** Now we choose the language we will be using for the nltk stopwords list. If you need a different language, simply change 'english' (keep the quotes) in the `language` variable to the anglicized name of the language you wish to use (e.g. 'spanish' instead of 'espanol' or 'german' instead of 'deutsch'). For a list of available stopword languages in nltk add a new code cell and type `print(stopwords.fileids())` and the list of available languages will print out below the cell.

- **lem_lang:** Now we choose the language for our lemmatizer. The languages available for spacy include the list below and the abbreviation spacy uses for that language. To choose a language simply type the two letter code following the angliscized language name in the list. So for Spanish it would be `'es'` (with the quotes) and for German `'de'` and so on.

    - **English:** `'en'`
    - **Chines:** `'zh'`
    - **Spanish:** `'es'`
    - **German:** `'de'`
    - **French:** `'fr'`
    - **Italian:** `'it'`
    - **Portuguese:** `'pt'`
    - **Japanese:** `'ja'`
    - **Russian:** `'ru'`
    - **Multi-Language:** `'xx'`
    

- **encoding/errors:** The variable `encoding` is where you determine what type of encoding to use (ascii, ISO-8850-1, utf-8, etc...). We have it set to utf-8 at the moment as we have found it is less likely to have any problems. However, errors do occur, but the encoding errors rarely impact our results and it causes the Python code to exit. So instead of dealing with unhelpful errors we ignore the ones dealing with encoding by assigning `'ignore'` to the `errors` variable. If you want to see any encoding errors then change `'ignore'` to `None` without the quotes.

- **concat:** The HTRC Workset Toolkit gives the option of concatenating the HathiTrust volumes. If this option is chosen when downloading your corpus then all of the individual page files will be combined into one volume file. If you did not choose to concatenate your volumes, then there will be a directory for each volume and each directory contains a file for every page of that volume. Later on in the code there will be two options for reading in your corpus, one for if you concatenated the volumes, and one for if you did not. So the code know which one to choose we need to assign `True` to the `concat` variable if you did concatenate the volume pages or `False` if you did NOT concatenate the volume pages.

- **n_comp:** The LSA algorithm used by the sklearn Python package has a required parameter called `n_components` for "number of components" (the point of it will be explained later). There is a part of the code further down that will help determine what the best number of components is for your corpus to produce the most accurate result. This will also make the process take longer as it is an added complicated step. If you want to perform this part of the code and try to determine the  best number of components, then assign `True` to the `n_comp` variable. If you do not wish to perform this added step then assign `False` to the `n_comp` variable. The default number of components will be the number of volumes in your corpus if you assign `False` to `n_comp`.

- **stop_words =[]:** The `stop_words =[]` variable is simply an empty list. This is where the words from the nltk stopword list or your custom stopword list or both combined or neither (depending on what you decide) will reside later on. You do not need to do anything to this line of code.

- **token_dict ={}:** The `token_dict = {}` variable is an empty dictionary. This is where your documents will reside later. The file name for the document will be the key and the content of the document will be the value. This will be explained in more detail later. For now, you do not need to do anything to this line.

In [4]:
nltk_stop = True
custom_stop = False
lem = True
lower_case = True
remove_digits = True
language = 'english'
lem_lang = "en"
encoding = 'utf-8'
errors = 'ignore'
concat = False
n_comp = True
stop_words = []
token_dict = {}

### Stopwords
If you set `nltk_stop` equal to **True** above then this will add the NLTK stopwords list to the empty list named `stop_words`

You should have already chosen your desired language above, but if you wish to add any words to the stopWords list then add the word(s) you want as a stop word in the `stop_words.extend(['words', 'you', 'want', 'to', 'add'])` part of the code.

In [5]:
if nltk_stop is True:
    # NLTK Stop words
    stop_words = stopwords.words(language)

    stop_words.extend(['would', 'said', 'says', 'also', '-PRON-', '-pron-'])

#### Add own stopword list

Here is where your own stopwords list is added if you set `custom_stop` equal to **True** above. Here you will need to change the folder names and file name to match your folders and file. Remember to put each folder name in quotes and in the correct order always putting the file name including the file extension (.txt) last.

In [6]:
if custom_stop is True:
    stop_words_filepath = os.path.join(home_path, "data", "my_stopwords.txt")

    with open(stop_words_filepath, "r",encoding = encoding, errors = errors) as f:
        stop_words_list = [x.strip() for x in f.readlines()]

    stop_words.extend(stop_words_list)

### Functions
We need to create a function in order to stem and tokenize our data. Any time you see `def` that means we are **DE**claring a **F**unction. The `def` is usually followed by the name of the function being created and then in parentheses are the parameters required by the function. After the parentheses is a colon, which closes the declaration, then a bunch of code below which is indented. The indented code is the program statement or statements to be executed. Once you have created your function all you need to do in order to run it is call the function by name and make sure you have included all the required parameters in the parentheses. This allows you to call the function without having to write out all the code in the function every time you wish to perform that task.

#### Tokenization functions
Here we have several "if...else" statements. First, if we set `lem` to **True** above we create two functions. One to ignore the "\n" markers in our text and the next to tokenize and lemmatize our text. Second, within that first `if` statement, if `lem_lang` is set to a certain language abbreviation above, we want to use the lemmatization for that language. 

The `else` is if we set `lem` to **False** above then we create one function that tokenizes our text only.

You should not need to make any changes to this block of code.

In [7]:
if lem is True:
    if lem_lang == "en" or "zh":
        nlp = spacy.load(lem_lang+"_core_web_sm", disable=["parser","ner"])
    elif lem_lang == "xx":
        nlp = spacy.load(lem_lang+"_ent_wiki_sm", disable=["parser","ner"])
    else:
        nlp = spacy.load(lem_lang+"_core_news_sm", disable = ["parser","ner"])
        
    nlp.max_length=1500000
    def token_filter(token):
        return not (token.is_space)
    
    def tokenize(text):
        for doc in nlp.pipe([text]):
            tokens = [token.lemma_ for token in doc if token_filter(token)]
        return tokens

else:
    def tokenize(text):
        tokens = nltk.word_tokenize(text)
        return tokens

#### Read in documents
Now we read in our documents and also perform some text cleaning. This code lower cases all the words as well as removes punctuation and digits, depending on what we set for the `lower_case` and `remove_digits` variables above. Then it adds the file names and cleaned content of each file to our previously empty `token_dict` dictionary above. You should not need to make any changes to this code.

A dictionary is similar to a list except it has what are called 'keys' and 'values'. This basically allows us to label our data. In this case we will be making the file names of our documents the 'keys' and the content of the file the 'values' so that each document name correlates to the content of that document.

The `if concat is True:` part says that if we assigned `True` to the `concat` variable above, then the code below the statement will be run, which reads in files as if each file was a separate volume (which it is if we concatenated the volume pages when we downloaded the volumes). If we did NOT assign `True` to `concat` (`else`), then the code below `else` will be run instead, which reads in the files for each directory and treats each directory as a single volume.

In [8]:
if concat is True:
    for subdir, dirs, files in os.walk(data_home):
        for file in files:
            if file.startswith('.'):
                    continue
            if file.startswith('volume-rights.txt'):
                    continue
            if not file.endswith('.txt'):
                    continue
            
            file_path = subdir + os.path.sep + file
            with open(file_path, 'r', encoding = encoding, errors = errors) as text_file:
                text = text_file.read()
                if lower_case and remove_digits is True:
                    lowers = text.lower()
                    no_punctuation = lowers.translate(str.maketrans('','', string.punctuation))
                    no_digits = no_punctuation.translate(str.maketrans('','', string.digits))
                    token_dict[file] = no_digits
                elif lower_case == True and remove_digits == False:
                    lowers = text.lower()
                    no_punctuation = lowers.translate(str.maketrans('','', string.punctuation))
                    token_dict[file] = no_punctuation
                elif lower_case == False and remove_digits == True:
                    no_punctuation = text.translate(str.maketrans('','', string.punctuation))
                    no_digits = no_punctuation.translate(str.maketrans('','', string.digits))
                    token_dict[file] = no_digits
                else:
                    no_punctuation = text.translate(str.maketrans('','', string.punctuation))
                    token_dict[file] = no_punctuation
else:
    
    data = []
    text = []
    for folder in sorted(os.listdir(data_home)):
        if not os.path.isdir(os.path.join(data_home, folder)):
            continue
        for file in sorted(os.listdir(os.path.join(data_home, folder))):
            data.append(((data_home, folder,file)))
    df = pd.DataFrame(data, columns = ["Root","Folder", "File"])
    df["Paths"] = df["Root"].astype(str) + "/" + df["Folder"].astype(str) + "/" + df["File"].astype(str)
    for path in df["Paths"]:
        if not path.endswith(".txt"):
            continue
        with open (path, "r", encoding = encoding, errors = errors) as f:
            t = f.read().strip().split()
            if lower_case and remove_digits is True:
                lowers = ' '.join(t).lower()
                no_punctuation = lowers.translate(str.maketrans('','', string.punctuation))
                no_digits = no_punctuation.translate(str.maketrans('','', string.digits))
                text.append(no_digits)
            elif lower_case == True and remove_digits == False:
                lowers = ' '.join(t).lower()
                no_punctuation = lowers.translate(str.maketrans('','', string.punctuation))
                text.append(no_punctuation)
            elif lower_case == False and remove_digits == True:
                no_punctuation = ' '.join(t).translate(str.maketrans('','', string.punctuation))
                no_digits = no_punctuation.translate(str.maketrans('','', string.digits))
                text.append(no_digits)
            else:
                no_punctuation = ' '.join(t).translate(str.maketrans('','', string.punctuation))
                text.append(no_punctuation)
                
    
    df["Text"] = pd.Series(text)
    df["Text"] = ["".join(map(str, l)) for l in df["Text"].astype(str)]
    d = {'Text':'merge'}
    df_text = df.groupby(['Folder'])["Text"].apply(lambda x: ' '.join(x)).reset_index()
    
    token_dict = dict(zip(df_text["Folder"], df_text["Text"]))

Let's check and see if our dictionary now has our data. We are asking to see the first 10 keys of our dictionary.

In [9]:
print(list(token_dict.keys())[:10])

['season_01', 'season_02', 'season_03', 'season_04', 'season_05', 'season_06', 'season_07']


#### Tfidf Vectorizer

Here we weight the importance of each word in the document. This is done using Term Frequency-Inverse Document Frequency (Tfidf). This considers how important a word is based on the frequency in the whole corpus as well as in individual documents. This allows for words that might not have a high frequency in an entire collection, but do have a high frequency in one or two documents when compared to other words to still be given a higher level of importance throughout the text.

In [10]:
vectorizer = TfidfVectorizer(tokenizer = tokenize, stop_words = stop_words)
dtm = vectorizer.fit_transform(token_dict.values())

The below code outputs the first 20 words that make up the columns once we have broken our corpus down into a Tfidf matrix. To output all of the words remove the square brackets and their contents in the `print(vectGFN[:20])` line of code. To change the number of words printed change `20` in the same line to the number of words you wish to see.

In [11]:
# Get words that correspond to each column
vectGFN = vectorizer.get_feature_names()
print(vectGFN[:20])

['\x00\x00\x00\x16', '\x00\x00\x00\x16b', '\x00\x00\x00\x16bb', '\x00\x00\x00\x16bbb', '\x00\x00\x00\x16bbbb', 'I', 'aa', 'aaaaall', 'aaaaard', 'aaagh', 'aaah', 'aah', 'aaron', 'ab', 'aback', 'aban', 'abandon', 'abash', 'abate', 'abatemarco']


If we assigned `True` to the `n_comp` variable above, this cell will create a function that determines number of components that first gives us an explained variance ratio of our choice. The math is fairly complicated, but we are reducing our tfidf matrix to a more managable size, but we need to know how many components we can reduce our matrix to and still keep a certain percentage of our data. If we assigned `False` to the `n_comp` variable, this cell will be skipped.

In [12]:
if n_comp is True:
    tsvd = TruncatedSVD(n_components=dtm.shape[0])
    tsvd.fit(dtm)
    tsvd_var_ratios = tsvd.explained_variance_ratio_

    def select_n_components(var_ratio, goal_var: float) -> int:
        total_variance = 0.0
        n_components = 0
        for explained_variance in var_ratio:
            total_variance += explained_variance
            n_components += 1
            if total_variance >= goal_var:
                break
        return n_components
else:
    None

If we assigned `True` to the `n_comp` variable we will now apply the function we created in the cell above. This will now incremently increase the number of components we will reduce our matrix to until we get a 95% or higher variance ratio, meaning we are still keeping 95% or more of our data. If we assigned `False` to `n_comp` then this cell will be skipped.

In [13]:
if n_comp is True:
    nc = select_n_components(tsvd_var_ratios, 0.95)
    print(nc)
else:
    None

7


#### Run SVD and Cosine Similarity

Here we run our Tfidf matrix created above through Singular Value Decomposition and then calculate the Cosine Similarity of the documents to one another. 

Singular Value Decomposition condenses our Tfidf matrix down a bit to make it easier to process. Here we also set the number of dimensions (`n_components`) , how many times we iterate over the corpus (`n_iter`), and then set the seed (`random_state`) so that the results are reproducable since sklearn uses a bit of randomization in their algorithm. At the moment the `random_state` is set to 42 which sets the seed for the random number generator, but feel free to adjust the number to get a slightly different output. Just make sure you keep the seed the same once you find one you like for reproducibility. If you assigned `True` to `n_comp` then the `n_components` parameter will be set to whatever was determined to be the best option by the `select_n_components` function above. If you assigned `False` to the `n_comp` variable then the `n_components` will be the number of volumes in your corpus.

Cosine similarity is where we measure how similar the documents are to one another. The result is a number between -1 and 1 with 1 being a perfect match (which we will get when the document is compared to itself) and -1 being completely different which we might get if we have a document of all numbers and one of all words with no numbers at all. Usually, even documents that are about unrelated topics share some common words and so are not completely dissimilar.

In [14]:
if n_comp is True:
    lsa = TruncatedSVD(n_components = nc, n_iter = 1000, random_state = 42)
    dtm_lsa = lsa.fit_transform(dtm)
    cosine_sim = cosine_similarity(dtm_lsa)
else:
    lsa = TruncatedSVD(n_components = dtm.shape[0], n_iter = 1000, random_state = 42)
    dtm_lsa = lsa.fit_transform(dtm)
    cosine_sim = cosine_similarity(dtm_lsa)

#### Save as .csv

Now we save the results as a .csv file. First we name the output .csv file so it matches our data. We do this in the first line of the cell.

Then we create the dataframe and say we want the rows and columns to be labeled with the file names. Then we sort the columns in alphanumeric order by column header, then we sort the rows alphanumericaly by row label.

Finally, we export the dataframe as a .csv file.

You can manipulate the .csv file in excel or some other spreadsheet software, or you can use it in the "doc_sim_lsa_heatmap" notebook that accompanies this notebook. However, it is not recommended to use the accompanying heatmap notebook if your corpus exceeds 100 volumes as the heatmap becomes unwieldy and difficult to read and would, therefore, not be helpful in understanding your results.

In [15]:
csv_file_name = "doc_similarity_matrix_default.csv"

df = pd.DataFrame(cosine_sim, index = token_dict.keys(), columns=token_dict.keys())
df_s = df[sorted(df)]
sorted_df = df_s.sort_index(axis = 0)
np.fill_diagonal(sorted_df.values, np.nan)
sorted_df.to_csv(os.path.join(cleaned_data, csv_file_name))

This notebook was adapted from https://www.datascienceassn.org/sites/default/files/users/user1/lsa_presentation_final.pdf at Colorado University, Boulder. Accessed on 02/01/2019.