# An Introduction to Latent Semantic Analysis

<hr>

This python notebook demonstrates how to perform Latent Semantic Analysis (LSA) on a corpus of text. There are a couple of other tutorials (see here and here) that this example builds upon and I hope the python along with the theory will help you out in using this technique for your own research.

In this example, we are going to analyse a book of abstracts from a conference to determine the similarity between the conference papers that are going to be presented. This information could be useful to conference organisers wishing to group similar research together and/or as part of a search tool for the attendees to find similar work that might be of interest.

We will be treating each abstract as a list of words. From this, will form a $N \times M$ matrix with the column vectors representing the abstracts (and thus, paper), and the row vectors representing the words. There are four key steps to the analysis and we will be stepping through each one with the code and output from each stage being explained. The stages are:

* Generating a list of words from the abstracts
* Creating the word-abstract matrix
* Applying the Term Frequency - Inverse Document Frequency (TF-IDF) Weighting Scheme
* Using Singular Value Decomposition (SVD) to derive the underlying concepts across the documents

Before we get underway, we need to first import all the packages that we will be using. These are:

* Numpy - for handling matrices and the svd function
* Matplotlib - for plotting

In [16]:
# A line of code that jupyter notebook uses to plot the images inline with the code and text
%matplotlib inline

# Importing the packages that we need
import numpy as np # For matrices and SVD function
import matplotlib.pyplot as plt # For plotting
import json # To read in JSON data
from pprint import pprint # To pretty print text output to console

## Generating a list of words

Before we can form the $N \times M$ matrix of words against abstracts, we need to decide on the list of words that we will be using to compare the abstracts. The first thing we need to do is load up the dataset and 

In [1]:
# load the data from the file
#with open('', 'r') as data_file:
#    data = json.load(data_file)
#pprint(data)

## Potential Improvements

* Considering tri-grams & bi-grams
* Stemming
* Synonyms