# Exercise/TA session on 25 October 2021

## Key points from comments on how to write an analysis report
* __Submit in PDF__. PDF gives you control of how the delivered content is displayed to the end user. It is a universally accepted document format where your formatting will not be altered by the reviewer's version of MS Word or equivalent software.
* __Structure your report__. For example, given the current assignment with a 4 page max length:
    * Introduction: Explain what the report is about - what is the task/analysis conducted in the report, and how is it structured
    * Method: Explain how the task is solved, and for this assignment, be specific on coding steps in order to show that you understand why the various preprocessing and processing steps are taken, and how the underlying basic models work. But do not include code snippets in the text; leave it for the appendix and the .py or .ipynb attachment.
    * Results and discussion: Present the results, and offer your cognitive judgment on model performance, whether results are reasonable, what matching scores have been accomplished, and offer hypotheses for why the model behaves as it does with the given data. Discuss whether your results or your model is generalizable, and/or for what types of documents or problems it is well suited. Discuss whether other models may have yielded different results on your data, and how - as in "opportunities for future research", often seen in academic journal papers.
    * Conclusion: Briefly re-iterate what you set out to do (very short, but same meaning as communicated in the Introduction section), how you have done it (very short, but same meaning as communicated in the Method section), and what were the results. Conclude by a very brief note of some of the analysis from the Results and discussion section.
    * References
* __Be clear and concise__ in your choice of words. 
    * Use only words and concepts you fully understand, and create efficient, accurate sentences. 
    * Feel free to mix long and short sentences. Not every sentence has to be as complex as the most advanced NLP model. A concise, to-the-point and accurate language style is more impressive. 
    * Avoid 'I did...' and other personal nouns. Rather, employ the slightly passive and inpersonal language style of (good) research papers. For example: "The preprocessing includes tokenization, stopwords and special character removal..."/"During preprocessing, the documents are tokenized..."
* __Make sure the report looks good__, typesetting/graphics wise. Take inspiration from (good) research papers in journals, except keep it single column and (for this assignment) skip the abstract.
* __Use__ charts and __figures if helpful__ to explain your analysis, underlying model logic et cetera.
* __Use references where applicable__. Any accepted academic referencing standard.

## Key remarks on the code attachment
* __Use comments extensively__. Even on very simple operations. Show us that you understand what's going on in every line of your code. 
* __Make sure the code attachment is executable__ and bug-free. Delivery as __.ipynb__ or __.py__. 

## A little demonstration of selecting the top N matches based on a pairwise similarity array
### Prerequisite code snippets to get to the point - from previous classes

In [1]:
def preprocessing(corpus): #same as previous sessions
    """ (same as above, but: ) takes a list of strings (corpus of documents), 
    do preprocessing on each one, return list of preprocessed corpus
    """
    
    from nltk import word_tokenize
    from nltk.corpus import stopwords
    stopwords = stopwords.words('english') #getting the english stop words list from NLTK
    corpusTokens = [word_tokenize(item) for item in corpus]
    corpusTokens = [ [word.lower().strip() for word in item if word.isalnum() and word not in stopwords] 
               for item in corpusTokens]
    
    return corpusTokens

In [2]:
import pandas as pd #same file
#this is the file uploaded on canvas under exercise 11
newsItems = pd.read_csv('../../../../Data Management/AC track data/2021-10-18 News items on space.csv')
newsItems.head()

Unnamed: 0,string,date,url,author,source
0,William Shatner becomes the oldest person to r...,2021-10-13T14:54:44Z,https://www.engadget.com/william-shatner-blue-...,Jon Fingas,"{'id': 'engadget', 'name': 'Engadget'}"
1,Jett: The Far Shore Imagines Conscientious Spa...,2021-10-13T12:00:00Z,https://www.wired.com/story/jett-the-far-shore...,Lewis Gordon,"{'id': 'wired', 'name': 'Wired'}"
2,11 Scary Space Facts That'll Make You Apprecia...,2021-10-08T12:00:00Z,https://lifehacker.com/11-scary-space-facts-th...,Stephen Johnson,"{'id': None, 'name': 'Lifehacker.com'}"
3,UK takes on Elon Musk in the broadband space r...,2021-10-10T14:01:00Z,https://techncruncher.blogspot.com/2021/10/uk-...,noreply@blogger.com (Unknown),"{'id': None, 'name': 'Blogspot.com'}"
4,Blue Origin postpones William Shatner’s space ...,2021-10-10T19:12:00Z,https://techncruncher.blogspot.com/2021/10/blu...,noreply@blogger.com (Unknown),"{'id': None, 'name': 'Blogspot.com'}"


In [4]:
corpusTokens = preprocessing(newsItems.string) #preprocessing

In [5]:
corpusNonstop = [(' ').join(document) for document in corpusTokens] #concatenating back to "sentence" (from token lists)

In [6]:
corpusNonstop[0] #inspecting

'william shatner becomes oldest person reach space it official plenty hype slight delay william shatner become oldest person fly space the star trek icon one four crew members aboard blue origin mission flew altitude 66 miles he it official plenty hype slight delay william shatner become oldest person fly space the star trek icon one four crew members aboard blue origin chars'

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer  #using sklearn package to vectorize 

v_tr = TfidfVectorizer(min_df=1, use_idf=False) #this time only running TF, not idf.
tfMatrix = v_tr.fit_transform(corpusNonstop)  #quickly creating representations for all documents (BoW-TF)

In [8]:
cosine_sim = tfMatrix * tfMatrix.T  #creating a pairwise cosine similarity array

In [9]:
import numpy as np

In [10]:
pairwise_array = cosine_sim.toarray() #converting to array shape (from sparse array)
pairwise_array.shape # we have 20 documents, each compared with itself and 19 others.

(20, 20)

### Now that we have our pairwise comparisons, lets display the best matches for a given document in the batch

In [11]:
for i in range(pairwise_array.shape[0]):
    pairwise_array[i,i] = 0   #we are not interested in selecting a comparison between a document and itself. thus, setting those values to zero

In [64]:
from random import random
random_select = int(random()*20) #generating a random number between 0 and 20

n_matches = 5    #choosing number of matches to return

topNmatches = list(np.argsort(pairwise_array[random_select])[-n_matches:]) #getting the top n_matches for the random_select sample
topNmatches.reverse()  #because argsort delivers them the opposite sort of what Im looking for, reversing the list

print('text results from cosine similarity between BoW-TF representations') #printing results with basic print statements for quick analysis
print('------------------------------------------------------------------')

print('random source: index',random_select)
print('top '+str(n_matches)+' matches:')
print(topNmatches,'\n\n\n')
print('Source and results, original documents\n')
print('random source:\n---------------')
print(newsItems.string[random_select],'\n---------------')
print('\n\nresults:\n---------------')
for num,match in enumerate(topNmatches):
    print('( match ',num+1,')')
    print('( score:',np.round(pairwise_array[random_select,match],2),')')
    print(newsItems.string[match],'\n---------------\n')
    

text results from cosine similarity between BoW-TF representations
------------------------------------------------------------------
random source: index 4
top 5 matches:
[16, 6, 5, 17, 0] 



Source and results, original documents

random source:
---------------
Blue Origin postpones William Shatner’s space flight by a day: William Shatner is heading to space on October 13th | Photo by Axelle/Bauer-Griffin/FilmMagic
Jeff Bezos’ spaceflight company Blue Origin said Sunday it will postpone the flight that is slated to fly William Shatner to space due to forecasted high winds at i…: William Shatner is heading to space on October 13th | Photo by Axelle/Bauer-Griffin/FilmMagic
Jeff Bezos’ spaceflight company Blue Origin said Sunday it will postpone the flight that is slated to fl… [+736 chars] 
---------------


results:
---------------
( match  1 )
( score: 0.57 )
Winds delay Blue Origin's space launch with Shatner - Reuters: Jeff Bezos' space company Blue Origin said on Sunday it had pu