By now you should be pretty confident dealing with different types of data. We are going to look at exploring our own data processing pipelines for dealing with some complex language.

You might want to remind yourself of Pythagoras and shortest distance calculations before we get started. 

[pyth and short dist calcs](https://en.wikipedia.org/wiki/Euclidean_distance#:~:text=In%20mathematics%2C%20the%20Euclidean%20distance,occasionally%20called%20the%20Pythagorean%20distance.)

Let's calculate this using our familiar NumPy.

NumPy linalg = linear algebra
|| norm = matrix or vector norm
[numpy docs](https://numpy.org/doc/stable/reference/routines.linalg.html)

`import numpy as np`

`a = np.array((1, 2, 3)) `

`b = np.array((1, 1, 2)) `

`dist = np.linalg.norm(a - b) `

`print("our euclidean distance is: " + str(dist))`

### We are cheating a bit here and not concentrating on minimising the average of the squared distances between observed and estimated values.

Scikit Learn would be a more efficient way to achieve this and gives us access to more comprehensive libraries than NumPy, especially when we want to handle more complex mappings.

### Scikit Learn version
[sklearn euclidian distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html)

`from sklearn.metrics.pairwise import euclidean_distances`

`a = [[1, 2, 3], [1, 1, 2]]`

`euclidean_distances(a, a)`

### Pretty straightforward, right?

Now it's time to work independently to try to do some research. Create a new Jupyter Notebook on your system. Write some code which prints out the distance matrix of a news article(s) of your choosing. The following library should make this easier:

[Scipy spatial distance matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html)

Python makes a lot of this work trivial – even if you are not familiar with the mechanics.

If you want to really stretch yourself you might try exploring other approaches such as Jaccard similarity and Cosine similarity. If you are feeling very confident you may also want to implement TF-IDF as well!

Test out the program with different sets of news articles. Can you find or create your own dataset?

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial import distance_matrix
from scipy.spatial.distance import pdist, squareform

import nltk
from nltk.stem import PorterStemmer

import re
import numpy as np
import pandas as pd

In [4]:
# READ IN DATA

data = pd.read_csv("Data/Articles.csv", encoding='mbcs', keep_default_na=False, skipinitialspace=True)
df = pd.DataFrame(data)

# CLEAN TEXT AND CREATE CORPUS LIST
def scrub_words(text):
    """Basic cleaning of texts."""
    
    # remove html markup
    text=re.sub("(<.*?>)","",text)
    
    #remove non-ascii and digits
    text=re.sub("(\\W|\\d)"," ",text)
    
    #remove whitespace
    text=text.strip()
    return text
corpus = []
# select first 5 articles to work with from dataset
for i in range(10):
    corpus.append(scrub_words(df.iloc[i,0]))

# DEFINE PREPROCESSOR USING PORTER STEMMER
porter_stemmer=PorterStemmer()


def my_preprocessor(text):
    text=text.lower() #lowercase text (done be default if don't use a custom preprocessor)
    text=re.sub("\\W"," ",text) # remove special chars
    text=re.sub("\\s+(in|the|all|for|and|on)\\s+"," _connector_ ",text) # normalize certain words
    
    # stem words
    words=re.split("\\s+",text)
    stemmed_words=[porter_stemmer.stem(word=word) for word in words]
    return ' '.join(stemmed_words)

# TRANSFORM TEXT TO VECTOR DATA WITH COUNTVECTORIZER
cv = CountVectorizer(corpus, preprocessor=my_preprocessor)
count_vector=cv.fit_transform(corpus).todense()

# GET DISTANCE MATRIX USE SCIPY DISTANCE_MATRIX FUNCTION
dm = distance_matrix(count_vector,count_vector)
pd.DataFrame(dm)

print(corpus[:2])



['KARACHI  The Sindh government has decided to bring down public transport fares by   per cent due to massive reduction in petroleum product prices by the federal government  Geo News reported Sources said reduction in fares will be applicable on public transport  rickshaw  taxi and other means of traveling Meanwhile  Karachi Transport Ittehad  KTI  has refused to abide by the government decision KTI President Irshad Bukhari said the commuters are charged the lowest fares in Karachi as compare to other parts of the country  adding that   pc vehicles run on Compressed Natural Gas  CNG   Bukhari said Karachi transporters will cut fares when decrease in CNG prices will be made', 'HONG KONG  Asian markets started      on an upswing in limited trading on Friday  with mainland Chinese stocks surging in Hong Kong on speculation Beijing may ease monetary policy to boost slowing growth Hong Kong rose      percent  closing        points higher at          Seoul closed up      percent  rising    

From this can see that article 0 and 8 are the closest related and 1 and 2 are the furthest from each other. However when actually looking at these articles, 1 and 2 have similar topic titles and topics so this model might need more work.

In [14]:
print('Article 0 Heading: ', df.iloc[0,2], '/n', 'Article 2 Heading: ', df.iloc[8,2])
print('Article 0 Heading: ', df.iloc[1,2], 'Article 2 Heading: ', df.iloc[2,2])

Article 0 Heading:  sindh govt decides to cut public transport fares by 7pc kti rej /n Article 2 Heading:  sugar prices drop to rs 49.80 in sind
Article 0 Heading:  asia stocks up in new year trad Article 2 Heading:  hong kong stocks open 0.66 percent lower
