### Challenge
Try the same exercise as above but use your own text. You can copy something you have written, find excerpts from your favorite books, or copy something from a news site. Create a text file and read it in as a list of three strings (documents). Calculate the tf-idf vector for each document and then compute the cosine similarity. Did you choose documents that were very different or very similar or somewhere in between?

In [2]:
# Create the corpus (text is available in a github repo)

# Import module, open and read file
from urllib.request import urlopen

# The text consists of three documents on three different subjects
link = 'https://raw.githubusercontent.com/marianvinas/DS-Unit-4-Sprint-1-NLP/main/assignment/einstein.txt'
f = urlopen(link)
myfile = f.read()

mystring = str(myfile, 'utf-8')
corpus = mystring.split(';')

# Print out the first 300 characters for each document
for i in [0, 1, 2]:
    print('Document:', i)
    print(corpus[i][0:300])

Document: 0
Albert Einstein was born at Ulm, in Württemberg, Germany, on March 14, 1879. Six weeks later the family moved to Munich, where he later on began his schooling at the Luitpold Gymnasium.
Document: 1
 Later, they moved to Italy and Albert continued his education at Aarau, Switzerland and in 1896 he entered the Swiss Federal Polytechnic School in Zurich to be trained as a teacher in physics and mathematics.
Document: 2
 In 1901, the year he gained his diploma, he acquired Swiss citizenship and, as he was unable to find a teaching post, he accepted a position as technical assistant in the Swiss Patent Office. In 1905 he obtained his doctor’s degree.


In [3]:
# Create the vectors for each document
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate vectorizer object
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)

# Create a vocabulary and get tfidf values per document
dtm = tfidf.fit_transform(corpus)

In [4]:
# Imports
import pandas as pd

# Get feature names to use as DataFrame column headers
dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())

# View the feature matrix as a DataFrame
dtm.head()

Unnamed: 0,14,1879,1896,1901,1905,aarau,accepted,acquired,albert,assistant,...,teacher,teaching,technical,trained,ulm,unable,weeks,württemberg,year,zurich
0,0.232682,0.232682,0.0,0.0,0.0,0.0,0.0,0.0,0.17696,0.0,...,0.0,0.0,0.0,0.0,0.232682,0.0,0.232682,0.232682,0.0,0.0
1,0.0,0.0,0.240329,0.0,0.0,0.240329,0.0,0.0,0.182776,0.0,...,0.240329,0.0,0.0,0.240329,0.0,0.0,0.0,0.0,0.0,0.240329
2,0.0,0.0,0.0,0.216607,0.216607,0.0,0.216607,0.216607,0.0,0.216607,...,0.0,0.216607,0.216607,0.0,0.0,0.216607,0.0,0.0,0.216607,0.0


In [5]:
# Find the cosine similarity of tf-idf vectors
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim  = cosine_similarity(dtm)

# Turn it into a DataFrame
cosine_sim = pd.DataFrame(cosine_sim)
display(cosine_sim)

Unnamed: 0,0,1,2
0,1.0,0.129377,0.0
1,0.129377,1.0,0.060219
2,0.0,0.060219,1.0
