## First Part: Notes on genism, bs, and word2vec
link: [github session notes link with core libraries](https://github.com/ual-cci/MSc-Coding-2/blob/master/Week-4.md)

- tip: use help() and dir()
- urllib: https://pythonspot.com/urllib-tutorial-python-3/
- bs4: parsing html; use it when grabbing webpage data
- matplotlib
- numpy
- genism

### Genism
- a general purpose Topic modelling and natural language processing library(machine learning)
- auto-summarisation, sentiment analysis, word-vectors, and so on
- https://radimrehurek.com/gensim/

#### Word2Vec
- Word2Vec approach uses deep learning and neural networks-based techniques to convert words into corresponding vectors
- in such a way that the semantically similar vectors are close to each other in N-dimensional space, 
- where N refers to the dimensions of the vector.
- Word2Vec returns maintain semantic relation. Ex:King - Man + Women = Queen
- More info in [this tutorial](https://stackabuse.com/implementing-word2vec-with-gensim-library-in-python/)
- Pro:
    - retains the semantic meaning of different words in a document.
    - the size of the embedding vector is very small
    
#### NLTK Tokenize
- NLTK Tokenize: Words and Sentences Tokenizer
- Tokenization is the process by which a large quantity of text is divided into smaller parts called tokens. These tokens are very useful for finding patterns and are considered as a base step for stemming and lemmatization.
- Natural language processing is used for building applications such as Text classification, intelligent chatbot, sentimental analysis, language translation, etc.
- related example and tutorial: [link here](https://www.guru99.com/tokenize-words-sentences-nltk.html)


## Second part - The Exercise
- Build a simple webscraper that scrapes a set of documents from the internet and summarises them using gensim.
- If you manage to achieve this, extract keywords from all the different documents and see if any are more popular than others.
- Search for documents that contain those keywords using Python and then summarise those documents too.

In [31]:
import bs4 as bs
import urllib.request
import re
import nltk

#use urllib's urlopen to get the webpage and read the article.
scrapped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Machine_learning')
article = scrapped_data .read()

#create a bs object to parse the data.
#remember to install lxml beforehand. It's used for parsing XML and HTML
parsed_article = bs.BeautifulSoup(article,'lxml')

#because text content normally is wrapped in <p> tag in html. so use bs's findall() to get the content inside p tag.
paragraphs = parsed_article.find_all('p')

article_text = ""
for p in paragraphs:
    article_text += p.text
#print(article_text) #we get an article now.

# we summarize the text
from gensim.summarization import summarize
mySummary = summarize(article_text,word_count=150)
print ("Summary")
print (mySummary)

Summary
Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.[2] Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
Probabilistic systems were plagued by theoretical and practical problems of data acquisition and representation.[19]:488 By 1980, expert systems had come to dominate AI, and statistics was out of favor.[20] Work on symbolic/knowledge-based learning did continue within AI, leading to inductive logic programming, but the more statistical line of research was now outside the field of AI proper, in pattern recognition and information retrieval.[19]:708–710; 755 Neural networks research had been abandoned by AI and computer science around the same time.


In [30]:
#Find similar word using nltk tokenization and Word2Vec model


# clean the text, remove any extra space or special charaters.
# also make all the text lower case
# to make it only containing words, convenient for word2vec use.
# use RegEx's sub() to replace the mateches with space
processed = article_text.lower()
processed = re.sub('[^a-z]', ' ', processed )#[^a-zA-Z]means any characters EXCEPT lowercase
processed = re.sub('\s+', ' ', processed)#\s means any string DOES NOT contain a white space character. This step is to remove extra space.

#use nltk.sent_tokenize to convert our article into sentences.
all_sentences = nltk.sent_tokenize(processed)
# print(all_sentences)

#To convert sentences into words,  use nltk.word_tokenize.
all_words = []
for sent in all_sentences:
    all_words.append(nltk.word_tokenize(sent))
# one line code: all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
#now we have a list with each word as string.

#but there are many unnecessary words. such as "is" "the" "by"
# I need to remove these words.
# there words are named "stop words" 
# more info here:https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# Removing Stop Words
from nltk.corpus import stopwords
for i in range(len(all_words)):
    all_words[i] = [w for w in all_words[i] if w not in stopwords.words('english')]
    

#Creating Word2Vec model
from gensim import models
from gensim.models import Word2Vec
#create the model with minCount as 2, it specifies to include only those words in the Word2Vec model that appear at least twice in the corpus.
word2vec = Word2Vec(all_words, min_count=2)

# the Word2Vec model converts words to their corresponding vectors. 
# v1 is the vector of the word artificial
v1 = word2vec.wv['artificial']
# find all the words similar to the word "intelligence"
sim_words = word2vec.wv.most_similar('artificial')
print("Find Similar Word of 'artificial'")
print(sim_words)

Find Similar Word of 'artificial'
[('training', 0.4113386869430542), ('rules', 0.39969879388809204), ('methods', 0.3768565058708191), ('program', 0.37293824553489685), ('intelligence', 0.3681725859642029), ('trained', 0.36725664138793945), ('hiring', 0.36059659719467163), ('difficult', 0.3424645960330963), ('inference', 0.3359124958515167), ('field', 0.3338709771633148)]


In [60]:

# just get all the links. Links are 'a' (as in <a href = "">)
page = urllib.request.urlopen('https://en.wikipedia.org/wiki/Category:Machine_learning').read()
soup = bs.BeautifulSoup(page)
# soup = BeautifulSoup(page.content, 'html.parser')
# just get all the links. Links are 'a' (as in <a href = "">)

lists = soup.find_all('li')
lists_link = lists.find('a')
print(lists)
# for link in soup.find_all('a'):
#     print(link)

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?