# Wiki scraper report

### Author:
Łukasz Andryszewski 151930

## Outline

The project consited of:
- Creating a database of 1000 wikipedia articles
- Scraping/Lemmatizing them
- Creating a recommender system

## Creating the database

The database should consist of a wide variety of topics, so the articles should be randomized. This could be achieved by generating 1000 random words and then searching using: 
- https://en.wikipedia.org/w/index.php?search=INPUT_TEXT

Thankfully wikipedia and fandom offer special pages among which, a random article can be entered:
- https://en.wikipedia.org/wiki/Special:Random

To retrieve the data the following parser was created:

In [None]:
import requests
import bs4

def get_title_and_text(url='https://en.wikipedia.org/wiki/Special:Random'):
    response = requests.get(url)
    parsed = bs4.BeautifulSoup(response.text,features="lxml")
    output = ""
    for p in parsed.select('p'):
        output += p.getText()
    return parsed.title.string,response.url,output

Using GET, it loads the html site and joins all paragraphs into one long string. However to ensure the data is interesting if an article containted less than 2000 characters it was exchanged for another article. This is done to exclude bland articles like:
- https://en.wikipedia.org/wiki/1296_in_poetry

The created database is stored as a simple csv file.

## Lemmatizng

To reduce the size of the data and get rid of useless words a word lemmatizer is applied and stopwords are filtered out:

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

def word_lemmatizer(string,method):
    stops = stopwords.words('english')
    return list(filter(lambda s: not s in stops,map(WordNetLemmatizer().lemmatize,word_tokenize(string))))

A Lemmatizer is used instead of a Stemmer because of issues in articles interpretability.

Using this methods 1000 articles were downloaded and put into a csv files along with their urls

## Recommender System

To recommend articles a similarity measure must be established between them. The articles are first scored using TF-IDF. The score is calculated the following way:

$$ tfidf(t,d,D) = tf(t,d) \cdot idf(t,D) $$

$ tf(t,d) $ - term frequency in a given document

$ idf(t,D) $ - information the term provides in a domain of documents

$t$ - term, $d$ - document, $D$ - domain of documents

The used vectorizer comes from sklearn library:

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf=TfidfVectorizer(use_idf=True, smooth_idf=False)

The similarity between documents is calculated using the cosine distance measure on the normalized tfidf values. The distance measured is then substracted from 1 transforming it into a similarity measure.

The recommender:
1. downloads the text from the article
2. lemmatizes the article
2. transform the query articles using tfidf 
3. measures similarity to each article in the database
3. returns n most similar articles from the database, along with their similarity scores

In [None]:
def recommend_articles(df,tfidf,queries,n=3):
    recommended = {}
    q_arrays = []

    for query in queries:
        _,url,text = get_title_and_text(query)
        q_stem = " ".join(word_lemmatizer(text))
        q_array = tfidf.transform([q_stem]).toarray()[0]
        q_arrays.append(q_array)
        values = (1-df.apply(lambda x: cosine(x, q_array), axis=1).sort_values())
        recommended[query] = (values[:n])

    return recommended

## Recommendation explanation