Wikipedia Recommendation system

-----

Authors:
- Martyna Stasiak id.156071

The aim of this project is to generate the recommendations on wikipedia articles basing on the ones that user have liked. <br>
To do that we have used 10 000 initial articles that were obtained by web crawling, starting from the https://en.wikipedia.org/wiki/Machine_learning article; later they were saved in the csv file, so if there is a need the file working as our database might be changed.

------

Libraries that we have used and are necessary for this project:

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, wordpunct_tokenize

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mmart\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mmart\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mmart\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Crawling and saving our articles

In this part we create the file that will work as or database containing all possible wikipedia articles. <br>
We perform the crawling by ...... <explain precisely> <br>


In [2]:
def crawlArticles(start_url, max_articles):
    visited = set()
    to_visit = [start_url]
    articles = []
    
    while to_visit and len(articles) < max_articles:
        page = to_visit.pop(0)
        if page in visited:
            continue
        visited.add(page)
        
        try:
            response = requests.get(page)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'html.parser')
            
            title = soup.find('h1').text # article's title
            paragraphs = soup.find_all('p') # article's paragraphs
            content = ' '.join([p.text for p in paragraphs]) # article's content that is inside paragraphs
            articles.append({"title": title, "link": page, "content": content})
            
            # extracting and filtering new links
            for link in soup.find_all('a', href=True): # we look for all links in the page
                href = link['href']
                if href.startswith('/wiki/') and ':' not in href and '#' not in href:
                    full_url = "https://en.wikipedia.org" + href
                    if full_url not in visited:
                        to_visit.append(full_url)
            time.sleep(0.5) # be polite to Wikipedia
        except:
            pass
        
    return articles

In [3]:
articles = crawlArticles("https://en.wikipedia.org/wiki/Machine_learning", 10)
df = pd.DataFrame(articles)

In [4]:
# def createDatabase(start_url, max_articles=10):
#     articles = crawlArticles(start_url, max_articles)
#     df = pd.DataFrame(articles)
#     # df.to_csv('articles.csv', index=False) no saving yet !!!!!!!!!!!!!!!!!!!!!!!!!!
#     return df

In [5]:
def saveDatabase(df, fileName):
    df.to_csv(fileName, index=False)
    return None

In [6]:
saveDatabase(df, 'articles.csv')

In [7]:
df.head()

Unnamed: 0,title,link,content
0,Machine learning,https://en.wikipedia.org/wiki/Machine_learning,Machine learning (ML) is a field of study in a...
1,Main Page,https://en.wikipedia.org/wiki/Main_Page,Jochi (c. 1182 – c. 1225) was a prince in the ...
2,Machine Learning (journal),https://en.wikipedia.org/wiki/Machine_Learning...,Machine Learning is a peer-reviewed scientifi...
3,Statistical learning in language acquisition,https://en.wikipedia.org/wiki/Statistical_lear...,Statistical learning is the ability for humans...
4,Data mining,https://en.wikipedia.org/wiki/Data_mining,Data mining is the process of extracting and d...


In [8]:
# # Starting crawling from a Wikipedia article
# start_url = "https://en.wikipedia.org/wiki/Machine_learning"
# articles = crawl_articles(start_url, max_articles=10)

# # Save the articles
# df = pd.DataFrame(articles)
# df.to_csv("wikipedia_articles.csv", index=False)

### Text preprocessing

Now when we have the database containing the articles we have to do the preprocessing; <br>
we have done: 
- lemmatization
- deleting the stopwords
- 

In [9]:
stopWords = set(stopwords.words('english'))
porter = PorterStemmer()
lancaster = LancasterStemmer()
lemmatizer = WordNetLemmatizer()

In [10]:
def preprocessArticles(df, tokenizer=word_tokenize, stemmer=None, lemmatizer=None, useLemmatizer=False):
    tokens = tokenizer(df['content'].lower())
    terms = [word for word in tokens if word.isalpha() and word not in stopWords] # remove stopwords and non-alphabetic words
    if stemmer:
        processed = [stemmer.stem(word) for word in terms]
    elif useLemmatizer and lemmatizer:
        processed = [lemmatizer.lemmatize(word) for word in terms]
    else:
        processed = terms
    return ' '.join(processed)
    

In [13]:
# Define preprocessing variations
variations = {
    "porter_stemmer": lambda row: preprocessArticles(row, tokenizer=word_tokenize, stemmer=porter),
    "lancaster_stemmer": lambda row: preprocessArticles(row, tokenizer=word_tokenize, stemmer=lancaster),
    "lemmatization": lambda row: preprocessArticles(row, tokenizer=word_tokenize, lemmatizer=lemmatizer, useLemmatizer=True)
}

# Apply variations without modifying the original DataFrame
results = pd.DataFrame({
    "title": df["title"],
    "original_content": df["content"]
})

for name, preprocess_function in variations.items():
    # Apply each variation to the content column using the original function
    results[name] = df.apply(preprocess_function, axis=1)

# Display first few rows of variations
columns_to_display = ["title", "original_content"] + list(variations.keys())
results[columns_to_display].head()


Unnamed: 0,title,original_content,porter_stemmer,lancaster_stemmer,lemmatization
0,Machine learning,Machine learning (ML) is a field of study in a...,machin learn ml field studi artifici intellig ...,machin learn ml field study art intellig conce...,machine learning ml field study artificial int...
1,Main Page,Jochi (c. 1182 – c. 1225) was a prince in the ...,jochi princ mongol empir month birth mother bö...,joch print mongol empir month bir moth börte c...,jochi prince mongol empire month birth mother ...
2,Machine Learning (journal),Machine Learning is a peer-reviewed scientifi...,machin learn scientif journal publish sinc for...,machin learn sci journ publ sint forty edit me...,machine learning scientific journal published ...
3,Statistical learning in language acquisition,Statistical learning is the ability for humans...,statist learn abil human anim extract statist ...,stat learn abl hum anim extract stat regul wor...,statistical learning ability human animal extr...
4,Data mining,Data mining is the process of extracting and d...,data mine process extract discov pattern larg ...,dat min process extract discov pattern larg da...,data mining process extracting discovering pat...


In [15]:
df['processedContent'] = df.apply(lambda row: preprocessArticles(row, tokenizer=word_tokenize, stemmer=porter, 
                                                          lemmatizer=lemmatizer, useLemmatizer=True), axis=1)
saveDatabase(df, 'processed_articles.csv')

columnstoUse = ['title', 'content','processedContent']
df[columnstoUse].head()

Unnamed: 0,title,content,processedContent
0,Machine learning,Machine learning (ML) is a field of study in a...,machin learn ml field studi artifici intellig ...
1,Main Page,Jochi (c. 1182 – c. 1225) was a prince in the ...,jochi princ mongol empir month birth mother bö...
2,Machine Learning (journal),Machine Learning is a peer-reviewed scientifi...,machin learn scientif journal publish sinc for...
3,Statistical learning in language acquisition,Statistical learning is the ability for humans...,statist learn abil human anim extract statist ...
4,Data mining,Data mining is the process of extracting and d...,data mine process extract discov pattern larg ...


### TF-IDF and Cosine Similarity

In [9]:
df = pd.read_csv("cleaned_wikipedia_articles.csv")

# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['clean_content'])

def recommend_articles(input_titles, top_n=5):
    # Find indices of the input articles
    input_indices = df[df['title'].isin(input_titles)].index
    
    # Compute the cosine similarity between input and all articles
    input_vectors = tfidf_matrix[input_indices]
    similarities = cosine_similarity(input_vectors, tfidf_matrix)
    mean_similarity = similarities.mean(axis=0)
    
    
    df['mean_similarity'] = mean_similarity
    
    recommendations = df.loc[~df.index.isin(input_indices)]
    
    recommend_articles = recommendations.nlargest(top_n, 'mean_similarity')
    
    return recommend_articles[['title', 'link', 'mean_similarity']]


# Example: Recommend articles based on previously visited ones
visited_titles = ["Machine learning", "Artificial intelligence"]
recommended_articles = recommend_articles(visited_titles, top_n=1)
print(recommended_articles)

                   title                                               link  \
6  Unsupervised learning  https://en.wikipedia.org/wiki/Unsupervised_lea...   

   mean_similarity  
6         0.513629  
