Wikipedia Recommendation system

-----

Authors:
- Martyna Stasiak id.156071

The aim of this project is to generate the recommendations on wikipedia articles basing on the ones that user have liked. <br>
To do that we have used 10 000 initial articles that were obtained by web crawling, starting from the https://en.wikipedia.org/wiki/Machine_learning article; later they were saved in the csv file, so if there is a need the file working as our database might be changed.

------

Libraries that we have used and are necessary for this project:

In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Crawling and saving our articles

In this part we create the file that will work as or database containing all possible wikipedia articles. <br>
We perform the crawling by ...... <explain precisely> <br>


In [5]:
def crawl_articles(start_url, max_articles=1000):
    visited = set()
    to_visit = [start_url]
    articles = []
    
    while to_visit and len(articles) < max_articles:
        url = to_visit.pop(0)
        if url in visited:
            continue
        visited.add(url)
        
        try:
            response = requests.get(url)
            soup = BeautifulSoup(response.content, 'html.parser')
            
            title = soup.find('h1').text
            paragraphs = soup.find_all('p')
            content = ' '.join([p.text for p in paragraphs])
            articles.append({"title": title, "link": url, "content": content})
            
            # Extract new links
            for link in soup.find_all('a', href=True):
                full_url = "https://en.wikipedia.org" + link['href']
                if '/wiki/' in link['href'] and ':' not in link['href']:
                    to_visit.append(full_url)
            time.sleep(0.5)
        except:
            pass
        
    return articles

# Start crawling from a Wikipedia page
start_url = "https://en.wikipedia.org/wiki/Machine_learning"
articles = crawl_articles(start_url, max_articles=10)

# Save the articles
df = pd.DataFrame(articles)
df.to_csv("wikipedia_articles.csv", index=False)

### Text preprocessing

Now when we have the database containing the articles we have to do the preprocessing; <br>
we have done: 
- lemmatization
- deleting the stopwords
- 

In [6]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text.lower())
    words = nltk.word_tokenize(text)
    clean_words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(clean_words)

# Preprocess content
df['clean_content'] = df['content'].apply(preprocess_text)
df.to_csv("cleaned_wikipedia_articles.csv", index=False)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mmart\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mmart\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mmart\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### TF-IDF and Cosine Similarity

In [9]:
df = pd.read_csv("cleaned_wikipedia_articles.csv")

# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['clean_content'])

def recommend_articles(input_titles, top_n=5):
    # Find indices of the input articles
    input_indices = df[df['title'].isin(input_titles)].index
    
    # Compute the cosine similarity between input and all articles
    input_vectors = tfidf_matrix[input_indices]
    similarities = cosine_similarity(input_vectors, tfidf_matrix)
    mean_similarity = similarities.mean(axis=0)
    
    
    df['mean_similarity'] = mean_similarity
    
    recommendations = df.loc[~df.index.isin(input_indices)]
    
    recommend_articles = recommendations.nlargest(top_n, 'mean_similarity')
    
    return recommend_articles[['title', 'link', 'mean_similarity']]


# Example: Recommend articles based on previously visited ones
visited_titles = ["Machine learning", "Artificial intelligence"]
recommended_articles = recommend_articles(visited_titles, top_n=1)
print(recommended_articles)

                   title                                               link  \
6  Unsupervised learning  https://en.wikipedia.org/wiki/Unsupervised_lea...   

   mean_similarity  
6         0.513629  
