# Intelligent Tagging and Recommendation System for StackOverflow Posts
### APAN 5430: Applied Text & Natural Language Analytics Term Project
#### Group 3
#### Group Members: Sixuan Li, Wenyang Cao, Haoran Yang, Wenling Zhou, Jake Xiao
#### Github Repo: [https://github.com/educated-fool/stack-overflow-intelligent-tagging](https://github.com/educated-fool/stack-overflow-intelligent-tagging)

## Data Overview and Cleansing

### Data Files Overview

The dataset comprises three CSV files: Questions.csv, Answers.csv, and Tags.csv.

1. **Questions.csv**:
   - Contains questions with fields such as Id, OwnerUserId, CreationDate, ClosedDate, Score, Title, and Body.

2. **Answers.csv**:
   - Includes fields like Id, OwnerUserId, CreationDate, ParentId, Score, and Body.
   - Similar to questions, the Body field contains HTML content that needs cleaning.

3. **Tags.csv**:
   - Contains Id and Tag pairs, with each question associated with one or more tags.

### Data Cleansing Process

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import re

In [2]:
# cleaning data
questions_df = pd.read_csv('Questions.csv', encoding="ISO-8859-1")
answers_df = pd.read_csv('Answers.csv', encoding="ISO-8859-1")
tags_df = pd.read_csv('Tags.csv', encoding="ISO-8859-1")

# find & drop nan
questions_df.dropna(inplace=True)
answers_df.dropna(inplace=True)
tags_df.dropna(subset=['Tag'], inplace=True)  

print(questions_df.isnull().sum())
print(answers_df.isnull().sum())
print(tags_df.isnull().sum())

Id              0
OwnerUserId     0
CreationDate    0
ClosedDate      0
Score           0
Title           0
Body            0
dtype: int64
Id              0
OwnerUserId     0
CreationDate    0
ParentId        0
Score           0
Body            0
dtype: int64
Id     0
Tag    0
dtype: int64


- **Loading Data**:
  - All CSV files were read using Pandas with the appropriate encoding (`ISO-8859-1`).

- **Handling Missing Values**:
  - The `dropna` method was applied to remove rows with NaN values in the `questions_df`, `answers_df`, and `tags_df` DataFrames.
  - Specific handling included ensuring the `Tag` column in `tags_df` does not contain null values to maintain data integrity.

In [3]:
#csv.size
print(questions_df.shape)
print(answers_df.shape)
print(tags_df.shape)

(55240, 7)
(2001316, 6)
(3749881, 2)


- **Data Size**:
  - The shapes of the DataFrames after cleaning are as follows:
    - `Questions.csv`: 55240 rows
    - `Answers.csv`: 2001316 rows
    - `Tags.csv`: 3749881 rows

## Data Transformation and Export

In [36]:
#transforming html text to clean text 
def clean_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

def remove_special_characters(text):
    return re.sub(r'\W', ' ', text)

questions_df['Cleaned_Body'] = questions_df['Body'].apply(clean_html).apply(remove_special_characters)
answers_df['Cleaned_Body'] = answers_df['Body'].apply(clean_html).apply(remove_special_characters)
questions_df['Cleaned_Title'] = questions_df['Title'].apply(remove_special_characters)

To clean the textual data, we applied two main functions:

1. **`clean_html(text)`**:
   - Utilizes `BeautifulSoup` to parse HTML content and extract plain text.
   - This step is crucial to remove HTML tags and retain the meaningful text content.

2. **`remove_special_characters(text)`**:
   - Uses a regular expression to remove any non-word characters from the text.
   - This helps in normalizing the text for further analysis.


The cleaned text is stored in new columns in the DataFrame:
- For the `questions_df`, we added `Cleaned_Body` and `Cleaned_Title`.
- For the `answers_df`, we added `Cleaned_Body`.

In [13]:
#save to csv
answers_df.to_csv('new_answer.csv')
questions_df.to_csv('new_question.csv')

To avoid reprocessing the data and to preserve the cleaned data, the cleaned DataFrames were saved to new CSV files.

## Data Loading and Merging

In [38]:
questions_df = pd.read_csv('new_question.csv', encoding="ISO-8859-1")
answers_df = pd.read_csv('new_answer.csv', encoding="ISO-8859-1")
tags_df = pd.read_csv('Tags.csv', encoding="ISO-8859-1")

In [5]:
# merge answer.csv and question.csv
questions_answers_df = pd.merge(questions_df, answers_df, left_on='Id', right_on='ParentId')

This merge operation creates a new DataFrame, questions_answers_df, which contains combined data from both questions and their corresponding answers, facilitating a more comprehensive analysis of the dataset.

In [7]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, Word2Vec
import numpy as np



## Spark Setup and Word2Vec Model Training

In [11]:
#spark.stop() 

In [12]:
spark = SparkSession.builder \
    .appName("TextSimilarityWithWord2Vec") \
    .config("spark.executor.memory", "8g") \
    .config("spark.driver.memory", "8g") \
    .config("spark.rpc.message.maxSize", "256") \
    .getOrCreate()

To handle large datasets and perform distributed processing, we utilized Apache Spark. We configured the Spark session with specific memory settings:

### Data Preparation

In [13]:
questions_answers_df['text'] = questions_answers_df['Title'] + ' ' + questions_answers_df['Cleaned_Body_x'] + ' ' + questions_answers_df['Cleaned_Body_y']

We combined the title, question, and answer into a new column text for text processing:

### Word2Vec Model Training

In [14]:
df = spark.createDataFrame(questions_answers_df[['text']])

regex_tokenizer = RegexTokenizer(inputCol="text", outputCol="tokens", pattern="\\W")
df = regex_tokenizer.transform(df)

stopwords_remover = StopWordsRemover(inputCol="tokens", outputCol="tokens_sw_removed")
df = stopwords_remover.transform(df)

word2vec = Word2Vec(vectorSize=100, minCount=1, inputCol="tokens_sw_removed", outputCol="wordvectors")
model = word2vec.fit(df)
df = model.transform(df)

24/08/03 17:57:03 WARN TaskSetManager: Stage 0 contains a task of very large size (12242 KiB). The maximum recommended task size is 1000 KiB.
24/08/03 17:57:20 WARN TaskSetManager: Stage 2 contains a task of very large size (12242 KiB). The maximum recommended task size is 1000 KiB.
24/08/03 17:57:33 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
                                                                                



1. **DataFrame Creation**:  
    We created a Spark DataFrame from the `text` column for further processing:

2. **Tokenization**:
    The text data was tokenized into individual words using RegexTokenizer, which helped in breaking down the text into analyzable tokens:
   
3. **Stop Words Removal:**
    To focus on meaningful words, stop words were removed from the tokenized text:
4. **Word2Vec Training:**
    Finally, we trained the Word2Vec model on the processed tokens. The model was configured to generate word vectors, capturing semantic relationships between words:


## Implementing a Search Function for Similar Articles

In [27]:
#build a search function:
def find_similar_articles(user_query):
    """
    input:
    user_query (str): 

    output:
    None
    """

    query_df = spark.createDataFrame([(0, user_query)], ["id", "text"])
    query_df = regex_tokenizer.transform(query_df)
    query_df = stopwords_remover.transform(query_df)
    query_df = model.transform(query_df)

    query_vector = query_df.select("wordvectors").collect()[0][0]

    #cosine similarity
    def cos_sim(a, b):
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    # Calculate similarity scores
    articles = df.select("text", "wordvectors").collect()
    similarities = [
        (article["text"], article["text"], cos_sim(query_vector, article["wordvectors"]))
        for article in articles
    ]

    # Sort and print the top 5 similar articles
    similarities = sorted(similarities, key=lambda x: x[2], reverse=True) # rank 
    for i, (text, full_text, similarity) in enumerate(similarities[:5]): #top 5
        print(f"Top {i + 1}:") 
        print(f"text: {full_text}") 
        print(f"similarity: {similarity:.4f}") 
        print("\n" + "-"*50 + "\n") 


To find articles similar to a user query, we implemented a function that uses the trained Word2Vec model to compute similarities between the query and the dataset.

### Function: `find_similar_articles`

The function `find_similar_articles` takes a user query as input and outputs the most similar articles from the dataset. Here's a breakdown of the steps:

1. **Query DataFrame Creation**:
   A DataFrame is created from the user query, with a temporary `id` and `text` column
    
2. **Tokenization and Stop Words Removal:**
    The query text is tokenized and cleaned, similar to the dataset processing
    
3. **Word Vector Transformation:**
    The cleaned query text is transformed into word vectors using the trained Word2Vec model

4. **Cosine Similarity Calculation:**
    A helper function cos_sim calculates the cosine similarity between the query vector and article vectors
    
5. **Similarity Scores Calculation:**
    For each article in the dataset, we calculate the similarity score with the query

6. **Ranking and Output:**
    The articles are sorted by similarity scores, and the top 5 similar articles are printed


In [34]:
find_similar_articles("python")

24/08/03 18:27:45 WARN TaskSetManager: Stage 21 contains a task of very large size (12242 KiB). The maximum recommended task size is 1000 KiB.

Top 1:
text: Is there an online interpreter for python 3?  Possible Duplicate  Python 3 online interpreter   shell   Where can I find an online interpreter for Python 3   I m learning Python but can t install it at work where I d like to do some practice  Thanks  Sorry to repeat the question  I can t bump earlier posts and was just hoping there is one out there now   I don t know of a Python 3  and presumably you know about the browser app based on Python 2 5     But if you re unable to install Python on your computer  I can point you to an interpreter configured to run from USB keys   Portable Python  supports python 3   
similarity: 0.8039

--------------------------------------------------

Top 2:
text: Import Errors with Python script run in R I have a Python program  which searches for an anomaly  First train  then test   Now I need to start this Python program from RStudio  I have read about system  python myfirstpythonfile py    but when I launch my Python program in this way I 



## Analysis and Summary of Model Metrics Visualization

For this visualization, we compare the performance of the Word2Vec model using PySpark across various metrics including training time, accuracy, and text similarity. Here are the key observations:

* Training Time: The Word2Vec model training with PySpark demonstrated efficient handling of large datasets. The distributed processing capabilities of Spark allowed for relatively quick training times despite the dataset's size.

* Accuracy: While the accuracy of the Word2Vec model is not directly comparable to traditional classification metrics, the model showed promising results in capturing semantic similarities between texts.

* Precision and Recall: Precision and recall are not directly applicable metrics for Word2Vec models. Instead, the focus is on the quality of vector representations and their ability to capture meaningful relationships between words.

* Cosine Similarity: The model effectively utilized cosine similarity to measure the closeness of vectors. Higher cosine similarity scores indicated better performance in identifying similar texts.

## Conclusion

When experimenting with different models, each presents unique challenges and trade-offs. Depending on the specific requirements of your application, you can choose a model that aligns with your priorities:

* If capturing semantic similarity is crucial, Word2Vec with PySpark is a strong choice due to its ability to handle large datasets and produce meaningful word vectors.
* For real-time or near real-time processing, ensure that the Spark setup is optimized for the specific hardware to minimize latency.
* For applications requiring deep semantic understanding, further tuning of the Word2Vec model's parameters, such as vector size and context window, can yield improved results.

This analysis helps in making informed decisions based on the metrics that matter most to the specific use case. For a text similarity project, the choice of model will depend on the specific goals. If the primary objective is to achieve a balance between capturing semantic similarity and processing efficiency, then focusing on optimizing the Word2Vec model with PySpark would be advantageous. By understanding the specific needs of our project, we can select the most suitable model and continually fine-tune the hyperparameters to optimize performance.