# StudySphere Quality Control (QC) Module
This notebook showcases the Quality Control (QC) module for StudySphere. It includes dummy data to illustrate QC mechanisms for filtering and ranking user-generated content.

### Module Objective:
The QC module aims to ensure that only high-quality content is retained on the platform by:
- Removing low-quality or downvoted questions
- Highlighting the top-rated answers
- Ranking questions by popularity
- Banning users who consistently contribute low-quality content


## Input Data Structure
### Questions Data
The platform collects questions from users in the following JSON format:
- **Username**: Unique username of the user posting the question.
- **Question Type**: Type of question (e.g., Multiple Choice, Short Answer).
- **Question**: The text of the question.
- **QuestionId**: Unique identifier for the question.
- **Answer**: Suggested answer provided by the user.

Example:
```json
[
  {"Username": "User1", "Question Type": "Multiple Choice", "Question": "What elements are in the periodic table?", "QuestionId": 1, "Answer": "Carbon, Mitochondria, Nucleus"},
  {"Username": "User2", "Question Type": "Short Answer", "Question": "What is the mitochondria?", "QuestionId": 2, "Answer": "The powerhouse of the cell."}
]
```

### Voting Data Structure
- **Question Votes**: JSON format where each vote is linked to a question.
  - `Upvote` is `True` for upvotes and `False` otherwise.

Example:
```json
[{"Username": "User5", "QuestionId": 3, "Upvote": True}, {"Username": "User6", "QuestionId": 2, "Upvote": False}]
```

- **Answer Votes**: JSON format where users vote on a specific answer to a question.

Example:
```json
[{"Username": "User7", "QuestionId": 3, "Answer": "Quick Sort"}]
```

## Step 1: Define Data Structures
We will separate the data into **questions** and **votes**, and later join them using functions.

In [2]:

import pandas as pd

# Define questions data
questions_data = [
    {"Username": "User1", "QuestionId": 1, "QuestionType": "Multiple Choice", "Question": "What elements are in the periodic table?", "Answer": "Carbon, Oxygen, Nitrogen"},
    {"Username": "User2", "QuestionId": 2, "QuestionType": "Short Answer", "Question": "What is the mitochondria?", "Answer": "The powerhouse of the cell."},
    {"Username": "User3", "QuestionId": 3, "QuestionType": "Short Answer", "Question": "What algorithm can be used to sort a list?", "Answer": "Quick Sort"},
    {"Username": "User4", "QuestionId": 4, "QuestionType": "Short Answer", "Question": "What algorithm can be used to sort a list?", "Answer": "Merge Sort"}
]

# Define voting data
question_votes = [
    {"Username": "User5", "QuestionId": 1, "Upvote": True},
    {"Username": "User6", "QuestionId": 2, "Upvote": False},
    {"Username": "User7", "QuestionId": 3, "Upvote": True},
    {"Username": "User8", "QuestionId": 4, "Upvote": False},
    {"Username": "User9", "QuestionId": 3, "Upvote": True},
    {"Username": "User10", "QuestionId": 4, "Upvote": False},
]

# Convert to DataFrames
questions_df = pd.DataFrame(questions_data)
votes_df = pd.DataFrame(question_votes)

# Display questions and votes
print("Questions Data:")
print(questions_df)
print("Votes Data:")
print(votes_df)


Questions Data:
  Username  QuestionId     QuestionType  \
0    User1           1  Multiple Choice   
1    User2           2     Short Answer   
2    User3           3     Short Answer   
3    User4           4     Short Answer   

                                     Question                       Answer  
0    What elements are in the periodic table?     Carbon, Oxygen, Nitrogen  
1                   What is the mitochondria?  The powerhouse of the cell.  
2  What algorithm can be used to sort a list?                   Quick Sort  
3  What algorithm can be used to sort a list?                   Merge Sort  
Votes Data:
  Username  QuestionId  Upvote
0    User5           1    True
1    User6           2   False
2    User7           3    True
3    User8           4   False
4    User9           3    True
5   User10           4   False


## Step 2: Define Function to Aggregate Votes
We will calculate upvotes and downvotes for each question by aggregating the votes data and joining it with the questions data.

In [3]:

# Function to process votes and join with questions
def process_votes(questions_df, votes_df):
    # Aggregate votes: Count upvotes and downvotes per question
    votes_aggregated = votes_df.groupby("QuestionId").apply(
        lambda x: pd.Series({
            "Upvotes": sum(x["Upvote"]),
            "Downvotes": len(x) - sum(x["Upvote"])
        })
    ).reset_index()
    
    # Merge aggregated votes with questions
    merged_df = pd.merge(questions_df, votes_aggregated, on="QuestionId", how="left")
    
    # Fill NaN values with 0 for questions without any votes
    merged_df.fillna({"Upvotes": 0, "Downvotes": 0}, inplace=True)
    
    return merged_df

# Apply the function
questions_with_votes = process_votes(questions_df, votes_df)

# Display merged DataFrame
print("Questions with Votes:")
print(questions_with_votes)


Questions with Votes:
  Username  QuestionId     QuestionType  \
0    User1           1  Multiple Choice   
1    User2           2     Short Answer   
2    User3           3     Short Answer   
3    User4           4     Short Answer   

                                     Question                       Answer  \
0    What elements are in the periodic table?     Carbon, Oxygen, Nitrogen   
1                   What is the mitochondria?  The powerhouse of the cell.   
2  What algorithm can be used to sort a list?                   Quick Sort   
3  What algorithm can be used to sort a list?                   Merge Sort   

   Upvotes  Downvotes  
0        1          0  
1        0          1  
2        2          0  
3        0          2  


  votes_aggregated = votes_df.groupby("QuestionId").apply(


## Step 3: Implementing Quality Control Rules
### QC Rule 1: Display Top 2-3 Answers
This function ranks answers based on the number of upvotes, helping to display the top answers prominently.

### QC Rule 2: Rank Questions by Votes
Questions are sorted by upvotes to prioritize popular and high-quality questions.

### QC Rule 3: Remove Questions with Too Many Downvotes
Questions with a specified downvote threshold are flagged for removal.

### QC Rule 4: Ban Users Who Post Irrelevant Questions
Users with repeated downvoted questions are flagged for potential banning.



In [9]:
# Step 1: Ensure questions_with_votes dataset exists
if 'questions_with_votes' in locals():
    print("Dataset 'questions_with_votes' loaded successfully.")
else:
    raise ValueError("Dataset 'questions_with_votes' is not defined. Ensure it is processed correctly before running QC rules.")

# QC Rule 1: Display Top 2-3 Answers
print("# QC Rule 1: Display Top 2-3 Answers\n"
      "# ----------------------------------\n"
      "# This function calculates the top answers based on upvotes.\n")

# Function to calculate and display top answers
def display_top_answers(df, top_n=3):
    """
    Display top N answers based on the number of upvotes.
    :param df: DataFrame containing questions and votes.
    :param top_n: Number of top answers to display.
    :return: DataFrame of top N answers.
    """
    return df.sort_values(by='Upvotes', ascending=False).head(top_n)

# Apply the function
top_answers = display_top_answers(questions_with_votes)

# Display the results
print("Top 2-3 Answers Based on Upvotes:")
print(top_answers[['QuestionId', 'Answer', 'Upvotes']])



Dataset 'questions_with_votes' loaded successfully.
# QC Rule 1: Display Top 2-3 Answers
# ----------------------------------
# This function calculates the top answers based on upvotes.

Top 2-3 Answers Based on Upvotes:
   QuestionId                       Answer  Upvotes
2           3                   Quick Sort        2
0           1     Carbon, Oxygen, Nitrogen        1
1           2  The powerhouse of the cell.        0


In [10]:

# QC Rule 2: Rank Questions by Votes
print("\n# QC Rule 2: Rank Questions by Votes\n"
      "# -----------------------------------\n"
      "# This ranks all questions based on their vote score (Upvotes - Downvotes).\n")

# Function to rank questions by votes
def rank_questions_by_votes(df):
    """
    Rank questions based on their vote score (Upvotes - Downvotes).
    :param df: DataFrame containing questions and votes.
    :return: DataFrame of ranked questions.
    """
    df['VoteScore'] = df['Upvotes'] - df['Downvotes']
    return df.sort_values(by='VoteScore', ascending=False)

# Apply the function
ranked_questions = rank_questions_by_votes(questions_with_votes)

# Display the results
print("Ranked Questions by Vote Score:")
print(ranked_questions[['QuestionId', 'Question', 'VoteScore']])


# QC Rule 2: Rank Questions by Votes
# -----------------------------------
# This ranks all questions based on their vote score (Upvotes - Downvotes).

Ranked Questions by Vote Score:
   QuestionId                                    Question  VoteScore
2           3  What algorithm can be used to sort a list?          2
0           1    What elements are in the periodic table?          1
1           2                   What is the mitochondria?         -1
3           4  What algorithm can be used to sort a list?         -2


In [11]:

# QC Rule 3: Remove Questions with Too Many Downvotes
print("\n# QC Rule 3: Remove Questions with Too Many Downvotes\n"
      "# ----------------------------------------------------\n"
      "# Questions with downvotes exceeding the threshold are flagged for removal.\n")

# Function to flag questions for removal
def flag_questions_for_removal(df, threshold=2):
    """
    Flag questions with too many downvotes for removal.
    :param df: DataFrame containing questions and votes.
    :param threshold: Downvote threshold for flagging questions.
    :return: DataFrame with a new column 'FlaggedForRemoval'.
    """
    df['FlaggedForRemoval'] = df['Downvotes'] >= threshold
    return df

# Apply the function
questions_with_removal_flags = flag_questions_for_removal(questions_with_votes)

# Display the results
print("Questions Flagged for Removal:")
print(questions_with_removal_flags[['QuestionId', 'Question', 'Downvotes', 'FlaggedForRemoval']])



# QC Rule 3: Remove Questions with Too Many Downvotes
# ----------------------------------------------------
# Questions with downvotes exceeding the threshold are flagged for removal.

Questions Flagged for Removal:
   QuestionId                                    Question  Downvotes  \
0           1    What elements are in the periodic table?          0   
1           2                   What is the mitochondria?          1   
2           3  What algorithm can be used to sort a list?          0   
3           4  What algorithm can be used to sort a list?          2   

   FlaggedForRemoval  
0              False  
1              False  
2              False  
3               True  


In [14]:

# QC Rule 4: Ban Users Who Post Irrelevant Questions
print("\n# QC Rule 4: Ban Users Who Post Irrelevant Questions\n"
      "# -----------------------------------------------------\n"
      "# Users with multiple flagged questions are identified for potential banning.\n")

# Function to identify users for potential banning
def identify_users_for_banning(df, user_threshold=1):
    """
    Identify users with multiple flagged questions for potential banning.
    :param df: DataFrame containing flagged questions and user information.
    :param user_threshold: Number of flagged questions required to ban a user.
    :return: DataFrame of users flagged for potential banning.
    """
    flagged_questions = df[df['FlaggedForRemoval']]
    user_flags = flagged_questions.groupby('Username').size().reset_index(name='FlaggedQuestionsCount')
    user_flags['BanUser'] = user_flags['FlaggedQuestionsCount'] >= user_threshold
    return user_flags

# Apply the function
users_flagged_for_banning = identify_users_for_banning(questions_with_removal_flags)

# Display the results
print("Users Flagged for Potential Ban:")
print(users_flagged_for_banning)


# QC Rule 4: Ban Users Who Post Irrelevant Questions
# -----------------------------------------------------
# Users with multiple flagged questions are identified for potential banning.

Users Flagged for Potential Ban:
  Username  FlaggedQuestionsCount  BanUser
0    User4                      1     True
