## Text Data Preprocessing for CSVTU Website Data

This process focuses on transforming raw textual data scraped from the CSVTU (Chhattisgarh Swami Vivekanand Technical University) website into a structured and machine-readable format, optimizing it for machine learning tasks. The key steps are as follows:

1. **Objective**:
   - The aim is to convert unstructured scraped data into a clean and structured *question-answer (QA)* format suitable for training machine learning models, particularly those designed for natural language understanding.
   - Additionally, the syllabus of B.Tech Honors in Data Science will be extracted and structured into a well-defined format for easier processing and access.

2. **Key Steps**:
   - **Scraped Data Cleaning**:
     - Remove unnecessary HTML tags, JavaScript, and redundant information using libraries like `BeautifulSoup` for parsing and cleaning.
     - Normalize the text by converting it to lowercase and removing special characters, numbers, and stopwords that do not contribute to the QA format.

   - **Tokenization and Segmentation**:
     - Split the text into meaningful sentences and paragraphs, ensuring logical separation between topics.
     - Identify and categorize content into questions, answers, and explanatory text using text segmentation algorithms.

   - **Formatting into QA Pairs**:
     - Extract questions directly stated or implied in the text, ensuring they align with the answers provided.
     - Create synthetic questions (if necessary) based on the content using techniques like question-generation models or templates.
     - Structure the data as key-value pairs where each question is paired with its respective answer for easier integration into machine learning pipelines.

   - **Syllabus Structuring**:
     - Focus on extracting the syllabus for the B.Tech Honors in Data Science program.
     - Organize the syllabus into a hierarchical format (e.g., semesters, subjects, modules).
     - Ensure consistency in terminology and formatting, such as using predefined headers like "Semester 1: Core Subjects."

3. **Output Formats**:
   - **QA Format**:
     - Save the question-answer pairs into a CSV file where:
       - Column 1: Question
       - Column 2: Answer
     - This format ensures compatibility with most machine learning frameworks.

   - **Syllabus File**:
     - Create a separate structured document for the B.Tech syllabus in a machine-readable format like JSON or a tabular format like CSV.
     - Include metadata such as subject codes, credit hours, and descriptions for a comprehensive dataset.

4. **Benefits**:
   - **Efficiency**: Eliminates the need for manual data preparation or reliance on external APIs for similar tasks.
   - **Scalability**: Enables the creation of large-scale QA datasets and structured syllabi for diverse applications, from search engines to chatbots.

In [1]:
# Open the file "Complete_website_data.txt" in read mode with UTF-8 encoding
# - "r" indicates read mode.
# - "encoding='utf-8'" ensures compatibility with a wide range of characters, 
#   especially for text containing special symbols or non-English characters.
with open("Dataset/Website_Data/Complete_website_data.txt", "r", encoding="utf-8") as file:
    # Read the entire content of the file into the variable `data`
    # - This will load the complete text file into memory as a single string.
    data = file.read()

# Display the first 1000 characters of the loaded text data
# - Slicing (`[:1000]`) is used to extract and print only the first 1000 characters.
# - This is helpful for inspecting the initial content of the file without printing the entire text, 
#   which might be too large for console output.
print(data[:1000])









Chhattisgarh Swami Vivekanand Technical University – CSVTU







































































FORMS / DOWNLOADS
CSVTU NSS
CSVTU STUDENT COUNCIL
LOCATION
PREVIOUS WEBSITE
ENROLL. DEFICIENCIES





Search for:





Recent Posts


Recruitment Notice For The Post of Principal, Professor, Asst. Professor, Asso. Professor & Lecturer  Under Statute-19 at Rungta Institute of Pharmaceutical Sciences, Bhilai


Public Relations Officer


AICTE Quality Improvement Scheme[AQIS] 2021-22 Financial Support


M.Tech/M.Plan Admissions 2020 at University Teaching Department,CSVTU,Newai,Bhilai


Important Notification-Suspicious Email Activities


Recent CommentsArchives

December 2021
August 2021
December 2020
September 2020
May 2020
April 2020
March 2020

Categories

Announcement

Notice

Uncategorized


Meta

Log in
Entries feed
Comments feed
WordPress.org










HOME
THE UNIVERSITY

About
Hon’ble Vice Chancellor
Hon’ble Pro Vice Chancellor
University Valu

### Cleaning the text file and converting in Question Answer Format

In [3]:
# Importing required modules

# For regular Expression Proecessing
import re

# For Data Analysis
import pandas as pd

# Step 1: Clean the text data
def clean_text(text):
    """
    Cleans the input text by:
    1. Removing extra spaces and normalizing them to a single space.
    2. Retaining only alphanumeric characters, spaces, '.', and '?'.
    3. Collapsing multiple consecutive newlines into one.

    Args:
        text (str): The input text to clean.

    Returns:
        str: The cleaned text.
    """
    text = re.sub(r'\s+', ' ', text.strip())  # Normalize spaces
    text = re.sub(r'[^\w\s\.\?]', '', text)  # Remove special characters except . and ?
    text = re.sub(r'(\n\s*){2,}', '\n', text)  # Collapses multiple newlines into one
    return text

# Clean the loaded text data
cleaned_data = clean_text(data)

# Step 2: Split the cleaned text into sentences
# - Use regex to split on '.', '!', or '?' followed by spaces.
sentences = re.split(r'(?<=[.?!])\s*', cleaned_data)

# Step 3: Generate Question-Answer pairs
qa_pairs = []  # Initialize an empty list to store (question, answer) pairs

for sentence in sentences:
    sentence = sentence.strip()  # Remove any leading or trailing spaces
    if not sentence:  # Skip empty sentences resulting from splitting
        continue
    if sentence.endswith("?"):
        # If the sentence is already a question, pair it with a placeholder answer
        question = sentence
        answer = "Answer not provided."  # Placeholder for now
    else:
        # Create a question from the sentence
        question = f"What about: {sentence.capitalize()}"
        answer = sentence  # Use the sentence itself as the answer

    qa_pairs.append((question, answer))  # Append the pair to the list

# Step 4: Save the Question-Answer pairs to a DataFrame
# - The DataFrame allows for additional preprocessing before saving.
df = pd.DataFrame(qa_pairs, columns=["Question", "Answer"])

# Print information about the DataFrame
print(f"DataFrame shape: {df.shape}")  # Display the number of rows and columns
print("\n=======  DataFrame description: ========\n")
print(df.describe(include="all"))  # Provide a summary of the data

print("\n======= First 5 rows of the DataFrame:========\n")
print(df.head())  # Display the first 5 rows for inspection

DataFrame shape: (146293, 2)


                 Question  Answer
count              146293  146293
unique              11703   11708
top     What about: Tech.   Tech.
freq                 8073    8073


                                            Question  \
0  What about: Chhattisgarh swami vivekanand tech...   
1  What about: Deficiencies search for recent pos...   
2                        What about: Professor asso.   
3  What about: Professor  lecturer under statute1...   
4                                 What about: Techm.   

                                              Answer  
0  Chhattisgarh Swami Vivekanand Technical Univer...  
1  DEFICIENCIES Search for Recent Posts Recruitme...  
2                                    Professor Asso.  
3  Professor  Lecturer Under Statute19 at Rungta ...  
4                                             TechM.  


### Removing Question and answer with only 1 letter

In [4]:
# Step 5: Remove question-answer pairs where either is a single letter
df = df[~((df['Question'].str.len() == 1) | (df['Answer'].str.len() == 1))]

# Print updated DataFrame information
print(f"Updated DataFrame shape: {df.shape}")  # Display the number of rows and columns
print("\n=======  Updated DataFrame description: ========\n")
print(df.describe(include="all"))  # Provide a summary of the data

print("\n======= First 5 rows of the Updated DataFrame:========\n")
print(df.head())  # Display the first 5 rows for inspection

Updated DataFrame shape: (146285, 2)


                 Question  Answer
count              146285  146285
unique              11702   11707
top     What about: Tech.   Tech.
freq                 8073    8073


                                            Question  \
0  What about: Chhattisgarh swami vivekanand tech...   
1  What about: Deficiencies search for recent pos...   
2                        What about: Professor asso.   
3  What about: Professor  lecturer under statute1...   
4                                 What about: Techm.   

                                              Answer  
0  Chhattisgarh Swami Vivekanand Technical Univer...  
1  DEFICIENCIES Search for Recent Posts Recruitme...  
2                                    Professor Asso.  
3  Professor  Lecturer Under Statute19 at Rungta ...  
4                                             TechM.  


### Removing Question and answer with only 3 and 4 words

In [170]:
# Step 5: Remove question-answer pairs where either is a single letter or has 2-3 words
df = df[~((df['Question'].str.len() == 1) | 
          (df['Answer'].str.len() == 1) | 
          (df['Question'].str.split().apply(len).isin([3, 4])) | 
          (df['Answer'].str.split().apply(len).isin([3, 4])))]

# Print updated DataFrame information
print(f"Updated DataFrame shape: {df.shape}")  # Display the number of rows and columns
print("\n=======  Updated DataFrame description: ========\n")
print(df.describe(include="all"))  # Provide a summary of the data

print("\n======= First 5 rows of the Updated DataFrame:========\n")
print(df.head())  # Display the first 5 rows for inspection

Updated DataFrame shape: (76572, 2)


                                                 Question  \
count                                               76572   
unique                                              11274   
top     What about: Academic university teaching depar...   
freq                                                 5918   

                                                   Answer  
count                                               76572  
unique                                              11284  
top     ACADEMIC University Teaching Department Diplom...  
freq                                                 5918  


                                            Question  \
0  What about: Chhattisgarh swami vivekanand tech...   
1  What about: Deficiencies search for recent pos...   
3  What about: Professor  lecturer under statute1...   
5  What about: Plan admissions 2020 at university...   
6  What about: Org home the university about honb...   

                 

### Converting Questions and Answer to Sentence Case

In [7]:
# Step 6: Convert the 'Question' and 'Answer' columns to sentence case with proper punctuation
def to_sentence_case(text):
    """
    Converts text to sentence case, ensuring proper capitalization and punctuation.
    """
    # Capitalize the first letter of the first word and make the rest lowercase
    if text:
        text = text[0].upper() + text[1:].lower()
        # Add punctuation if it ends with a word (basic handling)
        if not text.endswith(('.', '!', '?')):
            text += '.'
    return text

# Apply the function to both 'Question' and 'Answer' columns
df['Question'] = df['Question'].apply(to_sentence_case)
df['Answer'] = df['Answer'].apply(to_sentence_case)

# Print updated DataFrame information
print(f"Updated DataFrame shape: {df.shape}")  # Display the number of rows and columns
print("\n=======  Updated DataFrame description: ========\n")
print(df.describe(include="all"))  # Provide a summary of the data

print("\n======= First 5 rows of the Updated DataFrame:========\n")
print(df.head())  # Display the first 5 rows for inspection

Updated DataFrame shape: (146285, 2)


                 Question  Answer
count              146285  146285
unique              11702   11693
top     What about: tech.   Tech.
freq                 8073    8073


                                            Question  \
0  What about: chhattisgarh swami vivekanand tech...   
1  What about: deficiencies search for recent pos...   
2                        What about: professor asso.   
3  What about: professor  lecturer under statute1...   
4                                 What about: techm.   

                                              Answer  
0  Chhattisgarh swami vivekanand technical univer...  
1  Deficiencies search for recent posts recruitme...  
2                                    Professor asso.  
3  Professor  lecturer under statute19 at rungta ...  
4                                             Techm.  


### Saving to CSV File

In [9]:
# Step: Randomly sample 20,000 rows from the DataFrame
sampled_df = df.sample(n=20000, random_state=42)  # Ensure reproducibility with random_state

# Remove columns with any NaN values
sampled_df = sampled_df.dropna(axis=1)

# Save the sampled DataFrame to CSV
sampled_df.to_csv("Dataset/Preprocessed_Dataset/University_Website_Data_Question_Answer.csv", index=False, encoding="utf-8")

print(f"Sampled and saved {len(sampled_df)} rows to CSV.")

Sampled and saved 20000 rows to CSV.


### Saving another CSV file with well known facts about university as Question Answer

In [11]:
# Create a list of question-answer pairs
qa_pairs = [
    ("When was CSVTU established?", "CSVTU was established in 2001, located in Bhilai, Chhattisgarh."),
    ("What is the main objective of CSVTU?", "The main objective is to provide quality education in engineering, technology, and management fields."),
    ("How many affiliated colleges are under CSVTU?", "CSVTU has over 100 affiliated colleges across Chhattisgarh."),
    ("What courses does CSVTU offer?", "CSVTU offers undergraduate, postgraduate, and doctoral programs in engineering, technology, and management."),
    ("Which engineering branches are available at CSVTU?", "CSVTU offers branches like Computer Science, Mechanical, Civil, Electrical, Electronics, and Information Technology."),
    ("What are some popular postgraduate courses at CSVTU?", "The popular postgraduate courses include B.Tech, B.Tech(Hons.), M.Tech, MBA, and MCA programs."),
    ("Does CSVTU offer distance education programs?", "Yes, CSVTU provides distance education for working professionals and those unable to attend regular classes."),
    ("Where is the main campus of CSVTU located?", "The main campus is located in Bhilai, Chhattisgarh."),
    ("What is the vision of CSVTU?", "The vision is to become a leading center of technical and higher education, promoting research and innovation."),
    ("What is the mission of CSVTU?", "The mission is to provide quality technical education, promote research, and support students' holistic development."),
    ("What are some notable collaborations of CSVTU?", "CSVTU has partnerships with industries and academic institutions for research and development."),
    ("How does CSVTU support student development?", "CSVTU supports student development through workshops, clubs, seminars, and training programs."),
    ("Are there any research centers at CSVTU?", "Yes, CSVTU has research centers focusing on areas like robotics, renewable energy, and AI."),
    ("What is the student-teacher ratio at CSVTU?", "The student-teacher ratio varies, ensuring a conducive learning environment for students."),
    ("What facilities are available on campus at CSVTU?", "Facilities include modern classrooms, laboratories, libraries, hostels, and sports complexes."),
    ("What is the admission process for CSVTU?", "Admissions are based on entrance exams like JEE Mains for undergraduate courses and GATE for postgraduate courses."),
    ("Does CSVTU have an active alumni network?", "Yes, CSVTU has a strong alumni network that supports the growth and development of the university."),
    ("What kind of extracurricular activities are available at CSVTU?", "Students can participate in cultural events, sports competitions, workshops, and technical fests."),
    ("What are the key achievements of CSVTU?", "CSVTU has been recognized for its quality education, innovative research projects, and industry partnerships."),
    ("Are there any special scholarships offered by CSVTU?", "Yes, CSVTU offers scholarships for meritorious and economically disadvantaged students."),
    ("What is the university’s approach to sustainable development?", "CSVTU incorporates sustainable practices in its campus development and research."),
    ("What type of campus does CSVTU have?", "The university has a modern campus with state-of-the-art infrastructure and facilities."),
    ("How does CSVTU encourage research and innovation?", "CSVTU supports research with dedicated programs, funding, and collaborations with industries."),
    ("What industries partner with CSVTU for student internships?", "CSVTU partners with various industries, including tech and manufacturing companies, for student internships and placements."),
    ("What is the admission process for CSVTU's MBA program?", "The MBA program admits students based on entrance exams such as CAT or MAT and personal interviews."),
    ("How is CSVTU contributing to regional development?", "CSVTU contributes by providing skilled graduates who meet the demands of the regional workforce."),
    ("What is the role of CSVTU in promoting entrepreneurship?", "CSVTU encourages entrepreneurship through training programs, workshops, and startup incubation centers."),
    ("What is CSVTU's approach to international collaboration?", "CSVTU collaborates with international universities for exchange programs and joint research projects."),
    ("Does CSVTU have a placement cell?", "Yes, CSVTU has a dedicated placement cell that helps students connect with top employers."),
    ("What kind of support does CSVTU provide for research students?", "CSVTU offers research funding, guidance from faculty, and access to advanced research facilities."),
    ("What academic resources are available to students at CSVTU?", "CSVTU provides access to well-stocked libraries, online databases, and learning resources."),
    ("What is CSVTU address?", "CSVTU main address: CSVTU Bhilai, Newai, Bhilai, Chhattisgarh, India, 491107."),
    ("When was CSVTU established?", "CSVTU was established in 2001, located in Bhilai, Chhattisgarh."),
    ("When was CSVTU established?", "CSVTU was established in 2001, located in Bhilai, Chhattisgarh."),
    ("What act established CSVTU?", "CSVTU was established by an act (No. 25 of 2004) passed by the Chhattisgarh State Govt. Assembly."),
    ("When was CSVTU inaugurated?", "CSVTU was inaugurated on 30th April 2005."),
    ("Who inaugurated CSVTU?", "CSVTU was inaugurated by the Hon’ble Prime-Minister of India, Dr. Manmohan Singh."),
    ("What is the main purpose of CSVTU?", "The main purpose of CSVTU is to ensure systematic, efficient, and quality education in engineering and technology, including Architecture and Pharmacy."),
    ("What academic levels does CSVTU offer?", "CSVTU offers programs at Research, Postgraduate, Degree, and Diploma levels."),
    ("Where is the permanent campus of CSVTU located?", "The permanent campus of CSVTU is located in Bhilai, Chhattisgarh."),
    ("How much land does the CSVTU campus cover?", "The CSVTU campus encircles 250 acres of land."),
    ("How many engineering colleges are affiliated with CSVTU?", "There are 44 engineering colleges affiliated with CSVTU."),
    ("How many pharmacy colleges are affiliated with CSVTU?", "There are 11 pharmacy colleges affiliated with CSVTU."),
    ("How many polytechnic institutes are affiliated with CSVTU?", "There are 40 polytechnic institutions affiliated with CSVTU."),
    ("What special recognition did CSVTU receive in 2011?", "CSVTU was conferred the 'Emerging Technological University of the Year Award' in 2011."),
    ("Who conferred the 'Emerging Technological University of the Year Award' on CSVTU?", "The 'Emerging Technological University of the Year Award' was conferred by the WORLD MANAGEMENT CONGRESS."),
    ("What digital innovation has CSVTU introduced?", "CSVTU introduced a digitalized evaluation system to enhance the speed and accuracy of exam result publication."),
    ("What new areas have CSVTU focused on for research and development?", "CSVTU has identified frontier areas of research and development and organized outreach programs for societal benefits."),
    ("What are some soft skills courses offered at CSVTU?", "CSVTU offers courses in communication skills, group discussion, human values education, health hygiene and yoga, personality development, entrepreneurship, and project-based learning."),
    ("What facilities are included in CSVTU's permanent campus?", "The permanent campus includes administrative buildings, academic facilities, and supporting infrastructure."),
    ("What are the key reforms CSVTU has adopted?", "CSVTU has adopted reforms like introducing soft skills into the curriculum and redesigning teaching methods."),
    ("What is the importance of the 30th April 2005 date for CSVTU?", "30th April 2005 marks the inauguration date of CSVTU."),
    ("What are some outreach programs conducted by CSVTU?", "CSVTU has conducted academic programs, seminars, workshops, and conferences for the community."),
    ("How has CSVTU improved the publication of exam results?", "CSVTU improved the publication of results through its digital evaluation system, significantly increasing teaching time."),
    ("What is the vision behind CSVTU's curriculum design?", "The curriculum is designed with brainstorming sessions, task forces, and workshops aimed at enhancing technical education."),
    ("What activities have been conducted by CSVTU in the past 5 years?", "CSVTU has hosted various seminars, workshops, and conferences in the past five years."),
    ("What awards or recognitions has CSVTU received?", "CSVTU was recognized as the 'Emerging Technological University of the Year' by WORLD MANAGEMENT CONGRESS in 2011."),
    ("How does CSVTU contribute to regional and national development?", "CSVTU contributes by producing skilled graduates who meet regional and national workforce needs."),
    ("What email should be used for student queries and resolutions?", "sqrc@csvtu.ac.in."),
    ("Who is the Chairman of the Academic Council?", "Dr. Prof. Sachchidanand Shukla, Vice-Chancellor, CSVTU, Bhilai."),
    ("Who is the Member Secretary of the Academic Council?", "Dr. Ankit Arora, Registrar, CSVTU, Bhilai."),
    ("How can I contact CSVTU?", "Contact CSVTU at +91-788-2200062 or email studentenquiry@csvtu.ac.in."),
    ("What is the official address of CSVTU?", "Newai, P.O.-Newai, District-Durg, Chhattisgarh, PIN-491107."),
    ("What is the contact number for the Registrar's Office?", "+91-788-2445009."),
    ("How can I contact the Director of UTD at CSVTU?", "Phone: 9826431889."),
    ("What is the email for the Exam Controller?", "+91-788-2445004, 9179841986 (WhatsApp only)."),
    ("What are the working hours for student inquiries?", "10:30 AM to 5:30 PM."),
    ("Who can be contacted for admission-related queries for M.Tech?", "Call 7509983788, 9977889161, or 7974139459."),
    ("What is the contact for degree/diploma certificates?", "Email: degree@csvtu.ac.in, Phone: +91-788-2445030."),
    ("How can I reach the Public Information Officer for RTI?", "Phone: 9630458076, +91-788-2445015."),
    ("What is the email for transcript-related issues?", "deputyregistrar@csvtu.ac.in."),
    ("What are the contact numbers for the Assistant Finance Officer?", "+91-788-2445007, 8103021576."),
    ("What is the contact for the Chief Finance Officer?", "+91-788-2445006, 9584311365."),
    ("What email should be used for student queries and resolutions?", "sqrc@csvtu.ac.in."),
]

# Convert the question-answer pairs into a DataFrame
df_csvtu_data = pd.DataFrame(qa_pairs, columns=["Question", "Answer"])

# Print updated DataFrame information
print(f"Updated DataFrame shape: {df_csvtu_data.shape}")  # Display the number of rows and columns

print("\n=======  Updated DataFrame description: ========\n")
print(df_csvtu_data.describe(include="all"))  # Provide a summary of the data

print("\n======= First 5 rows of the Updated DataFrame:========\n")
print(df_csvtu_data.head())  # Display the first 5 rows for inspection


Updated DataFrame shape: (74, 2)


                           Question  \
count                            74   
unique                           71   
top     When was CSVTU established?   
freq                              3   

                                                   Answer  
count                                                  74  
unique                                                 71  
top     CSVTU was established in 2001, located in Bhil...  
freq                                                    3  


                                            Question  \
0                        When was CSVTU established?   
1               What is the main objective of CSVTU?   
2      How many affiliated colleges are under CSVTU?   
3                     What courses does CSVTU offer?   
4  Which engineering branches are available at CS...   

                                              Answer  
0  CSVTU was established in 2001, located in Bhil...  
1  The main objectiv

### Saving to File

In [13]:
# Save the DataFrame to a CSV file in the specified directory
output_path = "Dataset/Preprocessed_Dataset/CSVTU_Question_Answer.csv"

df_csvtu_data.to_csv(output_path, index=False, encoding="utf-8")

print(f"CSV file saved at {output_path}")

CSV file saved at Dataset/Preprocessed_Dataset/CSVTU_Question_Answer.csv


In [3]:
# For file handling
import os

# Importing for data analysis
import pandas as pd

# Import the regular expressions module
import re  

# For extracting text from PDF
import pdfplumber

# Reading Documents
from docx import Document

# Directory containing the syllabus files
input_directory_path = r"C:\Users\rawat\Documents\Semester Notes\5 SEMESTER\Minor Project\Codes\Dataset\Syllabus_Data_PDF"
output_directory_path = r"C:\Users\rawat\Documents\Semester Notes\5 SEMESTER\Minor Project\Codes\Dataset\Syllabus_Data"

# Ensure output directory exists
os.makedirs(output_directory_path, exist_ok=True)

# Function to read text from a .docx file
def read_docx(file_path):
    document = Document(file_path)
    text = ""
    for para in document.paragraphs:
        text += para.text + "\n"
    return text

# Function to read text from a .txt file
def read_txt(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

# Function to read text from a .pdf file
def read_pdf(file_path):
    text = ""
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text() or ""  # Append text from each page
    return text

# Function to generate questions and answers from the content
def generate_questions(file_content, semester, subject_name):
    questions = []
    lines = file_content.splitlines()
    current_unit = None
    current_content = []
    subject_line_added = False  # Flag to ensure the subject line is only added once

    for line in lines:
        # Check and add the subject only once at the start
        if not subject_line_added and "subject" in line.lower():
            questions.append((f"Subject: {subject_name} (Semester {semester})", ""))
            subject_line_added = True  # Set the flag to prevent adding it again

        # Detect a new unit and create questions for it
        if line.startswith("Unit "):
            if current_unit is not None and current_content:
                # Add the previous unit's question and content to the list
                question = f"What is covered in {current_unit}?"
                answer = "\n".join(current_content).strip()
                questions.append((question, answer))

            # Update current unit and reset content for new unit
            current_unit = line.strip()
            current_content = []  # Reset content for new unit

        # Append lines to the current unit's content if a unit is active
        elif current_unit:
            current_content.append(line.strip())

        # Add questions for standalone content (e.g., summaries)
        elif not current_unit and line.strip():  # Ensure it is not just whitespace
            questions.append((f"What is covered in {subject_name} (Semester {semester})?", line.strip()))

    # Add the last unit content after the loop ends
    if current_unit and current_content:
        question = f"What is covered in {current_unit}?"
        answer = "\n".join(current_content).strip()
        questions.append((question, answer))

    return questions

# Iterate through files in the directory and process them one by one
for file_name in os.listdir(input_directory_path):
    if file_name.endswith('.docx') or file_name.endswith('.txt') or file_name.endswith('.pdf'):
        file_path = os.path.join(input_directory_path, file_name)
        subject_name = file_name.split('.')[0]  # Use the file name as the subject name

        # Extract the semester number from the filename if applicable (e.g., "1 SEM" or "5 SEM")
        semester_match = [word for word in file_name.split() if 'SEM' in word.upper()]
        semester = semester_match[0] if semester_match else "Unknown"

        # Print message indicating the file is being read
        print(f"Reading file: {file_name}")

        if file_name.endswith('.docx'):
            file_content = read_docx(file_path)
        elif file_name.endswith('.txt'):
            file_content = read_txt(file_path)
        elif file_name.endswith('.pdf'):
            file_content = read_pdf(file_path)

        # Generate questions and store them in the appropriate semester
        questions = generate_questions(file_content, semester, subject_name)
        if semester != "Unknown":  # Ensure that we only add questions for recognized semesters
            # Create a CSV filename based on the semester and subject
            formatted_semester = semester.capitalize().replace(" ", "_")  # Format for file naming
            output_csv_path = os.path.join(output_directory_path, f"{formatted_semester}_{subject_name}_Question_Answer.csv")

            # Create a DataFrame and save to CSV
            output_df = pd.DataFrame(questions, columns=['Question', 'Answer'])
            output_df.to_csv(output_csv_path, index=False)

            # Print message indicating CSV creation for the file
            print(f"Questions and answers for {file_name} have been saved to {output_csv_path}")
        else:
            print(f"Semester not found in the filename: {file_name}")


Reading file: 1 Semester Syllabus.pdf
Questions and answers for 1 Semester Syllabus.pdf have been saved to C:\Users\rawat\Documents\Semester Notes\5 SEMESTER\Minor Project\Codes\Dataset\Syllabus_Data\Semester_1 Semester Syllabus_Question_Answer.csv
Reading file: 2 Semester Syllabus.pdf
Questions and answers for 2 Semester Syllabus.pdf have been saved to C:\Users\rawat\Documents\Semester Notes\5 SEMESTER\Minor Project\Codes\Dataset\Syllabus_Data\Semester_2 Semester Syllabus_Question_Answer.csv
Reading file: 3 Semester Syllabus.pdf
Questions and answers for 3 Semester Syllabus.pdf have been saved to C:\Users\rawat\Documents\Semester Notes\5 SEMESTER\Minor Project\Codes\Dataset\Syllabus_Data\Semester_3 Semester Syllabus_Question_Answer.csv
Reading file: 4 Semester Syllabus.pdf
Questions and answers for 4 Semester Syllabus.pdf have been saved to C:\Users\rawat\Documents\Semester Notes\5 SEMESTER\Minor Project\Codes\Dataset\Syllabus_Data\Semester_4 Semester Syllabus_Question_Answer.csv
Read

In [11]:
Semester_subjects = {
    "Semester_1": [
        "Engineering Mathematics I: Integration, differentiation, matrices, calculus, probability, Fourier series, Laplace transforms.",
        "Environmental Science: Ecosystem, sustainability, climate change, pollution, renewable energy, green technology.",
        "Foundations of Electronics Engineering: Circuits, transistors, diodes, amplifiers, digital and analog circuits.",
        "Fundamentals of Computational Biology: Algorithms, DNA sequencing, bioinformatics, molecular biology.",
        "Language and Writing Skills: Grammar, essays, communication, creative and technical writing.",
        "Learning Programming Concepts With C: Variables, loops, pointers, memory management, file handling.",
        "Professional Ethics and Life Skills: Teamwork, leadership, communication, ethics, decision-making."
    ],
    "Semester_2": [
        "Data Structure Using C: Arrays, linked lists, stacks, trees, sorting, graph traversal.",
        "Digital Logic and Design: Gates, Boolean algebra, flip-flops, multiplexers, state machines.",
        "Engineering Mathematics II: Laplace transforms, matrices, differential equations, optimization.",
        "Entrepreneurship: Innovation, market research, funding, business strategies, risk management.",
        "Object-Oriented Programming: Classes, inheritance, polymorphism, encapsulation, design patterns.",
        "Python for Data Science: Numpy, Pandas, visualization, machine learning, regression."
    ],
    "Semester_3": [
        "Analysis and Design of Algorithm: Sorting, recursion, graph algorithms, dynamic programming.",
        "Computer Organization and Architecture: CPU, memory, pipeline, assembly language.",
        "Database Management System: SQL, transactions, schema, normalization, ER diagrams.",
        "Discrete Structure: Graph theory, combinatorics, logic, set theory, modular arithmetic.",
        "Independent Project: Research, development, documentation, problem-solving, presentation.",
        "Probability and Statistics: Distributions, regression, hypothesis testing, predictive modeling."
    ],
    "Semester_4": [
        "Artificial Intelligence Principles and Applications: Machine learning, neural networks, NLP, robotics.",
        "Computer Network: Protocols, IP addressing, routing, firewalls, VPN, cloud networking.",
        "Data Visualization: Dashboards, charts, storytelling with data, heatmaps, time-series plots.",
        "Operating System: Process scheduling, memory management, file system, deadlocks, shell scripting.",
        "R for Data Science: Data manipulation, regression, visualization, statistical inference.",
        "Theory of Computation: Automata, Turing machines, decidability, computational complexity."
    ],
    "Semester_5": [
        "Computational Complexity: P vs NP, algorithm efficiency, quantum computing, probabilistic algorithms.",
        "Cryptography and Network Security: Encryption, decryption, RSA, AES, hashing, blockchain.",
        "Intelligent Data Analysis: Data mining, clustering, decision trees, anomaly detection.",
        "Natural Language Processing: Text processing, sentiment analysis, transformers, summarization.",
        "Pattern Recognition and Machine Learning: Image recognition, CNN, clustering, feature extraction.",
        "Vocational Training: Hands-on skills, certifications, internships, industry collaboration."
    ],
    "Semester_7": [
        "Software Engineering: Agile, SDLC, testing, DevOps, risk analysis, documentation.",
        "Big Data Analytics: Hadoop, Spark, NoSQL, real-time processing, distributed systems.",
        "Image Processing: Filters, segmentation, object recognition, real-time processing.",
        "Data Wrangling: Cleaning, transformation, preprocessing, anomaly detection, standardization."
    ]
}

# Dictionary mapping Semesters to their corresponding full subject names
Semester_subject_names = {
    "Semester_1": [
        "Engineering Mathematics I",
        "Environmental Science",
        "Foundations of Electronics Engineering",
        "Fundamentals of Computational Biology",
        "Language and Writing Skills",
        "Learning Programming Concepts With C",
        "Professional Ethics and Life Skills"
    ],
    "Semester_2": [
        "Data Structure Using C",
        "Digital Logic and Design",
        "Engineering Mathematics II",
        "Entrepreneurship",
        "Object-Oriented Programming",
        "Python for Data Science"
    ],
    "Semester_3": [
        "Analysis and Design of Algorithm",
        "Computer Organization and Architecture",
        "Database Management System",
        "Discrete Structure",
        "Independent Project",
        "Probability and Statistics"
    ],
    "Semester_4": [
        "Artificial Intelligence Principles and Applications",
        "Computer Network",
        "Data Visualization",
        "Operating System",
        "R for Data Science",
        "Theory of Computation"
    ],
    "Semester_5": [
        "Computational Complexity",
        "Cryptography and Network Security",
        "Intelligent Data Analysis",
        "Natural Language Processing",
        "Pattern Recognition and Machine Learning",
        "Vocational Training"
    ],
    "Semester_7": [
        "Software Engineering",
        "Big Data Analytics",
        "Image Processing",
        "Data Wrangling",
    ]
}

# Directory containing the input CSV files
input_directory_path = r"C:\Users\rawat\Documents\Semester Notes\5 Semester\Minor Project\Codes\Dataset\Syllabus_Data"

# Directory where preprocessed CSV files will be saved (unchanged)
output_directory_path = r"C:\Users\rawat\Documents\Semester Notes\5 Semester\Minor Project\Codes\Dataset\Preprocessed_Dataset"

# Iterate through each CSV file in the input directory
for file_name in os.listdir(input_directory_path):
    if file_name.endswith('.csv'):
        # Check for valid Semester in the file name by checking for a substring match
        Semester_found = None
        for Semester in Semester_subject_names:
            if Semester.lower() in file_name.lower():
                Semester_found = Semester
                break

        if Semester_found:
            file_path = os.path.join(input_directory_path, file_name)

            # Load the CSV into a DataFrame
            df = pd.read_csv(file_path)

           # Iterate over the DataFrame rows
            for i, row in df.iterrows():
                if pd.notna(row['Question']) or pd.notna(row['Answer']):
                    # Extract words from both 'Question' and 'Answer'
                    question_words = set(re.findall(r'\b\w+\b', row['Question'].lower())) if pd.notna(row['Question']) else set()
                    answer_words = set(re.findall(r'\b\w+\b', row['Answer'].lower())) if pd.notna(row['Answer']) else set()
                    matched_subject = None
            
                    # Check for matches in the subject lists
                    for subject in Semester_subjects[Semester_found]:
                        subject_words = set(re.findall(r'\b\w+\b', subject.lower()))  # Extract words from the subject name
                        if question_words.intersection(subject_words) or answer_words.intersection(subject_words):  # Check if any word matches
                            # Find the full name of the matched subject from Semester_subject_names
                            matched_subject = next((s for s in Semester_subject_names[Semester_found] if s.split()[0].lower() == subject.split()[0].lower()), None)
                            break  # Only add the first matched subject
            
                    # Update the 'Question' column if a match is found, otherwise mark as "General"
                    if matched_subject:
                        df.at[i, 'Question'] += f" (Related to: {matched_subject})"
                    else:
                        df.at[i, 'Question'] += " (General)"

            # Save the modified DataFrame back to the output directory with a "preprocessed_" prefix
            preprocessed_file_name = f"Preprocessed_{file_name}"
            preprocessed_file_path = os.path.join(output_directory_path, preprocessed_file_name)
            df.to_csv(preprocessed_file_path, index=False)

            print(f"Updated {file_name} with subjects for {Semester_found}. Saved as {preprocessed_file_name}.")
        else:
            print(f"No valid Semester found in {file_name}. Skipping file.")

Updated Semester_1 Semester Syllabus_Question_Answer.csv with subjects for Semester_1. Saved as Preprocessed_Semester_1 Semester Syllabus_Question_Answer.csv.
Updated Semester_2 Semester Syllabus_Question_Answer.csv with subjects for Semester_2. Saved as Preprocessed_Semester_2 Semester Syllabus_Question_Answer.csv.
Updated Semester_3 Semester Syllabus_Question_Answer.csv with subjects for Semester_3. Saved as Preprocessed_Semester_3 Semester Syllabus_Question_Answer.csv.
Updated Semester_4 Semester Syllabus_Question_Answer.csv with subjects for Semester_4. Saved as Preprocessed_Semester_4 Semester Syllabus_Question_Answer.csv.
Updated Semester_5 Semester Syllabus_Question_Answer.csv with subjects for Semester_5. Saved as Preprocessed_Semester_5 Semester Syllabus_Question_Answer.csv.
Updated Semester_7 Semester Syllabus_Question_Answer.csv with subjects for Semester_7. Saved as Preprocessed_Semester_7 Semester Syllabus_Question_Answer.csv.
