# Python/Go/Java Project Requirements

## Project 1
## Read a pdf file from a folder. Refer to the PDF file Chemistry Questions.pdf
### Requirements
1. Store a PDF file in a folder called “/content”
2. Read PDF file from the folder
3. Write the content to a text file called “output.txt”
4. Store this file under the “/content” folder
### Error Handling
1. Take care of case where folder is not available
2. Take care of case where PDF file is not present in the content folder
3. Take care of case where the output.txt file is not available

In [4]:
pip install PyPDF2 pandas

Note: you may need to restart the kernel to use updated packages.


In [6]:
import os
import PyPDF2

In [8]:
def read_pdf_from_folder(folder_path, filename):
    file_path = os.path.join(folder_path, filename)

    if not os.path.exists(file_path):
        print(f"File not found: {file_path}")
        return

    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page_num, page in enumerate(reader.pages):
            text += f"\n--- Page {page_num + 1} ---\n"
            text += page.extract_text() or "[No text found]"
        return text

In [10]:
def write_to_text_file(text, output_file, output_folder):
    os.makedirs(output_folder, exist_ok=True)
    output_path = os.path.join(output_folder, output_file)
    with open(output_path, 'w', encoding='utf-8') as file:
        file.write(text)
        if not os.path.exists(output_path):
            print(f"Failed !!! File not found: {output_path}")
            return
        else:
             print(f"Success !!! Content written to {output_path}")

In [12]:
folder = "/Users/naganatarajan/Desktop/GEN_AI_Tasks/my_python_tasks/content/"  # Change this to your folder path
pdf_file = "Chemistry Questions.pdf"        # Change this to your PDF file name
output_file="output.txt"
content = read_pdf_from_folder(folder, pdf_file)
if content:
    write_to_text_file(content,output_file,folder)

Success !!! Content written to /Users/naganatarajan/Desktop/GEN_AI_Tasks/my_python_tasks/content/output.txt


## Project 2
## Traverse through folder tree and filter pdf files
### Requirements
1. Add sub-folders called “One”, “Two”, “Three” under the folder called “/content”
2. Add PDF files under each of the sub-folders
3. Load all PDF files under the sub-folders and load the PDF content
4. Write the content to a text file called “output.txt” under each sub-folder respectively
### Error Handling
1. Take care of case where folder is not available
2. Take care of case where PDF file is not present in a sub-folder
3. Take care of case where the output.txt file is not available in a sub-folde


In [15]:
def traverse_subFolder(main_folder):
    subfolders = []
    for root, dirs, files in os.walk(main_folder):
        for dir in dirs:
            full_path = os.path.join(root, dir)
            subfolders.append(full_path)
    return subfolders


subfolders = traverse_subFolder(folder)

print("Subfolders found under", folder)
for sub in subfolders:
    content = read_pdf_from_folder(sub, pdf_file)
    if content:
        write_to_text_file(content,output_file,sub)


Subfolders found under /Users/naganatarajan/Desktop/GEN_AI_Tasks/my_python_tasks/content/
Success !!! Content written to /Users/naganatarajan/Desktop/GEN_AI_Tasks/my_python_tasks/content/One/output.txt
Success !!! Content written to /Users/naganatarajan/Desktop/GEN_AI_Tasks/my_python_tasks/content/One/Two/output.txt
Success !!! Content written to /Users/naganatarajan/Desktop/GEN_AI_Tasks/my_python_tasks/content/One/Two/Three/output.txt


## Project 3
## Read content from a particular page
### Requirements
1. Update project 1 and update the reading of content 
2. Take a page number as an input from command prompt
3. Read content of the page number provided and write to the output file
### Error Handling
1. Take care of case where folder is not available
2. Take care of case where PDF file is not present in a sub-folder
3. Take care of case where the output.txt file is not available in a sub-folder


In [18]:
def read_pdf_from_folder_based_page_number(folder_path, filename, pagenumber):
    file_path = os.path.join(folder_path, filename)

    if not os.path.exists(file_path):
        print(f"File not found: {file_path}")
        return

    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page_num, page in enumerate(reader.pages):
            if page_num + 1 == pagenumber:
                text += f"\n--- Page {page_num + 1} ---\n"
                text += page.extract_text() or "[No text found]"
        return text

In [23]:
folder = "/Users/naganatarajan/Desktop/GEN_AI_Tasks/my_python_tasks/content/"  # Change this to your folder path
pdf_file = "Chemistry Questions.pdf"        # Change this to your PDF file name
output_file="output.txt"
pagenumber = 1
content = read_pdf_from_folder_based_page_number(folder, pdf_file,pagenumber)
if content:
    write_to_text_file(content,output_file,folder)
else:
    print(f"Failed !!! No Content Found: {content}")

Success !!! Content written to /Users/naganatarajan/Desktop/GEN_AI_Tasks/my_python_tasks/content/output.txt


## Project 4
## Read regular expression from a config file and extract content
### Requirements
1. Update project 3
2. Add support for a configuration file 
3. In the configuration file set a config with key “regex” and value some regular expression that will match a part of the content in the PDF
4. Update code to extract only the content matching the regular expression 
5. Write to the output file
### Error Handling
1. Take care of case where folder is not available
2. Take care of case where PDF file is not present in a sub-folder
3. Take care of case where the output.txt file is not available in a sub-folder
4. Take care of case where no configuration file is available
5. Take care of the case where configuration file does not have the regular expression


In [35]:
import os
import json
import re
import PyPDF2

def load_config(config_path):
    with open(config_path, 'r', encoding='utf-8') as f:
        config = json.load(f)
    return config.get("regex", "")

def extract_matching_text_from_pdf(pdf_path, pattern):
    try:
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            matches = []
            for page_num, page in enumerate(reader.pages):
                page_text = page.extract_text()
                if page_text:
                    found = re.findall(pattern, page_text)
                    matches.extend(found)
            return "\n".join(matches)
    except Exception as e:
        print(f"Error reading {pdf_path}: {e}")
        return ""

def process_folder_with_regex(folder_path, regex_pattern):
    output = ""
    for filename in os.listdir(folder_path):
        if filename.lower().endswith(".pdf"):
            pdf_path = os.path.join(folder_path, filename)
            matched_text = extract_matching_text_from_pdf(pdf_path, regex_pattern)
            if matched_text:
                output += f"From {filename}:\n{matched_text}\n\n"
    if output:
        output_path = os.path.join(folder_path, "output.txt")
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(output)
        print(f"✔ Matched content written to {output_path}")

def get_all_subfolders(main_folder):
    return [os.path.join(root, d)
            for root, dirs, _ in os.walk(main_folder)
            for d in dirs]

In [39]:
config_path = os.path.join(folder, "config.json")
regex_pattern = load_config(config_path)

print(f"🔍 Using regex: {regex_pattern}")

subfolders = get_all_subfolders(base_folder)
for folder in subfolders:
    process_folder_with_regex(folder, regex_pattern)

🔍 Using regex: \d+\.\d+\s*(×|\\times|x|\*)\s*10\s*(\^|(\^{))?\s*-?\d+\}?


## Project 5
### Store extracted questions in mysql
### Requirements
1. Update project 4 and add support for database
2. Create a database to store the following
3. Subject Name
4. Question Text
5. Answer options
6. Chapter name
7. Load a PDF containing questions
8. Extract each question as per a regular expression
9. Store each question in the database
### Error Handling
1. Take care of case where database is not available
2. Take care of case where table is not available
3. Take care of any error handling in DB operations


In [62]:
import pandas as pd
import re
import json
from sqlalchemy import create_engine
from PyPDF2 import PdfReader

# ---------- CONFIG ----------
folder = "/Users/naganatarajan/Desktop/GEN_AI_Tasks/my_python_tasks/content/" 
config_path = os.path.join(folder, "config.json")
PDF_PATH = os.path.join(folder,"Chemistry Questions.pdf")
MYSQL_URI = "mysql+pymysql://root:admin1234@localhost:3306/natarajan_main"  # Update credentials

# ---------- LOAD CONFIG ----------
def load_config(path):
    try:
        with open(path, 'r', encoding='utf-8') as f:
            return json.load(f)
    except Exception as e:
        print(f" Failed to load config: {e}")
        return {}


# ---------- PDF TO TEXT ----------
def extract_text(pdf_path):
    with open(pdf_path, 'rb') as f:
        reader = PdfReader(f)
        return "\n".join([page.extract_text() for page in reader.pages if page.extract_text()])

# ---------- PARSE QUESTIONS ----------
def extract_questions(text, pattern):
    questions = []
    chapters = re.split(r"(Chapter\s+\d+:\s+.*?)\n", text)
    subject = "Chemistry"

    for i in range(1, len(chapters), 2):
        chapter = chapters[i].strip()
        body = chapters[i + 1]

        for match in re.finditer(pattern, body, re.DOTALL):
            block = match.group(1).strip()
            lines = block.split("\n")
            question_text = lines[0].strip()
            options = {l[0]: l[3:].strip() for l in lines[1:5] if len(l) > 3}
            #correct_letter = re.search(r"Answer:\s+([A-D])", block).group(1)
            #correct_text = options.get(correct_letter, "")

            questions.append({
                "Subject_Name": subject,
                "Chapter_Name": chapter,
                "Question_Text": question_text,
                "Answer_option_A": options.get('A', ''),
                "Answer_option_B": options.get('B', ''),
                "Answer_option_C": options.get('C', ''),
                "Answer_option_D": options.get('D', '')
                #"correct_answer": correct_text
            })

    return pd.DataFrame(questions)

# ---------- WRITE TO MYSQL ----------
def write_to_db(df):
    try:
        engine = create_engine(MYSQL_URI)
        with engine.connect() as conn:
            df.to_sql(name='Questions_Table', con=conn, if_exists='append', index=False)
        print(f" {len(df)} questions inserted into MySQL.")
    except Exception as e:
        print(f" Error writing to DB: {e}")

# ---------- MAIN ----------
def main():
    config = load_config(config_path)
    text = extract_text(PDF_PATH)
    df = extract_questions(text, config["question_regex"])
    print(df.head())  # Preview first few rows
    write_to_db(df)

if __name__ == "__main__":
    main()

  Subject_Name                            Chapter_Name  \
0    Chemistry  Chapter 1: Basic concepts of chemistry   
1    Chemistry  Chapter 1: Basic concepts of chemistry   
2    Chemistry  Chapter 1: Basic concepts of chemistry   
3    Chemistry  Chapter 1: Basic concepts of chemistry   
4    Chemistry  Chapter 1: Basic concepts of chemistry   

                                       Question_Text  \
0                    1. What is the SI unit of mass?   
1  2. Which of the following is an example of a c...   
2                      3. What is Avogadro's number?   
3  4. Which of the following elements has the hig...   
4        5. What is the chemical form ula for water?   

                            Answer_option_A                   Answer_option_B  \
0                                  Gram (g)                     Kilogram (kg)   
1                            Melting of ice                  Cutting of paper   
2  6.022×10236.022 \times 10^{23}6.022×1023  3.14×1033.14 \times 10^33.

In [54]:
pip install pymysql

Collecting pymysql
  Downloading PyMySQL-1.1.1-py3-none-any.whl.metadata (4.4 kB)
Downloading PyMySQL-1.1.1-py3-none-any.whl (44 kB)
Installing collected packages: pymysql
Successfully installed pymysql-1.1.1
Note: you may need to restart the kernel to use updated packages.
