# Text Markdown Formatting and Processing with Python
The provided code is a sophisticated toolset designed to convert raw text into a more structured and formatted version, suitable for web and markdown presentations. This is especially useful for web developers, content creators, and anyone looking to automatically enhance the readability and structure of their text.

## Overview
1. Import and Setup: The code begins by importing essential libraries and setting up certain prerequisites. This includes machine learning models, tokenizers, and text processors.

1. Utility Functions: A collection of utility functions provide the backbone of the text processing workflow. They perform tasks such as detecting bullet points, identifying links, and formatting text snippets.

1. Text Processing: The main functionality lies in segmenting the text, determining the format of each segment, and then reassembling it into a structured format. This involves breaking the text into "tiles" or segments, processing each tile individually, and then combining them.

1. Execution: Finally, the code concludes with a sample text being processed and printed, showcasing the capabilities of the defined functions.

## Key Features
1. Bullet Point Identification: The tool can detect bullet points and format them correctly for markdown.

2. Link Detection: All URLs in the text are identified, whether they point to web pages or images. This feature ensures that links are presented correctly in the final output.

3. Code Detection: Using a machine learning model, the tool can identify code snippets within the text and format them within markdown code blocks.

4. Title Generation: Another impressive feature is the automatic generation of titles and sub-titles using the T5 transformer model. This adds a layer of professionalism and structure to the output.

5. Text Segmentation: Using the TextTilingTokenizer from the NLTK library, the text is segmented into logical "tiles" or blocks. Each block is then processed and formatted.

6. Table Conversion: If there's tabular data within the text, the tool can convert it into a structured markdown table format.

In essence, this tool is a comprehensive solution for automatically converting raw text into a structured format, enhancing readability and presentation. Whether you're preparing a technical document, a blog post, or any other content, this tool can significantly elevate its quality.



## 1. Import Statements and Initial Setup
The first section includes the import statements of the required libraries and the initialization of some essential setups, such as loading the model and setting environment variables.

In [1]:
import re
import math
import pandas as pd
import numpy as np
import requests
import nltk
import os
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.signal import argrelextrema
from fastapi import FastAPI, HTTPException, Request
from typing import Optional
import joblib
from helper import preprocess
from nltk.tokenize.texttiling import TextTilingTokenizer
from fastapi.responses import Response
from table_generation import generate_table_endpoint
from transformers import AutoTokenizer, T5ForConditionalGeneration


os.environ["TOKENIZERS_PARALLELISM"] = "false"

    
    # 1. Load the model
model = joblib.load('code_model.joblib')

sentence_transformer_model = SentenceTransformer('all-mpnet-base-v2')

    # 2. Load the model and tokenizer for Headings
tokenizer = AutoTokenizer.from_pretrained("czearing/article-title-generator")
heading_model = T5ForConditionalGeneration.from_pretrained("czearing/article-title-generator")


    # 3. Check NLTK resources and download if necessary
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

## 2. Utility Functions
This section contains several utility functions to assist with processing the text. These functions determine various attributes of the text, such as whether it's a bullet point or a link, or they format the text into a specific style, such as markdown.




In [2]:
# Utility function to check if the line starts with a bullet point
def is_bullet_point(line):
    # Regular expression pattern to match bullet points
    bullet_point_regex = r'^\s*(?:[iIvVxX]{1,4})*(?:[a-zA-Z0-9])*(?:[.\)\-\*])\s?([^.\n]*)'
    return bool(re.match(bullet_point_regex, line))

# Utility function to detect hyperlinks in the text
def detect_links(text):
    # Regular expression pattern to match URLs
    pattern = r"(?:(?i)https?://|www\.)\S+"
    match = re.search(pattern, text)
    return bool(match)

# Utility function to find image links in the text
def find_image_links(text):
    # Regular expression pattern to match image URLs
    url_pattern = re.compile(r"(?:(?i)https?://)?(?:www\.)?\S+(?:/\S+)*\.(?:jpg|jpeg|png|gif|bmp|svg)")
    matches = re.findall(url_pattern, text)
    return bool(matches)

# Reversed sigmoid function to transform values
def rev_sigmoid(x: float) -> float:
    return (1 / (1 + math.exp(0.5*x)))

# Utility function to activate the similarities array
def activate_similarities(similarities: np.array, p_size=10) -> np.array:
    x = np.linspace(-10, 10, p_size)
    y = np.vectorize(rev_sigmoid) 
    activation_weights = np.pad(y(x), (0, similarities.shape[0]-p_size))
    diagonals = [similarities.diagonal(each) for each in range(0, similarities.shape[0])]
    diagonals = [np.pad(each, (0, similarities.shape[0]-len(each))) for each in diagonals]
    diagonals = np.stack(diagonals)
    diagonals = diagonals * activation_weights.reshape(-1, 1)
    activated_similarities = np.sum(diagonals, axis=0)
    return activated_similarities

# Utility function to check if a line is nested (starts with a tab)
def is_Nested(line):
    return line.startswith('\t')

# Function to generate a title for the given text using a pretrained model
def generate_title(text):
    input_text = "summarize: " + text
    input_tensor = tokenizer.encode(input_text, return_tensors="pt")
    generated = heading_model.generate(input_tensor, max_length=20, num_return_sequences=1, top_k=50, top_p=0.95, do_sample=True, early_stopping=False, num_beams=2)
    generated_title = tokenizer.decode(generated[0], skip_special_tokens=True)
    return generated_title

# Utility function to check if the text is a code snippet
def is_code(text):
    # Check for empty or whitespace-only strings
    if text == '' or text == ' ':
        return False
    # Predict if the text is a code snippet or not
    prediction = model.predict(text)
    if(prediction[0] == "Code"):
        return True
    else:
        return False

# Function to generate a main heading for the text
def get_main_heading(text):
    return "# " + generate_title(text)

# Function to generate a sub-heading for the text
def get_sub_heading(text):
    return "## " + generate_title(text)

# Utility function to check if the text has multiple paragraphs
def has_multiple_paragraphs(text):
    paragraphs = text.split('\n\n')
    return len(paragraphs) > 1
    
# Utility function to format bullet points in the text
def format_bullet_points(line):
    # Extract the bullet point marker from the line
    bullet_point_marker = re.search(r'^\s*(?:[iIvVxX]{1,4})*(?:[a-zA-Z0-9])*(?:[.\)\-\*])\s?', line).group()
    # Format nested and non-nested bullet points differently
    if is_Nested(line):
        return line.replace(bullet_point_marker, '\t+ ')
    return line.replace(bullet_point_marker, '1. ')

# Utility function to format image links in the text
def format_image_links(line):
    # Regular expression pattern to match image URLs
    url_pattern = re.compile(r"(?:(?i)https?://)?(?:www\.)?\S+(?:/\S+)*\.(?:jpg|jpeg|png|gif|bmp|svg)")
    # Replace image URLs with markdown image syntax
    return re.sub(url_pattern, r"![Image](\g<0>)", line)

# Utility function to format hyperlinks in the text
def format_links(line):
    return f"[Link]({line})"
    
# Utility function to format code blocks in the text
def format_code_block(line, is_in_code_block):
    if not is_in_code_block:
        # Start of a new code block
        return "```\n" + line, True
    # Inside a code block
    return line, True

# Utility function to end a code block
def format_end_code_block(line):
    return "```\n" + line, False

# Utility function to create paragraphs in the text
def create_paragraphs(text):
    sentences = text.split('. ')
    embeddings = sentence_transformer_model.encode(sentences)
    similarities = cosine_similarity(embeddings)
    activated_similarities = activate_similarities(similarities, p_size=5)
    minima = argrelextrema(activated_similarities, np.less, order=2)
    split_points = [each for each in minima[0]]
    text = ''
    for num, each in enumerate(sentences):
        if num in split_points:
            text += f'\n\n {each}. '
        else:
            text += f'{each}. '
    return text



## 3. Text Processing Functions
This section comprises the main functionality where the text is processed, segmented, and formatted. It includes converting the text into tiles, formatting each tile, and then assembling the entire formatted text.

In [3]:
# Custom string builder class for efficient string concatenation
class StringBuilder:
    def __init__(self):
        self._strings = []

    def append(self, value):
        self._strings.append(value)

    def __str__(self):
        return ''.join(self._strings)

# Function to process individual text tiles
def process_tile(tile):
    sb = StringBuilder()
    try:
        # Generate sub-heading for the tile
        sub_heading = get_sub_heading(tile)
        sb.append(sub_heading + '\n')
        # Process each line in the tile
        lines = tile.split('\n')
        is_in_code_block = False
        formatted_lines = []
        for line in lines:
            # Format the line based on its content
            if is_bullet_point(line):
                formatted_lines.append(format_bullet_points(line))
            elif find_image_links(line):
                formatted_lines.append(format_image_links(line))
            elif detect_links(line):
                formatted_lines.append(format_links(line))
            elif is_code(line) or line.strip().startswith('#'):
                line, is_in_code_block = format_code_block(line, is_in_code_block)
                formatted_lines.append(line)
            else:
                if is_in_code_block:
                    line, is_in_code_block = format_end_code_block(line)
                formatted_lines.append(line)
        # If we're inside a code block at the end of the tile, end it
        if is_in_code_block:
            formatted_lines.append("```")
        processed_tile = '\n'.join(formatted_lines)
        sb.append(processed_tile)
    except Exception as e:
        sb.append(f"An error occurred while processing the tile: {e}")
    return str(sb)

# Main function to process and format the text
def process_text(main_text, sub_text=""):
    sb = StringBuilder()
    # Check for empty main text or if sub_text is not part of main_text
    if not main_text:
        sb.append("Main text is empty.")
        return str(sb)
    if sub_text and sub_text not in main_text:
        sb.append("Sub_text is not part of the main_text.")
        return str(sb)
    main_text = main_text.replace(sub_text, "")
    try:
        if not has_multiple_paragraphs(main_text):
            main_text = create_paragraphs(main_text)
    except Exception as e:
        sb.append(f"Error while creating paragraphs: {e}")
        return str(sb)
    try:
        # Tokenize the main text into tiles
        tt = TextTilingTokenizer()
        tiles = tt.tokenize(main_text)
    except Exception as e:
        sb.append(f"Error during text tiling tokenization: {e}")
        return str(sb)
    try:
        # Generate a main heading for the text
        main_heading = get_main_heading(main_text)
        sb.append(main_heading + '\n\n')
    except Exception as e:
        sb.append(f"Error generating main heading: {e}")
        return str(sb)
    # Process each tile
    for tile in tiles:
        tile_result = process_tile(tile)
        sb.append(tile_result)
        sb.append('\n\n')
    # If there's sub_text, convert it to a table
    if sub_text:
        try:
            table = generate_table_endpoint(sub_text)
            sb.append(table)
        except Exception as e:
            sb.append(f"Error converting sub_text to table: {e}")
    return str(sb)

## 4. Execution
The last section is where the functions are utilized to process a sample text and display the output.

In [7]:
main_text = """
Welcome to the Ultimate Guide to Web Development with Python.Flask, a micro web framework written in Python, was invented by Armin Ronacher.It was initially released on April 1, 2010.This guide is intended to provide you with a comprehensive understanding of how to utilize Python for web development, from setting up your environment to deploying your first web application.To get started with web development in Python, you need to have Python installed on your machine. You can download the latest version of Python from the official website: https://www.python.org. Once you've installed Python, you can check the version by running the following command in your terminal or command prompt:
python --version. Next, let's set up a virtual environment for our project. A virtual environment is a self-contained directory that contains a Python installation for a particular version of Python, plus a number of additional packages. To create a virtual environment, navigate to your project directory and run:
python -m venv myenv. This will create a new directory called "myenv" in your project directory. To activate the virtual environment, use the following command:
On Windows: myenv\Scripts\activate. On macOS and Linux:source myenv/bin/activate. With our environment set up, we can now install the necessary packages. For web development, we'll be using the Flask framework. To install Flask, run: pip install Flask. Now, let's write our first simple web application using Flask. Create a new file called "app.py" and add the following code:
from Flask import Flask
app = Flask(__name__)
@app.route('/')
def hello():
    return "Hello, World!"
if __name__ == "__main__":
    app.run(debug=True)
To run the application, navigate to the directory containing "app.py" in your terminal and run:python app.py. This will start a development server, and you should be able to see your application by visiting http://localhost:5000 in your web browser. For more advanced topics, such as integrating databases, user authentication, and deploying your application, you might want to check out the following resources: Flask Mega-Tutorial by Miguel Grinberg: https://blog.miguelgrinberg.com Flask Web Development with Python Tutorial series on YouTube: https://youtube.com/user/somechannel.Lastly, if you want to view some example projects or contribute to open-source Flask projects, there are numerous repositories available on GitHub. One of the popular repositories is located at: https://github.com/pallets/flask.In addition to online resources, there are also several books available on the topic. One highly recommended book is "Flask Web Development" by Miguel Grinberg. You can purchase it from various online retailers or check if it's available at your local library. If you have any images or assets for your web application, it's a good practice to store them in a separate directory, typically named "static". For instance, if you have a logo for your website, you might store it at: https://i.ytimg.com/vi/_If4ikS43_Q/maxresdefault.jpg. Thank you for going through this guide. We hope it serves as a helpful starting point on your web development journey with Python. Remember, practice is key, so start building and happy coding!
"""

sub_text = """Flask, a micro web framework written in Python, was invented by Armin Ronacher.It was initially released on April 1, 2010."""


In [8]:
main_text = """
In recent years, machine learning has emerged as a powerful tool for solving a wide range of problems. This field involves using algorithms to learn patterns from data, without being explicitly programmed to do so. Machine learning has been applied to a variety of domains, including image recognition, natural language processing, and recommendation systems.One popular type of machine learning is supervised learning. This involves training a model on a labeled dataset, where the desired output is already known. The model can then use this knowledge to make predictions on new, unlabeled data.Some common algorithms used in supervised learning include:
1. Linear regression 
2. Decision trees 
3. Random forests
The three machine learning types are supervised, unsupervised, and reinforcement learning.
Here's an example of how to implement linear regression in Python:
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.random.rand(100, 1)
y = 2*X + 1 + 0.1*np.random.randn(100, 1)
model = LinearRegression()
model.fit(X, y)
print("Intercept:", model.intercept_)
print("Slope:", model.coef_)

Another type of machine learning is unsupervised learning. This involves training a model on an unlabeled dataset, and allowing it to discover patterns on its own. Unsupervised learning can be used for tasks such as clustering and anomaly detection.Some common algorithms used in unsupervised learning include:
1. K-means clustering 
2. Principal component analysis 
3. Autoencoders
Here's an example of how to perform K-means clustering in Python:
import numpy as np
from sklearn.cluster import KMeans
X = np.random.rand(100, 2)
model = KMeans(n_clusters=3)
model.fit(X)
print("Cluster centers:", model.cluster_centers_)
For more information
Check out this cool image https://av-eks-blogoptimized.s3.amazonaws.com/Screenshot-from-2019-08-08-14-47-20.png learn more about Machine Learning on https://machinelearningmastery.com/
"""

sub_text = """
The three machine learning types are supervised, unsupervised, and reinforcement learning.
"""


In [5]:
main_text = """
The gearbox is one of the building blocks of the modern car. Not surprisingly, it's also one of the most complex bits of hardware inside any vehicle.
A car's engine connects to its crankshaft, which  rotates thousands of times per minute, according to the book Tribological Processes in the Valve Train Systems with Lightweight Valves. This is too fast for the wheels, so gears convert the power to speeds the wheels can handle. Gears use interlocking teeth, connecting small, fast moving cogs to a larger gears with more teeth, and that larger cog rotates at a reduced speed. 
Lower gears have larger cogs and enable the engine to deliver high levels of force without the car moving quickly, which is ideal when you're driving slowly or uphill, according to the journal Transportation Research Part A: General. 
Higher gears deliver more speed rather than torque, which is great for smooth  driving on the freeway. 
Lower gears deliver more power, while higher gears supply more speed — but you've got to shift your way up through the gears to reach those high speeds. Most cars have at least five gears, which gives drivers finer control over power and speed.
Different types of gear and the applications and industries that utilize them.
* Spur Gear
* Helical Gear
* Double Helical Gear
* Herringbone Gear
* Bevel Gear
* Worm Gear
* Hypoid Gear

Gearboxes have been around for a while. According-to the AIP Conference Proceedings, the first modern manual transmission was demonstrated by French inventors Louis Rene Panhard and Emile Lavassor in 1894, and Louis Renault improved the design in 1898. In 1928, Cadillac introduced a synchronized system to make gear changes smoother.
Gears for other uses are even older than automobiles, and remains of ancient gears have been discovered dating back to 4th century China. The world’s oldest working clock has been using gears to tell the time in Salisbury Cathedral since 1386. 
Most vehicles still use either a manual or automatic gearbox, but some new vehicles have modern CVT or dual clutch systems, according to Driving.ca.
CVT stands for Continuously Variable Transmission. It's a type of automatic gearbox, but it differs from conventional hardware in one key way. Instead of using normal gears where drivers can feel the transitions, CVT gearboxes have a cone shaped design and a system of pulleys to create a smooth transition up and down the car's power range, according to the International Journal of Scientific & Engineering Research. This is an efficient system that's often used in hybrid vehicles.
Dual clutch systems are frequently found in semiautomatic cars, enabling the car to preselect gears, meaning the driver can shift gear incredibly quickly. This system is often present on high performance cars, and vehicles with gearshifting paddles on the steering wheel.
Manual gearboxes aren't immune from innovation, either. Volkswagen has produced a new manual gearbox called the MQ281 that improves efficiency and reduces emissions, according to Volkswagen's website.
Automatic gearboxes appeared commercially in the 1940s, according to the International Journal of Vehicle Design, and now they are often more popular than manual transmissions. In fact a large majority of U.S. drivers use a manual vehicle, according to Readers' Digest
Automatic gearboxes remove the need for a clutch pedal, which disconnects the engine from the transmission in order to change gear. Instead, they use a torque converter system, relying on gearbox oil to transfer energy from the input shaft to the gears. Computers in the car determine exactly when gears need to be changed.
This enables smooth gear shifts and easier driving, and automatic gearboxes are often more efficient than their manual counterparts.
"""

sub_text = ""
print(process_text(main_text,sub_text))

Token indices sequence length is longer than the specified maximum sequence length for this model (841 > 512). Running this sequence through the model will result in indexing errors
  url_pattern = re.compile(r"(?:(?i)https?://)?(?:www\.)?\S+(?:/\S+)*\.(?:jpg|jpeg|png|gif|bmp|svg)")
  match = re.search(pattern, text)


# The Evolution of Automatic Gearboxes

## Gearboxes and Gearboxes

The gearbox is one of the building blocks of the modern car. Not surprisingly, it's also one of the most complex bits of hardware inside any vehicle.
A car's engine connects to its crankshaft, which  rotates thousands of times per minute, according to the book Tribological Processes in the Valve Train Systems with Lightweight Valves. This is too fast for the wheels, so gears convert the power to speeds the wheels can handle. Gears use interlocking teeth, connecting small, fast moving cogs to a larger gears with more teeth, and that larger cog rotates at a reduced speed. 
Lower gears have larger cogs and enable the engine to deliver high levels of force without the car moving quickly, which is ideal when you're driving slowly or uphill, according to the journal Transportation Research Part A: General. 
Higher gears deliver more speed rather than torque, which is great for smooth  driving on the freeway. 
Lower gears del

In [6]:
main_text = """
Capacitors are used in virtually every electronic circuit that is built today. Capacitors are manufactured in their millions each day, but there are several different types of capacitor that are available.

Each type of capacitor has its own advantages and disadvantages can be used in different applications - different sorts of electronic circuit design: general analogue design, RF design, digital circuits, etc.

Accordingly it is necessary to know a little about each capacitor type so that the correct one can be chosen for any given use, application, electronic circuit design, etc.

There are many variations including whether the capacitor is fixed or variable, whether it is leaded or uses surface mount technology, and of course the dielectric: aluminium electrolytic, tantalum, ceramic, plastic film, paper and more.

It is important to not only determine the right capacitor value, but also which capacitor type is appropriate for the particular circuit design. Each type has its own properties and some may be appropriate for low frequency circuit designs, others for power supplies and some for RF designs. Selecting the right type is crucial.

One of the main distinctions between various types of capacitor is whether they are polarised or not.

Essentially a polarised capacitor is one that must be run with the voltage across it in a certain polarity.
Some of the more popular types of polarised capacitor include the aluminium electrolytic and tantalums.

These electronic components are marked to indicate the positive or negative terminal and they should only be operated with a voltage bias in this direction - reverse bias can damage or destroy them. As capacitors perform many tasks like coupling and decoupling, there will be a permanent DC voltage across them, and they will pass only any AC components.

The other form of capacitor is a non-polarised or non-polar capacitor. This type of capacitor has no polarity requirement and it can be connected either way in a circuit design. Ceramic, plastic film, silver mica and a number of other capacitors are non-polar or non-polarised capacitors.

Another type distinction for capacitors is whether they are fixed or variable.

The greatest majority of capacitors by far are fixed capacitors, i.e. they do not have any adjustment. However in some instances it may be necessary to have an adjustable or variable capacitor where the value of the capacitor may need to be varied. Typically these capacitors are relatively low in value, sometimes having maximum values up to 1000pF. The uses for these tend to be in RF designs.

Variable capacitors may also be classified as variable and preset. The main variable ones may be adjusted by a control knob and may be used for tuning a radio, etc.

Preset variable capacitors normally have a screw adjustment and are intended to be adjusted during setup, calibration and test, etc. They are not intended to be adjusted in normal use.

There are very many different fixed value capacitor types that can be bought and used in electronic circuit designs.

These capacitors are generally categorised by the dielectric that is used within the capacitor as this governs the major properties: electrolytic, ceramic, silver mica, metallised plastic film and a number of others.

While the list below gives some of the major capacitor types, not all can be listed and described and there are some less well used or less common types that can be seen. However it does include most of the major capacitor types.

Ceramic capacitor:   As the name indicates, this type of capacitor gains its name from the fact that it uses a ceramic dielectric. This gives the many properties including a low loss factor, and a reasonable level of stability, but this depends upon the exact type of ceramic used.

Ceramic dielectrics do not give as high a level of capacitance per unit volume as some types of capacitor and as a result ceramic capacitors typically range in value from a few picofarads up to values around 0.1 µF.

For leaded components, disc ceramic capacitors are widely used. This type of ceramic capacitor is extensively for applications like decoupling and coupling applications. More highly specified capacitors, especially used in surface mount types of capacitor often have specific types of ceramic dielectric specified. The more commonly seen types include:

*COG: Normally used for low values of capacitance. It has a low dielectric constant, but gives a high level of stability.
*X7R: Used for higher capacitance levels as it has a much higher dielectric constant than COG, but a lower stability.
*Z5U: Used for even higher values of capacitance, but has a lower stability than either COG or X7R.
"""

sub_text = ""
print(process_text(main_text,sub_text))

# Major Types of Capacitors

## Capacitors — Different Types of Capacitors

Capacitors are used in virtually every electronic circuit that is built today. Capacitors are manufactured in their millions each day, but there are several different types of capacitor that are available.

Each type of capacitor has its own advantages and disadvantages can be used in different applications - different sorts of electronic circuit design: general analogue design, RF design, digital circuits, etc.

## How to select the right capacitor type for any given use, application, electronic circuit design, etc


Accordingly it is necessary to know a little about each capacitor type so that the correct one can be chosen for any given use, application, electronic circuit design, etc.

There are many variations including whether the capacitor is fixed or variable, whether it is leaded or uses surface mount technology, and of course the dielectric: aluminium electrolytic, tantalum, ceramic, plastic film, pape