# Regular Expressions Cleanup

### Background:
Regular expressions are used primarily for find and replace.
One thing that distinguishes regular expressions from your common Microsoft Word (ctrl+f) is its ability to search for characters that fall into specific categories.
For example, imagine the text below:

`th2e c87at jump43e328d ov02918102er2389 the wal090193l2891.`

Remove all digits and you'd be left with the sentence

`the cat jumped over the wall.`

There are several ways to do this.  You could:

<ol>
    <li> Do it manually.
        This could be more difficult if the same problem occurs many times in a much larger file. </li>
    <li> (ctrl+f) 1,2,3,4,5,6,7,8,9,0 and replace it with nothing.
        This works, but takes time. </li>
    <li> <b>Regular Expressions.</b> Search for '\d' (represents any digit 0-9) and replace with nothing.
        This is easiest.  </li>
</ol>     
    
Not only do regular expressions apply for digits, but they can represent word boundaries, text characters, spaces, new lines, and anything in between.
For a great introduction to regular expressions, I would recommend watching [this youtube video](https://www.youtube.com/watch?v=sa-TUpSx1JA) for the basics.
Additionally, [regex101](https://regex101.com/) is a great tool for testing regular expressions.

### Purpose:
For the purpose of cataloging physics questions, you run into a fair amount of text that should be expressed in math mode.  
For example:

$\sqrt{x^2 + y^2}$ looks a whole lot nicer than (x^2 + y^2)^(1/2)

The goal of this python program is to automatedly locate and edit equations/variables within a text document so they can look professional when compiled in LaTex.
More basically, change ugly equations like the rightmost one above to pretty equations like the leftmost one above.
We can do this by being clever in substitutions and by surrounding equation elements with tags.

### Application:
There's a whole lot to unpack in this process, but on the most basic level, the program locates things that typically fall into equations like [digits, math symbols, isolated single letters, ect.].
We then replace them with the same thing, but sandwiched by dollar signs.
For example, the program would take the text:

`If we consider the equation x + 1 = 873 where x is a constant, solve for x.`

and transform it into:

`If we consider the equation $x$ $+$ $1$ $=$ $8$$7$$3$ where $x$ is a constant, solve for $x$.`

After removing empty spaces between dollar signs we are left with:

`If we consider the equation $x$$+$$1$$=$$8$$7$$3$ where $x$ is a constant, solve for $x$.`

If we delete any repeated dollar signs (ones next to each other), we are left with:

`If we consider the equation $x+1=873$ where $x$ is a constant, solve for $x$.`

Notice that at this point, all equations and mathy things are sandwiched by dollar signs.  
This denotes a mathematical notation in LaTex, so now all of our mathy things can be printed nicely!
I should note that this is a simple example, and the code below applies to a whole lot more than just this.

### Reflection:
<b><u> Positives </b></u>

This program works very well and will save lots of time.
I estimate that it catches about 98% of the things that I would like it to capture.
The remaining 2% should be edited manually.

<b><u> Negatives </b></u>

The program does not consistantly capture or exclude units.
For example the following transformation could occur in the same text document:

"10 cm" ------> "$10$ cm" 

"35cm" ------> "$35cm$"

The difference being whether or not the unit is expressed in math mode. 
Not a huge deal, but if you're being picky, this could be improved.

In [5]:
###############################################################################################
# Import basic libraries
###############################################################################################


import numpy as np
import re
import os

###############################################################################################
# Defining what to find, what to replace with
###############################################################################################

# Have to start with these, as the introductions of dollar signs could screw up finding them
carrot = [r'\^[\{\(]?([\-]?[\d/]{1,4})[}\)]?', '$^{\g<1>}']
underscore = [r'(.)_[\(\{]?([^\s}\)]{1,7})[}\)]?', '$\g<1>_{\g<2>}']
expo = [r'(\d)E([\d\-]{1,3})','$$\g<1>^{\g<2>}']
carrot2 = [r'\^','$$^$$']
mult = [r'([^a-z])x([^a-z])',r'\g<1>$\\times $\g<2>']

#  Secondary finds, still somewhat sensitive
left_paren = [r'\(([^i])','$($$\g<1>']
right_paren = [r'([^iv])\)','$$\g<1>$$)$']
floating_variable = [r'(\b)([b-z])(\b)','$\g<1>\g<2>\g<3>$']
div = [r'/','$/$']
plus = [r'\+','$+$$']

# General replacements that could happen at most any stage in the process
pi = [r'([ \$])pi([ \$])','\g<1>$\\\\pi$\g<2>']
sqrt = [r'sqrt','$\\\\sqrt$$']
left_sqiuggle = [r'\{','$${$$']
right_squiggle = [r'}','$$}$']
deg = [r'([ \$])degrees([ \$\.])','$$^{\\\\circ}$\g<2>']
gamma = [r'gamma','$\\\\gamma $']
alpha = [r'alpha(?![- ]?part)','$\\\\alpha $']
rho = [r'rho','$\\\\rho $']
pm = [r'plusminus','$\\\\pm $']
sigma = [r'sigma','$\\\\sigma $']
theta = [r'theta','$\\\\theta $']
omega = [r'omega','$\\\\omega $']
omega2 = [r'ohm[s]?(?!i)','$\\\\Omega $']
lamb = [r'lambda','$\\\\lambda $']
mu = [r' mu ','$\\\\mu $']
epsilon = [r'epsilon','$\\\\epsilon $']
beta = [r'beta(?![- ]?part)','$\\\\beta $']
percent = [r'%','$\%$']
equals = [r'\=','$=$']
minus = [r'\-','$-$']

# Last stages before duplicate dollar sign removal
all_digits = [r'(\d)(?!}\\)','$\g<1>$']
remove_space = [r'\$ \$','$$']
periods_stranded = [r'\$\.\$','$$.$$']

# Big reveal
remove_unnecissary_dollars = [r'[\$]{2,100}','']

# post-edits
minus_cleanup = [r'([a-z \.]{2})\$\-\$([a-z \.]{2})','\g<1>-\g<2>']
div_cleanup = [r'\$/\$','/']
i_cleanup = [r'\$i\$','i']
pos = [r'\$s\$','s']
left_parenth_cleanup = [r'\$\(([a-z\. \\]{4})','(\g<1>']
right_parenth_cleanup = [r'([a-z\. \\]{4})\)\$','\g<1>)']
slow = [r'(?<=\{)([^}\$]*)\$','\g<1>']
last = [r'\$\\times \$', ' $x$ ']
last2 = [r'\$([^\s\d\\][a-z. ]{1,7}\S)\$','\g<1>']
carrot_cleanup = [r'\$([^\d\$]{1,3}\^)','\g<1>']
x_underscore = [r'\\times [\$]?_','x_']



###############################################################################################
# The order of the find/replace.  Follows how they are defined above.
###############################################################################################

orderz = [carrot,underscore,expo,carrot2,mult,left_paren,right_paren,
          floating_variable,div,plus,pi,sqrt,left_sqiuggle,
          right_squiggle,deg,gamma,alpha,rho,pm,sigma,theta,omega,
          omega2,lamb,mu,epsilon,beta,percent,equals,minus,all_digits,
          remove_space,remove_space,periods_stranded,
          remove_unnecissary_dollars,
          minus_cleanup,div_cleanup,i_cleanup,pos,left_parenth_cleanup,
          right_parenth_cleanup,slow,slow,slow,slow,slow,last,last2,
         carrot_cleanup,carrot_cleanup,x_underscore]

###############################################################################################
# Open file, find and replace, you are left with altered text as desired.
###############################################################################################

file = open('../finished_texts/1998_p2b.txt','r')
text = file.read()
file.close()
for i in np.arange(0,len(orderz)):
    group = orderz[i]
    find = group[0]
    repl = group[1]
    pattern = re.compile(find,re.M | re.I)
    text = re.subn(pattern,repl,text)[0]
    
print(text)

Physics $2$B $-1998$

Define simple harmonic motion.

Prove that, the velocity $v$ of a particle moving in simple
harmonic motion is given by: $v=w(A^{2}-y^{2}^{0}.5$, where A is the amplitude of
oscillation, $w$ the angular frequency and $y$ the displacement from the mean position.

A simple pendulum has a period of $2.8$ seconds. When its length
is shortened by $1.0$ metre, the period becomes $2.0$ seconds.
From this information, determine the acceleration $g$, of
gravity and the original length of the pendulum.

A particle rests on a horizontal platform which is moving
vertically in simple harmonic motion with an amplitude of
$50$ mm. Above a certain frequency the particle ceases to remain
in contact with the platform throughout the motion. With a
help of a diagram and illustrative equations, find;
(i) the lowest frequency at which this situation occurs.
(ii) the position at which contact ceases.

What is terminal velocity?

Briefly explain an experiment designed to measure
terminal