# m# - A Cryptographic Hashing Algorithm
#### Mano Rajesh Robotics Honors
Spring 2022

---

# Introduction
To test my knowledge of Python, I decided to embark on creating my own hashing algorithm made entirely in Python. A hashing algorithm is where you have a given input and a fixed length output. Simply, this means that the string `hello, world` will have a totally unique output such as `@OOk##x39aVXMyUBu?(*hNx9PGHTjzHCx`. 
<figure><img src="/work/mhash-paper/figures/hashing_visual.png"><figcaption><center><b>Fig. 1 - Visual Example of Hashing, from Brilliant.org; 2022</b></center></figcaption></figure><br>

In reality, creating a perfect **cryptographic** (a hashing algorithm that's impossible to break) hashing algorithm is practically impossible since the only real way to test all the possibilities is to brute force every piece of data possible and check for collisions. A hashing collision is where _two different pieces_ of data produce the _same or similar_ data. For example, if we go back to our `hello, world` example and `hello, earth` produces that same hash (`@OOk##x39aVXMyUBu?(*hNx9PGHTjzHCx`), then there is a **collision**. In simpler terms, it is when two different input share an output. Collisions means that data can be mimiced, altered, or accessed. If the hashing algorithm is used widely (e.g. SHA, CRC32, etc.) then the collisions can be used to implement malicious code or other malevolent actions.

My goal was to create a cryptographic hashing algorithm made entirely in Python. This requires quite a bit of math, planning, testing, and analyzing. 

# Development
To begin, I started simply using knowledge of encryption algorithms and hashing algorithms to create a base for my own hashing algorithm. This way, I could make my own algorithm rather than making a copy of an existing one. The initial idea for the hashing algorithm was to devire arbitrary and random  

<sub>Note: You can interact with any of the hashing algorithms. Simply type your input, click Apply, and run the code block below</sub>

In [None]:
plaintext_input_1 = 'fasdfasdf'

In [None]:
def hashing(plaintext, length=32):
    seed = 0
    hash = []
    salt = 0
    random_length_num = 1
    text = "abcdefghjiklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ)(*&%$#@!"

    for char in plaintext:
        random_length_num += ord(char)

    while salt <= length:
        seed += ord(plaintext[salt % len(plaintext)]) + salt * random_length_num
        hash.append(text[(seed**salt*random_length_num) % len(text)])
        salt += 1
    return hash

print("".join(hashing(plaintext_input_1))) 

jxEbbRYEz7fT2)g%uD*MH5J&J%pXEvsmH


##### With a hashing algorithm that _seems_ to work, I began implementing a testing program. Below is the pseudo-code: 

```
def generate string(seed):
    generate a string like 'a' or 'b' depending on the seed

    if seed is greater than 26
    return 'aa' or 'ab' depending on seed
    # incrementing the letter depending on seed

while True:
    hash(generate string(number))
    write hash result to file
    increment number by 1

for word1 in file:
    for word2 in file:
        if word2 is similar word2: # Check every word against every other
            file write word2
            print("Collision found")
```

This testing program is going to brute force the hashing algorithm. This means that it is very, very slow; however, it is very effective in finding collisions as mentioned previously. Below is a cross-section of one of the outputs. The output (or ciphertext) is on the left, and the input (or plaintext) is on the right. This format allows for very easy reading, analyzing, and error-checking because it is very simple and consistent.

```
.
.
.
qx@6JYGBY01SnvviOs2Mz1IzlJ?njcCmq jjjj
7v!m53p@z)YhLOXJx&mB19&UU5?4ER84P kkkk
8t<%p@Z3(qv8jgyj@D!12Gno3p?J?63Kn llll
Lr?LaB9Y3HTXG0)IGnTp3O9Jaa?Z6ix(M mmmm
Epa3Wfht%!qmd*2hpYEe4WTdJW?e(Yshk nnnn
wnbiHKRP6yOCBu&HZ9z%5%d9rH?uwCnyJ oooo
8lc(3o1k@Pl*<N5g9jkT6by&)3?ASqjFh pppp
oidInT?G9fJr7f#GhT#I7iJx9n?Qn$dWG qqqq
vhez<xJb?7gH%98fR4R88r%Sg<?#JJ<ce rrrr
XffgU*s8BXE@2(<F1dCw9zomPU?lex%tD ssss
cdgYF7*&cnbwZtAe?Oxl080HxF?2AbZAb tttt
dbhF1aByEE0MwMbEJyjaAFUb#1?H#QURA uuuu
r?jwlFkUf$@bUeDds<%)BNe7El?X25P!< vvvv
k!id@iUpHv52r8eD*JPPCVz(m@?cXjKo8 wwww
&#kVSO4LjM*RP)GcBtAED&KvVS?ssXF6# xxxx
d%lCDscgKczgmshCk%v4Ea$Q4D?9OBAM5 yyyy
V*mtyXMCl4X7KLJbUEgsFjpkby?Oip6&& zzzz
$qmkF%XC%?)wL9v0VziexJ1csv?4G1bmw aaaa
fon*197!61xMj(X<5?$%yRL8(g?Jbe@4V bbbbb
XmoJlcf4@RVbGty9dKQTzZ#*0*?Z8T*Kt ccccc
%kp1@HPZ9hs2dM)!NuBI1@qwhN?e&8X(S ddddd
6jqhSlyu?9QRBe28w$w82eBRQ9?uylShq eeeee
LgrZDQ!QBZng<8&@#Fhw3mWlyt?AU)NyP fffff
MesGyuHlcpL77)57Fp&l4ugG@e?QpEIFn ggggg
)ctxiZqHEGjW%s##o)Oa532aF)?#LsDWM hhhhh
Taue$4)cf@Gl2L86YA0)6AM6nL?lg@9ck iiiii
B<vWQ!09HxdBZd<$8kuP7I@)W7?2CL4tJ jjjjj
M@wDBCj%jOB(w7A5gVfE8Qru5r?H!zyAh kkkkk
4$xuwgSzKe<qUZb%Q6(49YCPcc?X4dtRG lllll
A&ybhL2Vl67GrrD4zfMs0#XiLY?cZSo!e mmmmm
b(zT&paqNW%#PKe&<Q8hAdhEtJ?su7ioD nnnnn
rZ1AOUKMom2vmcG3I1s@Bl3<*5?9Qke6b ooooo
.
.
.
```

The amount of collisions found in the first version of the hash was signifcant; there were around a 1000 collisions in a million hashing. In other words, there was one collision for every 1000 hashes. This may seem like a practical amount of collisions (only 0.1% of hashes are collisions); however, SHA or the Secure Hashing Algorithm (the standard hashing algorithm for the US Government) has **no** collisions at all. 

In light of this, I aimed to deconstruct my hashing algorithm and find how those collisions were created.

---

To understand the hashing algorithm, we must deconstruct each section of the algorithm. Broadly, the algorithm is split into two parts: a number linked to the length of the input and the assignment of letters to output. Firstly, the number linked to the length of the input is important since you want the length of the input to drastically change the output. If it is drastically different, the input `i` and `ii` should be very different even though they contain the same character.

With that said, this version of the hash does not account for this as seen by the example of `gn` and `gnof`. the hash for `gn` is `bbsdO54K4B7%J7V7Bwiu7fb0JDs9Sma!t`, yet the hash for `gnof` is `bbddO5AK4BB%J7S7Bwmu7fq0JDp9SmM!t`. The resultant hashes are extremely close since the input strings are also very similar to each other. This is a major fault since an attacker can use this knowledge to estimate the data in a hash. For example, if the attacker knows that a particular secret hash is similar to another hash, they can estimate what the secret data is by using the hash input they already know.

In [None]:
plaintext = plaintext_input_1

seed = 0
hash = []
salt = 0
random_length_num = 1
text = "abcdefghjiklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ)(*&%$#@!"

for char in plaintext:
    random_length_num += ord(char)

print(random_length_num)

427


In [None]:
while salt <= 32:
    seed += ord(plaintext[salt % len(plaintext)]) + salt * random_length_num
    hash.append(text[(seed**salt*random_length_num) % len(text)])
    salt += 1

print("".join(hash))

bbddO5AK4BB%J7S7Bwmu7fq0JDp9SmM!t


This section of the hash is where in combination with the `seed` variable, `salt` variable, and `random_length_num` variable come together to calculate a random (yet consistent) letter for each place in the hash. Since the length of the hash is 32 characters, the loop will go on 32 times.

Since the `random_length_num` variable is vulnerable, the entire main loop is thereby affected. This area is also of concern since it has a lot of very compute-intensive calculations (like exponents and division). This means that this area is an inefficient part of the algorithm and can be improved.

---

Now, the algorithm can be improved if the varible `random_length_num` is changed drastically with the input length, and the character assignment section is adjusted for speed. Below is the new algorithm:

In [None]:
plaintext_input_1 = 'this is a neuasjdf;lkasjdfkl'

In [None]:
plaintext = plaintext_input_1

def hashing(plaintext, length=32):
    seed = 0
    hash = []
    salt = 0
    random_length_num = 1
    text = "abcdefghjiklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ)(*&%$#@!"

    for char in plaintext:
        random_length_num += ord(char)
        random_length_num += ~ len(plaintext)

    while salt <= length:
        seed += ord(plaintext[salt % len(plaintext)]) + salt * random_length_num
        hash.append(text[(seed**salt*random_length_num) % len(text)])
        salt += 1
        random_length_num += 1
    return hash

print("".join(hashing(plaintext_input_1))) 

R3Fb5LFixpGp&kP*rkaoapsBmla#kt2SQ


The algorithm was changed very minimally, but those changes were very important. The first and most important change was the line `random_length_num += ~ len(plaintext)`. This was important as the result of the line is heavily dependent on the length of the input. This is beacuse it takes the length of the input and inverts the bits (0 will be 1 and 1 will be 0). This is then added to the `random_length_num` variable to get a large number. The variable is then dependent on not only the actual character inputted (as shown with the `ord()` function) but also the length of the input.

Additionally, I added the line `random_length_num += 1` within the second part in order to make the chance of repeated values of it (with different inputs) nearly impossible.

These two changes made the chance of collisions nearly impossible with no collisions found in 10,000 hashes and only **one** collision found in 1,000,000 hashes.

# Analysis

To analyze the result, I used the aforementioned testing algorithm, hashing time program, and a plotting program.

Beginning with the testing algorithm, the algorithm was designed around the idea that a lot of hashes would be generated, and each hash would be compared against each other to find collisions. With this, there would have to be 2 sections to the tester: word and hash generator and the comparison between the hashes.

In [None]:
seed_input_1 = 683

In [None]:
seed = seed_input_1

def lowercase_word(seed):
    word = ""
    seed *= 1
    while seed > 0:
        word += chr(seed % 26 + 97)
        seed = seed // 26
    return word

def multiple_letters(seed):
    word = ""
    while seed > 0:
        word += chr(seed % 26 + 97)
        seed -= 26
    return word

print(lowercase_word(seed))
print(multiple_letters(seed))

hab
hhhhhhhhhhhhhhhhhhhhhhhhhhh


The generation of the inputs is quite easy: depending on the `seed` variable, return a series of letters. This sequential generation of inputs is quite efficient. This allows this step to take very little time and compute so that the analysis can use those resources.

In [None]:
def thread_function(hashes):
    file2 = open("hash_collisions.txt", "a")
    try:
        counter = 0
        for line in hashes:
            for word in hashes:
                if jf.jaro_distance(line.split()[0], word.split()[0]) > 0.9 and line != word:
                    file2.write(line + word + '\n')

            if counter % 500 == 0:
                print(counter, time.asctime())
            counter += 1
        file2.close()
    except KeyboardInterrupt:
        print("Keyboard interrupt")
        file2.close()
        exit()

Within this function, it first opens the file to write to if there are collisions found. Then, the two `for` loops are what loops through all the elements in the `hashes` list. As seen in Fig. 2, the first loop is the hash on the left that is then checked with every other hash found with the second loop. This is a simple, linear way to compare the hash to everything else.

<figure><img src="/work/mhash-paper/figures/hashes.png"><figcaption><center><b>Fig. 2 - Visual Example of Collision Checking, from Mano Rajesh; May 24, 2022</b></center></figcaption></figure><br>

This method, although very comprehensive, is very slow since the function has to repeat this operation $number\;of\;hashes^2$ times since each hash needs to be compared to $n$ hashes. This is a very inefficient and is a prime candidate for multi-processing. 

In [None]:
# mp.cpu_count() returns the number of usable hardware CPU cores
# text_buffer is the contents of the file with the hashes

for i in range(mp.cpu_count()):
    end = start + len(text_buffer) // mp.cpu_count()
    p = Process(target=thread_function, args=(text_buffer[start:end],))
    p.start()
    start = end

for i in range(mp.cpu_count()):
    p.join()
print("Done at " + time.asctime())

This section simply divides the hash file into equal (or nearly) pieces to be sent independently to the CPU cores. This division of the file is shown with the `end` variable assignment line: `end = start + len(text_buffer) // mp.cpu_count()`. This would create a cross-section of the data that is then sent to the core. After, this assignment is offset by the previous cross-section so that each core gets an equal amount of work (`start = end`).

With multi-processing, instead of one CPU core computing $10000^2$ or $100000000$ collision checks sequentially for example, 4 CPU cores can compute $({\frac {10000}{4}})^2=2500^2$ or $6250000$ collision checks simultaneously. There is a difference of 93,750,000 checks for one CPU core.

The speed improvements are quite significant: With 1000 hashes, single-core execution time is 31.3 seconds while four-core execution time is 1.2 seconds. In other words, the multi-core execution takes 3.8% of the original time.

--

With any collisions found with the testing, I would then input them into the below program to find the similarity of the two hashes with the Jaro–Winkler Distance equation (also used by the testing program).

In [None]:
plaintext_input_2 = 'fg'

In [None]:
plaintext_input_3 = 'f'

In [None]:
import jellyfish as jf

def hashing(plaintext, length=32):
    seed = 0
    hash = []
    salt = 0
    random_length_num = 1
    text = "abcdefghjiklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ)(*&%$#@!"

    for char in plaintext:
        random_length_num += ord(char)
        random_length_num += ~ len(plaintext)

    while salt <= length:
        seed += ord(plaintext[salt % len(plaintext)]) + salt * random_length_num
        hash.append(text[(seed**salt*random_length_num) % len(text)])
        salt += 1
        random_length_num += 1
    return hash 

plaintext1 = plaintext_input_2
plaintext2 = plaintext_input_3

print("".join(hashing(plaintext1)))
print("".join(hashing(plaintext2)))
print(jf.jaro_distance("".join(hashing(plaintext1)), "".join(hashing(plaintext2))))

W2sM1%fjQIKExab6mm7PlngkG2x143Zb5
5HNe&WmZCVzGQ#09)ty7e#8aOSF$E7E$8
0.3434343434343434


---

Now with a practical hashing function, the step was finding and quantifying the amount of time needed to compute the hashes. In order words, how compute-intensive is the hashing function. Additionally, finding the Big O number was important as it describes how intensity increases (e.g. $O(n^2)$ is exponentially increasing in intensity).

To find this, a simple hasher and timer program was created.

In [None]:
arguments = []
file = open("hash-times.txt", "w")
file.close()

for i in range(1000):
    arguments.append("xg"*(i+1))

for i in range(1000//mp.cpu_count()):
    try:
        processes = []
        for j in range(mp.cpu_count()):
            processes.append(mp.Process(target=hashing, args=(arguments[i*mp.cpu_count()+j],)))
        for process in processes:
            process.start()
        for process in processes:
            process.join()
    except (KeyboardInterrupt or IndexError):
        for process in processes:
            process.terminate()
        break

Firstly, the first `for` loop makes a list with the string `xg` in an increasing order (i.e. `arguments = ['xg', 'xgxg', 'xgxgxg'...]`). I chose this method because this mean if you time the hashing algorithm hashing each element of the `arguments` list, you will get an increasing time that illustrates how input length affects time.

The second `for` loop is the core assignment of the hashing and timing function. This has a very similar structure to the collision tester. The main difference is that the process (i.e. the different functions being sent to the individual cores) is put into a list and started from there. This was used because the `arguments` variable is split into pieces, and then those pieces are sent to the CPU cores. This was chosen since the hashing algorithm takes one element on the list as an argument, not a list (slices can't be used). This method means that several hashes can be produced simultaneously, decreasing the amount time to compute execution times.

--

Now, with a list of the execution times, the data can be plotted to visualize and compute the equation to find the complexity of the hashing algorithm. The plotter is split into 5 distinct parts: library imports and function definitions, data preparation, line of best fit calculations, difference graphs, and plotting. 

In [None]:
from statistics import fmean
import matplotlib.pyplot as plt
from scipy.ndimage.filters import gaussian_filter1d
import numpy as np
from scipy.optimize import curve_fit

def horizontal_line(limit, y):
    return [y for i in range(limit)]

# Exponential formula
def exponential(x, a, b):
    return a*np.exp(b*x)

# My usual line of best fit for exponential
def line_of_best_fit(x1, x2, y1, y2):
    b = (y2/y1)**(1/(x2-x1))
    a = y1
    print(f"y = {a}*{b}^x")
    
    values = [a*b**i for i in range(x2)]
    return values, f"y = {a}({b})^x"

def average_graphs(y_val1, y_val2):
    average_graph = []
    for i in range (len(y_val1)):
        average_graph.append(fmean([y_val1[i], y_val2[i]]))
    return average_graph

Firstly, the libraries `statistics, matplotlib, scipy, and numpy` were used. Their use is explained below. The `horizontal_line()` function is used for clarity on the final graph. The `exponential()` function is used as a model equation for the `curve_fit()` function used later on. The `line_of_best_fit()` function is used to calculate the *exponential* line of best fit between two points. Lastly, the `average_graphs()` function takes two set y-values of two graphs and averages them.

In [None]:
########### Data Preparation ###########

# File reading
file = open("tests\hash-times_50000.txt", "r")
times = file.read().split()
times = [float(i) for i in times]

# Smooth the data with a gaussian filter
times_smoothed = gaussian_filter1d(times, sigma=25)

The raw data is then read from the `hash-times.txt` file produced by the function previously described. Each entry in that file is the time in fractional seconds seperated by newlines. The data read from the file is a string, so in order to use them for calculations, each element is converted into a float with the line `times = [float(i) for i in times]`. After the raw data is prepared, that data is then smoothed with a 2D Gaussian Filter since the data is quite irregular. Pictured below is the **1D Gaussian Filter** where $x$ is the data to be smoothed to the degree of $\sigma$. In simpler terms, $x$ is smoothed by $\sigma$.

$$
G(x) = \frac{1}{\sqrt[]{2 \pi } \sigma} e^{- \frac{x^{2}}{2 \sigma ^{2}}}
$$

In [None]:
########### Line of Best Fit ###########

# Find line of best fit with usual formula
lineofbestfit1, equation = line_of_best_fit(0, len(times), times_smoothed[0], times_smoothed[-1])

# Find line of best fit with curve_fit
pars, cov = curve_fit(f=exponential, xdata=range(len(times_smoothed)), ydata=times_smoothed, p0=[0, 0], bounds=(-np.inf, np.inf))
lineofbestfit2 = exponential(range(len(times_smoothed)), *pars)

# Find line of best fit by averaging the two lobfs
average_lobf = average_graphs(lineofbestfit1, lineofbestfit2)

The smoothed data is then used to calculate the lines of best fit with several different methods. For `lineofbestfit1`, it is found with the equation:
$$
({\frac{y_2}{y_1}})^{\frac{1}{x_2-x_1}} 
$$
This has the benefit of being simple; however, it only used two seperate points on the graph, so it is not the most accurate.

--

`lineofbestfit2` was found using the library function `scipy.curve_fit()`. This function takes $data$ and models (or _fits_) it to a given curve or equation. This turned out to be the most accurate line of best fit but only for certain datasets.

--

`average_lobf` was found by averaging both `lineofbestfit1` and `lineofbestfit2` in order to get a slightly different ouput. This was supposed to aid with that fact that the first line of best fit was most accurate at the ends of the function but not the center like the second line of best fit.
$$
y_\mu={\frac{y_2-y_1}{2}} \quad\quad x_\mu={\frac{x_2-x_1}{2}}
$$

In [None]:
########### Difference Graphs ###########

# Find difference between parent graph and first line of best fit
difference_graph1 = []
for i in range(len(times_smoothed)):
    difference_graph1.append(times_smoothed[i] - lineofbestfit1[i])

# Find difference between parent graph and second line of best fit
difference_graph2 = []
for i in range(len(times_smoothed)):
    difference_graph2.append(times_smoothed[i] - lineofbestfit2[i])

difference_graph3 = []
for i in range(len(times_smoothed)):
    difference_graph3.append(times_smoothed[i] - average_lobf[i])

To find the accuracy of the lines of best fit (i.e. LOBF), the difference between each output of the graph can be plotted. In other words, the closer to 0 the difference graph is, the more accurate it is. To do this, a simple `for` loop iterates through each output (or y-value) of the LOBFs and subtracts them. The difference is saved to a list where it is then graphed.

In [None]:
########### Plotting Graphs ###########

print(f"differnce_graph1: {fmean(difference_graph1)} \ndifference_graph2: {fmean(difference_graph2)} \ndifference_graph3: {fmean(difference_graph3)}")
print(min((fmean(difference_graph1)), fmean(difference_graph2), fmean(difference_graph3)))

# Plot the data graphs
fig, ax = plt.subplots(2)
ax[0].plot(times_smoothed, color="black")
ax[0].plot(lineofbestfit1, linestyle='dashdot', color="green")
ax[0].plot(lineofbestfit2, linestyle=':', linewidth=2, color='blue')
#ax[0].plot(average_lobf, color="red")
#ax[0].plot(lineofbestfit3, color="orange")

# Plot the difference graphs
ax[1].plot(difference_graph1, color="green")
ax[1].plot(difference_graph2, color="blue")
#ax[1].plot(difference_graph3, color="red")
ax[1].plot(horizontal_line(len(times_smoothed), 0), linestyle=':')

# Formatting
ax[0].set_title(file.name, y=1.2, pad=-14)
ax[0].set_ylabel("Time (s)", labelpad=10)
ax[0].set_xlabel("Hash Length", labelpad=10)
ax[1].set_ylabel("Difference (s)", labelpad=10)
ax[1].set_xlabel("Hash Length", labelpad=10)
ax[0].legend(["Data", "Line of Best Fit", "Line of Best Fit (curve_fit)"], loc="upper left")
ax[1].legend(["Difference (Line of Best Fit)", "Difference (Line of Best Fit (curve_fit))", "Reference Horizontal Line"], loc="upper left")


ax = plt.gca()
ax.set_yticks(ax.get_yticks()[::1]) # Set y-ticks to every second value
plt.grid(False) # Remove grid
file.close()
plt.show()

To plot the graphs, the library `matplotlib` was used to easily plot a list of values to a GUI. The smoothed data and LOBFs are plotted on the first subplot and color-coded. The difference graphs are then plotted on another subplot and color-coded to match its parent graph. From there, the plots' ticks are set to appear every other tick for clarity. Finally, labels and a legend are added to the axses for readability. All these formatting modules are created with `matplotlib`.

--

Several hash-time datasets were tried: 1000 hashes, 10000 hashes, 20000 hashes, and 50000 hashes. Among those, a few were single-threaded to check for stability in the multi-processing tester. Below are the figures found with this plotter:

<figure><img src="/work/mhash-paper/figures/hash_times_10000.png"><figcaption><center><b>Fig. 3 - Graph of the hashing times, from Mano Rajesh; May 24, 2022</b></center></figcaption></figure><br>
<hr>
<figure><img src="/work/mhash-paper/figures/hash_times_20000.png"><figcaption><center><b>Fig. 4 - Graph of the hashing times, from Mano Rajesh; May 24, 2022</b></center></figcaption></figure><br>
<hr>
<figure><img src="/work/mhash-paper/figures/hash_times_50000.png"><figcaption><center><b>Fig. 5 - Graph of the hashing times, from Mano Rajesh; May 24, 2022</b></center></figcaption></figure><br>


---

Finally, to find the **Big-O number**, I guessed and checked matching different curves to the original dataset. Using *Desmos's Graphing Calculator*, I inputting values to match a several different curve equations (e.g. quadratic, rational, exponential, etc.) to the original dataset. Eventually, the quadratic curve fit the dataset the best, so the Big-O number would be quadratic. In other words, the Big-O for this hashing algorithm is $O(n^2)$. This means that the algorithm increases exponentially (to the second degree) in compute-intensity with input length.

The exact function for the Big-O estimate is,
$f(x) = x^{2}\cdot0.00000000023+0.0166$

<figure><img src="/work/mhash-paper/figures/big-o-estimate.png"><figcaption><center><b>Fig. 6 - Graph of the hashing time and Big-O estimate, from Mano Rajesh; May 24, 2022</b></center></figcaption></figure><br>

---

### Potential Improvements
Even though the m# (Mano Hash) hashing algorithm is pretty good at the moment, there are a few improvements to be implemented.

<hr>

|                               	|                                                                                                                                                                                                                                                                                                                                                   	|
|-------------------------------	|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------	|
|     **Bitwise Operations**    	| Rather than picking random numbers and cycling through a set of letters (as represented by the `text` variable in the function), operating directly on the bits of the input mean that regardless of the input data, the hash will be unique. Additionally, there is a potential that bitwise operations would be faster than integer operations. 	|
| <br>                          	|                                                                                                                                                                                                                                                                                                                                                   	|
|    **Faster Time to Hash**    	| In correlation with bitwise operations, making the second `for` loop faster by replacing pointless, compute-intensive operations for clean, quick bitwise operations.                                                                                                                                                                             	|
| <br>                          	|                                                                                                                                                                                                                                                                                                                                                   	|
| **Refined <br>Testing Algorithm** 	| Since many of the hash collisions share input characters, I can only test a hash against other hashes with similar input characters. This would reduce the testing time by magnitudes, and the effectiveness of the tester should be the same or similar.                                                                                         	|

# Conclusion
The **m# hashing algorithm** is not yet cryptographic like the title implies: that requires heavy cryptanalysis. With that said, this hashing algorithm is quite good with encoded strings as shown throughout this paper. It also functions quite well with large files and large inputs. As illustrated throughout this paper, the hashing alorithm was tested against encoded strings (implying numbers but not explicitly tested), and it shines in its simple calculations. With that said, the algorithm is still in its infancy and cannot compete with more mature algorithms like SHA or xxHash. 

This was a project that tested my skills with a language I am only beginning to understand. This project touched nearly all aspects of the language, testing me to find the most efficient approach to a problem.

You can visit the GitHub repository [here](github.com/manorajesh/hashing) for an in-depth view of the project.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=764b8806-990e-4d16-80ea-8c890893c289' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>