# Resaltador léxico secuencial
---

- [x] Regex expr
- [x] Tokenize file
- [x] Write HTML file
- [x] Highlight syntax
- [x] Paralelo
- [x] 1000 files
- [x] Poster

> Isaac Cortés Martínez       A01378642

> Alejandro Enríquez Coronado A01378141

---
### Analyzed tokens:
- **Keywords**
- **Comments**
- **Literals**
    - Complex
    - Float
    - Integers
    - Boolean
    - String
    - None
- **Delimiters**
    - Grouping
    - Punctuation
    - Arithmetic assignment
    - Bitwise assignment
- **Operators**
    - Arithmetic
    - Bitwise
    - Relational
    - Logical
- **Identifiers**
- **Spaces**
- **Newlines**
- **Invalid characters**
---

### Regular expression

In [1]:
(def regex #"
(?xi)
    (?:ð)
# 1. Keywords
    |(\b(?:class|finally|return|continue|for|lambda|try|def|from|nonlocal|while|del|global|with|as|elif|if|yield|assert|else|import|pass|break|except|raise)\b)
# Comments
    |((?:\#.*|\"{3}[\w\W]*?\"{3}|\'{3}[\w\W]*?\'{3}))                                              # 2. Multi and single-line comments
# 3. Misc
    |(->|>>>)                                                                                      # Function Annotations, command prompt
# Literals
    |((?:\d+j{1}|(?:(?: [+-]? \d*\.\d* (?: e [+-]? \d+ | ) | \d+ (?:e[+-]?\d+) ) ) [+-]?\.?\d+j))  # 4. Complex
    |((?:[+-]?\d*\.\d+(?:e[+-]?\d+|)|\d+(?:e[+-]?\d+)))                                            # 5. Float
    |(\d+)                                                                                         # 6. Ints
    |(\bTrue\b|\bFalse\b)                                                                          # 7. Boolean
    |((?:\".*\"|\'.*\'))                                                                           # 8. String
    |(None)                                                                                        # 9. None
# Delimiters
    |((?:\(|\)|\[|\]|\{|\}))                                                                       # 10. Grouping
    |((?:\.|\,|\:|\;|\@))                                                                          # 11. Punctuation
    |((?:=(?!=)|\+=|-=|\*{1}=|\/{1}=|\/{2}=|%=|\*{2}=))                                            # 12. Arithmetic assignment
    |((?:\&=|\|=|\^=|<{2}=|>{2}=))                                                                 # 13. Bitwise assignment
# Operators
    |((?:\+|\-|\*{2}|\/{2}|\/{1}|\%|\*{1}))                                                        # 14. Arithmetic
    |((?:\&|\||\~|\^|\<{2}|\>{2}))                                                                 # 15. Bitwise 
    |((?:<=|>=|<{1}|>{1}|!=|={2}|\bis\b|\bin\b))                                                   # 16. Relational
    |((?:\band\b|\bor\b|\bnot\b))                                                                  # 17. Logical
# Identifiers
    |(\b(?!class|finally|return|continue|for|lambda|try|def|from|nonlocal|while|del|global|with|as|elif|if|yield|assert|else|import|pass|break|except|raise)[a-z_]?[a-z0-9_]+)
                                                                                                   # 18. Identifiers
# Spaces
    |([\ \t])                                                                                      # 19. Spaces
# Newlines
    |([\v\r\n\f])                                                                                  # 20. Newlines
# Invalid characters
    |(.)                                                                                           # 21. Invalid
")

#'user/regex

#### **html-friendlynize** turns a string into it's HTML friendly version by escaping ">" and "<".

In [2]:
(defn html-friendlynize
  [token]
  (clojure.string/replace token #">|<" {">" "&gt;" "<" "&lt;"}))

#'user/html-friendlynize

#### **htmlize-file** reads a python file, breaks it down into the pre-defined lexical groups of a regex expression and writes an HTML pre-formatted text version of those groups (preserving their lexical group as the HTML class)

In [3]:
(defn htmlize-file
  [file-name]
  (->> (re-seq regex (slurp file-name))
     (map (fn [match]
            (let [token (match 0)]
              (cond
                (match 1)  (format "<span class=\"keyword\">%s</span>" token)
                (match 2)  (format "<span class=\"comment\">%s</span>" (html-friendlynize token))
                (match 3)  (format "<span class=\"misc\">%s</span>" (html-friendlynize token))
                (match 4)  (format "<span class=\"complex-literal\">%s</span>" token)
                (match 5)  (format "<span class=\"float-literal\">%s</span>" token)
                (match 6)  (format "<span class=\"int-literal\">%s</span>" token)
                (match 7)  (format "<span class=\"bool-literal\">%s</span>" token)
                (match 8)  (format "<span class=\"string-literal\">%s</span>" (html-friendlynize token))
                (match 9)  (format "<span class=\"none-literal\">%s</span>" token)
                (match 10) (format "<span class=\"grouping\">%s</span>" token)
                (match 11) (format "<span class=\"punctuation\">%s</span>" token)  
                (match 12) (format "<span class=\"arithmetic-assignment\">%s</span>" token)
                (match 13) (format "<span class=\"bitwise-assignment\">%s</span>" (html-friendlynize token))
                (match 14) (format "<span class=\"arithmetic-operator\">%s</span>" token)
                (match 15) (format "<span class=\"bitwise-operator\">%s</span>" (html-friendlynize token))
                (match 16) (format "<span class=\"relational-operator\">%s</span>" (html-friendlynize token))
                (match 17) (format "<span class=\"logical-operator\">%s</span>" token)
                (match 18) (format "<span class=\"identifier\">%s</span>" (html-friendlynize token))
                (match 19) " "
                (match 20) (format "%s" token)
                (match 21) (format "<span class=\"invalid\">%s</span>" (html-friendlynize token))       
                ))))))

#'user/htmlize-file

¹ No need to make html-friendly the tokens that will never encounter any of the HTML delimiters.

#### **highlight-file** produces the final HTML file, concatenating the head, body, pre, spans and the css stylesheet route.

In [4]:
(defn highlight-file
  [input-file output-file]
  (spit output-file (format "<html>
  <head>
    <title>Python Lexical Highliter</title>
    <link rel=\"stylesheet\" href=\"token_colors.css\">
  </head>
  <body>
    <pre>%s
    </pre>
  </body>
</html>", (clojure.string/join (concat (htmlize-file input-file))))))

#'user/highlight-file

---
### Parallel


#### **get-file-info** returns a list of every file on a directory and its subdirectories, filtering the folders themselves

In [5]:
(defn get-file-info
    [dir]
    (for [f (map (fn [f]
             [(.getAbsolutePath f) (subs (.getName f) 0 (- (.length (.getName f)) 3)) (.isFile f)])
         (file-seq (clojure.java.io/file dir))) :when (f 2) ] [(f 0) (f 1)]))

#'user/get-file-info

#### **ranges** separates the file list indexes into separate parts to distribute between processes

In [6]:
(defn ranges
  [dir threads]
   (let [files (get-file-info dir)]
       (partition 2 1 (range 0 (inc (count files)) (quot (count files) threads)))))

#'user/ranges

#### **highlight-file-seq gets the file list and loops between a start and end range to produce the html output with the **highlight-file** function

In [7]:
(defn highlight-file-seq
    [dir start end]
    (let [files (get-file-info dir)]
        (loop [i start]
        (if (< i end)
            (do (highlight-file (nth (nth files i) 0)
                                (str "./Outputs/" (nth (nth files i) 1) ".html"))
                (recur (inc i)))
            nil))))

#'user/highlight-file-seq

#### **highlight-file-par uses pmap and ranges to divide workload from highlight-file-seq into the desired number of threads

In [8]:
(defn highlight-file-par
    [dir threads]
    (nth (->> (ranges dir threads)
        (pmap (fn [[start end]]
              (highlight-file-seq dir start end)))) 0))

#'user/highlight-file-par

#### Sequential Test

In [9]:
(def file-numbers (count (get-file-info "Python")))
(doseq [i (range 5)]
  (time (highlight-file-seq "Python" 0 file-numbers)))

"Elapsed time: 9553.8826 msecs"
"Elapsed time: 9180.5234 msecs"
"Elapsed time: 9154.4379 msecs"
"Elapsed time: 9469.622 msecs"
"Elapsed time: 9157.7746 msecs"


nil

#### Parallel Test

In [10]:
(doseq [i (range 5)]
  (time (highlight-file-par "Python" 8)))

"Elapsed time: 1103.854 msecs"
"Elapsed time: 1305.0051 msecs"
"Elapsed time: 1606.4679 msecs"
"Elapsed time: 1613.7931 msecs"
"Elapsed time: 1638.2633 msecs"


nil

#### Speedup

In [12]:
(def average-sequential
  (/ (+ 9553.8826  9180.5234 9154.4379
        9469.622 9157.7746)
     5))

(def average-parallel
  (/ (+ 1103.854  1305.0051 1606.4679
        1613.7931 1638.2633)
     5))

;; Speedup
(/ average-sequential average-parallel)

6.400686180943749