# Create Trigram From Markdown

## Summary

Read a Markdown file then tokenize, ignore stop words, and then tokenize. Display tokens on the screen with counts.

## Credits

- OpenAI Chat GPT 3, Dec 15 Version.

## Python Script

### Import Libraries

- `sys` to interact with OS filesystem.
- `re` UNKONWN
-  `markdown` to read input files.

In [2]:
import sys
import re
import markdown

### Read the input file

- [ ] Change file from command line. [created::2023-01-20] Use a test file rather than reading from command line.
- [ ] Does the file exist? [created::2023-01-20] If error, exit the script. File needs to be there to read it.

In [3]:
input_file = sys.argv[1]

### Initialize a dictionary

Initialize a dictionary to store the tri-grams and their counts. A trigram is three words together.

In [None]:
trigrams = {}

### Define YAML Frontmatter

Use regular expression to match YAML front matter. Ignore values in front matter. It is possible for this to break when additional `^---` elsewhere in file.

In [None]:
front_matter_regex = r'^---\s*$'

### Read file contents

Read input file, ignore YAML front matter, load links into list of pointers. Using a flag to skip YAML front matter.

In [None]:
with open(input_file, 'r') as f:
    in_front_matter = False
    lines = f.readlines()

### Convert Markdown to Plaintext

This is being done to remove formatting, but may not be necessary. Markdown is plain text already. However, the tokenization functions may not recognize Markdown symbol usage.

In [None]:
plain_text = markdown.markdown(''.join(lines))

### Split the plain text into lines

This may be more effective to end lines with period, or work with lanugage rather than line break. Would give shorter chunks to process.

In [None]:
lines = plain_text.splitlines()

### Purge stopwords and distractions

Removing stopwords would make better trigrams. Also removing any unecessary symboles would be good. 

### Iterate through the lines of the file

1. Split lines into works.
2. Extract current trigram. There is overlap in words, "one two three four five" would be "one two three", "two three four", and "three four five".
3. Increment count for trigram. It's counting each trigram individually.

In [None]:
# Iterate through the lines of the file
for line in lines:
        # Split the line into words
        words = line.split()
        
        # Iterate through the words in the line
        for i in range(len(words)-2):
            # Extract the current tri-gram
            trigram = ' '.join(words[i:i+3])
            
            # Increment the count for this tri-gram in the dictionary
            if trigram in trigrams:
                trigrams[trigram] += 1
            else:
                trigrams[trigram] = 1


### Sort the trigrams

Sorting trigrams by count in descending order may not be necessary. If there are a lot of them, only high count values may be necessary.

In [None]:
sorted_trigrams = sorted(trigrams.items(), key=lambda x: x[1], reverse=True)

### Print the trigrams and their counts

This prints a giant list that is pretty raw. In the future make this input for something else. Think of format that is easy to visualize or feed to graph analysis.

In [None]:
for trigram, count in sorted_trigrams:
    print(f'{trigram}: {count}')

## Requirements

- Script will read Markdown file. Only moving forward after determining file exists, is markdown, and ignoring any YAML front matter.
- Script removes stopwords to improve quality of trigram. Use a common stopword library, as well as local configuration file.

/EOF/