# Working with digital textual data: a Python primer
### aka *Things I wish someone had told me when I started using Python ~10 years ago*

#### [https://github.com/mdic/python_primer](https://github.com/mdic/python_primer)

# Disclaimer and scope
The materials contained in this interactive Jupyter notebook are meant to provide a set of (some of the) main concepts and mechanisms underlying the use of Python for working with digital textual data. Rather than covering the "basics" of Python, it takes into account things that cover the full range of "proficiency level", from *basic* to *advanced*. As such, it is meant to be used as a cheat-sheet of "things that need to be known" when you will start experimenting with Python on your own.  

This notebook is in no way meant to be an "*introductory course to Python*", nor is it meant to teach you how to write or use Python fluently.  
Think of this notebook as a **heavily-opinionated** tour guide: it is the result of personal experience, and as such it is based on personal habits and needs. It's as if you wanted to visit a place you have never been to (e.g. India), and asked an Indian friend to provide you with a list of "things to do and see in India": while the list may be more or less complete, you will never know exactly what it means to experience any of the suggested things until you are there.  
  
When using and studying Python you will soon find out that a lot of the things written in this notebook are imprecise, and that a lot of the code exemplified could have been written in different ways. That is, you will soon find out that principle n.13 of the so-called [*Zen of Python*](https://en.wikipedia.org/wiki/Zen_of_Python):

> There should be one-- and preferably only one --obvious way to do it.

doesn't hold in real-life. Rather - just like natural languages - each person develops linguistic habits that contribute to a constantly varying - and subjective-first - ecosystem.  
At last, let's be frank: using Python (just like any other programming language) is a *complex* endeavour. It's not difficult nor complicated, but **complex**; meaning that a lot of interconnected things (mostly from computer science, but from other fields as well) are involved. The only way effective way to learn a language (programming or natural one) is to experience it, and make a lot of mistakes!  
  
I hope this notebook will invite you to experiment first-hand with Python.

## How to use this notebook
You should run this notebook (i.e. this `.ipynb` file) from a local copy on your PC; this assumes that Python (or, even better, a Python virtual environment) is installed on your PC -  along with JupyterLab from which this notebook can be run - and that you have [downloaded the notebook files](https://github.com/mdic/python_primer/archive/refs/heads/main.zip). More details are available in section *Installing Python*.  
You may then click on any cell (text or code) and modify it, adding notes, experiments, etc...  
Nothing can go wrong, and you can also go back to the [original file](https://github.com/mdic/python_primer/blob/main/Python_primer.ipynb).  
In addition, you may want to check out Di Cristofaro (2023) and the accompanying [online compendium `catlism`](https://catlism.github.io/).

# Why Python?

The question might as well be "why programming language(s)?", but the answer remains pretty much the same.  

> "Some people think of corpus linguistics as the action that starts with the analysis of a corpus of texts: data is selected, collected, and processed, and only then does corpus linguistics begin. This is a narrow view and one that [should be broadened] by incorporating into corpus approaches those technical notions and procedures that define what corpus data is." (Di Cristofaro 2023:1)

Approaching digital textual data should therefore be inclusive of everything that concerns data processing, or - to summarise - of *digital technicalities*

> "that is, those notions and mechanisms that – while not classically associated with natural language – are i) foundational of the digital environments in which language production and exchanges
occur and ii) at the core of the techniques that are used to produce, collect, and process the focus of investigation, that is, digital textual data. (Di Cristofaro 2023:4)

This does not however mean that we should become computer scientists nor expert coders, but rather

> "to acquire a degree of proficiency with the ‘digital language’ sufficient enough to ensure that no disconnect is present among the digital data, the corpus itself, and the methods through which it is investigated and interpreted – even when the person who collects the data is not the same one who conducts the analysis. (Di Cristofaro 2023:15)

In fact we must not forget that
> "data processing is as relevant as data analysis; even more, it might be argued that the former is more crucial than the latter. Knowledge of how to use corpus tools and of the underlying theories is paramount to guarantee a scientifcally valid analysis and should never be overlooked or ignored. This can, however, be learnt or improved along the way during the analysis of the data once the corpus has already been created. Data from the web, on the contrary, does not usually permit this fexibility: ensuring that what a researcher needs for their analysis is correctly collected may be a one-time chance, not replicable in the future." (Di Cristofaro 2023:72)

# Programming languages
To simplify, the thousands of available programming languages can be categorised according to two major characteristics:

a. Low- or high-level  
b. General-purpose (GPL) or domain-specific (DSL) language  
  
**a.** A high-level programming language (such as Python) provides a "strong abstraction from the details of the computer. In contrast to low-level programming languages, it may use natural language elements, be easier to use, or may automate (or even hide entirely) significant areas of computing systems (e.g. memory management), making the process of developing a program simpler and more understandable than when using a lower-level language." ([source](https://en.wikipedia.org/wiki/High-level_programming_language)).  
For this reason Python is considered to be among the most "intuitive" programming languages, since it uses English words as *keywords*: "predefined, reserved words used in Python programming that have special meanings to the compiler. We cannot use a keyword as a variable name, function name, or any other identifier. They are used to define the syntax and structure of the Python language. All the keywords except `True`, `False` and `None` are in lowercase" ([source](https://www.programiz.com/python-programming/keywords-identifier)).  Some examples are `for`, `if`, `else`, `with`, `as`, `and`, `or`; see [here](https://www.w3schools.com/python/python_ref_keywords.asp) or [here](https://www.geeksforgeeks.org/python-keywords/) for the full list.  
This notebook will present a number of them through practical examples.
  
**b.** A general-purpose programming language (such as Python) can be used to develop a wide range of different applications (e.g. videogames, web apps, websites, editing images, processing textual data, etc...), while domain-specific ones are tailored to one (or a small number of) specific purposes.

# Cooking with Python: a culinary analogy
Let's pretend you have just bought a new apartment in Italy, and that it comes pre-furnished. In it you have a kitchen, a basic one; let's pretend it's a *basic Italian kitchen*. A kitchen where you will find all the tools and ingredients that are needed to prepare basic Italian dishes: an oven, a stove, some pans, some cutlery; pasta, salt, pepper, olive oil, tomatoes, onions, garlic.  
You won't find Garam Masala in it, nor sushi rice, nor a wok. Just a basic Italian kitchen to get you started with Italian dishes.  
The default Python installation is just like this kitchen: it has the basic tools and ingredients (called **modules** or **libraries**) and nothing more.  
  
Now, you have invited some friends over for an Indian dinner, and you are going to prepare a Tikka Masala curry. Your basic kitchen doesn't have a lot of the things needed - some of which are commonly available from any store in Italy, some others that can only be found in specialised stores. Chicken and yoghurt can easily be found in any supermarket; curry spices and coconut milk must be bought in Asian markets. 
Similarly in Python you may need to get some **modules from "outside" as they are not available in the basic installation**; this is commonly done through a module manager called [`pip`](https://pip.pypa.io/en/stable/).  

What if one of your guests is coeliac? Well, you'd have to get some special ingredients, as well as using tools that must not come in contact with non-gluten-free ingredients. You would have to keep these ingredients and tools separate from the other ones, since mixing them may cause serious consequences: the former are conflicting with the latter.  
The same may happen with Python modules: you may be using a module that requires specific versions of other modules, which in turn may be incompatible with another module. Luckily with Python you can create a **virtual environment** (often called `venv`), a self-contained and isolated "box" into which you can install modules and keep them separated from modules installed inside of a second virtual environment.  

So our next step is to install Python through `conda`, a package manager which simplifies the installation of modules as well as the creation of virtual environments.

# Installing Python

[Miniconda](https://docs.anaconda.com/free/miniconda/index.html) or alternatively [Miniforge](https://conda-forge.org/miniforge/) (for the latter, select the installer for **Miniforge3**, the last in the table with the download links). Setup guides for Miniconda are available for [Windows](https://katiekodes.com/setup-python-windows-miniconda/) and [macOS](https://medium.com/@sophieowen_40339/how-to-install-conda-and-create-virtual-environments-on-mac-m1-a3a15093820b).  
Miniconda/Miniforge will install Python in your system, and automatically create a virtual environment called `base` which replaces (not delete, but "take the place of") your existing Python installed on your system (if you had already installed Python!).  
Once installed, opening a command prompt (Windows) or terminal (macOS) will show something like the following:

> `(base) catlism@debian:~$`

where your username, name of the PC, and name of the folder you are currently in is preceded by the label `(base)`. This is the way through which `conda` lets you know that you are currently using its basic (default) virtual environment.  
From here we can create a new environment called `test` by writing the following command followed by `Enter`:

> `conda create --name test`

which we can then activate using the command:

> `conda activate test`

This will be reflected in the terminal, where something like the line below will replace the previous `(base)` version:

> `(test) catlism@debian:-$`

If you want to deactivate the `test` environment and switch back to the `base` one, use the following command:

> `conda deactivate`

Further details may be found in [this page](https://catlism.github.io/setup_env/conda.html).

## Installing JupyterLab to run this notebook

Now, let's create a `venv` for this notebook, along with the required packages needed to run it.

> `conda create --name primer`

We now activate the just created `primer` virtual env

> `conda activate primer`

And at last we install the required modules

> `pip install jupyterlab python-lsp-server[all]`

We can now start JupyterLab

> `jupyter lab`

and then, once inside JupyterLab, load this notebook by double-clicking on the file *Python_primer.ipynb*.  


## Notes
1. Remember: Jupyter Lab will use as working folder the folder in which the command `jupyter lab` is run)
2. Windows and macOS use different syntax to identify paths (i.e. the locations of folders and files); the path to the Instagram data for this notebook files is seen by Windows as `data\instagram\`, while macOS reads it as `data/instagram/`

# The two golden rules
In order to begin approaching Python code, two rules need to be known:

1. Strings of text preceded by a `#` symbol, or enclosed in three pairs of single or double quotes (`'''` or `"""`) are **comments meant for humans and are not read by Python**. Everything else is interpreted by Python as code.
2. Graphical indentation is meaningful in Python, and defines the hierarchy of the code.

In [None]:
# This is a single-line comment

"""
This is a
multiline
comment
"""

In [None]:
# the following code has a hierarchy, whereby `print(c)` is a child of `for c in "example"`
for c in "example":
    print(c)

# Types of data
Python is able to read different types of data, and do different things with each one of them. This is similar to what humans can and cannot do: we can add/divide/multiply/subtract numbers, but not letters.

## Numbers
Python (just like the majority of programming languages) distinguishes [numbers](https://www.w3schools.com/python/python_numbers.asp) by grouping them into two sub-categories: `integers` and `floats` (a third type exists, but we'll ignore it as we're not going to need it).  
In the code below, we assign four different numbers to four different variables (`w`, `x`, `y`, `z`), and ask Python to print the `type` of each one.

In [None]:
w = 1
x = -3255522
y = 1.09834
z = -20.976362

print(type(w))
print(type(x))
print(type(y))
print(type(z))

## Text strings
A string of text is defined by enclosing it in single or double quotes.  
In the example below, we assign a string to the variable called `text`; then we ask Python to operate on the string. We want Python to take each minimal unit  - which we call `c` - of the object stored in `text` and print it, one after the other until the object (i.e. the string) is over.

In [None]:
text = "This is a sample sentence."

for c in text:
    print(c)

By default Python sees as minimal unit of a string what we humans call a character.  
We can change this behaviour by applying one or more **methods**.  
A **method** is a special word that apply a function (the parentheses `()` following the special word indicate it is a function) to the variable that precedes them (the dot `.` indicates that the function is applied to what is on the left of it).  
For example the [`.split()`](https://www.w3schools.com/python/ref_string_split.asp) method splits a string of text whenever it finds a whitespace:

In [None]:
for c in text.split():
    print(c)

## Lists

Lists are used to store multiple items (e.g. strings or numbers) in a single variable, such as:

In [None]:
ingredients = ["chicken", "curry spices", "yoghurt", "coconut milk"]

The four elements are seen by Python as the minimal units of the object `ingredients`:

In [None]:
for i in ingredients:
    print(i)

List items are **ordered**, **changeable**, **allow duplicate values**, and are **indexed** - i.e. the first item has index `[0]`, the second item has index `[1]` etc.

In [None]:
# Lists are ordered and indexed

print(ingredients[1])

In [None]:
# Lists are changeable
ingredients[2] = "vegan yoghurt"
print(ingredients)

In [None]:
# Lists may contain duplicate values
ingredients.append("chicken")
# we need A LOT of chicken!
print(ingredients)

In [None]:
# How many items are in a list?
print(len(ingredients))

## Dictionaries
Dictionaries are used to store data values in `key:value` pairs.

A dictionary is a collection which is **ordered**, **changeable** and **do not allow duplicates**.

In [None]:
shopping_list = {
    "chicken": "1 whole",
    "curry spices": "150gr",
    "yoghurt": "200gr",
    "coconut milk": "400ml",
}

print(shopping_list)

In [None]:
# Dictionaries are ordered
print(shopping_list["chicken"])

In [None]:
# Dictionaries are changeable
shopping_list["yoghurt"] = "400gr"
print(shopping_list)

In [None]:
# Dictionaries do not allow duplicates
shopping_list = {
    "chicken": "1 whole",
    "curry spices": "150gr",
    "yoghurt": "200gr",
    "coconut milk": "400ml",
    "coconut milk": "300ml",
}

print(shopping_list)

In [None]:
# How many items are in a dictionary?
print(len(shopping_list))

## Other types
Other types of data exist, but we're not going to cover them here (nor usem them!). You may read more details about them [here](https://python101.pythonlibrary.org/chapter3_lists_dicts.html) and [here](https://www.w3schools.com/python/python_datatypes.asp). 

# Regular expressions

> Regular expressions (also known as regexes or regex patterns) are strings of text interpreted by a software or a programming language as rules for matching one or more patterns inside a set of strings [...] Regexes are extremely flexible and can be used to match any type of pattern, from simple words to email addresses, to more complex constructions; this flexibility comes at
the expense of their readability, as they can become extremely complex to interpret or to build. The easiest way to approach them and to learn their syntax is by using a tool such as [RegExr](https://regexr.com/) (Skinner 2022), an open source application for creating, testing, and learning regular expressions (Di Cristofaro 2023:130-132)

Many of the code examples below make use of regular expressions through the built-in library `re`, and we're going to see through such examples some use-cases.

# Reading files


## Reading one single file
Reading a file entails two operations: first we `open` it, then we `read` it.

In [None]:
with open("./data/instagram/2022-01-02_14-00-14_UTC.txt", "r") as text:
    print(text)
    

In [None]:
with open("./data/instagram/2022-01-02_14-00-14_UTC.txt", "r") as text:
    print(text.read())

## Reading multiple files

In [None]:
from glob import glob

files = glob("./data/instagram/*.txt")

for file in files:
    text = open(file, "r").read()
    print(text)

# Writing to file(s)
The are different `modes` to write data to a file; the most important ones to work with digital textual data are `w` and `a`.

In [None]:
ingredients = ["chicken", "curry spices", "yoghurt", "coconut milk"]
more_ingredients = ["rice","ghee","salt","bay leaves"]

In [None]:
out = open("list_of_ingredients.txt", "w")
for i in ingredients:
    out.write(f"{i}\n")
out.close()

In [None]:
out = open("list_of_ingredients.txt", "w")
for m in more_ingredients:
    out.write(f"{m}\n")
out.close()

In [None]:
out = open("list_of_ingredients.txt", "a")
for i in ingredients:
    out.write(f"{i}\n")
out.close()

In [None]:
with open("list_of_ingredients.txt", "a") as out:
    for i in ingredients:
        out.write(f"{i}\n")
    for m in more_ingredients:
        out.write(f"{m}\n")
        

# Tabular data: `csv`

Exemplified below is an advanced usage of `pandas`, a module made for working with a very powerful version of spreadsheets called `dataframes`. The module is one of the most important tools for data science, and has a lot of features used to process, transform, and analyse data stored in tabular format.  
An easier - and less powerful option - to work with tabular data is the module `csv` (included in any default Python installation, but requires importing through the line `import csv` at the beginning of a script).  
Whenever we work with tabular data, it's always a good choice to save it in `csv` since other formats (e.g. Excel `xlsx`) are not *open* and may cause data to be incomprehensible to Python.  
So, when working with Excel/Google Sheets/etc... select `csv` from the "Save as" option (see Di Cristofaro 2023:104-116 for more details).

Code below exemplifies a common operation when working with digital textual data: we have a `csv` file where we have saved metadata (additional, often manually-entered) details for the Instagram files in the `instagram` folder. At some point we will likely need to merge the source data with our metadata details, and `pandas` can help us automate this operation (thus reducing the risk of errors) by using the metadata `csv` file as a **lookup table**.

In [None]:
import pandas as pd

metadata = pd.read_csv("./data/metadata.csv", sep="\t")
metadata = metadata.set_index("filename")
metadata = metadata.groupby(metadata.index).first()

print(metadata.loc["2022-12-12_22-00-37_UTC.txt", "type"])

# Marked-up data: `xml`
Arguably the single most important format for working with digital textual data, `xml` is a markup language that allows us to include additional information alongside text contents (see Di Cristofaro 2023:105-111 for more details and for the basic rules of `xml`).  
At their core, `xml` files are text file were things enclosed in tags (i.e. between an opening `<` and a closing `>`) have a special meaning, and are interpreted by computers as special objects.  
Tags create a structure, which **parsers** (such as Python `lxml` or `beautifulsoup`) can read and understand, allowing them to navigate inside the data using said structure. As such, we should never treat `xml` as "text files only", meaning that using **regular expressions** to create/edit them is (almost) always a bad idea!  
The code below exemplifies some basic operations using a dataset of webpages collected in `xml` through [`trafilatura`](https://trafilatura.readthedocs.io/en/latest/) (see also Di Cristofaro 2023:156-160; [catlism](https://catlism.github.io/data_collection/general_purpose/trafilatura.html))

### Parsing existing metadata
The first example shows how to extract the value of the attribute `author` in each `xml` file. But first we need to install beautifulsoup and lxml through

> `pip install beautifulsoup4 lxml`

In [None]:
from bs4 import BeautifulSoup
from glob import glob

data = glob("./data/xml/*.xml")

for d in data:
    f = open(d, encoding="utf-8")
    soup = BeautifulSoup(f, "lxml")
    author = soup.find("doc")["author"]
    print(author)

### Changing the name of a tag in an xml file
The first example involves parsing existing `xml` files, and changing the main (root) tag `doc` into `text`, so that they are compatible with all the corpus tools that accept `xml` files as input.

In [None]:
# first, we need to install beautifulsoup and lxml through
# pip install beautifulsoup4 lxml
from bs4 import BeautifulSoup
from glob import glob

data = glob("./data/xml/*.xml")

for d in data:
    f = open(d, encoding="utf-8")
    filename = file.replace(".xml", "")
    soup = BeautifulSoup(f, "lxml")
    doc_tag = soup.find("doc")
    doc_tag.name = "text"
    print(soup)

### Creating xml files
In this last example we switch back to Instagram data, and create an `xml` file for each post, adding the metadata for each post from the previously seen `csv` file. This time we use `lxml` instead of `beautifulsoup` for a change - `lxml` can read (parse) and write `xml` files; `beautifulsoup` can read (parse) and write `xml` and `html` files.

In [None]:
from lxml import etree
from glob import glob
import pandas as pd

metadata = pd.read_csv("./data/metadata.csv", sep="\t")
metadata = metadata.set_index("filename")
metadata = metadata.groupby(metadata.index).first()

data = glob("./data/instagram/*.txt")

for d in data:
    filename = re.sub(".*?instagram\/", "", d)
    f = open(d, "r").read()
    root_tag = etree.Element("text")
    root_tag.attrib["type"] = metadata.loc[f"{filename}", "type"]
    root_tag.text = f
    tree = etree.ElementTree(root_tag)
    tree.write(
        f"./data/instagram/{filename}.xml",
        pretty_print=True,
        xml_declaration=True,
        encoding="utf-8",
    )
    
    

# Language recognition
As with our Instagram data, it may be the case that our dataset contains multiple languages - an issue if we are going to use e.g. corpus tools to analyse it!  
Luckily there are many libraries that allow Python to identify which language is used in a string/text; one such library is [`lingua-py`](https://github.com/pemistahl/lingua-py).  
We install it with the command  

> `pip install lingua-language-detector`

In [None]:
from lingua import Language, LanguageDetectorBuilder
from glob import glob
import re

# Setup some variables and parameters to be later used
languages = [Language.ENGLISH, Language.FRENCH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

In [None]:
posts = glob("./data/instagram/*.txt")

for post in posts:
    f = open(post, "r").readlines()
    for line in f:
        lang = detector.detect_language_of(line)
        print(f"TEXT::{line}\nLANG::{lang}\n")
        

Alternatively, we can use `lingua-py` to separate text in English from text in French by creating two separate files, named after the original post filename. 

In [None]:
posts = glob("./data/instagram/*.txt")

for post in posts:
    f = open(post, "r").readlines()
    post_lines = {}
    recognised_lang = ""
    post_name = re.sub(".*?instagram\/", "", post).replace("/", "").replace(".txt", "")
    for line in f:
        lang = detector.detect_language_of(line)
        if lang is None:
            lang = recognised_lang
        else:
            post_lines[f"{line}"] = lang
    lines_en = [k for k,v in post_lines.items() if str(v) == "Language.ENGLISH"]
    lines_fr = [k for k,v in post_lines.items() if str(v) == "Language.FRENCH"]
    post_en = open(f"{post_name}_EN.txt", "a").write("\n".join(lines_en))
    post_fr = open(f"{post_name}_FR.txt", "a").write(lines_fr)

# Emoji transliteration
Emojis pose a number of issues when working with corpus tools (AntConc, SketchEngine, LancsBox, LancsBox X, WordSmith, etc...) since they are often not fully supported (yet). The reason is extremely complex, but the consequences are straightforward: corpus tools don't "see" emojis as we humans do, and - more often than not - miscount and misrepresent them. This leads in turn to a skewed corpus and skewed results.  

## Emoji consequences
For example, the following "sentence" is interpreted by us as composed of 3 different emojis, while corpus tools may interpret them in various ways, including seeing 6 (or more) emojis.

> 👯‍♂️ 👩🏿‍🦰 🥷🏼

The three emojis above all have an official description (called [CLDR](https://unicode.org/emoji/charts/full-emoji-list.html)) provided by the Unicode consortium. These are:  

- [Men with Bunny Ears](https://emojipedia.org/men-with-bunny-ears)
- [Woman: Dark Skin Tone, Red Hair](https://emojipedia.org/woman-dark-skin-tone-red-hair)
- [Ninja: Medium-Light Skin Tone](https://emojipedia.org/ninja-medium-light-skin-tone)

The peculiarity of these 3 emojis (and of a number of other ones) is that they are created by combining together two or more emojis:  

> 👯 People with Bunny Ears + ♂️ Male Sign  
> 👩 Woman + 🏿 Dark Skin Tone + 🦰 Red Hair  
> 🥷 Ninja + 🏼 Medium-Light Skin Tone  

To avoid issues - and have a corpus as faithful as possible to the original data - we may *transliterate* emojis into their CLDR, thus turning each emoji into a "bundle of words" that contain their official descriptions; e.g.

> 👯‍♂️ -> {men_with_bunny_ears}

The syntax used for the description is applied so that the corpus tool doesn't interpret each word of the description as a token, but rather treats the `{men_with_bunny_ears}` as a single (unknown) token.  
Emoji transliteration can be achieved through the Python module [`emoji`](https://github.com/carpedm20/emoji) (see also [here](https://catlism.github.io/data_processing/emoticons_emojis.html)),which we now install with the command

> `pip install emoji`

In [None]:
import emoji

# Define a custom function to transliterate emojis and add curly brackets as delimites for the CLDR

def demoji(chars, data_dict):
    trans = emoji.demojize(chars, delimiters=("{", "}"))
    return trans

In [None]:
posts = glob("./data/instagram/*.txt")

for post in posts:
    f = open(post, "r").readlines()
    for line in f:
        line = emoji.replace_emoji(line, replace=lambda chars, data_dict: demoji(chars,data_dict))
        print(line)

# Hashtag segmentation

Oftentimes hashtags are formed by two or more words merged together, making the actual linguistic contents incomprehensible to language analysis tools.  
The following script is an adaptation of [s5.17](https://github.com/catlism/catlism_scripts/raw/main/s5.17_wordsegment_hashtags.py) (Di Cristofaro 2023), and uses the module [`wordsegment`](https://github.com/grantjenks/python-wordsegment) to identify when two or more words are merged and subsequently split them. More info can be found [here](https://catlism.github.io/data_processing/hashtags.html); we now install the library with the command

> `pip install wordsegment`

In [None]:
import re
import wordsegment

wordsegment.load()
hashtag_re = re.compile("(?:^|\s)([＃#]{1})(\w+)", re.UNICODE)

In [None]:
posts = glob("./data/instagram/*.txt")

for post in posts:
    f = open(post, "r").read()
    segmented_hastags = ""
    for hashtag in re.findall(hashtag_re, f):
        found_hashtag = "".join(hashtag)
        clean_hashtag = hashtag[1]
        segmented = " ".join(wordsegment.segment(clean_hashtag))
        tag = f"<exhashtag original='{clean_hashtag}'>{segmented}</exhashtag>"
        f = f.replace(found_hashtag, tag)
    print(f)

# Where to next?
What follows is a list of "topics" and suggested "materials" (e.g. names of modules, technical terms used to refer to a topic, references, guides, etc...) you may want to investigate and consider while continuing your journey into Python.  
You may also want to consult some Python guides, such as:
- [Python 101](https://python101.pythonlibrary.org/index.html)
- [Introduction to Cultural Analytics & Python](https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html)
- [Tutorials on W3Schools](https://www.w3schools.com/python/default.asp)

|topic|terms|
|---|---|
|markdown|[markdownguide](https://www.markdownguide.org/); also this entire notebook (at least the textual parts) are written in markdown, so you may want to double click on a cell and see how it was written|
|manipulating text strings|f-strings (i.e. the syntax `f""`)|
|character encodings/UTF-8/Unicode|[The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles](https://zenodo.org/records/1300528); [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/); [What Every Programmer Absolutely, Positively Needs to Know About Encodings and Character Sets to Work With Text](https://kunststube.net/encoding/)|
|tabular data (reading/parsing/editing/creating)|[pandas](https://www.w3schools.com/python/pandas/default.asp)|
|HTML data (reading/parsing)|[beautifulsoup](https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/)|
|XML data (reading/parsing/editing/creating)|[lxml](https://lxml.de/tutorial.html); [beautifulsoup](https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/)|
|deleting/renaming/creating local files/folders|[os](https://www.w3schools.com/python/module_os.asp)|
|digital textual data formats|along with `csv` and `xml`, also check out `json` (Di Cristofaro 2023:104-116)|
|linguistic annotation/tagging|[PyMUSAS](https://ucrel.github.io/pymusas/), which powers WMatrix, and is powered by [Spacy](https://spacy.io/) - THE most important NLP Python library|