# Working with digital textual data: a Python primer

## Disclaimer and scope
The materials contained in this interactive Jupyter notebook are meant to provide a set of (some of the) main concepts and mechanisms @@@. As such, they are meant to be used a @@@ when you will start experimenting with Python on your own.  
This notebook is in no way meant to be an "introductory course to Python", nor is it meant to teach you how to write or use Python fluently.

## How to use this notebook
You should run this notebook (i.e. this `.ipynb` file) from a local copy on your PC; this assumes @@@

# Programming languages
To simplify, the thousands of available programming languages can be categorised according to two major characteristics:

a. Low- or high-level  
b. General-purpose (GPL) or domain-specific (DSL) language  
  
a. A high-level programming language (such as Python) provides a "strong abstraction from the details of the computer. In contrast to low-level programming languages, it may use natural language elements, be easier to use, or may automate (or even hide entirely) significant areas of computing systems (e.g. memory management), making the process of developing a program simpler and more understandable than when using a lower-level language." ([source](https://en.wikipedia.org/wiki/High-level_programming_language))  
  
b. A general-purpose programming language (such as Python) can be used to develop a wide range of different applications (e.g. videogames, web apps, websites, editing images, processing textual data, etc...), while domain-specific ones are tailored to one (or a small number of) specific purposes.

In [None]:
# This is a single-line comment

"""
This is a
multiline
comment
"""

# Cooking with Python: a culinary analogy
Let's pretend you have just bought a new apartment in Italy, and that it comes pre-furnished. In it you have a kitchen, a basic one; let's pretend it's a *basic Italian kitchen*. A kitchen where you will find all the tools and ingredients that are needed to prepare the basic Italian dishes: an oven, a stove, some pans, some cutlery; pasta, salt, pepper, olive oil, tomatoes, onions, garlic.  
You won't find Garam Masala in it, nor sushi rice, nor a wok. Just a basic Italian kitchen to get you started with Italian dishes.  
The default Python installation is just like this kitchen: it has the basic tools and ingredients (called **modules** or **libraries**) and nothing more.  
  
Now, you have invited some friends over for an Indian dinner, and you are going to prepare a Tikka Masala curry. Your basic kitchen doesn't have a lot of the things needed - some of which are commonly available from any store in Italy, some others that can only be found in specialised stores. Chicken and yoghurt can easily be found in any supermarket; curry spices and coconut milk can be bought in Asian markets. 
Similarly in Python you may need to get some **modules from outside as they are not available in the basic installation**; this is commonly done through a module manager called [`pip`](https://pip.pypa.io/en/stable/).  

What if one of your guests is coeliac? Well, you'd have to get some special ingredients, as well as using tools that must not come in contact with @@@. You would have to keep these ingredients and tools separate from the other ones, since mixing them may cause serious consequences: the former are conflicting with the latter.  
The same may happen with Python modules: you may be using a module that requires specific versions of other modules, which in turn may be incompatible with an additional module. Luckily with Python you can create a **virtual environment** (often called `venv`), a self-contained and isolated "box" into which you can install modules and keep them separated from modules installed inside of a second virtual environment.

Throughout this notebook we will install Python through `conda`, a package manager which simplifies the installation of modules as well as the creation of virtual environments.

# Installing Python

[Miniconda](https://docs.anaconda.com/free/miniconda/index.html) or alternatively [Miniforge](https://conda-forge.org/miniforge/) (select the installer for **Miniforge3**, the last in the table with the download links.  
Miniconda/Miniforge will install Python in your system, and automatically creates a virtual environment called `base` which will replace (not delete, but "take the place of") your existing Python installed on your system (if you had already installed Python!).  
Once installed, opening a command prompt (Windows) or terminal (macOS) will show something like the following:

> `(base) catlism@debian:~$`

where your username, name of the PC, and name of the folder you are currently in is preceded by the label `(base)`. This is the way through which `conda` lets you know that you are currently using its basic (default) virtual environment.  
From here we can create a new environment called `test` by writing the following command followed by `Enter`:

> `conda create --name test`

which we can then activate using the command:

> `conda activate test`

This will be reflected in the terminal, where something like the line below will replace the previous `(base)` version:

> `(test) catlism@debian:-$`

If you want to deactivate the `test` environment and switch back to the `base` one, use the following command:

> `conda deactivate`

Further details may be found in [this page](https://catlism.github.io/setup_env/conda.html).

# The two golden rules
In order to begin approaching Python code, two rules need to be known:

1. Strings of text preceded by a `#` symbol, or enclosed in three pairs of single or double quotes (`'''` or `"""`) are **comments meant for humans and are not read by Python**. Everything else is interpreted by Python as code.
2. Graphical indentation is meaningful in Python, and defines the hierarchy of the code.

In [None]:
# the following code has a hierarchy, whereby `print(c)` is a child of `for c in "example"`
for c in "example":
    print(c)

# Types of data
Python is able to read different types of data, and do different things with each one of them. This is similar to 

## Numbers
Python (just like the majority of programming languages) distinguishes [numbers](https://www.w3schools.com/python/python_numbers.asp) by grouping them into two sub-categories: `integers` and `floats` (a third type exists, but we'll ignore it as we're not going to need it).  
In the code below, we assign four different numbers to four different variables (`w`, `x`, `y`, `z`), and ask Python to print the `type` of each one.

In [None]:
w = 1
x = -3255522
y = 1.09834
z = -20.976362

print(type(w))
print(type(x))
print(type(y))
print(type(z))

## Text strings
A string of text is defined by enclosing it in single or double quotes.  
In the example below, we assign a string to the variable called `text`; then we ask Python to operate on the string. We want Python to take each minimal unit  - which we call `c` - of the object stored in `text` and print it, one after the other until the object (i.e. the string) is over.

In [None]:
text = "This is a sample sentence."

for c in text:
    print(c)

By default Python sees as minimal unit of a string what we humans call a character.  
We can change this behaviour by applying one or more **methods**: these are special words that apply a function (the parentheses `()` following the special word indicate it is a function) to the variable that precedes them (the dot `.` indicates that the function is applied to what is on the left of it).  
For example the [`.split()`](https://www.w3schools.com/python/ref_string_split.asp) method splits a string of text whenever it finds a whitespace:

In [None]:
for c in text.split():
    print(c)

## Lists

Lists are used to store multiple items in a single variable, such as:

In [None]:
ingredients = ["chicken", "curry spices", "yoghurt", "coconut milk"]

The four elements are seen by Python as the minimal units of the object `ingredients`:

In [None]:
for i in ingredients:
    print(i)

List items are **ordered**, **changeable**, **allow duplicate values**, and are **indexed** - i.e. the first item has index `[0]`, the second item has index `[1]` etc.

In [None]:
# Lists are ordered and indexed

print(ingredients[1])

In [None]:
# Lists are changeable
ingredients[2] = "vegan yoghurt"
print(ingredients)

In [None]:
# Lists may contain duplicate values
ingredients.append("chicken")
# we need A LOT of chicken!
print(ingredients)

In [None]:
# How many items are in a list?
print(len(ingredients))

## Dictionaries
Dictionaries are used to store data values in `key:value` pairs.

A dictionary is a collection which is **ordered**, **changeable** and **do not allow duplicates**.

In [None]:
shopping_list = {
    "chicken": "1 whole",
    "curry spices": "150gr",
    "yoghurt": "200gr",
    "coconut milk": "400ml",
}

print(shopping_list)

In [None]:
# Dictionaries are ordered
print(shopping_list["chicken"])

In [None]:
# Dictionaries are changeable
shopping_list["yoghurt"] = "400gr"
print(shopping_list)

In [None]:
# Dictionaries do not allow duplicates
shopping_list = {
    "chicken": "1 whole",
    "curry spices": "150gr",
    "yoghurt": "200gr",
    "coconut milk": "400ml",
    "coconut milk": "300ml",
}

print(shopping_list)

In [None]:
# How many items are in a dictionary?
print(len(shopping_list))

## Other types
There exist other types of data, but we're not going to cover them here (nor usem them!). You may read more details about them [here](https://python101.pythonlibrary.org/chapter3_lists_dicts.html) and [here](https://www.w3schools.com/python/python_datatypes.asp). 

# Reading files


## Reading one single file

In [None]:
import os
print(os.getcwd())
with open("./python_primer/data/ABERCROMBIE_FITCH__AR__NYSE_ANF_2022.txt", "r") as text:
    print(text)
    

In [None]:
with open("./python_primer/data/ABERCROMBIE_FITCH__AR__NYSE_ANF_2022.txt", "r") as text:
    print(text)

## Reading multiple files

In [None]:
from glob import glob

files = glob("./python_primer/data/*.txt")

for file in files:
    text = open(file, "r").read()
    print(text)

# Writing to file(s)

In [None]:
with open("output.txt", "w") as out:
    out.write()

In [None]:
with open("output_continuous.txt", "a") as out:
    out.write()

# Tabular data: `csv`

# Marked-up data: `xml`

# Language recognition

[`lingua-py`](https://github.com/pemistahl/lingua-py)

> `pip install lingua-language-detector`

In [23]:
from lingua import Language, LanguageDetectorBuilder

# Setup some variables and parameters to be later used
languages = [Language.ENGLISH, Language.ITALIAN]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

In [25]:
sent1 = "And now for something completely different!"
sent2 = "E ora, qualcosa di completamente diverso!"
sent3 = "Y ahora algo completamente diferente"

print(detector.detect_language_of(sent1))
print(detector.detect_language_of(sent2))
print(detector.detect_language_of(sent3))

Language.ENGLISH
Language.ITALIAN
Language.ITALIAN


# Linguistic annotations