# **Basics of Python**

**Creators:** Ivan Vazquez, Iwan Paolucci, and William Shropshire

This is a short course covering some fundamental concepts in Python. The course is designed to help you get started with Python programming, and it assumes that you have little knowledge of Python and programming in general. Python programming is a great skill to learn. Python often ranks as the [best language to master](https://spectrum.ieee.org/top-programming-languages/
).

**Learning objectives:**
1. Learn about the syntax, operators, common data structures, and fundamental constructs used in Python programs
1. Learn about python Libraries
  - Importing a library



---

## **Variables with Python**

You can create a variable in python by writing a variable name and assigning a "value" using `=`. For example, `x = 10` and `dog = 'Spot'` are two variables representing an integer and a string, respectively. Variables in python can represent a variety of data structures ranging from basic numbers (integers and floats) to functions and *classes*<sup>*</sup>. 

<small><sup>*</sup>Please do not worry if you are unfamiliar with programming terms like functions, methods, classes, etc. We will cover these terms later.</small>


### **Creating a variable**

In [None]:
my_first_variable = 'Hello World!' # This is a string

my_second_variable = 100 # This is an integer

You can print the value of a variable using the function `print()`.

In [None]:
print(my_first_variable)

<font color='red'>Pro Tip:</font> There is a style guide for Python called [PEP 8](https://peps.python.org/pep-0008/#introduction). It specifies how to format the name of variables, functions, classes, and other types of constructs. For example, `my_variable` and `myVariable` are proper ways to name a variable. Using `MyVariable` is recommended for classes and not for "regula" variables. You will not get an error for using a variable naming convention that is different from what PEP 8 recommends, but your code might not be as readable. 



In [None]:
awesome_varibale_name = 10

awesomeVariableName = 20 # the first word is all lower case while every other word is Capitalized.

BadName = 30 # This is not going to give an error, but it can confuse someone who may assume that you use good practices

⛔  **Note:** Variable names that start with a number are illegal. Try running the code below.

In [None]:
1my_variable

---
### **Standard operations with variables**


In [None]:
x = 7
n = 2
h = 'Hello'
w = ' world!'

print(x + n) # Addition
print(x*n)   # Multiplication 
print(x**n)  # Exponentiation 
print(x/n)   # Division
print(x//n)  # Integer division (same as floor(x/n) if that is familiar to you)
print(h + w) # Stings can be added! 

<font color='red'>Pro Tip:</font> You can assign values to multiple variables in one line using the technique shown below

In [None]:
a, b = 5, 7 # PT: single line assignment

print(a,'and', b)

**Exercise 1:** First, create two variables that meets PEP 8 standards and set them equal to an integer of your choice (e.g. 4, 47, -100,...). Next, set a third variable to be equal to the product of the two variables. Finally, do an integer division between your third variable and 7. Your last line should looks something like: `my_last_variable//7`.

In [None]:
# WRITE YOUR ANSWER IN THIS CELL!

###Logical operators

In [None]:
a, b = 3, 11

print(a==b), print(a > b), print(a + b >= (a**2+b**2)**0.5); # try removing the ; and see what happens...

---
## **Lists, tuples, sets, and dictionaries**


### **Lists**

A **list** represents a collection of values (enclosed by squared brakets) that can be *indexed*. For example, a list with three items can look like `my_list_name = [item_1, item_2, item_3]`. The elements of a list can have different types. For example, the list `my_list = [1, 'a', 10.3]` contains, in order, an integer, a string, and a float. List can contain "simple" elements (like numbers and strings) or more exotic elements representing functions or entire datasets. In Python, many functions use lists as inputs and outputs, so it is important to become familiar with lists.



#### **Indexing**

<img src=https://cdn.programiz.com/sites/tutorial2program/files/python-list-index.png width="500">


The elements of a list can be accessed by indexing their location using square brakets. The first element in Python is index with `0` while the second element is `1` and so on. For example, if we have the list `my_list = ['p', 'r', 'o', 'b', 'e']` and want to grab the first element, then we type `my_list[0]` and will get the value `'p'` from the list. In Python, you can also index the last element using `-1`. For example, `my_list[-1]` returns `e`.   

In [None]:
my_list_of_strings = ['no', 'pressure', 'no', 'diamonds'] # list of like elements (all strings)

print(my_list_of_strings[-2])
print(my_list_of_strings[3])

You can index several elements in a list using a technique called 'slicing' with the following notation: `my_list[first_element:last_element+1]`. Note, the index on the right side of the colon should be 1 element higher than the last element you want to grab. 

In [None]:
my_mixed_list = [10, 'b', 2.3, 'c', 'a small sentence'] # list with mixed elements               

print(my_mixed_list[0:3])
print(my_mixed_list[1:4])

As mentioned, the elements in a list can be more "exotic". For example, we can have a list nested inside of a list" `my_list = [1, [1,2], 2]`. The list in the cell below has a function for its first element, a nested list for the second element, and a simple float for the third element. 

In [None]:
my_exotic_list = [round, ['nested list', 1.4, '1', 10], 50.3] # The elements of this list are: a function, a (nested) list, a float

Since the first element is the rounding function, we can use it to round numbers:

In [None]:
my_exotic_list[0](1.3567437, ndigits=3) # Note: In a notebook environment like Colab, you do not need to type print to see an output.

❓ What will we see if we run `print(my_exotic_list[1][1])`?

In [None]:
my_exotic_list[1][1]

Lists are **mutable**, which means that you *can* change their content.

In [None]:
my_list = ['i', 8, 'an', 'apple'] # create a list with four mixed elements

In [None]:
my_list[2] = 'd' # change the value of the third element (index = 2)

❓ What will we see if we run `print(my_list)`?

In [None]:
print(my_list)

#### Methods

Different data structures in python have functions (or methods that belong to Classes) associated with them. For example, you can add elements to a list using the `.append()` method. 
```python
>>> my_list = [1,2,3]
>>> my_list.append(4)
>>> print(my_list)
>>> [1, 2, 3, 4]
```
Another example is the `.split()` method for strings, which can split a string 
relative to a delimiter (see the example below). When we use a method, we "connect" the object (list, string,...) with the method using a period. This is a common syntax for object-oriented programming (OOP), which we can explore more in an advanced seminars. 

In the next cell, we will see that two lists can be combined with `+`. We will also see the effect of the `.append()` method.


In [None]:
my_list = ['i', 8, 'an', 'apple']

added_list_1 = my_list + [7] # if you use 7 without [], you will get an error. Do you know why?

added_list_2 = [7] + my_list

my_list.append(7) # The change happens in-place, meaning that your `my_list` variable changes.

print(my_list, added_list_1)
print(added_list_1 == added_list_2)
print(added_list_1 == my_list)

Next, let's see what happens when we use sort with a list:

In [None]:
my_list = [10, 7, 3, 1, 9, 13]

print(my_list)

my_list.sort() # sorting happens in-place

print(my_list)

print('The largest number is: ', my_list[-1]) # get largest number

As mentioned earlier, many functions in python return (or output) lists. In the next example, we will use the `.split()` method (or function) of strings to break a sentence down into the words

In [None]:
full_sentence = 'Wise men speak because they have something to say; fools because they have to say something.'

parts = full_sentence.split(' ') # we are using a space as a delimiter

print(parts)
print(type(parts))

<small>There are [other tricks](https://docs.python.org/3/tutorial/datastructures.html) specific to lists like `.remove()` to eliminate an element. You can learn about these from online sources as you practice your Python programming.</small>

### **Tuples**

A **tuple**, like a list, can have elements that are of different types. However, unlike a list, a tuple is **immutable**, which means that elments cannot change once defined. To create a tuple, we use parenthesis.

In [None]:
my_tuple = (10, 'b', 2.3, 'c')

another_tuple = 'hello', [1, 2, 3] # This tuple has two elements. One is a string and one is a list

print(my_tuple)
print(another_tuple)
print(my_tuple[-1], my_tuple[2], my_tuple[0:2])

⛔ **Note:** Trying to change the value of an element in a tuple will result in an error. Run the code below and see what happens.

In [None]:
my_tuple[1] = 'd'

However, if an element of the tuple is mutable, you can change its values!

In [None]:
updatable_tuple = ('hello', [1, 2, 3])

print(type(updatable_tuple))

updatable_tuple[1][0] = 0

print(updatable_tuple)

<font color='red'>Pro Tip:</font> You can assign multiple elements of a tuple to variables with syntax like `a, b, c = (1, 2, 3)`. This also works for lists and other *iterables*. As you will learn later, an iterable is an object that can be "looped over". 

In [None]:
a, b, c = (10, 11, 12)

❓ What will we see if we run `print(c)`?

In [None]:
print(c)

### **Sets**

A set is an **unordered** collection with no duplicate elements. Here, **unorder** means that indexing is not allowed. One common use for a set is to eliminate repeated elements in a list, which is easily done by converting the list to a set using the function `set()`. For example, if you have a list `my_list = [1,2,2,3]` and use `set(my_list)`, then you get a set equals to `{1,2,3}`.



In [None]:
my_set = {'a', 1, 3}

a_list = [1, 4, 6, 9, 9, 3, 6, 1, 3]

another_set = set(a_list)

print(my_set, another_set)

There are also functions to convert sets to lists and tuples. Like we did with the `set()` function, you can use `list()` to convert a set back to a list. You can "chain" these functions to do things like creating lists without repeated elements.

⛔ **Note:** You cannot index a set. Sets are not as flexible in terms of the data types you can use for elements. You can only use *hashable* elements. Try running the code below.

In [None]:
my_set[1]

In [None]:
a_list = [1, 'a', [1,2,3]]
a_tuple = (1, 'a', [1,2,3])
a_set = {1,'a',(1,2,3)}
another_set = {1, 'a', [1,2,3]} # This line will fail since lists are not hashable

<small>An object is *hashable* if it has a hash value which never changes during its lifetime [(source).](https://docs.python.org/3/glossary.html#term-hashable)</small>

### **Dictionaries**

A dictionary is a *mapping type* where hashable values are mapped to arbitrary objects. Don't worry if this sounds complicated. Dictionaries are intuitive once you create and use a few of them. The structure of a dictionary is as follows: `name_of_variable = {key_1 : value_for_key_1, key_2 : value_for_key_2}`. Rather than using integers for indexing a dictionary, we use the keys.

In [None]:
monthly_budget = {'groceries':500, 'car':1000, 'health':200, 'rent':2000} # Here the keys are on the left of the ":"

monthly_budget['car'], monthly_budget['rent'] # Here we index the dictionaries with the 'car' ad 'rent' keys to get the corresponding values

What if you want grab all the keys in a dictionary? For example, what if you wanted to print the value store in each key to inspect a dictionary? In such a case, you would use the `.keys()` method (or function), which returns an object containing all of the keys.  

In [None]:
keys = monthly_budget.keys()
print(keys)
print(type(keys))

The output from the cell above lets us know that the object is a `dict_keys`. This is **not** a list but it is an iterable object. You will learn more about that later. If you want a list containing the keys, we can turn the output from `.keys()` into a list with the `list()` function.

In [None]:
list_of_keys = list(monthly_budget.keys())
print(list_of_keys)
print(type(list_of_keys))

⛔ **Note:** Keys have to be *hashable*, so you are not allowed to use things like lists. However, you can use tuples.

In [None]:
my_good_dictionary = {(1,3):1000, 'hello':'world', 7:True} # This is ok since the keys are hashable

In [None]:
my_bad_dictionary = {[1,3]:1000, 'hello':'world', 7:True} # This is NOT ok since a list is not hashable

---

## **`if` statements, `for` loops, and `while` loops**

Having discussed data types and structures, we can now appreciate the ability we have to modify, manipulate, and parse through different data types in Python. When you begin to handle large amounts of data, we often will want to perform complex functions over *iterables* such as a list of numbers: `num_list = [5, 10, 15, 20]`. For example, you may want to know what the perfect squares are for a list of itegers:

In [None]:
square_list = []
num_list = [5, 10, 15, 20]
square_list.append(num_list[0]**2)
square_list.append(num_list[1]**2)
square_list.append(num_list[2]**2)
square_list.append(num_list[3]**2)
print(square_list)

It's clear through this example that while appending the results of this function with four separate function calls achieves our desired result, this block of code can become quite cumbersome if we were to have an even larger list to iterate over. Here is an example where a for loop structure works nicely to achieve the same result, while keeping code clean:

In [None]:
square_list = []
num_list = [10, 15, 20, 25]
for n in num_list:
    square_list.append(n**2)
print(square_list)

In the following section, we will discuss how loop structures are powerful tools to create conditions under which we want to execute a function as well as elegantly create concise coding blocks `(e.g. list comprehension)` which not only makes our code more flexible, but also more readable.

### **Importance of Block Indentation**

As you learn about the different statements and loops used in Python, keep in mind that indentation is important. For example, the statement
```python
if x > 9:
y = 2*x + 10
```
is illegal since the indentation of `y = 2*x + 10` is aligned with that of the `if` statement. To correct this code, you need to indent `y = 2*x + 10`. You can indent the second line with a tab or a simple space. My recommendation is to use a tab. 
```python
if x > 9:
    y = 2*x + 10
```


### **`if` statements**
`if` statements are an essential tool for programming. They give a way to enforce a condition. 
```python
if x >= 10:  # first condition
    y = x**2   # what you want to do when your first condition is met
elif x >= 7: # second condition
    y = x**3   # what you want to do when your second condition is me
```


Try running the code below. The indentation is purposely wrong. You should get an error as a result.

In [None]:
best_programming_languages = ['python', 'java', 'C']

if best_programming_languages[0] == 'python':
print(f'The best language is...{best_programming_languages[0]}.')

Re-run the code now that the line below the `if` statement was indented.

In [None]:
best_programming_languages = ['python', 'java', 'C']

if best_programming_languages[0] == 'python':
  print(f'The best language is...{best_programming_languages[0]}.') # This was indented 

Keep this rule about indentation in mind as you work with `for` loops, `while` loops, and other types of constructs in Python.

<font color='red'>Pro Tip:</font> You can have single-line `if` statements. For example: `if patient_name == 'John': print('John is here')`*italicized text*

In [None]:
x = 2**16

if x > 1000: print(f'x is {x}') 

<small>If you wonder what the `f` in front of the string means, it stands for **formated** and it allows you to create a versitile string that can take variables as inputs. This is a great way to print nice strings that are easy to read</small>


### **`for` loops**

`for` loops allow you to perform iterative tasks. The structure of a for loop (or for statement) is: 
```python
for variable in iterable_object:
    # Do something 
```
So, what is an *iterable* object? We can define an iterable object as a collection of values, which does not have to be hashable. Again, hashable just means that the value remains the same after you create it like we saw with tuples. In the example below, we will iterate over the sequence of integers generated by the function `range(start,stop[,step])`

In [None]:
for n in range(5): # range(5) creates an iterable sequence of integers from 0 to 4
  print(n)

Note that we wrote `range(5)` but the number 5 never printed. This is because python likes to start counting from 0 and `range(5)` creates five integers in sequence: (0,1,2,3,4). You can force the loop to start from any number you want by specifying your first value in the `range()` function. However, the final element in the sequence will be the the one you specify minus 1. 

In [None]:
for n in range(2,10,3): # here we specify the first, the last (plus 1), and the size of the step to take
  print(n)

A for loop can access the elements of any iterable. This is a fan favorite feature in Python since *lists* and other iterables can contain exotic objects that you can manipulate inside of the loop with less effort than with other languages. For example, you can pull each element of a list (one-by-one) and do something with each. 

In [None]:
a_list = [1, 3, 'a', ['I am a whole list', 10, 20], {'s', 'e', 't'}, ('This is a tuple'), round]

for n in a_list:
  print(n)

<font color='red'>Pro Tip:</font> Knowledge of how `for` loops work allows you to use a powerful feature in Python called **comprehension**. This makes it possible to create objects in a single line with a syntax that is intuitive yet powerful.

In [None]:
my_family = ['Maria Pesek', 'Liz Vazquez', "Liam Vazquez", 'Ivan Vazquez', 'Dagmaris Cepero', 'Nicholas Smith', 'Noah Smith']

last_names_list = [full_name.split(' ')[1] for full_name in my_family] 

name_parts_dictionary = {'last_names' : [full_name.split(' ')[1] for full_name in my_family], 
                         'first_names' : [full_name.split(' ')[0] for full_name in my_family]}

In [None]:
print(last_names_list)

In [None]:
print(name_parts_dictionary['first_names'])
print(name_parts_dictionary['last_names'])

**Exercise 2:** Use the `set()` and `list()` functions to create a list without any repeated last names. To find out how many different last names you have, use the `len()` function to find the length of the list of unique last names. Your code should look something like this:

```python
unique_last_names_set = set(INSERT THE CORRECT VARIABLE HERE)
unique_last_names_list = list(unique_last_names_set)
print(f'I found {len(unique_last_names_list)} unique last names')
```

In [None]:
# WRITE YOUR CODE IN THIS CELL!

### **`while` loops**

A `while` loop, like a `for` loop, is good for performing iterable tasks. `while` loops are formated as: 
```python
while condition:
    # Do something useful if condition is true
```
A `while` loop runs for as long as the **condition** it analyses is true. For example, if we assume that `x=10`, the statement `while x > 9:` would execute any code inside the loop  This means that you can create **infinite loops** when the condition is always true. For example, `while True:` will run forever 😦  Don't worry! You can alway terminate a code while it is running. 

In Python, you can use `1` and `0` as substitutes to `True` and `False`. 

In [None]:
print(f'1. Is True == 1? {True == 1}\n',
      f'2. Is False == 0? {False == 0}')

Time to have fun with a `while` loop. Remember, it can go wild if we are not careful!


In [None]:
inputs = [] # create an empty list
cues = ['name for a person', 'name for a fake country', 'color',  'food item'] # list of strings
i = 0 # integer we will use as an index

while True:
  
  inputs.append(input(f"Please enter a {cues[i]}: "))

  i+=1 # increases the value of i by one: equivalent to i = i + 1
  if len(inputs)>= len(cues): break

print(f'\n {inputs[0].capitalize()} is from {inputs[1].capitalize()} where people like to eat {inputs[2]} {inputs[3]}.')

---
##**Python libraries (or module)**


Like other languages, Python gets a lot of versatility from libraries created by a large community of skilled users. The effectiveness of Python libraries, especially those for machine learning, is likely the main reason for the popularity of Python. 

Python libraries are usually designed to be easy to install and use. You can often find a library to do what you need from a quick google search. For example, if you want to get started with data science, you can search for "best libraries for data science" and find sites like [this one](https://www.simplilearn.com/top-python-libraries-for-data-science-article). Once you identify a library that does what you need, the next step may involve reading the documentation that explains how to use it or going over relevant examples provided by the developers. Understanding documentation and examples generally requires good knowledge of the fundamentals of Python programming, which is where tutorials like this come in.


Two popular libraries are `numpy` (numerical python) and `matplotlib` (a plotting library). Before using a library, you need to **import** it to your code. This is done using the syntax `import library_name as nickname`. You can also import specific functions using the syntax `from library import function_name`. In the script below, we will import the `matplotlib` and `numpy` with the nicknames `mp` and `np`. 
```python
import matplotlib as mp # This says to import a library and assign the content to a variable called mp
import numpy as np 
```

Assigning a different name to the libraries (i.e. `mp` and `np`) is not required, but it reduces the amount of typing you will need to do. For example, you could write:
```python
# Long version
import numpy
X, Y = numpy.meshgrid(numpy.linspace(-3, 3, 256), numpy.linspace(-3, 3, 256))
Z = (1 - X/2 + X**5 + Y**3) * numpy.exp(-X**2 - Y**2)
```
or, equivalently, you could use `np` as a "nickname" for `numpy`:
```python
# short version
import numpy as np # import the library as np
X, Y = np.meshgrid(np.linspace(-3, 3, 256), np.linspace(-3, 3, 256))
Z = (1 - X/2 + X**5 + Y**3) * np.exp(-X**2 - Y**2)
```

The cell below shows an example that uses the `matplotlib` and `numpy` libraries to create a stream plot with little effort 

In [None]:
import matplotlib.pyplot as plt # we are importing the pyplot module in matplotlib
import numpy as np

# make a stream function:
X, Y = np.meshgrid(np.linspace(-3, 3, 256), np.linspace(-3, 3, 256))
Z = (1 - X/2 + X**5 + Y**3) * np.exp(-X**2 - Y**2)

# make U and V out of the streamfunction using numerical differentiation:
V, U = np.diff(Z[1:, :], axis=1), -np.diff(Z[:, 1:], axis=0)

# plot:
plt.subplot(1,2,1)
plt.pcolor(Z, cmap='jet'); plt.axis('off')
plt.title('Z')
plt.subplot(1,2,2)
plt.streamplot(X[1:, 1:], Y[1:, 1:], U, V)
plt.title('Stream Plot');

<font color='red'>Pro Tip:</font> You should use the conventional "nicknames" for some of the popular libaries used in Python. For example, use `plt` when you import the `pyplot` module in `matplotlib`. In other words, type `import matplotlib.pyplot as plt`. For `numpy`, you should use `np`. The `pandas` library is usually imported as `pd` and the `seaborn` library is imported as `sns`. 

---
### **Installing a new library (or module)**

Let us assume that we want to read medical images stored as DICOM files using the `pydicom` library. As we just learned, the first step is to `import` the library: 

In [None]:
import pydicom as pd

As you can see, `pydicom` is not installed in our **environment**, and this resulted in an error. You can install libraries using **package managers** like [`pip`](https://packaging.python.org/en/latest/tutorials/installing-packages/) (package installer for Python) and [`conda`](https://docs.anaconda.com/anaconda/user-guide/tasks/install-packages/). There are many tutorials on how to use these package managers. In Colab or Jupyter, you can install packages with `pip` using `!pip install package_name`

In [None]:
!pip install pydicom

Now that we installed `pydicom` in our Python environment, we should be able to import the library and use it.

In [None]:
import pydicom as pd

In [None]:
from pydicom.data import get_testdata_file

import matplotlib.pyplot as plt # import the pyplot module

fpath = get_testdata_file("CT_small.dcm") # find the location of the test data

ds = pd.dcmread(fpath) # read the DICOM file and store content in variable `ds`

plt.imshow(ds.pixel_array, cmap='gray'); # create a quick plot

---

### **NumPy**

The [`numpy`](https://numpy.org/) library is widely used in scientific computing in part because "NumPy brings the computational power of languages like C and Fortran to Python, a language much easier to learn and use" [(source)](https://numpy.org/). If your work requires processing large datasets, then you will likely benefit from using NumPy. Also, many existing libraries use NumPy to handle intensive calculations. As we saw above, to start using NumPy, you need to first import it. It is quite common to import NumPy using the variable `np`:
```python
import numpy as np
```
Once imported, you can use the functions (or methods) in NumPy with the syntax `np.function_name`. The script below shows some examples of `numpy` functions:



```python
a = np.array([1, 2, 3]) # convert a list of numbers to a numpy array

A = np.random.random((100,100)) # create array of random numbers

B = np.dot(A,A.T) # matrix multiplication of A and its transpose 

indices = np.where(A > 0.1) # find indices of array where values are greater than 0.1
```

#### **Arrays**

NumPy works with data stored as *numpy arrays*, which means that you need to transform your data into this format to use the powerful functions in NumPy. Luckily, there are many ways to create a *numpy array*, including [functions](https://numpy.org/doc/stable/reference/routines.io.html) that read data from `.txt` files into numpy arrays. In the examples below, we will create arrays and perform some basic operations. First, let's create a 1D and 2D array by inputing values by hand.

In [None]:
my_1d_array = np.array([1,2,3]) # you can converty a list into a 1D array

my_2d_array = np.array([[1,2,3],[4,5,6]]) # a 2D array is a list of lists [[],[]]

In [None]:
print(f'Shape of 1D array: {my_1d_array.shape}\n Shape of 2D array: {my_2d_array.shape}')

The cell below shows some ways you can use NumPy to operate on arrays. Although these are basic operations, the syntax for more sophisticated ones will be similar.

In [None]:
import numpy as np 

# Create two arrays
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

print(f'Add two arrays:\n {np.add(x, y)}\n')

print(f'Subtract two arrays:\n {np.subtract(x, y)}\n')

print(f'Elementwise product:\n {np.multiply(x, y)}\n')

print(f'Elementwise division:\n {np.divide(x,y)}\n')

print(f'Elementwise square root:\n {np.sqrt(x)}\n')

print(f'Matrix multiplication:\n {np.dot(x,y)}')

---
### Pandas and Seaborn

[`pandas`](https://pandas.pydata.org/) is a popular tool for "data analysis and manipulation". In this section, we will use `pandas` to read a CSV file, inspect the content, and generate some figures. Although the `pandas` libraries has some built-in plotting functions, we will use the `seaborn` library for our figures. [`seaborn`](https://seaborn.pydata.org/index.html) is described as a "Python data visualization library based on matplotlib". To start, let's import the two libraries

In [None]:
import pandas as pd 
import seaborn as sns

We can now read a CSV by first creating a variable that desribes the location (as a string) and then using the `.read_csv()` method in `pandas`. The result of using `pd.read_csv(directory_of_data_file)` is a `DataFrame`, which is an object specific to the `pandas` library. In one of our Pro Tips, we talked about PEP 8 naming conventions. By this convention, `DataFrame` would be class, which means that it will have methods (or functions) associated with it. This is indeed the case. Once you create a `DataFrame`, you can use several functions design to clean data, compute quick statistics, find specific values, etc. 

In [None]:
directory_of_data_file= '/content/sample_data/california_housing_train.csv'

california_housing = pd.read_csv(directory_of_data_file)

print(type(california_housing))

Notebook environmnets, like Colab, are designed for fast prototyping and, thus, give ways to quickly visualize data. We can visualize the content of the CSV file by just typing the name of the variable we used when reading it

In [None]:
california_housing

Next, we will generate a quick histogram plot of the median house value stored in the data using the Seaborn library. 

In [None]:
sns.histplot(data=california_housing, x="median_house_value", kde=True);

---
---

## Working with files 

### The `os` library
A useful library for wrangling files from your Python code is the `os` library, which contains tools for interacting with the operating system of your machine. The `os` library has functions for creating files `.mkdir()`, deleting files `.remove()`, listing the files in a directory `.listdir()`, and [much more](https://docs.python.org/3/library/os.html). 

### File paths
When working with files it is important to remember that Windows and Linux/Mac use different *path separators*. For example, Windows uses `/` while Linux uses `\`. Luckily, one of the functions in the `os` module can account for this.

In [None]:
import os

os.makedirs("tmp", exist_ok=True) # make a directory and ignores if it already exists

testfile_read = os.path.join("tmp", "testfile_read.txt") # create a file path
testfile_write = os.path.join("tmp", "testfile_write.txt")
testfile_append = os.path.join("tmp", "testfile_append.txt")

print(testfile_read)

Notice that the *directory* or file path above shows the folder and file name separated by a `/`. This was determined by the `os` module in the background.

### I/O modes

There are *different modes* of opening a file which has to be specified: 
*   `w` for *write*
*   `a` for *append*
*   `r` for *read*
*   `b` for  *binary mode* (e.g. images)

#### Read

After opening the file you can use the read/write functions.

In [None]:
with open(testfile_read, "r") as f:
  content = f.read()
  
print(content)

<font color='red'>Pro Tip:</font>  Always use the `with object as nickname` when executing the `open()` function to read/write files using Python. It ensures that the file is closed even if there is an error during file manipulation.

*Good practice*
```python
with open(os.path.join('folder','file_name.ext'),'w') as f: 
  f.write('Hello World!')
```
*Bad practice*
```python
f = open(os.path.join('folder','file_name.ext'),'w')
f.write('Hello World!') 
f.close() # you need to remember to close your file
```


#### Write

Write mode `w` will always create a new file. 

⛔ **Note:** If a file with the same directory already exists, the content will be overwritten!

In [None]:
with open(testfile_write, "w") as f:
  f.write("Hello World!")

#### Append

Append creates a new file if one with the specified directory (or file path) does not exist. If a file with the same directory exists, then it *appends* to the existing file. In other words, it can add data to a file without overwritting the existing content.

In [None]:
with open(testfile_append, "a") as f:
  f.write("Hello World!")
  f.write(os.linesep)

### Configuration files

Configuration (or config) files are a great way to keep certain kind of data out of the code. Avoid hard coding URLs, usernames, password, data directories, etc. 

Perhaps the most popular and convenient *config* file types are YAML or JSON files.

In [None]:
!pip install pyyaml

#### As YAML file

```
data:
  url: 'https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv'
```


In [None]:
import yaml

with open("config/config.yaml", "r") as f:
    try:
        config = yaml.safe_load(f)
    except yaml.YAMLError as exc:
        print(exc)

print(config['data']['url'])

#### **As *JSON***

```
{
  "data": {
    "url": "https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv"
  }
}
```

In [None]:
import json

with open("config/config.json", "r") as f:
  config = json.load(f)
  
print(config['data']['url'])

This configfile contains the URL of the dataset file. If the URL changes we do not have to modify any code but just the config file.

#### Using a `config` file to download data from the internet

Let's look at an example of downloading a file and saving it to the disk. We use a config file to store the URL of the dataset we want to download. Then we use the requests package to download the content. To write the content to disk we use open with the options 'wb' (binary write mode).

In [None]:
import requests
import os

r = requests.get(config['data']['url'])

with open(os.path.join('tmp', 'iris.csv'), 'wb') as f:
  f.write(r.content)

In [None]:
import pandas as pd
df = pd.read_csv(os.path.join('tmp', 'iris.csv'))
df

### **HDF5 files**

Hierarchical data format (version) 5 or HDF5 files are useful to work with large data. A key benefit is that data is stored in the hard drive and you can slice (or index) the section you need, which loads it to the RAM. This is especially helpful when your RAM is insuffucient to load a large file but your hard drive (or SSD) is large enough. Python makes working with HDF5 files (or H5 for short) easy with the help of the [`h5py` module](https://docs.h5py.org)


In [None]:
import h5py
import os
import numpy as np
from scipy import misc

# prepare the file name
h5_file_name = os.path.join("tmp", "test_file.hdf5")

# fake data for this exercise
suspect_data = {'head_shot':misc.face(), 'crime_scene':misc.ascent(), 
                'lie_detector_results':misc.electrocardiogram()}

# create the H5 file with the data you want to store
with h5py.File(h5_file_name, "w") as f:
  [f.create_dataset(name=f"suspects/{k}", data=suspect_data[k]) for k in suspect_data.keys()]

The "H" in HDF5 stands for hierarchical, and it refers to the ability to structure HDF5 file in a hierarchy using groups, subgroups, attributes, and other so on. In the example above we created a file that has a group called "suspects". There are several subgroups inside of this group. To view the different subgroups we can use the `.keys()` method.

In [None]:
hf = h5py.File(h5_file_name,'r')
print('Groups:', list(hf.keys()))
print('Subgroups:', list(hf['suspects'].keys()))

Grabbing data from an H5 file can be done using `[:]`, let's grab the head shot for the criminal in our data bank and display it. We will also grab his lie detector results and a picture of the scene where it terrorizes people. 

In [None]:
import matplotlib.pyplot as plt

head_shot = hf['suspects/head_shot'][:]
lie_detector_results = hf['suspects/lie_detector_results'][:]
crime_scene = hf['suspects/crime_scene'][:]

hf.close() # Since with...as... was not used, you need to close the file manually...like a cave-man/woman

# Generate some figures showing the bad guy
plt.figure(figsize=(12,3))
plt.subplot(1,3,1)
plt.title('The Bad Guy')
plt.imshow(head_shot); plt.axis('off')
plt.subplot(1,3,2)
plt.title('Lie Detector Results')
plt.plot(lie_detector_results)
plt.subplot(1,3,3)
plt.title('Scene of the Crime')
plt.imshow(crime_scene, cmap='gray'); plt.axis('off');

### **Progress bars**

Progress bars are useful when running long lasting tasks like processing many images. To use progress bars in a notebook environment like **jupyter** or **colab**, we can use the [`tqdm` package](https://github.com/tqdm/tqdm).

In [None]:
!pip install tqdm

In [None]:
from time import sleep
from tqdm import tqdm

for i in tqdm(range(100)):
  sleep(0.05)

We can use the same to iterate through a list and display the currently processed element.

In [None]:
list_to_process = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l",
                   "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", 
                   "x", "y", "z"]

pbar = tqdm(list_to_process)
for char in pbar:
    sleep(0.25)
    pbar.set_description("Processing %s" % char)