# Section 2. Packages and Files

#### 2025 Spring Semester / Instructor: Pierre Biscaye

The content of this notebook draws on material from UC Berkeley's D-Lab Python Fundamentals [course](https://github.com/dlab-berkeley/Python-Fundamentals).

### Sections
1. Importing modules/packages
2. Reading from and writing to files

# 1. Importing Modules or Libraries

You do not need to be confused by the word 'module.' It is like importing a package so that we can use functions included in that package. If you are an R user, it is similar to library(package name). Even if you have already installed a package or library, you still need to load particular modules from the packages to be able to use them in a particular notebook.

## Most of the power of a programming language is in its libraries.

*   A *library* is a collection of functions that can be used by other programs.
    *   May also contain data values (e.g., numerical constants).
    *   Library's contents are supposed to be related, but there's no way to enforce that.
*   Python's [standard library](https://docs.python.org/3/library/) is installed with it.
*   Many additional libraries are available from [PyPI](https://pypi.python.org/pypi) (the Python Package Index).

## A program must import a library in order to use it.

*   Use `import` to load a library into a program's memory.
    
### A common module is numerical python, or `numpy`. We used it last section to create an array.

In [None]:
# Import numpy
import numpy
type(numpy)

If you encounter an `ImportError` when trying to import a package, it means that package has not been installed in your Python environment or is not accessible from your current environment. 

You can resolve this by installing the package using the command `!conda install package_name` directly within a Jupyter cell, replacing `package_name` with the name of the package you need. 

Why the exclamation mark `!`? It allows you to run **shell commands** from within a Jupyter Lab notebook. A **shell command** is a simple instruction you type to tell your computer to do something, like installing a program or managing files. In Jupyter Lab, you can use these instructions by starting with an exclamation mark `!` in a cell to make your computer do tasks, like installing a package. Make sure to include `-y` to avoid it hanging up waiting for confirmation.

In [None]:
# Uncomment the following line to install pandas
# !conda install numpy -y

For many packages, like `numpy`, we use an **alias**, or nickname, when importing them. This is just done to save some typing when we refer to the package in our code. The common alias for numpy is `np`.

In [None]:
# Import numpy as np
# using standard abbreviations can make code more elegant
import numpy as np
print(type(np))
print(type(np.sum))

Then refer to things from the library as `library_name.thing_name`.


Python uses `.` to mean "part of".

In [None]:
# use ? to get help for using a function
?np.sum

In [None]:
# Sum up 1 and 2 using np.sum()
c=[1,2]
np.sum(c)

In [None]:
# Make an array, which is similar to a list: np.array
# Difference?: 
# The elements in a NumPy array are all required to be of 
# the same data type.
ar = np.array([1, 2, 3])
ar

In [None]:
np.sum(ar)

In [None]:
# importing a specific function in a module
from numpy import vstack
from numpy import hstack
print(type(vstack))
print(type(hstack))

In [None]:
# These are all the same!
a=[5,7,9]
vstack([a, a])
vstack((a, a))
np.vstack([a, a])
np.vstack([a, a])
numpy.vstack([a, a])
numpy.vstack([a, a])

print(np.vstack([a, a]))
print(vstack([a, a]))
print(numpy.vstack([a, a]))

print(np.vstack((a, a)))
print(vstack((a, a)))
print(numpy.vstack((a, a)))

In [None]:
# Check the dimensions of an object: .shape 
# This is called an attribute (not method this time)
# "Shape attribute returns a tuple of the shape of 
# the underlying data for the given series objects."
ar.shape

In [None]:
# Try hstack using 'a' twice and check the new shape.
# You don't need to make a new object.
print(np.hstack((a,a)))
print(np.hstack((a,a)).shape)

In [None]:
print(np.vstack((a,a)))
print(np.vstack((a,a)).shape)

In [None]:
# Matrix algebra
np.vstack((a,a))*2

In [None]:
np.vstack((a,a))*np.vstack((a,a))

Another example: the `math` library.

In [None]:
import math

print('pi is', math.pi)
print('cos(pi) is', math.cos(math.pi))

*   Have to refer to each item with the library's name.
    *   `math.cos(pi)` won't work: the reference to `pi` doesn't somewhow "inherit" the function's reference to `math`.

## Import specific items from a library to shorten programs.

*   Use `from...import...` to load only specific items from a library.
*   Then refer to them directly without the library name as prefix.


In [None]:
from math import cos, pi

print('cos(pi) is', cos(pi))

## Create an alias for a library when importing it to shorten programs.

*   Use `import...as...` to give a library a short *alias* while importing it. We used `np` above for `numpy`.
*   Then refer to items in the library using that shortened name.

In [None]:
import math as m

print('cos(pi) is', m.cos(m.pi))

*   Commonly used for libraries that are frequently used or have long names.
    *   E.g., `matplotlib` plotting library is often aliased as `mpl`.
*   But can make programs harder to understand,
    since readers must learn your program's aliases.

*****
## Keypoints

- "Most of the power of a programming language is in its libraries."
- "A program must import a library in order to use it."
- "Use `help` to find out more about a library's contents."
- "Import specific items from a library to shorten programs."
- "Create an alias for a library when importing it to shorten programs."


# 2. Files

## Reading from text files

Reading a text file requires three steps:

1. Opening the file
2. Reading the file
3. Closing the file

In [None]:
my_file = open("Data/example.txt", "r")
text = my_file.read()
my_file.close()

print(text)

- However, use the `with open` syntax and this will automatically close files for you. 
- The `'r'` indicates that you are reading the file, as opposed to, say, writing to it.

In [None]:
# better code
with open('Data/example.txt', 'r') as my_file:
    text = my_file.read()
    
print(text)

`with` will keep the file open as long as the program is still in the indented block, once outside, the file is no longer open, and you can't access the contents, only what you have saved to a variable.

## Reading a file as a list

- Sometimes we want to read in a file line by line, storing those lines as a list.
- To do that, we can use a loop, as in `for line in my_file`:

In [None]:
stored = []
with open('Data/example.txt', 'r') as my_file:
    for line in my_file:
        stored.append(line)

In [None]:
stored

Remember that the variable name can be anything. It does not have to be `line`. Files are simply always read line by line.

- We can use the `strip` [method](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#method) to get rid of those line breaks at the end

In [None]:
stored = []
with open('Data/example.txt', 'r') as my_file:
    for line in my_file:
        stored.append(line.strip())

In [None]:
stored

## Writing to a file

We can use the `with open` syntax for writing files (saving them to your computer) as well. Here we use 'w' to note that we are writing to a file.

In [None]:
# this is okay...
new_file = open("Data/example2.txt", "w")
bees = ['bears', 'beets', 'Battlestar Galactica']
for i in bees:
    new_file.write(i + '\n')
new_file.close()

In [None]:
# but this is better...
bees = ['bears', 'beets', 'Battlestar Galactica']
with open('example2.txt', 'w') as new_file:
    for i in bees:
        new_file.write(i + '\n')

Let's take a look at the file we wrote.
- Recall, an exclamation point `!` puts you in the command shell

In [None]:
# for Macs use the `cat` command
!cat example2.txt

In [None]:
# for windows use the `type` command
!type example2.txt

# Reading/Writing csv files using `pandas`

Reading in a dataset that is stored as a "comma separated file" (csv) is easy in Python using the `pandas` package. We will learn more about pandas in the next class, but will preview it for now.

Central to the `pandas` package is the `DataFrame` type, which stores 2-dimensional tabular data in a format similar to Excel spreedsheets.

Let's import `pandas` (using the common abbreviation pd) and use its `read_csv()` function to load the data stored in a csv file into a `DataFrame`

In [None]:
import pandas as pd

caps = pd.read_csv('Data/capitals.csv')

We can look at the first 5 (or any number) rows of data using the `.head()` method of the `DataFrame` object.

In [None]:
caps.head()

To see how many data points and variables exist in the dataframe we can simply use the `.shape` attribute.

In [None]:
caps.shape

Or we can get more detailed information about the number of entries (e.g. observations, data points) and the variables for each entry using the `.info()` method.

In [None]:
caps.info()

It looks like there is a single missing value in the Capital variable (there are 199 non-null objects, not 200). Let's remove that missing value (or `na`) using the `dropna()` method so that we can save an updated version of the csv file.

In [None]:
caps_nomissing = caps.dropna()
caps_nomissing.info()

That looks better. Now let's write this updated `DataFrame` out to a csv file. The below code saves a new csv file in your working directory.

In [None]:
caps_nomissing.to_csv('Data/capitals_nomissing.csv')

## There are also specific packages for reading csv, dta, and other types of data files.

### If in doubt of how to load a specific file, Google it!