# Virtual Enviroments

Many data scientists aren't particularly careful about knowing *where* their packages live on their machine.  They usually just install their packages at the system level and then get on with their data analysis.

In a high stakes context, you will want to make sure that your work is completely reproducible.  Insuring reproducibility will also be important when you start collaborating with others.

*Virtual environments* are a way to insure reproducibility of your work; they are a specific example of the general concept of a *project specific library*.

## Where Do Packages Live on Your Machine?

Where does Python look for packages to use in your projects?  You can see the default set of paths that it looks for using the following code.

In [None]:
import sys
sys.path

['/home/pritam/Desktop/p4dsf/chapters/42_virtual_environments',
 '/usr/lib/python310.zip',
 '/usr/lib/python3.10',
 '/usr/lib/python3.10/lib-dynload',
 '',
 '/home/pritam/.local/lib/python3.10/site-packages',
 '/usr/local/lib/python3.10/dist-packages',
 '/usr/lib/python3/dist-packages']

## Dist-packages and Site-packages

There are two types of packages: *dist-packages* and *site-packages*.  Dist-packages are the *standard library* that comes out-of-the-box with every Python distribution.  Site-packages are the third party packages that you install using `pip`.

From what I can tell, dist-packages *mostly* live in `/usr/lib/python3.10`.  And site-packages *mostly* live in `/home/pritam/.local/lib/python3.10/site-packages`.

Let's hop into the terminal and take a look.

```shell
$$ ls /usr/lib/python3.8

$$ ls /home/pritam/.local/lib/python3.8/site-packages
```

(Of course, the specific file paths will be different on your machine.)

## The Need for Virtual Environments

Consider the following scenario where you have two projects: ProjectA and ProjectB, both of which have a dependency on the same library, ProjectC. The problem becomes apparent when we start requiring different versions of ProjectC. Maybe ProjectA needs v1.0.0, while ProjectB requires the newer v2.0.0, for example.

This is a real problem for Python since it can’t differentiate between versions in the site-packages directory. So both v1.0.0 and v2.0.0 would reside in the same directory with the same name

A virtual envorionment will solve this problem.

## What is a Virtual Environment?

A virtual environment is simply a directory with three important components:

- A `site-packages/` folder where third party libraries are installed.
- Symlinks to Python executables installed on your system.
- Scripts that ensure executed Python code uses the Python interpreter and site packages installed inside the given virtual environment.

## Creating a Virtual Environment

Let's begin by creating a simple project to demonstrate these concepts.

```shell
$$ mkdir test_project

$$ cd test_project
```

Next, let's create a simple script inside our project and call it `script.py`:

```
import pandas as pd
import yfinance as yf
yf.pdr_override()
from pandas_datareader import data as pdr

df = pdr.get_data_yahoo('SPY')
print(df)
```

Let's try running our script at the terminal:

```shell
$$ python3 script.py
```

This runs just fine because our project is looking for system-wide installations of `pandas-datareader`.

We can see a list of all site-packages with the following:

```shell
$$ pip list
```

Notice that `pandas-datareader` is in this list.

## Creating a Virtual Environment

Now let's create a virtual environment in `test_project`.  There are many different ways of creating virtual environments in Python.  We are going to use the **venv** package which is part of the standard library.

The command for creating a virtual environment is simply:

```shell
$$ python3 -m venv .venv
```

Let's take a look at the contents of our project directory with `ls -la`.  As you can see, there is a new directory called `.venv`.  If we look at the content of `.venv` we see a number of subdirectories.

Here is what the important folders contains:

- `bin`: files that interact with the virtual environment
- `include`: C headers that compile the Python packages
- `lib`: a copy of the Python version along with a site-packages folder where each dependency is installed

Now that we have created our virtual environment inside our `test_project`, let's try rerunning our script.  As you can see it works fine.

```shell
$$ python3 script.py
```

As you can see the script still runs fine.

## Activating the Virtual Environment

Even though we have created our virtual environment, we still have not activated it.  Let's do so now.

```shell
$$ source .venv/bin/activate

(venv) $$
```

Notice the `(.venv)` decorator at the terminal prompt.  This indicates that the virtual environment has been activated.

Let's try running our script again:

```shell
$$ python3 script.py
```

This time it fails.  To see why, let's look at the search path in the Python console.

```shell
>>> import sys
>>> import pprint
>>> pprint.pprint(sys.path)
```

As you can see, `sys.path` is much shorter now.  Now, all the site packages associated with the project are in `/home/pritam/Desktop/test_project/.venv/lib/python3.8/site-packages`.

In the terminal, let's look at the contents of this directory with `ls -la`.  As you can see, the content of this directory are minimal and, in particular, `pandas-datareader` is not present.

Another way to verify this is to run:

```shell
$$ pip list
```

## Installing Packages into a Virtual Environment

Let's now use `pip` to install `pandas-datareader` into our virtual environment.

```shell
$$ pip install pandas-datareader

$$ ls /home/pritam/Desktop/test_project/.venv/lib/python3.10/site-packages
```

Now we can see that `/home/pritam/Desktop/test_project/venv/lib/python3.8/site-packages` contains `pandas-datareader` and all its dependencies.  We also have a longer `pip list`.

```shell
$$ pip list
```

If we try to rerun our script, it now works.

```shell
$$ python3 script.py
```

## Creating a `requirements.txt`

In order to make our project reproducible we will need to make a `requirements.txt` which will detail all the packages in our virtual environment.  This is done as follows:

```shell
$$ pip freeze > requirements.txt
```

We now have a text file called `requirements.txt` which details all the packages in our project as well as their version numbers.  Let's check the contents of this text file:

```shell
$$ cat requirements.txt
```

## Deactivating a Virtual Environment

Deactivating a virtual enviroment is straight forward.

```shell
(venv) $$ deactivate

$$
```

We can verify that our virtual environment is deactivated by running:

```shell
$$ pip list
```

Notice the much longer list of system-wide site package installations.

## Reproducing a Development Environment with `requirements.txt`

Let's delete the `.venv/` virtual enviroment in our `test_project` and pretend that we cloned this project from github.  (Conventional wisdom is that it is best to put virtual environments in your `.gitignore`)

```shell
$$ rm -rf .venv
```

Because we have the `requirements.txt` we can reproduce the set of site package installations that we had before.

First, let's create a new virtual enviroment and activate it:

```shell
$$ python3 -m venv .venv/

$$ source .venv/bin/activate

(venv) $ 
```

We can see that we once again have a minimal number of site-packages in our virtual environment.

```shell
$$ pip list
```

Also, our `script.py` once again doesn't run because `pandas-datareader` is not in the virtual environments site library.

```shell
$$ python3 script.py
```

All we have to do to reproduce the development environment is run the following code:

```shell
$$ pip install -r requirements.txt
```

Now we can check `pip list` and also verify that our code runs.

```shell
$$ pip list

$$ python3 script.py
```

## Virtual Environments in Jupyter Notebooks

Working with virtual environments in Jupyter is a little more tricky.

Let's begin by creating fresh virtual environment in our project and activating it.

```shell
$$ deactivate

$$ rm -rf .venv/

$$ python3 -m venv .venv

$$ source .venv/bin/activate
```

You can verify that our script once again does not run in the project.

Next, let's create a blank Jupyter notebook in our project and call it `notebook.ipynb`.  And let's type the following code in a cell and run it:

```python
import pandas as pd
import yfinance as yf
yf.pdr_override()
from pandas_datareader import data as pdr

df = pdr.get_data_yahoo('SPY')
print(df)
```

Notice that this is the same code that is in `script.py`, yet it runs even though the virtual environment is activated and `script.py` doesn't run.  

The reason for this is that behind the scenes of a Jupyter notebook, the computational engine is an *IPython kernel*.  In order to use a virtual environment in a Jupyter notebook you have to take the extra step of registering it as a kernel.  That is what we will do next with the following commands:

```shell
$$ pip install ipykernel

$$ python3 -m ipykernel install --user --name=.venv

$$ cat /home/pritam/.local/share/jupyter/kernels/.venv/kernel.json
```

Notice that `.venv` is now available in the JupyterLab launcher.

Go into `notebook.ipynb` and choose the `.venv` kernel that appears in the drop down menu.  Now, rerun the code chunk from above and you will find that it doesn't work.  This is because we are now working in the context of `.venv` which doesn't have `pandas-datareader` in it.

## Removing a Virtual Environment from a Jupyter Notebook

It is straight forward to remove a virtual environment kernel.  To see all the available kernels run the following:

```shell
$$ jupyter kernelspec list
```

Notice that `.venv` isn't the list.

To remove the `.venv` kernel simply run the following code:

```shell
$$ jupyter kernelspec uninstall .venv
```

You can rerun `jupyter kernelspec list` to verify that the kernel has been removed.  It is also no longer available in the JupyterLab launcher.