# 4. Packages and Environments

A software package (or library) is a pre-written collection of code which you can import into your project. This is extremely useful, as it means we don't have to start from scratch every time we write some code!

## 4.1 Built-In Python Packages

The example below imports a function called `math` which contains a host of mathematical functions to play with:

```python

import math
```
In Code Block 1, import the `math` library and use the `help()` function to see what functions `math` makes available. Use the syntax `math.function(argument)` (e.g. `math.cos(0)`) to play with some of the functions. 

Now we have the `math` package imported, we can use functions like `math.cos()` or `math.atan()` anywhere in our code. If these were the only functions we were interested in, we could also specify that in our import using the line:

```python
from math import cos, atan
```
These can now be used in our code without needing the `math.` prefix:

```python
x = 90
cos_x = cos(x)
print(cos_x)
```

## 4.2 Third Party Packages and Coding Environments

Third party packages are libraries which are not included in the base Python installation. To use them we must first install them on our computer using a package manager like Pip or Conda. 

### 4.2.1 Creating a Conda Environment
It is good practice to keep all the packages you need for a project in the same place, called an *environment*. Keeping our packages in a clean and minimal environment makes it easier for others to re-use our code, as they know exactly what packages to install to make it work. This is particularly important in science, where we want our code to be as reproducible as possible.

To do this, search for *Anconda Prompt* on your computer and open it. You should get something like this:

![image](conda_prompt_1.PNG)

The first thing to note is the `(base)` before your current working directory, in this case (`C:\Users\jws10y>`). This is indicating that we are in our `base` environment. We could start installing software into this, however if we want to keep a track of which packages we use on different projects we need to create a project specific environment.

Let's create that new environment. We'll name this environment `NWB-hackathon`. We're also going to specify the Python version we will be using for this, in this case `Python==3.11`. Do this with the following line:

![image](conda_prompt_1.PNG)

You will see a load of text generated, then be prompted to confirm the action with the line `Proceed ([y]/n)?`. input `y` then hit `Enter`. This environment is now a clean install of Python which we can build a coding environment on. First, though, we need to enter that environment by *activating* it. You need to do this any time you want to use the environment for coding. Enter the following line into your Anaconda Prompt: 

![image](conda_prompt_3.PNG)

### 4.2.2 Installing Packages

If this has worked, you will see the `(base)` has now changed to `(NWB-hackathon)`, indicating our environment has changed. We now need to install some packages into this environment. We're going to install a few commonly used data science libraries, but there are thousands of others out there. First, let's install *Jupyter* again. Since we have made a clean environment, the packages installed with Anaconda might not be on our `NWB-hackathon` environment. We'll be using `pip` to install these packages. Pip will pull all the code for Jupyter from [PyPi](https://pypi.org/) - a huge repository of open-source Python code. Install Jupyter with the following command:

![image](conda_prompt_4.PNG)

Use the same approach to install a few more packages:

* `matplotlib` : for creating plots and graphs,

* `numpy` : for handling numerical data and matrices,

* `pandas` : for data wrangling and analysis,

* `scikit-learn` : for data analysis and machine learning.

Use `pip install <package>` in your Anaconda Prompt.

### 4.2.3 Opening a Jupyter Notebook in your Environment

Once those are installed, save all your notebooks and close you Jupyter window in your browser. We'll open it again running inside our `NWB-hackathon` environment.

Go back to your Anaconda Prompt terminal, and check that the `(NWB-hackathon)` environment is running (it should be the first thing on your current line). To open Jupyter again, all we need to do is type in `jupyter notebook` and hit `Enter`. This should automatically open Jupyter in your browser.

### 4.3 Using Third Party Packages
Let's take a look at a few of the package we installed. It is common to import many of these under shortened pseudonyms - especially for packages which see very frequent use such as `numpy`, `pandas` and `matplotlib`. To import these packages, use the code below

```python

import pandas as pd # the `as pd` line means we now just need to type "pd" when we want to use pandas
import numpy as np
import matplotlib.pyplot as plt # we're just importing the `pyplot` module from matplotlib

import sklearn
```
Import these packages in the code block below, and use `help()` to read a little about what they do.  

### 4.3.1 Reading Tabular Data with `pandas`
The Pandas library is used for reading and analysing tabular data. We're going to use it to read in `amino_acids.csv` as a *DataFrame* (or `df`) and view some of the data. We'll use the `DataFrame.head()` function to view the first 5 elements in the dataframe. 

```python
amino_acids_df = pd.read_csv('aminoacids.csv')
amino_acids_df.head()
```
💡**Task:** use the `len()` function on your dataframe to check how many columns it has.

We can get a full list of the columns in oure dataframe using the `DataFrame.columns()` function. Try that on our `amino_acids_df` dataframe in the Code Block below.

Our dataframe has 22 columns and 22 rows of data. We can extract a single column (or *Series*) using the following syntax:

```python
amino_acids_masses = amino_acids_df['Molecular Weight']
print(amino_acids_masses)
```
💡 **Task:** Extract a few more columns from our dataframe.

We can use the `numpy` package to get basic statistics for our series. Try the following examples to see what they do:

```python

np.amax(amino_acid_masses)
np.amin(amino_acid_masses)
np.mean(amino_acid_masses)
np.quantile(amino_acid_masses, [0.05, 0.25, 0.5, 0.75, 0.95])
np.std(amino_acid_masses)
```

Pandas has hundereds more features, some of which we'll see in the next session of the workshop. For now, lets take a look at plotting data with `Matplotlib` and doing some basic modelling with `scikit-learn`. 

### 4.3.2 Plotting data with `matplotlib`
Matplotlib offers a range of data visualisation tools. Today, we'll look at two simple plotting types, histograms and scatter plots. 

As a quick tip - the standard plotting style for Matplotlib is a little sparse. We can change the plotting style easily by running the following code:

```python
plt.style.use("ggplot")
```
This will chnage the style to be more similar to R's *ggplot*. You can see a full range of styles [here](https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html).

Choose a style from the style list and run the code above (with `"ggplot"` replaced with your chose style).

We'll start by producing a historgram of the different massess of amino acids using th following code:

```python
plt.hist(amino_acids_df['Molecular Weight'])

plt.title('Amino Acid Molar Masses') # adds a title
plt.x_label('Molecular mass (g/mol)') # adds label for x axis
plt.y_label('Count') # adds label for y axis
plt.show() # shows the plot
```

Use the Code Block below to produce histograms for a few other features in our dataset.

We can also compare two variables using the `scatter` plotting tool. The code below produces a scatter plot comparing `Molecular Weight` with `VSC` (volatile sulfur compounds). 

```python
plt.scatter(amino_acids_df['Molecular Weight'], amino_acids_df['VSC'])
```
💡**Task:** Amend the code above so that it adds a title and axis labels. Take a look at [matplotlib's documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) to see how else you can customise your plot. 

### 4.3.3 Simple Modelling with `scikit-learn`, `numpy` and `pandas`
The `scikit-learn` library has hundereds of tools for data processing, analysis and modelling. We're going to use a few here to great a simple linear regression which predicts `VSC` given an amino acid's molecular mass. 

Out first task is to create an X variable for our regression. We'll put these into the form of a Numpy array by using the numpy function `np.array()`. Numpy arrays are sort of like lists, except the only handle numerical data (`int` or `float`) and are much more memory efficient - super important for complex machine learning tasks.

Create your input variable using the code below:

```python
# first, drop anything which is missing values for either VSC or Molecular weight
input_df = amino_acids_df.dropna(subset=['Molecular Weight', 'VSC'])
X = np.array(input_df['Molecular Weight'])
```
💡**Task:** Copy the code above and add a new line which extracts the `y` variable from the `VSC` column. Remember to make it a numpy array. 

We now want to split our data into a test set and a training set. The code below randomly assigns 10% of our data to a test set. We'll set a random state so that our random split is reproducible for others:

```python
from sklearn.model_selection import train_test_split

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.1, random_state=42)
```
Copy this code in to the Code Block below to generate a train and test set.

We can now train a linear regression model on this data! First, we need to import the `LinearRegression` model from sklearn and initialise a new model.

```python
from sklearn.linear_model import LinearRegression

regressor = LinearRegession()
model = regressor.fit(X_trn, y_trn)
```
Copy this into the Code Block Below. You will get an error. Read the error carefully and adjust your data accordingly.


We can now test our model. Use the `model.score(X,y)` method to get a regression score for the model on our test data!

💡**Task:** Have a go at making a linear regression model with multiple features from our dataset - can you improve the performance on predicting VSC for different Amino Acids?