# Introduction to the pandas Python Package

---

In this section we will briefly review what `Python` modules and packages are in general and why they are useful to us as Data Scientists. We will then move on to describe specifically the `Python` package `pandas`, which will be used throughout this entire course on data wrangling.

# Python Packages and Modules

---


You will see that there are functions and definitions that are a recurring need in your code. An example of this may be a function to find the average value of a list of integers. So, rather than repeatedly implementing the same function or redefining the same variable over and over again, you may want to stick to the software engineering principle of DRY, Don't Repeat Yourself, and thankfully `Python` supports this quite seamlessly. A `Python` file can be written one time and be home to those recurring definitions and functions. The file can then be imported at the beginning of a new project, saving time and effort, this type of file is called a `Python` ***Module***. 

These recurring functions and definitions are in fact so common that it would be impractical to put all of them into 1 file, so what we do is organize them into categories, for example, maybe we have one module for math and another for making plots. Furthermore, we may run into the issue where even those categories are to comprehensive and we only actually need a subset of the definitions and functions in the module. To solve this issue we can break up the module into smaller modules called submodules. But all these submodules should still live together somwhere since they are related, so what we do is organize them into a `Python` ***Package*** ( you may also see these refered to as a *library*). A `Python` package structures the way we can refer to the modules and submodules.

Luckily we don't always have to build our own modules and packages. There is a large community of `Python` programmers who have ran into many of the same problems and have created `Python` modules and organized them into packages that are open source and available for you to use for your own projects! 

# Types of Python Modules

---

 In the context of module installation, there are essentially 3 types of `Python` modules we will be concerned with:

1.   Modules bundled with the `Python` distribution but not available for use by default.
  - Ex. datetime, random, collections, etc.
  - These modules do not need to be installed, they are included in the `Python` standard library, but need to be imported prior to being used.
  
2.   Modules typically not available in the default `Python` installation but which are bundled in packages that may be installed by the `Conda` installer. Those are also not available by default.
  - Ex. numpy, pandas, matplotlib, etc.
  - These modules may be installed using the conda installer on the command line:

3.   Other modules contained in specialized packages that need to be installed manually.
  - Ex. BioPython, astropy, tensorflow, etc
  - We will not be using these modules.

# Importing Python Modules

---

Once the Python package containing the module you wish to use is installed on your machine or virtual environment, you must import the module into your file to use it. To do this we have two options 

1.   We can import __all__ the modules in a package, and with it all the functionality and definitions

```python
>>> import math
```
To reference definitions or functionality after this import statement we use *dot* notation. For example if we wanted to acess the `math` package's definition for *$\pi$*  we would type:

```python
>>> math.pi
3.141592653589793
```

2. Or we can import only some modules in the package, or even more specifically, only some particular functionality or definition. If we wanted to only import the `math` package's definition for *$\pi$*  we would type 

```python
>>> from math import pi
```

Then to reference `pi` all we would type is:

```python
>>> pi
3.141592653589793
```



# Aliasing

It is common practice for modules to be renamed when they are imported, this is called *aliasing*. For instance, if we wanted to instead refer to the `math` package as `mth`, we could do this using the *as* keyword with the following syntax:

```python
>>> import math as mth
```

We can now reference the functionality of the module using the shortened alias, saving us a keystroke! (the benefits of this feature are more evident when the names of the modules are larger). Accessing the functionality and definitions of the module work just the same before except now we use `mth` rather than `math`:

```python
>>> mth.pi
3.141592653589793
```




# What is Data Wrangling? Why pandas?

---

*Data Wrangling* is the process of transforming your data from one form into another, usually with the intent of making it more suitable for analysis. For instance, perhaps your are given a data set in the form of a `.tsv` file which contains all the information about 1000's of baseball players over many years playing in the MLB, but you are only interested in comparing the batting averages for baseball players who played specific positions during a single season. We would need to reformat this data set, taking a subset of the features and entries provided and somehow join players based on the positions they played, this is going to take some data wrangling. 

`pandas` is the de facto package for wrangling your data. `pandas` provides an abundace of functionality for each step of the data wrangling workflow from reading and writing various files formats, to cleaning your data, to merging data sets, all of which we will learn how to do by the end of this course. 

![](images/pandas_architecture.png)


# Installing the `pandas` Python Package

---

Before we begin discussing how to install `pandas` on your machine, I am assuming that you have already installed the *Anaconda* distribution of `Python`, If not, you can see how this is done here: [Anaconda Installation](https://conda.io/docs/user-guide/install/index.html)

The Anaconda distribution of python comes with many useful tools including a package manager referred to as [*`conda`*](https://conda.io/docs/). A *package manager* helps you install, update, and organize your packages.

`pandas`  is a package that is not typically available with a standard `Python` installation but it can be installed using the `conda` installer. To install it, we can type the follwing command in the command line (You can use Terminal for Mac or Command Prompt for Windows):

```bash
>>> conda install pandas
```

In [0]:
%%HTML
<video width="900" height="720" controls loop autoplay>
  <source src="Videos/conda_install_pandas.mp4" type="video/mp4">
</video>