# Getting Started with Python: Part One

So you've decided to learn programming using Python. Wonderful! In this tutorial, we'll take the basics of programming and apply them to data sets.

## Importing Packages

Python's bread and butter is not the simple syntax (though it is a nice quality of life), it is the fact that it is **open source**. Any person can write and submit a package. These ackages can contain classes, functions, and methods that you would have to write on your own. 

Let's take advantage of the work that others have put into Python. Some these packages are fundamental to performing certain tasks with the programming language.

In order to install a package to your machine, you will have to install it from `pip` (Package Installer for Python). In your command prompt/terminal, use the following command:

`pip install <package_name>`

If you wanted to update an existing package, you can use this command instead:

`pip install --upgrade <package_name>`

Once installed, we can then **import** that package into whichever Python scripts/notebooks we're using. Some of these packages are quite large (machine learning libraries) and so Python doesn;t import them all by default to keep the program light.

Use only what you need!:)

Let's import the packages we're going to be using in this tutorial:

In [None]:
import pandas as pd
import numpy as np

If these packages are installed, then this should have run with no problem. Not only are we importing these two packages, we're giving them an alias. This way we dont have to write `pandas` over and over again while coding, we can just type `pd`. 

## Pandas and Numpy

`pandas` and `numpy` are the two best packages for working with tabular data. `numpy` is the workhorse of the two packages, where the majority of `pandas` combines `numpy` functions and makes them easier to work with. We'll need both to get going.

### Read in Data

Initially, let's bring some data into Python. `pandas` can read a bunch of different data files, but the most common for our purposes are CSV's and Excel workbooks. Let's go ahead and read in one of the files we have:

In [None]:
# Making sure we know what directory we're currently in
import os
os.getcwd()

In [None]:
df = pd.read_csv('../data/flights.csv')

If everything worked correctly, we should be able to see inside of this file now that it is read into our Python session:

In [None]:
df

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.head(10)

In [None]:
df.tail(5)

What we've done here is read in our data using `pandas`. Doing so, we've put our data into an *object* called a `pandas DataFrame`. You can thing of this similar to our list of lists, except there is much more functionality that we have here. You can even see that we've called a few methods to look around our data.

We can even look at specific columns/rows:

In [None]:
df['month']

In [None]:
df[['year']]

In [None]:
print(type(df['month']))
print(type(df[['year']]))

`pandas` has two ways you can index by row. `loc[]` will use the bolded column on the left (called the *index*) and subset using that. However, we can treat our DataFrame like a list and index using `iloc[]` instead. Both work, just different ways of doing so.

In [None]:
df.loc[20:40,]

In [None]:
df.iloc[-10:-1,]

## Filter Data

Let's go ahead and subset our data on a condition, also known as filtering:

In [None]:
df['year'].value_counts()

In [None]:
df['year'] >= 1955

In [None]:
df[df['year'] >= 1955]

In [None]:
df[
    (df['year'] >= 1955)
    & (df['month'] == 'January')
]

In [None]:
df[
    (df['year'] >= 1955)
    & (df['month'].str.len() > 6)
]

### Transforming Data

`pandas` and `numpy` are special in the way that they perform operations on data. They use what is called `multithreading`, a way to do many operations all at the same time instead of sequentially. The next result doesn't rely on the previous result, so let's do them all togehter and combine the output at the end. 

This is why we'll be using the built-in DataFrame methods rather than some of the Python logic we learned earlier. We'll still use that knowledge, but we'll only use those if the bilt-in methods are not what we need.

In [None]:
df_transform = df.copy()

In [None]:
df_transform['year'] = np.where(
    df_transform['year'] == 1960, #if this condition
    'bad_year', #replace with this value
    df_transform['year'], #otherwise replace with this value
)

In [None]:
conditions = [
    df_transform['month'] == 'June',
    df_transform['month'] == 'June',
]
choices = [
    'good_month',
    'bad_month',
]
df_transform['month'] = np.select(
    conditions, 
    choices, 
    default=df_transform['month'],
)

In [None]:
df_transform

### Group By

In [None]:
# WIP

### Merge

In [None]:
# WIP