# Introduction to Python

#### Parker H. Holzer,   Department of Statistics & Data Science,  Yale University

Goals:
----------
1. Become familiar with Jupyter notebooks
2. Understand Python programming basics
3. Learn to work with lists and arrays
4. Get comfortable working with dataframes

## Part 1: Jupyter notebooks

While Python is a scripting language that can be executed in the terminal/command-prompt, Jupyter notebooks provide a good interface for preparing python code as well as data analysis. These notebooks are composed with cells, of which there are two types:

* Code cells: chunks of python code that should all be run together
* Markdown cells: areas to add text for describing code, adding math, presentations like this, etc.

#### *Exercise:*     **DOUBLE-CLICK HERE**

Which of these two cell types do you think this is?

Also, did you notice how the appearance of this cell changed when you double-clicked it? That's because each cell has two different modes it can be in: Command Mode (surrounded with a blue box) and Edit Mode (surrounded with a green box). This cell is now in Command Mode, which means you cannot edit the contents of the cell. To go to Edit Mode and change the contents of the cell, either click inside the cell or press Enter on your keyboard. When you are ready to run a cell (either code or markdown type) you can either click the *Run* button near the top left of the notebook or use Shift-Enter on your keyboard. Try it out right now on this cell!

If you wish to leave Edit Mode without running the cell, you can press Escape on your keyboard. (Note: there are many other keyboard shortcuts for Jupyter notebooks that you can check out by clicking *Help* near the top and selecting *Keyboard Shortcuts*.)

#### *Exercise*:

Fill in the missing text of the following lines

**- - - - -**

**Bow, wow, wow**

**Eli Yale**

**- - - - -**

**Bow, wow, wow**

**Our team can never fail**

In [1]:
#This is a code cell, which can have comments (like this) and python code. Within a code cell, the lines are
#run from top to bottom. Code cells can also be run in any order, the order of which is indicated by the number
#on the left of the cell. 
print("Hello! And welcome to Jupyter notebooks")

Hello! And welcome to Jupyter notebooks


## Part 2: Python basics

### Arithmetic and Math Operations

Each of the following is a specific math operator in python: `+`, `-`, `*`, `/`, `**`, `//`, `%`

Try each of them out by replacing `+` in the following cell to see if you can figure out what they do.

In [24]:
-7 + 3

2

To assign a value to a variable, use the `=` symbol. The following cell gives the variable `x` the value `6` and the variable `y` the value `4`.

In [25]:
x = 6
y = 4

If you want to see the value of a variable, just run a cell with it as the last line of the cell.

In [26]:
x

6

There are also comparison operators: `==`, `>`, `<`, `!=`, `>=`, `<=` .

Try each of these operators out by replacing `==` in the cell to get a feel for what they do.

In [28]:
x == y + 2

True

#### *Exercise:*

What will the following cell output?

In [35]:
z = (y - 3)**6 + 5
z = z/2
(z % 2) >= (x % 2)

True

### Variable Types

The variables `x`, `y`, and `z` each have numeric values. But there are many other types of variables available. Jupyter notebook color-codes each of these types to make that more obvious.

In [36]:
var1 = "This is a string"
var2 = False

To see what type a variable is, use the `type()` function

In [43]:
type(var1)  #replace var1 with var2, x, or x/5 to see how the type changes

str

So do the math and comparison operators work with these other types too?

In [52]:
var1 = var1 + " that works with addition!"
var1

'This is a string that works with addition!'

In [53]:
var1 * 2

'This is a string that works with addition!This is a string that works with addition!'

In [56]:
var1 * var2

''

In [57]:
var2 * x

0

In [64]:
var3 = True
var3 + var3

2

## Part 3: Lists and Arrays

Often times, we want a variable to hold a set of values. Lists and arrays are what get used for this.

In [18]:
a = [5,9,3,-10,'bulldogs',True]    #this is the syntax for creating a list from scratch
a

[5, 9, 3, -10, 'bulldogs', True]

To get the length of a list, use the `len()` function

In [None]:
len(a)

### Indexing

In [70]:
a[0]    #get the first element in a list

5

In [71]:
a[1]   #get the second element in a list

9

In [72]:
a[-1]  #get the last element in a list

True

In [73]:
a[-2]  #get the second-to-last element in a list

'bulldogs'

In [74]:
a[:4]  #get the first four elements of a list

[5, 9, 3, -10]

In [75]:
a[-3:]  #get the last three elements of a list

[-10, 'bulldogs', True]

In [76]:
a[2:5] #get the elements of a list with index 2 up to (not including) index 5

[3, -10, 'bulldogs']

#### *Exercise:*

Complete the second line of code below so that the cell returns the list `[-4,-6,-9,-15]`

In [4]:
b = [0,-1,-2,-4,-6,-9,-15,-20,-45,-90]
b ...

While anything can go into lists, they are not particularly suited for when you might want to apply a mathematical operation to every element of the list. For instance, what if we wanted to multiply every element of `b` by 2.5?

In [7]:
2.5*b

TypeError: can't multiply sequence by non-int of type 'float'

We need to convert it to an array (which is pretty much the same as a vector). For this we need a package called `numpy`, which is so useful that I recommend getting into the habit of importing it at the beginning of every python code you ever write!

In [8]:
import numpy as np

In [9]:
b2 = np.array(b)
b2

array([  0,  -1,  -2,  -4,  -6,  -9, -15, -20, -45, -90])

In [10]:
b2*2.5

array([   0. ,   -2.5,   -5. ,  -10. ,  -15. ,  -22.5,  -37.5,  -50. ,
       -112.5, -225. ])

That time it worked!!! 

#### *Exercise*:

Without running the next cell, what do you think will be the output?

In [11]:
x = np.array([6,-1,6,-1,6])
y = np.array([1,2,3,4,5])
x + y[-1]**2

array([31, 24, 31, 24, 31])

You can also apply various operations to arrays of the same length. From the next couple cells, see if you can see a pattern that explains how the operations are applied.

In [12]:
x + y

array([ 7,  1,  9,  3, 11])

In [13]:
x - y

array([ 5, -3,  3, -5,  1])

In [14]:
x * y

array([ 6, -2, 18, -4, 30])

In [15]:
x == y

array([False, False, False, False, False])

In [16]:
x % y

array([0, 1, 0, 3, 1])

In [17]:
x > y

array([ True, False,  True, False,  True])

You can also index an array with another boolean array of the same length. This is particularly useful in data analysis!

In [20]:
boolarray = np.array([True, False, False, True, True, False, False, False, True])
v = np.array([2,4,6,8,10,12,14,16,17])
v[boolarray]

array([ 2,  8, 10, 17])

## Part 4: Dataframes

The most useful and traditional form of data is a dataframe, which essentially is a matrix with variables assigned to the columns and observations assigned to the rows. In Python, the most popular package for dataframes is Pandas.

In [21]:
import pandas as pd

Let's actually read in some real data now! Here is a dataset of recently sold houses (as of March, 2021) in Bountiful, Utah.

In [29]:
url = "https://raw.githubusercontent.com/parkerholzer/Bountiful_houses_sold/master/Bountiful_UT_3-25-2021.csv"
data = pd.read_csv(url)

Let's take a quick look at the dataframe!

In [30]:
data

Unnamed: 0,Type,Built,Lot,Bed,Bath,Area,Cost,URLaddress
0,SingleFamily,1979.0,0.21 Acres,6.0,5.0,5550,,https://www.zillow.com/homedetails/910-E-Mills...
1,Townhouse,2018.0,0.02 Acres,3.0,4.0,2105,"$420,183",https://www.zillow.com/homedetails/261-E-340-N...
2,Townhouse,1983.0,0.01 Acres,3.0,2.0,1648,"$305,125",https://www.zillow.com/homedetails/188-E-2050-...
3,SingleFamily,1959.0,0.20 Acres,4.0,3.0,2268,"$400,571",https://www.zillow.com/homedetails/840-N-600-E...
4,SingleFamily,1973.0,0.43 Acres,3.0,3.0,2766,"$533,270",https://www.zillow.com/homedetails/102-E-Oakri...
...,...,...,...,...,...,...,...,...
790,Condo,1973.0,,2.0,1.0,925,"$202,252",https://www.zillow.com/homedetails/314-W-Cente...
791,SingleFamily,1997.0,"9,147 sqft",5.0,3.0,2843,"$553,730",https://www.zillow.com/homedetails/119-W-1500-...
792,SingleFamily,1993.0,0.41 Acres,5.0,4.0,3166,"$574,166",https://www.zillow.com/homedetails/1130-S-800-...
793,Townhouse,2014.0,436 sqft,3.0,2.0,1798,"$357,503",https://www.zillow.com/homedetails/169-E-Orcha...


#### *Exercise*:

Answer the following questions:

1. How many sold properties do we have data about?
2. What are the different variables for each property?
3. Are there any missing values in the dataset?

But how do we get the data from the dataframe? Here we introduce one of the main aspects of Python: object-oriented programming. An object is anything! And depending on what type of object it is, there are some functions that can be applied to it. For example, our object might be the dataframe `data`. A function that can be applied to it is the `.loc` function.

In [47]:
built = data.loc[:,"Built"]  # get the data for all observations (indicated by the colon) of the variable "Built"
built

0      1979.0
1      2018.0
2      1983.0
3      1959.0
4      1973.0
        ...  
790    1973.0
791    1997.0
792    1993.0
793    2014.0
794    1996.0
Name: Built, Length: 795, dtype: float64

In [48]:
type(built)

pandas.core.series.Series

If instead of a pandas Series we want it to be an array, we can just use the `.values` function to it.

In [49]:
built2 = built.values
type(built2)

numpy.ndarray

### Summary Statistics

For the most basic statistics there are at least two ways to calculate them: (1) with object-oriented functions from Pandas or (2) with Numpy functions.

In [50]:
built.mean()

1956.1300813008131

In [59]:
np.mean(built)

1956.1300813008131

What about the median?

In [52]:
built.median()

1970.0

In [53]:
np.median(built)

nan

#### *Exercise:*

Why do you think these two ways of calculating the median didn't give the same result? (Hint: if you ever need to take a closer look at what a function does, just type `??` followed by the name of the function in a cell and run it. Try running `??np.median` and `??built.median`.)

## Additional Exercises

#### *Exercise:*

Give the 5-number summary (i.e., minimum, 25th percentile, median, 75th percentile, maximum) of the Bed variable in the dataframe `data`, only including properties of Type "SingleFamily".

In [66]:
# Hint: You might find the object-oriented function for Pandas Series .quantile useful

#### *Exercise:*

Find out how many different property types there are in the dataframe `data`, and how many of each type there are.

#### *Exercise:*

How many properties in the dataframe `data` have more bathrooms than bedrooms?