# Introduction to Python 2

#### Parker H. Holzer,   Department of Statistics & Data Science,  Yale University

Recap:
---------
Last time we went over ...
1. Structure of Jupyter notebooks
2. Basic Python commands
3. Different Python objects such as lists, arrays, dictionaries, strings, etc.
4. Intro. to dataframes

Goals:
----------
Understand ...
1. Python syntax
2. User-defined functions
3. Loops
4. Introductory data analysis

## Part 0: Review!

Let's start by importing three standard packages that we will need.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#### *Exercise*:
Print the value of the "Area" variable (the square footage) for the first row in the dataframe `data`.

In [None]:
url = "https://raw.githubusercontent.com/parkerholzer/Bountiful_houses_sold/master/Bountiful_UT_3-25-2021.csv"
data = pd.read_csv(url)


#### *Exercise*:
Convert the square feet area printed above to be in units of square meters. (Note: 1 ft. = 0.3048 m.)

## Part 1: Loops and Conditional Statements

Often times we want to apply a certain operation to every entry in a list or array. For example, we probably want to have all the values for "Area" in `data` as floats instead of strings.

In [None]:
data.loc[:10,"Area"].values   #the first 10 Area measurements in the form of strings

To do all of this, we can use a 'for loop'. This type of loop goes through each element in a list (or array) and applies a certain operation to it. The structure of a for loop is as follows:    

`for ` *element* `in` *mylist*`:`

$\ \ \ \ \ \ $ *operation(s) to be applied to element*

In [None]:
for a in data.loc[:10,"Area"]:   # begin the for loop, each round using a single element in data.loc[:10,"Area"]
    s = a.replace(',', '')       # first, remove any commas from the element
    s = float(s)                 # second, convert the string to a floating point variable
    print(s)                     # last, print the value of the variable

#### Important Note:

Did you notice the indentation and lack of semicolons and brackets?!?! In most other programming languages brackets are required for loops and indentation is an optional formality. Python requires proper indentation as a replacement for brackets and semicolons.

Now lets try applying it to all the values of "Area" in `data`.

In [None]:
newarea = []                      # initialize an empty list
for a in data.loc[:,"Area"]:          
    s = a.replace(',', '')              
    s = float(s)
    newarea.append(s)             # add the value of the variable s to the end of the list newarea

Oops! It looks like there are some values of "Area" that are not strings. Let's take a closer look at the raw values.

In [None]:
data.loc[:100,"Area"].values

So it looks like the operation we apply needs to depend a bit on what variable type each element is. For this we can use conditional statements inside our loop.

That worked! Let's take a look at the list to see if it did what it was supposed to.

In [None]:
np.array(newarea)

There are also ways to shorten loops like this. One way is to stack operations together in a single line.

In [None]:
newarea = []                      
for a in data.loc[:,"Area"]:
    if pd.isna(a):                   
        newarea.append(a)                        
    else:                            
        newarea.append(float(a.replace(',', '')))

You can also use a clever technique called [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).

In [None]:
newarea = [a if pd.isna(a) else float(a.replace(',', '')) for a in data.loc[:,"Area"]]

#### *Exercise*: 
Clean up the "Cost" variable in `data` to be numerical instead of strings, and add the cleaned array to `data` as a new variable called "Cost2" in units of dollars.

## Part 2: User-defined Functions

Often times, we have a set of operations that we would like to apply in a more concise way. This is the main purpose of functions. While many functions are already built into Python and its packages, it is often useful to define your own functions for your own purposes.

For example, consider the "Lot" variable in `data` (which gives the lot size of each property).

In [None]:
data.loc[:,"Lot"]

This variable is another string, but also has rows with different units. So to clean this data up, we need to convert the string to be numeric but also make the units the same. If we wrote a function that did all of this, it would make for more straight-forward coding.

The structure of functions is as follows:

`def` *name_of_function*`(` *arguments* `):`

$\ \ \ \ \ \ $ *body of function*

$\ \ \ \ \ \ $ `return` *output*

#### *Exercise:*
If you split each entry in the "Lot" column of `data` by a single space, what are all the unique values that the second half of the entry takes? (Hint: use the `np.unique()` function.)

Now let's write a function to clean up the "Lot"!

In [None]:
def clean_lot(rawlot):                        # initialize the function name and its arguments
     # take care of the possibility of missing values
     # if not a missing value
     # split the string at every space
     # if the lotsize is already in Acres,
     # convert the numeric part to a float
     # if the lotsize has units of squarefeet
     # remove the comma
     # convert to a float and put in units of acres
     # if the lotsize doesn't have units of acres or squarefeet
     # consider the value to be missing
     # return the value of the lot variable

In [None]:
clean_lot(data.loc[791,"Lot"])

In [None]:
Lot2 = np.array([clean_lot(l) for l in data.loc[:,"Lot"]])

In [None]:
data.loc[:,"Lot2"] = Lot2
data.loc[:,"Area2"] = newarea

In [None]:
data

#### *Exercise:*
Write a function that takes an array and returns a dictionary of the number of occurences of each unique value, with the names in the dictionary being the unique values.

In [None]:
def uniquevals(a):
    mydict = {"NA": sum(pd.isna(a))}
    a = a[~pd.isna(a)]
    for x in np.unique(a):
        mydict[str(x)] = sum(a == x)
    return mydict

In [None]:
uniquevals(data.loc[:,"Type"].values)

## Part 3: Data Analysis 101

Data analysis essentially comes down to two components: basic statistics and plots. Let's start by analyzing one variable at a time. 

### One Categorical Variable

Statistic: one-way frequency table

In [None]:
data.loc[:,"Type"].value_counts(dropna=False)

Plot: bar plot

In [None]:
d = data.loc[:,"Type"].value_counts(dropna=False)        # save the two-way frequency table as d
plt.bar(d.index.astype(str), d.values, align='edge')     # create a bar plot
plt.xticks(rotation = 45, fontsize=12)                   # rotate labels for better visualization
plt.xlabel("Property Type", fontsize=14)                 # add an x-axis label
plt.ylabel("Count", fontsize=14)                         # add a y-axis label
plt.title("Bountiful, UT Properties Sold", fontsize=16)  # add a plot title
plt.show()                                               # show us the final plot

What is that "Unknown" property type?

### One Discrete Quantitative Variable

Statistics: mean, median, IQR, standard deviation, one-way frequency table

In [None]:
notna = ~pd.isna(data.loc[:,"Bed"])
print("Mean Bedrooms: %.2f"%np.mean(data.loc[notna,"Bed"]))
print("Median Bedrooms: %.2f"%np.median(data.loc[notna,"Bed"]))
print("IQR Bedrooms: %.2f"%(np.percentile(data.loc[notna,"Bed"],75) - np.percentile(data.loc[notna,"Bed"],25)))
print("Std. Dev. Bedrooms: %.2f"%np.std(data.loc[notna,"Bed"]))
data.loc[notna,"Bed"].value_counts().sort_index()

Plot: bar chart

In [None]:
d = data.loc[:,"Bed"].value_counts(dropna=False).sort_index()        
plt.bar(d.index.astype(str), d.values)     
plt.xticks(fontsize=12)                   
plt.xlabel("Bedrooms", fontsize=14)                 
plt.ylabel("Count", fontsize=14)                         
plt.title("Bountiful, UT Properties Sold", fontsize=16)  
plt.show()  

#### *Exercise:*

Give the five-number summary of the Bathroom counts **(minimum, 25th percentile, median, 75th percentile, maximum)**.

### One Continuous Quantitative Variable

Statistics: mean, median, standard deviation, IQR, ...

Plots: histograms, boxplots

In [None]:
plt.hist(data.loc[:,"Area2"].dropna(), bins=20)
plt.show()

In [None]:
data.loc[:,"Area2"].argmax()

In [None]:
data.loc[645]

#### *Exercise:*

Explore, and clean up, the "Built" variable of `data` (which represents the year the property was built).

### Multiple Variable types

In [None]:
import statsmodels.formula.api as sm

In [None]:
mdl = sm.ols(formula = "Cost2 ~ Type + Built + Lot2 + Bed + Bath + Area2", data = data.dropna()).fit()
mdl.summary()

In [None]:
plt.hist(mdl.resid, bins=25)
plt.show()

In [None]:
plt.scatter(mdl.fittedvalues, mdl.resid)
plt.hlines(0, np.min(mdl.fittedvalues), np.max(mdl.fittedvalues))
plt.show()