# Performance 3

In [1]:
# known import statements
import pandas as pd
import csv
from subprocess import check_output

# new import statements
import zipfile
from io import TextIOWrapper

### Let's take a look at the files inside the current working directory.

In [2]:
str(check_output(["ls", "-lh"]), encoding="utf-8").split("\n")

['total 21M',
 '-rw-rw-r-- 1 msyamkumar msyamkumar 19K Feb  8 15:21 lecture.ipynb',
 '-rw-rw-r-- 1 msyamkumar msyamkumar 16K Feb  8 15:22 template.ipynb',
 '-rw-rw-r-- 1 msyamkumar msyamkumar 21M Feb  8 14:53 wi.zip',
 '']

### Let's `unzip` "wi.zip".

In [3]:
check_output(["unzip", "wi.zip"])

b'Archive:  wi.zip\n  inflating: wi.csv                  \n'

### Let's take a look at the files inside the current working directory.

In [4]:
str(check_output(["ls", "-lh"]), encoding="utf-8").split("\n")

['total 197M',
 '-rw-rw-r-- 1 msyamkumar msyamkumar  19K Feb  8 15:21 lecture.ipynb',
 '-rw-rw-r-- 1 msyamkumar msyamkumar  16K Feb  8 15:22 template.ipynb',
 '-rw-rw-r-- 1 msyamkumar msyamkumar 177M Jan 14  2022 wi.csv',
 '-rw-rw-r-- 1 msyamkumar msyamkumar  21M Feb  8 14:53 wi.zip',
 '']

### Traditional way of reading data using pandas

In [5]:
df = pd.read_csv("wi.csv")

  df = pd.read_csv("wi.csv")


In [6]:
df.head(5) # Top 5 rows within the DataFrame

Unnamed: 0,activity_year,lei,derived_msa-md,state_code,county_code,census_tract,conforming_loan_limit,derived_loan_product_type,derived_dwelling_category,derived_ethnicity,...,denial_reason-2,denial_reason-3,denial_reason-4,tract_population,tract_minority_population_percent,ffiec_msa_md_median_family_income,tract_to_msa_income_percentage,tract_owner_occupied_units,tract_one_to_four_family_homes,tract_median_age_of_housing_units
0,2020,549300FX7K8PTEQUU487,31540,WI,55025.0,55025000000.0,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,3572,41.15,96600,64,812,910,45
1,2020,549300FX7K8PTEQUU487,99999,WI,55013.0,55013970000.0,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,2333,9.9,68000,87,1000,2717,34
2,2020,549300FX7K8PTEQUU487,99999,WI,55127.0,55127000000.0,C,VA:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,5943,13.26,68000,104,1394,1856,44
3,2020,549300FX7K8PTEQUU487,99999,WI,55127.0,55127000000.0,C,Conventional:Subordinate Lien,Single Family (1-4 Units):Site-Built,Ethnicity Not Available,...,,,,5650,7.63,68000,124,1712,2104,36
4,2020,549300FX7K8PTEQUU487,33460,WI,55109.0,55109120000.0,C,VA:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,7210,4.36,97300,96,2101,2566,22


### How can we see all the column names?

In [7]:
df.columns

Index(['activity_year', 'lei', 'derived_msa-md', 'state_code', 'county_code',
       'census_tract', 'conforming_loan_limit', 'derived_loan_product_type',
       'derived_dwelling_category', 'derived_ethnicity', 'derived_race',
       'derived_sex', 'action_taken', 'purchaser_type', 'preapproval',
       'loan_type', 'loan_purpose', 'lien_status', 'reverse_mortgage',
       'open-end_line_of_credit', 'business_or_commercial_purpose',
       'loan_amount', 'loan_to_value_ratio', 'interest_rate', 'rate_spread',
       'hoepa_status', 'total_loan_costs', 'total_points_and_fees',
       'origination_charges', 'discount_points', 'lender_credits', 'loan_term',
       'prepayment_penalty_term', 'intro_rate_period', 'negative_amortization',
       'interest_only_payment', 'balloon_payment',
       'other_nonamortizing_features', 'property_value', 'construction_method',
       'occupancy_type', 'manufactured_home_secured_property_type',
       'manufactured_home_land_property_interest', 'total_

### How to extract `interest_rate`?

In [8]:
df["interest_rate"] # observe that there are missing values

0          NaN
1          3.0
2          NaN
3         3.75
4          2.5
          ... 
468267    2.25
468268     2.5
468269    4.25
468270     NaN
468271     NaN
Name: interest_rate, Length: 468272, dtype: object

### How to count unique values in a column `Series`?

In [9]:
df["interest_rate"].value_counts()

Exempt    37959
3.0       22584
2.75      22480
3.25      21343
2.875     21201
          ...  
3.023         1
2.632         1
3.345         1
3.364         1
2.32          1
Name: interest_rate, Length: 2080, dtype: int64

### Let's eliminiate the strings (Exempt) and missing values (NaN).
Let's try `pd.to_numeric(...)`. We need a way to specify that strings need to be converted into NaN values.

In [10]:
pd.to_numeric(df["interest_rate"]) 
# TODO: open the documentation and figure out what parameter will help us
# Recall that we can press shift + tab after a function name to open the documentation

ValueError: Unable to parse string "Exempt" at position 1505

In [11]:
pd.to_numeric(df["interest_rate"], errors="coerce")

0          NaN
1         3.00
2          NaN
3         3.75
4         2.50
          ... 
468267    2.25
468268    2.50
468269    4.25
468270     NaN
468271     NaN
Name: interest_rate, Length: 468272, dtype: float64

### Let's drop the NaN values and compute average interest rate.

In [12]:
pd.to_numeric(df["interest_rate"], errors="coerce").dropna()

1         3.00
3         3.75
4         2.50
5         2.75
7         3.75
          ... 
468265    4.50
468266    2.50
468267    2.25
468268    2.50
468269    4.25
Name: interest_rate, Length: 324658, dtype: float64

In [13]:
pd.to_numeric(df["interest_rate"], errors="coerce").dropna().mean()

3.266264315063852

### Clearing memory using re-assignment.
In python, you can clear memory used up for an object simply by getting rid of all the active references. But we cannot do that for the current notebook because we used "df" to perform other operations, so there are more than one active references. In fact, we don't even have access to some of the active references. In that case, you can only free up the memory after you "shutdown" the current notebook.

In [14]:
df = "some string"

### How can we read the data without creating an uncompressed version called "wi.csv"?

- Why would we want to do something like that?
    1. lower memory usage (we can try to load information on one loan at a time, instead of all the loans): that will still work for average interest rate computation
    2. lower storage usage (you can directly work with compressed data)
    
**IMPORTANT**: do not run this cell code unless you shutdown the notebook - your kernel will crash (you will run out of memory space)

In [2]:
# IMPORTANT: do not run this cell code unless you shutdown the notebook - your kernel will crash (you will run out of memory space)
f = open("wi.csv")
# instead of passing relative path of file name, we can pass a file object instance reference
df = pd.read_csv(f) 
f.close()

  df = pd.read_csv(f)


### Let's free up memory and delete "wi.csv".

In [3]:
df = "some string"

In [4]:
check_output(["rm", "wi.csv"])
str(check_output(["ls", "-lh"]), encoding="utf-8").split("\n")

['total 21M',
 '-rw-rw-r-- 1 msyamkumar msyamkumar 37K Feb  8 15:22 lecture.ipynb',
 '-rw-rw-r-- 1 msyamkumar msyamkumar 16K Feb  8 15:22 template.ipynb',
 '-rw-rw-r-- 1 msyamkumar msyamkumar 21M Feb  8 14:53 wi.zip',
 '']

### How can we read data directly from a zip file?
`zipfile.ZipFile(...)`

### Goals:
1. directly access the data without decompressing: `zipfile.ZipFile(...)` - saves storage space by directly opening a zip file
2. only look at one row at a time: `csv.DictReader(...)` - saves memory space by enabling us to read one row at a time (as `dict`)

In [5]:
# code for goal 1
zf = zipfile.ZipFile("wi.zip")
f = zf.open("wi.csv")

df = pd.read_csv(f) 

f.close()
zf.close()

  df = pd.read_csv(f)


In [6]:
# Free up the memory again
df = "some string"

In [7]:
# code for goal 2 & goal 1
zf = zipfile.ZipFile("wi.zip")
f = zf.open("wi.csv")

reader = csv.DictReader(f)

for row in reader:
    print(row)
    break

f.close()
zf.close()

Error: iterator should return strings, not bytes (the file should be opened in text mode)

### Let's learn more modes for `open` built-in function
- `open(..., mode="r")`   => text (default)
- `open(..., mode="rb")`  => bytes
- `zf.open(...)`          => always bytes

With `zipfile` module there isn't a way for us to specify that we need text.

### `TextIOWrapper` inside `io` module enables us to convert `bytes` into `str`

In [8]:
# code for goal 2 & goal 1
zf = zipfile.ZipFile("wi.zip")
f = zf.open("wi.csv")

reader = csv.DictReader(TextIOWrapper(f))

for row in reader:
    print(row)
    break

f.close()
zf.close()

{'activity_year': '2020', 'lei': '549300FX7K8PTEQUU487', 'derived_msa-md': '31540', 'state_code': 'WI', 'county_code': '55025', 'census_tract': '55025002402', 'conforming_loan_limit': 'C', 'derived_loan_product_type': 'Conventional:First Lien', 'derived_dwelling_category': 'Single Family (1-4 Units):Site-Built', 'derived_ethnicity': 'Not Hispanic or Latino', 'derived_race': 'White', 'derived_sex': 'Male', 'action_taken': '3', 'purchaser_type': '0', 'preapproval': '2', 'loan_type': '1', 'loan_purpose': '4', 'lien_status': '1', 'reverse_mortgage': '2', 'open-end_line_of_credit': '1', 'business_or_commercial_purpose': '2', 'loan_amount': '225000.0', 'loan_to_value_ratio': '78.671', 'interest_rate': 'NA', 'rate_spread': 'NA', 'hoepa_status': '3', 'total_loan_costs': 'NA', 'total_points_and_fees': 'NA', 'origination_charges': 'NA', 'discount_points': 'NA', 'lender_credits': 'NA', 'loan_term': '360', 'prepayment_penalty_term': 'NA', 'intro_rate_period': '1', 'negative_amortization': '2', 'in

### Let's go back to calculating average interest rate.
- Algorithm / Pseudocode steps:
    1. print "interest rate" and type of "interest rate"
    2. convert "interest rate" into `float` - how can we handle errors? `try` ... `except` ... (*IMPORTANT*: always have your `except` block catch specific exceptions)
    3. calculate running total, count for each row of data
    4. calculate average

In [9]:
zf = zipfile.ZipFile("wi.zip")
f = zf.open("wi.csv")

total = 0
count = 0

reader = csv.DictReader(TextIOWrapper(f))

for row in reader:
    try:
        total += float(row["interest_rate"])
        count += 1
    except ValueError:
        pass # do nothing

f.close()
zf.close()

total / count

3.266264315063054

### Let's generalize the code to read "interest rate" into a function.

- This does make things worse because we are going back to reading all the data before doing the computation.
- But this sets us up to learn about generators.

In [10]:
def get_rates_v1():
    rates = []
    
    zf = zipfile.ZipFile("wi.zip")
    f = zf.open("wi.csv")

    reader = csv.DictReader(TextIOWrapper(f))
    
    for row in reader:
        try:
            rates.append(float(row["interest_rate"]))
        except ValueError:
            pass # do nothing

    f.close()
    zf.close()
    
    return rates

rates = get_rates_v1()
sum(rates) / len(rates)

3.266264315063054

### Using a generator
- `yield` each value
- use `next` to get the next value => internally `for` loop invokes `next` for each iteration

In [11]:
def get_rates_v2():
    print("Starting generator")
    
    zf = zipfile.ZipFile("wi.zip")
    f = zf.open("wi.csv")
    
    reader = csv.DictReader(TextIOWrapper(f))
    
    for row in reader:
        try:
            yield float(row["interest_rate"])
        except ValueError:
            pass # do nothing

    f.close()
    zf.close()

rates = get_rates_v2()

In [12]:
next(rates) # gives us the next value

Starting generator


3.0

In [13]:
next(rates) # gives us the next value

3.75

In [14]:
next(rates) # gives us the next value

2.5

### Let's use `for` loop to keep getting all the rates.

- `len` function doesn't work with generators
- indexing doesn't work with generators

In [15]:
len(rates)

TypeError: object of type 'generator' has no len()

In [16]:
rates[4]

TypeError: 'generator' object is not subscriptable

In [17]:
rates = get_rates_v2()

total = 0
count = 0

for rate in rates: # keeps calling next(rates) to get values from yield
    total += rate
    count += 1
    
total / count

Starting generator


3.266264315063054

This approach doesn't work for median calculation. Why? Remember we have to sort, so we need all values in memory.

In [18]:
rates = list(get_rates_v2())
rates.sort()

Starting generator


# OOP 1: Classes

Classes enable to create custom types. Attributes within the class will be used to store information about each object instance.

### Let's create a `Dog` class.

In [19]:
class Dog:
    pass # eventually we will learn how to write code inside a class

### Let's create `Dog` object instances and add attributes.

In [20]:
dog1 = Dog()
dog1.name = "Jimmy"
dog1.age = 2

In [21]:
dog2 = Dog()
dog2.name = "Buster"

### Let's define a `speak` function that will make the `Dog` bark.
- Algorithm / pseudocode steps:
    1. puppies bark thrice (age < 2)
    2. dogs bark once

In [22]:
def speak(dog):
    if dog.age < 2:
        print(f"{dog.name}: bark bark bark!")
    else:
        print(f"{dog.name}: bark!")

### Let's invoke `speak` for dog1 and dog2.

In [23]:
speak(dog1)

Jimmy: bark!


In [24]:
speak(dog2)

AttributeError: 'Dog' object has no attribute 'age'

### How can we standardize the attribute initialization to avoid bugs?

- Eventually we will learn about how to define methods inside the class, which will include `__init__` method.
- For now, let's define an `init` function.

In [25]:
def init(dog, name, how_old):
    dog.name = name
    dog.age = how_old

In [26]:
dog2 = Dog()
init(dog2, "Spark", 10)
speak(dog2)

Spark: bark!
