# Building the Dataset
This notebook should take the bare minimum to create the data set for use with philly-landlord-spotter.com

### Prerequisites
* Use pip to install `tqdm`
* `opa_properties_public.csv`

## Step 1
Run the following code block using `shift+enter`.

This block imports the imports and puts the needed functions into memory.

In [11]:
import csv
from tqdm import tqdm
import json
import matplotlib.pyplot as plt
import numpy as np

def json_creator(source, output):
    data = {}
    with open(source, mode='r') as csv_file:
        csv_reader = csv.DictReader(csv_file)
        line_count = 0
        for row in tqdm(csv_reader, total=581456):
            if line_count != 0: # The first iteration needs too ignore the csv headers
                if row["owner_1"].strip() not in data.keys():
                    address = row['location'].strip()  # So it makes more sense

                    data[row["owner_1"].strip()] = {
                        'total_properties': 1,
                        'properties': {
                            address: {
                                'location': [row['lat'].strip(), row['lng'].strip()],
                                'category': row['category_code_description'].strip(),
                                'owner_2': row['owner_2'].strip(),
                                'sale_date': row['sale_date'].strip(),
                                'sale_price': row['sale_price'].strip(),
                                'year_built': row['year_built'].strip(),
                                'year_built_estimate': row['year_built_estimate'].strip(),
                                'recording_date': row['recording_date'],
                                'zip_code': row['zip_code']
                            }
                        }
                    }
                else:
                    # Setting the data to var names for easier reading
                    owner_1 = data[row['owner_1'].strip()]
                    lat = row['lat'].strip()
                    long = row['lng'].strip()
                    category_code_description = row['category_code_description'].strip()
                    owner_2 = row['owner_2'].strip()
                    sale_date = row['sale_date'].strip()
                    sale_price = row['sale_price'].strip()
                    year_built = row['year_built'].strip()
                    year_built_estimate = row['year_built_estimate'].strip()
                    recording_date = row['recording_date']
                    zip_code = row['zip_code']

                    # Update and add values
                    owner_1['total_properties'] += 1
                    owner_1['properties'][row['location']] = {
                                'location': [lat, long],
                                'category': category_code_description,
                                'owner_2': owner_2,
                                'sale_date': sale_date,
                                'sale_price': sale_price,
                                'year_built': year_built,
                                'year_built_estimate': year_built_estimate,
                                'recording_date': recording_date,
                                'zip_code': zip_code
                            }
            line_count += 1

    with open(output, 'w') as file:
        file.write(json.dumps(data))
        
def landlord_json_creator(source, output):
    landlords = {}

    with open(source, mode="r") as csv_file:
        csv_reader = csv.DictReader(csv_file)
        line_count = 0
        for row in tqdm(csv_reader, total=581456):
            if line_count == 0:
                line_count += 1
            if row["owner_1"] not in landlords.keys():
                landlords[row["owner_1"].strip()] = 1
                line_count += 1
            else:
                try:
                    landlord_prop_count = landlords[row["owner_1"].strip()]
                    landlords[row["owner_1"].strip()] = landlord_prop_count + 1
                except:
                    print(row["owner_1"].strip(), "is missing a count.")
                line_count += 1
        sorted_landlords = sorted(landlords.items(), key=lambda x: x[1], reverse=True)
    with open(output, 'w') as file:
        file.write(json.dumps(sorted_landlords))
        
def property_json_creator(source, output):
    properties = []

    with open(source, mode="r") as csv_file:
        csv_reader = csv.DictReader(csv_file)
        line_count = 0
        for row in tqdm(csv_reader, total=581456):
            if line_count == 0:
                line_count += 1
            properties.append(row['location'].strip())
            line_count += 1
    with open(output, 'w') as file:
        file.write(json.dumps(properties))

def remove_one_off_landlords(source, output, significant_property_count):
    with open(source, mode='r') as file:
        data = file.read()
    landlords_and_properties = json.loads(data)
    landlord_count = len(landlords_and_properties.keys())
    significant_landlords = {}
    for landlord in tqdm(landlords_and_properties, total=landlord_count):
        if landlords_and_properties[landlord]['total_properties'] >= significant_property_count:
            significant_landlords[landlord] = {
                'total_properties': landlords_and_properties[landlord]['total_properties'],
                'properties': landlords_and_properties[landlord]['properties']
            }
    print('Significance Threshold: ', significant_property_count, 'Properties Owned')
    print('Significant Landlords: ', len(significant_landlords))
    with open(output, 'w') as file:
            file.write(json.dumps(significant_landlords))
            
def significant_landlords_generator(source, output, significant_property_count):
    with open(source, mode='r') as file:
        data = file.read()
    landlords_and_properties = json.loads(data)
    landlord_count = len(landlords_and_properties)
    significant_landlords = []
    for landlord in tqdm(landlords_and_properties, total=landlord_count):
        if landlord[1] >= significant_property_count:
            significant_landlords.append(landlord)
    print('Significance Threshold: ', significant_property_count, 'Properties Owned')
    print('Significant Landlords: ', len(significant_landlords))
    with open(output, 'w') as file:
            file.write(json.dumps(significant_landlords))


## Step 2
Run the following code block using `shift+enter`.

This block runs the intital JSON creator for the dataset.

In [12]:
json_creator('./data_sets/opa_properties_public.csv', './data_sets/landlords_and_properties.json')

100%|██████████| 581456/581456 [00:14<00:00, 39770.09it/s]


## Step 3
Warning: Step 2 must be run successfully before this.

Run the following code block using `shift+enter`.

This block creates a JSON object of `owner_1`'s.

In [13]:
landlord_json_creator('./data_sets/opa_properties_public.csv', './data_sets/sorted_landlords.json')

100%|██████████| 581456/581456 [00:08<00:00, 68582.72it/s]


## Step 4
Warning: Step 2 must be run successfully before this.

Run the following code block using `shift+enter`.

This block creates a JSON object of properties in the dataset.

In [14]:
property_json_creator('./data_sets/opa_properties_public.csv', './data_sets/properties.json')

100%|██████████| 581456/581456 [00:08<00:00, 69940.63it/s]


## Step 5
Warning: Step 2 must be run successfully before this.

Run the following code block using `shift+enter`.

This block creates a JSON object of signigant `owner_1`s in the dataset and their properties.

Currently I have the signifigance on the website at `200`, so it's probably best to use that. I also don't quite remember where I use this, if anywhere rn but just in case.

In [15]:
remove_one_off_landlords('./data_sets/landlords_and_properties.json', './data_sets/significant_landlords.json', 200)

100%|██████████| 429982/429982 [00:00<00:00, 1555032.37it/s]


Significance Threshold:  200 Properties Owned
Significant Landlords:  16


## Step 6
Warning: Step 3 must be run successfully before this.

Run the following code block using `shift+enter`.

This block creates a JSON object of signigant `owner_1`s in the dataset and their properties, then sorts it.

Currently I have the signifigance on the website at `50`, so it's probably best to use that. This one is definitely used on the site.

In [16]:
significant_landlords_generator('./data_sets/sorted_landlords.json', './data_sets/significant_sorted_landlords.json', 50)

100%|██████████| 429983/429983 [00:00<00:00, 3228876.74it/s]

Significance Threshold:  50 Properties Owned
Significant Landlords:  23





# Scrubbing PHA From Public Facing Maps

To keep the privacy of some housing movements, we don't show this so easily.

This isn't needed if you don't plan on working with these files.

## Step 1
Run the following code block using `shift+enter`.

This block imports the imports and puts the needed functions into memory.

In [17]:
import json
from tqdm import tqdm

def remove_entry(source, entry_to_remove):
    data = json.load(open(source))
    data.pop(entry_to_remove)
    json.dump(data, open(source, 'w'))
    
def remove_significant_landlord(source, entry_to_remove):
    data = json.load(open(source))
    for sig_landlord in tqdm(data):
        if sig_landlord[0] == entry_to_remove:
            data.remove(sig_landlord)
    json.dump(data, open(source, 'w'))

## Step 2
Run the following code block using `shift+enter`. Also, this one takes a sec and won't output til its done.

This block removes PHA from where it shouldn't be.

If this is ran successfully once, it will fail it's second time.

In [18]:
remove_entry('./data_sets/landlords_and_properties.json', "PHILADELPHIA HOUSING AUTH")
remove_significant_landlord('./data_sets/significant_sorted_landlords.json', "PHILADELPHIA HOUSING AUTH")
print('done')

 96%|█████████▌| 22/23 [00:00<00:00, 96521.64it/s]

done





# Done
Assuming all of this runs correctly, it should be everything needed to get the site going. If there are any problems with anything, let me know. Removing the PHA stuff isn't entirely needed if you don't plan on working with it.