# Introduction

#### top-level question:
Do we see evidence of discrimination against housing vouchers?

In Texas, discrimination against vouchers is enshrined into law, so the question shifts from "is there discrimination" to "how does discrimination against vouchers play out in practice, and what might the effects be on tenants?". While attempting to investigate this question, I was immediately derailed by the large number of spammy apartment locator postings among the actual apartment listings. These postings often directly reference Section 8 vouchers, and could bias an analysis of vouchers, so it is necessary to identify such postings.

### immediate question: 
can I identify apartment locator postings among the actual apartment listings?

#### outline:

1. obtain the data by scraping craigslist
2. explore its structure
3. convert into DataFrame for analysis
4. featurize text data
5. explore possible clusters of postings

### imports & helper fxns

In [1]:
from requests import get
from bs4 import BeautifulSoup
import json
import time
import os
import random

import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [2]:
# URL = "https://providence.craigslist.org/"
URL = "https://dallas.craigslist.org/"
QUERY = "search/apa"

DTYPE= 'craigslist-v1'

def query_url():
    return URL + QUERY

def filter_line(line):
    if not line or line == 'QR Code Link to This Post':
        return False
    else:
        return True

def map_line(line):
    return line.strip('\n').strip(' ')

def get_description(href):
    time.sleep(random.uniform(.2, .8))
    response_details = get(href)
    detail_soup = BeautifulSoup(response_details.text, 'html.parser')
    body = detail_soup.find(id="postingbody")
    
    return list(filter(filter_line, map(map_line, body.find_all(text=True))))

def extract_repost(duplicate_rows):
    if duplicate_rows:
        return [row.get('data-repost-of') for row in duplicate_rows.find_all(class_="result-row")]
    else:
        return []

def listing(post):
    title = post.find(class_ = "result-title hdrlnk")
    print("Getting: {}".format(title.text))
    return {
        'price': post.find(class_ = "result-price").text, 
        'title': title.text,
        'href': title.get('href'),
        'description': get_description(title.get('href')),
        'duplicates': extract_repost(post.find(class_="duplicate-rows")),
    }

def extract_listings(postal_code, area):
    time.sleep(random.uniform(.3, .8))
    response = get(
        query_url(), 
        params={
            'bundleDuplicates':1, 
            'availabilityMode':0, 
            'sale_date':'all dates', 
            'search_distance': 0, 
            'postal': postal_code
        }
    )
    html_soup = BeautifulSoup(response.text, 'html.parser')
    posts = html_soup.find_all('li', class_= ['result-row', 'result-row duplicate-row'])
    dup_posts = html_soup.find_all('li', class_= 'result-row duplicate-row')

    print(
        "Got type: {}; listings: {}; dup listings: {} for {}".format(
            type(posts), len(posts), len(dup_posts), postal_code)
    )

    # Is there pagination? Seems to match UI well but not exactly?
    extracted = [listing(post) for post in posts + dup_posts]
       
    raw_path = os.path.join(DTYPE, 'raw_html')
    os.makedirs(raw_path, exist_ok=True)
    with open(os.path.join(raw_path, '{}-{}.html'.format(postal_code, area)), 'w', encoding='utf-8') as f:
        f.write(response.text)

    extracted_path = os.path.join(DTYPE, 'extracted')
    os.makedirs(extracted_path, exist_ok=True)
    with open(os.path.join(extracted_path, '{}-{}.json'.format(postal_code, area)), 'w', encoding='utf-8') as f:
        json.dump(extracted, f, indent=2)
        
# extract_listings('02860', 'foo_bar, baz')
        
# for postal_code, area in dallas_area_postal_codes.items():
#      extract_listings(postal_code, area)

### scrape/load

In [None]:
# I hand-labeled this set of zip codes at one point
dallas_area_postal_codes = {
    "75023": "Plano",
    "75024": "Plano",
    "75025": "Plano",
    "75040": "Garland",
    "75041": "Garland",
    "75042": "Garland",
    "75043": "Garland",
    "75044": "Garland",
    "75048": "Garland",
    "75074": "Plano",
    "75075": "Plano",
    "75080": "Richardson",
    "75081": "Richardson",
    "75082": "Richardson",
    "75093": "Plano",
    "75094": "Plano",
    "75240": "Far North Dallas",
    "75248": "Far North Dallas",
    "75252": "Far North Dallas",
    "75254": "Far North Dallas",
    "75287": "Far North Dallas",
}

# there are many more postal codes in the city of Dallas, so 
# copy entire list of zip codes from google search for "dallas zip codes":
dallas_zip_codes = [
    "75001",
    "75019",
    "75032",
    "75039",
    "75043",
    "75051",
    "75052",
    "75061",
    "75063",
    "75080",
    "75087",
    "75088",
    "75089",
    "75098",
    "75104",
    "75116",
    "75126",
    "75166",
    "75182",
    "75201",
    "75202",
    "75203",
    "75204",
    "75205",
    "75206",
    "75207",
    "75208",
    "75209",
    "75210",
    "75211",
    "75212",
    "75214",
    "75215",
    "75216",
    "75217",
    "75218",
    "75219",
    "75220",
    "75221",
    "75222",
    "75223",
    "75224",
    "75225",
    "75226",
    "75227",
    "75228",
    "75229",
    "75230",
    "75231",
    "75232",
    "75233",
    "75234",
    "75235",
    "75236",
    "75237",
    "75238",
    "75240",
    "75241",
    "75242",
    "75243",
    "75244",
    "75246",
    "75247",
    "75248",
    "75249",
    "75250",
    "75251",
    "75252",
    "75253",
    "75254",
    "75260",
    "75262",
    "75263",
    "75264",
    "75265",
    "75266",
    "75267",
    "75270",
    "75277",
    "75285",
    "75287",
    "75301",
    "75303",
    "75312",
    "75313",
    "75315",
    "75320",
    "75336",
    "75339",
    "75342",
    "75354",
    "75355",
    "75356",
    "75357",
    "75359",
    "75360",
    "75367",
    "75370",
    "75371",
    "75372",
    "75373",
    "75374",
    "75376",
    "75378",
    "75379",
    "75380",
    "75382",
    "75389",
    "75390",
    "75392",
    "75393",
    "75394",
    "75395",
    "75397",
    "75398",
    "76217",
]

# plug in known area names from earlier work, fill the remainder as "Dallas"
dallas_postal_dict = {}
for zipcode in dallas_zip_codes:
    if zipcode in dallas_area_postal_codes.keys():
        dallas_postal_dict[zipcode] = dallas_area_postal_codes[zipcode]
    else:
        dallas_postal_dict[zipcode] = "Dallas"

# combine two dicts, preferring dallas_area_postal_codes
tx_zip_codes = {**dallas_postal_dict, **dallas_area_postal_codes}

# run the code below to do the actual scrape:

for postal_code, area in tx_zip_codes.items():
     extract_listings(postal_code, area)

Got type: <class 'bs4.element.ResultSet'>; listings: 156; dup listings: 10 for 75001
Getting: LOVE the Place you call Home at Bent Tree Trails
Getting: DON'T Miss Out on a Great Place to Call HOME!!!
Getting: Upscale Townhome in North Dallas/Addison**Attached Garage!**
Getting: Dishwasher, Limited Access Gates, Clothes Care Center
Getting: Home Sweet Home at Bent Tree Trails
Getting: Clubhouse, Covered Parking, Ice Maker
Getting: We are conveniently located with walking distance to retail&Restaurant
Getting: HOME SWEET HOME...
Getting: Addison is the spot
Getting: Free Premium Cable TV
Getting: HOME SWEET HOME...
Getting: Move in today.
Getting: 💕FREE APARTMENT LOCATOR SERVICE THAT HELPS WITH 2ND CHANCE LEASING 💕
Getting: Get a roof over your head TODAY - Immediate availability
Getting: WELCOME TO YOUR NEW HOME
Getting: Manager's Special $150 deposit
Getting: Come live a while
Getting: Come live a while
Getting: Walk-In Closet, Upgraded Appliances, Controlled Access
Getting: Tenemos cu

In [None]:
def flatten(nested_list):
    return [item for sublist in nested_list for item in sublist]

def load_listing(postal_code, area):
    extracted_path = os.path.join(DTYPE, 'extracted')
    with open(os.path.join(extracted_path, '{}-{}.json'.format(postal_code, area)), 'r', encoding='utf-8') as f:
        return json.load(f)

def load_all_listings(postal_codes):
    zip_codes = {}
    for postal_code, area in postal_codes.items():
        name = "{}_{}".format(postal_code, area)
        try:
            zip_codes[name] = load_listing(postal_code, area)
        except FileNotFoundError:
            pass
    return zip_codes

zip_codes = load_all_listings(tx_zip_codes)

### explore structure of data

In [None]:
zip_codes['75023_Plano'][:3]

In [None]:
[listing['price'] for listing in zip_codes['75023_Plano']][:5]

In [None]:
def print_matches_to(term, limit=3):
    l = [
        [
            "\n".join([listing['title'], listing['price'], listing['href']] + listing['description']) for listing in zipcode 
            if term in " ".join(listing['description']).lower() and int(listing['price'][1:]) > 1
        ] for zipcode in zip_codes.values()
    ]
    i=0
    for item in flatten(l):
        print(item)
        print()
        i+=1
        if limit:
            if i==limit:
                break

print_matches_to('voucher')
print_matches_to('section 8')


In [None]:
def find_matches_to(term):
    l = [
        [
            listing['href'] for listing in zipcode 
            if term in " ".join(listing['description']).lower() and int(listing['price'][1:]) > 1
        ] for zipcode in zip_codes.values()
    ]
    return flatten(l)

print("voucher:", find_matches_to("voucher"))
# print("section 8:", find_matches_to("section 8"))


In [None]:
zip_codes.keys()
len(zip_codes['75025_Plano'])
sum([len(zip_codes[where]) for where in zip_codes])

### convert to pandas DataFrame for analysis

In [None]:
def extract_nested(field):
    return flatten(
        [
            [
                listing[field] for listing in zip_codes[where]
            ] for where in zip_codes
        ]
    )

raw_df = pd.DataFrame({
    "href":extract_nested('href'),
    "title":extract_nested('title'),
    "price":extract_nested('price'),
    "description":extract_nested('description'),
    "duplicates":extract_nested('duplicates')
})

raw_df['city'] = flatten([
    np.tile(town, count) for town, count in [
        (key.split('_')[1], len(zip_codes[key])) for key in zip_codes.keys()
    ]
])

raw_df.head(3)

## featurize

eventually the featurization steps should be unit tested and combined in a pipeline, but for now it is still unclear which steps will be necessary down the line and which will not

In [None]:
def parse_href(url):
    full_title = url.split('/')[-2]
    city = full_title.split('-')[0] # will capture only 1st word, e.g. "Grand Prairie" --> "Grand"
    title = "-".join(full_title.split('-')[1:]) # will capture any spillover from city, e.g. "Prairie"
    c_id = url.split('/')[-1].split('.')[0]
    return city, title, c_id

parse_href('https://dallas.craigslist.org/dal/apa/d/richardson-luxury-second-chance/7112620971.html')

In [None]:
def parse_title(url):
    return url.split('/')[-2]

# def parse_city(url):
#     # will capture only 1st word, e.g. "Grand Prairie" --> "Grand"
#     return _parse_title(url).split('-')[0]
# 
# def parse_title(url):
#     return "-".join(_parse_title(url).split('-')[1:])

def parse_id(url):
    return url.split('/')[-1].split('.')[0]

def flatten_description(description):
    return "\n".join(description)


In [None]:
new_df = raw_df.copy()

new_df['parsed_id'] = raw_df['href'].apply(parse_id)
new_df = new_df.loc[~new_df['parsed_id'].duplicated()]

new_df['title'] = new_df['title'].str.lower()
new_df['parsed_title'] = raw_df['href'].apply(parse_title).str.lower()
new_df['description'] = new_df['description'].apply(flatten_description).str.lower()
new_df['description_oneline'] = new_df['description'].str.replace("\n", " ")

new_df['n_dupes'] = new_df['duplicates'].apply(len)

# price: chop off the '$' and convert to int
new_df['price'] = raw_df['price'].apply(lambda x: int(x[1:]))

new_df.head(3)

In [None]:
print(new_df['description'][0])

In [None]:
plt.hist(new_df.loc[new_df['price']<5000, 'price'], bins=50)
plt.xlabel('price')
plt.ylabel('count')
plt.show()

In [None]:
for city in np.sort(new_df['city'].unique()):
    plt.hist(
        new_df.loc[
            (new_df['price']<5000) & (new_df['city']==city),
            'price'
        ],
        bins=50)
    plt.xlabel('price')
    plt.xlim((0, 5000))
    plt.ylabel('count')
    plt.title(city)
    plt.show()

### featurize bedrooms/bathrooms

In [None]:
re_num = '\d{1,10}\.?\d*'

bed_match = new_df["description"].apply(
    lambda x: re.search(
        "({} (bed|bd)|(bed|beds|bd|bds|bedroom|bedrooms|bdrm|bdrms): {})".format(re_num, re_num), x
    )
)

bath_match = new_df["description"].apply(
    lambda x: re.search(
        "({} bath|(bath|bathroom|baths|bathrooms|bth|bths): {})".format(re_num, re_num), x
    )
)

def return_matches(match):
    try:
        return match.group(0)
    except AttributeError:
        return None

def strip_text(string):
    try:
        match = re.search("\d{1,10}\.?\d*", string)
    except TypeError:
        return None
    return float(return_matches(match))

new_df['bedrooms'] = bed_match.apply(return_matches).apply(strip_text)
new_df['bathrooms'] = bath_match.apply(return_matches).apply(strip_text)


In [None]:
plt.hist(
    new_df['bedrooms'], 
    bins=np.arange(1, max(new_df['bedrooms'].dropna()), .5)
)
plt.title('beds')
plt.show()

plt.hist(
    new_df['bathrooms'], 
    bins=np.arange(1, max(new_df['bathrooms'].dropna()), 0.5)
)
plt.title('baths')
plt.show()

In [None]:
# mult-unit listings currently return only a single value for bed/bath counts
print(new_df.loc[new_df['parsed_id']=='7107475787', 'description'].values[0])
# TODO: capture all matches, not just the first

### duplicates
need to address these esp. if doing supervised learning

In [None]:
new_df.loc[new_df['description'].duplicated(keep=False)].sort_values('description')

In [None]:
# mark posts that have duplicates,
# then deduplicate without attending to raw 'duplicates' column for now

# CAUTION: mutates in place. Avoid this pattern in production.

new_df['has_repost'] = False
new_df.loc[new_df['description'].duplicated(keep=False), 'has_repost'] = True

new_df = new_df.loc[
    ~new_df['description'].duplicated()
]

new_df = new_df.drop(columns=['duplicates', 'n_dupes'])
new_df.head(3)

### TF-IDF

In [None]:
spanglish = stopwords.words('spanish') + stopwords.words('english')

vectorizer = TfidfVectorizer(
    stop_words=spanglish, 
    token_pattern="[a-z]{2,}"
)

X = vectorizer.fit_transform(new_df['description_oneline'])

print(X.shape)
print(vectorizer.get_feature_names()[:20])

In [None]:
# some of those tokens look surprising

surprise_term = 'abacus'
for entry in new_df['description_oneline']:
    if surprise_term in entry:
        print(entry)
        break

## Clustering
use tf-idf scores to cluster the postings

In [None]:
kmeans = KMeans(n_clusters=2).fit(X)

In [None]:
# count the class labels
pd.Series(kmeans.labels_).value_counts()

In [None]:
new_df.loc[kmeans.labels_==1].head()

#### understand the clusters

In [None]:
kmeans.cluster_centers_

In [None]:
# what are the distinctive terms in the second (smaller) cluster?
pd.Series(
    vectorizer.get_feature_names()
).loc[
    kmeans.cluster_centers_[1]>0.1
]

##### is "locator" the distinguishing characteristic?

In [None]:
new_df.loc[
    (kmeans.labels_==1), 'description_oneline'
].str.contains("locator").value_counts()

In [None]:
new_df.loc[
    (kmeans.labels_==0), 'description_oneline'
].str.contains("locator").value_counts()

proportions appear different, and we could use a contingency table to test for significance, but there are probably better clusterings to be found

### Silhouette analysis
what value of K maximizes silhouette score?

In [None]:
# https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

range_n_clusters = [2, 3, 4, 5, 6]

for n_clusters in range_n_clusters:
    fig, ax1 = plt.subplots(1, 1)
    fig.set_size_inches(18, 7)

    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is:", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    plt.suptitle(("Silhouette analysis for KMeans clustering"
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

* increasing number of clusters increases "fit" of postings within their assigned cluster
* no clustering yet solves the problem of one or more large "miscellaneous" clusters with poor fit

### try direct match to "locator" in the posting? 
##### probably should have started with this approach as a baseline

In [None]:
new_df['description_oneline'].str.contains('locator').value_counts(normalize=True)

should spot-check some of these postings, in case legitimate posts use the word "locator"

future direction:

## Supervised Learning
plan:
* hand-label spammy posts from a random subset of the data?
* grab processed features from new_df (price, n_beds, n_baths, has_repost)
* maybe include tf-idf scores, consider either dimensionality reduction or regularization given the number of features
* fit logistic regression, random forest, xgboost models
* compare precision, recall after determining whether false positives or false negatives are more problematic

potential issues:
* spanish/english appearing within/across posts
* spammers must balance the tradeoff between evading automatic filters and still getting people to click through or call

big picture:
* will filtering out these spammy posts help evaluate discrimination against housing vouchers?
* are there differences by geography?