# Filtering Craigslist Furniture & Anique Postings

Ideally, I would like to look at two separate streams of data, one stream for posts in which sellers recognize the manufacturer or style of their item and another for posts in which sellers do not.

### Manufacturer & Designer Filtered List

For the first stream, I would like to search postings (either under the furniture section of craigslist or both the furniture and antiques sections) that include any of the keywords used in any of the three lists below, but exclude any postings that include more than one keyword each from the manufacturer and designer lists. I would also like to remove any posts that have been posted more than once, e.g. duplicates. I would also like to exclude any postings including the keywords in the blacklist.

Manufacturer List: “Knoll” “Herman Miller” “Fritz Hansen” “Selig” “Modernica” “Carl Hansen” “Swedese” “Artek” “Moreddi” “Cassina” “Thayer Coggin” “Bramin” “Howard Miller” “Drexel Declaration” “Broyhill Brasilia” “Lane Acclaim” “Westnofa” “USM Haller"

Designer List:  “Nelson” “Eames” “Saarinen” “Florence Knoll” “Paulin” “Risom” “Ekstrom” “Schultz” “Wegner” “Vodder” “Jalk” “Olsen”

Style List: “mid century” “mid century modern” “teak”

Blacklist: “cubicle” “aeron” 

### Generic Filtered List Under X Dollar Amount
For the second stream, I would like to see a raw stream of posts under the furniture section, excluding duplicate postings, any posting listing a price over $400, and, as above, any postings that include more than one keyword each from the manufacturer and designer lists.  I would also like to exclude any postings including the keywords in the blacklist.

Lastly, I would like to search the Providence, Eastern Connecticut, New Haven, and New York City craigslists.

Things I’d like to see in Data Display: Title, Picture, Hyperlink, Price, Location

In [13]:
import timeit

import numpy as np
import pandas as pd
from collections import Counter


df = pd.read_csv('/Users/katielazell-fairman/desktop/projects/furniture_craigslist/scrapy/all_results-10-25-17-4PM-v4.csv')

print df.at[0,'Image']
print "https://images.craigslist.org/00b0b_eF4D0qhS1MV_600x450.jpg"
#Convert title to lower case
#df['Title'] = df['Title'].str.lower()


#describe Table
#df.tail()


https://newhaven.craigslist.org/fuo/d/new-frames-all-5/6328477083.html
https://images.craigslist.org/00b0b_eF4D0qhS1MV_600x450.jpg


In [2]:
#Generate combined search terms

manufacturer_list = '''“Knoll” “Herman Miller” “Fritz Hansen” “Selig” “Modernica” “Carl Hansen” 
“Swedese” “Artek” “Moreddi” “Cassina” “Thayer Coggin” “Bramin” “Howard Miller” “Drexel Declaration” 
“Broyhill Brasilia” “Lane Acclaim” “Westnofa” “USM Haller"'''

designer_list = '''“Nelson” “Eames” “Saarinen” “Florence Knoll” “Paulin” “Risom” “Ekstrom” “Schultz” “Wegner” “Vodder” “Jalk” “Olsen”'''

style_list = ''' “mid century” “mid-century” “century modern” “teak” '1960s' '''

blacklist = '''“cubicle” “aeron” 'bed' 'mattress' 'chest' 'ethan allen' 'ikea' 'baker' 'grandfather clock'
'industrial' 'farmhouse' 'pottery barn' 'furnishare' '''

def clean_string(x):
    rep_char = ["\xe2\x80\x9c","\xe2\x80\x9d","' '","'", '"']
    rep_with = ["'", "'", "|", "|", ""]
    for char, rep in zip(rep_char,rep_with):
        x = x.replace(char,rep)
    
    return x.lower()


search_terms = '"' + clean_string(manufacturer_list) + clean_string(designer_list) + clean_string(style_list) + '"'
search_terms = search_terms.replace("| |", "|")
blacklist = '"' + clean_string(blacklist) + '"'
print search_terms
print ""
print "blacklist : " + blacklist

"|knoll|herman miller|fritz hansen|selig|modernica|carl hansen| 
|swedese|artek|moreddi|cassina|thayer coggin|bramin|howard miller|drexel declaration| 
|broyhill brasilia|lane acclaim|westnofa|usm haller|nelson|eames|saarinen|florence knoll|paulin|risom|ekstrom|schultz|wegner|vodder|jalk|olsen|mid century|mid-century|century modern|teak|1960s| "

blacklist : "|cubicle|aeron|bed|mattress|chest|ethan allen|ikea|baker|grandfather clock|
|industrial|farmhouse|pottery barn|furnishare| "


### Manufacturer & Designer Filtered List

In [3]:
#Contains key search term
df1 = df[df['Title'].str.contains(search_terms)]
df1.shape

(32, 6)

In [4]:
#Remove rows which contain words from blacklist
df1 = df1[df1['Title'].str.contains(blacklist) == False]
df1.shape

(16, 6)

In [6]:
#Remove Duplicates (don't keep any posts which appear more than once)
df1 = df1.drop_duplicates(subset = 'Title', keep = False)


In [7]:

#Remove Listings where multiple designers or multiple manufacturers are mentioned


def term_size(lst):
    # outputs two list given an original list of strings:
    # - first only contains single word terms
    # - second only contains two word terms (or more)
    terms_with_2_words = [term for term in lst if (' ' in term) == True if term not in [' \n', '']]
    terms_with_single_words = [term for term in lst if term not in terms_with_2_words if term not in [' \n', ''] ]
    return  terms_with_single_words, terms_with_2_words
            
    

manufacturer_ls_1, manufacturer_ls_2 = term_size(clean_string(manufacturer_list).split("|"))

designer_ls_1, designer_ls_2 = term_size(clean_string(designer_list).split("|"))

def multi_manufacturer_designer(df,column):
    for n in df.index:
        count_m = 0
        count_d = 0
        text = df['Title'][df.index == n].values

        for phrase in text:
            for i, word in enumerate(phrase.split()):
                if word in manufacturer_ls_1:
                    count_m += 1
                if word in designer_ls_1:
                    count_d += 1
                try:
                    if phrase.split()[i-1] + ' ' + word in manufacturer_ls_2:
                        count_m += 1
                    if phrase.split()[i-1] + ' ' + word in designer_ls_2:
                        count_d += 1
                except:
                    continue

        df.at[n, 'Manufacturers Listed'] = count_m
        df.at[n, 'Designers Listed'] = count_d

    return df[(df['Manufacturers Listed'] <= 1) & (df['Designers Listed'] <= 1)] 

df1 = multi_manufacturer_designer(df1, 'Title')

df1


Unnamed: 0,Title,URL,Section,Time,Meta_HTML,Image,Manufacturers Listed,Designers Listed
27,eames molded plywood chair ebony,https://newhaven.craigslist.org/fuo/d/eames-mo...,CT - Furniture,2017-10-25 12:21,"<span class=""result-meta"">\n <s...",http://pictaram.today/post/1588071282535283001...,0.0,1.0
29,mid-century modern dresser,https://newhaven.craigslist.org/fuo/d/mid-cent...,CT - Furniture,2017-10-25 12:15,"<span class=""result-meta"">\n <s...",http://pictaram.today/post/1588071282535283001...,0.0,0.0
51,vintage mid-century modern eames fiberglass sw...,https://newhaven.craigslist.org/fuo/d/vintage-...,CT - Furniture,2017-10-25 11:08,"<span class=""result-meta"">\n <s...",http://pictaram.today/post/1588071282535283001...,0.0,1.0
152,huge warehouse packed with great vintage!! lot...,https://newyork.craigslist.org/brk/atq/d/huge-...,New york - Furniture,2017-10-25 13:22,"<span class=""result-meta"">\n\n\n ...",http://pictaram.today/post/1588071282535283001...,0.0,0.0
209,mid century danish modern teak square dinning ...,https://newhaven.craigslist.org/fuo/d/mid-cent...,CT - Furniture,2017-10-25 08:34,"<span class=""result-meta"">\n <s...",http://pictaram.today/post/1588071282535283001...,0.0,0.0
267,vintage mid century bernard buffet clown print,https://newyork.craigslist.org/que/atq/d/vinta...,New york - Furniture,2017-10-25 12:55,"<span class=""result-meta"">\n <s...",http://pictaram.today/post/1588071282535283001...,0.0,0.0
342,midcentury modern 1950's chrome orb table lamp...,https://newyork.craigslist.org/mnh/atq/d/midce...,New york - Furniture,2017-10-25 12:21,"<span class=""result-meta"">\n <s...",http://pictaram.today/post/1588071282535283001...,0.0,0.0
343,mid century modern dining room table,https://newyork.craigslist.org/que/fuo/d/mid-c...,New York - Antiques,2017-10-25 13:40,"<span class=""result-meta"">\n <s...",http://pictaram.today/post/1588071282535283001...,0.0,0.0
404,eames aluminum group management chair,https://newyork.craigslist.org/brk/fuo/d/eames...,New York - Antiques,2017-10-25 13:33,"<span class=""result-meta"">\n <s...",http://pictaram.today/post/1588071282535283001...,0.0,1.0
407,"mid century lamps, hollywood regency, pair",https://newyork.craigslist.org/wch/fuo/d/mid-c...,New York - Antiques,2017-10-25 13:33,"<span class=""result-meta"">\n <s...",http://pictaram.today/post/1588071282535283001...,0.0,0.0


### List 2: Generic Filtered List Under X Dollar Amount 

In [8]:
#Remove rows which contain words from blacklist
df2 = df[df['Title'].str.contains(blacklist) == False]

#Remove Duplicates (don't keep any posts which appear more than once)
df2 = df2.drop_duplicates(subset = 'Title', keep = False)

df2 = multi_manufacturer_designer(df2, 'Title')


In [None]:
list(df2.columns)

In [9]:
def extract_price(result_meta_html):
        start_price = result_meta_html.find('<span class="result-price">') + len('<span class="result-price">')
        end_price = start_price + result_meta_html[start_price:start_price+80].find('</span>')
        price = result_meta_html[start_price: end_price].replace('$', "")
        try:
            price = float(price)
        except:
            price = np.nan
        return price

def extract_location(result_meta_html):
        start_loc = result_meta_html.find('<span class="result-hood"> ') + len('<span class="result-hood"> ')
        end_loc = start_loc + result_meta_html[start_loc:start_loc+80].find('</span>')
        loc = result_meta_html[start_loc: end_loc].replace('(', "").replace(')',"")
        return loc

In [12]:
def add_price_location_reformat(df, max_price=1000000000):
    for ix , meta in zip(df.index,df['Meta_HTML']):
        df.at[ix, 'Price'] = extract_price(meta)
        df.at[ix, 'Location'] = extract_location(meta).lower()
        pd.set_option('display.max_colwidth', -1)
        df_view = df[df['Price'] <= max_price]
        
    return df_view[['Title', 'Price', 'Location', 'Time', 'Section', 'URL', 'Image']]


In [13]:
df2_view = add_price_location_reformat(df2,400)
df2_view

Unnamed: 0,Title,Price,Location,Time,Section,URL,Image
0,oak bathroom storage cabinet,35.0,shelton,2017-10-25 13:34,CT - Furniture,https://newhaven.craigslist.org/fuo/d/oak-bathroom-storage-cabinet/6354595705.html,http://pictaram.today/post/1588071282535283001_5754592301
7,refrigerator - very clean almost new never store produce and meat,125.0,barkhamsted,2017-10-25 13:06,CT - Furniture,https://newhaven.craigslist.org/fuo/d/refrigerator-very-clean/6360514256.html,http://pictaram.today/post/1588071282535283001_5754592301
9,"solid oak dining table set (6 chairs, extendable table)",148.0,woodbridge,2017-10-25 13:02,CT - Furniture,https://newhaven.craigslist.org/fuo/d/solid-oak-dining-table-set-6/6360508428.html,http://pictaram.today/post/1588071282535283001_5754592301
10,room and board bamboo table - modified to computer desk,100.0,"milford, ct",2017-10-25 13:02,CT - Furniture,https://newhaven.craigslist.org/fuo/d/room-and-board-bamboo-table/6360508130.html,http://pictaram.today/post/1588071282535283001_5754592301
14,wool rug 8x10,100.0,west haven,2017-10-25 12:53,CT - Furniture,https://newhaven.craigslist.org/fuo/d/wool-rug-8x10/6360493453.html,http://pictaram.today/post/1588071282535283001_5754592301
16,genuine corian marble table 36x36,200.0,"\n <span class=""result-price"">$200",2017-10-25 12:42,CT - Furniture,https://newhaven.craigslist.org/fuo/d/genuine-corian-marble-table/6341763087.html,http://pictaram.today/post/1588071282535283001_5754592301
17,"classroom desk, virco",75.0,"\n <span class=""result-price"">$75",2017-10-25 12:42,CT - Furniture,https://newhaven.craigslist.org/fuo/d/classroom-desk-virco/6341751202.html,http://pictaram.today/post/1588071282535283001_5754592301
22,"modern glass tv stand, entertainment center",60.0,bethany,2017-10-25 12:35,CT - Furniture,https://newhaven.craigslist.org/fuo/d/modern-glass-tv-stand/6360436770.html,http://pictaram.today/post/1588071282535283001_5754592301
23,wicker arm chair with tan cushion - outside furniture,50.0,"milford, ct",2017-10-25 12:35,CT - Furniture,https://newhaven.craigslist.org/fuo/d/wicker-arm-chair-with-tan/6360463529.html,http://pictaram.today/post/1588071282535283001_5754592301
24,recliners ( 2 ) la-z-boy,100.0,derby,2017-10-25 12:30,CT - Furniture,https://newhaven.craigslist.org/fuo/d/recliners-2-la-boy/6339916540.html,http://pictaram.today/post/1588071282535283001_5754592301


In [None]:
df1_view = add_price_location_reformat(df1)
df1_view

In [None]:
def export_df_to_html(df, title):
    #Add Header with Title
    header = '<h1 style="font-family:Verdana"> {} </h1>'.format(title)
    html = header + "\n" + df.to_html()
    #Update Table Format
    html = html.replace('<table border="1" class="dataframe">',
                        '<table border="1" class="greyGridTable", style="font-family:Verdana"> ')
    #Replace hyperlink text with working links
    html = html.replace('<td>https://','<td><a href="https://' ).replace('.html</td>','.html"> link </a></td>')
    return html
    
    

In [None]:
test1 = export_df_to_html(df1_view, "List 1: Manufacturer, Designer, Style Filtered List")

In [None]:
print export_df_to_html(df2_view, "List 2: Generic Filtered List Under $400 Dollar Amount")

In [None]:
def save_html(html_text, title):
    time = str(pd.to_datetime('now')).replace(" ", "_")[:13] + "hrs"
    filename = "{}-{}.html".format(title,time)
    
    with open(dir_+filename, 'w') as f:
        write_html = f.write(html_text)
        write_blacklist = f.write('<p style="font-family:Verdana">Blacklist:<br>{}<br></p>'.format(blacklist))
        
    print filename + " created in current directory"
    

In [None]:
save_html(test1, "test1")

In [None]:
'''<p style="font-family:Verdana">
   Blacklist:<br>{}<br></p>'''.format(blacklist)

In [None]:
st = "Hello this is a test"

findindex = st.find('<span class="result-hood"> ') 

In [None]:
findindex