# Goal: To scrap a forum website for the following:

- `link_to_thread`

- `name_of_thread`

- `views, replies`

- `last_post_time`

- `last_post_date`

and save the result to a .csv file



### Note:

- This script is built based on Ford F150 Ecoboost forum, and should be able to handle 2015-2017 forums.

- The scripts takes into account the situation where a post is **moved**, which can result in a situation where a post has title but has no stats on last post date, time, reply counts and view counts. 

- This script is built to product the top 100 posts based on population, namely, views and replies. However, I have also written a function that can either take just one of these inputs (either Repleis or Views), or both, in order to sort the posts by popularity. 

- Due to the dynamic nature of the forum sites, it is possible that when this script is run, the website happened to be updated at the same time. If this indeed happens, please re-run the script. This is because the dictionary might end up being of different length, and hence not being able to conver to a dataframe. But re-running the script will resolve this issue.


In [1]:
import time
import datetime
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

In [2]:
def init_driver():
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.wait = WebDriverWait(driver, 3)
    return driver

In [3]:
def lookup(driver):
    post_dict = {'link': [], 'title': [], 'stats': [], 'last_post_stats': []}
    #driver.get('http://www.f150ecoboost.net/forum/68-2016-ford-f150-ecoboost-chat')
    driver.get('http://www.f150ecoboost.net/forum/42-2015-ford-f150-ecoboost-chat')
    counter1 = 0

    while True:
        try:
            driver.find_element_by_xpath('''//*[@id="yui-gen11"]''').click()
            posts = driver.find_elements_by_xpath('''.//*[@id='threads']''')
            for post in posts:
                titles = post.find_elements_by_xpath('''//a[@class='title']''')
                for title in titles:
                    # print(title.text)
                    post_dict['title'].append(title.text)
                    # print(title.get_attribute('href'))
                    post_dict['link'].append(title.get_attribute('href'))
                subs_stats = post.find_elements_by_xpath('''.//*[@class='threadstats td alt']''')
                for sub_stats in subs_stats:
                    # print(sub_stats.text)
                    if sub_stats.text != '&nbsp;':
                        post_dict['stats'].append(sub_stats.text)
                subs_dates = post.find_elements_by_xpath('''.//*[@class='threadlastpost td']''')
                for sub_dates in subs_dates:
                    # print(sub_dates.text)
                    if sub_dates.text != '&nbsp;':
                        post_dict['last_post_stats'].append(sub_dates.text)
                counter1 += 1
            print(counter1)
            print(driver.current_url)
            page = driver.find_element_by_xpath('''//img[@alt='Next']''')
            page.click()
        except:
            return post_dict
    

In [4]:
def process_df(post_dict):
    post_dict['last_post_datetime'] = (post_dict.last_post_stats.apply(lambda x:x.split('\n'))).str[1]
    post_dict['Last_post_date'] = (post_dict.last_post_datetime.str.split(', ')).str[0]
    post_dict['Last_post_time'] = (post_dict.last_post_datetime.str.split(', ')).str[1]

    #post_dict.Last_post_date = post_dict.Last_post_date.apply(
    #    lambda x: datetime.datetime.strptime(x, '%m-%d-%Y') if x not in ('Yesterday', 'Today') else x)
    #post_dict.Last_post_time = post_dict.Last_post_time.apply(
    #    lambda x: (time.strftime('%H:%M', time.strptime(x, '%I:%M %p'))))

    #post_dict['repliesString'] = post_dict.stats.apply(lambda x: (x.split('\n'))[0])#.split(':')[1])
    #post_dict['Replies'] = post_dict.repliesString.str.split(':').str[1]
    post_dict['repliesViews'] = post_dict.stats.apply(lambda x: x.replace(',',''))
    post_dict['counts'] = post_dict.repliesViews.apply(lambda x:[int(s) for s in x.split() if s.isdigit()])
    post_dict['Replies'] = post_dict.counts.str[0]
    post_dict['Views'] = post_dict.counts.str[1]
    #post_dict['Views'] = post_dict.stats.apply(lambda x: (x.split('\n'))[1])#.split(':')[1])
    #post_dict['Views'] = post_dict.stats.apply(lambda x: (x.split('\n'))[1].split(':')[1])
    post_dict.drop(['last_post_stats', 'stats', 'last_post_datetime', 'repliesViews', 'counts'], axis=1, inplace=True)

    return post_dict

In [5]:
def sort_values_by(df, by, ascending):
    sorted_df = df.sort_values(by=by, ascending=ascending)
    return sorted_df

In [6]:
driver = init_driver()
data = lookup(driver)
time.sleep(5)
driver.quit()

1
http://www.f150ecoboost.net/forum/42-2015-ford-f150-ecoboost-chat
2
http://www.f150ecoboost.net/forum/42-2015-ford-f150-ecoboost-chat/index2.html
3
http://www.f150ecoboost.net/forum/42-2015-ford-f150-ecoboost-chat/index3.html
4
http://www.f150ecoboost.net/forum/42-2015-ford-f150-ecoboost-chat/index4.html
5
http://www.f150ecoboost.net/forum/42-2015-ford-f150-ecoboost-chat/index5.html
6
http://www.f150ecoboost.net/forum/42-2015-ford-f150-ecoboost-chat/index6.html
7
http://www.f150ecoboost.net/forum/42-2015-ford-f150-ecoboost-chat/index7.html
8
http://www.f150ecoboost.net/forum/42-2015-ford-f150-ecoboost-chat/index8.html
9
http://www.f150ecoboost.net/forum/42-2015-ford-f150-ecoboost-chat/index9.html
10
http://www.f150ecoboost.net/forum/42-2015-ford-f150-ecoboost-chat/index10.html
11
http://www.f150ecoboost.net/forum/42-2015-ford-f150-ecoboost-chat/index11.html
12
http://www.f150ecoboost.net/forum/42-2015-ford-f150-ecoboost-chat/index12.html
13
http://www.f150ecoboost.net/forum/42-2015-f

In [7]:
data = pd.DataFrame.from_dict(data)

In [8]:
for e in data:
    print(e)
    print(len(data[e]))

last_post_stats
578
link
578
stats
578
title
578


In [9]:
data.head()

Unnamed: 0,last_post_stats,link,stats,title
0,"SSGTTJ\nYesterday, 06:53 AM",http://www.f150ecoboost.net/forum/42-2015-ford...,"Replies: 1,587\nViews: 91,300",What did you do to your truck today?
1,"johnsnowkornar\n08-25-2017, 12:03 PM",http://www.f150ecoboost.net/forum/42-2015-ford...,"Replies: 15\nViews: 10,052",Bed Liners for the Aluminum F150
2,"Sir_Boosted\n08-24-2017, 12:40 PM",http://www.f150ecoboost.net/forum/42-2015-ford...,"Replies: 8\nViews: 1,448",Lower Active Grill Shutters = CAC Condensation...
3,"jjc155\n08-24-2017, 12:51 AM",http://www.f150ecoboost.net/forum/42-2015-ford...,Replies: 17\nViews: 942,Blown motor
4,"dwrenchz\n08-21-2017, 03:55 PM",http://www.f150ecoboost.net/forum/42-2015-ford...,Replies: 0\nViews: 101,My real life mpg


In [10]:
data = process_df(data)

In [11]:
data.head()

Unnamed: 0,link,title,Last_post_date,Last_post_time,Replies,Views
0,http://www.f150ecoboost.net/forum/42-2015-ford...,What did you do to your truck today?,Yesterday,06:53 AM,1587.0,91300.0
1,http://www.f150ecoboost.net/forum/42-2015-ford...,Bed Liners for the Aluminum F150,08-25-2017,12:03 PM,15.0,10052.0
2,http://www.f150ecoboost.net/forum/42-2015-ford...,Lower Active Grill Shutters = CAC Condensation...,08-24-2017,12:40 PM,8.0,1448.0
3,http://www.f150ecoboost.net/forum/42-2015-ford...,Blown motor,08-24-2017,12:51 AM,17.0,942.0
4,http://www.f150ecoboost.net/forum/42-2015-ford...,My real life mpg,08-21-2017,03:55 PM,0.0,101.0


#### Below, I have used the osrt_valaue_by() function defined above to sort the result

Note that you can switch the order of the input list to have the results sorted in different order.

In [12]:
sorted_data = sort_values_by(data, ['Replies','Views'], [False,False])

In [13]:
sorted_data.head()
type(sorted_data)

pandas.core.frame.DataFrame

Another use case of the function is demonstrated below.

In [14]:
another = sorted_data.groupby('Last_post_date').apply(pd.DataFrame.sort_values, 'Replies')

In [15]:
another.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,link,title,Last_post_date,Last_post_time,Replies,Views
Last_post_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
01-03-2015,483,http://www.f150ecoboost.net/forum/42-2015-ford...,Motor Trend Idiocy - Chev#$^*let Colorado 2015...,01-03-2015,03:34 PM,10.0,1589.0
01-04-2017,56,http://www.f150ecoboost.net/forum/42-2015-ford...,door handles,01-04-2017,12:54 PM,10.0,755.0
01-05-2016,241,http://www.f150ecoboost.net/forum/42-2015-ford...,2015 2.7 Eco 4x4 with electrical issues,01-05-2016,10:56 AM,15.0,1721.0
01-06-2017,55,http://www.f150ecoboost.net/forum/42-2015-ford...,Spark Plug change results in misfire that can'...,01-06-2017,02:41 PM,39.0,2444.0
01-09-2015,482,http://www.f150ecoboost.net/forum/42-2015-ford...,Car and Driver mostly positive on the 2015,01-09-2015,04:01 PM,3.0,837.0


In [18]:
sorted_data.to_csv('Top100_Posts.csv')