In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import requests
import pandas as pd
import numpy as np

# 1. Scraping the main thread for links of all the published projects:
[back to top](#index)

We'll start with scraping the dataquests forum page for publishing projects, every thread on that page represents a different project. We're interested in the content of those threads. To get to that content, we need to obtain links to that content. So we have to scrape the guided project page containing all those threads, then scrape all the posts individually.

In [2]:

url = "https://community.dataquest.io/c/share/guided-project/55"
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
list_all = soup.find_all("a", class_="title raw-link raw-topic-link")
len(list_all)
# print(soup)

30

Trying to scrape this website leads to our first problem: the website displays only 30 threads. I've tried different path options and couldn't find a way to target the next 30 posts. So I came up with a brutal and simple solution:
* manually scroll down to the bottom of the website (so that all posts are displayed)
* save the website to a file
* load the file to the notebook and keep on scraping 

In [3]:
import codecs
# this is the file of the website, after scrolling all the way down:
file = codecs.open("../input/dq-projects/projects.html", "r", "utf-8")
parser = BeautifulSoup(file, 'html.parser')
list_all = parser.find_all('tr')
series_4_df = pd.Series(list_all)
# create a dataframe with values(title, link, etc.) extracted from the html file:
df = pd.DataFrame(series_4_df, columns=['content'])
df['content'] = df['content'].astype(str)
df.head()

Unnamed: 0,content
0,"<tr><th class=""default"" data-sort-order=""defau..."
1,"<tr class=""topic-list-item category-share-guid..."
2,"<tr class=""topic-list-item category-share-guid..."
3,"<tr class=""topic-list-item category-share-guid..."
4,"<tr class=""topic-list-item category-share-guid..."


In [4]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1445 entries, 0 to 1444
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   content  1445 non-null   object
dtypes: object(1)
memory usage: 3.7 MB


In [5]:
# check how one cell looks like:
df.loc[2,'content'] 

'<tr class="topic-list-item category-share-guided-project tag-257 tag-sql-fundamentals tag-257-8 has-excerpt unseen-topic ember-view" data-topic-id="558357" id="ember71">\n<td class="main-link clearfix" colspan="">\n<div class="topic-details">\n<div class="topic-title">\n<span class="link-top-line">\n<a class="title raw-link raw-topic-link" data-topic-id="558357" href="https://community.dataquest.io/t/analyzing-cia-factbook-with-sql-full-project/558357" level="2" role="heading"><span dir="ltr">Analyzing CIA Factbook with SQL - Full Project</span></a>\n<span class="topic-post-badges">\xa0<a class="badge badge-notification new-topic" href="https://community.dataquest.io/t/analyzing-cia-factbook-with-sql-full-project/558357" title="new topic"></a></span>\n</span>\n</div>\n<div class="discourse-tags"><a class="discourse-tag bullet" data-tag-name="257" href="https://community.dataquest.io/tag/257">257</a> <a class="discourse-tag bullet" data-tag-name="sql-fundamentals" href="https://communi

In [6]:
df = df.iloc[1:,:]
# extract title, link and number of replies:
df['title'] = df['content'].str.extract('<span dir="ltr">(.*?)</span>')
df['link'] = df['content'].str.extract('href=(.*?)level="2"')
df['link'] = df['link'].str.extract('\"(.*?)\"')
df['replies'] = df['content'].str.extract("This topic has (.*?) re").astype(int)
df['views'] = df['content'].str.extract("this topic has been viewed (.*?) times")
df['views'] = df['views'].str.replace(',','').astype(int)
# remove 1 generic post and posts with 0 replies:
df = df[df['replies']>0]
df = df[df['replies']<100]
df.head()

Unnamed: 0,content,title,link,replies,views
4,"<tr class=""topic-list-item category-share-guid...",Predicting house prices,https://community.dataquest.io/t/predicting-ho...,1,26
5,"<tr class=""topic-list-item category-share-guid...",[Re-upload]Project Feedback - Popular Data Sci...,https://community.dataquest.io/t/re-upload-pro...,3,47
7,"<tr class=""topic-list-item category-share-guid...",GP: Clean and Analyze Employee Exit Surveys ++,https://community.dataquest.io/t/gp-clean-and-...,2,53
10,"<tr class=""topic-list-item category-share-guid...",Project Feedback - Popular Data Science Questions,https://community.dataquest.io/t/project-feedb...,5,71
12,"<tr class=""topic-list-item category-share-guid...",Guided Project: Answer to Albums vs. Singles w...,https://community.dataquest.io/t/guided-projec...,5,370


# 2. Scraping the individual posts for feedback:

**We've collected all the necessary links, now we can commence scraping the actual websites containing original post (published project) and replies (feedback).** We've already filtered out the posts without any replies, now we're going to assume that **only the first reply is valuable for us**. Many posts contain long discussions about various features of published projects, on average my gut feeling tells me the first post usually contains the best feedback, then the conversations, gratitude etc. start. 

In [7]:
# create a function for scraping the actual posts website:
def get_reply(one_link):
    response = requests.get(one_link)
    content = response.content
    parser = BeautifulSoup(content, 'html.parser')
    tag_numbers = parser.find_all("div", class_="post")
    # we're only going to scrape the content of the first reply (that's usually the feedback)
    feedback = tag_numbers[1].text
    return feedback

# create a test dataframe to test scraping on 5 rows:
df_test = df[:5].copy()

# we'll use a loop on all the elements of pd.Series (fastern than using 'apply')
feedback_list = []
for el in df_test['link']:
    feedback_list.append(get_reply(el))
df_test['feedback'] = feedback_list
df_test['feedback']

4     \nprocessing data inside a function saves memo...
5     \nHi,\nI’ve been going through your project an...
7     \n\n\nnoticed that you’re deleting objects, af...
10            \nthink you forgot to attach your file…\n
12    \n@gdelaserre: recategorized your topic. The E...
Name: feedback, dtype: object

In [8]:
df_test['feedback'][4]

'\nprocessing data inside a function saves memory (the variables you create stay inside the function and are not stored in memory, when you’re done with the function) it’s important when you’re working with larger datasets - if you’re interested with experimenting:\nhttps://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page\nTry cleaning 1 month of this dataset on kaggle notebook (and look at your RAM usage) outside the function and inside the function, compare the RAM usage in both examples\n'

### It works, time to try it out on a bigger fish:

In [9]:
# lets scrape all the posts, not just 5 of them:
def scrape_replies(df):
    feedback_list = []
    for el in df['link']:
        feedback_list.append(get_reply(el))
    df['feedback'] = feedback_list
    return df    

df = scrape_replies(df)
# save the file:
df.to_csv('dq.csv',index=False)
df.head()

Unnamed: 0,content,title,link,replies,views,feedback
4,"<tr class=""topic-list-item category-share-guid...",Predicting house prices,https://community.dataquest.io/t/predicting-ho...,1,26,\nprocessing data inside a function saves memo...
5,"<tr class=""topic-list-item category-share-guid...",[Re-upload]Project Feedback - Popular Data Sci...,https://community.dataquest.io/t/re-upload-pro...,3,47,"\nHi,\nI’ve been going through your project an..."
7,"<tr class=""topic-list-item category-share-guid...",GP: Clean and Analyze Employee Exit Surveys ++,https://community.dataquest.io/t/gp-clean-and-...,2,53,"\n\n\nnoticed that you’re deleting objects, af..."
10,"<tr class=""topic-list-item category-share-guid...",Project Feedback - Popular Data Science Questions,https://community.dataquest.io/t/project-feedb...,5,71,\nthink you forgot to attach your file…\n
12,"<tr class=""topic-list-item category-share-guid...",Guided Project: Answer to Albums vs. Singles w...,https://community.dataquest.io/t/guided-projec...,5,370,\n@gdelaserre: recategorized your topic. The E...


In [10]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1102 entries, 4 to 1444
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   content   1102 non-null   object
 1   title     1102 non-null   object
 2   link      1102 non-null   object
 3   replies   1102 non-null   int64 
 4   views     1102 non-null   int64 
 5   feedback  1102 non-null   object
dtypes: int64(2), object(4)
memory usage: 4.8 MB
