<a href="https://colab.research.google.com/github/matakahas/portfolio/blob/main/reddit_proed_pt1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Topic modeling and flair prediction from the banned r/proED/ subreddit (Part 1)
In Part 1 of this project, I will scrape the [mirror site](https://goutiest-zorse-5012.dataplicity.io/) of r/proED and save the obtained dataset.

### required packages

In [None]:
import numpy as np
import pandas as pd
import requests 
import html5lib
from bs4 import BeautifulSoup
import re
import datetime

### scraping
I will scrape posts from an entire time period (2015-2018). Here is the function for conducting the scraping and parsing the data into a cleaner format (I will do a bit more pre-processing in Part 2).

In [None]:
def proed_scraping(link, headers):
    url = "https://goutiest-zorse-5012.dataplicity.io/" + link
    search_response=requests.get(url, headers = headers)
    if search_response.status_code == 200:
        soup=BeautifulSoup(search_response.content,'html.parser')
    else:
        raise Exception(f"request is not processed correctly. Code: {search_response.status_code}")
    #extract titles
    titles = []
    for tag in soup.find_all('strong'):
        titles.append(tag.text)
    #extract flairs
    flairs = []
    new_titles = []
    for t in titles:
        pattern = re.match(r"\[[a-z\/?A-Z]+\]", t)
        if pattern:
            flairs.append(re.sub(r"\[|\]", '', pattern.group(0)))
            new_titles.append(re.sub(r"\[[a-z\/?A-Z]+\]\s", '', t))
        else:
            flairs.append('none')
            new_titles.append(t)
    #make a dataframe
    df = pd.DataFrame(data={'Flair':flairs, 'Title':new_titles})
    #extract usenames, dates, and texts
    users = []
    dates = []
    texts = []
    txt = []
    count = 0
    for br in soup.find_all("br"):
        next_s = br.nextSibling
        if str(next_s) == '<hr/>':
            texts.append(txt)
            txt = []
            count = 0
        else:
            count += 1
            if count == 1:
                users.append(next_s)
            elif count == 2:
                dates.append(next_s)
            elif count >= 5:
                txt.append(next_s)
    #remove the line break tags from each text 
    new_texts = []
    for text in texts:
        new_text = ''.join(str(t) for t in text if t.name != 'br')
        new_texts.append(new_text)
    #append the extracted data to the dataframe
    df['User'], df['Date'], df['Text'] = [pd.Series(users), pd.Series(dates), pd.Series(new_texts)]
    #parse the info under the Date column
    df['Date'] = df['Date'].map(lambda x: format_dt(str(x)))
    #finally, export the dataframe
    l = link.split('/')[1]
    df.to_csv(f'{l}.csv')

Now, get a list of log URLs and scrape posts from each of them.

In [None]:
url = "https://goutiest-zorse-5012.dataplicity.io/"

headers = {'User-Agent': 'Mozilla 5.0'}

search_response=requests.get(url, headers = headers)

if search_response.status_code == 200:
    soup=BeautifulSoup(search_response.content,'html.parser')

In [None]:
links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))
links

['ProED_Summary/15_05_16_22_31-15_09_05_07_48.html',
 'ProED_Summary/15_09_05_08_00-16_04_04_10_03.html',
 'ProED_Summary/16_04_04_10_30-16_08_06_19_39.html',
 'ProED_Summary/16_08_06_19_49-16_11_11_05_11.html',
 'ProED_Summary/16_11_11_05_11-17_03_14_00_34.html',
 'ProED_Summary/17_03_14_00_54-17_06_17_00_26.html',
 'ProED_Summary/17_06_17_00_29-17_08_23_12_34.html',
 'ProED_Summary/17_08_23_12_42-17_11_05_19_32.html',
 'ProED_Summary/17_11_05_19_38-18_01_21_11_15.html',
 'ProED_Summary/18_01_21_11_21-18_04_02_11_12.html',
 'ProED_Summary/18_04_02_11_45-18_06_04_09_55.html',
 'ProED_Summary/18_06_04_10_06-18_07_24_13_56.html',
 'ProED_Summary/18_07_24_14_09-18_09_06_20_56.html',
 'ProED_Summary/18_09_06_21_30-18_10_14_09_13.html',
 'ProED_Summary/18_10_14_09_15-18_11_14_15_33.html']

In [None]:
#don't forget to specify the user agent
headers = {'User-Agent': 'Mozilla 5.0'}

for link in links:
    proana_scraping(link, headers)



I then combined all the generated datasets into one, and saved it as "proED_full_dataset.csv," which I will call in Part 2. \

**EDIT:** The mirror site seems to be down, so the above code won't work. I'm not sure if it is temporary or not, but meanwhile you can download the zip file of the dataset I made from my github repository, or from [Kaggle](https://www.kaggle.com/matakahas/reddit-proana-dataset). \
\
That's it for Part 1 - thanks for tagging along!