# Blog data preparation

## Environment prep and imports

In [1]:
from glob import glob
import pandas as pd
import string

In [2]:
import xml.etree.ElementTree as et
import lxml
from bs4 import BeautifulSoup as bs
from tqdm.notebook import tqdm
import pickle
import nltk

In [3]:
import chardet

## Download the data
This data is provided by https://u.cs.biu.ac.il/~koppel/BlogCorpus.htm. [1] It is an academic dataset of blogger metadata and blog contents. We can either stick with the original dataset in its `.xml` form or we can use [Kaggle's version](https://www.kaggle.com/rtatman/blog-authorship-corpus). The original data has a host of different encodings, which makes it necessary to detect the correct encoding before reading in the file much of the time, and even then we still fail to read over 30 files. Furthermore, the original data address has recently changed without providing a new location. We therefore recommend the Kaggle approach. However, the following code is preserved in case anyone wants to process the original dataset by hand.

## Prepare the raw data

In [4]:
def get_blog_entries(filename, min_length=1):
    """Read a blogfile and record entries with more words than min_length
    
    The compilers of this data left the blog files in an assortment of different encodings. 
    Short of hard-coding the exact encoding for each file, the only solution seems to be
    to detect them dynamically at file read time and then attempt to use that encoding.
    This should not ordinarily be done, since it is slow and dumb.
    """
    vals = filename.split("/")[-1].split(".")
    base_dict = {"blogger_id": vals[0], 
                 "sex": vals[1],
                 "age": vals[2],
                 "topic": vals[3],
                 "sign": vals[4],
                }
    valid_posts = []
    try: # start with default encoding
        with open(filename, "r") as infile:
            content = infile.read()
    except: # if that fails, try to detect the encoding
        try:
            with open(filename, "rb") as infile:
                bcontent = infile.read()
                detect = chardet.detect(bcontent)
                content = bcontent.decode(encoding=detect['encoding'])
        except: # if that fails, skip the file
            print(f"failed on {filename}")
            return valid_posts
            
    bs_content = bs(content, "lxml")
    dates = bs_content.find_all("date")
    posts = bs_content.find_all("post")
    for i in range(len(posts)):
        text = posts[i].text.strip()
        if len(text.split(" ")) >= min_length:
            post_dict = base_dict.copy()
            post_dict["date"] = dates[i].text
            post_dict["post_content"] = text
            valid_posts.append(post_dict)
 
    return valid_posts

In [5]:
## Uncomment to run
# ! wget http://www.cs.biu.ac.il/~koppel/blogs/blogs.zip
# ! unzip blogs.zip -d ../Data -q
# ! rm blogs.zip
# df_columns = ["blogger_id", "sex", "age", "topic", "sign", "date", "post_content"]
# df_entries = []
# for filename in tqdm(glob("../Data/blogs/*.xml")):
#     df_entries += get_blog_entries(filename)
# df = pd.DataFrame(df_entries, columns=df_columns)
# with open("../Data/blogs_df.pkl", "wb") as outfile:
#     pickle.dump(df, outfile)

## Prepare the Kaggle version
First, sign into Kaggle and download the data from https://www.kaggle.com/rtatman/blog-authorship-corpus/, saving it in the `../Data` directory as `blogtext.csv`.

In [6]:
! head ../Data/blogtext.csv -n 2

head: cannot open '../Data/blogtext.csv' for reading: No such file or directory


In [9]:
kdf = pd.read_csv("../Data/blogtext.csv", parse_dates=True)

In [10]:
kdf

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
...,...,...,...,...,...,...,...
681279,1713845,male,23,Student,Taurus,"01,July,2004","Dear Susan, I could write some really ..."
681280,1713845,male,23,Student,Taurus,"01,July,2004","Dear Susan, 'I have the second yeast i..."
681281,1713845,male,23,Student,Taurus,"01,July,2004","Dear Susan, Your 'boyfriend' is fuckin..."
681282,1713845,male,23,Student,Taurus,"01,July,2004","Dear Susan: Just to clarify, I am as..."


Well, this version definitely preserved the text I couldn't read due to encoding issues in the original data. We'll roll with this data.

In [11]:
def prep_text(text_string):
    '''Tokenize a text string.'''
    tokenized_text = nltk.casual_tokenize(text_string.translate(str.maketrans(
        '', 
        '', 
        string.punctuation,
        )),
        preserve_case=False,strip_handles=True)
    return tokenized_text

In [12]:
kdf['tokens'] = kdf.text.apply(prep_text)
with open("../Data/blog_df_casual_tokens.pkl", "wb") as outfile:
    pickle.dump(kdf, outfile)

# References
1. J. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs.