# Post data cleaning

## Importing and cleaning post data

Before we can do any text analsysis on the post data, we need to cleanup the `selftext` and `title` fields by removing punctuation, number and converting the text to lowercase. The `clean` function from the [cleantext](https://github.com/jfilter/clean-text) package makes this process very easy to do:


In [25]:
#Data import
import pandas as pd
from glob import glob

posts = pd.DataFrame()

for file in glob("../data/raw/*json"):
    df = pd.read_json(file, orient="records")
    posts = posts.append(df, ignore_index=True, sort=False)

In [26]:
posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80000 entries, 0 to 79999
Columns: 104 entries, all_awardings to user_reports
dtypes: bool(3), float64(40), int64(5), object(56)
memory usage: 61.9+ MB


Let's remove posts with no `selftext` or that have the "\[removed\]" and "\[deleted\]" tags:

In [31]:
posts = posts[~posts["selftext"].str.contains("\[removed\]|\[deleted\]|^$", case=False, regex=True, na=False)]

In [28]:
#Remove punctuation, numbers and change to lowercase:
posts["selftext_clean"] = posts["selftext"].apply(lambda row: clean(row, no_punct = True, no_numbers=True))
posts["title_clean"] = posts["title"].apply(lambda row: clean(row, no_punct = True, no_numbers=True))


Unnamed: 0,title,title_clean,selftext,selftext_clean
1,Should I switch labs (undergrad),should i switch labs undergrad,Im an undergraduate junior doing work in a bio...,im an undergraduate junior doing work in a bio...


Saving `posts` dataframe in a pickle:

In [32]:
posts.to_pickle("../data/interim/clean_posts.pkl")

In [30]:
posts = pd.read_pickle("../data/interim/clean_posts.pkl")