# Lemmatize Forum Text
Cleans the data for topic modelling

## Data Sources
- youbemom-merged.db


## Changes
- 2020-09-14: Created
- 2020-09-15: Added functions for accessing database, cleaning/tokenizing text
- 2020-09-16: Generated and saved corpus and dictionary
- 2020-12-13: Updated for new data
- 2021-01-26: Updated for new clean text

## TODO
- Tutorial: https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21
- Can I tokenize emoji?

## Imports

In [1]:
import sqlite3
import pandas as pd
import numpy as np
from datetime import datetime
from pathlib import Path
from io import FileIO
import re
import string
# tokenizing
from nltk.tokenize import RegexpTokenizer #, word_tokenize
import spacy
# saving the corpus and dictionary
from gensim import corpora, models
import pickle
# my functions
from scraping import create_connection
from lemmatize import *

Prerequisite:

In [2]:
# import nltk
# nltk.download('wordnet')
# nltk.download('stopwords')

## Functions

For processing the data

In [3]:
def process_data(subforum="special-needs", group="parent"):
    conn = create_connection(path_db)
    sql = gen_sql(subforum, group)
    df = pd.read_sql_query(sql, conn)
    text = clean_data(df)
    save_data(text, subforum, group)
    conn.close()

For saving the data

In [4]:
def save_data(text, subforum="special-needs", group="parent"):
    pickle.dump(text, open(path_lemma_pkl.format(subforum, group), 'wb'))
    dictionary = corpora.Dictionary(text)
    dictionary.save(FileIO(path_dictionary_gensim.format(subforum, group), "wb"))
    corpus = [dictionary.doc2bow(t) for t in text]
    pickle.dump(corpus, open(path_corpus_pkl.format(subforum, group), 'wb'))

## File Locations

In [5]:
p = Path.cwd()
path_parent = p.parents[0]

In [6]:
path_db = path_parent / "database" / "youbemom-merged.db"
path_db = str(path_db)
path_lemma_pkl = path_parent / "clean_data" / "lemmatized_text_{0}_{1}.pkl"
path_lemma_pkl = str(path_lemma_pkl)
path_corpus_pkl = path_parent / "clean_data" / "corpus_{0}_{1}.pkl"
path_corpus_pkl = str(path_corpus_pkl)
path_dictionary_gensim = path_parent / "clean_data" / "dictionary_{0}_{1}.gensim"
path_dictionary_gensim = str(path_dictionary_gensim)

## Process Data

In [7]:
process_data("special-needs", "all")

In [None]:
process_data("special-needs", "parent")

In [None]:
process_data("special-needs", "child")

In [None]:
process_data("toddler", "parent")