# Lemmatize Forum Text
Cleans the data for topic modelling

## Data Sources
- youbemomTables.db (scraped with 1-Scrape_Forum.ipynb)
    - sentiment table (created with 2-Sentiment_Analysis.ipynb)


## Changes
- 2020-09-14: Created
- 2020-09-15: Added functions for accessing database, cleaning/tokenizing text
- 2020-09-16: Generated and saved corpus and dictionary

## Database Structure

## TODO
- Tutorial: https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21
- Add actual database structure

## Imports

In [1]:
import sqlite3
import pandas as pd
import numpy as np
from datetime import datetime
from pathlib import Path
from io import FileIO
import re
import string
# tokenizing
from nltk.corpus import wordnet as wn
from nltk.tokenize import RegexpTokenizer #, word_tokenize
from nltk.corpus import stopwords
# saving the corpus and dictionary
from gensim import corpora, models
import pickle

Prerequisite:

In [2]:
#import nltk
#nltk.download('wordnet')

## Functions
For accessing the database

In [3]:
def create_connection(db_file):
    """ create a database connection to the SQLite database
        specified by the db_file
    :param db_file: database file
    :return: Connection object or None
    """
    conn = None
    try:
        conn = sqlite3.connect(db_file)
    except Error as err:
        print(err)
    return conn

For formatting the data

In [4]:
def clean_text(text):
    """ cleans the input text of punctuation, extra
        spaces, and makes letters lower case
    :param text: text (= title + body here)
    :return clean: clean text
    """
    clean = "".join([t for t in text if t not in string.punctuation])
    clean = re.sub(" +", " ", clean)
    clean = clean.strip()
    clean = clean.lower()
    return clean

In [5]:
def remove_stopwords(text):
    """ remove all stop words from the text
        using stopwords from nltk.corpus
    :param text: text with stopwords
    :return words: text without stopwords
    """
    words = [w for w in text if w not in stopwords.words('english')]
    return words

In [6]:
def encode_utf8(text):
    words = [w.encode() for w in text]
    return words

In [7]:
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma

In [8]:
def lemmatize(text):
    lemmas = [get_lemma(w) for w in text]
    return lemmas

## File Locations

In [9]:
p = Path.cwd()
path_parent = p.parents[0]
path_db = path_parent / "database" / "youbemomTables.db"
path_db = str(path_db)
path_lemma_pkl = path_parent / "clean_data" / "lemmatized_text.pkl"
path_corpus_pkl = path_parent / "clean_data" / "corpus.pkl"
path_dictionary_gensim = path_parent / "clean_data" / "dictionary.gensim"

## Load Data

In [10]:
conn = sqlite3.connect(path_db)
df = pd.read_sql_query("SELECT * from sentiments", conn)

## Tokenize/Lemmatize Text
Tokenize the data, removing stopwords, punctuation, and making all lower case. Then lemmatize the words.

In [11]:
text = df['text']
text = text.apply(clean_text)
tokenizer = RegexpTokenizer(r'\w+')
text = text.apply(tokenizer.tokenize)
text = text.apply(remove_stopwords)
text = text.apply(lemmatize)
pickle.dump(text, open(path_lemma_pkl, 'wb'))

Create a corpus and dictionary, and save them

In [12]:
dictionary = corpora.Dictionary(text)
dictionary.save(FileIO(path_dictionary_gensim, "wb"))

In [13]:
corpus = [dictionary.doc2bow(t) for t in text]
pickle.dump(corpus, open(path_corpus_pkl, 'wb'))