# Problem Statement:

The objective of this project is to build a machine learning model capable of accurately classifying Reddit posts into two categories: posts from the r/movies subreddit and posts from the r/books subreddit. The model will assist in automating the process of categorizing posts, potentially improving the user experience by providing targeted content and aiding in content moderation.

## Imports

In [21]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import pos_tag
import nltk

import pandas as pd
import string
import praw
import re

## NLTK Downloads

In [2]:
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/erikmercado/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/erikmercado/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

# Data Collection

This section of the notebook is dedicated to the data collection process. Using Reddit's API, I fetch posts from the `movies` and `books` subreddits across various categories such as 'hot', 'new', 'top', 'rising', and 'controversial'. The function `fetch_posts` is designed to equally balance the number of posts from each category to ensure diverse representation in my dataset.

The posts are fetched and then concatenated, title and selftext together, to form a comprehensive text document that represents each post.

## API Setup

In [3]:
reddit = praw.Reddit(client_id=None,
                     client_secret=None,
                     user_agent=None,
                     username=None,
                     password=None)


## Utility Functions

In [4]:
def fetch_posts(subreddit_name, limit=100):
    """
    Fetch an equal number of posts from different categories within a subreddit.
    
    :param subreddit_name: The name of the subreddit from which to fetch posts.
    :param limit: The total maximum number of posts to fetch across all categories.
    :return: A list of posts.
    """
    subreddit = reddit.subreddit(subreddit_name)
    categories = ['hot', 'new', 'top', 'rising', 'controversial']
    posts = []
    limit_per_category = limit // len(categories)
    
    try:
        for category in categories:
            submissions = getattr(subreddit, category)(limit=limit_per_category)
            for submission in submissions:
                posts.append(submission.title + " " + submission.selftext)
    except Exception as e:
        print(f"An error occurred: {e}")
    
    return posts


# Text Preprocessing

Text preprocessing is a critical step in natural language processing (NLP) tasks. The function `preprocess_text` performs several standard preprocessing steps on the raw text data:

1. Converting all text to lowercase to ensure uniformity.
2. Removing URLs, special characters, and numbers to focus on the textual content.
3. Eliminating stopwords that do not contribute to the overall meaning.
4. Lemmatizing the words, which involves converting each word to its base or dictionary form.

These preprocessing steps are designed to clean and standardize the text data, thereby making it more suitable for modeling and analysis.


In [5]:
def preprocess_text(text):
    """
    Preprocesses text by removing URLs, special characters, numbers, and stopwords, and by performing lemmatization.
    """

    text = text.loIr()  # Convert to loIrcase
    text = re.sub(r'\w+:\/\/\S+', '', text, flags=re.MULTILINE)  # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text, flags=re.MULTILINE)  # Remove special characters and numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove excessive whitespace

    tokens = text.split()

    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return ' '.join(tokens)


# Tokenization and POS Tagging

After preprocessing the text, the next step involves tokenization and part-of-speech (POS) tagging using the function `tokenize_and_tag`. Tokenization breaks down the text into individual words or tokens. POS tagging assigns grammatical categories to each token, such as noun, verb, adjective, etc.

POS tags provide additional contextual information that may be useful for distinguishing betIen the language used in `movies` and `books` posts. For instance, I might expect more past tense verbs in `books` discussions relating to narratives and storylines.

By combining preprocessed text with POS tags, I create a rich set of features that could improve the performance of my classification model.


In [6]:
def tokenize_and_tag(text):
    tokens = nltk.word_tokenize(text)
    # Remove punctuation from tokens
    tokens = [token for token in tokens if token not in string.punctuation]
    tagged_tokens = pos_tag(tokens)
    # Combine tokens and tags, exclude punctuation tags
    tagged_tokens_str = ["{}_{}".format(token.loIr(), tag) for token, tag in tagged_tokens if tag not in string.punctuation]
    return ' '.join(tagged_tokens_str)

# Dataframe Creation and Export

With my text data preprocessed and POS tags generated, I compile my dataset into a Pandas DataFrame. This DataFrame includes the preprocessed text, POS tags, a combination of text and POS tags, and the corresponding labels (`0` for movies and `1` for books).

The final step in the data preparation process is to save this DataFrame to a CSV file. This file will serve as an input for the exploratory data analysis (EDA) and modeling stages of my project.

## Fetch Posts

In [7]:
movies_posts = fetch_posts('movies', limit=2000)
books_posts = fetch_posts('books', limit=2000)
print(f"Fetched {len(movies_posts)} posts from movies and {len(books_posts)} posts from books")

Fetched 1624 posts from movies and 1323 posts from books


## Preprocessing

In [8]:
preprocessed_text = [preprocess_text(text) for text in movies_posts + books_posts]

# Generate POS tags for the original (not preprocessed) text
pos_tags = [tokenize_and_tag(text) for text in movies_posts + books_posts]

# Combine preprocessed text with POS tags
text_and_pos = [text + " " + pos for text, pos in zip(preprocessed_text, pos_tags)]

# Create DataFrame
data = pd.DataFrame({
    'text': preprocessed_text,
    'pos': pos_tags,
    'text_and_pos': text_and_pos, 
    'label': [0]*len(movies_posts) + [1]*len(books_posts)
})



data.to_csv('movies_books_dataset.csv', index=False)
print("Dataset saved to movies_books_dataset.csv.")

Dataset saved to movies_books_dataset.csv.


# Conclusion

In this notebook, I successfully collected, preprocessed, and saved a dataset of Reddit posts from the `movies` and `books` subreddits. I have prepared the ground for the following stages, which will involve exploring this dataset to gain insights and developing a classification model that can accurately categorize new posts.

my dataset is balanced and includes a combination of raw text, cleaned text, and POS tagged text, providing a robust starting point for building and training my machine learning models.
