# Data Cleaning, EDA and Models
This notebook imports collected data from my selected subreddits and performs cleaning and EDA. Intital Models are built and tested.

## Imports

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.dummy import DummyClassifier
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

## Read in the data

~~Something wetn wrong during data collection so I've added a subreddit column to use as target.~~

In [2]:
unresolved = pd.read_csv('../data/unresolved.csv').drop(columns='Unnamed: 0')
#unresolved['subreddit'] = 'unresolved'
unresolved.head(2)

Unnamed: 0,author,awarders,created_utc,selftext,subreddit,title
0,TheBonesOfAutumn,[],1643236773,"On April 8th, 1981, 19-year-old David Huff dec...",UnresolvedMysteries,"In April of 1981, the body of 23-year-old Shar..."
1,Skoodilypoop666,[],1643233541,"In the fall of 2021, The small town of London ...",UnresolvedMysteries,“Whodunnit” the murder of 62 year old Bryan Mc...


In [3]:
unsolved = pd.read_csv('../data/unsolved.csv').drop(columns='Unnamed: 0')
#unsolved['subreddit'] = 'unsolved'
unsolved.head(2)

Unnamed: 0,author,awarders,created_utc,selftext,subreddit,title
0,amkakis,[],1643239264,,UnsolvedMysteries,An 18 year old leaves home to retrieve a purse...
1,Once_a_TQ,[],1643227462,,UnsolvedMysteries,Search continues for retired Cape Breton veter...


In [4]:
unsolved['selftext'].isna().sum()

1090

In [5]:
unsolved['awarders'].value_counts()

[]    1099
Name: awarders, dtype: int64

In [6]:
unresolved['awarders'].value_counts()

[]    1009
Name: awarders, dtype: int64

**'selftext' empty in UnsolvedMysteries posts, will focus on title only for now.**

## Data Cleaning

### Concat/Merge Data
Drop 'selftext', 'awarders' columns from both datasets and merge/concat them together

In [7]:
unsolved = unsolved.drop(columns=['awarders', 'selftext'])
unresolved = unresolved.drop(columns=['awarders', 'selftext'])

In [8]:
data = pd.concat([unsolved, unresolved], ignore_index=True)
data.head()

Unnamed: 0,author,created_utc,subreddit,title
0,amkakis,1643239264,UnsolvedMysteries,An 18 year old leaves home to retrieve a purse...
1,Once_a_TQ,1643227462,UnsolvedMysteries,Search continues for retired Cape Breton veter...
2,lexx999,1643224623,UnsolvedMysteries,"On December 31 2021, around 01:50, every trace..."
3,lexx999,1643224227,UnsolvedMysteries,"On December 31 2021, around 01:50, every trace..."
4,010010100100011001,1643216812,UnsolvedMysteries,McMartin Preschool Satanic abuse: 20-minute pr...


In [9]:
data.shape

(2108, 4)

### Drop duplicates

In [10]:
data.duplicated().sum()

1

In [11]:
data = data.drop_duplicates()

### Binarize target variable
UnresolvedMysteries == 1

UnsolvedMysteries == 0

In [12]:
data['subreddit'] = data['subreddit'].map(
    {
        'UnresolvedMysteries': 1,
        'UnsolvedMysteries': 0,
    }
)
data.head(2)

Unnamed: 0,author,created_utc,subreddit,title
0,amkakis,1643239264,0,An 18 year old leaves home to retrieve a purse...
1,Once_a_TQ,1643227462,0,Search continues for retired Cape Breton veter...


## EDA
### Let's add a length and word count based on title and examine distributions.


In [13]:
data['title_len'] = data['title'].str.len()

In [14]:
data['title_count'] = data['title'].map(lambda x: len(x.split()))

### **Good place to add some kind of sentiment score**

In [15]:
# Sanity check
data.head(2)

Unnamed: 0,author,created_utc,subreddit,title,title_len,title_count
0,amkakis,1643239264,0,An 18 year old leaves home to retrieve a purse...,107,21
1,Once_a_TQ,1643227462,0,Search continues for retired Cape Breton veter...,67,10


### Full data pairplot

In [16]:
# With help from: https://github.com/matplotlib/ipympl/issues/25
# plt.style.use(['seaborn-whitegrid'])
# sns.pairplot(data.drop(columns='subreddit'), corner=True);

There are two outliers where title length doesn't align with title count. Nothing alarming though.

Title length and title count both appear normally distributed with right tail skew, which is expected: Mostly normal posts with some having longer titles.

Let's run two more pairplots, one for each subreddit.

### UnresolvedMysteries pairplot

In [17]:
# With help from: https://github.com/matplotlib/ipympl/issues/25
# plt.style.use(['seaborn-whitegrid'])
# sns.pairplot(data[data['subreddit'] == 1].drop(columns='subreddit'), corner=True);

### UnsolvedMysteries pairplot

In [18]:
# With help from: https://github.com/matplotlib/ipympl/issues/25
# plt.style.use(['seaborn-whitegrid'])
# sns.pairplot(data[data['subreddit'] == 0].drop(columns='subreddit'), corner=True);

### Summary Statistics

In [19]:
data.describe()

Unnamed: 0,created_utc,subreddit,title_len,title_count
count,2107.0,2107.0,2107.0,2107.0
mean,1635176000.0,0.478405,99.335548,16.843854
std,5428918.0,0.499652,69.133024,12.282324
min,1622510000.0,0.0,1.0,1.0
25%,1632448000.0,0.0,50.0,8.0
50%,1635985000.0,0.0,79.0,13.0
75%,1639200000.0,1.0,126.0,22.0
max,1643239000.0,1.0,301.0,61.0


### EDA notes

The data seems to be fairly balanced and the numerical predictors are normally distributed. It appears this is a good dataset to use NLP and attempt to correctly classify the subreddit based on the words in the title of each post.

## Preprocessing

### Train-Test split
Start with 'title' only to build our baseline and first models

In [20]:
X = data['title']
y = data['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=1331)

### Baseline model
Count Vectorizer and Dummy Classifier

In [21]:
cv = CountVectorizer()
X_train_transformed = cv.fit_transform(X_train)
X_test_transfomed = cv.transform(X_test)

In [22]:
null_model = DummyClassifier()
null_model.fit(X_train_transformed, y_train)
null_model.score(X_test_transfomed, y_test)

0.5218216318785579

In [23]:
y_test.value_counts(normalize=True)

0    0.521822
1    0.478178
Name: subreddit, dtype: float64

## Models
A Naive Bayes Classifier and..

### Naive Bayes and CountVectorizer Pipeline. Initial score: 64.71% accuracy

In [24]:
pipe = make_pipeline(CountVectorizer(), MultinomialNB())

In [25]:
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

0.6470588235294118

Initial score without adjusting any hyper parameters is not great: 64.71% accuracy. Let's include the author feature to see if it improves score.

### Include author feature and model.

In [29]:
X = data[['author', 'title']]
y = data['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=1331)

In [31]:
X_train.shape

(1580, 2)

In [32]:
y_train.shape

(1580,)

In [33]:
X_test.shape

(527, 2)

In [34]:
y_test.shape

(527,)

In [30]:
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

ValueError: Found input variables with inconsistent numbers of samples: [2, 1580]