# Baseline and Model Benchmark
#### Credit to Aiden Curley for assistance

<a name="contents"></a>
- [Contents](#contents)  
    - [Imports](#imports)  
    - [Data Cleaning](#cleaning)  
    - [Baseline Score](#baseline)  
    - [Benchmark Model with Title and Contents](#benchmark)  
    - [Benchmark Model Score](#benchmark_score)  


<a name="imports"></a>
- [Back to Contents](#contents)
## Imports

In [1]:
# Import Libraries
import pandas as pd
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split, GridSearchCV
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB


In [2]:
amd = pd.read_csv('./Data/amd_big.csv')
buildpc = pd.read_csv('./Data/build_pc_big.csv')

In [3]:
amd.head(1)

Unnamed: 0.1,Unnamed: 0,subreddit,title,selftext
0,0,Amd,Low VRAM Issues?,[removed]


<a name="cleaning"></a>
- [Back to Contents](#contents)
## Clean Data

In [4]:
# Drop unneccesary columns
amd.drop(columns='Unnamed: 0', inplace=True)
buildpc.drop(columns='Unnamed: 0', inplace=True)

In [5]:
buildpc

Unnamed: 0,subreddit,title,selftext
0,buildapc,Windows 10,Hey guys just finished my build not able to ge...
1,buildapc,Is there a somewhat reasonably priced GPU that...,I have been rocking a Vega blower for a couple...
2,buildapc,How's This Budget Build?,Mobo: ASUS PRIMW B450M-A $109\n\nCPU: Ryzen 3...
3,buildapc,New case and liquid cooling,So I’m thinking about buying a new case and ad...
4,buildapc,Quick questions on possible GPUs,"I've currently got my Motherboard mATX, 3X 120..."
...,...,...,...
110320,buildapc,Looking for feedback on my first build.,This build is primarily a PC Part Picker build...
110321,buildapc,Sleeper Update (Final?),"So those of you who've seen my last post, I ha..."
110322,buildapc,27 Inch IPS 1440P 100HZ + Monitor for under 30...,"Heya!, im still on the hunt for a monitor :')\..."
110323,buildapc,CPU issue,Having an issue where randomly my CPU fan will...


In [6]:
amd.shape[0]

109740

In [7]:
amd.shape[0] - buildpc.shape[0]

-585

In [8]:
buildpc = buildpc[:-585]
buildpc.shape

(109740, 3)

In [9]:
amd.head()

Unnamed: 0,subreddit,title,selftext
0,Amd,Low VRAM Issues?,[removed]
1,Amd,[Level1Techs] 32 Core Threadripper Workstation...,
2,Amd,Decent cheap AMD prebuilt systems? Canada,Trying to see if there's anything worth recomm...
3,Amd,The new drivers don't wanna download,I have been trying to update my drivers from t...
4,Amd,aorus redemption I on b550m,"last time, i said only gigabyte has not used d..."


In [10]:
# Check nan values
amd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109740 entries, 0 to 109739
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   subreddit  109740 non-null  object
 1   title      109740 non-null  object
 2   selftext   78415 non-null   object
dtypes: object(3)
memory usage: 2.5+ MB


In [11]:
buildpc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109740 entries, 0 to 109739
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   subreddit  109740 non-null  object
 1   title      109740 non-null  object
 2   selftext   106497 non-null  object
dtypes: object(3)
memory usage: 2.5+ MB


In [12]:
# Replace nan values
amd.fillna("", inplace=True)

In [13]:
buildpc.fillna("", inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


In [14]:
df = amd.append(buildpc, ignore_index=True)

In [29]:
df['title_text_combined'] = df['title'] + ' ' + df['selftext']

In [30]:
df.head(2)

Unnamed: 0,subreddit,title,selftext,title_text_combined
0,Amd,Low VRAM Issues?,[removed],Low VRAM Issues? [removed]
1,Amd,[Level1Techs] 32 Core Threadripper Workstation...,,[Level1Techs] 32 Core Threadripper Workstation...


In [31]:
# First test on title and Content
X = df[['title_text_combined']]
y = df['subreddit']

In [32]:
X.shape

(219480, 1)

In [33]:
y.shape

(219480,)

<a name="baseline"></a>
- [Back to Contents](#contents)
## Baseline score

In [44]:
y.value_counts(normalize=True) # Our baseline is 50.00%

buildapc    0.5
Amd         0.5
Name: subreddit, dtype: float64

<a name="benchmark"></a>
- [Back to Contents](#contents)
# Benchmark Model with Title and Contents

In [34]:
# Train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state=42)

In [35]:
vectorizer = CountVectorizer(max_features=25000)

In [36]:
train_features = vectorizer.fit_transform(X_train['title_text_combined'])
test_features = vectorizer.transform(X_test['title_text_combined'])

In [37]:
train_features = train_features.toarray()

In [38]:
train_features.shape

(164610, 25000)

In [39]:
test_features.shape

(54870, 25000)

In [40]:
nb = MultinomialNB()

In [41]:
nb.fit(train_features, y_train)

MultinomialNB()

<a name="benchmark_score"></a>
- [Back to Contents](#contents)
# Benchmark Model Score

In [42]:
nb.score(train_features, y_train)

0.7855233582406901

In [43]:
nb.score(test_features, y_test)

0.7825405503918352

In [None]:
# My benchmark model score is 78.55% Training 
# My benchmark model score is 78.25% Testing
