## 02_NLP

**02_NLP** 

- Convert text to word count vectors/frequency vectors [Countvectorize/Tfidfvectorize](https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)
- Remove stop words (from test?)
- stemming, and lemmatization

**03_Classification_Modeling** 
Each document is an “input” and a class label is the “output” for our predictive algorithm.
For our $X$ variable, we will only use the `post` variable. For our $Y$ variable, we will only use the xx variable.

- Train, test, split
- Identify and explain the baseline score
- Bayesian model
- Logistic regression, KNN, SVM
- Explanation of reasoning behind choosing production models
- Evaluate model performance

**Preprocessing Options**

- Tokenizing
- Regular Expression
- Lemmatizing/Stemming
- Cleaning (i.e. removing HTML)
- Countvectorize
- Tfidfvectorize

**Model Options**

- Logistic Regression
- Naive Bayes (Multinomial, Bernoulli, Guassian)

# Pre-Processing 

In [21]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from nltk.tokenize import RegexpTokenizer
from bs4 import BeautifulSoup  
from nltk.corpus import stopwords

In [14]:
# read in csv files
posts = pd.read_csv('posts_clean.csv')
posts.head()

Unnamed: 0,subreddit,text
0,0,https://www.dailymail.co.uk/news/article-7922...
1,0,There is a search engine called [Ecosia](https...
2,0,[Vandana Shiva](https://youtu.be/MNM833K22LM) ...
3,0,"If you have a weak stomach, I wouldn’t watch t..."
4,0,Breathing Pattern Disorders Caused by Environ...


In [15]:
# prepare the data for modeling
X = posts['text']
y = posts['subreddit']

In [16]:
# check distribution of y variable
y.value_counts(normalize=True)

1    0.525467
0    0.474533
Name: subreddit, dtype: float64

In [9]:
# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify=y,
                                                    random_state=42)
                                                    

In [23]:
# Function to clean raw posts into a clean string of words

def post_to_words(raw_text):
    # Remove HTML with beautiful soup
    clean_text = BeautifulSoup(raw_text).get_text()
    
    # Remove all non-letters with regex
    letters_only = re.sub("[^a-zA-Z]", " ", clean_text)
    
    # Convert everything to lower case, split/tokenize into individual words.
    words = letters_only.lower().split()
    
    # Convert the stop words to an unordered collection to improve efficiency
    stops = set(stopwords.words('english'))
    
    # Remove stop words.
    meaningful_words = [w for w in words if w not in stops]
    
    # Join the meaningful words back into one string separated by spaces
    return(" ".join(meaningful_words))

In [25]:
# Get the number of posts based on the dataframe size.
total_posts = X_train.shape[0]
print(f'There are {total_posts} posts.')

There are 4420 posts.


In [None]:
# Initialize an empty list to hold the clean reviews.
clean_train_reviews = []
clean_test_reviews = []

print("Cleaning and parsing the training set movie reviews...")

# create counter j to print progress reports, helps for long code
j = 0

for train_review in X_train['review']:
    
    # Convert review to words, then append to clean_train_reviews.
    clean_train_reviews.append(review_to_words(train_review))
    
    # If the index is divisible by 1000, print a message
    # dont want to print a return message every time but it will run for a while
    if (j + 1) % 1000 == 0:
        print(f'Review {j + 1} of {total_reviews}.')
    
    j += 1

# Let's do the same for our testing set.

print("Cleaning and parsing the testing set movie reviews...")

for test_review in X_test['review']:
    # Convert review to words, then append to clean_train_reviews.
    clean_test_reviews.append(review_to_words(test_review))
    
    # If the index is divisible by 1000, print a message
    if (j + 1) % 1000 == 0:
        print(f'Review {j + 1} of {total_reviews}.')
        
    j += 1

## Tokenizing

## Modeling
- Transformer: CountVectorizer
- Estimator: Multinomial Naive Bayes

In [68]:
from sklearn.
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import metrics

In [None]:
# get baseline accuracy
y_test.value_counts(normalize=True)

In [69]:
# Instantiate
nb = MultinomialNB()
cvec = CountVectorizer()

In [72]:
# Countvectorize on x_train data
X_train_cvec = cvec.fit_transform(X_train)

In [73]:
X_train_cvec

<6644x12289 sparse matrix of type '<class 'numpy.int64'>'
	with 83565 stored elements in Compressed Sparse Row format>

In [74]:
# Transform the test
X_test_cvec = cvec.transform(X_test)

In [76]:
# Fit to train data
nb.fit(X_train_cvec, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [78]:
# Make predictions
y_pred_train = nb.predict(X_train_cvec)

In [79]:
# Make predictions
y_pred_test = nb.predict(X_test_cvec)

In [84]:
# calculate accuracy train
print(f"MN Cvec Train score: {metrics.accuracy_score(y_train, y_pred_train)}")
print(f"MN Cvec Test score: {metrics.accuracy_score(y_test, y_pred_test)}")

MN Train score: 0.8749247441300422
MN Test score: 0.6003666361136571
