**02_NLP** 

- Convert text to word count vectors/frequency vectors [Countvectorize/Tfidfvectorize](https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)
- Remove stop words (from test?)
- stemming, and lemmatization

**03_Classification_Modeling** 
Each document is an “input” and a class label is the “output” for our predictive algorithm.
For our $X$ variable, we will only use the `post` variable. For our $Y$ variable, we will only use the xx variable.

- Train, test, split
- Identify and explain the baseline score
- Bayesian model
- Logistic regression, KNN, SVM
- Explanation of reasoning behind choosing production models
- Evaluate model performance

**Preprocessing Options**

- Tokenizing
- Regular Expression
- Lemmatizing/Stemming
- Cleaning (i.e. removing HTML)
- Countvectorize
- Tfidfvectorize

**Model Options**

- Logistic Regression
- Naive Bayes (Multinomial, Bernoulli, Guassian)

# Pre-Processing 

### Imports

In [212]:
# Standard imports
import pandas as pd
import regex as re


# Processing and Models
from bs4 import BeautifulSoup 
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split, GridSearchCV
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer




### Cleaning and EDA

In [264]:
# read in csv files
posts = pd.read_csv('posts_clean.csv')
posts.head()

Unnamed: 0,subreddit,text
0,0,https://www.dailymail.co.uk/news/article-7922...
1,0,There is a search engine called [Ecosia](https...
2,0,[Vandana Shiva](https://youtu.be/MNM833K22LM) ...
3,0,"If you have a weak stomach, I wouldn’t watch t..."
4,0,Breathing Pattern Disorders Caused by Environ...


In [265]:
# Clean the text column by removing html code
posts["text"] = posts["text"].str.replace('[^a-zA-Z ]', ' ')
posts["text"] = posts["text"].str.replace(r'http\S+', '')
posts["text"] = posts["text"].str.replace(r'\[http\S+', '')
posts["text"] = [post.lower().strip() for post in posts['text']]

In [266]:
# Get the total character count for all posts in each subreddit
def word_count(num):
    characters = 0
    if posts[(posts['subreddit'] == num)]:
        characters += [len(post) for post in posts['text']]
    return characters

## Train, test, split

1. Train-test-split
2. Tokenize
3. Stem/Lemmatize
4. Apply function that further cleans text
5. CountVectorizer/TFIDFVectorizer (stop_words go here)
6. Pipelines and GridSearch

In [267]:
# prepare the data for modeling
X = posts['text']
y = posts['subreddit']

In [268]:
# check distribution of y variable
y.value_counts(normalize=True)

1    0.525467
0    0.474533
Name: subreddit, dtype: float64

In [269]:
# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify=y,
                                                    random_state=42)

## Tokenize, Stem, Lemmatize

### Create function

In [124]:
# Create function to clean text into a clean string of words
# Remove HTML with beautiful soup
# Remove all non-letters with regex
# Convert everything to lower case, split into individual words.

def post_to_words(raw_text):
    clean_text = BeautifulSoup(raw_text).get_text()
    clean_text = clean_text.strip()
    letters = re.sub("[^a-zA-Z]", " ", clean_text)
    no_url = re.sub(r'http\S+', ' ', letters)
    srsly_no_url = re.sub(r'\[http\S+', ' ', no_url)
    words = srsly_no_url.lower().split()
    return(' '.join(words))

# Create function to clean text into a clean string of words
# Remove HTML links with beautiful soup and returns text
# Remove all extra spaces after strings
# Remove all non-letters with regex
# Convert everything to lower case, split into individual words.
# join together as a string

def post_to_words(raw_text):
        clean_text = BeautifulSoup(i).get_text()
        no_spaces = clean_text.strip()
        letters = re.sub("[^a-zA-Z]", " ", str(no_spaces))
        no_url = re.sub(r'http\S+', ' ', letters)
        srsly_no_url = re.sub(r'\[http\S+', ' ', no_url)
        words = srsly_no_url.lower().split()
        return words

In [270]:
# Get the number of posts based on the dataframe size.
total_posts = X_train.shape[0]
print(f'There are a total of {total_posts} posts.')

There are a total of 4137 posts.


### Apply cleaning function to train and test data set

In [126]:
# Create two empty lists to hold the clean posts
clean_train_posts = []
clean_test_posts = []

In [127]:
# print statement to indicate start of cleaning training set
print("Cleaning and parsing the training set movie reviews...")

# create a counter to print progress
j = 0

for post in X_train:
    # Use function to convert to words, then append to clean posts list
    clean_train_posts.append(post_to_words(post))
    # If the index is divisible by 1000, print a message
    if (j + 1) % 1000 == 0:
        print(f'Review {j + 1} of {total_posts}.')
    j += 1

    
# print statement to indicate start of cleaning testing set
print("Cleaning and parsing the testing set movie reviews...")

for post in X_test:
    # Use function to convert to words, then append to clean posts list
    clean_test_posts.append(post_to_words(post))
    # If the index is divisible by 1000, print a message
    if (j + 1) % 1000 == 0:
        print(f'Review {j + 1} of {total_posts}.')
    j += 1

Cleaning and parsing the training set movie reviews...
Review 1000 of 4137.
Review 2000 of 4137.
Review 3000 of 4137.
Review 4000 of 4137.
Cleaning and parsing the testing set movie reviews...


  ' that document to Beautiful Soup.' % decoded_markup


Review 5000 of 4137.


In [130]:
# Check the clean train posts
# make sure to add ve, www, m, to stopwords
clean_train_posts

['my friend has very bad wifi which her pc and it can be sometimes i possible to play shooters like overwatch i think the problem is that her router is too far from it would there be a solution to the problem without moving the router',
 'i m a uk customer with three is it possible to download your voicemail online three how do i access my voicemail online',
 'recycling and gathering old plastic would be like mining for gold recycling plastic would become extremely valuable and encourage more people to do so heres an idea if we as a planet could agree to make it illegal to create any new plastic for years',
 'www livescience com greenland ice sheet sliding html www livescience com greenland ice sheet sliding html climate change is worsening wildfires new study highlights research shows that warming temperatures are likely fueling more deadly and devastating fires',
 'adhesive solutions for consumer electronics',
 'can someone help me asap please people can t hear me on my phone during 

In [131]:
# Reset x train and x test to clean posts variables
X_train = clean_train_posts
X_test = clean_test_posts

### Stem the data for train and test

In [275]:
# Instantiate object of class PorterStemmer.
p_stemmer = PorterStemmer()
tokenizer = RegexpTokenizer(r'\w+')

In [276]:
# Tokenize train and test set
token_train = tokenizer.tokenize(X_train)
token_test = tokenizer.tokenize(X_test)

TypeError: expected string or bytes-like object

In [277]:
# Stem train and test set
stemmed_train = [p_stemmer.stem(i) for i in token_train]
stemmed_test = [p_stemmer.stem(i) for i in token_test]

NameError: name 'token_train' is not defined

In [273]:
stemmed_train

['my friend has very bad wifi which her pc and it can be sometimes i possible to play shooters like overwatch  i think the problem is that her router is too far from it  would there be a solution to the problem without moving the rout',
 'i m a uk customer with three  is it possible to download your voicemail online three  how do i access my voicemail onlin',
 'recycling and gathering old plastic would be like mining for gold  recycling plastic would become extremely valuable and encourage more people to do so heres an idea  if we as a planet could agree to make it illegal to create any new plastic for     year',
 'www livescience com       greenland ice sheet sliding html     www livescience com       greenland ice sheet sliding html climate change is worsening wildfires  new study highlights research shows that warming temperatures are likely fueling more deadly and devastating fir',
 'adhesive solutions for consumer electron',
 'can someone help me asap please  people can t hear me 

### Stopwords

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Adding my own stopwords to the already-made
# set of stopwords
ENGLISH_STOP_WORDS.union({
    'zebra'
})

# stopwords from nltk
stopwords.words('english')
stop = set(stopwords.words('english'))
stop

#Method for adding new stopwords to the list
stop + ['']

# Modeling

## Baseline Accuracy

In [44]:
# Get baseline accuracy
y_test.value_counts(normalize=True)

1    0.525362
0    0.474638
Name: subreddit, dtype: float64

In [None]:
# model imports
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

## Logistic Regression
- Transformer: CountVectorizer
- Estimator: Logistic Regression
- Gridsearch

In [48]:
pipe_cvec_lr = Pipeline([('cvec', CountVectorizer()),
                         ('lr', LogisticRegression(random_state=42))])

In [49]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 2500, 3000, 3500
# Minimum number of documents needed to include token: 2, 3
# Maximum number of documents needed to include token: 90%, 95%
# N-gram check individual tokens and bigrams
# cvec__stopwords: [None, 'english', ENGLISH_STOP_WORDS, ENGLISH_STOP_WORDS.union({})]

params_cvec_lr = {
    'cvec__max_features': [2500, 3000, 3500],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [.9, .95],
    'cvec__ngram_range': [(1,1), (1,2), (1,3)]
    'cvec__stopwords': [None, 'english'],
    'lr_C': [1.0, 0.1]
}

# Instantiate GridSearchCV
gs_cvec_lr = GridSearchCV(pipe_cvec_lr,
                  param_grid=params_cvec_lr, 
                  cv=3) 

# Fit GridSearch to training data.
gs_cvec_lr.fit(X_train, y_train)

# Score model and check best params/model
print(f'Best Params: {gs_cvec_lr.best_params_}')
print(f'Best Score: {gs_cvec_lr.best_score_}')
print(f'Train Score: {gs_cvec_lr.score(X_train, y_train)}')
print(f'Test Score: {gs_cvec_lr.score(X_test, y_test)}')

## Multinomial Naive Bayes
- Transformer: CountVectorizer
- Estimator: Multinomial Naive Bayes
- Gridsearch

In [36]:
pipe_cvec_nb = Pipeline([('cvec', CountVectorizer()),
                         ('nb', MultinomialNB())])

In [None]:
# Create parameters
params_cvec_nb = {
    'cvec__max_features': [2500, 3000, 3500],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [.9, .95],
    'cvec__ngram_range': [(1,1), (1,2), (1,3)]
}

# Instantiate GridSearchCV
gs_cvec_nb = GridSearchCV(pipe_cvec_nb,
                          param_grid=params_cvec_nb, 
                          cv=3) 

# Fit Gridsearch to training data
gs_cvec_nb.fit(X_train, y_train)

# Score model and check best params/model
print(f'Best Params: {gs_cvec_nb.best_params_}')
print(f'Best Score: {gs_cvec_nb.best_score_}')
print(f'Train Score: {gs_cvec_nb.score(X_train, y_train)}')
print(f'Test Score: {gs_cvec_nb.score(X_test, y_test)}') 

## Random Forest
- Estimator: Random Forest Classifier
- Gridsearch

In [None]:
rf = RandomForestClassifier(random_state=42)

In [None]:
params_rf ={
    'n_estimators':[100,150,200],
    'max_depth':[None, 1,2,3,4,5],
    'min_samples_split':[2,4,6]
}

# Instantiate Gridsearch
gs_rf = GridSearchCV(rf, 
                     param_grid=params_rf, 
                     cv=5)

# Fit Gridsearch to the training data
gs_rf.fit(X_train, y_train)

# Score model and check best params/model
print(f'Best Params: {gs_rf.best_params_}')
print(f'Best Score: {gs_rf.best_score_}')
print(f'Train Score: {gs_rf.score(X_train, y_train)}')
print(f'Test Score: {gs_rf.score(X_test, y_test)}') 

In [None]:
# Check the important features attribute
for name, score in zip(X_train.columns, best_rf.feature_importances_):
    print(name, score)

In [None]:
# sort them in a nice list
pd.Series(data=best_rf.feature_importances_,index=X_train.columns).sort_values(ascending=False)

# Model Evaluation

In [None]:
# Generating a confusion matrix on the test results?

metrics.confusion_matrix_dataframe(y_test, 
                                   cvec_lr_preds,
                                   columns = ['', ''],
                                   index   = ['', ''])

References

https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python