## 02_NLP

**02_NLP** 

- Convert text to word count vectors/frequency vectors [Countvectorize/Tfidfvectorize](https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)
- Remove stop words (from test?)
- stemming, and lemmatization

**03_Classification_Modeling** 
Each document is an “input” and a class label is the “output” for our predictive algorithm.
For our $X$ variable, we will only use the `post` variable. For our $Y$ variable, we will only use the xx variable.

- Train, test, split
- Identify and explain the baseline score
- Bayesian model
- Logistic regression, KNN, SVM
- Explanation of reasoning behind choosing production models
- Evaluate model performance

### Pre-Processing Options

- Tokenizing
- Regular Expression
- Lemmatizing/Stemming
- Cleaning (i.e. removing HTML)
- Countvectorize
- Tfidfvectorize

### Model Options

- Logistic Regression
- Naive Bayes (Multinomial, Bernoulli, Guassian)

### Imports

In [48]:
import pandas as pd

In [50]:
# read in csv files
df = pd.read_csv('combined.csv')
df

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp,subreddit
0,When Is a Pesticide Not a Pesticide? When It C...,2,ev5x7m,https://news.bloombergenvironment.com/environm...,0,1.580249e+09,,2020-01-28 16:57:52,0
1,Lawsuits Seeking Damages for Climate Change Fa...,2,ev5vp3,https://insideclimatenews.org/news/27012020/su...,0,1.580248e+09,,2020-01-28 16:53:53,0
2,Lawsuits Seeking Damages for Climate Change Fa...,6,ev5q8p,https://insideclimatenews.org/news/27012020/su...,0,1.580248e+09,,2020-01-28 16:39:48,0
3,Endangered cheetahs can return to Indian fores...,2,ev5pt4,https://www.bbc.co.uk/news/world-asia-india-51...,0,1.580248e+09,,2020-01-28 16:38:38,0
4,French NGOs and local authorities take court a...,3,ev5krd,https://www.theguardian.com/world/2020/jan/27/...,0,1.580247e+09,,2020-01-28 16:25:57,0
...,...,...,...,...,...,...,...,...,...
1245,Seventeen Android Nasties Spotted in Google Pl...,7,ep74i2,https://labs.bitdefender.com/2020/01/seventeen...,0,1.579146e+09,,2020-01-15 22:36:15,1
1246,U.S. states tell court prices to rise if Sprin...,41,ep6xer,https://www.reuters.com/article/us-sprint-corp...,2,1.579145e+09,,2020-01-15 22:22:31,1
1247,FCC ID's State-by-State Rural Broadband Fund B...,2,ep6v2i,https://www.multichannel.com/news/fcc-ids-stat...,2,1.579145e+09,,2020-01-15 22:17:52,1
1248,Airframe: The SR-71 Blackbird,13,ep6ub2,https://airman.dodlive.mil/2017/07/10/airframe...,2,1.579145e+09,,2020-01-15 22:16:17,1


## Data Cleaning

- Check for null values - combine body and title columns
- Drop the 'created' and 'url' columns
- Check datatypes
- Check for duplicated rows
- remove stickied posts?
- remove non-letter characters

In [51]:
# Check for null values
df.isnull().sum()

title           0
score           0
id              0
url             0
comms_num       0
created         0
body         1226
timestamp       0
subreddit       0
dtype: int64

In [52]:
# Fill null values with empty string then combine body and text columns

df['body'].fillna('', inplace=True)

df['text'] = df['title'] + df['body']

In [54]:
df.isnull().sum()

title        0
score        0
id           0
url          0
comms_num    0
created      0
body         0
timestamp    0
subreddit    0
text         0
dtype: int64

In [44]:
# Check the shape
df.shape

(1564, 9)

In [26]:
# Drop unneeded columns
#df.drop(columns=['score','id','url','comms_num','created','body','timestamp'], inplace=True)

In [45]:
# Check for duplicates
df.duplicated().sum()

0

In [46]:
# Check for duplicate titles
df[(df['title'].duplicated())]

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp,subreddit
28,Climate change-driven sea-level rise could tri...,3,euole1,https://viterbischool.usc.edu/news/2020/01/sea...,0,1580164000.0,,2020-01-27 17:28:19,0
55,Climate change-driven sea-level rise could tri...,5,euge8z,https://viterbischool.usc.edu/news/2020/01/sea...,0,1580116000.0,,2020-01-27 04:11:11,0
56,When It’s Migratory Birds vs. the Trump Admini...,8,eufvht,https://blog.ucsusa.org/maria-caffrey/when-its...,4,1580114000.0,,2020-01-27 03:29:30,0
75,Climate change-driven sea-level rise could tri...,7,eu9i86,https://viterbischool.usc.edu/news/2020/01/sea...,1,1580088000.0,,2020-01-26 20:16:48,0
99,Climate change-driven sea-level rise could tri...,5,etyfg2,https://viterbischool.usc.edu/news/2020/01/sea...,1,1580023000.0,,2020-01-26 02:21:53,0
110,Birds in California's desert are dying,24,etvez9,https://thehill.com/opinion/energy-environment...,0,1580010000.0,,2020-01-25 22:38:29,0
132,The western United States has experienced such...,15,etpfym,https://www.news.ucsb.edu/2020/019761/warmer-d...,10,1579979000.0,,2020-01-25 13:55:27,0
147,The western United States has experienced such...,6,etiqzn,https://www.news.ucsb.edu/2020/019761/warmer-d...,0,1579938000.0,,2020-01-25 02:44:11,0
172,Peruvian indigenous group wins suit to block o...,73,etcj3e,https://www.reuters.com/article/us-peru-indige...,1,1579911000.0,,2020-01-24 19:18:10,0
205,NOAA Gets Go-Ahead to Study Controversial Clim...,1,et6ys3,https://www.scientificamerican.com/article/noa...,0,1579879000.0,,2020-01-24 10:13:37,0


In [14]:
# What text is in the body
env['body'].value_counts()

It may be easy to get pessimistic about environmental problems, but if we examine things carefully, we find that our environment is in much better shape now than 50 years ago.\n\nIn the 60s and 70s, things were really, really bad. Our rivers and lakes were being rendered lifeless as they filled with raw sewage and industrial effluent. New York Governor Nelson Rockefeller was blunt in his assessment of the Hudson River. “The river from Troy to the south of Albany is one great septic tank that has been rendered nearly useless for water supply, for swimming, or to support the rich fish life that once abounded there.” There were advisories against swimming in the Great Lakes. The Cuyahoga river, famously, caught fire 13 times.\n\nOur air was also being poisoned by emissions from automobile exhausts, garbage incinerators, and power plants. Southern California was blanketed with a thick layer of photochemical smog. A smog event over Thanksgiving 1966 in New York killed 168 people. \n\nThe pe

In [13]:
# Check for duplicates
env.duplicated().sum()

0

## Modeling
- Transformer: CountVectorizer
- Estimator: Multinomial Naive Bayes

In [68]:
from sklearn.
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import metrics

In [69]:
# Instantiate
nb = MultinomialNB()
cvec = CountVectorizer()

In [72]:
# Countvectorize on x_train data
X_train_cvec = cvec.fit_transform(X_train)

In [73]:
X_train_cvec

<6644x12289 sparse matrix of type '<class 'numpy.int64'>'
	with 83565 stored elements in Compressed Sparse Row format>

In [74]:
# Transform the test
X_test_cvec = cvec.transform(X_test)

In [76]:
# Fit to train data
nb.fit(X_train_cvec, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [78]:
# Make predictions
y_pred_train = nb.predict(X_train_cvec)

In [79]:
# Make predictions
y_pred_test = nb.predict(X_test_cvec)

In [84]:
# calculate accuracy train
print(f"MN Cvec Train score: {metrics.accuracy_score(y_train, y_pred_train)}")
print(f"MN Cvec Test score: {metrics.accuracy_score(y_test, y_pred_test)}")

MN Train score: 0.8749247441300422
MN Test score: 0.6003666361136571
