# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [5]:
import requests
import pandas as pd
import json
import time
import csv
import numpy as np
import sklearn as sk
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
import matplotlib as plt

%matplotlib inline

In [6]:
colors = {'blue': '#729ECE',
          'brown': '#A8786E',
          'green': '#67BF5C',
          'grey': '#A2A2A2',
          'orange': '#FF9E4A',
          'pink': '#ED97CA',
          'purple': '#AD8BC9',
          'red': '#ED665D',
          'teal': '#6DCCDA',
          'yellow': '#CDCC5D'}

In [7]:
chronic_pain = "https://www.reddit.com/r/ChronicPain.json"
migraine = "https://www.reddit.com/r/migraine.json"
back_pain = "https://www.reddit.com/r/backpain.json"


In [8]:
res_cp = requests.get(chronic_pain, headers={'User-agent': 'KatBot'})
res_m = requests.get(migraine, headers={'User-agent': 'KatBot'})
res_bp = requests.get(back_pain, headers={'User-agent': 'KatBot'})

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [9]:
data_cp = res_cp.json()
# data_cp

In [10]:
data_m = res_m.json()
# data_m

In [11]:
data_bp = res_bp.json()
# data_bp

In [12]:
post_content_cp = data_cp['data']['children'][1]['data']['selftext']
post_content_m = data_m['data']['children'][1]['data']['selftext']
post_content_bp = data_bp['data']['children'][1]['data']['selftext']

data_bp['data']['children'][1]['data']['title'] # Is this thing on?

'The First Steps For Dealing With Back Pain'

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [574]:
# Kat Bot 2.0
def post_loop(URL, post_content): 
    
#    Information the Function Will Return:

#   Posts
    post_list = []
    
#   Titles
    title_list = []
    
#   Author Self Descriptions
    author_flair_text = []
    
#   Container URL Variable, Will Change Each Loop
    base_URL = URL

    while len(post_list) < 775:  #The lowest number of posts I was able to scrape across 3 datasets

    # Import Json and Get Request
        get_request = requests.get(URL, headers={'User-agent': 'KatBot'})
    # Make Dict of Info
        container_dict = get_request.json()
        if container_dict == False:
            print("No Response from Reddit!")
            break
        next_page = base_URL + '?after=' + container_dict['data']['after']
#         print(container_dict['data']['after']) #Is this thing on?
#         print(container_dict['data']['children'][2]['data']['post_hint']) #Is this thing on?
        data_dict = container_dict['data']['children'] # index into data
#         print(len(data_dict)) #Is this thing on? 
        posts = [p['data']['selftext'] for p in data_dict] # index into posts
#         print(len(posts)) #Is this thing on? 
        post_titles = [t['data']['title'] for t in data_dict] # index into titles
#         print(len(post_titles)) #Is this thing on?
        author_flair = [f['data']['author_flair_text'] for f in data_dict] # index into author self-descript
#         print(len(author_flair), author_flair_text) #Is this thing on?

        for child in data_dict:
            is_new = True
            for post in post_list: 
                if post == child['data']['selftext']:
                    for title in title_list:
                        if title == child['data']['title']:
                            is_new = False
            
#             for flair in author_flair_text: 
#                 is_new = True
            
            
            if is_new:
#                 if child['data']['selftext']:
                post_list.append(child['data']['selftext'])
#                 print(child['data'])
#                 try:
#                     post_list.append(child['data']['post_hint'])
#                     print(child['data']['post_hint']) #Is this thing on?
#                 except KeyError:
#                     continue
                title_list.append(child['data']['title'])

                author_flair_text.append(child['data']['author_flair_text'])
#                 print('length of post list in loop', len(post_list)) #Is this thing on?
#                 print('length of post titles in loop', len(title_list)) #Is this thing on?
#                 print('length of flairs in loop', len(author_flair_text)) #Is this thing on?
    
        URL = next_page

#       Don't Wake The Sleeping Monster (Don't Make Reddit Boot You)
        time.sleep(3)
        
#       Turn Data into Dataframe
#       Check if same length
#         print(len(post_list), len(title_list), len(author_flair_text))
        
#       Here's that dataframe!
        df = pd.DataFrame({'Titles': title_list, 'Posts': post_list, 'Author_Self_Descript': author_flair_text})

    return df

In [575]:
post_loop(chronic_pain, post_content_cp)

Unnamed: 0,Titles,Posts,Author_Self_Descript
0,Final Coping Skills List,"Hey everyone, thanks for participation in the ...",
1,PSA about generics and shortages,Hello my wonderful pain ninjas! I am glad to s...,Trigeminal Neuralgia (Atypical)
2,me🤕irl,,
3,NYS MMJ program may be slinging overpriced gar...,,
4,Gathered energy &amp; pushed past my chronic p...,,
5,"Jeez, we still have a long way to go in layman...",I am personally not on opiates and likely will...,
6,[Australia] Fentanyl in the news again,,
7,"I wanna die,please :( (Life-Story+Rant)","Here is my story how rare disease is ""choking""...",
8,Every time,,
9,Hey everyone! Let's vent about our symptoms.,I don't really have anyone to talk to about th...,30/m/Canada/Nerve pain (pelvic floor) 5


In [469]:
# Run Kat Bot 2.0 on Chronic Pain Reddit
# Set up Chronic Pain Dataframe
cp_df = post_loop(chronic_pain, post_content_cp)

In [189]:
# Run Kat Bot 2.0 on Migraine Reddit
# Set up Migraine Dataframe
m_df = post_loop(migraine, post_content_m)

In [470]:
# Run Kat Bot 2.0 on Back Pain Reddit
# Set up Back Pain Dataframe
bp_df = post_loop(back_pain, post_content_bp)

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [471]:
# Export Chronic Pain Dataframe to csv
cp_df.to_csv("chronic_pain_csv.csv")

In [4]:
cp_df = pd.read_csv('../chronic_pain_csv.csv')

NameError: name 'pd' is not defined

In [472]:
# Export Migraine Dataframe to csv
m_df.to_csv("migraine_csv.csv")
m_df = pd.read_csv('migraine_csv.csv')

In [473]:
# Export Back Pain Dataframe to csv
bp_df.to_csv("back_pain_csv.csv")
bp_df = pd.read_csv('../back_pain_csv.csv')

In [None]:
import

<h1>Basic EDA, Feature Creation and Data Cleaning</h1>
<h3>(Subsection Created for Organization's Sake)</h3>

In [474]:
#Some Basic EDA of the Chronic Pain Dataset
cp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 3 columns):
Titles                  777 non-null object
Posts                   777 non-null object
Author_Self_Descript    81 non-null object
dtypes: object(3)
memory usage: 18.3+ KB


In [475]:
cp_df.describe()

Unnamed: 0,Titles,Posts,Author_Self_Descript
count,777,777.0,81
unique,777,605.0,45
top,Shivering to reduce pain?,,6
freq,1,173.0,6


In [476]:
cp_df.head(10)

Unnamed: 0,Titles,Posts,Author_Self_Descript
0,Final Coping Skills List,"Hey everyone, thanks for participation in the ...",
1,PSA about generics and shortages,Hello my wonderful pain ninjas! I am glad to s...,Trigeminal Neuralgia (Atypical)
2,me🤕irl,,
3,NYS MMJ program may be slinging overpriced gar...,,
4,Gathered energy &amp; pushed past my chronic p...,,
5,"Jeez, we still have a long way to go in layman...",I am personally not on opiates and likely will...,
6,[Australia] Fentanyl in the news again,,
7,Every time,,
8,Hey everyone! Let's vent about our symptoms.,I don't really have anyone to talk to about th...,30/m/Canada/Nerve pain (pelvic floor) 5
9,Tips for Pain Relief and Weight Loss?,My first post here so I hope I can get some ad...,


In [477]:
#Some Basic EDA of the Migraine Dataset
m_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 776 entries, 0 to 775
Data columns (total 5 columns):
Titles                   776 non-null object
Posts                    776 non-null object
Author Self Descript.    35 non-null object
Is M                     776 non-null int64
Is_M                     776 non-null int64
dtypes: int64(2), object(3)
memory usage: 30.4+ KB


In [478]:
m_df.describe()

Unnamed: 0,Is M,Is_M
count,776.0,776.0
mean,1.0,1.0
std,0.0,0.0
min,1.0,1.0
25%,1.0,1.0
50%,1.0,1.0
75%,1.0,1.0
max,1.0,1.0


In [479]:
m_df.head(10)

Unnamed: 0,Titles,Posts,Author Self Descript.,Is M,Is_M
0,Resources,Hey all!\nI had hoped the wiki to be running b...,TN/AFP + Weird migraines.,1,1
1,Aimovig (CGRP) month 2 megathread.,"We can only have two sticky threads at a time,...",TN/AFP + Weird migraines.,1,1
2,"Pain consuming me, sculpture by me.",,,1,1
3,im letting my feelings out,"apologies for formatting, im on mobile \nim al...",,1,1
4,Sporadic Hemiplegic Migraine,"Just found this sub, thought I'd see if anyone...",,1,1
5,My migraine “buddy”,,bop,1,1
6,Anybody else throw up their nausea pills this ...,Just me? O good.,,1,1
7,Neck popping-new migraine symptom?,I’ve had migraines for over 20 years now. Just...,,1,1
8,barometric graphs?,Anybody know of a site that shows barometric d...,,1,1
9,"Any observant, orthodox Jews here? I'm in need...",My apologies in advance to those reading who w...,,1,1


In [480]:
#Some Basic EDA of the Back Pain Dataset
bp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 3 columns):
Titles                  777 non-null object
Posts                   777 non-null object
Author_Self_Descript    0 non-null object
dtypes: object(3)
memory usage: 18.3+ KB


In [481]:
bp_df.describe()

Unnamed: 0,Titles,Posts,Author_Self_Descript
count,777,777.0,0.0
unique,773,655.0,0.0
top,Lower back pain,,
freq,4,123.0,


In [482]:
bp_df.head(10)

Unnamed: 0,Titles,Posts,Author_Self_Descript
0,Please Read New Rules for /r/backpain,"Hi everyone, I have taken over /r/backpain and...",
1,The First Steps For Dealing With Back Pain,Do you have back pain and you're not sure what...,
2,Anyone have any tips for lower lumber back pai...,I have muscle/joint damage and arthritis in my...,
3,"Brazilian Jiu-Jitsu sparring last night, now I...",I haven't been doing it that long but I notice...,
4,Getting rid of my old oxy 30s cheap,Dm me or wickr me noddingforfun. Selling my ol...,
5,Lower Right Side Back Pain for Three Weeks,I injured myself in a crossfit class while doi...,
6,Seeking advice on sharing our journey,I have gone through chronic pain for a very lo...,
7,Foundation training - ankle pain,I recently discovered Eric Goodman's 12 minute...,
8,Putting a pillow under my butt,"I don't know if it's my mattress, or what, but...",
9,Back Pain Remedies at Roseville Disc Center,Roseville Disc Center offer results-oriented ...,


In [483]:
cp_df["Is_CP"] = 1
cp_df

Unnamed: 0,Titles,Posts,Author_Self_Descript,Is_CP
0,Final Coping Skills List,"Hey everyone, thanks for participation in the ...",,1
1,PSA about generics and shortages,Hello my wonderful pain ninjas! I am glad to s...,Trigeminal Neuralgia (Atypical),1
2,me🤕irl,,,1
3,NYS MMJ program may be slinging overpriced gar...,,,1
4,Gathered energy &amp; pushed past my chronic p...,,,1
5,"Jeez, we still have a long way to go in layman...",I am personally not on opiates and likely will...,,1
6,[Australia] Fentanyl in the news again,,,1
7,Every time,,,1
8,Hey everyone! Let's vent about our symptoms.,I don't really have anyone to talk to about th...,30/m/Canada/Nerve pain (pelvic floor) 5,1
9,Tips for Pain Relief and Weight Loss?,My first post here so I hope I can get some ad...,,1


In [484]:
m_df["Is_M"] = 1
m_df

Unnamed: 0,Titles,Posts,Author Self Descript.,Is M,Is_M
0,Resources,Hey all!\nI had hoped the wiki to be running b...,TN/AFP + Weird migraines.,1,1
1,Aimovig (CGRP) month 2 megathread.,"We can only have two sticky threads at a time,...",TN/AFP + Weird migraines.,1,1
2,"Pain consuming me, sculpture by me.",,,1,1
3,im letting my feelings out,"apologies for formatting, im on mobile \nim al...",,1,1
4,Sporadic Hemiplegic Migraine,"Just found this sub, thought I'd see if anyone...",,1,1
5,My migraine “buddy”,,bop,1,1
6,Anybody else throw up their nausea pills this ...,Just me? O good.,,1,1
7,Neck popping-new migraine symptom?,I’ve had migraines for over 20 years now. Just...,,1,1
8,barometric graphs?,Anybody know of a site that shows barometric d...,,1,1
9,"Any observant, orthodox Jews here? I'm in need...",My apologies in advance to those reading who w...,,1,1


In [485]:
bp_df["Is_BP"] = 1
bp_df

Unnamed: 0,Titles,Posts,Author_Self_Descript,Is_BP
0,Please Read New Rules for /r/backpain,"Hi everyone, I have taken over /r/backpain and...",,1
1,The First Steps For Dealing With Back Pain,Do you have back pain and you're not sure what...,,1
2,Anyone have any tips for lower lumber back pai...,I have muscle/joint damage and arthritis in my...,,1
3,"Brazilian Jiu-Jitsu sparring last night, now I...",I haven't been doing it that long but I notice...,,1
4,Getting rid of my old oxy 30s cheap,Dm me or wickr me noddingforfun. Selling my ol...,,1
5,Lower Right Side Back Pain for Three Weeks,I injured myself in a crossfit class while doi...,,1
6,Seeking advice on sharing our journey,I have gone through chronic pain for a very lo...,,1
7,Foundation training - ankle pain,I recently discovered Eric Goodman's 12 minute...,,1
8,Putting a pillow under my butt,"I don't know if it's my mattress, or what, but...",,1
9,Back Pain Remedies at Roseville Disc Center,Roseville Disc Center offer results-oriented ...,,1


In [486]:
# Make Master Dataframe of All Information Scraped from Reddit
a_world_of_hurt = cp_df.append(m_df, sort=True)
a_world_of_hurt = a_world_of_hurt.append(bp_df, sort=True)
a_world_of_hurt # Is this thing on?

Unnamed: 0,Author Self Descript.,Author_Self_Descript,Is M,Is_BP,Is_CP,Is_M,Posts,Titles
0,,,,,1.0,,"Hey everyone, thanks for participation in the ...",Final Coping Skills List
1,,Trigeminal Neuralgia (Atypical),,,1.0,,Hello my wonderful pain ninjas! I am glad to s...,PSA about generics and shortages
2,,,,,1.0,,,me🤕irl
3,,,,,1.0,,,NYS MMJ program may be slinging overpriced gar...
4,,,,,1.0,,,Gathered energy &amp; pushed past my chronic p...
5,,,,,1.0,,I am personally not on opiates and likely will...,"Jeez, we still have a long way to go in layman..."
6,,,,,1.0,,,[Australia] Fentanyl in the news again
7,,,,,1.0,,,Every time
8,,30/m/Canada/Nerve pain (pelvic floor) 5,,,1.0,,I don't really have anyone to talk to about th...,Hey everyone! Let's vent about our symptoms.
9,,,,,1.0,,My first post here so I hope I can get some ad...,Tips for Pain Relief and Weight Loss?


In [488]:
a_world_of_hurt

Unnamed: 0,Author Self Descript.,Author_Self_Descript,Is M,Is_BP,Is_CP,Is_M,Posts,Titles
0,0,0,0.0,0.0,1.0,0.0,"Hey everyone, thanks for participation in the ...",Final Coping Skills List
1,0,Trigeminal Neuralgia (Atypical),0.0,0.0,1.0,0.0,Hello my wonderful pain ninjas! I am glad to s...,PSA about generics and shortages
2,0,0,0.0,0.0,1.0,0.0,,me🤕irl
3,0,0,0.0,0.0,1.0,0.0,,NYS MMJ program may be slinging overpriced gar...
4,0,0,0.0,0.0,1.0,0.0,,Gathered energy &amp; pushed past my chronic p...
5,0,0,0.0,0.0,1.0,0.0,I am personally not on opiates and likely will...,"Jeez, we still have a long way to go in layman..."
6,0,0,0.0,0.0,1.0,0.0,,[Australia] Fentanyl in the news again
7,0,0,0.0,0.0,1.0,0.0,,Every time
8,0,30/m/Canada/Nerve pain (pelvic floor) 5,0.0,0.0,1.0,0.0,I don't really have anyone to talk to about th...,Hey everyone! Let's vent about our symptoms.
9,0,0,0.0,0.0,1.0,0.0,My first post here so I hope I can get some ad...,Tips for Pain Relief and Weight Loss?


In [489]:
a_world_of_chronic_pain_and_migraines = cp_df.append(m_df, sort=True)

# The Hunt for NaN October
a_world_of_chronic_pain_and_migraines.isnull().sum()

# Replace NaN's with Zeros
a_world_of_chronic_pain_and_migraines = a_world_of_chronic_pain_and_migraines.fillna(0)

#Convert Mixed Data Types in Author Self Descript. to String
a_world_of_chronic_pain_and_migraines.Author_Self_Descript.apply(str)

a_world_of_chronic_pain_and_migraines

Unnamed: 0,Author Self Descript.,Author_Self_Descript,Is M,Is_CP,Is_M,Posts,Titles
0,0,0,0.0,1.0,0.0,"Hey everyone, thanks for participation in the ...",Final Coping Skills List
1,0,Trigeminal Neuralgia (Atypical),0.0,1.0,0.0,Hello my wonderful pain ninjas! I am glad to s...,PSA about generics and shortages
2,0,0,0.0,1.0,0.0,,me🤕irl
3,0,0,0.0,1.0,0.0,,NYS MMJ program may be slinging overpriced gar...
4,0,0,0.0,1.0,0.0,,Gathered energy &amp; pushed past my chronic p...
5,0,0,0.0,1.0,0.0,I am personally not on opiates and likely will...,"Jeez, we still have a long way to go in layman..."
6,0,0,0.0,1.0,0.0,,[Australia] Fentanyl in the news again
7,0,0,0.0,1.0,0.0,,Every time
8,0,30/m/Canada/Nerve pain (pelvic floor) 5,0.0,1.0,0.0,I don't really have anyone to talk to about th...,Hey everyone! Let's vent about our symptoms.
9,0,0,0.0,1.0,0.0,My first post here so I hope I can get some ad...,Tips for Pain Relief and Weight Loss?


In [490]:
a_world_of_back_pain_and_migraines = bp_df.append(m_df, sort=True)

# The Hunt for NaN October
a_world_of_back_pain_and_migraines.isnull().sum()

# Replace NaN's with Zeros
a_world_of_back_pain_and_migraines = a_world_of_back_pain_and_migraines.fillna(0)

#Convert Mixed Data Types in Author Self Descript. to String
a_world_of_back_pain_and_migraines.Author_Self_Descript.apply(str)

a_world_of_back_pain_and_migraines

Unnamed: 0,Author Self Descript.,Author_Self_Descript,Is M,Is_BP,Is_M,Posts,Titles
0,0,0,0.0,1.0,0.0,"Hi everyone, I have taken over /r/backpain and...",Please Read New Rules for /r/backpain
1,0,0,0.0,1.0,0.0,Do you have back pain and you're not sure what...,The First Steps For Dealing With Back Pain
2,0,0,0.0,1.0,0.0,I have muscle/joint damage and arthritis in my...,Anyone have any tips for lower lumber back pai...
3,0,0,0.0,1.0,0.0,I haven't been doing it that long but I notice...,"Brazilian Jiu-Jitsu sparring last night, now I..."
4,0,0,0.0,1.0,0.0,Dm me or wickr me noddingforfun. Selling my ol...,Getting rid of my old oxy 30s cheap
5,0,0,0.0,1.0,0.0,I injured myself in a crossfit class while doi...,Lower Right Side Back Pain for Three Weeks
6,0,0,0.0,1.0,0.0,I have gone through chronic pain for a very lo...,Seeking advice on sharing our journey
7,0,0,0.0,1.0,0.0,I recently discovered Eric Goodman's 12 minute...,Foundation training - ankle pain
8,0,0,0.0,1.0,0.0,"I don't know if it's my mattress, or what, but...",Putting a pillow under my butt
9,0,0,0.0,1.0,0.0,Roseville Disc Center offer results-oriented ...,Back Pain Remedies at Roseville Disc Center


In [491]:
a_world_of_back_pain_and_chronic_pain = bp_df.append(cp_df, sort=True)

# The Hunt for NaN October
a_world_of_back_pain_and_chronic_pain.isnull().sum()

# Replace NaN's with Zeros
a_world_of_back_pain_and_chronic_pain = a_world_of_back_pain_and_chronic_pain.fillna(0)

#Convert Mixed Data Types in Author Self Descript. to String
a_world_of_back_pain_and_chronic_pain.Author_Self_Descript.apply(str)

a_world_of_back_pain_and_chronic_pain

Unnamed: 0,Author_Self_Descript,Is_BP,Is_CP,Posts,Titles
0,0,1.0,0.0,"Hi everyone, I have taken over /r/backpain and...",Please Read New Rules for /r/backpain
1,0,1.0,0.0,Do you have back pain and you're not sure what...,The First Steps For Dealing With Back Pain
2,0,1.0,0.0,I have muscle/joint damage and arthritis in my...,Anyone have any tips for lower lumber back pai...
3,0,1.0,0.0,I haven't been doing it that long but I notice...,"Brazilian Jiu-Jitsu sparring last night, now I..."
4,0,1.0,0.0,Dm me or wickr me noddingforfun. Selling my ol...,Getting rid of my old oxy 30s cheap
5,0,1.0,0.0,I injured myself in a crossfit class while doi...,Lower Right Side Back Pain for Three Weeks
6,0,1.0,0.0,I have gone through chronic pain for a very lo...,Seeking advice on sharing our journey
7,0,1.0,0.0,I recently discovered Eric Goodman's 12 minute...,Foundation training - ankle pain
8,0,1.0,0.0,"I don't know if it's my mattress, or what, but...",Putting a pillow under my butt
9,0,1.0,0.0,Roseville Disc Center offer results-oriented ...,Back Pain Remedies at Roseville Disc Center


In [392]:
# a_world_of_hurt["Cat_CP_Or_M"] = np.where((a_world_of_hurt.Is_CP == 1) | (a_world_of_hurt.Is_M == 1), 0, 1)
# a_world_of_hurt

In [393]:
# a_world_of_hurt["Cat_BP_Or_CP"] = np.where((a_world_of_hurt.Is_BP == 1) | (a_world_of_hurt.Is_CP == 1), 0, 1)
# a_world_of_hurt

In [349]:
a_world_of_hurt.head(10)

In [335]:
a_world_of_hurt.describe()

Unnamed: 0,Is BP,Is CP,Is M,Is_BP,Is_CP,Is_M,Cat_BP_Or_M,Cat_CP_Or_M,Cat_BP_Or_CP
count,2330.0,2330.0,2330.0,2330.0,2330.0,2330.0,2330.0,2330.0,2330.0
mean,0.333476,0.333476,0.333047,0.333476,0.333476,0.333047,0.333476,0.333476,0.333047
std,0.471556,0.471556,0.471404,0.471556,0.471556,0.471404,0.471556,0.471556,0.471404
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [336]:
a_world_of_hurt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2330 entries, 0 to 776
Data columns (total 12 columns):
Author Self Descript.    2330 non-null object
Is BP                    2330 non-null float64
Is CP                    2330 non-null float64
Is M                     2330 non-null float64
Is_BP                    2330 non-null float64
Is_CP                    2330 non-null float64
Is_M                     2330 non-null float64
Posts                    2330 non-null object
Titles                   2330 non-null object
Cat_BP_Or_M              2330 non-null int64
Cat_CP_Or_M              2330 non-null int64
Cat_BP_Or_CP             2330 non-null int64
dtypes: float64(6), int64(3), object(3)
memory usage: 236.6+ KB


<h1>Lemmatization</h1>

In [1]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemma_pain = lemmatizer.lemmatize("a_world_of_chronic_pain_and_migraines['Posts']")
lemma_pain

'says_Pain'

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [678]:
import seaborn as sns

In [3]:
# Chronic Pain Vs. Migraines Count Vectorizer and Logistic Regression Model
# Using Post Data

# Set X and y Values
X = a_world_of_chronic_pain_and_migraines['Posts']
y = a_world_of_chronic_pain_and_migraines['Is_CP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_chronic_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
# sns.regplot(X = log_reg.predict_proba(X_train_counts), y = y
#            data = a_world_of_chronic_pain_and_migraines,
#            logistic = True)
print(log_reg.score(X_test_counts, y_test))
print('log_reg intercept:', log_reg.intercept_)
print('log_reg coef(s):', log_reg.coef_)
# a_world_of_chronic_pain_and_migraines['Is_CP'].unique()

print('log_reg predicted probabilities: ', log_reg.predict_proba(X_train_counts))

# print(len(a_world_of_chronic_pain_and_migraines['Is_CP']).unique())
# is_cp_a_world_of_chronic_pain_and_migraines.groupby('Is_CP').mean()

# def odds_ratio(p):
#     return p / (1-p)

# a_world_of_chronic_pain_and_migraines["odds_ratio"] = a_world_of_chronic_pain_and_migraines['Is_CP'].map(odds_ratio)
# a_world_of_chronic_pain_and_migraines["odds_ratio"]



NameError: name 'a_world_of_chronic_pain_and_migraines' is not defined

In [669]:
# Chronic Pain Vs. Migraines Count Vectorizer and Logistic Regression Model
# Using Post Data with Instance of String "Pain"

# Filter Chronic Pain/Migraine for Instance of Pain
says_Pain = a_world_of_chronic_pain_and_migraines[a_world_of_chronic_pain_and_migraines['Posts'].str.contains("pain")]

# Set X and y Values
X = says_Pain['Posts']
y = says_Pain['Is_CP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_chronic_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
print(log_reg.score(X_test_counts, y_test))
print('log_reg intercept:', log_reg.intercept_)
print('log_reg coef(s):', log_reg.coef_)

0.9354838709677419
log_reg intercept: [0.90455295]
log_reg coef(s): [[ 2.37346389e-02 -7.01646205e-05 -2.97758856e-02 ... -6.23614746e-02
  -1.43389275e-01 -1.09452279e-03]]


In [670]:
# Chronic Pain Vs. Migraines Count Vectorizer and Logistic Regression Model
# Using Post Data with Instance of String "feel"

# Filter Chronic Pain/Migraine for Instance of Pain
says_Feel = a_world_of_chronic_pain_and_migraines[a_world_of_chronic_pain_and_migraines['Posts'].str.contains("feel")]

# Set X and y Values
X = says_Feel['Posts']
y = says_Feel['Is_CP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_chronic_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
# cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
print(log_reg.score(X_test_counts, y_test))
print('log_reg intercept:', log_reg.intercept_)
print('log_reg coef(s):', log_reg.coef_)


0.9004524886877828
log_reg intercept: [-0.41753486]
log_reg coef(s): [[-0.0636389  -0.03251449 -0.00701226 ...  0.0353856   0.05481945
  -0.00667306]]


In [671]:
# Chronic Pain Vs. Migraines Count Vectorizer and Logistic Regression Model
# Using Post Data with Instance of String "I feel like"

# Filter Chronic Pain/Migraine for Instance of Pain
says_feel_like = a_world_of_chronic_pain_and_migraines[a_world_of_chronic_pain_and_migraines['Posts'].str.contains("I feel like")]

# Set X and y Values
X = says_feel_like['Posts']
y = says_feel_like['Is_CP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_chronic_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
# cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
print(log_reg.score(X_test_counts, y_test))
print('log_reg intercept:', log_reg.intercept_)
print('log_reg coef(s):', log_reg.coef_)


0.825
log_reg intercept: [-0.06125609]
log_reg coef(s): [[-0.04909738 -0.01851399  0.00478414 ... -0.07994178  0.00098798
   0.02556408]]


In [672]:
# Chronic Pain Vs. Migraines Count Vectorizer and Logistic Regression Model
# Using Post Data with Instance of String "I feel like"

# Filter Chronic Pain/Migraine for Instance of Pain
says_mg = a_world_of_chronic_pain_and_migraines[a_world_of_chronic_pain_and_migraines['Posts'].str.contains("mg")]

# Set X and y Values
X = says_mg['Posts']
y = says_mg['Is_CP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_chronic_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
# cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
print(log_reg.score(X_test_counts, y_test))
print('log_reg intercept:', log_reg.intercept_)
print('log_reg coef(s):', log_reg.coef_)


0.9
log_reg intercept: [0.19100891]
log_reg coef(s): [[ 0.00508966 -0.10167776  0.02558049 ... -0.06528728  0.00508966
  -0.03264364]]


In [673]:
# Chronic Pain Vs. Migraines Count Vectorizer and Logistic Regression Model
# Using Post Data with Instance of String "I feel like"

# Filter Chronic Pain/Migraine for Instance of Pain
says_hurts = a_world_of_chronic_pain_and_migraines[a_world_of_chronic_pain_and_migraines['Posts'].str.contains("hurts")]

# Set X and y Values
X = says_hurts['Posts']
y = says_hurts['Is_CP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_chronic_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
# cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
print(log_reg.score(X_test_counts, y_test))
print('log_reg intercept:', log_reg.intercept_)
print('log_reg coef(s):', log_reg.coef_)


0.9411764705882353
log_reg intercept: [-0.12793516]
log_reg coef(s): [[ 0.05256124 -0.04893398  0.01339069 ...  0.01023688 -0.11879847
   0.00840536]]


In [638]:
# Set plot up.
plt.figure(figsize=(16,9))
# plt.xlabel("Age", fontsize = 20)
# plt.ylabel("Income", fontsize = 20)
plt.xticks([])
plt.yticks([])

## Plot classification line.
x = np.linspace(min(says_Feel['Posts']), max(says_Feel['Posts']))
plt.plot(x, c = 'orange', lw = 5)

## Generate scatterplot.
# plt.scatter(age_train, income_train, c=party_train, s=100);

TypeError: 'module' object is not callable

In [601]:
# Chronic Pain Vs. Migraines Count Vectorizer and Logistic Regression Model
# Using Titles

# Set X and y Values
X = a_world_of_chronic_pain_and_migraines['Titles']
y = a_world_of_chronic_pain_and_migraines['Is_CP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_chronic_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
log_reg.score(X_test_counts, y_test)

0.7847938144329897

In [602]:
# Chronic Pain Vs. Migraines Count Vectorizer and Logistic Regression Model
# Using Title Data with Instance of String "Pain"

# Filter Chronic Pain/Migraine for Instance of Pain
says_Pain = a_world_of_chronic_pain_and_migraines[a_world_of_chronic_pain_and_migraines['Titles'].str.contains("pain")]

# Set X and y Values
X = says_Pain['Titles']
y = says_Pain['Is_CP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_chronic_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
log_reg.score(X_test_counts, y_test)

0.9270833333333334

In [508]:
# # Chronic Pain Vs. Migraines Count Vectorizer and Logistic Regression Model
# # Using Author Self Description

# # Set X and y Values
# X = a_world_of_chronic_pain_and_migraines['Author_Self_Descript']
# y = a_world_of_chronic_pain_and_migraines['Is_CP']

# # Set Training and Testing Data
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 300, random_state=42)

# # Is this thing even on?
# for x in X_train:
#     if type(x) is not str: str(x).to_string()
# #     else: print("Zilch")

# # Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
# cvec = CountVectorizer(stop_words='english')
# X_train_counts = cvec.fit_transform(X_train)
# X_test_counts = cvec.transform(X_test)

# # # Fit model, Get Score
# # log_reg = LogisticRegression()
# # log_reg.fit(X_train_counts, y_train)
# # log_reg.score(X_test_counts, y_test)

In [603]:
# Back Pain Vs. Migraines Count Vectorizer and Logistic Regression Model
# Using Post Data

# Set X and y Values
X = a_world_of_back_pain_and_migraines['Posts']
y = a_world_of_back_pain_and_migraines['Is_BP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_back_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
log_reg.score(X_test_counts, y_test)

0.9278350515463918

In [604]:
# Chronic Pain Vs. Migraines Count Vectorizer and Logistic Regression Model
# Using Post Data with Instance of String "Pain"

# Filter Chronic Pain/Migraine for Instance of Pain
says_Pain_bp = a_world_of_back_pain_and_migraines[a_world_of_back_pain_and_migraines['Posts'].str.contains("pain")]

# Set X and y Values
X = says_Pain_bp['Posts']
y = says_Pain_bp['Is_BP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_chronic_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
log_reg.score(X_test_counts, y_test)

0.9583333333333334

In [605]:
# Back Pain Vs. Migraines Count Vectorizer and Logistic Regression Model
# Using Titles

# Set X and y Values
X = a_world_of_back_pain_and_migraines['Titles']
y = a_world_of_back_pain_and_migraines['Is_BP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_back_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
log_reg.score(X_test_counts, y_test)

0.8376288659793815

In [606]:
# Back Pain Vs. Migraines Count Vectorizer and Logistic Regression Model
# Using Title Data with Instance of String "Pain"

# Filter Chronic Pain/Migraine for Instance of Pain
says_Pain_bp = a_world_of_back_pain_and_migraines[a_world_of_back_pain_and_migraines['Titles'].str.contains("pain")]

# Set X and y Values
X = says_Pain_bp['Titles']
y = says_Pain_bp['Is_BP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_chronic_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
log_reg.score(X_test_counts, y_test)

0.9404761904761905

In [607]:
# Back Pain Vs. Chronic Pain Count Vectorizer and Logistic Regression Model
# Using Post Data

# Set X and y Values
X = a_world_of_back_pain_and_chronic_pain['Posts']
y = a_world_of_back_pain_and_chronic_pain['Is_BP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_back_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
log_reg.score(X_test_counts, y_test)

0.7709137709137709

In [608]:
# Back Pain Vs. Chronic Pain Count Vectorizer and Logistic Regression Model
# Using Post Data with Instance of String "Pain"

# Filter Chronic Pain/Migraine for Instance of Pain
says_Pain_bp = a_world_of_back_pain_and_chronic_pain[a_world_of_back_pain_and_chronic_pain['Posts'].str.contains("pain")]

# Set X and y Values
X = says_Pain_bp['Posts']
y = says_Pain_bp['Is_BP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_chronic_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
log_reg.score(X_test_counts, y_test)

0.8035714285714286

In [609]:
# Back Pain Vs. Chronic Pain Count Vectorizer and Logistic Regression Model
# Using Titles

# Set X and y Values
X = a_world_of_back_pain_and_chronic_pain['Titles']
y = a_world_of_back_pain_and_chronic_pain['Is_BP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_back_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
log_reg.score(X_test_counts, y_test)

0.7091377091377091

In [610]:
# Back Pain Vs. Chronic Pain Count Vectorizer and Logistic Regression Model
# Using Post Data with Instance of String "Pain"

# Filter Chronic Pain/Migraine for Instance of Pain
says_Pain_bp = a_world_of_back_pain_and_chronic_pain[a_world_of_back_pain_and_chronic_pain['Titles'].str.contains("pain")]

# Set X and y Values
X = says_Pain_bp['Posts']
y = says_Pain_bp['Is_BP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Is this thing even on?
# for x in a_world_of_chronic_pain_and_migraines['Is_CP']:
#     if x == 1.0: print("We got one at least")
#     else: print("Zilch")

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Fit model, Get Score
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
log_reg.score(X_test_counts, y_test)

0.7344398340248963

In [512]:
# Set X and y Values
X = a_world_of_chronic_pain_and_migraines["Posts"].apply(str)
X #Is this thing on?
# y = a_world_of_chronic_pain_and_migraines['Is_CP']

# # Set Training and Testing Data
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 300, random_state=42)

# # Is this thing even on?
# # for x in a_world_of_chronic_pain_and_migraines['Is_CP']:
# #     if x == 1.0: print("We got one at least")
# #     else: print("Zilch")

# # Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
# cvec = CountVectorizer(stop_words='english')
# X_train_counts = cvec.fit_transform(X_train)
# X_test_counts = cvec.transform(X_test)

# # Fit model, Get Score
# log_reg = LogisticRegression()
# log_reg.fit(X_train_counts, y_train)
# log_reg.score(X_test_counts, y_test)

pandas.core.series.Series

In [562]:
# Chronic Pain Vs. Migraines TF-IDF Vectorizer and Logistic Regression Model
# Using Post Data

def post_list_for_TFIDF_maker(dataframe): 
    post_list_for_TFIDF = []
    for value in dataframe["Posts"][1: ]: 
        new_val = value.replace("\n", "")
        post_list_for_TFIDF.append(new_val)
    return post_list_for_TFIDF     
    
post_list_for_TFIDF_CPM = post_list_for_TFIDF_maker(a_world_of_chronic_pain_and_migraines)
post_list_for_TFIDF_CPM

# #TF-IDF Vectorizer

# corpus = post_list_for_TFIDF_CPM

# tvec = TfidfVectorizer(stop_words='english')
# tvec.fit(corpus)

# tf_idf  = pd.DataFrame(tvec.transform(corpus).todense(),
#                    columns=tvec.get_feature_names(),
#                    index=['spam', 'ham'])

# # df.transpose().sort_values('spam', ascending=False).head(10).transpose()

In [None]:
corpus_m =

#CountVectorizer

#TF-IDF Vectorizer

In [None]:
corpus_bp =

#CountVectorizer

#TF-IDF Vectorizer

## Predicting subreddit using Random Forests + Another Classifier

In [None]:
bootstrap(lst, size = 5)

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

In [None]:
# See Above for Code. Will input previously constructed dataframes that satisfy this
# this part of the assignment.

# Master Dataframe:
# Contains all data, including binary classification of sub-reddits.
a_world_of_hurt

# Dataframe for an Analysis of Migraine Language versus Chronic Pain
# Contains data pertaining to migraine subreddit and chronic pain subreddit.
a_world_of_chronic_pain_and_migraines

# Dataframe for an Analysis of Migraine Language versus Back Pain
# Contains data pertaining to migraine subreddit and back pain subreddit.
a_world_of_back_pain_and_migraines

# Dataframe for an Analysis of Back Pain versus Chronic Pain
# Contains data pertaining to back pain subreddit and chronic pain subreddit.
a_world_of_back_pain_and_chronic_pain

#### Thought experiment: What is the baseline accuracy for this model?

In [None]:
# Baseline accuracy should be 50/50.

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [615]:
a_world_of_chronic_pain_and_migraines

Unnamed: 0,Author Self Descript.,Author_Self_Descript,Is M,Is_CP,Is_M,Posts,Titles
0,0,0,0.0,1.0,0.0,"Hey everyone, thanks for participation in the ...",Final Coping Skills List
1,0,Trigeminal Neuralgia (Atypical),0.0,1.0,0.0,Hello my wonderful pain ninjas! I am glad to s...,PSA about generics and shortages
2,0,0,0.0,1.0,0.0,,me🤕irl
3,0,0,0.0,1.0,0.0,,NYS MMJ program may be slinging overpriced gar...
4,0,0,0.0,1.0,0.0,,Gathered energy &amp; pushed past my chronic p...
5,0,0,0.0,1.0,0.0,I am personally not on opiates and likely will...,"Jeez, we still have a long way to go in layman..."
6,0,0,0.0,1.0,0.0,,[Australia] Fentanyl in the news again
7,0,0,0.0,1.0,0.0,,Every time
8,0,30/m/Canada/Nerve pain (pelvic floor) 5,0.0,1.0,0.0,I don't really have anyone to talk to about th...,Hey everyone! Let's vent about our symptoms.
9,0,0,0.0,1.0,0.0,My first post here so I hope I can get some ad...,Tips for Pain Relief and Weight Loss?


In [629]:
# Chronic Pain Vs. Migraines Count Vectorizer and Random Forest Using Post Data

# Set X and y Values
X = a_world_of_chronic_pain_and_migraines['Posts']
y = a_world_of_chronic_pain_and_migraines['Is_CP']

# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= int(len(X)/2), random_state=42)

# Count Vectorize Posts in a_world_of_chronic_pain_and_migraines Dataframe
cvec = CountVectorizer(stop_words='english')
X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

# Begin Random Forest Instantiation

# Create a random forest Classifier. By convention, clf means 'Classifier'
rfc = RandomForestClassifier(n_jobs=2, random_state=42)

# Fit model, Get Score
rfc.fit(X_train_counts, y_train)
rfc.score(X_test_counts, y_test)

# Ok, Again, But Better

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
s = cross_val_score(rfc, X_train_counts, y_train, cv=cv, n_jobs=-1)
s
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.785 ± 0.02


#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [None]:
## YOUR CODE HERE

#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

In [None]:
# See Above ^^

# Executive Summary
---
Put your executive summary in a Markdown cell below.