# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 3: Web APIs & Classification

### Description

In week four we've learned about a few different classifiers. In week five we'll learn about webscraping, APIs, and Natural Language Processing (NLP). Now we're going to put those skills to the test.

For project 3, your goal is two-fold:
1. Using Reddit's API, you'll collect posts from two subreddits of your choosing.
2. You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.


#### About the API

Reddit's API is fairly straightforward. For example, if I want the posts from [`/r/boardgames`](https://www.reddit.com/r/boardgames), all I have to do is add `.json` to the end of the url: https://www.reddit.com/r/boardgames.json

To help you get started, we have a primer video on how to use Reddit's API: https://www.youtube.com/watch?v=5Y3ZE26Ciuk

---

### Requirements

- Gather and prepare your data using the `requests` library.
- **Create and compare two models**. One of these must be a Bayes classifier, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.
- A Jupyter Notebook with your analysis for a peer audience of data scientists.
- An executive summary of the results you found.
- A short presentation outlining your process and findings for a semi-technical audience.

**Pro Tip 1:** You can find a good example executive summary [here](https://www.proposify.biz/blog/executive-summary).

**Pro Tip 2:** Reddit will give you 25 posts **per request**. To get enough data, you'll need to hit Reddit's API **repeatedly** (most likely in a `for` loop). _Be sure to use the `time.sleep()` function at the end of your loop to allow for a break in between requests. **THIS IS CRUCIAL**_

**Pro tip 3:** The API will cap you at 1,000 posts for each subreddit (assuming the subreddit has that many posts).

**Pro tip 4:** At the end of each loop, be sure to save the results from your scrape as a `csv`: JSON from Reddit > Pandas DataFrame > CSV. That way, if something goes wrong in your loop, you won't lose all your data.

---

### Necessary Deliverables / Submission

- Code and executive summary must be in a clearly commented Jupyter Notebook.
- You must submit your slide deck.
- Materials must be submitted by **10:00 AM on Monday, April 8th**.

---

## Rubric
Your local instructor will evaluate your project (for the most part) using the following criteria.  You should make sure that you consider and/or follow most if not all of the considerations/recommendations outlined below **while** working through your project.

For Project 3 the evaluation categories are as follows:<br>
**The Data Science Process**
- Problem Statement
- Data Collection
- Data Cleaning & EDA
- Preprocessing & Modeling
- Evaluation and Conceptual Understanding
- Conclusion and Recommendations

**Organization and Professionalism**
- Organization
- Visualizations
- Python Syntax and Control Flow
- Presentation

**Scores will be out of 30 points based on the 10 categories in the rubric.** <br>
*3 points per section*<br>

| Score | Interpretation |
| --- | --- |
| **0** | *Project fails to meet the outlined expectations; many major issues exist.* |
| **1** | *Project close to meeting expectations; many minor issues or a few major issues.* |
| **2** | *Project meets expectations; few (and relatively minor) mistakes.* |
| **3** | *Project demonstrates a thorough understanding of all of the considerations outlined.* |


### The Data Science Process

**Problem Statement** 
- Is it clear what the goal of the project is?
- What type of model will be developed?
- How will success be evaluated?
- Is the scope of the project appropriate?
- Is it clear who cares about this or why this is important to investigate?
- Does the student consider the audience and the primary and secondary stakeholders?

**Data Collection** 
- Was enough data gathered to generate a significant result?
- Was data collected that was useful and relevant to the project?
- Was data collection and storage optimized through custom functions, pipelines, and/or automation?
- Was thought given to the server receiving the requests such as considering number of requests per second?

**Data Cleaning and EDA** 
- Are missing values imputed/handled appropriately?
- Are distributions examined and described?
- Are outliers identified and addressed?
- Are appropriate summary statistics provided?
- Are steps taken during data cleaning and EDA framed appropriately?
- Does the student address whether or not they are likely to be able to answer their problem statement with the provided data given what they've discovered during EDA?

**Preprocessing and Modeling** 
- Is text data successfully converted to a matrix representation?
- Are methods such as stop words, stemming, and lemmatization explored?
- Does the student properly split and/or sample the data for validation/training purposes?
- Does the student test and evaluate a variety of models to identify a production algorithm (**AT MINIMUM:** Bayes and one other model)?
- Does the student defend their choice of production model relevant to the data at hand and the problem?
- Does the student explain how the model works and evaluate its performance successes/downfalls?

**Evaluation and Conceptual Understanding** 
- Does the student accurately identify and explain the baseline score?
- Does the student select and use metrics relevant to the problem objective?
- Does the student interpret the results of their model for purposes of inference?
- Is domain knowledge demonstrated when interpreting results?
- Does the student provide appropriate interpretation with regards to descriptive and inferential statistics?

**Conclusion and Recommendations** 
- Does the student provide appropriate context to connect individual steps back to the overall project?
- Is it clear how the final recommendations were reached?
- Are the conclusions/recommendations clearly stated?
- Does the conclusion answer the original problem statement?
- Does the student address how findings of this research can be applied for the benefit of stakeholders?
- Are future steps to move the project forward identified?


### Organization and Professionalism

**Project Organization**
- Are modules imported correctly (using appropriate aliases)?
- Are data imported/saved using relative paths?
- Does the README provide a good executive summary of the project?
- Is markdown formatting used appropriately to structure notebooks?
- Are there an appropriate amount of comments to support the code?
- Are files & directories organized correctly?
- Are there unnecessary files included?
- Do files and directories have well-structured, appropriate, consistent names?

**Visualizations**
- Are sufficient visualizations provided?
- Do plots accurately demonstrate valid relationships?
- Are plots labeled properly?
- Are plots interpreted appropriately?
- Are plots formatted and scaled appropriately for inclusion in a notebook-based technical report?

**Python Syntax and Control Flow**
- Is care taken to write human readable code?
- Is the code syntactically correct (no runtime errors)?
- Does the code generate desired results (logically correct)?
- Does the code follows general best practices and style guidelines?
- Are Pandas functions used appropriately?
- Are `sklearn` and `NLTK` methods used appropriately?

**Presentation**
- Is the problem statement clearly presented?
- Does a strong narrative run through the presentation building toward a final conclusion?
- Are the conclusions/recommendations clearly stated?
- Is the level of technicality appropriate for the intended audience?
- Is the student substantially over or under time?
- Does the student appropriately pace their presentation?
- Does the student deliver their message with clarity and volume?
- Are appropriate visualizations generated for the intended audience?
- Are visualizations necessary and useful for supporting conclusions/explaining findings?


---

### Why we choose this project for you?
This project covers three of the biggest concepts we cover in the class: Classification Modeling, Natural Language Processing and Data Wrangling/Acquisition.

Part 1 of the project focuses on **Data wrangling/gathering/acquisition**. This is a very important skill as not all the data you will need will be in clean CSVs or a single table in SQL.  There is a good chance that wherever you land you will have to gather some data from some unstructured/semi-structured sources; when possible, requesting information from an API, but often scraping it because they don't have an API (or it's terribly documented).

Part 2 of the project focuses on **Natural Language Processing** and converting standard text data (like Titles and Comments) into a format that allows us to analyze it and use it in modeling.

Part 3 of the project focuses on **Classification Modeling**.  Given that project 2 was a regression focused problem, we needed to give you a classification focused problem to practice the various models, means of assessment and preprocessing associated with classification.   


# Code Starts Here

In [2]:
import praw
import pandas as pd
import pickle
from datetime import datetime as dt


In [3]:
reddit = praw.Reddit(client_id='xM8fuZEVl3srfQ',
                     client_secret='uNvu_daDvhJFHcnKUBT2rpJ5p4A',
                     user_agent='praw', \
                     username='PrawRobot_', \
                     password='Delta#062010')

In [5]:
right = reddit.subreddit('republican')

In [6]:
data_dict = {}

In [7]:
lim = 10
right_top = right.top(limit=lim)
i = 0

for title in right_top:
    print(i) # A progress tracker
    
    key = str(title.id) + "_" + str(title.subreddit)
    data_dict[key] = {}
    data_dict[key]["title"] = title.title
    data_dict[key]["id"] = title.id
    data_dict[key]["subreddit"] = title.subreddit
    data_dict[key]["time"] = dt.fromtimestamp(title.created).strftime('%c')
    data_dict[key]["num_comments"] = title.num_comments
    
    submission = reddit.submission(id=title.id)
    submission.comments.replace_more(limit=None)
    comments = [comment.body for comment in submission.comments.list()]
    
    data_dict[key]["comments"] = comments
    i += 1 

0
1
2
3
4
5
6
7
8
9


In [None]:
# right_top = right.top(limit=1000)
# i = 0

# for title in right_top:
#     print(i) # A progress tracker
    
#     key = str(title.id) + "_" + str(title.subreddit)
#     data_dict[key] = {}
#     data_dict[key]["title"] = title.title
#     data_dict[key]["id"] = title.id
#     data_dict[key]["subreddit"] = title.subreddit
#     data_dict[key]["time"] = dt.fromtimestamp(title.created).strftime('%c')
#     data_dict[key]["num_comments"] = title.num_comments
    
#     submission = reddit.submission(id=title.id)
#     submission.comments.replace_more(limit=None)
#     comments = [comment.body for comment in submission.comments.list()]
    
#     data_dict[key]["comments"] = comments
#     i += 1 

In [None]:
left = reddit.subreddit('democrats')

left_top = left.top(limit=1000)
i = 0

for title in left_top:
    print(i) # A progress tracker
    
    key = str(title.id) + "_" + str(title.subreddit)
    data_dict[key] = {}
    data_dict[key]["title"] = title.title
    data_dict[key]["id"] = title.id
    data_dict[key]["subreddit"] = title.subreddit
    data_dict[key]["time"] = dt.fromtimestamp(title.created).strftime('%c')
    data_dict[key]["num_comments"] = title.num_comments
    
    submission = reddit.submission(id=title.id)
    submission.comments.replace_more(limit=None)
    comments = [comment.body for comment in submission.comments.list()]
    
    data_dict[key]["comments"] = comments
    i += 1 

In [None]:
with open("rep_dem.pkl","wb") as f:
    pickle.dump(data_dict,f)

In [2]:
with open ("rep_dem.pkl","rb") as f:
    data_dict = pickle.load(f)

In [3]:
df = pd.DataFrame(data_dict).T

In [148]:
with open("rep_dem_df.pkl","wb") as f:
    pickle.dump(df,f)

In [4]:
#Head and Tail
df.tail(2).append(df.head(2))

Unnamed: 0,comments,id,num_comments,subreddit,time,title
8nbfwl_democrats,[Why bother lying about such things? Why can’t...,8nbfwl,50,democrats,Wed May 30 15:27:11 2018,Trump places Nashville rally crowd size above ...
7rft0w_democrats,[That whole family values thing was a lie just...,7rft0w,21,democrats,Thu Jan 18 23:21:04 2018,Keep a record of their treachery and hypocrisy.
atkbwd_Republican,"[Also r/politicalhumor, When I first joined re...",atkbwd,129,Republican,Fri Feb 22 12:30:39 2019,Thought you guys might like this
axz3oi_Republican,"[18 U.S. Code § 2381. Treason\nWhoever, owing ...",axz3oi,158,Republican,Wed Mar 6 08:57:51 2019,I’m sure everyone agrees to not let her back i...


In [5]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from itertools import chain

corpus = list(chain.from_iterable(df["comments"].values))

In [6]:
corpus[0]

'Also r/politicalhumor'

In [7]:
cvec = CountVectorizer(ngram_range=(2,3),
                       stop_words='english',
                       max_features=None,
                       max_df = 650,
                       min_df = 50
                      )

cvec.fit(corpus)

cvec_corp = cvec.transform(corpus)

cvec_corp = pd.DataFrame(cvec_corp.toarray(), columns=cvec.get_feature_names())

cvec_sum = pd.DataFrame(cvec_corp.sum(), columns=["count"]).sort_values("count")

print(cvec_sum.shape)
cvec_sum.tail(20)

(1239, 1)


Unnamed: 0,count
https np,543
democratic party,544
supreme court,561
years ago,574
wikipedia org,590
wikipedia org wiki,590
en wikipedia,590
en wikipedia org,590
org wiki,594
feel like,595


In [8]:
for i in cvec_sum.tail(10).index:
    print(i)

don care
donald trump
gun control
people don
doesn mean
people like
white house
fox news
np reddit com
np reddit


In [54]:
A = pd.DataFrame(cvec.transform(df["comments"]["atkbwd_Republican"]).toarray(),
             columns=cvec.get_feature_names()).sum().values

A = pd.DataFrame(A, columns = ["atkbwd_Republican"], index = cvec.get_feature_names()).T

In [55]:
B = pd.DataFrame(cvec.transform(df["comments"]["8nbfwl_democrats"]).toarray(),
             columns=cvec.get_feature_names()).sum().values

B = pd.DataFrame(B, columns = ["8nbfwl_democrats"], index = cvec.get_feature_names()).T

In [56]:
C = pd.DataFrame()

In [76]:
A = pd.concat([B,A])

In [77]:
A.head()

Unnamed: 0,000 000,000 people,10 000,10 years,100 years,12 years,15 years,20 years,2016 election,2017 11,...,year olds,year year,years ago,years later,years old,yes know,york times,young people,youtube com,youtube com watch
8nbfwl_democrats,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
atkbwd_Republican,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [83]:
complete = (pd.DataFrame(columns=cvec.get_feature_names()))

for i in df.index:
    print(i)
    line = pd.DataFrame(cvec.transform(df["comments"][i]).toarray(),
                 columns=cvec.get_feature_names()).sum().values
    
    line = pd.DataFrame(line, columns = [i],
                     index = cvec.get_feature_names()).T
    
    complete = pd.concat([complete,line])

atkbwd_Republican
axz3oi_Republican
ao4pso_Republican
asul6v_Republican
aemlra_Republican
av19dd_Republican
aizbuk_Republican
arlssn_Republican
aw7qz1_Republican
an2ujz_Republican
at5sza_Republican
aulyhb_Republican
awm1k3_Republican
avg3om_Republican
ahu3ge_Republican
b29czb_Republican
amgq3y_Republican
9v0wvf_Republican
az76gm_Republican
auahip_Republican
aqkhlw_Republican
anve66_Republican
anacem_Republican
9dmmm6_Republican
aci274_Republican
aq9vay_Republican
b1tfuc_Republican
9s12dq_Republican
apt4nm_Republican
a3dmab_Republican
b825l2_Republican
azmits_Republican
b17euh_Republican
ap44xf_Republican
araulh_Republican
ax0y6l_Republican
apmlvw_Republican
94iyq4_Republican
ajsq10_Republican
9qp1hp_Republican
akrmty_Republican
axqi7t_Republican
b27j8y_Republican
azx3pp_Republican
9uj3zf_Republican
amrsy3_Republican
9k34vl_Republican
b3j5dn_Republican
b1m5hb_Republican
9lz4rv_Republican
apeu59_Republican
b631ow_Republican
ax75j6_Republican
9v9vro_Republican
aydfsc_Republican
9xd9oo_Rep

9wp6j3_Republican
1ls5b1_Republican
a706y9_Republican
9fxo3t_Republican
950h2y_Republican
90jtda_Republican
8cmtg4_Republican
9zed8x_Republican
9cb9dk_Republican
8zy14o_Republican
5hpxvb_Republican
10wy0o_Republican
au78io_Republican
a64mnu_Republican
7vx2fm_Republican
9zudq0_Republican
9r9kuh_Republican
8reofv_Republican
8owijk_Republican
9pyvpq_Republican
9f91av_Republican
8s8h33_Republican
8znw0y_Republican
anmk6y_Republican
a1iico_Republican
9xu4yy_Republican
99l4sx_Republican
95bb3d_Republican
5cl3as_Republican
8h3swg_Republican
9s5ku7_Republican
9hzdrm_Republican
9hfavo_Republican
5uptgc_Republican
94ep5o_Republican
8drkc8_Republican
9mnp7j_Republican
8go5ts_Republican
8a62pu_Republican
8hhwwu_Republican
xtwwy_Republican
9ru4co_Republican
97htjq_Republican
b5mwdq_Republican
9igj2p_Republican
8ulbr1_Republican
4af7tq_Republican
9nkp9x_Republican
b732eb_Republican
92swlc_Republican
8pf1os_Republican
6qs0dp_Republican
9i2wih_Republican
8fb51z_Republican
arxx82_Republican
ap5m1o_Repu

5pnp2p_Republican
4tz4vi_Republican
4gzgke_Republican
1wwb1g_Republican
ab82jh_Republican
9nrnno_Republican
8l2i1z_Republican
7seg6r_Republican
62hjaa_Republican
5wh7c8_Republican
5pp6nm_Republican
1rhze4_Republican
1nwh20_Republican
azg9lx_Republican
augp9w_Republican
9m8tjt_Republican
9c8ayn_Republican
8h9s9n_Republican
7uc0w7_Republican
18ujt0_Republican
10i6ym_Republican
zcc74_Republican
aic556_Republican
8kcetg_Republican
4er8yb_Republican
1pfcqh_Republican
9w3swg_Republican
9txtk2_Republican
6nxlvi_Republican
6j2b0k_Republican
5k8lnz_Republican
1rduzx_Republican
9p0gmy_Republican
8tcnpe_Republican
7wvnsv_Republican
768wwi_Republican
5x7hx1_Republican
5q4jhq_Republican
5nkdy0_Republican
4399q3_Republican
uuyl5_Republican
b05ren_Republican
a08tcr_Republican
861utr_Republican
829n5u_Republican
7mv9za_Republican
7hye2v_Republican
6hzh2t_Republican
6d1ysb_Republican
5g3263_Republican
b9ch9m_Republican
98n504_Republican
92383t_Republican
8jla4t_Republican
8hrfse_Republican
5phpkp_Repub

8oy9ov_democrats
7fz4jx_democrats
9ndifz_democrats
8v6b69_democrats
8gjadu_democrats
8mpwta_democrats
77lw2z_democrats
a1dk4m_democrats
aoufg3_democrats
asnq70_democrats
b3xn2i_democrats
9x7u08_democrats
91v7ba_democrats
8ye327_democrats
6enosb_democrats
a9kney_democrats
5y2mvs_democrats
a1u06t_democrats
8j43qr_democrats
9lux8w_democrats
9e49qd_democrats
awi3h7_democrats
b49sez_democrats
ax08tt_democrats
apbfib_democrats
8g58js_democrats
7aqut0_democrats
80dhyi_democrats
7xyddm_democrats
9n3yob_democrats
a86943_democrats
9gbbpr_democrats
86k8i8_democrats
9c2tsl_democrats
agzgsk_democrats
967bn6_democrats
8q4wg4_democrats
afv7xj_democrats
9rkua2_democrats
7abgi4_democrats
b1sxw4_democrats
amplch_democrats
9hpcex_democrats
7e8l6c_democrats
b02hbe_democrats
aej6hj_democrats
9iyhjd_democrats
b52bqu_democrats
axg5r8_democrats
9n9gq3_democrats
8lohoz_democrats
9hfsyq_democrats
8on2ws_democrats
8g0ojw_democrats
8s69ba_democrats
8hbana_democrats
anmjc6_democrats
5xgu1e_democrats
aiv7ei_democra

72tksm_democrats
75qlaf_democrats
6rrfxs_democrats
aal6m9_democrats
8dnbvq_democrats
8kqtqa_democrats
8wjroz_democrats
6tw8to_democrats
948v7i_democrats
80wf4n_democrats
7j09yw_democrats
67nupg_democrats
8nxli7_democrats
7wiq87_democrats
7bu7fa_democrats
5g6fid_democrats
9o3atv_democrats
7ztfqy_democrats
6spmbd_democrats
8f233w_democrats
80m5uh_democrats
7s2uys_democrats
6y8jof_democrats
6cu7qh_democrats
5pf3k7_democrats
61zhce_democrats
9it30c_democrats
8rkyk7_democrats
8wzcif_democrats
9u2vru_democrats
7nmiin_democrats
5vrkgy_democrats
95lkgy_democrats
8db1v0_democrats
8ppno5_democrats
94lwv9_democrats
8k3lqb_democrats
akj9do_democrats
8bthin_democrats
7f1onv_democrats
ajl6k0_democrats
88jha9_democrats
7rpzp6_democrats
az3a2a_democrats
8iyhew_democrats
5zjitn_democrats
7zabm5_democrats
72kjsk_democrats
6inmy1_democrats
8cdvod_democrats
8b5q6b_democrats
8u8ysc_democrats
8u5y1s_democrats
62a9uj_democrats
85s1ot_democrats
6qsj40_democrats
8yp9v0_democrats
b1ay6i_democrats
9rnis9_democra

In [116]:
complete.head()

Unnamed: 0,000 000,000 people,10 000,10 years,100 years,12 years,15 years,20 years,2016 election,2017 11,...,year year,years ago,years later,years old,yes know,york times,young people,youtube com,youtube com watch,target
atkbwd_Republican,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,2,0,0,1
axz3oi_Republican,0,0,2,1,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
ao4pso_Republican,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
asul6v_Republican,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,1
aemlra_Republican,0,0,0,0,1,0,0,1,0,0,...,0,5,0,0,1,0,0,0,0,1


In [111]:
tryer = df.columns

# complete[tryer] = df[tryer]
# complete.drop(tryer,axis = 1, inplace = True)

complete["target"] = df["subreddit"].apply(lambda a: 1 if a == "Republican" else 0)

In [118]:
with open("cvec(2,4).pkl","wb") as f:
    pickle.dump(complete,f)

complete.tail(2).append(complete.head(2))

Unnamed: 0,000 000,000 people,10 000,10 years,100 years,12 years,15 years,20 years,2016 election,2017 11,...,year year,years ago,years later,years old,yes know,york times,young people,youtube com,youtube com watch,target
8nbfwl_democrats,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7rft0w_democrats,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
atkbwd_Republican,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,2,0,0,1
axz3oi_Republican,0,0,2,1,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1


In [124]:
complete = complete.reset_index(drop=True)

In [125]:
X = complete.drop("target",axis = 1)
y = complete["target"]

In [144]:
from sklearn.linear_model import LogisticRegression, RidgeClassifierCV
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   random_state = 42,
                                                   stratify = y)

rc = RidgeClassifierCV(fit_intercept=False, cv=5)

rc.fit(X_train, y_train)



RidgeClassifierCV(alphas=array([ 0.1,  1. , 10. ]), class_weight=None, cv=5,
         fit_intercept=False, normalize=False, scoring=None,
         store_cv_values=False)

In [145]:
lr.score(X_train,y_train)

0.9759036144578314

In [146]:
lr.score(X_test,y_test)

0.7655310621242485

In [136]:
for i in range(len(X_train.columns)):
    print(X_train.columns[i],lr.coef_[0][i])

000 000 0.5365632009337907
000 people 0.1422299376073018
10 000 0.05704788286249892
10 years 0.08245032316576387
100 years 0.2009029815039907
12 years 0.2074846604123926
15 years 0.3729834022164442
20 years 0.02564505963618533
2016 election -0.31014552155111363
2017 11 0.218633845287166
2018 06 0.12910822043108608
2018 07 0.6976783602081381
2018 11 0.12858918735198777
2018 election -0.6826398175690165
2018 election primary -0.15289334864116044
2018 general -0.3568809782481147
2018 general election -0.3568809782481147
2nd amendment 0.3633341670679911
30 years 0.26883850693575123
40 years -0.042367774780605084
50 years -0.7283095621193516
act like 0.49480747472073877
acting like -0.42220967893907174
actually believe 0.02949466558495593
actually care -0.7073719013467225
actually did -0.3225711679853326
actually read 0.12065453543244989
actually said -0.33629508213406256
actually think -0.4825540069847679
ad hominem -0.1546200148892635
african american 0.45064442876674765
african americans

In [None]:
df["comments"]["atkbwd_Republican"]

In [None]:
for i in cvec_sum.index:
    df[i] = 0

In [None]:
len(df["comments"]["atkbwd_Republican"])

In [None]:
for title in df.index:
    print(i)
    cvec_title = cvec.transform(df["comments"][title])
    cvec_title = pd.DataFrame(cvec_title.toarray(), columns=cvec.get_feature_names()).sum()
    for i in cvec_title.index:
        df[i][title] = cvec_title[i]
    

In [None]:
bb = cvec.transform(df["comments"]["atkbwd_Republican"])

In [None]:
bb = pd.DataFrame(bb.toarray(), columns=cvec.get_feature_names()).sum()

In [None]:
for i in bb.index:
    print(i, bb[i])

In [None]:
bb

In [None]:
df["000 000"]

In [None]:
A = df["comments"]["a039wy_democrats"][0:10]

In [None]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
A_tokens = tokenizer.tokenize(A.lower())

In [None]:
from bs4 import BeautifulSoup  
example1 = BeautifulSoup(A)
print(example1.get_text())

In [None]:
import re

def ngrams(input, n):
  input = input.lower().split(' ')
  output = []
  for i in range(len(input)-n+1):
    output.append(input[i:i+n])
  return output

ngrams('a bm,C d e', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]

In [None]:
A = re.split("(\W+)", 'a bm,C \d e').join()
re.search("\w{1,}",A)

In [None]:
\w{1,}

In [None]:
" ".join(A)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [None]:
import string

listed = []
for i in df["comments"]["a039wy_democrats"]:
        listed.append(i.lower().translate(str.maketrans('', '', string.punctuation)).split(" "))

In [None]:
two_gram = []
for i in listed:
    gram = []
    if len(i) %2 == 0:
        for j in range(len(i)):
            

In [None]:
# Import lemmatizer. (Same as above.)
from nltk.stem import WordNetLemmatizer

# Instantiate lemmatizer. (Same as above.)
lemmatizer = WordNetLemmatizer()

In [None]:
tokens_lem = [lemmatizer.lemmatize(i) for i in df["comments"]["a039wy_democrats"]]