# Universal Healthcare NLP Project

###### This project will explore twitter discussions surrounding the implementation of universal healthcare. The aim will be to create a topic model to display the most common fields of concern for those discussing universal healthcare to better create campaigns and solutions to address any hesitations against it.  

This notebook follows the **NMF topic modeling** process post data cleaning.

In [13]:
#some libraries
import pandas as pd
import numpy as np
#Base and Cleaning 
import json
import requests
import emoji
import regex
import re
import string
from collections import Counter

# #Visualizations
# import plotly.express as px
# import seaborn as sns
# import matplotlib.pyplot as plt 
# import pyLDAvis.gensim
# import chart_studio
# import chart_studio.plotly as py 
# import chart_studio.tools as tls

#Natural Language Processing (NLP)
import spacy
import gensim
from spacy.tokenizer import Tokenizer
from gensim.corpora import Dictionary
from gensim.models.ldamulticore import LdaMulticore
from gensim.models.coherencemodel import CoherenceModel
from gensim.parsing.preprocessing import STOPWORDS as SW
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint
from wordcloud import STOPWORDS
stopwords = set(STOPWORDS)

In [14]:
#pull in data
df = pd.read_csv('/Users/mehikapatel/Universal_Healthcare_NLP/Data/SecondTwitterDF').drop(columns='Unnamed: 0')

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26457 entries, 0 to 26456
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   date                 26457 non-null  object
 1   timezone             26457 non-null  int64 
 2   username             26457 non-null  object
 3   day                  26457 non-null  int64 
 4   hour                 26457 non-null  int64 
 5   nlikes               26457 non-null  int64 
 6   reply_to             26457 non-null  object
 7   tweet                26457 non-null  object
 8   tokens               26457 non-null  object
 9   tokens_back_to_text  26454 non-null  object
 10  lemmas               26457 non-null  object
 11  lemmas_back_to_text  26452 non-null  object
 12  lemma_tokens         26457 non-null  object
dtypes: int64(4), object(9)
memory usage: 2.6+ MB


### Document Term Matrix

Preliminary steps before model- vectorize your data.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

from sklearn.decomposition import NMF
from wordcloud import WordCloud, STOPWORDS

In [17]:
stopwords = set(STOPWORDS)

stopwords.update(['affordable','universal','healthcare'])

In [18]:
tfidf = TfidfVectorizer(max_df=.95,min_df=3,stop_words = stopwords,ngram_range=(1,1))
dtm = tfidf.fit_transform(df['lemma_tokens'])




#### Creating an initial NMF model

In [19]:
#creating NMF with 5 components

nmf_model = NMF(n_components = 5, random_state = 60)

#fit/transform dtm
#get weights of docs to each doc

topics = nmf_model.fit_transform(dtm)

topics[15]



array([0.01810371, 0.        , 0.        , 0.00167009, 0.00134833])

The above shows that the first document likely belongs to the second topic.

#### Interpreting Topics:

In [20]:
#use nmf_model.components_ to get relation of topics to words by viewing top 10 words w highest coefs corresponding to each topic

for i, topic in enumerate(nmf_model.components_):
    print(f'TOP TEN WORDS BY TOPIC: {i+1}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')#print new line per topic

TOP TEN WORDS BY TOPIC: 1
['work', 'money', 'israel', 'wage', 'american', 'right', 'country', 'free', 'pay', 'people']


TOP TEN WORDS BY TOPIC: 2
['education', 'job', 'quality', 'water', 'salary', 'rent', 'road', 'drinking', 'ask', 'good']


TOP TEN WORDS BY TOPIC: 3
['childcare', 'infrastructure', 'food', 'education', 'fixthecountry', 'job', 'nakufoaddo', 'fortunate', 'housing', 'need']


TOP TEN WORDS BY TOPIC: 4
['social', 'retrain', 'canada', 'socialist', 'month', 'leave', 'close', '25', 'wage', 'college']


TOP TEN WORDS BY TOPIC: 5
['quality', 'service', 'need', 'medical', 'system', 'mental', 'access', 'insurance', 'care', 'health']




### Topic Interpretations:
###### Topic 1: Living Costs


###### Topic 2: Quality of life


###### Topic 3: Infrastrucutre


###### Topic 4: Socialism


###### Topic 5: Accessibility to Public Services 


Now we can go in and label our tweets according to these topics to later pull examples of each topic.

In [21]:
df['Topic'] = topics.argmax(axis=1)

naming={0:'Cost of Living',1:'Quality of Life',2:'Infrastructure',3:'Socialism',4:'Public Services & Accessibility'}

df['Topic_name'] = df['Topic'].map(naming)


In [22]:
df[['tweet','Topic_name']].head(70)

Unnamed: 0,tweet,Topic_name
0,Lyra on the importance of Medicare and univers...,Infrastructure
1,@JordanChariton She was right. People don’t wa...,Public Services & Accessibility
2,@SenTedCruz That’s cool you stand for a countr...,Cost of Living
3,@JordanChariton Hillary Clinton was 100% ride ...,Public Services & Accessibility
4,thinking about that guy who was like “if you h...,Cost of Living
...,...,...
65,@Guyperson654 @magdalaheals @EventuallyTruth @...,Cost of Living
66,@rafaelshimunov Need to end privatized medicin...,Infrastructure
67,@biancoresearch @GlideIsh @arik_shalom Doesn't...,Cost of Living
68,@PoliticsFan10 What side is the leave me alone...,Socialism


In [23]:
# df.iloc[0].tweet #This one fits well in the infrastucture topic!
# df.iloc[1].tweet #This one fits well in the PS&A topic! //more about hybrid models
# df.iloc[2].tweet #This one fits well in the Cost of living topic! //BETTER WITH RIGHTS
# df.iloc[3].tweet #This one fits well in the PS&A topic!
# df.iloc[4].tweet #This one fits well in the cost of living topic!//BETTER WITH RIGHTS
# df.iloc[5].tweet #This one fits well in the cost of living topic! //Kinda also goes with rights
# df.iloc[6].tweet #This one fits well in the cost of living  topic!//BETTER WITH RIGHTS
# df.iloc[7].tweet #This one fits well in the cost of living  topic!
# df.iloc[8].tweet #This one fits well in the PS&A topic!
# df.iloc[9].tweet #This one fits well in the quality of life topic! GOOD ALSO W RIGHTS
# df.iloc[10].tweet #This one fits well in the cost of living  topic! it works! lol
# df.iloc[11].tweet #This one fits well in the PS&A topic!
# df.iloc[12].tweet #This one fits well in the Socialism topic! //BETTER W QUALITY OF LIFE
# df.iloc[13].tweet #This one fits well in the infrastucture topic!
# df.iloc[14].tweet #This one fits well in the cost of living topic!//QUALITY OF LIFE
# df.iloc[15].tweet #This one fits well in the cost of living topic!//QUALITY OF LIFE//INFRASTRUCTURE//POLITICAL
# df.iloc[16].tweet #This one fits well in the socialism topic!
# df.iloc[17].tweet #This one fits well in the cost of living topic!//POLITICAL
# df.iloc[18].tweet #This one fits well in the cost of living topic!//POLITICAL
# df.iloc[19].tweet #This one fits well in the cost of living topic!//POLITICAL
# df.iloc[20].tweet #This one fits well in the cost of living topic!//INFRASTRUCTURE AND QUALITY OF LIFE
df.iloc[68].tweet #This one fits well in the cost of living topic!//INFRASTRUCTURE AND QUALITY OF LIFE

'@PoliticsFan10 What side is the leave me alone but give me universal healthcare side?'

It seems that a more appropriate label for the originally "cost of living category" might be "political/rights-based" , seeing as this is a label that encompasses the actual content of these tweets. The original "socialism" topic might be more appropriately labeled as "Comparisons"-- meaning comparisons to other countries with more "socialist" regimes like universal healthcare.

In [24]:
naming={0:'Political/Rights',1:'Quality of Life',2:'Infrastructure',3:'Comparisons',4:'Public Services & Accessibility'}

df['Topic_name'] = df['Topic'].map(naming)

In [25]:
df[['tweet','Topic_name']].head()

Unnamed: 0,tweet,Topic_name
0,Lyra on the importance of Medicare and univers...,Infrastructure
1,@JordanChariton She was right. People don’t wa...,Public Services & Accessibility
2,@SenTedCruz That’s cool you stand for a countr...,Political/Rights
3,@JordanChariton Hillary Clinton was 100% ride ...,Public Services & Accessibility
4,thinking about that guy who was like “if you h...,Political/Rights


Now, there is a list of braoder topics, some of which can be sifted out into more specific categories:

1. **Politically related** -- Discussions directly pertaining to how politics are involved in decisions about healthcare, about how party lines shape perceptions of universal healthcare, etc.

2. **Rights related**-- Discussions about universal healthcare as a human right

3. **Quality of life**-- Discussions surrounding healthcare as a part of broader quality of life improvements

4. **Infrastrucutre**-- Discussions about american healthcare infrastructure

5. **Comparisons**-- Discussions comparing american healthcare and spending infrastructure to other countries

6. **Public Services** -- Discussions describing healthcare as a public service, sometimes in efforts to describe it as a publicly funded necessity

7. **Accessibility** -- Conversations about accessibility to public services like healthcare

*Save the dataframes by topic for future sentiment analysis by topic*

In [32]:
df.Topic_name.value_counts()

Political/Rights                   11550
Public Services & Accessibility     9197
Infrastructure                      2477
Quality of Life                     1719
Comparisons                         1514
Name: Topic_name, dtype: int64

In [35]:
#t1
df[df.Topic_name =='Political/Rights'].to_csv(r'/Users/mehikapatel/Universal_Healthcare_NLP/Data/Topic1DF')
#t2
df[df.Topic_name =='Public Services & Accessibility'].to_csv(r'/Users/mehikapatel/Universal_Healthcare_NLP/Data/Topic2DF')
#t3
df[df.Topic_name =='Infrastructure'].to_csv(r'/Users/mehikapatel/Universal_Healthcare_NLP/Data/Topic3DF')
#t4
df[df.Topic_name =='Quality of Life'].to_csv(r'/Users/mehikapatel/Universal_Healthcare_NLP/Data/Topic4DF')
#t5
df[df.Topic_name =='Comparisons'].to_csv(r'/Users/mehikapatel/Universal_Healthcare_NLP/Data/Topic5DF')