# Abstract
Natural Language Processing (NLP) is, in itself, a massive subset of computer science. I used this project as a opportunity to begin to explore that field, while demonstrating how machine learning models play role. The data used in this study was collection of tweets from 16 verified health-focused accounts (BBC Health, CBC Health, etc.). I found this data searching through the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php), although the original credit belongs to Amir Karami, from the University of South Carolina. 

Hence, the goal of this project is to detect trends in twitter accounts who's target audience is accounts interested general health.

# Motivation
Initially, the goal of this project was to perform a very similar analysis on Donald Trump's rally speeches leading up to the 2020 presidential election. The hope was that I would be able to group words together in a meaningful way, such that I would be able to extract topics from his speeches. While I am still working to finalize that project, this accomplishes a similar task, and has better prepared me to revisit that goal.

# Representing Words Numerically
Before performing any analysis we have to first clean, and convert our data frame from a list of tweets into a numerical format. Below is a handful of tweets from our data frame with some of the other unused columns removed from view. This data frame will serve as our corpus, where each tweet is treated as its own individual document.

In [1]:
import pandas as pd

df = pd.read_csv("../Datasets/all_tweets.csv")
display(df.head(3))

Unnamed: 0,tweet
0,Breast cancer risk test devised http://bbc.in/...
1,GP workload harming care - BMA poll http://bbc...
2,Short people's 'heart risk greater' http://bbc...


# Data Cleaning
To clean this data, I removed hyper-links, re-tweet symbols (RT), and mention symbols (@) from each document.

In [2]:
df = pd.read_csv("../Datasets/all_tweets_cleaned.csv")
display(df.head(3))

Unnamed: 0,tweet
0,breast cancer risk test devised
1,gp workload harming care - bma poll
2,short people's 'heart risk greater'


# Term Frequency Inverse Document Frequency (TF-IDF)
Before explaining the cell of code below, it is important to have some background knowledge about what TF-IDF represents. TF-IDF is a matrix representation of how words appear in each document. Here, the rows represent each document, or tweet in this case, and the columns represent the unique vocabulary built from the tweets provided. Within this matrix, each cell represents the term frequency within a row multiplied by inverse document frequency, as a way of showing how important a word is to a tweets meaning, as it relates to the overall collection of tweets.

In the cell below, we create this matrix using the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform) function from sklearn. This function provides a lot of versatility, and allows us to specify how we want to define our vocabulary. The parameters we used for this transformation were:
> Only including words that have a document frequency greater than or equal to 10%.<br>
> Removing stop words as defined by the NLTK library.<br>
> Looking at words individually, as well as groups of two and three.  

In [3]:
df = pd.read_csv("../Datasets/df_tfidf.csv")
display(df.head(3))

Unnamed: 0,cancer,risk,care,people,heart,new,nhs,doctors,video,day,...,want,10,know,well,foods,reports,obamacare,fda,ways,healthtalk
0,0.0,0.0,0.0,0.0,0.0,0.678162,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Elbow analysis
Elbow analysis is a technique for helping determine the number of clusters for a model. Since K-Means require this as an initial parameter, elbow analysis runs a model for each desired number of clusters, and compares the values on a line chart. The algorithms used to score the different model are:

> Distortion: The sum of the squared distances from each point to its corresponding cluster center. <br>
> Silhouette: A measure of how well-defined clusters are from one another.<br>
> Calinski Harabasz : "The ratio of dispersion between and within clusters." ([YellowBrick Documentaion](https://www.scikit-yb.org/en/latest/api/cluster/elbow.html))

In the first two tables there isn't a clear "elbow" in the graph to indicate an obvious choice of clusters. For the distortion graph, this "elbow" would be a sharp decline in the graph, indicating a substantial improvement from the newly added cluster, and for the silhouette score, this would be an upward spike towards a score of one from the addition of a new cluster. Unfortunately, neither of these indicators were visibly present, so despite the algorithms best guess at the dotted lines, we continued to search. In the Calinski Harabasz test in graph three, there was a much more convincing "elbow" to choose from. While the algorithm suggested 7 clusters, indicated by the dotted black line, I chose to implement my model with four clusters, the solid black line, because it was the significant improvement I was looking for in the graph. 
<table>
    <tr>
        <td><img src="../Images/Elbow_Distortion_k_in_2-12.png"></td>
        <td><img src="../Images/Elbow_Silhouette_k_in_2-12.png"></td>
        <td colspan="1"><img src="../Images/Elbow_Harabasz_k_in_2-12_Labeled.png"></td>
    </tr>
</table>

# Clusters Found
To find what cluster each tweets belongs too, I ran a K-Means Clustering Model with 4 clusters, as determined above. Now, the final task is too see what words define each cluster, and what information can be extracted from this model. It it helpful to note that the term frequencies were determined by summing up the columns from the TF-IDF matrix above. Feel free to interact with the cell blocks below each cluster to see what kind of tweets you can find.

In [4]:
df = pd.read_csv("../Datasets/df_cleaned_clusters.csv")

<table>
    <tr>
        <th style="width:5%; text-align:center">Cluster</th>
        <th style="width:25%; text-align:center">Word Cloud</th>
        <th style="width:75%; text-align:center">Description</th>
    </tr>
    <tr>
        <td><h1>0</h1></td>
        <td><img src="../Images/C_0.png"></td>
        <td>
            <p>Of the words that made this cluster, the most defining words were healthcare, make, medical, nhs, brain, and study. This grouping of words was largely dominated by healthcare, the following words suggest that this cluster describes tweets that talk about studies in the field. </p>
        </td>
    </tr>
</table>

In [5]:
temp = df[df["cluster"]==0]
temp  = pd.DataFrame(temp[[('healthcare' in t) or 
                           ('make' in t) or 
                           ('study' in t) for t in temp["tweet"]]])
display(temp.head(5)[["tweet", "source", "cluster"]])

Unnamed: 0,tweet,source,cluster
1528,video: 'make mental health bigger priority',bbchealth,0
1530,make mental health 'bigger priority',bbchealth,0
4281,racism against aboriginal people in health-car...,cbchealth,0
4297,canadian seniors satisfied with health-care qu...,cbchealth,0
4726,shorter overseas troop deployments tied to bet...,cbchealth,0


<table>
    <tr>
        <th style="width:5%; text-align:center">Cluster</th>
        <th style="width:25%; text-align:center">Word Cloud</th>
        <th style="width:75%; text-align:center">Description</th>
    </tr>
    <tr>
        <td><h1>1</h1></td>
        <td><img src="../Images/C_1.png"></td>
        <td>
            <p>Of the words that made this cluster, the most defining words were healthcare, make, medical, nhs, brain, and study. This grouping of words was largely dominated by healthcare, the following words suggest that this cluster describes tweets that talk about studies in the field. </p>
        </td>
    </tr>
</table>

In [6]:
temp = df[df["cluster"]==1]
temp  = pd.DataFrame(temp[[('food' in t) or 
                           ('law' in t) or 
                           ('better' in t) for t in temp["tweet"]]])
display(temp.head(3)[["tweet", "source", "cluster"]])

Unnamed: 0,tweet,source,cluster
25,unsafe food 'growing global threat',bbchealth,1
301,cigarette packet law 'would save lives',bbchealth,1
312,unlabelled nuts in food prompts probe,bbchealth,1


<table>
    <tr>
        <th style="width:5%; text-align:center">Cluster</th>
        <th style="width:25%; text-align:center">Word Cloud</th>
        <th style="width:75%; text-align:center">Description</th>
    </tr>
    <tr>
        <td><h1>2</h1></td>
        <td><img src="../Images/C_2.png"></td>
        <td>
            <p>Similarly, this cluster seems to be describing tweets that is centered around daily updates, and advertising healthy behaviors. Some of the tweets observed in this cluster took the form of "You could start this today", and, more notably, "New today: ..." </p>
        </td>
    </tr>
</table>

In [7]:
temp = df[df["cluster"]==2]
temp  = pd.DataFrame(temp[[('today' in t) or 
                           ('better' in t) or 
                           ('food' in t) for t in temp["tweet"]]])
display(temp.head(5)[["tweet", "source", "cluster"]])

Unnamed: 0,tweet,source,cluster
1596,new hospital food rules introduced,bbchealth,2
4407,how to resist holiday junk food habits into th...,cbchealth,2
4806,"double-gloving, better face protection added i...",cbchealth,2
5963,new documentary says added sugar in food cause...,cbchealth,2
7469,lionel shriver probes obsession with food in n...,cbchealth,2


<table>
    <tr>
        <th style="width:5%; text-align:center">Cluster</th>
        <th style="width:25%; text-align:center">Word Cloud</th>
        <th style="width:75%; text-align:center">Description</th>
    </tr>
    <tr>
        <td><h1>3</h1></td>
        <td><img src="../Images/C_3.png"></td>
        <td>
            <p>With the leading terms, patients and drugs, I feel this cluster is capturing tweets about new drugs on the market and the target audience they could be for. For example:</p>
            <p style="text-indent: 50px">"Drug hope for leukaemia patients http://bbc.in/1dg81pI"	from BBC Health on line 3383</p>
            <p style="text-indent: 50px">"Health Canada blocks dying patients from access to drug http://bit.ly/12eCbb0" from CBC Health on line 7424</p>
        </td>
    </tr>
</table>

In [8]:
temp = df[df["cluster"]==3]
temp  = pd.DataFrame(temp[[('patients' in t) or 
                           ('drug' in t) or 
                           ('new' in t) for t in temp["tweet"]]])
display(temp.head(5)[["tweet", "source", "cluster"]])

Unnamed: 0,tweet,source,cluster
29,drug giant 'blocks' eye treatment,bbchealth,3
35,ms drug 'may already be out there',bbchealth,3
189,clegg in drug law election pledge,bbchealth,3
198,cancer drug patient's england move,bbchealth,3
206,drug drivers targeted by new rules,bbchealth,3


# Conclusion
K-Means clustering is a powerful tool, and an excellent introduction into the field of unsupervised learning. Paired with a handful of Natural Language Processing libraries, I was able to successfully extract value out of more than 60,000 tweets, and group them in a meaningful way. While there value of the collection of tweets analyzed may be variable from reader to reader, it should be universally understood that the techniques presented are well worth studying, and applying to future data.