## WordClouds

#### WordClouds represent the importance of a word by its frequency. The more a word appears in a document, the more prominent it is displayed


##### We start by loading relevant libraries. You might have to install the wordcloud package via pip.

In [None]:
import numpy as np
import pandas as pd
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import matplotlib.pyplot as plt

##### The data we are using is a subset of tweets sent out by users on their New Year's resolutions for 2015. We will generate a WordCloud on the text of the tweets

In [None]:
df = pd.read_csv("2015-nyrs.csv", engine='python')

In [None]:
df.head()

In [None]:
df.info()

##### Let's make a wordcloud on the very first tweet

In [None]:
text = df.text[0]

##### We can specify the background color along with other aesthetic parameters

In [None]:
wordcloud = WordCloud(background_color="white").generate(text)
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()

##### This isn't the best application of wordclouds as you need to have a large corpus (collection) of text. So let's apply it on the entire data.

In [None]:
text = " ".join(review.lower() for review in df.text)
text[:1000]

In [None]:
wordcloud = WordCloud(background_color="white").generate(text)
plt.figure(figsize = (10,10))
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()

##### We notice a lot of words that are a combination of new years resolution, because that's the main topic of this text set, so we need to come up with a way to remove them. 

##### We also need to remove common words, prepositions, articles, etc. from the final wordcloud. These words are called stopwords and do not convey any relevant information.

In [None]:
stopwords = set(STOPWORDS)

##### We will also add some more words to the stopwords set because they are deemed stopwords for this data, like new year, resolution, etc.

In [None]:
stopwords.update(['new year','new','years','resolution','year','rt','amp','thi','newyearsresolution','will'])

In [None]:
wordcloud = WordCloud(stopwords = stopwords, background_color="white").generate(text)
plt.figure(figsize = (10,10))
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()

##### This is much better! But we now have an issue with weblinks, or small pieces of them, being most frequent. So we will remove all weblinks using regular expressions.

In [None]:
import re
text = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', text)

In [None]:
wordcloud = WordCloud(stopwords = stopwords, background_color="white").generate(text)
plt.figure(figsize = (10,10))
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()

##### Now we finally get to a meaningful wordcloud. Usually 80% of the time is spent cleaning up the text for these kinds of analyses.